Background

Most documents in companies and other organizations are now stored in some electronic form. These documents can be in a number of formats like plain text, HTML, XML, Microsoft Word, Adobe PDF, etc. With the advent of lower storage cost, as well as regulations regarding traceability and accountability of documents, it is also common that both previous versions and deleted documents have to be stored. In addition to traditional document repositories, recently personal document repositories have emerged. An example of such a repository is a collection of personal email, stored either on the personal computer or on a central mail server.

Searching in document collections is now a relatively mature research area, and as shown by the various web search engines, also scalability is well achieved. However, in addition to the explicit information and knowledge that can be retrieved using text-search techniques, the documents also contain implicit knowledge inside particular documents, as well as inter-document knowledge. In order to discover this knowledge, data mining techniques have to be applied. By using such techniques, patterns can be detected through association rule mining, similar documents can be found through clustering, and classification of documents can be performed in order to ease subsequent searching.

Traditionally, text mining has been performed on a single text collection, and in the case of collections from several repositories these collections have first been merged before performing the mining process. So far some research exist on distributed data mining in general, but it should be mentioned that for some text-mining approaches, e.g., clustering of large multi-domain text collections with large vocabularies and noise (for example web pages), there are still no scalable technique that give high clustering quality. In the context of association rules (text association rules in the case of documents) some progress has also been done, however as shown by their results, both execution time and memory usage indicate that the techniques are not yet scalable.

While merging of collections in many cases is possible, for many application areas this is not acceptable. For example, some repositories can not be merged for legal reasons, while some can not be merged because of risk of revealing classified information. An everyday example of such repository is email collections as mentioned above. In a company, email collections can be mined to discover the aggregate knowledge of the organization.

Cooperative independent mining is mostly uncharted territory, but it can be assumed that a general approach to the problem is independent processing of collections at each repository, creating sufficient intermediate results to perform global mining. The main challenge in cooperative independent mining is to understand form and contents of the intermediate results.

The most related current research area is P2P-based resource discovery, where approaches exist that use a combination of local and global clustering to facilitate the subsequent search process. The results of the COMIDOR project will also be of value in this area. For example, a problem in resource discovery based on independent collection, is the use of different schemas or semantics. This necessitates mining-conscious schema/ontology mediation.

The techniques that will be developed for independent collections can also be useful in the case of collections that are not necessarily confidential but based on different schemas/ontologies. Instead of mapping to a common schema/ontology which might be non-trivial, or even impossible, the complexity of the problem can be reduced by employing independent mining into a neutral form.

The COMIDOR project is a research project funded by the Norwegian Research Council under the VERDIKT research programme.

For more information please contact the project leader, Dr. Kjetil Nørvåg.

For more information about the research group and the department, please visite the respective home pages:
The Data and Information Management Group
Department of computer and information science