Background
Most documents in companies and other organizations are now stored in
some electronic form. These documents can be in a number of formats
like plain text, HTML, XML, Microsoft Word, Adobe PDF, etc. With the
advent of lower storage cost, as well as regulations regarding
traceability and accountability of documents, it is also common that
both previous versions and deleted documents have to be stored. In
addition to traditional document repositories, recently personal
document repositories have emerged. An example of such a repository is
a collection of personal email, stored either on the personal computer
or on a central mail server.
Searching in document collections is now a relatively mature research
area, and as shown by the various web search engines, also scalability
is well achieved. However, in addition to the explicit information and
knowledge that can be retrieved using text-search techniques, the
documents also contain implicit knowledge inside particular documents,
as well as inter-document knowledge. In order to discover this
knowledge, data mining techniques have to be applied. By using such
techniques, patterns can be detected through association rule mining,
similar documents can be found through clustering, and classification
of documents can be performed in order to ease subsequent searching.
Traditionally, text mining has been performed on a single text
collection, and in the case of collections from several repositories
these collections have first been merged before performing the mining
process. So far some research exist on distributed data mining in
general, but it should be mentioned that for
some text-mining approaches, e.g., clustering of large multi-domain
text collections with large vocabularies and noise (for example web
pages), there are still no scalable technique that give high
clustering quality. In the context of association rules (text
association rules in the case of documents) some progress has also
been done, however as shown by their
results, both execution time and memory usage indicate that the
techniques are not yet scalable.
While merging of collections in many cases is possible, for many
application areas this is not acceptable. For example, some
repositories can not be merged for legal reasons, while some can not
be merged because of risk of revealing classified information. An
everyday example of such repository is email collections as mentioned
above. In a company, email collections can be mined to discover the
aggregate knowledge of the organization.
Cooperative independent mining is mostly uncharted territory, but it
can be assumed that a general approach to the problem is independent
processing of collections at each repository, creating sufficient
intermediate results to perform global mining. The main challenge in
cooperative independent mining is to understand form and contents of
the intermediate results.
The most related current research area is P2P-based resource
discovery, where approaches exist that use a combination of local and
global clustering to facilitate the subsequent search process. The results of the COMIDOR project will also be
of value in this area. For example, a problem in resource discovery
based on independent collection, is the use of different schemas or
semantics. This necessitates mining-conscious schema/ontology
mediation.
The techniques that will be developed for independent collections can
also be useful in the case of collections that are not necessarily
confidential but based on different schemas/ontologies. Instead of
mapping to a common schema/ontology which might be non-trivial, or
even impossible, the complexity of the problem can be reduced by
employing independent mining into a neutral form.
The COMIDOR project is a research project funded by the Norwegian
Research Council under the VERDIKT
research programme.
For more information please contact the project leader, Dr. Kjetil Nørvåg.
For more information about the research group and the department,
please visite the respective home pages:
The Data and Information
Management Group
Department of computer and
information science