A Detailed Study On Text Mining Techniques
A Detailed Study On Text Mining Techniques
Plain,
Structured,
Text
Mining,
Web
118
are skilled in the art of producing summaries and carry out the
task as part of their professional life.
A2. Document Retrieval
Document retrieval is the task of identifying and returning
the most relevant documents. Traditional libraries provide
catalogues that allow users to identify documents based on
resources which consist of metadata. Metadata is a highly
structured document for summary, and successful
methodologies have been developed for manually extracting
metadata and for identifying relevant documents based on it,
methodologies that are widely taught in library school.
Automatic extraction of metadata (e.g. subjects, language,
author, key-phrases) is a prime application of text mining
techniques. The idea is to index every individual word in the
document collection. It specifies many effective and popular
document retrieval techniques.
A3. Information retrieval
Information retrieval is considered as an extension to
document retrieval where the documents that are returned are
processed to condense or extract the particular information
sought by the user. Thus document retrieval is followed by a
text summarization stage that focuses on the query posed by
the user, or an information extraction stage. The modularity of
documents may be adjusted so that each individual subsection
or paragraph comprises a unit in its own right, in an attempt to
focus results on individual nuggets of information rather than
lengthy documents.
A4. Assessing document similarity
Many text mining problems involve assessing the similarity
between different documents; for example, assigning
documents to pre-defined categories and grouping documents
into natural clusters. These are the basic problems in data
mining too, and have been a focus for research in text mining,
perhaps because the success of different techniques can be
evaluated and compared using standard, objective, measures
of success.
A5. Text categorization
Text categorization is the assignment of natural language
documents to predefined categories according to their content.
The set of categories is often called a controlled vocabulary.
Document categorization is a long-standing traditional
technique for information retrieval in libraries, where subjects
rival authors as the predominant gateway to library contents
although they are far harder to assign objectively than
authorship. Automatic text categorization has many practical
applications, including indexing for document retrieval,
automatically extracting metadata, word sense disambiguation
by detecting the topics a document covers, and organizing and
maintaining large catalogues of Web resources. As in other
areas of text mining, until the 1990s text categorization was
dominated by ad hoc techniques of knowledge engineering
that sought to elicit categorization rules from human experts
and code them into a system that could apply them
automatically to new documents. Since thenand particularly
in the research communitythe dominant approach has been
to use techniques of machine learning to infer categories
automatically from a training set of pre-classified documents.
Indeed, text categorization is a hot topic in machine learning
today. The pre-defined categories are symbolic labels with no
119
120
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
121