A Brief Survey of Text Mining: Andreas Hotho KDE Group University of Kassel
A Brief Survey of Text Mining: Andreas Hotho KDE Group University of Kassel
Andreas Hotho
KDE Group
University of Kassel
[email protected]
Andreas Nürnberger
Information Retrieval Group
School of Computer Science
Otto-von-Guericke-University Magdeburg
[email protected]
Gerhard Paaß
Fraunhofer AiS
Knowledge Discovery Group
Sankt Augustin
[email protected]
May 13, 2005
Abstract
The enormous amount of information stored in unstructured texts cannot sim-
ply be used for further processing by computers, which typically handle text as
simple sequences of character strings. Therefore, specific (pre-)processing meth-
ods and algorithms are required in order to extract useful patterns. Text mining
refers generally to the process of extracting interesting information and knowledge
from unstructured text. In this article, we discuss text mining as a young and in-
terdisciplinary field in the intersection of the related areas information retrieval,
machine learning, statistics, computational linguistics and especially data mining.
We describe the main analysis tasks preprocessing, classification, clustering, in-
formation extraction and visualization. In addition, we briefly discuss a number of
successful applications of text mining.
1 Introduction
As computer networks become the backbones of science and economy enormous quan-
tities of machine readable documents become available. There are estimates that 85%
of business information lives in the form of text [TMS05]. Unfortunately, the usual
logic-based programming paradigm has great difficulties in capturing the fuzzy and
1
often ambiguous relations in text documents. Text mining aims at disclosing the con-
cealed information by means of methods which on the one hand are able to cope with
the large number of words and structures in natural language and on the other hand
allow to handle vagueness, uncertainty and fuzziness.
In this paper we describe text mining as a truly interdisciplinary method drawing
on information retrieval, machine learning, statistics, computational linguistics and es-
pecially data mining. We first give a short sketch of these methods and then define
text mining in relation to them. Later sections survey state of the art approaches for
the main analysis tasks preprocessing, classification, clustering, information extraction
and visualization. The last section exemplifies text mining in the context of a number
of successful applications.
The analysis of data in KDD aims at finding hidden patterns and connections in
these data. By data we understand a quantity of facts, which can be, for instance, data in
a database, but also data in a simple text file. Characteristics that can be used to measure
the quality of the patterns found in the data are the comprehensibility for humans,
validity in the context of given statistic measures, novelty and usefulness. Furthermore,
different methods are able to discover not only new patterns but to produce at the same
time generalized models which represent the found connections. In this context, the
expression “potentially useful” means that the samples to be found for an application
generate a benefit for the user. Thus the definition couples knowledge discovery with a
specific application.
Knowledge discovery in databases is a process that is defined by several processing
steps that have to be applied to a data set of interest in order to extract useful patterns.
These steps have to be performed iteratively and several steps usually require interac-
tive feedback from a user. As defined by the CRoss Industry Standard Process for Data
Mining (Crisp DM1 ) model [cri99] the main steps are: (1) business understanding2 , (2)
data understanding, (3) data preparation, (4) modelling, (5) evaluation, (6) deployment
(cf. fig. 13 ). Besides the initial problem of analyzing and understanding the overall
task (first two steps) one of the most time consuming steps is data preparation. This
is especially of interest for text mining which needs special preprocessing methods to
1 http://www.crisp-dm.org/
2 Business understanding could be defined as understanding the problem we need to solve. In the context
of text mining, for example, that we are looking for groups of similar documents in a given document
collection.
3 figure is taken from http://www.crisp-dm.org/Process/index.htm
2
Figure 1: Phases of Crisp DM
convert textual data into a format which is suitable for data mining algorithms. The ap-
plication of data mining algorithms in the modelling step, the evaluation of the obtained
model and the deployment of the application (if necessary) are closing the process cy-
cle. Here the modelling step is of main interest as text mining frequently requires the
development of new or the adaptation of existing algorithms.
3
this connection, a database represents not only the medium for consistent storing and
accessing, but moves in the closer interest of research, since the analysis of the data
with data mining algorithms can be supported by databases and thus the use of database
technology in the data mining process might be useful. An overview of data mining
from the database perspective can be found in [CHY96].
Machine Learning (ML) is an area of artificial intelligence concerned with the de-
velopment of techniques which allow computers to ”learn” by the analysis of data sets.
The focus of most machine learning methods is on symbolic data. ML is also con-
cerned with the algorithmic complexity of computational implementations. Mitchell
presents many of the commonly used ML methods in [Mit97].
Statistics has its grounds in mathematics and deals with the science and practice for
the analysis of empirical data. It is based on statistical theory which is a branch of ap-
plied mathematics. Within statistical theory, randomness and uncertainty are modelled
by probability theory. Today many methods of statistics are used in the field of KDD.
Good overviews are given in [HTF01, Be99, Mai02].
Text Mining = Information Extraction. The first approach assumes that text mining
essentially corresponds to information extraction (cf. section 3.3) — the extrac-
tion of facts from texts.
Text Mining = Text Data Mining. Text mining can be also defined — similar to data
mining — as the application of algorithms and methods from the fields machine
learning and statistics to texts with the goal of finding useful patterns. For this
purpose it is necessary to pre-process the texts accordingly. Many authors use
information extraction methods, natural language processing or some simple pre-
processing steps in order to extract data from texts. To the extracted data then
data mining algorithms can be applied (see [NM02, Gai03]).
Text Mining = KDD Process. Following the knowledge discovery process model [cri99],
we frequently find in literature text mining as a process with a series of partial
steps, among other things also information extraction as well as the use of data
mining or statistical procedures. Hearst summarizes this in [Hea99] in a general
4
manner as the extraction of not yet discovered information in large collections of
texts. Also Kodratoff in [Kod99] and Gomez in [Hid02] consider text mining as
process orientated approach on texts.
In this article, we consider text mining mainly as text data mining. Thus, our focus
is on methods that extract useful patterns from texts in order to, e.g., categorize or
structure text collections or to extract useful information.
Natural Language Processing (NLP). The general goal of NLP is to achieve a better
understanding of natural language by use of computers [Kod99]. Others include also
the employment of simple and durable techniques for the fast processing of text, as
they are presented e.g. in [Abn91]. The range of the assigned techniques reaches from
the simple manipulation of strings to the automatic processing of natural language
inquiries. In addition, linguistic analysis techniques are used among other things for
the processing of text.
Information Extraction (IE). The goal of information extraction methods is the ex-
traction of specific information from text documents. These are stored in data base-like
patterns (see [Wil97]) and are then available for further use. For further details see
section 3.3.
5
In the following, we will frequently refer to the above mentioned related areas of
research. We will especially provide examples for the use of machine learning methods
in information extraction and information retrieval.
2 Text Encoding
For mining large document collections it is necessary to pre-process the text documents
and store the information in a data structure, which is more appropriate for further pro-
cessing than a plain text file. Even though, meanwhile several methods exist that try to
exploit also the syntactic structure and semantics of text, most text mining approaches
are based on the idea that a text document can be represented by a set of words, i.e.
a text document is described based on the set of words contained in it (bag-of-words
representation). However, in order to be able to define at least the importance of a word
within a given document, usually a vector representation is used, where for each word a
numerical ”importance” value is stored. The currently predominant approaches based
on this idea are the vector space model [SWY75], the probabilistic model [Rob77] and
the logical model [van86].
In the following we briefly describe, how a bag-of-words representation can be
obtained. Furthermore, we describe the vector space model and corresponding sim-
ilarity measures in more detail, since this model will be used by several text mining
approaches discussed in this article.
6
Filtering methods remove words from the dictionary and thus from the documents.
A standard filtering method is stop word filtering. The idea of stop word filtering is
to remove words that bear little or no content information, like articles, conjunctions,
prepositions, etc. Furthermore, words that occur extremely often can be said to be of
little information content to distinguish between documents, and also words that occur
very seldom are likely to be of no particular statistical relevance and can be removed
from the dictionary [FBY92]. In order to further reduce the number of words in the
dictionary, also (index) term selection methods can be used (see Sect. 2.1.2).
Lemmatization methods try to map verb forms to the infinite tense and nouns to
the singular form. However, in order to achieve this, the word form has to be known,
i.e. the part of speech of every word in the text document has to be assigned. Since
this tagging process is usually quite time consuming and still error-prone, in practice
frequently stemming methods are applied.
Stemming methods try to build the basic forms of words, i.e. strip the plural ’s’ from
nouns, the ’ing’ from verbs, or other affixes. A stem is a natural group of words with
equal (or very similar) meaning. After the stemming process, every word is represented
by its stem. A well-known rule based stemming algorithm has been originally proposed
by Porter [Por80]. He defined a set of production rules to iteratively transform (English)
words into their stems.
1 X tf(d, t)
W (t) = 1 + P (d, t) log2 P (d, t) with P (d, t) = Pn (1)
log2 |D| l=1 tf(dl , t)
d∈D
Here the entropy gives a measure how well a word is suited to separate documents
by keyword search. For instance, words that occur in many documents will have low
entropy. The entropy can be seen as a measure of the importance of a word in the given
domain context. As index words a number of words that have a high entropy relative to
their overall frequency can be chosen, i.e. of words occurring equally often those with
the higher entropy can be preferred.
In order to obtain a fixed number of index terms that appropriately cover the docu-
ments, a simple greedy strategy can be applied: From the first document in the collec-
tion select the term with the highest relative entropy (or information gain as described
in Sect. 3.1.1) as an index term. Then mark this document and all other documents con-
taining this term. From the first of the remaining unmarked documents select again the
term with the highest relative entropy as an index term. Then mark again this document
and all other documents containing this term. Repeat this process until all documents
are marked, then unmark them all and start again. The process can be terminated when
the desired number of index terms have been selected. A more detailed discussion of
7
the benefits of this approach for clustering - with respect to reduction of words required
in order to obtain a good clustering performance - can be found in [BN04].
An index term selection methods that is more appropriate if we have to learn a
classifier for documents is discussed in Sect. 3.1.1. This approach also considers the
word distributions within the classes.
tf(d, t) log(N/nt )
w(d, t) = qP , (2)
m 2 2
j=1 tf (d, tj ) (log(N/ntj ))
where N is the size of the document collection D and nt is the number of documents
in D that contain term t.
8
Based on a weighting scheme a document d is defined by a vector of term weights
w(d) = (w(d, t1 ), . . . , w(d, tm )) and the similarity S of two documents d1 and d2
(or the similarity of a document and a query vector) can be computed based on the
inner product of the vectors (by which – if we assume normalized vectors – the cosine
between the two document vectors is computed), i.e.
Xm
S(d1 , d2 ) = w(d1 , tk ) · w(d2 , tk ). (3)
k=1
However, the Euclidean distance should only be used for normalized vectors, since
otherwise the different lengths of documents can result in a smaller distance between
documents that share less words than between documents that have more words in
common and should be considered therefore as more similar.
Note that for normalized vectors the scalar product is not much different in behavior
from the Euclidean distance, since for two vectors ~x and ~y it is
µ ¶
~x~y 1 ~x ~y
cos ϕ = = 1 − d2 , .
|~x| · |~y | 2 |~x| |~y |
For a more detailed discussion of the vector space model and weighting schemes
see, e.g. [BYRN99, Gre98, SB88, SWY75].
9
Linguistic processing either uses lexica and other resources as well as hand-crafted
rules. If a set of examples is available machine learning methods as described in section
3, especially in section 3.3, may be employed to learn the desired tags.
It turned out, however, that for many text mining tasks linguistic preprocessing is of
limited value compared to the simple bag-of-words approach with basic preprocessing.
The reason is that the co-occurrence of terms in the vector representation serves as
an automatic disambiguation, e.g. for classification [LK02]. Recently some progress
was made by enhancing bag of words with linguistic feature for text clustering and
classification [HSS03, BH04].
3.1 Classification
Text classification aims at assigning pre-defined classes to text documents [Mit97]. An
example would be to automatically label each incoming news story with a topic like
”sports”, ”politics”, or ”art”. Whatever the specific method employed, a data mining
classification task starts with a training set D = (d1 , . . . , dn ) of documents that are
already labelled with a class L ∈ L (e.g. sport, politics). The task is then to determine
a classification model
f :D→L f (d) = L (5)
which is able to assign the correct class to a new document d of the domain.
To measure the performance of a classification model a random fraction of the la-
belled documents is set aside and not used for training. We may classify the documents
of this test set with the classification model and compare the estimated labels with
the true labels. The fraction of correctly classified documents in relation to the total
number of documents is called accuracy and is a first performance measure.
Often, however, the target class covers only a small percentage of the documents.
Then we get a high accuracy if we assign each document to the alternative class. To
10
avoid this effect different measures of classification success are often used. Precision
quantifies the fraction of retrieved documents that are in fact relevant, i.e. belong to the
target class. Recall indicates which fraction of the relevant documents is retrieved.
11
Note that each document is assumed to belong to exactly one of the k classes in L.
The prior probability p(L) denotes the probability that an arbitrary document belongs
to class L before its words are known. Often the prior probabilities of all classes may
be taken to be equal. The conditional probability on the left is the desired posterior
probability that the document with words t1 , . . . , tni belongs to class Lc . We may
assign the class with highest posterior probability to our document.
For document classification it turned out that the specific order of the words in a
document is not very important. Even more we may assume that for documents of a
given class a word appears in the document irrespective of the presence of other words.
This leads to a simple formula for the conditional probability of words given a class Lc
ni
Y
p(t1 , . . . , tni |Lc ) = p(tj |Lc )
j=1
Combining this “naı̈ve” independence assumption with the Bayes formula defines the
Naı̈ve Bayes classifier [Goo65]. Simplifications of this sort are required as many thou-
sand different words occur in a corpus.
The naı̈ve Bayes classifier involves a learning step which simply requires the esti-
mation of the probabilities of words p(tj |Lc ) in each class by its relative frequencies
in the documents of a training set which are labelled with Lc . In the classification step
the estimated probabilities are used to classify a new instance according to the Bayes
rule. In order to reduce the number of probabilities p(tj |Lm ) to be estimated, we can
use index term selection methods as discussed above in Sect. 3.1.1.
Although this model is unrealistic due to its restrictive independence assumption
it yields surprisingly good classifications [DPHS98, Joa98]. It may be extended into
several directions [Seb02].
As the effort for manually labeling the documents of the training set is high, some
authors use unlabeled documents for training. Assume that from a small training set
it has been established that word ti is highly correlated with class Lc . If from unla-
beled documents it may be determined that word tj is highly correlated with ti , then
also tj is a good predictor for class Lc . In this way unlabeled documents may im-
prove classification performance. In [NMTM00] the authors used a combination of
Expectation-Maximization (EM) [DLR77] and a naı̈ve Bayes classifier and were able
to reduce the classification error by up to 30%.
12
is the cosine similarity as defined in (3). Note that only a small fraction of all possible
terms appear in this sums as w(d, t) = 0 if the term t is not present in the document d.
Other similarity measures are discussed in [BYRN99].
For deciding whether document di belongs to class Lm , the similarity S(di , dj )
to all documents dj in the training set is determined. The k most similar training
documents (neighbors) are selected. The proportion of neighbors having the same
class may be taken as an estimator for the probability of that class, and the class with
the largest proportion is assigned to document di . The optimal number k of neighbors
may be estimated from additional training data by cross-validation.
Nearest neighbor classification is a nonparametric method and it can be shown that
for large data sets the error rate of the 1-nearest neighbor classifier is never larger
than twice the optimal error rate [HTF01]. Several studies have shown that k-nearest
neighbor methods have very good performance in practice [Joa98]. Their drawback
is the computational effort during classification, where basically the similarity of a
document with respect to all other documents of a training set has to be determined.
Some extensions are discussed in [Seb02].
13
documents of class 1
h yp
erp
lan
e
ma
rg in x
ma
rg in x
documents of class 2
equation.
N
X
y = f (t~d ) = b0 + bj tdj
j=1
The SVM algorithm determines a hyperplane which is located between the positive and
negative examples of the training set. The parameters bj are adapted in such a way that
the distance ξ – called margin – between the hyperplane and the closest positive and
negative example documents is maximized, as shown in Fig. 3.1.5. This amounts to a
constrained quadratic optimization problem which can be solved efficiently for a large
number of input vectors.
The documents having distance ξ from the hyperplane are called support vectors
and determine the actual location of the hyperplane. Usually only a small fraction of
documents are support vectors. A new document with term vector t~d is classified in
L1 if the value f (t~d ) > 0 and into L2 otherwise. In case that the document vectors of
the two classes are not linearly separable a hyperplane is selected such that as few as
possible document vectors are located on the “wrong” side.
SVMs can be used with non-linear predictors by transforming the usual input fea-
tures in a non-linear way, e.g. by defining a feature map
¡ ¢
φ(t1 , . . . , tN ) = t1 , . . . , tN , t21 , t1 t2 , . . . , tN tN −1 , t2N
14
especially suitable for the classification of texts [Joa98]. In the case of textual data the
choice of the kernel function has a minimal effect on the accuracy of classification:
Kernels that imply a high dimensional feature space show slightly better results in
terms of precision and recall, but they are subject to overfitting [LK02].
3.2 Clustering
Clustering method can be used in order to find groups of documents with similar con-
tent. The result of clustering is typically a partition (also called) clustering P, a set
of clusters P . Each cluster consists of a number of documents d. Objects — in our
case documents — of a cluster should be similar and dissimilar to documents of other
clusters. Usually the quality of clusterings is considered better if the contents of the
documents within one cluster are more similar and between the clusters more dissimi-
lar. Clustering methods group the documents only by considering their distribution in
document space (for example, a n-dimensional space if we use the vector space model
for text documents).
Clustering algorithms compute the clusters based on the attributes of the data and
measures of (dis)similarity. However, the idea of what an ideal clustering result should
look like varies between applications and might be even different between users. One
can exert influence on the results of a clustering algorithm by using only subsets of
attributes or by adapting the used similarity measures and thus control the clustering
process. To which extent the result of the cluster algorithm coincides with the ideas
of the user can be assessed by evaluation measures. A survey of different kinds of
clustering algorithms and the resulting cluster types can be found in [SEK03].
In the following, we first introduce standard evaluation methods and present then
details for hierarchical clustering approaches, k-means, bi-section-k-means, self-organizing
15
maps and the EM-algorithm. We will finish the clustering section with a short overview
of other clustering approaches used for text clustering.
Statistical Measures In the following, we first discuss measures which cannot make
use of a given classification L of the documents. They are called indices in statistical
literature and evaluate the quality of a clustering on the basis of statistic connections.
One finds a large number of indices in literature (see [Fic97, DH73]). One of the
most well-known measures is the mean square error. It permits to make statements
on quality of the found clusters dependent on the number of clusters. Unfortunately,
the computed quality is always better if the number of cluster is higher. In [KR90] an
alternative measure, the silhouette coefficient, is presented which is independent of the
number of clusters. We introduce both measures in the following.
Mean square error If one keeps the number of dimensions and the number of clus-
ters constant the mean square error (Mean Square error, MSE) can be used likewise for
the evaluation of the quality of clustering. The mean square error is a measure for the
compactness of the clustering and is defined as follows:
Definition 1 (MSE) The means square error (M SE) for a given clustering P is de-
fined as X
M SE(P) = M SE(P ), (9)
P ∈P
P
and µP = 1 ~
d∈P td is the centroid of the clusters P and dist is a distance measure.
|P |
Silhouette Coefficient One clustering measure that is independent from the number
of clusters is the silhouette coefficient SC(P) (cf. [KR90]). The main idea of the coef-
ficient is to find out the location of a document in the space with respect to the cluster
of the document and the next similar cluster. For a good clustering the considered doc-
ument is nearby the own cluster whereas for a bad clustering the document is closer
to the next cluster. With the help of the silhouette coefficient one is able to judge the
quality of a cluster or the entire clustering (details can be found in [KR90]). [KR90]
gives characteristic values of the silhouette coefficient for the evaluation of the cluster
16
quality. A value for SC(P) between 0.7 and 1.0 signals excellent separation between
the found clusters, i.e. the objects within a cluster are very close to each other and
are far away from other clusters. The structure was very well identified by the cluster
algorithm. For the range from 0.5 to 0.7 the objects are clearly assigned to the appro-
priate clusters. A larger level of noise exists in the data set if the silhouette coefficient
is within the range of 0.25 to 0.5 whereby also here still clusters are identifiable. Many
objects could not be assigned clearly to one cluster in this case due to the cluster algo-
rithm. At values under 0.25 it is practically impossible to identify a cluster structure
and to calculate meaningful (from the view of application) cluster centers. The cluster
algorithm more or less ”guessed” the clustering.
17
results, microaveraged precision is identical to microaveraged recall (cf. e.g. [Seb02]).
The F-measure works similar as inverse purity, but it depreciates overly large clusters,
as it includes the individual precision of these clusters into the evaluation.
While (inverse) purity and F-measure only consider ‘best’ matches between ‘queries’
and manually defined categories, the entropy indicates how large the information con-
tent uncertainty of a clustering result with respect to the given classification is
X
E(P, L) = prob(P ) · E(P ), where (15)
P ∈P
X
E(P ) = − prob(L|P ) log(prob(L|P )) (16)
L∈L
|P |
where prob(L|P ) = Precision(P, L) and prob(P ) = |D| . The entropy has the range
[0, log(|L|)], with 0 indicating optimality.
18
clusters. Unfortunately, the selection of the appropriate linkage method depends on the
desired cluster structure, which is usually unknown in advance. For example, single
linkage tends to follow chain-like clusters in the data, while complete linkage tends
to create ellipsoid clusters. Thus prior knowledge about the expected distribution and
cluster form is usually necessary for the selection of the appropriate method (see also
[DH73]). However, substantially more problematic for the use of the algorithm for
large data sets is the memory required to store the similarity matrix, which consists of
n(n − 1)/2 elements where n is the number of documents. Also the runtime behavior
with O(n2 ) is worse compared to the linear behavior of k-means as discussed in the
following.
k-means is one of the most frequently used clustering algorithms in practice in the
field of data mining and statistics (see [DH73, Har75]). The procedure which originally
comes from statistics is simple to implement and can also be applied to large data sets.
It turned out that especially in the field of text clustering k-means obtains good results.
Proceeding from a starting solution in which all documents are distributed on a given
number of clusters one tries to improve the solution by a specific change of the alloca-
tion of documents to the clusters. Meanwhile, a set of variants exists whereas the basic
principle goes back to Forgy 1965 [For65] or MacQueen 1967 [Mac67]. In literature
for vector quantization k-means is also known under the name LloydMaxAlgorithm
([GG92]). The basic principle is shown in the following algorithm:
k-means essentially consists of the steps three and four in the algorithm, whereby
the number of clusters k must be given. In step three the documents are assigned
to the nearest of the k centroids (also called cluster prototype). Step four calculates
a new centroids on the basis of the new allocations. We repeat the two steps in a
loop (step five) until the cluster centroids do not change any more. The algorithm 5.1
corresponds to a simple hill climbing procedure which typically gets stuck in a local
optimum (the finding of the global optimum is a NP complete problem). Apart from
a suitable method to determine the starting solution (step one), we require a measure
for calculating the distance or similarity in step three (cf. section 2.1). Furthermore the
abort criterion of the loop in step five can be chosen differently e.g. by stopping after a
fix number of iterations.
19
Bi-Section-k-means One fast text clustering algorithm, which is also able to deal
with the large size of the textual data is the Bi-Section-k-means algorithm. In [SKK00]
it was shown that Bi-Section-k-means is a fast and high-quality clustering algorithm
for text documents which is frequently outperforming standard k-means as well as
agglomerative clustering techniques.
Bi-Section-k-means is based on the k-means algorithm. It repeatedly splits the
largest cluster (using k-means) until the desired number of clusters is obtained. Another
way of choosing the next cluster to be split is picking the one with the largest variance.
[SKK00] showed neither of these two has a significant advantage.
Self Organizing Map (SOM) [Koh82] are a special architecture of neural networks
that cluster high-dimensional data vectors according to a similarity measure. The clus-
ters are arranged in a low-dimensional topology that preserves the neighborhood re-
lations in the high dimensional data. Thus, not only objects that are assigned to one
cluster are similar to each other (as in every cluster analysis), but also objects of nearby
clusters are expected to be more similar than objects in more distant clusters. Usually,
two-dimensional grids of squares or hexagons are used (cf. Fig. 3).
The network structure of a self-organizing map has two layers (see Fig. 3). The
neurons in the input layer correspond to the input dimensions, here the words of the
document vector. The output layer (map) contains as many neurons as clusters needed.
All neurons in the input layer are connected with all neurons in the output layer. The
weights of the connection between input and output layer of the neural network encode
positions in the high-dimensional data space (similar to the cluster prototypes in k-
means). Thus, every unit in the output layer represents a cluster center. Before the
learning phase of the network, the two-dimensional structure of the output units is fixed
and the weights are initialized randomly. During learning, the sample vectors (defining
the documents) are repeatedly propagated through the network. The weights of the
most similar prototype w~s (winner neuron) are modified such that the prototype moves
toward the input vector w ~i , which is defined by the currently considered document
~i := t~d (competitive learning). As similarity measure usually the Euclidean
d, i.e. w
distance is used. However, for text documents the scalar product (see Eq. 3) can be
applied. The weights w ~ s of the winner neuron are modified according to the following
equation:
w~s 0 = w~s + σ · (w~s − w
~i ),
where σ is a learning rate.
To preserve the neighborhood relations, prototypes that are close to the winner
neuron in the two-dimensional structure are also moved in the same direction. The
weight change decreases with the distance from the winner neuron. Therefore, the
adaption method is extended by a neighborhood function v (see also Fig. 3):
where σ is a learning rate. By this learning procedure, the structure in the high-
dimensional sample data is non-linearly projected to the lower-dimensional topology.
After learning, arbitrary vectors (i.e. vectors from the sample set or prior ‘unknown’
vectors) can be propagated through the network and are mapped to the output units.
20
Figure 3: Network architecture of self-organizing maps (left) and possible neighbor-
hood function v for increasing distances from s (right)
For further details on self-organizing maps see [Koh84]. Examples for the application
of SOMs for text mining can be found in [LMS91, HKLK96, KKL+ 00, Nür01, RC01]
and in Sect. 3.4.2.
21
Fuzzy Clustering While most classical clustering algorithms assign each datum to
exactly one cluster, thus forming a crisp partition of the given data, fuzzy clustering al-
lows for degrees of membership, to which a datum belongs to different clusters [Bez81].
These approaches are frequently more stable. Applications to text are described in, e.g.,
[MS01b, BN04].
The Utility of Clustering We have described the most important types of clustering
approaches, but we had to leave out many other. Obviously there are many ways to
define clusters and because of this we cannot expect to obtain something like the ‘true’
clustering. Still clustering can be insightful. In contrast to classification, which relies
on a prespecified grouping, cluster procedures label documents in a new way. By
studying the words and phrases that characterize a cluster, for example, a company
could learn new insights about its customers and their typical properties. A comparison
of some clustering methods is given in [SKK00].
22
get tag ”O”. This is done for each type of entity of interest. For the example above we
have for instance the person-words ”by (O) John (B) J. (I) Donner (I) Jr. (I) the (O)”.
Hence we have a sequential classification problem for the labels of each word, with
the surrounding words as input feature vector. A frequent way of forming the feature
vector is a binary encoding scheme. Each feature component can be considered as a test
that asserts whether a certain pattern occurs at a specific position or not. For example,
a feature component takes the value 1 if the previous word is the word ”John” and
0 otherwise. Of course we may not only test the presence of specific words but also
whether the words starts with a capital letter, has a specific suffix or is a specific part-
of-speech. In this way results of previous analysis may be used.
Now we may employ any efficient classification method to classify the word labels
using the input feature vector. A good candidate is the Support Vector Machine because
of its ability to handle large sparse feature vectors efficiently. [TC02] used it to extract
entities in the molecular biology domain.
23
We consider the simple case that the words t = (t1 , t2 , . . . , tn ) and the corre-
sponding labels L1 , L2 , . . . , Ln have a chain structure and that Lc depends only on
the preceding and succeeding labels Lc−1 and Lc+1 . Then the conditional distribution
p(L|t) has the form
n X kj n−1 mj
1 X XX
p(L|t) = exp λjr fjr (Lj , t) + µjr gjr (Lj , Lj−1 , t) (20)
const
j=1 r=1 j=1 r=1
where fjr (Lj , t) and gjr (Lj , Lj−1 , t) are different features functions related to Lj
and the pair Lj , Lj−1 respectively. CRF models encompass hidden Markov models,
but they are much more expressive because they allow arbitrary dependencies in the
observation sequence and more complex neighborhood structures of labels. As for
most machine learning algorithms a training sample of words and the correct labels is
required. In addition to the identity of words arbitrary properties of the words, like
part-of-speech tags, capitalization, prefixes and suffixes, etc. may be used leading to
sometimes more than a million features. The unknown parameter values λjr and µjr
are usually estimated using conjugate gradient optimization routines [McC03].
McCallum [McC03] applies CRFs with feature selection to named entity recog-
nition and reports the following F1-measures for the CoNLL corpus: person names
93%, location names 92%, organization names 84%, miscellaneous names 80%. CRFs
also have been successfully applied to noun phrase identification [McC03], part-of-
speech tagging [LMP01], shallow parsing [SP03], and biological entity recognition
[KOT+ 04].
24
and the hardware. In the following we give a brief overview of visualization methods
that have been realized for text mining and information retrieval systems.
25
maps can be used to arrange documents based on their similarity. This approach opens
up several appealing navigation possibilities. Most important, the surrounding grid
cells of documents known to be interesting can be scanned for further similar docu-
ments. Furthermore, the distribution of keyword search results can be visualized by
coloring the grid cells of the map with respect to the number of hits. This allows a user
to judge e.g. whether the search results are assigned to a small number of (neighboring)
grid cells of the map, or whether the search hits are spread widely over the map and
thus the search was - most likely - too unspecific.
A first application of self-organizing maps in information retrieval was presented
in [LMS91]. It provided a simple two-dimensional cluster representation (categoriza-
tion) of a small document collection. A refined model, the WEBSOM approach, ex-
tended this idea to a web based interface applied to newsgroup data that provides simple
zooming techniques and coloring methods [HKLK96, Hon97, KKL+ 00]. Further ex-
tensions introduced hierarchies [Mer98], supported the visualization of search results
[RC01] and combined search, navigation and visualization techniques in an integrated
tool [Nür01]. A screenshot of the prototype discussed in [Nür01] is depicted in Fig. 4.
26
SPIRE [WTP+ 95] applies a three step approach: It first clusters documents in docu-
ment space, than projects the discovered cluster centers onto a two dimensional surface
and finally maps the documents relative to the projected cluster centers. SPIRE offers a
scatter plot like projection as well as a three dimensional visualization. The visualiza-
tion tool SCI-Map [Sma99] applies an iterative clustering approach to create a network
using, e.g., references of scientific publications. The tools visualizes the structure by a
map hierarchy with an increasing number of details.
One major problem of most existing visualization approaches is that they create
their output only by use of data inherent information, i.e. the distribution of the doc-
uments in document space. User specific information can not be integrated in order
to obtain, e.g., an improved separation of the documents with respect to user defined
criteria like keywords or phrases. Furthermore, the possibilities for a user to interact
with the system in order to navigate or search are usually very limited, e.g., to boolean
keyword searches and simple result lists.
4 Applications
In this section we briefly discuss successful applications of text mining methods in
quite diverse areas as patent analysis, text classification in news agencies, bioinformat-
ics and spam filtering. Each of the applications has specific characteristics that had to
be considered while selecting appropriate text mining methods.
27
to support companies and also the European patent office in their work. The challenges
in patent analysis consists of the length of the documents, which are larger then docu-
ments usually used in text classification, and the large number of available documents
in a corpus [KSB01]. Usually every document consist of 5000 words in average. More
than 140000 documents have to be handled by the European patent office (EPO) per
year. They are processed by 2500 patent examiners in three locations.
In several studies the classification quality of state-of-the-art methods was ana-
lyzed. [KSB01] reported very good result with an 3% error rate for 16000 full text
documents to be classified in 16 classes (mono-classification) and a 6% error rate in
the same setting for abstracts only by using the Winnow [Lit88] and the Rocchio algo-
rithm [Roc71]. These results are possible due to the large amount of available training
documents. Good results are also reported in [KZ02] for an internal EPO text classifi-
cation application with a precision of 81 % and an recall of 78 %.
Text clustering techniques for patent analysis are often applied to support the analy-
sis of patents in large companies by structuring and visualizing the investigated corpus.
Thus, these methods find their way in a lot of commercial products but are still also
of interest for research, since there is still a need for improved performance. Compa-
nies like IBM offer products to support the analysis of patent text documents. Dorre
describes in [DGS99] the IBM Intelligent Miner for text in a scenario applied to patent
text and compares it also to data mining and text mining. Coupet [CH98] does not only
apply clustering but also gives some nice visualization. A similar scenario on the basis
of SOM is given in [LAHF03].
28
The Deutsche Presse-Agentur now is routinely using a text mining system in its news
production workflow.
4.3 Bioinformatics
Bio-entity recognition aims to identify and classify technical terms in the domain of
molecular biology that correspond to instances of concepts that are of interest to biolo-
gists. Examples of such entities include the names of proteins, genes and their locations
of activity such as cells or organism names. Entity recognition is becoming increas-
ingly important with the massive increase in reported results due to high throughput
experimental methods. It can be used in several higher level information access tasks
such as relation extraction, summarization and question answering.
Recently the GENIA corpus was provided as a benchmark data set to compare
different entity extraction approaches [KOT+ 04]. It contains 2000 abstracts from the
MEDLINE database which were hand annotated with 36 types of biological entities.
The following sentence is an example: “We have shown that <protein> interleukin-
1 </protein> (<protein> IL-1 </protein>) and <protein> IL-2 </protein> control
<DNA> IL-2 receptor alpha (IL-2R alpha) gene </DNA> transcription in <cell line>
CD4-CD8- murine T lymphocyte precursors </cell line>”.
In the 2004 evaluation four types of extraction models were used: Support Vec-
tor Machines (SVMs), Hidden Markov Models (HMMs), Conditional Random Fields
(CRFs) and the related Maximum Entropy Markov Models (MEMMs). Varying types
of input features were employed: lexical features (words), n-grams, orthographic in-
formation, word lists, part-of-speech tags, noun phrase tags, etc. The evaluation shows
that the best five systems yield an F1-value of about 70% [KOT+ 04]. They use SVMs
in combination with Markov models (72.6%), MEMMs (70.1%), CRFs (69.8%), CRFs
together with SVMs (66.3%), and HMMs (64.8%). For practical applications the cur-
rent accuracy levels are not yet satisfactory and research currently aims at including
a sophisticated mix of external resources such as keyword lists and ontologies which
provide terminological resources.
29
96.5% and a recall of 89.3%. They conclude that these good results may be improved
by careful preprocessing and the extension of filtering to different languages.
5 Conclusion
In this article, we tried to give a brief introduction to the broad field of text mining.
Therefore, we motivated this field of research, gave a more formal definition of the
terms used herein and presented a brief overview of currently available text mining
methods, their properties and their application to specific problems. Even though, it
was impossible to describe all algorithms and applications in detail within the (size)
limits of an article, we think that the ideas discussed and the provided references should
give the interested reader a rough overview of this field and several starting points for
further studies.
References
[Abn91] S. P. Abney. Parsing by chunks. In R. C. Berwick, S. P. Abney, and
C. Tenny, editors, Principle-Based Parsing: Computation and Psycholin-
guistics, pages 257–278. Kluwer Academic Publishers, Boston, 1991.
[All02] J. Allan, editor. Topic Detection and Tracking. Kluwer Academic Pub-
lishers, Norwell, MA, 2002.
30
[CH98] P. Coupet and M. Hehenberger. Text mining applied to patent analysis.
In Annual Meeting of American Intellectual Property Law Association
(AIPLA) Airlington, 1998.
[Chi97] N. Chinchor. Muc-7 named entity task definition version 3.5. Technical
report, NIST, ftp.muc.saic.com/pub/MUC/MUC7-guidelines, 1997.
[CHY96] M.-S. Chen, J. Han, and P. S. Yu. Data mining: an overview from a
database perspective. IEEE Transaction on Knowledge and Data Engi-
neering, 8(6):866–883, 1996.
[cri99] Cross industry standard process for data mining. http://www.
crisp-dm.org/, 1999.
[CS96] P. Cheeseman and J. Stutz. Bayesian classification (AutoClass): Theory
and results. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthu-
rusamy, editors, Advances in Knowledge Discovery and Data Mining,
pages 153–180. AAAI/MIT Press, 1996.
[DDFL90] S. Deerwester, S.T. Dumais, G.W. Furnas, and T.K. Landauer. Indexing
by latent semantic analysis. Journal of the American Society for Informa-
tion Sciences, 41:391–407, 1990.
[DGS99] J. Dörre, P. Gerstl, and R. Seiffert. Text mining: finding nuggets in moun-
tains of textual data. In Proc. 5th ACM Int. Conf. on Knowledge Discovery
and Data Mining (KDD-99), pages 398–401, San Diego, US, 1999. ACM
Press, New York, US.
31
[FFKS99] K. L. Fox, O. Frieder, M. M. Knepper, and E. J. Snowberg. Sentinel: A
multiple engine information retrieval and visualization system. Journal
of the American Society of Information Science, 50(7):616–625, 1999.
[Fic97] N. Fickel. Clusteranalyse mit gemischt-skalierten merkmalen: Ab-
strahierung vom skalenniveau. Allg. Statistisches Archiv, 81(3):249–265,
1997.
[For65] E. Forgy. Cluster analysis of multivariate data: Efficiency versus inter-
pretability of classification. Biometrics, 21(3):768–769, 1965.
[FPSS96] U. M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. Knowledge discovery
and data mining: Towards a unifying framework. In Knowledge Discov-
ery and Data Mining, pages 82–88, 1996.
[Gai03] R. Gaizauskas. An information extraction perspective on
text mining: Tasks, technologies and prototype applications.
http://www.itri.bton.ac.uk/projects/euromap/
TextMiningEvent/Rob_Gaizauskas.pdf, 2003.
[GG92] A. Gersho and R. M. Gray. Vector quantization and signal compression.
Kluwer Academic Publishers, 1992.
[Goo65] I. J. Good. The Estimation of Probabilities: An Essay on Modern
Bayesian Methods. MIT Press, Cambridge, MA, 1965.
[Gre98] W. R. Greiff. A theory of term weighting based on exploratory data anal-
ysis. In 21st Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval, New York, NY, 1998. ACM.
[Har75] J. Hartigan. Clustering Algorithms. John Wiley and Sons, New York,
1975.
[Hea99] M. Hearst. Untangling text data mining. In Proc. of ACL’99 the 37th
Annual Meeting of the Association for Computational Linguistics, 1999.
[HHP+ 01] S. Havre, E. Hetzler, K. Perrine, E. Jurrus, and N. Miller. Interactive
visualization of multiple query result. In Proc. of IEEE Symposium on
Information Visualization 2001, pages 105 –112. IEEE, 2001.
[Hid02] J. M. G. Hidalgo. Tutorial on text mining and internet content filter-
ing. Tutorial Notes Online: http://ecmlpkdd.cs.helsinki.
fi/pdf/hidalgo.pdf, 2002.
[HK97] M. A. Hearst and C. Karadi. Cat-a-cone: An interactive interface for
specifying searches and viewing retrieval results using a large category
hierarchie. In Proc. of the 20th Annual Int. ACM SIGIR Conference, pages
246–255. ACM, 1997.
32
[HKLK96] T. Honkela, S. Kaski, K. Lagus, and T. Kohonen. Newsgroup explo-
ration with the websom method and browsing interface, technical report.
Technical report, Helsinki University of Technology, Neural Networks
Research Center, Espoo, Finland, 1996.
[HKW94] M. Hemmje, C. Kunkel, and A. Willett. Lyberworld - a visualization user
interface supporting fulltext retrieval. In Proc. of ACM SIGIR 94, pages
254–259. ACM, 1994.
[Hof01] T. Hofmann. Unsupervised learning by probabilistic latent semantic anal-
ysis. Machine Learning Journal, 41(1):177–196, 2001.
[Hon97] T. Honkela. Self-Organizing Maps in Natural Language Processing. PhD
thesis, Helsinki Univ. of Technology, Neural Networks Research Center,
Espoo, Finland, 1997.
[HSS03] A. Hotho, S. Staab, and G. Stumme. Ontologies improve text document
clustering. In Proc. IEEE Int. Conf. on Data Mining (ICDM 03), pages
541–544, 2003.
[HTF01] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical
Learning. Springer, 2001.
[Joa98] T. Joachims. Text categorization with support vector machines: Learning
with many relevant features. In C. Nedellec and C. Rouveirol, editors,
European Conf. on Machine Learning (ECML), 1998.
[Kei02] D. A. Keim. Information visualization and visual data mining. IEEE
Transactions on Visualization and Computer Graphics, 7(2):100–107,
2002.
[KJ03] V. Kumar and M. Joshi. What is data mining? http://www-users.
cs.umn.edu/˜mjoshi/hpdmtut/sld004.htm, 2003.
[KKL+ 00] T. Kohonen, S. Kaski, K. Lagus, J. Salojärvi, J. Honkela, V. Paattero, and
A. Saarela. Self organization of a massive document collection. IEEE
Transactions on Neural Networks, 11(3):574–585, May 2000.
[Kod99] Y. Kodratoff. Knowledge discovery in texts: A definition and applica-
tions. Lecture Notes in Computer Science, 1609:16–29, 1999.
[Koh82] T. Kohonen. Self-organized formation of topologically correct feature
maps. Biological Cybernetics, 43:59–69, 1982.
[Koh84] T. Kohonen. Self-Organization and Associative Memory. Springer-
Verlag, Berlin, 1984.
[KOT+ 04] J. Kim, T. Ohta, Y. Tsuruoka, Y. Tateisi, and N. Collier. Introduction to
the bio-entity task at jnlpba. In N. Collier, P. Ruch, and A. Nazarenko,
editors, Proc. Workshop on Natural Language Processing in Biomedicine
and its Applications, pages 70–76, 2004.
33
[KR90] L. Kaufman and P. J. Rousseeuw. Finding groups in data: an introduction
to cluster analysis. Wiley, New York, 1990.
34
[MAP+ 04] E. Michelakis, I. Androutsopoulos, G. Paliouras, G. Sakkis, and P. Stam-
atopoulos. Filtron: A learning-based anti-spam filter. In Proc. 1st Conf.
on Email and Anti-Spam (CEAS 2004), Mountain View, CA, USA, 2004.
[McC03] A. McCallum. Efficiently inducing features of conditional random fields.
In Proc. Conf. on Uncertainty in Articifical Intelligence (UAI), 2003.,
2003.
[Mer98] D. Merkl. Text classification with self-organizing maps: Some lessons
learned. Neurocomputing, 21:61–77, 1998.
[Mit97] T. Mitchell. Machine Learning. McGraw-Hill, 1997.
[MM99] I. Mani and M. T. Maybury, editors. Advances in Automatic Text Summa-
rization. MIT Press, 1999.
[MS01a] C. D. Manning and H. Schütze. Foundations of Statistical Natural Lan-
guage Processing. MIT Press, Cambridge, MA, 2001.
[MS01b] M. E. Mendes and Lionel Sacks. Dynamic knowledge representation
for e-learning applications. In Proc. of BISC International Workshop on
Fuzzy Logic and the Internet (FLINT 2001), pages 176–181, Berkeley,
USA, 2001. ERL, College of Engineering, University of California.
[NM02] U. Nahm and R. Mooney. Text mining with information extraction. In
Proceedings of the AAAI 2002 Spring Symposium on Mining Answers
from Texts and Knowledge Bases, 2002.
35
[Rab89] L. R. Rabiner. A tutorial on hidden markov models and selected applica-
tions in speech recognition. Proc. of IEEE, 77(2):257–286, 1989.
36
[SS00] R. E. Schapire and Y. Singer. BoosTexter: A boosting-based system for
text categorization. Machine Learning, 39(2/3):135–168, 2000.
[SWY75] G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic
indexing. Communications of the ACM, 18(11):613–620, 1975. (see also
TR74-218, Cornell University, NY, USA).
[TC02] K. Takeuchi and N. Collier. Use of support vector machines in extended
named entity recognition. In 6th Conf. on Natural Language Learning
(CoNLL-02), pages 119–125, 2002.
[TMS05] Text mining summit conference brochure. http://www.
textminingnews.com/, 2005.
[UF01] A. Wierse U. Fayyad, G. Grinstein. Information Visualization in Data
Mining and Knowledge Discovery. Morgan Kaufmann, 2001.
[van86] C. J. van Rijsbergen. A non-classical logic for information retrieval. The
Computer Journal, 29(6):481–485, 1986.
37