0% found this document useful (0 votes)
40 views10 pages

Art Organization

This document summarizes self-organizing map techniques for creating low-dimensional representations of large document collections to enable visualization and exploration. It discusses representing documents as word frequency histograms or reduced forms like latent semantic indexing and random projections. A method called a "self-organizing semantic map" clusters words into categories mapped to grid points, and documents are represented as histograms over these word categories. Dimensionality reduction through random projections of word category histograms provides an efficient way to discriminate between documents in large collections.

Uploaded by

aysgl91
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views10 pages

Art Organization

This document summarizes self-organizing map techniques for creating low-dimensional representations of large document collections to enable visualization and exploration. It discusses representing documents as word frequency histograms or reduced forms like latent semantic indexing and random projections. A method called a "self-organizing semantic map" clusters words into categories mapped to grid points, and documents are represented as histograms over these word categories. Dimensionality reduction through random projections of word category histograms provides an efficient way to discriminate between documents in large collections.

Uploaded by

aysgl91
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

volume 1, pages 65-74. Springer, London.

Self-Organization of Very Large


Document Collections: State of the Art
Teuvo Kohonen
Helsinki University of Technology, Neural Networks Research Centre
P.O. Box 2200, FIN-02015 HUT, Finland
Email: teuvo.kohonen@hut.

Abstract
The Self-Organizing Map (SOM) forms a nonlinear projection from a
high-dimensional data manifold onto a low-dimensional grid. A repre-
sentative model of some subset of data is associated with each grid point.
The SOM algorithm computes an optimal collection of models that ap-
proximates the data in the sense of some error criterion and also takes
into account the similarity relations of the models. The models then be-
come ordered on the grid according to their similarity. When the SOM
is used for the exploration of statistical data, the data vectors can be
approximated by models of the same dimensionality. When mapping
documents, one can represent them statistically by their word frequency
histograms or some reduced representations of the histograms that can
be regarded as data vectors. We have made SOMs of collections of over
one million documents. Each document is mapped onto some grid point,
with a link from this point to the document database. The documents
are ordered on the grid according to their contents and neighboring doc-
uments can be browsed readily. Keywords or key texts can be used to
search for the most relevant documents rst. New e ective coding and
computing schemes of the mapping are described.

1 Introduction
Visual overviews of large data sets can be produced by various clustering [1]
or projection [2] methods. The Self-Organizing Map (SOM) [3] forms the pro-
jection of a high-dimensional data distribution onto a two-dimensional regular
grid, whereby also the cluster structure of the data is preserved.
A representative model of some subset of observations is associated with
each grid point. The SOM algorithm computes the optimal collection of models
that approximates an arbitrary distribution of input observations in the sense
of some overall error criterion. This criterion also involves the spatial ordering
of the models: the most similar models shall be found at adjacent grid points,
and the more dissimilar ones shall be located farther away from each other on
the grid. In this sense the SOM is a similarity graph of data.
The grid may be made to act as a groundwork for various kinds of illustrative
displays. For instance, one can use shades of gray [4] on the groundwork to
indicate the clustering tendency (e.g., vectorial distances of the neighboring
model vectors), or the values of any component of all the model vectors can be
displayed separately to study their contribution to the cluster structure [5].
In the vast majority of SOM applications, the input data constitute high-
dimensional real feature vectors x 2 <n , and the model vectors mi 2 <n are
then approximations of the x in a somewhat similar sense as the codebook
vectors in classical vector quantization are. However, the models need not
necessarily be replica of the input vectors: they may be, e.g., parametric repre-
sentations of operators that generate sequences of data [6]. On the other hand,
there exist means to approximate also nonvectorial data, e.g., sets of symbol
strings can be approximated by "average strings" [7].
In the SOMs that form similarity graphs of documents, the models can still
be taken as real vectors that describe collections of words in the documents.
The models can simply be weighted histograms of the words, but usually some
dimensionality reduction of the histograms is carried out, as we shall see next.

2 Statistical models of documents


2.1 The primitive vector space model
In the basic vector space model [8] the stored documents are represented as real
vectors in which each component corresponds to the frequency of occurrence
of a particular word in the document: the model or document vector can be
viewed as a weighted word histogram. For the weighting of a word according
to its importance one can use the Shannon entropy over document classes, or
the inverse of the number of the documents in which the word occurs (\inverse
document frequency"). The main problem of the vector space model is the
large vocabulary in any sizable collection of free-text documents, which means
a vast dimensionality of the model vectors.

2.2 Latent semantic indexing (LSI)


In an attempt to reduce the dimensionality of the document vectors, one often
rst forms a matrix in which each column corresponds to the word histogram
of a document, and there is one column for each document. After that the
factors of the space spanned by the column vectors are computed by a method
called the singular-value decomposition (SVD), and the factors that have the
least in uence on the matrix are omitted. The document vector formed of the
histogram of the remaining factors has then a much smaller dimensionality.
This method is called the latent semantic indexing (LSI) [9].

2.3 Randomly projected histograms


It has been shown experimentally that the dimensionality of the document vec-
tors can be reduced radically by a random projection method [10],[11] without
essentially losing the power of discrimination between the documents. Consider
the original document vector (weighted histogram) ni 2 <n and a rectangu-
lar random matrix R, the elements in each column of which are assumed to
be normally distributed. Let us form the document vectors as the projections
xi 2 <m, where m  n:
xi = Rni : (1)
It has transpired in our experiments that if m is at least of the order of 100,
the similarity relations between arbitrary pairs of projection vectors (xi ; xj ) are
very good approximations of the corresponding relations between the original
document vectors (ni ; nj ), and the computing load of the projections is rea-
sonable; on the other hand, with the radically decreased dimensionality of the
document vectors, the time needed to classify a document is radically decreased.
In our recent experiments we have always selected n = 315 (to compare our
results with earlier experiments in which this dimensionality was used).

2.4 Histograms on the word category map


In the \self-organizing semantic map" method [12] the words of free natural
text are clustered onto neighboring grid points of a special SOM. Synonyms and
closely related words such as those with opposite meanings and those forming
a closed set of attribute values are often mapped onto the same grid point.
In this sense this clustering scheme is even more e ective than the thesaurus
method in which sets of synonyms are found manually.
The input to the \self-organizing semantic map" usually consists of adjacent
words in the text taken over a moving window. Let a word in the vocabulary
be indexed by k and represented by a unique random vector rk . Let us then
scan all occurrences of word (k) in the text in the positions j (k), and construct
for word (k) its \average context vector"
2 3
E frj(k)?1 g
xk = 4 " rj(k) 5; (2)
E frj(k)+1 g
where E means the average over all j (k), rj(k) is the random vector represent-
ing word (k) in position j = j (k) of the text, and " is a scaling (balancing)
parameter. Notice that this expression has to be computed only once for each
di erent word, because the rj(k) for all the j = j (k) are identical.
In making the \semantic SOM" or the word category map, all the xk from
all the documents are input iteratively a sucient number of times. After that
each grid point is labeled by all those words (k), the xk of which are mapped
to that point. In this way the grid points usually get multiple labels.
In forming the \word category histogram" for a document, the words of the
document are scanned and counted at those grid points of the SOM that were
labeled by that word. In counting, the words can be weighted by the Shannon
entropy or the inverse of the number of documents in the text corpus in which
this word had occurred (= \inverse document frequency").
The \word category histograms" can be computed reasonably fast, much
faster than, e.g., the LSI.
2.5 Randomly projected word category histograms
In a great number of experiments performed by us it has transpired that if
the histograms on the word category maps are used as models, the ability
of our method to discriminate between the documents is reduced if the grid
points in the word category map contain more than, say, ten words on the
average: speci c information contained in the words is then lost. We have been
interested in very large document collections that may contain, say, hundreds
of thousands of unique words, and even after discarding very rare words, the
remaining vocabulary consisted of tens of thousands of words. In order to
keep the number of words on each point of the word category map at the
tolerable level, the word category map therefore had to be reasonably large, for
example 13,432 grid points in some of our latest experiments. The histograms of
this dimensionality we then again projected randomly to form 315-dimensional
statistical document vectors.
The combination of word categorization and random projection guarantees
a certain degree of invariance with respect to the choice of, e.g., synonyms, while
a high degree of discrimination between documents can still be maintained, for
similar reasons as in the random projection method.

2.6 Construction of the random projections by pointers


Now I would like to report a new idea for the speedup of computation of the
document vectors. For the present it is being programmed into the next large
demonstration.
2.6.1 Preliminary tests
It is advisable to read Sec. 3 before returning to this point.
Before detailed description of the total system I have to present some ex-
perimental results that motivate the idea discussed in this section. Table 1
compares a few projection methods in which the model vectors, except in the
rst case, were always 315-dimensional.
As the material in this experiment we used 18,540 English documents (dis-
cussions etc.) from 20 Usenet newsgroups of Internet. When the text was
preprocessed as will be explained in Sec. 3.1.1, the remaining vocabulary con-
sisted of 5,789 words or word forms. The documents, represented by di erent
kinds of document vectors, were classi ed in the following way. When the doc-
ument map discussed more closely in Sec. 3 was formed, each document was
mapped onto one of its grid points. These points were then classi ed according
to the majority of newsgroup names in them. All documents that represented
a minority group at any grid point were counted as classi cation errors.
One has to take into account that many newsgroups have almost identical
topics although their names are di erent. However, misclassi cations due to
this reason were simply counted as errors. Often the discussions are also so
di use that they do not identify the group. Therefore the "accuracies" reported
here seem much more pessimistic than they really are, and one must regard the
given gures as relative ones, meant for comparison of the di erent methods
only.
The classi cation accuracy of 68.0 per cent reported on the rst row of
Table 1 refers to a classi cation that was carried out with the vector-space
model with full 5789-dimensional histograms as document vectors. In practice,
this kind of classi cation would be orders of magnitude too slow.
Random projection (with matrix R) of the original document vectors onto a
315-dimensional space, with normally distributed matrix elements and normal-
ized columns of R yielded, within the statistical accuracy of computation, the
same gures as the basic vector space method. This is reported on the second
row. The gures are averages from seven statistically independent tests, like in
the rest of the cases.
The following rows have the following meaning: Third row, the matrix
elements of R were thresholded to +1 or ?1; fourth row, exactly 5 randomly
distributed ones were generated in each column of R, and the other elements
were zeroes; fth row, the number of ones was 3; and sixth row, the number of
ones was 2, respectively.
It can be concluded that a sparse binary projection matrix is almost as good
in practice as the normally distributed R, which again was about as good as
the vector space model. However, now we can apply a fast computing method.

Table 1: Classi cation accuracies of documents, in per cent, with di erent


projection matrices R. The gures are averages from seven test runs with
di erent random elements of R.
Accuracy Standard deviation due to
di erent randomization of R
Vector space model 68.0 ?
Normally distributed R 68.0 0.2
Thresholding to +1 or ?1 67.9 0.2
5 ones in each column 67.8 0.3
3 ones in each column 67.4 0.2
2 ones in each column 67.3 0.2

2.6.2 Fast computation of the projected histograms


The matrix product x = Rn in eq. (1) (where we drop the document index),
with a sparse matrix R, can be computed very fast. Consider rst the following
trivial-looking piece of pseudocode.
for i:=1 step 1 until m do x(i):=0 ;
for all (i,j) such that R(i,j)=1 begin
x(i):=x(i)+n(j) ;
end

This scheme is supposed to give us the idea that if we reserve a memory


array for x = (x1 ; x2 ; : : : ; xm ) that acts like an accumulator, another array for
n = (n1; n2; : : : ; nn), and permanent address pointers from all the locations nj
to all the locations xi for which the matrix element Rij of R is equal to one,
we can accumulate the values of xi very fast by following the pointers. If R is
very sparse, this scheme works very fast.
After the above introduction it may be easier to understand the version of
the method that was actually used. Now we do not project ready histograms,
but the pointers are already used with each word in the text in the construction
of the low-dimensional document vectors.
Assume that we have precomputed for each word in the vocabulary its
weight (entropy or \inverse document frequency"). The vocabulary and its
weights reside in a table, the entries of which are found by hash coding ( for
a textbook account, cf., e.g., [13] or [3]). The hash addresses are formed on
the basis of the ASCII codes of the words. At each hash address or its spare
location, corresponding to a word entry we store a small number of, say, three
random pointers to the elements of the x array.
When scanning the text, the hash address for each word is formed, and if
the word resides in the hash table, those elements of the x array that are found
by the (say, three) address pointers are incremented by the weight value of that
word.
The weighted, randomly projected word histogram obtained in the above
way may be normalized (optionally).
The computing time needed to form the histograms in the above case is
about 20 per cent of that of the matrix-product method. We have good reasons
to assume that at least the same speedup holds for larger maps, too.

3 Construction of the document map


Our original document-organization system named the WEBSOM (http:
//websom.hut. /websom/) used word-category histograms as statistical models
of the documents. Certain reasons, among them the accuracy of classi cation,
have recently led us to prefer the straightforward random projection (or its
shortcut computation by the pointers) in forming the statistical models of the
documents. We have carried out numerous experiments with maps of very
di erent sizes, but the following comparable gures in Table 2 are given here
for the same document collection used earlier. In these experiments the word
category map had 1598 grid points, and the dimension of the projected model
was 270.
It must also be taken into account that with the word category map method
we have to deal with an extra self-organizing process, whereas forming the
random projection is a straightforward computation.

3.1 The WEBSOM method


Our method is a collection of programs that can be combined in di erent ways.
A brief overview of the computing phases is given in the following.
Table 2: Classi cation accuracies with similar material as in Table 1
Matrix product Pointer method
(3 pointers/column)
Random projection 68.0 67.5
Randomly projected
word category 66.0 67.0
histogram

3.1.1 Preprocessing
From the raw text, nontextual and otherwise nonrelevant information (punc-
tuation marks, articles and other stopwords, message headers, URLs, email
addresses, signatures, images, and numbers) was removed. The most common
words, and words occuring rarely (e.g., less than 50 times in the corpus) were
also discarded. Each remaining word was represented by a unique random
vector of dimensionality 90.
For a language like Finnish that has plenty of in ections, we have used
a stemmer. In our experiments we have so far regarded the various English
word forms as di erent "words" in vocabulary, but a stemmer could be used
for English, too.
3.1.2 Formation of statistical models
To reduce the dimensionality of the models, we have used both randomly pro-
jected word category histograms and randomly projected word histograms,
weighted by the Shannon entropy or "inverse document frequency."
3.1.3 Formation of the document map
The document maps were formed automatically by the SOM algorithm, for
which the statistical models of documents were used as input. The size of the
SOM was determined so that on the average 10 to 15 articles were mapped
onto each grid point; this gure was mainly determined for the convenience of
browsing.
The speed of computation, especially of large SOMs can be increased by
several methods: for instance, the winner search can be accelerated by starting
the search in the neighborhood of corresponding winners at the last cycle of
iteration ([3], Sec. 3.15.1), and increasing the size (number of grid nodes)
stepwise during learning using an estimation procedure ([3], Sec. 3.15.2).
3.1.4 User interface
The document map was presented as a series of HTML pages that enable ex-
ploration of the grid points: when clicking the latter with a mouse, links to the
document data base enable reading the contents of the articles. Depending on
the size of the grid, subsets of it can rst be viewed by zooming. Usually we
use two zooming levels for bigger maps before reading the documents.
There is also an automatic method for assigning descriptive signposts to
map regions; in deeper zooming, more signs appear. The signposts are words
that appear often in the articles in that map region and rarely elsewhere.

3.1.5 Content-addressable search


The HTML page can be provided with a form eld into which the user can type
an own query in the form of a short \document." This query is preprocessed
and a document vector (histogram) is formed in the same way as for the stored
documents. This histogram is then compared with the \models" of all grid
points, and a speci ed number of best-matching points are marked with a
round symbol, the diameter of which is the larger, the better the match is.
These symbols provide good starting points for browsing.
A problem, however, may be encountered if the user wants to use a single
keyword or a few keywords only as a "key document." Such queries make very
bad \histograms." In this case it is more advisable to use two di erent modes
of use of the WEBSOM: the user must then specify whether a document-type
or keyword-type query has to be used. In the former case the operation is
like described before; in the latter case one has to index each word of the
vocabulary by pointers to those documents where these words occur, and use
a rather conventional indexed search to nd the matches.

4 An example
The biggest document map we have made up to this writing consists of 104,040
grid points. Each model is 315-dimensional, and has been made by projecting a
word category map with 13,432 grid points randomly onto the 315-dimensional
space. The text material was taken from 80 very di erent Usenet newsgroups
and consisted of 1,124,134 documents with average length of 218 words. The
size of the nally accepted vocabulary was 63,773 words. The words were
weighted by the Shannon entropy computed from the distribution of the words
into 80 classes (newsgroups). It took about 1 month to process the two SOMs
without our newest speedup methods; searching occurs in nearly real time.
The accuracy of classifying a document into one of the 80 groups was about
80 per cent.
Fig. 1 exempli es a case of content-addressable search. The document map
has been depicted in the background, and the shades of gray correspond to
document clusters. The 20 grid points, the models of which matched best with
the short query, are visible as a small black heap on the left-hand side of the
document map. Using a browser, the documents mapped to grid points of the
document map can be read out from the HTML page. Two title pages are
shown.
Actually there is only one article in Fig. 1 that deals with NN chess. How-
ever, the other computer chess documents were so similar that they were re-
turned, too. About one fourth of the found documents obviously does not deal
with chess.

QUERY: chess playing neural nets,


NN chess player vs. human player

Figure 1: Content-addressable search from a 1,124,134-document WEBSOM

5 Conclusions
It has transpired in our experiments that the encoding of documents for their
statistical identi cation can be performed much more e ectively than believed
a few years ago [9]. In particular, the various random-projection methods are
as accurate in practice as the ideal theoretical vector space method, but much
faster to compute than the eigenvalue methods (e.g., LSI) that have been used
extensively to solve the problem of large dimensionality.
The content-addressable search must obviously be implemented di erently
when complete new \documents" are used as key information vs. when only
a few keywords are used. To this end one must rst identify the users' needs,
e.g., whether background information to a given article is wanted, or whether
the method is used as a kind of keyword-directed search engine.
Finally it ought to be emphasized that the order that ensues in the WEB-
SOM may not represent any taxonomy of the articles and does not serve as a
basis for any automatic indexing of the documents; the similarity relationships
better serve \ nding" than \searching for" relevant information.

References
[1] Jain AK, Dubes RC. Algorithms for clustering data. Prentice Hall, Englewood
Cli s, NJ, 1988
[2] Kruskal JB, Wish M. Multidimensional scaling. Sage University Paper series on
Quantitative Applications in the Social Sciences, no 07-011. Sage Publications,
Newbury Park, CA, 1978
[3] Kohonen T. Self-organizing maps. Series in Information Sciences, vol 30,
Springer-Verlag, Heidelberg, 1995; second ed 1997; Japanese ed 1996, Springer-
Verlag, Tokyo
[4] Ultsch A. Self-organizing networks for visualization and classi cation. In: Opitz
O, Lausen B, Klar R (eds) Information and classi cation. Springer-Verlag,
Berlin, 1993, pp 307-313
[5] Goser K, Hilleringmann U, Rueckert U, Schumacher K. VLSI technologies for
arti cial neural networks. IEEE Micro 1989; 9:28-44
[6] Lampinen J, Oja E. Self-organizing maps for spatial and temporal AR models.
In: Pietikainen M, Roning J (eds) Proc 6SCIA, Scand Conf on Image Analysis.
Suomen Hahmontunnistustutkimuksen Seura ry, Helsinki, 1989, pp 120-127
[7] Kohonen T. Self-organizing maps of symbol strings. Report A42. Helsinki Uni-
versity of Technology, Laboratory of Computer and Information Science, Espoo,
Finland, 1996
[8] Salton G, McGill MJ. Introduction to modern information retrieval. McGraw-
Hill, New York, 1983
[9] Deerwester S, Dumais S, Furnas G, Landauer K. Indexing by latent semantic
analysis. J Am Soc Inform Sci, 1990; 41:391-407
[10] Kaski S. Data exploration using self-organizing maps. Acta Polytechnica Scan-
dinavica, Mathematics, Computing and Management in Engineering Series No
82, 1997. Dr Tech Thesis, Helsinki University of Technology, Finland
[11] Kaski S. Dimensionality reduction by random mapping. In: Proc of IJCNN'98,
Int Joint Conf on Neural Networks. IEEE Press, Piscataway, NJ, 1998, pp 413-
418
[12] Ritter H, Kohonen T. Self-organizing semantic maps. Biol Cyb, 1989; 61:241-254
[13] Kohonen T. Content-addressable memories. Springer-Verlag, Heidelberg, 1980;
second ed 1987

You might also like