Art Organization
Art Organization
Abstract
The Self-Organizing Map (SOM) forms a nonlinear projection from a
high-dimensional data manifold onto a low-dimensional grid. A repre-
sentative model of some subset of data is associated with each grid point.
The SOM algorithm computes an optimal collection of models that ap-
proximates the data in the sense of some error criterion and also takes
into account the similarity relations of the models. The models then be-
come ordered on the grid according to their similarity. When the SOM
is used for the exploration of statistical data, the data vectors can be
approximated by models of the same dimensionality. When mapping
documents, one can represent them statistically by their word frequency
histograms or some reduced representations of the histograms that can
be regarded as data vectors. We have made SOMs of collections of over
one million documents. Each document is mapped onto some grid point,
with a link from this point to the document database. The documents
are ordered on the grid according to their contents and neighboring doc-
uments can be browsed readily. Keywords or key texts can be used to
search for the most relevant documents rst. New eective coding and
computing schemes of the mapping are described.
1 Introduction
Visual overviews of large data sets can be produced by various clustering [1]
or projection [2] methods. The Self-Organizing Map (SOM) [3] forms the pro-
jection of a high-dimensional data distribution onto a two-dimensional regular
grid, whereby also the cluster structure of the data is preserved.
A representative model of some subset of observations is associated with
each grid point. The SOM algorithm computes the optimal collection of models
that approximates an arbitrary distribution of input observations in the sense
of some overall error criterion. This criterion also involves the spatial ordering
of the models: the most similar models shall be found at adjacent grid points,
and the more dissimilar ones shall be located farther away from each other on
the grid. In this sense the SOM is a similarity graph of data.
The grid may be made to act as a groundwork for various kinds of illustrative
displays. For instance, one can use shades of gray [4] on the groundwork to
indicate the clustering tendency (e.g., vectorial distances of the neighboring
model vectors), or the values of any component of all the model vectors can be
displayed separately to study their contribution to the cluster structure [5].
In the vast majority of SOM applications, the input data constitute high-
dimensional real feature vectors x 2 <n , and the model vectors mi 2 <n are
then approximations of the x in a somewhat similar sense as the codebook
vectors in classical vector quantization are. However, the models need not
necessarily be replica of the input vectors: they may be, e.g., parametric repre-
sentations of operators that generate sequences of data [6]. On the other hand,
there exist means to approximate also nonvectorial data, e.g., sets of symbol
strings can be approximated by "average strings" [7].
In the SOMs that form similarity graphs of documents, the models can still
be taken as real vectors that describe collections of words in the documents.
The models can simply be weighted histograms of the words, but usually some
dimensionality reduction of the histograms is carried out, as we shall see next.
3.1.1 Preprocessing
From the raw text, nontextual and otherwise nonrelevant information (punc-
tuation marks, articles and other stopwords, message headers, URLs, email
addresses, signatures, images, and numbers) was removed. The most common
words, and words occuring rarely (e.g., less than 50 times in the corpus) were
also discarded. Each remaining word was represented by a unique random
vector of dimensionality 90.
For a language like Finnish that has plenty of in
ections, we have used
a stemmer. In our experiments we have so far regarded the various English
word forms as dierent "words" in vocabulary, but a stemmer could be used
for English, too.
3.1.2 Formation of statistical models
To reduce the dimensionality of the models, we have used both randomly pro-
jected word category histograms and randomly projected word histograms,
weighted by the Shannon entropy or "inverse document frequency."
3.1.3 Formation of the document map
The document maps were formed automatically by the SOM algorithm, for
which the statistical models of documents were used as input. The size of the
SOM was determined so that on the average 10 to 15 articles were mapped
onto each grid point; this gure was mainly determined for the convenience of
browsing.
The speed of computation, especially of large SOMs can be increased by
several methods: for instance, the winner search can be accelerated by starting
the search in the neighborhood of corresponding winners at the last cycle of
iteration ([3], Sec. 3.15.1), and increasing the size (number of grid nodes)
stepwise during learning using an estimation procedure ([3], Sec. 3.15.2).
3.1.4 User interface
The document map was presented as a series of HTML pages that enable ex-
ploration of the grid points: when clicking the latter with a mouse, links to the
document data base enable reading the contents of the articles. Depending on
the size of the grid, subsets of it can rst be viewed by zooming. Usually we
use two zooming levels for bigger maps before reading the documents.
There is also an automatic method for assigning descriptive signposts to
map regions; in deeper zooming, more signs appear. The signposts are words
that appear often in the articles in that map region and rarely elsewhere.
4 An example
The biggest document map we have made up to this writing consists of 104,040
grid points. Each model is 315-dimensional, and has been made by projecting a
word category map with 13,432 grid points randomly onto the 315-dimensional
space. The text material was taken from 80 very dierent Usenet newsgroups
and consisted of 1,124,134 documents with average length of 218 words. The
size of the nally accepted vocabulary was 63,773 words. The words were
weighted by the Shannon entropy computed from the distribution of the words
into 80 classes (newsgroups). It took about 1 month to process the two SOMs
without our newest speedup methods; searching occurs in nearly real time.
The accuracy of classifying a document into one of the 80 groups was about
80 per cent.
Fig. 1 exemplies a case of content-addressable search. The document map
has been depicted in the background, and the shades of gray correspond to
document clusters. The 20 grid points, the models of which matched best with
the short query, are visible as a small black heap on the left-hand side of the
document map. Using a browser, the documents mapped to grid points of the
document map can be read out from the HTML page. Two title pages are
shown.
Actually there is only one article in Fig. 1 that deals with NN chess. How-
ever, the other computer chess documents were so similar that they were re-
turned, too. About one fourth of the found documents obviously does not deal
with chess.
5 Conclusions
It has transpired in our experiments that the encoding of documents for their
statistical identication can be performed much more eectively than believed
a few years ago [9]. In particular, the various random-projection methods are
as accurate in practice as the ideal theoretical vector space method, but much
faster to compute than the eigenvalue methods (e.g., LSI) that have been used
extensively to solve the problem of large dimensionality.
The content-addressable search must obviously be implemented dierently
when complete new \documents" are used as key information vs. when only
a few keywords are used. To this end one must rst identify the users' needs,
e.g., whether background information to a given article is wanted, or whether
the method is used as a kind of keyword-directed search engine.
Finally it ought to be emphasized that the order that ensues in the WEB-
SOM may not represent any taxonomy of the articles and does not serve as a
basis for any automatic indexing of the documents; the similarity relationships
better serve \nding" than \searching for" relevant information.
References
[1] Jain AK, Dubes RC. Algorithms for clustering data. Prentice Hall, Englewood
Clis, NJ, 1988
[2] Kruskal JB, Wish M. Multidimensional scaling. Sage University Paper series on
Quantitative Applications in the Social Sciences, no 07-011. Sage Publications,
Newbury Park, CA, 1978
[3] Kohonen T. Self-organizing maps. Series in Information Sciences, vol 30,
Springer-Verlag, Heidelberg, 1995; second ed 1997; Japanese ed 1996, Springer-
Verlag, Tokyo
[4] Ultsch A. Self-organizing networks for visualization and classication. In: Opitz
O, Lausen B, Klar R (eds) Information and classication. Springer-Verlag,
Berlin, 1993, pp 307-313
[5] Goser K, Hilleringmann U, Rueckert U, Schumacher K. VLSI technologies for
articial neural networks. IEEE Micro 1989; 9:28-44
[6] Lampinen J, Oja E. Self-organizing maps for spatial and temporal AR models.
In: Pietikainen M, Roning J (eds) Proc 6SCIA, Scand Conf on Image Analysis.
Suomen Hahmontunnistustutkimuksen Seura ry, Helsinki, 1989, pp 120-127
[7] Kohonen T. Self-organizing maps of symbol strings. Report A42. Helsinki Uni-
versity of Technology, Laboratory of Computer and Information Science, Espoo,
Finland, 1996
[8] Salton G, McGill MJ. Introduction to modern information retrieval. McGraw-
Hill, New York, 1983
[9] Deerwester S, Dumais S, Furnas G, Landauer K. Indexing by latent semantic
analysis. J Am Soc Inform Sci, 1990; 41:391-407
[10] Kaski S. Data exploration using self-organizing maps. Acta Polytechnica Scan-
dinavica, Mathematics, Computing and Management in Engineering Series No
82, 1997. Dr Tech Thesis, Helsinki University of Technology, Finland
[11] Kaski S. Dimensionality reduction by random mapping. In: Proc of IJCNN'98,
Int Joint Conf on Neural Networks. IEEE Press, Piscataway, NJ, 1998, pp 413-
418
[12] Ritter H, Kohonen T. Self-organizing semantic maps. Biol Cyb, 1989; 61:241-254
[13] Kohonen T. Content-addressable memories. Springer-Verlag, Heidelberg, 1980;
second ed 1987