Text Mining Applications and Theory
Text Mining Applications and Theory
Text Mining
Applications and Theory
Michael W. Berry
University of Tennessee, USA
Jacob Kogan
University of Maryland Baltimore County, USA
For details of our global editorial offices, for customer services and for information about how to apply
for permission to reuse the copyright material in this book please see our website at www.wiley.com.
The right of the author to be identified as the author of this work has been asserted in accordance with
the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or
transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise,
except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission
of the publisher.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print
may not be available in electronic books.
Designations used by companies to distinguish their products are often claimed as trademarks. All brand
names and product names used in this book are trade names, service marks, trademarks or registered
trademarks of their respective owners. The publisher is not associated with any product or vendor
mentioned in this book. This publication is designed to provide accurate and authoritative information
in regard to the subject matter covered. It is sold on the understanding that the publisher is not engaged
in rendering professional services. If professional advice or other expert assistance is required, the
services of a competent professional should be sought.
A catalogue record for this book is available from the British Library.
ISBN: 978-0-470-74982-1
List of Contributors xi
Preface xiii
Index 205
List of Contributors
Loulwah AlSumait Nick Cramer
Department of Information Science Pacific Northwest National Laboratory
Kuwait University, Kuwait. Richland, WA, USA.
[email protected] [email protected]
Daniel Barbará
Lynne Edwards
Department of Computer Science Department of Media and
George Mason University Communication Studies
Fairfax, VA, USA Ursinus College
[email protected] Collegeville, PA, USA.
[email protected]
Michael W. Berry
University of Tennessee Dave Engel
Min H. Kao Department of Electrical Pacific Northwest National Laboratory
Engineering and Computer Science Richland, WA, USA.
Knoxville, TN, USA. [email protected]
[email protected]
Wilfried N. Gansterer
Peter A. Chew Research Lab Computational
Sandia National Laboratories Technologies and Applications
University of Vienna, Austria.
Albuquerque, NM, USA.
[email protected]
[email protected]
Andreas G. K. Janecek
Wendy Cowley Research Lab Computational
Pacific Northwest National Laboratory Technologies and Applications
Richland, WA, USA. University of Vienna, Austria.
[email protected] [email protected]
xii LIST OF CONTRIBUTORS
Eric P. Jiang Stuart Rose
University of San Diego Pacific Northwest National Laboratory
San Diego, CA, USA. Richland, WA, USA.
[email protected] [email protected]
Wenyin Tang
Amanda Leatherman
Nanyang Technological University
Department of Media and
Singapore.
Communication Studies
[email protected]
Ursinus College
Collegeville, PA, USA.
Flora S. Tsai
[email protected]
Nanyang Technological University
Singapore
Charles Nicholas [email protected]
University of Maryland, Baltimore
County
Pu Wang
Baltimore, MD, USA. Department of Computer Science
[email protected] George Mason University
Fairfax, VA
Andrey A. Puretskiy [email protected]
University of Tennessee
Min H. Kao Department of Electrical Paul Whitney
Engineering and Computer Science Pacific Northwest National Laboratory
Knoxville, TN, USA. Richland, WA
[email protected] [email protected]
Preface
The proliferation of digital computing devices and their use in communication
continues to result in an increased demand for systems and algorithms capable of
mining textual data. Thus, the development of techniques for mining unstructured,
semi-structured, and fully structured textual data has become quite important in
both academia and industry. As a result, a one-day workshop on text mining was
held on May 2, 2009 in conjunction with the SIAM Ninth International Confer-
ence on Data Mining to bring together researchers from a variety of disciplines
to present their current approaches and results in text mining. The workshop sur-
veyed the emerging field of text mining, the application of techniques of machine
learning in conjunction with natural language processing, information extraction,
and algebraic/mathematical approaches to computational information retrieval.
Many issues are being addressed in this field ranging from the development of
new document classification and clustering models to novel approaches for topic
detection, tracking, and visualization.
With over 40 applied mathematicians and computer scientists representing
universities, industrial corporations, and government laboratories from six dif-
ferent countries, the workshop featured both invited and contributed talks on
the use of techniques from machine learning, knowledge discovery, natural lan-
guage processing, and information retrieval to design computational models for
automated text analysis and mining. Most of the invited and contributed papers
presented at the workshop have been compiled and expanded for this volume.
Collectively, they span several major topic areas in text mining:
1. Keyword extraction
2. Classification and clustering
3. Anomaly and trend detection
4. Text streams.
This volume presents state-of-the-art algorithms for text mining from both
the academic and industrial perspectives. Each chapter is self-contained and is
completed by a list of references. A subject-level index is also provided at the
end of the volume. Familiarity with basic undergraduate-level mathematics is
needed for several of the chapters. The volume should be useful for a novice to
the field as well as for an expert in text mining research.
xiv PREFACE
The inherent differences in the words written by authors and those used by
readers continue to fuel the development of effective search and retrieval algo-
rithms and software in the field of text mining. This volume demonstrates how
advancements in the fields of applied mathematics, computer science, machine
learning, and natural language processing can collectively capture, classify, and
interpret words and their contexts. The words alone are not enough.
1.1 Introduction
Keywords, which we define as a sequence of one or more words, provide a
compact representation of a document’s content. Ideally, keywords represent in
condensed form the essential content of a document. Keywords are widely used
to define queries within information retrieval (IR) systems as they are easy to
define, revise, remember, and share. In comparison to mathematical signatures,
keywords are independent of any corpus and can be applied across multiple
corpora and IR systems.
Keywords have also been applied to improve the functionality of IR sys-
tems. Jones and Paynter (2002) describe Phrasier, a system that lists documents
related to a primary document’s keywords, and that supports the use of keyword
anchors as hyperlinks between documents, enabling a user to quickly access
related material. Gutwin et al. (1999) describe Keyphind, which uses keywords
from documents as the basic building block for an IR system. Keywords can also
be used to enrich the presentation of search results. Hulth (2004) describes Kee-
gle, a system that dynamically provides keyword extracts for web pages returned
from a Google search. Andrade and Valencia (1998) present a system that auto-
matically annotates protein function with keywords extracted from the scientific
literature that are associated with a given protein.
Text Mining: Applications and Theory edited by Michael W. Berry and Jacob Kogan
2010, John Wiley & Sons, Ltd
4 TEXT MINING
1.1.1 Keyword extraction methods
Despite their utility for analysis, indexing, and retrieval, most documents do
not have assigned keywords. Most existing approaches focus on the manual
assignment of keywords by professional curators who may use a fixed taxonomy,
or rely on the authors’ judgment to provide a representative list. Research has
therefore focused on methods to automatically extract keywords from documents
as an aid either to suggest keywords for a professional indexer or to generate
summary features for documents that would otherwise be inaccessible.
Early approaches to automatically extract keywords focus on evaluating
corpus-oriented statistics of individual words. Jones (1972) and Salton et al.
(1975) describe positive results of selecting for an index vocabulary the
statistically discriminating words across a corpus. Later keyword extraction
research applies these metrics to select discriminating words as keywords for
individual documents. For example, Andrade and Valencia (1998) base their
approach on comparison of word frequency distributions within a text against
distributions from a reference corpus.
While some keywords are likely to be evaluated as statistically discriminating
within the corpus, keywords that occur in many documents within the corpus are
not likely to be selected as statistically discriminating. Corpus-oriented methods
also typically operate only on single words. This further limits the measurement of
statistically discriminating words because single words are often used in multiple
and different contexts.
To avoid these drawbacks, we focus our interest on methods of keyword
extraction that operate on individual documents. Such document-oriented
methods will extract the same keywords from a document regardless of the
current state of a corpus. Document-oriented methods therefore provide context-
independent document features, enabling additional analytic methods such as
those described in Engel et al. (2009) and Whitney et al. (2009) that characterize
changes within a text stream over time. These document-oriented methods are
suited to corpora that change, such as collections of published technical abstracts
that grow over time or streams of news articles. Furthermore, by operating on a
single document, these methods inherently scale to vast collections and can be
applied in many contexts to enrich IR systems and analysis tools.
Previous work on document-oriented methods of keyword extraction has com-
bined natural language processing approaches to identify part-of-speech (POS)
tags that are combined with supervised learning, machine-learning algorithms, or
statistical methods.
Hulth (2003) compares the effectiveness of three term selection approaches:
noun-phrase (NP) chunks, n-grams, and POS tags, with four discriminative fea-
tures of these terms as inputs for automatic keyword extraction using a supervised
machine-learning algorithm.
Mihalcea and Tarau (2004) describe a system that applies a series of syntactic
filters to identify POS tags that are used to select words to evaluate as key-
words. Co-occurrences of the selected words within a fixed-size sliding window
AUTOMATIC KEYWORD EXTRACTION 5
are accumulated within a word co-occurrence graph. A graph-based ranking
algorithm (TextRank) is applied to rank words based on their associations in
the graph, and then top ranking words are selected as keywords. Keywords that
are adjacent in the document are combined to form multi-word keywords. Mihal-
cea and Tarau (2004) report that TextRank achieves its best performance when
only nouns and adjectives are selected as potential keywords.
Matsuo and Ishizuka (2004) apply a chi-square measure to calculate how
selectively words and phrases co-occur within the same sentences as a particular
subset of frequent terms in the document text. The chi-square measure is applied
to determine the bias of word co-occurrences in the document text which is
then used to rank words and phrases as keywords of the document. Matsuo and
Ishizuka (2004) state that the degree of biases is not reliable when term frequency
is small. The authors present an evaluation on full text articles and a working
example on a 27-page document, showing that their method operates effectively
on large documents.
In the following sections, we describe Rapid Automatic Keyword Extrac-
tion (RAKE), an unsupervised, domain-independent, and language-independent
method for extracting keywords from individual documents. We provide details
of the algorithm and its configuration parameters, and present results on a bench-
mark dataset of technical abstracts, showing that RAKE is more computationally
efficient than TextRank while achieving higher precision and comparable recall
scores. We then describe a novel method for generating stoplists, which we use to
configure RAKE for specific domains and corpora. Finally, we apply RAKE to a
corpus of news articles and define metrics for evaluating the exclusivity, essential-
ity, and generality of extracted keywords, enabling a system to identify keywords
that are essential or general to documents in the absence of manual annotations.
Figure 1.1 A sample abstract from the Inspec test set and its manually assigned
keywords.
carry meaning within a document are described as content bearing and are often
referred to as content words.
The input parameters for RAKE comprise a list of stop words (or stoplist), a
set of phrase delimiters, and a set of word delimiters. RAKE uses stop words and
phrase delimiters to partition the document text into candidate keywords, which
are sequences of content words as they occur in the text. Co-occurrences of words
within these candidate keywords are meaningful and allow us to identify word co-
occurrence without the application of an arbitrarily sized sliding window. Word
associations are thus measured in a manner that automatically adapts to the style
and content of the text, enabling adaptive and fine-grained measurement of word
co-occurrences that will be used to score candidate keywords.
constructing
diophantine
inequations
constraints
generating
supporting
algorithms
equations
numbers
nonstrict
systems
minimal
bounds
system
solving
natural
criteria
upper
linear
strict
sets
set
algorithms 2 1
bounds 1 1
compatibility 2
components 1
constraints 1 1
constructing 1
corresponding 1 1
criteria 2
diophantine 1 1 1
equations 1 1 1
generating 1 1 1
inequations 2 1 1
linear 1 1 1 2
minimal 1 3 2 1 1
natural 1 1
nonstrict 1 1
numbers 1 1
set 2 3 1
sets 1 1 1
solving 1
strict 1 1
supporting 1 1 1
system 1
systems 4
upper 1 1
Figure 1.3 The word co-occurrence graph for content words in the sample
abstract.
8 TEXT MINING
corresponding
compatibility
components
constructing
diophantine
inequations
constraints
generating
supporting
algorithms
equations
numbers
nonstrict
systems
minimal
bounds
system
solving
natural
criteria
upper
linear
strict
sets
set
deg(w) 3 2 2 1 2 1 2 2 3 3 3 4 5 8 2 2 2 6 3 1 2 3 1 4 2
freq(w) 2 1 2 1 1 1 1 2 1 1 1 2 2 3 1 1 1 3 1 1 1 1 1 4 1
deg(w) / freq(w) 1.5 2 1 1 2 1 2 1 3 3 3 2 2.5 2.7 2 2 2 2 3 1 2 3 1 1 2
Figure 1.4 Word scores calculated from the word co-occurrence graph.
minimal generating sets (8.7), linear diophantine equations (8.5), minimal supporting set
(7.7), minimal set (4.7), linear constraints (4.5), natural numbers (4), strict inequations (4),
nonstrict inequations (4), upper bounds (4), corresponding algorithms (3.5), set (2),
algorithms (1.5), compatibility (1), systems (1), criteria (1), system (1), components
(1),constructing (1), solving (1)
word scores. Figure 1.5 lists each candidate keyword from the sample abstract
using the metric deg(w)/freq(w) to calculate individual word scores.
keyword set of natural numbers, for the purposes of the benchmark evaluation
it is considered a miss. There are therefore three false positives in the set of
extracted keywords, resulting in a precision of 67%. Comparing the six true
positives within the set of extracted keywords to the total of seven manually
assigned keywords results in a recall of 86%. Equally weighting precision and
recall generates an F -measure of 75%.
TextRank
Undirected, co-occ. 6784 13.6 2116 4.2 31.2 43.1 36.2
window = 2
Undirected, co-occ. 6715 13.4 1897 3.8 28.2 38.6 32.6
window = 3
(Hulth 2003)
Ngram with tag 7815 15.6 1973 3.9 25.2 51.7 33.9
NP chunks with tag 4788 9.6 1421 2.8 29.7 37.2 33
Pattern with tag 7012 14 1523 3 21.7 39.9 28.1
the, and, of, a, in, is, for, to, we, this, are, with, as, on, it, an, that, which, by, using, can,
paper, from, be, based, has, was, have, or, at, such, also, but, results, proposed, show,
new, these, used, however, our, were, when, one, not, two, study, present, its, sub, both,
then, been, they, all, presented, if, each, approach, where, may, some, more, use,
between, into, 1, under, while, over, many, through, addition, well, first, will, there,
propose, than, their, 2, most, sup, developed, particular, provides, including, other, how,
without, during, article, application, only, called, what, since, order, experimental, any
(2004) are included for comparison. The highest values for precision, recall, and
F -measure are shown in bold. As noted, perfect precision is not possible with
any of the techniques as the manually assigned keywords do not always appear
in the abstract text. The highest precision and F -measure are achieved using
RAKE with a generated stoplist based on keyword adjacency, a subset of which
is listed in Figure 1.6. With this stoplist RAKE yields the best results in terms of
F -measure and precision, and provides comparable recall. With Fox’s stoplist,
RAKE achieves a high recall while experiencing a drop in precision.
TextRank
Milliseconds
RAKE
0
10 20 30 40 50 60 70 80 90 100 110 120
Number of Vertices in Word Co-occurrence Graph
an abstract’s keywords. Keyword frequency reflects the number of times the word
occurred within an abstract’s keywords.
Looking at the top 50 frequent words, in addition to the typical function
words, we can see that system, control , and method are highly frequent within
technical abstracts and highly frequent within the abstracts’ keywords. Selecting
solely by term frequency will therefore cause content-bearing words to be added
to the stoplist, particularly if the corpus of documents is focused on a particular
domain or topic. In those circumstances, selecting stop words by term frequency
presents a risk of removing important content-bearing words from analysis.
We therefore present the following method for automatically generating a
stoplist from a set of documents for which keywords are defined. The algorithm
is based on the intuition that words adjacent to, and not within, keywords are
less likely to be meaningful and therefore are good choices for stop words.
To generate our stoplist we identified for each abstract in the Inspec training
set the words occurring adjacent to words in the abstract’s uncontrolled key-
word list. The frequency of each word occurring adjacent to a keyword was
accumulated across the abstracts. Words that occurred more frequently within
keywords than adjacent to them were excluded from the stoplist.
AUTOMATIC KEYWORD EXTRACTION 13
Table 1.3 The 50 most frequent words in the Inspec training set listed in
descending order by term frequency.
Term Document Adjacency Keyword
Word frequency frequency frequency frequency
the 8611 978 3492 3
of 5546 939 1546 68
and 3644 911 2104 23
a 3599 893 1451 2
to 3000 879 792 10
in 2656 837 1402 7
is 1974 757 1175 0
for 1912 767 951 9
that 1129 590 330 0
with 1065 577 535 3
are 1049 576 555 1
this 964 581 645 0
on 919 550 340 8
an 856 501 332 0
we 822 388 731 0
by 773 475 283 0
as 743 435 344 0
be 595 395 170 0
it 560 369 339 13
system 507 255 86 202
can 452 319 250 0
based 451 293 168 15
from 447 309 187 0
using 428 282 260 0
control 409 166 12 237
which 402 280 285 0
paper 398 339 196 1
systems 384 194 44 191
method 347 188 78 85
data 347 159 39 131
time 345 201 24 95
model 343 157 37 122
information 322 153 18 151
or 315 218 146 0
s 314 196 27 0
have 301 219 149 0
has 297 225 166 0
at 296 216 141 0
new 294 197 93 4
two 287 205 83 5
(continued overleaf )
14 TEXT MINING
Table 1.3 (Continued )
Term Document Adjacency Keyword
Word frequency frequency frequency frequency
algorithm 267 123 36 96
results 262 221 129 14
used 262 204 92 0
was 254 125 161 0
these 252 200 93 0
also 251 219 139 0
such 249 198 140 0
problem 234 137 36 55
design 225 110 38 68
Table 1.5 Keywords extracted with word scores by deg(w) and deg(w)/freq(w).
Scored by deg(w) Scored by deg(w)/
freq(w)
Keyword edf(w) rdf(w) edf(w) rdf(w)
kyoto protocol legally obliged 2 2 2 2
developed countries
eu leader urge russia to ratify 2 2 2 2
kyoto protocol
kyoto protocol on climate 2 2 2 2
change
ratify kyoto protocol 2 2 2 2
kyoto protocol requires 2 2 2 2
1997 kyoto protocol 2 4 4 4
kyoto protocol 31 44 7 44
kyoto 10 12 – –
kyoto accord 3 3 – –
kyoto pact 2 3 – –
sign kyoto protocol 2 2 – –
ratification of the kyoto 2 2 – –
protocol
ratify the kyoto protocol 2 2 – –
kyoto agreement 2 2 – –
AUTOMATIC KEYWORD EXTRACTION 17
counts on how often each extracted keyword is referenced by documents in the
corpus. The referenced document frequency of a keyword, rdf(k), is the number of
documents in which the keyword occurred as a candidate keyword. The extracted
document frequency of a keyword, edf(k), is the number of documents from which
the keyword was extracted.
A keyword that is extracted from all of the documents in which it is refer-
enced can be characterized as exclusive or essential , whereas a keyword that is
referenced in many documents but extracted from a few may be characterized as
general . Comparing the relationship of edf(k) and rdf(k) allows us to characterize
the exclusivity of a particular keyword. We therefore define keyword exclusivity
exc(k) as shown in Equation (1.1):
edf(k)
exc(k) = . (1.1)
rdf(k)
Of the 711 extracted keywords, 395 have an exclusivity score of 1, indicating
that they were extracted from every document in which they were referenced.
Within that set of 395 exclusive keywords, some occur in more documents than
others and can therefore be considered more essential to the corpus of documents.
In order to measure how essential a keyword is, we define the essentiality of a
keyword, ess(k), as shown in Equation (1.2):
Figure 1.8 lists the top 50 essential keywords extracted from the MPQA cor-
pus, listed in descending order by their ess(k) scores. According to CERATOPS,
the MPQA corpus comprises 10 primary topics, listed in Table 1.6, which are
well represented by the 50 most essential keywords as extracted and ranked by
RAKE.
In addition to keywords that are essential to documents, we can also char-
acterize keywords by how general they are to the corpus. In other words, how
united states (32), human rights (24), kyoto protocol (22), international space station (18),
mugabe (16), space station (14), human rights report (12), greenhouse gas emissions
(12), chavez (11), taiwan issue (11), president chavez (10), human rights violations (10),
president bush (10), palestinian people (10), prisoners of war (9), president hugo chavez
(9), kyoto (8), taiwan (8), israeli government (8), hugo chavez (8), climate change (8),
space (8), axis of evil (7), president fernando henrique cardoso (7), palestinian (7),
palestinian territories (6), taiwan strait (6), russian news agency interfax (6), prisoners (6),
taiwan relations act (6), president robert mugabe (6), presidential election (6), geneva
convention (5), palestinian authority (5), venezuelan president hugo chavez (5), chinese
president jiang zemin (5), opposition leader morgan tsvangirai (5), french news agency
afp (5), bush (5), north korea (5), camp x-ray (5), rights (5), election (5), mainland china
(5), al qaeda (5), president (4), south africa (4), global warming (4), bush administration
(4), mdc leader (4)
Figure 1.8 Top 50 essential keywords from the MPQA Corpus, with correspond-
ing ess(k) score in parentheses.
18 TEXT MINING
Table 1.6 MPQA Corpus topics and definitions.
Topic Description
argentina Economic collapse in Argentina
axisofevil Reaction to President Bush’s 2002 State of the Union Address
guantanamo US holding prisoners in Guantanamo Bay
humanrights Reaction to US State Department report on human rights
kyoto Ratification of Kyoto Protocol
mugabe 2002 Presidential election in Zimbabwe
settlements Israeli settlements in Gaza and West Bank
spacestation Space missions of various countries
taiwan Relations between Taiwan and China
venezuela Presidential coup in Venezuela
government (147), countries (141), people (125), world (105), report (91), war (85), united
states (79), china (71), president (69), iran (60), bush (56), japan (50), law (44), peace
(44), policy (43), officials (43), israel (41), zimbabwe (39), taliban (36), prisoners (35),
opposition (35), plan (35), president george (34), axis (34), administration (33), detainees
(32), treatment (32), states (30), european union (30), palestinians (30), election (29),
rights (28), international community (27), military (27), argentina (27), america (27),
guantanamo bay (26), official (26), weapons (24), source (24), eu (23), attacks (23),
united nations (22), middle east (22), bush administration (22), human rights (21), base
(20), minister (20), party (19), north korea (18)
Figure 1.9 Top 50 general keywords from the MPQA Corpus, with corresponding
gen(k) score in parentheses.
often was a keyword referenced by documents from which it was not extracted?
In this case we define generality of a keyword, gen(k), as shown in Equation
(1.3):
Figure 1.9 lists the top 50 general keywords extracted from the MPQA corpus,
listed in descending order by their gen(k) scores. It should be noted that general
keywords and essential keywords are not mutually exclusive. Within the top 50
for both metrics, there are several shared keywords: united states, president,
bush, prisoners, election, rights, bush administration, human rights, and north
korea. Keywords that are both highly essential and highly general are essential
to a set of documents within the corpus but also referenced by a significantly
greater number of documents within the corpus than other keywords.
1.6 Summary
We have shown that our automatic keyword extraction technology, RAKE,
achieves higher precision and similar recall in comparison to existing techniques.
AUTOMATIC KEYWORD EXTRACTION 19
In contrast to methods that depend on natural language processing techniques
to achieve their results, RAKE takes a simple set of input parameters and
automatically extracts keywords in a single pass, making it suitable for a wide
range of documents and collections.
Finally, RAKE’s simplicity and efficiency enable its use in many applications
where keywords can be leveraged. Based on the variety and volume of existing
collections and the rate at which documents are created and collected, RAKE
provides advantages and frees computing resources for other analytic methods.
1.7 Acknowledgements
This work was supported by the National Visualization and Analytics Center
(NVAC), which is sponsored by the US Department of Homeland Security
Program and located at the Pacific Northwest National Laboratory (PNNL), and
by Laboratory Directed Research and Development at PNNL. PNNL is managed
for the US Department of Energy by Battelle Memorial Institute under Contract
DE-AC05-76RL01830.
We also thank Anette Hulth, for making available the dataset used in her
experiments.
References
Andrade M and Valencia A 1998 Automatic extraction of keywords from scientific
text: application to the knowledge domain of protein families. Bioinformatics 14(7),
600–607.
CERATOPS 2009 MPQA Corpus http://www.cs.pitt.edu/mpqa/ceratops/corpora.html.
Engel D, Whitney P, Calapristi A and Brockman F 2009 Mining for emerging technolo-
gies within text streams and documents. Proceedings of the Ninth SIAM International
Conference on Data Mining. Society for Industrial and Applied Mathematics.
Fox C 1989 A stop list for general text. ACM SIGIR Forum, vol. 24, pp. 19–21. ACM,
New York, USA.
Gutwin C, Paynter G, Witten I, Nevill-Manning C and Frank E 1999 Improving browsing
in digital libraries with keyphrase indexes. Decision Support Systems 27(1–2), 81–104.
Hulth A 2003 Improved automatic keyword extraction given more linguistic knowledge.
Proceedings of the 2003 Conference on Empirical Methods in Natural Language Pro-
cessing, vol. 10, pp. 216–223 Association for Computational Linguistics, Morristown,
NJ, USA.
Hulth A 2004 Combining machine learning and natural language processing for automatic
keyword extraction. Stockholm University, Faculty of Social Sciences, Department of
Computer and Systems Sciences (together with KTH).
Jones K 1972 A statistical interpretation of term specificity and its application in retrieval.
Journal of Documentation 28(1), 11–21.
Jones S and Paynter G 2002 Automatic extraction of document keyphrases for use in
digital libraries: evaluation and applications. Journal of the American Society for Infor-
mation Science and Technology.
20 TEXT MINING
Matsuo Y and Ishizuka M 2004 Keyword extraction from a single document using word
co-occurrence statistical information. International Journal on Artificial Intelligence
Tools 13(1), 157–169.
Mihalcea R and Tarau P 2004 Textrank: Bringing order into texts. In Proceedings of
EMNLP 2004 (ed. Lin D and Wu D), pp. 404–411. Association for Computational
Linguistics, Barcelona, Spain.
Salton G, Wong A and Yang C 1975 A vector space model for automatic indexing.
Communications of the ACM 18(11), 613–620.
Whitney P, Engel D and Cramer N 2009 Mining for surprise events within text streams.
Proceedings of the Ninth SIAM International Conference on Data Mining, pp. 617–627.
Society for Industrial and Applied Mathematics.
2
2.1 Introduction
Pages on the World Wide Web have tremendous variation, covering a wide range
of topics and viewpoints. Some are news pages, others are blogs. Given the sheer
volume of documents on the Web, clustering these pages by topic would be a
challenging problem. But web pages could be in any language, which complicates
an already challenging text mining problem.
In a series of articles published largely in the computational linguistics lit-
erature, we have outlined a number of computational techniques for clustering
documents in a multilingual corpus. This chapter reviews these techniques, pro-
vides some additional insight into these techniques, and presents some recent
advances. Specifically, we show multiple algebraic models for this problem that
were developed recently and that use matrix and tensor manipulations. These
methods can be applied not just to pairs of languages, but also to groups of
languages when a suitable multi-parallel corpus exists (Chew and Abdelali 2007).
In Sections 2.2 and 2.3, we review the problem and our experimental setup
for multilingual document clustering. Then, in Sections 2.4–2.9 we present our
Text Mining: Applications and Theory edited by Michael W. Berry and Jacob Kogan
2010, John Wiley & Sons, Ltd
22 TEXT MINING
various approaches and their results. Section 2.10 discusses our results and
summarizes our contribution.
2.2 Background
An early approach for dealing with documents in an information retrieval (IR)
setting was the vector space model (VSM) of Salton (Salton 1968; Salton and
McGill 1983). The principle behind the VSM is that a vector, with elements
representing individual terms, may encode a document’s meaning according to
the relative weights of these term elements. Then one may encode a corpus of
documents as a term-by-document matrix X of column vectors such that the rows
represent terms and the columns represent documents. Each element xij tabulates
the number of times term i occurs in document j . This matrix is sparse due to
the Zipfian distribution of terms in a language (Zipf 1935).
As a practical matter for better performance, the term counts in X often
are scaled. Many scaling approaches have been proposed, but the two most
popular, based on their widespread availability in software such as SAS, are
TFIDF (Term Frequency Inverse Document Frequency) and log-entropy scaling.
Other approaches have been considered by Chisholm and Kolda (1999). We
consider only the log-entropy scaling (see Equation (2.2)) in our approach here.
In 1990, Deerwester et al. (1990) proposed analyzing term-by-document
matrices using the singular value decomposition (SVD) to organize terms and
documents into a common semantic space based upon term co-occurrence.
Because the approach claimed to organize the surface terms into their underlying
semantics, the approach became known as latent semantic analysis (LSA).
In LSA a singular value decomposition of the (scaled) term–document matrix
X is computed
X = USVT . (2.1)
Table 2.1 Lexical statistics of the translations of the Bible used for training.
Language (translation) Unique terms Total word count
English (King James) 12 335 789 744
French (Darby) 20 428 812 947
Spanish (Reina Valera 1909) 28 456 704 004
Russian (Synodal 1876) 47 226 560 524
Arabic (Smith Van Dyke) 55 300 440 435
24 TEXT MINING
there are other languages with even starker differences. Payne (Payne 1997)
cites an illustrative example that comes from Yup’ik Eskimo, tuntussuqatarnik-
saitengqiggtuq, which means ‘he had not yet said again that he was going to
hunt reindeer’. This word is composed of many morphemes, as evidenced by
the fact that the English translation has multiple words. For example, the first
morpheme, tuntu, refers to reindeer. So if the concept were to change instead to
‘she was going to hunt reindeer’, then there would be a whole new unique word
starting with tuntu containing only some of the morphemes from the example
along with a different morpheme due to the change in gender of the subject.
Thus, it is easy to see why such a language would prove troublesome for VSMs.
Each word, which is packed with more meaning, is represented by a single direc-
tion in vector space instead of a collection of directions based on its constituent
morphemes.
Hence, these language differences provide a challenge to statistical techniques
that rely on co-occurrence patterns. Synthetic languages, which have more unique
terms representing more diverse concepts, will have fewer terms co-occurring
with other terms from an isolating language, making it more difficult to learn
from relationships from co-occurrence patterns.
For our system, we do not consider traditional stemming or stoplists because
we want the most generalizable system that does not rely on expert knowledge
of a language. We prefer to rely solely on the statistical properties of the corpus
for an extensible system for languages that may be applied to less common or
obscure languages.
The Bible has 31 226 verses, which we use as individual ‘documents’ in
our training set. The Quran has 114 suras (or chapters), which we use as the
documents in our test set. With the five languages, we have 570 individual test
queries. For each new query document, we project its vector representation into
the space of US−1 and compute a cosine similarity with all other document
feature vectors. The highest similarity indicates the best match available, which
for our case should be a matching translation of the query document. We use
S −1 instead of other alternatives because if we consider the documents in X as
our test set, then the projection of X on US−1 is close to the matrix V , which is
the document-by-concept matrix from the SVD.
To assess the performance of our techniques, we consider two measures of
precision used in multilingual IR. For the first, we split the test set into each of
the 25 possible language-pair combinations, where these include each language
to itself. For each pair, we have 228 distinct queries (i.e. chapters). The goal is
to retrieve the corresponding translation of that chapter in the other language.
We calculate the average precision at one document (P1), which is the average
percentage of times that the translation of the query ranked highest. P1 may be
calculated as an average over all queries for each language pair or as an overall
average, which we report here. P1 is a fairly strict measure of precision that
essentially measures success in retrieving documents when the source and target
languages are specified.
MULTILINGUAL DOCUMENT CLUSTERING 25
For the second measure, we considered average multilingual precision at five
documents (MP5), which is the average percentage of the top five documents
that are translations of the query document. We calculate MP5 as an average for
all queries and all languages. Essentially, MP5 measures success in multilingual
clustering. MP5 is a stricter measure than P1; since the target language is not
specified, there are more possibilities to choose from.
English ≈ U1 VT
S
French U2
Spanish U3
Russian U4
Arabic U5
0
10
Global Term Weight
10−1
−2 a =1
10
a = 1.8
0 1 2 3 4 5
10 10 10 10 10 10
English term index
where the notation Xk and Ak refers to the kth frontal slice of tensors X and
A, respectively, with whatis called slab notation. The matrix V is the set of
principal eigenvectors of k XkT Xk (which is the same as the principal right
singular vectors of the matrix formed by stacking the slices Xk on top of each
other). Each matrix Ak is the matrix that best fits the data in a least squares
sense, which is just Ak = Xk V because V is orthonormal.
To use the same framework as outlined previously for multilingual LSA where
we project new documents in the space of Uk Sk−1 , we normalize the columns in
each Ak so that they have unit length and the weight is stored in a diagonal
matrix Sk . Then the Tucker1 representation becomes
Xk ≈ Uk Sk V T for k = 1, . . . , K. (2.4)
Arabic
Russian X5
Spanish X4
French X3
English X2
X1
Xk = Uk VT
Sk
For our case where the row dimension is not constant, however, we may assume
that the tensor has a row dimension of the largest matrix and that the other
smaller matrices are padded with rows of zeros in order to adapt the Tucker1
model. The resulting factor matrices Uk will have a corresponding number of
zero rows. Figure 2.4 shows the Tucker1 model.
Using a rank 300 Tucker1 model, we get an average P1 score of 89.5%
and an average MP5 score of 71.3%. With this tensor representation, we see
a small increase over SVD in P1 (p value = 8 × 10−3 ) and a large increase in
multilingual precision (p value = 4 × 10−11 ). However, the fact that each Uk
does not form an orthogonal space in the Tucker1 model may be limiting the
performance of this tensor approach. When projecting new documents onto these
oblique axes to get document feature vectors, distances between features are
distorted, which could adversely affect cosine similarity calculations.
Xk ≈ Uk H Sk V T for k = 1, . . . , K, (2.5)
Xk = Uk H VT
Sk
V is a dense matrix that is not necessarily orthogonal. Figure 2.5 shows the
PARAFAC2 model. We project new documents in the space of Uk Sk−1 .
The algorithm to fit a PARAFAC2 model is decidedly more complex than
for Tucker1, so we will only refer to the algorithm in Kiers et al. (1999), which
we implemented in MATLAB using the Tensor Toolbox (Bader and Kolda 2006,
2007a,b).
Due to memory constraints, we were not able to compute a rank 300
PARAFAC2 model. Instead we computed a rank 240 PARAFAC2 model, which
provided an average P1 score of 89.8% and an average MP5 score of 78.5%.
With this tensor representation, and even though the rank of the model is
lower than previously, we see a large and highly significant increase in MP5
over Tucker1 (p value = 2 × 10−17 ). However, the increase over Tucker1 is
insignificant for P1 (p value = 0.6).
D1 X U3
eigendecomposition U4
U5
XT
docs
and populating the block so that Dij = 1 if the term pair (i, j ) occurs in a
dictionary, and zero otherwise. Another option, which was used in Bader and
Chew (2008), involves computing the pairwise mutual information (PMI) of
two terms appearing together in the same documents across the whole cor-
pus. This draws upon one of the ideas which underpins statistical machine
translation (SMT) (Brown et al. 1994). To preserve sparsity in the matrix, we
retain the value only for the pair (i, j ) that has the highest PMI in both direc-
tions. Because the resulting matrix is not symmetric and symmetry is needed
in D1 to obtain real eigenvalues, we symmetrize the matrix using a modi-
fied Sinkhorn balancing procedure. Sinkhorn balancing (Sinkhorn 1964) is also
needed to equalize contributions between terms. The standard Sinkhorn balancing
procedure normalizes the row and column sums to one, but we use a modi-
fied procedure that makes each row and column of D1 have unit length. This
modification was found to produce better results than creating a doubly stochas-
tic matrix D1 . All together, we call this technique LSA with term alignments
(LSATA).
By adding term-alignment information to the diagonal block, we strengthen
the co-occurrence information that LSA normally finds in the parallel corpus via
the SVD. To understand this mathematically, we consider the solution obtained
MULTILINGUAL DOCUMENT CLUSTERING 31
from LSA and then apply a power method to update U and V . Here is one
iteration of the power method on our block matrix:
Unew = D1 U + XV , (2.6)
Vnew = X U.
T
(2.7)
The terms XV and XT U are the standard relationships in LSA. The term D1 U
is new, and it reinforces term–term relationships from external information
(although note that under our approach, the information is not ‘external’ in that it
is implied by the same corpus from which we get the term-by-document matrix).
A graphical representation of this interpretation is shown in Figure 2.7, where
in one of the concept vectors in U the term ‘house’ dominates the correspond-
ing terms in Spanish and French, for example. After multiplication with D1 , the
relationship between these three words is strengthened, and all three terms have
similar values.
This observation leads to another consideration: the weighting of D1 relative
to X. If the values of D1 are very small compared to X, then any contribution
from D1 will be negligible. The opposite happens if the values in D1 are very
large compared to X. Hence, the matrices D1 and X must be numerically balanced
by, say, multiplying D1 by some parameter β. For our corpus and particular scal-
ing of D1 (Sinkhorn-balanced PMI) and X (log-entropy with α = 1.8), we deter-
mined empirically that a value of β = 12 provides good results. Alternatively,
β can be determined automatically by routinely balancing the two contributions
from D1 U and XV in Equations (2.6)–(2.7). One possible approach is to set
XV F
β= . (2.8)
D1 U F
Algorithmically, β could be computed iteratively inside an eigensolver or
externally by looping over an eigensolver and adjusting β until it converges to
a constant value.
In our numerical experiments, using a rank 300 LSATA model and β = 12,
we get an average P1 score of 91.8% and an average MP5 score of 80.7%. With
this matrix representation, we see a small increase over PARAFAC2 in P1 and in
house
= casa
maison
English
Spanish
Russian
Arabic
French
2.11 Acknowledgements
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed
Martin Company, for the United States Department of Energy’s National Nuclear
Security Administration under contract DE-AC04-94AL85000.
References
Bader BW and Chew PA 2008 Enhancing multilingual latent semantic analysis with term
alignment information. COLING 2008 .
Bader BW and Kolda TG 2006 Algorithm 862: MATLAB tensor classes for fast algorithm
prototyping. ACM Transactions on Mathematical Software 32(4), 635–653.
Bader BW and Kolda TG 2007a Efficient MATLAB computations with sparse and factored
tensors. SIAM Journal on Scientific Computing 30(1), 205–231.
Bader BW and Kolda TG 2007b Tensor toolbox for MATLAB, version 2.2. http://
csmr.ca.sandia.gov/∼tgkolda/TensorToolbox/.
Brown PF, Della Pietra VJ, Della Pietra SA and Mercer RL 1994 The mathematics of
statistical machine translation: Parameter estimation. Computational Linguistics 19(2),
263–311.
Chew P and Abdelali A 2007 Benefits of the massively parallel Rosetta Stone: Cross-
language information retrieval with over 30 languages. Proceedings of the Association
for Computational Linguistics, pp. 872–879.
36 TEXT MINING
Chew P, Kegelmeyer P, Bader B and Abdelali A 2008a The knowledge of good and evil:
Multilingual ideology classification with PARAFAC2 and machine learning. Language
Forum 34(1), 37–52.
Chew PA, Bader BW and Abdelali A 2008b Latent morpho-semantic analysis: Multilin-
gual information retrieval with character n-grams and mutual information. COLING
2008 .
Chew PA, Bader BW, Kolda TG and Abdelali A 2007 Cross-language information
retrieval using PARAFAC2. KDD’07: Proceedings of the 13th ACM SIGKDD Inter-
national Conference on Knowledge Discovery and Data Mining, pp. 143–152. ACM
Press, New York.
Chisholm E and Kolda TG 1999 New term weighting formulas for the vector space method
in information retrieval. Technical Report ORNL-TM-13756, Oak Ridge National Lab-
oratory, Oak Ridge, TN.
Deerwester SC, Dumais ST, Landauer TK, Furnas GW and Harshman RA 1990 Indexing
by latent semantic analysis. Journal of the American Society for Information Science
41(6), 391–407.
Harshman RA 1972 PARAFAC2: Mathematical and technical notes. UCLA Working
Papers in Phonetics 22, 30–47.
Hendrickson B 2007 Latent semantic analysis and Fiedler retrieval. Linear Algebra and
its Applications 421(2–3), 345–355.
Kiers HAL, Ten Berge JMF and Bro R 1999 PARAFAC2 – Part I. A direct fitting algo-
rithm for the PARAFAC2 model. Journal of Chemometrics 13(3–4), 275–294.
Kolda TG and Bader BW 2009 Tensor decompositions and applications. SIAM Review
15(3), 455–500.
Landauer TK and Littman ML 1990 Fully automatic cross-language document retrieval
using latent semantic indexing. Proceedings of the 6th Annual Conference of the UW
Centre for the New Oxford English Dictionary and Text Research, pp. 31–38, UW
Centre for the New OED and Text Research, Waterloo, Ontario.
Payne TE 1997 Describing Morphosyntax: A guide for field linguists. Cambridge Univer-
sity Press, Cambridge, UK.
Salton G 1968 Automatic Information Organization and Retrieval . McGraw-Hill, New
York.
Salton G and McGill M 1983 Introduction to Modern Information Retrieval . McGraw-Hill,
New York.
Sinkhorn R 1964 A relation between arbitrary positive matrices and doubly stochastic
matrices. Annals of Mathematical Statistics 35(2), 876–879.
Tucker LR 1966 Some mathematical notes on three-mode factor analysis. Psychometrika
31, 279–311.
Young P 1994 Cross language information retrieval using latent semantic indexing. Mas-
ter’s thesis University of Knoxville Knoxville, TN.
Zipf GK 1935 The Psychobiology of Language. Houghton-Mifflin, Boston, MA.
3
3.1 Introduction
With the rapid growth of the Internet and advances in computer technology email
has become a preferred form of communication and information exchange for
both business and personal purposes. It is fast and convenient. In recent years,
however, the effectiveness and confidence in email have been diminished quite
noticeably by spam email, or bulk unsolicited and unwanted email messages.
Spam email has been a painful annoyance for email users with an overwhelm-
ing amount of unwelcome messages flowing into their mailboxes. Now, it has
also evolved into a primary medium for spreading phishing scams and malicious
viruses. The cost of spam in the United States alone in terms of decreased pro-
ductivity and increased technical expenses for businesses has reached tens of
billions of dollars annually.1 Worldwide spam volume has increased significantly
and during the first quarter of 2008, spam email accounted for more than nine
out of every ten email messages sent over the Internet.2
1 http://www.spamlaws.com/spam-stats.html
2
http://www.net-security.org/
Text Mining: Applications and Theory edited by Michael W. Berry and Jacob Kogan
2010, John Wiley & Sons, Ltd
38 TEXT MINING
Over the years, various spam filtering technology and anti-spam software
products have been developed and deployed. Some of them are designed to
detect and stop spam email at the TCP/IP or SMTP level and may rely on DNS
blacklists of domain names that are known to originate spam. This approach has
been commonly used. However, it can be insufficient due to the lack of accuracy
of the name lists, since spammers can now register hundreds of free webmail ser-
vices such as Hotmail and Gmail and then rotate them every few minutes during
a spam campaign. The other major type of spam filtering technology functions at
the client level. Once an email message is downloaded, its content can be exam-
ined to determine whether the message is spam or legitimate. Several supervised
machine-learning algorithms have been used in client-side spam detection and
filtering. Among them, naive Bayes (Mitchell 1997; Sahami et al. 1998), boost-
ing algorithms such as logitBoost (Androutsopoulos et al. 2004; Friedman et al.
2000), support vector machines (SVMs) (Christianini and Shawe-Taylor 2000;
Drucker et al. 1999), instance-based algorithms such as k-nearest neighbor (Aha
and Albert 1991), and Rocchio’s classifier (Rocchio 1997) are commonly cited.
More recently, a number of other interesting algorithms for spam filtering have
been developed. One uses an augmented latent semantic indexing (LSI) space
model (Jiang 2006) and another applies a radial basis function (RBF) neural
network (Jiang 2007).
This chapter considers five supervised machine-learning algorithms for an
evaluation study of spam filtering application. The algorithms selected in this
study include widely used ones with good classification results and some recently
proposed methods. More specifically, we evaluate these five classification algo-
rithms: naive Bayes classifier (NB), support vector machines (SVMs), logitBoost
algorithm (LB), augmented latent semantic indexing space model (LSI) and radial
basis function (RBF) networks.
Spam filtering is a cost-sensitive classification task since misclassifying legit-
imate email (a false positive error) is generally more costly than misclassifying
spam email (a false negative error). Fairly recently, there have been several stud-
ies (Androutsopoulos et al. 2004; Zhang et al. 2004) surveying machine-learning
techniques in spam filtering. Using a constant λ to measure the higher cost of
false positives, these studies have evaluated several algorithms on spam filter-
ing by integrating the λ value or a function of λ into the algorithms through
a variety of cost-sensitive adjustment strategies. This was done by increasing
algorithm thresholds on spam confidence scores, adding more weights on legit-
imate training samples, or empirically adjusting algorithm decision thresholds
using cross-validation. Different adjustment strategies have also been applied to
different algorithms in the studies. Since all the algorithms were designed with
cost-insensitive tasks in mind, applying such simple cost-sensitive adjustments
on the algorithms can produce unreliable results. Apparently this insufficiency
has been recognized and, for some algorithms, the studies reported only the best
results among several adjustment trials.
This chapter provides a related study of five machine-learning algorithms on
spam filtering from a different perspective. The main objective of the study is to
CONTENT-BASED SPAM EMAIL CLASSIFICATION 39
learn whether and to what extent the algorithms are adaptable and applicable to
the cost-sensitive email classification problem and to identify the characteristics
of the algorithms most suitable for adaptability. In this study, we selected two
benchmark email testing corpora for experiments that were constructed from two
different languages and have reverse ratios of the number of spam emails to the
number of legitimate emails in the training data. We also vary feature size in the
experiments to analyze the usefulness of feature selection for these algorithms.
The rest of the chapter is organized as follows. In Section 3.2, the five
machine-learning algorithms that are investigated for spam filtering applica-
tions are briefly described. In Section 3.3, several data preprocessing procedures,
including feature selection and message representation, are discussed. Spam filter-
ing is a cost-sensitive classification task and a related discussion of effectiveness
measures is included in Section 3.4. We then compare the algorithms, using two
popular email testing corpora. The experimental results and analysis are reported
in Section 3.5, and an empirical comparison of the characteristics of the five clas-
sifiers is presented in Section 3.6. Finally, some concluding remarks are provided
in Section 3.7.
m
cMAP = arg max P (c|d) = arg max P (c) P (tk |c). (3.2)
c∈{cl ,cs } c∈{cl ,cs }
k=1
All model parameters, i.e. class priors and feature probability distributions, can
be estimated with relative frequencies from the training set D. Note that when
a given class and message feature do not occur together in the training set, the
corresponding frequency-based probability estimate will be zero, which would
make the right hand side of Equation (3.3) undefined. This problem can be
mitigated by incorporating some correction such as Laplace smoothing in all
probability estimates.
NB is a simple probability learning model and can be implemented very
efficiently with a linear complexity. It applies a simplistic or naive assumption
that the presence or absence of a feature in a class is completely independent
of any other features. Despite the fact that this oversimplified assumption is
often inaccurate (in particular for text domain problems), NB is one of the most
widely used classifiers and possesses several properties (Zhang 2004) that make
it surprisingly useful and accurate.
3.2.2 LogitBoost
LB is a boosting algorithm that implements forward stagewise modeling to form
additive logistic regression (Friedman et al. 2000). Like other boosting methods,
LB adds base models or learners of the same type iteratively, and the construction
of each new model is influenced by the performance of those preceding ones.
This is accomplished by assigning weights to all training samples and adaptively
updating the weights through iterations. Suppose fm is the mth base learner and
fm (d) is the prediction value of message d. After fm is constructed and added
to the ensemble, the weights on training samples are updated in such a way that
the subsequent base learner fm+1 will focus more on those difficult samples to
classify by fm . In the iteration process, the probability of d being in class c
CONTENT-BASED SPAM EMAIL CLASSIFICATION 41
is estimated by applying a sigmoid function, which is also known as the logit
transformation, to the response of the ensemble that has been built so far, i.e.
eF (d) 1
w · d + b = 0, (3.5)
w2
min , subject to ci (w · di + b) ≥ 1. (3.6)
w 2
42 TEXT MINING
The optimization problem in Equation (3.6) can be solved by the standard
Lagrange multiplier method with the new objective function:
w2
Since the Lagrangian involves a large number of parameters, this is still a dif-
ficult problem. Fortunately, the problem can be simplified by transforming the
Lagrangian in Equation (3.7) into the following dual formation that contains only
Lagrange multipliers:
1
λi ci di · d + b = 0. (3.9)
i
In order to deal with the cases where the training samples cannot be fully sep-
arated and also small misclassification errors are permitted, the so-called soft
margin method was developed for choosing a hyperplane that intends to reduce
the number of errors committed by the decision boundary while maximizing the
width of the margin. The method introduces a positive-valued slack variable ξ
that measures the degree of misclassification error on a sample and solves the
following modified optimization problem:
w2
k
acj = dni , dni ∈ cj , (3.11)
k
i=1
and it can be used to represent the most important topic covered in the cluster
(Jiang 2006). Once the cluster centroids of a category c are identified, all training
samples from the other category are compared against the centroids and the most
similar ones are then chosen to add to the training set of c. Selecting the sizes
of clusters and augmented samples of a category can vary depending on the data
to be learned. The cluster size can also be set by a silhouette plot (Kaufman
and Rousseeuw 1990) on a given training dataset. In our experiments, we use
the augmented sample sizes of 18 and 70 for the corpora PU1 and ZH1 (see
Section 3.5), respectively.
To use two separate augmented LSI spaces for classification, several
approaches have been considered and evaluated in Jiang (2006) that coordinate
and classify target email messages into their respective classes. For a given
target message, the first approach simply projects it onto both LSI spaces and
then uses the most semantically similar training sample to decide the class
for the message. The second approach classifies the message similarly but by
applying a fixed number of the top most similar training samples in the spaces
and using either the sum or average of computed similarity values from both
classes to make its classification decision. The third approach is a hybrid one
that intends to combine the ideas of the first two methods and also to mollify
some of their shortcomings. Essentially, it determines the class for the target
message by linearly balancing the votes or decisions made by the first two
methods. Experiments indicate that in general the hybrid approach delivers
significantly better classification results (Jiang 2006) and it is used in the study.
k
cj = wij φi (x), j = 1, 2, (3.13)
i=1
where wij is the weight connecting the ith neuron in the hidden layer to the j th
neuron in the output layer. The neuron activation φi is a nonlinear function of the
distance; the closer the distance, the stronger the activation. The most commonly
used basis function is the Gaussian
2
− x2
φ(x) = e 2σ , (3.14)
where P (c ) and P (t ) denote the probability that a message belongs to category
c and the probability that a feature t occurs in a message, respectively, and
P (t , c ) is the joint probability of t and c . All probabilities can be estimated
by frequency counts from the training data. Another popular feature selection
method is CHI. It measures the lack of independence between the occurrence of
feature t and the occurrence of class c. In other words, features are ranked with
respect to the quantity
CONTENT-BASED SPAM EMAIL CLASSIFICATION 47
P (t|c)(1 − P (t|c))
OR(t, c) = . (3.17)
(1 − P (t|c))P (t|c)
3.5 Experiments
In this section, we use two benchmark email testing corpora to compare the
efficacy of the five machine-learning algorithms, discussed in Section 3.2, for
spam email filtering and provide the experimental results and analysis. Note that
the input data to the classifiers is the preprocessed message vectors after both
feature selection and feature weighting.
1 SVM
NB
RBF
0.98 LSI
LB
Average Weighted Accuracy
0.96
0.94
0.92
0.9
0.88
0.86
50 150 250 350 450 550 650 750 850 950 1050 1150 1250 1350 1450 1550 1650
Feature Size
Figure 3.1 Average weighted classification accuracy with λ = 1 (PU1).
CONTENT-BASED SPAM EMAIL CLASSIFICATION 51
1
SVM
NB
RBF
0.98 LSI
LB
Average Weighted Accuracy
0.96
0.94
0.92
0.9
0.88
50 150 250 350 450 550 650 750 850 950 1050 1150 1250 1350 1450 1550 1650
Feature Size
Figure 3.2 Average weighted classification accuracy with λ = 9 (PU1).
other hand, since LSI, followed very closely by RBF, carries somewhat smaller
numbers of false positive errors than other classifiers, its accuracy values are
lifted for it to become the top performer. A detailed analysis of LSI and RBF on
their error counts suggests that a richer feature set generally helps the classifiers
characterize legitimate messages and improve classification of the category. But it
may not be useful for them to improve their classification of spam messages. One
possible explanation for this phenomenon may be related to the vocabularies used
in the respective email categories. It is hypothesized that spam email has a strong
correspondence between a small set of features and the category, while legitimate
email likely carries more sophisticated characteristics. The spam category could
attain good classification with a small vocabulary while the legitimate category
requires a large vocabulary, which can be assisted by feature expansion.
1
SVM
NB
0.98 RBF
LSI
LB
Average Weighted Accuracy
0.96
0.94
0.92
0.9
0.88
0.86
0.84
50 150 250 350 450 550 650 750 850 950 1050 1150 1250 1350 1450 1550 1650
Feature Size
Figure 3.3 Average weighted classification accuracy with λ = 1 (ZH1).
1
SVM
NB
RBF
LSI
0.95 LB
Average Weighted Accuracy
0.9
0.85
0.8
0.75
50 150 250 350 450 550 650 750 850 950 1050 1150 1250 1350 1450 1550 1650
Feature Size
Figure 3.4 Average weighted classification accuracy with λ = 9 (ZH1).
CONTENT-BASED SPAM EMAIL CLASSIFICATION 53
Figure 3.4, but this time, at those feature sizes that are greater than 350, both
LSI and RBF become much more competitive than LB and SVM. All four of
these classifiers achieve high classification accuracy.
3.8 Acknowledgements
This work was in part supported by a faculty research grant from the University
of San Diego.
References
Aha W and Albert M 1991 Instance-based learning algorithms. Machine Learning 6,
37–66.
Androutsopoulos I, Paliouras G and Michelakis E 2004 Learning to filter unsolicited
commercial e-mail. Technical Report, NCSR Demokritos.
Berry M, Dumais S and O’Brien W 1995 Using linear algebra for intelligent information
retrieval. SIAM Review 37(4), 573–595.
Bishop C 1995 Neural Networks for Pattern Recognition. Oxford University Press.
Christianini B and Shawe-Taylor J 2000 An Introduction to Support Vector Machines and
Other Kernel-based Learning Methods. Cambridge University Press.
Deerwester S, Dumais S, Furnas G, Landauer T and Harshman R 1990 Indexing by
latent semantic analysis. Journal of the American Society for Information Science 41,
391–409.
Drucker H, Wu D and Vapnik V 1999 Support vector machines for spam categorization.
IEEE Transactions on Neural Networks 10, 1048–1054.
Friedman J, Hastie T and Tibshirani R 2000 Additive logistic regression: A statistical
view of boosting. Annals of Statistics 28(2), 337–374.
Gee K 2003 Using latent semantic indexing to filter spam. Proceedings of the ACM
Symosium on Applied Computing, pp. 460–464.
Golub G and van Loan C 1996 Matrix Computations, third edn. Johns Hopkins University
Press.
Hidalgo J 2002 Evaluating cost-sensitive unsolicited bulk email categorization. Proceed-
ings of the 17th ACM Symposium on Applied Computing, pp. 615–620.
56 TEXT MINING
Jiang E 2006 Learning to semantically classify email messages. Lecture Notes in Control
and Information Sciences 344, 700–711.
Jiang E 2007 Detecting spam email by radial basis function networks. International Jour-
nal of Knowledge-based and Intelligent Engineering Systems 11, 409–418.
Jiang E 2009 Semi-supervised text classification using RBF networks. Lecture Notes in
Computer Science 5772, 95–106.
Joachims T 1998 Text categorization with support vector machines – learning with many
relevance features. Proceedings of the 10th European Conference on Machine Learning,
pp. 137–142.
Kaufman L and Rousseeuw P 1990 Finding Groups in Data. John Wiley & Sons, Inc.
Manning C, Raghavan P and Schutze H 2008 Introduction to Information Retrieval . Cam-
bridge University Press.
Mitchell T 1997 Machine Learning. McGraw-Hill.
Platt J 1999 Fast training of support vector machines using sequential minimal optimiza-
tion. In Advances in Kernel Methods: Support Vector Learning (ed. Scholkop B, Burges
C and Smola A) MIT Press pp. 185–208.
Rocchio J 1997 Relevance feedback information retrieval In The Smart Retrieval Sys-
tem: Experiments in automatic document processing (ed. Salton G) Prentice Hall pp.
313–323.
Sahami M, Dumais S, Heckerman D and Horvitz E 1998 A Bayesian approach to filtering
junk e-mail. Proceedings of AAAI Workshop, pp. 55–62.
Sebastiani F 2002 Machine learning in automated text categorization. ACM Computing
Surveys 1, 1–47.
Witten T and Frank E 2005 Data Mining, second edn. Morgan Kaufmann.
Yang Y and Pedersen J 1997 A comparative study on feature selection in text catego-
rization. Proceedings of the 14th International Conference on Machine Learning, pp.
412–420.
Zhang H 2004 The optimality of naive bayes. Proceedings of the 17th International
FLAIRS Conference.
Zhang L, Zhu J and Yao T 2004 An evaluation of statistical spam filtering techniques.
ACM Transactions on Asian Language Information Processing 3, 243–369.
4
4.1 Introduction
About a decade ago, unsolicited bulk email (‘spam’) started to become one of
the biggest problems on the Internet. A vast number of strategies and techniques
were developed and employed to fight email spam, but none of them can be
considered a final solution to this problem. In recent years, phishing (‘password
fishing’) has become a severe problem in addition to spam email. The term
covers various criminal activities which try to fraudulently acquire sensitive data
or financial account credentials from Internet users, such as account user names,
passwords, or credit card details. Phishing attacks use both social engineering and
technical means. In contrast to unsolicited but harmless spam email, phishing is
an enormous threat for all big Internet-based commercial operations.
Generally, email classification methods can be categorized into three groups,
according to their point of action in the email transfer process. These groups
are pre-send methods, post-send methods, and new protocols, which are based
on modifying the transfer process itself. Pre-send methods, which act before
the email is transported over the network, are very important because of their
Text Mining: Applications and Theory edited by Michael W. Berry and Jacob Kogan
2010, John Wiley & Sons, Ltd
58 TEXT MINING
potential to avoid the wasting of resources caused by spam. However, since the
efficiency of these methods depends on their widespread deployment, most of the
currently used email filtering techniques belong to the group of post-send meth-
ods. Amongst others, this group comprises techniques such as black-, white-, and
graylisting, or rule-based filters, which block email based on a predetermined
set of rules. Using these rules, features describing an email message can be
extracted. After extracting the features, a classification process can be applied
to predict the class (ham, spam, phishing) of unclassified email. An important
approach for increasing the speed of the classification process is to perform
feature subset selection (removal of redundant and irrelevant features) or dimen-
sionality reduction (use of low-rank approximations of the original data) prior to
the classification.
Low-rank approximations replace a large and often sparse data matrix with
a related matrix of much lower rank. The objective of these techniques – which
can be utilized in many data mining applications such as image processing, drug
discovery, or text mining – is to reduce the required storage space and/or to
achieve more efficient representations of the relationship between data elements.
Depending on the approximation technique used, great care must be taken in
terms of storage requirements. If the original data matrix is very sparse (as is the
case for many text mining problems), the storage requirements for the reduced
rank matrices might be higher than for the original data matrix with higher
dimensions (since the reduced rank matrices are often almost completely dense).
Besides well-known techniques like principal component analysis (PCA) and sin-
gular value decomposition (SVD), there are several other low-rank approximation
methods like vector quantization (Linde et al. 1980), factor analysis (Gorsuch
1983), QR decomposition (Golub and Van Loan 1996) or CUR decomposition
(Drineas et al. 2004). In recent years, another approximation technique for non-
negative data has been used successfully in various fields. The nonnegative matrix
factorization (NMF, see Section 4.2) determines reduced rank nonnegative fac-
tors W and H which approximate a given nonnegative data matrix A, such that
A ≈ WH.
In this chapter, we investigate the application of NMF to the task of email
classification. We consider the interpretability of the NMF factors in the email
classification context and try to take advantage of information provided by the
basis vectors in W (interpreted as basis emails or the basis features). Moti-
vated by this context, we also investigate a new initialization technique for
NMF based on ranking the original features. This approach is compared to stan-
dard random initialization and other initialization techniques for NMF described
in the literature. Our approach shows faster reduction of the approximation
error than random initialization and comparable results to existing but often
more time-consuming approaches. Moreover, we analyze classification meth-
ods based on NMF. In particular, we introduce a new method that combines
NMF with LSI (Latent Semantic Indexing) and compare this approach to stan-
dard LSI.
UTILIZING NONNEGATIVE MATRIX FACTORIZATION 59
4.1.1 Related work
The utilization of low-rank approximations in the context of email classification
has been analyzed in Gansterer et al. (2008b). In this work, LSI was applied
successfully both on purely textual features and on features extracted by rule-
based filtering systems. Especially the features from rule-based filters allowed
for a strong reduction of the dimensionality without losing significant accuracy
in the classification process. Feature reduction is particularly important if time
constraints play a role, as in the online processing of email streams. In Gansterer
et al. (2008a) a framework for such situations was presented – an enhanced
self-learning variant of graylisting (temporarily rejecting email messages) was
combined with a reputation-based trust mechanism to separate SMTP communi-
cation from feature extraction and classification. This architecture minimizes the
workload on the client side and achieves very high spam classification rates. A
comparison of the classification accuracy achieved with feature subset selection
and low-rank approximation based on PCA in the context of email classification
can be found in Janecek et al. (2008).
NMF initialization. All algorithms for computing the NMF are iterative and
require initialization of W and H. While the general goal – to establish initializa-
tion techniques and algorithms that lead to better overall error at convergence – is
still an open issue, some initialization strategies can improve the NMF in terms of
faster convergence and faster error reduction. Although the benefits of good NMF
initialization techniques are well known in the literature, rather few algorithms
for non-random initializations have been published so far.
Wild et al. (Wild 2002; Wild et al. 2003, 2004) were among the first to inves-
tigate the initialization problem of NMF. They used spherical k -means clustering
based on the centroid decomposition (Dhillon and Modha 2001) to obtain a struc-
tured initialization for W. More precisely, they partition the columns of A into
k clusters and select the centroid vectors for each cluster to initialize the corre-
sponding columns in W. Their results show faster error reduction than random
initialization, thus saving expensive NMF iterations. However, since this decom-
position must run a clustering algorithm on the columns of A, it is expensive as
a preprocessing step (cf. Langville et al. (2006)).
Langville et al. (2006) also provided some new initialization ideas and com-
pared the aforementioned centroid clustering approach and random seeding to
60 TEXT MINING
four new initialization techniques. While two algorithms (Random Acol and
Random C) only slightly decrease the number of NMF iterations and another
algorithm (Co-occurrence) turns out to contain very expensive computations, the
SVD–Centroid algorithm clearly reduces the approximation error and therefore
the number of NMF iterations compared to random initialization. The algo-
rithm initializes W based on a SVD–centroid decomposition (Wild 2002) of
the low-dimensional SVD factor Vn×k , which is much faster than a centroid
decomposition on Am×n since V is much smaller than A. Nevertheless, the SVD
factor V must be available for this algorithm, and the computation of V can
obviously be time consuming.
Boutsidis and Gallopoulos (2008) initialized W and H using a technique
called nonnegative double singular value decomposition (NNDSVD) which is
based on two SVD processes, one approximating the data matrix A (rank k
approximation) and the other approximating positive sections of the resulting
partial SVD factors. The authors performed various numerical experiments and
showed that NNDSVD initialization is better than random initialization in terms
of faster convergence and error reduction in all test cases, and generally appears
to be better than the centroid initialization in Wild (2002).
4.1.2 Synopsis
This chapter is organized as follows. In Section 4.2 we review some basics of
NMF and make some comments on the interpretability of the basis vectors in
W in the context of email classification (‘basis features’ and ‘basis emails’).
We also provide some information about the data and feature sets used in this
chapter. Some ideas about new NMF initialization techniques are discussed in
Section 4.3, and Section 4.4 focuses on new classification methods based on
NMF. We conclude our work in Section 4.5.
4.2 Background
In this section, we review the definition and characteristics of NMF and give
a brief overview of the two NMF algorithms considered in this work, as well
as their termination criteria and computational complexity. We then describe
the datasets used for experimental evaluation and make some remarks on
the interpretability of the NMF factors W and H in the context of email
classification problems.
Multiplicative update. The update steps for the MU algorithm given in Lee and
Seung (2001) are based on the mean squared error objective function. Adding
ε in each iteration avoids division by zero. A typical value used in practice is
ε = 10−9 .
Both algorithms are iterative and depend on the initialization of W (and H).
Since the iterates generally converge to a local minimum, often several instances
of the algorithm are run using different random initializations, and the best of the
solutions is chosen. A proper nonrandom initialization of W and/or H (depending
on the algorithm) can avoid the need to repeat several factorizations. Moreover,
it may speed up convergence of a single factorization and reduce the error as
defined in Equation (4.1).
UTILIZING NONNEGATIVE MATRIX FACTORIZATION 63
Termination criterion
Generally, the termination criterion for NMF algorithms comprises three com-
ponents. The first condition is based on the maximum number of iterations (the
algorithm iterates until the maximal number of iterations is reached). The second
condition is based on the required approximation accuracy (if the approximation
error in Equation (4.1) drops below a predefined threshold, the algorithm stops).
Finally, the third condition is based on the relative change of the factors W and
H from one iteration to another. If this change is below a predefined threshold
δ, then the algorithm also terminates.
4.2.3 Datasets
The datasets used for evaluation consist of 15 000 email messages, divided into
three groups – ham, spam, and phishing. The email messages were taken partly
from the Phishery 1 and partly from the 2007 TREC corpus.2 The email messages
are described by 133 features. A part of these features is purely text based, other
features comprise online features and features extracted by rule-based filters.
Some of the features specifically test for spam messages, while other features
specifically test for phishing messages. As a preprocessing step we scaled all
feature values to [0,1] to ensure that they have the same range.
The structure of phishing messages tends to differ significantly from the
structure of spam messages, but it may be quite close to the structure of regular
ham messages (because for a phishing message it is particularly important to
look like a regular message from a trustworthy source). A detailed discussion
and evaluation of this feature set has been given in Gansterer and Pölz (2009).
The email corpus was split into two sets (for training and for testing), the
training set consisting of the oldest 4000 email messages of each class (12 000
messages in total), and the test set consisting of the newest 1000 email messages
of each class (3000 messages in total). This chronological ordering of historical
data allows for simulation of the changes and adaptations in spam and phishing
messages which occur in practice. Both email sets are ordered by the classes – the
first group in each set consists of ham messages, followed by spam and phishing
1
http://phishery.internetdefence.net
2
http://trec.nist.gov/data/spam.html
64 TEXT MINING
messages. Due to the nature of the features, the data matrices are rather sparse.
The larger (training) set has 84.7% zero entries, and the smaller (test) set has
85.5% zero entries.
4.2.4 Interpretation
A key characteristic of NMF is the representation of basis vectors in W and
the representation of basis coefficients in the second NMF factor H. With these
coefficients the columns of A can be represented in the basis given by the columns
of W. In the context of email classification, W may contain basis features or
basis emails, depending on the structure of the original data. If NMF is applied
to an email × feature matrix (i.e. every row in A corresponds to an email
message), then W contains k basis features. If NMF is applied on the transposed
matrix (feature × email matrix, i.e. every column in A corresponds to an email
message), then W contains k basis email messages.
Basis features. Figure 4.1 shows three basis features ∈ R12 000 (for k = 3) for
our training set when NMF is applied to an email × feature matrix. The three
different groups of objects – ham (first 4000 messages), spam (middle 4000 mes-
sages), and phishing (last 4000 messages) – are easy to identify. The group of
phishing emails tends to yield high values for basis feature 1, while basis feature 2
shows the highest values for the spam messages. The values of basis feature 3
are generally smaller than those of basis features 1 and 2, and this basis feature
is clearly dominated by the ham messages.
Basis email messages. The three basis email messages ∈ R133 (again for k =
3) resulting from NMF on the transposed (feature × email ) matrix are plotted
Basis Feature 1
4
0
0 2.000 4.000 6.000 8.000 10.000 12.000
Basis Feature 2
4
0
0 2.000 4.000 6.000 8.000 10.000 12.000
Basis Feature 3
4
0
0 2.000 4.000 6.000 8.000 10.000 12.000
0.2
0
0 20 40 60 80 100 120
0.6
Basis E-mail Spam
0.4
0.2
0
0 20 40 60 80 100 120
0.6
Basis E-mail Ham
0.4
0.2
0
0 20 40 60 80 100 120
in Figure 4.2. The figure shows two features (16 and 102) that have a relatively
high value in all basis emails, indicating that these features do not distinguish
well between the three classes of email. Other features better distinguish between
classes. For example, features 89–91 and 128–130 have a high value in basis
email 1, and are (close to) zero in the other two basis emails. Investigation of the
original data shows that these features tend to have high values for phishing email,
indicating that the first basis email represents a phishing message. Using the same
procedure, the third basis email can be identified to represent ham messages
(indicated by features 100 and 101). Finally, basis email 2 represents spam.
This rich structure observed in the basis vectors should be exploited in the
context of classification methods. However, the structure of the basis vectors
heavily depends on the concrete feature set used. In the following, we discuss the
application of feature selection techniques in the context of NMF initialization.
Information gain. One option for ranking the features of email messages accord-
ing to how well they differentiate the three classes ham, spam, and phishing is
to use their information gain, which is also used to compute splitting criteria for
decision trees. The overall entropy I of a given dataset S is defined as
C
I (S) := − pi log2 pi , (4.2)
i=1
where C denotes the total number of classes and pi the portion of instances that
belong to class i. The reduction in entropy or the information gain is computed
for each attribute A according to
|SA,v |
IG(S, A) := I (S) − I (SA,v ), (4.3)
|S|
v
A
where v is a value of A and SA,v is the set of instances where A has value v.
Gain ratio. Information gain favors features which assume many different values.
Since this property of a feature is not necessarily connected with the splitting
information of a feature, we also ranked the features based on their information
gain ratio, which normalizes the information gain and is defined as GR(S, A) :=
IG(S, A)/splitinfo(S, A), where
|SA,v | |SA,v |
splitinfo(S, A) := − log2 . (4.4)
|S| |S|
4.3.2 FS initialization
After determining the feature ranking based on information gain and gain ratio,
we use the k first ranked features to initialize W (denoted as FS initialization
UTILIZING NONNEGATIVE MATRIX FACTORIZATION 67
in the following). Since feature selection aims at reducing the feature space,
our initialization is applied in the setup where W contains basis features (i.e.
every row in A corresponds to an email message, cf. Section 4.2.4). FS methods
are usually computationally inexpensive (see, e.g., Janecek et al. (2008) for a
comparison of information gain and PCA runtimes) and can thus be used as a
computationally cheap but effective initialization step. A detailed runtime com-
parison of information gain, gain ratio, NNDSVD, random seeding, and other
initialization methods as well as the initialization of H (at the moment H is
randomly seeded) are work in progress.
Results. Figures 4.3 and 4.4 show the NMF approximation error for our new
initialization strategy for both information gain (infogain) and gain ratio feature
ranking as well as for NNDSVD and random initialization when using the ALS
algorithm. As a baseline, the figures also show the approximation error based
on an SVD of A, which gives the best possible rank k approximation of A.
For rank k = 1, all NMF variants achieve the same approximation error as the
SVD, but for higher values of k the SVD has a smaller approximation error than
the NMF variants (as expected, since SVD gives the best rank k approxima-
tion in terms of approximation error). Note that when the maximum number of
iterations inside a single NMF factorization (maxiter) is high (maxiter = 30 in
Figure 4.4), the approximation errors are very similar for all initialization strate-
gies used and are very close to the best approximation computed with SVD. On
the other hand, with a small number of iterations (maxiter = 5 in Figure 4.3), it
is clearly visible that random seeding cannot compete with initialization based on
NNDSVD and feature selection. Moreover, for this small value of maxiter, the
FS initializations (both information gain and gain ratio ranking) show better error
0.2
0.18
Approximation Error ||A – WH||F
0.16
0.14
0.12
0.1
0.08
0.06 random
nndsvd
0.04
infogain
0.02 gainratio
svd
0
0 10 20 30 40 50
rank k
Figure 4.3 Approximation error for different initialization strategies and varying
rank k using the ALS algorithm ( maxiter = 5).
68 TEXT MINING
0.2
0.18
0.14
0.12
0.1
0.08
0.06 random
nndsvd
0.04
infogain
0.02 gainratio
svd
0
0 10 20 30 40 50
rank k
Figure 4.4 Approximation error for different initialization strategies and varying
rank k using the ALS algorithm ( maxiter = 30).
reduction than NNDSVD with increasing rank k . For higher values of maxiter
the gap between the different initialization strategies decreases until the error
curves become basically identical when maxiter is about 30 (see Figure 4.4).
Runtime. In this subsection we analyze runtimes for computing NMF for differ-
ent values of rank k and different values of maxiter using the ALS algorithm. All
runtime comparisons in this chapter were measured on a SUN Fire X4600M2
with eight AMD quad-core Opteron 8356 processors (32 cores overall) with
2.3 GHz CPU and 64 GB of memory. Since the MATLAB implementation of the
ALS algorithm is not the best implementation in terms of runtime, we computed
the ALS update steps (see Algorithm 3) using an economy-size QR factoriza-
tion: that is, only the first n columns of the QR factorization factors Q and R are
computed (here n is the smaller dimension of the original data matrix A). This
saves computation time (about 3.7 times faster than the original ALS algorithm
implemented in MATLAB), but achieves identical results to the MATLAB imple-
mentation. The algorithms terminated when the number of iterations exceeded
the predefined threshold maxiter; that is, the approximation error was not inte-
grated in the stopping criterion. Consequently, the runtimes do not depend on
the initialization strategy used (neglecting the marginal runtime savings due to
sparse initializations). In this setup, a linear relationship between runtime and the
rank of k can be observed. Reducing the number of iterations (lower values of
maxiter) brings important reductions in runtimes. This underlines the benefits of
our new initialization techniques. As Figure 4.3 has shown, our FS initialization
reduces the number of iterations required for achieving a certain approximation
error compared to existing approaches.
Table 4.1 compares runtimes needed to achieve different approximation error
thresholds with different values of maxiter for different initialization strategies.
Table 4.1 Runtime comparison.
||A−WH||F maxiter 5 maxiter 10 maxiter 15 maxiter 20 maxiter 25 maxiter 30
Gain ratio initialization
0.10 0.6 s (k = 17) 1.0 s (k = 11) 1.5 s (k = 11) 2.0 s (k = 11) 2.2 s (k = 10) 2.7 s (k = 10)
0.08 0.9 s (k = 27) 1.5 s (k = 22) 2.2 s (k = 21) 2.9 s (k = 19) 3.1 s (k = 19) 3.3 s (k = 19)
0.06 1.1 s (k = 32) 2.0 s (k = 30) 2.8 s (k = 28) 3.7 s (k = 28) 4.6 s (k = 27) 5.4 s (k = 26)
0.04 1.5 s (k = 49) 2.4 s (k = 40) 3.9 s (k = 40) 5.0 s (k = 40) 6.3 s (k = 38) 7.2 s (k = 38)
Information gain initialization
0.10 0.6 s (k = 18) 1.0 s (k = 12) 1.6 s (k = 12) 1.8 s (k = 10) 2.2 s (k = 10) 2.7 s (k = 10)
0.08 1.0 s (k = 28) 1.5 s (k = 22) 2.3 s (k = 22) 2.9 s (k = 19) 3.1 s (k = 19) 3.3 s (k = 19)
0.06 1.5 s (k = 48) 2.0 s (k = 30) 3.0 s (k = 30) 3.7 s (k = 28) 4.6 s (k = 28) 5.4 s (k = 26)
0.04 1.6 s (k = 50) 2.5 s (k = 42) 4.1 s (k = 42) 5.1 s (k = 41) 6.3 s (k = 38) 7.2 s (k = 38)
NNDSVD initialization
0.10 0.6 s (k = 15) 1.0 s (k = 12) 1.6 s (k = 12) 1.8 s (k = 10) 2.2 s (k = 10) 2.7 s (k = 10)
0.08 n.a. 1.7 s (k = 25) 2.6 s (k = 25) 2.6 s (k = 18) 3.1 s (k = 19) 3.2 s (k = 18)
0.06 n.a. 2.1 s (k = 32) 3.1 s (k = 32) 3.9 s (k = 29) 4.6 s (k = 28) 5.7 s (k = 30)
0.04 n.a. n.a. n.a. 5.1 s (k = 41) 6.3 s (k = 38) 7.2 s (k = 38)
Random initialization
0.10 n.a. 0.9 s (k = 10) 1.4 s (k = 10) 1.8 s (k = 10) 2.2 s (k = 10) 2.7 s (k = 10)
0.08 n.a. 1.5 s (k = 22) 2.3 s (k = 22) 2.5 s (k = 17) 3.1 s (k = 19) 3.3 s (k = 19)
0.06 n.a. n.a. n.a. 4.1 s (k = 31) 4.5 s (k = 26) 5.4 s (k = 26)
0.04 n.a. n.a. n.a. 5.4 s (k = 45) 6.7 s (k = 42) 7.3 s (k = 39)
UTILIZING NONNEGATIVE MATRIX FACTORIZATION
69
70 TEXT MINING
Obviously, a given approximation error ||A − WH||F can be achieved much
faster with small maxiter and high rank k than with high maxiter and small rank
k . As can be seen in Table 4.1, an approximation error of 0.04 or smaller can
be computed in 1.5 and 1.6 seconds, respectively, when using gain ratio and
information gain initialization (here, only five iterations (maxiter) are needed
to achieve an approximation error of 0.04). To achieve the same approximation
error with NNDSVD or random initialization, more than 5 seconds are needed
(here, 20 iterations are needed to achieve the same approximation error).
Classification results. For lower ranks (k < 30), the SVMinfogain results are
markedly below the results achieved with nonrandomly initialized NMF (info-
gain, gainratio, and nndsvd). This is not very surprising, since W contains com-
pressed information about all features (even for small ranks of k ). Random NMF
UTILIZING NONNEGATIVE MATRIX FACTORIZATION 71
100
90
infogain
gainratio
nndsvd
random
SVMinfogain
85
5 10 15 20 25 30 35 40 45 50
rank k
Figure 4.5 SVM (RBF kernel) classification accuracy for different initialization
methods using the MU algorithm ( maxiter = 5).
100
Classification Accuracy [%]
95
90
infogain
gainratio
nndsvd
random
SVMinfogain
85
5 10 15 20 25 30 35 40 45 50
rank k
Figure 4.6 SVM (RBF kernel) classification accuracy for different initialization
methods using the MU algorithm ( maxiter = 30).
Review of VSM and standard LSI. A VSM (Raghavan and Wong 1999) is a
widely used algebraic model where objects and queries are represented as vectors
in a potentially very high-dimensional metric vector space. Generally speaking,
given a query vector q, the distances of q to all objects in a given feature × object
matrix A can be measured (for example) in terms of the cosines of the angles
between q and the columns of A. The cosine ϕi of the angle between q and the
i th column of A can be computed as
ei
A
q
(VSM) : cos ϕi = . (4.5)
||Aei ||2 ||q||2
LSI (Langville 2005) is a variant of the basic VSM. Instead of the original
matrix A, the SVD is used to construct a low-rank approximation Ak of A, such
that A = UV
≈ Uk k V
k =: Ak . When A is replaced with Ak , then the cosine
of ϕi for the angle between q and the i th column of A is approximated as
ei
Vk k Uk
q
(SVD-LSI) : cos ϕi ≈ . (4.6)
||Uk k Vk
ei ||2 ||q||2
Since some terms on the right hand side of this equation only need to be com-
puted once for different queries (e
i Vk k and ||Uk k Vk ei ||2 ), LSI saves storage
and computational cost. Further, the approximated data often gives a cleaner and
more efficient representation of the relationship between data elements (Langville
et al. 2006) and can uncover latent information in the data.
Figure 4.7 Overview: (a) basic VSM; (b) LSI using SVD; (c) LSI using NMF.
we call NMF-LSI, simply replaces the approximation within LSI with a different
approximation. Instead of using Uk k V
k , we approximate A with Ak := Wk Hk
from the rank k NMF. Note that when using NMF, the value of k must be fixed
prior to the computation of W and H. The cosine of the angle between q and
the i th column of A can then be approximated as
ei
Hk
Wk
q
(NMF-LSI) : cos ϕi ≈ . (4.7)
||Wk Hk ei ||2 ||q||2
To save computational cost, the leftmost term in the denominator and the
leftmost part of the numerator (both involving Wk and Hk ) can be computed a
priori.
The second classifier. which we call NMF-BCC (NMF Basis Coefficient Clas-
sifier), is based on the idea that the basis coefficients in H can be used to classify
new email. These coefficients are representations of the columns of A in the basis
given by W. If W, H, and q are given, we can calculate a column vector x that
minimizes the equation
ei
H
x
(NMF-BCC) : cos ϕi ≈ . (4.9)
||H ei ||2 ||x||2
It is obvious that the computation of the cosines in Equation (4.9) is much
faster than for both other LSI variants mentioned earlier (since usually H is a
much smaller matrix than A), but the computation of x causes additional cost.
These aspects will be discussed further at the end of this section.
100
Classification Accuracy [%]
95
90
SVD–LSI
NMF–LSI(als)
85
NMF–BCC(als)
NMF–LSI(mu)
NMF–BCC(mu)
VSM
80
0 10 20 30 40 50
rank k
Figure 4.8 Classification accuracy for different LSI variants and VSM
( maxiter = 5).
UTILIZING NONNEGATIVE MATRIX FACTORIZATION 75
100
90
SVD–LSI
NMF–LSI(als)
85
NMF–BCC(als)
NMF–LSI(mu)
NMF–BCC(mu)
VSM
80
0 10 20 30 40 50
rank k
Figure 4.9 Classification accuracy for different LSI variants and VSM
( maxiter = 30).
NMF-BCC(mu) show comparable results (see Figure 4.9). For many values of
k, the NMF variants achieved better classification accuracy than a basic VSM
with all original features. Moreover, the standard ALS variant (NMF-LSI(als))
achieves very comparable results to LSI based on SVD, especially for small
values of rank k (between 5 and 10). Note that this improvement of a few percent
is substantial in the context of email classification. Moreover, as discussed in
Section 4.2.4, the purely nonnegative linear representation within NMF makes
the interpretation of the NMF factors much easier than that for the standard LSI
factors. It is interesting to note that initialization of the factors W and H does
not improve the classification accuracy when using the NMF-LSI and NMF-BCC
classifiers. This is in contrast to the previous sections – especially when maxiter
is small, the initialization was important for the SVM.
Runtimes. The computational runtime for all LSI variants comprises two steps.
Prior to the classification process, the low-rank approximations of SVD and NMF,
respectively, have to be computed. Afterward, any newly arriving email message
(a single query vector) has to be classified.
Figure 4.10 shows the runtimes needed for computing the low-rank approx-
imations, and Figure 4.11 shows the runtimes for the classification process of
a singly query vector. As already mentioned in Section 4.3.2, the NMF run-
times depend almost linearly on the value of maxiter. Figure 4.10 shows that for
almost any a given rank k, the computation of an SVD takes much longer than
an NMF factorization with maxiter = 5, but is faster than a factorization with
maxiter = 30. For computing the SVD we used MATLAB’s svds() function,
which computes only the first k largest singular values and associated singular
76 TEXT MINING
12
alsqr(30)
mu(30)
10 svds
alsqr(5)
mu(5)
8
Runtime [s]
0
0 10 20 30 40 50
rank k
Figure 4.10 Runtimes for computing low-rank approximations based on SVD and
variants of NMF of a 12 000 × 133 matrix ( alsqr(30) refers to the ALS algorithm
computed with explicit QR factorization and maxiter set to 30).
0.014
NMF–LSI
SVD–LSI
0.012
NMF–BCC
0.01
Runtime [s]
0.008
0.006
0.004
0.002
0
0 10 20 30 40 50
rank k
Figure 4.11 Runtimes for classifying a single query vector.
vectors of a matrix. The computation of the complete SVD usually takes much
longer (but is not needed in this context). There is only a small difference in the
runtimes for computing the ALS algorithm (using the economy-size QR factor-
ization, cf. Section 4.3.2) and the MU algorithm, and, of course, no difference
between the NMF-LSI and the NMF-BCC runtimes (since the NMF factorization
has to be computed identically for both approaches). The difference in the compu-
tational cost between NMF-LSI and NMF-BCC is embedded in the classification
process of query vectors, not in the factorization process of the training data.
UTILIZING NONNEGATIVE MATRIX FACTORIZATION 77
Looking at the classification runtimes in Figure 4.11, it can be seen that the
classification process using the basis coefficients (NMF-BCC) is faster than for
SVD-LSI and NMF-LSI. Although the classification times for a single email are
modest, they have to be considered for every single email that is classified. The
classification (performed in MATLAB) of all 3000 email messages in our test
sample took about 36 seconds for NMF-LSI, 24 seconds for SVD-LSI, and only
13 seconds for NMF-BCC (for rank k = 50).
Rectangular versus square data. Since the dimensions of the email data matrix
used in this work are very imbalanced (12 000 × 133), we also compared runtime
and approximation errors for data of the same size, but√ with balanced dimensions.
We created square random matrices of dimension 133 × 12 000 ≈ 1263 and
performed experiments on them identical to those in the previous section.
Figure 4.12 shows the runtime needed to compute the first k largest singular
values and associated singular vectors for SVD (again using the svds() function
from MATLAB) as well as the two NMF factorizations with different values of
maxiter. For square A, the computation of the SVD takes much longer than for
unbalanced dimensions. In contrast, both NMF approximations can be computed
much faster (cf. Figure 4.10). For example, the computation of an SVD of rank
k = 50 takes about eight times longer than the computation of an NMF of the
same rank.
The approximation error for square random data is shown in Figure 4.13. The
approximation error of both SVD and NMF is generally higher than for the email
dataset (see Figures 4.3 and 4.4). It is interesting to note that the approximation
error of the ALS algorithm decreases with increasing k until k ≈ 35, and then
increases again with higher values of k. Nevertheless, especially for smaller
values of k, the ALS algorithm achieves an approximation error comparable to
the SVD with much lower computational runtimes.
20
svds
alsqr(30)
mu(30)
15
alsqr(5)
mu(30)
Runtime [s]
10
0
0 10 20 30 40 50
rank k
Figure 4.12 Runtimes for computing low-rank approximations based on SVD
and variants of NMF of a random 1263 × 1263 matrix.
78 TEXT MINING
0.3
mu(5)
mu(30)
0.285
0.28
0.275
0.27
0.265
0 10 20 30 40 50
rank k
Figure 4.13 Approximation error for low-rank approximations based on SVD
and variants of NMF on a random 1263 × 1263 matrix.
4.5 Conclusions
The application of nonnegative matrix factorization (NMF) to ternary email clas-
sification tasks (ham vs. spam vs. phishing messages) has been investigated. We
have introduced a fast initialization technique based on feature subset selection
(FS initialization) which significantly reduces the approximation error of the NMF
compared to randomized seeding of the NMF factors W and H. Comparison of
our approach to existing initialization strategies such as NNDSVD (Boutsidis
and Gallopoulos 2008) shows basically the same accuracy when many NMF
iterations are performed, and much better accuracy when the NMF algorithm is
restricted to a small number of iterations.
Moreover, we investigated and evaluated two new classification methods
which are based on NMF. We showed that using the basis features of W generally
achieves much better results than using the original features. While the number
of iterations (maxiter) in the iterative process for computing the NMF seems
to be a crucial factor for the classification accuracy when random initialization
is used, the classification results achieved with FS initialization and NNDSVD
depend only weakly on this parameter, leading to high classification accuracy
even for small values of maxiter (see Figures 4.5 and 4.6). This is in contrast to
the approximation error illustrated in Figures 4.3 and 4.4, where the number of
iterations is important for all initialization variants.
As a second classification method we constructed NMF-based classifiers to
be applied on newly arriving email messages without recomputing the NMF. For
this purpose, we introduced two LSI classifiers based on NMF (computed with
the ALS algorithm) and compared them to standard LSI based on SVD. Both
new variants achieved a classification accuracy comparable to standard LSI when
UTILIZING NONNEGATIVE MATRIX FACTORIZATION 79
using the ALS algorithm and can often be computed faster, especially when the
dimensions of the original data matrix are close to each other (in this case, the
computation of the SVD usually takes much longer than an NMF factorization).
A copy of the codes used in this chapter is available from the authors or at
http://rlcta.univie.ac.at.
Future work. Our investigations indicate several important and interesting direc-
tions for future work. First of all, we will focus on analyzing the computational
cost of various initialization strategies (FS initialization vs. NNDSVD etc.). More-
over, we will look at updating schemes for our NMF-based LSI approach, since
for real-time email classification a dynamical adaptation of the training data (i.e.
adding new email to the training set) is essential. We also plan to work on
strategies for the initialization of H (currently, H is randomly initialized) for our
FS initialization (Section 4.3) and the comparison of the MU and ALS algo-
rithms to other NMF algorithms (gradient descent, algorithms with sparseness
constraints, etc.).
4.6 Acknowledgements
We gratefully acknowledge financial support from the CPAMMS-Project (grant#
FS397001) in the Research Focus Area ‘Computational Science’ of the University
of Vienna. We also thank David Poelz for providing us with detailed information
about the characteristics of the email features.
References
Berry MW, Browne M, Langville AN, Pauca PV and Plemmons RJ 2007 Algorithms and
applications for approximate nonnegative matrix factorization. Computational Statistics
& Data Analysis 52(1), 155–173.
Boutsidis C and Gallopoulos E 2008 SVD based initialization: A head start for nonnegative
matrix factorization. Pattern Recognition 41(4), 1350–1362.
Chang CC and Lin CJ 2001 LIBSVM: a library for support vector machines. Software
available at http://www.csie.ntu.edu.tw/∼cjlin/libsvm.
Dhillon IS and Modha DS 2001 Concept decompositions for large sparse text data using
clustering. Machine Learning 42(1), 143–175.
Dhillon IS and Sra S 2006 Generalized nonnegative matrix approximations with bregman
divergences. Advances in Neural Information Processing Systems 18: Proceedings of
the 2005 Conference, pp. 283–290.
Drineas P, Kannan R and Mahoney MW 2004 Fast Monte Carlo algorithms for matrices
III: Computing a compressed approximate matrix decomposition. SIAM Journal on
Computing 36(1), 184–206.
Gansterer WN and Pölz D 2009 E-mail classification for phishing defense. In Advances in
Information Retrieval, 31st European Conference on IR Research, ECIR 2009, Toulouse,
80 TEXT MINING
France, April 6–9, 2009. Proceedings (ed. Boughanem M, Berrut C, Mothe J and
Soulé-Dupuy C), vol. 5478 of Lecture Notes in Computer Science. Springer.
Gansterer WN, Janecek A and Kumer KA 2008a Multi-level reputation-based greylisting.
Proceedings of Third International Conference on Availability, Reliability and Security
(ARES 2008), pp. 10–17. IEEE Computer Society, Barcelona, Spain.
Gansterer WN, Janecek A and Neumayer R 2008b Spam filtering based on latent semantic
indexing. In: Survey of Text Mining 2 , vol. 2, pp. 165–183. Springer.
Golub GH and Van Loan CF 1996 Matrix Computations (Johns Hopkins Studies in Math-
ematical Sciences). The Johns Hopkins University Press.
Gorsuch RL 1983 Factor Analysis 2nd edn. Lawrence Erlbaum.
Janecek A, Gansterer WN, Demel M and Ecker GF 2008 On the relationship between
feature selection and classification accuracy. JMLR: Workshop and Conference Pro-
ceedings 4, 90–105.
Langville AN 2005 The linear algebra behind search engines. Journal of Online Mathe-
matics and its Applications (JOMA), 2005, Online Module.
Langville AN, Meyer CD and Albright R 2006 Initializations for the nonnegative matrix
factorization. Proceedings of the 12th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining.
Lee DD and Seung HS 1999 Learning the parts of objects by non-negative matrix fac-
torization. Nature 401(6755), 788–791.
Lee DD and Seung HS 2001 Algorithms for non-negative matrix factorization. Advances
in Neural Information Processing Systems 13, 556–562.
Li X, Cheung WKW, Liu J and Wu Z 2007 A novel orthogonal NMF-based belief com-
pression for POMDPs. Proceedings of the 24th International Conference on Machine
Learning, pp. 537–544.
Linde Y, Buzo A and Gray RM 1980 An algorithm for vector quantizer design. IEEE
Transactions on Communications 28(1), 702–710.
Paatero P and Tapper U 1994 Positive matrix factorization: A non-negative factor
model with optimal utilization of error estimates of data values. Environmetrics 5(2),
111–126.
Raghavan VV and Wong SKM 1999 A critical analysis of vector space model for informa-
tion retrieval. Journal of the American Society for Information Science 37(5), 279–287.
Robila S and Maciak L 2009 Considerations on parallelizing nonnegative matrix factor-
ization for hyperspectral data unmixing. Geoscience and Remote Sensing Letters 6(1),
57–61.
Wild SM 2002 Seeding non-negative matrix factorization with the spherical k-means
clustering. Master’s Thesis, University of Colorado.
Wild SM, Curry JH and Dougherty A 2003 Motivating non-negative matrix factorizations.
Proceedings of the Eighth SIAM Conference on Applied Linear Algebra.
Wild SM, Curry JH and Dougherty A 2004 Improving non-negative matrix factorizations
through structured initialization. Pattern Recognition 37(11), 2217–2232.
5
5.1 Introduction
Clustering is a fundamental data analysis task that has numerous applications in
many disciplines. Clustering can be broadly defined as a process of partitioning
a dataset into groups, or clusters, so that elements of the same cluster are more
similar to each other than to elements of different clusters.
In many cases additional information about the desired type of clusters is
available (e.g. Basu et al. (2009)). When incorporated into the clustering pro-
cess this information may lead to better clustering results. Motivated by Basu
et al. (2004) we consider pairwise constrained clustering. In pairwise constrained
clustering, we may have information about pairs of vectors that may not belong
to the same cluster (cannot-links), information about pairs of vectors that must
belong to the same cluster (must-links), or both. (For the first introduction of
constrained clustering with a focus on instance-level constraints see Wagstaff
and Cardie (2000) and Wagstaff et al. (2001).)
We focus on three k-means type clustering algorithms and two different
distance-like functions. The clustering algorithms are k-means (Duda et al.
2000), smoka (Teboulle and Kogan 2005), and spherical k-means (Dhillon and
Modha 1999). The distance-like functions are ‘reverse Bregman divergence’ (see
Text Mining: Applications and Theory edited by Michael W. Berry and Jacob Kogan
2010, John Wiley & Sons, Ltd
82 TEXT MINING
e.g. Kogan (2007a)) and ‘cosine’ similarity (see e.g. Berry and Browne (1999)).
We show that for these algorithms and distance-like functions the pairwise
constrained clustering problem can be reduced to clustering with cannot-link
constraints only. We substitute cannot-link constraints by penalty, and propose
clustering algorithms that tackle clustering with penalties.
The chapter is organized as follows. In Section 5.2 we introduce basic nota-
tions, and briefly review batch and incremental versions of classical quadratic
k-means. Section 5.3 presents the clustering algorithm equipped with Bregman
divergences and constraints. We show by an example that a straightforward adop-
tion of batch k-means may lead to erroneous results, and introduce a modification
of incremental k-means that generates a sequence of partitions with improved
quality. We show that must-link constraints can be eliminated (the elimination
technique is based on the methodology proposed in Zhang et al. (1997)). When
information about a large number of must-linked vectors is available, the pro-
posed elimination technique may significantly reduce the size of the dataset.
Section 5.4 introduces a smoka type clustering with constrains (see e.g. Teboulle
and Kogan (2005) and Teboulle (2007)). Elimination of must-link constraints is
based on results reported in Kogan (2007b). Section 5.5 presents spherical k-
means with constraints. Numerical experiments that illustrate the usefulness of
constraints are collected in Section 5.6. Brief conclusions and future research
directions are given in Section 5.7.
m
Q (A) = d (c, a) , where c = c (A) (5.2)
i=1
CONSTRAINED CLUSTERING WITH k-MEANS TYPE ALGORITHMS 83
(we set Q(∅) = 0 for convenience). Let = {π1 , . . . , πk } be a partition of A,
i.e.
πi = A, and πi ∩ πj = ∅ if i = j.
i
We aim to find a partition min = {π1min , . . . , πkmin } that minimizes the value of
the objective function Q. The problem is known to be NP-hard (see e.g. Brucker
(1978)) and we seek algorithms that generate ‘reasonable’ solutions.
It is easy to see that centroids and partitions are associated as follows:
1. Given a partition = {π1 , . . . , πk } of the set A one can define the cor-
responding centroids {c (π1 ) , . . . , c (πk )} by
of A, as
1
We first show that the assignment step (i.e. Equation (5.9)) may lead to erroneous
results.
Example 5.3.1 Consider the one-dimensional dataset
= {π1 , π2 , π3 }
with
π1 = {a1 , a2 }, π2 = {a3 }, π3 = {a4 , a5 }
(see Figure 5.1 where the clusters are encircled ). Note that
Q() = (2 + p) + 0 + (2 + p) = 4 + 2p = 12.
with
Q( ) = 0 + (3p + 2(0.9)2 ) + 0 = 1.62 + 3p = 13.62
1.5
0.5
0
a1 a2 a3 a a5
4
−0.5
−1
−1.5
−4 −3 −2 −1 0 1 2 3 4
0.5
0
a1 a2 a3 a4 a5
−0.5
−1
−1.5
−4 −3 −2 −1 0 1 2 3 4
+ p(a, a ) − p(a, a )
a ∈πi a ∈πj
(see Equation (5.6)). We denote by (a) the maximal value of the right hand
side of Equation (5.11) over j = 1, . . . , m. We note that removal of a from πi
and assigning it back to πi is a reassignment with zero change of the objective.
Hence (a), the maximal value of the right hand side of Equation (5.11), is
always nonnegative. To minimize the objective we shall select a vector a whose
reassignment maximizes (a). The incremental k-means algorithm we propose
is given next. A single iteration of the algorithm applied to either one of the
partitions or of Example 5.3.1 generates a partition
0.5
0
a1 a2 a3 a4 a5
−0.5
−1
−1.5
−4 −3 −2 −1 0 1 2 3 4
The vector set B = {b1 , . . . , bM } is the new set to be clustered. For two vectors
bi , bj ∈ B the penalty is defined by
p
1
B
QB (π ) = mij c − bij 2 + P (bil , bij ), (5.12)
2
j =1 l,j
where
mi1 bi1 + · · · + mip bip
c=
mi1 + · · · + mip
is the (weighted) arithmetic mean of the set π B = {bi1 , . . . , bip } and the associ-
p p
ated subset j =1 πij of A. The quality functions of the sets π B and j =1 πij
are related as follows:
p
p
Q
πij = qij + QB (π B ) (5.13)
j =1 j =1
(for the unconstrained case see Kogan (2007a)). Hence, for each pair of associated
partitionsB and A the difference between Q(A ) and QB (B ) is the same
constant M i=1 qi .
CONSTRAINED CLUSTERING WITH k-MEANS TYPE ALGORITHMS 89
Incremental clustering of the set B is identical to Algorithm 4 with change
in the objective function caused by reassignment of a vector b from the cluster
πiB to the cluster πjB given by
where Ml = b∈πlB m(b). In what follows, we extend these results to Bregman
distances.
Rn+ with the convention 0 log 0 = 0), we obtain the Kullback–Leibler relative
entropy distance
n
x[j ]
Dψ (x, y) = x[j ] log + y[j ] − x[j ] ∀ (x, y) ∈ Rn+ × Rn++ . (5.16)
y[j ]
j =1
n
Note that under the additional assumption x[j ] = nj=1 y[j ] = 1,
n j =1
the Bregman divergence Dψ (x, y) reduces to j =1 x[j ] log(x[j ]/y[j ]) (for
additional examples of Bregman distances see e.g. Banerjee et al. (2005) and
Teboulle et al. (2006)). Note that Bregman distance Dψ (x, y) is convex with
respect to the x variable. Hence, centroid computation in Equation (5.1) is an
‘easy’ optimization problem.
By reversing the order of variables in Dψ , i.e.
←−
Dψ (x, y) = Dψ (y, x) = ψ(y) − ψ(x) − ∇ψ(x)(y − x) (5.17)
90 TEXT MINING
(compare with Equation (5.15)) and using the kernel
ν
n
ψ(x) = x2 + µ x[j ] log x[j ] − x[j ] , (5.18)
2
j =1
we obtain
n
←− ν y[j ]
Dψ (x, y) = Dψ (y, x) = y − x2 + µ y[j ] log + x[j ] − y[j ] .
2 x[j ]
j =1
(5.19)
←−
While in general Dψ (x, y) given by Equation (5.16) is not necessarily convex
in x, when ψ(x) is given either by x2 or by nj=1 x[j ] log x[j ] − x[j ] the
←−
resulting functions Dψ (x, y) are strictly convex with respect to the first variable.
Extension of Algorithm 4 to ‘reversed’ Bregman distances requires the fol-
lowing:
1. The ability to compute c (π) for a finite set π (see Equation (5.1)).
2. A convenient expression for QB (π B ) of a subset π B = {bi1 , . . . , bip } ⊆ B
(see (5.12)).
3. A convenient formula for the change in the objective function caused
by reassignment of a vector b from the cluster πiB to the cluster πjB (see
(5.14)).
We next list results already available in the literature and relevant to the above
three points. The first result1 holds for all Bregman divergences with reversed
←−
order of variables Dψ (x, y) = Dψ (y, x) (see Banerjee et al. (2005)):
m m
Theorem 5.3.2 If z = (a1 + · · · + am )/m, then i=1 Dψ (ai , z) ≤ i=1 Dψ
(ai , x).
The result shows that the centroid of any set equipped with reversed Bregman
distance is given by the arithmetic mean.
The change in the objective Q caused by moving a vector a from cluster
πi to cluster πj is given by
= (mi − 1)[ψ(c− +
i ) − ψ(ci )] − ψ(ci ) + (mj + 1)[ψ(cj ) − ψ(cj )] + ψ(cj ),
(5.20)
−
where mi and mj denote the size of the clusters πi and πj , ci is the centroid of
πi with a being removed, and c+ j is the centroid of πj with a being added (see
Kogan (2007a)).
1
Note that this distance-like function is not necessarily convex with respect to x.
CONSTRAINED CLUSTERING WITH k-MEANS TYPE ALGORITHMS 91
In text mining applications, due to sparsity of the data vector a, most coor-
dinates of centroids c− , c+ , and c coincide. Hence, when the function ψ is
separable, computation of ψ(c− +
i ) and ψ(cj ) is relatively cheap.
Elimination of must-links requires an analogue of Equations (5.12) and (5.14).
The following two statements are provided by Kogan (2007a):
Theorem 5.3.3 If A = π1 ∪ π2 ∪ · · · ∪ πk with mi = |πi |, ci = c (πi ), i =
1, . . . , k,
m1 mk
c = c (A) = c1 + · · · + ck , where m = m1 + · · · + mk ,
m m
and = {π1 π2 , . . . , πk }, then
k
Q () = Q (πi ) + mi d(c, ci ) = Q (πi ) + mi [ψ(ci ) − ψ (c)] .
i=1 i=1 i=1 i=1
(5.21)
+ P (b, b ) − P (b, b )
b ∈πiB b ∈πjB
k
Q () = xi − a = min x1 − a2 , . . . , xk − a2
i=1 a∈πi a∈A
k
xl −a 2
= lim −s log e− s . (5.24)
s→0
a∈A l=1
The right hand side of Equation (5.24) shows that the problem of finding the
best k-cluster partition with no constraints can be restated as the problem of
identifying the k best centroids x1 , . . . , xk . While both expressions
k
xl −a2
−
−s log e s and min x1 − a2 , . . . , xk − a2
a∈A l=1 a∈A
are functions of x1 , . . . , xk , the one on the left is differentiable, while the one on
the right is not. This observation suggests use of the smooth approximation
k
xl −a2
−
−s log e s
a∈A l=1
CONSTRAINED CLUSTERING WITH k-MEANS TYPE ALGORITHMS 93
in order to approximate optimal centroids. Application of smooth approximations
to k-means clustering appears, for example, in Rose et al. (1990), Marroquin and
Girosi (1993), Nasraoui and Krishnapuram (1995), Teboulle and Kogan (2005),
and Teboulle (2007).
Next we briefly describe smoka clustering with cannot-link constraints only.
For two vectors a, a , and a set of k vectors x1 , . . . , xk , one has
k xi −a2 +xj −a 2
lim −s log e− s = min xi − a2 + xj − a 2 . (5.25)
s→0 i,j
i,j =1
We denote the left hand side of (5.25) by ψ(a, a ), and define φ(a, a ) as
k
xi −a2 +xi −a 2
lim −s log e− s = min xi − a2 + xi − a 2 . (5.26)
s→0 i
i=1
Clearly ψ(a, a ) ≤ φ(a, a ), and the equality holds only when a and a belong to
the same cluster. This observation motivates the introduction
of a penalty
func-
tion for cannot-linked vectors a, a as p(a, a ) = ρ φ(a, a ) − ψ(a, a ) where
ρ : R+ → R+ is a monotonically increasing function with ρ(0) = 0 so that
p(a, a ) = 0 when a and a belong to the same cluster (the simplest but, perhaps,
not the best choice for the function ρ is ρ(t) = t).
Since we intend to approximate the right hand side of Equations (5.25) and
(5.26) by the corresponding expressions on the left hand side with ‘small’ val-
ues of s, we shall consider penalty function ps (a, a ) = ρ φs (a, a ) − ψs (a, a )
where
k xi −a2 +xj −a 2
ψs (a, a ) = −s log e− s (5.27)
i,j =1
and
k
x −a2 +xi −a 2
− i
φs (a, a ) = −s log e s . (5.28)
i=1
−
Fs (x) = −s log e s + ps (x; a, a ) (5.29)
2
i=1 l=1 a,a ∈A
k
xl −bi 2
−s mi log e− s , (5.30)
i=1 l=1
k
xl −bi 2
1
T
where x = xT1 , . . . , xTk . The clustering algorithm is presented next (see Algo-
rithm 6). The following section describes the constrained clustering algorithm
designed to handle unit length vectors.
centered at the origin (when it does not lead to ambiguity we shall denote the
sphere just by S).
For a set of vectors A = {a1 , . . . , am } ⊂ Rn , and the ‘distance-like’ function
d(x, a) = aT x, we define centroid c = c (A) of the set A as a solution of the
maximization problem
arg max x a, x ∈ S
T
if a1 + · · · + am = 0,
c= (5.33)
a∈A
0 otherwise.
Note that:
1. For A ⊂ Rn+ (which is typical for many IR applications) the sum of the
vectors in A is never zero, and c (A) is a unit length vector.
2. The quality of the set A is just Q (A) = a∈A aT c (A) = a1 + · · · +
am .
3. While the motivation for spherical k-means is provided by IR applications
dealing with vectors with nonnegative coordinates residing on the unit
sphere, Equation (5.34) provides solutions to the maximization problem
in Equation (5.33) for any set A ⊂ Rn .
Spherical batch k-means is a procedure similar to the batch k-means algorithm
with the obvious substitution of min by max in Equation (5.4).
96 TEXT MINING
5.5.1 Spherical k-means with cannot-link constraints only
In the presence of cannot-link constraints we introduce a nonpositive symmetric
penalty function p(a, a ) ≤ 0. For a cluster π we define
1
1.5
1 a5
a4
a3
0.5 a2
a1
0
−0.5
−0.5 0 0.5 1 1.5
Figure 5.4 Initial three-cluster partition.
CONSTRAINED CLUSTERING WITH k-MEANS TYPE ALGORITHMS 97
1.5
1 a5
a4
a3
0.5 a2
a1
0
−0.5
−0.5 0 0.5 1 1.5
Figure 5.5 Three-cluster partition generated by spherical batch k-means.
1 a5
a4
a3
0.5 a2
a1
0
−0.5
−0.5 0 0.5 1 1.5
Q(π ∪ π ) = a+ a+ p(a, a ). (5.37)
a∈π a ∈π a∈π, a ∈π
By setting b = a∈π a, b = a ∈π a , and P (b, b ) = a∈π, a ∈π p(a, a ) one
gets
Q(π ∪ π ) = b + b + P (b, b ).
B A
quality of the associated subset π =
p quality of the set π Bis equal toAthe
The
j =1 πij of A, i.e. QB π =Q π .
CONSTRAINED CLUSTERING WITH k-MEANS TYPE ALGORITHMS 99
The incremental spherical k-means algorithm for the dataset A with penalty
function p and must-link constraints is identical to Algorithm 7 applied to the
dataset B equipped with penalty function P and with no must-link constraints.
We denote the overall collection of 3891 documents by DC. Many clustering algo-
rithms are capable of partitioning DC into three clusters with small (but not zero)
‘misclassification’ (see e.g. Dhillon et al. (2003); Dhillon and Modha (2001)).
We preprocess all the text datasets following the methodology of Dhillon et al.
(2003), so that the clustering algorithm deals with 3891 vectors of dimension 600.
An application of PDDP (Principal Direction Divisive Partitioning; see Boley
(1998)) generates the initial three-cluster partition for DC. The confusion matrix
for the partition is given in Table 5.1. This partition is used later as an input for
both Algorithm 4 and Algorithm 7. Both algorithms are applied to the dataset
with no must-link constraints. The penalty function p(a, a ) is defined as follows.
For collection DC0 we sort all the document vectors a00 , a01 , a02 . . . with respect
to the distance to the collection average (a00 is the nearest). We select first r0
vectors a00 , a01 , . . . a0r0 −1 and for each a not in D0 define p(a0i , a) = p > 0,
i = 1, . . . , r0 − 1. For the other two document collections DC1 and DC2 the
penalty function is defined analogously.
2
Available from http://www.cs.utk.edu/∼lsi.
100 TEXT MINING
5.6.1 Quadratic k-means
An application of Algorithm 4 with zero penalty and tol = 0.001 (i.e. just
incremental k-means) to the PDDP generated partition improves the confusion
matrix (see Table 5.2). Algorithm 4 with p = 0.01 generates the final parti-
tion with the confusion matrix given in Table 5.3. The penalty increase to 0.09
leads to the perfect diagonal confusion matrix given in Table 5.4. The values
of penalty versus ‘misclassification’ of final partitions generated by Algorithm 4
with tol = 0.001 are given in Table 5.5. In these experiments r0 = r1 = r2 = 1.
Selection of r0 = r1 = r2 = 2 and penalty values one-half of those shown in
Table 5.5 produce results similar to those collected in Table 5.5.
5.7 Conclusion
The chapter presents three clustering algorithms: constrained k-means, con-
strained spherical k-means, and constrained smoka. Each algorithm is capable
102 TEXT MINING
of clustering a vector dataset equipped with must-link constraints and a penalty
function that penalizes violations of cannot-link constraints.
Numerical experiments with the first two algorithms show improvement of
clustering performance in the presence of constraints. At the same time a single
iteration of each algorithm changes the cluster affiliation of one vector only.
A straightforward application of the algorithms to large datasets is, therefore,
impractical.
In contrast, a single iteration of the proposed constrained smoka clustering
changes all k clusters. Numerical experiments with constrained smoka and large
datasets with must-link and cannot-link constraints will be reported elsewhere.
Judicious selection of constraints is of paramount importance to the success of
clustering algorithms. We plan to perform and report experiments with large
datasets equipped with cannot-link and must-link constraints in the near future.
References
Banerjee A, Merugu S, Dhillon IS and Ghosh J 2005 Clustering with Bregman diver-
gences. Journal of Machine Learning Research 6, 1705–1749.
Basu S, Banerjee A and Mooney R 2004 Active semi-supervision for pairwise con-
strained clustering. Proceedings of SIAM International Conference on Data Mining,
pp. 333–344.
Basu S, Davidson I and Wagstaff K 2009 Constrained Clustering. Chapman & Hall/CRC.
Berry M and Browne M 1999 Understanding Search Engines. SIAM.
Boley DL 1998 Principal direction divisive partitioning. Data Mining and Knowledge
Discovery 2(4), 325–344.
Brucker P 1978 On the complexity of clustering problems. Lecture Notes in Economics
and Mathematical Systems, Volume 157 Springer pp. 45–54.
Dhillon IS and Modha DS 1999 Concept decompositions for large sparse text data using
clustering. Technical Report RJ 10147, IBM Almaden Research Center.
Dhillon IS and Modha DS 2001 Concept decompositions for large sparse text data using
clustering. Machine Learning 42(1), 143–175. Also appears as IBM Research Report
RJ 10147, July 1999.
Dhillon IS, Kogan J and Nicholas C 2003 Feature selection and document clustering. In
Survey of Text Mining (ed. Berry M), pp. 73–100. Springer.
Duda RO, Hart PE and Stork DG 2000 Pattern Classification second edn. John Wiley &
Sons, Inc.
Kogan J 2007a Introduction to Clustering Large and High–Dimensional Data. Cambridge
University Press.
Kogan J 2007b Scalable clustering with smoka. Proceedings of International Conference
on Computing: Theory and Applications, pp. 299–303. IEEE Computer Society Press.
Marroquin J and Girosi F 1993 Some extensions of the k-means algorithm for image
segmentation and pattern classification. Technical Report A.I. Memo 1390, MIT, Cam-
bridge, MA.
CONSTRAINED CLUSTERING WITH k-MEANS TYPE ALGORITHMS 103
Nasraoui O and Krishnapuram R 1995 Crisp interpretations of fuzzy and possibilistic
clustering algorithms. Proceedings of 3rd European Congress on Intelligent Techniques
and Soft Computing, pp. 1312–1318, Aachen, Germany.
Rockafellar RT 1970 Convex Analysis. Princeton University Press.
Rose K, Gurewitz E and Fox C 1990 A deterministic annealing approach to clustering.
Pattern Recognition Letters 11(9), 589–594.
Teboulle M 2007 A unified continuous optimization framework for center-based clustering
methods. Journal of Machine Learning Research 8, 65–102.
Teboulle M and Kogan J 2005 Deterministic annealing and a k-means type smoothing
optimization algorithm for data clustering. In Proceedings of the Workshop on Clus-
tering High Dimensional Data and its Applications (held in conjunction with the Fifth
SIAM International Conference on Data Mining) (ed. Dhillon I, Ghosh J and Kogan J),
pp. 13–22. SIAM, Philadelphia, PA.
Teboulle M, Berkhin P, Dhillon I, Guan Y and Kogan J 2006 Clustering with entropy-like
k-means algorithms. In Grouping Multidimensional Data: Recent Advances in Cluster-
ing (ed. Kogan J, Nicholas C and Teboulle M) Springer pp. 127–160.
Wagstaff K and Cardie C 2000 Clustering with instance-level constraints. Proceedings
of the Seventeenth International Conference on Machine Learning, pp. 1103–1110,
Stanford, CA.
Wagstaff K, Cardie C, Rogers S and Schroedl S 2001 Constrained k-means clustering
with background knowledge. Proceedings of the Eighteenth International Conference
on Machine Learning, pp. 577–584, San Francisco, CA.
Zhang T, Ramakrishnan R and Livny M 1997 BIRCH: A new data clustering algorithm
and its applications. Journal of Data Mining and Knowledge Discovery 1(2), 141–182.
Part II
ANOMALY AND TREND
DETECTION
6
Text Mining: Applications and Theory edited by Michael W. Berry and Jacob Kogan
2010, John Wiley & Sons, Ltd
108 TEXT MINING
where many different authors may be collaborating on a single document, this
often results in an incredibly complex and difficult to read plot.
Sometimes a quick, complete, and graphical summary of a large document is
all that the user requires. Tag clouds and other similar techniques have proven
highly useful in this area. A tag cloud is a summary of a document or a collection
of documents that relies upon font size, color, and/or text placement to indicate
the relative importance of key terms to the user. The key terms may be chosen
according to any number of schemes, some as simple as a straightforward term
count. Though perhaps not particularly useful for detailed analysis, a tag cloud
is highly effective in summarizing large amounts of text in an easily readable,
and understandable, visual manner.
Another major purpose of text visualization is general text exploration: that
is, a general search for interesting patterns or relationships within the data. Quite
often, the user has very limited prior information regarding the target of his or her
search, thus the term ‘exploration’ describes this type of analysis better. In order
to facilitate it, visualization software in this category typically creates an altered,
graphical term space representation – for example, an interconnected graph of all
of the terms in a book, where terms may be connected based on co-occurrence
within a single chapter or section. Many variations of this approach exist, but
one aspect that most of them have in common is that they are heavily reliant
upon the user’s attention and perception. The user’s ability to notice, interpret,
and understand patterns in the dataset is a critical part of the analysis process
when such software is utilized.
Sentiment tracking (and its related visualization software) is a relative new-
comer to the text visualization arena, and yet it is a highly promising technique
that has a great capability for insightful analysis of textual data. Various tech-
niques for sentiment tracking exist. One common approach attempts to connect
adjectives from the text to one of a number of basic emotion descriptor adjectives
via a thesaurus synonym path. The length of the connecting paths determines how
each text adjective is categorized. A percentage breakdown plot may then be con-
structed to indicate the overall content of basic emotions or sentiments within
the text over time.
Many text mining procedures produce unlabeled, textual results (e.g. groups
of interrelated terms that describe features contained in the original input dataset).
In order to draw potentially useful conclusions, further interpretation of these
results is necessary. This often requires a great commitment of time and effort
on the part of human analysts. Visual postprocessing tools tailored for specific
text mining packages can therefore greatly facilitate the analysis process. This
chapter will discuss one such visual tool, FutureLens, in great detail.
Figure 6.1 A tag cloud of the paper in Shutt et al. (2009), generated by the
TagCrowd application.
tag. The font size and color, as well as the orientation of the text (vertical or
horizontal) and the proximity of tags to one another, may be used to convey
information to the observer (Kaser and Lemire 2007). A basic tag cloud gener-
ator is a relatively simple and straightforward program that obtains term counts
from textual data, then generates HTML that takes the term counts into consid-
eration. Frequently, the user is allowed to choose the total number of terms in
the tag cloud summary. The tag cloud generating code then selects these terms
based on the overall counts and generates HTML code where font sizes vary
according to the relative relationship between the overall term counts. Figure 6.1
demonstrates a straightforward and easy-to-use tag cloud generator application,
TagCrowd (Steinbock 2009). The text of the paper in Shutt et al. (2009) was
used to generate the tag cloud in the figure.
Figures 6.2 and 6.3 demonstrate a more complex application, Wordle (Fein-
berg 2009). This generator includes many additional graphical capabilities. It
Figure 6.2 A tag cloud of the paper in Shutt et al. (2009), generated by the
Wordle application using the ‘Vigo’ font type and a randomized predominant text
orientation.
110 TEXT MINING
Figure 6.3 A tag cloud of the paper in Shutt et al. (2009), generated by the
Wordle application using the ‘Boope’ font type and with the predominant text
orientation set to horizontal.
gives the user the ability to alter text and background color in a variety of ways.
Font type may be modified. The predominant orientation of the words in the
word cloud may be set in a variety of ways, ranging from completely horizontal,
to mostly horizontal or mostly vertical, to completely vertical. Wordle is capable
of automatically randomizing all of these parameters.
Both Steinbock (2009) and Feinberg (2009), as well as many other tag
cloud generators, allow free noncommercial use of the images and/or HTML
code that they generate. TagCrowd and Wordle both use the Creative Commons
license, meaning users are allowed to copy, distribute, and transmit the materi-
als (Commons 2009a,b). While Wordle does not limit usage to noncommercial
applications, TagCrowd allows noncommercial use only. It should be noted that
the source code of the generators is copyrighted by the respective authors and
does not fall under the Creative Commons license.
Figure 6.4 TextArc applied to Shakespeare’s Hamlet. Not surprisingly, the name
‘Hamlet’ figures prominently in the work.
112 TEXT MINING
Figure 6.5 TextArc allows the user to easily track the connections between var-
ious terms. Here, we see that the term ‘Hamlet’ is related to the term ‘lord’. It is
also possible to track either term further.
Figure 6.6 SEASR’s Sentiment Tracking project applied to Turn of the Screw,
by Henry James (1898). Each unit on the X-axis corresponds to a group of 12
sentences. The Y-axis shows the sentiment composition for all six of Parrott’s
core emotions (Parrott 2000).
SURVEY OF TEXT VISUALIZATION TECHNIQUES 113
Figure 6.7 SEASR’s Sentiment Tracking project applied to Turn of the Screw,
by Henry James (1898). This figure shows the presence of anger in the literary
work.
of Scholarly Research) Sentiment Tracking project used Parrot’s six core emo-
tions in its sentiment tracking demonstration (Figures 6.6, 6.7, and 6.8): Love,
Joy, Surprise, Anger, Sadness, and Fear (Parrott 2000). The Sentiment Track-
ing project uses UIMA (Unstructured Information Management Applications), a
component framework for analyzing unstructured content, including but not lim-
ited to text. UIMA began as a project at IBM, but evolved into an open source
project at the Apache Software Foundation (SEASR 2009b). Several different
metrics may be used in order to categorize the terms from the text. The approach
used by the SEASR/UIMA Sentiment Tracking project involves searching for
the shortest path through a thesaurus from each term within the text to one of the
descriptor adjectives. Synonym symmetry is another useful technique, and may
be helpful as a ‘tie breaker’ (SEASR 2009a).
Figure 6.8 SEASR’s Sentiment Tracking project applied to Turn of the Screw,
by Henry James (1898). This figure shows the presence of joy in the literary work.
6.7.1 Scenarios
The primary (crime and terrorism-based) scenarios depicted in the VAST 2007
Contest involved wildlife law enforcement incidents occurring in the fall of 2004.
Endangered species issues and ecoterrorism activities played key roles in the
underlying terrorist scenario/plot. The data used to describe the details of the
plot included text, images, and some statistics. Although activities of certain
animal rights groups, such as the People for the Ethical Treatments of Animals
(PETA) and Earth Liberation Front (ELF), were involved with the plot, the con-
test organizers did not consider them to be the primary (interesting) parties for
investigation. In fact, such sideplots were used to deflect attention from the main
criminal/terrorist scenarios, thus providing a realistic challenge.
While FeatureLens may sound suitable for the given task, it is not without
its shortcomings. For one, its design is rather complex as it requires a MySQL
database server, an HTTP server, and an Adobe Flash-enabled web browser to
function properly. As such, it is not a trivial task to set up an instance of Fea-
tureLens from scratch and may take an inexperienced user a significant amount
of time to get started. Datasets must be parsed and stored in the database, an
operation that an end user cannot perform, so examining arbitrary datasets is out
of the question. In implementing the architecture of FeatureLens, the designers
chose to use a variety of languages: Ruby for the back end, XML to communicate
between the front end and back end, and OpenLaszlo for the interface. Because
of this variety in languages, adapting and modifying FeatureLens would prove
quite difficult. Responsiveness of the interface also tends to degrade to the point
that it impacts usability when given even the simplest of tasks. Clearly a better
solution was needed.
SURVEY OF TEXT VISUALIZATION TECHNIQUES 117
All the basic functionality of FutureLens can be seen in this example. The
boxes along the bottom show the terms that are currently being investigated.
The intensity of the color in these boxes hints at the concentration of the term
throughout the documents. A graph of the percentage of documents containing the
term versus time is shown at the top, while the raw text of the selected document
is shown to the right with the selected terms highlighted in the appropriate color.
118 TEXT MINING
Multiple terms can easily be combined into extended patterns by dragging and
dropping. Terms may be combined into either collections or phrases. A collection
is created when the user drags and drops terms onto each other. Term adjacency
does not affect search results for a collection. If the users holds down the Copy
key (this key varies depending on the operating system; for example, on Mac OS
X this is the Alt key), a phrase rather than a collection will be created. In this
case, term adjacency will be considered when the software performs searching.
While this presents an excellent overview of the data, it is also possible to load
the output (groups of terms and/or entities) derived from a data clustering method.
An example of this is shown in Figure 6.11.
Figure 6.11 FutureLens tracking the co-occurrences of grouped terms and enti-
ties (persons, locations, and organizations).
Figure 6.12 FutureLens with the bioterrorism NTF output group loaded. The
panel on the left shows the terms and entities relevant to the NTF output group.
The top-level graph summarizes the frequency of the selected terms and entities
over time. The monthly frequency plots in the center of the screen allow the user a
more detailed view of the term/entity occurrence over time. The monthly plots are
clickable; the results of that operation are demonstrated in the subsequent figures.
120 TEXT MINING
the user to easily identify a key news story within the large dataset. The article
shown in this figure contains a great amount of relevant information regarding an
outbreak of a potentially deadly virus, monkeypox, in the Los Angeles area. The
article implies that the outbreak may not have been accidental, and connects it to
an animal rights activist and chinchilla breeder named Cesar Gil. In order to fully
reconstruct the plotline, the user selects the names Cesar Gil and Gil from the
Entities list, as shown in Figure 6.15. However, this results in too many instances
of Gil being found, and most of them are probably irrelevant. Exploiting the link
between Gil and chinchilla breeding, the user combines the terms Chinchilla and
Gil into a collection. This helps the user to quickly identify a relevant article that
contains an advertisement for Gil’s chinchilla breeding business (Figure 6.16).
SURVEY OF TEXT VISUALIZATION TECHNIQUES 121
Figure 6.14 Key news story identification using FutureLens. The monthly plots
allow for convenient visualization of term co-occurrence over time. As demon-
strated in this figure, term co-occurrence allows the user to quickly extract the
most relevant and informative textual data from a large dataset. In this example,
the news article that contains all of the user’s selected terms contains a great deal
of information relevant to the chinchilla–bioterrorism plot. The context provided
by the article tells the user exactly in what way many of the terms and entities
within the NTF output group are relevant to the bioterrorism scenario that was
hidden within this textual dataset.
Not all of the articles that are relevant to this plotline have been shown in the
figures; however, FutureLens enables the user to quickly and easily identify them
all. FutureLens also helps the user to focus on the relevant parts of the article
(Shutt et al. 2009).
Figure 6.15 Entity of interest search using FutureLens. The key news article
demonstrated in Figure 6.14 revealed that an individual named Cesar Gil is a
key player in this scenario. FutureLens allows the user to expand the search by
including alternative forms of this individual’s name (e.g. Gil). However, this may
cause a significant number of irrelevant search results. Figure 6.16 demonstrates
how the user might use FutureLens’ collection creation capability to focus the
search.
discusses the use of trade in exotic pets (including tropical fish) as a cover for
drug smuggling (including cocaine trafficking). The next figure, Figure 6.19,
shows the selection of what appears to be a company name, Global Ways,
from the Entities list. As shown, the user is able to quickly find a story that
identifies Global Ways as a company that imports exotic tropical fish from South
America into the United States. Given the previously established connection
between drug trafficking and tropical fish imports, Global Ways may be worth
investigating further. As Figure 6.20 shows, shortly after publication of the
story advertising Global Ways’ import business, the Fish and Wildlife Service
had issued a warning to avoid handling shipments of tropical fish that may
have entered the United States through Miami. According to this story, the
packaging of some of these shipments appears to have been contaminated with
an unknown toxic substance. Global Ways is listed as one of the suspects.
SURVEY OF TEXT VISUALIZATION TECHNIQUES 123
Finally, Figure 6.21 identifies the owner of Global Ways as Madhi Kim,
thereby allowing the analyst to continue tracing relationships through the
dataset.
Figure 6.17 The drug trafficking NTF output group loaded into FutureLens.
Figure 6.18 Two types of term chaining, phrase creation and collection creation,
help the user to quickly identify relevant news stories.
SURVEY OF TEXT VISUALIZATION TECHNIQUES 125
Figure 6.19 Among entities of interest produced by NTF, there appears a com-
pany name, Global Ways. FutureLens enables the user to further explore the
relationship between this company, the tropical fish trade, and drug trafficking.
Figure 6.20 FutureLens helps the user to identify news stories that connect
Global Ways to drug trafficking.
126 TEXT MINING
Figure 6.21 The owner of Global Ways is identified with the help of FutureLens.
Further investigation of the owner’s connections and associations is possible at
this point.
interesting features in the dataset could be added to create a single analysis tool.
As it stands now, the output of data mining models such as that created by
nonnegative tensor factorization (see Bader et al. (2008b)) must be entered man-
ually into the software environment. Eliminating this human interaction would
greatly increase the efficiency of scenario discovery. An obvious extension for
dynamic (time-varying) datasets is certainly needed. The portability and intuitive
word/phrase tracking capability of FutureLens, however, make this public-domain
software environment a solid contribution to the text mining community.
References
Bader BW, Berry MW and Browne M 2008a Discussion tracking in Enron email using
PARAFAC. In Survey of Text Mining II: Clustering, Classification, and Retrieval (ed.
Berry M and Castellanos M) Springer-Verlag pp. 147–163.
Bader BW, Puretskiy AA and Berry MW 2008b Scenario discovery using nonnegative ten-
sor factorization. In Progress in Pattern Recognition, Image Analysis and Applications
(ed. Ruiz-Shulcloper J and Kropatsch WG) Springer-Verlag pp. 791–805.
Commons C 2009a Creative Commons Attribution License 3.0 http://
creativecommons.org/licenses/by/3.0/us/.
Commons C 2009b Creative Commons Non-Commercial Attribution license 3.0
http://creativecommons.org/licenses/by-nc/3.0/.
SURVEY OF TEXT VISUALIZATION TECHNIQUES 127
Don A, Zhelev E, Gregory M, Tarkan S, Auvil L, Clement T, Shneiderman B and
Plaisant C 2007 Discovering interesting usage patterns in text collections: integrating
text mining with visualization. HCIL Technical Report 2007-08 .
Don A, Zheleva E, Gregory M, Tarkan S, Auvil L, Clement T, Shneiderman B and
Plaisant C 2008 Exploring and visualizing frequent patterns in text collections with
FeatureLens. http://www.cs.umd.edu/hcil/textvis/featurelens. Vis-
ited November 2008.
Feinberg J 2009 Wordle: Beautiful word clouds. http://www.wordle.net. Visited
July 2009.
Kaser O and Lemire D 2007 Tag-cloud drawing: Algorithms for cloud visualization.
CoRR.
Kumar A 2009 The MONK Project Wiki. https://apps.lis.uiuc.edu/wiki/
display/MONK/The+MONK+Project+Wiki. Last edited August 2008.
Paley WB 2009 TextArc. http://www.textarc.org/. Visited July 2009.
Parrott WG 2000 Emotions in social psychology: Volume overview. In Emotions in Social
Psychology: Essential readings (ed. Parrott WG) Psychology Press pp. 1–19.
Scholtz J, Plaisant C and Grinstein G 2007 IEEE VAST 2007 Contest. http://
www.cs.umd.edu/hcil/VASTcontest07.
SEASR 2009a Sentiment tracking from UIMA data. http://seasr.org/
documentation/uima-and-seasr/sentiment-tracking-from-uima-
data/. Visited July 2009.
SEASR 2009b UIMA and SEASR. http://seasr.org/documentation/uima-
and-seasr/. Visited July 2009.
Shutt GL, Puretskiy AA and Berry MW 2009 FutureLens: Software for text visualiza-
tion and tracking Text Mining Workshop, Proceedings of the Ninth SIAM International
Conference on Data Mining, Sparks, NV .
Steinbock D 2009 TagCrowd: Create your own tag cloud from any text to visualize word
frequency. http://www.tagcrowd.com. Visited July 2009.
Viégas FB, Wattenberg M and Dave K 2004 Studying cooperation and conflict between
authors with History Flow visualizations Proceedings of the SIGCHI Conference on
Human Factors in Computing Systems, pp. 575–582. ACM Press.
Viégas FB, Wattenberg M and Dave K 2009 History Flow: Visualizing the editing history
of Wikipedia pages http://www.research.ibm.com/visual/projects/
history_flow/index.htm.
7
7.1 Introduction
In the age of information, it is easy to accumulate various documents such as news
articles, scientific papers, blogs, advertisements, etc. These documents contain
rich information as well as useless or redundant information. People who are
interested in a certain topic may only want to track the new developments of
an event or the different opinions on the topic. This motivates the study of
novelty mining, or novelty detection, which aims to retrieve novel, yet relevant,
information, given a specific topic defined by a user (Zhang and Tsai 2009a). A
typical novelty mining system consists of two modules: (1) categorization; and
(2) novelty mining. The categorization module classifies each incoming document
into its relevant topic bin. Then, the novelty mining module detects the documents
containing enough novel information in the topic bin. This chapter will focus on
the later module. Due to its importance in information retrieval, a great deal of
attention has been given to novelty mining in the past few years. The pioneering
work for novelty mining was performed at the document level (Zhang et al.
2002). Later, more contributions were made to novel sentence mining, such as
those reported in TREC 2002–2004 Novelty Track (Harman 2002; Soboroff
2004; Soboroff and Harman 2003), those in comparing various novelty metrics
Text Mining: Applications and Theory edited by Michael W. Berry and Jacob Kogan
2010, John Wiley & Sons, Ltd
130 TEXT MINING
(Allan et al. 2003; Tang and Tsai 2009: Zhao et al. 2006), and those in integrating
various natural language processing (NLP) techniques (Kwee et al. 2009; Ng et al.
2007; Zhang and Tsai 2009b).
Novelty mining is a process of mining novel text in the relevant documents
of a given topic. The novelty of any document or sentence is quantitatively mea-
sured by a novelty metric based on its history documents and represented by a
novelty score. The final decision on whether a document or sentence is novel
or not depends on whether the novelty score falls above or below a threshold.
As an adaptive filtering algorithm, novelty mining is one of the most challeng-
ing problems in information retrieval. One primary challenge is how to set the
threshold of novelty scores adaptively. In the novelty mining system, since there
is little or no training information available, the threshold cannot be predefined
with confidence. The motivations for designing an adaptive threshold setting for
the novelty mining system are manifold. There is little training information in
the initial stages of novelty mining and different users may have different defi-
nitions about novelty. Motivations of adaptive threshold setting will be analyzed
in detail later (in Section 7.2.2).
To the best of our knowledge, few studies have focused on adaptive threshold
setting in novelty mining. A simple threshold setting algorithm was proposed in
Zhang et al. (2002), which decreases the redundancy threshold a little if a redun-
dant document is retrieved as a novel one based on a user’s feedback. Clearly
it is a weak algorithm because it can only decrease the redundancy threshold.
This chapter addresses the problem of setting an adaptive threshold by modeling
the score distributions of both novel and nonnovel documents. Although score
distribution-based threshold-setting algorithms have been proposed for relevant
document/sentence retrieval (Arampatzis et al. 2000, Robertson 2002; Zhai et al.
1999; Zhang and Callan 2001), the novelty score in novelty mining has its dis-
tinctive characteristics. In our experimental study, we find that scores from the
novel and nonnovel classes heavily overlap. This is intuitive because novel and
nonnovel information are always interlaced in one document, while in the rele-
vance retrieval problem most of the nonrelevant documents show little similarity
with relevant ones. Second, we find that the score distributions for both novel
and nonnovel classes can be approximated by Gaussian distributions (detailed in
Section 7.2.3). In the relevance retrieval problem, however, the scores of nonrele-
vant documents follow an exponential distribution (Arampatzis et al. 2000). This
also implies that most nonrelevant documents are dissimilar to relevant ones.
The score distributions of classes provide the global information necessary
for constructing an optimization criterion for threshold setting, while the thresh-
old that optimizes this criterion is the best we can obtain until new user feedback
is provided. Our proposed method, the Gaussian-based adaptive threshold set-
ting (GATS) algorithm, is a general algorithm, which can be tuned according to
different performance requirements, by employing different optimization criteria,
such as the Fβ score (Equation (7.7)), which is the weighted harmonic average
of precision and recall where β controls the trade-off between them.
ADAPTIVE THRESHOLD SETTING FOR NOVELTY MINING 131
The novelty mining system combined with GATS has been tested on both
document-level and sentence-level data and compared to the novelty mining
system using various fixed thresholds. The experimental results show that a good
performance of GATS can be obtained at both levels.
The remainder of this chapter is organized as follows. Section 7.2 first ana-
lyzes the motivations of threshold setting in novelty mining, and then introduces
the GATS algorithm. Section 7.3 tests GATS at both the sentence level and
document level. Section 7.4 concludes the chapter.
and where Ncos (d) denotes the cosine similarity-based novelty score of document
d and wk (d) is the weight of the kth word in document weighted vector d. The
weighting function used in our work is the term frequency.
The final decision on whether a document is novel or not depends on whether
the novelty score falls above or below a threshold. The document predicted as
‘novel’ will be pushed into the history document list.
When novelty mining adopts a fixed threshold, no user feedback is considered
and the whole process is unsupervised. When novelty mining adopts an adaptive
threshold setting algorithm, the system needs to respond to any new feedback
from the user. Based on the feedback from the user, the new threshold output by
this algorithm will replace the current one and be used for future incoming doc-
uments until new feedback is available. Note that when no feedback is received,
the system will fix the threshold at the initial threshold.
132 TEXT MINING
7.2.2 Motivation
There are several reasons motivating us to design an adaptive threshold setting
algorithm for novelty mining. First of all, there is little or no training informa-
tion in the initial stages of novelty mining. Therefore, the threshold can hardly
be predefined with confidence. The training information that is necessary for
threshold setting includes the statistics of data and users’ reading habits. For
example, a topic with 90% novel documents needs a relatively low threshold for
novelty scores to retrieve most of the documents. On the other hand, different
users may have different definitions of ‘novel’ information. For example, one
user might regard a document with 50% novel information as a novel document
while another user might only regard a document with 80% novel information
as a novel document. The threshold of novelty scores should be higher for the
user with a stricter definition of the ‘novel’ document. As novelty mining is an
accumulating system, more training information will be available for threshold
setting, based on user feedback given over time. The adaptive threshold setting
algorithm is able to utilize this available information and customizes the novelty
mining system to the user’s needs.
Satisfying different performance requirements is another important motivation
for employing an adaptive threshold setting algorithm for novelty mining. For
example, when users do not want to miss any novel information, a high-recall
system that only filters out very redundant documents is desired. When users
want to read the most novel documents first, a high-precision system that only
retrieves very novel documents is preferred. Therefore, the threshold should be
tuned according to different performance requirements.
Next, we will introduce the proposed method, namely GATS, and explain
how it works with novelty mining.
µk = xi , (7.4)
nk
i∈ck
The Gaussian probability density function estimated for each class is repre-
sented by the dashed lines in Figures 7.1 and 7.2. It would appear that novelty
134 TEXT MINING
distributions of novelty scores of novel
sentences for TREC2004–N54
3
2.5
1.5
0.5
0
−0.2 0 0.2 0.4 0.6 0.8 1 1.2
novelty scores
2.5
1.5
0.5
0
−0.2 0 0.2 0.4 0.6 0.8 1 1.2
novelty scores
Figure 7.1 Empirical and probability distribution approximation for TREC 2004
Novelty Track data topic N54.
scores from both the novel and nonnovel classes can be well fitted by Gaussian
distributions.
Optimization criterion
Assume we have an incoming document stream d1 , d2 , to dn , of which n1 are
novel. After filtering the document stream by the novelty mining system with a
threshold θ , any document can be classified in one of four classes as shown in
Table 7.1.
Precision and recall are two widely used measures for evaluating the quality of
results in information retrieval. Precision can be seen as a measure of exactness,
whereas recall is a measure of completeness. In novelty mining, precision reflects
how likely the system-retrieved documents are truly novel and recall reflects how
likely the truly novel documents can be retrieved by the system. Precision and
ADAPTIVE THRESHOLD SETTING FOR NOVELTY MINING 135
distributions of novelty scores of novel
sentences for TREC2004–N69
3
2.5
1.5
0.5
0
−0.2 0 0.2 0.4 0.6 0.8 1 1.2
novelty scores
distributions of novelty scores of non-novel
sentences for TREC2004–N69
3.5
2.5
1.5
0.5
0
−0.2 0 0.2 0.4 0.6 0.8 1 1.2
novelty scores
Figure 7.2 Empirical and probability distribution approximation for TREC 2004
Novelty Track data topic N69.
2 × precision × recall
F = . (7.7)
precision + recall
The F score is a special case of the Fβ score, i.e. the weighted harmonic average
of precision and recall
1
Fβ = β 1−β
, (7.8)
precision + recall
Substituting Equations (7.9) and (7.10) into Equation (7.6), precision and recall
can be rewritten as functions of the threshold θ , as follows:
Pc1 P (x > θ |c1 )
precision(θ ) = , (7.11)
Pc1 P (x > θ |c1 ) + Pc0 P (x > θ |c0 )
recall(θ ) = P (x > θ |c1 ), (7.12)
where Pc1 and Pc0 are the prior probabilities of the novel and nonnovel classes
which can be estimated by
P (x > θ |c1 )
= arg max Pc0
.
θ β[P (x > θ |c1 ) + Pc1 P (x
> θ |c0 )] + (1 − β)
START
Update threshold
i=i 1 No
by GATS
No End of
data?
Yes
END
11. He said that Mikael was in his 20s and that he had used Internet service
providers in the Philippines to spread his programs.
12. Bjorck said Mikael had published information on how to get rid of the ‘I
Love You’ program.
14. The ICSA.net researchers said they had disassembled one of the four
components of the ‘I Love You’ program and had discovered that its
instructions closely matched two similar programs that they had captured
last fall and in January.
15. Once a computer was infected, the program was set up to fetch the
password-stealing component from a Philippine Web site.
16. After it was installed in the computer it was programmed to relay the
stolen passwords to an e-mail account also in the Philippines.
17. But after the ‘I Love You’ outbreak was detected on Wednesday, the
company running the Philippine Web site, Sky Internet, quickly removed
the password program from its system.
18. Computer investigators said that both the ‘I Love You’ program and the
password-stealing modules discovered earlier had references to Amable
Mendoza Aguiluz Computer College, which they said had seven campuses
in the Philippines.
Figure 7.4 Sentence-level novelty mining results for TREC03 topic N39.
Figure 7.5 Threshold adjustment to 0.6000 for sentence 16 after running GATS.
142 TEXT MINING
Figure 7.6 Sentence-level novelty mining results for TREC03 topic N39 after
running GATS.
0.58
0.95
0.85
0.56
0.80
0.54
Precision
0.52
0.8
0.70
0.5 0.7
0.65
0.60 0.6
0.5
0.48 0.4
0.50
0.45 0.3
0.40 0.2
0.46 fixed threshold with 0.30
0.1
the best F score 0.15
0.44 0.05
mining with GATS will not fall within the regions of the extreme values, in
which the F score can be very low. In practice, our users usually require a high-
recall system with the precision no lower than a lower bound, or a high-precision
system with the recall no lower than a lower bound. An extremely high recall
with an extremely low precision is useless because this system just marks almost
all documents as novel. On the other hand, an extremely high precision with an
extremely low recall means that the system only marks very few documents as
novel. Both cases make little sense.
Moreover, since there is no prior information available for a user to choose a
suitable fixed threshold, the system with a predefined threshold can hardly lead
to a suitable tradeoff between precision and recall, and hence can hardly obtain
a good F score. On the contrary, GATS will optimize the F score automatically,
based on user feedback over time.
Besides the P R curve, we also compare two algorithms using the Fβ score.
Table 7.2 shows the performance of the two algorithms evaluated with Fβ scores
of β = 0.2, 0.5, and 0.8. For novelty mining employing GATS, the parameter
β is set to 0.2, 0.5, and 0.8 accordingly. For novelty mining employing the
fixed threshold, the highest Fβ scores are reported in tables after various trial-
and-error attempts. From Table 7.2, by comparing to the best fixed threshold,
we discovered that GATS can obtain similar or slightly better results for TREC
144 TEXT MINING
Table 7.2 Comparison of performance evaluated by Fβ
(β = 0.2, 0.5, 0.8) on TREC 2004 Novelty Track data.
Performance of the novelty mining system
Adaptive threshold Best fixed threshold
by GATS (β) by trial and error (θ )
F0.2 0.7706 (0.2) 0.7758 (0.15)
F0.5 0.6155 (0.5) 0.6126 (0.45)
F0.8 0.5396 (0.8) 0.5281 (0.60)
2004 Novelty Track data. The best fixed thresholds for F0.2 , F0.5 , and F0.8 are
0.15, 0.45, and 0.60, respectively. Examination of the P R curves in Figure 7.7
suggests that the corresponding region of the best fixed thresholds is covered by
the P R curve of GATS. This implies that GATS can be effective in searching for
the best threshold in novelty mining, under different performance requirements.
In the following subsections, we test GATS by assuming complete feed-
back for document-level novelty mining (NM) data with low, medium, and high
novelty ratios. This will provide some guidelines on how GATS should be used.
1
GATS
fixed threshold F score
0.95 Threshold
=1
0.9
Precision
0.8
0.75
1 F score
GATS
fixed threshold
0.9
0.9
0.8
0.7
0.8
Threshold Beta
0.6 =1 = 0.9
Precision
0.2 0.1
0.7
0
0.5 0.85
0.95
0.6
0.4
0.5
0.3
0.4
0.2 0.3
0.2
0.1
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Figure 7.9 Precision–recall curves of novelty mining with fixed threshold vs.
adaptive threshold by GATS (tuning for Fβ ) with complete user feedback on
document-level TREC 2004 Novelty Track data (with PNS threshold 0.03).
146 TEXT MINING
1 F score
GATS
fixed threshold
0.9
0.9
0.8
0.7
0.8
0.6 Threshold
Precision
=1
0.7
0.5
0.95 0.6
0.4 0.9 Beta
0.2 = 0.1
0.85 0.5
0.3 0
0.4
0.2 0.3
0.2
0.1
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Figure 7.10 Precision–recall curves of novelty mining with fixed threshold vs.
adaptive threshold by GATS (tuning for Fβ ) with complete user feedback on
document-level TREC 2004 Novelty Track data (with PNS threshold 0.5).
Discussion
Although both the fixed threshold and the GATS parameter β control the trade-
off between precision and recall, they play different roles in novelty mining. The
fixed threshold cannot reflect the trade-off between precision and recall directly.
Since different data may have different characteristics and different metrics may
output different values of novelty scores, the fixed threshold can hardly be pre-
defined with confidence. On the contrary, the parameter β in GATS reflects
the weights of precision and recall directly (β is the weight of precision while
1 − β is the weight of recall), and hence can be set based on the performance
requirement directly.
From our experimental results on document-level NM data with low, medium,
and high novelty ratios, we find that GATS is extremely useful for data with low
novelty ratios, useful for data with medium novelty ratios, but not as useful as
the best fixed threshold for data with a novelty ratio higher than 75%. Therefore,
GATS is not recommended for topics with high novelty ratios. In this case, setting
a lower fixed threshold to force most of the documents to be ‘novel’ would be a
better choice.
7.4 Conclusion
This chapter addressed the problem of setting an adaptive threshold by utiliz-
ing user feedback over time. The proposed method, the Gaussian-based adaptive
ADAPTIVE THRESHOLD SETTING FOR NOVELTY MINING 147
threshold setting (GATS) algorithm, modeled the distributions of novelty scores
from both novel and nonnovel classes by the Gaussian distributions. Class dis-
tributions learnt from user feedback yielded the global information of data used
for the construction of an optimization criterion for searching the best threshold.
GATS is a general method, which can be tuned according to different perfor-
mance requirements, by combining with different optimization criteria. In this
chapter, the most commonly used performance evaluation measure in NM, the
Fβ score, has been employed as the optimization criterion. The Fβ score is the
weighted harmonic average of precision and recall, where β and (1 − β) are
weights for precision and recall, respectively.
In the experimental study, the NM system employing the GATS algorithm
was tested on experimental datasets with complete user feedback on data with
low, medium, and high novelty ratios (percentage of novel sentences/documents).
The experimental results suggest that GATS is very effective in finding the best
threshold in the NM system. Moreover, GATS is able to meet the different per-
formance requirements by setting the weights of precision and recall externally.
GATS has been shown to be extremely effective for data with a low novelty
ratio, useful for data with a medium novelty ratio, and not as effective for data
with a high novelty ratio.
References
Allan J, Wade C and Bolivar A 2003 Retrieval and novelty detection at the sentence level.
SIGIR 2003, Toronto, Canada, pp. 314–321 ACM.
Arampatzis A, Beney J, Koster CHA and Weide TP 2000 KUN on the TREC-9 filtering
track: Incrementality, decay, and threshold optimization for adaptive filtering systems.
TREC 9 – the 9th Text Retrieval Conference.
Davis J and Coadrich M 2006 The relationship between precision-recall and ROC curves.
Proceedings of the 23rd International Conference on Machine Learning, pp. 233–240.
Harman D 2002 Overview of the TREC 2002 Novelty Track. TREC 2002 – the 11th Text
Retrieval Conference, pp. 46–55.
Kwee AT, Tsai FS and Tang W 2009 Sentence-level novelty detection in English and
Malay. Lecture Notes in Computer Science (LNCS) vol. 5476 Springer pp. 40–51.
Ng KW, Tsai FS, Goh KC and Chen L 2007 Novelty detection for text documents using
named entity recognition. 6th International Conference on Information, Communica-
tions and Signal Processing, pp. 1–5.
Robertson S 2002 Threshold setting and performance optimization in adaptive filtering.
Information Retrieval 5(2–3), 239–256.
Soboroff I 2004 Overview of the TREC 2004 Novelty Track. TREC 2004 – the 13th Text
Retrieval Conference.
Soboroff I and Harman D 2003 Overview of the TREC 2003 Novelty Track. TREC
2003 – the 12th Text Retrieval Conference.
Tang W and Tsai FS 2009 Intelligent novelty mining for the business enterprise. Technical
Report.
148 TEXT MINING
Zhai C, Jansen P, Stoica E, Grot N and Evans DA 1999 Threshold calibration in CLARIT
adaptive filtering. Proceedings of the Seventh Text Retrieval Conference, TREC-7 , pp.
149–156.
Zhang Y and Callan J 2001 Maximum likelihood estimation for filtering thresholds. ACM
SIGIR 2001 , pp. 294–302.
Zhang Y and Tsai FS 2009a Chinese novelty mining EMNLP’09: Proceedings of the
Conference on Empirical Methods in Natural Language Processing, pp. 1561–1570.
Zhang Y and Tsai FS 2009b Combining named entities and tags for novel sentence
detection. ESAIR’09: Proceedings of the WSDM’09 Workshop on Exploiting Semantic
Annotations in Information Retrieval , pp. 30–34.
Zhang Y, Callan J and Minka T 2002 Novelty and redundancy detection in adaptive
filtering. ACM SIGIR 2002, Tampere, Finland , pp. 81–88.
Zhao L, Zheng M and Ma S 2006 The nature of novelty detection. Information Retrieval
9, 527–541.
8
8.1 Introduction
According to the most recent 2008 online victimization research, approximately
1 in 7 youths (ages 10 to 17 years) experience a sexual approach or solicitation
by means of the Internet (National Center for Missing and Exploited Children
2008). In response to this growing concern, law enforcement collaborations and
nonprofit organizations have been formed to deal with sexual exploitation on the
Internet. Most notable is the Internet Crimes Against Children (ICAC) task force
(Internet Crimes Against Children 2009). The ICAC Task Force Program was
created to help state and local law enforcement agencies enhance their investiga-
tive response to offenders who use the Internet, social networking websites, or
other computer technology to sexually exploit children. The program is currently
composed of 59 regional task force agencies and is funded by the United States
Department of Justice, Office of Juvenile Justice and Delinquency Prevention.
The National Center for Missing and Exploited Children (NCMEC) has set
up a CyberTipLine for reporting cases of child sexual exploitation including child
pornography, online enticement of children for sex acts, molestation of children
outside the family, sex tourism of children, child victims of prostitution, and
unsolicited obscene material sent to a child. All calls to the tip line are referred
to appropriate law enforcement agencies – and the magnitude of the calls is
staggering. From March 1998, when the CyberTipLine began operations, until
April 20, 2009, there were 44 126 reports of ‘Online Enticement of Children for
Text Mining: Applications and Theory edited by Michael W. Berry and Jacob Kogan
2010, John Wiley & Sons, Ltd
150 TEXT MINING
Sexual Acts’, one of the reporting categories. There were 146 in the week of April
20th, 2009 alone (National Center for Missing and Exploited Children 2008).
The owners of Perverted-Justice.com (PJ) began a grassroots effort to identify
cyberpredators in 2002. PJ volunteers pose as youths in chat rooms and respond
when approached by an adult seeking to begin a sexual relationship with a child.
We are currently working with the data collected by PJ from these conversations
in an effort to understand cyberpredator communications.
Cyberbullying, according to the National Crime Prevention Council, is using
the Internet, cell phones, video game systems, or other technology to send or post
text or images intended to hurt or embarrass another person – and is a growing
threat among children. In 2004, half of US youths surveyed stated that they or
someone they knew had been victims or perpetrators of cyberbullying (National
Crime Prevention Council 2009a). Being a victim of cyberbullying is a common
and painful experience. Nearly 20% of teens had a cyberbully pretend to be
someone else in order to trick them online, getting the victim to reveal personal
information; 17% of teens were victimized by someone lying about them to others
online; 13% of teens learned that a cyberbully was pretending to be them while
communicating with someone else; and 10% of teens were victimized by someone
posting unflattering pictures of them online, without permission (National Crime
Prevention Council 2009b).
The anonymous nature of the Internet may contribute to the prevalence of
cyberbullying. Kids respond to cyberbullying by avoiding communication tech-
nologies or messages altogether. They rarely report the conduct to parents (for
fear of losing phone/Internet privileges) or to school officials (for fear of getting
into trouble for using cell phones or the Internet in class) (Agatston et al. 2007;
Williams and Guerra 2007).
As we analyzed cyberbullying and cyberpredator transcripts from a variety of
sources, we were struck by the similar communicative tactics employed by both
cyberbullies and cyberpredators – in particular, masking identity and deception.
We were also struck by the similar responses of law enforcement and youth advo-
cacy groups: reporting and preventing. Victims are physically and psychologically
abused by predators and bullies who trap them in vicious communicative cycles
using modern technologies; their only recourse is to report the act to authorities
after it has occurred. By the time a report is made, unfortunately the aggressor
has moved on to a new victim.
Cyberbullying and Internet predation frequently occur over an extended
period of time and across several technological platforms (i.e. chat rooms,
social networking sites, cell phones, etc.). Techniques that link multiple online
identities would help law enforcement and national security agencies identify
criminals, as well as the forums in which they participate. The threat to youth
is of particular interest to researchers, law enforcement, and youth advocates
because of the potential for it to get worse as membership of online communities
continues to grow (Backstrom et al. 2006; Kumar et al. 2004; Leskovec et al.
2008) and as new social networking technologies emerge (Boyd and Ellison
2007). Much of modern communication takes place via online chat media
TEXT MINING AND CYBERCRIME 151
in virtual communities populated by millions of anonymous members who
use a variety of chat technologies to maintain virtual relationships based on
daily (if not hourly) contact (Ellison et al. 2007; O’Murchu et al. 2004). MSN
Messenger, for example, reports 27 million users and AOL Instant Messenger
has the largest share of the instant messaging market (52% as of 2006) (IM
MarketShare 2009); however, Facebook, the latest social networking craze,
reported over 90 million users worldwide (Nash 2008). These media, along with
MySpace, WindowsLive, Google, and Yahoo, all have online chat technologies
that can be easily accessed by anyone who chooses to create a screen name and
to log on; no proof of age, identity, or intention is required. A recent update
to Facebook also allows users to post and receive Facebook messages via text
messaging on their cell phones (FacebookMobile 2009).
We describe the current state of research in the areas of cyberbullying and
Internet predation in Section 8.2. In Section 8.3, we describe several commercial
products which claim to provide chat and social networking site monitoring for
home use. Finally in Section 8.4 we offer our conclusions and discuss opportu-
nities for future research into this interesting and timely field.
A statistical approach
Pendar used the PJ transcripts to separate predator communication from vic-
tim communication (Pendar 2007). In this study, the author downloaded the PJ
transcripts and indexed them. After preprocessing to reduce some of the prob-
lems associated with Internet communication (i.e. handling netspeak), the author
developed attributes for each chat log. The attributes consisted of word unigrams,
bigrams, and trigrams. Terms that appeared in only one log or in more than 95%
of the logs were removed from the index. Afterward approximately 10 000 uni-
grams, 43 000 bigrams, and 13 000 trigrams remained. The author describes using
701 log files.1 . Each log file was split into victim communication and predator
communication, resulting in 1402 total input instances, each with 10 000–43 000
attributes, depending on the model being tested. Additional feature extraction and
weighting completed the indexing process.
1
It appears as if the perverted-justice.com site has changed its method of presenting the chat
data in recent years.
154 TEXT MINING
The data file was split into a 1122 instance training set and a 280 instance test
set, stratified by class (i.e. the test set contained 140 predator instances and 140
victim instances). Classification was then attempted using both support vector
machine (SVM) and distance-weighted k-nearest neighbor (k-NN) classifiers.
The F -measure (see also Sections 3.4 and 7.2.3) reported by the author ranged
from 0.415 to 0.943. The k-NN classifier was a better classifier for this task
and trigrams were shown to be more effective than unigrams and bigrams. The
maximum performance (F -measure = 0.943) was obtained when 30 nearest
neighbors were used and 10 000 trigrams were extracted and used as attributes.
During the gaining access phase, the predator maneuvers him- or herself into
professional and social positions where he or she can interact with the child in
a seemingly natural way, while still maintaining a position of authority over the
child. For example, gaining employment at an amusement park or volunteering
for a community youth sports team. The next phase, entrapping the victim in
a deceptive relationship, is a communicative cycle that consists of grooming,
isolation, and approach. Grooming involves subtle communication strategies that
desensitize victims to sexual terminology and reframe sexual acts in child-like
terms of play or practice. In this stage, offenders also isolate their victims from
family and friend support networks before approaching the victim for the third
phase: sexual contact and long-term abuse.
In previous work, we expanded and modified the luring theory to accom-
modate the difference between online luring and real-world luring (Leatherman
2009). For example, the concept ‘gaining access’ was revised to include the initial
entrance into the online environment and initial greeting exchange by offenders
and victims, which is different from meeting kids at the amusement park or
through a youth sports league. Communicative desensitization was modified to
include the use of slang, abbreviations, netspeak, and emoticons in online conver-
sations. The core concept underpinning entrapment is the ongoing deceptive trust
that develops between victims and offenders. In online luring communications,
TEXT MINING AND CYBERCRIME 155
this concept is defined as perpetrator and victim sharing personal information,
information about activities, relationship details, and compliments.
Communications researchers define two primary goals for content analysis
(Riffe et al. 1998):
1. describe the communication; and
2. draw inferences about its meaning.
In order to perform a content analysis for Internet predation, we developed
a codebook and dictionary to distinguish among the various constructs defined
in the luring communication theoretical model. The coding process occurred in
several stages. First, a dictionary of luring terms, words, icons, phrases, and net-
speak for each of the three luring communication stages was developed. Second,
a coding manual was created. This manual has explicit rules and instructions for
assigning terms and phrases to their appropriate categories. Finally, software that
mimics the manual coding process was developed (this software is referred to as
ChatCoder below).
Twenty-five transcripts from the PJ website were carefully analyzed for
the development of the dictionary. These 25 online conversations ranged from
349 to 1500 lines of text. The perpetrators span from 23 to 58 years of age,
were all male, and were all convicted of sexual solicitation of minors over the
Internet.
We captured key terms and phrases that were frequently used by online sex-
ual predators, and identified their appropriate category labels within the luring
model: deceptive trust development, grooming, isolation, and approach (Leather-
man 2009; Olson et al. 2007). The dictionary included terms and phrases common
to net culture in general, and luring language in particular. Some examples appear
in Table 8.1. The version of coding dictionary used in these experiments con-
tained 475 unique phrases. A breakdown of the phrase count by category appears
in Table 8.2.
In order to provide a baseline for the usefulness of the codebook for detecting
online predation, we ran two small categorization experiments. In the first exper-
iment, we coded 16 transcripts in two ways: first we coded the predator dialogue
(so only phrases used by the predator were recorded), and then we coded for the
victim. Thus, we had 32 instances, and each instance had a count of the phrases
in each of the coding categories (eight attributes). Our class attribute was binary
(predator or victim).
We used the J48 classifier within the Weka suite of data mining tools (Witten
and Frank 2005) to build a decision tree to predict whether the coded dialogue
was predator or victim. The J48 classifier builds a C4.5 decision tree with reduced
error pruning (Quinlan 1993). This experiment is similar to that in Pendar (2007),
but Pendar used a bag-of-words approach and an instance-based learner. The clas-
sifier correctly predicted the class 60% of the time, a slight improvement over the
50% baseline. This is remarkable when we consider the fact that we were cod-
ing individuals who were in conversation with each other, and therefore the
156 TEXT MINING
Table 8.1 Sample excerpts from the codebook for Internet predation.
Phrase Coding category
are you safe to meet Approach
i just want to meet Approach
i just want to meet and mess around Approach
how cum Communicative desensitization
if i don’t cum right back Communicative desensitization
i want to cum down there Communicative desensitization
i just want to gobble you up Communicative desensitization
you are a really cute girl Compliment
you are a sweet girl Compliment
are you alone Isolation
do you have many friends Isolation
let’s have fun together Reframing
let’s play a make believe game Reframing
there is nothing wrong with doing that Reframing
0.4
0.35
0.3
0.25
0.2
0.15 Cluster0:
0.1 Cluster1:
0.05 Cluster2:
Cluster3:
0
s t
itie tio
n en hip ing tio
n ion oach
c tiv rma p lim tions fram t i za o lat pr
i I s Ap
A nfo
l I C om Rela Re s en
s
na e
rso eD
Pe a tiv
i c
un
o mm
C
Figure 8.1 Initial clustering of predator type.
158 TEXT MINING
8.2.5 Cyberbullying detection
In 2006, the Conference on Human Factors in Computing Systems (CHI) ran a
workshop on the misuse and abuse of interactive technologies, and in 2008 Rawn
and Brodbeck showed that participants in first-person shooter games had a high
level of verbal aggression, although in general there was no correlation between
gaming and aggression (Rawn and Brodbeck 2008).
Most recently, in 2009 the Content Analysis for the Web 2.0 (CAW 2.0)
workshop was formed and held in conjunction with WWW2009. As noted above,
the CAW 2.0 organizers devised a shared task to deal with online harassment,
and also developed a dataset to be used for research in this area. Only one
submission was received for the misbehavior detection task. A brief summary of
that paper follows.
Yin et al. define harassment as communication in which a user intentionally
annoys another user in a web community. In Yin et al. (2009) detection of
harassment is presented as a classification problem with two classes: positive
class for posts which contain harassment and negative class for posts which do
not contain harassment.
The authors combine a variety of methods to develop the attributes for input
to their classifier. They use standard term weighting techniques, such as TFIDF
(Term Frequency–Inverse Document Frequency) to extract index terms and give
appropriate weight to each term. They also develop a rule-based system for
capturing sentiment features. For example, a post that contains foul language
and the word ‘you’ (which can appear in many forms in online communication)
is likely to be an insult directed at someone, and therefore could be perceived as a
bullying post. Finally, some web communities seem to engage in friendly banter
or ‘trash talk’ that may appear to be bullying, but is instead just a communicative
style. The authors also were able to identify contextual features by comparing a
post to a window of neighboring posts. Posts that are unusual or which generate
a cluster of similar activity from other users are more likely to be harassing.
After extracting relevant features, the authors developed an SVM classifier
for detecting bullying behavior in three of the datasets provided by the CAW 2.0
conference organizers. They chose two different types of communities: Kongre-
gate, which captures IM conversations during game play; and Slashdot/MySpace,
which tend to be more asynchronous discussion-style forums where users write
longer messages and discussion may continue over days or weeks. The authors
manually labeled the three datasets. The level of harassment in general was very
sparse. Overall only 42 of the 4802 posts in the Kongregate dataset represented
bullying behavior. The ratio of bullying to nonbullying in Slashdot was similar
(60 out of 4303 posts). MySpace was a little higher with 65 out of 1946 posts.
The authors employed an SVM to develop a model for classifying harassing
posts. Their experimental results show that including the contextual and sentiment
features improves the classification over the local weighting (TFIDF) baseline
for the three datasets. The maximum recall was achieved with the chat-style
collection (recall was 0.595 for Kongregate). Precision was best when the dataset
TEXT MINING AND CYBERCRIME 159
contained more harassment (precision was 0.417 for MySpace). Overall the F -
measure ranged from 0.298 to 0.442, so there is much room for improvement.
A random chance baseline would be less than 1%, however, so the experimental
results show that detection of cyberbullying is possible.
8.5 Acknowledgements
This work was supported in part by the Ursinus College Summer Fellows pro-
gram. The authors thank Dr Susan Gauch and her students for providing the
ChatTrack data, and Dr Nick Pendar for his helpful advice on acquiring the
Perverted-justice.com transcripts. We also thank Fundación Barcelona Media
(FBM) for compiling and distributing the CAW 2.0 shared task datasets. Our
thanks extend to the many students and colleagues in both the Mathematics and
Computer Science and Media and Communication Studies Departments at Ursi-
nus College who have provided support and input to this project, as well as to
the editors for their patience and feedback.
References
Acar E, Camtepe S, Krishnamoorthy M and Yener B 2005 Modeling and multiway anal-
ysis of chatroom tensors. IEEE International Conference on Intelligence and Security
Informatics.
Agatston P, Kowalski R and Limber S 2007 Students perspectives on cyber bullying.
Journal of Adolescent Health 41(6), S59–S60.
Axlerod H and Jay DR 1999 Crime and punishment in cyberspace: Dealing with law
enforcement and the courts. SIGUCCS’99: Proceedings of the 27th Annual ACM
SIGUCCS Conference on User Services, pp. 11–14.
Backstrom L, Huttenlocher D, Kleinberg J and Lan. X 2006 Group formation in
large social networks: Membership, growth, and evolution. Proceedings of the 12th
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
KDD’06 .
Bengel J, Gauch S, Mittur E and R Vijayaraghavan. 2004 ChatTrack: Chat room topic
detection using classification. Second Symposium on Intelligence and Security Infor-
matics.
Boyd D and Ellison N 2007 Social network sites: Definition, history, and scholarship.
Journal of Computer-Mediated Communication 13(1), 210–230.
Bsecure 2009 http://www.bsecure.com/Products/Family.aspx.
Burmester M, Henry P and Kermes LS 2005 Tracking cyberstalkers: A cryptographic
approach. ACM SIGCAS Computers and Society 35(3), 2.
Camtepe S, Krishnamoorthy M and Yener B 2004 A tool for Internet chatroom surveil-
lance. Second Symposium on Intelligence and Security Informatics.
CAW2.0 2009 http://caw2.barcelonamedia.org/.
Consumer Search 2008 Parental control software review. http://www.
consumersearch.com/parental-control-software/review.
TEXT MINING AND CYBERCRIME 163
Cooke E, Jahanian F and Mcpherson D 2005 The zombie roundup: Understanding, detect-
ing, and disrupting botnets. Workshop on Steps to Reducing Unwanted Traffic on the
Internet (SRUTI), pp. 39–44.
CyberPatrol 2009 http://www.cyberpatrol.com/family.asp.
Dewes C, Wichmann A and Feldmann A 2003 An analysis of Internet chat systems.
IMC’03: Proceedings of the 3rd ACM SIGCOMM Conference on Internet Measurement ,
pp. 51–64.
eBlaster 2008 http://www.eblaster.com/.
Ellison N, Steinfield C and Lampe C 2007 The benefits of Facebook ‘friends’: Social
capital and college students’ use of online social network sites. Journal of Computer-
Mediated Communication 12(4), 1143–1168.
FacebookMobile 2009 http://www.facebook.com/mobile/.
Gianvecchio S, Xie M, Wu Z and Wang H 2008 Measurement and classification of
humans and bots in internet chat. SS’08: Proceedings of the 17th Conference on Security
Symposium, pp. 155–169.
Hartigan J and Wong MA 1979 A k-means clustering algorithm. Applied Statistics 28(1),
100–108.
IamBigBrother 2009. http://www.iambigbrother.com/.
ICQ-Sniffer 2009 icq-sniffer.qarchive.org/.
IM MarketShare 2009 http://www.bigblueball.com/forums/general-
other-im-news/34413-im-market-share.html/.
Internet Crimes Against Children 2009. http://www.icactraining.org/.
InternetSafety 2009 http://www.internetsafety.com/safe-eyes-
parental-control-software.php.
Jones Q, Moldovan M, Raban D and Butler B 2008 Empirical evidence of information
overload constraining chat channel community interactions. Proceedings of the ACM
2008 Conference on Computer Supported Cooperative Work .
Kontostathis A, Edwards L and Leatherman A 2009 ChatCoder: Toward the tracking and
categorization of Internet predators. Proceedings of the Text Mining Workshop 2009
held in conjunction with the Ninth SIAM International Conference on Data Mining
(SDM 2009).
Kumar R, Novak J, Raghavan P and Tomkins A 2004 Structure and evolution of blogspace.
Communications of the ACM 47(12), 35–39.
Leatherman A 2009 Luring language and virtual victims: Coding cyber-predators’ online
communicative behavior. Technical report, Ursinus College, Collegeville, PA.
Leskovec J, Lang KJ, Dasgupta A and Mahoney MW 2008 Statistical properties of com-
munity structure in large social and information networks WWW’08: Proceedings of
the 17th International Conference on World Wide Web, pp. 695–704.
Muller M, Raven M, Kogan S, Millen D and Carey K 2003 Introducing chat into business
organizations: Toward an instant messaging maturity model. Proceedings of the 2003
International ACM SIGGROUP Conference on Supporting Group Work .
Nash KS 2008 A peek inside Facebook. http://www.pcworld.com/business-
center/article/150489/a peek inside facebook.html.
National Center for Missing and Exploited Children 2008 http://www.
missingkids.com/en US/documents/CyberTiplineFactSheet.pdf.
164 TEXT MINING
National Crime Prevention Council 2009a. http://www.ncpc.org/topics/by-
audience/cyberbullying/cyberbullying-faq-for-teens.
National Crime Prevention Council 2009b. http://www.ojp.usdoj.gov/cds/
internet safety/NCPC/Stop CyberbullyingBeforeItStarts.pdf.
Net Nanny 2008 http://www.netnanny.com/.
Olson L, Daggs J, Ellevold B and Rogers T 2007 Entrapping the innocent: Toward a
theory of child sexual predators’ luring communication. Communication Theory 17(3),
231–251.
O’Murchu I, Breslin J and Decker S 2004 Online social and business networking com-
munities. Technical report, Digital Enterprise Research Institute (DERI).
PC Mag 2008 Net Nanny 6.0 http://www.pcmag.com/article2/0,2817,
2335485,00.asp.
Pendar N 2007 Toward spotting the pedophile: Telling victim from predator in text
chats. Proceedings of the First IEEE International Conference on Semantic Computing,
pp. 235–241.
Personal Communication 2008 Trooper Paul Iannace, Pennsylvania State Police, Cyber
Crimes Division.
Perverted-Justice.com 2008 Perverted justice. www.perverted-justice.com.
Quinlan R 1993 C4.5: Programs for Machine Learning. Morgan Kaufmann.
Rawn RWA and Brodbeck DR 2008 Examining the relationship between game type, player
disposition and aggression. Future Play ’08: Proceedings of the 2008 Conference on
Future Play, pp. 208–211.
Riffe D, Lacy S and Fico F 1998 Analyzing Media Messages: Using Quantitative Content
Analysis in Research. Lawrence Erlbaum Associates.
Sipior JC and Ward BT 1999 The dark side of employee email. Communications of the
ACM 42(7), 88–95.
TigerDirect 2009 http://www.tigerdirect.com/applications/Search-
Tools/item-details.asp?EdpNo=3728335\&CatId=986.
TopTenReviews 2009. http://monitoring-software-review.toptenre-
views.com/i-am-big-brother-review.html.
Tuulos V and Tirri H 2004 Combining topic models and social networks for chat data
mining. Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web
Intelligence, pp. 235–241.
Van Dyke N, Lieberman H and Maes P 1999 Butterfly: A conversation-finding agent for
Internet relay chat. Proceedings of the 4th International Conference on Intelligent User
Interfaces.
Williams K and Guerra N 2007 Prevalence and predictors of Internet bullying. Journal
of Adolescent Health 41(6), S14–S21.
Witten I and Frank E 2005 Data Mining: Practical Machine Learning Tools and Tech-
niques. Morgan Kaufmann.
Yin D, Xue Z, Hong L, Davison BD, Kontostathis A and Edwards L 2009 Detection of
harassment on Web 2.0. Proceedings of the Content Analysis in the Web 2.0 (CAW2.0)
Workshop at WWW2009 .
Part III
TEXT STREAMS
9
9.1 Introduction
Text streams – collections of documents or messages that are generated and
observed over time – are ubiquitous. Our research and development are targeted
at developing algorithms to find and characterize changes in topic within text
streams. To date, this research has demonstrated the ability to detect and describe
(1) short-duration atypical events and (2) the emergence of longer term shifts
in topical content. This technology has been applied to predefined temporally
ordered document collections but is suitable also for application to near-real-time
textual data streams.
Massive amounts of text stream data exist and are readily available, especially
over the Internet. Analyzing this text data for content and for detecting change in
topic or sentiment can be a daunting task. Mathematical and statistical methods
in the area of data mining can be very helpful to the analyst looking for these
changes. Specifically, we have implemented some of these techniques into a
surprise event and emerging trend detection technology designed to monitor a
stream of text or messages for changes within the content of that data stream.
Some of the event types that one might want to detect in a text stream (which
could be a sequence of news articles, a sequence of messages, or an evolving
dialogue) are shown in Figure 9.1. In each case, time is along the x -axis. The y-
axis corresponds to some measure of topic (such as the number of words or events
Text Mining: Applications and Theory edited by Michael W. Berry and Jacob Kogan
2010, John Wiley & Sons, Ltd
168 TEXT MINING
Slope discontinuity
Figure 9.1 Typical event or trend types.
that occur within the data). In the context of a text stream, a point discontinuity
in topics could correspond to a single time step with a relatively unique content.
A jump discontinuity could correspond to an abrupt change in the content of the
text stream. A slope discontinuity could correspond to a ramping up (or down)
in a topic for that text stream.
Typically, jump and point discontinuities are detected more readily than slope
discontinuities (Eubank and Whitney et al. 1989). For our terminology, we refer
to the instantaneous discontinuity types (point or jump) as a surprise event (see
Grabo (2004) for more information on surprise events). We define an emerging
trend as a change in topic for an extended period of time, as illustrated by the
jump discontinuity or the slope discontinuity (see Kontostathis et al. (2003) for
a more concise definition of emerging trend).
Much of the research in information mining from text streams focuses either
on describing new events and salient features or in clustering documents (He et al.
2007; Kumaran and Allan 2004; Mei and Zhai 2005). For instance, the goal of
the Topic Detection and Tracking (TDT) Research Program (Allan 2002) was to
break down the text into individual news stories, to monitor the stories for events
that have not been seen before, and to gather the stories into groups that each
discuss a single topic. This program used a training set to identify stories (topics)
to track. A good source of research in trend analysis was compiled in Survey of
Text Mining: Clustering, Classification, and Retrieval (Kontostathis et al. 2003)
and also in the article ‘Detecting emerging trends from scientific corpora’ (Le
et al. 2005). In both, the main focus is tracking defined topics and trying to detect
changes.
EVENTS AND TRENDS IN TEXT STREAMS 169
The difference in our approach is that we monitor and evaluate the occur-
rence of individual terms (the least common denominator between documents)
for changes over time. Once individual terms have been determined as surpris-
ing or emerging, then terms related temporally are identified to help the analyst
identify the story/topic involved with the surprising (emerging) terms. As a pre-
processing step, a text analysis tool is used to extract words from the text stream
and give information about terms within the documents. With this information,
mathematical algorithms are used to score each term. Using these scores (statis-
tical metrics, which we call surprise or emergence statistics), we evaluate each
term over the period represented by the text stream. When a sufficiently surpris-
ing (emerging) term occurs, related terms (based on the temporal profile) are
found and are useful in explaining the broader nature of the event.
Detected events and the explanatory terms can be represented in a variety
of ways. From our experience, graphical representations tend to be the most
desirable (if not most useful) form for the analysts.
A description of the data (text streams) and the extraction and reduction
of relevant features are discussed in the next two sections. The methodology
for the detection of (surprising) events and (emerging) trends is discussed in
Sections 9.4 and 9.5. In Section 9.6, we discuss temporally related terms and
present an example to illustrate the capabilities of our technology. The last two
sections discuss differences in our algorithms, contrast our algorithms with other
topicality measures, and summarize our technology development.
In our modeling, the (document frequency) temporal profile for each term
is the dependent variable. Therefore, the selection of terms that represent the
document set is a key task. This task is accomplished within IN-SPIRE. An
important feature of this capability is the automatic keyword extraction. This
capability allows keywords to be single words or phrases that reflect the content
of a document. An in-depth description of this technology is included in the first
chapter of this book.
EVENTS AND TRENDS IN TEXT STREAMS 171
team 5
food 6
country 6
previously 5
serious 5
usa 7
influenza 6
The goal is to find the times when these two measurements (counts) are not
(statistically) the same. Think of this like a hypothesis test in statistics: we define
the null hypothesis (Ho ) and alternate hypothesis (H a ) as
i−1
Ho : xi = xj , and
np
j =i−np
i−1
Ha : xi = xj .
np j =i−np
The goal of a hypothesis test is to reject the null hypothesis and accept
the alternate hypothesis. We have developed our algorithms with this in mind.
The first surprise algorithm is based on a chi-square statistic (Pearson method)
constructed from the following 2 × 2 table (Agresti 2002):
xi Ni − xi
i−1
i−1
i−1
xj Nj − xj
j =i−np j =i−np j =i−np
where, for this table, xi is the count (number of documents containing a specific
at the ith time step/bin, Ni is the total number of documents at the ith time
term)
step, xj is the sum of the document counts containing the term in the (np)
time steps prior to the ith time step, and Nj is the total number of documents
in the (np) time steps prior to time t (time at the ith time step). The amount
EVENTS AND TRENDS IN TEXT STREAMS 173
of time (both the width of a time interval and the number of time windows) is
a user-selected parameter of the procedure. A value sufficiently large for a chi-
square statistic is one way to flag a surprising event/term. This statistic looks
for deviations in the number of occurrences of a specific term normalized by the
total number of documents (within the same time interval).
The formula used for the chi-square statistic is
2
n.. |n11 n22 − n12 n21 | − 12 Y n..
χ =
2
, (9.1)
n1. n2. n.1 n.2
where the previous 2 × 2 frequency table is rewritten as
n11 n12
n21 n22
and
n1. = n11 + n12 ,
n2. = n21 + n22 ,
n.1 = n11 + n21 ,
n.2 = n12 + n22 , and
n.. = n11 + n12 + n21 + n22 .
where np is the number of time intervals in the previous time windows and s is
the standard deviation.
Finally, combining the previous algorithms (chi-square and Gaussian) forms
the final two algorithms within our toolkit for the surprise statistic. Each com-
bined statistic is accomplished by taking the square root of the chi-square statistic
plus the absolute value of the Gaussian statistic, as follows:
-
Csurprise = χ 2 + |G|. (9.4)
1
1
i+nc i−1
Ho : xj = xj , and
nc np
j =i j =i−np
i+nc
1
i−1
Ha : xj > xj .
nc j =i
np j =i−np
north 6
country 6
affected 7
died 6
director 5
caused 6
healthmap 11
i+nc
i+nc
i+nc
xj Nj − xj
j =i j =i j =i
i−1
i−1
i−1
xj Nj − xj
j =i−np j =i−np j =i−np
where, for this table, xj in the first row is the sum of all the documents
containing the individual term in the current time window, Nj in the first row
is the total number of documents within this time period, xj in the second
row is the sum of all the documents containing the term within the period prior
to the current time step (previous window), and Nj in the second row is the
total number of documents within this (previous) time window. The number of
time steps within each interval (previous window and current window) is a user-
selected parameter of the procedure. (Note that these window sizes need not be
equal.)
For detecting trends, the Gaussian algorithm is modified from the surprise
implementation to incorporate the multiple time steps in the current time window
176 TEXT MINING
(time past the current time step, i ). The new Gaussian algorithm is defined by
i+nc 1 i−1
1
nc j =i xj − np j =i−np xj
G= . ,
si sj
+
nc np
where si is the standard deviation of counts in the current time window and sj
is the standard deviation of the counts in the previous time window.
PubMed–mail
437 Documents
15
10
# Docs
5
Figure 9.5 Temporal profiles sorted by the chi-square (Pearson) surprise score
(ProMed-mail).
h1n1 4
alert 5
influenza a h1n1 4
cent 4
mexico 5
swine influenza 4
swine flu 4
swine 4
worldwide 5
texas 4
pandemic 5
patients 4
united 6
novel 4
developing 3
regional 5
America 4
director 5
California 4
Germany 3
Figure 9.6 Temporal profiles sorted by the chi-square (Pearson) emergence score
(ProMed-mail).
178 TEXT MINING
chi-square (Pearson) algorithms. The temporal profiles for the top 20 surprising
terms are shown in Figure 9.5. The temporal profiles for the top 20 emerging
terms are shown in Figure 9.6. From these two plots, the main topic within this
dataset becomes obvious (H1N1, the swine flu outbreak of 2009). On April 24,
the surprise analysis (Figure 9.5) starts to select terms that first appear about the
swine flu outbreak (serious, vaccination, epidemic). However, the results of the
emergence analysis (Figure 9.6) clearly explain when and what occurred. The
results of using the Gaussian algorithms to analyze this ProMed-mail dataset are
shown in Figures 9.7 and 9.8. The results from the Gaussian surprise analysis
show that no swine flu outbreak terms were selected as significantly surprising
for this analysis. The results of the emergence analysis, however, did show the
selection of several (swine flu) relevant terms (Figure 9.8).
Similarities between terms within a given set can give an analyst more infor-
mation than just a single term can provide (including multi-term keywords).
We assess similarity based on the distances between vectors of the temporal
occurrence of each term. There are a large number of candidate algorithms for
calculating distances between temporal profiles. Our preferred implementation is
based on the correlation function between the vectors and is, for two such vectors
(x, y), equal to 1 − |corr(x, y)|. This distance often results in interpretable term
groupings (Kaufman and Rousseeuw 1990). Using combined related term pro-
files, one can gain more detailed information about the events. For illustration,
Figure 9.9 shows the related terms for the term mexico (from the analysis of the
mpp 8
northern 8
postings 6
amp 12
similar 6
common 6
dead 7
mhj 8
suspected 7
exposed 6
research 6
human 8
started 6
victims 5
recent 8
person 6
acute 5
don 6
health 11
killed 5
Figure 9.7 Temporal profiles for the ProMed-mail dataset, sorted by the Gaus-
sian surprise score.
EVENTS AND TRENDS IN TEXT STREAMS 179
disease 11
infected 9
animals 9
infection 8
map 11
outbreak 13
h1n1 4
swine flu 4
alert 5
1st 9
influenza a h1n1 4
cent 4
mhj 8
mexico 5
united 6
virus 8
htm 6
reported 11
swine 4
swine influenza 4
Figure 9.8 Temporal profiles for the ProMed-mail dataset, sorted by the Gaus-
sian emergence score.
mexico
influenza a h1n1
h1n1
swine flu
mexican
swine influenza
kong
hong
California
swine
PubMed–mail grouped around mexico using Similar words
Figure 9.9 Temporal profiles for the term mexico and the top nine related terms
(ProMed-mail dataset).
180 TEXT MINING
ProMed-mail dataset). From this, it is obvious that the main topic about this term
(mexico) is the 2009 swine flu (H1N1) outbreak.
9.7 Discussion
In the previous section, the surprise and emergence algorithms were used to
analyze the ProMed-mail dataset. From Figure 9.4, we see that the maximum
number of documents (reports) for a single day (from March 13 through May
13) was 17. In Figure 9.6, we see that the maximum number of documents that
contained the term h1n1 was only 4 (number on the right hand side of each
temporal profile). Because of the low number of term occurrences and document
counts, the surprise algorithms did not produce the desired results compared to
the results from the emergence algorithms.
A comparison of the surprise statistic (maximum value for each term) and the
emergence statistic is shown in Figure 9.10. Also shown in this figure is a com-
parison of the IN-SPIRE topicality score to the surprise and emergence statistic.
The IN-SPIRE topicality score is a measure that defines discriminating terms
within a set of documents. This comparison was done using the ProMed-mail
dataset and the chi-square (Pearson) algorithms. The fundamental observation is
that the metrics are uncorrelated, at least for this corpora, because no correlation
is seen in any of these plots (or very low correlation for the surprise–emergence
5 10 15 20
20
Emergence Score
Surprise Score
15
10
5
0
5 10 15 5 10 15
Topicality Topicality
20
Emergence Score
15
10
5
0
0 5 10 15 20
Surprise Score
Figure 9.10 Comparison of topicality (topicality), event detection (surprise
score), and trend detection (emergence score) algorithms (ProMed-mail dataset).
EVENTS AND TRENDS IN TEXT STREAMS 181
plot), which suggests that these three statistics provide different information about
the dataset.
9.8 Summary
Mathematical and statistical methods in the area of text mining can be very
helpful for the analysis of the massive amounts of text stream data that exists.
Analyzing this data for content and for detecting change can be a daunting task.
Therefore, we have implemented some of these text mining techniques into a sur-
prise event and emerging trend detection technology that is designed to monitor
a stream of text or messages for changes within the content of that data stream.
In this chapter, we have described our algorithmic development in the area of
detecting evolving content in text streams (events and trends). We have compared
our results to text analysis results on a static document collection and found that
our techniques produce results that are different and enhance those results.
A recent dataset was analyzed using our surprise and emergence algorithms.
In this analysis, the emergence algorithms did a very good job of finding the
emergence of the most relevant subject matter (H1N1, swine flu outbreak) and
when the event began (April 24, 2009).
To help understand the important topics defined by each term (keyword),
related terms are found. For the swine flu analysis, the term mexico was found to
be a significant emerging term. The related term analysis showed that this term
was temporally related to the swine flu (H1N1) outbreak (2009).
9.9 Acknowledgements
The authors of this chapter would like to thank Andrea Currie for her editorial
review and Guang Lin for his LATEX help and expertise.
References
Agresti A 2002 Categorical Data Analysis 2nd edn. John Wiley & Sons, Inc.
Allan J 2002 Topic Detection and Tracking: Event-based Information Organization.
Kluwer Academic.
Dunning T 1993 Accurate methods for the statistics of surprise and coincidence. Compu-
tational Linguistics 19(1), 61–74.
Engel D, Whitney P, Calapristi A and Brockman F 2009 Mining for emerging technologies
within text streams and documents. Ninth SIAM International Conference on Data
Mining. Society for Industrial and Applied Mathematics.
Eubank R and Whitney P 1989 Convergence rates for estimation in certain partially linear
models. Journal of Statistical Planning and Inference 23, 33–43.
Fleiss J 1981 Statistical Methods for Rates and Proportions 2nd edn. John Wiley & Sons,
Inc.
182 TEXT MINING
Grabo C 2004 Anticipating Surprise: Analysis for Strategic Warning. University Press of
America.
He Q, Chang K, Lim E and Zhang J 2007 Bursty feature representation for clustering
text streams. Seventh SIAM International Conference on Data Mining, pp. 491–496.
Society for Industrial and Applied Mathematics.
Hetzler E, Crow V, Payne D and Turner A 2005 Turning the bucket of text into a pipe.
IEEE Symposium on Information Visualization, pp. 89–94.
IN-SPIRE 2009 http://in-spire.pnl.gov Pacific Northwest National Laboratory.
Kaufman L and Rousseeuw P 1990 Finding Groups in Data: An Introduction to Cluster
Analysis. John Wiley & Sons, Inc.
Kontostathis A, Galitsky L, Pottenger W, Roy S and Phelps D 2003 A survey of emerging
trend detection in textual data mining. in: Survey of Text Mining: Clustering, Classifi-
cation, and Retrieval . Springer.
Kumaran G and Allan J 2004 Text classification and named entities for new event detec-
tion. ACM SIGIR Conference pp. 297–304.
Le M, Ho T and Nakamori Y 2005 Detecting emerging trends from scientific corpora.
ACM SIGIR Conference pp. 45–50.
Mei Q and Zhai C 2005 Discovering evolutionary theme patterns from text: An explo-
ration of temporal text mining. KDD, 11th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pp. 198–207.
ProMED-mail 2009 http://www.promedmail.org.
Whitney P, Engel D and Cramer N 2009 Mining for surprise events within text streams.
Ninth SIAM International Conference on Data Mining, pp. 617–627. Society for Indus-
trial and Applied Mathematics.
10
Embedding semantics
in LDA topic models
Loulwah AlSumait, Pu Wang, Carlotta Domeniconi
and Daniel Barbará
10.1 Introduction
The huge advancement in databases and the explosion of the Internet, intranets,
and digital libraries have resulted in giant text databases. It is estimated that
approximately 85% of worldwide data is held in unstructured formats with an
increasing rate of roughly 7 million digital pages per day (White 2005). Such huge
document collections hold useful yet implicit and nontrivial knowledge about
the domain. Text mining (TM) is an integral part of data mining that is aimed
at automatically extracting such knowledge from the unstructured textual data.
The main tasks of TM include text classification, text summarization, document
and/or word clustering, in addition to classical natural language processing tasks
such as machine translation and question-answering. The learning tasks are more
complex when processing text documents that arrive in discrete or continuous
streams over time.
Topic modeling is a newly emerging approach to analyze large volumes of
unlabeled text (Steyvers and Griffiths 2005). It specifies a statistical sampling
technique to describe how words in documents are generated based on (a small
set of) hidden topics. In this chapter, we investigate the role of prior knowledge
semantics in estimating the topical structure of large text data in both batch and
online modes under the framework of latent Dirichlet alglocation (LDA) topic
Text Mining: Applications and Theory edited by Michael W. Berry and Jacob Kogan
2010, John Wiley & Sons, Ltd
184 TEXT MINING
modeling (Blei et al. 2003). The objective is to enhance the descriptive and/or
predictive model of the data’s thematic structure based on the embedded prior
knowledge about the domain’s semantics.
The prior knowledge can be either external semantics from prior-knowledge
sources, such as ontologies and large universal datasets, or a data-driven seman-
tics which is a domain knowledge that is extracted from the data itself. This
chapter investigates the role of semantic embedding in two main directions. The
first is to embed semantics from an external prior-knowledge source to enhance
the generative process of the model parameters. The second direction which suits
the online knowledge discovery problem is to embed data-driven semantics. The
idea is to construct the current LDA model based on information propagated from
topic models that were learned from previously seen documents of the domain.
10.2 Background
Given the unstructured nature of text databases, many challenges face TM algo-
rithms. First, there are a very high number of possible features to represent a
document. Such features can be derived from all the words and/or phrase types
in the language. Furthermore, in order to unify the data structure of documents, it
is necessary to use a dictionary of all the words to represent a document, which
results in a very sparse representation. Another critical challenge stems from the
complex relationships between concepts and from the ambiguity and context sen-
sitivity of words in text. Thus, a good TM algorithm must be efficient to process
such large and challenging data so that the documents are represented in short
descriptions in which only the essential and most discriminative information is
preserved. The rest of this section is focused on three major advancements to
solve this problem, then the LDA topic models will be introduced in Section 10.3.
d qd
b
p(z|d)
z f zi
p(w|z) K
w wi
Nd Nd
D D
where CwKW ¬i ,j
is the number of times word w is assigned to topic j , not including
the current token instance i; and CdKD ¬i ,j
is the number of times topic j is assigned
to some word token in document d, not including the current instance i. From
this distribution, i.e. p(zi |z¬i , w), a topic is sampled and stored as the new topic
assignment for this word token. After a sufficient number of sampling iterations,
the approximated posterior can be used to get estimates of φ and θ by examining
the counts of word assignments to topics and topic occurrences in documents.
Given the direct estimate of topic assignments z for every word, it is important
to obtain its relation to the required parameters and . This is achieved
by sampling new observations based on the current state of the Markov chain
(Steyvers and Griffiths 2005). Thus, estimates ´ and
´ of the word – topic and
topic – document distributions can be obtained from the count matrices
WK
Ci,k + βi,k DK
Cd,k + αd,k
φ́ik = W , θ́dk = K . (10.3)
v=1 (Cv,k + βv,k ) j =1 (Cd,j + αd,j )
WK DK
Gibbs sampling has been empirically tested to determine the required length
of the burn-in phase, the way to collect samples, and the stability of inferred
topics (Griffiths and Steyvers 2004; Heinrich 2005; Steyvers and Griffiths 2005).
EMBEDDING SEMANTICS IN LDA TOPIC MODELS 189
10.3.3 Online latent Dirichlet allocation (OLDA)
OLDA is an online version of the LDA model that is able to process text streams
(AlSumait et al. 2008). The OLDA model considers the temporal ordering infor-
mation and assumes that the documents arrive in discrete time slices. At each
time slice t of a predetermined size ε, e.g. an hour, a day, or a year, a stream of
documents, S t = {d1 , . . . , dD t }, of variable size, D t , is received and ready to be
processed. A document d received at time t is represented as a vector of word
tokens, wtd = {wd1t t
, . . . , wdN d
}. Then, an LDA topic model with K components
is used to model the newly arrived documents. The generated model, at a given
time, is used as a prior for LDA at the successive time slice, when a new data
stream is available for processing (see Figure 10.2 for an illustration). The hyper-
parameters β can be interpreted as the prior observation counts on the number
of times words are sampled from a topic before any word from the corpus is
observed (Steyvers and Griffiths 2005), bishop. So, the count of words in topics,
resulting from running LDA on documents received at time t, can be used as the
priors for the t + 1 stream.
Thus, the per-topic distribution over words at time t, (t) k , is drawn from a
Dirichlet distribution governed by the inferred topic structure at time t − 1 as
follows:
t−1 Time t
(time between slices t−1 & t = ∆)
a t-1 at
Priors
Construction
q t-1 q t-1 at qt qt
Topic Evolution
Zit-1 Tracking Zit
f t-1 bt
Emerging Topic
Wit-1
Detection St Wit
S t-1
Nd Nd
Topic
D t-1 Dt
Significance
Ranking
ft
b t-1 f t-1 bt ft
K K
where each entry Bk (v, t) is the weight of word v under topic k at time t.1 Thus,
working with the evolutionary matrix will allow for tracking the drifts of existing
topics, detection of emerging topics, and visualizing the data in general.
Thus, the generative model for time slice t of the proposed OLDA model can
be summarized as follows:
1
New observed terms at time t are assumed to have 0 count in φ for all topics in previous
streams.
EMBEDDING SEMANTICS IN LDA TOPIC MODELS 191
Maintaining the models’ priors as Dirichlet is essential to simplify the infer-
ence problem by making use of the conjugacy property of Dirichlet and multi-
nomial distributions. In fact, by tracking the history as prior patterns, the data
likelihood and, hence, the posterior inference of LDA are left the same. Thus,
implementing Gibbs sampling in Equation (10.2) in OLDA is straightforward.
The main difference of the online approach is that the sampling is performed
over the current stream only. This makes the time complexity and memory usage
of OLDA efficient and practical. In addition, the β under OLDA are constructed
from historic observations rather than fixed values.
Table 10.3 Topic distributions of dynamic simulated data over three streams.
The rule ( ) indicates that the corresponding word or topic has not yet
emerged.
Stream t =1 t =2 t =3
Topic k1 k2 k3 k1 k2 k3 k1 k2 k3
40% 60% 0% 40% 50% 10% 30% 40% 30%
Dictionary↓ p(wi |kj ) p(wi |kj ) p(wi |kj )
river 0.2 0 – 0.4 0 0 0.37 0 0
stream 0.4 0 – 0.2 0 0 0.41 0 0
bank 0.3 0.35 – 0.25 0.36 0.1 0.22 0.28 0
money 0 0.3 – 0 0.24 0 0 0.3 0.07
loan 0 0.25 – 0.05 0.22 0.1 0 0.2 0
debt – – – 0 0.08 0 0 0.12 0
factory – – – 0 0 0.37 0 0 0.33
product – – – 0 0 0.33 0 0 0.25
labor – – – – – – 0 0 0.25
news 0.05 0.05 – 0.05 0.05 0.05 0.05 0.05 0.05
reporter 0.05 0.05 – 0.05 0.05 0.05 0.05 0.05 0.05
Given the same dictionary, three streams of documents are generated from
evolving descriptions of topics to demonstrate the OLDA model. Table 10.3
shows the distributions of topics in the three time epochs. Topic 3 emerges as a
new topic at the second time epoch. In addition to the new terms introduced by
EMBEDDING SEMANTICS IN LDA TOPIC MODELS 193
Table 10.4 Topics discovered by OLDA from dynamic simulated data.
t =1 t =2 t =3
ID Topic ID Topic ID Topic
distribution distribution distribution
1 news reporter 1 news reporter 1 reporter news
2 bank 2 bank 2 bank
3 money loan 3 money loan debt 3 money loan debt
4 stream river 4 river stream 4 river stream
5 bank news 5 bank factory production 5 production factory labor
topic 3, a number of terms such as debt and labor gradually emerge. The weight
(importance) of topics also varies between the streams. The OLDA topic model
is trained on the corresponding documents of each stream with K set to 5. At
each time epoch, OLDA is trained on the currently generated documents only.
Table 10.4 lists the highest important words under each topic of the evolving
simulated data that were discovered by OLDA with K set to 5 at each time
epoch. After 50 iterations of Gibbs sampling on each stream, OLDA converged
to aligned topic models that correspond to the true topic densities and evolution.
Another observation stems from the setting of K, i.e. the number of compo-
nents. When K is set to the true number of topics, the topic distributions included
some common words in addition to the semantically descriptive ones, see for
example the words news and reporter in topics T1 , T2 , and T3 in Table 10.2.
When K is increased to 5, the topics became more focused as the common words
are mapped into individual topics, see topics 1 and 2 in Table 10.3.
,k + βi
CwW K + αk
DK
Cm,k
φik = W i W K , θmk = K , (10.6)
v=1 Cv,k + βv j =1 Cm,j + αj
DK
where m is the index of the Wikipedia article. Within the related Wikipedia
WK DK
articles, Ci,k is the number of times word i is assigned to topic k and Cm,k
is the number of times topic k is assigned to some word token in Wikipedia
article m.
The prior distributions φ and θ are then updated into posteriors using the test
documents. Specifically, the topic – word distribution φ is updated to a new φ̂,
and a new topic – document distribution θ̂ is learned from scratch using the test
documents
,k + C w ,k + βi
WK
CwW K d,k + αk
C DK
φ̂ik = V i W K i W K , θ̂dk = K , (10.7)
v=1 Cv,k + C v,k + βv j =1 C d,j + αj
DK
β (t) (t−1)
k = Bk ω (10.8)
ˆ (t−δ) ω1 + · · · +
= ˆ (t−2) ωδ−1 +
ˆ (t−1) ωδ . (10.9)
k k k
Given the equality in Equation (10.8), the per-topic distribution over words at
time t, (t)
k , is drawn from a Dirichlet distribution governed by the evolutionary
matrix of the topic as follows:
By updating the priors as described above, the structure of the model is kept
simple, as all the historic knowledge patterns are printed in the priors rather than
in the structure of the graphical model itself. In addition, the learning process
on the new stream of data starts from what has been learned so far, rather than
starting from arbitrary settings that do not relate to the underlying distributions.
8: β tk = Bt−1
k ω, k ∈ {1, . . . , K}
9: end if
10: α (t)
d = a, d = 1, . . . , D
(t)
(t) (t)
11: initialize and θ to zeros
12: initialize topic assignment, z(t) , randomly for all word tokens in S (t)
13: [(t) , (t) , z(t) ] = GibbsSampling(S (t) , β (t) , α (t) )
14: if t < δ then
15: Btk = Bk(t−1) ∪ ˆ (t)
k , k ∈ {1, . . . , K}
16: else
17: Btk = Bk(t−1) (1 : W (t) , 2 : δ) ∪ ˆ (t)
k , k ∈ {1, . . . , K}
18: end if
19: end loop
6
Perplexity
0
1 2 3 4 5 6 7 8 9
Stream
Figure 10.3 Perplexity of OLDA on Reuters with and without Wikipedia articles.
200 TEXT MINING
1600
1/w(1.0)
3/w(0.33)
5/w(0.2)
1400 OLDAFixed
1200
Perplexity
1000
800
600
400
200
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Stream
Figure 10.4 Perplexity of OLDA on Reuters for various window sizes compared
to OLDAFixed.
1800
3/w(0.33)
4/w(0.25)
1700 5/w(0.2)
OLDAFixed
1/w(1.0)
1600
1500
1400
Perplexity
1300
1200
1100
1000
900
800
88 89 90 91 92 93 94 95 96 97 98 99 2000
Year
Figure 10.5 Perplexity of OLDA on NIPS for various window sizes compared to
OLDAFixed.
1400
1300
1200
1100
1000
Perplexity
900
800
700
OLDAFixed
2/sum(w) = 0.05
600 2/sum(w) = 0.1
2/sum(w) = 0.15
2/sum(w) = 0.2
500 2/sum(w) = 1
400
Reuters NIPS
Dataset
Figure 10.6 Average perplexity of OLDA on Reuters and NIPS under different
weights of history contribution compared to OLDA with fixed β.
202 TEXT MINING
Reuters’ documents span a short period of time while the streams of NIPS
are yearly based. As a result, the Reuters’ topics are homogeneous and more
stable. So, letting the current generative model be heavily influenced by the past
topical structure will eventually result in a better description of the data. On the
other hand, although there is a set of predefined publication domains in NIPS,
like algorithms, applications, and visual processing, these topics are very broad
and interrelated. Furthermore, research papers usually cover more topics and
continuously introduce novel ideas and topics. Hence, the influence of previous
semantics should not exceed the topical structure of the present.
References
AlSumait L and Domeniconi C 2008 Text clustering with local semantic kernels. In
Survey of Text Mining: Clustering, Classification, and Retrieval (ed. Berry M and
Castellanos M) 2nd edn Springer.
AlSumait L, Barbará D and Domeniconi C 2008 Online LDA: Adaptive topic model for
mining text streams with application on topic detection and tracking. Proceedings of
the IEEE International Conference on Data Mining.
AlSumait L, Barbará D and Domeniconi C 2009 The role of semantic history on online
generative topic modeling. Proceedings of the Workshop on Text Mining, held in con-
junction with the SIAM International Conference on Data Mining.
Andrzejewski D, Zhu X and Craven M 2009 Incorporating domain knowledge into topic
modeling via Dirichlet forest priors Proceedings of the International Conference on
Machine Learning.
Blei D, Ng A and Jordan M 2003 Latent Dirichlet allocation. Journal of Machine Learning
Research 3, 993–1022.
Cristianini N, Shawe-Taylor J and Lodhi H 2002 Latent semantic kernels. Journal of
Intelligent Information Systems 18(2–3), 127–152.
Deerwester S, Dumais S, Furnas G, Landauer T and Harshman R 1990 Indexing by
latent semantic analysis. Journal of the American Society for Information Science 41(6),
391–407.
Griffiths T and Steyvers M 2004 Finding scientific topics. Proceedings of the National
Academy of Sciences, pp. 5228–5235.
Heinrich G 2005 Parameter Estimation for Text Analysis. Springer.
Hofmann T 1999 Probabilistic latent semantic indexing. Proceedings of the 15th Confer-
ence on Uncertainty in Artificial Intelligence.
Mimno D and McCallum A 2007 Organizing the OCA: Learning faceted subjects from a
library of digital books. Proceedings of the Joint Conference on Digital Libraries.
Minka T and Lafferty J 2002 Expectation-propagation for the generative aspect model.
Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence.
Papadimitriou C, Tamaki H, Raghavan P and Vempala S 2000 Latent semantic indexing:
A probabilistic analysis. Journal of Computer and System Sciences 61(2), 217–235.
Phan X, Nguyen L and Horiguchi S 2008 Learning to classify short and sparse text
and web with hidden topics from large-scale data collections. International WWW
Conference Committee.
Salton G 1983 Introduction to Modern Information Retrieval . McGraw-Hill.
204 TEXT MINING
Steyvers M and Griffiths T 2005 Probabilistic topic models. In Latent Semantic Analysis:
A Road to Meaning (ed. Landauer T, McNamara D, Dennis S and Kintsch W) Lawrence
Erlbaum Associates.
Story R 1996 An explanation of the effectiveness of latent semantic indexing by means of a
Bayesian regression model. Information Processing and Management 32(3), 329–344.
Sun Q, Li R, Luo D and Wu X 2008 Text segmentation with LDA-based Fisher kernels.
Proceedings of the Association for Computational Linguistics.
Wang X, McCallum A and Wei X 2007 Topical n-grams: Phrase and topic discovery,
with an application to information retrieval. Proceedings of the 7th IEEE International
Conference on Data Mining.
Wei X and Croft B 2006 LDA-based document models for ad-hoc retrieval. Proceedings
of the Conference on Research and Development in Information Retrieval .
White C 2005 Consolidating, accessing and analyzing unstructured data.
Wikipedia 2009 Wikipedia: The free encyclopedia.
Index
adaptive threshold setting, 132 Bregman, 89
Kullback–Leibler, 82
centroid, 82, 83 reverse Bregman, 81
chat rooms, 151
chi-square statistic, 173 event types, 167
clustering external prior-knowledge, 193
constrained, 81
pairwise constrained, 81 FeatureLens, 113
confusion matrix, 99 first variation, 83
constrained optimization, 83 function
constraints closed, proper, convex, 89
cannot-link, 82, 84, 87, 93, distance-like, 83
102 FutureLens, 113
instance-level, 81
must-link, 82, 87, 94, 102 gain ratio, 66
must–must-link, 99 Gaussian algorithm, 173
correlation function, 178 Gaussian distribution, 133
cybercrime, 161
history flow, 110
cyberbullying, 150
hypothesis test, 172
cyberpredators, 150
information gain, 66
dataset
information retrieval (IR), 22, 95
CISI collection, 99
isolating language, 23, 32
Cranfield collection, 99
Medlars collection, 99 k-means, 81
MPQA corpus, 15 batch, 83
NIPS, 198 constrained, 101
Reuters, 198 incremental, 82, 83, 86, 100
distance quadratic, 82, 95
Bregman, 89 quadratic batch, 82
Kullback–Leibler, 82, 89 spherical, 81
reversed Bregman, 90 spherical batch, 95, 96
squared Euclidean, 82 spherical constrained, 101
divergence spherical incremental, 97, 99
Text Mining: Applications and Theory edited by Michael W. Berry and Jacob Kogan
2010, John Wiley & Sons, Ltd
206 INDEX
k-means, (Continued) nonnegative matrix factorization,
spherical with constraints, 82 60
keyword, 3, 170 alternating least squares
applications, 3, 4, 15 algorithm, 62
extraction methods, 4, 5 classification, 70
keyphrase, 3, 5 initialization, 65
metrics, 17, 18 multiplicative update algorithm,
variants, 16 62
novelty mining, 130
latent Dirichlet allocation, 186 NP-hard problem, 83
generative process, 187
inference, 187 online
Gibbs sampling, 188 communities, 150
online LDA, 189 luring, 154
latent morpho-semantic analysis victimization, 149
(LMSA), 32
pairwise mutual information (PMI),
with term alignments
30
(LMSATA), 33
PARAFAC2, 28
latent semantic analysis, 185
partition, 83
probabilistic, 185
nextFV (II), 84
latent semantic analysis (LSA), 22
quality, 83
with term alignments (LSATA),
PDDP, 99, 100
30
penalty, 82, 84, 87, 88, 93, 96, 99,
latent semantic indexing, 72
102
LDA, see latent Dirichlet
power method, 30
allocation, 186
precision at one document (P1), 24
log-entropy term weighting, 26
predatory behavior, 159
LSI, see latent semantic indexing,
72 RAKE, 5
luring communication, 154 algorithm, 6, 7
evaluation, 9, 10, 15
mean input parameters, 6
arithmetic, 90 RSS, 169
misbehavior detection, 152
multi-parallel corpus, 21 SEASR, 111
multilingual document clustering, semantic embedding
21 data driven, 194
multilingual LSA, 25 external, 193
multilingual precision at five sentiment tracking, 111
documents (MP5), 25 sexual
exploitation, 149
NMF, see nonnegative matrix predation, 150
factorization, 60 singular value decomposition, 22
NMF-BCC, 74 Sinkhorn balancing, 30
NMF-LSI, 72 smoka, 81, 92
INDEX 207
constrained, 101 TextArc, 111
SMT, see statistical machine topicality, 180
translation, 30 transitive closure, 88, 94, 98
social networking, 149 TREC novelty track data, 138
statistical machine translation, 30 Tucker1, 27
stoplist, 6, 10
generation, 11 UIMA, 111
stopwords, 5
SVD, see singular value vector space model, 22, 184
decomposition, 22 vocabulary
swine flu, 178 indexing, 4
synthetic language, 23, 32 term frequency, 12
term selection, 4
tag VSM, see vector space model,
cloud, 108 22
crowd, 108
temporal profiles, 171 Wikipedia, see external prior
tensors, 27 knowledge, 193
text stream, 169 Wordle, 108