From: KDD-97 Proceedings. Copyright © 1997, AAAI (www.aaai.org). All rights reserved.
Discovering Trends in Text Databases
Brian Lent*and Rakesh Agrawal and Ramakrishnan Srikant
IBM Almaden Research Center
San Jose, California 95120, U.S.A.
[email protected], {ragrawal,srikant}@almaden.ibm.com
Introduction a shape query and this query is then executed over the
We address the problem of discovering trends in text mined data yielding the desired trends. The final step
databases. Trends can be used, for example, to dis- in the process is to visualize the results. We give ex-
cover that a company is shifting interests from one do- periences from applying this system to the IBM Patent
main to another. We are given a database V of doc- Server, a database of U.S. patents.
uments. Each document consists of one or more text
fields and a timestamp. The unit of text is a word and Related Work
a phrase is a list of words. (We defer the discussion An approach to discovering interesting patterns and
of more complex structures till the “Methodology” sec- trend analysis on text documents was presented in
l-inn
YAVU.,\ Ao.aw.;,tc.rl ..r;th r...rh nhrano ;a Lo
s ,YYUY”~
h;rtmw y nf the In 11 ml-- L---I 1- C--L ----L-l-.3
~uu”~Icu”n,L& ““lull \.uIUIA yuLCll”U I” “I Yll” (reraman & Dagan 1995). one bexc is nrsb annobabea
frequency of occurrence of the phrase, obtained by par- with a set of concepts, organized as a hierarchy. Treat-
titioning the documents based upon their timestamps. ing the concept hierarchy as a distribution of probabil-
The frequency of occurrence in a particular time period ities, they identify several model distributions distribu-
is the number of documents that contain the phrase. tion) to which a given concept hierarchy can be com-
(Other measures of frequency are possible, e.g. count- pared. Interesting concepts are those that differ from
ing each occurrence of the phrase in a document.) A their model distribution. Analyzing trends involves the
trend is a specific subsequence of the history of a phrase comparison of concept distributions using old data with
that satisfies the users’ query over the histories. For distributions using new data.
example, the user may specify a “spike” query to finds In (Feldman & Hirsh 1996), the authors find as-
those phrases whose frequency of occurrence increased sociations between the keywords or concepts labeling
and then decreased. the documents using background knowledge about re-
lationships among the keywords. The purpose of the
Approach knowledge base is to supply unary or binary relations
Our system uses several data mining techniques in amongst the keywords labeling the documents.
novel ways and demonstrates a method in which to Using words and phrases to describe themes and con-
visualize the trends. We have two major mining com- cepts in text documents has been studied by the in-
ponents: phrase identification using sequential patterns formation retrieval community. The work on Latent
mining (Srikant & Agrawal 1996) and trend identific- Semantic Indexing (LSI) (Deerwester et al. 1990) de-
ation using shape queries (Agrawal et al. 1995). We scribes a mathematical model of relating word associ-
begin by cleansing and parsing the input data, and sep- ations as weighted vectors that represent “concepts”
arating the documents based on their timestamps. We found within the documents. Using LSI, a query can
then assign a transaction ID to each word of every doc- retrieve a document even when they share no words,
ument treating the words as items in the data mining but do share a similar concept. However, building the
algorithms (the details of this assignment are discussed model takes O(tlc4d) time, where t is the number of
in the “Methodology” section). This transformed data terms or words, k is the number the major concepts in
is then mined for dominant words and phrases, and the model (typically defined from 100 to 300), and d is
IL- ---..,I- savtm.
Lilt: reYUlbS -----J ml--
II,t: ..-..-,- - __^--- :-
user s query 1s c----,-+--I
C~aIls1abu.lI-L-
ulb” the number of documents.
The use of phrases to build more advanced quer-
* Current address: Department of Computer Science,
Stanford University. Continuing support has been provided ies is discussed in (Croft, Turtle, & Lewis 1991). In
by a graduate fellowship of the Department of Defense Of- this work, the authors identify phrases as concepts
fice of Naval Research. and as relationships between concepts. The usefulness
‘Copyright 01997, American Association for Artificial of phrases is shown in (Lewis & Croft 1990) where
Intelligence (www.aaai.org). All rights reserved. the quality of text categorization is improved by us-
Lent
mb word gap predefinedshape
SQLQuery max word gap user-createdshape
i b b
DB2 Data Sequential Shape Trend
Client Cleansing Patterns Query Visualization
Patent ----). - + -+ Engine 7
Server Parsing I
Tim& I I
I I
generation I I
I I
I
I _ - - - - - - - - - -I
I t
Figure 2: The PatentMiner system
Identifying trends .”
Recentuptrend shape query -
By maintaining a support history for each supported k-
phrase we can query the set of histories to select those
phrases that have some specific shape in their histor-
ies. We propose the use of a shape definition language
called SVC (Agrawal et al. 1995) to define the users’
queries and retrieve the associated objects. There are
several benefits for using a shape query language such
as SZC to identify trends: First, the language is small,
yet pi werful, allowing a rich combination of operators.
Second, it is a fairly straightforward task to rewrite a
shape the user may define graphically, as is done in our
PatentMiner system described in the “Experience” sec- Figure 3: A recent Uptrend shape query
tion, into the S’D.fYset, of operators. Third, SYDDC
allows
a “blurry” match where the user may care about the (shape strongUp ( ) (camp Bigup Bigup Bigup ))
overall shape but does not care about specific details of
each interval of the shape. Finally, SV,C allows itself to (files “list-hist” )
be implemented efficiently since most of the operators (query (window 1) ((strongup) (support 2 end)))
are designed to be greedy to reduce non-determinism (quit)
which in turn reduces the amount of back-tracking that
must be done across the histories. Figure 4: Recent uptrend query
Trends are simply those h-phrases selected by the
shape query with the additional information of the time
ending years. Finally, a shape matching the desired
periods in which the trend is supported.
trend (such as “recent upwards trend”, “recent spike
in usage”, “downwards trend”, and “resurgence of us-
Experience: The PatentMiner System age”) is selected and the mining process begins. Al-
Figure 2 shows a high-level view of our system to com- ternatively, users can define their own shape by using
pute and visualize the word-phrase trends, which we a visual shape editor. Once the phrases matching the
now describe. shape query are found, they are presented in a visual
The PatentMiner prototype is a system we developed display.
to discover trends among patents granted in different Once a shape query has been defined, either intern-
categories. The system is connected to an IBM DB2 ally or using the graphical editor, a rewriting of the
.^^_\ .
database containing aii granted U.S. Patents and patent query into SD,C (Agrawai et cai. 1YYa) is performed.
data is retrieved using a dynamically generated SQL Given the shape query in Figure 3, the rewriting of this
query based upon the selection criteria specified by the query into SDC is shown in Figure 4. The rewriting
user. The system allows selection of patents in a spe- happens as follows. For every partitioned time period
cific classification or by keywords appearing in the title of documents there is a corresponding interval in the
or abstract of the patents. Once retrieved, a histogram shape query graph that has associated with it begin-
displaying the number of patents for each year is shown ning and ending relative levels of support. In the case
and the user may then specify a range of years upon where every interval has a specific beginning or ending
which the system will focus. value, the rewriting into SDL is straightforward in that
Next, the user can choose the maximum and min- the slope of each interval determines the basic shape
imum gap desired between words in the phrases to query that is used for that interval. For example, inter-
he
I- mind
--------, -IRS well
. . -__ as
-I the
L--v minimllm
-------------- sunnnrt
-- ==--1 all
--- nhrxen
r------- V~iS wit,h_ a nnsitive
=-----. - SlOtX
---=- translate
.-..----...- to aI “Up” shpe Of
must meet, for each time period between the start and length one, while intervals with a negative slope trans-
Lent 229
ments show that our system, PatentMiner, scales ap-
proximately linearly with the number documents.
Acknowledgments We are grateful to the IBM Al-
maden Patent Server team, especially Laura Anderson,
Steve Boyer and Tom Griffin for their ongoing contri-
butions and suggestions.
References
Agrawal, R.; Psaila, G.; Wimmers, E.; and Zait, M.
1995. Querying shapes of histories. In Proceedings
of the 2lst International Conference on Very Large
Databases.
1992 Croft, W.; Turtle, H.; and Lewis, D. 1991. The use
Time Periods
of phrases and structured queries in information re-
trieval. In i&h International ACM SIGIR Confer-
Figure 5: Some recent upwards trends ence on Research and Development in Information
Retrieval, 32-45.
late to a “down” shape of length one. The concaten- Deerwester, S.; Dumais, S. T.; Furnas, G. W.; Land-
ation of all of these base shapes then defines our SDL auer, T. K.; and Harshman, R. 1990. Indexing by
query. In the case where only some of the intervals of latent semantic analysis. Journal of the American So-
a shape query have been specified, as in intervals three ciety for Injormation Science 41(6):391-407.
to six in Figure 3, then the same concatenation occurs Feldman, R., and Dagan, I. 1995. Knowledge discov-
I.,.+
“L&L’ +h, ,,o,.lt;,,
“11~ IlxlUI”lll~ CT-IT onc&J+
LJJdu ohomn c-n IIcbYb
bcAII hnxro WllJ
nnx, uuypvru
Lz,,mnn,.t ~,l,,,,
“CYIUC. ery in textual databases (KDT). In Proceedings 0s iIre
match the unspecified intervals. 1st InternationaE Conference on Knowledge Discovery
We present some of the trends our system found from in Databases and Data Mining.
U.S. Patents classified in the category “Induced Nuclear Feldman, R., and Hirsh, H. 1996. Mining associations
Reactions: Processes, Systems, and Elements” in Fig- in text in the presence of background knowledge. In
ure 5. These example phrases matched a shape query Proceedings of the 2nd International Conference on
that represented an increasing trend of their usage in Knowledge Discovery in Databases and Data Mining.
recent years. Without knowing a priori the kind of
patents filed in this category, we are able to look at Gay, L., and Croft, W. 1990. Interpreting nom-
the trends and determine some of the popular topics of inal compounds for information retrieval. Information
recently granted patents. Processing and Management 26(1):21-38.
A potential problem with this system is that the num- Lewis, D., and Croft, W. 1990. Term clustering of
ber of phrases that match a query can be quite large. syntactic phrases. In 13th International ACM SIGIR
There are two types of pruning we use to reduce the Conference on Research and Development in Inform-
number of phrases to a more reasonable number. The ation Retrieval, 385-404.
first form of pruning is to drop non-maximal phrases Renouf, A. 1993a. Making sense of text: automated
when their support is near that of a maximal phrase approaches to meaning extraction. 17th International
that is a superset. The second form of pruning in- Online Information Meeting Proceedings 77-86.
volves the use of a syntactic hierarchical ordering of Renouf, A. 1993b. What the linguist has to say to the
phrases. The intuition is that if phrase X is a syntactic information scientist. Journal of Document and Text
sub-phrase of phrase Y, then the concept corresponding Management 1(2):173-190.
to X is usually a generalization of the concept corres-
ponding to Y. Users initially see only the most general Salton, G.; Allan, J.; Buckley, C.; and Singhal, A.
concepts, and can explore lower-level concepts by se- 1994. Automatic analysis, theme generation, and
lecting some of the phrases. summarization of machine readable texts. SCIENC’E
264(5164):1421-1426.
Conclusion Salton, G.; Singhal, A.; Buckley, C.; and Mitra, M.
hrrtnmnt;P +.0x-t
1996 . r~u.u”IIIw”IcI .-lc.onmnnn;i;rm ,,n;nrr
.Jti*u u.dti”rrr~““rur”rr U.xL’~ tcwt ac.m-
YUnUvue-
We presented a system for identifying trends in text ments and text themes. In Proceedings of Hypertext,
documents collected over a period of time. Our system 53-65.
uses several data mining techniques such as sequential Srikant, R., and Agrawal, R. 1996. Mining sequential
patterns and shape queries in novel ways and demon- patterns: Generalizations and performance improve-
strates a trend visualization method. We described our ments. In Proceedings of the Fifth International Con-
experience in applying this system to the IBM Patent ference on Extending Database Technology (EDBT).
Server, a database of U.S. patents. Scaleup experi-
230 KDD-97