KDD97 046

Uploaded by

zuhripantek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views3 pages

KDD97 046

Uploaded by

zuhripantek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

From: KDD-97 Proceedings. Copyright © 1997, AAAI (www.aaai.org). All rights reserved.

Discovering Trends in Text Databases

Brian Lent*and Rakesh Agrawal and Ramakrishnan Srikant

IBM Almaden Research Center
San Jose, California 95120, U.S.A.
[email protected], {ragrawal,srikant}@almaden.ibm.com

Introduction a shape query and this query is then executed over the
We address the problem of discovering trends in text mined data yielding the desired trends. The final step
databases. Trends can be used, for example, to dis- in the process is to visualize the results. We give ex-
cover that a company is shifting interests from one do- periences from applying this system to the IBM Patent
main to another. We are given a database V of doc- Server, a database of U.S. patents.
uments. Each document consists of one or more text
fields and a timestamp. The unit of text is a word and Related Work
a phrase is a list of words. (We defer the discussion An approach to discovering interesting patterns and
of more complex structures till the “Methodology” sec- trend analysis on text documents was presented in
l-inn
YAVU.,\ Ao.aw.;,tc.rl ..r;th r...rh nhrano ;a Lo
s ,YYUY”~
h;rtmw y nf the In 11 ml-- L---I 1- C--L ----L-l-.3
~uu”~Icu”n,L& ““lull \.uIUIA yuLCll”U I” “I Yll” (reraman & Dagan 1995). one bexc is nrsb annobabea
frequency of occurrence of the phrase, obtained by par- with a set of concepts, organized as a hierarchy. Treat-
titioning the documents based upon their timestamps. ing the concept hierarchy as a distribution of probabil-
The frequency of occurrence in a particular time period ities, they identify several model distributions distribu-
is the number of documents that contain the phrase. tion) to which a given concept hierarchy can be com-
(Other measures of frequency are possible, e.g. count- pared. Interesting concepts are those that differ from
ing each occurrence of the phrase in a document.) A their model distribution. Analyzing trends involves the
trend is a specific subsequence of the history of a phrase comparison of concept distributions using old data with
that satisfies the users’ query over the histories. For distributions using new data.
example, the user may specify a “spike” query to finds In (Feldman & Hirsh 1996), the authors find as-
those phrases whose frequency of occurrence increased sociations between the keywords or concepts labeling
and then decreased. the documents using background knowledge about re-
lationships among the keywords. The purpose of the
Approach knowledge base is to supply unary or binary relations
Our system uses several data mining techniques in amongst the keywords labeling the documents.
novel ways and demonstrates a method in which to Using words and phrases to describe themes and con-
visualize the trends. We have two major mining com- cepts in text documents has been studied by the in-
ponents: phrase identification using sequential patterns formation retrieval community. The work on Latent
mining (Srikant & Agrawal 1996) and trend identific- Semantic Indexing (LSI) (Deerwester et al. 1990) de-
ation using shape queries (Agrawal et al. 1995). We scribes a mathematical model of relating word associ-
begin by cleansing and parsing the input data, and sep- ations as weighted vectors that represent “concepts”
arating the documents based on their timestamps. We found within the documents. Using LSI, a query can
then assign a transaction ID to each word of every doc- retrieve a document even when they share no words,
ument treating the words as items in the data mining but do share a similar concept. However, building the
algorithms (the details of this assignment are discussed model takes O(tlc4d) time, where t is the number of
in the “Methodology” section). This transformed data terms or words, k is the number the major concepts in
is then mined for dominant words and phrases, and the model (typically defined from 100 to 300), and d is
IL- ---..,I- savtm.
Lilt: reYUlbS -----J ml--
II,t: ..-..-,- - __^--- :-
user s query 1s c----,-+--I
C~aIls1abu.lI-L-
ulb” the number of documents.
The use of phrases to build more advanced quer-
* Current address: Department of Computer Science,
Stanford University. Continuing support has been provided ies is discussed in (Croft, Turtle, & Lewis 1991). In
by a graduate fellowship of the Department of Defense Of- this work, the authors identify phrases as concepts
fice of Naval Research. and as relationships between concepts. The usefulness
‘Copyright 01997, American Association for Artificial of phrases is shown in (Lewis & Croft 1990) where
Intelligence (www.aaai.org). All rights reserved. the quality of text categorization is improved by us-

Lent
mb word gap predefinedshape
SQLQuery max word gap user-createdshape
i b b

DB2 Data Sequential Shape Trend

Client Cleansing Patterns Query Visualization
Patent ----). - + -+ Engine 7
Server Parsing I
Tim&amp I I
I I
generation I I
I I
I
I _ - - - - - - - - - -I
I t

Figure 2: The PatentMiner system

Identifying trends .”

Recentuptrend shape query -

By maintaining a support history for each supported k-
phrase we can query the set of histories to select those
phrases that have some specific shape in their histor-
ies. We propose the use of a shape definition language
called SVC (Agrawal et al. 1995) to define the users’
queries and retrieve the associated objects. There are
several benefits for using a shape query language such
as SZC to identify trends: First, the language is small,
yet pi werful, allowing a rich combination of operators.
Second, it is a fairly straightforward task to rewrite a
shape the user may define graphically, as is done in our
PatentMiner system described in the “Experience” sec- Figure 3: A recent Uptrend shape query
tion, into the S’D.fYset, of operators. Third, SYDDC
allows
a “blurry” match where the user may care about the (shape strongUp ( ) (camp Bigup Bigup Bigup ))
overall shape but does not care about specific details of
each interval of the shape. Finally, SV,C allows itself to (files “list-hist” )
be implemented efficiently since most of the operators (query (window 1) ((strongup) (support 2 end)))
are designed to be greedy to reduce non-determinism (quit)
which in turn reduces the amount of back-tracking that
must be done across the histories. Figure 4: Recent uptrend query
Trends are simply those h-phrases selected by the
shape query with the additional information of the time
ending years. Finally, a shape matching the desired
periods in which the trend is supported.
trend (such as “recent upwards trend”, “recent spike
in usage”, “downwards trend”, and “resurgence of us-
Experience: The PatentMiner System age”) is selected and the mining process begins. Al-
Figure 2 shows a high-level view of our system to com- ternatively, users can define their own shape by using
pute and visualize the word-phrase trends, which we a visual shape editor. Once the phrases matching the
now describe. shape query are found, they are presented in a visual
The PatentMiner prototype is a system we developed display.
to discover trends among patents granted in different Once a shape query has been defined, either intern-
categories. The system is connected to an IBM DB2 ally or using the graphical editor, a rewriting of the
.^^_\ .
database containing aii granted U.S. Patents and patent query into SD,C (Agrawai et cai. 1YYa) is performed.
data is retrieved using a dynamically generated SQL Given the shape query in Figure 3, the rewriting of this
query based upon the selection criteria specified by the query into SDC is shown in Figure 4. The rewriting
user. The system allows selection of patents in a spe- happens as follows. For every partitioned time period
cific classification or by keywords appearing in the title of documents there is a corresponding interval in the
or abstract of the patents. Once retrieved, a histogram shape query graph that has associated with it begin-
displaying the number of patents for each year is shown ning and ending relative levels of support. In the case
and the user may then specify a range of years upon where every interval has a specific beginning or ending
which the system will focus. value, the rewriting into SDL is straightforward in that
Next, the user can choose the maximum and min- the slope of each interval determines the basic shape
imum gap desired between words in the phrases to query that is used for that interval. For example, inter-
he
I- mind
--------, -IRS well
. . -__ as
-I the
L--v minimllm
-------------- sunnnrt
-- ==--1 all
--- nhrxen
r------- V~iS wit,h_ a nnsitive
=-----. - SlOtX
---=- translate
.-..----...- to aI “Up” shpe Of
must meet, for each time period between the start and length one, while intervals with a negative slope trans-

Lent 229
ments show that our system, PatentMiner, scales ap-
proximately linearly with the number documents.

Acknowledgments We are grateful to the IBM Al-

maden Patent Server team, especially Laura Anderson,
Steve Boyer and Tom Griffin for their ongoing contri-
butions and suggestions.

References
Agrawal, R.; Psaila, G.; Wimmers, E.; and Zait, M.
1995. Querying shapes of histories. In Proceedings
of the 2lst International Conference on Very Large
Databases.
1992 Croft, W.; Turtle, H.; and Lewis, D. 1991. The use
Time Periods
of phrases and structured queries in information re-
trieval. In i&h International ACM SIGIR Confer-
Figure 5: Some recent upwards trends ence on Research and Development in Information
Retrieval, 32-45.
late to a “down” shape of length one. The concaten- Deerwester, S.; Dumais, S. T.; Furnas, G. W.; Land-
ation of all of these base shapes then defines our SDL auer, T. K.; and Harshman, R. 1990. Indexing by
query. In the case where only some of the intervals of latent semantic analysis. Journal of the American So-
a shape query have been specified, as in intervals three ciety for Injormation Science 41(6):391-407.
to six in Figure 3, then the same concatenation occurs Feldman, R., and Dagan, I. 1995. Knowledge discov-
I.,.+
“L&L’ +h, ,,o,.lt;,,
“11~ IlxlUI”lll~ CT-IT onc&J+
LJJdu ohomn c-n IIcbYb
bcAII hnxro WllJ
nnx, uuypvru
Lz,,mnn,.t ~,l,,,,
“CYIUC. ery in textual databases (KDT). In Proceedings 0s iIre
match the unspecified intervals. 1st InternationaE Conference on Knowledge Discovery
We present some of the trends our system found from in Databases and Data Mining.
U.S. Patents classified in the category “Induced Nuclear Feldman, R., and Hirsh, H. 1996. Mining associations
Reactions: Processes, Systems, and Elements” in Fig- in text in the presence of background knowledge. In
ure 5. These example phrases matched a shape query Proceedings of the 2nd International Conference on
that represented an increasing trend of their usage in Knowledge Discovery in Databases and Data Mining.
recent years. Without knowing a priori the kind of
patents filed in this category, we are able to look at Gay, L., and Croft, W. 1990. Interpreting nom-
the trends and determine some of the popular topics of inal compounds for information retrieval. Information
recently granted patents. Processing and Management 26(1):21-38.
A potential problem with this system is that the num- Lewis, D., and Croft, W. 1990. Term clustering of
ber of phrases that match a query can be quite large. syntactic phrases. In 13th International ACM SIGIR
There are two types of pruning we use to reduce the Conference on Research and Development in Inform-
number of phrases to a more reasonable number. The ation Retrieval, 385-404.
first form of pruning is to drop non-maximal phrases Renouf, A. 1993a. Making sense of text: automated
when their support is near that of a maximal phrase approaches to meaning extraction. 17th International
that is a superset. The second form of pruning in- Online Information Meeting Proceedings 77-86.
volves the use of a syntactic hierarchical ordering of Renouf, A. 1993b. What the linguist has to say to the
phrases. The intuition is that if phrase X is a syntactic information scientist. Journal of Document and Text
sub-phrase of phrase Y, then the concept corresponding Management 1(2):173-190.
to X is usually a generalization of the concept corres-
ponding to Y. Users initially see only the most general Salton, G.; Allan, J.; Buckley, C.; and Singhal, A.
concepts, and can explore lower-level concepts by se- 1994. Automatic analysis, theme generation, and
lecting some of the phrases. summarization of machine readable texts. SCIENC’E
264(5164):1421-1426.
Conclusion Salton, G.; Singhal, A.; Buckley, C.; and Mitra, M.
hrrtnmnt;P +.0x-t
1996 . r~u.u”IIIw”IcI .-lc.onmnnn;i;rm ,,n;nrr
.Jti*u u.dti”rrr~““rur”rr U.xL’~ tcwt ac.m-
YUnUvue-
We presented a system for identifying trends in text ments and text themes. In Proceedings of Hypertext,
documents collected over a period of time. Our system 53-65.
uses several data mining techniques such as sequential Srikant, R., and Agrawal, R. 1996. Mining sequential
patterns and shape queries in novel ways and demon- patterns: Generalizations and performance improve-
strates a trend visualization method. We described our ments. In Proceedings of the Fifth International Con-
experience in applying this system to the IBM Patent ference on Extending Database Technology (EDBT).
Server, a database of U.S. patents. Scaleup experi-

230 KDD-97

Offerletter Infinity Applicationid 529 46202111739370
No ratings yet
Offerletter Infinity Applicationid 529 46202111739370
3 pages
Sped 277 Udl Lesson Plan Templated Inferences
No ratings yet
Sped 277 Udl Lesson Plan Templated Inferences
2 pages
U1 - Data Mining Task Primitives
No ratings yet
U1 - Data Mining Task Primitives
4 pages
Data Mining Primitives, Languages and System Architecture
No ratings yet
Data Mining Primitives, Languages and System Architecture
64 pages
Demos 049
No ratings yet
Demos 049
8 pages
Jurnal Information Retrieval
No ratings yet
Jurnal Information Retrieval
4 pages
Tokenization: Token Normalization Is The Process of Canonicalizing Tokens So That Matches Occur
No ratings yet
Tokenization: Token Normalization Is The Process of Canonicalizing Tokens So That Matches Occur
3 pages
Major Issues in DM
No ratings yet
Major Issues in DM
5 pages
General Architecture of Text Mining Systems
No ratings yet
General Architecture of Text Mining Systems
6 pages
DM Unit 5
No ratings yet
DM Unit 5
47 pages
DATAMINING
No ratings yet
DATAMINING
8 pages
Unit-4 1
No ratings yet
Unit-4 1
7 pages
Anti-Serendipity: Finding Useless Documents and Similar Documents
No ratings yet
Anti-Serendipity: Finding Useless Documents and Similar Documents
9 pages
Intro IR
No ratings yet
Intro IR
108 pages
Data Mining Task Primitives and Major Issues
No ratings yet
Data Mining Task Primitives and Major Issues
18 pages
Text Mining: Lecturer: Dr. Nguyen Thi Ngoc Anh
No ratings yet
Text Mining: Lecturer: Dr. Nguyen Thi Ngoc Anh
27 pages
Esu105b Surveying I Notes 2024 A
No ratings yet
Esu105b Surveying I Notes 2024 A
143 pages
Text Extraction Research Paper
No ratings yet
Text Extraction Research Paper
6 pages
Verrier Elwin, Sarat Chandra Roy - The Agaria (1992, Oxford University Press, USA)
No ratings yet
Verrier Elwin, Sarat Chandra Roy - The Agaria (1992, Oxford University Press, USA)
383 pages
Introduction To Text Mining
No ratings yet
Introduction To Text Mining
82 pages
Zhang 2015
No ratings yet
Zhang 2015
5 pages
Finetech GTX 620 Katalogu 944
No ratings yet
Finetech GTX 620 Katalogu 944
4 pages
A Comprehensive Survey of Pattern Mining: Challenges and Opportunities
No ratings yet
A Comprehensive Survey of Pattern Mining: Challenges and Opportunities
8 pages
Apriori
No ratings yet
Apriori
33 pages
Unit Ii DM
No ratings yet
Unit Ii DM
18 pages
2019-Exploratory Visual Sequence Mining Based On Pattern-Growth
No ratings yet
2019-Exploratory Visual Sequence Mining Based On Pattern-Growth
14 pages
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
No ratings yet
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
122 pages
4 TH UNIT DVT
No ratings yet
4 TH UNIT DVT
51 pages
IRS Unit - 1 & 2
No ratings yet
IRS Unit - 1 & 2
33 pages
Online Techniques For Dealing With Concept Drift in Process Mining
No ratings yet
Online Techniques For Dealing With Concept Drift in Process Mining
12 pages
Trend Analysis in Machine Learning Research
No ratings yet
Trend Analysis in Machine Learning Research
6 pages
Pattern Mining Current Challenges and Op
No ratings yet
Pattern Mining Current Challenges and Op
16 pages
Updates in Taxation 18 April 2024 MCLE
No ratings yet
Updates in Taxation 18 April 2024 MCLE
61 pages
Complete Doc - Lavanya
No ratings yet
Complete Doc - Lavanya
95 pages
Data Science Trends, Issues and Challenges
No ratings yet
Data Science Trends, Issues and Challenges
16 pages
Efficient Preprocessing and Patterns Identification Approach For Text Mining
No ratings yet
Efficient Preprocessing and Patterns Identification Approach For Text Mining
6 pages
Effective Pattern Discovery For Text Mining
No ratings yet
Effective Pattern Discovery For Text Mining
8 pages
Improved Method For Pattern Discovery in Text Mining
No ratings yet
Improved Method For Pattern Discovery in Text Mining
5 pages
Anti-Serendipity: Finding Useless Documents and Similar Documents
No ratings yet
Anti-Serendipity: Finding Useless Documents and Similar Documents
9 pages
Text Mining Notes
No ratings yet
Text Mining Notes
24 pages
146 - Module 4 - FinTech Regulation and RegTech - FinTech, RegTech and The Reconceptualisation of Financial Regulation
No ratings yet
146 - Module 4 - FinTech Regulation and RegTech - FinTech, RegTech and The Reconceptualisation of Financial Regulation
51 pages
Chapter 2 Text Operations
No ratings yet
Chapter 2 Text Operations
37 pages
Week 12
No ratings yet
Week 12
19 pages
Online Learning With Stream Mining
No ratings yet
Online Learning With Stream Mining
36 pages
Future Data Mining
No ratings yet
Future Data Mining
11 pages
Biogeochemical Cycle
No ratings yet
Biogeochemical Cycle
29 pages
Mining Databases: Towards Algorithms For Knowledge Discovery
No ratings yet
Mining Databases: Towards Algorithms For Knowledge Discovery
10 pages
Querying Shapes of Histories: Related Papers
No ratings yet
Querying Shapes of Histories: Related Papers
14 pages
Differentiating Between Data-Mining and Text-Mining Terminology
No ratings yet
Differentiating Between Data-Mining and Text-Mining Terminology
15 pages
Introduction
No ratings yet
Introduction
11 pages
Peg Howland, Haesun Park (Auth.), Michael W. Berry, Malu Castellanos (Eds.) - Survey of Text Mining II - Clustering, Classification, and Retrieval-Springer-Verlag London (2008)
No ratings yet
Peg Howland, Haesun Park (Auth.), Michael W. Berry, Malu Castellanos (Eds.) - Survey of Text Mining II - Clustering, Classification, and Retrieval-Springer-Verlag London (2008)
239 pages
Upadhyay 2018 Ijca 916573
No ratings yet
Upadhyay 2018 Ijca 916573
9 pages
A Study On Visualizing Semantically Similar Frequent Patterns in Dynamic Datasets
No ratings yet
A Study On Visualizing Semantically Similar Frequent Patterns in Dynamic Datasets
6 pages
CAPE Chemistry Data Booklet New
No ratings yet
CAPE Chemistry Data Booklet New
10 pages
Visualization Based Sequential Pattern Text Mining
No ratings yet
Visualization Based Sequential Pattern Text Mining
6 pages
Base Knowledge Based
No ratings yet
Base Knowledge Based
14 pages
Comp Sci - Ijcse - Improve Frequent Patteren Mining in Data - Himanshu - Opaid
No ratings yet
Comp Sci - Ijcse - Improve Frequent Patteren Mining in Data - Himanshu - Opaid
12 pages
Final Exam Denis Bonilla
100% (1)
Final Exam Denis Bonilla
7 pages
RWS Vol. 29 No. 2 113 146 Presto 2020 Revisiting Intersectional Identities - Voices of Poor Bakla Youth in Rural Philippines
No ratings yet
RWS Vol. 29 No. 2 113 146 Presto 2020 Revisiting Intersectional Identities - Voices of Poor Bakla Youth in Rural Philippines
34 pages
Brainy kl6 Short Tests Unit 6 Lesson 1
No ratings yet
Brainy kl6 Short Tests Unit 6 Lesson 1
1 page
A Brief Survey On Data Mining For Biological and Environmental Problems.
No ratings yet
A Brief Survey On Data Mining For Biological and Environmental Problems.
46 pages
Text Analysis Pipelines
No ratings yet
Text Analysis Pipelines
36 pages
TABLE-SPECIAL CIVIL ACTIONS - And-Special-Rules
No ratings yet
TABLE-SPECIAL CIVIL ACTIONS - And-Special-Rules
42 pages
HRMS User Manual
No ratings yet
HRMS User Manual
30 pages
Rawabi Pearl Compound Brochure 2021
No ratings yet
Rawabi Pearl Compound Brochure 2021
9 pages
Ch-4 Data Mining Knowledge Representation Premitives
No ratings yet
Ch-4 Data Mining Knowledge Representation Premitives
16 pages
5941604-16SCH r6
No ratings yet
5941604-16SCH r6
1 page
Class-3 - Ratio & Proportion& Data Interpreation
No ratings yet
Class-3 - Ratio & Proportion& Data Interpreation
11 pages
Exam Center Data
No ratings yet
Exam Center Data
2 pages
Data Mining and Visualization
No ratings yet
Data Mining and Visualization
8 pages
Introduction To Automatic Indexing
No ratings yet
Introduction To Automatic Indexing
28 pages
Reissuance Process - Lost Owner's Duplicate
No ratings yet
Reissuance Process - Lost Owner's Duplicate
5 pages
Online Message Categorization Using Apriori Algorithm
No ratings yet
Online Message Categorization Using Apriori Algorithm
7 pages
Apriori Based Novel Frequent Itemset Mining Mechanism: Issn No
No ratings yet
Apriori Based Novel Frequent Itemset Mining Mechanism: Issn No
8 pages
All P
No ratings yet
All P
5 pages
Shivangi
No ratings yet
Shivangi
31 pages
Digital Natives Digital Immigrants - II
No ratings yet
Digital Natives Digital Immigrants - II
24 pages
SoftTest03022023 0937
No ratings yet
SoftTest03022023 0937
5 pages
Technical Analyst
100% (1)
Technical Analyst
50 pages
A Deeper Look
No ratings yet
A Deeper Look
4 pages
Getting Children To Help Around The House Level 8.0
100% (1)
Getting Children To Help Around The House Level 8.0
2 pages
Earth Layer Worsheet f13-2 PDF
No ratings yet
Earth Layer Worsheet f13-2 PDF
3 pages
Success STory: SAP C4C Sales Cloud Implementation at AL Tasnim Group (ATNM)
No ratings yet
Success STory: SAP C4C Sales Cloud Implementation at AL Tasnim Group (ATNM)
1 page
Unit I Content Beyond Syllabus - I Introduction To Data Mining and Data Warehousing What Are Data Mining and Knowledge Discovery?
No ratings yet
Unit I Content Beyond Syllabus - I Introduction To Data Mining and Data Warehousing What Are Data Mining and Knowledge Discovery?
12 pages
Karan Resume
No ratings yet
Karan Resume
2 pages
Introduction To Literary Theory Syllabus
No ratings yet
Introduction To Literary Theory Syllabus
2 pages
Exploring Trends in A Topic-Based Search Engine: Wray Buntine, Jukka Perki O, Sami Perttu
No ratings yet
Exploring Trends in A Topic-Based Search Engine: Wray Buntine, Jukka Perki O, Sami Perttu
7 pages
Improved Data Mining Approach To Find Frequent Itemset Using Support Count Table
No ratings yet
Improved Data Mining Approach To Find Frequent Itemset Using Support Count Table
7 pages

KDD97 046

Uploaded by

KDD97 046

Uploaded by

From: KDD-97 Proceedings. Copyright © 1997, AAAI (www.aaai.org). All rights reserved.

Discovering Trends in Text Databases

Brian Lent*and Rakesh Agrawal and Ramakrishnan Srikant

DB2 Data Sequential Shape Trend

Figure 2: The PatentMiner system

Recentuptrend shape query -

Acknowledgments We are grateful to the IBM Al-

You might also like