0% found this document useful (0 votes)

15 views78 pages

Probabilistic Topic Models

Uploaded by

mhc2023006

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views78 pages

Probabilistic Topic Models

Uploaded by

mhc2023006

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 78

Probabilistic Topic Models for

ChengXiang Zhai
Text Mining ( 翟成祥)
Department of Computer Science
Graduate School of Library & Information Science
Institute for Genomic Biology, Statistics
University of Illinois, Urbana-Champaign
http://www-faculty.cs.uiuc.edu/~czhai, [email protected]

1
What Is Text Mining?

“The objective of Text Mining is to exploit information

contained in textual documents in various ways, including
…discovery of patterns and trends in data, associations
among entities, predictive rules, etc.” (Grobelnik et al.,
2001)

“Another way to view text data mining is as a process of

exploratory data analysis that leads to heretofore
unknown information, or to answers for questions for
which the answer is not currently known.” (Hearst, 1999)

(Slide from Rebecca Hwa’s “Intro to Text Mining”)

2
Two Different Views of Text Mining

• Data Mining View: Explore patterns in textual data

– Find latent topics
Shallow mining
– Find topical trends
– Find outliers and other hidden patterns
• Natural Language Processing View: Make
inferences based on partial understanding natural
language text
– Information extraction Deep mining
– Question answering

3
Applications of Text Mining

• Direct applications: Go beyond search to ﬁnd knowledge

– Question-driven (Bioinformatics, Business Intelligence, etc): We
have speciﬁc questions; how can we exploit data mining to answer
the questions?
– Data-driven (WWW, literature, email, customer reviews, etc): We
have a lot of data; what can we do with it?
• Indirect applications
– Assist information access (e.g., discover latent topics to better
summarize search results)
– Assist information organization (e.g., discover hidden structures)

4
Text Mining Methods
• Data Mining Style: View text as high dimensional data
– Frequent pattern ﬁnding
– Association analysis
– Outlier detection
• Information Retrieval Style: Fine granularity topical analysis
– Topic extraction
– Exploit term weighting and text similarity measures
• Natural Language Processing Style: Information Extraction
– Entity extraction
– Relation extraction
– Sentiment analysis
– Question answering
• Machine Learning Style: Unsupervised or semi-supervised learning
– Mixture models
– Dimension reduction
Topic of this lecture

5
Outline
• The Basic Topic Models:
– Probabilistic Latent Semantic Analysis (PLSA) [Hofmann 99]
– Latent Dirichlet Allocation (LDA) [Blei et al. 02]
• Extensions
– Contextual Probabilistic Latent Semantic Analysis (CPLSA)
[Mei & Zhai 06]

6
Basic Topic Model: PLSA

7
PLSA: Motivation
What did people say in their blog articles about “Hurricane Katrina

Query = “Hurricane Katrina”

Results:

8
Probabilistic Latent Semantic
Analysis/Indexing (PLSA/PLSI) [Hofmann 99]

• Mix k multinomial distributions to generate a document

• Each document has a potentially different set of mixing
weights which captures the topic coverage
• When generating words in a document, each word may be
generated using a DIFFERENT multinomial distribution (this
is in contrast with the document clustering model where,
once a multinomial distribution is chosen, all the words in a
document would be generated using the same model)
• We may add a background distribution to “attract”
background words

9
PLSA as a Mixture Model
k

pd ( w)   B p ( w |  B )  (1   B )   d , j p ( w |  j )
j 1

log p ( d )   c ( w , d ) log [ B p ( w |  B )  (1  B )   d , j p ( w |  j ) ]
wV j 1

Document d
warning ?
Topic 1 0.3 system? d,1
1 “Generating” word w
0.2.. in doc d in the collection
aid 0.1
? 2
Topic 2 donation ? d,2 1 - B
0.05 ?
support 0.02 d, k W
… k
..
statistics 0.2
?
loss 0.1
? B
Topic k dead 0.05
? .. B
is 0.05
?
Background B ?
the 0.04 Parameters:
a 0.03
? .. B=noise-level (manually set)
’s and ’s are estimated with Maximum Likelihood
10
How to Estimate j: EM Algorithm

the 0.2
Known a 0.1
Background we 0.01 Observed Doc(s)
p(w | B) to 0.02
…

Unknown … Suppose,
topic model text =? we know
mining =? ML
p(w|1)=? association =? the identity Estimator
word =? of each
“Text mining” …
word ...
…
Unknown …
topic model information =?
p(w|2)=? retrieval =?
query =?
“information document =?
retrieval” …

11
How the Algorithm Works
c(w,d)(1 - p(zd,w = B))p(zd,w=j)
c(w,d)p(zd,w = B) πd1,1 πd1,2
c(w, d) ( P(θ1|d1) ) ( P(θ2|d1) )
aid 7
d1 price 5
Initial value
oil 6
πd2,1 πd2,2
aid 8 ( P(θ1|d2) ) ( P(θ2|d2) )
d2 price 7
oil 5 Initial value

P(w| θ) Topic 1 Topic 2

Iteration
Initializing
Iteration
Iteration
1: E2:
2: 1:
πStep:
d,3,
M
j and
4,
Step:
split
5,P(w|
…re-
word
θj)
aid counts
estimateUntil
with πconverging
with
random
d,different
j and values
P(w|topics
θj) by
price Initial value adding(byand
computing
normalizing z’ s) the12
splitted word counts
oil

12
Parameter Estimation
E-Step:
Word w in doc d is generated Application of Bayes rule
- from cluster j
- from background

M-Step:
Re-estimate
- mixing weights
- cluster LM

Sum over all docs Fractional counts contributing to

- using cluster j in generating d
(in multiple collections)
- generating w from cluster j
m = 1 if one collection
13
PLSA with Prior Knowledge

• There are different ways of choosing aspects (topics)

– Google = Google News + Google Map + Google scholar, …
– Google = Google US + Google France + Google China, …

• Users have some domain knowledge in mind, e.g.,

– We expect to see “retrieval models” as a topic in IR.
– We want to show the aspects of “history” and “statistics” for Youtube

• A ﬂexible way to incorporate such knowledge as priors of

PLSA model
• In Bayesian, it’s your “belief” on the topic distributions
14
14
Adding Prior
*  arg max p (  ) p ( Data |  )

Most likely 
Document d
warning
Topic 1 0.3 system d,1
1 “Generating” word w
0.2.. in doc d in the collection
aid 0.1 2
Topic 2 donation d,2 1 - B
0.05
support 0.02 d, k W
… k
..
statistics 0.2
loss 0.1 B
Topic k dead 0.05 .. B
is 0.05
Background B the 0.04 Parameters:
a 0.03 .. B=noise-level (manually set)
’s and ’s are estimated with Maximum Likelihood
15
15
Adding Prior as Pseudo Counts
Observed Doc(s)
the 0.2
Known a 0.1
Background we 0.01
p(w | B) to 0.02
…
MAP
Unknown … Suppose, Estimator
topic model text =? we know
mining =?
p(w|1)=? association =? the identity
word =? of each
“Text mining” …
word ... Pseudo Doc
…
Unknown …
topic model information =?
p(w|2)=? retrieval =?
query =? Size = μ
document =? text
“information
retrieval” …
16
mining
16
Maximum A Posterior (MAP) Estimation

Pseudo counts of w from prior ’

+p(w|’j)
+

Sum of all pseudo counts

What if =0? What if =+?

17
17
Basic Topic Model: LDA

The following slides about LDA are taken from Michael C. Mozer’s course lecture
http://www.cs.colorado.edu/~mozer/courses/ProbabilisticModels/

18
LDA: Motivation
– “Documents have no generative probabilistic semantics”
•i.e., document is just a symbol
– Model has many parameters
•linear in number of documents
•need heuristic methods to prevent overﬁtting
– Cannot generalize to new documents
Unigram Model

p(w )   p ( wn )
n 1
Mixture of Unigrams

p(w )   p ( z )  p ( wn | z)
z n 1
Topic Model / Probabilistic LSI

p ( d , wn )  p(d )  p(w n
| z) p( z | d )

•d is a localist representation of (trained) documents

•LDA provides a distributed representation

LDA
•Vocabulary of |V| words
•Document is a collection of words from vocabulary.
•N words in document
•w = (w1, ..., wN)

•Latent topics
•random variable z, with values 1, ..., k

•Like topic model, document is generated by sampling

a topic from a mixture and then sampling a word from a
mixture.
•But topic model assumes a ﬁxed mixture of topics (multinomial
distribution) for each document.
•LDA assumes a random mixture of topics (Dirichlet distribution) for each
topic.
Generative Model

•“Plates” indicate looping structure

•Outer plate replicated for each document
•Inner plate replicated for each word
•Same conditional distributions apply for each replicate

•Document probability
Fancier Version

( ik1  i )  1
p (  )  k 1  k
1
 k 1

i 1 (  i )
Inference

p ( , z | w ,  ,  )
p (  , z , w | ,  )
 p ( w | ,  )

p ( , z , w ,  )  p (  )  p( z
n 1
n
) p(w n
zn ,  )

 N  k
p ( w  ,  )   p (   )   p ( z n  ) p ( w n z n ,  ) d 
 n 1 z 
 n 
Inference
•In general, this formula is intractable:
 N  k
p ( w  ,  )   p (   )   p ( z n  ) p ( w n z n ,  ) d 
 n 1 z 
 n 

•Expanded version: 1 if wn is the j'th vocab word

( i  i )  k  1  N k V 
|,  )      i   ( i  ij )
 d 
j
w
p(w i n


i ( i )  i 1  n 1 i 1 j 1 
Variational Approximation
•Computing log likelihood and introducing Jensen's inequality:
log(E[x]) >= E[log(x)]

•Find variational distribution q such that the above equation is

computable.
– q parameterized by γ and φn
– Maximize bound with respect to γ and φn to obtain best approximation to p(w
| α, β)
– Lead to variational EM algorithm
•Sampling algorithms (e.g., Gibbs sampling) are also common
Data Sets

C. Elegans Community abstracts

 5,225abstracts
 28,414 unique terms
TREC AP corpus (subset)
 16,333 newswire articles
 23,075 unique terms

Held-out data – 10%

Removed terms
50 stop words, words appearing once
C. Elegans

Note: fold in hack for pLSI to allow it to handle novel documents.

Involves refitting p(z|dnew) parameters -> sort of a cheat
AP
Summary: PLSA vs. LDA
• LDA adds a Dirichlet distribution on top of PLSA to
regularize the model
• Estimation of LDA is more complicated than PLSA
• LDA is a generative model, while PLSA isn’t
• PLSA is more likely to over-fit the data than LDA
• Which one to use?
– If you need generalization capacity, LDA
– If you want to mine topics from a collection, PLSA may be
better (we want overfitting!)

32
Extension of PLSA:
Contextual Probabilistic Latent
Semantic Analysis (CPLSA)

33
A General Introduction to EM

Data: X (observed) + H(hidden) Parameter: 

“Incomplete” likelihood: L( )= log p(X| )

“Complete” likelihood: Lc( )= log p(X,H| )
EM tries to iteratively maximize the incomplete likelihood:

Starting with an initial guess (0),

1. E-step: compute the expectation of the complete likelihood

Q ( ; 
( n 1)
)  E ( n 1 ) [ L c ( ) | X ]   p ( H  hi | X ,
( n 1)
) lo g P ( X , hi )
hi

2. M-step: compute (n) by maximizing the Q-function

 ( n )  arg m ax  Q ( ;  ( n 1) )  arg m ax   p ( H  hi | X ,  ( n 1) ) lo g P ( X , hi )
hi

34
Convergence Guarantee

Goal: maximizing “Incomplete” likelihood: L( )= log p(X| )

I.e., choosing (n), so that L((n))-L((n-1))0

Note that, since p(X,H| ) =p(H|X, ) P(X| ) , L() =Lc() -log p(H|X, )
L((n))-L((n-1)) = Lc((n))-Lc( (n-1))+log [p(H|X,  (n-1) )/p(H|X, (n))]

Taking expectation w.r.t. p(H|X, (n-1)),

L((n))-L((n-1)) = Q((n);  (n-1))-Q( (n-1);  (n-1)) + D(p(H|X,  (n-1))||p(H|X,  (n)))

Doesn’t contain H

EM chooses (n) to maximize Q KL-divergence, always non-negative

Therefore, L((n))  L((n-1))!

35
Another way of looking at EM

Likelihood p(X| )
L((n-1)) + Q(; (n-1)) -Q( (n-1);  (n-1) ) + D(p(H|X,  (n-1) )||p(H|X,  ))

L((n-1)) + Q(; (n-1)) -Q( (n-1);  (n-1) )

next guess
current guess

Lower bound
(Q function)

E-step = computing the lower bound
M-step = maximizing the lower bound
36
Why Contextual PLSA?

37
Motivating Example:
Comparing Product Reviews
IBM Laptop APPLE Laptop DELL Laptop
Reviews Reviews Reviews

Common Themes “IBM” specific “APPLE” specific “DELL” specific

Battery Life Long, 4-3 hrs Medium, 3-2 hrs Short, 2-1 hrs

Hard disk Large, 80-100 GB Small, 5-10 GB Medium, 20-50 GB

Speed Slow, 100-200 Mhz Very Fast, 3-4 Ghz Moderate, 1-2 Ghz

Unsupervised discovery of common topics and their variations

38
Motivating Example:
Comparing News about Similar Topics
Vietnam War Afghan War Iraq War

Common Themes “Vietnam” specific “Afghan” specific “Iraq” specific

United nations … … …
Death of people … … …
… … … …

Unsupervised discovery of common topics and their variations

39
Motivating Example:
Discovering Topical Trends in Literature
Theme Strength

Time
1980 1990 1998 2003
TF-IDF Retrieval Language Model
IR Applications Text Categorization

Unsupervised discovery of topics and their temporal variations

40
Motivating Example:
Analyzing Spatial Topic Patterns

• How do blog writers in different states respond to

topics such as “oil price increase during Hurricane
Karina”?
• Unsupervised discovery of topics and their variations
in different locations
41
Motivating Example:
Sentiment Summary

Unsupervised/Semi-supervised discovery of topics and

different sentiments of the topics

42
Research Questions
• Can we model all these problems generally?
• Can we solve these problems with a uniﬁed approach
?
• How can we bring human into the loop?

43
Contextual Text Mining
• Given collections of text with contextual information (meta-
data)
• Discover themes/subtopics/topics (interesting word clusters)
• Compute variations of themes over contexts
• Applications:
– Summarizing search results
– Federation of text information
– Opinion analysis
– Social network analysis
– Business intelligence
– ..

44
Context Features of Text (Meta-data)

Weblog Article
communities

Author

source

Time Location
Author’s Occupation

45
Context = Partitioning of Text

papers written
1998 in 1998

1999 Papers about Web

…… ……

2005

2006 papers written by

authors in US

WWW SIGIR ACL KDD SIGMOD

46
Themes/Topics

Theme 1
government [ Criticism of government response to the
0.3 hurricane primarily consisted of criticism of its
response 0.2..
response to the approach of the storm and its
donate 0.1
aftermath, specifically in the delayed response ] to
Theme 2 relief 0.05
help 0.02 .. the [ flooding of New Orleans. … 80% of the 1.3
… million residents of the greater New Orleans
metropolitan area evacuated ] …[ Over seventy
city 0.2
countries pledged monetary donations or other
Theme k new 0.1 assistance]. …
orleans 0.05
..
Background B
Is 0.05
the 0.04 • Uses of themes:
a 0.03 .. – Summarize topics/subtopics
– Navigate in a document space
– Retrieve documents
– Segment documents
– …
47
View of Themes:
Context-Specific Version of Views
Context: After 1998 (Language models)

vector
space
TF-IDF
Okapi Theme 2: vector
retrieve LSI
Theme 1: feedback Rocchio
model retrieval
judge
Feedback weighting
relevance feedback
Retrieval documen expansion term
Model t query pseudo
query

language mixture
model model
smoothing estimate
query EM
generation feedback
pseudo

Context: Before 1998 (Traditional models)

48
Coverage of Themes:
Distribution over Themes
Oil Price
Government Criticism of government response to the
Response hurricane primarily consisted of criticism of its
Aid and response to … The total shut-in oil production
donation from the Gulf of Mexico … approximately 24% of
Background
the annual production and the shut-in gas
production … Over seventy countries pledged
Context: Texas monetary donations or other assistance. …

Oil Price
Government
Response
• Theme coverage can depend on context
Aid and
donation
Background

Context: Louisiana

49
General Tasks of Contextual Text Mining

• Theme Extraction: Extract the global salient themes

– Common information shared over all contexts
• View Comparison: Compare a theme from different views
– Analyze the content variation of themes over contexts
• Coverage Comparison: Compare the theme coverage of different
contexts
– Reveal how closely a theme is associated to a context
• Others:
– Causal analysis
– Correlation analysis

50
A General Solution: CPLSA
• CPLAS = Contextual Probabilistic Latent Semantic Analysis
• An extension of PLSA model ([Hofmann 99]) by
– Introducing context variables
– Modeling views of topics
– Modeling coverage variations of topics

• Process of contextual text mining

– Instantiation of CPLSA (context, views, coverage)
– Fit the model to text data (EM algorithm)
– Compute probabilistic topic patterns

51
“Generation” Process of CPLSA
View1 View2 View3 Choose a theme
Themes Draw a word from i
Criticism of government
government 0.3 response to government
the hurricane
response 0.2.. primarily consisted of
Document
government response
criticism of its response to
context:
… The total shut-in oil
donate 0.1 Time =from
production Julythe
2005
Gulf of
relief 0.05 MexicoLocation = Texas
… approximately
donat
donation 24% ofAuthor
the annual
= xxx
help 0.02 .. aid e help
production and the shut-in
Occup. = Sociologist
gas production … Over
city 0.2 Agecountries
seventy Group =pledged
45+
New new 0.1 …Orleans
monetary donations or
Orleans orleans 0.05 other assistance.
new …
..
Texas July sociolo
2005 gist
Choose a view

Theme
coverages: …… Choose a
Coverage
Texas July 2005 document
52
Probabilistic Model
• To generate a document D with context feature set C:
– Choose a view vi according to the view distribution p ( vi | D , C )

– Choose a coverage кj according to the coverage distribution

p ( j | D , C )

– Choose a theme according to the coverage кj

– Generate a word using  il

– The likelihood of the document collection is:

n m k

log p ( D )    c ( w , D ) log(  p ( v
( D , C ) D wV i 1
i
| D ,C )  p (
j 1
j
| D ,C )  p(l | 
l 1
j
) p ( w |  il ) )

53
Parameter Estimation: EM Algorithm
• Interesting patterns:
   p( z
m k

– Theme content p
( t 1 )
( vi | D ,C ) 
wV
c ( w, D )
j 1 l 1 w ,i , j ,l
 1)

  c ( w, D )   p ( z
n m k
 1)
variation for each i ' 1 wV j ' 1 l ' 1 w , i ', j ', l '

view: p ( z w ,i , j ,l  1) 
p
(t)
( vi | D , C ) p
(t)
( j | D , C ) p
(t)
(l |  j ) p
(t)
( w |  il )

p ( w |  il )
  ( j ' | D , C )  ( l '|  j ' ) p ( w |  i 'l ' )
n (t) m (t) k (t) (t)
p ( vi ' | D , C ) p p
i ' 1 j ' 1 l ' 1

– Theme strength p
( t 1 )
( j | D ,C ) 
 wV
c ( w, D )   p( z
n

i 1
k

l 1 w ,i , j ,l
 1)

variation for each   c ( w, D )   p ( z

m n k

j ' 1 wV i ' 1 l ' 1 w , i ', j ', l '

 1)

context
   p( z
n
c ( w, D )  1)
p(l |  j ) (l |  j
( t 1 ) ( D , C ) D wV i 1 w ,i , j ,l
p ) 
   c ( w, D )  p ( z
l n

l ' 1 ( D , C ) D w 'V i ' 1 w , i ', j , l '

 1)

• Prior from a user can be  c ( w, D ) 

m
p ( z w ,i , j ,l  1)
incorporated using MAP ( w |  il
( t 1 ) ( D , C ) D j 1
p ) 
  
m

w 'V ( D , C ) D
c ( w', D )
j ' 1
p ( z w ', i , j ', l  1)
estimation

54
Regularization of the Model
• Why?
– Generality high complexity (inefﬁcient, multiple local maxima)
– Real applications have domain constraints/knowledge

• Two useful simpliﬁcations:

– Fixed-Coverage: Only analyze the content variation of themes (e.g., author
-topic analysis, cross-collection comparative analysis )
– Fixed-View: Only analyze the coverage variation of themes (e.g.,
spatiotemporal theme analysis)
• In general
– Impose priors on model parameters
– Support the whole spectrum from unsupervised to supervised
learning

55
Interpretation of Topics

Statistical
term 0.1599 term 0.1599 term 0.1599
relevance 0.0752 relevance 0.0752 relevance 0.0752
weight 0.0660 weight 0.0660 weight 0.0660

topic models feedback

model
0.0372
independence 0.0311
0.0310
feedback 0.0372
independence 0.0311
model 0.0310
feedback 0.0372
independence 0.0311
model 0.0310
frequent 0.0233 frequent 0.0233 frequent 0.0233
probabilistic 0.0188 probabilistic 0.0188 probabilistic 0.0188
document 0.0173 document 0.0173 document 0.0173
… … …

Multinomial topic models

Collection (Context) Coverage; Discrimination

Relevance Score Re-ranking

clustering algorithm;
distance measure;
database system, clustering algorithm, …
NLP Chunker r tree, functional dependency, iceberg
Ngram stat. cube, concurrency control, Ranked List
index structure … of Labels
Candidate label pool

56
Relevance: the Zero-Order Score

• Intuition: prefer phrases covering high probability words

Clustering

Good Label (l1):

“ clustering
dimensional
p(l | ) algorithm”
p(l )
Latent algorithm …
Topic 
birch

shape
Bad Label (l2):
… “ body shape”
p(w|) body

57
Relevance: the First-Order Score

• Intuition: prefer phrases with similar context (distribution)

Clustering
Clustering Clustering

dimension
dimension dimension
Good Label ( l1): Bad Label ( l2):
“clustering “hash join”
Topic  algorithm” …
partition partition
algorithm
algorithm algorithm
… … join
C: SIGMOD
Score (l,  )
  p ( w |  ) PMI
hash hash Proceedings
hash ( w, l | C )
P(w|) P(w|l1) P(w|l2) w

D(||l1) < D(||l2)

58
Sample Results
• Comparative text mining
• Spatiotemporal pattern mining
• Sentiment summary
• Event impact analysis
• Temporal author-topic analysis

59
Comparing News Articles
Iraq War (30 articles) vs. Afghan War (26 articles)
The common theme indicates that “United Nations” is involved in both wars

Cluster 1 Cluster 2 Cluster 3

Common united 0.042 killed 0.035 …

nations 0.04 month 0.032
Theme … deaths 0.023
…
Iraq n 0.03 troops 0.016 …
Weapons 0.024 hoon 0.015
Theme Inspections 0.023 sanches 0.012
… …
Northern 0.04 taleban 0.026 …
alliance 0.04 rumsfeld 0.02
Afghan kabul 0.03 hotel 0.012
Theme taleban 0.025 front 0.011
aid 0.02 …
…

Collection-speciﬁc themes indicate different roles of “United Nations” in the two wars

60
Comparing Laptop Reviews

Top words serve as “labels” for common themes

(e.g., [sound, speakers], [battery, hours], [cd,drive])

These word distributions can be used to segment text and

add hyperlinks between documents

61
Spatiotemporal Patterns in Blog Articles

• Query= “Hurricane Katrina”

• Topics in the results:

• Spatiotemporal patterns

62
Theme Life Cycles for Hurricane Katrina

Oil Price
price 0.0772
oil 0.0643
New Orleans gas 0.0454
increase 0.0210
product 0.0203
fuel 0.0188
company 0.0182
…

city 0.0634
orleans 0.0541
new 0.0342
louisiana 0.0235
ﬂood 0.0227
evacuate 0.0211
storm 0.0177
…

63
Theme Snapshots for Hurricane Katrina
Week2: The discussion moves towards the north and west

Week1: The theme is the strongest along the Gulf of Mexico Week3: The theme distributes more uniformly over the states

Week4: The theme is again strong along the east coast and the Gulf of Mexico

Week5: The theme fades out in most states

64
Theme Life Cycles: KDD

gene 0.0173
expressions 0.0096
probability 0.0081
microarray 0.0038
…
marketing 0.0087
customer 0.0086
model 0.0079
business 0.0048
…
rules 0.0142
association 0.0064
support 0.0053
…

Global Themes life cycles of KDD Abstracts

65
Theme Evolution Graph: KDD
1999 2000 2001 2002 2003 2004 T

web 0.009
SVM 0.007 classifica – mixture 0.005
criteria tion 0.007 random 0.006
0.007 features0.00 cluster 0.006
classifica – 6 clustering
tion 0.006 topic 0.005 0.005
linear 0.005 … variables 0.005 topic 0.010
… … mixture 0.008
decision 0.006 LDA 0.006
tree 0.006 … semantic
classifier 0.005 Classifica 0.005
class 0.005 - tion 0.015 Informa …
Bayes 0.005 text 0.013 - tion 0.012
… unlabeled web 0.010
0.012 social 0.008
document retrieval 0.007
0.008 distance 0.005
… networks
labeled 0.008
…
… 0.004
learning 0.007
… …
66
Blog Sentiment Summary
(query=“Da Vinci Code”)
Neutral Positive Negative
... Ron Howards selection of Tom Hanks stars in the But the movie might get
Tom Hanks to play Robert movie,who can be mad at delayed, and even killed off if
Langdon. that? he loses.
Facet 1:
Movie Directed by: Ron Howard Tom Hanks, who is my protesting ... will lose your faith
Writing credits: Akiva favorite movie star act the by ... watching the movie.
Goldsman ... leading role.

After watching the movie I Anybody is interested in it ... so sick of people making such
went online and some ? a big deal about a FICTION book
research on ... and movie.

I remembered when i first read Awesome book. ... so sick of people making such
the book, I finished the book a big deal about a FICTION book
Facet 2: in two days. and movie.
Book
I’m reading “Da Vinci Code” So still a good book to This controversy book cause
now. past time. lots conflict in west society.
…

67
Results: Sentiment Dynamics

Facet: the book “ the da vinci code”. Facet: religious beliefs ( Bursts
( Bursts during the movie, Pos > during the movie, Neg > Pos )
Neg )

68
Event Impact Analysis: IR Research
xml 0.0678
vector email 0.0197
0.0514 model
concept 0.0191
Theme: retrieval 0.0298 collect 0.0187
extend judgment
SIGIR papers
models 0.0297 0.0102
model rank 0.0097
0.0291 subtopic
term 0.1599
space 0.0236 0.0079
relevance 0.0752 Publication of the paper “A language
boolean 0.0151 …
weight 0.0660 modeling approach to information retrieval”
function 0.01231992
feedback 0.0372
independence
feedback year
0.0077 1998
0.0311 Starting of the TREC conferences
…
model 0.0310 model 0.1687
frequent 0.0233 language 0.0753
probabilistic 0.0188 probabilist estimate 0.0520
document 0.0173 0.0778 parameter 0.0281
… model distribution 0.0268
0.0432 probable 0.0205
logic 0.0404 smooth 0.0198
ir 0.0338 markov 0.0137
boolean likelihood 0.0059
0.0281 …
algebra 0.0200
estimate 69
0.0119
Temporal-Author-Topic Analysis
close 0.0805 index 0.0440
project 0.0444 pattern 0.0720 graph 0.0343
itemset 0.0433 sequential 0.0462 web 0.0307
intertransaction min_support gspan 0.0273
0.0397 0.0353 substructure
support 0.0264 threshold 0.0207 0.0201
associate 0.0258 top-k 0.0176 gindex 0.0164
frequent 0.0181 Author fp-tree 0.0102 bide 0.0115
closet 0.0176
preﬁxspan 0.0170
… Jiawei Han xml 0.0109
…
…
Author A
Global theme:
frequent patterns time
2000
pattern 0.1107 Author B research 0.0551
frequent 0.0406 Rakesh Agrawal next 0.0308
frequent-pattern transaction 0.0308
0.039 panel 0.0275
sequential 0.0360 technical 0.0275
pattern-growth article 0.0258
0.0203 revolution 0.0154
constraint 0.0184 innovate 0.0154
push 0.0138 …
…
70
Modeling Topical Communities
(Mei et al. 08)

Community 1:
Information Retrieval
Community 2:
Data Mining

Community 3:
Machine Learning
71
71
Other Extensions (LDA Extensions)

• Many extensions of LDA, mostly done by David

Blei, Andrew McCallum and their co-authors
• Some examples:
– Hierarchical topic models [Blei et al. 03]
– Modeling annotated data [Blei & Jordan 03]
– Dynamic topic models [Blei & Lafferty 06]

– Pachinko allocation [Li & McCallum 06])

• Also, some speciﬁc context extension of PLSA,

e.g., author-topic model [Steyvers et al. 04]

72
Future Research Directions
• Topic models for text mining
– Evaluation of topic models
– Improve the efﬁciency of estimation and inferences
– Incorporate linguistic knowledge
– Applications in new domains and for new tasks
• Text mining in general
– Combination of NLP-style and DM-style mining algorithms
– Integrated mining of text (unstructured) and unstructured data (e.g.,
Text OLAP)
– Interactive mining:
• Incorporate user constraints and support iterative mining
• Design and implement mining languages

73
Lecture 5: Key Points
• Topic models coupled with topic labeling are quite
useful for extracting and modeling subtopics in text
• Adding context variables signiﬁcantly increases a
topic model’s capacity of performing text mining
– Enable interpretation of topics in context
– Accommodate variation analysis and correlation analysis of
topics over context
• User’s preferences and domain knowledge can be
added as prior or soft constraint

74
Readings
• PLSA:
– http://www.cs.brown.edu/~th/papers/Hofmann-UAI99.pdf

• LDA:
– http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf
– Many recent extensions, mostly done by David Blei and Andrew
McCallums
• CPLSA:
– http://sifaka.cs.uiuc.edu/czhai/pub/kdd06-mix.pdf
– http://sifaka.cs.uiuc.edu/czhai/pub/www08-net.pdf

75
Discussion
• Topic models for mining multimedia data
– Simultaneous modeling of text and images
• Cross-media analysis
– Text provides context to analyze images and vice versa

76
Course Summary
Integrated Multimedia Data Analysis
Scope of the course
-Mutual reinforcement (e.g., text images)
Information Retrieval
-Simultaneous mining of text + images +video…
1. Evaluation Retrieval models/framework
Multimedia Data Text Data
2. User modeling Evaluation Feedback
3. Ranking Contextual topic models
4. Computer Vision
Learning with little Natural Language Processing
supervision
Machine Learning

Looking forward to collaborations…

Statistics

77
Thank You!

Information Retrieval On Cranfield Dataset
No ratings yet
Information Retrieval On Cranfield Dataset
15 pages
Graboplast ENG PDF
No ratings yet
Graboplast ENG PDF
11 pages
Probabilistic Topic Models
No ratings yet
Probabilistic Topic Models
78 pages
Topic Models in Natural Language Processing
No ratings yet
Topic Models in Natural Language Processing
64 pages
SNLP Overview
No ratings yet
SNLP Overview
43 pages
Unit 2, Part 2:topic Modeling
No ratings yet
Unit 2, Part 2:topic Modeling
26 pages
WINSEM2018-19 - CSE6019 - ETH - SJT421 - VL2018195001554 - Reference Material I - 3.3 PLSI
No ratings yet
WINSEM2018-19 - CSE6019 - ETH - SJT421 - VL2018195001554 - Reference Material I - 3.3 PLSI
22 pages
NLP Notes-1
No ratings yet
NLP Notes-1
54 pages
oneata
No ratings yet
oneata
7 pages
Apex Institute of Technology Natural Language Processing (CST-354)
No ratings yet
Apex Institute of Technology Natural Language Processing (CST-354)
22 pages
Information Retrieval - Lsi, Plsi and Lda: Jian-Yun Nie
No ratings yet
Information Retrieval - Lsi, Plsi and Lda: Jian-Yun Nie
34 pages
Information Retrieval Using Effective Bigram Topic Modeling
No ratings yet
Information Retrieval Using Effective Bigram Topic Modeling
8 pages
Word and Document Embeddings
No ratings yet
Word and Document Embeddings
94 pages
bag_of_words nlp
No ratings yet
bag_of_words nlp
23 pages
CS772 Lec21
No ratings yet
CS772 Lec21
18 pages
Probabilistic Topic Modeling and Its Variants - A Survey: Padmaja CH V R S Lakshmi Narayana
No ratings yet
Probabilistic Topic Modeling and Its Variants - A Survey: Padmaja CH V R S Lakshmi Narayana
5 pages
ME314 Day11
No ratings yet
ME314 Day11
77 pages
Predictive Methods For Text Mining
No ratings yet
Predictive Methods For Text Mining
75 pages
Ai Lecture22
No ratings yet
Ai Lecture22
32 pages
Lecture 8-1 - Text Classification, Naïve Bayes, Vector Space Classification
No ratings yet
Lecture 8-1 - Text Classification, Naïve Bayes, Vector Space Classification
38 pages
07 - Topic Modeling
No ratings yet
07 - Topic Modeling
122 pages
01_Introduction to Text Analytics_part2
No ratings yet
01_Introduction to Text Analytics_part2
48 pages
Topic Model For LDA
No ratings yet
Topic Model For LDA
9 pages
Topic Modelling
No ratings yet
Topic Modelling
14 pages
Data Science Interview Preparation Questions (#Day06)
No ratings yet
Data Science Interview Preparation Questions (#Day06)
10 pages
2023F DM (E) 13
No ratings yet
2023F DM (E) 13
66 pages
Machine Learning and Statistical Natural Language Processing
No ratings yet
Machine Learning and Statistical Natural Language Processing
27 pages
Q16-1028
No ratings yet
Q16-1028
16 pages
UTOPIC 2023.eacl-main.132
No ratings yet
UTOPIC 2023.eacl-main.132
16 pages
AI Lec 04+05 - Naive Bayes
No ratings yet
AI Lec 04+05 - Naive Bayes
55 pages
Machine Learning for data science Unit-5
No ratings yet
Machine Learning for data science Unit-5
10 pages
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
No ratings yet
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
29 pages
Cs224n 2024 Lecture02 Wordvecs2
No ratings yet
Cs224n 2024 Lecture02 Wordvecs2
45 pages
Topic Models in Natural Language Processing
No ratings yet
Topic Models in Natural Language Processing
55 pages
Clustering Search Results Using PLSA
No ratings yet
Clustering Search Results Using PLSA
29 pages
Latent Dirichlet Allocation: An Example of A Graphical Model
No ratings yet
Latent Dirichlet Allocation: An Example of A Graphical Model
47 pages
Session 2
No ratings yet
Session 2
58 pages
module5_DS_ppt
No ratings yet
module5_DS_ppt
38 pages
Topic Modelling Using NLP
No ratings yet
Topic Modelling Using NLP
18 pages
Module_5-Natural_language_processing[1]
No ratings yet
Module_5-Natural_language_processing[1]
13 pages
CS109/Stat121/AC209/E-109 Data Science: Bayesian Methods Continued, Text Data
No ratings yet
CS109/Stat121/AC209/E-109 Data Science: Bayesian Methods Continued, Text Data
35 pages
ir_answers
No ratings yet
ir_answers
15 pages
Pert23 - NLP
No ratings yet
Pert23 - NLP
30 pages
Week 12
No ratings yet
Week 12
19 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
NLP_Module 2(1)
No ratings yet
NLP_Module 2(1)
77 pages
An Introduction To Text: Mining
No ratings yet
An Introduction To Text: Mining
39 pages
Lecture03 Naive Bayes
No ratings yet
Lecture03 Naive Bayes
33 pages
Vector Semantics
No ratings yet
Vector Semantics
83 pages
Text Mining
No ratings yet
Text Mining
41 pages
Exploring Trends in A Topic-Based Search Engine: Wray Buntine, Jukka Perki O, Sami Perttu
No ratings yet
Exploring Trends in A Topic-Based Search Engine: Wray Buntine, Jukka Perki O, Sami Perttu
7 pages
IIT-P ADS Week 22 Transcripts
No ratings yet
IIT-P ADS Week 22 Transcripts
4 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
wordembed
No ratings yet
wordembed
31 pages
Introduction To Text Mining
No ratings yet
Introduction To Text Mining
82 pages
Unit5 Notes
No ratings yet
Unit5 Notes
17 pages
Lecture 02
No ratings yet
Lecture 02
31 pages
Chapter-6
No ratings yet
Chapter-6
79 pages
Inex03 Pto
No ratings yet
Inex03 Pto
8 pages
IR Chapter 2
No ratings yet
IR Chapter 2
37 pages
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
orlitzky-benjamin-2001-corporate-social-performance-and-firm-risk-a-meta-analytic-review
No ratings yet
orlitzky-benjamin-2001-corporate-social-performance-and-firm-risk-a-meta-analytic-review
28 pages
Helicopter Route
No ratings yet
Helicopter Route
10 pages
Audiocodes mp108
No ratings yet
Audiocodes mp108
238 pages
(eBook PDF) Responsive Web Design with HTML 5 & CSS 9th Edition instant download
100% (1)
(eBook PDF) Responsive Web Design with HTML 5 & CSS 9th Edition instant download
54 pages
Overview of Transaction Processing and Enterprise Resource Planning Systems
No ratings yet
Overview of Transaction Processing and Enterprise Resource Planning Systems
34 pages
Eel 316
No ratings yet
Eel 316
2 pages
Inventions and Discoveries: Invention - Inventor
No ratings yet
Inventions and Discoveries: Invention - Inventor
13 pages
11 JULY Monthly Test
No ratings yet
11 JULY Monthly Test
2 pages
Question Pack 1 (2)
No ratings yet
Question Pack 1 (2)
23 pages
CO PO Format (Probability and Statistics, BCADS-3rd SEM)
No ratings yet
CO PO Format (Probability and Statistics, BCADS-3rd SEM)
7 pages
HS Food Safety and Sanitation Test Answers
No ratings yet
HS Food Safety and Sanitation Test Answers
3 pages
Postgresql Install
No ratings yet
Postgresql Install
4 pages
Sheehan’s Take on the Newly Released Documents
No ratings yet
Sheehan’s Take on the Newly Released Documents
24 pages
Alumnos de Doctorado
No ratings yet
Alumnos de Doctorado
135 pages
Toyota Hilux 4x4
No ratings yet
Toyota Hilux 4x4
4 pages
Thin Walled Pressure Vessels: ASEN 3112 - Structures
No ratings yet
Thin Walled Pressure Vessels: ASEN 3112 - Structures
11 pages
mx00_1193240481
No ratings yet
mx00_1193240481
76 pages
Pots and Pan New Catalog 1
No ratings yet
Pots and Pan New Catalog 1
68 pages
AMOM Lecture1 - Fundamentals
No ratings yet
AMOM Lecture1 - Fundamentals
50 pages
Pinout Cable de Consola: Db9 Female Db9 Female Db9 Female RJ45
100% (2)
Pinout Cable de Consola: Db9 Female Db9 Female Db9 Female RJ45
2 pages
ElementsFifthEdition 8-22-23
No ratings yet
ElementsFifthEdition 8-22-23
240 pages
Lecture 08 - Reflection and Transmission of Waves
67% (3)
Lecture 08 - Reflection and Transmission of Waves
38 pages
ECHOLENS_Smart_Glasses_for_Real-time_speech_displa
No ratings yet
ECHOLENS_Smart_Glasses_for_Real-time_speech_displa
6 pages
Coffee Seedlings Distribution Guidelines
No ratings yet
Coffee Seedlings Distribution Guidelines
2 pages
Business Logistics
No ratings yet
Business Logistics
22 pages
Rohde Schwarz LTE Poster
No ratings yet
Rohde Schwarz LTE Poster
2 pages
KEI Profile - Tech
No ratings yet
KEI Profile - Tech
77 pages
Make It Work Conventional Fire Alarms Mobile
100% (1)
Make It Work Conventional Fire Alarms Mobile
17 pages
Mahamarathon For Gpat 2024 Schedule
100% (1)
Mahamarathon For Gpat 2024 Schedule
11 pages