Probabilistic Topic Models
Probabilistic Topic Models
ChengXiang Zhai
Text Mining ( 翟成祥)
Department of Computer Science
Graduate School of Library & Information Science
Institute for Genomic Biology, Statistics
University of Illinois, Urbana-Champaign
http://www-faculty.cs.uiuc.edu/~czhai, [email protected]
1
What Is Text Mining?
2
Two Different Views of Text Mining
3
Applications of Text Mining
4
Text Mining Methods
• Data Mining Style: View text as high dimensional data
– Frequent pattern finding
– Association analysis
– Outlier detection
• Information Retrieval Style: Fine granularity topical analysis
– Topic extraction
– Exploit term weighting and text similarity measures
• Natural Language Processing Style: Information Extraction
– Entity extraction
– Relation extraction
– Sentiment analysis
– Question answering
• Machine Learning Style: Unsupervised or semi-supervised learning
– Mixture models
– Dimension reduction
Topic of this lecture
5
Outline
• The Basic Topic Models:
– Probabilistic Latent Semantic Analysis (PLSA) [Hofmann 99]
– Latent Dirichlet Allocation (LDA) [Blei et al. 02]
• Extensions
– Contextual Probabilistic Latent Semantic Analysis (CPLSA)
[Mei & Zhai 06]
6
Basic Topic Model: PLSA
7
PLSA: Motivation
What did people say in their blog articles about “Hurricane Katrina
8
Probabilistic Latent Semantic
Analysis/Indexing (PLSA/PLSI) [Hofmann 99]
9
PLSA as a Mixture Model
k
pd ( w) B p ( w | B ) (1 B ) d , j p ( w | j )
j 1
log p ( d ) c ( w , d ) log [ B p ( w | B ) (1 B ) d , j p ( w | j ) ]
wV j 1
Document d
warning ?
Topic 1 0.3 system? d,1
1 “Generating” word w
0.2.. in doc d in the collection
aid 0.1
? 2
Topic 2 donation ? d,2 1 - B
0.05 ?
support 0.02 d, k W
… k
..
statistics 0.2
?
loss 0.1
? B
Topic k dead 0.05
? .. B
is 0.05
?
Background B ?
the 0.04 Parameters:
a 0.03
? .. B=noise-level (manually set)
’s and ’s are estimated with Maximum Likelihood
10
How to Estimate j: EM Algorithm
the 0.2
Known a 0.1
Background we 0.01 Observed Doc(s)
p(w | B) to 0.02
…
Unknown … Suppose,
topic model text =? we know
mining =? ML
p(w|1)=? association =? the identity Estimator
word =? of each
“Text mining” …
word ...
…
Unknown …
topic model information =?
p(w|2)=? retrieval =?
query =?
“information document =?
retrieval” …
11
How the Algorithm Works
c(w,d)(1 - p(zd,w = B))p(zd,w=j)
c(w,d)p(zd,w = B) πd1,1 πd1,2
c(w, d) ( P(θ1|d1) ) ( P(θ2|d1) )
aid 7
d1 price 5
Initial value
oil 6
πd2,1 πd2,2
aid 8 ( P(θ1|d2) ) ( P(θ2|d2) )
d2 price 7
oil 5 Initial value
12
Parameter Estimation
E-Step:
Word w in doc d is generated Application of Bayes rule
- from cluster j
- from background
M-Step:
Re-estimate
- mixing weights
- cluster LM
+p(w|’j)
+
17
17
Basic Topic Model: LDA
The following slides about LDA are taken from Michael C. Mozer’s course lecture
http://www.cs.colorado.edu/~mozer/courses/ProbabilisticModels/
18
LDA: Motivation
– “Documents have no generative probabilistic semantics”
•i.e., document is just a symbol
– Model has many parameters
•linear in number of documents
•need heuristic methods to prevent overfitting
– Cannot generalize to new documents
Unigram Model
p(w ) p ( wn )
n 1
Mixture of Unigrams
p(w ) p ( z ) p ( wn | z)
z n 1
Topic Model / Probabilistic LSI
p ( d , wn ) p(d ) p(w n
| z) p( z | d )
•Latent topics
•random variable z, with values 1, ..., k
•Document probability
Fancier Version
( ik1 i ) 1
p ( ) k 1 k
1
k 1
i 1 ( i )
Inference
p ( , z | w , , )
p ( , z , w | , )
p ( w | , )
p ( , z , w , ) p ( ) p( z
n 1
n
) p(w n
zn , )
N k
p ( w , ) p ( ) p ( z n ) p ( w n z n , ) d
n 1 z
n
Inference
•In general, this formula is intractable:
N k
p ( w , ) p ( ) p ( z n ) p ( w n z n , ) d
n 1 z
n
( i i ) k 1 N k V
|, ) i ( i ij )
d
j
w
p(w i n
i ( i ) i 1 n 1 i 1 j 1
Variational Approximation
•Computing log likelihood and introducing Jensen's inequality:
log(E[x]) >= E[log(x)]
32
Extension of PLSA:
Contextual Probabilistic Latent
Semantic Analysis (CPLSA)
33
A General Introduction to EM
34
Convergence Guarantee
Note that, since p(X,H| ) =p(H|X, ) P(X| ) , L() =Lc() -log p(H|X, )
L((n))-L((n-1)) = Lc((n))-Lc( (n-1))+log [p(H|X, (n-1) )/p(H|X, (n))]
35
Another way of looking at EM
Likelihood p(X| )
L((n-1)) + Q(; (n-1)) -Q( (n-1); (n-1) ) + D(p(H|X, (n-1) )||p(H|X, ))
next guess
current guess
Lower bound
(Q function)
E-step = computing the lower bound
M-step = maximizing the lower bound
36
Why Contextual PLSA?
37
Motivating Example:
Comparing Product Reviews
IBM Laptop APPLE Laptop DELL Laptop
Reviews Reviews Reviews
Battery Life Long, 4-3 hrs Medium, 3-2 hrs Short, 2-1 hrs
Speed Slow, 100-200 Mhz Very Fast, 3-4 Ghz Moderate, 1-2 Ghz
United nations … … …
Death of people … … …
… … … …
Time
1980 1990 1998 2003
TF-IDF Retrieval Language Model
IR Applications Text Categorization
40
Motivating Example:
Analyzing Spatial Topic Patterns
42
Research Questions
• Can we model all these problems generally?
• Can we solve these problems with a unified approach
?
• How can we bring human into the loop?
43
Contextual Text Mining
• Given collections of text with contextual information (meta-
data)
• Discover themes/subtopics/topics (interesting word clusters)
• Compute variations of themes over contexts
• Applications:
– Summarizing search results
– Federation of text information
– Opinion analysis
– Social network analysis
– Business intelligence
– ..
44
Context Features of Text (Meta-data)
Weblog Article
communities
Author
source
Time Location
Author’s Occupation
45
Context = Partitioning of Text
papers written
1998 in 1998
2005
46
Themes/Topics
Theme 1
government [ Criticism of government response to the
0.3 hurricane primarily consisted of criticism of its
response 0.2..
response to the approach of the storm and its
donate 0.1
aftermath, specifically in the delayed response ] to
Theme 2 relief 0.05
help 0.02 .. the [ flooding of New Orleans. … 80% of the 1.3
… million residents of the greater New Orleans
metropolitan area evacuated ] …[ Over seventy
city 0.2
countries pledged monetary donations or other
Theme k new 0.1 assistance]. …
orleans 0.05
..
Background B
Is 0.05
the 0.04 • Uses of themes:
a 0.03 .. – Summarize topics/subtopics
– Navigate in a document space
– Retrieve documents
– Segment documents
– …
47
View of Themes:
Context-Specific Version of Views
Context: After 1998 (Language models)
vector
space
TF-IDF
Okapi Theme 2: vector
retrieve LSI
Theme 1: feedback Rocchio
model retrieval
judge
Feedback weighting
relevance feedback
Retrieval documen expansion term
Model t query pseudo
query
language mixture
model model
smoothing estimate
query EM
generation feedback
pseudo
48
Coverage of Themes:
Distribution over Themes
Oil Price
Government Criticism of government response to the
Response hurricane primarily consisted of criticism of its
Aid and response to … The total shut-in oil production
donation from the Gulf of Mexico … approximately 24% of
Background
the annual production and the shut-in gas
production … Over seventy countries pledged
Context: Texas monetary donations or other assistance. …
Oil Price
Government
Response
• Theme coverage can depend on context
Aid and
donation
Background
Context: Louisiana
49
General Tasks of Contextual Text Mining
50
A General Solution: CPLSA
• CPLAS = Contextual Probabilistic Latent Semantic Analysis
• An extension of PLSA model ([Hofmann 99]) by
– Introducing context variables
– Modeling views of topics
– Modeling coverage variations of topics
51
“Generation” Process of CPLSA
View1 View2 View3 Choose a theme
Themes Draw a word from i
Criticism of government
government 0.3 response to government
the hurricane
response 0.2.. primarily consisted of
Document
government response
criticism of its response to
context:
… The total shut-in oil
donate 0.1 Time =from
production Julythe
2005
Gulf of
relief 0.05 MexicoLocation = Texas
… approximately
donat
donation 24% ofAuthor
the annual
= xxx
help 0.02 .. aid e help
production and the shut-in
Occup. = Sociologist
gas production … Over
city 0.2 Agecountries
seventy Group =pledged
45+
New new 0.1 …Orleans
monetary donations or
Orleans orleans 0.05 other assistance.
new …
..
Texas July sociolo
2005 gist
Choose a view
Theme
coverages: …… Choose a
Coverage
Texas July 2005 document
52
Probabilistic Model
• To generate a document D with context feature set C:
– Choose a view vi according to the view distribution p ( vi | D , C )
n m k
log p ( D ) c ( w , D ) log( p ( v
( D , C ) D wV i 1
i
| D ,C ) p (
j 1
j
| D ,C ) p(l |
l 1
j
) p ( w | il ) )
53
Parameter Estimation: EM Algorithm
• Interesting patterns:
p( z
m k
– Theme content p
( t 1 )
( vi | D ,C )
wV
c ( w, D )
j 1 l 1 w ,i , j ,l
1)
c ( w, D ) p ( z
n m k
1)
variation for each i ' 1 wV j ' 1 l ' 1 w , i ', j ', l '
view: p ( z w ,i , j ,l 1)
p
(t)
( vi | D , C ) p
(t)
( j | D , C ) p
(t)
(l | j ) p
(t)
( w | il )
p ( w | il )
( j ' | D , C ) ( l '| j ' ) p ( w | i 'l ' )
n (t) m (t) k (t) (t)
p ( vi ' | D , C ) p p
i ' 1 j ' 1 l ' 1
– Theme strength p
( t 1 )
( j | D ,C )
wV
c ( w, D ) p( z
n
i 1
k
l 1 w ,i , j ,l
1)
context
p( z
n
c ( w, D ) 1)
p(l | j ) (l | j
( t 1 ) ( D , C ) D wV i 1 w ,i , j ,l
p )
c ( w, D ) p ( z
l n
w 'V ( D , C ) D
c ( w', D )
j ' 1
p ( z w ', i , j ', l 1)
estimation
54
Regularization of the Model
• Why?
– Generality high complexity (inefficient, multiple local maxima)
– Real applications have domain constraints/knowledge
55
Interpretation of Topics
Statistical
term 0.1599 term 0.1599 term 0.1599
relevance 0.0752 relevance 0.0752 relevance 0.0752
weight 0.0660 weight 0.0660 weight 0.0660
model
0.0372
independence 0.0311
0.0310
feedback 0.0372
independence 0.0311
model 0.0310
feedback 0.0372
independence 0.0311
model 0.0310
frequent 0.0233 frequent 0.0233 frequent 0.0233
probabilistic 0.0188 probabilistic 0.0188 probabilistic 0.0188
document 0.0173 document 0.0173 document 0.0173
… … …
clustering algorithm;
distance measure;
database system, clustering algorithm, …
NLP Chunker r tree, functional dependency, iceberg
Ngram stat. cube, concurrency control, Ranked List
index structure … of Labels
Candidate label pool
56
Relevance: the Zero-Order Score
shape
Bad Label (l2):
… “ body shape”
p(w|) body
57
Relevance: the First-Order Score
Clustering
Clustering Clustering
dimension
dimension dimension
Good Label ( l1): Bad Label ( l2):
“clustering “hash join”
Topic algorithm” …
partition partition
algorithm
algorithm algorithm
… … join
C: SIGMOD
Score (l, )
p ( w | ) PMI
hash hash Proceedings
hash ( w, l | C )
P(w|) P(w|l1) P(w|l2) w
58
Sample Results
• Comparative text mining
• Spatiotemporal pattern mining
• Sentiment summary
• Event impact analysis
• Temporal author-topic analysis
59
Comparing News Articles
Iraq War (30 articles) vs. Afghan War (26 articles)
The common theme indicates that “United Nations” is involved in both wars
Collection-specific themes indicate different roles of “United Nations” in the two wars
60
Comparing Laptop Reviews
61
Spatiotemporal Patterns in Blog Articles
• Spatiotemporal patterns
62
Theme Life Cycles for Hurricane Katrina
Oil Price
price 0.0772
oil 0.0643
New Orleans gas 0.0454
increase 0.0210
product 0.0203
fuel 0.0188
company 0.0182
…
city 0.0634
orleans 0.0541
new 0.0342
louisiana 0.0235
flood 0.0227
evacuate 0.0211
storm 0.0177
…
63
Theme Snapshots for Hurricane Katrina
Week2: The discussion moves towards the north and west
Week1: The theme is the strongest along the Gulf of Mexico Week3: The theme distributes more uniformly over the states
Week4: The theme is again strong along the east coast and the Gulf of Mexico
64
Theme Life Cycles: KDD
gene 0.0173
expressions 0.0096
probability 0.0081
microarray 0.0038
…
marketing 0.0087
customer 0.0086
model 0.0079
business 0.0048
…
rules 0.0142
association 0.0064
support 0.0053
…
65
Theme Evolution Graph: KDD
1999 2000 2001 2002 2003 2004 T
web 0.009
SVM 0.007 classifica – mixture 0.005
criteria tion 0.007 random 0.006
0.007 features0.00 cluster 0.006
classifica – 6 clustering
tion 0.006 topic 0.005 0.005
linear 0.005 … variables 0.005 topic 0.010
… … mixture 0.008
decision 0.006 LDA 0.006
tree 0.006 … semantic
classifier 0.005 Classifica 0.005
class 0.005 - tion 0.015 Informa …
Bayes 0.005 text 0.013 - tion 0.012
… unlabeled web 0.010
0.012 social 0.008
document retrieval 0.007
0.008 distance 0.005
… networks
labeled 0.008
…
… 0.004
learning 0.007
… …
66
Blog Sentiment Summary
(query=“Da Vinci Code”)
Neutral Positive Negative
... Ron Howards selection of Tom Hanks stars in the But the movie might get
Tom Hanks to play Robert movie,who can be mad at delayed, and even killed off if
Langdon. that? he loses.
Facet 1:
Movie Directed by: Ron Howard Tom Hanks, who is my protesting ... will lose your faith
Writing credits: Akiva favorite movie star act the by ... watching the movie.
Goldsman ... leading role.
After watching the movie I Anybody is interested in it ... so sick of people making such
went online and some ? a big deal about a FICTION book
research on ... and movie.
I remembered when i first read Awesome book. ... so sick of people making such
the book, I finished the book a big deal about a FICTION book
Facet 2: in two days. and movie.
Book
I’m reading “Da Vinci Code” So still a good book to This controversy book cause
now. past time. lots conflict in west society.
…
67
Results: Sentiment Dynamics
Facet: the book “ the da vinci code”. Facet: religious beliefs ( Bursts
( Bursts during the movie, Pos > during the movie, Neg > Pos )
Neg )
68
Event Impact Analysis: IR Research
xml 0.0678
vector email 0.0197
0.0514 model
concept 0.0191
Theme: retrieval 0.0298 collect 0.0187
extend judgment
SIGIR papers
models 0.0297 0.0102
model rank 0.0097
0.0291 subtopic
term 0.1599
space 0.0236 0.0079
relevance 0.0752 Publication of the paper “A language
boolean 0.0151 …
weight 0.0660 modeling approach to information retrieval”
function 0.01231992
feedback 0.0372
independence
feedback year
0.0077 1998
0.0311 Starting of the TREC conferences
…
model 0.0310 model 0.1687
frequent 0.0233 language 0.0753
probabilistic 0.0188 probabilist estimate 0.0520
document 0.0173 0.0778 parameter 0.0281
… model distribution 0.0268
0.0432 probable 0.0205
logic 0.0404 smooth 0.0198
ir 0.0338 markov 0.0137
boolean likelihood 0.0059
0.0281 …
algebra 0.0200
estimate 69
0.0119
Temporal-Author-Topic Analysis
close 0.0805 index 0.0440
project 0.0444 pattern 0.0720 graph 0.0343
itemset 0.0433 sequential 0.0462 web 0.0307
intertransaction min_support gspan 0.0273
0.0397 0.0353 substructure
support 0.0264 threshold 0.0207 0.0201
associate 0.0258 top-k 0.0176 gindex 0.0164
frequent 0.0181 Author fp-tree 0.0102 bide 0.0115
closet 0.0176
prefixspan 0.0170
… Jiawei Han xml 0.0109
…
…
Author A
Global theme:
frequent patterns time
2000
pattern 0.1107 Author B research 0.0551
frequent 0.0406 Rakesh Agrawal next 0.0308
frequent-pattern transaction 0.0308
0.039 panel 0.0275
sequential 0.0360 technical 0.0275
pattern-growth article 0.0258
0.0203 revolution 0.0154
constraint 0.0184 innovate 0.0154
push 0.0138 …
…
70
Modeling Topical Communities
(Mei et al. 08)
Community 1:
Information Retrieval
Community 2:
Data Mining
Community 3:
Machine Learning
71
71
Other Extensions (LDA Extensions)
72
Future Research Directions
• Topic models for text mining
– Evaluation of topic models
– Improve the efficiency of estimation and inferences
– Incorporate linguistic knowledge
– Applications in new domains and for new tasks
• Text mining in general
– Combination of NLP-style and DM-style mining algorithms
– Integrated mining of text (unstructured) and unstructured data (e.g.,
Text OLAP)
– Interactive mining:
• Incorporate user constraints and support iterative mining
• Design and implement mining languages
73
Lecture 5: Key Points
• Topic models coupled with topic labeling are quite
useful for extracting and modeling subtopics in text
• Adding context variables significantly increases a
topic model’s capacity of performing text mining
– Enable interpretation of topics in context
– Accommodate variation analysis and correlation analysis of
topics over context
• User’s preferences and domain knowledge can be
added as prior or soft constraint
74
Readings
• PLSA:
– http://www.cs.brown.edu/~th/papers/Hofmann-UAI99.pdf
• LDA:
– http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf
– Many recent extensions, mostly done by David Blei and Andrew
McCallums
• CPLSA:
– http://sifaka.cs.uiuc.edu/czhai/pub/kdd06-mix.pdf
– http://sifaka.cs.uiuc.edu/czhai/pub/www08-net.pdf
75
Discussion
• Topic models for mining multimedia data
– Simultaneous modeling of text and images
• Cross-media analysis
– Text provides context to analyze images and vice versa
76
Course Summary
Integrated Multimedia Data Analysis
Scope of the course
-Mutual reinforcement (e.g., text images)
Information Retrieval
-Simultaneous mining of text + images +video…
1. Evaluation Retrieval models/framework
Multimedia Data Text Data
2. User modeling Evaluation Feedback
3. Ranking Contextual topic models
4. Computer Vision
Learning with little Natural Language Processing
supervision
Machine Learning
77
Thank You!
78