Markov Topic Models
Markov Topic Models
Markov Topic Models
all v, 1 ≤ v ≤ V . This constraint ensures that topics vary GMRF representation. Notice that if all topics in the model
smoothly and consistently across corpora. are background topics, the model simplifies to a standard
LDA model (with logistic normal smoothing of the topic
Third, for simplicity, we assume that topic parameters as-
parameters). The generative process of MTMs with B
sociated with different terms are mutually independent.
shared background topics is a simple extension of basic
In other words, the probability for a particular term in a
MTMs. To generate a document, we follow the same pro-
topic is only directly affected by the probabilities for the
cedure, as described in this section, except that we will now
same term across corpora. With this additional constraint,
for each corpus consider K + B topics instead of just the
the precision matrix ∆1:V,1:K,1:W is block-diagonal with
K corpus specific topics.
blocks ∆1:V,k,w , 1 ≤ k ≤ K and 1 ≤ w ≤ W .
We further experimented with tying the blocks of preci-
sion parameters across words to the same value. That is, 3 Inference and Estimation
∆1:V,k,w = ∆1:V,k for all w. We found that this con-
straint is too simplistic for the problem at hand. In this In this section, we present the approximate inference and
case, the precisions associated with the many terms with parameter estimation for MTMs. The models are learned
low probabilities—which in fact do not characterize the by the variational EM algorithm, which are described in
topics—overwhelmed the estimation of the tied precisions. the following two sections.
Terms with higher topic parameter values are more impor-
tant to a topic. In order to ensure dominance of the topic 3.1 Approximate Inference: E-step
parameters for the characteristic terms, we instead scale the
tying by the weight of the expected number of observations The E-step computes the posterior distribution of the latent
for each term. That is, the block of precision parameters as- topic structure conditioned on the observed documents, and
sociated with term w is scaled by the factor the current values for the GMRF parameterization of the
topic distributions (defined by m1:K,1:W and ∆1:V,1:K ).
exp(mk,w )
gk,w = W P . (2) In a MTM, the latent topic structure comprises the per-
w exp(mk,w ) document topic proportions at each corpus θv,d , the per-
word topic assignments at each corpus zv,d,n , and the K
P
Note that w gk,w = W . If we set gw ≡ 1, we return to
the unscaled model. Markov structures of topics β1:V,k,1:W . The true posterior
is not tractable. We appeal to an approximation.
Putting these three constraints together, the parameteriza-
tion of the distribution in (1) simplifies dramatically. The We derive an efficient variational approximation for
distribution can now be represented by K independent MTMs. The main idea behind variational methods is to
V × W -dimensional block-diagonal Gaussians with a V posit a simple family of distributions over the latent vari-
dimensional block for each term w. Each block defines the ables, indexed by free variational parameters, and to find
distribution for a term in a specific topic across corpora, the member of that family which is closest in Kullback-
and is constrained as follows, Leibler divergence to the true posterior. Good overviews
of this methodology can be found in Jordan et al. (1999)
K Y
W
Y and Wainwright and Jordan (2003). The fully-factorized
β1:V,1:K,1:W ∼ NV (mk,w 11:V , gk,w ∆1:V,k ) ,
variational distribution over the latent variables is:
k=1 w=1
(3)
where 11:V denotes a V dimensional vector of ones. q(β, z, θ | β,
b φ, γ) =
K Y
W
Finally, in a Markov topic model, structural relations be-
Y
q(β1:V,k,w |βb1:V,k,w ) ×
tween corpora may restrict the distributions in (3) further. k=1 w=1
For example, the corpora could be local news stories and Dv
V Y
Nv,d
one could have reason to believe that topic parameters Y
q(θv,d |γv,d )
Y
q(zv,d,n |φv,d,n ) . (4)
evolve smoothly through a geo-spatial structure of neigh- v=1 d=1 n=1
boring locations. The structure in this way defines the
Markov properties that the distribution for β1:V,k,w has to The free variational parameters are the Dirichlets γv,d
obey, i.e., a GMRF. Alternatively to the a priori decided for the per-document topic proportions, the multinomials
structural constraints one could also choose to learn a struc- φv,d,n for each word’s topic assignment, and the varia-
ture via model selection methods, e.g., Meinshausen and
tional parameters βb1:V,k,w for β1:V,k,w . The updates for
Bühlmann (2006).
document-level variational parameters θv,d and zv,d,n fol-
In some modeling situations, we would like multiple cor- low similar forms of those in Blei et al. (2003), where the
pora to share a set of common “background” topics. Back- difference is that we replace the topic distribution parame-
ground topics can be modeled as isolated topics in the ters with their variational expectations.
Markov Topic Models
"
v
# L[m
b 1:V,w ]
nv,w Eq βv,w − log
XX X !
= exp(βv,w ) X X
= b v,w − nv log
nv,w m exp (m
b v,w )
v w w
XX v w
≥ nv,w m
b v,w gw
− m1:V,w − mw 1)T ∆(c
(c m1:V,w − mw 1). (13)
v w
! 2
−
X
nv log
X
exp (m
b v,w ) + Σ
b v,v , (7) By taking the derivative w.r.t. m
c1:V,w , we have
v w ∂L
= n1:V,w − ζ1:V,w
where the count of term w in document wv is nv,w , nv = ∂c
m1:V,w
P b −1 .
w nv,w and Σv,v is the entry (v, v) in matrix Σ = ∆ − m1:V,w − mw 1) ,
gw ∆ (c (14)
b b
The last inequality comes from Jensen’s inequality.
where
exp (m
b v,w )
Eq [log p(β|m, ∆)] = Eq [log p(β1:V,w |m, ∆)] ,
X
ζv,w = nv P . (15)
w exp (mb v,w )
w
Wang, Thiesson, Meek, Blei
To derive the derivative of mw , we first compute We analyzed the abstracts from six international confer-
ences: CIKM6 , ICML, KDD7 , NIPS, SIGIR and WWW8 .
gw (1 − gw /W ) if w0 = w
∂gw0
= These conferences were held between year 2005 and year
∂mw −gw gw0 /W otherwise 2008. The publications from these conferences cover a
wide range of topics related to information processing. For
By taking the derivative w.r.t. mw , we have example, CIKM mainly covers “databases”, “information
∂L[m] V 6
ACM Conference on Information and Knowledge Manage-
= (1 − gw )
∂mw 2 ment.
7
! ACM International Conference on Knowledge Discovery &
gw 1 X Data Mining.
− fw + fw0 − gw fw , (21) 8
2 W w International World Wide Web Conference.
Markov Topic Models
C ONF. Y EARS #D OCS #W ORDS AVG .W ORDS LDA LDA−idv MTM MTM−bg
CIKM 05-07 410 27609 67.3 CIKM ICML
1350 1300
ICML 06-08 447 28419 63.6
KDD 06-08 374 29179 78.0 1300 1250
vocabulary size is 3733. Year: the years when the con- 1350
ferences were held; #Docs: the total number documents 1300
1500
SIGIR WWW
1200 1400
1350
1350
LDA 1100
1300
LDA−idv
1300 1250
MTM 1000
per−word predictive perplexity
MTM−bg 1200
900 1150
0 20 40 60 0 20 40 60
1250
number of topics
CIKM ICML KDD NIPS SIGIR WWW CIKM ICML KDD NIPS SIGIR WWW
the best perplexity around K = 10, and LDA achieves its
best perplexity around K = 20. Most importantly, mod- (a) (b)
eling interrelated local corpora with MTM and MTM-bg Figure 4: Correlation coefficient analysis (rescaled). (a)
outperforms standard LDA and the individual LDA mod- The correlation coefficient analysis of the topics in Table
els, with MTM achieving the overall lowest predictive per- 2. (b) The correlation coefficient analysis of the topics in
plexity for this application. Table 3.
All three models begin to overfit after K = 20. As K shown in Table 3, the topic is about learning & classifica-
increases, the overfitting effect of MTM and MTM-bg is tion. ICML and NIPS are mainly on the theoretical side
worse than for LDA. There is a natural explanation to this (NIPS also has image classification papers though), while
fact. In MTM and MTM-bg, each corpus (modeled by a CIKM, SIGIR and WWW are on the application side. KDD
local LDA model) has K topics, and these topics are tied seems right in the middle.
to the topics from other corpora. Therefore, the “effective”
number of topics for MTM or MTM-bg is larger than K, 5 Related Work
though smaller than KV . For each individual corpus, from
figure 2, we can see similar results. (Note that for differ-
Previous work, including dynamic topic models
ence corpora, the numbers of topics for the best perfor-
(DTM) (Blei and Lafferty 2006) and continuous time
mance are not the same. How to discover the right number
dynamic topic models (cDTM) (Wang et al. 2008), has
of topics for each corpus under the MTM framework is a
studied the problem of topic evolution when time informa-
question for future work.)
tion is available. If documents from the same time period
Observe that MTM-bg always has higher perplexity results are considered as a corpus, then DTM and cDTM are
than MTM, indicating that the background topic is not of within the framework of MTM by designing a precision
use in this data. We do not expect this finding to carry over matrix that only allows dependence along the time line.
to different types of documents, but rather attribute it to
Several topic models have considered meta information,
the fact that we have been analyzing abstracts, where the
such as times, locations or authors, in estimating top-
writing style is quite constrained. In abstracts, people are
ics (Wang and McCallum 2006; Mei et al. 2006, 2008;
only allowed to use a few concise sentences to outline the
Rosen-Zvi et al. 2004). In principle, corpus assignment
whole paper, and these sentences must therefore be very
can be considered a type of meta information. However, all
relevant to the main content. It is unlikely that all abstracts
of these previous models assume a single set of global and
would share the same background information.
independent topics. Methods such as these do not provide
a mechanism for modeling topic relations among multiple
4.3 Qualitative: Topic Pattern Discovery corpora, as we have developed here for MTM.
topic: clustering
CIKM ICML KDD NIPS SIGIR WWW
clustering clustering clustering clustering clustering spam
data graph data graph semantic clustering
similarity data mining similarity similarity similarity
algorithms kernels patterns data filtering mining
algorithm constraints algorithm cluster based detection
patterns relational frequent clusters document algorithms
time based algorithms algorithms cluster extraction
mining similarity clusters matching information based
method pairwise set spectral spam data
set cluster cluster kernels clusters web
series spectral graph shape algorithm patterns
based algorithms pattern set items existing