Markov Topic Models

Markov Topic Models
Chong Wang ∗ Bo Thiesson, Christopher Meek David Blei

Computer Science Dept. Microsoft Research Computer Science Dept.
Princeton University One Microsoft Way Princeton University
Princeton, NJ 08540 Redmond, WA 98052 Princeton, NJ 08540
Abstract have been extended and applied to a variety of applica-

tions, including collaborative filtering (Marlin 2003), au-
thorship (Rosen-Zvi et al. 2004), computer vision (Fei-Fei
We develop Markov topic models (MTMs), a
and Perona 2005), web blogs (Mei et al. 2006) and infor-
novel family of generative probabilistic models
mation retrieval (Wei and Croft 2006). For a good review,
that can learn topics simultaneously from multi-
see Griffiths and Steyvers (2006).
ple corpora, such as papers from different con-
ferences. We apply Gaussian (Markov) random Most previous topic models assume that the documents are
fields to model the correlations of different cor- part of a single corpus, and are exchangeable within it. For
pora. MTMs capture both the internal topic many text analysis problems, however, this assumption is
structure within each corpus and the relationships not appropriate. For example, papers from different scien-
between topics across the corpora. We derive tific conferences and journals can be viewed as a collection
an efficient estimation procedure with variational of multiple corpora, related to each other in as much as they
expectation-maximization. We study the perfor- discuss similar scientific themes. Articles from newspapers
mance of our models on a corpus of abstracts and blogs can also be viewed as multiple corpora, again re-
from six different computer science conferences. lated to each other in the overlap of their content.
Our analysis reveals qualitative discoveries that
In this paper we study the problem of modeling documents
are not possible with traditional topic models,
from different corpora, respecting the boundaries of the
and improved quantitative performance over the
collections but accounting for and estimating the similar-
state of the art.
ities among their content. Our intuition is that although
documents across different corpora should not be assumed
exchangeable, they may show different degrees of relation-
1 Introduction ship. As an example, consider papers from multiple com-
puter science conferences. In general, papers from ICML1
Algorithmic tools for analyzing, indexing and managing are more likely to be similar to those in NIPS2 , rather
large collections of online documents are becoming in- than those in SIGIR3 . However, some papers in ICML and
creasingly important. In recent years, algorithms based SIGIR—specifically those dealing with text processing and
on topic models have become a widely used approach for information retrieval—can be very similar as well. As the
exploratory and predictive analysis of text. Topic models, different topics can be considered high-level semantic sum-
such as latent Dirichlet allocation (LDA) (Blei et al. 2003) marizations of a corpus from different aspects, our goal is
and the more general discrete component analysis (Buntine to discover the relations in the topic level among multiple
2004), are hierarchical Bayesian models of discrete data corpora. Thus, the models are able to discover how ICML,
that use “topics”, i.e., patterns of word use, to explain an SIGIR, and NIPS are correlated, rather than simply saying
observed document collection. Probabilistic topic models that ICML and NIPS are more similar.
∗
Part of this work was done when Chong Wang was an intern We introduce Markov topic models (MTMs), which use
at Microsoft Research. Gaussian Markov random fields (GMRFs) to model the
Appearing in Proceedings of the 12th International Confe-rence 1

ICML: International Conference of Machine Learning
on Artificial Intelligence and Statistics (AISTATS) 2009, Clearwa- 2
NIPS: Neural Information Processing Systems
ter Beach, Florida, USA. Volume 5 of JMLR: W&CP 5. Copyright 3
SIGIR: International Conference on Research and Develop-
2009 by the authors.
ment in Information Retrieval.
Markov Topic Models
dimensional vector of parameters for topic k, 1 ≤ k ≤ K

in corpus v, 1 ≤ v ≤ V .4
Given the (marginal) distributions over terms βv,1:K,1:W
for the K topics at corpus v, the generative process for that
corpus is defined by a local LDA model as follows:
For each document d, 1 ≤ d ≤ Dv in corpus v:
1. Draw θv,d ∼ Dir(αv ).
2. For each word wv,d,n in the document d:

Figure 1: Graphical model for MTMs with multiple cor-
pora. The left part illustrates high-level relations of topics (a) Draw zv,d,n ∼ Mult(θv,d ).
among multiple corpora and the right part illustrates the lo- (b) Draw wv,d,n ∼ Mult(βv,zv,d,n ,1:W ).
cal LDA model associated with each corpus.
Note that αv and θv,d are both K-length vectors.5
topic relationships among multiple corpora. The mod-
els not only capture internal topic structures within each We now turn to the topic distributions, where our goal is to
corpus, but also discover the relations between the topics statistically tie these parameters across corpora. The stan-
across multiple corpora. Moreover, our approach provides dard representation of a Multinomial distribution is by its
a natural way for smoothing topic parameters using infor- mean parameters, with uncertainty about these parameters
mation from multiple collections. represented by the conjugate Dirichlet distribution (Grif-
fiths and Steyvers 2006). We instead represent a Multi-
We explain MTMs in detail in Section 2. In Section 3, nomial topic-parameter distribution by its natural parame-
we present an efficient variational EM algorithm for model ters in the exponential family representation, and we model
learning. In Section 4, we present quantitative and quali- uncertainty about this parameter by a Gaussian (Aitchison
tative results on an analysis of the abstracts from different 1982). The w’th mean parameter of the W -dimensional
computer science conferences. Our analysis reveals quali- multinomial is denoted πw . The w’th natural parameter is
tative discoveries that are not possible with traditional topic the mapping βw = P log(πw /πW ), and the reverse mapping
models, and improved quantitative performance over the is πw = exp (βw )/ w exp (βw ).
state of the art.
In a MTM, the (marginal) topic parameters associated with
local LDA models for different corpora are related, as the
2 Markov Topic Models graphical structure in Figure 1 suggests. We are therefore
considering a huge V ×K ×W dimensional joint Gaussian
The class of MTMs is an extension of LDA-based topic over all topic parameters in the model with mean m and
models, where we apply a Markovian framework to the precision ∆ (The corresponding covariance matrix is Σ =
topic parameters for different corpora. Figure 1 shows a ∆−1 ). That is,
graphical representation of a Markov topic model with four
corpora. The topic parameters β1 , . . . , β4 are vertices in β1:V,1:K,1:W ∼ NV ×K×W (m, ∆). (1)
a Markov random field that governs the relations between
corpora, each modeled by an LDA topic model. The stan- We apply several constraints to this Gaussian. First, we
dard topic model, for one single corpus, and individual assume that the per-term parameters across the K topics
topic models, without any relations between corpora, are are mutually independent, as is standard for topic models.
both special cases that we will consider in our empirical
evaluation. Second, a topic is characterized by the terms with high
probabilities in the topic distribution over terms, and dif-
Before describing how an MTM addresses multiple cor- ferent topics will typically focus on different terms. Given
pora, we describe the standard topic modeling assumptions a particular topic we tie the mean for a particular term to
made for each. We assume that all V corpora cover the the same value across corpora. That is mv,k,w = mk,w for
same set of W terms (this is accomplished by considering
4
the union of terms across corpora). We will also assume We use subscripts to indicate a particular value for a dimen-
that all corpora contain the same number of topics K. Fol- sion (e.g. for a corpus, topic, or term) and colon notation (e.g.
1 : W ) to denote a range of values. We use various combinations
lowing Blei et al. (2003), each document in a corpus is rep-
of subscripts and ranges to denote relevant sets of parameters.
resented as a random mixture over latent topics, where each 5
We don’t write αv and θv,d as αv,1:K and θv,d,1:K explic-
topic is characterized by a Multinomial distribution over itly, since we don’t access αv,1:k and θv,d,k , 1 ≤ k ≤ K, indi-
the terms. Let βv,k,1:W = [βv,k,1 , · · · , βv,k,W ]T be the W - vidually in the rest of the paper.
Wang, Thiesson, Meek, Blei
all v, 1 ≤ v ≤ V . This constraint ensures that topics vary GMRF representation. Notice that if all topics in the model
smoothly and consistently across corpora. are background topics, the model simplifies to a standard
LDA model (with logistic normal smoothing of the topic
Third, for simplicity, we assume that topic parameters as-
parameters). The generative process of MTMs with B
sociated with different terms are mutually independent.
shared background topics is a simple extension of basic
In other words, the probability for a particular term in a
MTMs. To generate a document, we follow the same pro-
topic is only directly affected by the probabilities for the
cedure, as described in this section, except that we will now
same term across corpora. With this additional constraint,
for each corpus consider K + B topics instead of just the
the precision matrix ∆1:V,1:K,1:W is block-diagonal with
K corpus specific topics.
blocks ∆1:V,k,w , 1 ≤ k ≤ K and 1 ≤ w ≤ W .
We further experimented with tying the blocks of preci-
sion parameters across words to the same value. That is, 3 Inference and Estimation
∆1:V,k,w = ∆1:V,k for all w. We found that this con-
straint is too simplistic for the problem at hand. In this In this section, we present the approximate inference and
case, the precisions associated with the many terms with parameter estimation for MTMs. The models are learned
low probabilities—which in fact do not characterize the by the variational EM algorithm, which are described in
topics—overwhelmed the estimation of the tied precisions. the following two sections.
Terms with higher topic parameter values are more impor-
tant to a topic. In order to ensure dominance of the topic 3.1 Approximate Inference: E-step
parameters for the characteristic terms, we instead scale the
tying by the weight of the expected number of observations The E-step computes the posterior distribution of the latent
for each term. That is, the block of precision parameters as- topic structure conditioned on the observed documents, and
sociated with term w is scaled by the factor the current values for the GMRF parameterization of the
topic distributions (defined by m1:K,1:W and ∆1:V,1:K ).
exp(mk,w )
gk,w = W P . (2) In a MTM, the latent topic structure comprises the per-
w exp(mk,w ) document topic proportions at each corpus θv,d , the per-
word topic assignments at each corpus zv,d,n , and the K
P
Note that w gk,w = W . If we set gw ≡ 1, we return to
the unscaled model. Markov structures of topics β1:V,k,1:W . The true posterior
is not tractable. We appeal to an approximation.
Putting these three constraints together, the parameteriza-
tion of the distribution in (1) simplifies dramatically. The We derive an efficient variational approximation for
distribution can now be represented by K independent MTMs. The main idea behind variational methods is to
V × W -dimensional block-diagonal Gaussians with a V posit a simple family of distributions over the latent vari-
dimensional block for each term w. Each block defines the ables, indexed by free variational parameters, and to find
distribution for a term in a specific topic across corpora, the member of that family which is closest in Kullback-
and is constrained as follows, Leibler divergence to the true posterior. Good overviews
of this methodology can be found in Jordan et al. (1999)
K Y
W
Y and Wainwright and Jordan (2003). The fully-factorized
β1:V,1:K,1:W ∼ NV (mk,w 11:V , gk,w ∆1:V,k ) ,
variational distribution over the latent variables is:
k=1 w=1
(3)
where 11:V denotes a V dimensional vector of ones. q(β, z, θ | β,
b φ, γ) =
K Y
W
Finally, in a Markov topic model, structural relations be-
Y
q(β1:V,k,w |βb1:V,k,w ) ×
tween corpora may restrict the distributions in (3) further. k=1 w=1
For example, the corpora could be local news stories and Dv
V Y

Nv,d

one could have reason to believe that topic parameters Y
q(θv,d |γv,d )
Y
q(zv,d,n |φv,d,n ) . (4)
evolve smoothly through a geo-spatial structure of neigh- v=1 d=1 n=1
boring locations. The structure in this way defines the
Markov properties that the distribution for β1:V,k,w has to The free variational parameters are the Dirichlets γv,d
obey, i.e., a GMRF. Alternatively to the a priori decided for the per-document topic proportions, the multinomials
structural constraints one could also choose to learn a struc- φv,d,n for each word’s topic assignment, and the varia-
ture via model selection methods, e.g., Meinshausen and
tional parameters βb1:V,k,w for β1:V,k,w . The updates for
Bühlmann (2006).
document-level variational parameters θv,d and zv,d,n fol-
In some modeling situations, we would like multiple cor- low similar forms of those in Blei et al. (2003), where the
pora to share a set of common “background” topics. Back- difference is that we replace the topic distribution parame-
ground topics can be modeled as isolated topics in the ters with their variational expectations.
Markov Topic Models
We now turn to variational inference for the topic distri- where

butions. For clarity of presentation, we focus on a model
with only one topic and assume that each corpus has only Eq [log p(β1:V,w |m, ∆)] =
one document. These calculations are simpler versions of V V 1 gw b
− log 2π + log gw + log |∆| − Tr ∆Σ
those we need for the full model, but exhibit the essential 2 2 2 2
gw
features of the algorithm. Generalization to the full model − (c m1:V,w − mw 1)T ∆(c
m1:V,w − mw 1). (8)
is straightforward. 2
In this case, we only need to compute q(β|β).

b Note that VW W
H(q) = log 2π − b + VW.
log |∆| (9)
we drop some indices to make the following part easier 2 2 2
to follow. Specifically, we don’t need subscript k and d
anymore. Further simplifying notation, we use ∆ (∆)b to
Now we proceed to compute the required derivatives for ∆
b
represent ∆1:V (∆1:V ).
b
and m1:V,w . Now, we isolate the terms that contain ∆,
c b
We use the following variational posterior, for term w,
L[∆
b]
q(β1:V,w | βb1:V,w ) = ϕV m c1:V,w , ∆ b , (5) 1X b W b − W Tr ∆Σ

=− nv Σv,v − log |∆| b
2 v 2 2
where βb1:V,w = {c b and ϕV m
m1:V,w , ∆} c1:V,w , ∆b indi-
W
cates the Gaussian density with mean m c1:V,w and precision =− log |∆|
b + Tr (∆ + diag(n)/W ) , (10)
2
∆.
b Unlike mw , which is the same for all corpora, m c1:V,w P
where we have used w gw = W in the first “=” and n =
are free parameters to fit. ∆ is chosen to be the same for
b
all terms, which is required for the numerical stability. We [n1 , n2 , . . . , nv ]. The optimal value of ∆
b is obtained by:
will see ∆b is able to preserve the structure of GMRF if ∆
∆
b = ∆ + diag(n)/W, (11)
represents a non-dense GMRF.
Recall that, for simplicity, we assume each corpus has only where we use the following Equation 12:
one document. Let wv be the observed document for cor- log |X| + Tr(X −1 A) ≥ log |A| + d, (12)
pus v. With the variational posterior distributions (5) in
hand, we turn to the details of posterior inference. Equiva- where both X and A are d × d positive definitive matrixes
lent to minimizing KL is tightening the bound on the likeli- and the equality holds if and only if X = A. Equation 11
hood of the observations given by Jensen’s inequality (Jor- means that to obtain ∆,b one only needs to add a diagonal
dan et al. 1999), matrix diag(n)/W to ∆. Then if ∆ is sparse, ∆ b preserves
the sparsity. Recall that nv is the counts of all terms in
log p(w1:V |m, ∆)
the corpus v. Then if nv becomes larger (∆ b v,v becomes
≥ Eq [log p(w1:V |β)] + Eq [log p(β|m, ∆)] + H(q) larger and Σb v,v becomes smaller), the marginal variational
= L(c
m, ∆;b m, ∆), (6) distribution of q(βv,w ) tends to peak at m
b v,w .
where H(q) is the entropy of the variational distribution. Numerical approaches, such as L-BFGS (Liu and Nocedal
Now we expand the right side of Equation 6 term by term, 1989), can be used to estimate m
c1:V,w . After isolating the
Eq [log p(w1:V |β)] = Eq [log p(wv |βv,1:W )] terms that contain m
c1:V,w , we have
X
"
v
# L[m
b 1:V,w ]
nv,w Eq βv,w − log
XX X !
= exp(βv,w ) X X
= b v,w − nv log
nv,w m exp (m
b v,w )
v w w
XX v w
≥ nv,w m
b v,w gw
− m1:V,w − mw 1)T ∆(c
(c m1:V,w − mw 1). (13)
v w
! 2
−
X
nv log
X
exp (m
b v,w ) + Σ
b v,v , (7) By taking the derivative w.r.t. m
c1:V,w , we have
v w ∂L
= n1:V,w − ζ1:V,w
where the count of term w in document wv is nv,w , nv = ∂c
m1:V,w
P b −1 .
w nv,w and Σv,v is the entry (v, v) in matrix Σ = ∆ − m1:V,w − mw 1) ,
gw ∆ (c (14)
b b
The last inequality comes from Jensen’s inequality.
where
exp (m
b v,w )
Eq [log p(β|m, ∆)] = Eq [log p(β1:V,w |m, ∆)] ,
X
ζv,w = nv P . (15)
w exp (mb v,w )
w
3.2 Parameter estimation: M-step Algorithm 1 IPF algorithm for ∆

Input: S, C and initial guess ∆0
Parameter estimation is done in the M-step that maximizes Output: the optimal ∆opt
the lower bound of the log likelihood of the data obtained repeat
by the variational approximation in section 3.1. In other for a ∈ C do
words, variational E-step computes the variational poste- ∆aa ← Saa −1
+ ∆aac ∆−1 ac ac ∆ac a
rior q(β, z, θ) given the current settings of model param- end for
eter m1:K,1:W and ∆1:V,1:K . Then M-step finds the max- until converge
imum likelihood estimate of these model parameters. The
variational EM runs alternatively between two steps until
the lower bound converges. where fw0 = ∂fw /∂mw , a linear function of mw .
Recall that we consider a single topic model here. Let Σ = What if ∆ is sparse? If ∆ is sparse, i.e. ∆ represents
∆−1 . First, isolating the terms that contain ∆ from 6, we a non-dense GMRF, it becomes difficult to obtain an an-
have alytical solution like equation 18. We then choose to use
L[∆] iterative proportional fitting (IPF) (Ruschendorf 1995). We
outline the procedure as follows. Let S = ∆ b −1 + M .
W b −1 ) − Tr(∆M )

Recall that L[∆] can be written as
= log |∆| − Tr(∆∆
2
W b −1 + M )

W
= log |∆| − Tr(∆(∆ L[∆] = b −1 + M )
log |∆| − Tr(∆(∆
2 2
W
W W
=− log |Σ| + Tr(Σ−1 (∆b −1 + M ) , (16) = log |∆| − Tr(∆S). (22)
2 2 2
where
Viewing S as the sufficient statistics for the Gaussian dis-
1 X tribution N (0, ∆), this optimization falls in the IPF frame-
M= gw (c m1:V,w − mw 1)T . (17)
m1:V,w − mw 1)(c
W w work. Let G be the graph that ∆ represents and C be the
collections cliques of G. For a ∈ C, ac (complement of
Applying the Equation 12 to Equation 16, we obtain the a) contains all the other vertices in G. Define ∆ab =
optimal value of ∆ as: {∆i,j }(i,j)∈a×b , a, b ∈ C and Sab = {Si,j }(i,j)∈a×b ,
∆−1 = Σ = ∆
b −1 + M . (18) a, b ∈ C. Algorithm 1 computes the optimal ∆ for equa-
tion 22.
Clearly, M is a weighted combination by the relative im-
portance of terms, gw . According to the form of ∆
b in equa-
4 Experimental Results
tion 11, ∆ is somehow determined by M and the counts of
all terms (or expected counts for K topic models) for each
In this section, we demonstrate the use of MTMs on
corpus.
a multi-corpora dataset constructed from several interna-
Second, isolating the terms that contain m from 6, we have tional conferences held in the last few years. We report
predictive perplexity, compared to LDA models, and inter-
V X 1X esting topical patterns. The Dirichlet parameter α is fixed
L[m] = log gw − gw fw , (19)
2 w 2 w to be a symmetric prior (2.0) for every model and we use a
dense GMRF in the MTM.
where
m1:V,w − mw 1)T ∆(c

fw = (c m1:V,w − mw 1). (20) 4.1 Multi-corpora Dataset
To derive the derivative of mw , we first compute We analyzed the abstracts from six international confer-
ences: CIKM6 , ICML, KDD7 , NIPS, SIGIR and WWW8 .
gw (1 − gw /W ) if w0 = w

∂gw0
= These conferences were held between year 2005 and year
∂mw −gw gw0 /W otherwise 2008. The publications from these conferences cover a
wide range of topics related to information processing. For
By taking the derivative w.r.t. mw , we have example, CIKM mainly covers “databases”, “information
∂L[m] V 6
ACM Conference on Information and Knowledge Manage-
= (1 − gw )
∂mw 2 ment.
7
! ACM International Conference on Knowledge Discovery &
gw 1 X Data Mining.
− fw + fw0 − gw fw , (21) 8
2 W w International World Wide Web Conference.
Markov Topic Models
C ONF. Y EARS #D OCS #W ORDS AVG .W ORDS LDA LDA−idv MTM MTM−bg
CIKM 05-07 410 27609 67.3 CIKM ICML
1350 1300
ICML 06-08 447 28419 63.6
KDD 06-08 374 29179 78.0 1300 1250
NIPS 07-08 355 25031 70.5 1250 1200

SIGIR 06-08 573 34607 60.4 1200 1150
per−word predictive perplexity

WWW 07-08 439 27718 63.1
1150 1100
TOTAL 05-08 2598 172563 66.4
1100 1050
0 20 40 60 0 20 40 60
Table 1: Information about the multi-corpora dataset. The 1400

KDD
1600
NIPS
vocabulary size is 3733. Year: the years when the con- 1350
ferences were held; #Docs: the total number documents 1300
1500
(abstracts of papers or posters); #Words: the total number 1250

1400
of words; Avg.Words: the average number of words in a 1200
document. 1150 1300
0 20 40 60 0 20 40 60
SIGIR WWW
1200 1400
1350
1350
LDA 1100
1300
LDA−idv
1300 1250
MTM 1000
per−word predictive perplexity
MTM−bg 1200
900 1150
0 20 40 60 0 20 40 60
1250
number of topics
Figure 3: Per-word predictive perplexity comparison for

1200 each corpus (Standard errors are not shown). As we can
see, MTM generally gives the best performance for all the
corpora.
1150
individual LDA models for each corpus (LDA-idv)9 ), the

0 10 20 30 40 50 60
number of topics
basic MTM (MTM), and an extension of the MTM with
Figure 2: Per-word predictive perplexity comparison. one background topic (MTM-bg). We use 5-fold cross val-
MTM and MTM-bg achieve their best performances when idation for the evaluation. In each fold, 80% of the docu-
the K is around 10, while LDA achieves its best perfor- ments from each of the six conferences are chosen as the
mance when K is around 20. MTM gives the lowest pre- training set and the remaining 20% is used as the testing
dictive perplexity around K = 10. set. We compute the per-word predictive perplexity over a
test dataset Dtest as our test criterion. This perplexity is
defined as
retrieval” and “knowledge management”, while SIGIR fo- ( P )
cuses on all aspects of “information retrieval”. WWW cov- d∈Dtest log p(wd |β)
ers all aspects of World Wide Web, also including “web perplexitypw = exp − P , (23)
d∈Dtest Nd
information retrieval”. We expect that these conference
are correlated in some sense. For example, artificial in- where β denotes all the estimated topic parameters in a
telligence and machine learning techniques are studied and model.
used in these areas, but in many different ways.
For LDA, we use variational inference to approximate
Abstracts from the same conference form a corpus. After log p(wd ) with a lower bound (Blei et al. 2003). The sit-
pruning the vocabulary by removing the functional terms uation is slightly different for the local LDA models in
and the terms that occurred less than 5 times or in less than the MTM and MTM-bg. For these local models, we in
3 documents, the entire dataset contains 170K words split fact learn variational posterior distributions for the topic
among the 6 corpora. The vocabulary size is 3733. Table 1 parameters–see Section 3.2–and we instead use the mean
shows some statistical information of these corpora. values as the estimated parameterization. To be clear, for
corpus v, the topic parameter for the kth-topic is estimated
P
by βev,k,w ≈ exp(m b v,k,w )/ w exp(m b v,k,w ). The per-
4.2 Quantitative: Predictive Perplexity
plexity computation now proceeds as for a standard LDA
In our quantitative evaluation, we compare the following 9
We achieve this by removing all the edges in GMRF in the
models: a standard LDA model over all corpora (LDA), MTM.
model, except that we pick the estimated parameterization

CIKM CIKM
according to the corpus of each document.
ICML ICML
We studied the performance of the models for a wide range KDD KDD
of numbers of topics: K = 3, 5, 10, 20, 30, 40, 50, 60.
NIPS NIPS
Figure 2 shows the overall performance and figure 3 shows
SIGIR SIGIR
the performance over each corpus. (Note that lower per-
plexity is better.) We see that MTM and MTM-bg achieve WWW WWW
CIKM ICML KDD NIPS SIGIR WWW CIKM ICML KDD NIPS SIGIR WWW
the best perplexity around K = 10, and LDA achieves its
best perplexity around K = 20. Most importantly, mod- (a) (b)
eling interrelated local corpora with MTM and MTM-bg Figure 4: Correlation coefficient analysis (rescaled). (a)
outperforms standard LDA and the individual LDA mod- The correlation coefficient analysis of the topics in Table
els, with MTM achieving the overall lowest predictive per- 2. (b) The correlation coefficient analysis of the topics in
plexity for this application. Table 3.
All three models begin to overfit after K = 20. As K shown in Table 3, the topic is about learning & classifica-
increases, the overfitting effect of MTM and MTM-bg is tion. ICML and NIPS are mainly on the theoretical side
worse than for LDA. There is a natural explanation to this (NIPS also has image classification papers though), while
fact. In MTM and MTM-bg, each corpus (modeled by a CIKM, SIGIR and WWW are on the application side. KDD
local LDA model) has K topics, and these topics are tied seems right in the middle.
to the topics from other corpora. Therefore, the “effective”
number of topics for MTM or MTM-bg is larger than K, 5 Related Work
though smaller than KV . For each individual corpus, from
figure 2, we can see similar results. (Note that for differ-
Previous work, including dynamic topic models
ence corpora, the numbers of topics for the best perfor-
(DTM) (Blei and Lafferty 2006) and continuous time
mance are not the same. How to discover the right number
dynamic topic models (cDTM) (Wang et al. 2008), has
of topics for each corpus under the MTM framework is a
studied the problem of topic evolution when time informa-
question for future work.)
tion is available. If documents from the same time period
Observe that MTM-bg always has higher perplexity results are considered as a corpus, then DTM and cDTM are
than MTM, indicating that the background topic is not of within the framework of MTM by designing a precision
use in this data. We do not expect this finding to carry over matrix that only allows dependence along the time line.
to different types of documents, but rather attribute it to
Several topic models have considered meta information,
the fact that we have been analyzing abstracts, where the
such as times, locations or authors, in estimating top-
writing style is quite constrained. In abstracts, people are
ics (Wang and McCallum 2006; Mei et al. 2006, 2008;
only allowed to use a few concise sentences to outline the
Rosen-Zvi et al. 2004). In principle, corpus assignment
whole paper, and these sentences must therefore be very
can be considered a type of meta information. However, all
relevant to the main content. It is unlikely that all abstracts
of these previous models assume a single set of global and
would share the same background information.
independent topics. Methods such as these do not provide
a mechanism for modeling topic relations among multiple
4.3 Qualitative: Topic Pattern Discovery corpora, as we have developed here for MTM.
The analysis in this section is based on the 10-topic MTM

6 Conclusions
of the previous section. In Figure 4, we visualize the corre-
lation coefficients (scaled) for two topics using the covari-
In this paper, we developed MTMs for simultaneously
ance matrixes from the variational posterior distributions.
modeling multiple corpora. Across corpora, MTMs use
The whiter the square is, the more correlated the two con-
GMRFs to model the correlations between their topics.
ferences are on this topic. Figure 4(a) and 4(b) correspond
These models not only capture the internal topic structures
to Table 2 and Table 3, where we visualize the topics us-
within one corpus, but also discover the relationships of the
ing top 12 terms due to the limited space. In Figure 4(a),
topics across many.
the topic is about clustering, where almost all the confer-
ences have “clustering, data, similarity” in top 12 terms. While we examined MTMs in the context of LDA-based
However, different conferences may have different aspect models, we emphasize that the MTM framework can be in-
on this clustering topic. Among these, for example, we see tegrated into many other topic models. The inference and
that ICML and NIPS are highly correlated, they also share estimation procedures provide a general way of incorporat-
“graph, kernels, spectral”, while CIKM and WWW are also ing multiple corpora into topic analysis. In future work,
quite correlated on “pattern, mining”. Another example is we plan to study other datasets, e.g., local news articles,
Markov Topic Models
topic: clustering
CIKM ICML KDD NIPS SIGIR WWW
clustering clustering clustering clustering clustering spam
data graph data graph semantic clustering
similarity data mining similarity similarity similarity
algorithms kernels patterns data filtering mining
algorithm constraints algorithm cluster based detection
patterns relational frequent clusters document algorithms
time based algorithms algorithms cluster extraction
mining similarity clusters matching information based
method pairwise set spectral spam data
set cluster cluster kernels clusters web
series spectral graph shape algorithm patterns
based algorithms pattern set items existing
Table 2: The corresponding topic visualization of Figure 4(a).
topic: learning & classification

CIKM ICML KDD NIPS SIGIR WWW
classification learning model learning classification learning
learning model data model text models
text data classification data image topic
features models models models features images
training algorithm learning image learning classification
models bayesian labels inference labeled image
classifier approach training bayesian data text
model using labeling structure training topics
image structure labeled features using approach
approach semi-supervised algorithm classification classifier method
categorization markov text using algorithm features
based multiple multiple images segmentation framework
Table 3: The corresponding topic visualization of Figure 4(b).

and explore other possible representations of relationships B. Marlin. Modeling user rating profiles for collaborative filter-
between topics. ing. In NIPS. MIT Press, 2003.
Q. Mei, C. Liu, H. Su, and C. Zhai. A probabilistic approach to
Acknowledgments David M. Blei is supported by ONR spatiotemporal theme pattern mining on weblogs. In WWW.
ACM, 2006.
175-6343, NSF CAREER 0745520, and grants from
Google and Microsoft. Q. Mei, D. Cai, D. Zhang, and C. Zhai. Topic modeling with
network regularization. In WWW, 2008.
N. Meinshausen and P. Bühlmann. High-dimensional graphs and
References variable selection with the lasso. Annals of Statistics, 34(3):
1436–1462, 2006.
J. Aitchison. The statistical analysis of compositional data. Jour-
nal of the Royal Statistical Society, Series B, 44(2):139–177, M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The
1982. author-topic model for authors and documents. In UAI, 2004.
D. M. Blei and J. D. Lafferty. Dynamic topic models. In ICML, L. Ruschendorf. Convergence of the iterative proportional fitting
2006. procedure. The Annals of Statistics, 23(4):1160–1174, 1995.
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. M. J. Wainwright and M. I. Jordan. Graphical models, exponential
J. Mach. Learn. Res., 3:993–1022, 2003. ISSN 1533-7928. families and variational inference. Technical Report 649, UC
Berkeley, Dept. of Statistics, 2003.
W. Buntine. Applying discrete PCA in data analysis. In UAI.
AUAI Press, 2004. C. Wang, D. Blei, and D. Heckerman. Continuous time dynamic
topic models. In UAI, 2008.
L. Fei-Fei and P. Perona. A Bayesian hierarchical model for learn-
X. Wang and A. McCallum. Topics over time: a non-Markov
ing natural scene categories. In CVPR, 2005.
continuous-time model of topical trends. In KDD, 2006.
T. Griffiths and M. Steyvers. Probabilistic topic models. In Latent X. Wei and W. B. Croft. LDA-based document models for ad-hoc
Semantic Analysis: A Road to Meaning, 2006. retrieval. In SIGIR, New York, NY, USA, 2006. ACM.
M. I. Jordan, Z. Ghahramani, T. Jaakkola, and L. K. Saul. An
introduction to variational methods for graphical models. Ma-
chine Learning, 37(2):183–233, 1999.
D. C. Liu and J. Nocedal. On the limited memory BFGS method
for large scale optimization. Math. Program., 45(3):503–528,
1989.

Markov Topic Models

Uploaded by

Copyright:

Available Formats

Markov Topic Models

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Markov Topic Models

Uploaded by

Copyright:

Available Formats

Markov Topic Models

Chong Wang ∗ Bo Thiesson, Christopher Meek David Blei

Abstract have been extended and applied to a variety of applica-

Appearing in Proceedings of the 12th International Confe-rence 1

dimensional vector of parameters for topic k, 1 ≤ k ≤ K

1. Draw θv,d ∼ Dir(αv ).

2. For each word wv,d,n in the document d:

We now turn to variational inference for the topic distri- where

In this case, we only need to compute q(β|β).

3.2 Parameter estimation: M-step Algorithm 1 IPF algorithm for ∆

m1:V,w − mw 1)T ∆(c

NIPS 07-08 355 25031 70.5 1250 1200

per−word predictive perplexity

Table 1: Information about the multi-corpora dataset. The 1400

(abstracts of papers or posters); #Words: the total number 1250

Figure 3: Per-word predictive perplexity comparison for

individual LDA models for each corpus (LDA-idv)9 ), the

model, except that we pick the estimated parameterization

The analysis in this section is based on the 10-topic MTM

Table 2: The corresponding topic visualization of Figure 4(a).

topic: learning & classification

Table 3: The corresponding topic visualization of Figure 4(b).

You might also like