Topic Modeling in Embedding Spaces
Topic Modeling in Embedding Spaces
pope roman
catholic
bishop
priest cathedral Topic 72
season
church seasons games
Topic 23 game Topic 136
franchise
rev athletic
episcopal
football basketball
squad team Topic 244 league hockey
presbyterian leagues soccer
teams baseball
Figure 2. A topic about Christianity found by Figure 3. Topics about sports found by the
the ETM on The New York Times. The topic is ETM . Each topic is a point in the word em-
a point in the word embedding space. bedding space.
2017; Zhao et al., 2017a) or the topic assignment perspective, using Wasserstein distances to learn
priors (Xie et al., 2015). For example Petterson topics and word embeddings jointly.
et al. (2010) use a word similarity graph (as given
Another thread of recent research improves
by a thesaurus) to bias LDA towards assigning
topic modeling inference through deep neural net-
similar words to similar topics. As another ex-
works (Srivastava and Sutton, 2017; Card et al.,
ample, Xie et al. (2015) model the per-word topic
2017; Cong et al., 2017; Zhang et al., 2018).
assignments of LDA using a Markov random field
Specifically, these methods reduce the dimen-
to account for both the topic proportions and the
sion of the text data through amortized infer-
topic assignments of similar words. These meth-
ence and the variational auto-encoder (Kingma
ods use word similarity as a type of “side informa-
and Welling, 2014; Rezende et al., 2014). To per-
tion” about language; in contrast, the ETM directly
form inference in the ETM, we also avail ourselves
models the similarity (via embeddings) in its gen-
of amortized inference methods (Gershman and
erative process of words.
Goodman, 2014).
Other work has extended LDA to directly in- Finally, as a document model, the ETM also re-
volve word embeddings. One common strategy lates to works that learn per-document represen-
is to convert the discrete text into continuous ob- tations as part of an embedding model (Le and
servations of embeddings, and then adapt LDA to Mikolov, 2014; Moody, 2016; Miao et al., 2016).
generate real-valued data (Das et al., 2015; Xun In contrast to these works, the document variables
et al., 2016; Batmanghelich et al., 2016; Xun et al., in the ETM are part of a larger probabilistic topic
2017). With this strategy, topics are Gaussian dis- model.
tributions with latent means and covariances, and
the likelihood over the embeddings is modeled 3 Background
with a Gaussian (Das et al., 2015) or a Von-Mises
Fisher distribution (Batmanghelich et al., 2016). The ETM builds on two main ideas, LDA and word
The ETM differs from these approaches in that it is embeddings. Consider a corpus of D documents,
a model of categorical data, one that goes through where the vocabulary contains V distinct terms.
the embeddings matrix. Thus it does not require Let wdn ∈ {1, . . . , V } denote the nth word in the
pre-fitted embeddings and, indeed, can learn em- dth document.
beddings as part of its inference process.
Latent Dirichlet allocation. LDA is a proba-
There have been a few other ways of combining bilistic generative model of documents (Blei et al.,
LDA and embeddings. Nguyen et al. (2015) mix 2003). It posits K topics β1:K , each of which is
the likelihood defined by LDA with a log-linear a distribution over the vocabulary. LDA assumes
model that uses pre-fitted word embeddings; Bunk each document comes from a mixture of topics,
and Krestel (2018) randomly replace words drawn where the topics are shared across the corpus and
from a topic with their embeddings drawn from a the mixture proportions are unique for each docu-
Gaussian; and Xu et al. (2018) adopt a geometric ment. The generative process for each document
is the following: the vocabulary. Specifically, the ETM uses a log-
linear model that takes the inner product of the
1. Draw topic proportion θd ∼ Dirichlet(αθ ).
word embedding matrix and the topic embedding.
2. For each word n in the document:
With this form, the ETM assigns high probability
(a) Draw topic assignment zdn ∼ Cat(θd ).
to a word v in topic k by measuring the agreement
(b) Draw word wdn ∼ Cat(βzdn ).
between the word’s embedding and the topic’s em-
Here, Cat(·) denotes the categorical distribution. bedding.
LDA places a Dirichlet prior on the topics, βk ∼
Denote the L × V word embedding matrix by
Dirichlet(αβ ) for k = 1, . . . , K. The concentra-
ρ; the column ρv is the embedding of v. Under the
tion parameters αβ and αθ of the Dirichlet distri-
ETM , the generative process of the dth document
butions are fixed model hyperparameters.
is the following:
Word embeddings. Word embeddings provide
1. Draw topic proportions θd ∼ LN (0, I).
models of language that use vector representations
2. For each word n in the document:
of words (Rumelhart and Abrahamson, 1973; Ben-
a. Draw topic assignment zdn ∼ Cat(θd ).
gio et al., 2003). The word representations are fit-
b. Draw the word wdn ∼ softmax(ρ> αzdn ).
ted to relate to meaning, in that words with similar
meanings will have representations that are close. In Step 1, LN (·) denotes the logistic-normal dis-
(In embeddings, the “meaning” of a word comes tribution (Aitchison and Shen, 1980; Blei and Laf-
from the contexts in which it is used.) ferty, 2007); it transforms a standard Gaussian
random variable to the simplex. A draw θd from
We focus on the continuous bag-of-words
this distribution is obtained as
(CBOW) variant of word embeddings (Mikolov
et al., 2013b). In CBOW, the likelihood of each δd ∼ N (0, I) ; θd = softmax(δd ). (2)
word wdn is (We replaced the Dirichlet with the logistic nor-
mal to more easily use reparameterization in the
wdn ∼ softmax(ρ> αdn ). (1)
inference algorithm; see Section 5.)
The embedding matrix ρ is a L × V matrix whose Steps 1 and 2a are standard for topic mod-
columns contain the embedding representations of eling: they represent documents as distributions
the vocabulary, ρv ∈ RL . The vector αdn is the over topics and draw a topic assignment for each
context embedding. The context embedding is the observed word. Step 2b is different; it uses the
sum of the context embedding vectors (αv for each embeddings of the vocabulary ρ and the assigned
word v) of the words surrounding wdn . topic embedding αzdn to draw the observed word
from the assigned topic, as given by zdn .
4 The Embedded Topic Model
The topic distribution in Step 2b mirrors the
CBOW likelihood in Eq. 1. Recall CBOW uses
The ETM is a topic model that uses embedding
the surrounding words to form the context vector
representations of both words and topics. It con-
αdn . In contrast, the ETM uses the topic embed-
tains two notions of latent dimension. First, it em-
ding αzdn as the context vector, where the assigned
beds the vocabulary in an L-dimensional space.
topic zdn is drawn from the per-document variable
These embeddings are similar in spirit to classi-
θd . The ETM draws its words from a document
cal word embeddings. Second, it represents each
context, rather than from a window of surround-
document in terms of K latent topics.
ing words.
In traditional topic modeling, each topic is a full
The ETM likelihood uses a matrix of word em-
distribution over the vocabulary. In the ETM, how-
beddings ρ, a representation of the vocabulary in
ever, the k th topic is a vector αk ∈ RL in the em-
a lower dimensional space. In practice, it can ei-
bedding space. We call αk a topic embedding—it
ther rely on previously fitted embeddings or learn
is a distributed representation of the k th topic in
them as part of its overall fitting procedure. When
the semantic space of words.
the ETM learns the embeddings as part of the fit-
In its generative process, the ETM uses the topic ting procedure, it simultaneously finds topics and
embedding to form a per-topic distribution over an embedding space.
When the ETM uses previously fitted embed- on the log of the marginal likelihood of Eq. 4.
dings, it learns the topics of a corpus in a particu- There are two sets of parameters to optimize: the
lar embedding space. This strategy is particularly model parameters, as described above, and the
useful when there are words in the embedding that variational parameters, which tighten the bounds
are not used in the corpus. The ETM can hypothe- on the marginal likelihoods.
size how those words fit in to the topics because it
To begin, posit a family of distributions of
can calculate ρ> v αk , even for words v that do not
the untransformed topic proportions q(δd ; wd , ν).
appear in the corpus.
We use amortized inference, where the variational
distribution of δd depends on both the document
5 Inference and Estimation wd and shared variational parameters ν. In par-
ticular q(δd ; wd , ν) is a Gaussian whose mean
We are given a corpus of documents and variance come from an “inference network,”
{w1 , . . . , wD }, where wd is a collection of a neural network parameterized by ν (Kingma and
Nd words. How do we fit the ETM? Welling, 2014). The inference network ingests the
The marginal likelihood. The parameters of the document wd and outputs a mean and variance
ETM are the embeddings ρ1:V and the topic em- of δd . (To accommodate documents of varying
beddings α1:K ; each αk is a point in the embed- length, we form the input of the inference network
ding space. We maximize the marginal likelihood by normalizing the bag-of-word representation of
of the documents, the document by the number of words Nd .)
We study two variants of the ETM, one where learned by the skip-gram model.
the word embeddings are pre-fitted and one where
they are learned jointly with the rest of the param- Table 1 illustrates the embeddings of the differ-
eters. The variant with pre-fitted embeddings is ent models. All the methods provide interpretable
called the “labeled ETM.” We use skip-gram em- embeddings—words with related meanings are
beddings (Mikolov et al., 2013b). close to each other. The ETM and the NVDM learn
embeddings that are similar to those from the skip-
Algorithm settings. Given a corpus, each model gram. The embeddings of ∆-NVDM are differ-
comes with an approximate posterior inference ent; the simplex constraint on the local variable
problem. We use variational inference for all of changes the nature of the embeddings.
the models and employ stochastic variational in-
ference (SVI) (Hoffman et al., 2013) to speed up We next look at the learned topics. Table 2 dis-
the optimization. The minibatch size is 1,000 doc- plays the 7 most used topics for all methods, as
uments. For LDA, we set the learning rate as sug- given by the average of the topic proportions θd .
gested by Hoffman et al. (2013): the delay is 10 LDA and the ETM both provide interpretable top-
and the forgetting factor is 0.85. ics. Neither NVDM nor ∆-NVDM provide inter-
pretable topics; their model parameters β are not
Within SVI, LDA enjoys coordinate ascent vari- interpretable as distributions over the vocabulary
ational updates, with 5 inner steps to optimize that mix to form documents.
the local variables. For the other models, we use
amortized inference over the local variables θd . Quantitative results. We next study the mod-
We use 3-layer inference networks and we set the els quantitatively. We measure the quality of
local learning rate to 0.002. We use `2 regular- the topics and the predictive performance of the
ization on the variational parameters (the weight model. We found that among models with inter-
decay parameter is 1.2 × 10−6 ). pretable topics, the ETM provides the best predic-
tions.
Qualitative results. We first examine the embed-
dings. The ETM, NVDM, and ∆-NVDM all in- We measure topic quality by blending two met-
volve a word embedding. We illustrate them by rics: topic coherence and topic diversity. Topic
fixing a set of terms and calculating the words coherence is a quantitative measure of the inter-
that occur in the neighborhood around them. For pretability of a topic (Mimno et al., 2011). It is
comparison, we also illustrate word embeddings the average pointwise mutual information of two
Table 2. Top five words of seven most used topics from different document models on 1.8M documents
of the New York Times corpus with vocabulary size 212,237 and K = 300 topics.
LDA
time year officials mr city percent state
day million public president building million republican
back money department bush street company party
good pay report white park year bill
long tax state clinton house billion mr
NVDM
scholars japan gansler spratt assn ridership pryce
gingrich tokyo wellstone tabitha assoc mtv mickens
funds pacific mccain mccorkle qtr straphangers mckechnie
institutions europe shalikashvili cheetos yr freierman mfume
endowment zealand coached vols nyse riders filkins
∆-NVDM
concerto servings nato innings treas patients democrats
solos tablespoons soviet scored yr doctors republicans
sonata tablespoon iraqi inning qtr medicare republican
melodies preheat gorbachev shutout outst dr senate
soloist minced arab scoreless telerate physicians dole
Labeled ETM
music republican yankees game wine court company
dance bush game points restaurant judge million
songs campaign baseball season food case stock
opera senator season team dishes justice shares
concert democrats mets play restaurants trial billion
ETM
game music united wine company yankees art
team mr israel food stock game museum
season dance government sauce million baseball show
coach opera israeli minutes companies mets work
play band mr restaurant billion season artist
words drawn randomly from the same document The idea behind topic coherence is that a co-
(Lau et al., 2014), herent topic will display words that tend to oc-
cur in the same documents. In other words,
K 10 10
1 X 1 X X (k) (k) the most likely words in a coherent topic should
TC = f (wi , wj ),
K 45 have high mutual information. Document models
k=1 i=1 j=i+1
with higher topic coherence are more interpretable
(k) (k) topic models.
where {w1 , . . . , w10 } denotes the top-10 most
likely words in topic k. Here, f (·, ·) is the normal- We combine coherence with a second metric,
ized pointwise mutual information, topic diversity. We define topic diversity to be the
P (w ,w )
i j
percentage of unique words in the top 25 words of
log P (wi )P (wj ) all topics. Diversity close to 0 indicates redundant
f (wi , wj ) = .
− log P (wi , wj ) topics; diversity close to 1 indicates more varied
topics. We define the overall metric for the qual-
The quantity P (wi , wj ) is the probability of words ity of a model’s topics as the product of its topic
wi and wj co-occurring in a document and P (wi ) diversity and topic coherence.
is the marginal probability of word wi . We
approximate these probabilities with empirical A good topic model also provides a good distri-
counts. bution of language. To measure predictive quality,
V=3102 V=8496 V=18625 V=29461 V=52258
1.5 LDA
NVDM
-NVDM
1.0 Labeled ETM
Interpretability ETM
0.5
0.0
0.5
1.0
1.5
1 0 1
Predictive Power
(a) Topic quality as measured by normalized product of topic coherence and topic diversity (the higher the better) vs. predictive
performance as measured by normalized log-likelihood on document completion (the higher the better) on the 20NewsGroup
dataset.
1.0
0.5
Interpretability
0.0
0.5
LDA
1.0 NVDM
-NVDM
Labeled ETM
1.5 ETM(LE)
1 0 1
Predictive Power
(b) Topic quality as measured by normalized product of topic coherence and topic diversity (the higher the better) vs. predictive
performance as measured by normalized log-likelihood on document completion (the higher the better) on the New York Times
dataset.
Figure 4. Performance on the 20NewsGroups and the New York Times datasets for different vocabulary
sizes. On both plots, better models are on the top right corner. Overall, the ETM is a better topic model.
we calculate log likelihood on a document com- 20NewsGroups, the NVDM’s predictions are in
pletion task (Rosen-Zvi et al., 2004; Wallach et al., general better than LDA but worse than for the
2009). We divide each test document into two sets other methods; on the New York Times, the NVDM
of words. The first half is observed: it induces a gives the best predictions. However, topic qual-
distribution over topics which, in turn, induces a ity for the NVDM is far below the other methods.
distribution over the next words in the document. (It does not provide “topics”, so we assess the in-
We then evaluate the second half under this distri- terpretability of its β matrix.) In prediction, both
bution. A good document model should provide versions of the ETM are at least as good as the
higher log-likelihood on the second half. (For all simplex-constrained ∆-NVDM.
methods, we approximate the likelihood by setting
θd to the variational mean.) These figures show that, of the interpretable
models, the ETM provides the best predictive per-
We study both corpora and with different vocab- formance while keeping interpretable topics. It is
ularies. Figure 4 shows topic quality as a function robust to large vocabularies.
of predictive power. (To ease visualization, we
normalize both metrics by subtracting the mean
6.1 Stop words
and dividing by the standard deviation.) The best
models are on the upper right corner.
We now study a version of the New York Times cor-
LDA predicts worst in almost all settings. On pus that includes all stop words. We remove infre-
can
good Table 3. Topic quality on the New York Times data
passing better
our us
fine
in the presence of stop words. Topic quality is
we
Topic 181 the product of topic coherence and topic diver-
sity (higher is better). The labeled ETM is robust
going
to stop words; it achieves similar topic coherence
best
lot always than when there are no stop words.
never
just why
Coherence Diversity Quality
right what
how LDA 0.13 0.14 0.0173
together
way very
∆-NVDM 0.17 0.11 0.0187
Labeled ETM 0.18 0.22 0.0405
Figure 5. A topic containing stop words found by
the ETM on The New York Times. The ETM is
robust even in the presence of stop words. tion whose natural parameter is the inner product
of the word embeddings and the embedding of the
assigned topic.
quent words to form a vocabulary of size 10,283.
Our goal is to show that the labeled ETM provides The ETM learns interpretable word embeddings
interpretable topics even in the presence of stop and topics, even in corpora with large vocabu-
words, another regime where topic models typi- laries. We studied the performance of the ETM
cally fail. In particular, given that stop words ap- against several document models. The ETM learns
pear in many documents, traditional topic models both coherent patterns of language and an accurate
learn topics that contain stop words, regardless of distribution of words.
the actual semantics of the topic. This leads to
poor topic interpretability. Acknowledgments
We fit LDA, the ∆-NVDM, and the labeled ETM
with K = 300 topics. (We do not report the This work is funded by ONR N00014-17-1-2131,
NVDM because it does not provide interpretable NIH 1U01MH115727-01, DARPA SD2 FA8750-
topics.) Table 3 shows topic quality (the product of 18-C-0130, ONR N00014-15-1-2209, NSF CCF-
topic coherence and topic diversity). Overall, the 1740833, the Alfred P. Sloan Foundation, 2Sigma,
labeled ETM gives the best performance in terms Amazon, and NVIDIA. FJRR is funded by the Eu-
of topic quality. ropean Union’s Horizon 2020 research and inno-
vation programme under the Marie Skłodowska-
While the ETM has a few “stop topics” that are
Curie grant agreement No. 706760. ABD is sup-
specific for stop words (see, e.g., Figure 5), ∆-
ported by a Google PhD Fellowship.
NVDM and LDA have stop words in almost every
topic. (The topics are not displayed here for space
constraints.) The reason is that stop words co-
occur in the same documents as every other word; References
therefore traditional topic models have difficulties
telling apart content words and stop words. The la- J. Aitchison and S. Shen. 1980. Logistic nor-
beled ETM recognizes the location of stop words mal distributions: Some properties and uses.
in the embedding space; its sets them off on their Biometrika, 67(2):261–272.
own topic.
K. Batmanghelich, A. Saeedi, K. Narasimhan,
and S. Gershman. 2016. Nonparametric spher-
7 Conclusion
ical topic modeling with word embeddings.
In Proceedings of the conference. Association
We developed the ETM, a generative model of
for Computational Linguistics. Meeting, vol-
documents that marries LDA with word embed-
ume 2016, page 537. NIH Public Access.
dings. The ETM assumes that topics and words
live in the same embedding space, and that Y. Bengio, R. Ducharme, P. Vincent, and C. Jan-
words are generated from a categorical distribu- vin. 2003. A neural probabilistic language
model. Journal of Machine Learning Research, M. D. Hoffman, D. M. Blei, and F. Bach. 2010.
3:1137–1155. Online learning for latent Dirichlet allocation.
In Advances in Neural Information Processing
Y. Bengio, H. Schwenk, J.-S. Senécal, F. Morin, Systems.
and J.-L. Gauvain. 2006. Neural probabilistic
language models. In Innovations in Machine M. D. Hoffman, D. M. Blei, C. Wang, and
Learning, pages 137–186. Springer. J. Paisley. 2013. Stochastic variational infer-
ence. Journal of Machine Learning Research,
D. M. Blei. 2012. Probabilistic topic models. 14:1303–1347.
Communications of the ACM, 55(4):77–84.
M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and
D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. L. K. Saul. 1999. An introduction to variational
2017. Variational inference: A review for statis- methods for graphical models. Machine Learn-
ticians. Journal of the American Statistical As- ing, 37(2):183–233.
sociation, 112(518):859–877.
D. P. Kingma and J. L. Ba. 2015. Adam: A
D. M. Blei and J. D. Lafferty. 2007. A correlated method for stochastic optimization. In Interna-
topic model of Science. The Annals of Applied tional Conference on Learning Representations.
Statistics, 1(1):17–35.
D. P. Kingma and M. Welling. 2014. Auto-
D. M. Blei, A. Y. Ng, and M. I. Jordan. 2003. encoding variational Bayes. In International
Latent dirichlet allocation. Journal of machine Conference on Learning Representations.
Learning research, 3(Jan):993–1022.
J. H. Lau, D. Newman, and T. Baldwin. 2014. Ma-
J. Boyd-Graber, Y. Hu, and D. Mimno. 2017. Ap- chine reading tea leaves: Automatically evalu-
plications of topic models. Foundations and ating topic coherence and topic model quality.
Trends in Information Retrieval, 11(2–3):143– In Conference of the European Chapter of the
296. Association for Computational Linguistics.
S. Bunk and R. Krestel. 2018. Welda: Enhanc- Q. Le and T. Mikolov. 2014. Distributed represen-
ing topic models by incorporating local word tations of sentences and documents. In Interna-
context. In Proceedings of the 18th ACM/IEEE tional Conference on Machine Learning.
on Joint Conference on Digital Libraries, pages
293–302. ACM. O. Levy and Y. Goldberg. 2014. Neural word
embedding as implicit matrix factorization. In
D. Card, C. Tan, and N. A. Smith. 2017. A neu- Neural Information Processing Systems, pages
ral framework for generalized topic models. In 2177–2185.
arXiv:1705.09296.
Y. Li and Y. Tao. 2018. Word Embedding for
Y. Cong, B. Chen, H. Liu, and M. Zhou. 2017. Understanding Natural Language: A Survey.
Deep latent Dirichlet allocation with topic- Springer International Publishing.
layer-adaptive stochastic gradient Riemannian
MCMC. In International Conference on Ma- Y. Miao, L. Yu, and P. Blunsom. 2016. Neural
chine Learning. variational inference for text processing. In In-
ternational Conference on Machine Learning.
R. Das, M. Zaheer, and C. Dyer. 2015. Gaus-
sian LDA for topic models with word embed- T. Mikolov, K. Chen, G. Corrado, and J. Dean.
dings. In Association for Computational Lin- 2013a. Efficient estimation of word represen-
guistics and International Joint Conference on tations in vector space. ICLR Workshop Pro-
Natural Language Processing (Volume 1: Long ceedings. arXiv:1301.3781.
Papers).
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado,
S. J. Gershman and N. D. Goodman. 2014. Amor- and J. Dean. 2013b. Distributed representations
tized inference in probabilistic reasoning. In of words and phrases and their compositional-
Annual Meeting of the Cognitive Science Soci- ity. In Neural Information Processing Systems,
ety. pages 3111–3119.
D. Mimno, H. M. Wallach, E. Talley, M. Leenders, Research and Development in Information Re-
and A. McCallum. 2011. Optimizing semantic trieval, pages 375–384. ACM.
coherence in topic models. In Conference on
A. Srivastava and C. Sutton. 2017. Autoencoding
Empirical Methods in Natural Language Pro-
variational inference for topic models. arXiv
cessing.
preprint arXiv:1703.01488.
A. Mnih and K. Kavukcuoglu. 2013. Learn-
M. K. Titsias and M. Lázaro-Gredilla. 2014.
ing word embeddings efficiently with noise-
Doubly stochastic variational Bayes for non-
contrastive estimation. In Neural Information
conjugate inference. In International Confer-
Processing Systems, pages 2265–2273.
ence on Machine Learning.
C. E. Moody. 2016. Mixing dirichlet topic models H. M. Wallach, I. Murray, R. Salakhutdinov, and
and word embeddings to make lda2vec. arXiv D. Mimno. 2009. Evaluation methods for topic
preprint arXiv:1605.02019. models. In International Conference on Ma-
D. Q. Nguyen, R. Billingsley, L. Du, and M. John- chine Learning.
son. 2015. Improving topic models with latent P. Xie, D. Yang, and E. Xing. 2015. Incorporating
feature word representations. Transactions of word correlation knowledge into topic model-
the Association for Computational Linguistics, ing. In Proceedings of the 2015 conference of
3:299–313. the north American chapter of the association
J. Pennington, R. Socher, and C. D. Manning. for computational linguistics: human language
2014. Glove: Global vectors for word repre- technologies, pages 725–734.
sentation. In Conference on Empirical Methods H. Xu, W. Wang, W. Liu, and L. Carin. 2018. Dis-
on Natural Language Processing, volume 14, tilled Wasserstein learning for word embedding
pages 1532–1543. and topic modeling. In Advances in Neural In-
formation Processing Systems.
J. Petterson, W. Buntine, S. M. Narayanamurthy,
T. S. Caetano, and A. J. Smola. 2010. Word G. Xun, V. Gopalakrishnan, F. Ma, Y. Li, J. Gao,
features for latent dirichlet allocation. In Ad- and A. Zhang. 2016. Topic discovery for short
vances in Neural Information Processing Sys- texts using word embeddings. In 2016 IEEE
tems, pages 1921–1929. 16th international conference on data mining
(ICDM), pages 1299–1304. IEEE.
D. J. Rezende, S. Mohamed, and D. Wierstra.
2014. Stochastic backpropagation and ap- G. Xun, Y. Li, W. X. Zhao, J. Gao, and A. Zhang.
proximate inference in deep generative models. 2017. A correlated topic model using word em-
arXiv preprint arXiv:1401.4082. beddings. In IJCAI, pages 4207–4213.
M. Rosen-Zvi, T. Griffiths, M. Steyvers, and H. Zhang, B. Chen, D. Guo, and M. Zhou. 2018.
P. Smyth. 2004. The author-topic model for au- WHAI: Weibull hybrid autoencoding inference
thors and documents. In Uncertainty in Artifi- for deep topic modeling. In International Con-
cial Intelligence. ference on Learning Representations.
M. Rudolph, F. J. R. Ruiz, S. Mandt, and D. M. H. Zhao, L. Du, and W. Buntine. 2017a. A word
Blei. 2016. Exponential family embeddings. embeddings informed focused topic model. In
In Advances in Neural Information Processing Asian Conference on Machine Learning, pages
Systems. 423–438.
D. Rumelhart and A. Abrahamson. 1973. A model H. Zhao, L. Du, W. Buntine, and G. Liu. 2017b.
for analogical reasoning. Cognitive Psychol- Metalda: A topic model that efficiently incor-
ogy, 5(1):1–28. porates meta information. In 2017 IEEE Inter-
national Conference on Data Mining (ICDM),
B. Shi, W. Lam, S. Jameel, S. Schockaert, and pages 635–644. IEEE.
K. P. Lai. 2017. Jointly learning word embed-
dings and latent topics. In Proceedings of the
40th International ACM SIGIR Conference on