Topic Modelling Meets Deep Neural Networks - A Survey
Topic Modelling Meets Deep Neural Networks - A Survey
Survey Track
He Zhao1 , Dinh Phung1,2 , Viet Huynh1 , Yuan Jin1 , Lan Du1 , Wray Buntine1
1
Department of Data Science and Artificial Intelligence, Monash University, Australia
2
VinAI Research, Vietnam
{ethan.zhao, dinh.phung, viet.huynh, yuan.jin, lan.du, wray.buntine}@monash.edu
4713
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)
Survey Track
can be represented as a sequence of words, which can be de- a model is of new (test) data and is inversely proportional
noted by a vector of natural numbers, s ∈ NL , where L is to average log-likelihood per word. Although log-likelihood
the length of the document and sj ∈ {1, · · · , V } is the in- or perplexity gives a straight numerical comparison between
dex in the vocabulary (with the size of V ) of the token for models, there remain issues: 1) As topic models are not for
the j th (j ∈ {1, · · · , L}) word. A more common represen- predicting unseen data but learning interpretable topics and
tation in topic modelling is the bag-of-words model, which representations of seen data, predictive accuracy does not re-
represents a document by a vector of word counts, b ∈ ZV≥0 , flect the main use of topic models. 2) Predictive accuracy
where bv indicates the occurrences of the vocabulary token does not capture topic quality. Predictive accuracy and human
v ∈ {1, · · · , V } in the document. One can readily obtain b judgement on topic quality are often not correlated [Chang
for a document from its word sequence vector s. et al., 2009], and even sometimes slightly anti-correlated.
3) The estimation of the predictive probability is usually in-
Notations of Latent Variables. A central concept is a tractable for Bayesian models and different papers may apply
topic, which is usually interpreted as a cluster of words, de- different sampling or approximation techniques [Wallach et
scribing a specific semantic meaning. A topic is or can be al., 2009; Buntine, 2009]. For NTMs, the computation of
normalised into a distribution over the tokens in the vocab- log-likelihood is even more inconsistent, making it harder to
ulary, named word distribution, t ∈ ∆V , where ∆V is a V compare the results across different papers.
dimensional simplex and tv indicates the weight or relevance
of token v under this topic. Usually, a document’s seman- Topic Coherence. Experiments show topic coherence (TC)
tic content is assumed to be captured or generated by one or computed with the coherence between a topic’s most repre-
more topics shared across the corpus. Therefore, a document sentative words (e.g, top 10 words) is in line with human eval-
is commonly associated with a distribution (or a vector that uation of topic interpretability [Lau et al., 2014]. As various
can be normalised into a distribution) over K (K ≥ 1) top- formulations have been proposed to compute TC, we refer
ics, named topic distribution, z ∈ ∆K , where zk indicates readers to [Röder et al., 2015] for more details. Most for-
the weight of the k th topic for this document. We further use mulations require to compute the general coherence between
D, Z, and T to denote the corpus with all the document data, two words, which are estimated based on word co-occurrence
the collections of topic distributions of all the documents, and counts in a reference corpus. Regarding TC, we have the fol-
the collections of word distributions of all the topics, respec- lowing remarks: 1) The ranking of TC scores may vary un-
tively. der different formulations. Therefore, it is encouraged to re-
port TC scores of different formulations or report the average
Notations of Architectures and Learning. With these no- score. 2) The choice of the reference corpus can also affect
tations, the task for a topic model is to learn the latent vari- the TC scores, due to the change of lexical usage, i.e, the
ables of Z and parameters of T from the observed data D. shift of word distribution. For example, computing TC for
More formally, a topic model learns a projection parame- a machine learning paper collection with a tweet dataset as
terised by θ from a document’s data to its topic distribution: reference may generate inaccurate results. Popular choices of
z = θ(b) and a set of global variables for the word dis- the reference corpus are the target corpus itself or an external
tributions of the topics: T . To learn these parameters, one corpus such as a large dump of Wikipedia. 3) To exclude less
can generate or reconstruct a document’s BoW data from its interpretable “background” topics, one can select the topics
topic distribution, which is modelled by another projection (e.g., top 50%) with the highest TC or the largest proportions
parameterised by φ: b̃ = φ(z, T ). Note that the majority of and report the average score over those selected topics [Zhao
topic models belong to the category of probabilistic genera- et al., 2018a] or to vary the proportion of the selected topics
tive models, where z and b are latent and observed random (e.g, from 10% to 100%) and plot TC score at each propor-
variables assumed to be generated from certain distributions tion [Zhao et al., 2021].
respectively. The projection from the latent variables to the
observed ones is named the generative process, which we Topic Diversity. Topic diversity (TD), as its name implies,
measures how diverse the discovered topics are. It is prefer-
further denote as: b̃ ∼ pbφ (z, T ) where z is sampled from
able that discovered topics describe different semantic topical
the prior distribution z ∼ pz . While the inverse projection is meanings. Specifically, [Dieng et al., 2020] defines topic di-
named the inference process, denoted as z ∼ qθz (b), where versity to be the percentage of unique words in the top 25
q z is the posterior distribution of z. For NTMs, these proba- words.
bilities are typically parameterised by deep neural networks.
Downstream Application Performance. The topic distri-
2.2 Evaluation bution z of a document learned by a topic model can be
viewed as the semantic representation of the document, which
It is still challenging to comprehensively evaluate and com-
can be used in document classification, clustering, retrieval,
pare the performance of topic models including NTMs.
visualisation, and elsewhere. For document classification,
Based on the nature and applications of topic models, the
one can train a classification model with the topic distribu-
commonly-used metrics are as follows.
tions learned by a topic model as features and report the
Predictive accuracy. It has been common to measure the classification performance to compare different topic mod-
log-likelihood of a model on held-out test documents, i.e., els. Document clustering can be conducted by two strate-
the predictive accuracy. A more popular metric based on gies: 1) Similar to classification, one can perform a clustering
log-likelihood is perplexity, which captures how surprised model (e.g. K-means with different numbers of clusters) on
4714
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)
Survey Track
the topic distributions, such as in [Zhao et al., 2021]; 2) Al- log σ = θ2 (π). Here, θ := {θ0 , θ1 , θ2 }, all of which are
ternatively, topics can actually be viewed as clusters of doc- multi-layer perceptrons (MLPs). To better address the above
uments. Thus, one can use the most significant topic of a questions, various configurations of the prior distribution pz ,
document (i.e., the topic with the largest weight in the topic data distribution pb , posterior distribution q z , as well as dif-
distribution) as the cluster assignment, such as in [Nguyen et ferent architectures of the decoder φ, encoder θ, word distri-
al., 2015]. For document retrieval, we can use the distance of butions of the topics T , have been proposed for VAE-NTMs.
the topic distributions of two documents as their semantic dis-
tance and report retrieval accuracy as a metric of topic mod- 3.1 Variants of Distributions
elling [Larochelle and Lauly, 2012]. For qualitative analysis, Given the knowledge and experience of BPTMs, z’s prior
a straight-forward way is to plot the most significant words of plays an important role in the quality of topics and document
topics. Recently, [Doogan and Buntine, 2021] shows that it representations in topic models. Thus, various constructions
can be more insightful to show and analyse the typical docu- of the prior distributions and their corresponding posterior
ments for a topic. distributions have been proposed for VAE-NTMs, aiming to
be better alternatives to the normal distributions used in the
3 Neural Topic Models with Amortised original models.
Variational Inference Variants of Prior Distributions for z. Note that the appli-
The recent success of deep generative models such as vari- cation of Dirichlet is one of the key successes of LDA for
ational autoencoders (VAEs) and amortised variational infer- encouraging topic smoothness and sparsity. For VAE-NTMs,
ence (AVI) has shed light on extending the generative process one can apply: pz := Dir(α0 ) and q z := Dir(θ(b)). How-
and amortising the inference process of BPTMs, which is the ever, it is difficult to develop an effective reparameterisation
most popular framework for NTMs. We name this series of function (RF) for Dirichlet, making it hard to compute the
models VAE-NTMs. The basic framework of a VAE follows gradient of the expectation in ELBO. Therefore, various ap-
the description in Section 2.1, where b and z are the observed proximations have been proposed. For example, [Srivastava
and latent variables respectively and the generative and infer- and Sutton, 2017] uses the Laplace approximation, where
ence processes are modelled by the DNN-based decoder and Dirichlet samples are approximated by these sampled from a
encoder respectively. Following [Kingma and Welling, 2014; logistic normal distribution, whose mean and co-variance are
Rezende et al., 2014], one can learn a VAE model by max- specifically configured. Recall that the Dirichlet distribution
imising the Evidence Lower BOund (ELBO) of the marginal can be simulated by normalising gamma variables. Although
likelihood of the BoW data b in terms of θ, φ, and T : the gamma distribution still does not have non-central differ-
Ez∼qz [log p(b | z)] − KL [q z k pz ] , where the RHS term is entiable RF, it is easier to approximate. Several works have
the Kullback-Leiber (KL) divergence. To compute/estimate been proposed in this line, such as using the Weibull distribu-
gradients, tricks like reparameterisations are usually used tion as the approximation of gamma in [Zhang et al., 2018],
to back-propagate gradients through the expectation in the approximating the cumulative distribution function of gamma
LHS term and approximations are applied when the analyt- with an auxiliary uniform variable in [Joo et al., 2020], and
ical form of the KL divergence is unavailable. leveraging the proposal function of a rejection sampler of
the gamma distribution as the RF in [Burkhardt and Kramer,
To adapt the VAE framework for topic modelling, there are
2019]. Recently, [Tian et al., 2020] proposes to tackle this
two key questions to be answered: 1) Different from other
challenge by using the so-called rounded RF, which approx-
applications, the input data of topic modelling has its unique
imates Dirichlet samples by those drawn from the rounded
properties, i.e., b is a high-dimensional, sparse, count-valued
posterior distribution. Other than Dirichlet, [Miao et al.,
vector and s is a variable-length sequential data. How to deal
2017] introduces a Gaussian softmax (GSM) function in the
with such data is the first question for designing a VAE topic
encoder: q z := softmax N (µ, diagK (σ 2 )) and [Silveira et
model. 2) Interpretability of topics is extremely important
in topic modelling. When it comes to a VAE model, how al., 2018] proposes to use a logistic-normal mixture distribu-
to explicitly or implicitly incorporate the word distributions tion for the prior of z. To further enhance the sparsity in z,
[Lin et al., 2019] introduces to use the sparsemax function to
of topics (i.e., T ) to interpret the latent representations or
each dimension remains another question. [Miao et al., 2016] replace the softmax in GSM.
proposes the first answers to the above questions, where the Nonparametric Prior for z. Bayesian Nonparametrics
decoder is developed by specifying the data distribution pb such as the Dirichlet processes, Indian Buffet Processes, and
as: pb := Multi softmax TT z + c . Here z ∈ RK mod- gamma processes have been successfully applied in Bayesian
els the topic distribution of a document, T ∈ RK×V mod- topic modelling, enabling to automatically infer the prior pro-
els the words distributions of the topics, and c ∈ RV is the portion and number of topics (i.e., K), e.g., in [Teh et al.,
bias. That is to say, φ := {c}1 and T := {T}. For the en- 2006; Williamson et al., 2010; Buntine and Mishra, 2014;
coder which takes b as input and outputs (the samples of) z, Zhou et al., 2016; Zhao et al., 2018b]. As a flexible con-
the paper follows the original VAE: pz := N (0, diagK (1)); struction of Dirichlet processes, the stick-breaking process
q z := N (µ, diagK (σ 2 )), where π = θ0 (b), µ = θ1 (π), and (SBP) is able to generate probability vectors with infinite di-
mensions, which has been used to the prior of z in VAE-
Q Given z ∼ SBP(α0 ), we have z1 = v1 and zk =
1 NTMs.
With a slight abuse of notation, we use θ and φ to denote the
projections or the parameters of the projections. vk j<k (1 − vj ) for k > 1, where vk ∼ Beta(1, α0 ). This
4715
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)
Survey Track
procedure can be viewed as iteratively breaking a length-one topics with respect to different aspects, specialising in mod-
stick into multiple ones and the k th iteration breaks the stick elling user reviews.
at the point of vk . Although not for NTMs, [Nalisnick and
Smyth, 2017] uses SBP to generate z for VAEs, where its 3.3 NTMs with Meta-data
VI is done by various approximations to the beta distribu- Conventionally, topic models learn from documents in an un-
tion of vk with truncation. [Ning et al., 2020] adapts this supervised way. However, documents are usually associated
SBP construction for VAE-NTMs and also proposes to im- with rich sets of meta-data on both document and word levels,
pose an SBP on the corpus level, which serves as the prior for such as document labels, authorship, and pre-trained word
the document-level SBP, forming into a hierarchical model. embeddings, which can be used to improve topic quality or
In [Miao et al., 2017], the break points vk are generated from document representation quality [Zhao et al., 2017b] for su-
a posterior modelled by a recurrent neural network (RNN) pervised tasks (e.g., accuracy of predicting document meta-
with normal noises as input, making the model able to au- data). [Card et al., 2018] proposes a VAE-NTM that is able to
tomatically infer K in a truncation-free manner. Recently, incorporate various kinds of meta-data, where the BoW data
[Wu et al., 2020a] uses the (truncated) gamma negative bi- b of a document and its labels (e.g., sentiment) are generated
nomial process to generate discrete vectors for z (i.e. each with a joint process conditioned on the document’s covariates
entry of z is equivalently generated by an independent Pois- (e.g., publication year) in the decoder and the encoder gener-
son distribution), which gives the model certain ability to be ates z by conditioning on all types of data of the document:
nonparametric. BoW, covariates, and labels. Instead of specifying the gen-
Variants of Data Distribution pb . In addition to manipu- erative model as a directed network as in most of topic mod-
lating distributions on z, [Zhao et al., 2020] proposes to re- els, [Korshunova et al., 2019] introduces the logistic LDA
place the multinomial data distribution used in other NTMs model whose generative process can be viewed as an undi-
with the negative-binomial distribution to capture overdisper- rected graph. In addition to the BoW data, a document’s la-
sion, making the model more robust: b ∼ NB(φ0 (z), φ1 (z)), bel is also an observed variable in the graph. Following a few
where two separate decoders φ0 and φ1 are proposed to gen- assumptions of factorisation in the generative process, the pa-
erate the two parameters of the negative-binomial distribution per manually specifies the complete conditional distributions
from z. in the graph with the interactions between the latent variables
captured by neural networks. The inference is done by the
Variants of Word Distributions T . Conventionally, the mean-field VI and z in the model is further trained to be more
collection of the word distributions of the topics T is a K ×V discriminative for the classification of labels. Given a set of
matrix, i.e., T ∈ RK×V with KV free parameters. In documents with labels, [Wang and Yang, 2020] uses a VAE-
BPTMs, it has been popular to factorise the matrix into a NTM to model a document’s BoW data and an RNN classifier
product of topic and word embeddings, meaning that the rele- to predict a document’s label based on its sequential data in
vance between a topic and a word is captured by their distance a joint training process. The paper combines the two mod-
in the embedding space [Zhao et al., 2017a]. This construc- els by introducing an attention mechanism in the RNN which
tion has been studied in NTMs, e.g., in [Jung and Choi, 2017; takes documents’ topics into account. [Bai et al., 2018] pro-
Dieng et al., 2020; Ding et al., 2018]. poses to incorporate relational graphs (e.g. citation graph) of
documents into NTMs, where the topic distributions of two
3.2 Correlated and Structured Topics document are fed into a neural network to predict whether
Topics discovered by conventional topic models like LDA are they should be connected.
usually independent. An important research direction is to
explicitly capture topic correlations (e.g. pairwise relations 3.4 NTMs for Short Texts
between topics) or structures (e.g. tree structures of top- Texts generated on the internet (e.g., tweets, news headlines
ics), which has been studied in NTMs as well. Following and product reviews) can be short, meaning that each indi-
the framework of VAE with Householder flow that enables vidual document contains insufficient word co-occurrence in-
to draw z from the normal posterior with a non-diagonal co- formation. This results in degraded performance for both
variance matrix, [Liu et al., 2019] develops a more efficient BPTMs and NTMs. To tackle this issue, one can limit a
centralised transformation flow for NTMs, which is able to model’s capacity and to enhance the contextual information
discover pairwise topic correlations by the covariance matrix. of short texts. [Zeng et al., 2018] proposes a combination
In terms of discovering tree-structured topics, [Isonuma et al., of an NTM and a memory network for short text classifica-
2020] introduces to generate a series of topics from the root tion in a similar spirit to [Wang and Yang, 2020]. The main
to the leaf of a topic tree with a doubly-recurrent neural net- difference is the memory network instead of RNN is respon-
work [Alvarez-Melis and Jaakkola, 2017]. When applied in sible for classification, which is informed by the topic distri-
topic modelling, the gamma belief network (GBN) [Zhou et butions learned by the NTM. To enhance the contextual in-
al., 2016] can be viewed as a Bayesian model that also discov- formation of short documents, [Zhu et al., 2018] proposes
ers three-structured topics, whose inference is done by Gibbs an NTM whose encoder is a graph neural network (GNN)
sampling. [Zhang et al., 2018] introduces the NTM counter- taking the biterms graph of the words in sampled documents
part of GBN, which leverages AVI as the inference process as inputs and outputting the topic distribution for the whole
and significantly improves the test time of GBN. [Esmaeili et corpus. The model also learns a decoder that reconstructs
al., 2019] proposes an structured VAE-NTM that discovers the input biterms graph. Despite the novel idea, the model
4716
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)
Survey Track
might not be able to generate the topic distribution of an in- [Thompson and Mimno, 2020] shows that the clusters ob-
dividual document. To limit a short document to focus on tained by performing clustering algorithms (e.g., Kmeans) on
several salient topics, [Lin et al., 2020] introduces to use the the contextual word embeddings generated by various pre-
Archimedean copulas to regularise the discreteness of topic trained models such as BERT and GPT-2 can be interpreted
distributions for short texts. [Wu et al., 2020b] introduces an as topics, similar to those discovered by LDA. Having sim-
NTM with vector quantisation over z, i.e., a document’s topic ilar ideas with [Zeng et al., 2018; Wang and Yang, 2020],
distribution can only be one vector in the learned dictionary [Chaudhary et al., 2020] proposes to combine an NTM with
in the vector quantisation process. In addition to maximis- a fine-tuned BERT model by concatenating the topic distri-
ing the likelihood of the input documents, the paper intro- bution and the learned BERT embedding of a document as
duces to minimise the likelihood of the negatively-sampled the features for document classification. [Hoyle et al., 2020]
“fake documents”. Although not directly addressing the short proposes an NTM learned by distilling knowledge from a
text problem for topic modelling, [He et al., 2018] introduces pre-trained BERT model. Specifically, given a document,
NTMs for modelling microblog conversations, by leveraging the BERT model generates the predicted probability for each
their unique meta data and structures. word then the paper introduces to average those probabilities
to generate a pseudo BoW vector for the document. An NTM
3.5 Sequential NTMs following [Card et al., 2018] is used to reconstruct both the
The flexibility of VAE-NTMs enables to leverage various actual and pseudo BoW data.
neural network architectures for the encoder and decoder.
With the help of sequential networks like RNNs, unlike other 4 NTMs based on Other Frameworks
NTMs working with BoW data (i.e., b), sequential NTMs Besides VAE-NTMs, there are other frameworks for NTMs
(SNTMs) usually take sequences of words of documents (i.e., that also draw research attention.
s) and are able to capture the orders of words, sentences, and
topics. [Nallapati et al., 2017] proposes an SNTM working NTMs based on Autoregressive Models. VAE-NTMs
with s, which samples a topic for each sentence of an in- gained popularity after VAEs were invented. Before that,
put document according to z and then generates the word NTMs based on the autoregressive framework had been stud-
sequence of the sentence with an RNN conditioned on the ied. Specifically, [Larochelle and Lauly, 2012] proposes an
sentence’s topic. Note that z is attached to a document and autoregressive NTM, named DocNADE, similar to the spirit
shared across all its sentences. In [Zaheer et al., 2017], given of RNNs, where the predictive probability of a word in a
s, a word’s topic is conditioned on its previous word’s and document is conditioned on its hidden state, which is fur-
this order dependency is captured by a long short-term mem- ther conditioned on the previous words. A hidden unit can
ory (LSTM) model. At the similar period of time, [Dieng et be interpreted as a topic and a document’s hidden states cap-
al., 2017] independently proposes an SNTM whose genera- ture its topic distribution. The learning is done by maximis-
tive process is similar to [Zaheer et al., 2017], with an addi- ing the likelihood of the input documents. Recently, [Gupta
tional variable modelling stop words and several variants in et al., 2019a] extends DocNADE by introducing a structure
the inference process. Recently, [Panwar et al., 2020] pro- similar to the bi-directional RNN, which allows to model
poses to use an LSTM with attentions as the encoder taking s bi-directional dependencies between words. [Gupta et al.,
as input, where the attention incorporates topical information 2019b] combines DocNADE with an LSTM for incorporating
with a context vector that is constructed by topic embeddings external knowledge. [Gupta et al., 2020] extends DocNADE
and document embeddings. [Rezaee and Ferraro, 2020] intro- into the life long learning settings.
duces an SNTM that is related to [Dieng et al., 2017], where NTMs based on Generative Adversarial Nets. Besides
instead of marginalising out the discrete topic assignments, VAEs, generative adversarial networks (GANs) are another
the paper proposes to generate them from an RNN model. popular series of deep generative models. Recently, there
This helps to avoid using reparameterisation tricks in the vari- are a few attempts on adapting the GAN framework for topic
ational inference. modelling. [Wang et al., 2019] proposes a GAN generator
that takes a random sample of the Dirichlet distribution as a
3.6 NTMs with Pre-trained Language Models topic distribution z̃ and generates the word distributions of a
Recently, pre-trained transformer-based language models “fake” document conditioning on z̃. A discriminator is in-
such as BERT are becoming ubiquitous in NLP. Pre-trained troduced to distinguish between generated word distributions
on large corpora, such models usually have a fine-grained and real word distributions obtained by normalising the TF-
ability to capture aspects of linguistic context, which can be IDF vectors of real documents. Although the proposed model
partially represented by contextual word embeddings. These is able to discover interpretable topics, it cannot learn topic
contextual word embeddings can provide richer context in- distributions for documents. To address this issue, [Wang et
formation than BoW or sequential data, which has been re- al., 2020] introduces an additional encoder that learns z for a
cently used to assist the training of topic models. Instead given document. Moreover, z is concatenated with the word
of using the BoW or sequential data of a document as the distribution of a document as a real datum and z̃ is concate-
input of the encoder, [Bianchi et al., 2020] proposes to nated with the generated word distribution as a fake datum.
use the document embedding vector generated by Sentence- The discriminator is designed to distinguish between the real
BERT [Reimers and Gurevych, 2019] and to keep the remain- and fake ones. [Hu et al., 2020] further extends the above
ing part of an NTM the same as [Srivastava and Sutton, 2017]. model with a CycleGAN framework.
4717
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)
Survey Track
NTMs based on Graph Neural Networks. Instead of and downstream performance, the evaluation processes, met-
viewing a document as a sequence or bag of words, one can rics, settings usually vary in different papers. A topic model
consider the graph presentations of a corpus of documents. should be evaluated with comprehensive metrics, including
This perspective enables leveraging a variety of GNNs to those on topic quality, predictive accuracy, document repre-
discover latent topics. As discussed in Section 3.4, [Zhu et sentation, and downstream applications. It could be tenden-
al., 2018] views a collection of documents as a biterm word tious to only use one kind of metric (e.g., topic coherence),
graph. While [Yang et al., 2020; Zhou et al., 2020] model a which can reflect just one aspect of a model. Therefore, uni-
corpus by a bipartite graph with documents and words as two fied platforms and benchmarks for NTMs are needed.
separate parties and connected by the occurrences of words
in documents. For the former, it directly uses the word oc- Richer architectures and applications. Compared to
currences of documents as the weights of the connections be- BPTMs, NTMs offer better flexibility for representing topic
tween them and for the latter, it uses TF-IDF values instead. distributions for documents and word distributions for topics.
Particularly, projecting documents, topics, and words into a
Other NTMs. In addition to the above frameworks, other unified embedding space transforms the thinking of the rela-
kinds of NTMs have also been developed. An NTM is de- tionships between the three. Given this flexibility, NTMs are
veloped in [Cao et al., 2015] that takes n-gram embeddings expected to get integrated with the most recent neural archi-
(obtained from word embeddings) and a document index as tectures and play a unique role in richer applications.
input and then predicts whether an n-gram is in the docu-
ment. [Chen and Zaki, 2017] proposes an autoencoder model More external knowledge. With the development of topic
for NTMs where the neurons in the hidden layer of the au- models including NTMs, people have not stopped seeking to
toencoder compete with each other, focusing them to be spe- leverage external knowledge to help the learning, from doc-
cialised in recognising specific data patterns. [Peng et al., ument meta-data to pre-trained word embeddings. Recently-
2018] proposes an NTM based on matrix factorisation. [Gui proposed pre-trained language models (e.g., BERT) provide
et al., 2019] proposes a reinforcement learning framework more advanced, finer-grained, and higher-level representa-
for NTMs, where the encoder and decoder of an NTM are tions of semantic knowledge (e.g., contextual word embed-
kept. In addition, an agent takes actions to select the topical- dings over global embeddings), which can be leveraged in
coherent words from a document and uses the selected words NTMs to boost performance. Although the marriage between
as the input document for the encoder. The reward to the NTMs and language models is still an emerging area, we ex-
agent is the topic coherence of the reconstructed document pect to see more developments in this important direction.
from the decoder. [Nan et al., 2019] adapts the framework
of Wasserstein auto-encoders (WAEs), which minimises the References
Wasserstein distance between reconstructed documents from
the decoder and real documents, similarly to VAE-NTMs. [Alvarez-Melis and Jaakkola, 2017] David Alvarez-Melis
Recently, topic models based optimal transport have been de- and Tommi S Jaakkola. Tree-structured decoding with
veloped, such as in [Huynh et al., 2020]. [Zhao et al., 2021] doubly-recurrent neural networks. In ICLR, 2017.
introduces an NTM based on optimal transport, which min- [Bai et al., 2018] Haoli Bai, Zhuangbin Chen, Michael R
imises the optimal transport distance between the topic distri- Lyu, Irwin King, and Zenglin Xu. Neural relational topic
bution learned by an encoder and the word distribution of a models for scientific article analysis. In CIKM, 2018.
document.
[Bianchi et al., 2020] Federico Bianchi, Silvia Terragni, and
Dirk Hovy. Pre-training is a hot topic: Contextualized doc-
5 Discussion ument embeddings improve topic coherence. arXiv, 2020.
This paper is the first survey paper focusing on the specific [Buntine and Mishra, 2014] Wray L Buntine and Swapnil
area of neural topic models, which is the most popular re-
Mishra. Experiments with non-parametric topic models.
search trend of topic modelling in the deep learning era.
In SIGKDD, pages 881–890, 2014.
Due to the appealing flexibility, effectiveness, and efficiency,
NTMs show a promising potential in a range of applications. [Buntine, 2009] Wray Buntine. Estimating likelihoods for
After providing an overview of existing approaches of topic models. In ACML, 2009.
NTMs, we in this section would like to discuss the follow- [Burkhardt and Kramer, 2019] Sophie Burkhardt and Stefan
ing challenges and opportunities for NTMs.
Kramer. Decoupling sparsity and smoothness in the
Better evaluation. As stated in Section 2.2, evaluation of Dirichlet variational autoencoder topic model. JMLR,
topic models is challenging. This is mainly because there 2019.
has not been a unified system of evaluation metrics, and in- [Cao et al., 2015] Ziqiang Cao, Sujian Li, Yang Liu, Wenjie
deed some metrics are not always appropriate, making the
Li, and Heng Ji. A novel neural topic model and its super-
comparisons across different NTMs harder due to the vari-
vised extension. In AAAI, 2015.
ety of frameworks, architectures and datasets. For example,
VAE-NTMs calculate perplexity using the ELBO, attached to [Card et al., 2018] Dallas Card, Chenhao Tan, and Noah A
the models with variational inference, which cannot be com- Smith. Neural models for documents with metadata. In
pared with models without ELBO. Also for topic coherence ACL, 2018.
4718
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)
Survey Track
[Chang et al., 2009] Jonathan Chang, Jordan Boyd-Graber, [Huynh et al., 2020] Viet Huynh, He Zhao, and Dinh Phung.
Chong Wang, Sean Gerrish, and David M Blei. Read- OTLDA: A geometry-aware optimal transport approach
ing tea leaves: How humans interpret topic models. In for topic modeling. NeurIPS, 2020.
NeurIPS, 2009. [Isonuma et al., 2020] Masaru Isonuma, Junichiro Mori,
[Chaudhary et al., 2020] Yatin Chaudhary, Pankaj Gupta, Danushka Bollegala, and Ichiro Sakata. Tree-structured
Khushbu Saxena, Vivek Kulkarni, Thomas Runkler, and neural topic model. In ACL, 2020.
Hinrich Schütze. TopicBERT for energy efficient docu- [Joo et al., 2020] Weonyoung Joo, Wonsung Lee, Sungrae
ment classification. In EMNLP, 2020. Park, and Il-Chul Moon. Dirichlet variational autoencoder.
[Chen and Zaki, 2017] Yu Chen and Mohammed J Zaki. Pattern Recognition, 2020.
KATE: K-competitive autoencoder for text. In SIGKDD, [Jung and Choi, 2017] Namkyu Jung and Hyeong In Choi.
2017. Continuous semantic topic embedding model using varia-
[Dieng et al., 2017] Adji B Dieng, Chong Wang, Jianfeng tional autoencoder. arXiv, 2017.
Gao, and John Paisley. TopicRNN: A recurrent neural [Kingma and Welling, 2014] Diederik P Kingma and Max
network with long-range semantic dependency. In ICLR, Welling. Auto-encoding variational Bayes. In ICLR, 2014.
2017.
[Korshunova et al., 2019] Iryna Korshunova, Hanchen
[Dieng et al., 2020] Adji B Dieng, Francisco JR Ruiz, and Xiong, Mateusz Fedoryszak, and Lucas Theis. Discrim-
David M Blei. Topic modeling in embedding spaces. inative topic modeling with logistic LDA. In NeurIPS,
TACL, 2020. 2019.
[Ding et al., 2018] Ran Ding, Ramesh Nallapati, and Bing [Larochelle and Lauly, 2012] Hugo Larochelle and Stanislas
Xiang. Coherence-aware neural topic modeling. In Lauly. A neural autoregressive topic model. NeurIPS,
EMNLP, 2018. 2012.
[Doogan and Buntine, 2021] Caitlin Doogan and Wray Bun- [Lau et al., 2014] Jey Han Lau, David Newman, and Timo-
tine. Topic model or topic twaddle? re-evaluating semantic thy Baldwin. Machine reading tea leaves: Automatically
interpretability measures. In NAACL, 2021. evaluating topic coherence and topic model quality. In
[Esmaeili et al., 2019] Babak Esmaeili, Hongyi Huang, By- ACL, 2014.
ron Wallace, and Jan-Willem van de Meent. Structured [Lin et al., 2019] Tianyi Lin, Zhiyue Hu, and Xin Guo.
neural topic models for reviews. In AISTATS, 2019. Sparsemax and relaxed Wasserstein for topic sparsity. In
[Gui et al., 2019] Lin Gui, Jia Leng, Gabriele Pergola, WSDM, 2019.
Ruifeng Xu, and Yulan He. Neural topic model with re- [Lin et al., 2020] Lihui Lin, Hongyu Jiang, and Yanghui
inforcement learning. In EMNLP-IJCNLP, 2019. Rao. Copula guided neural topic modelling for short texts.
In SIGIR, 2020.
[Gupta et al., 2019a] Pankaj Gupta, Yatin Chaudhary, Flo-
rian Buettner, and Hinrich Schütze. Document informed [Liu et al., 2019] Luyang Liu, Heyan Huang, Yang Gao,
neural autoregressive topic models with distributional Yongfeng Zhang, and Xiaochi Wei. Neural variational cor-
prior. In AAAI, 2019. related topic modeling. In WWW, 2019.
[Gupta et al., 2019b] Pankaj Gupta, Yatin Chaudhary, Flo- [Miao et al., 2016] Yishu Miao, Lei Yu, and Phil Blunsom.
rian Buettner, and Hinrich Schütze. Texttovec: Deep con- Neural variational inference for text processing. In ICML,
textualized neural autoregressive topic models of language 2016.
with distributed compositional prior. In ICLR, 2019. [Miao et al., 2017] Yishu Miao, Edward Grefenstette, and
[Gupta et al., 2020] Pankaj Gupta, Yatin Chaudhary, Phil Blunsom. Discovering discrete latent topics with neu-
Thomas Runkler, and Hinrich Schuetze. Neural topic ral variational inference. In ICML, 2017.
modeling with continual lifelong learning. In ICML, [Nalisnick and Smyth, 2017] Eric Nalisnick and Padhraic
2020. Smyth. Stick-breaking variational autoencoders. In ICLR,
[He et al., 2018] Ruifang He, Xuefei Zhang, Di Jin, Long- 2017.
biao Wang, Jianwu Dang, and Xiangang Li. Interaction- [Nallapati et al., 2017] Ramesh Nallapati, Igor Melnyk, Ab-
aware topic model for microblog conversations through hishek Kumar, and Bowen Zhou. Sengen: Sentence gen-
network embedding and user attention. In COLING, 2018. erating neural variational topic model. arXiv, 2017.
[Hoyle et al., 2020] Alexander Miserlis Hoyle, Pranav Goel, [Nan et al., 2019] Feng Nan, Ran Ding, Ramesh Nallapati,
and Philip Resnik. Improving neural topic models using and Bing Xiang. Topic modeling with Wasserstein autoen-
knowledge distillation. In EMNLP, 2020. coders. In ACL, 2019.
[Hu et al., 2020] Xuemeng Hu, Rui Wang, Deyu Zhou, and [Nguyen et al., 2015] Dat Quoc Nguyen, Richard Billings-
Yuxuan Xiong. Neural topic modeling with cycle- ley, Lan Du, and Mark Johnson. Improving topic models
consistent adversarial training. In EMNLP, 2020. with latent feature word representations. TACL, 2015.
4719
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)
Survey Track
[Ning et al., 2020] Xuefei Ning, Yin Zheng, Zhuxi Jiang, [Williamson et al., 2010] Sinead Williamson, Chong Wang,
Yu Wang, Huazhong Yang, Junzhou Huang, and Peilin Katherine A Heller, and David M Blei. The IBP compound
Zhao. Nonparametric topic modeling with neural infer- Dirichlet process and its application to focused topic mod-
ence. Neurocomputing, 2020. eling. In ICML, 2010.
[Panwar et al., 2020] Madhur Panwar, Shashank Shailabh, [Wu et al., 2020a] Jiemin Wu, Yanghui Rao, Zusheng
Milan Aggarwal, and Balaji Krishnamurthy. TAN-NTM: Zhang, Haoran Xie, Qing Li, Fu Lee Wang, and Ziye
Topic attention networks for neural topic modeling. arXiv, Chen. Neural mixed counting models for dispersed topic
2020. discovery. In ACL, 2020.
[Peng et al., 2018] Min Peng, Qianqian Xie, Yanchun [Wu et al., 2020b] Xiaobao Wu, Chunping Li, Yan Zhu, and
Zhang, Hua Wang, Xiuzhen Jenny Zhang, Jimin Huang, Yishu Miao. Short text topic modeling with topic dis-
and Gang Tian. Neural sparse topical coding. In ACL, tribution quantization and negative sampling decoder. In
2018. EMNLP, 2020.
[Reimers and Gurevych, 2019] Nils Reimers and Iryna [Yang et al., 2020] Liang Yang, Fan Wu, Junhua Gu, Chuan
Gurevych. Sentence-BERT: Sentence embeddings using Wang, Xiaochun Cao, Di Jin, and Yuanfang Guo. Graph
siamese BERT-networks. In EMNLP-IJCNLP, 2019. attention topic modeling network. In WWW, 2020.
[Rezaee and Ferraro, 2020] Mehdi Rezaee and Francis Fer- [Zaheer et al., 2017] Manzil Zaheer, Amr Ahmed, and
raro. A discrete variational recurrent topic model without Alexander J Smola. Latent LSTM allocation: Joint clus-
the reparametrization trick. NeurIPS, 2020. tering and non-linear dynamic modeling of sequence data.
[Rezende et al., 2014] Danilo Jimenez Rezende, Shakir Mo- In ICML, 2017.
hamed, and Daan Wierstra. Stochastic backpropagation [Zeng et al., 2018] Jichuan Zeng, Jing Li, Yan Song, Cuiyun
and approximate inference in deep generative models. In Gao, Michael R Lyu, and Irwin King. Topic memory net-
ICML, 2014. works for short text classification. In EMNLP, 2018.
[Röder et al., 2015] Michael Röder, Andreas Both, and [Zhang et al., 2018] Hao Zhang, Bo Chen, Dandan Guo, and
Alexander Hinneburg. Exploring the space of topic co- Mingyuan Zhou. Whai: Weibull hybrid autoencoding in-
herence measures. In WSDM, 2015. ference for deep topic modeling. In ICLR, 2018.
[Silveira et al., 2018] Denys Silveira, Andr’e Carvalho, [Zhao et al., 2017a] He Zhao, Lan Du, and Wray Buntine.
Marco Cristo, and Marie-Francine Moens. Topic modeling A word embeddings informed focused topic model. In
using variational auto-encoders with Gumbel-softmax and ACML, 2017.
logistic-normal mixture distributions. In IJCNN, 2018. [Zhao et al., 2017b] He Zhao, Lan Du, Wray Buntine, and
[Srivastava and Sutton, 2017] Akash Srivastava and Charles Gang Liu. MetaLDA: A topic model that efficiently incor-
Sutton. Autoencoding variational inference for topic mod- porates meta information. In ICDM, 2017.
els. In ICLR, 2017. [Zhao et al., 2018a] He Zhao, Lan Du, Wray Buntine, and
[Teh et al., 2006] Yee Whye Teh, Michael I Jordan, Mingyuan Zhou. Dirichlet belief networks for topic struc-
Matthew J Beal, and David M Blei. Hierarchical Dirichlet ture learning. In NeurIPS, 2018.
processes. JASA, 101(476):1566–1581, 2006. [Zhao et al., 2018b] He Zhao, Lan Du, Wray Buntine, and
[Thompson and Mimno, 2020] Laure Thompson and David Mingyuan Zhou. Inter and intra topic structure learning
Mimno. Topic modeling with contextualized word repre- with word embeddings. In ICML, 2018.
sentation clusters. arXiv, 2020. [Zhao et al., 2020] He Zhao, Piyush Rai, Lan Du, Wray
[Tian et al., 2020] Runzhi Tian, Yongyi Mao, and Richong Buntine, Dinh Phung, and Mingyuan Zhou. Variational
Zhang. Learning VAE-LDA models with rounded repa- autoencoders for sparse and overdispersed discrete data.
rameterization trick. In EMNLP, 2020. In AISTATS, 2020.
[Wallach et al., 2009] Hanna M Wallach, Iain Murray, Rus- [Zhao et al., 2021] He Zhao, Dinh Phung, Viet Huynh,
lan Salakhutdinov, and David Mimno. Evaluation methods Trung Le, and Wray Buntine. Neural topic model via op-
for topic models. In ICML, 2009. timal transport. In ICLR, 2021.
[Wang and Yang, 2020] Xinyi Wang and Yi Yang. Neural [Zhou et al., 2016] Mingyuan Zhou, Yulai Cong, and
topic model with attention for supervised learning. In AIS- Bo Chen. Augmentable gamma belief networks. JMLR,
TATS, 2020. 2016.
[Wang et al., 2019] Rui Wang, Deyu Zhou, and Yulan He. [Zhou et al., 2020] Deyu Zhou, Xuemeng Hu, and Rui
ATM: Adversarial-neural topic model. Information Pro- Wang. Neural topic modeling by incorporating document
cessing & Management, 2019. relationship graph. In EMNLP, 2020.
[Wang et al., 2020] Rui Wang, Xuemeng Hu, Deyu Zhou, [Zhu et al., 2018] Qile Zhu, Zheng Feng, and Xiaolin Li.
Yulan He, Yuxuan Xiong, Chenchen Ye, and Haiyang Xu. Graphbtm: Graph enhanced autoencoded variational in-
Neural topic modeling with bidirectional adversarial train- ference for biterm topic model. In EMNLP, 2018.
ing. In ACL, 2020.
4720