Sievert, C., & Shirley, K. E. LDAvis. A Method For Visualizing and Interpreting Topics
Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces, pages 63–70,
Baltimore, Maryland, USA, June 27, 2014.
2014 Association for Computational Linguistics
Figure 1: The layout of LDAvis, with the global topic view on the left, and the term barcharts (with
Topic 34 selected) on the right. Linked selections allow users to reveal aspects of the topic-term relation-
ships compactly.
just this determination. A topic in LDA is a multi- tion in Section 3.2, and we describe how we incor-
nomial distribution over the (typically thousands porate relevance into our interactive visualization
of) terms in the vocabulary of the corpus. To inter- in Section 4.
pret a topic, one typically examines a ranked list of
the most probable terms in that topic, using any- 2 Related Work
where from three to thirty terms in the list. The
Much work has been done recently regarding the
problem with interpreting topics this way is that
interpretation of topics (i.e. measuring topic “co-
common terms in the corpus often appear near the
herence”) as well as visualization of topic models.
top of such lists for multiple topics, making it hard
to differentiate the meanings of these topics. 2.1 Topic Interpretation and Coherence
Bischof and Airoldi (2012) propose ranking It is well-known that the topics inferred by LDA
terms for a given topic in terms of both the fre- are not always easily interpretable by humans.
quency of the term under that topic as well as the Chang et al. (2009) established via a large
term’s exclusivity to the topic, which accounts for user study that standard quantitative measures of
the degree to which it appears in that particular fit, such as those summarized by Wallach et al.
topic to the exclusion of others. We propose a sim- (2009), do not necessarily agree with measures of
ilar measure that we call the relevance of a term topic interpretability by humans. Ramage et al.
to a topic that allows users to flexibly rank terms (2009) assert that “characterizing topics is hard”
in order of usefulness for interpreting topics. We and describe how using the top-k terms for a given
discuss our definition of relevance, and its graphi- topic might not always be best, but offer few con-
cal interpretation, in detail in Section 3.1. We also crete alternatives.
present the results of a user study conducted to de- AlSumait et al. (2009), Mimno et al. (2011),
termine the optimal tuning parameter in the defini- and Chuang et al. (2013b) develop quantitative
tion of relevance to aid the task of topic interpreta- methods for measuring the interpretability of top-
ics based on experiments with data sets that come respect to frequency and exclusivity, and they rec-
with some notion of topical ground truth, such as ommend it as a way to rank terms to aid topic in-
document metadata or expert-created topic labels. terpretation. We propose a similar method that is
These methods are useful for understanding, in a a weighted average of the logarithms of a term’s
global sense, which topics are interpretable (and probability and its lift, and we justify it with a user
why), but they don’t specifically attempt to aid the study and incorporate it into our interactive visu-
user in interpreting individual topics. alization.
Blei and Lafferty (2009) developed “Turbo Top-
ics”, a method of identifying n-grams within LDA- 2.2 Topic Model Visualization Systems
inferred topics that, when listed in decreasing or- A number of visualization systems for topic mod-
der of probability, provide users with extra in- els have been developed in recent years. Sev-
formation about the usage of terms within top- eral of them focus on allowing users to browse
ics. This two-stage process yields good results on documents, topics, and terms to learn about the
experimental data, although the resulting output relationships between these three canonical topic
is still simply a ranked list containing a mixture model units (Gardner et al., 2010; Chaney and
of terms and n-grams, and the usefulness of the Blei, 2012; Snyder et al., 2013). These browsers
method for topic interpretation was not tested in a typically use lists of the most probable terms
user study. within topics to summarize the topics, and the vi-
Newman et al. (2010) describe a method for sualization elements are limited to barcharts or
ranking terms within topics to aid interpretability word clouds of term probabilities for each topic,
called Pointwise Mutual Information (PMI) rank- pie charts of topic probabilities for each document,
ing. Under PMI ranking of terms, each of the ten and/or various barcharts or scatterplots related to
most probable terms within a topic are ranked in document metadata. Although these tools can be
decreasing order of approximately how often they useful for browsing a corpus, we seek a more com-
occur in close proximity to the nine other most pact visualization, with the more narrow focus of
probable terms from that topic in some large, ex- quickly and easily understanding the individual
ternal “reference” corpus, such as Wikipedia or topics themselves (without necessarily visualizing
Google n-grams. Although this method correlated documents).
highly with human judgments of term importance Chuang et al. (2012b) develop such a tool,
within topics, it does not easily generalize to topic called “Termite”, which visualizes the set of topic-
models fit to corpora that don’t have a readily term distributions estimated in LDA using a ma-
available external source of word co-occurrences. trix layout. The authors introduce two measures
In contrast, Taddy (2011) uses an intrinsic mea- of the usefulness of terms for understanding a
sure to rank terms within topics: a quantity called topic model: distinctiveness and saliency. These
lift, defined as the ratio of a term’s probability quantities measure how much information a term
within a topic to its marginal probability across conveys about topics by computing the Kullback-
the corpus. This generally decreases the rankings Liebler divergence between the distribution of top-
of globally frequent terms, which can be helpful. ics given the term and the marginal distribution
We find that it can be noisy, however, by giving of topics (distinctiveness), optionally weighted
high rankings to very rare terms that occur in only by the term’s overall frequency (saliency). The
a single topic, for instance. While such terms may authors recommend saliency as a thresholding
contain useful topical content, if they are very rare method for selecting which terms are included in
the topic may remain difficult to interpret. the visualization, and they further use a seriation
Finally, Bischof and Airoldi (2012) develop and method for ordering the most salient terms to high-
implement a new statistical topic model that infers light differences between topics.
both a term’s frequency as well as its exclusivity Termite is a compact, intuitive interactive visu-
– the degree to which its occurrences are limited alization of the topics in a topic model, but by only
to only a few topics. They introduce a univari- including terms that rank high in saliency or dis-
ate measure called a FREX score (“FRequency tinctiveness, which are global properties of terms,
and EXclusivity”) which is a weighted harmonic it is restricted to providing a global view of the
mean of a term’s rank within a given topic with model, rather than allowing a user to deeply in-
spect individual topics by visualizing a potentially Topic 29 of 50 (20 Newgroups data)
different set of terms for every single topic. In
fact, Chuang et al. (2013a) describe the use of a
70.4 ● ● ●
● ● ● ● ● ●●●●●
● ● ● ●● ● ● ● ● ● ●
● ● ●● ● ●
● ● ●● ●● ●
● ● ● ● ● ●● ● ●●
● ● ● ●● ● ●●● ●●● ● ● exhaust
● ● ● ● ●●
● ● ● ● ●● ●● ● ●● ● ● ●●● ● ● plastic● oil
● ● ● ● ● ● lights
rences) in documents from a single Newsgroup, Trial data for middle tercile of topics
such as Topic 38, which was the estimated topic
for 15,705 tokens in the corpus, 14,233 of which 0.9
1, the estimated proportions of correct responses value of λ, which can alter the rankings of terms
were closer to 53% and 63%, respectively. We to aid topic interpretation. By default, λ is set to
view this as evidence that ranking terms according 0.6, as suggested by our user study in Section 3.2.
to relevance, where λ < 1 (i.e. not strictly in de- If λ = 1, terms are ranked solely by φkw , which
creasing order of probability), can improve topic implies the red bars would be sorted from widest
interpretability. (at the top) to narrowest (at the bottom). By com-
Note that in our experiment, we used the collec- paring the widths of the red and gray bars for a
tion of single-posted 20 Newsgroups documents given term, users can quickly understand whether
to define our “ground truth” data. An alternative a term is highly relevant to the selected topic be-
method for collecting “ground truth” data would cause of its lift (a high ratio of red to gray), or
have been to recruit experts to label topics from its probability (absolute width of red). The top 3
an LDA model. We chose against this option be- most relevant terms in Figure 1 are “law”, “court”,
cause doing so would present a classic “chicken- and “cruel”. Note that “law” is a common term
or-egg” problem: If we use expert-labeled topics which is generated by Topic 34 in about 40% of
in an experiment to learn how to summarize top- its corpus-wide occurrences, whereas “cruel” is a
ics so that they can be interpreted (i.e. “labeled”), relatively rare term with very high lift in Topic 34
we would only re-learn the way that our experts – it occurs almost exclusively in this topic. Such
were instructed, or allowed, to label the topics in properties of the topic-term relationships are read-
the first place! If, for instance, the experts were ily visible in LDAvis for every topic.
presented with a ranked list of the most probable On the left panel, two visual features provide
terms for each topic, this would influence the in- a global perspective of the topics. First, the ar-
terpretations and labels they give to the topics, and eas of the circles are proportional to the relative
the experimental result would be the circular con- prevalences of the topics in the corpus. In the
clusion that ranking terms by probability allows 50-topic model fit to the 20 Newsgroups data,
users to recover the “expert” labels most easily. the first three topics comprise 12%, 9%, and
To avoid this, we felt strongly that we should use 6% of the corpus, and all contain common, non-
data in which documents have metadata associated specific terms (although there are interesting dif-
with them. The 20 Newsgroups data provides an ferences: Topic 2 contains formal debate-related
externally validated source of topic labels, in the language such as “conclusion”, “evidence”, and
sense that the labels were presented to users (in “argument”, whereas Topic 3 contains slang con-
the form of Newsgroup names), and users sub- versational language such as “kinda”, “like”, and
sequently filled in the content. It represents, es- “yeah”). In addition to visualizing topic preva-
sentially, a crowd-sourced collection of tokens, or lence, the left pane shows inter-topic differences.
content, for a certain set of topic labels. The default for computing inter-topic distances is
Jensen-Shannon divergence, although other met-
4 The LDAvis System rics are enabled. The default for scaling the set of
Our interactive, web-based visualization system, inter-topic distances defaults to Principal Compo-
LDAvis, has two core functionalities that enable nents, but other algorithms are also enabled.
users to understand the topic-term relationships in The second core feature of LDAvis is the abil-
a fitted LDA model, and a number of extra features ity to select a term (by hovering over it) to reveal
that provide additional perspectives on the model. its conditional distribution over topics. This dis-
First and foremost, LDAvis allows one to se- tribution is visualized by altering the areas of the
lect a topic to reveal the most relevant terms for topic circles such that they are proportional to the
that topic. In Figure 1, Topic 34 is selected, and term-specific frequencies across the corpus. This
its 30 most relevant terms (given λ = 0.34, in this allows the user to verify, as discussed in Chuang et
case) populate the barchart to the right (ranked al. (2012a), whether the multidimensional scaling
in order of relevance from top to bottom). The of topics has faithfully clustered similar topics in
widths of the gray bars represent the corpus-wide two-dimensional space. For example, in Figure 4,
frequencies of each term, and the widths of the the term “file” is selected. In the majority of this
red bars represent the topic-specific frequencies of term’s occurrences, it is drawn from one of several
each term. A slider allows users to change the topics located in the upper left-hand region of the
Figure 4: The user has chosen to segment the fifty topics into four clusters, and has selected the green
cluster to populate the barchart with the most relevant terms for that cluster. Then, the user hovered over
the ninth bar from the top, “file”, to display the conditional distribution over topics for this term.
global topic view. Upon inspection, this group of of their two-dimensional locations in the global
topics can be interpreted broadly as a discussion topic view). This is merely an effort to facilitate
of computer hardware and software. This verifies, semantic zooming in an LDA model with many
to some extent, their placement, via multidimen- topics where ‘after-the-fact’ clustering may be an
sional scaling, into the same two-dimensional re- easier way to estimate clusters of topics, rather
gion. It also suggests that the term “file” used in than fitting a hierarchical topic model (Blei et al.,
this context refers to a computer file. However, 2003), for example. Selecting a cluster of topics
there is also conditional probability mass for the (by clicking the Voronoi region corresponding to
term “file” on Topic 34. As shown in Figure 1, the cluster) reveals the most relevant terms for that
Topic 34 can be interpreted as discussing the crim- cluster of topics, where the term distribution of a
inal punishment system where “file” refers to court cluster of topics is defined as the weighted average
filings. Similar discoveries can be made for any of the term distributions of the individual topics in
term that exhibits polysemy (such as “drive” ap- the cluster. In Figure 4, the green cluster of topics
pearing in computer- and automobile-related top- is selected, and the most relevant terms, displayed
ics, for example). in the barchart on the right, are predominantly re-
lated to computer hardware and software.
Beyond its within-browser interaction capabil-
ity using D3 (Bostock et al., 2011), LDAvis 5 Discussion
leverages the R language (R Core Team, 2014)
and specifically, the shiny package (Rstudio, We have described a web-based, interactive visu-
2014), to allow users to easily alter the topical alization system, LDAvis, that enables deep in-
distance measurement as well as the multidimen- spection of topic-term relationships in an LDA
sional scaling algorithm to produce the global model, while simultaneously providing a global
topic view. In addition, there is an option to ap- view of the topics, via their prevalences and sim-
ply k-means clustering to the topics (as a function ilarities to each other, in a compact space. We
also propose a novel measure, relevance, by which Jason Chuang, Christopher D. Manning and Jeffrey
to rank terms within topics to aid in the task Heer. 2012b. Termite: Visualization Techniques for
Assessing Textual Topic Models. AVI.
of topic interpretation, and we present results
from a user study that show that ranking terms Jason Chuang, Yuening Hu, Ashley Jin, John D. Wilk-
in decreasing order of probability is suboptimal erson, Daniel A. McFarland, Christopher D. Man-
ning and Jeffrey Heer. 2013a. Document Explo-
for topic interpretation. The LDAvis visual- ration with Topic Modeling: Designing Interactive
ization system (including the user study data) is Visualizations to Support Effective Analysis Work-
currently available as an R package on GitHub: flows. NIPS Workshop on Topic Models: Computa- tion, Application, and Evaluation.
For future work, we anticipate performing a Jason Chuang, Sonal Gupta, Christopher D. Manning
larger user study to further understand how to fa- and Jeffrey Heer. 2013b. Topic Model Diagnostics:
cilitate topic interpretation in fitted LDA mod- Assessing Domain Relevance via Topical Alignment.
els, including a comparison of multiple methods,
such as ranking by Turbo Topics (Blei and Laf- Matthew J. Gardner, Joshua Lutes, Jeff Lund, Josh
ferty, 2009) or FREX scores (Bischof and Airoldi, Hansen, Dan Walker, Eric Ringger, and Kevin Seppi.
2010. The topic browser: An interactive tool for
2012), in addition to relevance. We also note the browsing topic models. NIPS Workshop on Chal-
need to visualize correlations between topics, as lenges of Data Visualization.
this can provide insight into what is happening on
Brynjar Gretarsson, John O’Donovan, Svetlin Bostand-
the document level without actually displaying en- jieb, Tobias Hollerer, Arthur Asuncion, David New-
tire documents. Last, we seek a solution to the man, and Padhraic Smyth. 2011. TopicNets: Visual
problem of visualizing a large number of topics Analysis of Large Text Corpora with Topic Model-
(say, from 100 - 500 topics) in a compact way. ing. ACM Transactions on Intelligent Systems and
Technology, pp 1-26.
Thomas L. Griffiths and Mark Steyvers. 2004. Finding
