Visualizing Topic Models
Visualizing Topic Models
1:D
(Eq. 4)
Associated topics, ordered by
d
Associated documents, ordered by
1:D
Related documents, ordered by
Associated terms, ordered by
k
function of
k
1:K
(Eq. 1)
Related topics, ordered by a
Terms w
d
present in the document
Figure 2: A topic page and document page from the navigator of Wikipedia. We have labeled how we compute each component of these
pages from the output of the topic modeling algorithm.
Visualizing the Elements of a Topic Model
The navigator has two main types of pages: one for display-
ing discovered topics and another for the documents. There
are also overview pages, which illustrate the overall struc-
ture of the corpus; they are a launching point for browsing.
These pages display the corpus and the discovered struc-
ture. But this is not sufcientwe also use the topic model
inference to nd connections between these visualizations.
With these connections, a user can move between summary
and document-level presentations.
Hence, in our visualization every element on a page links
a user to a new view. With these links, a user can easily
traverse the network of relationships in a given corpus. For
example, from a topic page a user can link to view a specic
document. This document might link to several topics, each
of which the user can explore:
{son, year, death}
Juris Doctor
Finally, related topics are also listed with corresponding
links, allowing a user to explore the high-level topic space.
Topic similarity is not inferred directly with LDA, but can
be computed from the topic distributions that it discovers.
Related topics are shown in the right column of the topic
page by pairwise topic dissimilarity score
ij
=
vV
1
R
=0
(
iv
)1
R
=0
(
jv
) |log(
iv
) log(
jv
)| (1)
where the indicator function 1
A
(x) is dened as 1 if x A
and 0 otherwise. This is related to the average log odds ratio
of the probability of each term in the two topics. This metric
nds topics that have similar distributions.
Continuing with the topic from Figure 1, this metric
scores the following topics highly.
{son, year, death}
ij
=
kK
1
R
=0
(
ik
)1
R
=0
(
jk
) |log(
ik
) log(
jk
)|. (2)
This metric says that a document is similar to other docu-
ments that exhibit a similar combination of topics.
Overview Pages Overview pages are the entry points to
exploring the corpus. In the simplest of these pages, we rank
the topics by their relative presence in the corpus and display
each in a bar with width proportional to the topics presence
score p
k
: the sum of the topic proportions for a given topic
over all documents,
p
k
=
dD
dk
. (3)
Examples of this view can be found in Figure 3. From this
gure, we see that many documents are related to the topic
{household, population, female}; this is consistent with our
observations of the corpus, which includes many Wikipedia
articles on individual cities, towns, and townships.
Implementation and Study
We provide an open source implementation of the topic
modeling visualization. There are three steps in applying our
method to visualizing a corpus: (1) run LDA inference on
the corpus to obtain posterior expectations of the latent vari-
ables (2) generate a database and (3) create the web pages to
navigate the corpus.
Any open-source LDA package can be used; we used
LDA-C.
3
We implemented the remainder of the pipeline in
python. It can be found at http://code.google.com/p/tmve.
We created three examples of navigators using our vi-
sualization. We analyzed 100,000 Wikipedia articles with
a 50-topic LDA model (http://bit.ly/wiki100). We ana-
lyzed 61,000 US Federal Cases
4
with a 30-topic model
(http://bit.ly/case-demo). We analyzed 3,000 New York
Times articles with a 20-topic model (http://bit.ly/nyt-
demo). A page from each of these three demos can be seen
in Figure 3. One week after we released the source code, we
received links to a navigator of arXiv (a large archive of sci-
entic preprints) that was generated using our code; it is at
http://bit.ly/arxiv-demo.
3
http://www.cs.princeton.edu/ blei/lda-c
4
http://www.infochimps.com/datasets/text-of-us-federal-cases
Preliminary User Study We conducted a preliminary
user study on seven individuals, asking for qualitative feed-
back on the Wikipedia navigator. The reviews were positive,
all noting the value of presenting the high-level structure of
a corpus with its low-level content. One reviewer felt it or-
ganized similar to how he thinks.
Six individuals responded that they discovered connec-
tions that would have remained obscure by using Wikipedia
traditionally. For example, one user explored articles about
economics and discovered countries with ination or dea-
tion problems of which he had previously been unaware.
Acknowledgements
David M. Blei is supported by ONR 175-6343, NSF CA-
REER 0745520, AFOSR 09NL202, the Alfred P. Sloan
foundation, and a grant from Google.
References
Blei, D., and Lafferty, J. 2009. Topic models. In Srivastava, A., and
Sahami, M., eds., Text Mining: Theory and Applications. Taylor
and Francis.
Blei, D.; Ng, A.; and Jordan, M. 2003. Latent Dirichlet allocation.
Journal of Machine Learning Research 3:9931022.
Cao, N.; Sun, J.; Lin, Y.-R.; Gotz, D.; Liu, S.; and Qu, H.
2010. FacetAtlas: Multifaceted Visualization for Rich Text Cor-
pora. IEEE Transactions on Visualization and Computer Graphics
16(6):1172 1181.
Chang, J.; Boyd-Graber, J.; Wang, C.; Gerrish, S.; and Blei, D.
2009. Reading tea leaves: How humans interpret topic models.
In Neural Information Processing Systems.
Chen, Y.; Wang, L.; Dong, M.; and Hua, J. 2009. Exemplar-based
visualization of large document corpus. IEEE Transactions on Vi-
sualization and Computer Graphics 15(6):11611168.
Gardener, M. J.; Lutes, J.; Lund, J.; Hansen, J.; Walker, D.; Ring-
ger, E.; and Seppi, K. 2010. The topic browser: An interactive
tool for browsing topic models. In Proceedings of the Workshop on
Challenges of Data Visualization (in conjunction with NIPS).
Gretarsson, B.; ODonovan, J.; Bostandjiev, S.; Asuncion, A.;
Newman, D.; Smyth, P.; and Hllerer, T. 2011. Topicnets: Visual
analysis of large text corpora with topic modeling. In ACM Trans-
actions on Intelligent Systems and Technology, 126.
Havre, S.; Hetzler, B.; and Nowell, L. 2000. Themeriver(tm): In
search of trends, patterns, and relationships. In Proc. IEEE Sympo-
sium on Information Visualization (InfoVis), 115123.
Newman, D.; Asuncion, A.; Chemudugunta, C.; Kumar, V.; Smyth,
P.; and Steyvers, M. 2006. Exploring large document collections
using statistical topic models. In KDD-2006 Demo Session.
Steyvers, M., and Grifths, T. 2006. Probabilistic topic models.
In Landauer, T.; McNamara, D.; Dennis, S.; and Kintsch, W., eds.,
Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum.