2016 Acl Doclabel
2016 Acl Doclabel
@inproceedings{Poursabzi-Sangdeh:Boyd-Graber:Findlater:Seppi-2016,
Author = {Forough Poursabzi-Sangdeh and Jordan Boyd-Graber and Leah Findlater and Kevin Seppi},
Url = {docs/2016_acl_doclabel.pdf},
Booktitle = {Association for Computational Linguistics},
Location = {Berlin, Brandenburg},
Year = {2016},
Title = {ALTO: Active Learning with Topic Overviews for Speeding Label Induction and Document Labeling},
}
1
ALTO: Active Learning with Topic Overviews for Speeding Label
Induction and Document Labeling
Forough Poursabzi-Sangdeh Jordan Boyd-Graber
Computer Science Computer Science
University of Colorado University of Colorado
[email protected] [email protected]
OR
Figure 1: Our annotation system. Initially, the user sees lists of documents organized in either a list format
or grouped into topics (only two topics are shown here; users can scroll to additional documents). The
user can click on a document to label it.
Raw
Text
Selected Document
Classifier-Labeled Documents
2009).
Congress (Synth) Newsgroups (Synth)
While purity, RI, and NMI are all normalized 0.8
within [0, 1] (higher is better), they measure dif- ● ● ● ● ● ●
● ● ●
purity
●
tween two clusterings, it is sensitive to the number ●
● ● ● ● ● ● ● ●
of clusters, and it is not symmetric. 0.4 ●
RI
shared information between two clusterings. 0.80 ● ●
NMI
● ● ● ● ● ● ● ● ●
0.3
●
compare different numbers of clusters means that
●
0.2
it sometimes gives high scores for clusterings by
chance. Given the diverse nature of these metrics, 25 50 75 100 25 50 75 100
if a labeling does well in all three of them, we can Documents Labeled
be relatively confident that it is not a degenerate
solution that games the system. Figure 4: Synthetic results on US Congressional
Bills and 20 Newsgroups data sets. Topic models
5 Synthetic Experiments help guide annotation attention to diverse segments
of the data.
Before running a user study, we test our hypothe-
sis that topic model overviews and active learning
selection improve final cluster quality compared Synthetic results validate our hypothesis that
to standard baselines: list overview and random topic overview and active learning selection can
selection. We simulate the four conditions on Con- help label a corpus more efficiently (Figure 4).
gressional Bills and 20 Newsgroups. LA shows early gains, but tends to falter eventu-
Since we believe annotators create more specific ally compared to both topic overview and topic
labels compared to the gold labels, we use sub- overview combined with active learning selection
labels as simulated user labels and labels as gold (TR and TA).
labels (we give examples of labels and sub-labels in However, these experiments do not validate
Section 4.1). We start with two randomly selected ALTO. Not all documents require the same time or
documents that have different sub-labels, assign effort to label, and active learning focuses on the
the corresponding sub-labels, then add more labels hardest examples, which may confuse users. Thus,
based on each condition’s preference function (Sec- we need to evaluate how effectively actual users
tion 3.3). We follow the condition’s preference annotate a collection’s documents.
function and incrementally add labels until 100
documents have been labeled (100 documents are
6 User Study
representative of what a human can label in about Following the synthetic experiments, we conduct a
an hour). Given these labels, we compute purity, RI, user study with forty participants to evaluate ALTO
and NMI over time. This procedure is repeated fif- (TA condition) against three alternatives that lack
teen times (to account for the randomness of initial topic overview (LA), active learning selection (TR),
document selections and the preference functions or both (LR) (Sections 6.1 and 6.2). Then, we con-
with randomness).11 duct a crowdsourced study to compare the overall
11
Synthetic experiment data available at http: master/2016_acl_doclabel/data/synthetic_
//github.com/Pinafore/publications/tree/ exp
F p
Overview Selection Overview Selection
condition ● LA LR TA TR
final purity 81.03 7.18 < .001 .011
final RI 39.89 6.28 < .001 .017
0.5 final NMI 70.92 9.87 < .001 .003
●
● ● ● df(1,36) for all reported results
0.4 ●
Purity
●
● ●
0.3 Table 2: Results from 2 × 2 ANOVA with ART
Median (over participants)
RI
●
0.75 both had significant effects on quality scores.
0.70
NMI
0.2
For statistical analysis, we primarily use 2 × 2
0.1
(overview × selection) ANOVAs with Aligned Rank
10 20
Elapsed Time (min)
30 40
Transform (Wobbrock et al., 2011, ART), which is
a non-parametric alternative to a standard ANOVA
Figure 5: User study results on US Congressional that is appropriate when data are not expected to
Bills dataset. Active learning selection helps ini- meet the normality assumption of ANOVA.
tially, but the combination of active learning selec- 6.2 Document Cluster Evaluation
tion and topic model overview has highest quality
labels by the end of the task. We analyze the data by dividing the forty-minute
labeling task into five minute intervals. If a par-
ticipant stops before the time limit, we consider
effectiveness of the label set generated by the par- their final dataset to stay the same for any remain-
ticipants in the four conditions (Section 6.3). ing intervals. Figure 5 shows the measures across
study conditions, with similar trends for all three
6.1 Method measures.
We use the freelance marketplace Upwork to re-
Topic model overview and active learning both
cruit online participants.12 We require participants
significantly improve final dataset measures.
to have more than 90% job success on Upwork,
The topic overview and active selection conditions
English fluency, and US residency. Participants are
significantly outperform the list overview and ran-
randomly assigned to one of the four conditions
dom selection, respectively, on the final label qual-
and we recruited ten participants per condition.
ity metrics. Table 2 shows the results of separate
Participants completed a demographic question-
2 × 2 ANOVAs with ART with each of final purity,
naire, viewed a video of task instructions, and then
RI , and NMI scores. There are significant main ef-
interacted with the system and labeled documents
fects of overview and selection on all three metrics;
until satisfied with the labels or forty minutes had
no interaction effects were significant.
elapsed.13 The session ended with a survey, where
participants rated mental, physical, and temporal TR outperforms LA . Topic models by them-
demand, and performance, effort, and frustration selves outperform traditional active learning strate-
on 20-point scales, using questions adapted from gies (Figure 5). LA performs better than LR; while
the NASA Task Load Index (Hart and Staveland, active learning was useful, it was not as useful as
1988, TLX). The survey also included 7-point the topic model overview (TR and TA).
scales for ease of coming up with labels, usefulness
and satisfaction with the system, and—for TR and LA provides an initial benefit. Average purity,
NMI and RI were highest with LA for the earliest
12
http://Upwork.com labeling time intervals. Thus, when time is very
13
Forty minutes of activity, excluding system time to clas-
14
sify and update documents. Participants nearly exhausted the User study data available at http://github.com/
time: 39.3 average minutes in TA, 38.8 in TR, 40.0 in LA, and Pinafore/publications/tree/master/2016_
35.9 in LR. acl_doclabel/data/user_exp
M ± SD [median]
purity RI NMI
TR create 1.96 labels per topic and the participants
TA 0.31 ± 0.08 [0.32] 0.80 ± 0.05 [0.80] 0.19 ± 0.08 [0.21] in TA created 2.26 labels per topic. This suggests
TR 0.32 ± 0.09 [0.31] 0.82 ± 0.04 [0.82] 0.21 ± 0.09 [0.20]
LA 0.35 ± 0.05 [0.35] 0.82 ± 0.04 [0.81] 0.27 ± 0.05 [0.28] that participants are going beyond what they see in
LR 0.31 ± 0.04 [0.31] 0.79 ± 0.04 [0.79] 0.19 ± 0.03 [0.19] topics for labeling, at least in the TA condition.
Table 3: Mean, standard deviation, and median 6.3 Label Evaluation Results
purity, RI, and NMI after ten minutes. NMI in partic-
Section 6.2 compares clusters of documents in dif-
ular shows the benefit of LA over other conditions
ferent conditions against the gold clusters but does
at early time intervals.
not evaluate the quality of the labels themselves.
Since one of the main contributions of ALTO is to
limited, using traditional active learning (LA) is accelerate inducing a high quality label set, we use
preferable to topic overviews; users need time to crowdsourcing to assess how the final induced label
explore the topics and a subset of documents within sets compare in different conditions.
them. Table 3 shows the metrics after ten minutes. For completeness, we also compare labels
Separate 2 × 2 ANOVAs with ART on the means of against a fully automatic labeling method (Aletras
purity, NMI and RI revealed a significant interaction and Stevenson, 2014) that does not require human
effect between overview and selection on mean NMI intervention. We assign automatic labels to docu-
(F (1, 36) = 5.58, p = .024), confirming the early ments based on their most prominent topic.
performance trends seen in Figure 5 at least for We ask users on a crowdsourcing platform to
NMI . No other main or interaction effects were vote for the “best” and “worst” label that describes
significant, likely due to low statistical power. the content of a US congressional bill (we use
Crowdflower restricted to US contributors).
Subjective ratings. Table 4 shows the average Five users label each document and we use the
scores given for the six NASA - TLX questions in aggregated results generated by Crowdflower. The
different conditions. Separate 2 × 2 ANOVA with user gets $0.20 for each task.
ART for each of the measures revealed only one We randomly choose 200 documents from our
significant result: participants who used the topic dataset (Section 4.1). For each chosen document,
model overview find the task to be significantly we randomly choose a participant from all four con-
less frustrating (M = 4.2 and median = 2) than ditions (TA, TR, LA, LR). The labels assigned in
those who used the list overview (M = 7.3 and different conditions and the automatic label of the
median = 6.5) on a scale from 1 (low frustra- document’s prominent topic construct the candi-
tion) to 20 (high frustration) (F (1, 36) = 4.43, date labels for the document.15 Identical labels are
p = .042), confirming that the topic overview helps merged into one label to avoid showing duplicate
users organize their thoughts and experience less labels to users. If a merged label gets a “best” or
stress during labeling. “worst” vote, we split that vote across all the identi-
Participants in the TA and TR conditions rate cal instances.16 Figure 6 shows the average number
topic information to be useful in completing the of “best” and “worst” votes for each condition and
task (M = 5.0 and median = 5) on a scale from the automatic method. ALTO (TA) receives the most
1 (not useful at all) to 7 (very useful). Over- “best” votes and the fewest “worst” votes. LR re-
all, users are positive about their experience with ceives the most worst votes. The automatic labels,
the system. Participants in all conditions rate interestingly, appear to do at least as well as the list
overall satisfaction with the interface positively view labels, with a similar number of best votes and
(M = 5.8 and median = 6) on a scale from 1 (not fewer worst votes. This indicates that automatic
satisfied at all) to 7 (very satisfied). labels have reasonable quality compared to at least
some manually generated labels. However, when
Discussion. One can argue that using topic users are provided with a topic model overview—
overviews for labeling could have a negative ef- 15
Some participants had typos in the labels. We corrected
fect: users may ignore the document content and all the typos using pyEnchant (http://pythonhosted.
focus on topics for labeling. We tried to avoid this org/pyenchant/ ) spellchecker. If the corrected label was
issue by making it clear in the instructions that they still wrong, we corrected it manually.
16
Evaluation data available at http://github.com/
need to focus on document content and use top- Pinafore/publications/tree/master/2016_
ics as a guidance. On average, the participants in acl_doclabel/data/label_eval
M ± SD [median]
Condition Mental Demand Physical Demand Temporal Demand Performance Effort Frustration
TA 9.8 ± 5.6 [10] 2.9 ± 3.4 [2] 9 ± 7.8 [7] 5.5 ± 5.8 [1.5] 9.4 ± 6.3 [10] 4.5 ± 5.5 [1.5]
TR 10.6 ± 4.5 [11] 2.4 ± 2.8 [1] 7.4 ± 4.1 [9] 8.8 ± 6.1 [7.5] 9.8 ± 3.7 [10] 3.9 ± 3.0 [3.5]
LA 9.1 ± 5.5 [10] 1.7 ± 1.3 [1] 10.2 ± 4.8 [11] 8.6 ± 5.3 [10] 10.7 ± 6.2 [12.5] 6.7 ± 5.1 [5.5]
LR 9.8 ± 6.1 [10] 3.3 ± 2.9 [2] 9.3 ± 5.7 [10] 9.4 ± 5.6 [10] 9.4 ± 6.2 [10] 7.9 ± 5.4 [8]
Table 4: Mean, standard deviation, and median results from NASA - TLX post-survey. All questions are
scaled 1 (low)–20 (high), except performance, which is scaled 1 (good)–20 (poor). Users found topic
model overview conditions, TR and TA, to be significantly less frustrating than the list overview conditions.
60
Hoque and Carenini, 2015).
40
Summarizing document collections through dis-
covered topics can happen through raw topics la-
20 beled manually by users (Talley et al., 2011), auto-
matically (Lau et al., 2011), or by learning a map-
0 ping from labels to topics (Ramage et al., 2009).
LA LR TA TR auto
Condition When there is not a direct correspondence between
topics and labels, classifiers learn a mapping (Blei
Figure 6: Best and worst votes for document labels. and McAuliffe, 2007; Zhu et al., 2009; Nguyen et
Error bars are standard error from bootstrap sample. al., 2015). Because we want topics to be consistent
ALTO ( TA ) gets the most best votes and the fewest between users, we use a classifier with static topics
worst votes. in ALTO. Combining our interface with dynamic
topics could improve overall labeling, perhaps at
the cost of introducing confusion as topics change
with or without active learning selection—they can during the labeling process.
generate label sets that improve upon automatic
labels and labels assigned without the topic model 8 Conclusion and Future Work
overview. We introduce ALTO, an interactive framework that
combines both active learning selections with topic
7 Related Work model overviews to both help users induce a label
Text classification—a ubiquitous machine learning set and assign labels to documents. We show that
tool for automatically labeling text (Zhang, 2010)— users can more effectively and efficiently induce
is a well-trodden area of NLP research. The diffi- a label set and create training data using ALTO
culty is often creating the training data (Hwa, 2004; in comparison with other conditions, which lack
Osborne and Baldridge, 2004); coding theory is an either topic overview or active selection.
entire subfield of social science devoted to creating, We can further improve ALTO to help users gain
formulating, and applying labels to text data (Sal- better and faster understanding of text corpora. Our
dana, 2012; Musialek et al., 2016). Crowdsourc- current system limits users to view only 20K docu-
ing (Snow et al., 2008) and active learning (Settles, ments at a time and allows for one label assignment
2012), can decrease the cost of annotation but only per document. Moreover, the topics are static and
after a label set exists. do not adapt to better reflect users’ labels. Users
should have better support for browsing documents
ALTO’s corpus overviews aid text understanding,
and assigning multiple labels.
building on traditional interfaces for gaining both
Finally, with slight changes to what the system
local and global information (Hearst and Peder-
considers a document, we believe ALTO can be ex-
sen, 1996). More elaborate interfaces (Eisenstein
tended to NLP applications other than classification,
et al., 2012; Chaney and Blei, 2012; Roberts et
such as named entity recognition or semantic role
al., 2014) provide richer information given a fixed
labeling, to reduce the annotation effort.
topic model. Alternatively, because topic mod-
Acknowledgments Jaegul Choo, Changhyun Lee, Chandan K. Reddy, and
Haesun Park. 2013. UTOPIAN: User-driven topic
We thank the anonymous reviewers, David Mimno, modeling based on interactive nonnegative matrix
Edward Scott Adler, Philip Resnik, and Burr Settles factorization. IEEE Transactions on Visualization
and Computer Graphics, 19(12):1992–2001.
for their insightful comments. We also thank Niko-
laos Aletras for providing the automatic topic label- Sanjoy Dasgupta and Daniel Hsu. 2008. Hierarchical
ing code. Boyd-Graber and Poursabzi-Sangdeh’s sampling for active learning. In Proceedings of the
International Conference of Machine Learning.
contribution is supported by NSF Grant NCSE-
1422492; Findlater, Seppi, and Boyd-Graber’s con- Jacob Eisenstein, Duen Horng Chau, Aniket Kittur, and
tribution is supported by collaborative NSF Grant Eric Xing. 2012. TopicViz: interactive topic ex-
IIS -1409287 ( UMD ) and IIS -1409739 ( BYU ). Any
ploration in document collections. In International
Conference on Human Factors in Computing Sys-
opinions, findings, results, or recommendations tems.
expressed here are of the authors and do not neces-
sarily reflect the view of the sponsor. Sandra G Hart and Lowell E Staveland. 1988. De-
velopment of nasa-tlx (task load index): Results of
empirical and theoretical research. Advances in psy-
chology, 52:139–183.
References
M.A. Hearst and J.O. Pedersen. 1996. Reexamining
E Scott Adler and John Wilkerson. 2006. Congres- the cluster hypothesis: scatter/gather on retrieval re-
sional bills project. NSF, 880066:00880061. sults. In Proceedings of the ACM SIGIR Confer-
ence on Research and Development in Information
Nikolaos Aletras and Mark Stevenson. 2014. La- Retrieval.
belling topics using unsupervised graph-based meth-
ods. In Proceedings of the Association for Computa- Enamul Hoque and Giuseppe Carenini. 2015. Con-
tional Linguistics, pages 631–636. visit: Interactive topic modeling for exploring asyn-
chronous online conversations. In Proceedings of
the 20th International Conference on Intelligent
Pranav Anand, Joseph King, Jordan L Boyd-Graber,
User Interfaces, IUI ’15.
Earl Wagner, Craig H Martell, Douglas W Oard, and
Philip Resnik. 2011. Believe me-we can do this! Yuening Hu, Jordan Boyd-Graber, Brianna Satinoff,
annotating persuasive acts in blog text. In Computa- and Alison Smith. 2014. Interactive topic modeling.
tional Models of Natural Argument. Machine learning, 95(3):423–469.
David M. Blei and Jon D. McAuliffe. 2007. Super- Lawrence Hubert and Phipps Arabie. 1985. Compar-
vised topic models. In Proceedings of Advances in ing partitions. Journal of classification, 2(1):193–
Neural Information Processing Systems. 218.
David M. Blei, Andrew Ng, and Michael Jordan. 2003. Rebecca Hwa. 2004. Sample selection for statistical
Latent Dirichlet allocation. Journal of Machine parsing. Computational linguistics, 30(3):253–276.
Learning Research, 3. Mohit Iyyer, Peter Enns, Jordan L Boyd-Graber, and
Philip Resnik. 2014. Political ideology detection
Gerlof Bouma. 2009. Normalized (pointwise) mutual using recursive neural networks. In Proceedings of
information in collocation extraction. In The Bien- the Association for Computational Linguistics.
nial GSCL Conference, pages 31–40.
Edward F Kelly and Philip J Stone. 1975. Com-
Jordan Boyd-Graber, David Mimno, and David New- puter recognition of English word senses, volume 13.
man. 2014. Care and feeding of topic models: Prob- North-Holland.
lems, diagnostics, and improvements. Handbook of
Mixed Membership Models and Their Applications; Soo-Min Kim and Eduard Hovy. 2004. Determining
CRC Press: Boca Raton, FL, USA. the sentiment of opinions. In Proceedings of the As-
sociation for Computational Linguistics, page 1367.
Ian Budge. 2001. Mapping policy preferences: esti- Association for Computational Linguistics.
mates for parties, electors, and governments, 1945- Hans-Dieter Klingemann, Andrea Volkens, Judith Bara,
1998, volume 1. Oxford University Press. Ian Budge, Michael D McDonald, et al. 2006.
Mapping policy preferences II: estimates for parties,
Bob Carpenter. 2008. Lingpipe 4.1.0. http:// electors, and governments in Eastern Europe, Euro-
alias-i.com/lingpipe. pean Union, and OECD 1990-2003. Oxford Univer-
sity Press Oxford.
Allison Chaney and David Blei. 2012. Visualizing
topic models. In International AAAI Conference on Ken Lang. 2007. 20 newsgroups data set.
Weblogs and Social Media. http://www.ai.mit.edu/people/jrennie/20Newsgroups/.
Jey Han Lau, Karl Grieser, David Newman, and Tim- Margaret E Roberts, Brandon M Stewart, Dustin
othy Baldwin. 2011. Automatic labelling of topic Tingley, Christopher Lucas, Jetson Leder-Luis,
models. In Proceedings of the Association for Com- Shana Kushner Gadarian, Bethany Albertson, and
putational Linguistics, pages 1536–1545. David G Rand. 2014. Structural topic models for
open-ended survey responses. American Journal of
Jey Han Lau, David Newman, and Timothy Baldwin. Political Science, 58(4):1064–1082.
2014. Machine reading tea leaves: Automatically
evaluating topic coherence and topic model quality. J. Saldana. 2012. The Coding Manual for Qualitative
In Proceedings of the European Chapter of the Asso- Researchers. SAGE Publications.
ciation for Computational Linguistics.
David D Lewis and William A Gale. 1994. A se- Burr Settles. 2012. Active learning (synthesis lec-
quential algorithm for training text classifiers. In tures on artificial intelligence and machine learning).
Proceedings of the 17th annual international ACM Long Island, NY: Morgan & Clay Pool.
SIGIR conference on Research and development in
H Sebastian Seung, Manfred Opper, and Haim Som-
information retrieval, pages 3–12. Springer-Verlag
polinsky. 1992. Query by committee. In Proceed-
New York, Inc.
ings of the fifth annual workshop on Computational
Andrew Kachites McCallum. 2002. Mal- learning theory, pages 287–294. ACM.
let: A machine learning for language toolkit.
http://www.cs.umass.edu/ mccallum/mallet. Rion Snow, Brendan O’Connor, Daniel Jurafsky, and
Andrew Ng. 2008. Cheap and fast—but is it good?
Marina Meilă. 2003. Comparing clusterings by the Evaluating non-expert annotations for natural lan-
variation of information. In Learning theory and ker- guage tasks. In Proceedings of Empirical Methods
nel machines, pages 173–187. Springer. in Natural Language Processing.
Chris Musialek, Philip Resnik, and S. Andrew Stavisky. Alexander Strehl and Joydeep Ghosh. 2003. Cluster
2016. Using text analytic techniques to create effi- ensembles—a knowledge reuse framework for com-
ciencies in analyzing qualitative data: A comparison bining multiple partitions. The Journal of Machine
between traditional content analysis and a topic mod- Learning Research, 3:583–617.
eling approach. In American Association for Public
Opinion Research. Edmund M. Talley, David Newman, David Mimno,
Grace Ngai and David Yarowsky. 2000. Rule writing Bruce W. Herr, Hanna M. Wallach, Gully A. P. C.
or annotation: Cost-efficient resource usage for base Burns, A. G. Miriam Leenders, and Andrew McCal-
noun phrase chunking. In Proceedings of the Associ- lum. 2011. Database of NIH grants using machine-
ation for Computational Linguistics. Association for learned categories and graphical clustering. Nature
Computational Linguistics. Methods, 8(6):443–444, May.
Thang Nguyen, Jordan Boyd-Graber, Jeff Lund, Kevin Katrin Tomanek, Joachim Wermter, and Udo Hahn.
Seppi, and Eric Ringger. 2015. Is your anchor going 2007. An approach to text corpus construction
up or down? Fast and accurate supervised topic mod- which cuts annotation costs and maintains reusabil-
els. In Conference of the North American Chapter ity of annotated data. In Proceedings of Empiri-
of the Association for Computational Linguistics. cal Methods in Natural Language Processing, pages
486–495.
Sonya Nikolova, Jordan Boyd-Graber, and Christiane
Fellbaum. 2011. Collecting semantic similarity Paul MB Vitányi, Frank J Balbach, Rudi L Cilibrasi,
ratings to connect concepts in assistive communica- and Ming Li. 2009. Normalized information dis-
tion tools. In Modeling, Learning, and Processing tance. In Information theory and statistical learning,
of Text Technological Data Structures, pages 81–93. pages 45–82. Springer.
Springer.
Jacob O Wobbrock, Leah Findlater, Darren Gergle, and
Miles Osborne and Jason Baldridge. 2004. Ensemble- James J Higgins. 2011. The aligned rank trans-
based active learning for parse selection. In Con- form for nonparametric factorial analyses using only
ference of the North American Chapter of the Asso- anova procedures. In Proceedings of the SIGCHI
ciation for Computational Linguistics, pages 89–96. Conference on Human Factors in Computing Sys-
Citeseer. tems, pages 143–146. ACM.
Daniel Ramage, David Hall, Ramesh Nallapati, and
Christopher Manning. 2009. Labeled LDA: A su- Tong Zhang. 2010. Fundamental statistical techniques.
pervised topic model for credit attribution in multi- In Nitin Indurkhya and Fred J. Damerau, editors,
labeled corpora. In Proceedings of Empirical Meth- Handbook of Natural Language Processing. Chap-
ods in Natural Language Processing. man & Hall/CRC, 2nd edition.
William M Rand. 1971. Objective criteria for the eval- Ying Zhao and George Karypis. 2001. Criterion
uation of clustering methods. Journal of the Ameri- functions for document clustering: Experiments and
can Statistical association, 66(336):846–850. analysis. Technical report, University of Minnesota.
Jun Zhu, Amr Ahmed, and Eric P. Xing. 2009.
MedLDA: maximum margin supervised topic mod-
els for regression and classification. In Proceedings
of the International Conference of Machine Learn-
ing.