0% found this document useful (0 votes)
13 views13 pages

2016 Acl Doclabel

Uploaded by

Assan Achibat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views13 pages

2016 Acl Doclabel

Uploaded by

Assan Achibat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Forough Poursabzi-Sangdeh, Jordan Boyd-Graber, Leah Findlater, and Kevin Seppi.

ALTO: Active Learning with


Topic Overviews for Speeding Label Induction and Document Labeling. Association for Computational Linguistics,
2016.

@inproceedings{Poursabzi-Sangdeh:Boyd-Graber:Findlater:Seppi-2016,
Author = {Forough Poursabzi-Sangdeh and Jordan Boyd-Graber and Leah Findlater and Kevin Seppi},
Url = {docs/2016_acl_doclabel.pdf},
Booktitle = {Association for Computational Linguistics},
Location = {Berlin, Brandenburg},
Year = {2016},
Title = {ALTO: Active Learning with Topic Overviews for Speeding Label Induction and Document Labeling},
}

Downloaded from http://cs.colorado.edu/~jbg/docs/2016_acl_doclabel.pdf

1
ALTO: Active Learning with Topic Overviews for Speeding Label
Induction and Document Labeling
Forough Poursabzi-Sangdeh Jordan Boyd-Graber
Computer Science Computer Science
University of Colorado University of Colorado
[email protected] [email protected]

Leah Findlater Kevin Seppi


iSchool and UMIACS Computer Science
University of Maryland Brigham Young University
[email protected] [email protected]

Abstract large swaths of the data. Effective NLP systems


must measure (Hwa, 2004; Osborne and Baldridge,
Effective text classification requires experts 2004; Ngai and Yarowsky, 2000) and reduce an-
to annotate data with labels; these training notation cost (Tomanek et al., 2007). Annotation
data are time-consuming and expensive to is hard because it requires both global and local
obtain. If you know what labels you want, knowledge of the entire dataset. Global knowledge
active learning can reduce the number of is required to create the set of labels, and local
labeled documents needed. However, estab- knowledge is required to annotate the most useful
lishing the label set remains difficult. An- examples to serve as a training set for an automatic
notators often lack the global knowledge classifier. The former’s cost is often hidden in mul-
needed to induce a label set. We intro- tiple rounds of refining annotation guidelines.
duce ALTO: Active Learning with Topic
Overviews, an interactive system to help We create a single interface—ALTO (Active
humans annotate documents: topic mod- Learning with Topic Overviews)—to address both
els provide a global overview of what la- global and local challenges using two machine
bels to create and active learning directs learning tools: topic models and active learning
them to the right documents to label. Our (we review both in Section 2). Topic models ad-
forty-annotator user study shows that while dress the need for annotators to have a global
active learning alone is best in extremely overview of the data, exposing the broad themes of
resource limited conditions, topic models the corpus so annotators know what labels to cre-
(even by themselves) lead to better label ate. Active learning selects documents that help the
sets, and ALTO’s combination is best over- classifier understand the differences between labels
all. and directs the user’s attention locally to them. We
provide users four experimental conditions to com-
1 Introduction pare the usefulness of a topic model or a simple
list of documents, with or without active learning
Many fields depend on texts labeled by human ex- suggestions (Section 3). We then describe our data
perts; computational linguistics uses such annota- and evaluation metrics (Section 4).
tion to determine word senses and sentiment (Kelly
and Stone, 1975; Kim and Hovy, 2004); while so- Through both synthetic experiments (Section 5)
cial science uses “coding” to scale up and systeme- and a user study (Section 6) with forty participants,
tize content analysis (Budge, 2001; Klingemann et we evaluate ALTO and its constituent components
al., 2006). by comparing results from the four conditions in-
Classification takes these labeled data as a train- troduced above. We first examine user strategies
ing set and labels new data automatically. Creat- for organizing documents, user satisfaction, and
ing a broadly applicable and consistent label set user efficiency. Finally, we evaluate the overall
that generalizes well is time-consuming and dif- effectiveness of the label set in a post study crowd-
ficult, requiring expensive annotators to examine sourced task.
Topic words Document Title be most useful to label when training a classifier.
metropolitan, car- A bill to improve the safety of mo- When user time is scarce, active learning builds a
rier, rail, freight, torcoaches, and for other purposes.
passenger, driver, more effective training set than random labeling:
airport, traffic, tran- uncertainty sampling (Lewis and Gale, 1994) or
sit, vehicles
violence, sexual, A bill to provide criminal penalties
query by committee (Seung et al., 1992) direct
criminal, assault, for stalking. users to the most useful documents to label.
offense, victims, In contrast to topic models, active learning pro-
domestic, crime,
abuse, trafficking vides local information: this document is the one
agricultural, farm, To amend the Federal Crop Insur- you should pay attention to. Our hypothesis is that
agriculture, rural, ance Act to extend certain supple- active learning directing users to documents most
producer, dairy, mental agricultural disaster assis-
crop, produc- tance programs through fiscal year beneficial to label will not only be more effective
ers, commodity, 2017, and for other purposes. than randomly selecting documents but will also
nutrition complement the global information provided by
Table 1: Given a dataset—in this case, the US con- topic models. Section 3.3 describes our approaches
gressional bills dataset—topics are automatically for directing user’s local attention.
discovered sorted lists of terms that summarize seg-
3 Study Conditions
ments of a document collection. Topics also are
associated with documents. These topics give users Our goal is to characterize how local and global
a sense of documents’ main themes and help users knowledge can aid users in annotating a dataset.
create high-quality labels. This section describes our four experimental con-
ditions and outlines the user’s process for labeling
documents.
2 Topic Overviews and Active Learning
ALTO ,1 a framework for assigning labels to docu- 3.1 Study Design
ments that uses both global and local knowledge The study uses a 2 × 2 between-subjects design,
to help users create and assign document labels, with factors of document collection overview (two
has two main components: topic overview and ac- levels: topic model or list) and document selection
tive learning selection. We explain how ALTO uses (two levels: active or random). The four conditions,
topic models and active learning to aid label induc- with the TA condition representing ALTO, are:
tion and document labeling.
1. Topic model overview, active selection (TA)
Topic Models Topic models (Blei et al., 2003)
2. Topic model overview, random selection (TR)
automatically induce structure from a text corpus.
3. List overview, active selection (LA)
Given a corpus and a constant K for the number of
4. List overview, random selection (LR)
topics, topic models output (i) a distribution over
words for each topic k (φk,w ) and (ii) a distribution 3.2 Document Collection Overview
over topics for each document (θd,k ). Each topic’s
most probable words and associated documents The topic and list overviews offer different over-
can help a user understand what the collection is all structure but the same basic elements for users
about. Table 1 shows examples of topics and their to create, modify, and apply labels (Section 3.4).
highest associated documents from our corpus of The topic overview (Figure 1a) builds on Hu et
US congressional bills. al. (2014): for each topic, the top twenty words
Our hypothesis is that showing documents are shown alongside twenty document titles. Topic
grouped by topics will be more effective than hav- words (w) are sized based on their probability φk,w
ing the user wade through an undifferentiated list in the topic k and the documents with the high-
of random documents and mentally sort the major est probability of that topic (θd,k ) are shown. The
themes themselves. list overview, in contrast, presents documents as a
simple, randomly ordered list of titles (Figure 1b).
Active Learning Active learning (Settles, 2012) We display the same number of documents (20K,
directs users’ attention to the examples that would where K is the total number of topics) in both the
1
Code available at https://github.com/ topic model and list overviews, but the list overview
Foroughp/ALTO-ACL-2016 provides no topic information.
(a) Topic Overview
(TA and TR)
Main Interface

OR

(b) List Overview


(LA and LR)

Figure 1: Our annotation system. Initially, the user sees lists of documents organized in either a list format
or grouped into topics (only two topics are shown here; users can scroll to additional documents). The
user can click on a document to label it.

Classifier Label (if available) User-Labeled Documents

Raw
Text

Selected Document
Classifier-Labeled Documents

Figure 3: After the user has labeled some docu-


ments, the system can automatically label other
documents and select which documents would be
User Label most helpful to annotate next. In the random selec-
tion setting, random documents are selected.
Figure 2: After clicking on a document from the
list or topic overview, the user inspects the text and LA : LA uses traditional uncertainty sampling:
provides a label. If the classifier has a guess at the
label, the user can confirm the guess. UdLA = HC [Yd ] , (1)
P
where HC [yd ] = − i P (yi |d)logP (yi |d) is the
classifier entropy. Entropy measures how confused
3.3 Document Selection
(uncertain) classifier C is about its prediction of a
We use a preference function U to direct users’ document d’s label y. Intuitively, it prefers docu-
attention to specific documents. To provide con- ments the classifier suggests many labels instead
sistency across the four conditions, each condition of a single, confident prediction.
will highlight the document that scores the highest LR : LR ’s approach is the same as LA ’s except we
for the condition’s preference function. For the replace HC [yd ] with a uniform random number:
random selection conditions, TR and LR, document
UdLR ∼ unif(0, 1). (2)
selection is random, within a topic or globally. We
expect this to be less useful than active learning. In contrast to LA, which suggests the most uncer-
The document preference functions are: tain document, LR suggests a random document.
TA :Dasgupta and Hsu (2008) argue that clustering 3. Skip the document.
should inform active learning criteria, balancing
coverage against classifier accuracy. We adapt their Once the user has labeled two documents with
method to flat topic models—in contrast to their different labels, the displayed documents are re-
hierarchical cluster trees—by creating a composite placed based on the preference function (Sec-
measure of document uncertainty within a topic: tion 3.3), every time the user labels (or updates
labels for) a document. In TA and TR, each topic’s
UdTA = HC [yd ] θd,k , (3) documents are replaced with the twenty highest
ranked documents. In LA and LR, all documents
where k is the prominent topic for document d. are updated with the top 20K ranked documents.3
UdTA prefers documents that are representative of a The system also suggests one document to con-
topic (i.e., have a high value of θd,k for that topic) sider by auto-scrolling to it and drawing a red box
and are informative for the classifier. around its title (Figure 3). The user may ignore
TR : TR ’s approach is the same as TA ’s except we that document and click on any other document.
replace HC [Yd ] with a uniformly random number: After the user labels ten documents, the classifier
runs and assigns labels to other documents.4 For
UdTR = unif(0, 1)θd,k . (4) classifier-labeled documents, the user can either
approve the label or assign a different label. The
Similar to TA, UdTR prefers documents that are rep-
process continues until the user is satisfied or a time
resentative of a topic, but not any particular docu-
runs out (forty minutes in our user study, Section 6).
ment in the topic. Incorporating the random com-
We use time to control for the varying difficulty of
ponent encourages covering different documents in
assigning document labels: active learning will se-
diverse topics.
lect more difficult documents to annotate, but they
In LA and LR, the preference function directly
may be more useful; time is a more fair basis of
chooses a document and directs the user to it. On
comparison in real-world tasks.
the other hand, UdTA and UdTR are topic dependent.
TA emphasizes documents that are both informative
4 Data and Evaluation Metrics
to the classifier and representative of a topic; if a
document is not representative, the surrounding In this section, we describe our data, the machine
context of a topic will be less useful. Therefore, learning techniques to learn classifiers from exam-
the factor θd,k appears in both. Thus, they require ples, and the evaluation metrics to know whether
that a topic be chosen first and then the document the final labeling of the complete documents col-
with maximum preference, U , within that topic can lection was successful.
be chosen. In TR, the topic is chosen randomly. In
4.1 Datasets
TA , the topic is chosen by
Data Our experiments require corpora to com-
k ∗ = arg max(mediand (HC [yd ] θd,k ). (5) pare user labels with gold standard labels. We ex-
k
periment with two corpora: 20Newsgroups (Lang,
That is the topic with the maximum median U . 2007) and US congressional bills from GovTrack.5
Median encodes how “confusing” a topic is.2 In For US congressional bills, GovTrack provides
other words, topic k ∗ is the topic that its documents bill information such as the title and text, while
confuse the classifier most. the Congressional Bills Project (Adler and Wilker-
son, 2006) provides labels and sub-labels for the
3.4 User Labeling Process
bills. Examples of labels are agriculture and health,
The user’s labeling process is the same in all four while sub-labels include agricultural trade and
conditions. The overview (topic or list) allows users comprehensive health care reform. The twenty
to examine individual documents (Figure 1). Click-
3
ing on a document opens a dialog box (Figure 2) In all conditions, the number of displayed unlabeled doc-
uments is adjusted based on the number of manually labeled
with the text of the document and three options: documents. i.e. if the user has labeled n documents in topic
k, n manually labeled documents followed by top 20 − n
1. Create and assign a new label to the document. uncertain documents will be shown in topic k.
4
2. Choose an existing label for the document. To reduce user confusion, for each existing label, only the
top 100 documents get a label assigned in the UI.
2 5
Outliers skew other measures (e.g., max or mean). https://www.govtrack.us/
top-level labels have been developed by consen- range of topics if it finds this feature a useful indi-
sus over many years by a team of top political cator of uncertainty.9
scientists to create a reliable, robust dataset. We
4.3 Evaluation Metrics
use the 112th Congress; after filtering,6 this dataset
has 5558 documents. We use this dataset in both Our goal is to create a system that allows users to
the synthetic experiments (Section 5) and the user quickly induce a high-quality label set. We com-
study (Section 6). pare the user-created label sets against the data’s
The 20 Newsgroups corpus has 19, 997 docu- gold label sets. Comparing different clusterings is a
ments grouped in twenty news groups that are fur- difficult task, so we use three clustering evaluation
ther grouped into six more general topics. Ex- metrics: purity (Zhao and Karypis, 2001), rand
amples are talk.politics.guns and sci.electronics, index (Rand, 1971, RI), and normalized mutual
which belong to the general topics of politics and information (Strehl and Ghosh, 2003, NMI).10
science. We use this dataset in synthetic experi- Purity The documents labeled with a good user
ments (Section 5). label should only have one (or a few) gold labels
associated with them: this is measured by cluster
4.2 Machine Learning Techniques
purity. Given each user cluster, it measures what
Topic Modeling To choose the number of topics fraction of the documents in a user cluster belong
(K), we calculate average topic coherence (Lau et to the most frequent gold label in that cluster:
al., 2014) on US Congressional Bills, between ten 1 X
and forty topics and choose K = 19, as it has the purity(Ω, G) = max |Ωl ∩ Gj |, (6)
N j
maximum coherence score. For consistency, we l
use the same number of topics (K = 19) for 20 where L is the number of labels user creates,
Newsgroups corpus. After filtering words based Ω = {Ω1 , Ω2 , . . . , ΩL } is the user clustering of
on TF - IDF, we use Mallet (McCallum, 2002) with documents, G = {G1 , G2 , . . . , GJ } is gold clus-
default options to learn topics. tering of documents, and N is the total number of
documents. The user Ωl and gold Gj labels are in-
Features and Classification A logistic regres- terpreted as sets containing all documents assigned
sion predicts labels for documents and provides to that label.
the classification uncertainty for active learning.
To make classification and active learning updates Rand index (RI) RI is a pair counting measure,
efficient, we use incremental learning (Carpenter, where cluster evaluation is considered as a series
2008, LingPipe). We update classification param- of decisions. If two documents have the same gold
eters using stochastic gradient descent, restarting label and the same user label (TP) or if they do not
with the previously learned parameters as new la- have the same gold label and are not assigned the
beled documents become available.7 We use cross same user label (TN), the decision is right. Other-
validation, using argmax topics as surrogate labels, wise, it is wrong (FP, FN). RI measures the percent-
to set the parameters for learning the classifier.8 age of decisions that are right:
The features for classification include topic prob- TP + TN
RI = . (7)
abilities, unigrams, and the fraction of labeled doc- TP + FP + TN + FN
uments in each document’s prominent topic. The Normalized mutual information (NMI) NMI is
intuition behind adding this last feature is to allow an information theoretic measure that measures
active learning to suggest documents in a diverse the amount of information one gets about the gold
6
We remove bills that have less than fifty words, no as-
clusters by knowing what the user clusters are:
signed gold label, duplicate titles, or have the gold label GOV- 2I(Ω, G)
ERNMENT OPERATIONS or SOCIAL WELFARE , which are NMI(Ω, G) = , (8)
broad and difficult for users to label. HΩ + HG
7
Exceptions are when a new label is added, a document’s 9
However, final classifier’s coefficients suggested that this
label is deleted, or a label is deleted. In those cases, we train feature did not have a large effect.
the classifier from scratch. Also, for final results in Section 6, 10
We avoided using adjusted rand index (Hubert and Ara-
we train a classifier from scratch. bie, 1985), because it can yield negative values, which is not
8
We use blockSize= #examples 1
minEpochs=100, consistent with purity and NMI. We also computed variation
learningRate=0.1, minImprovement=0.01, of information (Meilă, 2003) and normalized information dis-
maxEpochs=1000, and rollingAverageSize=5. tance (Vitányi et al., 2009) and observed consistent trends. We
The regression is unregularized. omit these results for the sake of space.
where Ω and G are user and gold clusters, H is
the entropy and I is mutual information (Bouma, condition ● LA LR TA TR

2009).
Congress (Synth) Newsgroups (Synth)
While purity, RI, and NMI are all normalized 0.8
within [0, 1] (higher is better), they measure dif- ● ● ● ● ● ●
● ● ●

ferent things. Purity measures the intersection be- 0.6

purity

tween two clusterings, it is sensitive to the number ●
● ● ● ● ● ● ● ●
of clusters, and it is not symmetric. 0.4 ●

On the other hand, RI and NMI are less sensitive

Median (over 15 runs)


to the number of clusters and are symmetric. RI 0.90
● ● ● ● ● ● ●
measures pairwise agreement in contrast to purity’s ●

0.85 ●
emphasis on intersection. Moreover, NMI measures ● ● ● ● ● ●

RI
shared information between two clusterings. 0.80 ● ●

None of these metrics are perfect: purity can 0.75


be exploited by putting each document in its own


0.5
label, RI does not distinguish separating similar ● ● ● ● ●
● ● ●
documents with distinct labels from giving dissimi- 0.4 ●

lar documents the same label, and NMI’s ability to

NMI
● ● ● ● ● ● ● ● ●
0.3

compare different numbers of clusters means that

0.2
it sometimes gives high scores for clusterings by
chance. Given the diverse nature of these metrics, 25 50 75 100 25 50 75 100
if a labeling does well in all three of them, we can Documents Labeled
be relatively confident that it is not a degenerate
solution that games the system. Figure 4: Synthetic results on US Congressional
Bills and 20 Newsgroups data sets. Topic models
5 Synthetic Experiments help guide annotation attention to diverse segments
of the data.
Before running a user study, we test our hypothe-
sis that topic model overviews and active learning
selection improve final cluster quality compared Synthetic results validate our hypothesis that
to standard baselines: list overview and random topic overview and active learning selection can
selection. We simulate the four conditions on Con- help label a corpus more efficiently (Figure 4).
gressional Bills and 20 Newsgroups. LA shows early gains, but tends to falter eventu-
Since we believe annotators create more specific ally compared to both topic overview and topic
labels compared to the gold labels, we use sub- overview combined with active learning selection
labels as simulated user labels and labels as gold (TR and TA).
labels (we give examples of labels and sub-labels in However, these experiments do not validate
Section 4.1). We start with two randomly selected ALTO. Not all documents require the same time or
documents that have different sub-labels, assign effort to label, and active learning focuses on the
the corresponding sub-labels, then add more labels hardest examples, which may confuse users. Thus,
based on each condition’s preference function (Sec- we need to evaluate how effectively actual users
tion 3.3). We follow the condition’s preference annotate a collection’s documents.
function and incrementally add labels until 100
documents have been labeled (100 documents are
6 User Study
representative of what a human can label in about Following the synthetic experiments, we conduct a
an hour). Given these labels, we compute purity, RI, user study with forty participants to evaluate ALTO
and NMI over time. This procedure is repeated fif- (TA condition) against three alternatives that lack
teen times (to account for the randomness of initial topic overview (LA), active learning selection (TR),
document selections and the preference functions or both (LR) (Sections 6.1 and 6.2). Then, we con-
with randomness).11 duct a crowdsourced study to compare the overall
11
Synthetic experiment data available at http: master/2016_acl_doclabel/data/synthetic_
//github.com/Pinafore/publications/tree/ exp
F p
Overview Selection Overview Selection
condition ● LA LR TA TR
final purity 81.03 7.18 < .001 .011
final RI 39.89 6.28 < .001 .017
0.5 final NMI 70.92 9.87 < .001 .003

● ● ● df(1,36) for all reported results
0.4 ●

Purity

● ●
0.3 Table 2: Results from 2 × 2 ANOVA with ART
Median (over participants)

0.2 analyses on the final purity, RI, and NMI metrics.


0.90 Only main effects for the factors of overview and
● ●
0.85 ● ● ● ●
selection are shown; no interaction effects were
statistically significant. Topics and active learning

0.80

RI

0.75 both had significant effects on quality scores.
0.70

0.3 ● ● ● ● ● ● TA —topic information helpfulness. Each partici-



pant was paid fifteen dollars.14

NMI
0.2
For statistical analysis, we primarily use 2 × 2
0.1
(overview × selection) ANOVAs with Aligned Rank
10 20
Elapsed Time (min)
30 40
Transform (Wobbrock et al., 2011, ART), which is
a non-parametric alternative to a standard ANOVA
Figure 5: User study results on US Congressional that is appropriate when data are not expected to
Bills dataset. Active learning selection helps ini- meet the normality assumption of ANOVA.
tially, but the combination of active learning selec- 6.2 Document Cluster Evaluation
tion and topic model overview has highest quality
labels by the end of the task. We analyze the data by dividing the forty-minute
labeling task into five minute intervals. If a par-
ticipant stops before the time limit, we consider
effectiveness of the label set generated by the par- their final dataset to stay the same for any remain-
ticipants in the four conditions (Section 6.3). ing intervals. Figure 5 shows the measures across
study conditions, with similar trends for all three
6.1 Method measures.
We use the freelance marketplace Upwork to re-
Topic model overview and active learning both
cruit online participants.12 We require participants
significantly improve final dataset measures.
to have more than 90% job success on Upwork,
The topic overview and active selection conditions
English fluency, and US residency. Participants are
significantly outperform the list overview and ran-
randomly assigned to one of the four conditions
dom selection, respectively, on the final label qual-
and we recruited ten participants per condition.
ity metrics. Table 2 shows the results of separate
Participants completed a demographic question-
2 × 2 ANOVAs with ART with each of final purity,
naire, viewed a video of task instructions, and then
RI , and NMI scores. There are significant main ef-
interacted with the system and labeled documents
fects of overview and selection on all three metrics;
until satisfied with the labels or forty minutes had
no interaction effects were significant.
elapsed.13 The session ended with a survey, where
participants rated mental, physical, and temporal TR outperforms LA . Topic models by them-
demand, and performance, effort, and frustration selves outperform traditional active learning strate-
on 20-point scales, using questions adapted from gies (Figure 5). LA performs better than LR; while
the NASA Task Load Index (Hart and Staveland, active learning was useful, it was not as useful as
1988, TLX). The survey also included 7-point the topic model overview (TR and TA).
scales for ease of coming up with labels, usefulness
and satisfaction with the system, and—for TR and LA provides an initial benefit. Average purity,
NMI and RI were highest with LA for the earliest
12
http://Upwork.com labeling time intervals. Thus, when time is very
13
Forty minutes of activity, excluding system time to clas-
14
sify and update documents. Participants nearly exhausted the User study data available at http://github.com/
time: 39.3 average minutes in TA, 38.8 in TR, 40.0 in LA, and Pinafore/publications/tree/master/2016_
35.9 in LR. acl_doclabel/data/user_exp
M ± SD [median]
purity RI NMI
TR create 1.96 labels per topic and the participants
TA 0.31 ± 0.08 [0.32] 0.80 ± 0.05 [0.80] 0.19 ± 0.08 [0.21] in TA created 2.26 labels per topic. This suggests
TR 0.32 ± 0.09 [0.31] 0.82 ± 0.04 [0.82] 0.21 ± 0.09 [0.20]
LA 0.35 ± 0.05 [0.35] 0.82 ± 0.04 [0.81] 0.27 ± 0.05 [0.28] that participants are going beyond what they see in
LR 0.31 ± 0.04 [0.31] 0.79 ± 0.04 [0.79] 0.19 ± 0.03 [0.19] topics for labeling, at least in the TA condition.
Table 3: Mean, standard deviation, and median 6.3 Label Evaluation Results
purity, RI, and NMI after ten minutes. NMI in partic-
Section 6.2 compares clusters of documents in dif-
ular shows the benefit of LA over other conditions
ferent conditions against the gold clusters but does
at early time intervals.
not evaluate the quality of the labels themselves.
Since one of the main contributions of ALTO is to
limited, using traditional active learning (LA) is accelerate inducing a high quality label set, we use
preferable to topic overviews; users need time to crowdsourcing to assess how the final induced label
explore the topics and a subset of documents within sets compare in different conditions.
them. Table 3 shows the metrics after ten minutes. For completeness, we also compare labels
Separate 2 × 2 ANOVAs with ART on the means of against a fully automatic labeling method (Aletras
purity, NMI and RI revealed a significant interaction and Stevenson, 2014) that does not require human
effect between overview and selection on mean NMI intervention. We assign automatic labels to docu-
(F (1, 36) = 5.58, p = .024), confirming the early ments based on their most prominent topic.
performance trends seen in Figure 5 at least for We ask users on a crowdsourcing platform to
NMI . No other main or interaction effects were vote for the “best” and “worst” label that describes
significant, likely due to low statistical power. the content of a US congressional bill (we use
Crowdflower restricted to US contributors).
Subjective ratings. Table 4 shows the average Five users label each document and we use the
scores given for the six NASA - TLX questions in aggregated results generated by Crowdflower. The
different conditions. Separate 2 × 2 ANOVA with user gets $0.20 for each task.
ART for each of the measures revealed only one We randomly choose 200 documents from our
significant result: participants who used the topic dataset (Section 4.1). For each chosen document,
model overview find the task to be significantly we randomly choose a participant from all four con-
less frustrating (M = 4.2 and median = 2) than ditions (TA, TR, LA, LR). The labels assigned in
those who used the list overview (M = 7.3 and different conditions and the automatic label of the
median = 6.5) on a scale from 1 (low frustra- document’s prominent topic construct the candi-
tion) to 20 (high frustration) (F (1, 36) = 4.43, date labels for the document.15 Identical labels are
p = .042), confirming that the topic overview helps merged into one label to avoid showing duplicate
users organize their thoughts and experience less labels to users. If a merged label gets a “best” or
stress during labeling. “worst” vote, we split that vote across all the identi-
Participants in the TA and TR conditions rate cal instances.16 Figure 6 shows the average number
topic information to be useful in completing the of “best” and “worst” votes for each condition and
task (M = 5.0 and median = 5) on a scale from the automatic method. ALTO (TA) receives the most
1 (not useful at all) to 7 (very useful). Over- “best” votes and the fewest “worst” votes. LR re-
all, users are positive about their experience with ceives the most worst votes. The automatic labels,
the system. Participants in all conditions rate interestingly, appear to do at least as well as the list
overall satisfaction with the interface positively view labels, with a similar number of best votes and
(M = 5.8 and median = 6) on a scale from 1 (not fewer worst votes. This indicates that automatic
satisfied at all) to 7 (very satisfied). labels have reasonable quality compared to at least
some manually generated labels. However, when
Discussion. One can argue that using topic users are provided with a topic model overview—
overviews for labeling could have a negative ef- 15
Some participants had typos in the labels. We corrected
fect: users may ignore the document content and all the typos using pyEnchant (http://pythonhosted.
focus on topics for labeling. We tried to avoid this org/pyenchant/ ) spellchecker. If the corrected label was
issue by making it clear in the instructions that they still wrong, we corrected it manually.
16
Evaluation data available at http://github.com/
need to focus on document content and use top- Pinafore/publications/tree/master/2016_
ics as a guidance. On average, the participants in acl_doclabel/data/label_eval
M ± SD [median]
Condition Mental Demand Physical Demand Temporal Demand Performance Effort Frustration
TA 9.8 ± 5.6 [10] 2.9 ± 3.4 [2] 9 ± 7.8 [7] 5.5 ± 5.8 [1.5] 9.4 ± 6.3 [10] 4.5 ± 5.5 [1.5]
TR 10.6 ± 4.5 [11] 2.4 ± 2.8 [1] 7.4 ± 4.1 [9] 8.8 ± 6.1 [7.5] 9.8 ± 3.7 [10] 3.9 ± 3.0 [3.5]
LA 9.1 ± 5.5 [10] 1.7 ± 1.3 [1] 10.2 ± 4.8 [11] 8.6 ± 5.3 [10] 10.7 ± 6.2 [12.5] 6.7 ± 5.1 [5.5]
LR 9.8 ± 6.1 [10] 3.3 ± 2.9 [2] 9.3 ± 5.7 [10] 9.4 ± 5.6 [10] 9.4 ± 6.2 [10] 7.9 ± 5.4 [8]

Table 4: Mean, standard deviation, and median results from NASA - TLX post-survey. All questions are
scaled 1 (low)–20 (high), except performance, which is scaled 1 (good)–20 (poor). Users found topic
model overview conditions, TR and TA, to be significantly less frustrating than the list overview conditions.

els are imperfect (Boyd-Graber et al., 2014), re-


best worst fining underlying topic models may also improve
users’ understanding of a corpus (Choo et al., 2013;
Number of Votes (mean)

60
Hoque and Carenini, 2015).
40
Summarizing document collections through dis-
covered topics can happen through raw topics la-
20 beled manually by users (Talley et al., 2011), auto-
matically (Lau et al., 2011), or by learning a map-
0 ping from labels to topics (Ramage et al., 2009).
LA LR TA TR auto
Condition When there is not a direct correspondence between
topics and labels, classifiers learn a mapping (Blei
Figure 6: Best and worst votes for document labels. and McAuliffe, 2007; Zhu et al., 2009; Nguyen et
Error bars are standard error from bootstrap sample. al., 2015). Because we want topics to be consistent
ALTO ( TA ) gets the most best votes and the fewest between users, we use a classifier with static topics
worst votes. in ALTO. Combining our interface with dynamic
topics could improve overall labeling, perhaps at
the cost of introducing confusion as topics change
with or without active learning selection—they can during the labeling process.
generate label sets that improve upon automatic
labels and labels assigned without the topic model 8 Conclusion and Future Work
overview. We introduce ALTO, an interactive framework that
combines both active learning selections with topic
7 Related Work model overviews to both help users induce a label
Text classification—a ubiquitous machine learning set and assign labels to documents. We show that
tool for automatically labeling text (Zhang, 2010)— users can more effectively and efficiently induce
is a well-trodden area of NLP research. The diffi- a label set and create training data using ALTO
culty is often creating the training data (Hwa, 2004; in comparison with other conditions, which lack
Osborne and Baldridge, 2004); coding theory is an either topic overview or active selection.
entire subfield of social science devoted to creating, We can further improve ALTO to help users gain
formulating, and applying labels to text data (Sal- better and faster understanding of text corpora. Our
dana, 2012; Musialek et al., 2016). Crowdsourc- current system limits users to view only 20K docu-
ing (Snow et al., 2008) and active learning (Settles, ments at a time and allows for one label assignment
2012), can decrease the cost of annotation but only per document. Moreover, the topics are static and
after a label set exists. do not adapt to better reflect users’ labels. Users
should have better support for browsing documents
ALTO’s corpus overviews aid text understanding,
and assigning multiple labels.
building on traditional interfaces for gaining both
Finally, with slight changes to what the system
local and global information (Hearst and Peder-
considers a document, we believe ALTO can be ex-
sen, 1996). More elaborate interfaces (Eisenstein
tended to NLP applications other than classification,
et al., 2012; Chaney and Blei, 2012; Roberts et
such as named entity recognition or semantic role
al., 2014) provide richer information given a fixed
labeling, to reduce the annotation effort.
topic model. Alternatively, because topic mod-
Acknowledgments Jaegul Choo, Changhyun Lee, Chandan K. Reddy, and
Haesun Park. 2013. UTOPIAN: User-driven topic
We thank the anonymous reviewers, David Mimno, modeling based on interactive nonnegative matrix
Edward Scott Adler, Philip Resnik, and Burr Settles factorization. IEEE Transactions on Visualization
and Computer Graphics, 19(12):1992–2001.
for their insightful comments. We also thank Niko-
laos Aletras for providing the automatic topic label- Sanjoy Dasgupta and Daniel Hsu. 2008. Hierarchical
ing code. Boyd-Graber and Poursabzi-Sangdeh’s sampling for active learning. In Proceedings of the
International Conference of Machine Learning.
contribution is supported by NSF Grant NCSE-
1422492; Findlater, Seppi, and Boyd-Graber’s con- Jacob Eisenstein, Duen Horng Chau, Aniket Kittur, and
tribution is supported by collaborative NSF Grant Eric Xing. 2012. TopicViz: interactive topic ex-
IIS -1409287 ( UMD ) and IIS -1409739 ( BYU ). Any
ploration in document collections. In International
Conference on Human Factors in Computing Sys-
opinions, findings, results, or recommendations tems.
expressed here are of the authors and do not neces-
sarily reflect the view of the sponsor. Sandra G Hart and Lowell E Staveland. 1988. De-
velopment of nasa-tlx (task load index): Results of
empirical and theoretical research. Advances in psy-
chology, 52:139–183.
References
M.A. Hearst and J.O. Pedersen. 1996. Reexamining
E Scott Adler and John Wilkerson. 2006. Congres- the cluster hypothesis: scatter/gather on retrieval re-
sional bills project. NSF, 880066:00880061. sults. In Proceedings of the ACM SIGIR Confer-
ence on Research and Development in Information
Nikolaos Aletras and Mark Stevenson. 2014. La- Retrieval.
belling topics using unsupervised graph-based meth-
ods. In Proceedings of the Association for Computa- Enamul Hoque and Giuseppe Carenini. 2015. Con-
tional Linguistics, pages 631–636. visit: Interactive topic modeling for exploring asyn-
chronous online conversations. In Proceedings of
the 20th International Conference on Intelligent
Pranav Anand, Joseph King, Jordan L Boyd-Graber,
User Interfaces, IUI ’15.
Earl Wagner, Craig H Martell, Douglas W Oard, and
Philip Resnik. 2011. Believe me-we can do this! Yuening Hu, Jordan Boyd-Graber, Brianna Satinoff,
annotating persuasive acts in blog text. In Computa- and Alison Smith. 2014. Interactive topic modeling.
tional Models of Natural Argument. Machine learning, 95(3):423–469.
David M. Blei and Jon D. McAuliffe. 2007. Super- Lawrence Hubert and Phipps Arabie. 1985. Compar-
vised topic models. In Proceedings of Advances in ing partitions. Journal of classification, 2(1):193–
Neural Information Processing Systems. 218.

David M. Blei, Andrew Ng, and Michael Jordan. 2003. Rebecca Hwa. 2004. Sample selection for statistical
Latent Dirichlet allocation. Journal of Machine parsing. Computational linguistics, 30(3):253–276.
Learning Research, 3. Mohit Iyyer, Peter Enns, Jordan L Boyd-Graber, and
Philip Resnik. 2014. Political ideology detection
Gerlof Bouma. 2009. Normalized (pointwise) mutual using recursive neural networks. In Proceedings of
information in collocation extraction. In The Bien- the Association for Computational Linguistics.
nial GSCL Conference, pages 31–40.
Edward F Kelly and Philip J Stone. 1975. Com-
Jordan Boyd-Graber, David Mimno, and David New- puter recognition of English word senses, volume 13.
man. 2014. Care and feeding of topic models: Prob- North-Holland.
lems, diagnostics, and improvements. Handbook of
Mixed Membership Models and Their Applications; Soo-Min Kim and Eduard Hovy. 2004. Determining
CRC Press: Boca Raton, FL, USA. the sentiment of opinions. In Proceedings of the As-
sociation for Computational Linguistics, page 1367.
Ian Budge. 2001. Mapping policy preferences: esti- Association for Computational Linguistics.
mates for parties, electors, and governments, 1945- Hans-Dieter Klingemann, Andrea Volkens, Judith Bara,
1998, volume 1. Oxford University Press. Ian Budge, Michael D McDonald, et al. 2006.
Mapping policy preferences II: estimates for parties,
Bob Carpenter. 2008. Lingpipe 4.1.0. http:// electors, and governments in Eastern Europe, Euro-
alias-i.com/lingpipe. pean Union, and OECD 1990-2003. Oxford Univer-
sity Press Oxford.
Allison Chaney and David Blei. 2012. Visualizing
topic models. In International AAAI Conference on Ken Lang. 2007. 20 newsgroups data set.
Weblogs and Social Media. http://www.ai.mit.edu/people/jrennie/20Newsgroups/.
Jey Han Lau, Karl Grieser, David Newman, and Tim- Margaret E Roberts, Brandon M Stewart, Dustin
othy Baldwin. 2011. Automatic labelling of topic Tingley, Christopher Lucas, Jetson Leder-Luis,
models. In Proceedings of the Association for Com- Shana Kushner Gadarian, Bethany Albertson, and
putational Linguistics, pages 1536–1545. David G Rand. 2014. Structural topic models for
open-ended survey responses. American Journal of
Jey Han Lau, David Newman, and Timothy Baldwin. Political Science, 58(4):1064–1082.
2014. Machine reading tea leaves: Automatically
evaluating topic coherence and topic model quality. J. Saldana. 2012. The Coding Manual for Qualitative
In Proceedings of the European Chapter of the Asso- Researchers. SAGE Publications.
ciation for Computational Linguistics.
David D Lewis and William A Gale. 1994. A se- Burr Settles. 2012. Active learning (synthesis lec-
quential algorithm for training text classifiers. In tures on artificial intelligence and machine learning).
Proceedings of the 17th annual international ACM Long Island, NY: Morgan & Clay Pool.
SIGIR conference on Research and development in
H Sebastian Seung, Manfred Opper, and Haim Som-
information retrieval, pages 3–12. Springer-Verlag
polinsky. 1992. Query by committee. In Proceed-
New York, Inc.
ings of the fifth annual workshop on Computational
Andrew Kachites McCallum. 2002. Mal- learning theory, pages 287–294. ACM.
let: A machine learning for language toolkit.
http://www.cs.umass.edu/ mccallum/mallet. Rion Snow, Brendan O’Connor, Daniel Jurafsky, and
Andrew Ng. 2008. Cheap and fast—but is it good?
Marina Meilă. 2003. Comparing clusterings by the Evaluating non-expert annotations for natural lan-
variation of information. In Learning theory and ker- guage tasks. In Proceedings of Empirical Methods
nel machines, pages 173–187. Springer. in Natural Language Processing.
Chris Musialek, Philip Resnik, and S. Andrew Stavisky. Alexander Strehl and Joydeep Ghosh. 2003. Cluster
2016. Using text analytic techniques to create effi- ensembles—a knowledge reuse framework for com-
ciencies in analyzing qualitative data: A comparison bining multiple partitions. The Journal of Machine
between traditional content analysis and a topic mod- Learning Research, 3:583–617.
eling approach. In American Association for Public
Opinion Research. Edmund M. Talley, David Newman, David Mimno,
Grace Ngai and David Yarowsky. 2000. Rule writing Bruce W. Herr, Hanna M. Wallach, Gully A. P. C.
or annotation: Cost-efficient resource usage for base Burns, A. G. Miriam Leenders, and Andrew McCal-
noun phrase chunking. In Proceedings of the Associ- lum. 2011. Database of NIH grants using machine-
ation for Computational Linguistics. Association for learned categories and graphical clustering. Nature
Computational Linguistics. Methods, 8(6):443–444, May.

Thang Nguyen, Jordan Boyd-Graber, Jeff Lund, Kevin Katrin Tomanek, Joachim Wermter, and Udo Hahn.
Seppi, and Eric Ringger. 2015. Is your anchor going 2007. An approach to text corpus construction
up or down? Fast and accurate supervised topic mod- which cuts annotation costs and maintains reusabil-
els. In Conference of the North American Chapter ity of annotated data. In Proceedings of Empiri-
of the Association for Computational Linguistics. cal Methods in Natural Language Processing, pages
486–495.
Sonya Nikolova, Jordan Boyd-Graber, and Christiane
Fellbaum. 2011. Collecting semantic similarity Paul MB Vitányi, Frank J Balbach, Rudi L Cilibrasi,
ratings to connect concepts in assistive communica- and Ming Li. 2009. Normalized information dis-
tion tools. In Modeling, Learning, and Processing tance. In Information theory and statistical learning,
of Text Technological Data Structures, pages 81–93. pages 45–82. Springer.
Springer.
Jacob O Wobbrock, Leah Findlater, Darren Gergle, and
Miles Osborne and Jason Baldridge. 2004. Ensemble- James J Higgins. 2011. The aligned rank trans-
based active learning for parse selection. In Con- form for nonparametric factorial analyses using only
ference of the North American Chapter of the Asso- anova procedures. In Proceedings of the SIGCHI
ciation for Computational Linguistics, pages 89–96. Conference on Human Factors in Computing Sys-
Citeseer. tems, pages 143–146. ACM.
Daniel Ramage, David Hall, Ramesh Nallapati, and
Christopher Manning. 2009. Labeled LDA: A su- Tong Zhang. 2010. Fundamental statistical techniques.
pervised topic model for credit attribution in multi- In Nitin Indurkhya and Fred J. Damerau, editors,
labeled corpora. In Proceedings of Empirical Meth- Handbook of Natural Language Processing. Chap-
ods in Natural Language Processing. man & Hall/CRC, 2nd edition.

William M Rand. 1971. Objective criteria for the eval- Ying Zhao and George Karypis. 2001. Criterion
uation of clustering methods. Journal of the Ameri- functions for document clustering: Experiments and
can Statistical association, 66(336):846–850. analysis. Technical report, University of Minnesota.
Jun Zhu, Amr Ahmed, and Eric P. Xing. 2009.
MedLDA: maximum margin supervised topic mod-
els for regression and classification. In Proceedings
of the International Conference of Machine Learn-
ing.

You might also like