TR1648
TR1648
Sciences
Department
Active Learning Literature Survey
Burr Settles
January 2009
Active Learning Literature Survey
Burr Settles
The key idea behind active learning is that a machine learning algorithm can
achieve greater accuracy with fewer labeled training instances if it is allowed to
choose the data from which is learns. An active learner may ask queries in the
form of unlabeled instances to be labeled by an oracle (e.g., a human annotator).
Active learning is well-motivated in many modern machine learning problems,
where unlabeled data may be abundant but labels are difficult, time-consuming,
or expensive to obtain.
This report provides a general introduction to active learning and a survey of
the literature. This includes a discussion of the scenarios in which queries can
be formulated, and an overview of the query strategy frameworks proposed in
the literature to date. An analysis of the empirical and theoretical evidence for
active learning, a summary of several problem setting variants, and a discussion
of related topics in machine learning research are also presented.
Contents
1 Introduction 2
1.1 What is Active Learning? . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Active Learning Examples . . . . . . . . . . . . . . . . . . . . . 4
1.3 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Scenarios 7
2.1 Membership Query Synthesis . . . . . . . . . . . . . . . . . . . . 8
2.2 Stream-Based Selective Sampling . . . . . . . . . . . . . . . . . 9
2.3 Pool-Based Active Learning . . . . . . . . . . . . . . . . . . . . 10
Bibliography 34
1
1 Introduction
This report provides a general review of the literature on active learning. There
have been a host of algorithms and applications for learning with queries over
the years, and this document is an attempt to distill the core ideas, methods, and
applications that have been considered by the machine learning community. To
make this survey more useful in the long term, an online version will be updated
and maintained indefinitely at:
http://pages.cs.wisc.edu/∼bsettles/active-learning/
This document is written for a machine learning audience, and assumes the reader
has a working knowledge of supervised learning algorithms (particularly statisti-
cal methods). For a good introduction to general machine learning, I recommend
Mitchell (1997) or Duda et al. (2001). This review is by no means comprehensive.
My research deals primarily with applications in natural language processing and
bioinformatics, thus much of the empirical active learning work I am familiar with
is in these areas. Active learning (like so many subfields in computer science) is
growing and evolving rapidly, so it is difficult for one person to provide an ex-
haustive summary. I apologize in advance for any oversights or inaccuracies, and
encourage interested readers to submit additions, comments, and corrections to
me at: [email protected].
2
1.1 What is Active Learning?
Active learning (also called “query learning,” or sometimes “optimal experimental
design” in the statistics literature) is a subfield of machine learning and, more gen-
erally, artificial intelligence. The key hypothesis is that, if the learning algorithm
is allowed to choose the data from which it learns—to be “curious,” if you will—it
will perform better with less training. Why is this a desirable property for learning
algorithms to have? Consider that, for any supervised learning system to perform
well, it must often be trained on hundreds (even thousands) of labeled instances.
Sometimes these labels come at little or no cost, such as the the “spam” flag you
mark on unwanted email messages, or the five-star rating you might give to films
on a social networking website. Learning systems use these flags and ratings to
better filter your junk email and suggest movies you might enjoy. In these cases
you provide such labels for free, but for many other more sophisticated supervised
learning tasks, labeled instances are very difficult, time-consuming, or expensive
to obtain. Here are a few examples:
3
Active learning systems attempt to overcome the labeling bottleneck by asking
queries in the form of unlabeled instances to be labeled by an oracle (e.g., a human
annotator). In this way, the active learner aims to achieve high accuracy using
as few labeled instances as possible, thereby minimizing the cost of obtaining
labeled data. Active learning is well-motivated in many modern machine learning
problems where data may be abundant but labels are scarce or expensive to obtain.
Note that this kind of active learning is related in spirit, though not to be confused,
with the family of instructional techniques by the same name in the education
literature (Bonwell and Eison, 1991).
labeled
training set
unlabeled pool
L
U
select queries
oracle (e.g., human annotator)
There are several scenarios in which active learners may pose queries, and
there are also several different query strategies that have been used to decide which
instances are most informative. In this section, I present two illustrative examples
in the pool-based active learning setting (in which queries are selected from a
large pool of unlabeled instances U) using an uncertainty sampling query strategy
(which selects the instance in the pool about which the model is least certain how
to label). Sections 2 and 3 describe all the active learning scenarios and query
strategy frameworks in more detail.
4
3 3 3
2 2 2
1 1 1
0 0 0
-1 -1 -1
-2 -2 -2
-3 -3 -3
-4 -2 0 2 4 -4 -2 0 2 4 -4 -2 0 2 4
(a) (b) (c)
Figure 2: An illustrative example of pool-based active learning. (a) A toy data set of
400 instances, evenly sampled from two class Gaussians. The instances are
represented as points in a 2D feature space. (b) A logistic regression model
trained with 30 labeled instances randomly drawn from the problem domain.
The line represents the decision boundary of the classifier (accuracy = 0.7).
(c) A logistic regression model trained with 30 actively queried instances using
uncertainty sampling (accuracy = 0.9).
Figure 1 illustrates the pool-based active learning cycle. A learner may begin
with a small number of instances in the labeled training set L, request labels for
one or more carefully selected instances, learn from the query results, and then
leverage its new knowledge to choose which instances to query next. Once a
query has been made, there are usually no additional assumptions on the part of
the learning algorithm. The new labeled instance is simply added to the labeled
set L, and the learner proceeds from there in a standard supervised way. There are
a few exceptions to this, such as when the learner is allowed to make alternative
types of queries (Section 5.4), or when active learning is combined with semi-
supervised learning (Section 6.1).
Figure 2 shows the potential of active learning in a way that is easy to visu-
alize. This is a toy data set generated from two Gaussians centered at (-2,0) and
(2,0) with standard deviation σ = 1, each representing a different class distribu-
tion. Figure 2(a) shows the resulting data set after 400 instances are sampled (200
from each class); instances are represented as points in a 2D feature space. In a
real-world setting these instances may be available, but their labels usually are not.
Figure 2(b) illustrates the traditional supervised learning approach after randomly
selecting 30 instances for labeling, drawn i.i.d. from the unlabeled pool U. The
line shows the linear decision boundary of a logistic regression model (i.e., where
the posterior equals 0.5) trained using these 30 points. Notice that most of the la-
beled instances in this training set are far from zero on the horizontal axis, which
5
1
0.9
0.8
accuracy
0.7
0.6
uncertainty sampling
random
0.5
0 20 40 60 80 100
number of instance queries
Figure 3: Learning curves for text classification: baseball vs. hockey. Curves plot clas-
sification accuracy as a function of the number of documents queried for two se-
lection strategies: uncertainty sampling (active learning) and random sampling
(passive learning). We can see that the active learning approach is superior here
because its learning curve dominates that of random sampling.
is where the Bayes optimal decision boundary should probably be. As a result,
this classifier only achieves accuracy = 0.7 on the remaining unlabeled points.
Figure 2(c), however, tells a very different story. The active learner uses uncer-
tainty sampling to focus on instances closest to its decision boundary, assuming it
can adequately explain those in other parts of the input space characterized by U.
As a result, it avoids requesting labels for redundant or irrelevant instances, and
achieves accuracy = 0.9 with a mere 30 labeled instances. That is a 67% reduction
in error compared to “passive” supervised learning (i.e., random sampling), and
less than 10% of the data was labeled.
Now let us consider active learning for a real-world learning task: text classifi-
cation. In this example, a learner must distinguish between baseball and hockey
documents from the 20 Newsgroups corpus (Lang, 1995), which consists of 2,000
Usenet documents evenly divided between the two classes. Active learning al-
gorithms are generally evaluated by constructing learning curves, which plot the
evaluation measure of interest (e.g., accuracy) as a function of the number of
6
new instance queries that are labeled and added to L. Figure 3 presents learning
curves for the first 100 instances labeled using uncertainty sampling and random
sampling. The reported results are for a logistic regression model averaged over
ten folds using cross-validation. After labeling 30 new instances, the accuracy of
uncertainty sampling is 0.810, while the random baseline is only 0.730. As can be
seen, the active learning curve dominates the baseline curve for all of the points
shown in this figure. We can conclude that an active learning algorithm is superior
to some other approach (e.g., a random baseline that represent traditional passive
supervised learning) if it dominates the other for most or all of the points along
their learning curves.
2 Scenarios
There are several different problem scenarios in which the learner may be able to
ask queries. The three main settings that have been considered in the literature
are (i) membership query synthesis, (ii) stream-based selective sampling, and (iii)
pool-based active learning. Figure 4 illustrates the differences among these three
scenarios. The remainder of this section provides an overview of the different
active learning settings.
7
membership query synthesis
model generates
a query de novo
8
which can execute a series of autonomous biological experiments to discover
metabolic pathways in the yeast Saccharomyces cerevisiae. Here, an instance
is a mixture of chemical solutions that constitute a growth medium, as well as
a particular yeast mutant. A label, then, is whether or not the mutant thrived
in the growth medium. All experiments are autonomously synthesized using an
active learning approach based on inductive logic programming, and physically
performed using a laboratory robot. This active method results in a three-fold de-
crease in the cost of experimental materials compared to naı̈vely running the least
expensive experiment, and a 100-fold decrease in cost compared to randomly gen-
erated experiments. In domains where labels come not from human annotators,
but from experiments such as this, query synthesis may be a promising direction
for automated scientific discovery.
9
the labeled data, but disagree on some unlabeled instance, then that instance lies
within the region of uncertainty. Calculating this region completely and explicitly
is computationally expensive, however, and it must be maintained after each new
query. As a result, approximations are used in practice (Seung et al., 1992; Cohn
et al., 1994; Dasgupta et al., 2008).
The stream-based scenario has been studied in several real-world tasks, includ-
ing part-of-speech tagging (Dagan and Engelson, 1995), sensor scheduling (Kr-
ishnamurthy, 2002), and learning ranking functions for information retrieval (Yu,
2005). Fujii et al. (1998) employ selective sampling for active learning in word
sense disambiguation, e.g., determining if the word “bank” means land alongside
a river or a financial institution in a given context (only they study Japanese words
in their work). The approach not only reduces annotation effort, but also limits
the size of the database used in nearest-neighbor learning, which in turn expedites
the classification algorithm.
It is worth noting that some authors (e.g., Thompson et al., 1999; Moskovitch
et al., 2007) use “selective sampling” to refer to the pool-based scenario described
in the next section. Under this interpretation, the term merely signifies that queries
are made with a select set of instances sampled from a real data distribution.
However, in most of the literature selective sampling refers to the stream-based
scenario described here.
10
and retrieval (Yan et al., 2003; Hauptmann et al., 2006), speech recognition (Tur
et al., 2005), and cancer diagnosis (Liu, 2004) to name a few.
The main difference between stream-based and pool-based active learning is
that the former scans through the data sequentially and makes query decisions
individually, whereas the latter evaluates and ranks the entire collection before
selecting the best query. While the pool-based scenario appears to be much more
common among application papers, one can imagine settings where the stream-
based approach is more appropriate. For example, when memory or processing
power may be limited, as with mobile and embedded devices.
11
in machine learning. For binary classification, entropy-based uncertainty sam-
pling is identical to choosing the instance with posterior closest to 0.5. However,
the entropy-based approach can be generalized easily to probabilistic multi-label
classifiers and probabilistic models for more complex structured instances, such
as sequences (Settles and Craven, 2008) and trees (Hwa, 2004). An alternative to
entropy in these more complex settings involves querying the instance whose best
labeling is the least confident:
where y ∗ = argmaxy P (y|x; θ) is the most likely class labeling. This sort of strat-
egy has been shown to work well, for example, with conditional random fields
or CRFs (Lafferty et al., 2001) for active learning in information extraction tasks
(Culotta and McCallum, 2005; Settles and Craven, 2008). For binary classifica-
tion, this approach is equivalent to the entropy-based strategy.
Uncertainty sampling strategies may also be employed with non-probabilistic
models. One of the first works to explore uncertainty sampling used a decision tree
classifier (Lewis and Catlett, 1994) by modifying it to have probabilistic output.
Similar approaches have been applied to active learning with nearest-neighbor
(a.k.a. “memory-based” or “instance-based”) classifiers (Fujii et al., 1998; Lin-
denbaum et al., 2004), by allowing each neighbor to vote on the class label of x,
with the proportion of these votes representing the posterior label probability.
Tong and Koller (2000) also experiment with an uncertainty sampling strategy
for support vector machines or SVMs (Cortes and Vapnik, 1995), that involves
querying the instance closest to the linear decision boundary. This last approach
is analogous to uncertainty sampling with a probabilistic binary linear classifier,
such as logistic regression or naı̈ve Bayes.
3.2 Query-By-Committee
Another, more theoretically-motivated query selection framework is the query-
by-committee (QBC) algorithm (Seung et al., 1992). The QBC approach involves
maintaining a committee C = {θ(1) , . . . , θ(C) } of models which are all trained on
the current labeled set L, but represent competing hypotheses. Each committee
member is then allowed to vote on the labelings of query candidates. The most
informative query is considered to be the instance about which they most disagree.
The fundamental premise behind the QBC framework is minimizing the ver-
sion space, which is (as mentioned in Section 2.2) the set of hypotheses that are
12
(a) (b)
Figure 5: Version space examples for (a) linear and (b) axis-parallel box classifiers. All
hypotheses are consistent with the labeled training data in L (as indicated by
shaded polygons), but each represents a different model in the version space.
consistent with the current labeled training data L. Figure 5 illustrates the concept
of version spaces for (a) linear functions and (b) axis-parallel box classifiers in
different binary classification tasks. If we view machine learning as a search for
the “best” model within the version space, then our goal in active learning is to
constrain the size of this space as much as possible (so that the search can be more
precise) with as few labeled instances as possible. This is exactly what QBC does,
by querying in controversial regions of the input space. In order to implement a
QBC selection algorithm, one must:
Seung et al. (1992) accomplish the first task simply by sampling a commit-
tee of two random hypotheses that are consistent with L. For generative model
classes, this can be done more generally by randomly sampling an arbitrary num-
ber of models from some posterior distribution P (θ|L). For example, McCallum
and Nigam (1998) do this for naı̈ve Bayes by using the Dirichlet distribution over
model parameters, whereas Dagan and Engelson (1995) sample hidden Markov
models or HMMs by using the Normal distribution. For other model classes,
such as discriminative or non-probabilistic models, Abe and Mamitsuka (1998)
13
have proposed query-by-boosting and query-by-bagging, which employ the well-
known ensemble learning methods boosting (Freund and Schapire, 1997) and bag-
ging (Breiman, 1996) to construct committees. Melville and Mooney (2004) pro-
pose another ensemble-based method that explicitly encourages diversity among
committee members. There is no general agreement in the literature on the appro-
priate committee size to use, which may in fact vary by model class or applica-
tion. However, even small committee sizes (e.g., two or three) have been shown
to work well in practice (Seung et al., 1992; McCallum and Nigam, 1998; Settles
and Craven, 2008).
For measuring the level of disagreement, two main approaches have been pro-
posed. The first is vote entropy (Dagan and Engelson, 1995):
X V (yi ) V (yi )
x∗V E = argmax − log ,
x
i
C C
where yi again ranges over all possible labelings, and V (yi ) is the number of
“votes” that a label receives from among the committee members’ predictions.
This can be thought of as a QBC generalization of entropy-based uncertainty sam-
pling. Another disagreement measure that has been proposed is average Kullback-
Leibler (KL) divergence (McCallum and Nigam, 1998):
C
1 X
x∗KL = argmax D(Pθ(c) kPC ),
x C c=1
where:
X P (yi |x; θ(c) )
D(Pθ(c) kPC ) = P (yi |x; θ(c) ) log .
i
P (yi |x; C)
(c)
Here θ represents a particular model in P the committee, and C represents the
committee as a whole, thus P (yi |x; C) = C C 1 (c)
c=1 P (yi |x; θ ) is the “consensus”
probability that yi is the correct label. KL divergence (Kullback and Leibler, 1951)
is an information-theoretic measure of the difference between two probability dis-
tributions. So this disagreement measure considers the most informative query to
be the one with the largest average difference between the label distributions of
any one committee member and the consensus.
Aside from the QBC framework, several other query strategies attempt to min-
imize the version space as well. For example, Cohn et al. (1994) describe a re-
lated selective sampling algorithm for neural networks using a combination of the
“most specific” and “most general” models, which lie at two extremes the version
14
space given the current training set L. Tong and Koller (2000) propose a pool-
based query strategy that tries to minimize the version space for support vector
machine classifiers directly. The membership query algorithms of Angluin (1988)
and King et al. (2004) can also be interpreted as synthesizing de novo instances
that limit the size of the version space. However, Haussler (1994) shows that the
size of the version space can grow exponentially with the size of L. This means
that, in general, the version space of an arbitrary model class cannot be explicitly
represented in practice. The QBC framework, rather, uses a committee which is a
subset-approximation of the full version space.
where k · k is the Euclidean norm of each resulting gradient vector. Note that, at
query time, k∇`(L; θ)k should be nearly zero since ` converged at the previous
round of training. Thus, we can approximate ∇`(L ∪ hx, yi i; θ) ≈ ∇`(hx, yi i; θ)
for computational efficiency, because the training instances are assumed to be
independent.
15
The intuition behind this framework is that it prefers instances that are likely
to most influence the model (i.e., have greatest impact on its parameters), regard-
less of the resulting query label. This approach has been shown to work well in
empirical studies, but can be computationally expensive if both the feature space
and set of labelings are very large.
ET (o − y)2 |x = E (y − E[y|x])2
where EL [·] is an expectation over some labeled set L of a given size, E[·] is an
expectation over the conditional density P (y|x), and ET is an expectation over
both. Here also o = g(x; θ) is shorthand for the model’s predicted output for a
given instance x (g is the learned function parameterized by θ), while y indicates
the true label of the instance.
The first term on the right-hand side of this equation is the noise, i.e., the
variance of the true label y given only x, which does not depend on the model
or training data. Such noise may result from stochastic effects of the method
used to obtain the true labels, for example, or because the feature representation
is inadequate. The second term is the bias, which represents the error due to the
model class itself, e.g., if a linear model is used to learn a function that is only
approximately linear. This component of the overall error is invariant given a
fixed model class. The third term is the model’s variance, which is the remaining
component of the learner’s mean squared error with respect to the true regression
function. Minimizing the variance, then, is guaranteed to minimize the future
generalization error of the model (since the learner itself can do nothing about the
noise or bias components).
16
Cohn et al. (1996) then use the estimated distribution of the model’s output
to estimate σ̃o2 , the variance of the learner after some new instance x̃ has been
labeled and added to L, and then query the instance resulting in the greatest future
variance reduction:
x∗V R = argmin σ̃o2 .
x̃
They show that this can be done in closed-form for neural networks, Gaussian
mixture models, and locally-weighted linear regression. In particular, for neural
networks the output variance is approximated by (MacKay, 1992):
T 2 −1
2 ∂o ∂ ∂o
σo ≈ S(L; θ) S(L; θ) ,
∂θ ∂θ2 ∂θ
where S(L; θ) = L1 Ll=1 (o(l) − y (l) )2 is the mean squared error of the current
P
model θ on the training set L. In the equation above, the second and last terms are
computed using the gradient of the model’s predicted output with respect to model
parameters θ. The middle term is the inverse of a covariance matrix representing
a second-order expansion around the objective function S with respect to θ. A
closed-form expression for σ̃o2 can then be derived, given the assumptions that
∂o
∂θ
is locally linear (true for most network configurations) and that variance is
Gaussian and constant for all x; further details are given by Cohn (1994). Since
the equation is a smooth function and differentiable with respect to any query x̃
in the input space, gradient methods can be used to search for the best possible
query that minimizes future variance, and therefore future error. This approach is
derived from statistical theories of optimal experimental design (Federov, 1972).
However, the approach of Cohn et al. (1996) applies only to regression tasks,
and synthesizes new queries de novo. For many learning problems like text clas-
sification, this technique cannot be used. More recently, though, Zhang and Oles
(2000) have proposed an analogous approach for selecting optimal queries in
a pool-based setting for discriminative classifiers based on Fisher information.
Formally, Fisher information I(θ) is the variance of the score, which is the par-
tial derivative of the log-likelihood function with respect to model parameters θ
(Schervish, 1995). Fisher information is given by:
∂2
Z Z
I(θ) = − P (x) P (y|x; θ) 2 log P (y|x; θ),
x y ∂θ
and can be interpreted as the overall uncertainty about an input distribution P (x)
with respect to the estimated model parameters. For a model with multiple pa-
rameters, Fisher information takes the form of a covariance matrix. The optimal
17
instance to query, then, is the one which minimizes the Fisher information ratio:
where Ix (θ) is the Fisher information matrix for an unlabeled query candidate
x ∈ U, and IU (θ) is the analogous matrix integrated over the entire unlabeled
pool. The trace function tr(·) is the sum of the terms along the principal diagonal
of a matrix, thus the equation above provides us with a ratio given by the inner
product of Ix (θ)’s inverse matrix and IU (θ).
The key idea behind the Fisher information ratio is that Ix (θ) will tell us not
only how uncertain the model is about x (e.g., the magnitude of the matrix di-
agonal), but it also tells us which model parameters are most responsible for this
uncertainty, as it is encoded in the matrix. Likewise, IU (θ) can tell us the same
information about the entire unlabeled pool. By minimizing the ratio above, the
learner will tend to query the instance whose model variance is most similar to
the overall input distribution approximated by U. A more formal explanation as
to why this is the optimal approach stems from the Cramér-Rao lower-bound on
asymptotic efficiency, as explained by Zhang and Oles (2000). They apply this
method to text classification using binary logistic regression. Hoi et al. (2006a)
extend this approach to active text classification in the batch-mode setting (see
Section 5.2) in which a set of queries Q is selected all at once in an attempt to
minimize the ratio between IQ (θ) and IU (θ). Settles and Craven (2008) have
also generalized the Fisher information ratio approach to probabilistic sequence
models such as CRFs.
The query strategies of variance reduction (Cohn et al., 1996) and Fisher in-
formation ratio (Zhang and Oles, 2000), while designed for different tasks and
active learning scenarios, are grouped together here because they can be viewed
as strategies under a more general variance minimization framework. Both are
grounded in statistics, and both select the optimal query to reduce model variance
given the assumptions. There are some practical disadvantages to these methods,
however, in terms of computational complexity. In both strategies, estimating the
variance requires inverting a K × K matrix for each new instance, where K is the
number of parameters in the model θ, resulting in a time complexity of O(U K 3 ),
where U is the size of the query pool U. This quickly becomes intractable for
large K, which is a common occurrence in, say natural language tasks. For vari-
ance estimation with neural networks, Paass and Kindermann (1995) propose a
sampling approach based on Markov chains to address this problem. For invert-
ing the Fisher information matrix, Hoi et al. (2006a) use principal component
18
analysis to reduce the dimensionality of the parameter space. Alternatively, Set-
tles and Craven (2008) approximate the matrix with its diagonal vector, which
can be inverted in only O(K) time. However, these methods are still empirically
much slower than simpler query strategies like uncertainty sampling.
19
other model classes, this is not the case. For example, a binary logistic regression
model would require O(U LG) time complexity simply to choose the next query,
where U is the size of the unlabeled pool U, L is the size of the current training set
L, and G is the number of gradient computations required by the by optimization
procedure until convergence. A classification task with three or more labels using
a MaxEnt model (Berger et al., 1996) would require O(M 2 U LG) time complex-
ity, where M is the number of class labels. For a sequence labeling task using
CRFs, the complexity explodes to O(T M T +2 U LG), where T is the length of an
input sequence. Because of this, the applications of the estimated error reduc-
tion framework have mostly only considered simple binary classification tasks.
Moreover, because the approach is often still impractical, some researchers have
resorted to subsampling the pool U when selecting queries (Roy and McCallum,
2001) or using only approximate training techniques (Guo and Greiner, 2007).
20
B
Figure 6: An illustration of when uncertainty sampling can be a poor strategy for classifi-
cation. Shaded polygons represent labeled instances in L, and circles represent
unlabeled instances in U. Since A is on the decision boundary, it would be
queried as the most uncertain. However, querying B is likely to result in more
information about the data distribution as a whole.
21
4 Analysis of Active Learning
This section discusses some of the empirical and theoretical evidence for how and
when active learning works in practice.
22
simple binary thresholding function g parameterized by θ:
(
1 if x > θ, and
g(x; θ) =
0 otherwise.
23
Dasgupta et al. (2005) propose a variant of the perceptron update rule which
can achieve the same sample complexity bounds as reported for QBC, but for a
single linear classifier. In earlier work, Dasgupta (2004) also provided a variety
of theoretical upper and lower bounds for active learning in more general pool-
based settings. In particular, if using linear classifiers the sample complexity can
explode to O(1/) in the worst case, which offers no improvement over standard
supervised learning, but is also no worse. However, Balcan et al. (2008) also show
that, under an asymptotic setting, active learning is always better than supervised
learning in the limit.
Most of these results have used theoretical frameworks similar to the standard
PAC model, and necessarily assume that the learner knows the correct concept
class in advance. Put another way, they assume that some model in our hypothesis
class can perfectly classify the instances, and that the data are also noise-free. To
address these limitations, there has been some more recent theoretical work in ag-
nostic active learning (Balcan et al., 2006), which only requires that the unlabeled
instances be drawn i.i.d. from a fixed distribution, and even noisy distributions are
allowed. Hanneke (2007) extends this work by providing upper bounds on query
complexity for the agnostic setting, and Dasgupta et al. (2008) propose a some-
what more efficient query selection algorithm. Cesa-Bianchi et al. (2005) have
also shown that active learning is possible in the “regret” framework, also known
as online adversarial learning.
However, most positive theoretical results to date have been based on in-
tractable algorithms, or methods otherwise too prohibitively complex and par-
ticular to be used in practice. The few analyses performed on efficient algorithms
have assumed uniform or near-uniform input distributions (Balcan et al., 2006;
Dasgupta et al., 2005), or severely restricted hypothesis spaces. Furthermore,
these studies have largely only been for simple (often binary) classification prob-
lems, with few implications for more complex models (e.g., that label structured
instances like sequences and trees), which are central to many large-scale infor-
mation management tasks addressed by the machine learning community today.
24
start The null in
(a) (b)
25
based on a probabilistic finite state machine, such as CRFs or HMMs. An exam-
ple sequence model is shown in Figure 7(b).
Settles and Craven (2008) present and evaluate a large number of active learn-
ing algorithms for sequence labeling tasks using probabilistic sequence models
like CRFs. Most of these algorithms can be generalized to other probabilistic
sequence models, such as HMMs (Dagan and Engelson, 1995; Scheffer et al.,
2001) and probabilistic context-free grammars (Baldridge and Osborne, 2004;
Hwa, 2004). Thompson et al. (1999) also propose query strategies for structured
output tasks like semantic parsing and information extraction using inductive logic
programming methods.
26
part, these approaches show improvements over random batch sampling, which in
turn is generally better than simple “N -best” batch construction.
27
• In some domains, annotation costs are not (approximately) constant across
instances, and can vary considerably.
• The cost of annotating an instance may not be intrinsic, but may instead
vary based on the person doing the annotation.
28
bag: image = { instances: segments } bag: document = { instances: passages }
(a) (b)
instances are negative. A bag is labeled positive, however, if at least one of its
instances is positive (note that positive bags may also contain negative instances).
The MI setting was formalized by Dietterich et al. (1997) in the context of drug
activity prediction, and has since been applied to a wide variety of tasks including
content-based image retrieval (Maron and Lozano-Perez, 1998; Andrews et al.,
2003; Rahmani and Goldman, 2006) and text classification (Andrews et al., 2003;
Ray and Craven, 2005).
Figure 8 illustrates how the MI representation can be applied to (a) content-
based image retrieval (CBIR) and to (b) text classification. For the CBIR task,
images are represented as bags and instances correspond to segmented regions of
the image. A bag representing a given image is labeled positive if the image con-
tains some object of interest. The MI paradigm is well suited to this task because
only a few regions of an image may represent the object of interest, such as the
gold medal in Figure 8(a). An advantage of the MI representation here is that it is
significantly easier to label an entire image than it is to label each segment, or even
a subset of the image segments. For the text classification task, documents can be
represented as bags and instances correspond to short passages (e.g., paragraphs)
29
that comprise each document. The MI representation is compelling for classifi-
cation tasks for which document labels are freely available or cheaply obtained
(e.g., from online indexes and databases), but the target concept is represented by
only a few passages.
For MI learning tasks such as these, it is possible to obtain labels both at the
bag level and directly at the instance level. Fully labeling all instances, however,
is expensive. Often the rationale for formulating the learning task as an MI prob-
lem is that it allows us to take advantage of coarse labelings that may be available
at low cost, or even for free. In MI active learning, however, the learner is some-
times allowed to query for labels at a finer granularity than the target concept,
e.g., querying passages rather than entire documents, or segmented image regions
rather than entire images. Settles et al. (2008b) focus on this type of active learn-
ing with a generalization of logistic regression. Vijayanarasimhan and Grauman
(2009) have extended the idea to SVMs for the image retrieval task, and also ex-
plore an approach that interleaves queries at varying levels of granularity.
Raghavan et al. (2006) have proposed a related idea for traditional classifica-
tion problems called tandem learning, in which the learner is allowed to query
for the labels of features as well as entire instances. They report not only that
interleaving document-level and word-level queries are very effective for a text
classification problem, but also that words (features) are often much easier for
human annotators to label in user studies.
30
the process repeats. A complementary technique in active learning is uncertainty
sampling (see Section 3.1), where the instances about which the model is least
confident are selected for querying.
Similarly, multi-view learning (de Sa, 1994) and co-training (Blum and Mitchell,
1998) use ensemble methods for semi-supervised learning. Initially, separate
models are trained with the labeled data (usually using separate, conditionally
independent feature sets), which then classify the unlabeled data, and “teach” the
other models with a few unlabeled examples (with predicted labels) about which
they are most confident. This helps to reduce the size of the version space, i.e.,
the models must agree on the unlabeled data as well as the labeled data. Query-
by-committee (see Section 3.2) is an active learning compliment here, as the com-
mittee represents different parts of the version space, and is used to query the
unlabeled instances about which they do not agree.
Through these illustrations, we begin to see that active learning and semi-
supervised learning attack the same problem from opposite directions. While
semi-supervised learning exploits what the learner thinks it already knows about
the unlabeled data, active learning attempts to explore the unknown aspects. It is
therefore natural to think about combining the two. Some example formulations
of semi-supervised active learning include McCallum and Nigam (1998), Muslea
et al. (2000), Zhu et al. (2003), Zhou et al. (2004), and Tur et al. (2005).
31
tain about the outcome, just as an active learner requests labels for instances it is
uncertain how to label. This is often called the “exploration-exploitation” trade-
off in the reinforcement learning literature. Furthermore, Mihalkova and Mooney
(2006) consider an explicitly active reinforcement learning approach with aims to
reduce the number of actions required to find an optimal policy.
32
6.5 Active Feature Acquisition and Classification
In some learning domains, instances may have incomplete feature descriptions.
For example, many data mining tasks in modern business are characterized by nat-
urally incomplete customer data, due to reasons such as data ownership, client dis-
closure, or technological limitations. Consider a credit card company that wishes
to model its most profitable customers; the company has access to data on client
transactions using their own cards, but no data on transactions using cards from
other companies. Here, the task of the model is to classify a customer using in-
complete purchase information as the feature set. Similarly, consider a learning
model used in medical diagnosis which has access to some patient symptom in-
formation, but not other symptoms that require complex or expensive procedures.
Here, the task of the model is to suggest a diagnosis using incomplete symptom
information as the feature set.
In these domains, active feature acquisition (Zheng and Padmanabhan, 2002;
Melville et al., 2004) seeks to alleviate these problems by allowing the learner
to request more complete feature information. The assumption is that some ad-
ditional features can be obtained at a cost, such as leasing transaction records
from other credit card companies, or running additional diagnostic procedures.
The goal in active feature acquisition is to select the most informative features
to obtain, rather than randomly or exhaustively acquiring all new features for all
training instances. The difference between this learning setting and typical active
learning is that these models request salient feature values rather than instance
labels. Similarly, work in active classification (Greiner et al., 2002) considers the
case in which features may be obtained during classification rather than training.
33
model classes), providing comprehensible, symbolic interpretations. Buciluǎ et al.
(2006) have adapted this idea to “compress” very large and computationally ex-
pensive model classes, such as complex ensembles, into smaller and more efficient
model classes, such as neural networks.
These approaches can be thought of as active learning methods where the ora-
cle is in fact another machine learning model (i.e., the one being parroted or com-
pressed) rather than, say, a human annotator. In both cases, the “oracle model” can
be trained using a small set of the available labeled data, and the “parrot model” is
allowed to query the the oracle model for (i) the labels of any unlabeled data that
is available, or (ii) synthesize new instances de novo. These two model parroting
and compression approaches correspond to the pool-based and membership query
scenarios for active learning, respectively.
Acknowledgements
I would like to thank Mark Craven, Jerry Zhu, Jude Shavlik, David Page, Andrew
McCallum, Rong Jin, John Langford, Aron Culotta, Greg Druck, Steve Hanneke,
Robbie Haertel, Ashish Kapoor, Clare Monteleoni, and the other colleagues who
have discussed active learning with me, both online and in person.
References
N. Abe and H. Mamitsuka. Query learning strategies using boosting and bagging.
In Proceedings of the International Conference on Machine Learning (ICML),
pages 1–9. Morgan Kaufmann, 1998.
34
M.F. Balcan, S. Hanneke, and J. Wortman. The true sample complexity of active
learning. In Proceedings of the Conference on Learning Theory (COLT), pages
45–56. Springer, 2008.
J. Baldridge and M. Osborne. Active learning and the total cost of annotation.
In Proceedings of the Conference on Empirical Methods in Natural Language
Processing (EMNLP), pages 9–16. ACL Press, 2004.
E.B. Baum and K. Lang. Query learning can work poorly when a human oracle
is used. In Proceedings of the IEEE International Joint Conference on Neural
Networks, 1992.
A.L. Berger, V.J. Della Pietra, and S.A. Della Pietra. A maximum entropy ap-
proach to natural language processing. Computational Linguistics, 22(1):39–
71, 1996.
A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training.
In Proceedings of the Conference on Learning Theory (COLT), pages 92–100.
Morgan Kaufmann, 1998.
35
D. Cohn. Neural network exploration using optimal experiment design. In Ad-
vances in Neural Information Processing Systems (NIPS), volume 6, pages
679–686. Morgan Kaufmann, 1994.
D. Cohn, Z. Ghahramani, and M.I. Jordan. Active learning with statistical models.
Journal of Artificial Intelligence Research, 4:129–145, 1996.
36
T. Dietterich, R. Lathrop, and T. Lozano-Perez. Solving the multiple-instance
problem with axis-parallel rectangles. Artificial Intelligence, 89:31–71, 1997.
Y. Freund, H.S. Seung, E. Shamir, and N. Tishby. Selective samping using the
query by committee algorithm. Machine Learning, 28:133–168, 1997.
37
S. Hanneke. A bound on the label complexity of agnostic active learning. In Pro-
ceedings of the International Conference on Machine Learning (ICML), pages
353–360. ACM Press, 2007.
A. Hauptmann, W. Lin, R. Yan, J. Yang, and M.Y. Chen. Extreme video retrieval:
joint maximization of human and computer performance. In Proceedings of the
ACM Workshop on Multimedia Image Retrieval, pages 385–394. ACM Press,
2006.
S.C.H. Hoi, R. Jin, and M.R. Lyu. Large-scale text categorization by batch mode
active learning. In Proceedings of the International Conference on the World
Wide Web, pages 633–642. ACM Press, 2006a.
S.C.H. Hoi, R. Jin, J. Zhu, and M.R. Lyu. Batch mode active learning and its
application to medical image classification. In Proceedings of the International
Conference on Machine Learning (ICML), pages 417–424. ACM Press, 2006b.
R.D. King, K.E. Whelan, F.M. Jones, P.G. Reiser, C.H. Bryant, S.H. Muggleton,
D.B. Kell, and S.G. Oliver. Functional genomic hypothesis generation and ex-
perimentation by a robot scientist. Nature, 427(6971):247–52, 2004.
38
J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic
models for segmenting and labeling sequence data. In Proceedings of the In-
ternational Conference on Machine Learning (ICML), pages 282–289. Morgan
Kaufmann, 2001.
K. Lang. Newsweeder: Learning to filter netnews. In Proceedings of the Inter-
national Conference on Machine Learning (ICML), pages 331–339. Morgan
Kaufmann, 1995.
D. Lewis and J. Catlett. Heterogeneous uncertainty sampling for supervised learn-
ing. In Proceedings of the International Conference on Machine Learning
(ICML), pages 148–156. Morgan Kaufmann, 1994.
D. Lewis and W. Gale. A sequential algorithm for training text classifiers. In
Proceedings of the ACM SIGIR Conference on Research and Development in
Information Retrieval, pages 3–12. ACM/Springer, 1994.
M. Lindenbaum, S. Markovitch, and D. Rusakov. Selective sampling for nearest
neighbor classifiers. Machine Learning, 54(2):125–152, 2004.
Y. Liu. Active learning with support vector machine applied to gene expression
data for cancer classification. Journal of Chemical Information and Computer
Sciences, 44:1936–1941, 2004.
R. Lomasky, C.E. Brodley, M. Aernecke, D. Walt, and M. Friedl. Active class
selection. In Proceedings of the European Conference on Machine Learning
(ECML), pages 640–647. Springer, 2007.
D. MacKay. Information-based objective functions for active data selection. Neu-
ral Computation, 4(4):590–604, 1992.
O. Maron and T. Lozano-Perez. A framework for multiple-instance learning. In
Advances in Neural Information Processing Systems (NIPS), volume 10, pages
570–576. MIT Press, 1998.
A. McCallum and K. Nigam. Employing EM in pool-based active learning for
text classification. In Proceedings of the International Conference on Machine
Learning (ICML), pages 359–367. Morgan Kaufmann, 1998.
P. Melville and R. Mooney. Diverse ensembles for active learning. In Proceedings
of the International Conference on Machine Learning (ICML), pages 584–591.
Morgan Kaufmann, 2004.
39
P. Melville, M. Saar-Tsechansky, F. Provost, and R. Mooney. Active feature-value
acquisition for classifier induction. In Proceedings of the IEEE Conference on
Data Mining (ICDM), pages 483–486. IEEE Press, 2004.
40
S. Ray and M. Craven. Supervised versus multiple instance learning: An empir-
ical comparison. In Proceedings of the International Conference on Machine
Learning (ICML), pages 697–704. ACM Press, 2005.
N. Roy and A. McCallum. Toward optimal active learning through sampling es-
timation of error reduction. In Proceedings of the International Conference on
Machine Learning (ICML), pages 441–448. Morgan Kaufmann, 2001.
T. Scheffer, C. Decomain, and S. Wrobel. Active hidden Markov models for infor-
mation extraction. In Proceedings of the International Conference on Advances
in Intelligent Data Analysis (CAIDA), pages 309–318. Springer-Verlag, 2001.
A.I. Schein and L.H. Ungar. Active learning for logistic regression: An evaluation.
Machine Learning, 68(3):235–265, 2007.
41
C.A. Thompson, M.E. Califf, and R.J. Mooney. Active learning for natural lan-
guage parsing and information extraction. In Proceedings of the International
Conference on Machine Learning (ICML), pages 406–414. Morgan Kaufmann,
1999.
S. Tong. Active Learning: Theory and Applications. PhD thesis, Stanford Uni-
versity, 2001.
S. Tong and E. Chang. Support vector machine active learning for image re-
trieval. In Proceedings of the ACM International Conference on Multimedia,
pages 107–118. ACM Press, 2001.
S. Tong and D. Koller. Support vector machine active learning with applications to
text classification. In Proceedings of the International Conference on Machine
Learning (ICML), pages 999–1006. Morgan Kaufmann, 2000.
G. Tur, D. Hakkani-Tür, and R.E. Schapire. Combining active and semi-
supervised learning for spoken language understanding. Speech Communica-
tion, 45(2):171–186, 2005.
L.G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):
1134–1142, 1984.
V.N. Vapnik and A. Chervonenkis. On the uniform convergence of relative fre-
quencies of events to their probabilities. Theory of Probability and Its Applica-
tions, 16:264–280, 1971.
S. Vijayanarasimhan and K. Grauman. Multi-level active prediction of useful im-
age annotations for recognition. In Advances in Neural Information Processing
Systems (NIPS), volume 21. MIT Press, 2009.
Z. Xu, R. Akella, and Y. Zhang. Incorporating diversity and density in active
learning for relevance feedback. In Proceedings of the European Conference
on IR Research (ECIR), pages 246–257. Springer-Verlag, 2007.
R. Yan, J. Yang, and A. Hauptmann. Automatically labeling video data using
multi-class active learning. In Proceedings of the International Conference on
Computer Vision, pages 516–523. IEEE Press, 2003.
D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised meth-
ods. In Proceedings of the Association for Computational Linguistics (ACL),
pages 189–196. ACL Press, 1995.
42
H. Yu. SVM selective sampling for ranking with application to data retrieval.
In Proceedings of the International Conference on Knowledge Discovery and
Data Mining (KDD), pages 354–363. ACM Press, 2005.
C. Zhang and T. Chen. An active learning framework for content based informa-
tion retrieval. IEEE Transactions on Multimedia, 4(2):260–268, 2002.
T. Zhang and F.J. Oles. A probability analysis on the value of unlabeled data
for classification problems. In Proceedings of the International Conference on
Machine Learning (ICML), pages 1191–1198. Morgan Kaufmann, 2000.
Z.H. Zhou, K.J. Chen, and Y. Jiang. Exploiting unlabeled data in content-based
image retrieval. In Proceedings of the European Conference on Machine Learn-
ing (ECML), pages 425–435. Springer, 2004.
43
Index
active class selection, 32 region of uncertainty, 9
active classification, 33 regression, 16
active feature acquisition, 33 reinforcement learning, 31
active learning, 3 return on investment (ROI), 28
agnostic active learning, 24
selective sampling, 9
batch-mode active learning, 26 semi-supervised learning, 30
sequence labeling, 25
classification, 3 speech recognition, 3
cost-sensitive active learning, 27 stream-based active learning, 9
entropy, 11 structured outputs, 25
equivalence queries, 32 tandem learning, 30
estimated error reduction, 19
expected gradient length (EGL), 15 uncertainty sampling, 4, 11
learning curves, 6
membership queries, 8
model compression, 33
model parroting, 33
multiple-instance active learning, 28
oracle, 3
query, 3
query strategy, 11
query-by-committee (QBC), 12
44