Mathematics 11 00820
Mathematics 11 00820
Mathematics 11 00820
Review
A Survey on Active Learning: State-of-the-Art, Practical
Challenges and Research Directions
Alaa Tharwat * and Wolfram Schenck
Center for Applied Data Science Gütersloh (CfADS), FH Bielefeld-University of Applied Sciences,
33619 Bielefeld, Germany
* Correspondence: [email protected]
Abstract: Despite the availability and ease of collecting a large amount of free, unlabeled data, the
expensive and time-consuming labeling process is still an obstacle to labeling a sufficient amount of
training data, which is essential for building supervised learning models. Here, with low labeling
cost, the active learning (AL) technique could be a solution, whereby a few, high-quality data points
are queried by searching for the most informative and representative points within the instance space.
This strategy ensures high generalizability across the space and improves classification performance
on data we have never seen before. In this paper, we provide a survey of recent studies on active
learning in the context of classification. This survey starts with an introduction to the theoretical
background of the AL technique, AL scenarios, AL components supported with visual explanations,
and illustrative examples to explain how AL simply works and the benefits of using AL. In addition
to an overview of the query strategies for the classification scenarios, this survey provides a high-level
summary to explain various practical challenges with AL in real-world settings; it also explains how
AL can be combined with various research areas. Finally, the most commonly used AL software
packages and experimental evaluation metrics with AL are also discussed.
MSC: 68T05
query the most informative unlabeled points, AL has also been used to reduce laboratory
experiments by finding the most informative experiments in large biological networks [8].
Similarly, AL could be used in simulation models with a large number of parameters to
reduce the number of parameter combinations actually evaluated [9]. This means that AL
could be combined with other technologies to solve many problems. Therefore, in this
survey, one of our goals is to provide a comprehensive overview of active learning and
explain how and why it can be combined with other research directions. Moreover, instead
of using AL as a black box, in this paper, we provide a comprehensive and up-to-date
overview of various active learning techniques in the “classification framework”. Our goal
is to illustrate the theoretical background of AL by using new visualizations and illustrative
examples in a step-by-step approach to help beginners to implement AL rather than just
by using it as a black box. In addition, some survey papers introduced a taxonomy of AL
from only one perspective, whereas in this paper different taxonomies of query strategies
from different perspectives are presented. Furthermore, several practical challenges related
to AL in real-world environments are presented. This highlights a research gap where
different research questions could be presented as future research directions. Moreover, the
most commonly used AL software packages and experimental evaluation metrics using AL
are discussed. We have also added a new software package that contains all the illustrative
examples in this paper and some other additional examples. These clear, simple, and
well-explained software examples could be the starting point for implementing newer AL
versions in many applications. Furthermore, different applications of AL are also presented.
However, from various other perspectives, several reviews have already been published
with the goal of introducing the active learning technique and simply explaining how it
works in different applications. Some examples are as follows.
• The most important study in the field of active learning is the one presented by
Burr Settles in 2009 [3]. It alone collects more than 6000 citations, which reflects its
importance. The paper explains AL scenarios, query strategies, the analysis of different
active learning techniques, some solutions to practical problems, and related research
areas. In addition, Burr Settles presents several studies that explain the active learning
technique from different perspectives such as [10,11].
• In [12], the authors present a comprehensive overview of the instance selection of
active learners. Here, the authors introduced a novel taxonomy of active learning
techniques, in which active learners were categorized, based on “how to select unla-
beled instances for labeling”, into (1) active learning based only on the uncertainty of
independent and identically distributed (IID) instances (we refer to this as information-
based query strategies as in Section 3.1), and (2) active learning by further taking into
account instance correlations (we refer to this as representation-based query strate-
gies as in Section 3.2). Different active learning algorithms from each category were
discussed including theoretical basics, different strengths/weaknesses, and practical
comparisons.
• Kumar et al. introduced a very elegant overview of AL for classification, regression, and
clustering techniques [13]. In that overview, the focus was on presenting different work
scenarios of the active learning technique with classification, regression, and clustering.
• In [14], from a theoretical perspective, the basic problem settings of active learning
and recent research trends were presented. In addition, Haneke gave a theoretical
overview of the theoretical issues that arise when no assumptions are made about
noise distribution [15].
• An experimental survey was presented in [16] to compare many active learners. The
goal is to show how to fairly compare different active learners. Indeed, the study
showed that using only one performance measure or one learning algorithm is not fair,
and changing the algorithm or the performance metric may change the experimental
results and thus the conclusions. In another study, to compare the most well-known
active learners and investigate the relationship between classification algorithms
and active learning strategies, a large experimental study was performed by using
Mathematics 2023, 11, 820 3 of 38
75 datasets, different learners (5NN, C4.5 decision tree, naive Bayes (NB), support
vector machines (SVMs) with radial basis function (RBF), and random forests (RFs)),
and different active learners [17].
• There are also many surveys on how AL is employed in different applications. For
example, in [18], a survey of active learning in multimedia annotation and retrieval
was introduced. The focus of this survey was on two application areas: image/video
annotation and content-based image retrieval. Sample selection strategies used in
multimedia annotation and retrieval were categorized into five criteria: risk reduction,
uncertainty, variety, density, and relevance. Moreover, different classification models
such as multilabel learning and multiple-instance learning were discussed. In the same
area, another recent small survey was also introduced in [19]. In a similar context,
in [20], a literature review of active learning in natural language processing and related
tasks such as information extraction, named entity recognition, text categorization,
part-of-speech tagging, parsing, and word sense disambiguation was presented. In
addition, in [21], an overview of some practical issues in using active learning in
some real-world applications was given. Mehdi Elahi et al. introduced a survey
of active learning in collaborative filtering recommender systems, where the active
learning technique is employed to obtain data that better reflect users’ preferences;
this enables the generation of better recommendations [22]. Another survey of AL for
supervised remote sensing image classification was introduced in [23]. This survey
covers only the main families of active learning algorithms that were used in the remote
sensing community. Some experiments were also conducted to show the performance
of some active learners that label uncertain pixels by using three challenging remote
sensing datasets for multispectral and hyperspectral classification. Another recent
survey that uses satellite-based Earth-observation missions for vegetation monitoring
was introduced in [24].
• A review of deep active learning, which is one of the most important and recent reviews,
has been presented in [25]. In this review, the main differences between classical AL
algorithms, which always work in low-dimensional space, and deep active learning
(DAL), which can be used in high-dimensional spaces, are discussed. Furthermore,
this review also explains the problems of DAL, such as (i) the requirement for high
training/labeling data, which is solved, for example, by using pseudolabeled data
and generating new samples (i.e., data augmentation) by using generative adversarial
networks (GANs), (ii) the challenge of computing uncertainty compared to classical
ALs, and (iii) the processing pipeline of deep learning, because feature learning and
classifier training are jointly optimized in deep learning. In the same field, another
review of the DAL technique has been recently presented, and the goal is to explain
(i) the challenge of training DAL on small datasets and (ii) the inability of neural networks
to quantify reliable uncertainties on which the most commonly used query strategies are
based [26]. To this end, a taxonomy of query strategies, which distinguishes between data-
based, model-based, and prediction-based instance selection, was introduced besides
the investigation of the applicability of these classes in recent research studies. In a
related study, Qiang Hu et al. introduced some practical limitations of AL deep neural
networks [27].
The rest of the survey is organized as follows. In Section 2, we provide a theoretical
background on active learning including an analysis of the AL technique, illustrative exam-
ples to show how the AL technique works, AL scenarios, and AL components. Section 3
introduces an overview of the main query strategies and different taxonomies of AL.
Section 4 presents the main practical challenges of AL in real environments. There are
many research areas that are linked with AL, Section 5 introduces some of these research
areas. Section 6 introduces some of the applications of AL. Section 7 introduces the most
well-known software packages of AL. Section 8 introduces the most widely used exper-
imental evaluation metrics that are utilized in research studies that use AL. Finally, we
conclude the survey in Section 9.
Mathematics 2023, 11, 820 4 of 38
where nl is the number of labeled points and Remp (h) is the average loss of all training
samples. This is called in-sample error or empirical risk because it is calculated by using
the empirical data taken as a sample rather than the whole data. After training a model,
Mathematics 2023, 11, 820 5 of 38
the aim is to predict the outputs for new or unseen data. Among the generated hypotheses,
the best hypothesis is the one that minimizes the expected value of the loss over the whole
input space, and this is called risk or out-of-sample error (R), and it is defined as follows:
R(h) = Eout (h) = E(x,y)∼ P(X,Y ) [ L(y, h(x))]. (2)
Because the joint distribution P( X, Y ) is unknown (i.e., the test data set is un-
known/unlimited), the risk cannot be calculated accurately. Therefore, the goal is not
to minimize the risk but to minimize the gap (this is called the generalization gap) between
Remp and R, which can be written as follows as proved in [31]:
2
P R(h) − Remp (h) > e ≤ 2|H|e−2nl e ,
(3)
where |H| is the size of the hypothesis space and e is a small number. The right-hand side in
Equation (3) indicates that increasing the size of the hypotheses space (i.e., |H| → ∞) increases
the generalization gap even if the training error is high while increasing the number of training
points improves the results by decreasing the generalization gap. In supervised learning,
because the test error for the data that we have never seen before cannot be calculated, the
hypothesis with the lowest empirical risk (h∗ ) is selected and considered the best hypothesis.
In this context, the question that arises is how the active learners with a small query
budget (i.e., a small number of labeled points) can achieve promising results (sometimes
better than the passive learners). The answer is that for passive learners, the training data
is randomly selected; therefore, there is a chance of finding many points at approximately
the same position within the space, and there are some other parts that are not yet covered.
In other words, the chance of covering the whole space is low (more details about the
problem of random generation and different generation methods are in [32]). This problem
may lead learning models to extrapolate (i.e., use a trained model to make predictions
for data that are outside (geometrically far away) from the training and validation set).
The AL strategy attempts to solve this problem by trying to cover a large portion of the
space by selecting and annotating a few highly informative and representative points that
cover a large portion of the space, especially uncertain regions. In [33], after a theoretical
analysis of the query-by-committee (QBC) algorithm and under a Bayesian assumption,
the authors found that a classifier with an error less than η could be achieved after seeing
O( D 1
η ) unlabeled points and requesting only O (D log η ) labels, where D is the Vapnik–
Chervonenkis (VC) [34] dimension of the model space (more details are in [14]). In another
study, Dasgupta et al. reported that a standard perceptron update rule which makes a poor
active learner in general requires O( η12 ) labels as a lower bound [35].
1
2
Class 3
(a) (b)
Active learner class predictions (accuracy: 0.520) Classification accuracy after 2 queries: 0.880
Unlabeled point Unlabeled point
(c) (d)
0 48 2 0 48 2
0 3 47 0 4 46
0 50 0
0 41 9
0 13 37
Query iteration
(e)
Figure 1. Visualization of our active learning example. (a) The original data with three classes, each
with a different color. (b) The unlabeled data are in black color and the initial training/labeled data
are the highlighted points with pink circles. (c,d) The correctly and incorrectly classified points of
the trained model on three and five training points (i.e., initially and after querying two points),
respectively. (e) The classification accuracy during the annotation process.
Iteratively, in our example, a simple active learner is used to query one of the most
uncertain points; this active learner uses the entropy method [10]. As can be seen in
Figure 1d, after annotating two points, the accuracy increased from 52% to 88%. This is
because one of the newly annotated points belongs to the first class; hence, the current
training data includes the three (i.e., all) classes and as shown from the confusion matrix,
all points from the first class are correctly classified. Figure 1e shows the classification
accuracy during the annotation process, where each point represents the accuracy after
annotating a new point. Additionally, the confusion matrix is shown at some points
to illustrate the number of correctly classified points from each class. As shown, the
accuracy increased to 88% after annotating only two points, one of which belongs to the
Mathematics 2023, 11, 820 7 of 38
first class. Furthermore, the accuracy continues to increase as more points are annotated,
and as shown, the accuracy is approximately stable after the sixth point. (The code of this
example is available at https://github.com/Eng-Alaa/AL_SurveyPaper/blob/main/AL_
Iris_SurveyPaper.py or https://github.com/Eng-Alaa/AL_SurveyPaper/blob/main/AL_
IrisData_SurveyPaper.ipynb [access date on 28 December 2022]).
This example shows how active learners simply search for highly informative points
to label them. This iteratively improves the quality of the labeled/trained data, and,
consequently, enhances the accuracy of the learner, which improves the generalizability of
the model on data it has never seen before.
2.4. AL Scenarios
There are three main scenarios for ALs:
• In the membership query synthesis scenario, the active learner generates synthetic
instances in the space and then requests labels for them (see Figure 2). This scenario
is suitable for finite problem domains, and because no processing on unlabeled data
is required in this scenario, the learner can quickly generate query instances [3]. The
major limitation of this scenario is that it can artificially generate instances that are
impossible to reasonably label [36]. For example, some of the artificially generated
images for classifying handwritten characters contained no recognizable symbols [37].
Unlabeled pool ( )
Sample a pool
of instances
query
Pool-based sampling
• In the stream-based selective sampling scenario, the learning model decides whether to
annotate the unlabeled point based on its information content [4]. This scenario is also
referred to as sequential AL because the unlabeled data points are drawn iteratively,
one at a time. In many studies such as [38,39], the selective sampling scenario was
considered in a slightly different manner from the pool-based scenario (this scenario is
explained below) because, in both scenarios, the queries are performed by selecting a
set of instances sampled from a real data distribution, and the main difference between
them is that the first scenario (selective sampling) scans the data sequentially, whereas
the second scenario samples a large set of points (see Figure 2) [3]. This increases the
applicability of the stream-based scenario when memory and/or processing power is
limited, such as with mobile devices [3]. In practice, the data stream-based selective
sampling scenario may not be suitable in nonstationary data environments due to the
potential for data drift.
• The pool-based scenario is the most well-known scenario, in which a query strategy
is used to measure the informativeness of some/all instances in the large set/pool
Mathematics 2023, 11, 820 8 of 38
of available unlabeled data to query some of them [40]. Figure 3 shows that there is
labeled data (DL ) for training a model (h) and a large pool of unlabeled data (DU ). The
trained model is used to evaluate the information content of some/all of the unlabeled
points in DU and ask the expert to label/annotate the most informative points. The
newly annotated points are added to the training data to further improve the model.
These steps show that this scenario is very computationally intensive, as it iteratively
evaluates many/all instances in the pool. This process continues until a termination
condition is met, such as reaching a certain number of queries (this is called query
budget) or when there are no clear improvements in the performance of the trained
model.
? ?
? ?
6,7: select
a point ? ?
Utility fn.
8: Query Input: , ,
Output:
1: Train using
2: While stopping conditions not met do
9: Add to the
training data
Figure 3. An illustrative example to show the steps of the pool-based active learning scenario.
In some studies, as in [41], the combination of the pool-based and the membership
query synthetic scenarios solved the problem of generating arbitrary points by finding the
nearest original neighbours to the ones that were generated synthetically.
2.5. AL Components
Any active learner (especially in the pool-based scenario) consists of four main components.
• Data: The first component is the data which consists of labeled and unlabeled data.
The unlabeled data (DU ) represents the pool from which a new point is selected, and
the labeled portion of the data (DL ) is used to train a model (h).
• Learning algorithm: The trained model (h) on DL is the second component and it is
used to evaluate the current annotation process and find the most uncertain parts
within the space for querying new points there.
• Query strategy: The third component is the query strategy (this is also called the
acquisition function [14]) which uses a specific utility function (u) for evaluating the
instances in DU for selecting and querying the most informative and representative
point(s) in DU . The active learners are classified in terms of the number of queries at a
time into one query and batch active learners.
– One query: Many studies assume that only one query is queried at a time, which
means that the learning models should be retrained every time a new sample
Mathematics 2023, 11, 820 9 of 38
SVM-based
Information-based
Expected model change
Information-based
Query strategy
Density-based
Diversity-based
Cluster-based
Data-based
Model-based
Meta-active learning
Information-based Prediction-based
where x∗ is the least confident instance, ŷ = argmax Ph (y|x) is the class label of x with
y
the highest posterior probability using the model h, and Ph (y|x) is the conditional class
probability of the class y given the unlabeled point x. Hence, this method only considers
information about the most likely label(s) and neglects the information about the rest of
the distribution [40]. Therefore, Schefer et al. introduced the margin sampling method,
which calculates the margin between the first and the second most probable class labels as
follows [43],
x∗ = argmin Ph (ŷ1 |x) − Ph (ŷ2 |x), (5)
x∈ DU
Mathematics 2023, 11, 820 11 of 38
where ŷ1 and ŷ2 are the first and second most probable class labels, respectively, under the
model h. Instances with small margins are ambiguous, and hence asking about their labels
could enhance the model for discriminating between them. In other words, a small margin
means that it is difficult for the trained model (h) to differentiate between the two most
likely classes (e.g., overlapped classes). For large label sets, the margin sampling method
ignores the output distribution of the remaining classes. Here, the entropy method, which
takes all classes into account, could be used for measuring the uncertainty as follows,
where yi ranges over all possible class labels and Ph (yi |x) is the conditional class probability
of the class yi for the given unlabeled point x [44]. The instance with the largest entropy
value is queried. This means that the learners query the instance for which the model has
the highest output variance in its prediction.
For example, suppose we have two instances (x1 and x2 ) and three classes (A, B, and
C) and want to measure the informativeness of each point to select which one should
be queried. The posterior probability that x1 belongs to the class A, B, and C is 0.9, 0.08,
0.02, respectively, and similarly, with x2 the probabilities are 0.3, 0.6, 0.1. With the LC
approach, the learner is fairly certain that x1 belongs to the class A with probability 0.9,
whereas x2 belongs to B with probability 0.6. Hence, the learner selects x2 to query its
actual label because it is the least confident. With the margin sampling method, the margin
between the two most probable class labels of x1 is 0.9 − 0.08 = 0.82 and the margin of x2
is 0.6 − 0.3 = 0.3. The small margin of x2 shows that it is more uncertain than x1 ; hence,
the learner queries the instance x2 . In the entropy sampling method, the entropy of x1
is calculated as −(0.9log2 0.9 + 0.08log2 0.08 + 0.02log2 0.02) = 0.5412, and similarly the
entropy of x2 is 1.2955. Therefore, the learner selects x2 which has the maximum entropy.
Therefore, all three approaches query the same instance. However, in some cases, the
approaches query different instances. For example, changing the posterior probability
of x1 to 0.4, 0.4, 0.2, and of x2 to 0.26, 0.35, 0.39, the LC and entropy methods select
x2 whereas the margin approach selects x1 . A more detailed analysis of the differences
between these approaches shows that the LC and margin methods are more appropriate
when the objective is to reduce the classification error to achieve better discrimination
between classes, whereas the entropy method is more useful when the objective function is
to minimize the log-loss [3,44,45].
The uncertainty approach could also be employed with nonprobabilistic classifiers,
such as (i) support vector machines (SVMs) [46] by querying instances near the decision
boundary, (ii) NN with probabilistic backpropagation (PBP) [47], and (iii) nearest-neighbour
classifier [48] by allowing each neighbour to vote on the class label of each unlabeled point,
and having the proportion of these votes represent the posterior probability.
Illustrative Example
The aim of this example is to explain in a step-by-step approach how active learn-
ing works (The code of this example is available at https://github.com/Eng-Alaa/AL_
SurveyPaper/blob/main/AL_NumericalExample.py and https://github.com/Eng-Alaa/
AL_SurveyPaper/blob/main/AL_NumericalExample.ipynb [access date on 28 December
2022]). In this example, there are three training/labeled data points, each with a different
color and belonging to a different class, as shown in Figure 6a. Moreover, there are 10
unlabeled data points in black color. The initial labeled points are used for training a
learning model (in this example, we used the RF algorithm). Then, the trained model is
used to predict the unlabeled points. As shown in Figure 6b, most of the unlabeled points
were classified to the green class. In addition to the predictions, the learning algorithm also
provides the class probabilities for each point. For example, the class probabilities of the
point x1 are 0.1, 0.8, and 0.1, which means that the probability that x1 belongs to the red,
green, and blue classes are 0.1, 0.8, and 0.1, respectively. Consequently, x1 belongs to the
Mathematics 2023, 11, 820 12 of 38
green class, which has the maximum class probability. Similarly, the class probabilities of
all unlabeled points were calculated. From these class probabilities, the four highlighted
points were identified as the most uncertain points by using the entropy method, and the
active learner was asked to query one of these points. As shown, all the uncertain points
lie between two classes (i.e., within the uncertain regions). In our example, we queried
the point x2 as shown in Figure 6c. After adding this new annotated point to the labeled
data and retraining the model, the predictions of the unlabeled points did not change (this
is not always the case), but the class probabilities did change as shown in Figure 6d. As
shown, after annotating a point from the red class, some of the nearby unlabeled points are
affected, which is evident from the class probabilities of the points x1 , x3 , and x6 , whose
class probabilities have changed (compare between Figure 6b and Figure 6d). Finally,
according to the class probabilities in Figure 6d, our active learner will annotate the point
x9 . This process continues until a stopping condition is satisfied.
Most uncertain
0.6 0.3 0.1
0.1 0.8 0.1 0.1 0.3 0.6
(a) (b)
0.6 0.2
0.7 0.3 0.1
0.3 0.6 0.1 0.1 0.3 0.5
0.2 0.6
(c) (d)
Figure 6. Illustration of the steps of how active learning queries points. (a) Labeled points are colored
while the black squares represent the unlabeled points. After training a model on the initial labeled
data in (a), the trained model is used to predict the class labels of the unlabeled points. (b) The
predictions and the class probabilities of the unlabeled points and the most uncertain points. (c) One
of the most uncertain points (x2 ) is queried and added to the labeled data. (d) The predictions
and class probabilities of the trained model on the newly labeled points (i.e., after adding the new
annotated point).
hypotheses have been trained and agree on DL (i.e., both classify the labeled points perfectly,
these are called consistent hypotheses), but disagree on some unlabeled points, these points
are within the uncertainty region; hence, finding this region is expensive, especially, if
it should be maintained after each new query. One famous example of this approach
is the committee-by-boosting and the committee-by-bagging techniques, which employ
well-known boosting and bagging learning methods for constructing committees [50].
The goal of active learners is to constrain the size of the version space given a few
labeled points. This could be done by using the QBC approach by querying controversial
regions within the input space. However, there is not yet agreement on the appropriate
committee size, but a small committee size has produced acceptable results [49]. For
example, in [4], the committee consists of only two neural networks, and it obtained
promising results.
Figure 7 shows an example explaining the version space. As shown, with two classes,
there are three hypotheses (hi , h j , and hk ), where hi ∈ H is the most general hypothesis and
hk ∈ H is the most specific one. Both hypotheses (hi and hk ) and all the hypotheses between
them including h j are consistent with the labeled data (i.e., the version space consists of
the two hypotheses (hi and hk ) and all the hypotheses between them). Mathematically,
given a set of hypotheses hi ∈ H, i = 1, 2 . . ., the version space is defined as VSH,DL =
{h ∈ H and h(xi ) = yi , ∀xi ∈ DL }. Furthermore, as shown, the four points A, B, C, and
D do not have the same degree of uncertainty, where A and D are certain (because all
hypotheses agree on them, i.e., hi , h j , and hk classify them identically), whereas B and C are
uncertain with different levels of uncertainty. As shown, h j and hk classify C to the red class,
whereas hi classifies the same point to the blue class. Therefore, there is a disagreement on
classifying the point C. The question here is, how do we measure the disagreement among
the committee members?
Input space
Figure 7. An illustrative example for explaining the version space. hi , h j , and hk are consistent with
D L (the colored points), where hi is the most general hypothesis and hk is the most specific one. The
points (A and D) are certain (i.e., all hypotheses agree on them), whereas the points B and C are
uncertain with different uncertainty levels (e.g., hk classifies B to the red class, whereas hi classifies B
to the blue class).
V ( yi ) V ( yi )
x∗ = argmax − ∑ log , (7)
x i
m m
where yi is all possible labels, m indicates the number of classifiers (i.e., number of commit-
tee members), and V (yi ) represents the number of votes that a label receives from the predic-
tion of all classifiers. For example, given three classes (ω1 , ω2 , ω3 ) (i.e., m = 15), and two in-
stances x1 and x2 . For x1 , let the votes be as follows: V (y1 = ω1 ) = 12, V (y1 = ω2 ) = 1, and
12
V (y1 = ω3 ) = 2. Hence, the vote entropy of x1 is −( 15 log 12 1 1 2 2
15 + 15 log 15 + 15 log 15 ) = 0.5714.
Mathematics 2023, 11, 820 14 of 38
where Ph+hx∗ ,y∗ i is the new model after retraining it with DL ∪ hx∗ , y∗ i. Therefore, a valida-
tion set is required in this category for evaluating the performance of the learned hypotheses.
Initially, an initial hypothesis is trained on the available labeled data. Next, the trained
model selects a point from the unlabeled pool, labels it, and then adds it to the labeled data.
After that, the hypothesis is retrained by using the updated set of labeled data. This process
is repeated by assigning the selected point to all possible classes to calculate the average
expected loss. This active learning strategy was employed in [59] for text classification.
However, because this strategy iteratively retrains the model after labeling each new point,
Mathematics 2023, 11, 820 15 of 38
it requires a high computational cost. Moreover, calculating the future error over DU for
each query dramatically increases the computational costs.
Another variant of this strategy is the variance reduction method [3]. In this method,
active learners query points that minimize the model’s variance, which consequently
minimizes the future generalization error of the model. This method is considered a variant
from the expected error reduction because minimizing the expected error can be interpreted
as a reduction of the output variance.
determine the value of an action in a certain state, and to take a decision on whether to label
this unlabeled point. In another example in [76], the deep reinforcement learning technique
was employed for designing the acquisition function that is updated dynamically with the
input distribution. Recently, the problem of finding the optimal query is closely related to
the bandit problem, and in [77–79], the acquisition function was designed as a multi-armed
bandit problem.
the posterior probability of some classes, which changes the class labels of some historical
labeled data points; hence, these points become noisy-labeled.
One of the trivial solutions for handling the noisy data problem is to relabel these
noisy points again by asking many weak labelers (nonexperts or noisy experts) who might
return noisy labels as in [81–83]. This relies on the redundancy of queried labels of noisy
labeled points from multiple annotators, which certainly increases the labeling cost. For
example, for an expert, if the probability to annotate some points incorrectly is 10%, with
two annotators, this drops to 0.1 × 0.1 = 0.01 = 1%, which is better and may be sufficient
in some applications. However, repeatedly asking experts for labeling some instances
over multiple rounds could be an expensive and impractical solution, especially if the
labelers should be experts, such as in medical image labeling, or if the labeling process is
complicated [84]. The noisy labeled data problem could also be solved by modelling the
expert’s knowledge and asking the expert to label an instance if it belongs to his knowledge
domain [85]. If the expert is uncertain about the annotations of some instances, the active
learner can accept or reject the labels [86]. However, for real challenges such as concept
drift, imbalanced data, and streaming data, it may be difficult to characterize the uncertain
knowledge of each expert. There are many reasons for this, e.g., each expert’s uncertain
domain may change due to drift. In [87], with the aim of cleaning the data, the QActor
model uses a novel measure CENT, which considers both the cross-entropy and the entropy
measures to query informative and noisy labeled points.
There are still many open research questions related to noisy labeled data [82]. For
example
RQ1: What happens if there are no experts who know the ground truth? and
RQ2: How might the active learner deal with the other experts whose quality fluctuates
over time (e.g., at the end of a long annotation task)?
As far as we know, the authors in [69,70] have introduced active learners that adapt
themselves to imbalanced and balanced data without predefined knowledge, and they
have achieved promising results.
However, most of the current active learners consider that initial knowledge is avail-
able. For example, active learners in [41,95,96,102] require initial labeling points, and the
models in [41,96,102–104] were initialized with the number of classes, and some of them
only handle binary classification data. In addition, some active learners only work under
the condition that they have initial labeled points from all classes. For example, the initial
training data in [105] should contain 15 instances from each class, and even if the data
is expected to be imbalanced, the initial training data should also contain points from
the minority classes [41,102]. However, some recent studies have taken this problem into
account and introduced novel active learners that do not require prior knowledge [69,70].
Data before the drift (a) Virtual concept drift (b) Real concept drift
Data after the drift New decision
New decision boundaries Drift region
boundaries Drift region
Original data
(before the drift)
Figure 8. Example of the concept drift phenomenon. The left panel shows the original data before
the drift, which is in gray, and it consists of three classes; each has a different shape. After two types
of drift (in a,b), the new data points are colored, and the data distributions are changed. (a) Virtual
drift: the changes are in the distribution of the data points (P(t) ( X ) 6= P(t+1) ( X )). (b) Real drift: The
changes are in the posterior probability (P(t) (Y | X ) 6= P(t+1) (Y | X )); therefore, the class labels of some
historical points (the two highlighted instances with blue dashed circles) are changed. The shaded
yellow area between the old decision boundaries (dashed lines) and the new decision boundaries
(solid lines) illustrates the changes in the decision boundaries before and after the drift.
There are many methods to detect the drift. The simplest method is to periodically
train a new model by using the most recently obtained data and replace the old model;
these methods are called blind methods. However, it is better to adjust the current model
than to discard it completely. Therefore, in some studies, the drift detection step has
been incorporated into active learners to monitor the received data and check if the data
distributions change with the data stream [107]. In [108], the adaptive window (ADWIN)
method compares the mean values of two subwindows; drift is detected when these
Mathematics 2023, 11, 820 21 of 38
subwindows differ significantly enough. The drift can also be detected from the results
of the learning models [109], such as the online error rate, or even from the parameters of
the learning models [110]. After detecting the drift, some of the historical data should be
removed, whereas the others are kept for revising the current learning model if a remarkable
change is detected, and the current learning model should be adapted to the current data,
for example, by retraining it by using the new data. In some studies, many adaptive
ensemble ML algorithms are used to deal with the concept deviation by adding/removing
some weak classifiers. For example, in [111], the dynamically weighted majority (DWM)
model reduces the weight of a weak learner that misclassifies an instance and removes
the weak learners whose weights are below a predefined threshold. In another example,
in [89], the weak learners are weighted according to the prediction error rates of the latest
streaming data, and the weak learners with low error rates are replaced with new ones.
However, detecting drift in real environments and adjusting the model during drift is
still an open question (RQ9), especially in real environments that present other practical
problems.
many methods to quantify uncertainty and some of them are mainly based on ML models,
the following question arises:
RQ10: In which way we can quantify uncertainty to obtain an indicator of the termina-
tion condition of AL?
be tuned. All these problems related to ML-based active learners motivate us to ask the
following research question:
RQ13: Can AL find uncertain regions without using ML models?
model was introduced, in which the prototypical networks (PN) are used for producing
clustered data in the embedding space, but the initial prototypes are estimated by using
the labeled data. Next, one of the clustering algorithms such as K-means is then performed
on the embeddings of both labeled and unlabeled data. AL was employed for reducing the
errors due to the incorrect labeling of the clusters [139,140]. In [75], reinforcement learning
and one-shot learning techniques are combined to allow the model to decide which data
points are worth labeling during classification. AL was combined with zero-shot learning,
where without using the target annotated data, the zero-shot learning uses the relation
between the source task and target one for predicting the label distribution of the unlabeled
target data [141]. The obtained results act as prior knowledge for AL.
of fitness evaluations in dynamic job shop scheduling by using the genetic algorithm
(GA) [147].
From a different perspective, some optimization algorithms are used for finding the
most informative points in AL. For example, in [148], PSO was used to select from massive
amounts of unlabeled medical instances those considered informative. Similarly, in [149],
the uncertainty-based strategy was formulated as an objective function and PSO was used
for finding the optimal solutions, which represent the most informative points within the
instance space.
(a) (b)
Figure 9. Visualization of how AL is combined with optimization algorithms. (a) Using only four
fitness evaluations (i.e., four initial points), a surrogate model ( fˆ) is built to approximate the original
fitness function ( f ). (b) Some initial points (x1 , x2 , . . . , xn ) will be evaluated by using the original
fitness function ( f ), and these initial points with their fitness values ({( x1 , y1 ), . . . , ( xn , yn )}) will be
used for training a surrogate mode ( fˆ), which helps to find better solutions.
scores of all tasks are estimated, and the point will be queried based on the combination of
these scores. In another study, based on the adaptive fixed interaction matrix of tasks used
to derive update rules for all tasks, the informativeness of newly arrived instances across
all tasks could be estimated to query the labels of the most informative instances [169].
6. Applications of AL
The active learning technique is widely used in many applications. Table 2 illustrates
the applications of some recent references including some details about (i) the dataset (e.g.,
number of classes, number of dimensions, data size, whether the data are balanced or
unbalanced) and (ii) the active learner (e.g., initial labeled data, query budget, and stopping
condition).
• In the field of natural language processing (NLP), AL has been used in the categoriza-
tion of texts to find out which class each text belongs to as in [36,40,46]. Moreover,
AL has been employed in named-entity relationships (NERs), given an unstructured
text (the entity). NER is the process of identifying a word or phrase in that entity and
classifying it as belonging to a particular class (the entity type)) [172,173]. AL is thus
used here to reduce the required annotation cost while maximizing the performance
of ML-based models [174]. In sentiment analysis, AL was employed for classifying
the given text as positive or negative [175,176]. AL was also utilized in information
extraction to extract some valuable information [177].
• AL has been employed in the image and video-related applications, for example,
image classification [123,178]. In image segmentation, AL is used, for example, to find
highly informative images and reduce the diversity in the training set [179,180]. For
example, in [181], AL improved the results with only 22.69% of the available data. AL
has been used for object detection and localization to detect objects [182,183]. This was
clear in a recent study that introduced two metrics for quantifying the informativeness
of an object hypothesis, allowing AL to be used to reduce the amount of annotated
data to 25% of the available data and produce promising results [184]. One of the major
challenges in remote sensing image classification is the complexity of the problem,
limited funding in some cases, and high intraclass variance. These challenges can
cause a learning model to fail if it is trained with a suboptimal dataset [23,185]. In this
context, AL is used to rank the unlabeled pixels according to their uncertainty of their
class membership and query the most uncertain pixels. In video annotation, AL could
be employed to select which frames a user should annotate to obtain highly accurate
tracks with minimal user effort [18,186]. In human activity recognition, the real
environment depends on humans, so collecting and labeling data in a nonstationary
environment is likely to be very expensive and unreliable. Therefore, AL could help
here to reduce the required amount of labeled data by annotating novel activities and
ignoring obsolete ones [187,188].
• In medical applications, AL plays a role in finding optimal solutions of many problems.
For example, AL has been used for compound selection to help in the formation of tar-
get compounds in drug discovery [189]. Moreover, AL has been used for the selection
Mathematics 2023, 11, 820 28 of 38
of protein pairs that could interact (i.e., protein–protein interaction prediction) [190],
for predicting the protein structure [191,192], and clinical annotation [193].
• In agriculture, AL has been used to select high-quality samples to develop efficient
and intelligent ML systems as in [194,195]. AL has also been used for semantic
segmentation of crops and weeds for agricultural robots as in [196]. Furthermore, AL
was applied for detecting objects in various agricultural studies [197].
• In industry, AL has been employed to handle many problems. Trivially, it is used to
reduce the labeling cost in ML-based problems by querying only informative unlabeled
data. For example, in [198], a cost-sensitive active learner has been used to detect
faults. In another direction, AL is used for quantifying the uncertainties to build cheap,
fast, and accurate surrogate models [199,200]. In data acquisition, AL is used to build
active inspection models that select some products in uncertain regions for further
investigating these selected products with advanced inspections [144].
Ref C d N Initial Data Balanced Data Query Budget Stopping Cond. Application
√
[118] 2 4–500 >5000 I ≈50% Q+U General
√
[201] 2 13 487 (5%) B 120 Q Medical
√
[202] M >10,000 >10,000 I − − Material Science
√
[203] M ≈30 <1000 (2/c) I − − General
√
[204] M 48 × 48 35,886 I − − Multimedia
√
[205] 2 20 3168 (≈2%) B − − Acoustical signals
√
[206] M 176 >50,000 (≈40) I 234–288 Q Remote sensing
√
[207] M 352 × 320 617,775 B ≈40% C.P Medical
[208] M 7310 − B 17.8% C.P Text Classification
[209] M 41 9159 − I − C.P Network Traffic Classification
√
[210] M H >10,000 (2%) I 50% Q Text Classification
[211] M − 44,030 − I 2000 Q Text Classification
[70] M <13 <625 x I 5% Q General
[69] 2 <9 <600 x I 30 Q General
√
[212] M H 610 × 610 (250) I 100/B Q Remote sensing
√
[213] M − 2008 (3/c) I 1750 Q Medical
√
[96] M 5–54 57k–830k (500) I 20% Q General
√
[214] M H 1.2M (20%) I 40% Q Image classification and segmentation
C: number of classes; M, multiple classes; 2, two classes. d: number of dimensions, H, high √ dimensional-data. N:
data size (i.e., number of data points). Initial Data, n/c, n initial labeled points for each class; , there is initial data;
x, no initial data. Balanced Data: B, balanced data; I, imbalanced data. Query Budget: n/B, maximum number
of labeled points is n for each batch. Stopping Condition: Q, query budget; U, uncertainty; C.P, classification
performance; −, unavailable information.
7. AL Packages/Software
There are many implementations for the active learning technique and most of them
use Python, but the most well-known packages are the following.
• A modular active learning framework for Python (modAL) (https://modal-python.
readthedocs.io/en/latest/, https://github.com/modAL-python/modAL [access date
on 28 December 2022]) is a small package that implements the most common sampling
methods, such as the least confident method, the margin sampling method, and the
entropy-based method. This package is easy to use and employs simple Python
Mathematics 2023, 11, 820 29 of 38
with multiclass datasets, and it represents the mean of the diagonal of the confusion
matrix [141].
• For imbalanced data, sensitivity (or true positive rate (TPR), hit rate, or recall), speci-
ficity (true negative rate (TNR), or inverse recall), and geometrical mean (GM) metrics
are used. For example, the sensitivity and specificity metrics were used in [69,96] and
GM was also used in [96]. Moreover, the false positive rate (FPR) was used in [96]
when the data was imbalanced. In [70,219], with multiclass imbalanced datasets, the
authors counted the number of annotated points from each minority class. This is
referred to as the number of annotated points from the minority class (N min ). This
metric is useful and representative in showing how the active learner scans the mi-
nority class. As an extension of this metric, the authors in [70] counted the number
of annotated points from each class to show how the active learner scans all classes,
including the minority classes.
• Receiver operating characteristic (ROC) curve: This metric visually compares the
performance of different active learners, where the active learner that obtains the
largest area under the curve (AUC) is the best one [141]. This is suitable for binary
classification problems. For multiclass datasets with imbalanced data, the multiclass
area under the ROC curve (MAUC) is used [96,220]. This metric is an extension of the
ROC curve that is only applicable in the case of two classes. This is done by averaging
pairwise comparisons.
• In [70], the authors counted the number of runs in which the active learner failed to
query points from all classes, and they called this metric the number of failures (NoF).
This metric is more appropriate for multiclass data and imbalanced data to ensure
that the active learner scans the space and finds representative points from all classes.
• Computation time: This metric is very effective because some active learners require
high computational costs and therefore cannot query enough points in real time.
For example, the active learner in [69] requires high computational time even in
low-dimensional spaces.
9. Conclusions
The active learning technique provides a solution for achieving high prediction accu-
racy with low labeling cost, effort, and time by searching and querying the most informative
and/or representative points from the available unlabeled points. Therefore, this is an
ever-growing area in machine learning research. In this review, the theoretical background
of AL is discussed, including the components of AL and illustrative examples to explain
the benefits of using AL. In addition, from different perspectives, an overview of the query
strategies for the classification scenarios is provided. A clear overview of various practical
challenges with AL in real-world environments and the combination between AL and vari-
ous research domains is also provided. In addition to discussing key practical challenges,
numerous research questions are also presented. As we introduced in Section 5, because
AL searches for the most informative and representative points, it was employed in many
research directions to find the optimal/best solution(s) in a short time. Table 3 shows how
AL is used in many research directions. Furthermore, an overview of AL software packages
and the most well-known evaluation metrics used in AL experiments is provided. A simple
software package for applying AL in classical ML and DL frameworks is also presented.
This package also contains illustrative examples that have been illustrated in this paper.
These examples and many more in the package are very simple and explained step by
step, so they can be considered as a cornerstone for implementing other active learners and
applying active learners in many applications.
Mathematics 2023, 11, 820 31 of 38
Table 3. Comparison between different research directions and how AL could be used to assist each
of them.
Author Contributions: Introduced the research plan, A.T.; revision and summarization of research
studies, A.T.; writing—original draft preparation, A.T.; reviewed and edited different drafts of the
paper; A.T. and W.S.; supervision and project administration, W.S. All authors have read and agreed
to the published version of the manuscript.
Funding: This research was conducted within the framework of the project “SAIL: SustAInable
Lifecycle of Intelligent Socio-Technical Systems”. SAIL is receiving funding from the programme
“Netzwerke 2021” (grant number NW21-059), an initiative of the Ministry of Culture and Science of
the State of Northrhine Westphalia. The sole responsibility for the content of this publication lies with
the authors.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Mitchell, T. Machine Learning; McGraw-Hill: New York, NY, USA, 1997.
2. Wang, Y.; Yao, Q.; Kwok, J.T.; Ni, L.M. Generalizing from a few examples: A survey on few-shot learning. ACM Comput. Surv.
(CSUR) 2020, 53, 1–34.
3. Settles, B. Active Learning Literature Survey; Computer Sciences Technical Report; Department of Computer Sciences, University
of Wisconsin-Madison: Madison, WI, USA, 2009; p. 1648.
4. Cohn, D.; Atlas, L.; Ladner, R. Improving generalization with active learning. Mach. Learn. 1994, 15, 201–221. [CrossRef]
5. Wang, M.; Fu, K.; Min, F.; Jia, X. Active learning through label error statistical methods. Knowl.-Based Syst. 2020, 189, 105140.
[CrossRef]
6. Krawczyk, B. Active and adaptive ensemble learning for online activity recognition from data streams. Knowl.-Based Syst. 2017,
138, 69–78. [CrossRef]
7. Wang, H.; Jin, Y.; Doherty, J. Committee-based active learning for surrogate-assisted particle swarm optimization of expensive
problems. IEEE Trans. Cybern. 2017, 47, 2664–2677. [CrossRef]
8. Sverchkov, Y.; Craven, M. A review of active learning approaches to experimental design for uncovering biological networks.
PLoS Comput. Biol. 2017, 13, e1005466. [CrossRef]
9. Cevik, M.; Ergun, M.A.; Stout, N.K.; Trentham-Dietz, A.; Craven, M.; Alagoz, O. Using active learning for speeding up calibration
in simulation models. Med. Decis. Mak. 2016, 36, 581–593. [CrossRef]
10. Settles, B. Curious Machines: Active Learning with Structured Instances. Ph.D. Thesis, University of Wisconsin-Madison,
Madison, WI, USA, 2008.
11. Settles, B. From theories to queries: Active learning in practice. In Proceedings of the Active Learning and Experimental Design
Workshop in Conjunction with AISTATS 2010. JMLR Workshop and Conference Proceedings, Sardinia, Italy, 16 May 2011;
pp. 1–18.
12. Fu, Y.; Zhu, X.; Li, B. A survey on instance selection for active learning. Knowl. Inf. Syst. 2013, 35, 249–283. [CrossRef]
13. Kumar, P.; Gupta, A. Active learning query strategies for classification, regression, and clustering: A survey. J. Comput. Sci.
Technol. 2020, 35, 913–945.
14. Hino, H. Active learning: Problem settings and recent developments. arXiv 2020, arXiv:2012.04225.
15. Hanneke, S. A bound on the label complexity of agnostic active learning. In Proceedings of the 24th International Conference on
Machine Learning, Corvalis, OR, USA, 20–24 June 2007; pp. 353–360.
16. Ramirez-Loaiza, M.E.; Sharma, M.; Kumar, G.; Bilgic, M. Active learning: An empirical study of common baselines. Data Min.
Knowl. Discov. 2017, 31, 287–313. [CrossRef]
17. Pereira-Santos, D.; Prudêncio, R.B.C.; de Carvalho, A.C. Empirical investigation of active learning strategies. Neurocomputing
2019, 326, 15–27. [CrossRef]
Mathematics 2023, 11, 820 32 of 38
18. Wang, M.; Hua, X.S. Active learning in multimedia annotation and retrieval: A survey. Acm Trans. Intell. Syst. Technol. (TIST)
2011, 2, 1–21. [CrossRef]
19. Xu, Y.; Sun, F.; Zhang, X. Literature survey of active learning in multimedia annotation and retrieval. In Proceedings of the Fifth
International Conference on Internet Multimedia Computing and Service, Huangshan, China, 17–18 August 2013; pp. 237–242.
20. Olsson, F. A Literature Survey of Active Machine Learning in the Context of Natural Language Processing, SICS Technical Report
T2009:06 -ISSN: 1100-3154. 2009. Available online: https://www.researchgate.net/publication/228682097_A_literature_survey_
of_active_machine_learning_in_the_context_of_natural_language_processing (accessed on 15 December 2022).
21. Lowell, D.; Lipton, Z.C.; Wallace, B.C. Practical obstacles to deploying active learning. arXiv 2018, arXiv:1807.04801.
22. Elahi, M.; Ricci, F.; Rubens, N. A survey of active learning in collaborative filtering recommender systems. Comput. Sci. Rev. 2016,
20, 29–50. [CrossRef]
23. Tuia, D.; Volpi, M.; Copa, L.; Kanevski, M.; Munoz-Mari, J. A survey of active learning algorithms for supervised remote sensing
image classification. IEEE J. Sel. Top. Signal Process. 2011, 5, 606–617. [CrossRef]
24. Berger, K.; Rivera Caicedo, J.P.; Martino, L.; Wocher, M.; Hank, T.; Verrelst, J. A survey of active learning for quantifying vegetation
traits from terrestrial earth observation data. Remote Sens. 2021, 13, 287. [CrossRef]
25. Ren, P.; Xiao, Y.; Chang, X.; Huang, P.Y.; Li, Z.; Gupta, B.B.; Chen, X.; Wang, X. A survey of deep active learning. ACM Comput.
Surv. (CSUR) 2021, 54, 1–40. [CrossRef]
26. Schröder, C.; Niekler, A. A survey of active learning for text classification using deep neural networks. arXiv 2020,
arXiv:2008.07267.
27. Hu, Q.; Guo, Y.; Cordy, M.; Xie, X.; Ma, W.; Papadakis, M.; Le Traon, Y. Towards Exploring the Limitations of Active Learning: An
Empirical Study. In Proceedings of the 2021 36th IEEE/ACM International Conference on Automated Software Engineering
(ASE), Melbourne, Australia, 15–19 November 2021; pp. 917–929.
28. Sun, L.L.; Wang, X.Z. A survey on active learning strategy. In Proceedings of the 2010 International Conference on Machine
Learning and Cybernetics, Qingdao, China, 11–14 July 2010; Volume 1, pp. 161–166.
29. Bull, L.; Manson, G.; Worden, K.; Dervilis, N. Active Learning Approaches to Structural Health Monitoring. In Special Topics in
Structural Dynamics; Springer: Berlin/Heidelberg, Germany, 2019; Volume 5, pp. 157–159.
30. Pratama, M.; Lu, J.; Lughofer, E.; Zhang, G.; Er, M.J. An incremental learning of concept drifts using evolving type-2 recurrent
fuzzy neural networks. IEEE Trans. Fuzzy Syst. 2016, 25, 1175–1192. [CrossRef]
31. Abu-Mostafa, Y.S.; Magdon-Ismail, M.; Lin, H.T. Learning from Data; AMLBook: New York, NY, USA, 2012; Volume 4.
32. Tharwat, A.; Schenck, W. Population initialization techniques for evolutionary algorithms for single-objective constrained
optimization problems: Deterministic vs. stochastic techniques. Swarm Evol. Comput. 2021, 67, 100952. [CrossRef]
33. Freund, Y.; Seung, H.S.; Shamir, E.; Tishby, N. Selective sampling using the query by committee algorithm. Mach. Learn. 1997,
28, 133–168. [CrossRef]
34. Vapnik, V.N.; Chervonenkis, A.Y. On the uniform convergence of relative frequencies of events to their probabilities. In Measures
of Complexity; Springer: Cham, Switzerland, 2015; pp. 11–30.
35. Dasgupta, S.; Kalai, A.T.; Monteleoni, C. Analysis of perceptron-based active learning. In Proceedings of the 18th Annual
Conference on Learning Theory, COLT 2005, Bertinoro, Italy, 27–30 June 2005; Springer: Berlin/Heidelberg, Germany; pp.
249–263.
36. Angluin, D. Queries and concept learning. Mach. Learn. 1988, 2, 319–342. [CrossRef]
37. Baum, E.B.; Lang, K. Query learning can work poorly when a human oracle is used. In Proceedings of the International Joint
Conference on Neural Networks, Baltimore, MD, USA, 7–11 June 1992; Volume 8, p. 8.
38. Moskovitch, R.; Nissim, N.; Stopel, D.; Feher, C.; Englert, R.; Elovici, Y. Improving the detection of unknown computer worms
activity using active learning. In Proceedings of the Annual Conference on Artificial Intelligence; Springer: Berlin/Heidelberg,
Germany, 2007; pp. 489–493.
39. Thompson, C.A.; Califf, M.E.; Mooney, R.J. Active learning for natural language parsing and information extraction. In
Proceedings of the ICML, Bled, Slovenia, 27–30 June 1999; pp. 406–414.
40. Lewis, D.D.; Gale, W.A. A sequential algorithm for training text classifiers: Corrigendum and additional data In Acm Sigir Forum;
ACM: New York, NY, USA, 1995; Volume 29, pp. 13–19.
41. Wang, L.; Hu, X.; Yuan, B.; Lu, J. Active learning via query synthesis and nearest neighbour search. Neurocomputing 2015,
147, 426–434. [CrossRef]
42. Sharma, M.; Bilgic, M. Evidence-based uncertainty sampling for active learning. Data Min. Knowl. Discov. 2017, 31, 164–202.
[CrossRef]
43. Scheffer, T.; Decomain, C.; Wrobel, S. Active hidden markov models for information extraction. In Proceedings of the International
Symposium on Intelligent Data Analysis (IDA); Springer: Berlin/Heidelberg, Germany, 2001; pp. 309–318.
44. Settles, B.; Craven, M. An analysis of active learning strategies for sequence labeling tasks. In Proceedings of the 2008 Conference
on Empirical Methods in Natural Language Processing, Honolulu, HI, USA, 25–27 October 2008; pp. 1070–1079.
45. Schein, A.I.; Ungar, L.H. Active learning for logistic regression: An evaluation. Mach. Learn. 2007, 68, 235–265. [CrossRef]
46. Tong, S.; Koller, D. Support vector machine active learning with applications to text classification. J. Mach. Learn. Res. 2001,
2, 45–66.
Mathematics 2023, 11, 820 33 of 38
47. Hernández-Lobato, J.M.; Adams, R. Probabilistic backpropagation for scalable learning of bayesian neural networks. In
Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 1861–1869.
48. Fujii, A.; Inui, K.; Tokunaga, T.; Tanaka, H. Selective sampling for example-based word sense disambiguation. arXiv 1999,
arXiv:cs/9910020.
49. Seung, H.S.; Opper, M.; Sompolinsky, H. Query by committee. In Proceedings of the Fifth Annual Workshop on Computational
Learning Theory, Pittsburgh, PA, USA, 27–29 July 1992; pp. 287–294.
50. Abe, N. Query learning strategies using boosting and bagging. In Proceedings of the 15th International Conference on Machine
Learning (ICML98), Madison, WI, USA, 24–27 July 1998; pp. 1–9.
51. Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [CrossRef]
52. Melville, P.; Yang, S.M.; Saar-Tsechansky, M.; Mooney, R. Active learning for probability estimation using Jensen-Shannon diver-
gence. In Proceedings of the European Conference on Machine Learning; Springer: Berlin/Heidelberg, Germany, 2005; pp. 268–279.
53. Körner, C.; Wrobel, S. Multi-class ensemble-based active learning. In Proceedings of the European Conference on Machine Learning;
Springer: Berlin/Heidelberg, Germany, 2006; pp. 687–694.
54. Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [CrossRef]
55. Kremer, J.; Steenstrup Pedersen, K.; Igel, C. Active learning with support vector machines. Wiley Interdiscip. Rev. Data Min. Knowl.
Discov. 2014, 4, 313–326. [CrossRef]
56. Schohn, G.; Cohn, D. Less is more: Active learning with support vector machines. In Proceedings of the ICML, Stanford, CA,
USA, 29 June–2 July 2000; Volume 2, p. 6.
57. Zhang, Y.; Lease, M.; Wallace, B. Active discriminative text representation learning. In Proceedings of the AAAI Conference on
Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31.
58. Vezhnevets, A.; Buhmann, J.M.; Ferrari, V. Active learning for semantic segmentation with expected change. In Proceedings of
the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 6–21 June 2012; pp. 3162–3169.
59. Roy, N.; McCallum, A. Toward optimal active learning through sampling estimation of error reduction. In Proceedings of the
International Conference on Machine Learning, Williamstown, MA, USA, 28 June–1 July 2001.
60. Wu, Y.; Kozintsev, I.; Bouguet, J.Y.; Dulong, C. Sampling strategies for active learning in personal photo retrieval. In Proceedings
of the 2006 IEEE International Conference on Multimedia and Expo, Toronto, ON, Canada, 9–12 July 2006; pp. 529–532.
61. Ienco, D.; Bifet, A.; Žliobaitė, I.; Pfahringer, B. Clustering based active learning for evolving data streams. In Proceedings of the
International Conference on Discovery Science; Springer: Berlin/Heidelberg, Germany, 2013; pp. 79–93.
62. Kang, J.; Ryu, K.R.; Kwon, H.C. Using cluster-based sampling to select initial training set for active learning in text classification.
In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining; Springer: Berlin/Heidelberg, Germany, 2004;
pp. 384–388.
63. Brinker, K. Incorporating diversity in active learning with support vector machines. In Proceedings of the 20th International
Conference on Machine Learning (ICML-03), Washington, DC, USA, 21–24 August 2003; pp. 59–66.
64. Xu, Z.; Akella, R.; Zhang, Y. Incorporating diversity and density in active learning for relevance feedback. In Proceedings of the
European Conference on Information Retrieval; Springer: Berlin/Heidelberg, Germany, 2007; pp. 246–257.
65. Osugi, T.; Kim, D.; Scott, S. Balancing exploration and exploitation: A new algorithm for active machine learning. In Proceedings
of the Fifth IEEE International Conference on Data Mining (ICDM’05), Houston, TX, USA, 27–30 November 2005; 8p.
66. Yin, C.; Qian, B.; Cao, S.; Li, X.; Wei, J.; Zheng, Q.; Davidson, I. Deep similarity-based batch mode active learning with exploration-
exploitation. In Proceedings of the 2017 IEEE International Conference on Data Mining (ICDM), New Orleans, LA, USA, 18–21
November 2017; pp. 575–584.
67. Huang, S.J.; Jin, R.; Zhou, Z.H. Active learning by querying informative and representative examples. Adv. Neural Inf. Process.
Syst. 2010, 23, 892–900. [CrossRef] [PubMed]
68. Cebron, N.; Berthold, M.R. Active learning for object classification: From exploration to exploitation. Data Min. Knowl. Discov.
2009, 18, 283–299. [CrossRef]
69. Tharwat, A.; Schenck, W. Balancing Exploration and Exploitation: A novel active learner for imbalanced data. Knowl.-Based Syst.
2020, 210, 106500. [CrossRef]
70. Tharwat, A.; Schenck, W. A Novel Low-Query-Budget Active Learner with Pseudo-Labels for Imbalanced Data. Mathematics
2022, 10, 1068. [CrossRef]
71. Nguyen, H.T.; Smeulders, A. Active learning using pre-clustering. In Proceedings of the Twenty-First International Conference
on Machine Learning, Banff, AB, Canada, 4–8 July 2004; p. 79.
72. Ebert, S.; Fritz, M.; Schiele, B. Ralf: A reinforced active learning formulation for object class recognition. In Proceedings of the
2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3626–3633.
73. Konyushkova, K.; Sznitman, R.; Fua, P. Learning active learning from data. Adv. Neural Inf. Process. Syst. 2017, 30, 4228–4238.
74. Fang, M.; Li, Y.; Cohn, T. Learning how to active learn: A deep reinforcement learning approach. arXiv 2017, arXiv:1708.02383.
75. Woodward, M.; Finn, C. Active one-shot learning. arXiv 2017, arXiv:1702.06559.
76. Wassermann, S.; Cuvelier, T.; Casas, P. RAL-Improving stream-based active learning by reinforcement learning. In Proceedings of
the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD)
Workshop on Interactive Adaptive Learning (IAL), Würzburg, Germany, 16 September 2019.
77. Baram, Y.; Yaniv, R.E.; Luz, K. Online choice of active learning algorithms. J. Mach. Learn. Res. 2004, 5, 255–291.
Mathematics 2023, 11, 820 34 of 38
78. Hsu, W.N.; Lin, H.T. Active learning by learning. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence,
Austin, TX, USA, 25–30 January 2015.
79. Chu, H.M.; Lin, H.T. Can active learning experience be transferred? In Proceedings of the 2016 IEEE 16th International Conference
on Data Mining (ICDM), Barcelona, Spain, 12–15 December 2016; pp. 841–846.
80. Frénay, B.; Hammer, B. Label-noise-tolerant classification for streaming data. In Proceedings of the 2017 International Joint
Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; pp. 1748–1755.
81. Donmez, P.; Carbonell, J.G. Proactive learning: Cost-sensitive active learning with multiple imperfect oracles. In Proceedings of
the 17th ACM Conference on Information and Knowledge Management, Napa Valley, CA, USA, 26–30 October 2008; pp. 619–628.
82. Yan, Y.; Rosales, R.; Fung, G.; Dy, J.G. Active learning from crowds. In Proceedings of the ICML, Bellevue, WA, USA, 28 June–2
July 2011.
83. Shu, Z.; Sheng, V.S.; Li, J. Learning from crowds with active learning and self-healing. Neural Comput. Appl. 2018, 30, 2883–2894.
[CrossRef]
84. Sheng, V.S.; Provost, F.; Ipeirotis, P.G. Get another label? improving data quality and data mining using multiple, noisy labelers.
In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, NV,
USA, 24–27 August 2008; pp. 614–622.
85. Fang, M.; Zhu, X. Active learning with uncertain labeling knowledge. Pattern Recognit. Lett. 2014, 43, 98–108. [CrossRef]
86. Tuia, D.; Munoz-Mari, J. Learning user’s confidence for active learning. IEEE Trans. Geosci. Remote Sens. 2012, 51, 872–880.
[CrossRef]
87. Younesian, T.; Zhao, Z.; Ghiassi, A.; Birke, R.; Chen, L.Y. QActor: Active Learning on Noisy Labels. In Proceedings of the Asian
Conference on Machine Learning, Virtual, 17–19 November 2021; pp. 548–563.
88. Zhang, L.; Chen, C.; Bu, J.; Cai, D.; He, X.; Huang, T.S. Active learning based on locally linear reconstruction. IEEE Trans. Pattern
Anal. Mach. Intell. 2011, 33, 2026–2038. [CrossRef]
89. Elwell, R.; Polikar, R. Incremental learning of concept drift in nonstationary environments. IEEE Trans. Neural Netw. 2011,
22, 1517–1531. [CrossRef] [PubMed]
90. Vaquet, V.; Hammer, B. Balanced SAM-kNN: Online Learning with Heterogeneous Drift and Imbalanced Data. In Proceedings of
the International Conference on Artificial Neural Networks; Springer: Berlin/Heidelberg, Germany, 2020; pp. 850–862.
91. Wang, S.; Minku, L.L.; Yao, X. Dealing with Multiple Classes in Online Class Imbalance Learning. In Proceedings of the IJCAI,
New York, NY, USA, 9–15 July 2016; pp. 2118–2124.
92. Gao, J.; Fan, W.; Han, J.; Yu, P.S. A general framework for mining concept-drifting data streams with skewed distributions. In
Proceedings of the 2007 SIAM International Conference on Data Mining, Minneapolis, MN, USA, 26–28 April 2007; pp. 3–14.
93. Chen, S.; He, H. SERA: Selectively recursive approach towards nonstationary imbalanced stream data mining. In Proceedings of
the 2009 International Joint Conference on Neural Networks, Atlanta, GA, USA, 14–19 June 2009; pp. 522–529.
94. Zhang, Y.; Zhao, P.; Niu, S.; Wu, Q.; Cao, J.; Huang, J.; Tan, M. Online adaptive asymmetric active learning with limited budgets.
IEEE Trans. Knowl. Data Eng. 2019, 33, 2680–2692. [CrossRef]
95. Žliobaitė, I.; Bifet, A.; Pfahringer, B.; Holmes, G. Active learning with drifting streaming data. IEEE Trans. Neural Netw. Learn.
Syst. 2013, 25, 27–39. [CrossRef]
96. Liu, W.; Zhang, H.; Ding, Z.; Liu, Q.; Zhu, C. A comprehensive active learning method for multiclass imbalanced data streams
with concept drift. Knowl.-Based Syst. 2021, 215, 106778. [CrossRef]
97. Ren, P.; Xiao, Y.; Chang, X.; Huang, P.Y.; Li, Z.; Chen, X.; Wang, X. A survey of deep active learning. arXiv 2020, arXiv:2009.00236.
98. Tomanek, K.; Hahn, U. A comparison of models for cost-sensitive active learning. In Proceedings of the Coling 2010: Posters,
Beijing, China, 23–27 August 2010; pp. 1247–1255.
99. Settles, B.; Craven, M.; Friedland, L. Active learning with real annotation costs. In Proceedings of the NIPS Workshop on
Cost-Sensitive Learning, Vancouver, BC, Canada, 13 December 2008; Volume 1.
100. Margineantu, D.D. Active cost-sensitive learning. In Proceedings of the IJCAI, Edinburgh, Scotland, 30 July–5 August 2005;
Volume 5, pp. 1622–1623.
101. Kapoor, A.; Horvitz, E.; Basu, S. Selective Supervision: Guiding Supervised Learning with Decision-Theoretic Active Learning.
In Proceedings of the IJCAI, Hyderabad, India, 6–12 January 2007; Volume 7, pp. 877–882.
102. Kee, S.; Del Castillo, E.; Runger, G. Query-by-committee improvement with diversity and density in batch active learning. Inf.
Sci. 2018, 454, 401–418. [CrossRef]
103. Yin, L.; Wang, H.; Fan, W.; Kou, L.; Lin, T.; Xiao, Y. Incorporate active learning to semi-supervised industrial fault classification.
J. Process. Control. 2019, 78, 88–97. [CrossRef]
104. He, G.; Li, Y.; Zhao, W. An uncertainty and density based active semi-supervised learning scheme for positive unlabeled
multivariate time series classification. Knowl.-Based Syst. 2017, 124, 80–92. [CrossRef]
105. Wang, Z.; Du, B.; Zhang, L.; Zhang, L. A batch-mode active learning framework by querying discriminative and representative
samples for hyperspectral image classification. Neurocomputing 2016, 179, 88–100. [CrossRef]
106. Straat, M.; Abadi, F.; Göpfert, C.; Hammer, B.; Biehl, M. Statistical mechanics of on-line learning under concept drift. Entropy
2018, 20, 775. [CrossRef] [PubMed]
107. Lindstrom, P.; Mac Namee, B.; Delany, S.J. Drift detection using uncertainty distribution divergence. Evol. Syst. 2013, 4, 13–25.
[CrossRef]
Mathematics 2023, 11, 820 35 of 38
108. Bifet, A.; Gavalda, R. Learning from time-changing data with adaptive windowing. In Proceedings of the 2007 SIAM International
Conference on Data Mining, Minneapolis, MN, USA, 26–28 April 2007; pp. 443–448.
109. Gama, J.; Medas, P.; Castillo, G.; Rodrigues, P. Learning with drift detection. In Proceedings of the Brazilian Symposium on Artificial
Intelligence; Springer: Berlin/Heidelberg, Germany, 2004; pp. 286–295.
110. Syed, N.A.; Liu, H.; Sung, K.K. Handling concept drifts in incremental learning with support vector machines. In Proceedings of
the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, 15–18 August
1999; pp. 317–321.
111. Kolter, J.Z.; Maloof, M.A. Dynamic weighted majority: An ensemble method for drifting concepts. J. Mach. Learn. Res. 2007,
8, 2755–2790.
112. Brinker, K. On active learning in multi-label classification. In From Data and Information Analysis to Knowledge Engineering; Springer:
Berlin/Heidelberg, Germany, 2006; pp. 206–213.
113. Wu, J.; Sheng, V.S.; Zhang, J.; Li, H.; Dadakova, T.; Swisher, C.L.; Cui, Z.; Zhao, P. Multi-label active learning algorithms for image
classification: Overview and future promise. Acm Comput. Surv. (CSUR) 2020, 53, 1–35. [CrossRef] [PubMed]
114. Tsoumakas, G.; Katakis, I.; Vlahavas, I. Mining multi-label data. In Data Mining and Knowledge Discovery Handbook; Springer:
Boston, MA, USA, 2009; pp. 667–685.
115. Reyes, O.; Morell, C.; Ventura, S. Effective active learning strategy for multi-label learning. Neurocomputing 2018, 273, 494–508.
[CrossRef]
116. Zhu, J.; Wang, H.; Hovy, E.; Ma, M. Confidence-based stopping criteria for active learning for data annotation. Acm Trans. Speech
Lang. Process. (TSLP) 2010, 6, 1–24. [CrossRef]
117. Li, M.; Sethi, I.K. Confidence-based active learning. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 1251–1261.
118. Nguyen, V.L.; Shaker, M.H.; Hüllermeier, E. How to measure uncertainty in uncertainty sampling for active learning. Mach.
Learn. 2022, 111, 89–122. [CrossRef]
119. Karamcheti, S.; Krishna, R.; Fei-Fei, L.; Manning, C.D. Mind your outliers! investigating the negative impact of outliers on active
learning for visual question answering. arXiv 2021, arXiv:2107.02331.
120. Klidbary, S.H.; Shouraki, S.B.; Ghaffari, A.; Kourabbaslou, S.S. Outlier robust fuzzy active learning method (ALM). In Proceedings
of the 2017 7th International Conference on Computer and Knowledge Engineering (ICCKE), Mashhad, Iran, 26–27 October 2017;
pp. 347–352.
121. Napierala, K.; Stefanowski, J. Types of minority class examples and their influence on learning classifiers from imbalanced data.
J. Intell. Inf. Syst. 2016, 46, 563–597. [CrossRef]
122. He, T.; Zhang, Z.; Zhang, H.; Zhang, Z.; Xie, J.; Li, M. Bag of tricks for image classification with convolutional neural networks.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June
2019; pp. 558–567.
123. Wang, K.; Zhang, D.; Li, Y.; Zhang, R.; Lin, L. Cost-effective active learning for deep image classification. IEEE Trans. Circuits Syst.
Video Technol. 2016, 27, 2591–2600. [CrossRef]
124. Tran, T.; Do, T.T.; Reid, I.; Carneiro, G. Bayesian generative active deep learning. In Proceedings of the International Conference
on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6295–6304.
125. Guo, Y.; Schuurmans, D. Discriminative batch mode active learning. Adv. Neural Inf. Process. Syst. 2007, 20, 593–600.
126. Tomanek, K.; Wermter, J.; Hahn, U. An approach to text corpus construction which cuts annotation costs and maintains reusability
of annotated data. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and
Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, 28–30 June 2007; pp. 486–495.
127. Vijayanarasimhan, S.; Grauman, K. Large-scale live active learning: Training object detectors with crawled data and crowds. Int.
J. Comput. Vis. 2014, 108, 97–114. [CrossRef]
128. Long, C.; Hua, G.; Kapoor, A. Active visual recognition with expertise estimation in crowdsourcing. In Proceedings of the IEEE
International Conference on Computer Vision, Sydney, NSW, Australia, 1–8 December 2013; pp. 3000–3007.
129. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F. Imagenet: A large-scale hierarchical image database. In Proceedings of the
2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255.
130. Zhang, J.; Wu, X.; Shengs, V.S. Active learning with imbalanced multiple noisy labeling. IEEE Trans. Cybern. 2014, 45, 1095–1107.
[CrossRef]
131. Siméoni, O.; Budnik, M.; Avrithis, Y.; Gravier, G. Rethinking deep active learning: Using unlabeled data at model training.
In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021;
pp. 1220–1227.
132. Hossain, H.S.; Roy, N. Active deep learning for activity recognition with context aware annotator selection. In Proceedings of the
25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019;
pp. 1862–1870.
133. Zhdanov, F. Diverse mini-batch active learning. arXiv 2019, arXiv:1901.05954.
134. Sener, O.; Savarese, S. Active learning for convolutional neural networks: A core-set approach. arXiv 2017, arXiv:1708.00489.
135. Wang, D.; Shang, Y. A new active labeling method for deep learning. In Proceedings of the 2014 International Joint Conference
on Neural Networks (IJCNN), Beijing, China, 6–11 July 2014; pp. 112–119.
Mathematics 2023, 11, 820 36 of 38
136. Gal, Y.; Ghahramani, Z. Bayesian convolutional neural networks with Bernoulli approximate variational inference. arXiv 2015,
arXiv:1506.02158.
137. Gal, Y.; Islam, R.; Ghahramani, Z. Deep bayesian active learning with image data. In Proceedings of the International Conference
on Machine Learning, 6–11 August 2017; pp. 1183–1192.
138. Kirsch, A.; Van Amersfoort, J.; Gal, Y. Batchbald: Efficient and diverse batch acquisition for deep bayesian active learning. Adv.
Neural Inf. Process. Syst. 2019, 32, 7026–7037.
139. Boney, R.; Ilin, A. Semi-supervised and active few-shot learning with prototypical networks. arXiv 2017, arXiv:1711.10856.
140. Boney, R.; Ilin, A. Active one-shot learning with prototypical networks. In Proceedings of the European Symposium on Artificial
Neural Networks, Computational Intelligence and Machine Learning, Bruges, Belgium, 24–26 April 2019; pp. 583–588.
141. Lampert, C.H.; Nickisch, H.; Harmeling, S. Attribute-based classification for zero-shot visual object categorization. IEEE Trans.
Pattern Anal. Mach. Intell. 2013, 36, 453–465. [CrossRef] [PubMed]
142. Zheng, Z.; Padmanabhan, B. On active learning for data acquisition. In Proceedings of the 2002 IEEE International Conference on
Data Mining, Maebashi, Japan, 9–12 December 2002; pp. 562–569.
143. Greiner, R.; Grove, A.J.; Roth, D. Learning cost-sensitive active classifiers. Artif. Intell. 2002, 139, 137–174. [CrossRef]
144. Shim, J.; Kang, S.; Cho, S. Active inspection for cost-effective fault prediction in manufacturing process. J. Process. Control. 2021,
105, 250–258. [CrossRef]
145. Jin, Y. A comprehensive survey of fitness approximation in evolutionary computation. Soft Comput. 2005, 9, 3–12. [CrossRef]
146. Lye, K.O.; Mishra, S.; Ray, D.; Chandrashekar, P. Iterative surrogate model optimization (ISMO): An active learning algorithm for
PDE constrained optimization with deep neural networks. Comput. Methods Appl. Mech. Eng. 2021, 374, 113575. [CrossRef]
147. Karunakaran, D. Active Learning Methods for Dynamic Job Shop Scheduling Using Genetic Programming under Uncer-
tain Environment. Ph.D. Dissertation, Open Access Te Herenga Waka-Victoria University of Wellington, Wellington, New
Zealand, 2019.
148. Zemmal, N.; Azizi, N.; Sellami, M.; Cheriguene, S.; Ziani, A.; AlDwairi, M.; Dendani, N. Particle swarm optimization based
swarm intelligence for active learning improvement: Application on medical data classification. Cogn. Comput. 2020, 12, 991–1010.
[CrossRef]
149. Zemmal, N.; Azizi, N.; Sellami, M.; Cheriguene, S.; Ziani, A. A new hybrid system combining active learning and particle swarm
optimisation for medical data classification. Int. J. Bio-Inspired Comput. 2021, 18, 59–68. [CrossRef]
150. Lookman, T.; Balachandran, P.V.; Xue, D.; Yuan, R. Active learning in materials science with emphasis on adaptive sampling
using uncertainties for targeted design. NPJ Comput. Mater. 2019, 5, 1–17. [CrossRef]
151. Jinnouchi, R.; Miwa, K.; Karsai, F.; Kresse, G.; Asahi, R. On-the-fly active learning of interatomic potentials for large-scale
atomistic simulations. J. Phys. Chem. Lett. 2020, 11, 6946–6955. [CrossRef]
152. Chabanet, S.; El-Haouzi, H.B.; Thomas, P. Coupling digital simulation and machine learning metamodel through an active
learning approach in Industry 4.0 context. Comput. Ind. 2021, 133, 103529. [CrossRef]
153. Diaw, A.; Barros, K.; Haack, J.; Junghans, C.; Keenan, B.; Li, Y.; Livescu, D.; Lubbers, N.; McKerns, M.; Pavel, R.; et al. Multiscale
simulation of plasma flows using active learning. Phys. Rev. E 2020, 102, 023310. [CrossRef] [PubMed]
154. Hodapp, M.; Shapeev, A. In operando active learning of interatomic interaction during large-scale simulations. Mach. Learn. Sci.
Technol. 2020, 1, 045005. [CrossRef]
155. Smith, J.S.; Nebgen, B.; Lubbers, N.; Isayev, O.; Roitberg, A.E. Less is more: Sampling chemical space with active learning. J. Chem.
Phys. 2018, 148, 241733. [CrossRef]
156. Ahmed, W.; Jackson, J.M. Emerging Nanotechnologies for Manufacturing; Elsevier William Andrew: Waltham, MA, USA, 2015.
157. Chen, C.T.; Gu, G.X. Generative deep neural networks for inverse materials design using backpropagation and active learning.
Adv. Sci. 2020, 7, 1902607. [CrossRef]
158. Zhang, C.; Amar, Y.; Cao, L.; Lapkin, A.A. Solvent selection for Mitsunobu reaction driven by an active learning surrogate model.
Org. Process. Res. Dev. 2020, 24, 2864–2873. [CrossRef]
159. Zhang, Y.; Wen, C.; Wang, C.; Antonov, S.; Xue, D.; Bai, Y.; Su, Y. Phase prediction in high entropy alloys with a rational selection
of materials descriptors and machine learning models. Acta Mater. 2020, 185, 528–539. [CrossRef]
160. Blum, A.; Mitchell, T. Combining labeled and unlabeled data with co-training. In Proceedings of the Eleventh Annual Conference
on Computational Learning Theory, Madison, WI, USA, 24–26 July 1998; pp. 92–100.
161. Tur, G.; Hakkani-Tür, D.; Schapire, R.E. Combining active and semi-supervised learning for spoken language understanding.
Speech Commun. 2005, 45, 171–186. [CrossRef]
162. Zhu, X.; Lafferty, J.; Ghahramani, Z. Combining active learning and semi-supervised learning using gaussian fields and harmonic
functions. In Proceedings of the ICML 2003 Workshop on the Continuum from Labeled to Unlabeled Data in Machine Learning
and Data Mining, Washington, DC, USA, 21–24 August 2003; Volume 3.
163. Shen, P.; Li, C.; Zhang, Z. Distributed active learning. IEEE Access 2016, 4, 2572–2579. [CrossRef]
164. Chen, X.; Wujek, B. Autodal: Distributed active learning with automatic hyperparameter selection. In Proceedings of the AAAI
Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 3537–3544.
165. Huang, S.J.; Zong, C.C.; Ning, K.P.; Ye, H.B. Asynchronous Active Learning with Distributed Label Querying. In Proceedings of
the International Joint Conferences on Artificial Intelligence Organization (IJCAI2021), Montrea, QC, Canada, 19–27 August 2021;
pp. 2570–2576.
Mathematics 2023, 11, 820 37 of 38
166. Baxter, J. A Bayesian/information theoretic model of learning to learn via multiple task sampling. Mach. Learn. 1997, 28, 7–39.
[CrossRef]
167. Caruana, R. Multitask learning. Mach. Learn. 1997, 28, 41–75. [CrossRef]
168. Zhang, Y. Multi-task active learning with output constraints. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial
Intelligence, Atlanta, GA, USA, 11–15 July 2010.
169. Saha, A.; Rai, P.; Daumé, H., III; Venkatasubramanian, S. Active online multitask learning. In Proceedings of the ICML 2010
Workshop on Budget Learning, Haifa, Israel, 21–24 June 2010;
170. Ghai, B.; Liao, Q.V.; Zhang, Y.; Bellamy, R.; Mueller, K. Explainable active learning (xal) toward ai explanations as interfaces for
machine teachers. Proc. ACM Hum.-Comput. Interact. 2021, 4, 1–28. [CrossRef]
171. Phillips, R.; Chang, K.H.; Friedler, S.A. Interpretable active learning. In Proceedings of the Conference on Fairness, Accountability
and Transparency, New York, NY, USA, 23–24 February 2018; pp. 49–61.
172. Zhu, X.; Zhang, P.; Lin, X.; Shi, Y. Active learning from stream data using optimal weight classifier ensemble. IEEE Trans. Syst.
Man Cybern. Part B Cybern. 2010, 40, 1607–1621.
173. Tran, V.C.; Nguyen, N.T.; Fujita, H.; Hoang, D.T.; Hwang, D. A combination of active learning and self-learning for named entity
recognition on twitter using conditional random fields. Knowl.-Based Syst. 2017, 132, 179–187. [CrossRef]
174. Chen, Y.; Lasko, T.A.; Mei, Q.; Denny, J.C.; Xu, H. A study of active learning methods for named entity recognition in clinical text.
J. Biomed. Inform. 2015, 58, 11–18. [CrossRef]
175. Aldoğan, D.; Yaslan, Y. A comparison study on active learning integrated ensemble approaches in sentiment analysis. Comput.
Electr. Eng. 2017, 57, 311–323. [CrossRef]
176. Zhou, S.; Chen, Q.; Wang, X. Active deep learning method for semi-supervised sentiment classification. Neurocomputing 2013,
120, 536–546. [CrossRef]
177. Wang, P.; Zhang, P.; Guo, L. Mining multi-label data streams using ensemble-based active learning. In Proceedings of the 2012
SIAM International Conference on Data Mining, Anaheim, CA, USA, 26–28 April 2012; pp. 1131–1140.
178. Boutell, M.R.; Luo, J.; Shen, X.; Brown, C.M. Learning multi-label scene classification. Pattern Recognit. 2004, 37, 1757–1771.
[CrossRef]
179. Casanova, A.; Pinheiro, P.O.; Rostamzadeh, N.; Pal, C.J. Reinforced active learning for image segmentation. arXiv 2020,
arXiv:2002.06583.
180. Mahapatra, D.; Bozorgtabar, B.; Thiran, J.P.; Reyes, M. Efficient active learning for image classification and segmentation using a
sample selection and conditional generative adversarial network. In Proceedings of the International Conference on Medical Image
Computing and Computer-Assisted Intervention, Granada, Spain; Springer: Berlin/Heidelberg, Germany, 2018; pp. 580–588.
181. Nath, V.; Yang, D.; Landman, B.A.; Xu, D.; Roth, H.R. Diminishing uncertainty within the training pool: Active learning for
medical image segmentation. IEEE Trans. Med. Imaging 2020, 40, 2534–2547. [CrossRef]
182. Bietti, A. Active Learning for Object Detection on Satellite Images. Available online: https://citeseerx.ist.psu.edu/document?
repid=rep1&type=pdf&doi=31243e163e02eb151e5564ae8c01dcd5c7dc225a (accessed on 28 December 2022).
183. Brust, C.A.; Käding, C.; Denzler, J. Active learning for deep object detection. arXiv 2018, arXiv:1809.09875.
184. Kao, C.C.; Lee, T.Y.; Sen, P.; Liu, M.Y. Localization-aware active learning for object detection. In Proceedings of the Asian Conference
on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2018; pp. 506–522.
185. Tuia, D.; Ratle, F.; Pacifici, F.; Kanevski, M.F.; Emery, W.J. Active learning methods for remote sensing image classification. IEEE
Trans. Geosci. Remote Sens. 2009, 47, 2218–2232. [CrossRef]
186. Liao, H.; Chen, L.; Song, Y.; Ming, H. Visualization-based active learning for video annotation. IEEE Trans. Multimed. 2016,
18, 2196–2205. [CrossRef]
187. Mohamad, S.; Sayed-Mouchaweh, M.; Bouchachia, A. Online active learning for human activity recognition from sensory data
streams. Neurocomputing 2020, 390, 341–358. [CrossRef]
188. Hossain, H.S.; Khan, M.A.A.H.; Roy, N. Active learning enabled activity recognition. Pervasive Mob. Comput. 2017, 38, 312–330.
[CrossRef]
189. Reker, D.; Schneider, G. Active-learning strategies in computer-assisted drug discovery. Drug Discov. Today 2015, 20, 458–465.
[CrossRef]
190. Mohamed, T.P.; Carbonell, J.G.; Ganapathiraju, M.K. Active learning for human protein-protein interaction prediction. BMC
Bioinform. 2010, 11, S57. [CrossRef]
191. Osmanbeyoglu, H.U.; Wehner, J.A.; Carbonell, J.G.; K Ganapathiraju, M. Active Learning for Membrane Protein Structure
Prediction. BMC Bioinf. 2010, 11 (Suppl. 1), S58. [CrossRef]
192. Warmuth, M.K.; Rätsch, G.; Mathieson, M.; Liao, J.; Lemmen, C. Active Learning in the Drug Discovery Process. In Proceedings
of the NIPS, Vancouver, BC, Canada, 3–8 December 2001; pp. 1449–1456.
193. Figueroa, R.L.; Zeng-Treitler, Q.; Ngo, L.H.; Goryachev, S.; Wiechmann, E.P. Active learning for clinical text classification: Is it
better than random sampling? J. Am. Med. Inform. Assoc. 2012, 19, 809–816. [CrossRef]
194. Yang, Y.; Li, Y.; Yang, J.; Wen, J. Dissimilarity-based active learning for embedded weed identification. Turk. J. Agric. For. 2022,
46, 390–401. [CrossRef]
195. Yang, J.; Lan, G.; Li, Y.; Gong, Y.; Zhang, Z.; Ercisli, S. Data quality assessment and analysis for pest identification in smart
agriculture. Comput. Electr. Eng. 2022, 103, 108322. [CrossRef]
Mathematics 2023, 11, 820 38 of 38
196. Sheikh, R.; Milioto, A.; Lottes, P.; Stachniss, C.; Bennewitz, M.; Schultz, T. Gradient and log-based active learning for semantic
segmentation of crop and weed for agricultural robots. In Proceedings of the 2020 IEEE International Conference on Robotics and
Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 1350–1356.
197. Chandra, A.L.; Desai, S.V.; Balasubramanian, V.N.; Ninomiya, S.; Guo, W. Active learning with point supervision for cost-effective
panicle detection in cereal crops. Plant Methods 2020, 16, 1–16. [CrossRef] [PubMed]
198. Peng, P.; Zhang, W.; Zhang, Y.; Xu, Y.; Wang, H.; Zhang, H. Cost sensitive active learning using bidirectional gated recurrent
neural networks for imbalanced fault diagnosis. Neurocomputing 2020, 407, 232–245. [CrossRef]
199. Agarwal, D.; Srivastava, P.; Martin-del Campo, S.; Natarajan, B.; Srinivasan, B. Addressing uncertainties within active learning
for industrial IoT. In Proceedings of the 2021 IEEE 7th World Forum on Internet of Things (WF-IoT), New Orleans, LA, USA, 14
June–31 July 2021; pp. 557–562.
200. Rahman, M.; Khan, A.; Anowar, S.; Al-Imran, M.; Verma, R.; Kumar, D.; Kobayashi, K.; Alam, S. Leveraging Industry 4.0—Deep
Learning, Surrogate Model and Transfer Learning with Uncertainty Quantification Incorporated into Digital Twin for Nuclear
System. arXiv 2022, arXiv:2210.00074.
201. El-Hasnony, I.M.; Elzeki, O.M.; Alshehri, A.; Salem, H. Multi-label active learning-based machine learning model for heart
disease prediction. Sensors 2022, 22, 1184. [CrossRef]
202. Yadav, C.S.; Pradhan, M.K.; Gangadharan, S.M.P.; Chaudhary, J.K.; Singh, J.; Khan, A.A.; Haq, M.A.; Alhussen, A.; Wechtaisong, C.;
Imran, H.; et al. Multi-Class Pixel Certainty Active Learning Model for Classification of Land Cover Classes Using Hyperspectral
Imagery. Electronics 2022, 11, 2799. [CrossRef]
203. Zhao, G.; Dougherty, E.; Yoon, B.J.; Alexander, F.; Qian, X. Efficient active learning for Gaussian process classification by error
reduction. Adv. Neural Inf. Process. Syst. 2021, 34, 9734–9746.
204. Yao, L.; Wan, Y.; Ni, H.; Xu, B. Action unit classification for facial expression recognition using active learning and SVM. Multimed.
Tools Appl. 2021, 80, 24287–24301. [CrossRef]
205. Karlos, S.; Aridas, C.; Kanas, V.G.; Kotsiantis, S. Classification of acoustical signals by combining active learning strategies with
semi-supervised learning schemes. Neural Comput. Appl. 2021, 35, 3–20. [CrossRef]
206. Xu, M.; Zhao, Q.; Jia, S. Multiview Spatial-Spectral Active Learning for Hyperspectral Image Classification. IEEE Trans. Geosci.
Remote. Sens. 2021, 60, 1–15. [CrossRef]
207. Wu, X.; Chen, C.; Zhong, M.; Wang, J.; Shi, J. COVID-AL: The diagnosis of COVID-19 with deep active learning. Med. Image Anal.
2021, 68, 101913. [CrossRef] [PubMed]
208. Al-Tamimi, A.K.; Bani-Isaa, E.; Al-Alami, A. Active learning for Arabic text classification. In Proceedings of the 2021 International
Conference on Computational Intelligence and Knowledge Economy (ICCIKE), Dubai, United Arab Emirates, 17–18 March 2021;
pp. 123–126.
209. Shahraki, A.; Abbasi, M.; Taherkordi, A.; Jurcut, A.D. Active learning for network traffic classification: A technical study. IEEE
Trans. Cogn. Commun. Netw. 2021, 8, 422–439. [CrossRef]
210. Liu, Q.; Zhu, Y.; Liu, Z.; Zhang, Y.; Wu, S. Deep Active Learning for Text Classification with Diverse Interpretations. In
Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Online, 1–5 November 2021;
pp. 3263–3267.
211. Prabhu, S.; Mohamed, M.; Misra, H. Multi-class text classification using BERT-based active learning. arXiv 2021, arXiv:2104.14289.
212. Cao, X.; Yao, J.; Xu, Z.; Meng, D. Hyperspectral image classification with convolutional neural network and active learning. IEEE
Trans. Geosci. Remote Sens. 2020, 58, 4604–4616. [CrossRef]
213. Rodríguez-Pérez, R.; Miljković, F.; Bajorath, J. Assessing the information content of structural and protein–ligand interaction
representations for the classification of kinase inhibitor binding modes via machine learning and active learning. J. Cheminform.
2020, 12, 1–14. [CrossRef] [PubMed]
214. Sinha, S.; Ebrahimi, S.; Darrell, T. Variational adversarial active learning. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5972–5981.
215. Danka, T.; Horvath, P. modAL: A modular active learning framework for Python. arXiv 2018, arXiv:1805.00979.
216. Tang, Y.P.; Li, G.X.; Huang, S.J. ALiPy: Active learning in python. arXiv 2019, arXiv:1901.03802.
217. Yang, Y.Y.; Lee, S.C.; Chung, Y.A.; Wu, T.E.; Chen, S.A.; Lin, H.T. libact: Pool-based active learning in python. arXiv 2017,
arXiv:1710.00379.
218. Lin, B.Y.; Lee, D.H.; Xu, F.F.; Lan, O.; Ren, X. AlpacaTag: An active learning-based crowd annotation framework for sequence
tagging. In Proceedings of the 57th Conference of the Association for Computational Linguistics, Florence, Italy, 28 July–2
August 2019.
219. Yu, H.; Yang, X.; Zheng, S.; Sun, C. Active learning from imbalanced data: A solution of online weighted extreme learning
machine. IEEE Trans. Neural Netw. Learn. Syst. 2018, 30, 1088–1103. [CrossRef]
220. Hand, D.J.; Till, R.J. A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach.
Learn. 2001, 45, 171–186. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.