0% found this document useful (0 votes)

19 views48 pages

Active Learning For Data Streams A Survey

Uploaded by

Shri ram

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views48 pages

Active Learning For Data Streams A Survey

Uploaded by

Shri ram

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

Active learning for data streams: a survey

Davide Cacciarelli1,2* and Murat Kulahci1,3

1 Department of Applied Mathematics and Computer Science, Technical University of Denmark,
Kgs. Lyngby, Denmark.
2 Department of Mathematical Sciences, Norwegian University of Science and Technology,

Trondheim, Norway.
3 Department of Business Administration, Technology and Social Sciences, Luleå University of
arXiv:2302.08893v4 [stat.ML] 29 Nov 2023

Technology, Luleå, Sweden.

*Corresponding author(s). E-mail(s): [email protected];

Abstract
Online active learning is a paradigm in machine learning that aims to select the most informative data
points to label from a data stream. The problem of minimizing the cost associated with collecting labeled
observations has gained a lot of attention in recent years, particularly in real-world applications where data is
only available in an unlabeled form. Annotating each observation can be time-consuming and costly, making
it difficult to obtain large amounts of labeled data. To overcome this issue, many active learning strategies
have been proposed in the last decades, aiming to select the most informative observations for labeling in
order to improve the performance of machine learning models. These approaches can be broadly divided
into two categories: static pool-based and stream-based active learning. Pool-based active learning involves
selecting a subset of observations from a closed pool of unlabeled data, and it has been the focus of many
surveys and literature reviews. However, the growing availability of data streams has led to an increase in
the number of approaches that focus on online active learning, which involves continuously selecting and
labeling observations as they arrive in a stream. This work aims to provide an overview of the most recently
proposed approaches for selecting the most informative observations from data streams in real time. We
review the various techniques that have been proposed and discuss their strengths and limitations, as well as
the challenges and opportunities that exist in this area of research.

Keywords: stream-based active learning; online active learning; data streams; online learning; unlabeled data; query
strategies; selective sampling; concept drift; experimental design; bandits.

1 Introduction
The deployment of machine learning models in real-world applications is often reliant on the availability of
significant amounts of annotated data. While recent advancements in sensor technology have facilitated the
collection of larger amounts of data, this data is not always labeled and ready for use in training models. Indeed,
the process of obtaining labeled observations for supervised learning models can be cost-prohibitive and time-
consuming, as it often requires quality inspections or manual annotation. In such cases, active learning proves
to be a valuable strategy to identify the most informative data points for use in training, thereby reducing the
overall cost of labeling and improving the performance of the model. Over the years, a plethora of active learning
approaches have been proposed in the literature, each with its own benefits and limitations. These approaches
seek to strike a balance between the cost of labeling and the quality of the model by selectively choosing the most
informative observations for querying. By carefully selecting the most informative observations, active learning
helps to minimize the amount of labeled data required and streamlines the learning process, contributing to its
overall efficiency.

Published in Machine Learning by Springer (2023). https://doi.org/10.1007/s10994-023-06454-2

While several surveys have been published on pool-based active learning (Aggarwal et al, 2014; Settles, 2009;
Fu et al, 2013; Kumar and Gupta, 2020), which involves selecting a fixed set of observations from a pool of
unlabeled data, the dynamic and sequential nature of many real-world problems often renders these approaches
impractical. This has led to growing interest in the online variant of active learning, also referred to as stream-
based active learning, which involves continuously selecting and labeling observations as they arrive in a stream,
allowing for real-time adaptation to changing data distributions. Lughofer (2017) provided a review of online
active learning approaches with a focus on fuzzy models. However, since its publication, numerous other online
active learning approaches have been proposed, and to the best of our knowledge, no other surveys have been
published to synthesize these developments. Moreover, surveys purely focusing on online learning from data
streams (Lu et al, 2018; Tieppo et al, 2022; Lima et al, 2022; Hoi et al, 2021) discuss methods that assume a
complete availability of labels, which is not the case in many real-world applications. The aim of this review is to
fill this gap by providing a comprehensive overview 1 of the most recently developed query strategies for online
active learning. It is worth noting that in certain cases, stream-based active learning is narrowly defined as the
act of selecting the most informative observations from a data stream to fit a predictive model. Instead, the act
of determining which observations to query while making predictions is referred to as online selective sampling
(Hanneke and Yang, 2021). In this work, we cover and examine all the methods that address the crucial problem
of selecting the most informative data points to label from a data stream in an online fashion. We will present the
techniques that have been proposed so far, discussing their strengths and limitations, as well as the challenges
and opportunities that exist in this field. In addition, we will provide an overview of evaluation strategies for
online active learning algorithms and highlight some real-world applications. Finally, we will identify potential
future research directions in this area.
This survey comprehensively explores various facets of active learning, encompassing both theoretical founda-
tions and practical challenges. By delving into this review, we aim to shed light on pertinent research questions,
including:
1. Query strategy. What sampling strategy should be used to maximize learning efficiency in a streaming context?
2. Timing of queries. When and how often should data points be queried to balance learning and resource
constraints?
3. Model updates. When should predictive models be updated and how can they adapt to changing data
distributions and concept drift?
4. Scalability. How can active learning methods be made scalable and efficient for high-velocity data streams?
5. Evaluation. What are appropriate evaluation metrics for assessing the performance of stream-based active
learning algorithms?
The structure of this paper is as follows. In Section 2, we provide an overview of active learning, including
the main instance selection criteria, an overview of the main active learning scenarios, and the connection
between active learning and semi-supervised learning. Section 3 represents the core of the review, with a brief
overview of how online active learning approaches have been classified, followed by a detailed description of the
state-of-the-art approaches. In Section 4, we examine evaluation strategies for online active learning algorithms.
Section 5 highlights real-world applications and challenges. Section 6 provides a summary of the most common
online active learning methods and highlights potential directions for future research. Finally, Section 7 provides
conclusions and summarizes the key contributions of the review.

2 Preliminaries on active learning

In supervised learning, we seek to learn a function that can predict the output variable, also known as response,
given a set of input variables, also known as covariates. This function is often learned by training a model on
a labeled dataset that consists of a large number of input-output pairs. However, obtaining labeled examples is
not always straightforward, and it may not be possible or practical to label all the available data. In these cases,
active learning can be used to select a subset of the data for labeling in order to improve the performance of the
model, when there is a budget constraint on the number of unlabeled observations that can be queried. Indeed,
there are many examples of how a classification or regression model can achieve a performance that is similar to
what can be achieved when all the labels are available, using only a small fraction of the available observations.

1
We conducted a search on SCOPUS and Google Scholar using the following keywords: ”on-line active learning”, ”online active learning”,
”stream-based active learning”, ”single pass active learning”, ”online selective sampling”, ”sequential selective sampling”, and ”active
learning” combined with ”data stream”. Each paper was reviewed individually to determine its relevance to online active learning. We
eliminated irrelevant papers and manually added some papers that did not contain these keywords but used online active learning methods
or were relevant to our discussion. Additionally, we included related papers that were necessary to understand the bigger picture from the
references of the reviewed strategies.

2
2.1 Instance selection criteria
The main challenge in active learning is deciding which data points to label. There are many strategies for
selecting data points in active learning, and most of them can be associated with one of these groups:
• Uncertainty-based query strategies. These approaches focus on selecting data points that the model is least
confident about, in order to reduce its uncertainty (Lu et al, 2016; Tong and Koller, 2002). When using
classification models, the most widely used is the margin-based query strategy, where data points close to the
decision boundary are selected (Roth and Small, 2006; Balcan et al, 2007).
• Expected error or variance minimization. These strategies estimate the future error or variance, when a newly
labeled example is made available, and try to minimize it directly (Cohn et al, 1996; Roy and Mccallum, 2001).
• Expected model change maximization. This strategy involves selecting data points that would have the greatest
impact on the estimate of the current model parameters if they were labeled and added to the training set
(Cai et al, 2013).
• Disagreement-based query strategies. These approaches focus on selecting data points where there is disagree-
ment among multiple models or experts (Hanneke, 2014; Wang, 2011; Steve and Liu, 2014; Sheng et al, 2008).
One of the most common approaches that use an ensemble of models is query by committee (Seung et al,
1992; Freund et al, 1997; Burbidge et al, 2007), which uses an ensemble of models to identify instances where
the models have conflicting predictions.
• Diversity- and density-based approaches. These methods exploit the structural information of the instances
and try to select data points that are diverse and representative of the overall distribution of the data. One
example of this approach is the use of Mahalanobis distance to seek observations that are far from the currently
labeled data points (Ge, 2014; Cacciarelli et al, 2022a). Clustering may be applied to label representative data
points (Nguyen and Smeulders, 2004; Min et al, 2020; Ienco et al, 2013), and graph-based methods can be
employed to explore the structure information of labeled and unlabeled data points (Zhang et al, 2020b) or
to build upon the semi-supervised label propagation strategy (Long et al, 2008).
• Hybrid strategies. These are active learning algorithms that combine multiple instance selection criteria (Don-
mez et al, 2007; Huang et al, 2014). For example, by combining margin-based sampling with clustering the
learner can select the most uncertain observations within different areas of the input space.
By considering these different strategies, one can select the most appropriate approach for a given problem
based on the characteristics of the data and the specific requirements of the application.

2.2 Active learning scenarios

Active learning can be broadly categorized into three macro scenarios, based on how the unlabeled instances are
supplied to the learner and then selected to be labeled by an oracle. Regardless of the particular query strategy
being employed, these macro scenarios provide a framework for understanding the flow of information and the
decision-making steps involved in active learning. These scenarios serve as a high-level categorization of different
methods for approaching the active learning problem, each with its own set of advantages and disadvantages
depending on the specific use case. Understanding these macro scenarios is crucial for selecting the appropriate
active learning technique for a particular problem and for comparing different active learning algorithms. In the
next subsections, each of the three macro scenarios will be discussed.

2.2.1 Membership query synthesis active learning

This scenario represents the case when the learner is given complete freedom to ask for the label of any data point
belonging to the input space or for a synthetically generated one. Some examples of membership query synthesis
active learning include image classification, where the learner can generate modified versions of existing images
to be labeled, or object detection, where the learner can generate new instances by combining and transforming
existing instances. In natural language processing (NLP) tasks such as text classification or sentiment analysis,
the learner might generate synthetic examples in the form of sentences or paragraphs that cover a wider range
of variations in the language. Also, in speech recognition, the learner might generate synthetic speech samples
in different accents, pronunciations, or speaking styles in order to improve the recognition accuracy. However,
as highlighted by Baum and Lang (1992) and Settles (2009), the main drawback of this strategy is that it
could generate unlabeled examples for which no labels can be associated by a human annotator (e.g., a mixture
between a number and a letter). A general flowchart for this scenario is reported in Figure 1, where the scheme
is repeated until a budget constraint on the requested labels is met, or a stopping criterion on the achieved
performance is satisfied.

3
Synthetically
Train/update generate
Labeled data Model
model
instance(s)

Ask for the

label(s)

Fig. 1 Membership query synthesis active learning.

In the context of deep active learning (Ren et al, 2022), the membership query synthesis scenario can be
addressed by using generative models. For instance, generative adversarial networks (GANs) have been used
to generate additional instances from the input space that may provide more informative labels for the learner
(Goodfellow et al, 2014). This can be done by using GANs for data augmentation, as GANs are capable of
generating diverse and high-quality instances (Zhu and Bento, 2017). Another approach is to combine the use of
variational autoencoders (VAEs) (Kingma and Welling, 2013) and Bayesian data augmentation, as demonstrated
by Tran et al. (Tran et al, 2019, 2017). The authors used VAEs to generate instances from the disagreement
regions between multiple models, and Bayesian data augmentation to incorporate the uncertainty of the generated
instances in the learning process.

2.2.2 Pool-based active learning

Pool-based active learning is one of the most widely studied scenarios in the machine learning literature. The
goal is to select the most informative subset of observations from a closed, static set of unlabeled data points.
The majority of the proposed pool-based active learning approaches have been developed for classification tasks
(Cai et al, 2013), with image classification being a common application in computer vision (Li and Guo, 2013),
as manually labeling large image datasets can be a challenging task.

Unlabeled Rank Select top 𝑘 Ask for the

data observations instance(s) label(s)

Train/update
Model Labeled data
model

Fig. 2 Pool-based active learning.

The flowchart in Figure 2 provides an overview of pool-based active learning sampling schemes, where k
represents the number of unlabeled instances whose label is queried at each round. Traditional machine learning
models that do not require substantial computational resources to train are typically associated with a choice of
k equal to one (Vahdat et al, 2019). This allows a timely update of the instance selection criteria, avoiding the
redundant labeling of similar data points. However, larger values of k have also been used in practice, such as
the analysis performed by Ge (2014) for values ranging from 5 to 30 or the approach used by Cai et al (2013)
to add 3% of the total number of observations to the training set each time. Using a higher k value may be
more practical when working with large models, as repeated training can be computationally expensive and
challenging. To this extent, batch mode active learning is generally considered to be a more efficient and effective
option for image classification or detection tasks compared to the one-by-one query strategy, as the latter can
be resource-intensive and time-consuming when working with large neural networks (Ren et al, 2022). This is
because re-training the model with just one new data point with high input dimensionality may not result in
significant improvement (Ren et al, 2022). In general, the choice of k may be problem- or model-specific, as it
represents a trade-off between computational efficiency and the risk of querying redundant labels.
To enhance pool-based active learning, many approaches combine uncertainty-based instance selection criteria
with acquisition functions such as entropy (Shannon, 1948; Wu et al, 2022), mutual information (Haussmann

4
et al, 2020), or variation ratio (Schmidt et al, 2020). Entropy is commonly used as an acquisition function in
active learning because it provides a way to measure the uncertainty of the model predictions for a given data
point. The entropy of a probability distribution is a measure of the amount of disorder or randomness in the
distribution. In the context of active learning, the entropy of a model’s predicted class probabilities for a data
point can be used as a measure of the model’s uncertainty about the correct class label for that data point.
Acquiring examples with the highest uncertainty is one way to select data points for annotation, but it is not
the only way. Mutual information and variation ratio can also be used on the predictions obtained with the
current model, in order to seek a diverse set of data points for which the predictions are the most uncertain. For
a more comprehensive discussion on pool-based active learning, readers are referred to the surveys (Aggarwal
et al, 2014; Settles, 2009; Fu et al, 2013; Kumar and Gupta, 2020).

2.2.3 Online active learning

In this type of active learning, we cannot greedily select the most informative observations from a static pool,
as the instances are generated in a continuous stream and cannot be stored in their entirety before a decision is
made. This is similar to the famous statistical puzzle known as the secretary problem (Freeman, 1983), where a
hiring manager must make a hiring decision for each applicant as they are interviewed, without the benefit of
seeing all applicants first. In general, online active learning is a crucial scenario for various real-world applications
where the ability to make a sampling decision in real-time is of utmost importance. A few examples are:
• Chemical or manufacturing processes. In these applications, a learner is tasked with predicting the quality of
the final product but may only have a short timeframe to make the sampling decision, to avoid traceability
issues, particularly in high-volume production (Schmitt et al, 2013; Lieber et al, 2012). Also, tasks like predic-
tive maintenance and visual inspection might benefit from a real-time selection of new examples to be labeled
and included in the training set (Rožanec et al, 2022).
• Video streaming and clinical trials. In these cases, a decision must be made on the fly, as users arrive or
volunteers appear sequentially, and there may not be enough time to accumulate a pool of potential users or
patients (Fowler et al, 2023; Riquelme, 2017).
• Text classification: In NLP, online active learning can be used for tasks such as sentiment analysis and spam
detection, where the learner continuously learns from new incoming data points which need to be labeled to
update the model in real-time and improve accuracy (Kranjc et al, 2015).
• Fraud detection. To effectively detect fraudulent activities, the learner must continuously select new examples
to label so that it can continuously update its decision-making process (Carcillo et al, 2018, 2017).
• Online customer service. Online customer service agents can use online active learning to improve their per-
formance by continuously learning from customer interactions. To do this, the learner must continuously select
new examples to label or customer information to obtain, so that it can predict the best response based on
past interactions and improve its accuracy over time (Zheng and Padmanabhan, 2006).
• Marketing. Online active learning could also be applied in the field of marketing to select informative examples
in real-time and continuously optimize customer targeting and personalization (Carnein and Trautmann, 2019;
Jamil and Khan, 2016).

Data Observe an
Is it No Discard the
stream unlabeled data
useful? observation
point
Yes

Train/update
Model Labeled data Ask for the label
model

Fig. 3 Single-pass online active learning.

One of the defining features of online active learning strategies is their data processing capabilities. Figure 3
and Figure 4 provide a visual representation of the two main approaches; single-pass and window-based. Single-
pass algorithms observe and evaluate each incoming data point on the fly, whereas window-based algorithms, also
referred to as batch-based methods, observe a fixed-size chunk of data at a time. In this approach, the learner
evaluates the entire batch of data and selects the top k observations as the most informative ones to be labeled.

5
This approach is referred to as best-out-of-window sampling. The specific value of k and the dimensionality
of the buffer can vary based on the storage capabilities of the system and the computational time required to
update the model. Window-based methods are useful in situations where data is generated in large quantities
and the algorithm does not have a tight constraint on the time available for decision-making. In contrast, single-
pass methods are necessary when the algorithm needs to make a decision immediately after observing a specific
data point.

Data Observe an
Is it
stream unlabeled data Buffer full?
point
Yes

Select top 𝑘
instance(s)

Train/update Ask for the

Model Labeled data
model label(s)

Fig. 4 Window- or batch-based online active learning.

Another critical property in the design of an effective online active learning strategy is the assumption
made about the data stream distribution. One important difference to consider is whether the data stream is
stationary or drifting. A stationary data stream is characterized by a stable data generating process where the
statistical properties of the data distribution that remain constant over time. Conversely, a drifting data stream is
marked by changing statistical properties of the data distribution over time, potentially due to alterations in the
underlying data generating process. The distinction between stationary and drifting data streams is significant
because it affects the performance of the active learning strategies. Online active learning strategies that have
been developed for stationary data streams may lead to suboptimal performance when applied to drifting data
streams. This is because concept drift can alter the scale of the informativeness measure of unlabeled data points
or even urge a complete change of the model, with the acquisition of more observations to accommodate the
new concept. Therefore, it is important to accurately assess the nature of the data stream distribution in the
design of an active learning strategy. A failure to do so can result in a suboptimal performance and a reduced
ability to effectively leverage the strengths of active learning. Another important property to consider when
designing an active learning strategy is the label delay or verification latency. This refers to the time needed by
the oracle to provide the label when it is requested by the learner. In some cases, there may be a delay L in the
oracle providing the label after it has been requested. This property must be taken into account when designing
a sampling strategy as there may be redundant label requests for similar instances if this issue is not properly
addressed. Label delay can be classified into null latency, intermediate latency, or extreme latency (Souza et al,
2018). The case with null latency, or immediate availability of the label upon request, is commonly used in the
stream mining community, but may not be realistic for many practical applications. Extreme latency, where
labels are never made available to the learner, is closer to an unsupervised learning task. Intermediate latency
assumes a delay 0 < L < ∞ in the availability of the labels from the oracle.
Finally, the training efficiency of the online active learning algorithms should also be taken into considera-
tion. There are two main training approaches in active learning; incremental training and complete re-training.
Incremental training involves updating model parameters with a small batch of new data, without starting the
training process from scratch (Polikar et al, 2001; Wu et al, 2019; Shilton et al, 2005; Istrate et al, 2018). This
approach allows the model to learn from new data while preserving its existing knowledge. This can be achieved
through fine-tuning the model parameters with the new data, or by using techniques such as elastic weight
consolidation, which prevent previous knowledge from being erased. Complete re-training, on the other hand,
involves training a new model from scratch using the entire labeled data collected so far. This approach discards
the previous knowledge of the model and starts anew, which may result in the loss of knowledge learned from
previous data. Complete re-training is typically used when the amount of new data is substantial, the previous
model is no longer relevant, or when the model architecture needs to be altered. It is important to note that
the choice of training approach in online active learning algorithms can have a significant impact on the overall
performance and effectiveness of the model.

6
2.3 Connection between active learning and semi-supervised learning
Semi-supervised learning is a field of research that is closely related to active learning, as both methods are
developed to deal with limited labeled data. While active learning aims to minimize the amount of labeled data
required to train a model, semi-supervised learning is a technique that trains a model using a combination of
labeled and unlabeled data. Active learning can be considered a special case of semi-supervised learning, as it
allows the model to actively select which data points it wants to be labeled, rather than relying on a fixed set of
labeled data. In the context of online learning, Kulkarni et al (2016) conducted a study that provided an overview
of semi-supervised learning techniques for classifying data streams. These techniques do not address the primary
question of active learning, which is when to query, but they are useful in exploiting the information contained
in the unlabeled data points and in addressing issues related to model update and retraining in limited labeled
data environments. It is also worth noting that semi-supervised learning can be used in combination with active
learning to improve the data selection strategy. By leveraging the strengths of both methods, it is possible to
achieve better performance and more efficient learning compared to using either method alone.
Semi-supervised learning approaches can be distinguished into three categories, unsupervised preprocessing,
wrapper methods, and graph-based methods. Unsupervised preprocessing refers to the use of unsupervised
learning techniques, such as dimensionality reduction (Cacciarelli and Kulahci, 2023), clustering, or feature
extraction, to preprocess the entire dataset, labeled and unlabeled, before it is fed to the supervised model
(Frumosu and Kulahci, 2018). The goal is to transform the data into a more useful representation that can be
learned more easily by a supervised model and can support the sampling of more informative data points. This
strategy can also help reduce the dimensionality of the learning problem, thus improving the model parameter
estimation when only a few queries can be made. Related to the online active learning problem, Rožanec et al
(2022) used a pre-trained network to extract salient features from unlabeled images before starting the sampling
routine. Similarly, Cacciarelli et al (2022a) used an autoencoder trained on all the available unlabeled data points
to improve the performance of online active learning for linear regression models.
Wrapper methods, on the other hand, use one or more supervised learners that are trained on labeled data
and pseudo-labeled unlabeled data. There are two main variants of wrapper methods, self-training and co-
training. Self-training uses a single supervised model that is trained on labeled data, and pseudo-labels are used
for the data points with confident predictions. Co-training, on the other hand, extends self-training to multiple
supervised models, where two or more models exchange the most confident predictions to obtain pseudo-labels.
Pseudo-labels can be very beneficial in label-scarce environments, but one must be mindful of the confirmation
bias issue, where the model might rely on incorrect self-created labels. This problem has been extensively analyzed
by Baykal et al (2022) in the active distillation scenario, which is a strategy where a smaller model, known as
the student model, is trained to mimic the behavior of a larger pre-trained model, known as the teacher model
(Hoang et al, 2021; Kwak et al, 2022). In this context, confirmatory bias refers to the student model tendency
to reproduce the predictions of the teacher model, even when the teacher predictions are incorrect. This can
happen when the student model is trained to mimic the teacher model output too closely, without considering the
underlying errors. To mitigate this, active distillation techniques use sample selection methods that encourage
the student model to learn from data points where the teacher model makes errors, rather than just reproducing
the teacher model predictions. In the more general active learning framework, confirmation bias might also refer
to the tendency of an active learning algorithm to select examples that confirm its current hypothesis, rather
than selecting examples that would challenge or improve it.
Finally, graph-based methods construct a graph on all available data and fit a supervised model, where
the loss comprises a supervised loss and a regularization term that penalizes the difference between the labels
predicted for connected data points. In the online active learning scenario, the graph structure can be used to
model the similarity between data points, and the active learning algorithm can select the examples to label
based on their position on the graph, such as selecting examples that are in low-density regions or are distant
from other labeled examples.

3 Online active learning approaches

In this review, we present a taxonomy of online active learning strategies into four categories:
1. Stationary data stream classification approaches. These methods are designed to tackle online classification
tasks, where the model is updated on the fly using newly labeled examples selected from a stream of data
that does not change significantly over time. These methods are particularly useful in scenarios where the
data distribution is relatively stable, such as quality control in industrial processes, where stationarity is often
ensured by control actions taken at regular intervals and continuous maintenance of the components of the

7
system (Bisgaard and Kulahci, 2011). Another example is represented by human activity recognition using
wearable devices, where data is collected over time from wearable devices such as fitness trackers to identify
patterns of activity like walking, running, or sleeping. This scenario would fall into this category because the
data stream is relatively stable, and the model can be updated in real-time as new labeled examples become
available (Miu et al, 2015).
2. Drifting data stream classification approaches. These online active learning strategies are specifically designed
to handle classification tasks in dynamic environments where the data distribution constantly changes. These
approaches are designed to adapt to changes in the data distribution in order to maintain high classification
accuracy. Some real-world applications might be fraud detection or intrusion detection. In financial fraud
detection, fraudsters often change their methods to evade detection, so a classification model used for fraud
detection must be able to adapt to new patterns of fraud as they emerge or to new customer habits (Zhang
et al, 2022). In real-time intrusion detection, computer networks detection systems must be able to detect
new forms of cyberattacks as they appear, so the classification models used must be able to adapt to changes
in the data distribution over time (Nixon et al, 2021). This scenario would fall into this category because the
data stream is constantly changing, and the model must be able to adapt to changes in the data distribution
over time to maintain high accuracy.
3. Evolving fuzzy system approaches. These approaches are based on a type of fuzzy system that can adapt and
change over time, in response to new data or changes in the environment (Gu et al, 2023). In traditional fuzzy
systems, the rules and membership functions that define the system are fixed and do not change over time.
Evolving fuzzy systems, on the other hand, are able to adapt their rules and membership functions based
on new data or changes in the environment. This is particularly useful in applications where the data or the
environment is non-stationary and evolves over time, such as in control systems for autonomous vehicles,
where we must be able to adapt to changes in the environment, such as traffic patterns, road conditions, and
weather (Naranjo et al, 2007; Wang et al, 2015).
4. Experimental design and bandit approaches. These methods, mostly related to regression models, actively
select the most informative data points to improve model predictions. This category includes online active
linear regression and sequential decision-making strategies like bandit algorithms or reinforcement learning.
These methods adaptively select the most promising options in a given situation. An example is given by
online advertising, where a model is used to select the most promising advertisements to display to users
based on their browsing history and other factors (Avadhanula et al, 2021). This scenario would fall into
this category because the model must adaptively select the most promising options in real-time based on
the information available at that time. Also, in clinical trials, a model is used to select the most promising
patients to enroll in a clinical trial based on their medical history and other personal information. Finally, in
drug development studies (Réda et al, 2020), online active learning can be used to select the most promising
compounds for further testing and development, based on their potential efficacy and safety.
This categorization provides a comprehensive overview of the different types of online active learning strategies
and how they can be applied in various scenarios. While the simplest active learning strategy, random sampling,
is available and involves selecting data points randomly from the stream for annotation, we will primarily focus
on more specialized strategies designed to address scenarios where informed decisions are crucial due to resource
constraints or where the data distribution is non-stationary.
Figure 5 depicts a general framework illustrating the essential components shared by the various categories
of online active learning algorithms. The accompanying callouts highlight key options utilized by these methods.
The following sections will provide an in-depth analysis of these strategies. For a more detailed flowchart regarding
the drift detection and adaptation process, please refer to Lu et al (2018); Lima et al (2022).

3.1 Stationary data stream classification approaches

In online active learning, a commonly employed strategy is to request labels for data points that are considered
to be informative enough based on a pre-determined threshold. This threshold can be established through a
variety of techniques, depending on the instance selection criterion used to evaluate the informativeness of the
unlabeled observations. Another method, sometimes referred to as b-sampling, is to calculate the probability
that a data point will be queried by adjusting the parameter of a Bernoulli random variable, as proposed by
Cesa-Bianchi et al. in one of the pioneering studies on online active learning (Cesa-Bianchi et al, 2004, 2006).
They used a linear predictor characterized by the weight vector w ∈ Rd and, at each time step t, after observing
the current data point xt , the binary output y ∈ {−1, +1} is predicted using
⊤

ybt = SGN wt−1 xt (1)

8
- Classification - Uncertainty
- Single model - Single pass - Thresholding
- Regression - Diversity
- Ensemble - Batch - 𝑏-sampling
-… -…

Data Instance Instance Sampling

stream Model
evaluation selection strategy

No Drifting
Model update data?

Yes
- Re-training
- Incremental
- Batch-mode Drift adaptation Drift detection

- Adjust sampling rate - DDM

- Replace model - ADWIN
- Change ensemble weights -…
-…

Fig. 5 Online active learning: general framework.

where wt−1 is the weight vector estimated with the previously seen labeled examples (x1 , y1 ) , . . . , (xt−1 , yt−1 ).
⊤
The value wt−1 xt is the margin, pbt , of wt−1 on the instance xt . If the learner queries the label yt , a new weight
vector is estimated using the newly added labeled example (xt , yt ) with the regular perceptron update rule
(Rosenblatt, 1958) as in

wt = wt−1 + Mt yt xt (2)
where Mt represents the indicator function of the event ybt ̸= yt . If the label is not requested, the model remains
unchanged, and we have wt = wt−1 . At each time step t, the learner decides whether to query the label of a
data point xt by drawing a Bernoulli random variable Zt ∈ {0, 1}, whose parameter is given by
b
Pt = (3)
b + |pbt |
where b > 0 is a positive smoothing constant that can be tuned to adjust the labeling rate. In general, as pbt
approaches 0, the sampling probability Pt converges to 1, suggesting that the labels are requested for highly
uncertain observations. The sampling scheme introduced by Cesa-Bianchi et al (2004) is referred to as selective
sampling perceptron, and it is reported in Algorithm 1.
A similar approach to the one proposed by Cesa-Bianchi et al (2004) was investigated by Dasgupta et al
(2005), who presented one of the first thresholding techniques for online active learning. They suggested setting a
threshold on the margin, with the idea of sampling data points xt with a value of |pbt | lower than a given threshold
Γ. The threshold is initially set at a high value and iteratively divided by two until enough misclassifications
occur among the queried points. The linear classifier is updated using the reflection concept [60] to give more
focus to recent data points. Sculley (2007) built on the works of Cesa-Bianchi and Dasgupta to analyze the
online active learning scenarios for real-time spam filtering. The author compares two models, a perceptron
and a support vector machine (SVM), and tries three different instance selection criteria, the fixed thresholding
approach by Dasgupta et al (2005), the Bernoulli-based approach by Cesa-Bianchi et al (2004), and a newly
developed logistic margin sampling. The perceptron is updated as per Dasgupta et al (2005), while the SVM is
retrained on all available labeled observations each time a new data point is added. According to the logistic
margin sampling strategy, the sampling decision is taken by drawing a Bernoulli random variable Zt ∈ {0, 1}
with a parameter given by

Pt = e−γ|pbt | (4)
As in the traditional b-sampling approach introduced by Cesa-Bianchi et al (2004), this sampling strategy
depends on the uncertainty, meant as the distance from the prediction hyperplane. The main difference between
the two strategies is the shape of the resulting sampling distribution, which can be observed in Figure 6.

9
Algorithm 1 Selective sampling perceptron
Require: a data stream S, an initial model w0 = (0, . . . , 0)⊤ , a time horizon T , a sampling budget B, a
parameter b.
t←1 ▷ Timestamp
c←0 ▷ Labeling cost
while c ≤ B, t ≤ T do
⊤
Observe an incoming data point xt ∈ S and set pbt = wt−1 xt
Predict the label ybt = SGN (pbt )
Draw a Bernoulli random variable Zt of parameter Pt = b/ (b + |pbt |)
if Zt = 1 then ▷ Sampling decision
Ask for the true label yt and update the model
c←c+1 ▷ Pay for the label
else
Discard xt
end if
t←t+1
end while

Fig. 6 Shape of the sampling distributions for b-sampling (a) and logistic sampling (b), for different values of b and γ.

The selective sampling perceptron approach has also been investigated by Lu et al (2016), who proposed an
online passive-aggressive active learning variant of the algorithm. Similarly to the b-sampling approach, at each
time step t, a Bernoulli random variable Zt ∈ {0, 1} is drawn to decide whether to query the label of the current
data point xt or not. In this case, the parameter of Zt is given by
δ
Pt = (5)
δ + |pbt |
where δ ≥ 1 is a smoothing parameter. Besides not allowing the smoothing parameter to assume a value lower
than 1, the sampling distribution is the same as the one governed by the parameter in Equation 3. The main
difference lies in the passive-aggressive approach used for updating the weight vector. Indeed, while the traditional
perceptron update, shown in Equation 2, only uses misclassified examples to update the model, the passive-
aggressive approach updates the weight vector w ∈ Rd whenever the current loss ℓt (wt−1 ; (xt , yt )) is nonzero
(Crammer et al, 2006). The new parameter wt is found using

wt = wt−1 + τt yt xt (6)
where τt represents the step size, and can be computed according to three different policies

10
2

ℓ (w ; (x , y )) / ∥xt ∥
 t t−1 t t


2
τt = min κ, ℓt (w t−1 ; (xt , yt )) / ∥x t ∥ (7)

 ℓt (wt−1 ; (xt , yt )) / ∥xt ∥2 + 1/2κ



where κ is a penalty cost parameter. Passive-aggressive algorithms are known for their aggressive approach in
updating the model, which is motivated by the fact that traditional perceptron updates might waste data points
that have been correctly classified but with low prediction confidence.
A related issue to the update of the weight vector wt was emphasized by Bordes et al (2005), who noted that
always picking the most misclassified example is a reasonable sampling strategy only when the training examples
are highly confident. When dealing with noisy labels, this strategy could lead to the selection of misclassified
examples or examples lying on the wrong side of the optimal decision boundary. To address this, they suggested
a more conservative approach that selects examples for updating wt based on a minimax gradient strategy.
In addition to confidence in the labels of the training examples, confidence in the model itself must be
considered when the sampling strategy is based solely on model predictions. Hao et al (2018b) pointed out that
a margin-based sampling strategy may be suboptimal when the classifier is not precise, especially in the early
rounds of active learning when the model performance may be poor due to limited training feedback, leading to
misleading sampling decisions. This issue is also referred to as cold-start active learning (Houlsby et al, 2014;
Yuan et al, 2020; Jin et al, 2022). To address this, Hao et al (2018b) propose considering second-order information
in addition to margin value when deciding whether or not to query the label of a data point xt . In general,
first-order online active learning strategies only consider the margin value, while second-order methods also take
into account the confidence associated with it. To do this, they assume that the weight vector of the classifier
w ∈ Rd is distributed as

w ∼ N (µ, Σ) (8)
where the values µi and Σi,i encode the model knowledge and confidence in the weight vector for the ith feature
wi . The covariance between the ith and jth features is captured by the term Σi,j . The smaller the variance
associated with the coefficient wi , the more confident the learner is about its mean value µi . The objective
of the proposed method is to take into account the confidence of the model when updating the model and
making the sampling decision. With regards to the model update, when the true label yt of xt is queried, the
Gaussian distribution in Equation 8 is updated by minimizing an objective function based on the Kullback-
Leibler divergence (Joyce, 2011) to ensure the updated model is not too different from the previous one. The
sampling decision uses an additional parameter to the margin pbt , which is defined as
−η
ct = (9)
1 1
2 νt + γ

where η, γ > 0 are two fixed hyper-parameters and νt represents the variance of the margin related to the data
point xt . The intuition is that, when the variance νt is high, the model has not been sufficiently trained on
instances similar to xt , and querying its label would lead to a model improvement. Then, a soft margin-based
approach is employed by computing

ρt = |pbt | + ct (10)
If ρt ≤ 0, the label is always queried as the model is extremely uncertain about the margin. Instead, when
ρt > 0, the model is more confident, and the labeling decision is taken by drawing a Bernoulli random variable
of parameter
δ
Pt = (11)
δ + ρt
where δ > 0 is a smoothing parameter. Finally, Hao et al (2018b) also introduced a cost-sensitive variant of the
loss function, for dealing with class-imbalanced applications. For a comprehensive discussion on imbalanced data
stream analysis, please see Aguiar et al (2023).
The cold-start issue related to the application of active learning to imbalanced datasets has also been high-
lighted by Qin et al (2021), who used extreme learning machines (Huang et al, 2006) and extended the active
learning framework initially proposed by Yu et al (2015) to the multiclass classification scenario. They highlighted
the challenge of the lack of instances for certain classes in imbalanced datasets, which can seriously impact the
predictive ability of the model for those classes. To address this issue, they propose a sampling strategy that

11
considers both diversity and uncertainty. The diversity is calculated by computing pairwise Manhattan distance
between the unlabeled observations. The uncertainty of a data point xt is computed by taking the difference
between the largest two posterior probabilities as in

margin (xt ) = p (y = cb | xt ) − p (y = csb | xt ) (12)

where cb and csb are the classes with the highest posterior probabilities. This approach is also referred to as
best-versus-second-best margin and, as highlighted by Joshi et al (2009), is a good indicator of uncertainty when
a large number of classes are present in the data. It should be noted that the sampling strategy introduced by
Qin et al (2021) is not suited for single-pass active learning as it requires computing similarity and uncertainty
measures for all the unlabeled observations in the current batch.
Another approach to deal with class imbalance in active learning was proposed by Ferdowsi et al (2013), who
used linear SVMs and a sampling strategy that switches between multiple instance selection criteria online. This
approach, however, is limited to a pool-based setting and requires predicting an unsupervised evaluation score
for all available unlabeled instances. The impact of the last queried observations on the scores associated with
the unlabeled data points is evaluated, and a greedy approach is used to decide which instance selection criterion
to trust. SVMs have also been used by Ghassemi et al (2016), who proposed a differentially private approach
to online active learning. The privacy concerns are tackled both during the instance selection and the training
phase, by randomizing the strategy introduced by Tong and Koller (2002). The informativeness of a data point
xt is measured by its closeness to the current hyperplane wt as in

c(t) = exp (−d (xt , wt )) ∈ [0, 1] (13)

where the distance function d (xt , wt ) is defined as

|⟨wt , xt ⟩|
d (xt , wt ) ≜ (14)
∥wt ∥
In the traditional framework, the label yt is queried if we have c(t) > Γ, where Γ is a pre-defined threshold. It
should be noted that c(t) > Γ is equivalent to d (xt , wt ) ≤ log 1/Γ, which means that the observation xt is in a
sampling region of width 2 log 1/Γ around wt . However, to avoid a deterministic decision process on the labeling
and ensure privacy, some randomness needs to be introduced. This can be done in two ways. First, the labeling
decision can be modeled as a Bernoulli random variable of parameter p if c(t) < Γ or (1 − p) if c(t) ≥ Γ, where
p < 1/2. Another approach is based on the exponential mechanism introduced by McSherry and Talwar (2007).
According to this strategy, the algorithm sets a constant probability of labeling data points within a sampling
region defined by α, and a decaying probability for points outside of it. The selection strategy is represented by
a Bernoulli of parameter
(
e−αϵ/∆ d (xt , wt ) ≤ α
q(t) = −d(xt ,wt )ϵ/∆
(15)
e d (xt , wt ) > α
where ϵ > 0 and ∆ = (1 − α/M )M . The authors assumed all data points belonging to the stream to be bounded
in norm by M , ∥xt ∥ ≤ M for t = 1, . . . , T . To tackle the privacy concerns while training, the authors propose two
mini-batch strategies, to avoid the problem of slow convergence that may result from introducing noise according
to the private stochastic gradient descent scheme (Bassily et al, 2014; Song et al, 2013; Duchi et al, 2013).
Two different approaches have been proposed by Ma et al (2016) and Shah and Manwani (2020). Ma et al
(2016) proposed a query-while-learning strategy for decision tree classifiers. They used entropy intervals extracted
from the evidential likelihood to determine the dominant attributes, which are ordered based on the information
gain ratio. When a new data point xt is observed, its label is queried only if there does not exist a dominant
attribute. This will help to identify one and narrow the entropy interval. However, it should be noted that
the authors consider a query while learning framework that only partially relates to to online active learning.
Shah and Manwani (2020) investigated the online active learning problem for reject option classifiers. Given
the high cost that is sometimes associated with a misclassification error, these models are given the option of
not predicting anything, for example when dealing with a highly ambiguous instance. A typical application of
reject option classifiers is in the medical field, when making a diagnosis with ambiguous symptoms might be
particularly difficult. In this case, it could be more beneficial not to provide a prediction but suggest further
tests instead. They proposed an approach based on a non-convex double ramp loss function ℓdr (Manwani et al,
2013), where the label of the current example xt is queried only if it falls in the linear region of the loss given
by |ft (xt )| ∈ [ρt − 1, ρt + 1], which is the region where the parameter would be updated. Here, ρ refers to the
bandwidth parameter of the reject option classifier that determines the rejection region.

12
Fujii and Kashima (2016) investigated the problem of Bayesian online active learning. They provided a general
framework based on policy-adaptive submodularity to handle data streams in an online setting. The authors
distinguish between the stream setting, where the labeling decision can be made within a given timeframe, and the
secretary setting, introduced in Section 2, where the labeling decision must be made immediately. The proposed
framework can be applied in a variety of active learning scenarios, such as active classification, active clustering,
and active feature selection. The framework is based on the concept of adaptive submodular maximization,
which extends the idea of submodular maximization. A set function is considered to be submodular if it satisfies
the property of diminishing returns, meaning that adding an element to a smaller set has a greater impact on
the function value than adding the same element to a larger set. Adaptive submodular maximization allows the
model to adapt to the changing distribution of data over time, by adjusting the set function to reflect the current
state of knowledge. This leads to more efficient use of available data and improved performance.
So far, we discussed several single model approaches to active learning, which have shown promising results
in various applications. However, it is important to note that single models have their limitations and can
sometimes struggle to capture complex patterns and diverse representations present in the data. To address these
limitations, researchers have proposed the use of ensembles or committees as an alternative (Krawczyk et al,
2017). An ensemble or committee refers to a group of multiple models that collaborate to produce a more robust
and accurate prediction by combining their individual predictions. The models in an ensemble or committee can
be trained on different subsets of the data or with varying hyperparameters, and the final prediction is typically
made through either voting or weighted averaging. Ensembles or committees can also be regarded as a collection
of models that work together to make a prediction, either by exchanging information or learning from one another.
Among this class of methods, a common sampling strategy is represented by disagreement-based active learning.
A framework to perform disagreement-based active learning in online settings was recently introduced by Huang
et al (2022). They characterized the learner by a hypothesis space H of Vapnik-Chervonenkis (VC) dimension
d, which is composed of all the classifiers currently under consideration, and a Tsybakov noise model (Mammen
and Tsybakov, 1999; Tsybakov, 2004). Each classifier h ∈ H is a measurable function mapping the observation xt
to binary output yt = {0, 1}. The disagreement among two classifiers is given by d (h1 , h2 ) = P [h1 (x) ̸= h2 (x)]
and the disagreement region is defined as

D (h1 , h2 ) = {x ∈ X : h1 (x) ̸= h2 (x)} (16)

The online active learning strategy is represented by the policy π = ({vt } , {λt }), where {vt }t≥1 is the map of the
queried data points, and {λt }t≥1 is the sequence of prediction rules. Over the time horizon T , the performance of
the policy π is evaluated using the label complexity and the regret. The label complexity measures the expected
number of labels queried, with respect to the stochastic process induced by π, and it is given by
" T #
X
E[Q(T )] = E 1 [vt = 1] (17)
t=1
The regret, on the other hand, represents the expected number of excess classification errors with respect to h∗ ,
and it is obtained as
 
X
E[R(T )] = E  1 [λt ̸= yt ] − 1 [h∗ (xt ) ̸= yt ] (18)
t≤T :vt =0

The objective of the algorithm is to minimize the label complexity with a constraint on the regret. At the first
round, the initial version space is the entire hypothesis space H, while the initial region of disagreement is the
whole input space X . Then, at time step t, the learner updates the version space Ht using the M collected labels,
and computes a new region of disagreement as

D (Ht ) = {x ∈ X : ∃h1 , h2 ∈ Ht , h1 (x) ̸= h2 (x)} (19)

If xt ∈ D (Ht ), then the label of the current data point is queried, otherwise a prediction is produced using
an arbitrary classifier in Ht . At the end of the iteration t, the set Zt of M collected labeled examples is used
to estimate the empirical error ϵZt (h) of the classifiers in H and identify the best currently available classifier.
Then, the version space is updated by removing all the suboptimal hypotheses whose empirical error exceeds the
one obtained with h∗t by a threshold ∆Zt (h, h∗t ). The threshold regulates the trade-off between reducing label
complexity by narrowing the region of disagreement and increasing the regret by eliminating good classifiers.
The disagreement concept was also used by Desalvo et al (2021), while proposing an approach to online
active learning for binary classification tasks based on surrogate losses. The overall framework is similar to

13
the disagreement-based one used by Huang et al (2022), with the main difference being the use of weak-labels
to optimize the sampling strategy. At each time step t, the learner observes the unlabeled data point xt and
either decides to request its label or assigns a pseudo-label ybt . Then, the pseudo labels ybt and the true labels
yt processed so far are used together to obtain an estimate of the empirical risk ϵSt (h), where St is obtained
by combining the collected labeled examples Zt with the pseudo-labeled ones Zbt . This represents an example of
combining active learning and semi-supervised learning, as highlighted in Section 2.3.
Loy et al (2012) presented a Bayesian framework that leverages the principle of committee consensus to bal-
ance exploration and exploitation in online active learning. The aim of exploration is to discover new, previously
unknown classes, while exploitation focuses on refining the decision boundary for known classes. To address the
issue of unknown classes, the framework uses a Pitman-Yor Processes (PYP) prior model (Pitman and Yor, 1997)
with a Dirichlet process mixture model (DPMM). A DPMM is a non-parametric clustering and classification
model that models the data generating process using a mixture of probability distributions. Each data point is
assigned to a cluster, which is associated with a probability distribution over the classes. The number of clusters
is modeled using a Dirichlet process, which is a distribution over distributions that allows for an infinite number
of clusters but ensures that the number of actual clusters is always finite. At each time step t, the learner samples
two random hypotheses h1 and h2 from the model. Then, it computes the posterior probability of the current
class c corresponding to k, p (c = k | xt ), for each of the two hypotheses. Finally, hi (xt ) = arg max p (c | xt ) is
calculated for i = 1, 2. The label of the current data point is queried in two cases: first, if h1 (xt ) ̸= h2 (xt ),
meaning the two hypotheses disagree, and second, if hi (xt ) = K + 1∀i, where K is the number of currently
known classes, meaning the current data point belongs to a new class.
The DPMM has also been used by Mohamad et al (2020), who proposed a semi-supervised strategy for
performing active learning in online human activity recognition with sensory data. To account for the possibility
of dealing with different sensor network layouts, the authors proposed pre-training a conditional restricted
Boltzmann machine (Taylor and Hinton, 2009; Taylor et al, 2006) and used it to extract generic features from the
sensory input. The instance selection strategy follows a Bayesian approach, in trying to minimize the uncertainty
about the model parameters. To assess the usefulness of labeling the data point xt , they measure the discrepancy
between the model uncertainty computed from the data observed until the time step t and the expected risk
associated with yt . This gives a hint of how the current label would impact the current model uncertainty. A
dynamically adaptive threshold Γ is finally used to the determine whether the current expected risk is greater
than the current risk.
A different kind of committee has been considered by Hao et al (2018a). They proposed a framework for
minimizing the number of queries made by an online learner that is trying to make the best possible forecast,
given the advice received from a pool of experts. To do so, they adapted the exponentially weighted average
forecaster (EWAF) and the greedy forecaster (GF) to the online active learning scenario. A comprehensive
analysis of forecasters to perform prediction with expert advice can be found in the book by Cesa-Bianchi and
Lugosi (2006). In general, at each time step t, the learner or forecaster has access to the predictions for the data
point xt made by the N experts, fi,t (xt ) : Rd → [0, 1] with i = 1, . . . , N . Based on these predictions, it outputs
its own prediction pt for the outcome yt . Then, if the label is revealed, the predictions made by the forecaster
and the experts are scored using a nonnegative loss function ℓ. The objective of the learner is to minimize the
cumulative regret over the time horizon T , which can be seen as the difference between its loss and the one
obtained with each expert i as in
T
X
Ri,T = (ℓ (pt , yt ) − ℓ (fi,t (xt ) , yt )) = L
b T − Li,T (20)
t=1
The most simple approach to obtain a prediction pt from the learner is to compute a weighted average of the
experts predictions as in
PN
ωi,t fi,t (xt )
i=1
pt = PN (21)
i=1 ωi,t
where ωi,t ≥ 0 is the weight assigned at time t to the ith expert. With the EWAF, the weight for the ith expert
are obtained using

eηRi,t−1
ωi,t = PN (22)
ηRi,t−1
i=1 e
where η is a positive decay factor and Ri,t−1 is the cumulative loss of expert i observed until step t. The
exponential decay factor η determines the weight given to the past losses, with more recent losses having a higher

14
weight and older losses having a lower weight. Instead, the GF works by minimizing, at each time step, the
largest possible increase of the potential function for all the possible outcomes of yt . The potential function is the
function that assigns a potential value to each expert, which captures the quality of an expert advice based on
its past performance. Hao et al (2018a) extended the EWAF and GF by proposing the active EWAF (AEWAF)
and active GF (AGF). The key idea is that, while the standard EWAF and GF assume the availability of the
true label yt after each prediction, in the online active learning framework the loss ℓ can only be measured a
limited number of times. To factor this in, a binary variable Zt ∈ {0, 1} is introduced to decide whether or not
at round t the label is requested. Consequently, the cumulative loss suffered by the ith expert on the instances
queried by the active forecaster is given by
T
X
L
b i,T = ℓ (fi,t (xt ) , yt ) · Zt (23)
t=1
The sampling strategy is based on the determination of a confidence condition on the difference between the
prediction pt of the fully supervised forecaster and the prediction pbt made by the active forecaster. For the active
forecaster we have that pbt = π[0,1] (pt ), where pt depends on the chosen model. The AEWAF is based upon the
observation that if we have

max |fi,t (xt ) − fj,t (xt )| ≤ δ (24)

1≤i,j≤N
then |pt − pbt | ≤ δ, where δ is a tolerance threshold. This means that the prediction of the forecaster is close to
the one obtained in the fully supervised setting if the maximum difference of advice between any two experts
is not too large. This assumption might not hold in the presence of noisy or bad experts and, to tackle this
problem, the authors proposed a robust variant of the AEWAF. The AGF uses instead a confidence condition
based on the fact that if

max |fi,t (xt ) − pt | ≤ δ (25)

1≤i,j≤N
then |pt − pbt | ≤ δ. The general scheme for performing online active learning with expert advice is reported in
Algorithm 2.

Algorithm 2 Online active learning with expert advice

Require: a data stream S, a loss function ℓ, a time horizon T , a set of N experts, a tolerance threshold δ, a
sampling budget B.
t←1 ▷ Timestamp
c←0 ▷ Labeling cost
while c ≤ B, t ≤ T do
Observe an incoming data point xt ∈ S
Receive advide by experts {fi,t (xt ) : i = 1, . . . , N }
Generate prediction pt for the label yt and set pbt = π[0,1] (pt )
Draw a Bernoulli random variable Zt of parameter Pt = b/ (b + |pbt |)
if Equation 24 or 25 is satisfied then ▷ Sampling decision
Discard xt
else
Ask for the true label yt
c←c+1 ▷ Pay for the label
end if
t←t+1
end while

A similar framework, in conjunction with multiple kernel learning (MKL), has been investigated by Chae
and Hong (2021). They propose an active MKL (AMKL) algorithm based on random feature approximation.
In general, online MKL based on random feature approximation is a method for online learning and prediction
that combines multiple kernel functions to improve the performance of a learning algorithm (Jin et al, 2010;
Hoi et al, 2013). In MKL, multiple kernel functions are used to capture different aspects of the data, and the
optimal combination of kernels is learned from the data. The online version of MKL based on random feature
approximation is designed to handle data that arrives sequentially, and the learning algorithm is updated after

15
each new data point. In kernel-based learning, the target function f (x) is assumed to belong to a reproducing
Hilbert kernel space (RKHS). In the proposed AMKL the learner uses an ensemble of N kernel functions. At
each time step t, two main steps are implemented. First, each kernel function fî,t (xt ) , with i = 1, . . . , n, is
optimized independently of the other kernel functions. This is referred to as local step. Then, in the global step,
the learner seeks the best function approximation fbt (xt ) by combining the N kernel functions as in
N
X
fbt (xt ) = vbi,t fî,t (xt ) (26)
where vbi,t refers to the weight for the ith kernel function at round t. Similarly to the case with expert advice,
the weights are determined by minimizing the regret over the time horizon T , which is defined as the difference
∗
between the loss of the learner and the one obtained with the best kernel function fi,t . To do so, the weights are
computed based on the past losses ℓ as
 
X
bi,t = exp −ηg
ω ℓ fî,τ (xτ ) , yτ  (27)
τ ∈At−1

where ηg > 0 is a tunable parameter and At−1 is an index of time stamps t indicating the instances for which
has label has been requested, thus permitting to measure the loss. Then, the weights vbi,t are obtained from ω
bi,t
as follows

ω
bi,t
vbi,t = PN (28)
i=1 ωbi,t
Finally, the instance selection criterion is based on a confidence condition, denoted by with δ > 0, on the
similarity of the learned kernel function, which is a similar to the condition used by Hao et al (2018a) in the
formulation of the AEWAF
N
X
max vbi,t ℓ fbi,t (xt ) , fbj,t (xt ) ≤ δ (29)
1≤j≤N
i=1

3.2 Drifting data stream classification approaches

Active learning strategies belonging to this category aim to tackle online classification tasks in time-varying data
streams affected by distribution shifts. We can classify distribution shifts into three main categories, depending
on whether they concern the feature space x or the output dimension y. A shift that only affects the input
distribution p(x), and not the conditional distribution p(y | x), is referred to as covariate shift (Zhou et al, 2021;
Wu et al, 2021; Li et al, 2021) or virtual drift (Baier et al, 2021). In these circumstances, for two different time
steps, ti and ti+∆ , we have that pti (x) ̸= pti+∆ (x) and pti (y | x) = pti+∆ (y | x), meaning that the underlying
model is not being altered by phenomena like class swaps or coefficient changes. Conversely, in the presence of
a real concept drift (Baier et al, 2021; Suárez-Cetrulo et al, 2023), the conditional distribution changes, and
we have pti (y | x) ̸= pti+∆ (y | x). In this scenario, the predictive performance of the fitted model dramatically
deteriorates, and a model update or replacement becomes necessary. An example of this kind of distribution
shift can be identified in the changes of the consumer behaviors over time, or following a major event as the
COVID-19 pandemic (Zwanka and Buff, 2021). However, it should be noted that virtual drifts and real concept
drifts often occur together (Tsymbal et al, 2008), leading to a situation where we have both pti (x) ̸= pti+∆ (x)
and pti (y | x) ̸= pti+∆ (y | x) (Lu et al, 2018). Lastly, we can incur in a label distribution shift (Wu et al, 2021)
when the shift only affects p(y), leading to pti (y) ̸= pti+∆ (y). This situation can be observed in many real-world
scenarios where the target distribution changes over time. A typical example is the prediction of diseases like
influenza, whose distribution can dramatically change depending on the season, or in the presence of sudden
outbreaks.
Another key characteristic of distribution shifts is represented by the change rate, namely how fast the new
concept or distribution is introduced into the data stream. To this extent, we can identify four kinds of drifts
(Lu et al, 2018; Lima et al, 2022), which are illustrated in Figure 7. A sudden or abrupt drift is a drift that
can be immediately detected from two consecutive time steps, ti and ti+1 . It refers to a sudden and clearly
identifiable change in the data distribution. An example of this would be a sudden change in the weather, which
would affect the behavior of customers at a retail store. The change is noticeable, and the model needs to be
updated immediately. A gradual drift exhibits a transition phase, where a mixture or overlap between the two
distributions pti and pti+∆ exists. In this case, the change is slower and more difficult to detect, making it

16
challenging to update the model. An example would be a change in consumer behavior over time, which is hard
to detect but can have a significant impact on a business. Another type of drift is the incremental drift, which
has an extremely low transition rate, which makes it very difficult to detect changes between the data points
observed in the transition period. This type of drift is often caused by changes in the data generating process
that happen gradually over time, in small steps rather than all at once. An example would be changes in the
types of products that are popular among customers, which happen gradually and are hard to detect. Finally,
a data stream can also be affected by recurring concepts, which sequentially alternate over time. An example
would be a retail store where the same types of products are popular at different times of the year, such as
winter coats and summer dresses. The model needs to be able to detect and adapt to these recurring concepts
in order to maintain good performance.

C2 C2 C2 C2

C1 C1 C1 C1

𝑡 𝑡 𝑡 𝑡
(a) (b) (c) (d)

Fig. 7 Different types of drifts that can affect the data stream: abrupt drift (a), gradual drift (b), incremental drift (c), recurring
concepts (d). C1 and C2 indicate the two concepts that might characterize the data distribution.

In online active learning for drifting data streams, some approaches address the presence of concept drifts
by combining active learning strategies with drift detectors (Zhang et al, 2020a; Krawczyk et al, 2018). Drift
detectors are algorithms that try to detect distribution shifts and identify when the context is changing. They
can be divided into three macro-categories (Lu et al, 2018). The first group of methods is represented by the
error-based drift detectors, which try to detect online changes in the error rate of a base classifier. Among these,
one of the most commonly employed strategies is the drift detection method (DDM) proposed by Gama et al
(2004). Another popular approach is the adaptive window (ADWIN) strategy proposed by Bifet and Gavaldà
(2007). The second class of drift detectors is called data distribution-based drift detection, and the third class is
represented by multiple hypothesis testing strategies. While the first class contains the majority of the proposed
approaches, it assumes that we are able to observe the labels of all the incoming data points to assess the error
rate. Instead, the last two classes could be implemented even in an unsupervised manner. An exhaustive overview
on unsupervised drift detection methods has been proposed by Gemaque et al (2020). While the unsupervised
nature of the data distribution-based and multiple hypothesis testing strategies make them ideal for the active
learning scenario, it should be noted that real concept drifts can hardly be detected in a completely unsupervised
fashion. Indeed, in a circumstance when the input distribution p(x) remains unaltered while the underlying model
relating the input variables x to the label y changes, it would not be possible to detect the change of concept
without collecting labels. This is why Krawczyk et al (2018) propose to apply an error-based drift detector to
the few labels collected during the online active learning routine. To this extent, they use the ADWIN (Bifet
and Gavaldà, 2007) method to detect drifts and decide when the current model needs to be updated or replaced.
The proposed general framework for dealing with online active learning with drifting data streams is reported
in Algorithm 3.
Moreover, the authors proposed the use of a time-variable threshold to balance the budget use over time.
Their approach is based on the intuition that, when a new concept is introduced, more labeling effort will be
required to quickly collect representative observations belonging to the new concept and replace the outdated
model. This is obtained by adjusting a time-variable threshold to balance the budget use over time. Given a
threshold Γ on the uncertainty of the classifier and a labeling rate adjustment r ∈ [0, 1], the threshold is reduced
to Γ − r when ADWIN raises a warning and to Γ − 2r when a real drift is detected. Thus, when allocating the
labeling budget, the key requirement is that the labeling rate employed when a drift is detected should be strictly
larger than the one used in static conditions. A similar thresholding idea has also been used by Castellani et al
(2022), who proposed an active learning strategy for non-stationary data streams in the presence of verification
latency. They used a piece-wise constant budget function, where the labeling rate α is increased to αhigh when

17
Algorithm 3 Online active learning with drifting data streams
Require: a data stream S, a classifier Θ, a drift detector Θ, a sampling strategy Υ, a labeling rate α, a sampling
budget B.
t←1 ▷ Timestamp
c←0 ▷ Labeling cost
while c ≤ B and t ≤ |S| do
Observe incoming data point xt ∈ S
if Υ(xt ) = True then ▷ Sampling decision
Ask for the true label yt
c←c+1 ▷ Pay for the label
Update classifier Ψ with the labeled example (xt , yt )
Update drift detector Θ with the labeled example (xt , yt )
if drift warning = True then
Start to train a new classifier Ψnew
Increase labeling rate α
else
if drift detected = True then ▷ A detection is always preceded by a warning
Replace C with Cnew
Further increase α
else
Return to initial labeling rate α
end if
end if
if Cnew exists then ▷ Keeps being updated in the background until replacement
Update classifier Cnew with the labeled example (xt , yt )
end if
end if
t←t+1
end while

a drift is detected and, after a while, reduced to αlow . Finally, the labeling rate is restored to its nominal value
α. A visual representation of the labeling approach is shown in Figure 8. The length of the time segments where
the labeling rate is altered depends on the desired values for αhigh and αlow , constraining the overall labeling
rate to be equal to α.

𝛼)#*)

𝛼&'(

𝑡!"#$% 𝑡"! 𝑡"" 𝑡

Fig. 8 Piece-wise constant budget function introduced by Castellani et al (2022). The sampling rate α is increased to αhigh when
a drift is detected (tdrif t ), then reduced to αlow between tr1 and tr2 , before being restored to its nominal value.

The authors also tackled the verification latency issue by considering the spatial information of a queried
point for which the label has not been made available yet by the oracle. In this way, it is possible to avoid
oversampling from regions where many close points have a high utility, namely a low classification confidence.
While assessing the utility of the incoming data points the authors use real and pseudo-labels by propagating
the information contained in the already labeled observations, as suggested by Pham et al (2022). The idea is to

18
use the spatial information of the queried labels by estimating the still missing labels with a weighted majority
vote of the label of its k-nearest neighbors labels, where the weight for each nearest neighbor depends on the
arrival time of the labels. The verification latency issue in online active learning with drifting data streams was
also extensively analyzed by Pham et al (2022). Consider the general case where at time txi we draw an instance
xi , and find it interesting enough to send it to the oracle, which will send back the label yi only at time tyi ,
where tyi > txi . Before the requested label arrives, we might encounter another instance similar to xi and ask
again for its label, since the learner could not update its utility function or threshold. Similarly, we might use
outdated information when updating the policy in a future window. To tackle these issues, the authors propose
a forgetting and simulating strategy to avoid using soon-to-be outdated observations and prevent redundant
labeling. The instance selection is based upon the variable uncertainty strategy proposed by Zliobaite et al (2014)
and the balanced incremental quantile filter by Kottke et al (2015). If we denote the current sliding window at
time txn as Wn = [txn − ∆, txn ) and use windows of fixed size ∆, we know that the sliding window that would be
used for training when the label yn related to xn arrives will be given by Dn = [tyn − ∆, tyn ). The forgetting step
is then implemented by discarding outdated labeled examples that are included in Wn but will not be included
in Dn . If ai is a Boolean variable indicating whether the ith observation has been labeled, the set of instances
selected to be forgotten is given by

On = [(xi , yi ) ∀i < n : ai = 1 ∧ txi , tyi ∈ Wn \Dn ] . (30)

Similarly, there is a second set of observations, with time stamps Dn+ = Dn \Wn = [txn , tyn ), where there might
be instances that have been queried but whose label is not currently available. To avoid losing such information
and redundantly asking for the label of similar instances, the algorithm simulates incoming labels with a bagging
approach by averaging across multiple utility estimations. They also consider an alternative simulation approach
based on fuzzy labeling.
Similarly to Krawczyk et al (2018), the ADWIN drift detector has also been used by Zhang et al (2020a)
while proposing a method for dealing with online active learning in environments characterized by concept drifts
and class imbalance. The instance selection criterion is based on the predictive uncertainty, which they estimate
using the best-versus-second-best margin value (Equation 12), as they tackle a multi-class classification problem.
An initial pool of n observations is passively collected from the stream to initialize the active learning strategy.
Then, a threshold Γi is estimated for each class as in
(
nm n
≥1
Γi = ni L nni L (31)
m ni L < 1
where i = 1, . . . , L is the number of classes and m is a pre-defined constant used to control the size of the
threshold. The model is represented by an ensemble of N classifiers and, when ADWIN detects a concept drift,
the classifier with the higher error is replaced with a newly trained one. Finally, the class imbalance issue is also
taken into account in two ways, during the training of the ensemble with the use of class-specific weights, and
during the active learning routine, by dynamically adjusting the threshold to select more observations belonging
to the minority class.
Recently, Cheng et al (2023) presented another approach to combine online active learning with drift detec-
tion. Their method involves segmenting the data stream S into fixed-length chunks and then detecting drifts by
comparing the distributions of adjacent chunks. After a drift is detected, a multi-objective optimization problem
is formulated in order to identify the most relevant and diverse data points within the current batch. For a data
point xt , relevance is defined as its contribution to the new concept, and diversity as the Pearson correlation
coefficient with other instances in the same region. Instead, Martins et al (2023) proposed to sample the most
uncertain data points from each chunk, using a meta-learning framework to fine-tune the threshold used for each
window. This allows to reduce the need for labels while maintaining a steady adaptation to the new concepts.
Another window-based approach to perform active learning from data streams has been proposed by Zhu et al
(2007). The authors developed an ensemble E by partitioning the data stream S into chunks and then training
each of the k models composing the ensemble E on a different chunk of data. In this way, even if the previous
observations become unavailable, the models can be used when taking the sampling decision in order to take
into account a global uncertainty measure, which is a more robust approach than treating each chunk as a static
dataset. At time step t, the learner receives a data chunk St , which is used to build the current classifier Ct . At this
point, the ensemble is composed by Ct , together with the most recent k − 1 classifiers, Ct−k+1 , . . . , Ct−1 , trained
on the labeled examples sampled from the previously observed data chunks, Lt−k+1 , . . . , Lt−1 . At each iteration,
the objective is to predict the remaining unlabeled data points from the current chunk, Ut . The ensemble-based
active learning framework is depicted in Figure 9. The instances selected to be queried are the ones with the

19
largest ensemble variance, and the predictions are obtained by combining the predictions of the single classifiers
using the weights ωt−k+1 , . . . , ωt−1 . Finally, a weight updating rule is used to adapt to dynamic data streams.

Predict
Ensemble 𝐸

𝜔!"#%$ 𝜔!"# 𝜔!

Classifier Classifier Classifier

𝐶!"#"$ 𝐶!"# … 𝐶!

𝐿!"#"$ 𝑈!"#"$ 𝐿!"# 𝑈!"# … 𝐿! 𝑈!

Fig. 9 Ensemble-based active learning framework for data streams proposed by Zhu et al (2007).

Shan et al (2019) and Zhang et al (2018) developed online active learning strategies by building upon the
pairwise classifiers strategy introduced by Xu et al (2016). The pairwise strategy makes use of two models, a
stable classifier Cs and a dynamic classifier Cd , and divides the data stream into batches as in (Zhu et al, 2007).
The prediction for an incoming data point xt is obtained with a weighted average of the predictions obtained
from the two classifiers as in

fE (xt ) = ωs fCs (xt ) + ωd fCd (xt ) (32)

where ωs and ωd are the weights associated with the stable and the dynamic classifier, respectively. At time
t, the stable classifier Cs is trained on the labeled portions of all the batches processed so far, L1 , . . . , Lt−1 .
Conversely, Cd is trained exclusively on Lt−1 . The key idea is that whenever the reactive classifier starts to
outperform the stable classifier, the stable classifier is replaced by the reactive one, which is eventually reset.
This replacement allows the learner to adapt to the drift and focus on the most recent instances, forgetting the
seemingly obsolete data points. The main drawback of this approach is that it cannot effectively address gradual
drifts as the replacement with the classifier trained on the most recent observations makes the learner forget
about observations away from the current window. Hence, similarly to the approach of Zhu et al (2007), Shan et al
(2019) proposed an extension of this approach, based on an ensemble of classifiers in trying to contemporarily
address gradual drifts and abrupt drifts. In their strategy, the stable classifier learns from all the labeled instances
and the reactive classifier is replaced by an ensemble of dynamic classifiers, trained on multilevel sliding windows
to capture changes in the data stream at different time intervals. The instance selection approach combines
random sampling and uncertainty sampling, where the latter is based on the margin value of the predictions
obtained by the ensemble. It should be noted that the prediction fE for the data point xt is obtained as a
weighted combination of the predictions obtained with the stable and dynamic classifiers as in
D
X
fE (xt ) = ωs fCs (xt ) + ωd fCd (xt ) (33)
d=1
The stable classifier has a constant weight ωs = 0.5 and plays a crucial role in trying to learn the overall trend
and direction of concept drift. Conversely, the dynamic classifiers have gradually decaying weights, according
1
to a damped sliding winding approach where each weight is initialized at D and then reduced according to its
creation time
1

ωd 1 − D d = 1, . . . , D − 1
ωd = 1 (34)
D d=D
The most recent classifiers are useful in detecting sudden concept drifts and have highest weights while the
old dynamic classifiers have lower weights and can help to identify gradual drifts. The same pairwise strategy
based on an ensemble composed by a stable classifier and D dynamic classifiers was used by Zhang et al (2022).
They modified the original strategy by introducing a reinforcement mechanism to adjust the weights ωd according
to the prediction performance and the class imbalance issue. The weights adjustment strategy is described by

20
Algorithm 4. It should be noted that this procedure is only implemented after the true label yt has been revealed
by the oracle. The damped class imbalance ratio (DCIR) value is obtained by taking into account the number of
observations for each class collected so far. This is expected to be useful when dealing with imbalanced classes.
With regards to the instance selection criterion, the authors consider a hybrid strategy combining uncertainty
sampling and random sampling, since approaches solely based on uncertainty could ignore a concept change that
is not close to the boundary. Woźniak et al (2023) recently proposed another ensemble-based active learning
strategy where the data points to be labeled are selected from the current chunk using the budget labeling
active learning strategy introduced by Zyblewski et al (2020). According to this approach, the learner selects
both random and informative data points, where the informativeness is determined using the support function
threshold, which in the case of binary classification problems can be interpreted as a distance from the decision
boundary.

Algorithm 4 Weight adjustment for dynamic classifiers

Require: a labeled observation (xt , yt ), number of classes K, number of dynamic classifiers D, current weights
ωd with d = 1, . . . , D, DCIR for each class DCIRκ for κ ∈ K.
1
if DCIR [yt ] < K then ▷ Check if it belongs to the minority class
for d in (1, D) do
if Cd (xt ) = yt then ▷ Check if the prediction made by Cd is correct
1
ωd ← ωd 1 + D ▷ Increase weight of classifier Cd
else
1

ωd ← ωd 1 − D ▷ Decrease weight of classifier Cd
end if
end for
end if

Another way to perform online active learning in time-varying data streams is to use clustering-based
approaches. Halder et al (2023) extended the framework based on stable and dynamic classifiers by introduc-
ing a clustering step that aims to train the new stable classifier Cs on the most informative and representative
instances from each data block. Similarly, Ienco et al (2013) investigated a clustering-based approach in a batch-
based scenario, where only a fraction of the incoming block of observations can be labeled. They extend the
pre-clustering approach (Nguyen and Smeulders, 2004), which had been previously studied in the pool-based
scenario, to the stream-based case. The sampling strategy takes into account an extra-cluster metric, to sort
the clusters, and an intra-cluster one, to sort the observations within each cluster. When a new batch arrives,
observations are clustered, and clusters are sorted based on the homogeneity of the clusters, which is measured
taking into account the number of (predicted) classes within each cluster. If a cluster is balanced in the number
of expected classes, it is regarded to as an uncertain cluster that covers a more difficult area of the input space.
Within each cluster, the certainty of an observations is determined by its representativeness, namely the distance
from the centroid, and the uncertainty, meant as the maximum a posterior probability among all the predicted
classes for xt . When the clusters and observations are ranked, the learner starts to iteratively ask the observa-
tions label in an alternate fashion. To sample the most representative data points from each batch, Zhang et al
(2023) suggested the use of density-peak clustering and recognize the incomplete clusters in the dynamic feature
space through the altitude of these data points. This allows to query the observations belonging to those regions
in the following iterations.
Recently, Yin et al (2023) proposed an adaptive data stream classification method based on microclustering.
After initializing micro-clusters from the initial training data, they collected new labels using a mixed strategy
that combines random sampling with a class-weighted margin score. Then, the micro-cluster learning model is
dynamically updated to adapt to the presence of concept drifts.
Another approach that tries to exploit the clustering nature of the incoming observations has been proposed
by Mohamad et al (2018), with the use of bi-criteria active learning algorithm that considers both density in the
input space and label uncertainty. The density-based criterion makes use of the growing Gaussian mixture model
proposed (GGMM) by Bouchachia and Vanaret (2014), which is used to find clusters in the data and estimate
its density. This model creates a new cluster when a new data point xt has a Mahalanobis distance greater than
a given closeness threshold from the nearest cluster, among the currently available ones. A flowchart describing
the main steps of the GGMM is depicted in Figure 10.
A Bayesian logistic regression model is used for addressing the label uncertainty criterion and the concept
drift. As the classifier parameters wt are assumed to evolve over time, the model is incrementally updated using

21
Compute the No Remove less
Observe an Gaussian
probability of match contributing
unlabeled data point matches?
with each Gaussian Gaussian

Yes

Initialize new
Decay the weight of Update parameter of
Gaussian with the
all Gaussians the Gaussian
current data point

Fig. 10 Main steps of the growing Gaussian mixture model used by Mohamad et al (2018).

a discrepancy measure, which is computed as the difference between the uncertainty of the model in xt before
and after the true label yt is added to the training set. The query strategy follows the b-sampling approach,
in trying to sample, with high probability, the observations that contribute the most to the current error. The
combination of density and uncertainty is also employed by Liu et al (2021), who proposed a cognitive dual query
strategy for online active learning in the presence of concept drifts and noise. The local density measure is used
to obtain representative instances and the uncertainty criterion aim to select data points where the classifier is
less confident. The cognitive aspect takes into account Ebbinghaus’s law of human memory (Ebbinghaus, 2013)
to determine an optimal replacement policy. The proposed strategy tries to tackle both gradual and abrupt
drifts. The drift is generally considered as a change in the underlying joint probability distribution from one
time step t to another, namely pt (x, y) ̸= pt+1 (x, y). The local density of an observation xt is defined by the
number of times that xt is the nearest neighbor of other instances (Ienco et al, 2014). Since we are in an online
framework, the authors proposed to measure the local density using a sliding window model, referred to as a
cognition window. Based on the concept of memory strength, the model determines when the current window
is full and needs to be updated. Finally, the labeling decision is taken by using two thresholds, one for the local
density and one for the classifier uncertainty.
A different sliding window-based online active learning strategy is the one proposed by Kurlej and Woźniak
(2011). The authors proposed a sliding window approach based on a nearest neighbors classifier. The reference
set for the k-nearest neighbors model is a window, and it is updated in two ways: in a first-in-first-out manner
or using the examples selected by the active learning strategy. Since the reference set is updated over time, this
method can effectively deal with concept drift and time-varying data streams. The sampling strategy is also
based on two criteria. The first one is similar to the margin-based approaches, an instance is queried if it has
a low distance from two observations belonging to different classes. The second criterion, similar to the greedy
sampling strategy, seeks observations that have a large minimum distance from the observations in the current
reference set. Both criteria are implemented by setting a threshold on the distances.
A simpler approach for taking into account the time-varying aspect of evolving data stream is to force
the model to focus on the most recent observations. Along these lines, Chu et al (2011) propose a framework
based on a Bayesian probit model and a time-decay variant. Online Bayesian learning is used to maintain a
posterior distribution of the weight vector of a linear classifier over time wt , and the time-decay strategies are
employed to tackle the concept drift and give more importance to recent observations. They also propose an
online approximation technique that can handle weighted examples, which is based upon Minka (2001). They
tested different sampling strategies, built upon an online probit classifier. The instance selection criteria are
based on entropy, function-value, and random sampling.

3.3 Evolving fuzzy systems approaches

An alternative way to take into account the time-varying nature of evolving data streams is the use of evolving
fuzzy systems (EFS) (Lughofer, 2011), which are soft computing techniques that can efficiently deal with novelty
and knowledge expansion. EFS are self-developing, self-learning fuzzy rule-based or neuro-fuzzy systems that
self-adapt both their parameters and their structure on-the-fly. They try to mimic human-like reasoning by
modeling it with a dynamically developing fuzzy rule-based structure and implementing it utilizing data streams
using a formal learning process. The basic rule structure of a fuzzy model is given by

22
Rulei : if (x1 is Xi1 ) and . . . and (xn is Xin )
(35)
then (yi = ai0 + ai1 x1 + · · · + ain xn )

where Rule i with i = 1, 2, . . . , R is one of several fuzzy rules in the current rule base; xj (j = 1, 2, . . . , n) are
input variables; yi denotes the output of the ith fuzzy rule; Xij denotes the jth prototype (focal point) of the
ith fuzzy rule; aij denotes the jth parameter of the ith fuzzy rule. For a more thorough discussion on EFS and
their use in online learning, please see (Lughofer, 2017, 2011; Ge and Zeng, 2020; Gu et al, 2022). The main
components of an EFS are shown in Figure 11. The two key components of an EFS are the structure evolving
scheme, which contains the rule generation and simplification modules, and the parameters updating scheme. The
rule generation module defines when a new rule needs to be added to the current model. The rule merging and
pruning steps simplify the models by removing redundant rules and combining two rules when their similarity is
larger than a given threshold. The parameter updating modules are used to keep track of the model evolution.
These learning modules are used to update the EFS every time a new labeled example (xt , yt ) is made available.

Parameters
Structure evolving updating

Rule generation Rule merging Rule pruning Antecedent Consequent

Fig. 11 Learning modules of an EFS (Ge and Zeng, 2020).

The first single-pass active learning approach based on the use of evolving classification models has been
proposed by Lughofer (2012). The proposed algorithm is based on two key concepts, conflict and ignorance.
The former is related to an incoming data point lying close to the boundary between any two classes; the latter
considers the distance of the incoming observation from the currently labeled training set, in the feature space.
This suggests that the data point falls within a region that has not been thoroughly explored by the learner.
Later on, Lughofer and Pratama (2018) also proposed the first online active learning approach for evolving
regression models. Similarly to their previous work (Lughofer, 2012), the authors consider the ignorance about
the input space in the instance selection criterion. Moreover, they also consider the uncertainty in the model
outputs and in the model parameters. The predictive uncertainty is assessed in terms of confidence intervals
using locally adaptive error bars. The error bars are inspired by (Škrjanc, 2009) and the authors propose a new
merging approach for dealing with the case of overlapping fuzzy rules. The uncertainty in the model parameters
is instead evaluated using the A-optimality criterion, which will be discussed in Section 3.4 together with other
alphabetic optimality criteria. Instead of leveraging the uncertainty about the output, Pratama et al (2015) set
a dynamic threshold based on the variable uncertainty strategy introduced by Zliobaite et al (2014) while trying
to address the what-to-learn question in the training of a recurrent fuzzy classifier. The key idea is that the
model is iteratively retrained using data points that fall within rules with low support, which were formed using
the smallest amount of observations. Recently, Lughofer and Škrjanc (2023) proposed an online active learning
strategy for fuzzy models based on three criteria.
• D-optimality in the consequent space to reduce parameter uncertainty, as in Cacciarelli et al (2022b).
• Overlap degree in the antecedent space to reduce the number of data points lying in the overlap regions of
two different rules.
• Novelty content in the antecedent space, indicating the required knowledge expansion through rule evolution.
A different kind of threshold, based on the spherical potential theory, has been suggested by Subramanian
et al (2014), with the proposal of a meta-cognitive component that evaluates the novelty content of incoming
data points. This is done using a knowledge measure represented by the spherical potential, which has been
thoroughly investigated in kernel-based approaches (Hoffmann, 2007). The spherical potential is used to set a
threshold and decide whether to add a new rule to capture the knowledge in the current sample. It should be

23
noted that the authors also used a threshold based on the prediction error, which could not be used with scarcity
of labels. The prediction error is assessed using the hinge loss error function (Suresh et al, 2008; Zhang, 2004).
Fuzzy models have also been used to solve computer vision tasks. Weigl et al (2016) analyze the visual
inspection quality control case, which is also considered by Rožanec et al (2022). They assess the usefulness of
the images in a single-pass manner, but the instances that are selected to be queried are accumulated in a buffer,
which is later on assigned to an oracle for labeling. Choosing the size of the buffer represents a trade-off problem
between timely updating the classifier and requiring continuous interventions from a human annotator. The
active learning strategy works by setting a threshold on the certainty of the model with regards to the incoming
data points. The authors take into account two model classes, a random forest classifier and an evolving fuzzy
classifier. When using random forest, certainty is computed using the best-versus-second-best margin score.
Instead, when using evolving fuzzy classifiers, the sample selection criterion takes into account the conflict and
ignorance concepts as in Lughofer (2012).
Finally, Cernuda et al (2014) combine the use of fuzzy models with a sampling approach inspired by the
multivariate statistical process control literature. Indeed, using a latent structure model, they propose a query
strategy based on the Hotelling T 2 and the squared prediction error (SPE) statistics, which have been extensively
used in anomaly detection problems (Cacciarelli and Kulahci, 2022; Gajjar et al, 2018; Vanhatalo and Kulahci,
2016; Vanhatalo et al, 2017). Ge (2014) used these statistics for pool-based active learning in conjunction with a
principal component regression model. The key idea is to use the Hotelling T 2 and the SPE statistics to measure
the distance between the currently labeled training set and a new unlabeled data point. A high value in one of
the two statistics would most likely suggest that the new observation is violating the current model, and thus
its inclusion in the training set could bring some valuable information. Similarly, Cernuda et al (2014) use the
Hotelling T 2 and the SPE statistics with a partial least squares model. Then, when a new observation is added
to the training set, they retrain a TS fuzzy model using a sliding window approach.

3.4 Experimental design and bandit approaches

Optimal experimental design (Karlin and Studden, 1966) is a research field that is closely related to active
learning. It deals with the design of experiments that allow for efficient estimation of model parameters or
improved prediction performance while minimizing the number of required labeled examples, also referred to as
the number of runs N . Many optimality criteria have been developed in thriving to strike a balance between
efficient use of resources and ensuring good performance of the model. The traditional framework of optimal
experimental designs focuses on linear regression models of the form

y = Xβ + ε (36)
where, given d input variables, y is a N × 1 vector of response variables, X is a N × d model matrix, β is a d × 1
vector of regression coefficients, and ε is a N × 1 vector representing the noise, with covariance matrix σ 2 I. If
the matrix X⊤ X is of full rank, an ordinary least square (OLS) estimator for β can be obtained using

b = X⊤ X −1 X⊤ y

β (37)
In general, design optimality criteria leverage the information contained in the moment matrix, which is defined
as M = X⊤ X/N . The matrix X⊤ X plays a crucial role in the estimation of the model coefficients β, and it
is important to
perceive information about the design geometry. Indeed, with Gaussian noise characterized by
ε ∼ N 0, σ 2 I , we know that

b | X ∼ N β, X⊤ X −1 σ 2

β (38)
and we can define a 100(1 − α)% confidence ellipsoid related to the solutions of β using

b ⊤ X⊤ X (b − β)

(b − β) b
2
≤ Fα,d,N −d (39)
ds
where s2 represents the residual mean square, Fα,d,N −d is the 100(1 − α) percentile derived from the Fisher
distribution, and b indicates all the possible vectors that could be the true model parameter β. The ellipsoid
b ⊤ X⊤ X (b − β)

can also be expressed as (b − β) b ≤ C, where C = ds2 Fα,d,N −d . The volume of this ellipsoid is
inversely proportional to the square root of the determinant of X⊤ X, and the length of its axes is proportional
to 1/λi , where λi represents the ith eigenvalue of X⊤ X, with i = 1, . . . , d. The so-called alphabetic optimality
criteria pursuit efficient designs by exploiting these properties (Kiefer, 1959). The most commonly employed
optimality criteria for good parameter estimation are A-, D- and E-optimality:

24
• A-optimality. This criterion pursues good model parameter estimation by minimizing the sum of the variances
of the regression coefficients. Knowing that the coefficients variances appear on the diagonal of the matrix
−1
X⊤ X , it can be shown that an A-optimal design is given by a design D∗ that satisfies minD tr[M(D)]−1 =
−1
tr [M (D∗ )] .
• D-optimality. This criterion takes into account both the variance and covariance of the regression coefficients,
directly minimizing the total volume of the confidence ellipsoid (Myers et al, 2016). A D-optimal design is
given by a design D∗ that satisfies maxD |M(D)| = |M (D∗ )| (John and Draper, 1975).
• E-optimality. This strategy tries to shrink the ellipsoid by minimizing the maximum eigenvalue of the
covariance matrix.
The geometrical intuition behind these criteria is illustrated, in the two-dimensional case, in Figure 12.

𝑏" 𝑏" 𝑏"

"
𝛃 "
𝛃 "
𝛃

𝑏! 𝑏! 𝑏!
(a) (b) (c)

Fig. 12 Confidence ellipsoid around the model parameters and optimality criteria: A-optimality (a) shrinks the hyperrectangular
enclosing the confidence ellipsoid (Asprey and Macchietto, 2002; Galvanin, 2010), D-optimality (b) aim to shrink the total volume
of the ellipsoid, and E-optimality (c) tries to reduce the length of the longest axis (Jamieson, 2018).

Finally, there are also optimality criteria that focus on developing models with good predictive properties.
Within this class, G-optimality represents a criterion that is used to seek protection against the worst-case
prediction variance in a region of interest R. This is achieved by solving

min max v(x) (40)
D x∈R

where v(x) represents the scaled prediction variance of the current model in the data point x, which can be
computed as
−1
v(x) = N x(m)T X⊤ X x(m) (41)
(m)
where x represents the data point where the variance is being estimated, expanded to the model form. It
should be noted that G-optimality can be highly influenced by anomalous observations, as it protects against
the highest possible variance over all the region R. This issue can be tackled by using I- or V-optimality, which
estimate the overall prediction variance over R by integrating or averaging, respectively. For a more extensive
discussion on optimal designs, please see Montgomery (2012) or Myers et al (2016).
The use of optimality criteria has proven to be highly beneficial in offline experimental design, allowing
practitioners to pre-determine the location of each design point with ease. However, these methods require
modification to be applied in a stream-based scenario where data points arrive sequentially. A common approach
for obtaining a near-optimal design with streaming observational data is represented by thresholding. Riquelme
(2017) proposed a thresholding algorithm for online active linear regression, which is related to the A-optimality
criterion. Their approach uses a norm-thresholding algorithm, where only observations with large, scaled norms
are selected. The design is augmented with the observations x whose norm exceeds a threshold Γ given by

P(∥x∥ ≥ Γ) = α (42)
where α is the ratio of observations we are willing to label out of the incoming data stream. Another approach
related to the A-optimality criterion was proposed by Fontaine et al (2021), who studied online optimal design
under heteroskedasticity assumptions, with the objective of optimally allocating the total labeling budget between

25
covariates in order to balance the variance of each estimated coefficient. Cacciarelli et al (2022b) further extended
the thresholding approach introduced by Riquelme (2017) by proposing a conditional D-optimality (CDO) algo-
rithm. The terms conditional refers to the fact the design is marginally optimal, given an initial set of labeled
observations to be augmented. The main steps of the CDO approach are reported in Algorithm 5. The authors
exploited the connection between D-optimality and prediction variance previously highlighted by Myers et al
(2016). The sampling strategy selects observations by setting a threshold Γ given by
−1
P x⊤ t X X
⊤
xt ≥ Γ = α (43)
where X is the current set of labeled observations and xt is the data point that is currently under evaluation.
The threshold is estimated using kernel density estimation (KDE) on a set of j unlabeled observations, which are
taken passively from the data stream without querying any label. This provides an initial set of data, referred
to as warm-up set, that can be used to estimate the covariance matrix and the threshold.

Algorithm 5 Online active learning using CDO

Require: an initial random design X, a data stream S, a warm-up length j, a sampling rate α, a budget B
t←1 ▷ Timestamp
c←0 ▷ Labeling cost
Set W = ∅ ▷ Warm-up set to estimate Σ and Γ
while t ≤ j do
Observe incoming data point xt ∈ S
Select xt : W = W ∪ xt
t←t+1
end while
Estimate the covariance matrix Σ of W and perform eigendecomposition Σ = UΛU⊤
Whiten the initial design by computing Z = Λ−1/2 U⊤ X
Whiten the warm-up observations by computing V = Λ−1/2 U⊤ W
Estimate Γ using KDE on V with the desired sampling rate α using Equation 43 with Z and V
while c ≤ B and t ≤ |S| do
Observe incoming data point xt ∈ S
Whiten xt by computing zt = Λ−1/2 U⊤ xt
if z⊤ ⊤
t (Z Z)
−1
zt ≥ Γ then
Ask for the label yi and augment the labeled dataset: Z ← Z ∪ zt
c←c+1 ▷ Pay for the label
Update threshold Γ to measure the prediction variance of the enlarged design
else
Discard xt
end if
t←t+1
end while

Cacciarelli et al (2023) also investigated how the presence of outliers affect the performance of online active linear
regression strategies. They showed how the design optimality-based sampling strategies might be attracted to
outliers, whose inclusion in the design eventually degrades the predictive performance of the model. This issue
can be tackled by bounding the search area of the learner with two thresholds, as in
−1
P Γ1 ≤ x⊤t X X
⊤
xt ≤ Γ2 = α (44)
where the choice of Γ2 represents a trade-off between seeking protection against outliers and exploring uncertain
regions of the input space.
The norm-thresholding approach was also extended by Riquelme et al (2017a) to the case where the learner
tries to estimate uniformly well a set of models, given a shared budget. This scenario is similar to a multi-armed
bandit (MAB) problem where the learner wants to estimate the mean of a finite set of arms by setting a budget
on the number of allowed pulls (Ruan et al, 2020; Audibert and Munos, 2010; Jamieson and Nowak, 2014; Soare
et al, 2013). The authors propose a trace upper confidence bound (UCB) algorithm to simultaneously estimate
the difficulty of each model and allocate the shared labeling budget proportionally to these estimates. UCB is
a common algorithm used in MAB problems to balance exploration and exploitation (Carpentier et al, 2015;

26
Garivier and Moulines, 2008), which takes into account the predicted mean value and the predicted standard
deviation, weighted by an adjustable parameter (Thompson et al, 2022). This allows to balance the exploitation
of data points with a high predicted value and the exploration of areas with high uncertainty.
In general, MAB problems can be seen as a special case of sequential experimental design, where the goal is to
sequentially choose experiments to perform with the aim of maximizing some outcome. The typical framework of
a MAB problem can be regarded as an optimization problem where the learner must identify the option or arm
with the highest reward, among a set of available arms characterized by different reward distributions. Both MAB
and active learning paradigms involve a sequential decision-making process where the learner aims to maximize
a reward or improve model accuracy by selecting an arm to pull or a data point to label, respectively, and
receiving feedback (in the form of a reward or label request) for each selection. There are two main approaches
to tackle MAB problems:
• Regret minimization. This approach is coherent with the objective of maximizing the cumulative reward
observed over many trials. In this case, the learner must balance exploration, namely trying out different arms
to learn more about the reward distributions, with exploitation, i.e., using current knowledge to choose the
most promising arm. These kinds of algorithms strike a balance between learning a good model and obtaining
high rewards. A few examples might be treatment design, online advertising and recommender systems.
• Pure exploration. In this case, we are interested in finding the most promising arm, with a certain confidence or
given a fixed budget on the number of pulls. To do so, the objective is to learn a good model while minimizing
the number of measurements or labels required. This scenario is suggested in circumstances where, due to
safety constraints, we are not given complete freedom to change the variable levels and we are mostly interested
in understanding the underlying model governing the system. Possible examples include drug discovery or soft
sensor development (Fortuna et al, 2007; Shi and Xiong, 2018; Chan et al, 2018; Tang et al, 2018).
The pure exploration approach is particularly useful when coupled with the study of linear bandits, which are a
type of contextual bandit algorithms that assume a linear relationship between the features of the context and
the expected reward of each arm. In this type of problem, when an arm x ∈ X is pulled, the learner observes a
reward r(x) that depends on an unknown parameter θ ∗ ∈ Rd according to the linear model

r(x) = x⊤ θ ∗ + ε (45)
where ε is a zero-mean i.i.d. noise. This is similar to active linear regression in that, in both cases, the learner
aims to select the most informative data points to learn about the underlying model or system (Audibert and
Munos, 2010; Jamieson and Nowak, 2014). Soare et al (2014), investigated this problem, in the offline setting,
using the G-optimality criterion and a newly proposed X Y-allocation algorithm. Jedra and Proutiere (2020)
proposed a fixed-confidence algorithm for the same problem, while Azizi et al (2022) analyzed the fixed-budget
case, extending the framework to the case where the underlying model is represented by a generalized linear
model (Filippi et al, 2010). An interesting variant of this problem is presented in the study of transductive
experimental designs. A transductive design is a problem where we can pull arms from a set X ∈ Rd , with the
objective of identifying the best arm or improve the predictions over a separate set of observations Z ∈ Rd ,
which is given, in an unlabeled form, beforehand. A practical example of this case is when we are trying to infer
the user preferences over a set of products, but we can only do that by pulling arms from a limited set of free
trials. Alternatively, we might be interested in estimating the efficacy of a drug over a certain population, while
doing experiments on a population with different characteristics. This problem has been tackled with an active
learning approach by Yu et al (2006), with the idea of exploiting unlabeled data points in Z while evaluating
the informativeness of the data points in X . The transductive case of sequential experimental design has been
explored by Fiez et al (2019), but instead of performing active learning, they were interested in inferring the best
reward over Z, only pulling the arms in X . Finally, this has been extended to the online scenario by Camilleri
et al (2021), balancing the trade-off between time complexity and label complexity, namely between the number
of unlabeled observations spanned and the number of labels queried in order to stop the learning procedure and
declare the best-arm.
In addition to MAB, reinforcement learning-based approaches can also be applied to active learning in order
to optimize a decision-making policy that balances the exploration of uncertain data with the exploitation of
information learned from previous observations. This can be particularly useful in applications where the goal
is to maximize the expected cumulative reward over time, such as in robotics or game playing. Compared to
MAB, reinforcement learning-based approaches offer a more general and flexible framework for active learning,
allowing for a wider range of problem formulations and feedback signals (Menard et al, 2021; Fang et al, 2017;
Rudovic et al, 2019). One approach to combining active learning and reinforcement learning is through modeling
the sampling routine as a contextual-bandit problem, as proposed by Wassermann et al (2019). In this approach,

27
the rewards are based on the usefulness of the query behavior of the learner. The key intuition behind the use
of reinforcement learning in online active learning is that the learner gets feedback after the requested label,
based on how useful the request actually was. In contrast to the traditional active learning view, where most
of the effort is dedicated to the instance selection phase, the learner is penalized ex-post for querying useless
instances. The learner gets a positive reward ρ+ if it asks for the label when it would have otherwise predicted
the wrong class, and a negative reward ρ− when querying was unnecessary as the model would have predicted
the right label. The contextual bandit problem is implemented by building an ensemble of different models, with
each expert suggesting whether to query or not based on whether its prediction certainty exceeds a threshold Γ.
The models are assigned a decision power based on how past suggestions were rewarded and how coherent they
were with the other experts’ suggestions. When an observation is sent to the oracle for labeling, the reward is
computed, and the objective function of the learner is to maximize the total reward over a time horizon T .
Another reinforcement learning-based approach has been proposed by Woodward and Finn (2017). They
considered the case where at each time step t the learner needs to decide whether to predict the label of the
unlabeled data point xt or pay to request its label yt . The reinforcement learning framework is used to find an
optimal policy π ∗ (st ) that takes into account the cost of asking for a label and the cost of making an incorrect
prediction, where st represents the state that is given in input at the timet to a policy π (st ) that outputs the
suggested action at . The authors approximate the action-value function using a long short-term memory (LSTM)
neural network with a linear output layer. The optimal policy is determined by maximizing the long-term reward,
after assigning a reward to a label request Rreq , a correct prediction Rcorr , and an incorrect prediction Rinc . It
should be noted that Rcorr and Rinc should be negative rewards, as they are associated with costly actions.

4 Evaluation strategies
The use of active learning approaches is becoming increasingly common in machine learning, allowing models to
be trained more efficiently by selecting the most informative examples for labeling. To evaluate the performance
of these approaches, it is typical to compare them to a passive random sampling strategy by generating learning
curves that plot the model performance (e.g., accuracy, F1 score, or root mean square error) on a holdout test
set over the number of labeled examples used for training. Learning curves are a useful tool for comparing the
asymptotic performance of different strategies and their sample efficiency, with the slope of the curve reflecting
the rate at which the model performance improves with additional labeled examples. A steeper slope indicates a
more sample-efficient strategy. When multiple sampling strategies are being compared, a visual inspection of the
learning curves may not be sufficient, and more rigorous statistical tests may be necessary. Reyes et al (2018)
recommend the use of non-parametric statistical tests to analyze the effectiveness of active learning strategies
for classification tasks. The sign test (Steel, 1959) or the Wilkinson signed-ranks test (Wilcoxon, 1945) can be
used to compare two strategies, while the Friedman test (Friedman, 1940), the Friedman aligned-ranks test
(Hodges and Lehmann, 1962), the Friedman test with Iman-Davenport correction (Iman and Davenport, 1980),
or the Quade test (Quade, 1979) can be used when evaluating more than two strategies. These statistical tests
can provide insight into whether the difference in performance between the active learning and passive random
sampling strategies is statistically significant.

Algorithm 6 Prequential evaluation for online active learning

Require: an initial model w0 , a data stream S, a budget B, an active learning strategy Q.
t←1 ▷ Timestamp
P←∅ ▷ Storing predictions
while c ≤ B and i ≤ |S| do
Observe the data point xt ∈ S
Predict the label ybt and store it: P ← P ∪ ybt
if Q(xt ) = True then ▷ Sampling decision
Ask for the true label yt and update the model
c←c+1 ▷ Pay for the label
else
Discard xt
end if
t←t+1
end while

28
Overall, the use of learning curves and statistical tests can provide valuable insights into the effectiveness
and efficiency of different active learning strategies. By understanding the statistical significance of differences
in performance between these strategies, researchers can make informed decisions about which approaches are
more effective for a particular task or dataset. Furthermore, the choice of the evaluation scheme is crucial when
assessing the performance of active learning approaches. If we use an evaluation scheme based on a holdout
test set, at each learning step t the performance of the model is assessed using the same test set. This can be
a reasonable approach if we are dealing with a stationary data stream, which does not evolve over time. Under
these assumptions, using the same test set we might be able to better assess the prediction improvement as
more labeled examples are included in the design. However, this approach might not be ideal when dealing with
drifting data streams. In these circumstances, a prequential evaluation scheme can be more useful to monitor
the evolution of the prediction error over time (Suárez-Cetrulo et al, 2021; Cerqueira et al, 2020; Tieppo et al,
2022; Cacciarelli and Boresta, 2021). In online learning, prequential evaluation is also referred to as test-then-
train approach, and it involves using each incoming instance first to measure the prediction error, and then to
be included in the training set (Suárez-Cetrulo et al, 2023). The main steps of the test-then-train approach are
reported in Algorithm 6. The key idea is that at each time step t, we first test the model by making a prediction,
then we decide whether to query the true labels and finally we update our model.
An in-depth analysis and discussion between the use of a holdout test set and the prequential evaluation
scheme for streaming data has been provided by Gama et al (2009, 2013), who suggested the use of a prequen-
tial evaluation scheme with forgetting mechanisms. For scenarios with imbalanced data streams, a specialized
prequential variant of the area under the curve metric has been proposed by Brzezinski and Stefanowski (2015,
2017). From an implementation perspective, Bifet et al (2010) developed an open source software suite called
MOA for data stream mining, which includes both the holdout and prequential strategies. This framework has
found widespread application in the evaluation of online active learning strategies, as evidenced by the studies
conducted by Liu et al (2021); Shan et al (2019); Weigl et al (2016); Zhang et al (2020a); Alabdulrahman et al
(2016).

Evaluation Strategy Works

Holdout test set Desalvo et al (2021); Wassermann et al (2019); Rožanec et al
(2022); Narr et al (2016); Ferdowsi et al (2013); Bordes et al
(2005); Suzuki et al (2021); Ghassemi et al (2016); Qin et al
(2021); Woodward and Finn (2017); Riquelme et al (2017b);
Cacciarelli et al (2022b, 2023); Manjah et al (2023)
Prequential/Test-then-train Zhang et al (2022); Pham et al (2022); Castellani et al (2022);
Chu et al (2011); Zhang et al (2018); Krawczyk et al (2018);
Xu et al (2016); Mohamad et al (2020); Weigl et al (2016);
Ienco et al (2013); Zhang et al (2020a)
Table 1 Evaluation strategies.

In Table 1, we categorize the studies based on the experimental protocols they employed to evaluate the sam-
pling strategies. The table exclusively includes approaches where the evaluation strategy was explicitly defined.
In most cases, when assessing active learning methods in the context of drifting data streams, a prequential
approach is favored. Conversely, for scenarios where the methods are ill-suited to handle concept drifts, hold-
out test sets tend to be the preferred choice. In approaches not featured in the table, the evaluation strategies
exhibited some variations or lacked explicit specification. For instance, in the work by Fujii and Kashima (2016),
their evaluation strategy involved training models on the queried data and subsequently testing them with the
entire dataset. This approach differs from the conventional test-then-train paradigm since, in this case, models
are tested on data they encountered during training, at least in part. Another example is found in Zhu et al
(2007), who utilized a window-based approach, assessing prediction accuracy across all observations in the cur-
rent batch. On a different note, Hao et al (2018a) employed the per-round regret metric, which quantifies the
loss difference between the forecaster and the best expert at each iteration of the active learning process. In some
instances, none of the previously mentioned methods were employed, as the analysis took a more theoretical per-
spective. This is exemplified by the works of Dasgupta et al (2005); Chae and Hong (2021); Huang et al (2022).
Lastly, bandit algorithms employed a distinct evaluation approach, often aiming to identify the most promising
arm with a fixed confidence or budget. In the fixed confidence setting, performance typically hinges on compar-
ing label complexity to problem dimensionality or the number of arms pulled, as observed in Fiez et al (2019).
Alternatively, regret or error metrics were evaluated against the required number of trials, as demonstrated in
the studies by Riquelme et al (2017a); Sudarsanam and Ravindran (2018); Fontaine et al (2021).

29
5 Real-world applications and challenges
5.1 Applications
Online active learning has been recognized as a powerful technique in scenarios where data is arriving at a high
velocity, labeling data is expensive, and it is infeasible to store all the unlabeled data before making a decision
about which observations to query to update the model. In particular, these techniques have proven particularly
useful in dynamic and ever-evolving environments, where models need to adapt to new data in real-time, by
selectively querying the most informative instances. One of the first real-world applications of online active
learning has been presented by Sculley (2007), who investigated the scenario of low-cost active spam filtering
(Figure 13) where a filter is updated online by selecting the most informative emails in real time. Another
application of online active learning in the field of IT has been recently presented by Zhang et al (2020a). They
analyzed the scenario of network protocol identification and proposed a method (presented in Section 3.2) to
select the most representative instances on the fly and adapt the model to dynamic data distributions.

Stream of emails No

Query
Receive an email Filter Classify email
label?
Stream of emails

Yes

Update filter

Fig. 13 Low-cost active spam filtering (Sculley, 2007).

Computer vision is another interesting area where online active learning can be applied. Deep learning models
require a large amount of annotated data, making manual annotation of thousands of images one of the most
challenging aspects of model development. However, it is important to note that the most effective deep active
learning methods proposed so far are not easily adaptable to a stream-based setting. Many of these methods
involve clustering or measuring pairwise similarity among image embeddings (Sener and Savarese, 2017; Agarwal
et al, 2020; Ash et al, 2019; Citovsky et al, 2021; Prabhu et al, 2020), which cannot be easily done in a single-
pass manner. As a result, most online applications of active learning in computer vision rely on the use of
traditional models with uncertainty-based sampling. Narr et al (2016) analyze the stream-based active learning
problem for the classification of 3D objects. They used a mondrian forest classifier (Lakshminarayanan et al,
2014), which is an efficient alternative of random forest for the online learning scenario, and selected images
with high classification uncertainty to be labeled. Rožanec et al (2022) used online active learning to reduce the
data labeling effort while performing vision-based process monitoring. Initially, features are extracted from the
images using a pre-trained ResNet-18
√ model (He et al, 2015) and then, using the mutual information criterion
(Kraskov et al, 2004), only n features (Hua et al, 2005) are retained to fit an online classifier, where n is
the total number of observations in the training set. The authors combine a simple active learning strategy
based on model uncertainty with five streaming classification algorithms, including Hoeffding tree (Hulten et al,
2001), Hoeffding adaptive tree (Bifet and Gavaldà, 2009), stochastic gradient tree (Gouk et al, 2019), streaming
logistic regression, and streaming k-nearest neighbors. Recently, Saran et al (2023) proposed a novel approach
to streaming active learning with deep neural networks. Given a neural network with f with parameters θ, last-
layer parameters θL , and the cross-entropy function ℓ, they compute the gradient representation of the data
point xt , which is given by
∂
g(xt ) = ℓ (f (xt ; θ), ybt ) (46)
∂θL
where ybt = argmax f (xt ; θ). Then, the data points to be included in the batch for training the model are chosen
by using a probability pt proportional to the contribution of the current example to the covariance matrix of the
examples collected so far, as in
pt ∝ det Σ b t + g(xt )g(xt )⊤ (47)

where Σb t is the covariance matrix of the data points that have been selected to be included int he current batch,
up to the time step t.

30
Online active learning has also been explored for object detection tasks. Manjah et al (2023) proposed
a stream-based active distillation (SBAD) framework by combining the concepts of active learning and self-
supervision as described in Section 2.3. The SBAD framework enables the deployment of scalable deep-learning
models as it does not rely on human annotators and takes into account the imperfection of the oracle when
distilling knowledge from a large teacher model to a lightweight student. Indeed, the authors suggest setting
a threshold on the confidence of the images and only querying images with high confidence in trying to avoid
confirmation bias. The threshold is determined using a warm-up phase, similarly to the approach proposed by
Cacciarelli et al (2022b) presented in Algorithm 5. The SBAD pipeline for model development and evaluation is
reported in Figure 14.

Fig. 14 SBAD framework (Manjah et al, 2023): sampling, fine-tuning and evaluation. The sampling is performed in a single-pass
manner via thresholding.

The problem of performing active learning for object detection with streaming data has also been explored
by Beck et al (2023). In the case of a camera placed on an autonomous vehicle, the collected data encompasses
various scenarios, including clear weather, foggy conditions, and rainy weather, all of which require the model
to perform effectively. However, the frequency of these scenarios can vary significantly. In situations where
one scenario is prevalent, a passive sampling strategy could tend to sample very few examples from the most
rare slices. Instead, the proposed streamline approach by attempts to smartly allocate the budget to obtain
more observations from the slices where the model is under-performing. The case of autonomous cars was also
considered by Yan et al (2023), who used a diversity-based online active learning strategy to reduce false alarm
rate and learn unseen faults.
Another interesting industrial application has been recently presented by Ghiasi et al (2023). They proposed
a deployable framework that combines a thermodynamics-based compressor model and a Gaussian Process-based
surrogate model with an online active learning module. The objective of the study was to minimize the power
absorbed by the machine during the boil off process of centrifugal compressor. In the proposed framework, the
simulator, the surrogate model, and the optimizer interact in real time to determine the new experimental points.

5.2 Challenges
When applying online active learning strategies to real-world problems, there are several potential issues to
consider, including:
• Algorithm scalability. Online active learning algorithms need to be efficient and scalable to handle large datasets
and high-velocity data streams. As the amount of data grows, the computational demands of active learning
can become prohibitive, making it difficult to deploy in practice. The time required to make the sampling
decision needs to be lower than the feed rate of the process being analyzed. If the algorithm is too slow, it
may require a buffer, which reduces the benefits of online active learning.
• Labeling quality. Most online active learning strategies rely heavily on the quality of labeled data, which can
be challenging to ensure in real-world scenarios. Human annotators may make errors, introduce biases, or
interpret labeling instructions differently. For this reason, in real-life situations, it may be necessary to consider
oracle imperfections like in the knowledge distillation case (Baykal et al, 2022). Another difficult aspect related
to labeling quality is the delay or latency, which has been described in Section 2.2.3.

31
• Data drift. In real-world settings, data distributions may shift over time, making it challenging for models to
adapt and continue providing accurate predictions. Changes in the data distribution may also affect the quality
of the labeled data, as the criteria for selecting informative instances may become less effective. Methods from
Sections 3.2 and 3.3 should be used when dynamic and ever-changing behaviors are expected.
• Model interpretability. Besides simply asking for the most informative instances from a modeling perspective,
it might be useful to provide additional information on why a particular instance is beneficial for improving
the performance of the current model. In fields like healthcare and manufacturing this might help practitioners
to improve their understanding of the underlying problem.
• Evaluation. When developing active learning methods from a research perspective, the different query strate-
gies are evaluated assuming the ground-truth labels to be available for a held-out test set, or for the data
stream being analyzed. However, in real life, the key motivation behind active learning is label scarcity and
thus it might be difficult to thoroughly assess the effectiveness of the deployed sampling strategy.
• Human-computer interaction. In the context of active learning for data streams, the synergy between human
labelers and computer systems plays a pivotal role in the labeling process. While the majority of online active
learning methods focus on querying the most informative data points in real-time, we can distinguish between
two distinct labeling scenarios:
1. Real-time annotation. In most of the presented works, it is assumed that labels are immediately available
when a data point is queried from the stream. This immediate access to true labels enables an optimized
active learning routine, as the model can be promptly updated and can recommend exploration of new
regions based on up-to-date information. However, this approach poses some implementation challenges
that need to be addressed with the use of advanced data annotation tools (Feuz and Cook, 2013).
2. Postponed annotation. There are cases where we must allow for a delay between data querying and labeling.
For instance, methods that consider verification latency (Castellani et al, 2022; Pham et al, 2022) take
into account the possibility of delayed labels. This is particularly relevant in situations where a physical
quality inspection or medical treatment must occur before the label is revealed. Another example is in the
training of deep neural networks, where real-time sampling from a data stream is necessary due to memory
constraints (Manjah et al, 2023), but the labeling and model update phase may occur when a batch is
collected, following a batch-mode active learning strategy (Ren et al, 2022).

6 Summary and future directions

This survey outlines the challenge of conducting active learning with data streams and investigates different
approaches for selecting the most informative data points in real-time.
Table 2 provides a summary of the relevant state-of-the-art approaches, highlighting their main properties
and settings. Our examination reveals that existing research has predominantly concentrated on creating online
classification models, which can operate with both stationary and drifting data streams. However, there has
been comparatively limited effort devoted to online active linear regression or dedicated to constructing online
regression models in general.
We believe that there are several promising directions for future research in this field. First, we recommend
further investigation into online active learning strategies specifically designed for regression models. Given the
limited work in this area, there is a need for more advanced methods that can be applied to nonlinear models,
beyond linear models or linear bandits. For example, there has been a recent spark of interest toward the use of
Bayesian optimization for active learning in nonlinear regression problems (Mohamadi and Amindavar, 2020; Riis
et al, 2022). Additionally, model-agnostic methods that can be applied to a variety of regression models could be
valuable as they would provide a more general solution to the problem. Second, we believe that there is potential
for research into single-pass online sampling strategies for dynamic data streams. Ensemble models and batch-
based approaches have been the dominant methods in online classification, but some of their assumptions or
requirements may not hold in many real-world applications. For instance, in some applications, data may arrive
in a continuous stream, and it may not be possible to divide it into batches due to time or memory constraints.
In such cases, single-pass online sampling strategies that do not require the use or update of multiple models
would be more practical. Moreover, it could be beneficial to develop online active learning strategies that are able
to tackle all the types of distribution shifts introduced in Section 3.2. Finally, the combination of reinforcement
learning and active learning in pool-based scenarios is an area of ongoing research. We believe that the study of
online reinforcement learning to optimize sampling strategies could provide valuable insights into how to best
perform active learning in dynamic environments.

32
Data processing Data Task Model Work(s)
stream
Single Cesa-Bianchi et al (2004,
Classification
Model 2006); Dasgupta et al (2005);
Stationary
Sculley (2007); Lu et al (2016);
Hao et al (2018b); Ghassemi
Single-pass
et al (2016); Shah and Man-
wani (2020); Mohamad et al
(2020); Saran et al (2023);
Rožanec et al (2022); Wood-
ward and Finn (2017)
Ensemble Huang et al (2022); Desalvo
et al (2021); Loy et al (2012);
Hao et al (2018a); Chae and
Hong (2021)
Regression Single Riquelme (2017); Fontaine
Model et al (2021); Cacciarelli et al
(2022b, 2023, 2022a)
Object Single Manjah et al (2023)
detection Model
Single Krawczyk et al (2018); Castel-
Drifting Classification
Model lani et al (2022); Pham et al
(2022); Yin et al (2023);
Mohamad et al (2018); Liu
et al (2021); Kurlej and
Woźniak (2011); Chu et al
(2011)
Ensemble Zhang et al (2020a); Shan
et al (2019); Zhang et al (2018,
2022)
Classification Single Lughofer (2012); Pratama et al
Evolving
Model (2015)
Regression Single Lughofer and Pratama (2018);
Model Lughofer and Škrjanc (2023)
Classification Single Bordes et al (2005); Qin et al
Stationary
Model (2021); Fujii and Kashima
Batch (2016)
Object Single Beck et al (2023)
detection Model
Single Cheng et al (2023); Martins
Drifting Classification
Model et al (2023); Ienco et al (2013);
Zhang et al (2023); Yan et al
(2023)
Ensemble Zhu et al (2007); Woźniak et al
(2023); Halder et al (2023)
Evolving Classification Single Subramanian et al (2014);
Model Weigl et al (2016); Cernuda
et al (2014)
Table 2 Online active learning strategies: summary based on data processing capabilities,
assumptions about the data stream, task of the model and model characteristics.

7 Conclusion
The field of online active learning with data streams is a rapidly evolving and highly relevant area of research
in machine learning. The ability to effectively learn from data streams in real-time is becoming increasingly
important, as the amount of data generated by modern applications continues to grow at an exponential rate.
However, obtaining annotated data to train complex prediction and decision-making models presents a major
roadblock. This hinders the proper integration of artificial intelligence models with real-world applications such
as healthcare, autonomous driving and industrial production. Our survey provides a comprehensive overview of
the current state of the art in this field and highlights the challenges and opportunities that researchers face
when developing methods for online active learning. We reviewed a wide range of strategies for selecting the most

33
informative data points in online active learning, including methods based on uncertainty sampling, diversity
sampling, query by committee, and reinforcement learning, among others. Our analysis has shown that these
strategies have been applied in a variety of contexts, including online classification, online regression, and online
semi-supervised learning. We hope that this survey will inspire further research in the field of online active
learning with data streams and encourage the development of new and advanced methods for handling this type
of data. In particular, we believe that there is significant potential for the development of model-agnostic and
single-pass online active learning strategies that can be applied in practical settings.

Acknowledgments
The authors gratefully acknowledge the support of the DTU Strategic Alliances Fund, which made this research
possible. We would also like to extend our sincere thanks to John Sølve Tyssedal for his invaluable help and
support throughout the project.

References
Agarwal S, Arora H, Anand S, et al (2020) Contextual diversity for active learning. European Conference on
Computer Vision 2020 https://doi.org/https://doi.org/10.1007/978-3-030-58517-4 9, URL http://arxiv.org/
abs/2008.05723

Aggarwal CC, Kong X, Gu Q, et al (2014) Data Classification (Chapter: ”Active Learning: A Survey”). Taylor
& Francis, URL http://charuaggarwal.net/active-survey.pdf

Aguiar G, Krawczyk B, Cano A (2023) A survey on learning from imbalanced data streams: taxonomy, challenges,
empirical study, and reproducible experimental framework. Machine Learning pp 1–79

Alabdulrahman R, Viktor H, Paquet E (2016) An active learning approach for ensemble-based data stream
mining. In: International Conference on Knowledge Discovery and Information Retrieval, SCITEPRESS, pp
275–282

Ash JT, Zhang C, Krishnamurthy A, et al (2019) Deep batch active learning by diverse, uncertain gradient lower
bounds. 2020 International Conference on Learning Representations URL http://arxiv.org/abs/1906.03671

Asprey S, Macchietto S (2002) Designing robust optimal dynamic experiments. Journal of Process Control
12:545–556. https://doi.org/10.1016/S0959-1524(01)00020-8

Audibert JY, Munos R (2010) Best arm identification in multi-armed bandits. COLT - 23th Conference on
Learning Theory URL http://certis.enpc.fr/∼audibert/Mes%20articles/COLT10.pdf

Avadhanula V, Colini Baldeschi R, Leonardi S, et al (2021) Stochastic bandits for multi-platform budget
optimization in online advertising. In: Proceedings of the Web Conference 2021, pp 2805–2817

Azizi MJ, Kveton B, Ghavamzadeh M (2022) Fixed-budget best-arm identification in structured bandits.
Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22) URL
https://www.ijcai.org/proceedings/2022/0388.pdf

Baier L, Schlör T, Schöffer J, et al (2021) Detecting concept drift with neural network model uncertainty. Hawaii
International Conference on System Sciences (HICSS) 2023 URL http://arxiv.org/abs/2107.01873

Balcan MF, Broder A, Zhang T (2007) Margin based active learning. COLT - 23th Conference on Learning
Theory 4739. https://doi.org/https://doi.org/10.1007/978-3-540-72927-3 5

Bassily R, Smith A, Thakurta A (2014) Private empirical risk minimization: Efficient algorithms and tight
error bounds. 2014 IEEE 55th Annual Symposium on Foundations of Computer Science pp 464–473. https:
//doi.org/10.1109/FOCS.2014.56

Baum E, Lang K (1992) Query learning can work poorly when a human oracle is used. Proceedings of the IEEE
International Joint Conference on Neural Networks

34
Baykal C, Trinh K, Iliopoulos F, et al (2022) Robust active distillation. URL http://arxiv.org/abs/2210.01213

Beck N, Kothawade S, Shenoy P, et al (2023) Streamline: Streaming active learning for realistic multi-
distributional settings. arXiv preprint arXiv:230510643

Bifet A, Gavaldà R (2007) Learning from time-changing data with adaptive windowing. Proceedings of the 2007
SIAM International Conference on Data Mining pp 443–448. https://doi.org/10.1137/1.9781611972771.42

Bifet A, Gavaldà R (2009) Adaptive learning from evolving data streams. IDA 2009: Advances in Intelligent
Data Analysis VIII pp 249–260. https://doi.org/10.1007/978-3-642-03915-7 22

Bifet A, Holmes G, Pfahringer B, et al (2010) Moa: Massive online analysis, a framework for stream classification
and clustering. In: Proceedings of the first workshop on applications of pattern analysis, PMLR, pp 44–50

Bisgaard S, Kulahci M (2011) Time series analysis and forecasting by example. John Wiley & Sons

Bordes A, Ertekin S, Weston J, et al (2005) Fast kernel classifiers with online and active learning. The Journal
of Machine Learning Research 6. URL https://jmlr.csail.mit.edu/papers/v6/bordes05a.html

Bouchachia A, Vanaret C (2014) Gt2fc: An online growing interval type-2 self-learning fuzzy classifier. IEEE
Transactions on Fuzzy Systems 22:999–1018. https://doi.org/10.1109/TFUZZ.2013.2279554

Brzezinski D, Stefanowski J (2015) Prequential auc for classifier evaluation and drift detection in evolving data
streams. 3rd International Workshop on New Frontiers in Mining Complex Patterns, (NFMCP 2014) pp
87–101. https://doi.org/10.1007/978-3-319-17876-9 6

Brzezinski D, Stefanowski J (2017) Prequential auc: properties of the area under the roc curve for data
streams with concept drift. Knowledge and Information Systems 52:531–562. https://doi.org/10.1007/
s10115-017-1022-8

Burbidge R, Rowland JJ, King RD (2007) Active learning for regression based on query by committee. 8th
International Conference on Intelligent Data Engineering and Automated Learning, IDEAL 2007 https://doi.
org/10.1007/978-3-540-77226-2 22

Cacciarelli D, Boresta M (2021) What drives a donor? a machine learning-based approach for predicting responses
of nonprofit direct marketing campaigns. International Journal of Nonprofit and Voluntary Sector Marketing
https://doi.org/10.1002/nvsm.1724

Cacciarelli D, Kulahci M (2022) A novel fault detection and diagnosis approach based on orthogonal autoen-
coders. Computers & Chemical Engineering 163:107853. https://doi.org/10.1016/j.compchemeng.2022.107853

Cacciarelli D, Kulahci M (2023) Hidden dimensions of the data: Pca vs autoencoders. Quality Engineering pp
1–10

Cacciarelli D, Kulahci M, Tyssedal J (2022a) Online active learning for soft sensor development using semi-
supervised autoencoders. ICML 2022 Workshop on Adaptive Experimental Design and Active Learning in the
Real World URL https://arxiv.org/abs/2212.13067

Cacciarelli D, Kulahci M, Tyssedal JS (2022b) Stream-based active learning with linear models. Knowledge-Based
Systems 254:109664. https://doi.org/10.1016/j.knosys.2022.109664

Cacciarelli D, Kulahci M, Tyssedal JS (2023) Robust online active learning. Quality and Reliability Engineer-
ing International https://doi.org/https://doi.org/10.1002/qre.3392, URL https://onlinelibrary.wiley.com/doi/
abs/10.1002/qre.3392, https://onlinelibrary.wiley.com/doi/pdf/10.1002/qre.3392

Cai W, Zhang Y, Zhou J (2013) Maximizing expected model change for active learning in regression. Proceedings
- IEEE International Conference on Data Mining, ICDM pp 51–60. https://doi.org/10.1109/ICDM.2013.104

Camilleri R, Xiong Z, Fazel M, et al (2021) Selective sampling for online best-arm identification. 35th Conference
on Neural Information Processing Systems (NeurIPS 2021) URL http://arxiv.org/abs/2110.14864

35
Carcillo F, Le Borgne YA, Caelen O, et al (2017) An assessment of streaming active learning strategies for real-
life credit card fraud detection. In: 2017 ieee international conference on data science and advanced analytics
(dsaa), IEEE, pp 631–639

Carcillo F, Le Borgne YA, Caelen O, et al (2018) Streaming active learning strategies for real-life credit card
fraud detection: assessment and visualization. International Journal of Data Science and Analytics 5:285–300

Carnein M, Trautmann H (2019) Customer segmentation based on transactional data using stream clustering.
In: Advances in Knowledge Discovery and Data Mining: 23rd Pacific-Asia Conference, PAKDD 2019, Macau,
China, April 14-17, 2019, Proceedings, Part I 23, Springer, pp 280–292

Carpentier A, Lazaric A, Ghavamzadeh M, et al (2015) Upper-confidence-bound algorithms for active learning

in multi-armed bandits

Castellani A, Schmitt S, Hammer B (2022) Stream-based active learning with verification latency in non-
stationary environments. https://doi.org/10.1007/978-3-031-15937-4 22, URL http://arxiv.org/abs/2204.
06822http://dx.doi.org/10.1007/978-3-031-15937-4 22

Cernuda C, Lughofer E, Mayr G, et al (2014) Incremental and decremental active learning for optimized self-
adaptive calibration in viscose production. Chemometrics and Intelligent Laboratory Systems 138:14–29. https:
//doi.org/10.1016/j.chemolab.2014.07.008

Cerqueira V, Torgo L, Mozetič I (2020) Evaluating time series forecasting models: an empirical study on perfor-
mance estimation methods. Machine Learning 109:1997–2028. https://doi.org/10.1007/s10994-020-05910-7

Cesa-Bianchi N, Lugosi G (2006) Prediction, Learning, and Games. Cambridge University Press, https://doi.
org/10.1017/CBO9780511546921

Cesa-Bianchi N, Gentile C, Zaniboni L (2004) Worst-case analysis of selective sampling for linear-threshold
algorithms. Advances in Neural Information Processing Systems URL https://proceedings.neurips.cc/paper
files/paper/2004/hash/92426b262d11b0ade77387cf8416e153-Abstract.html

Cesa-Bianchi N, Gentile C, Zaniboni L (2006) Worst-case analysis of selective sampling for linear classification.
The Journal of Machine Learning Research 7. URL https://www.jmlr.org/papers/volume7/cesa-bianchi06b/
cesa-bianchi06b.pdf

Chae J, Hong S (2021) Stream-based active learning with multiple kernels. 2021 International Conference on
Information Networking (ICOIN) pp 718–722. https://doi.org/10.1109/ICOIN50884.2021.9333940

Chan LLT, Wu QY, Chen J (2018) Dynamic soft sensors with active forward-update learning for selection
of useful data from historical big database. Chemometrics and Intelligent Laboratory Systems 175:87–103.
https://doi.org/10.1016/j.chemolab.2018.01.015

Cheng J, Zheng Z, Guo Y, et al (2023) Active broad learning with multi-objective evolution for data stream
classification. Complex & Intelligent Systems pp 1–18

Chu W, Zinkevich M, Li L, et al (2011) Unbiased online active learning in data streams. Proceedings of the
17th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’11 p 195.
https://doi.org/10.1145/2020408.2020444

Citovsky G, DeSalvo G, Gentile C, et al (2021) Batch active learning at scale. 35th Conference on Neural
Information Processing Systems, NeurIPS 2021 URL http://arxiv.org/abs/2107.14263

Cohn DA, Ghahramani Z, Jordan MI (1996) Active learning with statistical models. Journal of Artiicial
Intelligence Research 4:129–145. https://doi.org/10.1613/jair.295

Crammer K, Dekel O, Keshet J, et al (2006) Online passive-aggressive algorithms. The Journal of Machine
Learning Research URL https://jmlr.csail.mit.edu/papers/volume7/crammer06a/crammer06a.pdf

36
Dasgupta S, Kalai AT, Monteleoni C (2005) Analysis of perceptron-based active learning. COLT ’05 - Interna-
tional Conference on Computational Learning Theory pp 249–263. https://doi.org/10.1007/11503415 17

Desalvo G, Gentile C, Thune TS (2021) Online active learning with surrogate loss functions. Advances in Neural
Information Processing Systems 34 (NeurIPS 2021) URL https://proceedings.neurips.cc/paper/2021/hash/
c1619d2ad66f7629c12c87fe21d32a58-Abstract.html

Donmez P, Carbonell J, Bennet P (2007) Dual strategy active learning. 18th European Conference on Machine
Learning, ECML 2007 4701. https://doi.org/10.1007/978-3-540-74958-5 14

Duchi JC, Jordan MI, Wainwright MJ (2013) Local privacy and statistical minimax rates. 2013 IEEE 54th
Annual Symposium on Foundations of Computer Science pp 429–438. https://doi.org/10.1109/FOCS.2013.53

Ebbinghaus H (2013) Memory: A contribution to experimental psychology. Annals of Neurosciences 20. https:
//doi.org/10.5214/ans.0972.7531.200408

Fang M, Li Y, Cohn T (2017) Learning how to active learn: A deep reinforcement learning approach. URL
https://arxiv.org/abs/1708.02383

Ferdowsi Z, Ghani R, Settimi R (2013) Online active learning with imbalanced classes. 2013 IEEE 13th
International Conference on Data Mining pp 1043–1048. https://doi.org/10.1109/ICDM.2013.12

Feuz KD, Cook DJ (2013) Real-time annotation tool (rat). In: Workshops at the Twenty-Seventh AAAI
Conference on Artificial Intelligence

Fiez T, Jain L, Jamieson K, et al (2019) Sequential experimental design for transductive linear bandits. 33rd
Conference on Neural Information Processing Systems (NeurIPS 2019) URL https://proceedings.neurips.cc/
paper files/paper/2019/file/8ba6c657b03fc7c8dd4dff8e45defcd2-Paper.pdf

Filippi S, Cappe O, Garivier A, et al (2010) Parametric bandits: The generalized linear case. Advances in Neural
Information Processing Systems 23 (NIPS 2010) URL https://papers.nips.cc/paper files/paper/2010/hash/
c2626d850c80ea07e7511bbae4c76f4b-Abstract.html

Fontaine X, Perrault P, Valko M, et al (2021) Online a-optimal design and active linear regression. URL http:
//proceedings.mlr.press/v139/fontaine21a/fontaine21a.pdf

Fortuna L, Graziani S, Rizzo A, et al (2007) Soft sensors for monitoring and control of industrial processes,
vol 22. Springer, URL https://link.springer.com/book/10.1007/978-1-84628-480-9

Fowler K, Kokilepersaud K, Prabhushankar M, et al (2023) Clinical trial active learning. In: The 14th ACM
Conference on Bioinformatics, Computational Biology and Health Informatics (ACM-BCB)

Freeman PR (1983) The secretary problem and its extensions: A review. International Statistical Review 51:189–
206. URL https://www.jstor.org/stable/1402748

Freund Y, Seung HS, Shamir E, et al (1997) Selective sampling using the query by committee algorithm. Machine
Learning 28:133–168. https://doi.org/10.1023/a:1007330508534

Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. The Annals
of Mathematical Statistics 11:86–92. https://doi.org/10.1214/aoms/1177731944

Frumosu FD, Kulahci M (2018) Big data analytics using semi-supervised learning methods. Quality and
Reliability Engineering International 34:1413–1423. https://doi.org/10.1002/qre.2338

Fu Y, Zhu X, Li B (2013) A survey on instance selection for active learning. Knowledge and Information Systems
35:249–283. https://doi.org/10.1007/s10115-012-0507-8

Fujii K, Kashima H (2016) Budgeted stream-based active learning via adaptive submodular maximization. 30th
Annual Conference on Neural Information Processing Systems, NIPS 2016 URL https://proceedings.neurips.
cc/paper/2016/hash/07cdfd23373b17c6b337251c22b7ea57-Abstract.html

37
Gajjar S, Kulahci M, Palazoglu A (2018) Real-time fault detection and diagnosis using sparse principal
component analysis. Journal of Process Control 67:112–128. https://doi.org/10.1016/j.jprocont.2017.03.005

Galvanin F (2010) Optimal model-based design of experiments in dynamic systems: novel techniques and
unconventional applications. Thesis URL https://hdl.handle.net/11577/3427095

Gama J, Medas P, Castillo G, et al (2004) Learning with drift detection. Lecture Notes in Computer Science
(including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 3171:286–295.
https://doi.org/10.1007/978-3-540-28645-5 29

Gama J, Sebastiao R, Rodrigues PP (2009) Issues in evaluation of stream learning algorithms. In: Proceedings
of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 329–338

Gama J, Sebastiao R, Rodrigues PP (2013) On evaluating stream learning algorithms. Machine learning 90:317–
346

Garivier A, Moulines E (2008) On upper-confidence bound policies for non-stationary bandit problems. URL
https://arxiv.org/abs/0805.3415

Ge D, Zeng XJ (2020) Learning data streams online — an evolving fuzzy system approach with self-
learning/adaptive thresholds. Information Sciences 507:172–184. https://doi.org/10.1016/j.ins.2019.08.036

Ge Z (2014) Active learning strategy for smart soft sensor development under a small number of labeled data
samples. Journal of Process Control 24:1454–1461. https://doi.org/10.1016/j.jprocont.2014.06.015

Gemaque RN, Costa AFJ, Giusti R, et al (2020) An overview of unsupervised drift detection methods. Wiley
Interdisciplinary Reviews: Data Mining and Knowledge Discovery 10. https://doi.org/10.1002/widm.1381

Ghassemi M, Sarwate AD, Wright RN (2016) Differentially private online active learning with applications
to anomaly detection. AISec 2016 - Proceedings of the 2016 ACM Workshop on Artificial Intelligence and
Security, co-located with CCS 2016 pp 117–128. https://doi.org/10.1145/2996758.2996766

Ghiasi S, Pazzi G, Del Grosso C, et al (2023) Combining thermodynamics-based model of the centrifugal com-
pressors and active machine learning for enhanced industrial design optimization. In: 1st Workshop on the
Synergy of Scientific and Machine Learning Modeling@ ICML2023

Goodfellow IJ, Pouget-Abadie J, Mirza M, et al (2014) Generative adversarial networks. URL https://arxiv.org/
abs/1406.2661

Gouk H, Pfahringer B, Frank E (2019) Stochastic gradient trees. URL http://proceedings.mlr.press/v101/

gouk19a/gouk19a.pdf

Gu X, Han J, Shen Q, et al (2022) Autonomous learning for fuzzy systems: a review. Artificial Intelligence
Review https://doi.org/10.1007/s10462-022-10355-6

Gu X, Han J, Shen Q, et al (2023) Autonomous learning for fuzzy systems: a review. Artificial Intelligence
Review 56(8):7549–7595

Halder B, Hasan KA, Amagasa T, et al (2023) Autonomic active learning strategy using cluster-based ensemble
classifier for concept drifts in imbalanced data stream. Expert Systems with Applications p 120578

Hanneke S (2014) Theory of disagreement-based active learning. Foundations and Trends in Machine Learning
7:131–309. https://doi.org/10.1561/2200000037

Hanneke S, Yang L (2021) Toward a general theory of online selective sampling: Trading off mistakes and
queries. Proceedings of The 24th International Conference on Artificial Intelligence and Statistics URL https:
//proceedings.mlr.press/v130/hanneke21a.html

Hao S, Hu P, Zhao P, et al (2018a) Online active learning with expert advice. ACM Transactions on Knowledge
Discovery from Data 12. https://doi.org/10.1145/3201604

38
Hao S, Lu J, Zhao P, et al (2018b) Second-order online active learning and its applications. IEEE Transactions
on Knowledge and Data Engineering 30:1338–1351. https://doi.org/10.1109/TKDE.2017.2778097

Haussmann E, Fenzi M, Chitta K, et al (2020) Scalable active learning for object detection. Proceedings 31st
IEEE Intelligent Vehicles Symposium (IV) https://doi.org/https://doi.org/10.1109/IV47402.2020.9304793

He K, Zhang X, Ren S, et al (2015) Deep residual learning for image recognition. Proceedings of the IEEE
Computer Society Conference on Computer Vision and Pattern Recognition https://doi.org/10.1109/CVPR.
2016.90

Hoang TN, Hong S, Xiao C, et al (2021) Aid: Active distillation machine to leverage pre-trained black-box
models in private data settings. Proceedings of the Web Conference 2021 pp 3569–3581. https://doi.org/10.
1145/3442381.3449944

Hodges J, Lehmann E (1962) Rank methods for combination of independent experiments in analysis of variance.
The Annals of Mathematical Statistics

Hoffmann H (2007) Kernel pca for novelty detection. Pattern Recognition 40:863–874. https://doi.org/10.1016/
j.patcog.2006.07.009

Hoi SC, Sahoo D, Lu J, et al (2021) Online learning: A comprehensive survey. Neurocomputing 459:249–289.
https://doi.org/10.1016/j.neucom.2021.04.112

Hoi SCH, Jin R, Zhao P, et al (2013) Online multiple kernel classification. Machine Learning 90:289–316. https:
//doi.org/10.1007/s10994-012-5319-2

Houlsby N, Hernandez-Lobato JM, Ghahramani Z (2014) Cold-start active learning with robust ordinal matrix
factorization. 31st International Conference on Machine Learning URL https://proceedings.mlr.press/v32/
houlsby14.html

Hua J, Xiong Z, Lowey J, et al (2005) Optimal number of features as a function of sample size for various
classification rules. Bioinformatics 21:1509–1515. https://doi.org/10.1093/bioinformatics/bti171

Huang B, Salgia S, Zhao Q (2022) Disagreement-based active learning in online settings. IEEE Transactions on
Signal Processing 70:1947–1958. https://doi.org/10.1109/TSP.2022.3159388

Huang GB, Zhu QY, Siew CK (2006) Extreme learning machine: Theory and applications. Neurocomputing
70:489–501. https://doi.org/10.1016/j.neucom.2005.12.126

Huang SJ, Jin R, Zhou ZH (2014) Active learning by querying informative and representative examples. IEEE
Transactions on Pattern Analysis and Machine Intelligence 36:1936–1949. https://doi.org/10.1109/TPAMI.
2014.2307881

Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. Proceedings of the seventh ACM
SIGKDD international conference on Knowledge discovery and data mining - KDD ’01 pp 97–106. https:
//doi.org/10.1145/502512.502529

Ienco D, Bifet A, Zliobaite, et al (2013) Clustering based active learning for evolving data streams. 16th
International Conference on Discovery Science https://doi.org/10.1007/978-3-642-40897-7 6

Ienco D, Pfahringer B, Žliobaitė I (2014) High density-focused uncertainty sampling for active learning over
evolving stream data. BIGMINE’14: Proceedings of the 3rd International Conference on Big Data, Streams
and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications URL https:
//proceedings.mlr.press/v36/ienco14.html

Iman RL, Davenport JM (1980) Approximations of the critical region of the fbietkan statistic. Communications
in Statistics - Theory and Methods 9:571–595. https://doi.org/10.1080/03610928008827904

Istrate R, Malossi ACI, Bekas C, et al (2018) Incremental training of deep convolutional neural networks. URL
https://arxiv.org/abs/1803.10232

39
Jamieson K (2018) Online and adaptive machine learning. regression (part 7). URL https://courses.cs.
washington.edu/courses/cse599i/18wi/

Jamieson K, Nowak R (2014) Best-arm identification algorithms for multi-armed bandits in the fixed confidence
setting. 2014 48th Annual Conference on Information Sciences and Systems (CISS) pp 1–6. https://doi.org/
10.1109/CISS.2014.6814096

Jamil S, Khan A (2016) Churn comprehension analysis for telecommunication industry using alba. In: 2016
International Conference on Emerging Technologies (ICET), IEEE, pp 1–5

Jedra Y, Proutiere A (2020) Optimal best-arm identification in linear bandits. 34th Conference on Neu-
ral Information Processing Systems (NeurIPS 2020) URL https://proceedings.neurips.cc/paper/2020/file/
7212a6567c8a6c513f33b858d868ff80-Paper.pdf

Jin Q, Yuan M, Li S, et al (2022) Cold-start active learning for image classification. Information Sciences
616:16–36. https://doi.org/10.1016/j.ins.2022.10.066

Jin R, Hoi S, Yang T (2010) Online multiple kernel learning: Algorithms and mistake bounds. Proceedings of the
21st International Conference on Algorithmic Learning Theory https://doi.org/10.1007/978-3-642-16108-7 31

John RCS, Draper NR (1975) D-optimality for regression designs: A review. Technometrics 17:15–23. https:
//doi.org/10.1080/00401706.1975.10489266

Joshi AJ, Porikli F, Papanikolopoulos N (2009) Multi-class active learning for image classification. 2009 IEEE
Conference on Computer Vision and Pattern Recognition pp 2372–2379. https://doi.org/10.1109/CVPR.2009.
5206627

Joyce JM (2011) Kullback-leibler divergence. https://doi.org/10.1007/978-3-642-04898-2 327

Karlin S, Studden WJ (1966) Optimal experimental designs. The Annals of Mathematical Statistics 37:783–815.
URL https://www.jstor.org/stable/2238570

Kiefer J (1959) Optimum experimental designs. Journal of the Royal Statistical Society Series B (Methodological)
URL https://www.jstor.org/stable/2983802

Kingma DP, Welling M (2013) Auto-encoding variational bayes. 2nd International Conference on Learning
Representations, ICLR URL https://arxiv.org/abs/1312.6114

Kottke D, Krempl G, Spiliopoulou M (2015) Probabilistic active learning in datastreams. https://doi.org/10.

1007/978-3-319-24465-5 13

Kranjc J, Smailović J, Podpečan V, et al (2015) Active learning for sentiment analysis on data streams:
Methodology and workflow implementation in the clowdflows platform. Information Processing & Management
51(2):187–203

Kraskov A, Stögbauer H, Grassberger P (2004) Estimating mutual information. Physical Review E 69:066138.
https://doi.org/10.1103/PhysRevE.69.066138

Krawczyk B, Minku LL, Gama J, et al (2017) Ensemble learning for data stream analysis: A survey. Information
Fusion 37:132–156

Krawczyk B, Pfahringer B, Wozniak M (2018) Combining active learning with concept drift detection for data
stream mining. 2018 IEEE International Conference on Big Data (Big Data) pp 2239–2244. https://doi.org/
10.1109/BigData.2018.8622549

Kulkarni RV, Patil SH, Subhashini R (2016) An overview of learning in data streams with label scarcity.
Proceedings of the International Conference on Inventive Computation Technologies, ICICT 2016 2. https:
//doi.org/10.1109/INVENTIVE.2016.7824874

40
Kumar P, Gupta A (2020) Active learning query strategies for classification, regression, and clustering: A survey.
Journal of Computer Science and Technology 35:913–945. https://doi.org/10.1007/s11390-020-9487-4

Kurlej B, Woźniak M (2011) Learning curve in concept drift while using active learning paradigm. https://doi.
org/10.1007/978-3-642-23857-4 13

Kwak B, Kim Y, Kim YJ, et al (2022) Trustal: Trustworthy active learning using knowledge distillation. The
Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22) URL https://arxiv.org/abs/2201.11661

Lakshminarayanan B, Roy D, Teh YW (2014) Mondrian forests: Efficient online random forests. Advances in
Neural Information Processing Systems (NIPS) URL https://proceedings.neurips.cc/paper files/paper/2014/
file/d1dc3a8270a6f9394f88847d7f0050cf-Paper.pdf

Li A, Boyd A, Smyth P, et al (2021) Detecting and adapting to irregular distribution shifts in bayesian online
learning. 35th Conference on Neural Information Processing Systems (NeurIPS 2021) URL https://papers.
nips.cc/paper/2021/file/362387494f6be6613daea643a7706a42-Paper.pdf

Li X, Guo Y (2013) Adaptive active learning for image classification. 2013 IEEE Conference on Computer Vision
and Pattern Recognition pp 859–866. https://doi.org/10.1109/CVPR.2013.116

Lieber D, Konrad B, Deuse J, et al (2012) Sustainable interlinked manufacturing processes through real-time
quality prediction. In: Leveraging Technology for a Sustainable World: Proceedings of the 19th CIRP Confer-
ence on Life Cycle Engineering, University of California at Berkeley, Berkeley, USA, May 23-25, 2012, Springer,
pp 393–398

Lima M, Neto M, Filho TS, et al (2022) Learning under concept drift for regression—a systematic literature
review. IEEE Access 10:45410–45429. https://doi.org/10.1109/ACCESS.2022.3169785

Liu S, Xue S, Wu J, et al (2021) Online active learning for drifting data streams. IEEE Transactions on Neural
Networks and Learning Systems https://doi.org/10.1109/TNNLS.2021.3091681

Long J, Yin J, Zhao W, et al (2008) Graph-based active learning based on label propagation. MDAI 2008:
Modeling Decisions for Artificial Intelligence pp 179–190. https://doi.org/10.1007/978-3-540-88269-5 17

Loy CC, Hospedales TM, Xiang T, et al (2012) Stream-based joint exploration-exploitation active learning.
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition pp
1560–1567. https://doi.org/10.1109/CVPR.2012.6247847

Lu J, Zhao P, Hoi SCH (2016) Online passive-aggressive active learning. Machine Learning 103:141–183. https:
//doi.org/10.1007/s10994-016-5555-y

Lu J, Liu A, Dong F, et al (2018) Learning under concept drift: A review. IEEE Transactions on Knowledge
and Data Engineering pp 1–1. https://doi.org/10.1109/TKDE.2018.2876857

Lughofer E (2011) Evolving Fuzzy Systems – Methodologies, Advanced Concepts and Applications, vol 266.
Springer Berlin Heidelberg, https://doi.org/10.1007/978-3-642-18087-3

Lughofer E (2012) Single-pass active learning with conflict and ignorance. Evolving Systems 3:251–271. https:
//doi.org/10.1007/s12530-012-9060-7

Lughofer E (2017) On-line active learning: A new paradigm to improve practical useability of data stream
modeling methods. Information Sciences 415-416:356–376. https://doi.org/10.1016/j.ins.2017.06.038

Lughofer E, Pratama M (2018) Online active learning in data stream regression using uncertainty sampling
based on evolving generalized fuzzy models. IEEE Transactions on Fuzzy Systems 26:292–309. https://doi.
org/10.1109/TFUZZ.2017.2654504

Lughofer E, Škrjanc I (2023) Online active learning for evolving error feedback fuzzy models within a multi-
innovation context. IEEE Transactions on Fuzzy Systems

41
Ma L, Destercke S, Wang Y (2016) Online active learning of decision trees with evidential data. Pattern
Recognition 52:33–45. https://doi.org/10.1016/j.patcog.2015.10.014

Mammen E, Tsybakov AB (1999) Smooth discrimination analysis. The Annals of Statistics 27. https://doi.org/
10.1214/aos/1017939240

Manjah D, Cacciarelli D, Standaert B, et al (2023) Stream-based active distillation for scalable model deployment.
Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition (CVPR) Workshops

Manwani N, Desai K, Sasidharan S, et al (2013) Double ramp loss based reject option classifier. 19th Pacific-
Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD) https://doi.org/10.1007/
978-3-319-57454-7 53

Martins VE, Cano A, Junior SB (2023) Meta-learning for dynamic tuning of active learning on stream
classification. Pattern Recognition 138:109359

McSherry F, Talwar K (2007) Mechanism design via differential privacy. 48th Annual IEEE Symposium on
Foundations of Computer Science (FOCS’07) pp 94–103. https://doi.org/10.1109/FOCS.2007.41

Menard P, Domingues OD, Jonsson A, et al (2021) Fast active learning for pure exploration in reinforcement
learning. Proceedings of the 38th International Conference on Machine Learning URL http://proceedings.mlr.
press/v139/menard21a/menard21a-supp.pdf

Min F, Zhang SM, Ciucci D, et al (2020) Three-way active learning through clustering selection. International
Journal of Machine Learning and Cybernetics 11:1033–1046. https://doi.org/10.1007/s13042-020-01099-2

Minka TP (2001) A family of algorithms for approximate bayesian inference. Thesis URL https://hd.media.mit.
edu/tech-reports/TR-533.pdf

Miu T, Missier P, Plötz T (2015) Bootstrapping personalised human activity recognition models using online
active learning. In: 2015 IEEE International Conference on Computer and Information Technology; Ubiquitous
Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and
Computing, IEEE, pp 1138–1147

Mohamad S, Bouchachia A, Sayed-Mouchaweh M (2018) A bi-criteria active learning algorithm for dynamic
data streams. IEEE Transactions on Neural Networks and Learning Systems 29:74–86. https://doi.org/10.
1109/TNNLS.2016.2614393

Mohamad S, Sayed-Mouchaweh M, Bouchachia A (2020) Online active learning for human activity recognition
from sensory data streams. Neurocomputing 390:341–358. https://doi.org/10.1016/j.neucom.2019.08.092

Mohamadi S, Amindavar H (2020) Deep bayesian active learning, a brief survey on recent advances. URL
https://arxiv.org/abs/2012.08044

Montgomery DC (2012) Design and Analysis of Experiments. John Wiley & Sons, Inc., https://doi.org/10.1002/
9781118147634

Myers RH, Montgomery D, Anderson-Cook CM (2016) Response surface methodology: process and prod-
uct optimization using designed experiments. Wiley, URL https://www.wiley.com/en-au/Response+
Surface+Methodology:+Process+and+Product+Optimization+Using+Designed+Experiments,+4th+
Edition-p-9781118916018

Naranjo JE, Sotelo MA, Gonzalez C, et al (2007) Using fuzzy logic in automated vehicle control. IEEE intelligent
systems 22(1):36–45

Narr A, Triebel R, Cremers D (2016) Stream-based active learning for efficient and adaptive classification of
3d objects. Proceedings - IEEE International Conference on Robotics and Automation 2016-June:227–233.
https://doi.org/10.1109/ICRA.2016.7487138

42
Nguyen HT, Smeulders A (2004) Active learning using pre-clustering. Proceedings of the twenty-first interna-
tional conference on Machine learning https://doi.org/https://doi.org/10.1145/1015330.1015349

Nixon C, Sedky M, Hassan M (2021) Reviews in online data stream and active learning for cyber intrusion
detection-a systematic literature review. In: 2021 Sixth International Conference on Fog and Mobile Edge
Computing (FMEC), IEEE, pp 1–6

Pham T, Kottke D, Krempl G, et al (2022) Stream-based active learning for sliding windows under the influence
of verification latency. Machine Learning 111:2011–2036. https://doi.org/10.1007/s10994-021-06099-z

Pitman J, Yor M (1997) The two-parameter poisson-dirichlet distribution derived from a stable subordinator.
The Annals of Probability 25. URL https://www.jstor.org/stable/20680193

Polikar R, Upda L, Upda S, et al (2001) Learn++: an incremental learning algorithm for supervised neural
networks. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 31:497–
508. https://doi.org/10.1109/5326.983933

Prabhu V, Chandrasekaran A, Saenko K, et al (2020) Active domain adaptation via clustering uncertainty-
weighted embeddings. URL https://github.com/virajprabhu/CLUE.

Pratama M, Anavatti SG, Lu J (2015) Recurrent classifier based on an incremental metacognitive-based scaf-
folding algorithm. IEEE Transactions on Fuzzy Systems 23:2048–2066. https://doi.org/10.1109/TFUZZ.2015.
2402683

Qin J, Wang C, Zou Q, et al (2021) Active learning with extreme learning machine for online imbalanced
multiclass classification. Knowledge-Based Systems 231:107385. https://doi.org/10.1016/j.knosys.2021.107385

Quade D (1979) Using weighted rankings in the analysis of complete blocks with additive block effects. Journal
of the American Statistical Association 74:680. https://doi.org/10.2307/2286991

Réda C, Kaufmann E, Delahaye-Duriez A (2020) Machine learning applications in drug development. Compu-
tational and structural biotechnology journal 18:241–252

Ren P, Xiao Y, Chang X, et al (2022) A survey of deep active learning. ACM Computing Surveys 54:1–40.
https://doi.org/10.1145/3472291

Reyes O, Altalhi AH, Ventura S (2018) Statistical comparisons of active learning strategies over multiple datasets.
Knowledge-Based Systems 145:274–288. https://doi.org/10.1016/j.knosys.2018.01.033

Riis C, Antunes F, Hüttel FB, et al (2022) Bayesian active learning with fully bayesian gaussian processes.
In Proceedings of Advances in Neural Information Processing Systems 35 (NeurIPS 2022) URL https://
proceedings.neurips.cc/paper files/paper/2022/file/4f1fba885f266d87653900fd3045e8af-Paper-Conference.pdf

Riquelme C (2017) Online active learning with linear models. Thesis URL http://purl.stanford.edu/rp382fv8012

Riquelme C, Ghavamzadeh M, Lazaric A (2017a) Active learning for accurate estimation of linear models.
Proceedings of the 34th International Conference on Machine Learning URL http://proceedings.mlr.press/
v70/riquelme17a/riquelme17a.pdf

Riquelme C, Johari R, Zhang B (2017b) Online active linear regression via thresholding. Thirty-First AAAI
Conference on Artificial Intelligence URL www.aaai.org

Rosenblatt F (1958) The perceptron: A probabilistic model for information storage and organization in the brain.
Psychological Review 65:386–408. https://doi.org/10.1037/h0042519

Roth D, Small K (2006) Margin-based active learning for structured output spaces. Machine Learning: ECML
2006 pp 413–424. https://doi.org/10.1007/11871842 40

Roy N, Mccallum A (2001) Toward optimal active learning through sampling estimation of error reduction.
Proceedings of the Eighteenth International Conference on Machine Learning URL https://dl.acm.org/doi/

43
10.5555/645530.655646

Rožanec JM, Trajkova E, Dam P, et al (2022) Streaming machine learning and online active learning for
automated visual inspection. IFAC-PapersOnLine 55:277–282. https://doi.org/10.1016/j.ifacol.2022.04.206

Ruan Y, Yang J, Zhou Y (2020) Linear bandits with limited adaptivity and learning distributional optimal
design. STOC 2021: Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing
https://doi.org/https://doi.org/10.1145/3406325.3451004

Rudovic O, Zhang M, Schuller B, et al (2019) Multi-modal active learning from human data: A deep reinforcement
learning approach. 2019 International Conference on Multimodal Interaction pp 6–15. https://doi.org/10.
1145/3340555.3353742

Saran A, Yousefi S, Krishnamurthy A, et al (2023) Streaming active learning with deep neural networks. In:
Krause A, Brunskill E, Cho K, et al (eds) Proceedings of the 40th International Conference on Machine Learn-
ing, Proceedings of Machine Learning Research, vol 202. PMLR, pp 30005–30021, URL https://proceedings.
mlr.press/v202/saran23a.html

Schmidt S, Rao Q, Tatsch J, et al (2020) Advanced active learning strategies for object detection. 2020 IEEE
Intelligent Vehicles Symposium (IV) pp 871–876. https://doi.org/10.1109/IV47402.2020.9304565

Schmitt R, Jatzkowski P, Peterek M (2013) Traceable measurements using machine tools. In: Laser metrology
and machine performance X: 10th International Conference and Exhibition on Laser Metrology, Machine Tool,
CMM & Robotic Performance, Lamdamap, pp 20–21

Sculley D (2007) Online active learning methods for fast label efficient spam filtering. Proceedings of the Fourth
Conference on Email and AntiSpam

Sener O, Savarese S (2017) Active learning for convolutional neural networks: A core-set approach. ICLR

Settles B (2009) Active learning literature survey. Technical Report 1648, University of Wisconsin-Madison
Department of Computer Sciences URL https://burrsettles.com/pub/settles.activelearning.pdf

Seung HS, Opper M, Sompolinsky H (1992) Query by committee. Proceedings of the fifth annual workshop on
Computational learning theory - COLT ’92 pp 287–294. https://doi.org/10.1145/130385.130417

Shah K, Manwani N (2020) Online active learning of reject option classifiers. Proceedings of the AAAI Conference
on Artificial Intelligence 34:5652–5659. https://doi.org/10.1609/aaai.v34i04.6019

Shan J, Zhang H, Liu W, et al (2019) Online active learning ensemble framework for drifted data streams. IEEE
Transactions on Neural Networks and Learning Systems 30:486–498. https://doi.org/10.1109/TNNLS.2018.
2844332

Shannon E (1948) A mathematical theory of communication. The Bell System Technical Journal

Sheng VS, Provost F, Ipeirotis PG (2008) Get another label? improving data quality and data mining using mul-
tiple, noisy labelers. Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery
and data mining - KDD 08 p 614. https://doi.org/10.1145/1401890.1401965

Shi X, Xiong W (2018) Approximate linear dependence criteria with active learning for smart soft sensor design.
Chemometrics and Intelligent Laboratory Systems 180:88–95. https://doi.org/10.1016/j.chemolab.2018.07.009

Shilton A, Palaniswami M, Ralph D, et al (2005) Incremental training of support vector machines. IEEE
Transactions on Neural Networks 16:114–131. https://doi.org/10.1109/TNN.2004.836201

Soare M, Lazaric A, Munos R (2013) Active learning in linear stochastic bandits. Bayesian Optimization in The-
ory and Practice URL https://www.univ-orleans.fr/lifo/Members/soare/files/active learning linear bandit.
pdf

44
Soare M, Lazaric A, Munos R (2014) Best-arm identification in linear bandits. 27th Conference on Neural
Information Processing Systems (NeurIPS 2014)

Song S, Chaudhuri K, Sarwate AD (2013) Stochastic gradient descent with differentially private updates.
2013 IEEE Global Conference on Signal and Information Processing pp 245–248. https://doi.org/10.1109/
GlobalSIP.2013.6736861

Souza V, Pinho T, Batista G (2018) Evaluating stream classifiers with delayed labels information. 2018 7th
Brazilian Conference on Intelligent Systems (BRACIS) pp 408–413. https://doi.org/10.1109/BRACIS.2018.
00077

Steel RGD (1959) A multiple comparison sign test: Treatments versus control. Journal of the American Statistical
Association 54:767. https://doi.org/10.2307/2282500

Steve H, Liu Y (2014) Minimax analysis of active learning. Journal of Machine Learning Research URL https:
//www.jmlr.org/papers/volume16/hanneke15a/hanneke15a.pdf

Subramanian K, Das AK, Sundaram S, et al (2014) A meta-cognitive interval type-2 fuzzy inference sys-
tem and its projection based learning algorithm. Evolving Systems 5:219–230. https://doi.org/10.1007/
s12530-013-9102-9

Sudarsanam N, Ravindran B (2018) Using linear stochastic bandits to extend traditional offline designed
experiments to online settings. Computers & Industrial Engineering 115:471–485

Suresh S, Sundararajan N, Saratchandran P (2008) Risk-sensitive loss functions for sparse multi-category
classification problems. Information Sciences 178:2621–2638. https://doi.org/10.1016/j.ins.2008.02.009

Suzuki K, Sunagawa T, Sasaki T, et al (2021) Annotation cost reduction of stream-based active learning by
automated weak labeling using a robot arm. 2021 IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS) pp 9000–9007. https://doi.org/10.1109/IROS51168.2021.9636355

Suárez-Cetrulo AL, Kumar A, Miralles-Pechuán L (2021) Modelling the covid-19 virus evolution with incremental
machine learning. 29th Irish Conference on Artificial Intelligence and Cognitive Science, AICS 2021 URL
https://ceur-ws.org/Vol-3105/paper1.pdf

Suárez-Cetrulo AL, Quintana D, Cervantes A (2023) A survey on machine learning for recurring concept drifting
data streams. Expert Systems with Applications 213:118934. https://doi.org/10.1016/j.eswa.2022.118934

Tang Q, Li D, Xi Y (2018) A new active learning strategy for soft sensor modeling based on feature reconstruction
and uncertainty evaluation. Chemometrics and Intelligent Laboratory Systems 172:43–51. https://doi.org/10.
1016/j.chemolab.2017.11.001

Taylor G, Hinton G (2009) Factored conditional restricted boltzmann machines for modeling motion style.
Proceedings of the 26th International Conference on Machine Learning, Montreal, Canada, 2009 https://doi.
org/https://doi.org/10.1145/1553374.1553505

Taylor G, Hinton G, Roweis S (2006) Modeling human motion using binary latent variables. Advances in Neural
Information Processing Systems 19 (NIPS 2006) URL https://papers.nips.cc/paper files/paper/2006/hash/
1091660f3dff84fd648efe31391c5524-Abstract.html

Thompson J, Walters WP, Feng JA, et al (2022) Optimizing active learning for free energy calculations. Artificial
Intelligence in the Life Sciences 2:100050. https://doi.org/10.1016/j.ailsci.2022.100050

Tieppo E, dos Santos RR, Barddal JP, et al (2022) Hierarchical classification of data streams: a systematic
literature review. Artificial Intelligence Review 55:3243–3282. https://doi.org/10.1007/s10462-021-10087-z

Tong S, Koller D (2002) Support vector machine active learning with applications to text classification. The
Journal of Machine Learning Research 2. https://doi.org/10.1162/153244302760185243

45
Tran T, Pham T, Carneiro G, et al (2017) A bayesian data augmentation approach for learning deep models.
31st Conference on Neural Information Processing Systems (NIPS 2017) URL https://proceedings.neurips.cc/
paper files/paper/2017/file/076023edc9187cf1ac1f1163470e479a-Paper.pdf

Tran T, Do TT, Reid I, et al (2019) Bayesian generative active deep learning. Proceedings of the 36th
International Conference on Machine Learning URL https://arxiv.org/abs/1904.11643

Tsybakov AB (2004) Optimal aggregation of classifiers in statistical learning. The Annals of Statistics https:
//doi.org/10.1214/aos/1079120131

Tsymbal A, Pechenizkiy M, Cunningham P, et al (2008) Dynamic integration of classifiers for handling concept
drift. Information Fusion 9:56–68. https://doi.org/10.1016/j.inffus.2006.11.002

Vahdat A, Belbahri M, Nia VP (2019) Active learning for high-dimensional binary features. 15th Interna-
tional Conference on Network and Service Management (CNSM) URL https://www.computer.org/csdl/
proceedings-article/cnsm/2019/09012676/1hQr3hscsJG

Vanhatalo E, Kulahci M (2016) Impact of autocorrelation on principal components and their use in statistical
process control. Quality and Reliability Engineering International 32:1483–1500. https://doi.org/10.1002/qre.
1858

Vanhatalo E, Kulahci M, Bergquist B (2017) On the structure of dynamic principal component analysis used in
statistical process monitoring. Chemometrics and Intelligent Laboratory Systems 167:1–11. https://doi.org/
10.1016/j.chemolab.2017.05.016

Wang L (2011) Smoothness, disagreement coefficient, and the label complexity of agnostic active learning. The
Journal of Machine Learning Research URL https://www.jmlr.org/papers/volume12/wang11b/wang11b.pdf

Wang X, Fu M, Ma H, et al (2015) Lateral control of autonomous vehicles based on fuzzy logic. Control
Engineering Practice 34:1–17

Wassermann S, Cuvelier T, Casas P (2019) Ral-improving stream-based active learning by reinforcement learning.
URL https://hal.archives-ouvertes.fr/hal-02265426

Weigl E, Heidl W, Lughofer E, et al (2016) On improving performance of surface inspection systems by online
active learning and flexible classifier updates. Machine Vision and Applications 27:103–127. https://doi.org/
10.1007/s00138-015-0731-9

Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics Bulletin 1:80. https://doi.org/10.
2307/3001968

Woodward M, Finn C (2017) Active one-shot learning. NIPS 2016, Deep Reinforcement Learning Workshop
URL http://arxiv.org/abs/1702.06559

Woźniak M, Zyblewski P, Ksieniewicz P (2023) Active weighted aging ensemble for drifted data stream
classification. Information Sciences 630:286–304

Wu J, Chen J, Huang D (2022) Entropy-based active learning for object detection with progressive diversity
constraint. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) https://doi.
org/10.1109/CVPR52688.2022.00918

Wu R, Guo C, Su Y, et al (2021) Online adaptation to label distribution shift. 35th Conference on Neural
Information Processing Systems (NeurIPS 2021) URL https://www.kaggle.com/Cornell-University/arxiv

Wu Y, Chen Y, Wang L, et al (2019) Large scale incremental learning. 2019 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR) https://doi.org/10.1109/CVPR.2019.00046

Xu W, Zhao F, Lu Z (2016) Active learning over evolving data streams using paired ensemble framework.
2016 Eighth International Conference on Advanced Computational Intelligence (ICACI) pp 180–185. https:
//doi.org/10.1109/ICACI.2016.7449823

46
Yan X, Sarkar M, Lartey B, et al (2023) An online learning framework for sensor fault diagnosis analysis in
autonomous cars. IEEE Transactions on Intelligent Transportation Systems

Yin C, Chen S, Yin Z (2023) Clustering-based active learning classification towards data stream. ACM
Transactions on Intelligent Systems and Technology 14(2):1–18

Yu H, Sun C, Yang W, et al (2015) Al-elm: One uncertainty-based active learning algorithm using extreme
learning machine. Neurocomputing 166:140–150. https://doi.org/10.1016/j.neucom.2015.04.019

Yu K, Bi J, Tresp V (2006) Active learning via transductive experimental design. Proceedings of the 23rd
International Conference on Machine Learning https://doi.org/https://doi.org/10.1145/1143844.1143980

Yuan M, Lin HT, Boyd-Graber J (2020) Cold-start active learning through self-supervised language modeling.
roceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) https:
//doi.org/10.18653/v1/2020.emnlp-main.637

Zhang H, Liu W, Shan J, et al (2018) Online active learning paired ensemble for concept drift and class imbalance.
IEEE Access 6:73815–73828. https://doi.org/10.1109/ACCESS.2018.2882872

Zhang H, Liu W, Sun L, et al (2020a) Analyzing network traffic for protocol identification: An ensemble
online active learning method. Proceedings - 2020 6th International Conference on Big Data and Information
Analytics, BigDIA 2020 pp 167–172. https://doi.org/10.1109/BigDIA51454.2020.00035

Zhang H, Ravi SS, Davidson I (2020b) A graph-based approach for active learning in regression. Proceedings of
the 2020 SIAM International Conference on Data Mining (SDM) https://doi.org/https://doi.org/10.1137/1.
9781611976236.32

Zhang H, Liu W, Liu Q (2022) Reinforcement online active learning ensemble for drifting imbalanced data
streams. IEEE Transactions on Knowledge and Data Engineering 34:3971–3983. https://doi.org/10.1109/
TKDE.2020.3026196

Zhang K, Liu S, Chen Y (2023) Online active learning framework for data stream classification with density-peaks
recognition. IEEE Access 11:27853–27864

Zhang T (2004) Statistical behavior and consistency of classification methods based on convex risk minimization.
The Annals of Statistics 32. https://doi.org/10.1214/aos/1079120130

Zheng Z, Padmanabhan B (2006) Selectively acquiring customer information: A new data acquisition problem
and an active learning-based solution. Management Science 52(5):697–712

Zhou C, Ma X, Michel P, et al (2021) Examining and combating spurious features under distribution shift.
Proceedings of the 38 th International Conference on Machine Learning URL https://github.com/violet-zct/

Zhu JJ, Bento J (2017) Generative adversarial active learning. URL https://arxiv.org/abs/1702.07956

Zhu X, Zhang P, Lin X, et al (2007) Active learning from data streams. Proceedings - IEEE International
Conference on Data Mining, ICDM pp 757–762. https://doi.org/10.1109/ICDM.2007.101

Zliobaite I, Bifet A, Pfahringer B, et al (2014) Active learning with drifting streaming data. IEEE Transactions
on Neural Networks and Learning Systems 25:27–39. https://doi.org/10.1109/TNNLS.2012.2236570

Zwanka RJ, Buff C (2021) Covid-19 generation: A conceptual framework of the consumer behavioral shifts to
be caused by the covid-19 pandemic. Journal of International Consumer Marketing 33:58–67. https://doi.org/
10.1080/08961530.2020.1771646

Zyblewski P, Ksieniewicz P, Woźniak M (2020) Combination of active and random labeling strategy in the
non-stationary data stream classification. In: International Conference on Artificial Intelligence and Soft
Computing, Springer, pp 576–585

47
Škrjanc I (2009) Confidence interval of fuzzy models: An example using a waste-water treatment plant.
Chemometrics and Intelligent Laboratory Systems 96:182–187. https://doi.org/10.1016/j.chemolab.2009.01.
009

Active Learning
No ratings yet
Active Learning
102 pages
A Survey On Online Active Learning T41pz1uj
No ratings yet
A Survey On Online Active Learning T41pz1uj
64 pages
FULLTEXT01
No ratings yet
FULLTEXT01
59 pages
How To Measure Uncertainty in Uncertainty Sampling
No ratings yet
How To Measure Uncertainty in Uncertainty Sampling
35 pages
TR1648
No ratings yet
TR1648
47 pages
Mathematics 11 00820
No ratings yet
Mathematics 11 00820
38 pages
PbAL For Skewed Data With Nonparametric Logistic Regression
No ratings yet
PbAL For Skewed Data With Nonparametric Logistic Regression
34 pages
Ecmlpkdd 2019
No ratings yet
Ecmlpkdd 2019
17 pages
Scalable and Efficient Multi-Label Classification For Evolving Data Streams
No ratings yet
Scalable and Efficient Multi-Label Classification For Evolving Data Streams
30 pages
Active Learning Icml09
No ratings yet
Active Learning Icml09
96 pages
Streaming Active Learning With Deep Neural Networks: Ash & Adams 2020
No ratings yet
Streaming Active Learning With Deep Neural Networks: Ash & Adams 2020
17 pages
Active Rare Class Discovery and Classification Using Dirichlet Processes
No ratings yet
Active Rare Class Discovery and Classification Using Dirichlet Processes
18 pages
Semi-Supervised Variational Adversarial Active Lea
No ratings yet
Semi-Supervised Variational Adversarial Active Lea
20 pages
ADeepReinforcement Active Learning Method For Multi-Label Image Classification
No ratings yet
ADeepReinforcement Active Learning Method For Multi-Label Image Classification
15 pages
Bad Students Make Great Teachers
No ratings yet
Bad Students Make Great Teachers
16 pages
Active Learning
No ratings yet
Active Learning
16 pages
Active Learning For Improved Semi-Supervised Semantic Segmentation in Sattelite Images
No ratings yet
Active Learning For Improved Semi-Supervised Semantic Segmentation in Sattelite Images
17 pages
2008 - CVPR-gjqi - Two-Dimensional Active Learning For Image Classification
No ratings yet
2008 - CVPR-gjqi - Two-Dimensional Active Learning For Image Classification
8 pages
Sampling Yue Fuselage
No ratings yet
Sampling Yue Fuselage
11 pages
A Representation-Based Query Strategy To Derive Qu
No ratings yet
A Representation-Based Query Strategy To Derive Qu
11 pages
Gal 17 A
No ratings yet
Gal 17 A
10 pages
Aghdam Et Al. - 2019 - Active Learning For Deep Detection Neural Networks
No ratings yet
Aghdam Et Al. - 2019 - Active Learning For Deep Detection Neural Networks
9 pages
Active Finetuning Exploiting Annotation Budget in The Pretraining Finetuning Paradigm
No ratings yet
Active Finetuning Exploiting Annotation Budget in The Pretraining Finetuning Paradigm
12 pages
17013-Article Text-20507-1-2-20210518
No ratings yet
17013-Article Text-20507-1-2-20210518
8 pages
Aaai Seals
No ratings yet
Aaai Seals
9 pages
Active Learning Book
No ratings yet
Active Learning Book
116 pages
Activing Learning Method Using SVM For Text Classification
No ratings yet
Activing Learning Method Using SVM For Text Classification
9 pages
Learning Active Learning From Data
No ratings yet
Learning Active Learning From Data
11 pages
Data Mining - Utrecht University - 13. Active Learning
No ratings yet
Data Mining - Utrecht University - 13. Active Learning
57 pages
hospedalesEtAl Pakdd2011
No ratings yet
hospedalesEtAl Pakdd2011
12 pages
SSL Al Bilevel
No ratings yet
SSL Al Bilevel
5 pages
Selective Data Acquisition For Machine Learning: Josh Attenberg
No ratings yet
Selective Data Acquisition For Machine Learning: Josh Attenberg
45 pages
Sample Complexity Active LearninSDFSDGg
No ratings yet
Sample Complexity Active LearninSDFSDGg
31 pages
Stream and Pool Based Active Learning
No ratings yet
Stream and Pool Based Active Learning
11 pages
Combining Active Learning With Concept Drift Detection For Data Stream Mining
No ratings yet
Combining Active Learning With Concept Drift Detection For Data Stream Mining
6 pages
Active Learning in Multimedia Annotation and Retrieval - A Survey
No ratings yet
Active Learning in Multimedia Annotation and Retrieval - A Survey
21 pages
A New Vision of Collaborative Active Learning: A A A B
No ratings yet
A New Vision of Collaborative Active Learning: A A A B
14 pages
Scalable Active Learning For Multiclass Image Classification
No ratings yet
Scalable Active Learning For Multiclass Image Classification
15 pages
Less Is More: Active Learning With Support Vector Machines: Greg Schohn David Cohn
No ratings yet
Less Is More: Active Learning With Support Vector Machines: Greg Schohn David Cohn
8 pages
Natural Language Processing
No ratings yet
Natural Language Processing
31 pages
Active Knowledge Extraction From Cyclic Voltammetry
No ratings yet
Active Knowledge Extraction From Cyclic Voltammetry
19 pages
Active Learning From Imbalanced Data
No ratings yet
Active Learning From Imbalanced Data
4 pages
Active Learning and Machine Teaching For Online Learning A Study of Attention and Labelling Cost
No ratings yet
Active Learning and Machine Teaching For Online Learning A Study of Attention and Labelling Cost
6 pages
Active Learning Methods For Interactive Image Retrieval
No ratings yet
Active Learning Methods For Interactive Image Retrieval
12 pages
Active Sample Learning and Feature Selection: A Unified Approach
No ratings yet
Active Sample Learning and Feature Selection: A Unified Approach
11 pages
Active Learning (AL)
No ratings yet
Active Learning (AL)
5 pages
Module-5 IoT Notes
No ratings yet
Module-5 IoT Notes
19 pages
Hierarchical Sampling For Active Learning
No ratings yet
Hierarchical Sampling For Active Learning
8 pages
An Active Learning Algorithm Based On Parzen Window Classification
No ratings yet
An Active Learning Algorithm Based On Parzen Window Classification
14 pages
WEEK 01 Merged
No ratings yet
WEEK 01 Merged
606 pages
Active Learning From Multiple Knowledge Sources
No ratings yet
Active Learning From Multiple Knowledge Sources
8 pages
Smart Mirror - Final - Doc
No ratings yet
Smart Mirror - Final - Doc
46 pages
Delay Prediction
No ratings yet
Delay Prediction
37 pages
S1 S4mtechsignalprocessingandembeddedsystems
No ratings yet
S1 S4mtechsignalprocessingandembeddedsystems
273 pages
Rahim Artificialintelligence 2025
No ratings yet
Rahim Artificialintelligence 2025
51 pages
DSF Unit 4
No ratings yet
DSF Unit 4
12 pages
AIML Unit 4
No ratings yet
AIML Unit 4
26 pages
Unit 2
No ratings yet
Unit 2
19 pages
Outlier Detection A Survey
No ratings yet
Outlier Detection A Survey
84 pages
1-PAC - Learning Framework - Example-20-12-2024
No ratings yet
1-PAC - Learning Framework - Example-20-12-2024
75 pages
Ie Overview and Ner
No ratings yet
Ie Overview and Ner
52 pages
Relational Data Clustering Models Algorithms and Applications 1st Edition Bo Long
No ratings yet
Relational Data Clustering Models Algorithms and Applications 1st Edition Bo Long
84 pages
Deploy Machine Learning Models To Production: With Flask, Streamlit, Docker, and Kubernetes On Google Cloud Platform 1st Edition Pramod Singh
100% (1)
Deploy Machine Learning Models To Production: With Flask, Streamlit, Docker, and Kubernetes On Google Cloud Platform 1st Edition Pramod Singh
65 pages
Unit 4 Machine Learning
No ratings yet
Unit 4 Machine Learning
51 pages
Accepted Version Full
No ratings yet
Accepted Version Full
48 pages
Review of DL Pathological Image Classification Task-Specifc To Foundation Models
No ratings yet
Review of DL Pathological Image Classification Task-Specifc To Foundation Models
17 pages
Sentiment Analysis Using Bert On Yelp Restaurant Reviews
No ratings yet
Sentiment Analysis Using Bert On Yelp Restaurant Reviews
63 pages
LECTURE#9 EE258 F22 Part2 Draft v1
No ratings yet
LECTURE#9 EE258 F22 Part2 Draft v1
14 pages
不确定性引导的互一致性学习在半监督医学图像分割中的应用
No ratings yet
不确定性引导的互一致性学习在半监督医学图像分割中的应用
10 pages
Master Thesis Firas Ouerghi 2020
No ratings yet
Master Thesis Firas Ouerghi 2020
94 pages
MLSys 2022 Taglets A System For Automatic Semi Supervised Learning With Auxiliary Data Paper
No ratings yet
MLSys 2022 Taglets A System For Automatic Semi Supervised Learning With Auxiliary Data Paper
21 pages
A Survey of Motion Data Processing and Classification Techniques Based On Wearable Sensors
No ratings yet
A Survey of Motion Data Processing and Classification Techniques Based On Wearable Sensors
12 pages
40 ML Interview Questions That You Must Know (2024) - Reader View
No ratings yet
40 ML Interview Questions That You Must Know (2024) - Reader View
13 pages
Data Science Solutions IA 2
No ratings yet
Data Science Solutions IA 2
16 pages
AI and Its Evolution Mohit Sharma
No ratings yet
AI and Its Evolution Mohit Sharma
20 pages
Building Fake Review Detection Model Based On Sentiment Intensity and PU Learning
No ratings yet
Building Fake Review Detection Model Based On Sentiment Intensity and PU Learning
14 pages
Customer Churn Prediction Using Machine Learning Techniques: The Case of Lion Insurance
No ratings yet
Customer Churn Prediction Using Machine Learning Techniques: The Case of Lion Insurance
14 pages
IoT Module 5 Notes
No ratings yet
IoT Module 5 Notes
6 pages
CCNC Paper CameraReadyVersion 7P
No ratings yet
CCNC Paper CameraReadyVersion 7P
8 pages
Core Concepts in Statistical Learning
From Everand
Core Concepts in Statistical Learning
Tushar Gulati
No ratings yet
MACHINE LEARNING FOR BEGINNERS: A Practical Guide to Understanding and Applying Machine Learning Concepts (2023 Beginner Crash Course)
From Everand
MACHINE LEARNING FOR BEGINNERS: A Practical Guide to Understanding and Applying Machine Learning Concepts (2023 Beginner Crash Course)
Elaine Tate
No ratings yet
ICT Project Management: Framework for ICT-based Pedagogy System: Development, Operation, and Management
From Everand
ICT Project Management: Framework for ICT-based Pedagogy System: Development, Operation, and Management
Suman Ahmmed
No ratings yet
Data Science Unveiled: A Practical Guide to Key Techniques
From Everand
Data Science Unveiled: A Practical Guide to Key Techniques
Ed A Norex
No ratings yet
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Bootstrapping Language-Image Pretraining: The Complete Guide for Developers and Engineers
From Everand
Bootstrapping Language-Image Pretraining: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Self-Supervised Learning: Teaching AI with Unlabeled Data
From Everand
Self-Supervised Learning: Teaching AI with Unlabeled Data
Robert Johnson
No ratings yet
IGNOU BCA System Analysis and Design Previous Year Solved Papers MCS 014
From Everand
IGNOU BCA System Analysis and Design Previous Year Solved Papers MCS 014
Manish Soni
No ratings yet
Principles of Observability for Modern Systems: Definitive Reference for Developers and Engineers
From Everand
Principles of Observability for Modern Systems: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Few-Shot Machine Learning: Doing More with Less Data
From Everand
Few-Shot Machine Learning: Doing More with Less Data
Robert Johnson
No ratings yet
Modern Research Design: The Best Approach To Qualitative And Quantitative Data
From Everand
Modern Research Design: The Best Approach To Qualitative And Quantitative Data
Frank Albert
No ratings yet

Active Learning For Data Streams A Survey

Uploaded by

Active Learning For Data Streams A Survey

Uploaded by

Active learning for data streams: a survey

Davide Cacciarelli1,2* and Murat Kulahci1,3

Technology, Luleå, Sweden.

*Corresponding author(s). E-mail(s): [email protected];

Published in Machine Learning by Springer (2023). https://doi.org/10.1007/s10994-023-06454-2

2 Preliminaries on active learning

2.2 Active learning scenarios

2.2.1 Membership query synthesis active learning

Ask for the

Fig. 1 Membership query synthesis active learning.

2.2.2 Pool-based active learning

Unlabeled Rank Select top 𝑘 Ask for the

Fig. 2 Pool-based active learning.

2.2.3 Online active learning

Fig. 3 Single-pass online active learning.

Train/update Ask for the

Fig. 4 Window- or batch-based online active learning.

3 Online active learning approaches

3.1 Stationary data stream classification approaches

Data Instance Instance Sampling

- Adjust sampling rate - DDM

Fig. 5 Online active learning: general framework.

margin (xt ) = p (y = cb | xt ) − p (y = csb | xt ) (12)

c(t) = exp (−d (xt , wt )) ∈ [0, 1] (13)

D (h1 , h2 ) = {x ∈ X : h1 (x) ̸= h2 (x)} (16)

D (Ht ) = {x ∈ X : ∃h1 , h2 ∈ Ht , h1 (x) ̸= h2 (x)} (19)

max |fi,t (xt ) − fj,t (xt )| ≤ δ (24)

max |fi,t (xt ) − pt | ≤ δ (25)

Algorithm 2 Online active learning with expert advice

3.2 Drifting data stream classification approaches

𝑡!"#$% 𝑡"! 𝑡"" 𝑡

On = [(xi , yi ) ∀i < n : ai = 1 ∧ txi , tyi ∈ Wn \Dn ] . (30)

Classifier Classifier Classifier

𝐿!"#"$ 𝑈!"#"$ 𝐿!"# 𝑈!"# … 𝐿! 𝑈!

fE (xt ) = ωs fCs (xt ) + ωd fCd (xt ) (32)

Algorithm 4 Weight adjustment for dynamic classifiers

3.3 Evolving fuzzy systems approaches

Rule generation Rule merging Rule pruning Antecedent Consequent

Fig. 11 Learning modules of an EFS (Ge and Zeng, 2020).

3.4 Experimental design and bandit approaches

𝑏" 𝑏" 𝑏"

Algorithm 5 Online active learning using CDO

Algorithm 6 Prequential evaluation for online active learning

Evaluation Strategy Works

Fig. 13 Low-cost active spam filtering (Sculley, 2007).

6 Summary and future directions

Carpentier A, Lazaric A, Ghavamzadeh M, et al (2015) Upper-confidence-bound algorithms for active learning

Gouk H, Pfahringer B, Frank E (2019) Stochastic gradient trees. URL http://proceedings.mlr.press/v101/

Joyce JM (2011) Kullback-leibler divergence. https://doi.org/10.1007/978-3-642-04898-2 327

Kottke D, Krempl G, Spiliopoulou M (2015) Probabilistic active learning in datastreams. https://doi.org/10.

You might also like