CDSC Al

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

A Clustering-based Framework for Classifying Data Streams

Xuyang Yan1 , Abdollah Homaifar1∗ , Mrinmoy Sarkar1 ,


Abenezer Girma1 and Edward Tunstel2
1
North Carolina A&T State University, Greensboro, NC, 27401, USA
2
Raytheon Technologies Research Center, East Hartford, CT, 06108, USA
[email protected], [email protected], [email protected],
[email protected], [email protected]
arXiv:2106.11823v1 [cs.LG] 22 Jun 2021

Abstract To address these challenges, a great number of studies have


been conducted on classifying data streams considering two
The non-stationary nature of data streams strongly primary aspects, data stream classification through: (i) su-
challenges traditional machine learning techniques. pervised and (ii) semi-supervised learning. For supervised
Although some solutions have been proposed to data stream classification frameworks, most studies [Bifet
extend traditional machine learning techniques for et al., 2009; Bifet and Gavaldà, 2009; Bifet et al., 2010b;
handling data streams, these approaches either re- Losing et al., 2018; Gomes et al., 2017] focus on extending
quire an initial label set or rely on specialized de- traditional offline machine learning techniques with an as-
sign parameters. The overlap among classes and sumption that the label of each incoming sample will be avail-
the labeling of data streams constitute other ma- able after it is classified. However, this assumption does not
jor challenges for classifying data streams. In this always hold in real-world data stream classification problems
paper, we proposed a clustering-based data stream and the labeling of data streams is time-consuming and ex-
classification framework to handle non-stationary pensive. Additionally, to adapt to the change in data streams,
data streams without utilizing an initial label set. A the optimization of the learnable parameters also imposes
density-based stream clustering procedure is used high computational overhead for supervised approaches.
to capture novel concepts with a dynamic thresh- Semi-supervised approaches [Masud et al., 2012; Shao et
old and an effective active label querying strategy al., 2016; Wagner et al., 2018; Zhu and Li, 2020; Din et
is introduced to continuously learn the new con- al., 2020] are developed as an alternative solution to perform
cepts from the data streams. The sub-cluster struc- data stream classifications using a small portion of labeled
ture of each cluster is explored to handle the over- data and have shown substantial success. However, several
lap among classes. Experimental results and quan- challenges still remain. First, the detection of a novel class
titative comparison studies reveal that the proposed strongly depends on certain threshold parameters and the de-
method provides statistically better or comparable termination of threshold parameters is an open challenge.
performance than the existing methods. Secondly, the overlap among different classes is a common
problem in data stream classification problems while very
1 Introduction little research work has been conducted on handling data
streams with overlapped classes. Thirdly, most of the exist-
The recent advances in information technology have led to ing semi-supervised classification frameworks classify data
an increasing trend in the application of data streams, and streams with partially labeled data in a passive learning man-
data stream classification has captured the attention of many ner. For better classification performance, this requires an
researchers. Unlike traditional classification problems, data effective labeling strategy to actively guide the learning pro-
streams are continuously evolving such that the data distri- cedure.
butions can change dynamically and new data patterns may Considering the limitations of the existing methods de-
appear. This non-stationary nature of data streams, known scribed earlier, we propose a Clustering-based Data Stream
as concept drift and concept evolution [Masud et al., 2010; Classification framework with Active Learning, namely
Gama et al., 2014], requires a continuous learning capa- CDSC-AL, to handle non-stationary data streams. The pro-
bility for traditional machine learning techniques to han- posed framework consists of three main steps: (i) concept
dle data streams [Aggarwal et al., 2006; Bouchachia, 2011; drift and novel class detection through clustering, (ii) novel
challenges Mu et al., 2017]. The scarcity of data labels and expensive concept learning and concept drift adaptation through active
labeling costs also limit the conventional classification tech- learning, and (iii) classification of data streams using label
niques [Din et al., 2020]. Besides, limited time and memory propagation. With these three steps, CDSC-AL aims to per-
space pose an additional layer of challenges for classifying form non-stationary data stream classification by reducing the
the large volume of data streams. labeling costs and handling the overlap among classes. In

Contact Author
summary, the contributions of this paper are as follows:
• Extend a density-based data stream clustering method approach is introduced to reduce the time and memory com-
to capture concept drift and concept evolution for non- plexity [Wagner et al., 2018]. Primarily, it utilizes a tem-
stationary data streams with a dynamic threshold. poral label propagation procedure to not only label the new
• Develop an effective distance-based active learning pro- instances but also to learn from them.
cedure to query a small portion of labels for adapting to Few recent studies on data stream classification incorpo-
changes in data streams. rate active learning into the semi-supervised learning frame-
work [Lughofer et al., 2016; Mohamad et al., 2018]. Such
• Propose a classification procedure to handle the overlap approaches handle the high labeling costs of data streams by
among classes by investigating the sub-cluster structure actively querying labels for a small subset of interesting sam-
inside clusters. ples. In our proposed framework, we introduce a new active
The remainder of this paper is organized as follows. Sec- learning strategy to address the high labeling cost issue and
tion 2 provides a review of the related work. The details of combine it with a label propagation procedure to classify data
the CDSC-AL framework are discussed in Section 3 and the streams in a semi-supervised manner.
complexity analysis of the CDSC-AL framework is described
in Section 4. Section 5 presents the experimental results and 3 Proposed Approach
comparison studies with the state-of-the-art methods. Con-
cluding remarks and future work are outlined in Section 6. In this section, the basic notations and assumptions for the
CDSC-AL framework are introduced first. Then, we pro-
2 Related Work vide an overview of the CDSC-AL framework and discuss
its three main components: concept drift and evolution de-
Over the past few decades, great efforts have been conducted tection through clustering, the adaptation of drifted and novel
to adapt offline supervised learning techniques for classi- concepts using active learning, and classification through la-
fying streaming data. In [Aggarwal et al., 2006], the au- bel propagation.
thors proposed a dynamic data classification framework using
traditional classifiers to classify streaming data. A general 3.1 Notations and Assumptions
framework was developed for adapting conventional classi-
fication techniques to perform data stream classification us- Notations. Let DS be an unknown data stream S∞and CHt
ing clustering in [Masud et al., 2010; Masud et al., 2012]. is a chunk of data from DS such that DS = t=1 {CHt }
In [Bifet and Gavalda, 2007; Bifet et al., 2013], ADaptive where t refers to theStime index of a data chunk. For each
|CH |
sliding WINdow (ADWIN) is proposed as a detector to cap- data chunk, CHt = i=1 t {xit |xit ∈ Rm } where xit denotes
ture the change of data streams and work collaboratively a data sample in CHt and m is the dimension of xit . Assume
with traditional classification techniques for handling data HSt = {GSt , LSt } is the summary of a data stream where
streams. Later, ADWIN was widely used in ensemble-based GSt and LSt represent the macro-level and the micro-level
data stream classification frameworks [Bifet et al., 2010b; summary of DS. We use Zti to represent the ith cluster center
Gomes et al., 2017]. To enhance the time and memory effi- of DS at time t, and SCti,j refers to the j th sub-cluster center
ciency, a Self Adjusting Memory (SAM) model is introduced of cluster i at time t. The notation yti,j denotes the class label
for offline classifiers to deal with non-stationary data streams of SCti,j .
in [Losing et al., 2018].
Unlike supervised approaches, semi-supervised methods Assumptions. Considering the non-stationary nature of
utilize a more realistic assumption on the availability of la- data streams, the proposed CDSC-AL framework assumes:
bels for data streams. The Linear Neighborhood Propaga- • The characteristics of data streams can change abruptly
tion [Wang and Zhang, 2007] method is used to perform data or gradually.
stream classification using a limited amount of labeled data. • Multiple novel concepts may arise simultaneously.
In [Masud et al., 2012], ECSMiner is proposed to employ the
clustering procedure for data stream classification in a semi- • Overlapped classes will appear over time.
supervised manner. A combination of semi-supervised Sup-
port Vector Machines and k-means clustering is employed in 3.2 An Overview of CDSC-AL Framework
[Zhang et al., 2010] for classifying data streams. In [Hosseini A general algorithm description of the CDSC-AL framework
et al., 2016], data streams are divided into chunks and an en- is presented in Algorithm 1. We extended the recently devel-
semble of cluster-based classifiers is used to propagate labels oped density-based stream clustering algorithm, namely dy-
among each chunk. The Co-forest algorithm [Li and Zhou, namic fitness proportionate sharing (DFPS-clustering) [Yan
2007] is extended for data stream classification using a small et al., 2019; Yan et al., 2020], to perform the classification of
amount of labeled data in [Wang and Li, 2018]. data streams considering several aspects. First, a new merge
Recently, online graph-based semi-supervised data stream procedure between clusters from the incoming data chunk
classification frameworks were investigated in [Ravi and and historical clusters is employed and the modified merge
Diao, 2016; Wagner et al., 2018]. In [Ravi and Diao, 2016], procedure is used to detect novel classes and drifted classes.
the authors continuously maintained a graph using both the Second, two levels of cluster summary are maintained contin-
labeled and unlabeled data over time and then predicted the uously to reflect the characteristics of the data stream through
label for incoming unlabeled instances through label prop- an active learning procedure. Third, an effective classification
agation. Another novel online graph-based semi-supervised procedure using the k-nearest-neighbor (KNN) rule [Altman,
Algorithm 1 An overview of CDSC-AL framework Algorithm 2 New cluster merge procedure
Input: DS Parameters: LN C : a list of paired clusters between a historical
Parameters: Ct : the set of clusters at time t; CHt : the current data cluster and its neighboring clusters in CHt , XB : a set of boundary
chunk; CCHt : a set of clusters discovered in CHt ; Qt : a small set samples between each pair of clusters in LN C
of samples for active label querying in CHt ; YQt : the label set of 1: procedure C HECK M ERGE(HSt , CCHt , CHt )
the queried samples Qt ; YCHt : the label set for CHt . 2: Identify the paired neighboring historical clusters in HSt
Output: HSt for CCHt to obtain LN C
1: for t = 1 to ∞ do 3: Extract XB for each pair of neighboring clusters in LN C
2: Conduct recursive density evaluation on CHt and rank all 4: Evaluate the density of XB and check for the density drop
samples of CHt according to their density values for each pair of neighboring clusters
3: Perform the search of possible clusters in CHt 5: Merge each pair of neighboring clusters when there is no
4: Merge highly overlapped clusters to obtain CCHt density drop in XB
5: if t == 1 then 6: Mark unmerged clusters as novel clusters and merged clus-
6: HSt = ∅, Ct = CCHt ters as updated clusters
7: [Qt , YQt ] = ACTIVE Q UERY(HSt , Ct , CHt ) 7: Validate the existence of novel clusters using Remark 1
8: YCHt =C LASSIFY(HSt , Qt , YQt , CHt ) 8: Return Ct = [novel clusters, updated clusters]
9: HSt =C LUSTERING M ODEL(HSt , CCHt , CHt ) 9: end procedure
10: else
11: Ct =C HECK M ERGE(HSt , CCHt , CHt )
12: [Qt , YQt ]=ACTIVE Q UERY(HSt , Ct , CHt ) a cluster is considered as a valid novel cluster if its density
13: YCHt =C LASSIFY(HSt , Qt , YQt , CHt ) value falls inside the one-sigma distance from the average
14: HSt =C LUSTERING M ODEL(HSt , CCHt , CHt ) density of historical clusters at time t. With Remark 1, a novel
15: end if
16: Return HSt
cluster detector with a dynamic density threshold is used for
17: end for validating the existence of novel clusters.

3.4 Adaptation of Drifted and Novel Concepts


1992] is introduced to classify the incoming data chunk based using Active Learning
on the summary and queried labels. Additionally, the overlap Considering time and space constraints, we maintain a sum-
among classes is addressed by exploring the sub-cluster in- mary of the data stream rather than keeping all historical sam-
formation from the micro-level summary. ples from DS. The maintained summary HSt consists of two
3.3 Concept Drifts and Evolution Detection levels of summary on DS: (i). macro-level summary GSt ,
through Clustering and (ii) micro-level summary LSt . The definitions of these
two levels of summary are expressed as:
To capture the non-stationary property of data streams, we
modified the DFPS-clustering algorithm substantially by em- |Ct |
[
ploying a new cluster merge procedure between the histori- GSt = [Zti , Fti , Rti ], (1)
cal clusters and new clusters. Then, we use the new DFPS- i=1
clustering method to distinguish novel concepts from drifted
concepts. Algorithm 2 summarizes the new cluster merge and
|Ct | |SCti |
procedure for the detection of drifted and novel concepts. [ [
As shown in Algorithm 2, the new merge procedure uti- LSt = { [SCti,j , yti,j ]}. (2)
lizes the density of boundary instances to decide whether i=1 j=1

a merge should happen between a historical cluster and its Where Fti and Rti denote the density value and radius of the
neighboring cluster from CHt . Then, it generates two types ith cluster at time t, respectively. yti,j refers to the class la-
of clusters: (i) novel clusters and (ii) updated clusters. Clus-
bel of SCti,j and all samples from SCti,j share the same la-
ters that are not merged with historical clusters are considered
bel. LSt is used to explore the sub-cluster structure of each
as novel clusters while the merged clusters are defined as up-
cluster when classes are highly overlapped. Specifically, we
dated clusters. Based on these two types of clusters, the novel
split each cluster into a set of sub-clusters such that each sub-
concepts are captured by novel clusters and drifted concepts
cluster only has a unique class label. Instead of computing
are identified as updated clusters. To validate the existence
the mean vector for each sub-cluster, we consider only the
of novel clusters, we use the mean and standard deviation of
sample with the highest density value in each sub-cluster as
the density values of historical clusters to compute a dynamic
the sub-cluster center and use it for label propagation.
density threshold. Let Ctno be a novel cluster and FCtno be its
These two levels of summary are continuously updated to
density value. The mean and standard deviations of historical
adapt to the change of DS using an active learning procedure.
clusters are denoted as µ(FCt ) and σ(FCt ), respectively. The
Since two types of clusters can be obtained from the cluster-
following Remark is defined for novel cluster detection.
ing analysis, a hybrid active learning strategy of informative-
Remark 1. If FCtno ≥ |µ(FCt ) − σ(FCt )|, then Ctno is a true based and representative-based sampling is introduced to re-
novel cluster. duce the labeling costs. The adaptation procedure of these
Remark 1 is derived from the three-sigma principle of two levels of summary is provided in Algorithm 3. In Al-
Gaussian distribution used in [Yan et al., 2019] and it means gorithm 3, for novel clusters, the representative-based query
Algorithm 3 Adaptation of drifted and novel concepts Algorithm 4 Classification through label propagation
Parameters: Xno : a set of samples from novel clusters; Xup : a Parameters: YCHt : the label set for CHt ; Subr : a set of represen-
set of samples from updated clusters; QI : a set of queried samples tatives from sub-clusters; Pt : a set of prototypes with labels.
using the Informative-based sampling; QR : a set of queried samples 1: procedure C LASSIFY(HSt , Qt , YQt , CHt )
using Representative-based sampling; YCHt : the label set for CHt . 2: Extract sub-cluster centers and its labels from HSt as Subr
1: procedure C LUSTERING M ODEL(HSt , Ct , CHt ) 3: Pt = Subr ∪ [Qt , YQt ]
2: [Qt , YQt ]=ACTIVE Q UERY(Ct , CHt ) 4: Propagate labels from the prototype set to samples in CHt
3: YCHt =C LASSIFY(HSt , Qt , YQt , CHt ) using KNN rule and obtain the predicted label set YCHt
4: Update the GSt according to Ct 5: return YCHt
5: Update the LSt using YCHt 6: end procedure
6: return HSt = [GSt , LSt ]
7: end procedure
Datasets Sample Dimensions Classes Overlap
8: procedure ACTIVE Q UERY(Ct , CHt )
Syn-1 18900 2 9 False
9: Extract novel clusters and updated clusters from Ct
Syn-2 11400 2 10 True
10: Identify samples that are close to the novel clusters as Xno
Sea 60000 2 3 False
11: Identify samples that are close to updated clusters as Xup
KDD99 494021 34 5 False
12: Representative-based sampling for Xno to obtain QR
Forest 581012 11 7 True
13: Informative-based sampling for Xup to obtain QI
GasSensor 13910 128 6 True
14: Qt = (QI ∪ QR )
Shuttle 58000 9 7 False
15: Query labels from human experts to obtain YQt
MNIST 70000 784 10 False
16: return [Qt , YQt ]
CIFAR10 60000 3072 10 False
17: end procedure
Table 1: Dataset descriptions.
[Gu et al., 2019] is performed by sampling from the centers
of clusters. On the other hand, we conduct the informative- • |Qt | the number of queried samples at time t
based query [Gu et al., 2019] in updated clusters through a
distance-based strategy. Unlike the entropy-based sampling Time Complexity. The density evaluation of the new
[Lughofer et al., 2016] strategy, samples that are relatively far DFPS-clustering procedure requires O(n2 m) distance cal-
from the updated clusters are selected as informative samples culations. To update HSt , there will be O(nm|Ct | +
for label querying. Let Qt be the set of queried samples and nm|SCt |) ≤ O(2nm|SCt |) distance calculations where
YQt be the label set for Qt . After the active label query, the |Ct | ≤ |SCt |. The classification of the incoming data chunk
label propagation procedure begins to predict the label of the will require O(knm(|SCt | + |Qt |)) distance calculations.
remaining samples in CHt using YQt and HSt . Finally, the Thus, the total time complexity of a single data chunk in
predicted labels are used to update the LSt with a two-step terms of distance calculations is expressed as: O(n2 m +
procedure. First, we update the centers of sub-clusters within 2nm|SCt | + knm(|SCt | + |Qt |)).
updated clusters with new samples that have higher density Memory Complexity. For LSt , the memory space com-
values. Second, we create a set of new sub-clusters for each plexity is expressed as O(|SCt |(m + 1)) ≈ O(m|SCt |). In
novel cluster to capture the characteristics of novel concepts. terms of GSt , it takes O(|Ct |(m + 2)) ≈ O(m|Ct |) memory
space. The total memory complexity will be O(m(|SCt | +
|Ct |)).
3.5 Classification through Label Propagation
To classify an incoming data chunk, we employ an effective 5 Experimental Studies and Discussions
label propagation procedure based on HSt and Qt . First, a set
of prototypes with label information are obtained from HSt In this section, experiments are conducted on the CDSC-AL
and Qt . Then, the KNN-based classification procedure is em- framework using nine benchmark datasets, and comparison
ployed to propagate the labels of the prototypes to samples in studies with the state-of-the-art methods are presented.
CHt . Here, we set the value of k to five and the classification 5.1 Experimental Setup
procedure is presented in Algorithm 4.
Datasets and Streaming Settings. Nine multi-class bench-
mark datasets, including three synthetic datasets and six well-
4 Complexity Analysis known real datasets from [Dheeru and Karra Taniskidou,
Here, we discussed the time and memory complexity of the 2017], are used in the experiments for performance evalu-
CDSC-AL framework using the following parameters: ation. Table 1 summarizes these datasets in terms of sam-
• n: the number of samples from an incoming data chunk ple size, dimensionality, number of classes, and class over-
lap. According to Table 1, the Forest, Syn-2, CIFAR10, and
• m: the number of dimensions of an incoming instance GasSensor datasets have highly overlapped classes. Follow-
• |SCt |: the total number of sub-clusters at time t ing the setting in [Aggarwal et al., 2006; Din et al., 2020],
we partition each benchmark dataset into a number of data
• |Ct |: the total number of clusters at time t chunks with a fixed size of 1000 samples and pass them se-
• k: the number of nearest neighbors quentially to each algorithm in the experiment. In Syn-1,
Dataset Metric LNP OReSSL CDSC-AL
BA 0.8869(3) 0.9307(2) 0.9459(1)
Syn-1
Fmac 0.7939(3) 0.9318(2) 0.9490(1)
BA 0.8338(3) 0.8481(1) 0.8459(2)
Syn-2
Fmac 0.6375(3) 0.7899(2) 0.8149(1)
BA 0.5262(3) 0.8206(2) 0.9691(1)
Sea
Fmac 0.6019(3) 0.8275(2) 0.9729(1) Figure 1: Comparison of CDSC-AL against semi-supervised meth-
BA 0.5362(3) 0.6831(2) 0.8364(1) ods with the Nemenyi test with α = 0.05 using BA.
KDD99
Fmac 0.5311(3) 0.7076(2) 0.7921(1)
BA 0.5181(3) 0.7153(2) 0.8465(1)
Forest
Fmac 0.5258(3) 0.7114(2) 0.8230(1)
BA 0.6479(3) 0.8841(2) 0.8916(1)
GasSensor
Fmac 0.6611(3) 0.8674(2) 0.8995(1)
BA 0.4172(3) 0.4709(2) 0.4744(1)
Shuttle
Fmac 0.4119(3) 0.4862(1) 0.4789(2)
BA 0.7682(3) 0.8806(2) 0.9669(1) Figure 2: Comparison of CDSC-AL against semi-supervised meth-
MNIST
Fmac 0.7725(3) 0.8828(2) 0.9676(1) ods with the Nemenyi test with α = 0.05 using Fmac .
BA 0.4158(3) 0.6421(2) 0.7857(1)
CIFAR10
Fmac 0.4195(3) 0.6344(2) 0.7869(1)
BA 3.00 1.78 1.22 mance evaluation metrics. We recorded these two metrics
Avg. ranks
Fmac 3.00 1.89 1.11 over the entire data stream classification and reported the av-
erage values for performance evaluation. The best results are
Table 2: Performance comparison with semi-supervised methods. highlighted in bold-face. The Friedman and Nemenyi post-
(Relative rank of each algorithm is shown within parentheses.) hoc tests [Demšar, 2006] are employed to statistically analyze
the experimental results with a significance level of 0.05.
Syn-2, Sea, and Shuttle datasets, data chunks are arranged 5.2 Results and Discussions
in an order to simulate abrupt concept drifts. For remaining
datasets, we arrange data chunks in an order that generates Comparison with Semi-supervised Methods. To compare
the gradual concept drifts. the CDSC-AL method with the OReSSL and LNP methods,
we repeated each experiment ten times and the average re-
Compared Methods. We compared the CDSC-AL frame- sults are presented in Table 2. From Table 2, the balanced
work with the state-of-the-art methods considering two as- classification accuracy shows that the CDSC-AL method out-
pects: (i) comparison with semi-supervised approaches, and performs the other two methods on most data streams with
(ii) comparison with supervised approaches. We selected two the lowest rank of 1.11. In terms of Fmac , CDSC-AL also
existing semi-supervised methods, namely OReSSL [Din et provides better performance on most data streams. For data
al., 2020] and LNP [Wang and Zhang, 2007], and results are streams with abrupt concept drifts, CDSC-AL still achieves
summarized in Table 2. Four supervised methods, including slightly better or comparable performance. From the Ne-
Leverage Bagging (LB) [Bifet et al., 2010b], OZA Bag AD- menyi post-hoc test, Figures 1 and 2 reveal that there is a sta-
WIN (OBA) [Bifet et al., 2009], Adaptive Hoeffding Tree tistically significant difference between CDSC-AL and LNP
(AHT) [Bifet and Gavaldà, 2009], and SAMkNN [Losing et in terms of BA and Fmac . Although CDSC-AL shows statis-
al., 2018], are used for the second comparison study. The re- tically comparable performance with the OReSSL method, it
sults are presented in Table 3. The MATLAB code for semi- does not require any parameter optimization or an initial set
supervised methods is released by the authors and all codes of labeled data.
for supervised methods can be found on the Massive Online
Analysis (MOA) framework [Bifet et al., 2010a]. The python Comparison with Supervised Methods. Table 3 presents
code of the CDSC-AL framework is available at the link1 . the results of CDSC-AL and the four supervised methods.
All experiments are conducted on an Intel Xeon (R) machine Using only 10% of the labels, Table 3 demonstrates that the
with 64GB RAM operating on Microsoft Windows 10. CDCS-AL method achieves the best performance on six of
the benchmark data streams including Syn-1, Syn-2, Sea,
Parameter Setting. For the semi-supervised methods and GasSensor, MNIST, and CIFAR10. In Figures 3 and 4,
CDSC-AL, the portion of labeled data of each incoming data CDCS-AL shows statistically comparable performance to the
chunk is set as 10%. For supervised methods, the labels of all LB, OBA, and AHT methods on the remaining three data
samples from an incoming data chunk are provided to update streams from the Nemenyi test. Also, CDSC-AL has statis-
the classifier after classification while CDSC-AL utilized only tically better performance than the SAMkNN approach. For
10% labeled data. data streams with abrupt concept drifts, CDSC-AL presents
Evaluation Metrics. Due to the imbalanced class distribu- slightly better or comparable performance relative to super-
tions in benchmark datasets, we use the balanced classifica- vised approaches. In summary, the comparison study with
tion accuracy (BA) [Brodersen et al., 2010] and the macro- supervised methods reveals that CDSDF-AL always provides
average of F-score (Fmac ) [Kelleher et al., 2020] as perfor- statistically better or comparable performance than the super-
vised methods using only a small proportion of labeled data.
1
https://github.com/XuyangAbert/CDSC-AL
Dataset Metric LB OBA AHT SAMkNN CDSC-AL
BA 0.7910(2) 0.6640(3) 0.6354(4) 0.6247(5) 0.9459(1)
Syn-1
Fmac 0.7965(2) 0.6675(3) 0.6513(4) 0.6313(5) 0.9490(1)
BA 0.7124(2) 0.7204(3) 0.6926(4) 0.6784(5) 0.8459(1)
Syn-2
Fmac 0.7218(2) 0.7219(3) 0.6977(4) 0.6864(5) 0.8149(1)
BA 0.8204(2) 0.7498(3) 0.7493(4) 0.7205(5) 0.9691(1)
Sea
Fmac 0.8227(2) 0.7501(4) 0.7505(3) 0.7345(5) 0.9729(1)
BA 0.7585(4) 0.7812(3) 0.8541(1) 0.7495(5) 0.8364(2)
KDD99
Fmac 0.7564(4) 0.7798(3) 0.8012(1) 0.7682(5) 0.7921(2)
BA 0.8888(1) 0.8707(2) 0.8612(3) 0.8545(4) 0.8465(5)
Forest
Fmac 0.8901(1) 0.8709(2) 0.8688(3) 0.8588(4) 0.8230(5)
BA 0.7185(2) 0.6345(4) 0.6111(3) 0.6357(5) 0.8916(1)
GasSensor
Fmac 0.7199(2) 0.6361(4) 0.6188(3) 0.6412(5) 0.8995(1)
BA 0.4789(1) 0.4477(4) 0.4508(3) 0.4424(5) 0.4744(2)
Shuttle
Fmac 0.5187(1) 0.5112(2) 0.4978(3) 0.4894(4) 0.4789(5)
BA 0.8909(2) 0.8498(4) 0.8393(5) 0.8549(3) 0.9669(1)
MNIST
Fmac 0.8946(2) 0.8501(4) 0.8412(5) 0.8596(3) 0.9676(1)
BA 0.7199(3) 0.6208(5) 0.7366(2) 0.6218(4) 0.7857(1)
CIFAR10
Fmac 0.7208(2) 0.6325(4) 0.7381(3) 0.6295(5) 0.7869(1)
BA 2.00 2.86 3.57 4.57 1.86
Avg. ranks
Fmac 1.75 2.63 3.00 4.00 2.00

Table 3: Performance comparison with supervised methods. (Relative rank of each algorithm is shown within parentheses.)

proportion of labeled data. The comparison studies with the


semi-supervised and supervised methods justify the efficacy
of CDSC-AL methods on both synthetic and real data.
In the future, we will investigate solutions to reduce the
time complexity of CDSC-AL and implement the CDSC-
AL framework in some real-world applications such as
Figure 3: Comparison of CDSC-AL against supervised methods text/image stream classifications.
with the Nemenyi test with α = 0.05 using BA.
Acknowledgements
This work is supported by the Air Force Research Laboratory
and the OSD under agreement number FA8750-15-2-0116.
Also, this work is partially funded by the NASA University
Leadership Initiative (ULI), the National Science Foundation,
and the OSD RTL under grants number 80NSSC20M0161,
2000320, and W911NF-20-2-0261 respectively. The authors
would like to thank them for their support.
Figure 4: Comparison of CDSC-AL against supervised methods
with the Nemenyi test with α = 0.05 using Fmac .
References
[Aggarwal et al., 2006] Charu C Aggarwal, Jiawei Han, Jianyong
6 Conclusions and Future Work Wang, and Philip S Yu. A framework for on-demand classifica-
tion of evolving data streams. IEEE Transactions on Knowledge
We presented a clustering-based data stream classification and Data Engineering, 18(5):577–589, 2006.
framework through active learning (CDSC-AL) to handle [Altman, 1992] Naomi S Altman. An introduction to kernel and
non-stationary data streams. We developed a new cluster nearest-neighbor nonparametric regression. The American Statis-
merge procedure as a part of the new DFPS-clustering pro- tician, 46(3):175–185, 1992.
cedure and used the new DFPS-clustering procedure to cap- [Bifet and Gavalda, 2007] Albert Bifet and Ricard Gavalda. Learn-
ture the drifted concepts and novel concepts. Two levels of ing from time-changing data with adaptive windowing. In Pro-
cluster summaries were maintained to effectively classify the ceedings of the 2007 SIAM international conference on data min-
incoming data samples in the presence of overlapped classes, ing, pages 443–448. SIAM, 2007.
and an active learning technique was introduced to learn the [Bifet and Gavaldà, 2009] Albert Bifet and Ricard Gavaldà. Adap-
change of data distributions. The primary advantages of the tive learning from evolving data streams. In International Sym-
CDSC-AL method can be summarized from several points posium on Intelligent Data Analysis, pages 249–260. Springer,
of view: (i) no dependency on initial label set, (ii) effec- 2009.
tive detection for drifted concepts and novel concepts with [Bifet et al., 2009] Albert Bifet, Geoff Holmes, Bernhard
dynamic threshold, and (iii) high-quality classification per- Pfahringer, Richard Kirkby, and Ricard Gavaldà. New en-
formance in the presence of overlapped classes with a small semble methods for evolving data streams. In Proceedings of
the 15th ACM SIGKDD international conference on Knowledge [Lughofer et al., 2016] Edwin Lughofer, Eva Weigl, Wolfgang
discovery and data mining, pages 139–148, 2009. Heidl, Christian Eitzinger, and Thomas Radauer. Recognizing
[Bifet et al., 2010a] Albert Bifet, Geoff Holmes, Richard Kirkby, input space and target concept drifts in data streams with scarcely
labeled and unlabelled instances. Information Sciences, 355:127–
and Bernhard Pfahringer. MOA: massive online analysis. J.
151, 2016.
Mach. Learn. Res., 11:1601–1604, 2010.
[Masud et al., 2010] Mohammad Masud, Jing Gao, Latifur Khan,
[Bifet et al., 2010b] Albert Bifet, Geoff Holmes, and Bernhard
Jiawei Han, and Bhavani M Thuraisingham. Classification and
Pfahringer. Leveraging bagging for evolving data streams. In
novel class detection in concept-drifting data streams under time
Joint European conference on machine learning and knowledge
constraints. IEEE Transactions on Knowledge and Data Engi-
discovery in databases, pages 135–150. Springer, 2010.
neering, 23(6):859–874, 2010.
[Bifet et al., 2013] Albert Bifet, Bernhard Pfahringer, Jesse Read, [Masud et al., 2012] Mohammad M Masud, Clay Woolam, Jing
and Geoff Holmes. Efficient data stream classification via prob- Gao, Latifur Khan, Jiawei Han, Kevin W Hamlen, and Nikunj C
abilistic adaptive windows. In Proceedings of the 28th annual Oza. Facing the reality of data stream classification: coping with
ACM symposium on applied computing, pages 801–806, 2013. scarcity of labeled data. Knowledge and information systems,
[Bouchachia, 2011] Abdelhamid Bouchachia. Incremental learn- 33(1):213–244, 2012.
ing with multi-level adaptation. Neurocomputing, 74(11):1785– [Mohamad et al., 2018] Saad Mohamad, Moamar Sayed-
1799, 2011. Mouchaweh, and Abdelhamid Bouchachia. Active learning
[Brodersen et al., 2010] Kay Henning Brodersen, Cheng Soon for classifying data streams with unknown number of classes.
Ong, Klaas Enno Stephan, and Joachim M Buhmann. The bal- Neural Networks, 98:1–15, 2018.
anced accuracy and its posterior distribution. In 2010 20th Inter- [Mu et al., 2017] Xin Mu, Feida Zhu, Juan Du, Ee-Peng Lim, and
national Conference on Pattern Recognition, pages 3121–3124. Zhi-Hua Zhou. Streaming classification with emerging new class
IEEE, 2010. by class matrix sketching. In Proceedings of the AAAI Confer-
[Demšar, 2006] Janez Demšar. Statistical comparisons of classifiers ence on Artificial Intelligence, volume 31, 2017.
over multiple data sets. Journal of Machine learning research, [Ravi and Diao, 2016] Sujith Ravi and Qiming Diao. Large scale
7(Jan):1–30, 2006. distributed semi-supervised learning using streaming approxima-
[Dheeru and Karra Taniskidou, 2017] Dua Dheeru and Efi tion. In Artificial Intelligence and Statistics, pages 519–528,
Karra Taniskidou. UCI machine learning repository, 2017. 2016.
[Din et al., 2020] Salah Ud Din, Junming Shao, Jay Kumar, Waqar [Shao et al., 2016] Junming Shao, Chen Huang, Qinli Yang, and
Ali, Jiaming Liu, and Yu Ye. Online reliable semi-supervised Guangchun Luo. Reliable semi-supervised learning. In 2016
learning on evolving data streams. Information Sciences, 2020. IEEE 16th International Conference on Data Mining (ICDM),
pages 1197–1202. IEEE, 2016.
[Gama et al., 2014] João Gama, Indrė Žliobaitė, Albert Bifet,
[Wagner et al., 2018] Tal Wagner, Sudipto Guha, Shiva Ka-
Mykola Pechenizkiy, and Abdelhamid Bouchachia. A survey
on concept drift adaptation. ACM computing surveys (CSUR), siviswanathan, and Nina Mishra. Semi-supervised learning on
46(4):1–37, 2014. data streams via temporal label propagation. In International
Conference on Machine Learning, pages 5095–5104, 2018.
[Gomes et al., 2017] Heitor Murilo Gomes, Jean Paul Barddal,
[Wang and Li, 2018] Yi Wang and Tao Li. Improving semi-
Fabrı́cio Enembreck, and Albert Bifet. A survey on ensemble
supervised co-forest algorithm in evolving data streams. Applied
learning for data stream classification. ACM Computing Surveys
Intelligence, 48(10):3248–3262, 2018.
(CSUR), 50(2):1–36, 2017.
[Wang and Zhang, 2007] Fei Wang and Changshui Zhang. Label
[Gu et al., 2019] Shilin Gu, Yang Cai, Jincheng Shan, and Chen-
propagation through linear neighborhoods. IEEE Transactions
ping Hou. Active learning with error-correcting output codes. on Knowledge and Data Engineering, 20(1):55–67, 2007.
Neurocomputing, 364:182–191, 2019.
[Yan et al., 2019] Xuyang Yan, Mohammad Razeghi-Jahromi, Ab-
[Hosseini et al., 2016] Mohammad Javad Hosseini, Ameneh dollah Homaifar, Berat A Erol, Abenezer Girma, and Edward
Gholipour, and Hamid Beigy. An ensemble of cluster-based clas- Tunstel. A novel streaming data clustering algorithm based on
sifiers for semi-supervised classification of non-stationary data fitness proportionate sharing. IEEE Access, 7:184985–185000,
streams. Knowledge and information systems, 46(3):567–597, 2019.
2016.
[Yan et al., 2020] Xuyang Yan, Shabnam Nazmi, Berat A Erol, Ab-
[Kelleher et al., 2020] John D Kelleher, Brian Mac Namee, and dollah Homaifar, Biniam Gebru, and Edward Tunstel. An ef-
Aoife D’arcy. Fundamentals of machine learning for predictive ficient unsupervised feature selection procedure through feature
data analytics: algorithms, worked examples, and case studies. clustering. Pattern Recognition Letters, 2020.
MIT press, 2020.
[Zhang et al., 2010] Peng Zhang, Xingquan Zhu, Jianlong Tan, and
[Li and Zhou, 2007] Ming Li and Zhi-Hua Zhou. Improve Li Guo. Classifier and cluster ensembles for mining concept drift-
computer-aided diagnosis with machine learning techniques us- ing data streams. In 2010 IEEE International Conference on Data
ing undiagnosed samples. IEEE Transactions on Systems, Man, Mining, pages 1175–1180. IEEE, 2010.
and Cybernetics-Part A: Systems and Humans, 37(6):1088–1098,
[Zhu and Li, 2020] Yong-Nan Zhu and Yu-Feng Li. Semi-
2007.
supervised streaming learning with emerging new labels. In
[Losing et al., 2018] Viktor Losing, Barbara Hammer, and Heiko AAAI, pages 7015–7022, 2020.
Wersing. Tackling heterogeneous concept drift with the self-
adjusting memory (sam). Knowledge and Information Systems,
54(1):171–201, 2018.

You might also like