CDSC Al
CDSC Al
CDSC Al
a merge should happen between a historical cluster and its Where Fti and Rti denote the density value and radius of the
neighboring cluster from CHt . Then, it generates two types ith cluster at time t, respectively. yti,j refers to the class la-
of clusters: (i) novel clusters and (ii) updated clusters. Clus-
bel of SCti,j and all samples from SCti,j share the same la-
ters that are not merged with historical clusters are considered
bel. LSt is used to explore the sub-cluster structure of each
as novel clusters while the merged clusters are defined as up-
cluster when classes are highly overlapped. Specifically, we
dated clusters. Based on these two types of clusters, the novel
split each cluster into a set of sub-clusters such that each sub-
concepts are captured by novel clusters and drifted concepts
cluster only has a unique class label. Instead of computing
are identified as updated clusters. To validate the existence
the mean vector for each sub-cluster, we consider only the
of novel clusters, we use the mean and standard deviation of
sample with the highest density value in each sub-cluster as
the density values of historical clusters to compute a dynamic
the sub-cluster center and use it for label propagation.
density threshold. Let Ctno be a novel cluster and FCtno be its
These two levels of summary are continuously updated to
density value. The mean and standard deviations of historical
adapt to the change of DS using an active learning procedure.
clusters are denoted as µ(FCt ) and σ(FCt ), respectively. The
Since two types of clusters can be obtained from the cluster-
following Remark is defined for novel cluster detection.
ing analysis, a hybrid active learning strategy of informative-
Remark 1. If FCtno ≥ |µ(FCt ) − σ(FCt )|, then Ctno is a true based and representative-based sampling is introduced to re-
novel cluster. duce the labeling costs. The adaptation procedure of these
Remark 1 is derived from the three-sigma principle of two levels of summary is provided in Algorithm 3. In Al-
Gaussian distribution used in [Yan et al., 2019] and it means gorithm 3, for novel clusters, the representative-based query
Algorithm 3 Adaptation of drifted and novel concepts Algorithm 4 Classification through label propagation
Parameters: Xno : a set of samples from novel clusters; Xup : a Parameters: YCHt : the label set for CHt ; Subr : a set of represen-
set of samples from updated clusters; QI : a set of queried samples tatives from sub-clusters; Pt : a set of prototypes with labels.
using the Informative-based sampling; QR : a set of queried samples 1: procedure C LASSIFY(HSt , Qt , YQt , CHt )
using Representative-based sampling; YCHt : the label set for CHt . 2: Extract sub-cluster centers and its labels from HSt as Subr
1: procedure C LUSTERING M ODEL(HSt , Ct , CHt ) 3: Pt = Subr ∪ [Qt , YQt ]
2: [Qt , YQt ]=ACTIVE Q UERY(Ct , CHt ) 4: Propagate labels from the prototype set to samples in CHt
3: YCHt =C LASSIFY(HSt , Qt , YQt , CHt ) using KNN rule and obtain the predicted label set YCHt
4: Update the GSt according to Ct 5: return YCHt
5: Update the LSt using YCHt 6: end procedure
6: return HSt = [GSt , LSt ]
7: end procedure
Datasets Sample Dimensions Classes Overlap
8: procedure ACTIVE Q UERY(Ct , CHt )
Syn-1 18900 2 9 False
9: Extract novel clusters and updated clusters from Ct
Syn-2 11400 2 10 True
10: Identify samples that are close to the novel clusters as Xno
Sea 60000 2 3 False
11: Identify samples that are close to updated clusters as Xup
KDD99 494021 34 5 False
12: Representative-based sampling for Xno to obtain QR
Forest 581012 11 7 True
13: Informative-based sampling for Xup to obtain QI
GasSensor 13910 128 6 True
14: Qt = (QI ∪ QR )
Shuttle 58000 9 7 False
15: Query labels from human experts to obtain YQt
MNIST 70000 784 10 False
16: return [Qt , YQt ]
CIFAR10 60000 3072 10 False
17: end procedure
Table 1: Dataset descriptions.
[Gu et al., 2019] is performed by sampling from the centers
of clusters. On the other hand, we conduct the informative- • |Qt | the number of queried samples at time t
based query [Gu et al., 2019] in updated clusters through a
distance-based strategy. Unlike the entropy-based sampling Time Complexity. The density evaluation of the new
[Lughofer et al., 2016] strategy, samples that are relatively far DFPS-clustering procedure requires O(n2 m) distance cal-
from the updated clusters are selected as informative samples culations. To update HSt , there will be O(nm|Ct | +
for label querying. Let Qt be the set of queried samples and nm|SCt |) ≤ O(2nm|SCt |) distance calculations where
YQt be the label set for Qt . After the active label query, the |Ct | ≤ |SCt |. The classification of the incoming data chunk
label propagation procedure begins to predict the label of the will require O(knm(|SCt | + |Qt |)) distance calculations.
remaining samples in CHt using YQt and HSt . Finally, the Thus, the total time complexity of a single data chunk in
predicted labels are used to update the LSt with a two-step terms of distance calculations is expressed as: O(n2 m +
procedure. First, we update the centers of sub-clusters within 2nm|SCt | + knm(|SCt | + |Qt |)).
updated clusters with new samples that have higher density Memory Complexity. For LSt , the memory space com-
values. Second, we create a set of new sub-clusters for each plexity is expressed as O(|SCt |(m + 1)) ≈ O(m|SCt |). In
novel cluster to capture the characteristics of novel concepts. terms of GSt , it takes O(|Ct |(m + 2)) ≈ O(m|Ct |) memory
space. The total memory complexity will be O(m(|SCt | +
|Ct |)).
3.5 Classification through Label Propagation
To classify an incoming data chunk, we employ an effective 5 Experimental Studies and Discussions
label propagation procedure based on HSt and Qt . First, a set
of prototypes with label information are obtained from HSt In this section, experiments are conducted on the CDSC-AL
and Qt . Then, the KNN-based classification procedure is em- framework using nine benchmark datasets, and comparison
ployed to propagate the labels of the prototypes to samples in studies with the state-of-the-art methods are presented.
CHt . Here, we set the value of k to five and the classification 5.1 Experimental Setup
procedure is presented in Algorithm 4.
Datasets and Streaming Settings. Nine multi-class bench-
mark datasets, including three synthetic datasets and six well-
4 Complexity Analysis known real datasets from [Dheeru and Karra Taniskidou,
Here, we discussed the time and memory complexity of the 2017], are used in the experiments for performance evalu-
CDSC-AL framework using the following parameters: ation. Table 1 summarizes these datasets in terms of sam-
• n: the number of samples from an incoming data chunk ple size, dimensionality, number of classes, and class over-
lap. According to Table 1, the Forest, Syn-2, CIFAR10, and
• m: the number of dimensions of an incoming instance GasSensor datasets have highly overlapped classes. Follow-
• |SCt |: the total number of sub-clusters at time t ing the setting in [Aggarwal et al., 2006; Din et al., 2020],
we partition each benchmark dataset into a number of data
• |Ct |: the total number of clusters at time t chunks with a fixed size of 1000 samples and pass them se-
• k: the number of nearest neighbors quentially to each algorithm in the experiment. In Syn-1,
Dataset Metric LNP OReSSL CDSC-AL
BA 0.8869(3) 0.9307(2) 0.9459(1)
Syn-1
Fmac 0.7939(3) 0.9318(2) 0.9490(1)
BA 0.8338(3) 0.8481(1) 0.8459(2)
Syn-2
Fmac 0.6375(3) 0.7899(2) 0.8149(1)
BA 0.5262(3) 0.8206(2) 0.9691(1)
Sea
Fmac 0.6019(3) 0.8275(2) 0.9729(1) Figure 1: Comparison of CDSC-AL against semi-supervised meth-
BA 0.5362(3) 0.6831(2) 0.8364(1) ods with the Nemenyi test with α = 0.05 using BA.
KDD99
Fmac 0.5311(3) 0.7076(2) 0.7921(1)
BA 0.5181(3) 0.7153(2) 0.8465(1)
Forest
Fmac 0.5258(3) 0.7114(2) 0.8230(1)
BA 0.6479(3) 0.8841(2) 0.8916(1)
GasSensor
Fmac 0.6611(3) 0.8674(2) 0.8995(1)
BA 0.4172(3) 0.4709(2) 0.4744(1)
Shuttle
Fmac 0.4119(3) 0.4862(1) 0.4789(2)
BA 0.7682(3) 0.8806(2) 0.9669(1) Figure 2: Comparison of CDSC-AL against semi-supervised meth-
MNIST
Fmac 0.7725(3) 0.8828(2) 0.9676(1) ods with the Nemenyi test with α = 0.05 using Fmac .
BA 0.4158(3) 0.6421(2) 0.7857(1)
CIFAR10
Fmac 0.4195(3) 0.6344(2) 0.7869(1)
BA 3.00 1.78 1.22 mance evaluation metrics. We recorded these two metrics
Avg. ranks
Fmac 3.00 1.89 1.11 over the entire data stream classification and reported the av-
erage values for performance evaluation. The best results are
Table 2: Performance comparison with semi-supervised methods. highlighted in bold-face. The Friedman and Nemenyi post-
(Relative rank of each algorithm is shown within parentheses.) hoc tests [Demšar, 2006] are employed to statistically analyze
the experimental results with a significance level of 0.05.
Syn-2, Sea, and Shuttle datasets, data chunks are arranged 5.2 Results and Discussions
in an order to simulate abrupt concept drifts. For remaining
datasets, we arrange data chunks in an order that generates Comparison with Semi-supervised Methods. To compare
the gradual concept drifts. the CDSC-AL method with the OReSSL and LNP methods,
we repeated each experiment ten times and the average re-
Compared Methods. We compared the CDSC-AL frame- sults are presented in Table 2. From Table 2, the balanced
work with the state-of-the-art methods considering two as- classification accuracy shows that the CDSC-AL method out-
pects: (i) comparison with semi-supervised approaches, and performs the other two methods on most data streams with
(ii) comparison with supervised approaches. We selected two the lowest rank of 1.11. In terms of Fmac , CDSC-AL also
existing semi-supervised methods, namely OReSSL [Din et provides better performance on most data streams. For data
al., 2020] and LNP [Wang and Zhang, 2007], and results are streams with abrupt concept drifts, CDSC-AL still achieves
summarized in Table 2. Four supervised methods, including slightly better or comparable performance. From the Ne-
Leverage Bagging (LB) [Bifet et al., 2010b], OZA Bag AD- menyi post-hoc test, Figures 1 and 2 reveal that there is a sta-
WIN (OBA) [Bifet et al., 2009], Adaptive Hoeffding Tree tistically significant difference between CDSC-AL and LNP
(AHT) [Bifet and Gavaldà, 2009], and SAMkNN [Losing et in terms of BA and Fmac . Although CDSC-AL shows statis-
al., 2018], are used for the second comparison study. The re- tically comparable performance with the OReSSL method, it
sults are presented in Table 3. The MATLAB code for semi- does not require any parameter optimization or an initial set
supervised methods is released by the authors and all codes of labeled data.
for supervised methods can be found on the Massive Online
Analysis (MOA) framework [Bifet et al., 2010a]. The python Comparison with Supervised Methods. Table 3 presents
code of the CDSC-AL framework is available at the link1 . the results of CDSC-AL and the four supervised methods.
All experiments are conducted on an Intel Xeon (R) machine Using only 10% of the labels, Table 3 demonstrates that the
with 64GB RAM operating on Microsoft Windows 10. CDCS-AL method achieves the best performance on six of
the benchmark data streams including Syn-1, Syn-2, Sea,
Parameter Setting. For the semi-supervised methods and GasSensor, MNIST, and CIFAR10. In Figures 3 and 4,
CDSC-AL, the portion of labeled data of each incoming data CDCS-AL shows statistically comparable performance to the
chunk is set as 10%. For supervised methods, the labels of all LB, OBA, and AHT methods on the remaining three data
samples from an incoming data chunk are provided to update streams from the Nemenyi test. Also, CDSC-AL has statis-
the classifier after classification while CDSC-AL utilized only tically better performance than the SAMkNN approach. For
10% labeled data. data streams with abrupt concept drifts, CDSC-AL presents
Evaluation Metrics. Due to the imbalanced class distribu- slightly better or comparable performance relative to super-
tions in benchmark datasets, we use the balanced classifica- vised approaches. In summary, the comparison study with
tion accuracy (BA) [Brodersen et al., 2010] and the macro- supervised methods reveals that CDSDF-AL always provides
average of F-score (Fmac ) [Kelleher et al., 2020] as perfor- statistically better or comparable performance than the super-
vised methods using only a small proportion of labeled data.
1
https://github.com/XuyangAbert/CDSC-AL
Dataset Metric LB OBA AHT SAMkNN CDSC-AL
BA 0.7910(2) 0.6640(3) 0.6354(4) 0.6247(5) 0.9459(1)
Syn-1
Fmac 0.7965(2) 0.6675(3) 0.6513(4) 0.6313(5) 0.9490(1)
BA 0.7124(2) 0.7204(3) 0.6926(4) 0.6784(5) 0.8459(1)
Syn-2
Fmac 0.7218(2) 0.7219(3) 0.6977(4) 0.6864(5) 0.8149(1)
BA 0.8204(2) 0.7498(3) 0.7493(4) 0.7205(5) 0.9691(1)
Sea
Fmac 0.8227(2) 0.7501(4) 0.7505(3) 0.7345(5) 0.9729(1)
BA 0.7585(4) 0.7812(3) 0.8541(1) 0.7495(5) 0.8364(2)
KDD99
Fmac 0.7564(4) 0.7798(3) 0.8012(1) 0.7682(5) 0.7921(2)
BA 0.8888(1) 0.8707(2) 0.8612(3) 0.8545(4) 0.8465(5)
Forest
Fmac 0.8901(1) 0.8709(2) 0.8688(3) 0.8588(4) 0.8230(5)
BA 0.7185(2) 0.6345(4) 0.6111(3) 0.6357(5) 0.8916(1)
GasSensor
Fmac 0.7199(2) 0.6361(4) 0.6188(3) 0.6412(5) 0.8995(1)
BA 0.4789(1) 0.4477(4) 0.4508(3) 0.4424(5) 0.4744(2)
Shuttle
Fmac 0.5187(1) 0.5112(2) 0.4978(3) 0.4894(4) 0.4789(5)
BA 0.8909(2) 0.8498(4) 0.8393(5) 0.8549(3) 0.9669(1)
MNIST
Fmac 0.8946(2) 0.8501(4) 0.8412(5) 0.8596(3) 0.9676(1)
BA 0.7199(3) 0.6208(5) 0.7366(2) 0.6218(4) 0.7857(1)
CIFAR10
Fmac 0.7208(2) 0.6325(4) 0.7381(3) 0.6295(5) 0.7869(1)
BA 2.00 2.86 3.57 4.57 1.86
Avg. ranks
Fmac 1.75 2.63 3.00 4.00 2.00
Table 3: Performance comparison with supervised methods. (Relative rank of each algorithm is shown within parentheses.)