0% found this document useful (0 votes)
16 views12 pages

A Novel Drift Detection Algorithm Based

Uploaded by

mssvr57ky4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views12 pages

A Novel Drift Detection Algorithm Based

Uploaded by

mssvr57ky4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

JAISCR, 2020, Vol. 10, No. 4, pp.

287 – 298
10.2478/jaiscr-2020-0019

A NOVEL DRIFT DETECTION ALGORITHM BASED ON


FEATURES’ IMPORTANCE ANALYSIS IN A DATA
STREAMS ENVIRONMENT

Piotr Duda1,∗ , Krzysztof Przybyszewski2 , Lipo Wang 3

1 Department of Computer Engineering, Czestochowa University of Technology,


Czȩstochowa, Poland
2 Information Technology Institute, University of Social Sciences, 90-113 Łódz
and Clark University Worcester, MA 01610, USA
3 Nanyang Technological University,
School of Electrical and Electronic Engineering, Singapore
∗ E-mail: [email protected]

Submitted: 5th November 2019; Accepted: 18th May 2020

Abstract

The training set consists of many features that influence the classifier in different degrees.
Choosing the most important features and rejecting those that do not carry relevant in-
formation is of great importance to the operating of the learned model. In the case of
data streams, the importance of the features may additionally change over time. Such
changes affect the performance of the classifier but can also be an important indicator
of occurring concept-drift. In this work, we propose a new algorithm for data streams
classification, called Random Forest with Features Importance (RFFI), which uses the
measure of features importance as a drift detector. The RFFT algorithm implements solu-
tions inspired by the Random Forest algorithm to the data stream scenarios. The proposed
algorithm combines the ability of ensemble methods for handling slow changes in a data
stream with a new method for detecting concept drift occurrence. The work contains an
experimental analysis of the proposed algorithm, carried out on synthetic and real data.
Keywords: data stream mining, random forest, features importance

1 Introduction cent years as the number of collected data has in-


creased. Therefore, the researchers paid special at-
The crucial stage in creating machine learning tention to a field of artificial intelligence called data
models is to gather a training set that reflects the streams mining (DSM) [1–13]. In the data stream
considered issue in the best possible way. On the scenario, instead of the static training set, we as-
other hand, the nature of many real-world problems sume that the data come to the system continuously,
changes over time. As a result, it is not possible one after the other. The DSM is focused on models
to access in one moment the data corresponding to that can adapt to changes in incoming data. More-
all possible scenarios, and in consequence to pre- over, these algorithms should minimize two addi-
pare one model, able to handle every change. This tional criteria: the number of data stored in a sys-
problem has become particularly important in re- tem and learning time. The model must be able to
288 Piotr Duda, Krzysztof Przybyszewski, Lipo Wang

provide the output at any time, and the resources pendent models based on these subsets. The ran-
used by the model should be strictly limited. The dom forests algorithm also uses bootstrap samples,
algorithms of DSM have found many applications, but additionally, different features are excluded in
e.g. in network traffic analysis [14], financial data different subsets. It should be noted that decision
analysis [15], or credit card fraud detection [16]. trees are used as weak classifiers in the RF algo-
Recently, the possibilities of combining stream pro- rithm. The adaptation of these algorithms to work
cessing methods with deep learning techniques are in the case of data streams requires some modifica-
being explored [13, 17]. tions due to the need for minimizing the time for
One of the most popular techniques for data processing available data.
stream mining are ensemble algorithms [18, 19]. In The motives mentioned above inspired us to
the classic approach, their main idea is to combine propose a new algorithm for data streams classifi-
the outputs of models built only on a part of data. cation using the RF method. Our algorithm com-
This allows for achieving better results with respect bines the ability of ensemble methods for handling
to a single component. A simple modification of a slow changes in a data stream (passive reacting)
classic ensemble algorithm allows us to effectively with a new method, based on FI, developed herein
adapt the model to the changes observed in incom- to detecting (active reacting) concept drift occur-
ing data. Training new components on chunks of rence. Developed methods potentially can be ap-
recent data can keep the model up-to-date. The ap- plied to monitor various industrial processes, see
propriate criterion for including a new component e.g. [26–31].
into the ensemble is an important factor that affects The rest of the paper consists of the following
the performance of the model. This issue is cur- sections. Section 2 presents the main trends in the
rently the subject of many studies [20–23]. area of ensemble classifiers, random forest, and fea-
A non-stationarity phenomenon in the context tures importance. In Section 3, the descriptions of
of data streams is called a concept-drift. In the liter- the RF algorithm and the method of computing FI
ature two types of concept-drifts are distinguished: are shown. The proposed ensemble algorithm and
virtual, when changes in the distribution of data new drift detector are presented in Section 4. Sec-
do not affect the decision boundaries, and the real, tion 5 depicts results obtained in simulations per-
when the decision boundaries are changed. There formed on synthetic and real data. The article ends
are a few approaches that allow updating the algo- with the conclusions presented in Section 6.
rithms to operate in a new environment. One of
them is the passive approach [24, 25]. It is based
on the continuous adaptation of the model to cur- 2 Related works
rent data, it is used, inter alia, in ensemble algo-
In this Section, we recall the most significant
rithms. Another method, the so-called active ap-
and the most recent papers about ensemble meth-
proach, is based on the permanent monitoring of
ods, features importance, and drift detectors.
the stream itself and to indicate in which moment
the concept-drift took place. The methods that indi- Ensemble methods are popular techniques of
cate the moments of significant change in data dis- data mining in a static environment. They owe their
tribution are called drift detectors (DDs). In this ap- popularity to the possibility of using them to solve
proach, the model is updated only if DD signalizes many real-world problems, see e.g. [32]. The most
that the concept-drift occurred. significant features that distinguish various ensem-
ble methods are the method of creating new com-
The most popular techniques for creating en-
ponents and the methods for aggregating outputs.
sembles of classifiers are bagging and random
However, their adaptation to operate in the data
forests (RF). These methods allow the creation of
stream scenario also requires important modifica-
many different models from one training set. For
tions, in particular, a special approach to data pre-
this purpose, they use the bootstrap samples tech-
processing.
nique. This method consists in generating sev-
eral subsets by sampling with replacement from the In the Streaming Ensemble Algorithm (SEA)
training set. The idea of bagging is to learn inde- [18] the authors proposed to create the new clas-
A NOVEL DRIFT DETECTION ALGORITHM BASED ON . . . 289

sifier based on chunks of data (subsequently gath- the ADWIN algorithm [44] and the Page-Hinkley
ered from the stream and forgotten after process- test [45] is applied.
ing). To decide about classifying a new instance, As a consequence of the model training based
a major voting strategy was applied. In [33] the on nonstationary data, the significance of particular
authors proposed the Accuracy Weighted Ensemble features can change over time. Such type of drift
(AWE) algorithm, which is an improvement of the is called contextual concept drift [46], or feature
SEA algorithm by weighting the power of a vote of drift [47]. In [48] the authors adopt the off-line
each component according to its accuracy. Addi- evaluating feature importance procedure to oper-
tionally, the authors proved that the decision made ate in the on-line scenario with classification mod-
by the ensemble will always be at least as good as els. They proposed two models based on a mean
made by a single classifier. The resampling method decrease in Gini impurity and a mean decrease in
inspired by AdaBoost was proposed in the Learn++ accuracy, respectively. In [49] the authors investi-
algorithm [34], originally in a static environment. gate the statistical properties of feature importance
Additionally, the authors proposed a new way to measure to propose a novel algorithm called Rein-
establish the weights for the base classifiers. This forcement learning trees. The method called Iter-
idea was adapted to the data stream scenario in [35]. ative Subset Selection was proposed in [50]. This
In [22] the authors proposed a method to include method, first ranking the features, and then iter-
newly create components only if it ensures increas- atively selecting the best features from the rank-
ing the accuracy not only for a current chunk of data ing. More about feature selection on the static and
but also for a whole data stream. In [23] a new streaming environments can be found in [51, 52],
weighting strategy was proposed, assuming that the and [53].
weak learners are decision trees. Instead of assign-
ing a weight to the whole tree, the authors propose In literature, there exist many drift detecting
to establish weights in the leaves. The online ver- methods. One of the most popular DD is the
sion of Bagging and Boosting was proposed in [19], CUSUM algorithm [45]. It is based on tracing a
and this approach was extended in [36]. For more performance measure (e.g., the accuracy of classi-
recent information about ensemble algorithms, the fier). If this measure in the consecutive steps ex-
reader is referred to [37] and [38]. ceeds a fixed threshold, then the cumulative sum
starts to grow. If it is higher than a certain thresh-
In the paper [39], the author presents a pro- old, then the algorithm indicates concept-drift. The
cedure of random forest in a static environment, Page-Hinkley test examines differences between
by introducing randomness both on a training set current observations and means of previously an-
and on a feature set. This idea was tailored to the alyzed data in a similar way to the CUSUM algo-
data stream scenario in several ways. The Dynamic rithm. The DDM [54] (Drift Detection Method) al-
Streaming Random Forest (DSRF) was proposed gorithm treats data from a stream as Bernoulli tri-
in [40]. In this approach, after the initial phase of als (assigning them values of 0 or 1 depending on
the subsequently generating a finite number of trees, whether they were correctly classified by the cur-
the algorithm update statistics defining thresholds rent model). The final decision is based on a test
for decision trees construction. Then the algorithm that takes into account the means and standard de-
update forest with a fixed percentage of the trees. In viations of the previous trials. It was enhanced to
the DSRF algorithm, the entropy of incoming data deal with abrupt concept drift as the EDDM algo-
is measured for drift detecting. If the drift is de- rithm [55]. The Adwin algorithm [44] is based on
tected, all the parameters of the algorithm are re- a sliding window. It searches the point on current
set to initial values, and the algorithm replaces a windows to obtain two sub-windows with signifi-
specific number of trees in the forest, which num- cantly different means values. The decision is taken
ber depends on the value of measured entropy. Its on a base of the Hoeffding’s bound. Moreover, as
ideas are extended in the paper [41]. In [42] the au- drift detectors, different methods of tracing mov-
thors propose the Adaptive Random Forests algo- ing averages can be used. One of the most popular
rithm, which combines classical random forest pro- is GMADM (geometric moving average detecting
cedure with Hoeffding’s decision trees [43]. To re- method) [56].
act to changes in data stream, a procedure based on
290 Piotr Duda, Krzysztof Przybyszewski, Lipo Wang

3 The formalisms 5. Set the value of feature importance as the differ-


ence between accuracy obtained on the original
Let us consider a training set D = and permuted test set, divided by the standard
{(X1 ,Y1 ), (X2 ,Y2 ), . . . , (Xn ,Yn )}, where Xi is a d- deviation of the outputs.
dimensional feature vector and Yi ∈ Y = {1, . . . , l}
is a class label, for i = 1, . . . , n. From the classic ma- By applying such an approach, the accuracy is
chine learning point of view, the task is to create the calculated on two different sets for which marginal
mapping f : X → Y, where X = f1 × f2 × · · · × fd , distributions are identical. This idea, in the next
and f j corresponds to a single feature, for j = Section, will be extended to deal with data streams.
1, . . . , d.
In a case of ensemble models, the mapping f 4 Random Forest with Feature Im-
consists of many so-called, weak classifiers hi , for
i = 1, . . . , T , where T is the size of the ensemble. portance Drift Detector
The decision about predicted class, for vector x, is
In this Section, the proposed method of adap-
indicated by the majority voting
tation of random forest to stream data is de-
ŷ=f(x)= scribed. It uses a chunk-based approach, popular
pre-processing method. Let say we have stream
arg max[card({hi (x) = c|i = 1, ..., T })], (1)
c∈Y od data S = {(X1 ,Y1 ), (X2 ,Y2 ), . . . , }, and data come
to the system continuously one after the other. In
where card(A) denotes the cardinality of the set A. the chunk-based approach, we try to gather a fixed
In the random forest algorithm, the ideas of bag- number of data. After obtaining the first n data from
ging [57] and random subspaces [58] are combined the stream (chunk B0 ), one can call out a standard
to create a family of classifiers. Based on the train- RF algorithm which generates M0 trees. In the next
ing set D, a new (smaller) training sets are created step, a new chunk of data (Bt ), t = 1, 2, . . . , is gath-
by sampling with replacement. New weak classi- ered. Before creating a new component, a current
fiers are trained on those newly created training sets chunk of data is used to assess a formerly estab-
but restricted only to some subspace of feature. lished ensemble (such procedure is called sequen-
tial evaluation). Aside from computing an accuracy,
The different features have different impor-
we can also use this chunk of data to obtain the val-
tances. We can say that the i-th feature is unim-
ues of features’ importance. We can assume that ev-
portant if
ery particular chunk of data is a set of independent
P( fA (x) = y) = P( fA\ fi (x) = y), (2) random variables. Let ft be the RF after process-
ing t chunks, t = 1, 2, ..., then values of the function
for every x and y, where fA is a mapping from φ comparing predictions with actual values are also
the set A ⊆ X into the set Y , fA : A → Y . From random variables defined as
the other side, we say that the feature is important {
1, if ft (Xi ) = Yi
if P( fX (x) = y) ̸= P( fX\ fi (x) = y). To measure a φ(Xi ) = (3)
level of feature importance is not a trivial task. In 0, if ft (Xi ) ̸= Yi ,
particular, the following procedure can be applied for Xi ∈ Bt , i = 1, 2, . . . , card(Bt+1 ). Then the accu-
(see [39]) racy of the classifier given by the following formula
∑ φ(X)
1. Calculate the outputs for every tree based on the Acc(Bt+1 ) =
X∈Bt+1
(4)
testing set (B). card(Bt+1 )
is a mean of random variables taking values from a
2. For each of the features separately
binomial distribution.
3. Perform the permutation of the considered fea- Comparing values of accuracy in the original
ture values based on the test set. j
(Bt+1 ) and permuted (B̃t+1 ) training sets we can de-
4. Calculate the outputs for every tree based on the fine values of feature importance
j
changed testing set (B̃) V I j = Acc(Bt+1 ) − Acc(B̃t+1 ), (5)
A NOVEL DRIFT DETECTION ALGORITHM BASED ON . . . 291

for j = 1, . . . , d.
Algorithm 1. The RFFI algorithm
The initial values of FI (V I 0j ) are computed
based on chunk B0 . Every next value of FI will be
computed on unseen testing set Bt+1 . The idea of Data: Data stream S in a form of data
the proposed drift detection method is to compare, chunks B0 , B1 , B2 , ...; Number of
new values of V I j , with the previously obtained. initial trees M0 , Number of
The significance of the changes is tested by appli- addiitional trees M
cation of the Hoeffding’s bound [59] Result: Ensemble of classifiers
t = 0;
√ Take the first chunk Bt from the stream ;
R2 ln 1/α
V I 0j −V I j < , (6) Train Random Forest on Bt (M0 trees);
2n Compute V Ij0 on Bt by the equation (5),
where R is a range of considered random variable for j = 1, 2, . . . , d;
(in this particular case equal 2), and α is a fixed while new data chunks are available do
parameter. If inequality (6) is satisfied, then any t = t + 1;
changes are not made in the ensemble. New trees Take next chunk Bt from the stream ;
can be trained on Bt+1 , to add them to the ensem- Compute V Ij on Bt by the equation
ble. The number of additional trees, equal to M, is (5), for j = 1, 2, . . . , d;
fixed by the user. In the other case, we replace the Compute differences V Ij0 − V Ij for
forest by a new one trained on the current chunk of j = 1, . . . , d;
data and replace V I 0j values by the lastly obtained. Choose feature F which maximize the
computed differences;
In this paper, we will examine three different
if inequality (6) is not satisfied for
strategies for computing V I j values.
feature F then
V Ij0 = V Ij ;
FP - Fixed Permutation. In this approach, one per- Train new Random Forest on Bt
mutation is used for every feature and every (M0 trees);
chunk of data during the whole data stream pro- else
cessing. Do not make any changes;
Add new M random trees, trained
SP - Single Permutation. In this approach, one per- on Bt , to the forest;
mutation is used for every feature, but with every end
chunk of data, a new permutation is chosen. end

MP - Multiple Permutations. The specific choice of


permutation can result in different values of fea- 5 Experimental results
ture importance. In this approach, we will aver-
age results obtained by many permutations. This In this Section, the results of the simulation ex-
approach allows as to obtain more robust results, periments are presented. Even though many pa-
however, the number of considered permutations rameters can have an important influence on the
is a bottleneck in terms of the speed of coming RFFI algorithm performance, because of the lack
data. of space, the experiments are focused on reacting
on different concept-drift types. The issue of de-
termining the permutation of the values for a given
The proposed algorithm is called RFFI (Ran-
feature is a key issue for the speed of learning of
dom Forest with Features Importance), and its
the ensemble. It is worth noting that accuracy can
pseudo-code is given below.
be computed in a parallel way on a new data chunk
and its permutations. On the other hand, different
permutations can result in a different output of the
drift detector. In order to evaluate the repeatability
292 Piotr Duda, Krzysztof Przybyszewski, Lipo Wang

of the FI values, synthetic data was generated us- lar, the Adaptive Random Forest (ARF) algorithm
ing the Random Tree Generator implemented in the equipped with various drift detection methods. In
MOA software [60]. The data were described by the simulations, the following DD methods, de-
25 numerical features and one of a five-class label. scribed in Section 2 were used: Adwin, CUSUM,
The first chunk of stream, containing 2000 data el- DDM, EDDM, GMADM, and no-change (NoCh).
ements, was used to train the RFFI algorithm. The Those algorithms use VFDT as a weak classifier.
second one, obtained from the same stream (with- The number of components in the RFFI algorithm
out concept-drift), was used to compute FI. The val- was set to 10, M = 0, and the level of confidence
ues obtained by SP, FP, and MP methods (see Sec- for the drift detector was equal to α = 0.9. The ID3
tion 4), were computed 124 times independently. algorithm was applied as the weak classifier. The
The results for one feature, in the form of boxplot, results obtained after each 1000 data elements are
are presented in Figure 1. The values of MP are the presented in Figure 2.
results of averaging 25 samples.
One can see that the widest variety of values
is in the case of the FP method. The remaining
methods seem to provide similar values. The MP
method gives slightly more stable outputs, but at the
expense of an increasing number of computations,
it seems to be the worse choice for data stream pro-
cessing.
In the following subsections, the SP method is
applied to investigate various types of concept-drift.

Figure 2. The accuracies (in percent) for ARF


algorithm with various DD and RFFI, computed
after every 1000 data elements on the synthetic
dataset with abrupt concept drift.
One can see that most of the considered algo-
rithms give similar results. This should not come
as a surprise due to the fact that all algorithms are
based on the same approach (RF). However, it is
worth noticing that, the RFFI algorithm detects the
Figure 1. FI values, after 124 iterations, computed
change of concept at the earliest. It reacts just like
by FP (Fixed Permutation), SP (Single
the first chunk of data from the new concept came.
Permutations) and MP (Multiple Permutations)
All the other DDs require more data.
methods.
To compare with different state-of-the-art data
5.1 Abrupt concept drift streams classifiers, the results were computed for
The Random Tree Generator was applied to the following algorithms:
generate 100000 data. The first 50000 was taken
– Oza algorithm with 10 components and VFDT
from the first concept and the rest of the data from
as weak classifier [Oza-HT-10]
the second one. Both concepts have 25 features, and
each element was assigned to one of two classes. – Oza algorithm with 50 components and VFDT
The maximal depths of the trees (to generating data) as weak classifier [Oza-HT-50]
were set for considered concepts as 20 and 15, re-
spectively. In the first experiment, the performance – Oza algorithm with 10 components and VFDT
of the RFFI algorithm was compared with differ- as weak classifier and ADWIN [Oza-B-10]
ent Random Forest-based algorithms, in particu- – VFDT algorithm with entropy as impurity mea-
A NOVEL DRIFT DETECTION ALGORITHM BASED ON . . . 293

sure and confidence level of equal to δ = 0.05 one concept. Then, data from two different con-
[HT-E-0.05] cepts began to be appended to the stream, alter-
nately, each in a package of 2000 elements. Ulti-
– VFDT algorithm with entropy as impurity mea- mately, the stream contains 100000 elements. The
sure and confidence level of equal to δ = 0.01 comparisons with other random forest-based and
[HT-E-0.01] stare-of-the-art algorithms are presented in Figures
4 and 5, respectively.
– VFDT algorithm with Gini index as impurity
measure and confidence level of equal to δ =
0.05 [HT-G-0.05]

– VFDT algorithm with Gini index as impurity


measure and confidence level of equal to δ =
0.01 [HT-G-0.01]

The results are depicted in Figure 3. One can


see that RFFI turned out to be better than most of
the other algorithms. Only the accuracy of Oza-B-
10 was better than RFFI at the end of the stream.
This is due to the type of the used weak classifier.
Oza-B-10 uses the VFDT algorithm, which allows
Figure 4. The accuracies (in percent) for ARF
training components during stream processing. In
algorithm with various DD and RFFI, computed
the case of the proposed algorithm, the static com-
after every 1000 data elements on the synthetic
ponents were used. However, it is worth noticing
dataset with recurring concept drift.
that the proposed DD reacts to the change earlier
than other algorithms.

Figure 5. The accuracies (in percent) for RFFI and


Figure 3. The accuracies (in percent) for RFFI and state-of-the-art algorithms, computed after every
state of the art algorithms, computed after every 1000 data elements on the synthetic dataset with
1000 data elements on the synthetic dataset with recurring concept drift.
abrupt concept drift.
On a background of the ARF-based algorithm,
it is clearly seen that the type of change and its fre-
5.2 Recurring concept drift
quency did not allow any of the considered drift de-
The abilities of the proposed algorithm to react tectors to indicate the moment of drift correctly. All
to the recurring concept-drift were compared with algorithms re-create the same components for one
the same algorithms as in subsection 5.1. Also the concept. The use of static trees instead of VFDT
data was generated as in the previous Section. The allowed to achieve better accuracy of the proposed
only change was the number of concept changes. method. However, this does not change the fact that
For an initial 50,000 data, they were generated from
294 Piotr Duda, Krzysztof Przybyszewski, Lipo Wang

all these methods cannot cope with the detection of


rapidly changing abrupt drifts. Comparison with
other classifiers shows that they also have a prob-
lem with analyzing such frequently changing data.

5.3 Incremental concept drift


To illustrate the performance of algorithms on
data with slow incremental changes in the concept,
the Hyperplane Generator from the MOA software
was applied. The data consist of 25 features, and 10
of them were subject to concept-drift. The magni-
tude of changes after each data element was set to
0.02 and the probability of reversing the direction Figure 7. The accuracies (in percent) for RFFI and
of changes was set to 0.1. state-of-the-art algorithms, computed after every
1000 data elements on the synthetic dataset with
Compared to the algorithms in subsection 5.1,
incremental concept drift.
only the setting of RFFI was changed. Basing only
on the drift detector does not bring satisfactory re- One can see that only the Oza-B-10 algorithm
sults in the case of such changes in the distribution. outperforms the others. It is a consequence appli-
For efficient operation, it is necessary to use the cation of the Adwin algorithm. The application of
passive property of ensemble algorithms. For this sliding windows seems to be especially beneficial
purpose, each time when the drift was not detected, in the case of such changes.
four new components, generated from the last ar-
rived data, were attached to the forest (M = 4). 5.4 Real-world data
Other parameters remained unchanged. The results
obtained by RFFI and ARF based algorithms are To perform simulation on real-world data, the
presented in Figure 6. popular benchmark dataset, called electricity, was
applied [54]. The data contains eight features and
belongs to one of two classes. The number of data is
equal to 45312. The obtained results after process-
ing data chunks, each consisting of 1000 elements,
are depicted in Figure 8. The RFFI algorithm was
applied with the same values of parameters as in
subsection 5.3.

Figure 6. The accuracies (in percent) for ARF


algorithm with various DD and RFFI, computed
after every 1000 data elements on the synthetic
dataset with incremental concept drift.
The results are similar in all methods. This
shows that DDs do not play a key role. The com-
parison with the other classifiers is presented in Fig- Figure 8. The accuracies (in percent) for RFFI and
ure 7. state-of-the-art algorithms, computed after every
2000 data elements on the real-world dataset.
A NOVEL DRIFT DETECTION ALGORITHM BASED ON . . . 295

In the following experiment, the performance of tion about which feature (or features) has the most
the RFFI algorithm is compared with the best oper- significant impact on the detected drift.
ating algorithm from the previous subsection, i.e. As part of further work, we will improve the
OZA − 10, OZA − B − 10, HT − E − 0.05, ARF − drift detector to enhance its work in a rapidly chang-
CUSUM. The actual moments of the concept-drift ing environment. In the presented version detector
occurrence are not known in the case of real data. operates on data chunks. In the future, we will try
As the prequential evaluation was used, during the to develop its fully on-line version.
learning process the algorithms obtain various ac-
curacies for the subsequent data-chunks. The pro-
posed algorithm allows for achieving the best result, References
93.9, of all algorithms. However, one can see that
[1] P. Duda, M. Jaworski, L. Pietruczuk, and
the most stable accuracy was provided by the AFT-
L. Rutkowski, A novel application of Hoeffding’s
Adwin, which is probably due to the use of sliding inequality to decision trees construction for data
windows. None of the data chunk-based approaches streams, in Neural Networks (IJCNN), 2014 In-
have given better results. ternational Joint Conference on. IEEE, 2014, pp.
The aggregated values of accuracy and stan- 3324–3330.
dard deviations obtained after processing the whole [2] L. Rutkowski, L. Pietruczuk, P. Duda, and M. Ja-
stream, and the maximal values of accuracies are worski, Decision trees for mining data streams
depicted in Table 1. based on the McDiarmid’s bound, IEEE Transac-
tions on Knowledge and Data Engineering, vol. 25,
Table 1. Average accuracies (Aa) and standard no. 6, pp. 1272–1279, 2013.
deviations (Sd), in percents, and the maximum
value obtained by the algorithms on real dataset [3] L. Rutkowski, M. Jaworski, L. Pietruczuk, and
P. Duda, Decision trees for mining data streams
Algorithm Aa Sd max based on the Gaussian approximation, IEEE Trans-
actions on Knowledge and Data Engineering,
Oza-10 82.5 6.25 91.4
vol. 26, no. 1, pp. 108–119, 2014.
Oza-B-10 84.28 4.89 91.7
HT-E-0.05 82.54 4.7 92 [4] L. Rutkowski, M. Jaworski, L. Pietruczuk, and
ARF-Adwin 88.8 2.22 93.5 P. Duda, The CART decision tree for mining data
RFFI 85.37 5.79 93.9 streams, Information Sciences, vol. 266, pp. 1–15,
2014.
The presented results demonstrate that the RFFI
[5] L. Pietruczuk, L. Rutkowski, M. Jaworski, and
algorithm can be effectively used to analyze real P. Duda, The parzen kernel approach to learning
data. in non-stationary environment, in Neural Networks
(IJCNN), 2014 International Joint Conference on.
IEEE, 2014, pp. 3319–3323.

6 Conclusions [6] L. Rutkowski, M. Jaworski, L. Pietruczuk, and


P. Duda, A new method for data stream mining
In this article, we proposed a new algorithm for based on the misclassification error, IEEE Trans-
classification in the data stream scenario. Our pro- actions on Neural Networks and Learning Systems,
vol. 26, no. 5, pp. 1048–1059, 2015.
posal is based on the random forest algorithm. To
enable the algorithm to adapt to changes in the en- [7] P. Duda, M. Jaworski, and L. Rutkowski, Knowl-
vironment, two approaches were used: the mecha- edge discovery in data streams with the orthogo-
nism of incorporating newly learned trees into the nal series-based generalized regression neural net-
ensemble and an innovative drift detector. By com- works, Information Sciences,, 2017.
bining both these techniques, the algorithm can op-
[8] M. Jaworski, P. Duda, and L. Rutkowski, New
erate in environments of different types of non- splitting criteria for decision trees in stationary data
stationarity. The proposed drift detector works no streams, IEEE Transactions on Neural Networks
worse than other commonly used methods. Be- and Learning Systems, vol. PP, no. 99, pp. 1–14,
sides, when catching changes, it provides informa- 2017.
296 Piotr Duda, Krzysztof Przybyszewski, Lipo Wang

[9] M. Jaworski, P. Duda, L. Rutkowski, P. Najge- [20] P. Duda, On ensemble components selection in
bauer, and M. Pawlak, Heuristic regression func- data streams scenario with gradual concept-drift, in
tion estimation methods for data streams with con- International Conference on Artificial Intelligence
cept drift, in Lecture Notes in Computer Science. and Soft Computing. Springer, 2018, pp. 311–320.
Springer, 2017, pp. 726–737.
[21] P. Duda, M. Jaworski, and L. Rutkowski, On en-
[10] M. Jaworski, P. Duda, and L. Rutkowski, On apply- semble components selection in data streams sce-
ing the restricted boltzmann machine to active con- nario with reoccurring concept-drift, in 2017 IEEE
cept drift detection, in Computational Intelligence Symposium Series on Computational Intelligence
(SSCI), 2017 IEEE Symposium Series on. IEEE, (SSCI). IEEE, 2017, pp. 1–7.
2017, pp. 1–8.
[11] M. Jaworski, Regression function and noise vari- [22] L. Pietruczuk, L. Rutkowski, M. Jaworski, and
ance tracking methods for data streams with con- P. Duda, A method for automatic adjustment of en-
cept drift, International Journal of Applied Math- semble size in stream data mining, in Neural Net-
ematics and Computer Science, vol. 28, no. 3, pp. works (IJCNN), 2016 International Joint Confer-
559–567, 2018. ence on. IEEE, 2016, pp. 9–15.

[12] P. Duda, M. Jaworski, and L. Rutkowski, Con- [23] L. Pietruczuk, L. Rutkowski, M. Jaworski, and
vergent time-varying regression models for data P. Duda, How to adjust an ensemble size in stream
streams: Tracking concept drift by the recursive data mining? Information Sciences, vol. 381, pp.
parzen-based generalized regression neural net- 46–54, 2017.
works, International Journal of Neural Systems,
vol. 28, no. 02, p. 1750048, 2018. [24] G. Ditzler, M. Roveri, C. Alippi, and R. Polikar,
Learning in nonstationary environments: A sur-
[13] P. Duda, M. Jaworski, A. Cader, and L. Wang, vey, IEEE Computational Intelligence Magazine,
On training deep neural networks using a stream- vol. 10, no. 4, pp. 12–25, 2015.
ing approach, Journal of Artificial Intelligence and
Soft Computing Research, vol. 10, no. 1, 2020. [25] P. Duda, L. Rutkowski, M. Jaworski, and
D. Rutkowska, On the Parzen kernel-based prob-
[14] A. Lall, V. Sekar, M. Ogihara, J. Xu, and H. Zhang,
ability density function learning procedures over
Data streaming algorithms for estimating entropy
time-varying streaming data with applications to
of network traffic, in ACM SIGMETRICS Perfor-
pattern classification, IEEE transactions on cyber-
mance Evaluation Review, vol. 34, no. 1. ACM,
netics, vol 50, no. 4, pp. 1683-1696, 2020.
2006, pp. 145–156.
[15] C. Phua, V. Lee, K. Smith, and R. Gayler, A com- [26] E. Rafajlowicz, W. Rafajlowicz, Testing (non-
prehensive survey of data mining-based fraud de- ) linearity of distributed-parameter systems from a
tection research, arXiv preprint arXiv:1009.6119, video sequence, Asian Journal of Control, Vol. 12,
2010. no. 2, pp. 146–158, 2010.
[16] A. Dal Pozzolo, G. Boracchi, O. Caelen, C. Alippi, [27] E. Rafajlowicz, H. Pawlak-Kruczek, W. Rafajlow-
and G. Bontempi, Credit card fraud detection: A icz, Statistical Classifier with Ordered Decisions as
realistic modeling and a novel learning strategy, an Image Based Controller with Application to Gas
IEEE transactions on neural networks and learn- Burners , Springer, Lecture Notes in Artificial In-
ing systems, vol. 29, no. 8, p. 3784–3797, August telligence, vol. 8467, pp. 586–597, 2014.
2018.
[28] E. Rafajlowicz, W. Rafajlowicz, Iterative learning
[17] S. Disabato and M. Roveri, Learning convolu- in optimal control of linear dynamic processes , In-
tional neural networks in presence of concept drift, ternational Journal Of Control, vol. 91, no. 7, pp.
in 2019 International Joint Conference on Neural 1522–1540, 2018.
Networks (IJCNN), 2019, pp. 1–8.
[18] W. N. Street and Y. Kim, A streaming ensem- [29] P. Jurewicz, W. Rafajlowicz, J. Reiner, et al., Simu-
ble algorithm (sea) for large-scale classification, in lations for Tuning a Laser Power Control System of
Proceedings of the seventh ACM SIGKDD inter- the Cladding Process , Lecture Notes in Computer
national conference on Knowledge discovery and Science, vol. 9842, pp. 218–229, Springer, 2016.
data mining. ACM, 2001, pp. 377–382. [30] E. Rafajlowicz, W. Rafajlowicz, Iterative Learning
[19] N. C. Oza, Online bagging and boosting, in Sys- in Repetitive Optimal Control of Linear Dynamic
tems, man and cybernetics, 2005 IEEE interna- Processes , 15th International Conference on Arti-
tional conference on, vol. 3. IEEE, 2005, pp. 2340– ficial Intelligence and Soft Computing (ICAISC),
2345. 2016, Springer, vol. 9692, pp. 705–717, 2016.
A NOVEL DRIFT DETECTION ALGORITHM BASED ON . . . 297

[31] E. Rafajlowicz, W. Rafajlowicz, Control of linear [43] P. Domingos and G. Hulten, Mining high-speed
extended nD systems with minimized sensitivity data streams, in Proc. 6th ACM SIGKDD Internat.
to parameter uncertainties , Multidimensional Sys- Conf. on Knowledge Discovery and Data Mining,
tems And Signal Processing, vol. 24, no. 4, pp. 2000, pp. 71–80.
637–656, 2013.
[44] A. Bifet and R. Gavaldà, Adaptive learning from
[32] S. A. Ludwig, Applying a neural network ensemble evolving data streams, in International Symposium
to intrusion detection , Journal of Artificial Intelli- on Intelligent Data Analysis. Springer, 2009, pp.
gence and Soft Computing Research, vol. 9, no. 3, 249–260.
pp. 177–188, 2019. [45] E. S. Page, Continuous inspection schemes,
[33] H. Wang, W. Fan, P. S. Yu, and J. Han, Mining Biometrika, vol. 41, no. 1/2, pp. 100–115, 1954.
concept-drifting data streams using ensemble clas- [46] J. P. Barddal, H. M. Gomes, F. Enembreck, and
sifiers, in Proceedings of the ninth ACM SIGKDD B. Pfahringer, A survey on feature drift adapta-
international conference on Knowledge discovery tion: Definition, benchmark, challenges and future
and data mining. AcM, 2003, pp. 226–235. directions, Journal of Systems and Software, 07
2016.
[34] R. Polikar, L. Upda, S. S. Upda, and V. Honavar,
Learn++: An incremental learning algorithm for [47] H.-L. Nguyen, Y.-K. Woon, W.-K. Ng, and L. Wan,
supervised neural networks, IEEE transactions on Heterogeneous ensemble for feature drifts in data
systems, man, and cybernetics, part C (applications streams, in Pacific-Asia Conference on Knowledge
and reviews), vol. 31, no. 4, pp. 497–508, 2001. Discovery and Data Mining. Springer, 2012, pp. 1–
12.
[35] R. Elwell and R. Polikar, Incremental learning of
concept drift in nonstationary environments, IEEE [48] A. P. Cassidy and F. A. Deviney, Calculating fea-
Transactions on Neural Networks, vol. 22, no. 10, ture importance in data streams with concept drift
pp. 1517–1531, 2011. using online random forest, in 2014 IEEE Interna-
tional Conference on Big Data (Big Data). IEEE,
[36] A. Beygelzimer, S. Kale, and H. Luo, Optimal and 2014, pp. 23–28.
adaptive algorithms for online boosting, in Pro-
[49] R. Zhu, D. Zeng, and M. R. Kosorok, Reinforce-
ceedings of the 32nd International Conference on
ment learning trees, Journal of the American Sta-
Machine Learning (ICML-15), 2015, pp. 2323–
tistical Association, vol. 110, no. 512, pp. 1770–
2331.
1784, 2015.
[37] H. M. Gomes, J. P. Barddal, F. Enembreck, and [50] L. Yuan, B. Pfahringer, and J. P. Barddal, Iterative
A. Bifet, A survey on ensemble learning for data subset selection for feature drifting data streams, in
stream classification, ACM Computing Surveys Proceedings of the 33rd Annual ACM Symposium
(CSUR), vol. 50, no. 2, p. 23, 2017. on Applied Computing. ACM, 2018, pp. 510–517.
[38] B. Krawczyk, L. L. Minku, J. Gama, J. Ste- [51] L. C. Molina, L. Belanche, and À. Nebot, Feature
fanowski, and M. Wozniak, Ensemble learning for selection algorithms: A survey and experimental
data stream analysis: A survey, Information Fu- evaluation, in 2002 IEEE International Conference
sion, vol. 37, pp. 132–156, 2017. on Data Mining, 2002. Proceedings. IEEE, 2002,
pp. 306–313.
[39] L. Breiman, Random forests, Machine learning,
vol. 45, no. 1, pp. 5–32, 2001. [52] G. Ditzler, J. LaBarck, J. Ritchie, G. Rosen, and
R. Polikar, Extensions to online feature selection
[40] H. Abdulsalam, D. B. Skillicorn, and P. Martin, using bagging and boosting, IEEE Transactions on
Classifying evolving data streams using dynamic Neural Networks and Learning Systems, vol. 29,
streaming random forests, in International Confer- no. 9, pp. 4504–4509, 2018.
ence on Database and Expert Systems Applica-
tions. Springer, 2008, pp. 643–651. [53] J. P. Barddal, H. M. Gomes, F. Enembreck, and
B. Pfahringer, A survey on feature drift adapta-
[41] H. Abdulsalam, P. Martin, and D. Skillicorn, tion: Definition, benchmark, challenges and future
Streaming random forests, 2008. directions, Journal of Systems and Software, 07
2016.
[42] H. M. Gomes, A. Bifet, J. Read, J. P. Bard-
dal, F. Enembreck, B. Pfharinger, G. Holmes, and [54] J. Gama, P. Medas, G. Castillo, and P. Rodrigues,
T. Abdessalem, Adaptive random forests for evolv- Learning with drift detection, in Brazilian sympo-
ing data stream classification, Machine Learning, sium on artificial intelligence. Springer, 2004, pp.
vol. 106, no. 9-10, pp. 1469–1495, 2017. 286–295.
298 Piotr Duda, Krzysztof Przybyszewski, Lipo Wang

Piotr Duda received the M.Sc. degree Dr. Lipo Wang received the Bach-
in mathematics from the Department elor degree from National University
of Mathematics, Physics, and Chem- of Defense Technology (China) and
istry, University of Silesia, Katowice, Ph.D. from Louisiana State University
Poland, in 2009. He obtained the Ph.D. (USA). His research interest is arti-
degree and Sc.D. in computer science ficial intelligence/machine learning
from Częstochowa University of Tech- with applications to communications,
nology, Częstochowa, Poland in 2015 image/video processing, biomedical
and 2019, respectively. His current re- engineering, and data mining. He has
search interests include deep learning and data stream mining. authored 320 papers, of which 110 are in journals. He has au-
thored 2 monographs and edited 20 books. His work has been
Krzysztof Przybyszewski is a profes- cited 7,800 times in Google Scholar. He was/will be keynote
sor at the University of Social Sciences speaker for 40 international conferences. He was President
in Łódź. His adventure with applied of the Asia-Pacific Neural Network Assembly (APNNA) and
computer science began in the 1980s received the APNNA Excellent Service Award.
with a simulation of non-quantum col-
lective processes (the subject of a Ph.D.
dissertation). At present, he is involved
in research and applications of various
artificial intelligence technologies and
soft computing methods in selected IT problems (in particu-
lar, in expert systems supporting the management of educa-
tion quality in universities - the use of fuzzy numbers and
sets). As a deputy dean at the University of Social Sciences,
he is the designer and organizer of the on-Computer Science
Faculty education program. He is the author of over 80 pub-
lications in the field of computer science and IT applications.

You might also like