Online Entropy-Based Discretization For Data Streaming Classification
Online Entropy-Based Discretization For Data Streaming Classification
Online Entropy-Based Discretization For Data Streaming Classification
highlights
article info a b s t r a c t
Article history: Data quality is deemed as determinant in the knowledge extraction process. Low-quality data normally
Received 8 November 2017 imply low-quality models and decisions. Discretization, as part of data preprocessing, is considered one
Received in revised form 19 January 2018 of the most relevant techniques for improving data quality.
Accepted 3 March 2018
In static discretization, output intervals are generated at once, and maintained along the whole
Available online 9 March 2018
process. However, many contemporary problems demands rapid approaches capable of self-adapting
Keywords: their discretization schemes to an ever-changing nature. Other major issues for stream-based discretiza-
Data stream tion such as interval definition, labeling or how is implemented the interaction between learning and
Concept drift discretization components are also discussed in this paper.
Data preprocessing In order to address all the aforementioned problems, we propose a novel, online and self-adaptive
Data reduction discretization solution for streaming classification which aims at reducing the negative impact of fluctu-
Discretization ations in evolving intervals. Experiments with a long list of standard streaming datasets and discretizers
Online learning have demonstrated that our proposal performs significantly more accurately than the other alternatives.
In addition, our scheme is able to leverage from class information without incurring in an overweight cost,
being ranked as one of the most rapid supervised options.
© 2018 Elsevier B.V. All rights reserved.
https://doi.org/10.1016/j.future.2018.03.008
0167-739X/© 2018 Elsevier B.V. All rights reserved.
60 S. Ramírez-Gallego et al. / Future Generation Computer Systems 86 (2018) 59–70
etc.) output continuous data in form of batches or individual in- The remaining paper is structured as follows. Section 2 intro-
stances (online) [9]. These unbounded and dynamic data [10] (data duces the discretization topic from two different perspectives: its
streams) demand novel learning schemes that not only adapt well, formal definition and its adaptation to the online environment.
but also that constantly revise their time and memory require- Section 3 describes throughly the solution proposed, highlighting
ments [11,12]. The ideal scenario is that in which instances are the main contributions introduced to solve the problem. Section 4
processed once, and then discarded. Another requirement to face is presents the results obtained and the subsequent analysis. Finally,
the likely non-stationary of incoming data (concept drift) [13]. Sud- some relevant conclusions are provided in Section 5.
den or abrupt changes in data distribution [14] require outstand-
ing adaptation abilities to follow drifting movements in decision 2. Background
borders.
In this section we detail the discretization problem and some
Online discretization [15] also suffers from concept drift as data
related concepts such as the use of border points as an optimiza-
distribution is strongly connected with evolving intervals. Ideally,
tion. Then the problem of discretizing streaming data is presented,
discrete intervals should adapt as smooth as possible to drifts
as well as a list of related methods presented in the literature.
in order to avoid significant drops in accuracy. Also adjustments
Lastly, the issue of interval definition in the online environment
should not imply complex rebuilding processes, but they should is thoroughly analyzed.
be solved rapidly. Up to date, few supervised approaches for on-
line discretization have been presented in the literature. Despite 2.1. Discretization: related concepts and ideas
relevant, these proposals tend to accomplish abrupt and imprecise
splits [15], or they are too costly for streaming systems. Discretization is a data reduction technique that aims at pro-
How interval labels are defined and labeled by online dis- jecting a set of continuous values into a discrete and finite space
cretizers, or what type of discrete information is passed to online [3,16]. Let D refer to a labeled dataset formed by a set of instances
learners are other open problems that have received even less N, a set of attributes M, and a set of classes C . All training instances
attention in the literature. Any minor alteration in the meaning are labeled with a label from C . A discretization algorithm would
and/or the definition of discrete intervals means a certain subse- generate a set of disjoint intervals for all continuous attribute
quent drop in learning accuracy. As shown in [15], the standard A ∈ M. The discretization scheme generated IA consists of a set
labeling technique inherited from the static environment is unable of cutpoints which define the limits for each interval:
to cope with these questions, and it shows a deficient behavior
in this new paradigm. Hence novel and improved schemes that IA = {∀gi ∈ Dom(A) : g1 < g2 < · · · < gk }, (1)
explicitly address the interval labeling and interaction problems where IA is the discretization scheme for A, and g1 and gk , are
are required in the streaming field. the inferior and superior limit for A, respectively. Notice that the
The aim of this work is to tackle the previous problems by original scheme considers all distinct points in A at the start, where
developing a new solution that smoothly and efficiently adapts to k ≤ |N |.
incoming drifts. Our method, henceforth called Local Online Fu- As a preliminary approach to interval labeling, we can associate
sion Discretizer (LOFD), mainly relies on highly-informative class each interval with the same index as gi−1 forming the interval
statistics to generate accurate intervals at every step. Furthermore, set I = {IA1 , IA2 , . . . , IAk }, |I | = k − 1. Labeling process (also
local nature of operations implemented in LOFD offers low re- called indexing) will determine how intervals are retrieved in the
sponse times, thereby making it suitable for high-speed streaming subsequent learning process. Following the previous description,
systems. Finally, we detail two alternatives that can be used by we can move to define the membership of continuous points to a
online discretizers to effectively improve interaction between the given interval IAj as follows:
discretization and learning phases. The first approach naturally
provides reliable histogram information to some learners, whereas ∀p ∈ Dom(A), {∃j = {1, 2, . . . , k} | p ∈ IAj ⇐⇒ gj−1 < p ≤ gj }.
the second one is a renovated version of the standard scheme (2)
which is valid for all learners. The improvements introduced here
For simplification purposes, gj−1 value for each attribute can
aim at minimizing the drawbacks associated to the dynamic rela-
be removed so that intervals are uniquely defined by their gj . The
beling and interaction phenomenons, described in Section 2.3.
attribute is upperly bounded by gk .
The main contributions of this paper are as follows.
Supervised discretization problem (as described above) is a NP-
1. Highly-informative and adaptive discretization schemes to complete [17] optimization problem whose search space consists
of all candidate cut points for each A ∈ M, namely, all distinct
reduce the impact of concept shift in online discretization.
points in the input dataset considering each attribute indepen-
2. Efficient evaluation of cut points by reducing the number of
dently. This initial space can be simplified by only considering
intervals considered in each local operation.
those points on the borders, which are known to be optimal ac-
3. Formal definition of interval labeling and definition prob-
cording to several measures in the literature [18].
lems, as well as the interaction learner-discretizer in online
Among the wide set of evaluation measures that benefit from
environments. Embedded two-sided potential solution for the inclusion of boundaries, those based on entropy are distin-
all problems enumerated above. guished by its outleading results in discretization. For instance,
4. Comprehensive experimental evaluation of methods sup- FUSINTER [19], which integrates quadratic entropy in its evalua-
ported by nonparametric and Bayesian statistical testing. tions, has proven to be one of the most flexible and competitive
discretizers according to [16]. In each iteration, FUSINTER fuses
Our approach will be evaluated using a thorough experimental
those adjacent intervals whose merging would most improve the
framework, which includes a list of 12 streaming datasets, two
aggregated criterion value, defined for each interval as follows:
online learning algorithms, and the state-of-the-art for online dis-
|C | |C |
cretization presented in [15]. A thorough analysis based on non- ∑ c+j (∑ ci + λ ( ci + λ ))
parametric and Bayesian statistical tests is performed to assess µ(IAβ ) = α 1−
|N | c+j + |C |λ c+j + |C |λ
quality in the results. Additionally, a study concerning the impact j=1 i=1
of the novel relabeling approaches, and a case study are also in- |C |λ
cluded for illustrative purposes.
+ (1 − α ) , (3)
c+j
S. Ramírez-Gallego et al. / Future Generation Computer Systems 86 (2018) 59–70 61
where ci represents the number of elements in IAβ for a given class, modifications in the set of cutpoints. However, 1-step definition
c+j the total amount of elements contained in IAβ , and α and λ are is constantly threatened by the never-ceasing arrival of unseen
two control factors. data in the streaming context. This unpleasant situation not only
hinders the discretizer’s ability to partition the continuous space,
2.2. Online discretization for data streams: related work but also the subsequent interaction with the learning operator.
In this scenario, online learners are forced to incessantly forget
Data streaming describes the scenario in which instances ar- and learn old and new patterns from data. This section aims at
rive sequentially in form of unbounded instances or batches [20]. addressing all these problems.
Standard algorithms are not originally designed to cope with un- In the literature, we can find two alike strategies for interval
bounded data since they typically assume that the entire training labeling originally designed for static discretization: one based
set is available beforehand. Dynamic systems can also be affected on directly assuming cut-points as labels (i.e.: values lower than
by the changes in data distribution introduced by new data [21]. point 2.5), and another one based on literal human-readable terms,
This phenomenon, called concept drift, is well-categorized and usually set by experts (i.e.: <2.5 is replaced by ‘‘low’’ income).
described in the literature [14]. Interval definition is usually composed by the set of cut points for
Several learning strategies have been captured in the literature each feature. Cut-point based intervals represent a quite versatile
to overcome the concept drift problem. Explicitly, algorithms can option as it do not require expert labeling. However, this strategy
leverage from an external drift detector [22] to detect and rebuild can be considered as less appropriate for dynamic learning because
the model whenever a drift appears. On the other hand, learners cut-points constantly move across the feature space. The previous
can hold a self-adaptive strategy based on sliding windows, ensem- scheme might be replaced by explicit labels that would allow points
bles [23], or online learners [24] — build the model incrementally vary freely. Nevertheless, learners that rely on literal labels are
with each novel instance. The emergence of drifts in dynamic known to suffer from a natural drop in accuracy because many
environments poses a major challenge for online discretizers [15], learners of them directly depends of these labels to generalize.
which must adjust their intervals properly over time. Interval Additionally, new labels appear, and some old disappear as a con-
adaptation should be as smooth as possible, at the same time sequence of natural discretization evolution.
reducing as well its time consumption. Although explicit labeling suffer from definition drift, the cut-
Early proposals on online discretization usually follow an un- points based strategy – as defined by [18] – can be directly stated
supervised approach, which defines a preset number of inter- as outdated in streaming contexts. This is mainly due to inter-
vals in advance. Some proposals compute quantile points (equal- vals maintain class consistency by constantly shifting their limits
frequency) in an approximate or exact manner. The quantiles (and hence their labels). To illustrate this problematic, suppose
then serve to delimit further intervals. One of the most relevant an scenario where all cut points in an scheme IA in time t are
proposals in quantile-based discretization is the Incremental Dis- slightly displaced in time t + 1, for example when a new element is
cretization Algorithm (IDA) [25] algorithm. IDA employs a reser- inserted in the lowest bin in IDA. In this case the online discretizer
voir sample to track data distribution and quantiles. Equal-width is forced to update the label of each equidistant bins, thus yielding a
discretization is another alternative that uniquely demands the completely new scheme. The previous issue justifies the adoption
number of bins the attribute will be split. of explicit labels to track intervals, and relegates cut points to a
In contrast to unsupervised discretizers, class-guided algo- secondary role (exclusively for definition).
rithms do not impose a constant number of intervals, but they It is important to distinguish between how the interval defi-
alternate splits and merges to generate more informative cut- nition and labeling tasks are accomplished, and how information
points [26]. Few proposals in the literature address online dis- is transferred between the discretizer phase and the subsequent
cretization from a supervised perspective. PiD [27] was the first learning one (interval interaction). Most of times labels also act
proposal to leverage from class information in its two-layer model. as bridge between phases beyond as a explicit denomination for
The former one produces preliminary cutpoints by summarizing intervals. However, there exist some special situations where la-
data, whereas the latter one performs class-guided merges on the beling and interaction differ. For instance, the algorithm in [29]
previous splits. Recurrent updates in the first layer are performed relies on cut-points to define its boundaries but transfers updated
whenever a primal interval is above its size. In counterpart, the class histograms to the learner. From these statistics, it is possible
second layer uses a parametric approach to merge candidates. PiD to derive the conditional likelihood that an attribute-value be-
has been criticized [25] by several reasons: firstly, the correspon- longs to an interval given its belonging to a class: P(IA j|Class). This
dence between layers dilute as time passes (see Section 3); sec- scheme, called statistic-based discretization, does not require labels
ondly, high skewness in data might provoke a dramatic increment explicitly but is exclusive from some learners, such as bayesian-
in intervals; and finally, repetitive values might also undermine based algorithms.
the performance due to the generation of different intervals with Explicit labeling also entails a wide range of major issues in
common points. online learning, such as: abrupt changes in the original defini-
In [28], the authors developed an online version of ChiMerge tion (label swap, label creation), or constant transfer of instances
(OC) that offers identical results to those claimed by the original between bins (instance relabeling). All of them deeply affect the
proposal. OC relies on a sliding window technique as well as several underlying interaction between discrete values and the learning
complex structures to emulate original ChiMerge. Nevertheless, phase as shown in [15]. This is specially remarkable in algorithms
the complexity of the data structures introduced conveys a barely that uniquely rely on labels, such as linear gradient-based, or rule-
acceptable cost in time and memory requirements. based algorithms. Furthermore, not only the meaning is suscepti-
ble to alteration, but also the number of intervals may be altered.
2.3. Interval definition, labeling and interaction in the streaming sce- The aim here is to reduce as much as possible the number of
nario intervals and instances affected by relabeling.
In order to illustrate the previous problem, we propose an
Evolving nature of discretization in streaming contexts poses example where a given interval (numeric label i) is split into two
a major challenge for the close interaction existing between su- new intervals. This causes the amount of original intervals and
pervised discretizers and learners. 1-step definition in static dis- labels to augment, and a new scheme to be generated. If the right
cretization plainly ignores this problem by assuming no further resulting interval is deemed as the new interval, the new definition
62 S. Ramírez-Gallego et al. / Future Generation Computer Systems 86 (2018) 59–70
Table 1 Table 2
Basic information concerning datasets. For each row, # total instances (#Inst.), to- Description of parameters. Default values (in bold) are shown in the first row. Un-
tal number of attributes (#Atts.) (the number of numerical (#Num.) and nominal less specified, these values are common to all methods.
(#Nom.)), and the number of output labels (#Cl), are depicted.
Method Parameters
Data set #Inst. #Atts. #Num. #Nom. #Cl. All disc. initial elements = 100, window size = 1 (default)
airlines 539,383 6 3 3 2 Gaussian NB –
elecNormNew 45,311 8 7 1 2 Gaussian HT 10 splits
kddcup_10 494,020 41 39 2 2 IDA [25] window size = 1000, # bins = 5
poker-lsn 829,201 10 5 5 10 OC [28] –
covtypeNorm 581,011 54 10 44 7 PiD [27] α = 0.75, # bins = 500, update layer2 = 103 , min/max = 0/1
blips 500,000 20 20 0 4 LOFD max. size by histogram = 10,000, initial elements = 5000
sudden_drift 500,000 3 3 0 2
gradual_drift 500,000 3 3 0 2
gradual_recurring_drift 500,000 20 20 0 4
incremental_fast 500,000 10 10 0 4 (HT) [31] is incorporated to test the discretization effect of our
incremental_slow 500,000 10 10 0 4 solution on other learning models. Table 2 shows the parameters
involved in the experiments, as well as the values set according to
the authors’ criteria.
which more strongly reduces entropy (according Eq. (3)), will be The evaluation is performed following an standard evalua-
applied. The previous process will be repeated recursively until no tion technique for online learning, called interleaved test-then-
more merges are available or convenient. Note that a merge will train [32]. In this scheme, incoming examples are first evaluated
be performed iff the quadratic entropy of the resulting interval by the current model. Afterwards, the examples are incorporated
is lower than the sum of both parts. Notice also that merge is to the model in the training phase. Note that this technique is more
responsible of mixed histograms formed by boundary points from appropriate for streaming environments than hold-out evaluation.
different classes in intervals I1 and I3 . All experiments has been executed in a single commodity ma-
This procedure avoids recomputing values for intact intervals chine with the following characteristics: 2 processors Intel Core i7
(the vast majority), so that only one interval will require to re- CPU 930 (4 cores/8 threads, 2.8 GHz, 8 MB cache), 24 GB DDR2
evaluate its entropy and potential merges in each iteration. RAM, 1 TB HDD (3 Gb/s), Ethernet network, CentOS 6.4 (Linux).
All algorithms, included our new approach, have been packaged in
4. Experiments and case study an extension library for MOA (16.04v).5 All the experiments have
been launched in MOA.
This section provides a comparative analysis between our pro-
posal and the other state-of-the-art discretizers. As LOFD offers 4.2. Analytical comparative: smooth shift vs. static labeling
two alternatives for interval labeling, two independent sections are
issued here. In Section 4.2, LOFD adopts smooth shifting whereas This section focuses on studying the effect of LOFD discretizer
the rest of discretizers assume standard labeling to interact with with smooth labeling in online learning, as well as comparing our
NB. In Section 4.3 LOFD and PiD directly interact with NB through solution with other alternatives which utilizes standard labeling.
histograms. Finally, a case study is included to illustrate the effect
of concept drift on evolving discretization intervals. 4.2.1. Accuracy results (predictive ability)
Table 3 contains average accuracy results as a summary of the
4.1. Experimental framework entire learning process performed by Naive Bayes. From these
outcomes we can outline the following conclusions:
Here we outline all the details related to our experiments,
• There exists a clear advantage on using LOFD against other
such as: datasets processed, parameters involved and their values,
alternatives. LOFD is on average 5% more precise than its
validation scheme, etc. Evaluation has been performed in terms of
closest competitor (IDA), which means 2.5 × 103 more in-
prediction ability (accuracy), evaluation time (spent on discretizing
stances correctly classified.6
and prediction), and reduction ability on continuous features (#
• Up to now, supervised alternatives have generated worse
discrete intervals).
solutions than those presented by unsupervised approaches,
Table 1 shows some basic information about data. Half of the despite the former ones leverage from class information. On
datasets were artificially created by the Massive Online Analysis the contrary, LOFD utilizes this knowledge to overcome the
(MOA) tool [30], ranging from sudden drift to different types of previous problem and thus generates better schemes.
gradual drifts. For more details about the parameter setting ac- • LOFD overcomes its competitors in 9/11 datasets, with sim-
complished to generate these datasets, please refer to our code ilar results in the other datasets. This fact proves the supe-
repository.3 The remaining datasets were collected from the offi- riority of LOFD, and its great versatility for both real and
cial MOA’s webpage, except for kddcup_10 that was processed and artificial datasets, as well as for different type of trends and
generated by Dr. Gama.4 drifts.
In order to evaluate the performance of LOFD several state-
of-the-art discretizers have been included in this framework for To reaffirm the positive results obtained by LOFD a thorough
comparison purposes. They range from supervised (OC and PiD) statistical analysis is performed on accuracy results through-
to unsupervised schemes (IDA - window-based version). All de- out two non-parametric tests [33]. Table 4 reports the results
scribed methods have been thoroughly analyzed and categorized for Wilcoxon Signed-Ranks Test (1vs1) and Friedman–Holm test
in [15]. Gaussian Naïve Bayes (GB) has been elected as the reference (1vsN) with a significance level α = 0.05. Both tests assert LOFD
classifier for our tests, because of the reasons exposed in Sec- clearly outperforms the other options, and no method is close in
tion 2. For the remaining methods, Gaussian estimation is replaced performance to it (no ties in row). Additionally, p-values derived
by the discrete intervals generated. Alternatively, Hoeffding Tree from Holm’s test show to be highly significant (<0.01).
3 https://github.com/sramirez/MOAReduction. 5 http://moa.cms.waikato.ac.nz/moa-extensions/.
4 http://www.liaad.up.pt/kdus/products/datasets-for-concept-drift. 6 Considering 5 × 105 instances by dataset on average.
S. Ramírez-Gallego et al. / Future Generation Computer Systems 86 (2018) 59–70 65
Table 3
Classification test accuracy on discretized data (Naïve Bayes).
PiD IDA OC GB LOFD
airlines 63.0057 64.1563 65.0723 64.5504 65.0868
elecNormNew 71.9522 76.6905 74.0731 73.3625 77.1517
kddcup_10 99.1474 98.4644 98.1404 97.1908 99.2901
poker-lsn 55.0335 59.4337 58.5465 59.5528 69.3981
covtypeNorm 66.6306 62.7235 64.2254 60.5208 69.2387
blips 74.5680 66.4494 64.2148 60.9060 76.3668
sudden_drift 65.7736 81.3168 77.8808 83.8144 83.5752
gradual_drift_med 60.8404 82.8908 80.1032 84.7000 84.2794
gradual_recurring_drift 65.1678 58.5250 58.5612 56.7450 67.9446
incremental_fast 73.9900 75.6472 75.6036 76.3642 80.7308
incremental_slow 65.6074 76.9186 75.4316 78.0688 81.6210
MEAN 69.2470 73.0197 71.9866 72.3432 77.6985
Table 4 platform. Table 6 shows the mean accuracy results following the
Wilcoxon pairwise test on accuracy, and ranking of methods gen-
previous learning procedure but using HT instead of NB. In this
erated by Friedman’s procedure + adjusted p-value by Holm’s
Test. First two columns represents Wilcoxon comparisons, where ‘+’ scenario, the baseline model (HT with gaussian approximation, 10
indicates the number of wins achieved by each algorithm, and ‘ ±’ the splits) stands out as the best method on average. Nevertheless,
number of wins/ties. The best achievement is highlighted in grease. LOFD represents an interesting option for real data, outperforming
Third column shows the ranking of methods according to Friedman’s
its competitors in 3/5 cases. In general, LOFD is advantageous in
test, starting from the control method (topmost). Last column details
the adjusted p-values generated by the post hoc Holm’s test. the whole spectrum of problems (5/11). Finally our solution stands
among with IDA as the best discretization alternative on average.
In summary, LOFD has shown to be compatible with other
online algorithms beyond NB. Despite HT can be considered as less
susceptible to discretization than NB, our solutions also stands as
a positive alternative in some problems.
Table 6
Classification test accuracy on discretized data. Hoeffding tree used as learner.
PiD IDA OC HT LOFD
airlines 64.3951 64.5158 65.3619 65.0784 65.0008
elecNormNew 79.8442 79.8354 70.2132 79.1954 80.7645
kddcup_10 99.8389 99.7929 99.8368 99.7413 99.5120
poker-lsn 57.9820 69.8381 55.4892 76.0685 76.1936
covtypeNorm 77.6671 75.8652 70.1681 80.3119 81.8190
blips 73.6652 86.0112 35.7974 90.9808 79.3036
sudden_drift 69.5128 82.9856 61.3936 84.8418 86.7238
gradual_drift_med 64.6858 84.1394 51.1838 85.5088 86.5246
gradual_recurring_drift 68.2206 83.7164 35.6192 88.3368 77.8664
incremental_fast 71.1508 78.6526 50.6528 82.7748 77.0852
incremental_slow 66.3744 76.7644 50.5308 83.1052 70.9906
MEAN 72.1215 80.1924 58.7497 83.2676 80.1622
Table 7
Runtime performance in seconds. First columns mean total evaluation time (discretization + prediction), while last two
represent time rate spent by each component in LOFD: interval creation (c1), and interaction discretizer-learner (c2).
PiD IDA OC GB LOFD LOFD-c1 LOFD-c2
airlines 114.62 160.16 595.29 14.04 261.48 0.72 0.28
elecNormNew 28.25 9.17 12.57 0.67 16.49 0.77 0.23
kddcup_10 526.50 158.69 3,850.95 18.59 341.49 0.33 0.67
poker-lsn 129.11 104.30 1,769.26 11.72 390.99 0.09 0.91
covtypeNorm 408.86 275.40 1,694.97 28.28 690.43 0.27 0.73
blips 1,610.87 487.77 1,013.60 12.02 780.17 0.40 0.60
sudden_drift 91.56 74.39 183.15 2.49 490.70 0.94 0.06
gradual_drift_med 210.77 94.08 172.43 3.63 672.94 0.94 0.06
gradual_recurring_drift 1,152.51 429.20 1,038.21 12.52 741.98 0.87 0.13
incremental_fast 986.21 274.86 518.33 5.68 853.27 0.87 0.13
incremental_slow 335.08 246.33 615.29 6.06 506.64 0.72 0.28
MEAN 508.58 210.40 1,042.19 10.52 522.42 0.59 0.41
Fig. 3. Detailed plots of prequential accuracy, and CPU time over the data stream progress (# instances processed) on blips.
Fig. 4. Detailed plots of prequential accuracy, and CPU time over the data stream progress (# instances processed) on poker-lsn.
Table 9
Classification test accuracy (%) + total time on discrete data (histogram scheme). Last two columns represent time rate
spent by each component in LOFD: interval creation (c1), and interaction discretizer-learner (c2).
Datasets Accuracy Time (s) LOFD comp. runtime (rate)
PiD LOFD PiD LOFD LOFD-c1 LOFD-c2
airlines 53.4763 64.6136 108.08 221.01 0.84 0.16
elecNormNew 74.1989 75.1324 21.18 13.81 0.94 0.06
kddcup_10 97.9902 99.2079 514.84 140.38 0.71 0.29
poker-lsn 0.1117 61.0778 108.71 55.26 0.54 0.46
covtypeNorm 63.1194 62.9710 407.16 217.69 0.71 0.20
blips 70.9794 72.7270 1,243.54 315.70 0.81 0.19
sudden_drift 38.6880 83.3610 89.53 479.93 0.99 0.01
gradual_drift_med 51.1544 84.4526 180.98 480.45 0.99 0.01
gradual_recurring_drift 62.9858 64.0668 1,201.22 351.77 0.82 0.18
incremental_fast 75.6676 75.9816 868.54 500.65 0.98 0.02
incremental_slow 77.3006 77.2844 278.62 267.95 0.96 0.04
MEAN 60.5157 74.6251 456.58 276.78 0.84 0.16
4.4. Case study: sudden drift scenario the other supervised schemes look quite misguided given that
intervals are quite concentrated, thus showing a high level of over-
This section illustrates the different discretization solutions lapping (specially those of PiD). Concerning IDA, its discretization
offered by the discretizers studied, as well as how they adapt their solutions fits perfectly an equal-frequency approach as expected.
solutions after the appearance of concept drifts. Fig. 5 depicts the LOFD intervals also look well-distributed, but at the same time they
solutions generated for the sudden drift dataset, before and after respect and follow the class borders.
(+1 × 104 instances later) a drift appears in attribute #2. The drift
5. Concluding remarks
concretely appears after iteration 3.75 × 105 .
Among with the cut points limiting intervals (vertical lines), a
In this paper we have studied several major issues to be faced by
simplified class histogram of the last 1 × 103 points is included in contemporary online discretizers. How discretizers should adapt
the figure. Left subplots show the density of points before the drift, the intervals or how intervals are labeled and interact with learners
where most of points are blue, and skewed to the right. After the are two of the main axes on which further developments should
drift (right subplots) more red points appear on the left side, thus revolved. As a potential solution for the interval labeling and in-
removing practically the skewness. The expected output here is teraction problems, we have proposed and analyzed two opposing
that more intervals appear in the leftmost part of the histogram strategies. The first one is a renovated solution valid for every
in order to follow the trend and thus better separate classes. online learning system, whereas the other one relies on histograms
As observed in Fig. 5, only LOFD generates more cut points to provide direct information to bayesian learners, for example.
to the left of the midpoint after the drift, whereas the rest look Both alternatives have shown in the experiments their positive
practically identical to its previous value. In fact, we can observe effect on the transition between consecutive discretization states.
68 S. Ramírez-Gallego et al. / Future Generation Computer Systems 86 (2018) 59–70
(a) LOFD: density function (iteration: 3.75 × 105 ). (b) LOFD: density function (iteration: 3.85 × 105 ).
(c) IDA: density function (iteration: 3.75 × 105 ). (d) IDA: density function (iteration: 3.85 × 105 ).
(e) PiD: density function (iteration: 3.75 × 105 ). (f) PiD: density function (iteration: 3.85 × 105 ).
(g) OC: density function (iteration: 3.75 × 105 ). (h) OC: density function (iteration: 3.85 × 105 ).
Fig. 5. Density plots before and after a concept drift in attribute #2 (sudden drift dataset). Each row represents a different discretizer, each column the distribution of data before
and after the drift, and each vertical line the intervals generated. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of
this article.)
To solve the adaptation problem we have implemented all level of responsiveness thanks to the fully local strategy imple-
the labeling schemes in a novel online discretization algorithm, mented, mainly based on fast interval fusions and splits.
called LOFD. This discretizer produces self-adaptive and highly- The complex experimental framework performed, with 12
informative discretization schemes, in which precise intervals are datasets and 3 algorithms, has proven that LOFD is by far the
supported by updated class statistics. LOFD also presents a high most competitive solution in terms of predictive accuracy. It has
S. Ramírez-Gallego et al. / Future Generation Computer Systems 86 (2018) 59–70 69
also been confirmed by the statistical analysis carried out with a [23] B. Krawczyk, L.L. Minku, J. Gama, J. Stefanowski, M. Woźniak, Ensemble learn-
significance level α ≤ 0.01. LOFD is also ranked as one of the most ing for data stream analysis: A survey, Inform. Fusion 37 (2017) 132–156.
[24] B. Krawczyk, M. Woźniak, One-class classifiers with incremental learning and
rapid supervised discretizers. Compared with the other alterna-
forgetting for data streams with concept drift, Soft Comput. 19 (12) (2015)
tives, which either barely cover the search space or generate too 3387–3400.
many meaningless intervals, LOFD is able to achieve an excellent [25] G. Webb, Contrary to popular belief incremental discretization can be sound,
trade-off between simple and precise solutions. computationally efficient and extremely useful for streaming data, in: IEEE
International Conference on Data Mining, ICDM, 2014, pp. 1031–1036.
[26] S. Ramírez-Gallego, S. García, J.M. Benítez, F. Herrera, Multivariate discretiza-
Acknowledgments
tion based on evolutionary cut points selection for classification, IEEE Trans.
Cybern. 46 (3) (2016) 595–608.
This work is supported by the Spanish National Research, Spain [27] J. Gama, C. Pinto, Discretization from data streams: Applications to histograms
Project TIN2014-57251-P, the Foundation BBVA, Spain project and data mining, in: Proceedings of the 2006 ACM Symposium on Applied
75/2016 BigDaPTOOLS, the Andalusian Research Plan, Spain P11- Computing, SAC ’06, 2006, pp. 662–667.
[28] P. Lehtinen, M. Saarela, T. Elomaa, Online ChiMerge Algorithm, in: D.E. Holmes,
TIC-7765. S. Ramírez-Gallego holds a FPU scholarship from the
L.C. Jain (Eds.), Springer Berlin Heidelberg, Berlin, Heidelberg, 2012, pp. 199–
Spanish Ministry of Education and Science, Spain (FPU13/00047). 216.
[29] T. Elomaa, P. Lehtinen, Maintaining optimal multi-way splits for numerical
References attributes in data streams, in: Advances in Knowledge Discovery and Data
Mining, 12th Pacific-Asia Conference, PAKDD 2008, Osaka, Japan, May 20–23,
[1] S. García, J. Luengo, F. Herrera, Data Preprocessing in Data Mining, Springer, 2008 Proceedings, 2008, pp. 544–553.
2015. [30] A. Bifet, G. Holmes, R. Kirkby, B. Pfahringer, MOA: Massive online analysis, J.
[2] S. García, J. Luengo, F. Herrera, Tutorial on practical tips of the most influential Mach. Learn. Res. 11 (2010) 1601–1604.
data preprocessing algorithms in data mining, Knowl.-Based Syst. 98 (2016) [31] G. Hulten, L. Spencer, P. Domingos, Mining time-changing data streams, in:
1–29. Proceedings of the Seventh ACM SIGKDD International Conference on Knowl-
[3] H. Liu, F. Hussain, C.L. Tan, M. Dash, Discretization: An enabling technique, Data edge Discovery and Data Mining, KDD ’01, 2001, pp. 97–106.
Mining Knowl. Discov. 6 (4) (2002) 393–423. [32] A. Bifet, R. Kirkby, Data Stream Mining: A Practical Approach, The University
[4] S. Ramírez-Gallego, S. García, H. Mouriño Talín, D. Martínez-Rego, V. Bolón- of Waikato, 2009.
Canedo, A. Alonso-Betanzos, J.M. Benítez, F. Herrera, Data discretization: Tax- [33] S. García, A. Fernández, J. Luengo, F. Herrera, Advanced nonparametric tests
onomy and big data challenge, Wiley Interdiscip. Rev.: Data Mining Knowl. for multiple comparisons in the design of experiments in computational intel-
Discov. 6 (1) (2016) 5–21. ligence and data mining: Experimental analysis of power, Inform. Sci. 180 (10)
[5] H. Chen, T. Li, C. Luo, S.J. Horng, G. Wang, A rough set-based method for (2010) 2044–2064, Special Issue on Intelligent Distributed Information Sys-
updating decision rules on attribute values; coarsening and refining, IEEE tems.
Trans. Knowl. Data Eng. 26 (12) (2014) 2886–2899. [34] A. Benavoli, G. Corani, F. Mangili, M. Zaffalon, F. Ruggeri, A bayesian wilcoxon
[6] Y. Yang, G.I. Webb, Discretization for Naive-Bayes learning: Managing dis- signed-rank test based on the dirichlet process, in: Proceedings of the 31th
cretization bias and variance, Mach. Learn. 74 (1) (2009) 39–74. International Conference on Machine Learning, ICML 2014, 21–26, 2014, pp.
[7] X. Wang, Y. He, D.D. Wang, Non-naive bayesian classifiers for classification 1026–1034.
problems with continuous attributes, IEEE Trans. Cybern. 44 (1) (2014) 21–39.
[8] Q. Wu, D. Bell, M. McGinnity, G. Prasad, G. Qi, X. Huang, Improvement of deci-
sion accuracy using discretization of continuous attributes, in: Proceedings of
the Third International Conference on Fuzzy Systems and Knowledge Discov- Sergio Ramírez-Gallego received the M.Sc. degree in
ery, FSKD’06, Springer-Verlag, Berlin, Heidelberg, 2006, pp. 674–683. Computer Science in 2012 from the University of Jaén,
[9] J. Lu, P. Zhao, S.C.H. Hoi, Online passive-aggressive active learning, Mach. Learn. Spain. He is currently a Ph.D. student at the Department
103 (2) (2016) 141–183. of Computer Science and Artificial Intelligence, University
[10] J. Gama, Knowledge Discovery from Data Streams, Chapman & Hall/CRC, 2010. of Granada, Spain. He has published in journals such as
[11] M.-A. Aufaure, R. Chiky, O. Curé, H. Khrouf, G. Kepeklian, From business IEEE Transactions on Cybernetics, IEEE Transactions on
intelligence to semantic data stream management, Future Gener. Comput. Systems, Man, and Cybernetics: Systems, Experts Systems
Syst. 63 (Supplement C) (2016) 100–107 Modeling and Management for Big with Applications or Neurocomputing. His research inter-
Data Analytics and Visualization. ests include data mining, data preprocessing, big data and
[12] S. Ramírez-Gallego, A. Fernández, S. García, M. Chen, F. Herrera, Big data: Tuto- cloud computing.
rial and guidelines on information and process fusion for analytics algorithms
with MapReduce, Inform. Fusion 42 (Supplement C) (2018) 51–61.
[13] R. Pears, S. Sakthithasan, Y.S. Koh, Detecting concept change in dynamic data
streams, Mach. Learn. 97 (3) (2014) 259–293. Salvador García received the M.Sc. and Ph.D. degrees
[14] J. Gama, I. Zliobaite, A. Bifet, M. Pechenizkiy, A. Bouchachia, A survey on in Computer Science from the University of Granada,
concept drift adaptation, ACM Comput. Surveys 46 (4) (2014) 44:1–44:37. Granada, Spain, in 2004 and 2008, respectively. He is
[15] S. Ramírez-Gallego, B. Krawczyk, S. García, M. Woniak, F. Herrera, A survey currently an Associate Professor in the Department of
on data preprocessing for data stream mining: Current status and future Computer Science and Artificial Intelligence, University of
directions, Neurocomputing 239 (2017) 39–57. Granada, Granada, Spain. He has published more than 45
[16] S. García, J. Luengo, J.A. Sáez, V. López, F. Herrera, A survey of discretization papers in international journals. As edited activities, he
techniques: Taxonomy and empirical analysis in supervised learning, IEEE has co-edited two special issues in international journals
Trans. Knowl. Data Eng. 25 (4) (2013) 734–750. on different Data Mining topics and is a member of the
[17] B.S. Chlebus, S.H. Nguyen, On finding optimal discretizations for two attributes, editorial board of the Information Fusion journal. He is a
in: L. Polkowski, A. Skowron (Eds.), Rough Sets and Current Trends in Com- co-author of the book entitled ‘‘Data Preprocessing in Data
puting: First International Conference, RSCTC’98 Warsaw, Poland, June 22– Mining’’ published in Springer. His research interests include data mining, data
26, 1998 Proceedings, Springer Berlin Heidelberg, Berlin, Heidelberg, 1998, preprocessing, data complexity, imbalanced learning, semi-supervised learning,
pp. 537–544. statistical inference, evolutionary algorithms and biometrics.
[18] T. Elomaa, J. Rousu, General and efficient multisplitting of numerical attributes,
Mach. Learn. 36 (1999) 201–244.
[19] D.A. Zighed, S. Rabaséda, R. Rakotomalala, FUSINTER: A method for discretiza- Francisco Herrera (SM’15) received his M.Sc. in Mathe-
tion of continuous attributes, Internat. J. Uncertain. Fuzziness Knowledge- matics in 1988 and Ph.D. in Mathematics in 1991, both
Based Systems 6 (3) (1998) 307–326. from the University of Granada, Spain. He is currently
[20] M.M. Gaber, Advances in data stream mining, Wiley Interdiscip. Rev.: Data a Professor in the Department of Computer Science and
Mining Knowl. Discov. 2 (1) (2012) 79–85. Artificial Intelligence at the University of Granada.
[21] M. Tennant, F.T. Stahl, O. Rana, J.B. Gomes, Scalable real-time classification of He has been the supervisor of 40 Ph.D. students. He
data streams with concept drift, Future Gener. Comput. Syst. 75 (2017) 187– has published more than 300 journal papers that have
199. received more than 49000 citations (Scholar Google, H-
[22] S. Sakthithasan, R. Pears, A. Bifet, B. Pfahringer, Use of ensembles of fourier index 112). He is coauthor of the books ‘‘Genetic Fuzzy
spectra in capturing recurrent concepts in data streams, in: 2015 International Systems’’ (World Scientific, 2001) and ‘‘Data Preprocessing
Joint Conference on Neural Networks IJCNN, 2015, pp. 1–8. in Data Mining’’ (Springer, 2015), ‘‘The 2-tuple Linguistic
70 S. Ramírez-Gallego et al. / Future Generation Computer Systems 86 (2018) 59–70
Model. Computing with Words in Decision Making’’ (Springer, 2015), ‘‘Multilabel Computing (Fourth Edition, 2010), IEEE Transactions on Fuzzy System Outstanding
Classification. Problem analysis, metrics and techniques’’ (Springer, 2016), ‘‘Multiple 2008 and 2012 Paper Award (bestowed in 2011 and 2015 respectively), 2011 Lotfi A.
Instance Learning. Foundations and Algorithms’’(Springer, 2016). Zadeh Prize Best paper Award of the International Fuzzy Systems Association, 2013
He currently acts as Editor in Chief of the international journals ‘‘Information AEPIA Award to a scientific career in Artificial Intelligence, and 2014 XV Andalucía
Fusion’’ (Elsevier) and ‘‘Progress in Artificial Intelligence (Springer). He acts as Research Prize Maimónides (by the regional government of Andalucía).
editorial member of a dozen of journals. His current research interests include among others, soft computing (including
He received the following honors and awards: ECCAI Fellow 2009, IFSA Fellow fuzzy modeling and evolutionary algorithms), information fusion, decision making,
2013, 2010 Spanish National Award on Computer Science ARITMEL to the ‘‘Spanish biometric, data preprocessing, data science and big data.
Engineer on Computer Science’’, International Cajastur ‘‘Mamdani’’ Prize for Soft