0% found this document useful (0 votes)
11 views12 pages

Online Entropy-Based Discretization For Data Streaming Classification

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 12

Future Generation Computer Systems 86 (2018) 59–70

Contents lists available at ScienceDirect

Future Generation Computer Systems


journal homepage: www.elsevier.com/locate/fgcs

Online entropy-based discretization for data streaming classification


S. Ramírez-Gallego a, *, S. García a , F. Herrera a,b
a
Department of Computer Science and Artificial Intelligence, CITIC-UGR, University of Granada, 18071 Granada, Spain
b
Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia

highlights

• We propose LOFD, an online, self-adaptive discretizer for streaming classification.


• LOFD smoothly adaptsits interval limits reducing the negative impact of shifts.
• Interval labeling and interaction problems in data streaming are analyzed.
• Interaction discretizer-learner is addressed by providing 2 alike solutions in LOFD.
• The model is compared to the start-of-the-art, using several real-world problems.

article info a b s t r a c t
Article history: Data quality is deemed as determinant in the knowledge extraction process. Low-quality data normally
Received 8 November 2017 imply low-quality models and decisions. Discretization, as part of data preprocessing, is considered one
Received in revised form 19 January 2018 of the most relevant techniques for improving data quality.
Accepted 3 March 2018
In static discretization, output intervals are generated at once, and maintained along the whole
Available online 9 March 2018
process. However, many contemporary problems demands rapid approaches capable of self-adapting
Keywords: their discretization schemes to an ever-changing nature. Other major issues for stream-based discretiza-
Data stream tion such as interval definition, labeling or how is implemented the interaction between learning and
Concept drift discretization components are also discussed in this paper.
Data preprocessing In order to address all the aforementioned problems, we propose a novel, online and self-adaptive
Data reduction discretization solution for streaming classification which aims at reducing the negative impact of fluctu-
Discretization ations in evolving intervals. Experiments with a long list of standard streaming datasets and discretizers
Online learning have demonstrated that our proposal performs significantly more accurately than the other alternatives.
In addition, our scheme is able to leverage from class information without incurring in an overweight cost,
being ranked as one of the most rapid supervised options.
© 2018 Elsevier B.V. All rights reserved.

1. Introduction example, by selecting the most informative features or instances,


or by simplifying the feature space.
Learning models and subsequent results are highly dependent Data discretization follows a reduction strategy which converts
on the quality of input data. Incorrect decisions can be taken if complex continuous attributes into a finite set of discrete intervals.
raw data are not properly cleaned and structured. The data pre- Discretization has recently become very popular in the data sci-
processing task [1,2] is an essential step in data mining, which ence community [3,1,4], mainly due to the need of many learning
aims at transforming raw data extracted from databases to pol- algorithms for discrete values. For instance, standard implemen-
ished datasets. This goal is achieved by removing negative factors tations of decision rules [5] or Naïve Bayes [6,7] (NB) only admit
inherent in data, such as: noise, missing values, meaningless or categorical data in their processes. Even though other methods do
redundant data. Data reduction is a family of preprocessing tech- not explicitly require discrete values, many of them benefit from
niques that focuses on obtaining a reduced representation of data, simplified spaces [8]. In general, discrete data usually convey faster
at the same maintaining its original structure. This can be done, for learning processes and more precise models, thus following the
Occam’s razor principle.
Standard discretization algorithms require the entire dataset
* Corresponding author. to be in memory as a preliminary requirement. However, an in-
E-mail addresses: sramirez@decsai.ugr.es (S. Ramírez-Gallego),
salvagl@decsai.ugr.es (S. García), herrera@decsai.ugr.es (F. Herrera). creasing number of current problems in industry (sensors, logs,

https://doi.org/10.1016/j.future.2018.03.008
0167-739X/© 2018 Elsevier B.V. All rights reserved.
60 S. Ramírez-Gallego et al. / Future Generation Computer Systems 86 (2018) 59–70

etc.) output continuous data in form of batches or individual in- The remaining paper is structured as follows. Section 2 intro-
stances (online) [9]. These unbounded and dynamic data [10] (data duces the discretization topic from two different perspectives: its
streams) demand novel learning schemes that not only adapt well, formal definition and its adaptation to the online environment.
but also that constantly revise their time and memory require- Section 3 describes throughly the solution proposed, highlighting
ments [11,12]. The ideal scenario is that in which instances are the main contributions introduced to solve the problem. Section 4
processed once, and then discarded. Another requirement to face is presents the results obtained and the subsequent analysis. Finally,
the likely non-stationary of incoming data (concept drift) [13]. Sud- some relevant conclusions are provided in Section 5.
den or abrupt changes in data distribution [14] require outstand-
ing adaptation abilities to follow drifting movements in decision 2. Background
borders.
In this section we detail the discretization problem and some
Online discretization [15] also suffers from concept drift as data
related concepts such as the use of border points as an optimiza-
distribution is strongly connected with evolving intervals. Ideally,
tion. Then the problem of discretizing streaming data is presented,
discrete intervals should adapt as smooth as possible to drifts
as well as a list of related methods presented in the literature.
in order to avoid significant drops in accuracy. Also adjustments
Lastly, the issue of interval definition in the online environment
should not imply complex rebuilding processes, but they should is thoroughly analyzed.
be solved rapidly. Up to date, few supervised approaches for on-
line discretization have been presented in the literature. Despite 2.1. Discretization: related concepts and ideas
relevant, these proposals tend to accomplish abrupt and imprecise
splits [15], or they are too costly for streaming systems. Discretization is a data reduction technique that aims at pro-
How interval labels are defined and labeled by online dis- jecting a set of continuous values into a discrete and finite space
cretizers, or what type of discrete information is passed to online [3,16]. Let D refer to a labeled dataset formed by a set of instances
learners are other open problems that have received even less N, a set of attributes M, and a set of classes C . All training instances
attention in the literature. Any minor alteration in the meaning are labeled with a label from C . A discretization algorithm would
and/or the definition of discrete intervals means a certain subse- generate a set of disjoint intervals for all continuous attribute
quent drop in learning accuracy. As shown in [15], the standard A ∈ M. The discretization scheme generated IA consists of a set
labeling technique inherited from the static environment is unable of cutpoints which define the limits for each interval:
to cope with these questions, and it shows a deficient behavior
in this new paradigm. Hence novel and improved schemes that IA = {∀gi ∈ Dom(A) : g1 < g2 < · · · < gk }, (1)
explicitly address the interval labeling and interaction problems where IA is the discretization scheme for A, and g1 and gk , are
are required in the streaming field. the inferior and superior limit for A, respectively. Notice that the
The aim of this work is to tackle the previous problems by original scheme considers all distinct points in A at the start, where
developing a new solution that smoothly and efficiently adapts to k ≤ |N |.
incoming drifts. Our method, henceforth called Local Online Fu- As a preliminary approach to interval labeling, we can associate
sion Discretizer (LOFD), mainly relies on highly-informative class each interval with the same index as gi−1 forming the interval
statistics to generate accurate intervals at every step. Furthermore, set I = {IA1 , IA2 , . . . , IAk }, |I | = k − 1. Labeling process (also
local nature of operations implemented in LOFD offers low re- called indexing) will determine how intervals are retrieved in the
sponse times, thereby making it suitable for high-speed streaming subsequent learning process. Following the previous description,
systems. Finally, we detail two alternatives that can be used by we can move to define the membership of continuous points to a
online discretizers to effectively improve interaction between the given interval IAj as follows:
discretization and learning phases. The first approach naturally
provides reliable histogram information to some learners, whereas ∀p ∈ Dom(A), {∃j = {1, 2, . . . , k} | p ∈ IAj ⇐⇒ gj−1 < p ≤ gj }.
the second one is a renovated version of the standard scheme (2)
which is valid for all learners. The improvements introduced here
For simplification purposes, gj−1 value for each attribute can
aim at minimizing the drawbacks associated to the dynamic rela-
be removed so that intervals are uniquely defined by their gj . The
beling and interaction phenomenons, described in Section 2.3.
attribute is upperly bounded by gk .
The main contributions of this paper are as follows.
Supervised discretization problem (as described above) is a NP-
1. Highly-informative and adaptive discretization schemes to complete [17] optimization problem whose search space consists
of all candidate cut points for each A ∈ M, namely, all distinct
reduce the impact of concept shift in online discretization.
points in the input dataset considering each attribute indepen-
2. Efficient evaluation of cut points by reducing the number of
dently. This initial space can be simplified by only considering
intervals considered in each local operation.
those points on the borders, which are known to be optimal ac-
3. Formal definition of interval labeling and definition prob-
cording to several measures in the literature [18].
lems, as well as the interaction learner-discretizer in online
Among the wide set of evaluation measures that benefit from
environments. Embedded two-sided potential solution for the inclusion of boundaries, those based on entropy are distin-
all problems enumerated above. guished by its outleading results in discretization. For instance,
4. Comprehensive experimental evaluation of methods sup- FUSINTER [19], which integrates quadratic entropy in its evalua-
ported by nonparametric and Bayesian statistical testing. tions, has proven to be one of the most flexible and competitive
discretizers according to [16]. In each iteration, FUSINTER fuses
Our approach will be evaluated using a thorough experimental
those adjacent intervals whose merging would most improve the
framework, which includes a list of 12 streaming datasets, two
aggregated criterion value, defined for each interval as follows:
online learning algorithms, and the state-of-the-art for online dis-
|C | |C |
cretization presented in [15]. A thorough analysis based on non- ∑ c+j (∑ ci + λ ( ci + λ ))
parametric and Bayesian statistical tests is performed to assess µ(IAβ ) = α 1−
|N | c+j + |C |λ c+j + |C |λ
quality in the results. Additionally, a study concerning the impact j=1 i=1

of the novel relabeling approaches, and a case study are also in- |C |λ
cluded for illustrative purposes.
+ (1 − α ) , (3)
c+j
S. Ramírez-Gallego et al. / Future Generation Computer Systems 86 (2018) 59–70 61

where ci represents the number of elements in IAβ for a given class, modifications in the set of cutpoints. However, 1-step definition
c+j the total amount of elements contained in IAβ , and α and λ are is constantly threatened by the never-ceasing arrival of unseen
two control factors. data in the streaming context. This unpleasant situation not only
hinders the discretizer’s ability to partition the continuous space,
2.2. Online discretization for data streams: related work but also the subsequent interaction with the learning operator.
In this scenario, online learners are forced to incessantly forget
Data streaming describes the scenario in which instances ar- and learn old and new patterns from data. This section aims at
rive sequentially in form of unbounded instances or batches [20]. addressing all these problems.
Standard algorithms are not originally designed to cope with un- In the literature, we can find two alike strategies for interval
bounded data since they typically assume that the entire training labeling originally designed for static discretization: one based
set is available beforehand. Dynamic systems can also be affected on directly assuming cut-points as labels (i.e.: values lower than
by the changes in data distribution introduced by new data [21]. point 2.5), and another one based on literal human-readable terms,
This phenomenon, called concept drift, is well-categorized and usually set by experts (i.e.: <2.5 is replaced by ‘‘low’’ income).
described in the literature [14]. Interval definition is usually composed by the set of cut points for
Several learning strategies have been captured in the literature each feature. Cut-point based intervals represent a quite versatile
to overcome the concept drift problem. Explicitly, algorithms can option as it do not require expert labeling. However, this strategy
leverage from an external drift detector [22] to detect and rebuild can be considered as less appropriate for dynamic learning because
the model whenever a drift appears. On the other hand, learners cut-points constantly move across the feature space. The previous
can hold a self-adaptive strategy based on sliding windows, ensem- scheme might be replaced by explicit labels that would allow points
bles [23], or online learners [24] — build the model incrementally vary freely. Nevertheless, learners that rely on literal labels are
with each novel instance. The emergence of drifts in dynamic known to suffer from a natural drop in accuracy because many
environments poses a major challenge for online discretizers [15], learners of them directly depends of these labels to generalize.
which must adjust their intervals properly over time. Interval Additionally, new labels appear, and some old disappear as a con-
adaptation should be as smooth as possible, at the same time sequence of natural discretization evolution.
reducing as well its time consumption. Although explicit labeling suffer from definition drift, the cut-
Early proposals on online discretization usually follow an un- points based strategy – as defined by [18] – can be directly stated
supervised approach, which defines a preset number of inter- as outdated in streaming contexts. This is mainly due to inter-
vals in advance. Some proposals compute quantile points (equal- vals maintain class consistency by constantly shifting their limits
frequency) in an approximate or exact manner. The quantiles (and hence their labels). To illustrate this problematic, suppose
then serve to delimit further intervals. One of the most relevant an scenario where all cut points in an scheme IA in time t are
proposals in quantile-based discretization is the Incremental Dis- slightly displaced in time t + 1, for example when a new element is
cretization Algorithm (IDA) [25] algorithm. IDA employs a reser- inserted in the lowest bin in IDA. In this case the online discretizer
voir sample to track data distribution and quantiles. Equal-width is forced to update the label of each equidistant bins, thus yielding a
discretization is another alternative that uniquely demands the completely new scheme. The previous issue justifies the adoption
number of bins the attribute will be split. of explicit labels to track intervals, and relegates cut points to a
In contrast to unsupervised discretizers, class-guided algo- secondary role (exclusively for definition).
rithms do not impose a constant number of intervals, but they It is important to distinguish between how the interval defi-
alternate splits and merges to generate more informative cut- nition and labeling tasks are accomplished, and how information
points [26]. Few proposals in the literature address online dis- is transferred between the discretizer phase and the subsequent
cretization from a supervised perspective. PiD [27] was the first learning one (interval interaction). Most of times labels also act
proposal to leverage from class information in its two-layer model. as bridge between phases beyond as a explicit denomination for
The former one produces preliminary cutpoints by summarizing intervals. However, there exist some special situations where la-
data, whereas the latter one performs class-guided merges on the beling and interaction differ. For instance, the algorithm in [29]
previous splits. Recurrent updates in the first layer are performed relies on cut-points to define its boundaries but transfers updated
whenever a primal interval is above its size. In counterpart, the class histograms to the learner. From these statistics, it is possible
second layer uses a parametric approach to merge candidates. PiD to derive the conditional likelihood that an attribute-value be-
has been criticized [25] by several reasons: firstly, the correspon- longs to an interval given its belonging to a class: P(IA j|Class). This
dence between layers dilute as time passes (see Section 3); sec- scheme, called statistic-based discretization, does not require labels
ondly, high skewness in data might provoke a dramatic increment explicitly but is exclusive from some learners, such as bayesian-
in intervals; and finally, repetitive values might also undermine based algorithms.
the performance due to the generation of different intervals with Explicit labeling also entails a wide range of major issues in
common points. online learning, such as: abrupt changes in the original defini-
In [28], the authors developed an online version of ChiMerge tion (label swap, label creation), or constant transfer of instances
(OC) that offers identical results to those claimed by the original between bins (instance relabeling). All of them deeply affect the
proposal. OC relies on a sliding window technique as well as several underlying interaction between discrete values and the learning
complex structures to emulate original ChiMerge. Nevertheless, phase as shown in [15]. This is specially remarkable in algorithms
the complexity of the data structures introduced conveys a barely that uniquely rely on labels, such as linear gradient-based, or rule-
acceptable cost in time and memory requirements. based algorithms. Furthermore, not only the meaning is suscepti-
ble to alteration, but also the number of intervals may be altered.
2.3. Interval definition, labeling and interaction in the streaming sce- The aim here is to reduce as much as possible the number of
nario intervals and instances affected by relabeling.
In order to illustrate the previous problem, we propose an
Evolving nature of discretization in streaming contexts poses example where a given interval (numeric label i) is split into two
a major challenge for the close interaction existing between su- new intervals. This causes the amount of original intervals and
pervised discretizers and learners. 1-step definition in static dis- labels to augment, and a new scheme to be generated. If the right
cretization plainly ignores this problem by assuming no further resulting interval is deemed as the new interval, the new definition
62 S. Ramírez-Gallego et al. / Future Generation Computer Systems 86 (2018) 59–70

in the shape or number of intervals is solved by creating a new


scheme whose relationship with the previous one will determine
how the prediction model will perform in further steps. The most
common idea is to label intervals by attaching an ordinal list of
integers to the interval set defined by points. However, plenty of la-
bel movements between intervals arise as the training progresses.
Our idea is to break the requirement of using ordinal labels, and
to replace them by unrelated labels which minimize the number
of intervals. In this new scheme, henceforth called smooth shift,
the splits will be solved by attaching a new label to the minority
partition (see Fig. 1). In case of merges, the interval will adopt the
label from the larger partition in terms of number of instances. In
both cases, the remaining intervals do not vary which considerably
reduces the impact of evolving discretization.
Beyond providing smoother transitions, ‘‘smooth shifting’’ is
completely valid for any classifier that admits categorical variables.
Some algorithms (random forest) natively works with categorical
Fig. 1. Evolution of intervals before and after a split: a comparison of labeling variables, whereas others (logistic regression) assumes an implicit
techniques. Number in squares corresponds with interval labels, black boxes with the order in values that in the case of categorical terms is erroneously
intervals affected by the split, and numbers on vertical lines with the threshold values imposed. This problem can be easily solved by introducing a binary
delimiting the intervals.
encoding (one-hot) that equally separates labels in the feature
space.
For those learners that can leverage from statistics, LOFD also
forces to interval i to borrow the label from the old interval i + 1. offers an scheme similar to that presented in PiD (statistic-based
This process is repeated sequentially from i + 1 interval till the last labeling). Throughout the maintenance of augmented histograms,
one, which is labeled as |I | = 4. LOFD provides free likelihood information to NB. In this scenario,
Fig. 1 depicts a toy example on how the labeling of intervals intervals are defined and ordered by cut points, and labels are di-
evolves over time after a split occurs under the standard scheme. rectly ignored. For each incoming test value, the interval bounding
The topmost row represents the shape of intervals in time t. After the point will be retrieved, and the required information provided.
a new cut point appears, interval I2 is split into two new intervals. This direct interaction avoids revisiting and swapping interval
It causes that intervals I2 and I3 need to re-adapt their labels. In labels, which makes both the discretization and learning processes
the middle row, the right partition (I3 ) will borrow the label from more lightweight.
next interval, and the rightmost one will receive a new label I4 . Our
alternative (smooth shift) is represented by the bottommost row 3.2. The LOFD algorithm
in which only the right split changes. In this case, only one interval
(I4 ) is affected, whereas the other intervals remains invariant. This
In this section we present the strategy implemented by LOFD to
scheme will be further explained in Section 3.
adapt its scheme over time, as well as other relevant features and
Despite current discretizers have solved properly the problem
improvements introduced and outlined below:
of dynamically discerning between irrelevant and useful intervals
in the online context, the interval labeling and interaction prob- • Highly-informative splits: in online environments, it is
lems have received little attention in the literature. Only PiD [27] complex and costly to track real distribution of points given
explicitly addresses them by providing a solution that offers free that algorithms are constrained to certain memory bound.
accurate histograms to the attached classifier (NB), similar to that However, most of discretization decisions heavily depend
in [29]. Other methods in the literature directly assume the de- on statistical measures that require accurate information.
ficient standard scheme based on interval labels inherited from For example, those based on entropy or mutual information.
static discretization.
Thus, we can assert that the more accurately intervals track
distributions, the wiser decisions will be applied.
3. LOFD: an online entropy-based discretizer
In LOFD, and for each interval, we build an independent
memory-constrained histogram that accurately tracks dis-
In this section we present LOFD, an online, local, supervised,
tributions. This model differs from PiD in that the latter one
bottom-up discretizer [16] which smoothly adapts its online dis-
cretization scheme through a set of accurate histograms. LOFD is a suffers from a weak correspondence between layers which
entirely local and self-adaptive that apply local merges and splits makes the histograms imprecise in most cases. LOFD his-
whenever a new boundary point appears. By default LOFD relies on tograms are only limited by memory requirements though.
the smooth strategy (Section 2.3) to tackle the labeling problem, By imposing size limits, we can adjust the trade-off between
however, it can be configured to provide likelihood information if performance and accuracy according to our requirements.
required. As evaluation measure, quadratic entropy Eq. (3) is used • Bi-directional discretization: LOFD proposes to consider
to evaluate the fitness quality of potential merges. both splits and merges since both insertions and removals
Firstly, Section 3.1 provides two alike perspectives to address of points are considered and relevant for the streaming sce-
the interval labeling and interaction problems; both included in nario. Natural fluctuations in our scheme will be addressed
LOFD. Section 3.2 explains the other features included in LOFD, as by considering both actions, thus increasing the competi-
well as description about split and merge operations. tiveness of output solutions. Whereas merges are naturally
applied, splits are much more complex since they demand
3.1. Interval labeling in LOFD: smooth shifting accurately conserved distributions.
• Extended merges: local changes in intervals may cause a
As mentioned before, online discretizers tend to adopt the stan- previous adjacent merge becomes positive. In order to im-
dard discretization scheme by default. In this scheme any change prove the performance of merges, we propose to evaluate all
S. Ramírez-Gallego et al. / Future Generation Computer Systems 86 (2018) 59–70 63

potential combinations among the novel interval and their


adjacent intervals (see Algorithm 1). For splits far from the Algorithm 1 LOFD algorithm
extremes, four intervals must be considered: the two splits 1: INPUT: D, initTh, maxHist
and their neighbors. 2: // D is the input dataset.
3: // initTh Number of instances before initializing intervals
4: // maxHist Maximum number of elements in interval his-
3.2.1. Main process: instance-level tograms
The main procedure in LOFD is explained here. Firstly, discrete 5: I = On the first batch (i = 1 . . . initTh), apply the static dis-
intervals are initialized following the static process defined in cretization process explained in [19].
FUSINTER [19] (line 6). Discretization is performed on the first 6: for i = initTh + 1 → N do
batch of elements, formed by iniTh instances. From this point, LOFD 7: for A ∈ M do
updates the scheme of intervals in each iteration, and for each 8: ceil = retrieve the ceiling interval that contains DiA
single attribute att. 9: if ceil ̸ = null then
For each new single point v al, LOFD retrieves its bounding 10: isBound = check if DiA is boundary
interval (ceiling) from IA (line 10), which is internally implemented 11: Insert DiA into ceil and update its criterion
as a red–black tree.1 If the point is over the upper feature limit 12: if isBound == true then
(lines 19–23), LOFD will generate a new interval at this point, being 13: (ceil, new ) = split ceil into two intervals with DiA as
v al the new maximum for attribute att. A merge between the old cutpoint
last interval and the new one is also evaluated by computing the 14: Evaluate local merges between ceil, new , and the sur-
quadratic entropy value for the potential merge. If merged entropy rounding intervals until no improvement is achieved.
is lower than the sum of parts, the change will be accepted.
If ceiling exists, v al is inserted in the histogram in ceiling (also 15: Insert the resulting set into IA
a red-black tree). Using the previous histogram, we check if v al 16: end if
is a boundary point or not (Section 2.1). If affirmative (lines 14– 17: else
17), ceiling is split into two parts with v al as midpoint. Afterwards, 18: last = Create a new interval on the right side with DiA as
different merge combinations are evaluated between the resulting upper limit
intervals, and their neighbors (as will be explained in Section 3.2.2). 19: Insert last into IA
The resulting set will be inserted in IA . 20: Evaluate merge with the old maximum interval
Finally, each point is added to the timestamped queue to en- 21: end if
sure further removals in case of histograms overflow (histogram 22: end for
size ≥ maxHist). This condition is checked on the entire set of 23: add Di to the timestamped queue
intervals in lines 26–30. If needed, LOFD retrieves points from 24: for int ∈ IA do
the queue in ascending order (by age), and removes these points 25: if |histogram(int)|> maxHist then
from histograms until enough space for further points is available 26: Remove old points from the timestamped queue,
in interval int. This mechanism is mainly used to avoid heavy and subsequently, from the local histograms until
searches in overpopulated histograms. Further memory control |histogram(int)|<= maxHist. Remove empty intervals.
can be programmed by adding a parameter that limit the growth 27: end if
of the timestamped queue. This would help us to avoid memory 28: end for
overflow in scenarios where splits occur continuously. 29: end for
Complexity is mainly determined by boundary evaluation
O(|M | · log(maxHist)), and the split/merge process O(|M | · maxHist)
which requires to fetch the whole inner histogram.2 In either
case, the trade-off between performance and effectiveness can ul-
timately be controlled through maxHist. Hence, shorter histograms
will imply less accurate decisions, but more agile evaluations.

3.2.2. Merge and splits: interval-level


Fig. 2 depicts a simplified scheme of the split process, which
occurs whenever a new boundary point is processed. The new
boundary point (2.2) introduced causes interval I2 to be separated
into two intervals. Interval I2 now contains those points from the
histogram lower or equal than 2.2, and preserves the same label
because it contains more elements than the new interval I4 . Larger
intervals keep their original label in order to reduce as much as
possible the effect of relabeling in inner points. The new interval
receives label I4 (the next integer unseen), and those points higher
than the cutpoint (2.2).
Whenever a new split/interval is generated, the merge process
is launched on the divided intervals and their neighbor intervals.
Merge process can be deemed as the opposite of split since it
basically consists of the fusion of class counters and inner his-
tograms. From the set of potential combinations, that merge action, Fig. 2. Flowchart describing a split in LOFD (smooth labeling) with three original in-
tervals (first row) and their histograms. A new point (2.2) is introduced, generating a
split and a new interval I4 . In LOFD, splits are performed whenever a new boundary
1 https://en.wikipedia.org/wiki/Red-black_tree. point is processed. Number in squares corresponds with interval labels, number in
2 The logarithmic contribution of interval searches has been removed from this brackets class distribution, and vertical lines with the cut points considered.
formula since |IA | tends to be negligible (see Section 4).
64 S. Ramírez-Gallego et al. / Future Generation Computer Systems 86 (2018) 59–70

Table 1 Table 2
Basic information concerning datasets. For each row, # total instances (#Inst.), to- Description of parameters. Default values (in bold) are shown in the first row. Un-
tal number of attributes (#Atts.) (the number of numerical (#Num.) and nominal less specified, these values are common to all methods.
(#Nom.)), and the number of output labels (#Cl), are depicted.
Method Parameters
Data set #Inst. #Atts. #Num. #Nom. #Cl. All disc. initial elements = 100, window size = 1 (default)
airlines 539,383 6 3 3 2 Gaussian NB –
elecNormNew 45,311 8 7 1 2 Gaussian HT 10 splits
kddcup_10 494,020 41 39 2 2 IDA [25] window size = 1000, # bins = 5
poker-lsn 829,201 10 5 5 10 OC [28] –
covtypeNorm 581,011 54 10 44 7 PiD [27] α = 0.75, # bins = 500, update layer2 = 103 , min/max = 0/1
blips 500,000 20 20 0 4 LOFD max. size by histogram = 10,000, initial elements = 5000
sudden_drift 500,000 3 3 0 2
gradual_drift 500,000 3 3 0 2
gradual_recurring_drift 500,000 20 20 0 4
incremental_fast 500,000 10 10 0 4 (HT) [31] is incorporated to test the discretization effect of our
incremental_slow 500,000 10 10 0 4 solution on other learning models. Table 2 shows the parameters
involved in the experiments, as well as the values set according to
the authors’ criteria.
which more strongly reduces entropy (according Eq. (3)), will be The evaluation is performed following an standard evalua-
applied. The previous process will be repeated recursively until no tion technique for online learning, called interleaved test-then-
more merges are available or convenient. Note that a merge will train [32]. In this scheme, incoming examples are first evaluated
be performed iff the quadratic entropy of the resulting interval by the current model. Afterwards, the examples are incorporated
is lower than the sum of both parts. Notice also that merge is to the model in the training phase. Note that this technique is more
responsible of mixed histograms formed by boundary points from appropriate for streaming environments than hold-out evaluation.
different classes in intervals I1 and I3 . All experiments has been executed in a single commodity ma-
This procedure avoids recomputing values for intact intervals chine with the following characteristics: 2 processors Intel Core i7
(the vast majority), so that only one interval will require to re- CPU 930 (4 cores/8 threads, 2.8 GHz, 8 MB cache), 24 GB DDR2
evaluate its entropy and potential merges in each iteration. RAM, 1 TB HDD (3 Gb/s), Ethernet network, CentOS 6.4 (Linux).
All algorithms, included our new approach, have been packaged in
4. Experiments and case study an extension library for MOA (16.04v).5 All the experiments have
been launched in MOA.
This section provides a comparative analysis between our pro-
posal and the other state-of-the-art discretizers. As LOFD offers 4.2. Analytical comparative: smooth shift vs. static labeling
two alternatives for interval labeling, two independent sections are
issued here. In Section 4.2, LOFD adopts smooth shifting whereas This section focuses on studying the effect of LOFD discretizer
the rest of discretizers assume standard labeling to interact with with smooth labeling in online learning, as well as comparing our
NB. In Section 4.3 LOFD and PiD directly interact with NB through solution with other alternatives which utilizes standard labeling.
histograms. Finally, a case study is included to illustrate the effect
of concept drift on evolving discretization intervals. 4.2.1. Accuracy results (predictive ability)
Table 3 contains average accuracy results as a summary of the
4.1. Experimental framework entire learning process performed by Naive Bayes. From these
outcomes we can outline the following conclusions:
Here we outline all the details related to our experiments,
• There exists a clear advantage on using LOFD against other
such as: datasets processed, parameters involved and their values,
alternatives. LOFD is on average 5% more precise than its
validation scheme, etc. Evaluation has been performed in terms of
closest competitor (IDA), which means 2.5 × 103 more in-
prediction ability (accuracy), evaluation time (spent on discretizing
stances correctly classified.6
and prediction), and reduction ability on continuous features (#
• Up to now, supervised alternatives have generated worse
discrete intervals).
solutions than those presented by unsupervised approaches,
Table 1 shows some basic information about data. Half of the despite the former ones leverage from class information. On
datasets were artificially created by the Massive Online Analysis the contrary, LOFD utilizes this knowledge to overcome the
(MOA) tool [30], ranging from sudden drift to different types of previous problem and thus generates better schemes.
gradual drifts. For more details about the parameter setting ac- • LOFD overcomes its competitors in 9/11 datasets, with sim-
complished to generate these datasets, please refer to our code ilar results in the other datasets. This fact proves the supe-
repository.3 The remaining datasets were collected from the offi- riority of LOFD, and its great versatility for both real and
cial MOA’s webpage, except for kddcup_10 that was processed and artificial datasets, as well as for different type of trends and
generated by Dr. Gama.4 drifts.
In order to evaluate the performance of LOFD several state-
of-the-art discretizers have been included in this framework for To reaffirm the positive results obtained by LOFD a thorough
comparison purposes. They range from supervised (OC and PiD) statistical analysis is performed on accuracy results through-
to unsupervised schemes (IDA - window-based version). All de- out two non-parametric tests [33]. Table 4 reports the results
scribed methods have been thoroughly analyzed and categorized for Wilcoxon Signed-Ranks Test (1vs1) and Friedman–Holm test
in [15]. Gaussian Naïve Bayes (GB) has been elected as the reference (1vsN) with a significance level α = 0.05. Both tests assert LOFD
classifier for our tests, because of the reasons exposed in Sec- clearly outperforms the other options, and no method is close in
tion 2. For the remaining methods, Gaussian estimation is replaced performance to it (no ties in row). Additionally, p-values derived
by the discrete intervals generated. Alternatively, Hoeffding Tree from Holm’s test show to be highly significant (<0.01).

3 https://github.com/sramirez/MOAReduction. 5 http://moa.cms.waikato.ac.nz/moa-extensions/.
4 http://www.liaad.up.pt/kdus/products/datasets-for-concept-drift. 6 Considering 5 × 105 instances by dataset on average.
S. Ramírez-Gallego et al. / Future Generation Computer Systems 86 (2018) 59–70 65

Table 3
Classification test accuracy on discretized data (Naïve Bayes).
PiD IDA OC GB LOFD
airlines 63.0057 64.1563 65.0723 64.5504 65.0868
elecNormNew 71.9522 76.6905 74.0731 73.3625 77.1517
kddcup_10 99.1474 98.4644 98.1404 97.1908 99.2901
poker-lsn 55.0335 59.4337 58.5465 59.5528 69.3981
covtypeNorm 66.6306 62.7235 64.2254 60.5208 69.2387
blips 74.5680 66.4494 64.2148 60.9060 76.3668
sudden_drift 65.7736 81.3168 77.8808 83.8144 83.5752
gradual_drift_med 60.8404 82.8908 80.1032 84.7000 84.2794
gradual_recurring_drift 65.1678 58.5250 58.5612 56.7450 67.9446
incremental_fast 73.9900 75.6472 75.6036 76.3642 80.7308
incremental_slow 65.6074 76.9186 75.4316 78.0688 81.6210
MEAN 69.2470 73.0197 71.9866 72.3432 77.6985

Table 4 platform. Table 6 shows the mean accuracy results following the
Wilcoxon pairwise test on accuracy, and ranking of methods gen-
previous learning procedure but using HT instead of NB. In this
erated by Friedman’s procedure + adjusted p-value by Holm’s
Test. First two columns represents Wilcoxon comparisons, where ‘+’ scenario, the baseline model (HT with gaussian approximation, 10
indicates the number of wins achieved by each algorithm, and ‘ ±’ the splits) stands out as the best method on average. Nevertheless,
number of wins/ties. The best achievement is highlighted in grease. LOFD represents an interesting option for real data, outperforming
Third column shows the ranking of methods according to Friedman’s
its competitors in 3/5 cases. In general, LOFD is advantageous in
test, starting from the control method (topmost). Last column details
the adjusted p-values generated by the post hoc Holm’s test. the whole spectrum of problems (5/11). Finally our solution stands
among with IDA as the best discretization alternative on average.
In summary, LOFD has shown to be compatible with other
online algorithms beyond NB. Despite HT can be considered as less
susceptible to discretization than NB, our solutions also stands as
a positive alternative in some problems.

4.2.2. Time results (runtime performance)


It is well-known that supervised approaches tend to be more
time-consuming than unsupervised ones, in return they are able
to leverage from class information. In streaming environments, it
Table 5 is crucial to control the rapidness of algorithms. Table 7 compares
Bayesian sign and signed-rank test between LOFD and its competitors. In each cell,
the first number represents the signed probability, and the second one the signed-rank
methods in terms of evaluation time (discretization + learning),
probability. as well as the overhead introduced by each component in LOFD.
Algorithms (LOFD vs. ?) left (≪) rope (=) right (≫) Unsupervised methods run faster on average than supervised ones,
PiD 1.0–1.0 0.0–0.0 0.0–0.0
as previously discussed. IDA discretizer is 2 times faster than the
IDA 1.0—1.0 0.0—0.0 0.0—0.0 closest supervised option (PiD), whereas LOFD (522.42) is ranked
OC 1.0—1.0 0.0—0.0 0.0—0.0 as one of the fastest supervised options, behind PiD (508.58). If the
GB 0.8181–0.9973 0.0–0.0 0.1818–0.0002 throughput rate (time by instance) is computed, we can observe
LOFD is able to process each example in approximately 1 ms,7
which seems suitable for most of streaming systems. Regarding
We also include a Bayesian sign and signed-rank study [34] runtime decomposition in LOFD, it seems that the interaction
based on pairwise comparison between methods (in accuracy re- module (LOFD-c2) introduces a considerable overhead in the total
sults). In these tests, a Dirichlet process is assumed over likelihood process (more than 40%). This fact can be easily explained by
distributions such that marginals on finite partitions are Dirichlet the high cost associated to interval creation and statistics update
distributed. In Bayesian tests, it is assumed two classifiers are prac- operations repeated after each arrival.
tically equivalent if their mean difference of accuracies is within
the interval [−0.01, 0.01]. This interval defines what is called as
region of practical equivalence (rope). Column rope in Table 5 4.2.3. Number of intervals (model complexity)
defines the probability of two methods are equal, and corresponds Although sometimes irrelevant, the third component to con-
with the area of the posterior within the rope. Column left defines sider here is the number of intervals generated by discretizers. A
the probability of method A is practically superior to B, whereas reduced set of intervals usually imply simpler learning processes,
column right defines the probability of method B is practically and subsequently, simpler models [16]. However, a reduced num-
superior to A. Both values corresponds with the area to the left ber of intervals is typically associated with poor learning capa-
and the right of the rope, respectively. Signed-rank test differs from bilities inasmuch as class separability is not fully accomplished.
signed test in that the closed form used to compute the distribution Table 8 depicts information about the simplicity of discretization
can be replaced by a Monte Carlo sampling of weights, and the schemes in terms of number of intervals generated. OC is elected
latter one does not require a symmetric distribution. as the best alternative since its schemes typically consists of few
Table 5 shows that the posterior probability that LOFD is prac- intervals. Nevertheless, in this scenario, simple solutions do not
tically superior to the other alternatives. It is close to one for NB, correspond with accurate solutions (see OC results in Table 4).
and exactly one for all the discretizers. Although LOFD solutions can be deemed as much more complex
Another classifier (HT) is included in the experiments in or- (more intervals), they lead the accuracy ranking in return.
der to show that LOFD is a versatile, non-exclusive solution for
Naïve Bayes. With the inclusion of HT, we have also proven all
online classifier susceptible of discretization provided by the MOA 7 Considering again that the benchmark dataset has 5 × 105 instances.
66 S. Ramírez-Gallego et al. / Future Generation Computer Systems 86 (2018) 59–70

Table 6
Classification test accuracy on discretized data. Hoeffding tree used as learner.
PiD IDA OC HT LOFD
airlines 64.3951 64.5158 65.3619 65.0784 65.0008
elecNormNew 79.8442 79.8354 70.2132 79.1954 80.7645
kddcup_10 99.8389 99.7929 99.8368 99.7413 99.5120
poker-lsn 57.9820 69.8381 55.4892 76.0685 76.1936
covtypeNorm 77.6671 75.8652 70.1681 80.3119 81.8190
blips 73.6652 86.0112 35.7974 90.9808 79.3036
sudden_drift 69.5128 82.9856 61.3936 84.8418 86.7238
gradual_drift_med 64.6858 84.1394 51.1838 85.5088 86.5246
gradual_recurring_drift 68.2206 83.7164 35.6192 88.3368 77.8664
incremental_fast 71.1508 78.6526 50.6528 82.7748 77.0852
incremental_slow 66.3744 76.7644 50.5308 83.1052 70.9906
MEAN 72.1215 80.1924 58.7497 83.2676 80.1622

Table 7
Runtime performance in seconds. First columns mean total evaluation time (discretization + prediction), while last two
represent time rate spent by each component in LOFD: interval creation (c1), and interaction discretizer-learner (c2).
PiD IDA OC GB LOFD LOFD-c1 LOFD-c2
airlines 114.62 160.16 595.29 14.04 261.48 0.72 0.28
elecNormNew 28.25 9.17 12.57 0.67 16.49 0.77 0.23
kddcup_10 526.50 158.69 3,850.95 18.59 341.49 0.33 0.67
poker-lsn 129.11 104.30 1,769.26 11.72 390.99 0.09 0.91
covtypeNorm 408.86 275.40 1,694.97 28.28 690.43 0.27 0.73
blips 1,610.87 487.77 1,013.60 12.02 780.17 0.40 0.60
sudden_drift 91.56 74.39 183.15 2.49 490.70 0.94 0.06
gradual_drift_med 210.77 94.08 172.43 3.63 672.94 0.94 0.06
gradual_recurring_drift 1,152.51 429.20 1,038.21 12.52 741.98 0.87 0.13
incremental_fast 986.21 274.86 518.33 5.68 853.27 0.87 0.13
incremental_slow 335.08 246.33 615.29 6.06 506.64 0.72 0.28
MEAN 508.58 210.40 1,042.19 10.52 522.42 0.59 0.41

Table 8 in online discretization, and whether the previous scheme is more


Number of intervals generated by discretizer. Best value (lowest) by row is high- advantageous for PiD or LOFD. It is noteworthy to remark that the
lighted in bold.
histogram scheme is the default strategy presented in [27], and the
PiD IDA OC LOFD
only one tested in that work.
airlines 17 48 29 39 Table 9 provides evidence of the negative effect of histogram
elecNormNew 81 54 33 50
version on PiD (from 69.25% to 60.52% in accuracy), and its bad
kddcup_10 300 138 158 153
poker-lsn 51 55 43 42 results compared to LOFD. Specially remarkable is the case of
covtypeNorm 344 330 96 82 poker-lsn where almost all instances are incorrectly predicted. If we
blips 1,924 126 120 552 focus on this dataset, we can notice several deficiencies in PiD. First
sudden_drift 22 24 18 28 of all, if new values are laid out of the range defined by parameters
gradual_drift_med 17 24 18 30
gradual_recurring_drift 1,829 126 120 504
min/max, there will be required several iterations to create the
incremental_fast 1,085 66 60 55 required intervals. Likewise, as no interval is defined for the new
incremental_slow 313 66 60 75 overflowed point, PiD will not provide histogram nor likelihoods
MEAN 543.91 96.09 68.64 146.36 to NB. Therefore, subsequent predictions will be almost misguided.
In poker-lsn, the effect is much more astonishing, simply because
NB has more options (classes) to choose from. All these evidences,
4.2.4. Case study: drift effect on discretization plus those presented in Section 2.2, show that combination ‘‘his-
Figs. 3 (blips) and 4 (poker-lsn) aim at depicting the effect of togram + PiD’’ does not work properly.
drifts on discretizers’ performance. In Fig. 3, some abrupt peaks LOFD outperforms PiD in 8/11 datasets, and its mean accuracy
can be observed, as well as the LOFD’s ability to recover from them (74.63%) across all datasets is even superior to that of each method
properly. This ability allows LOFD outperforms other methods in Table 3. It shows that standard/smooth labeling seems to per-
from early stages. Also notice that LOFD is the only algorithm form better than the histogram alternative in terms of accuracy
capable of sustaining the accuracy rate after the drift in the midway rate.
of the series. Regarding time outcomes, now discretization is more straight-
No remarkable drift can be distinguished from Fig. 4, however, forward as reflected in LOFD’s results, almost half of time shown
we can notice that LOFD’s accuracy rate is much more competitive in Table 7. Time improvement in PiD is less noticeable, but still
and less fluctuating than presented by other methods. Regarding competitive and relevant. Component analysis in Table 9 also wit-
the time plots, LOFD shows an efficiency order close to linear for nesses the lightness of interaction discretizer–learner in histogram
both problems. version — 25% more rapid than in smooth shift. Runtime savings
here are explained by the removal of update phase in learning side.
4.3. Analytical comparative: statistic-based labeling Fresh statistics are now directly passed by LOFD.
In summary, standard/smooth labeling contributes to obtain
Beyond standard labeling, augmented histograms provide much more precise models, and this is a more versatile strategy
enough information for a correct NB-based learning. Labels are that can work with any online classifier as mentioned in Sec-
not required anymore in this context. This section presents an tion 3.1. Nevertheless, free information provided by histogram-
analysis about how valuable could be the adoption of this scheme based discretizers allow classifiers to perform faster predictions.
S. Ramírez-Gallego et al. / Future Generation Computer Systems 86 (2018) 59–70 67

(a) Accuracy (% of correct predictions). (b) CPU time (seconds).

Fig. 3. Detailed plots of prequential accuracy, and CPU time over the data stream progress (# instances processed) on blips.

(a) Accuracy (% of correct predictions). (b) CPU time (seconds).

Fig. 4. Detailed plots of prequential accuracy, and CPU time over the data stream progress (# instances processed) on poker-lsn.

Table 9
Classification test accuracy (%) + total time on discrete data (histogram scheme). Last two columns represent time rate
spent by each component in LOFD: interval creation (c1), and interaction discretizer-learner (c2).
Datasets Accuracy Time (s) LOFD comp. runtime (rate)
PiD LOFD PiD LOFD LOFD-c1 LOFD-c2
airlines 53.4763 64.6136 108.08 221.01 0.84 0.16
elecNormNew 74.1989 75.1324 21.18 13.81 0.94 0.06
kddcup_10 97.9902 99.2079 514.84 140.38 0.71 0.29
poker-lsn 0.1117 61.0778 108.71 55.26 0.54 0.46
covtypeNorm 63.1194 62.9710 407.16 217.69 0.71 0.20
blips 70.9794 72.7270 1,243.54 315.70 0.81 0.19
sudden_drift 38.6880 83.3610 89.53 479.93 0.99 0.01
gradual_drift_med 51.1544 84.4526 180.98 480.45 0.99 0.01
gradual_recurring_drift 62.9858 64.0668 1,201.22 351.77 0.82 0.18
incremental_fast 75.6676 75.9816 868.54 500.65 0.98 0.02
incremental_slow 77.3006 77.2844 278.62 267.95 0.96 0.04
MEAN 60.5157 74.6251 456.58 276.78 0.84 0.16

4.4. Case study: sudden drift scenario the other supervised schemes look quite misguided given that
intervals are quite concentrated, thus showing a high level of over-
This section illustrates the different discretization solutions lapping (specially those of PiD). Concerning IDA, its discretization
offered by the discretizers studied, as well as how they adapt their solutions fits perfectly an equal-frequency approach as expected.
solutions after the appearance of concept drifts. Fig. 5 depicts the LOFD intervals also look well-distributed, but at the same time they
solutions generated for the sudden drift dataset, before and after respect and follow the class borders.
(+1 × 104 instances later) a drift appears in attribute #2. The drift
5. Concluding remarks
concretely appears after iteration 3.75 × 105 .
Among with the cut points limiting intervals (vertical lines), a
In this paper we have studied several major issues to be faced by
simplified class histogram of the last 1 × 103 points is included in contemporary online discretizers. How discretizers should adapt
the figure. Left subplots show the density of points before the drift, the intervals or how intervals are labeled and interact with learners
where most of points are blue, and skewed to the right. After the are two of the main axes on which further developments should
drift (right subplots) more red points appear on the left side, thus revolved. As a potential solution for the interval labeling and in-
removing practically the skewness. The expected output here is teraction problems, we have proposed and analyzed two opposing
that more intervals appear in the leftmost part of the histogram strategies. The first one is a renovated solution valid for every
in order to follow the trend and thus better separate classes. online learning system, whereas the other one relies on histograms
As observed in Fig. 5, only LOFD generates more cut points to provide direct information to bayesian learners, for example.
to the left of the midpoint after the drift, whereas the rest look Both alternatives have shown in the experiments their positive
practically identical to its previous value. In fact, we can observe effect on the transition between consecutive discretization states.
68 S. Ramírez-Gallego et al. / Future Generation Computer Systems 86 (2018) 59–70

(a) LOFD: density function (iteration: 3.75 × 105 ). (b) LOFD: density function (iteration: 3.85 × 105 ).

(c) IDA: density function (iteration: 3.75 × 105 ). (d) IDA: density function (iteration: 3.85 × 105 ).

(e) PiD: density function (iteration: 3.75 × 105 ). (f) PiD: density function (iteration: 3.85 × 105 ).

(g) OC: density function (iteration: 3.75 × 105 ). (h) OC: density function (iteration: 3.85 × 105 ).

Fig. 5. Density plots before and after a concept drift in attribute #2 (sudden drift dataset). Each row represents a different discretizer, each column the distribution of data before
and after the drift, and each vertical line the intervals generated. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of
this article.)

To solve the adaptation problem we have implemented all level of responsiveness thanks to the fully local strategy imple-
the labeling schemes in a novel online discretization algorithm, mented, mainly based on fast interval fusions and splits.
called LOFD. This discretizer produces self-adaptive and highly- The complex experimental framework performed, with 12
informative discretization schemes, in which precise intervals are datasets and 3 algorithms, has proven that LOFD is by far the
supported by updated class statistics. LOFD also presents a high most competitive solution in terms of predictive accuracy. It has
S. Ramírez-Gallego et al. / Future Generation Computer Systems 86 (2018) 59–70 69

also been confirmed by the statistical analysis carried out with a [23] B. Krawczyk, L.L. Minku, J. Gama, J. Stefanowski, M. Woźniak, Ensemble learn-
significance level α ≤ 0.01. LOFD is also ranked as one of the most ing for data stream analysis: A survey, Inform. Fusion 37 (2017) 132–156.
[24] B. Krawczyk, M. Woźniak, One-class classifiers with incremental learning and
rapid supervised discretizers. Compared with the other alterna-
forgetting for data streams with concept drift, Soft Comput. 19 (12) (2015)
tives, which either barely cover the search space or generate too 3387–3400.
many meaningless intervals, LOFD is able to achieve an excellent [25] G. Webb, Contrary to popular belief incremental discretization can be sound,
trade-off between simple and precise solutions. computationally efficient and extremely useful for streaming data, in: IEEE
International Conference on Data Mining, ICDM, 2014, pp. 1031–1036.
[26] S. Ramírez-Gallego, S. García, J.M. Benítez, F. Herrera, Multivariate discretiza-
Acknowledgments
tion based on evolutionary cut points selection for classification, IEEE Trans.
Cybern. 46 (3) (2016) 595–608.
This work is supported by the Spanish National Research, Spain [27] J. Gama, C. Pinto, Discretization from data streams: Applications to histograms
Project TIN2014-57251-P, the Foundation BBVA, Spain project and data mining, in: Proceedings of the 2006 ACM Symposium on Applied
75/2016 BigDaPTOOLS, the Andalusian Research Plan, Spain P11- Computing, SAC ’06, 2006, pp. 662–667.
[28] P. Lehtinen, M. Saarela, T. Elomaa, Online ChiMerge Algorithm, in: D.E. Holmes,
TIC-7765. S. Ramírez-Gallego holds a FPU scholarship from the
L.C. Jain (Eds.), Springer Berlin Heidelberg, Berlin, Heidelberg, 2012, pp. 199–
Spanish Ministry of Education and Science, Spain (FPU13/00047). 216.
[29] T. Elomaa, P. Lehtinen, Maintaining optimal multi-way splits for numerical
References attributes in data streams, in: Advances in Knowledge Discovery and Data
Mining, 12th Pacific-Asia Conference, PAKDD 2008, Osaka, Japan, May 20–23,
[1] S. García, J. Luengo, F. Herrera, Data Preprocessing in Data Mining, Springer, 2008 Proceedings, 2008, pp. 544–553.
2015. [30] A. Bifet, G. Holmes, R. Kirkby, B. Pfahringer, MOA: Massive online analysis, J.
[2] S. García, J. Luengo, F. Herrera, Tutorial on practical tips of the most influential Mach. Learn. Res. 11 (2010) 1601–1604.
data preprocessing algorithms in data mining, Knowl.-Based Syst. 98 (2016) [31] G. Hulten, L. Spencer, P. Domingos, Mining time-changing data streams, in:
1–29. Proceedings of the Seventh ACM SIGKDD International Conference on Knowl-
[3] H. Liu, F. Hussain, C.L. Tan, M. Dash, Discretization: An enabling technique, Data edge Discovery and Data Mining, KDD ’01, 2001, pp. 97–106.
Mining Knowl. Discov. 6 (4) (2002) 393–423. [32] A. Bifet, R. Kirkby, Data Stream Mining: A Practical Approach, The University
[4] S. Ramírez-Gallego, S. García, H. Mouriño Talín, D. Martínez-Rego, V. Bolón- of Waikato, 2009.
Canedo, A. Alonso-Betanzos, J.M. Benítez, F. Herrera, Data discretization: Tax- [33] S. García, A. Fernández, J. Luengo, F. Herrera, Advanced nonparametric tests
onomy and big data challenge, Wiley Interdiscip. Rev.: Data Mining Knowl. for multiple comparisons in the design of experiments in computational intel-
Discov. 6 (1) (2016) 5–21. ligence and data mining: Experimental analysis of power, Inform. Sci. 180 (10)
[5] H. Chen, T. Li, C. Luo, S.J. Horng, G. Wang, A rough set-based method for (2010) 2044–2064, Special Issue on Intelligent Distributed Information Sys-
updating decision rules on attribute values; coarsening and refining, IEEE tems.
Trans. Knowl. Data Eng. 26 (12) (2014) 2886–2899. [34] A. Benavoli, G. Corani, F. Mangili, M. Zaffalon, F. Ruggeri, A bayesian wilcoxon
[6] Y. Yang, G.I. Webb, Discretization for Naive-Bayes learning: Managing dis- signed-rank test based on the dirichlet process, in: Proceedings of the 31th
cretization bias and variance, Mach. Learn. 74 (1) (2009) 39–74. International Conference on Machine Learning, ICML 2014, 21–26, 2014, pp.
[7] X. Wang, Y. He, D.D. Wang, Non-naive bayesian classifiers for classification 1026–1034.
problems with continuous attributes, IEEE Trans. Cybern. 44 (1) (2014) 21–39.
[8] Q. Wu, D. Bell, M. McGinnity, G. Prasad, G. Qi, X. Huang, Improvement of deci-
sion accuracy using discretization of continuous attributes, in: Proceedings of
the Third International Conference on Fuzzy Systems and Knowledge Discov- Sergio Ramírez-Gallego received the M.Sc. degree in
ery, FSKD’06, Springer-Verlag, Berlin, Heidelberg, 2006, pp. 674–683. Computer Science in 2012 from the University of Jaén,
[9] J. Lu, P. Zhao, S.C.H. Hoi, Online passive-aggressive active learning, Mach. Learn. Spain. He is currently a Ph.D. student at the Department
103 (2) (2016) 141–183. of Computer Science and Artificial Intelligence, University
[10] J. Gama, Knowledge Discovery from Data Streams, Chapman & Hall/CRC, 2010. of Granada, Spain. He has published in journals such as
[11] M.-A. Aufaure, R. Chiky, O. Curé, H. Khrouf, G. Kepeklian, From business IEEE Transactions on Cybernetics, IEEE Transactions on
intelligence to semantic data stream management, Future Gener. Comput. Systems, Man, and Cybernetics: Systems, Experts Systems
Syst. 63 (Supplement C) (2016) 100–107 Modeling and Management for Big with Applications or Neurocomputing. His research inter-
Data Analytics and Visualization. ests include data mining, data preprocessing, big data and
[12] S. Ramírez-Gallego, A. Fernández, S. García, M. Chen, F. Herrera, Big data: Tuto- cloud computing.
rial and guidelines on information and process fusion for analytics algorithms
with MapReduce, Inform. Fusion 42 (Supplement C) (2018) 51–61.
[13] R. Pears, S. Sakthithasan, Y.S. Koh, Detecting concept change in dynamic data
streams, Mach. Learn. 97 (3) (2014) 259–293. Salvador García received the M.Sc. and Ph.D. degrees
[14] J. Gama, I. Zliobaite, A. Bifet, M. Pechenizkiy, A. Bouchachia, A survey on in Computer Science from the University of Granada,
concept drift adaptation, ACM Comput. Surveys 46 (4) (2014) 44:1–44:37. Granada, Spain, in 2004 and 2008, respectively. He is
[15] S. Ramírez-Gallego, B. Krawczyk, S. García, M. Woniak, F. Herrera, A survey currently an Associate Professor in the Department of
on data preprocessing for data stream mining: Current status and future Computer Science and Artificial Intelligence, University of
directions, Neurocomputing 239 (2017) 39–57. Granada, Granada, Spain. He has published more than 45
[16] S. García, J. Luengo, J.A. Sáez, V. López, F. Herrera, A survey of discretization papers in international journals. As edited activities, he
techniques: Taxonomy and empirical analysis in supervised learning, IEEE has co-edited two special issues in international journals
Trans. Knowl. Data Eng. 25 (4) (2013) 734–750. on different Data Mining topics and is a member of the
[17] B.S. Chlebus, S.H. Nguyen, On finding optimal discretizations for two attributes, editorial board of the Information Fusion journal. He is a
in: L. Polkowski, A. Skowron (Eds.), Rough Sets and Current Trends in Com- co-author of the book entitled ‘‘Data Preprocessing in Data
puting: First International Conference, RSCTC’98 Warsaw, Poland, June 22– Mining’’ published in Springer. His research interests include data mining, data
26, 1998 Proceedings, Springer Berlin Heidelberg, Berlin, Heidelberg, 1998, preprocessing, data complexity, imbalanced learning, semi-supervised learning,
pp. 537–544. statistical inference, evolutionary algorithms and biometrics.
[18] T. Elomaa, J. Rousu, General and efficient multisplitting of numerical attributes,
Mach. Learn. 36 (1999) 201–244.
[19] D.A. Zighed, S. Rabaséda, R. Rakotomalala, FUSINTER: A method for discretiza- Francisco Herrera (SM’15) received his M.Sc. in Mathe-
tion of continuous attributes, Internat. J. Uncertain. Fuzziness Knowledge- matics in 1988 and Ph.D. in Mathematics in 1991, both
Based Systems 6 (3) (1998) 307–326. from the University of Granada, Spain. He is currently
[20] M.M. Gaber, Advances in data stream mining, Wiley Interdiscip. Rev.: Data a Professor in the Department of Computer Science and
Mining Knowl. Discov. 2 (1) (2012) 79–85. Artificial Intelligence at the University of Granada.
[21] M. Tennant, F.T. Stahl, O. Rana, J.B. Gomes, Scalable real-time classification of He has been the supervisor of 40 Ph.D. students. He
data streams with concept drift, Future Gener. Comput. Syst. 75 (2017) 187– has published more than 300 journal papers that have
199. received more than 49000 citations (Scholar Google, H-
[22] S. Sakthithasan, R. Pears, A. Bifet, B. Pfahringer, Use of ensembles of fourier index 112). He is coauthor of the books ‘‘Genetic Fuzzy
spectra in capturing recurrent concepts in data streams, in: 2015 International Systems’’ (World Scientific, 2001) and ‘‘Data Preprocessing
Joint Conference on Neural Networks IJCNN, 2015, pp. 1–8. in Data Mining’’ (Springer, 2015), ‘‘The 2-tuple Linguistic
70 S. Ramírez-Gallego et al. / Future Generation Computer Systems 86 (2018) 59–70

Model. Computing with Words in Decision Making’’ (Springer, 2015), ‘‘Multilabel Computing (Fourth Edition, 2010), IEEE Transactions on Fuzzy System Outstanding
Classification. Problem analysis, metrics and techniques’’ (Springer, 2016), ‘‘Multiple 2008 and 2012 Paper Award (bestowed in 2011 and 2015 respectively), 2011 Lotfi A.
Instance Learning. Foundations and Algorithms’’(Springer, 2016). Zadeh Prize Best paper Award of the International Fuzzy Systems Association, 2013
He currently acts as Editor in Chief of the international journals ‘‘Information AEPIA Award to a scientific career in Artificial Intelligence, and 2014 XV Andalucía
Fusion’’ (Elsevier) and ‘‘Progress in Artificial Intelligence (Springer). He acts as Research Prize Maimónides (by the regional government of Andalucía).
editorial member of a dozen of journals. His current research interests include among others, soft computing (including
He received the following honors and awards: ECCAI Fellow 2009, IFSA Fellow fuzzy modeling and evolutionary algorithms), information fusion, decision making,
2013, 2010 Spanish National Award on Computer Science ARITMEL to the ‘‘Spanish biometric, data preprocessing, data science and big data.
Engineer on Computer Science’’, International Cajastur ‘‘Mamdani’’ Prize for Soft

You might also like