0% found this document useful (0 votes)
5 views

Decision Trees For Mining Data Streams

Uploaded by

julianascudilio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Decision Trees For Mining Data Streams

Uploaded by

julianascudilio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Decision Trees for Mining Data Streams

João Gama1 , Ricardo Fernandes2 , and Ricardo Rocha2


1
LIACC, FEP - University of Porto
Rua de Ceuta, 118-6, 4050-190 Porto, Portugal
[email protected]
2
Department of Mathematics
University of Aveiro, Aveiro, Portugal
{tbs,ricardor}@mat.ua.pt

Abstract. In this paper we study the problem of constructing accurate


decision tree models from data streams. Data streams are incremental
tasks that require incremental, online, and any-time learning algorithms.
One of the most successful algorithms for mining data streams is VFDT.
We have extended VFDT in three directions: the ability to deal with
continuous data; the use of more powerful classification techniques at
tree leaves, and the ability to detect and react to concept drift. VFDTc
system can incorporate and classify new information online, with a sin-
gle scan of the data, in time constant per example. The most relevant
property of our system is the ability to obtain a performance similar to a
standard decision tree algorithm even for medium size datasets. This is
relevant due to the any-time property. We also extend VFDTc with the
ability to deal with concept drift, by continuously monitoring differences
between two class-distribution of the examples: the distribution when a
node was built and the distribution in a time window of the most recent
examples. We study the sensitivity of VFDTc with respect to drift, noise,
the order of examples, and the initial parameters in different problems
and demonstrate its utility in large and medium data sets.

Key words: Data Streams, Concept Drift, Incremental Decision trees.

1 Introduction

Databases are rich in information that can be used in the decision process.
Nowadays, the majority of the companies and organizations possess gigantic
databases, that grow to millions of registers per day. Continuous data streams
arise naturally, for example, in the network installations of large Telecommu-
nications and Internet service providers where detailed usage information from
different parts of the network needs to be continuously collected and analyzed for
interesting trends. The usual techniques of data mining have problems in dealing
with this huge volume of data. In traditional applications of data mining, the
volume of data is the main obstacle to the use of memory-based techniques due
the restrictions in the computational resources: memory, time or space in hard
disk. Therefore in the majority of these systems, the use of all available data
becomes impossible and can result in under fitting. To construct KDD systems
that use this entire amount of data and keeps the accuracy of the traditional
systems becomes problematic.
Decision trees, due to its characteristics, are one of the most used techniques
for data-mining. Decision tree models are non-parametric, distribution-free, and
robust to the presence of outliers and irrelevant attributes. Tree models have high
degree of interpretability. Global and complex decisions can be approximated
by a series of simpler and local decisions. Univariate trees are invariant under
all (strictly) monotone transformations of the individual input variables. Usual
algorithms that construct decision trees from data use a divide and conquer
strategy. A complex problem is divided into simpler problems and recursively
the same strategy is applied to the sub-problems. The solutions of sub-problems
are combined in the form of a tree to yield the solution of the complex problem.
Formally a decision tree is a direct acyclic graph where each node is either a
decision node with two or more successors or a leaf node. A decision node has
some condition based on attribute values. A leaf node is labeled with a constant
that minimizes a given loss function. The standard algorithm to construct a
decision tree usually install at each leaf a constant that minimizes a given loss
function. In the classification setting, the constant that minimizes the 0-1 loss
function is the mode of the classes that reach this leaf. Nevertheless, several
authors have studied the use of other functions at tree leaves [20, 10].
Data streams are, by definition, problems where the training examples used
to construct decision models come over time, usually one at time. A natural ap-
proach for this incremental task is incremental learning algorithms. A paradigm
for any learning algorithm is the ability to change a decision model in order to
incorporate new information in a incremental way. In the field of incremental
tree induction, we can distinguish two main research lines. In the first one, a
tree is constructed using a greedy search. The incorporation of new information
involves re-structuring the actual tree [29, 31, 17]. This is done using operators
that could pull-up or push-down decision nodes. This is the case of systems like
ID5 [29], ITI [31], or ID5R[17]. The second research line does not use the greedy
search of standard tree induction. It maintains at each decision node a set of
sufficient statistics and only make a decision (install a split-test at that node),
when there is enough statistical evidence in favor of a particular split test [13].
A notable example is the Very Fast Decision Tree (VFDT) system [6]. VFDT is
an online decision tree, designed to maintain a decision tree over a data stream.
It assumes a continuous flow of training examples, that could be mixed with
unlabeled examples. VFDT continuously train on labeled data and at anytime
classify unlabeled data.
In this paper we propose the VFDTc system, which incorporates three main
extensions to the VFDT system. The first one is the ability to deal with numerical
attributes. The second one is the ability to apply naive Bayes classifiers in tree
leaves. The third extension, is the ability to detect and react in the presence
of drift in the time-evolving data stream. In such cases the decision model is
updated to reflect the most recent class-distribution of the examples. We show
that incremental tree induction methods that wait till enough statistical support
for install a split-test will large benefit of using more appropriate classification
strategies at tree leaves [12]. The paper is organized as follows. The next section
describes VFDT and other related work that is the basis for our work. The
section 3, we present our extensions to VFDT leading to system VFDTc. We
detail the major options that we implemented and the differences to VFDT
and others available systems. The system has been implemented and evaluated.
Experimental evaluation and sensitivity analysis is done in Section 4. Last section
concludes the paper, resuming the main contributions of this work.

2 Related Work
In this section we analyze related work in three dimensions. One dimension is
related to the use of more power classification strategies at tree leaves, the other
dimension is related to methods for incremental tree induction, and last methods
to react to drift.

Functional Tree Leaves The standard algorithm to construct a decision tree


usually install at each leaf a constant that minimizes a given loss function. In
the classification setting, the constant that minimizes the 0-1 loss function is
the mode of the target attribute of the examples that fall at this leaf. Several
authors have studied the use of other functions at tree leaves [20, 11]. One of the
earliest works is the Perceptron tree algorithm [30] where leaf nodes may imple-
ment a general linear discriminant function. Also Kohavi [20] has presented the
naive Bayes tree that uses functional leaves. NBtree is a hybrid algorithm that
generates a regular univariate decision tree, but the leaves contain a naive Bayes
classifier built from the examples that fall at this node. The approach retains
the interpretability of naive Bayes and decision trees, while resulting in classi-
fiers that frequently outperform both constituents, especially in large datasets.
Seewald et al. [27] have presented an interesting extension to this topology by
allowing leaf nodes using different kind of models: naive Bayes, a multi-response
linear regression, and instance base models. The results indicate a certain per-
formance improvement. The use of functional leaves and functional nodes for
classification and regression has been studied in [11]. The author shows that
functional leaves is a variance reduction method. This paper explore this idea
in the context of learning from data streams. As we show in the experimental
section, there are strong advantages in the performance of resulting decision
models.

Incremental Tree Induction In many interesting domains, the information


required to learn concepts is rarely available a priori. Over time, new pieces of
information become available, and decision structures should be revised. This
learning mode has been identified and studied in the machine learning commu-
nity under several designations: incremental learning, online learning, sequential
learning, theory revision, etc. In the case of tree models, we can distinguish two
main research lines. The first one a tree is constructed using a greedy search.
Incorporation of new information involves re-structuring the actual tree. This is
done using operators that could pull-up or push-down decision nodes. This is the
case of systems like ID4 [32], ID5 [29], ITI [31], or ID5R [17]. The second research
line doesn’t use the greedy search of standard tree induction. It maintains a set
of sufficient statistics at each decision node and only make a decision, i. e., install
a split-test at that node when there is enough statistical evidence in favor to a
split test. This is the case of [13, 6]. A notable example, is the VFDT system [6].
It can manage thousand of examples using few computational resources with a
performance similar to a batch decision tree given enough examples.

The VFDT System A decision tree is learned by recursively replacing leaves


with decision nodes. Each leaf stores the sufficient statistics about attribute-
values. The sufficient statistics are those needed by a heuristic evaluation func-
tion that evaluates the merit of split-tests based on attribute-values. When an
example is available, it traverses the tree from the root to a leaf, testing the ap-
propriate attribute at each node, and following the branch corresponding to the
attribute’s value in the example. When the example reaches a leaf, the sufficient
statistics are updated. Then, each possible condition based on attribute-values
is evaluated. If there is enough statistical support in favor of one test over the
others, the leaf is changed to a decision node. The new decision node will have as
many descendant leaves as the number of possible values for the chosen attribute
(therefore this tree is not necessarily binary). The decision nodes only maintain
the information about the split-test installed in this node.
The initial state of the tree consists of a single leaf: the root of the tree.
The heuristic evaluation function is the Information Gain (denoted by H) 1 . The
sufficient statistics for estimating the merit of a nominal attribute are: the counts
nijk = number of examples that reached the leaf with class k, attribute j and
attribute-value i. The Information Gain measures the amount of information
that it would be necessary to classify an example that reached the node.

H(Aj ) = inf o(examples) − inf o(Aj ) (1)

The information of the attribute j is given by:


X X 
inf o(Aj ) = Pi −Pik log2 (Pik ) (2)
i k

n
where Pik = P ijk
n
, is the probability of observing the value of the attribute
ajk
a P
nijb
i given class k and Pi = P P b
n
is the probability of observing the value of
ajb
a b
attribute i.
1
The original description of VFDT is general enough for other evaluation functions
(e.g. GINI). Without loss of generality, we restrict here to the information gain.
The main innovation of the VFDT system is the use of Hoeffding bounds
to decide how many examples are necessary to see before to install a split-test
at a given leaf. Suppose we have made n independent observations of a random
variable r whose range is R. The Hoeffding bound states with
q probability 1 − δ
ln( 1 )
that the true average of r, r̄ is in the range r̂ ± ǫ where ǫ = R2 2nδ . Let H be
the evaluation function of an attribute. For the information gain, the range, R, of
H is log2 (#classes). Let xa be the attribute with the highest H, xb the attribute
with second-highest H, and ∆H = H(xa ) − H(xb ), the difference between the
two best attributes. Then if ∆H > ǫ with n examples observed in the leaf, the
Hoeffding bound states with probability 1 − δ that x a is really the attribute with
highest value in the evaluation function. In this case the leaf must be transformed
into a decision node that splits on xa .
It turns out that is not efficient to calculate H every time one example
arrives. Moreover, it is highly unprovable that ∆H becomes greater than ǫ with
the arrival of this single new example, as such, a constant defined by the user
is used, nmin , which is the number of necessary examples to reach a leaf before
computing H. Contrary to [17], it uses the entropy difference to determine how
many examples will be necessary until it is verified that ∆H > ǫ.
When two or more attributes continuously have very similar values of H,
even given a large number of examples, the Hoeffding bound will not decide
between them. To solve this problem the VFDT uses a constant τ introduced
by the user for run-off, e.g., if ∆H < ǫ < τ then the leaf is transformed into a
decision node. The split test is based on the best attribute.
Another interesting characteristic of VFDT is the ability to deactivate all less
promising leaves in the case where the maximum available memory is reached.
One leaf can always be reactivate, if it is verified that the leaf is more promising
than those that are active. The VFDT can be initialized with a tree produced
by a conventional algorithm.

Concept Drift In the literature of machine learning, several methods have


been presented to deal with time changing concepts [26, 34, 19, 23, 18, 16]. The
two basic methods are based on temporal windows where a window of the most
recent examples define the training set for the learning algorithm, and weight-
ing examples that ages the examples, shrinking the importance of the oldest
ones. These basic methods can be combined and used together. Both weighting
examples and temporal windows are used with incremental learning.
Any method to dynamically choose the set of most recent examples that will
be used to learn the new concept faces several difficulties. It has to select enough
examples to the learner algorithm and also to keep old data from disturbing the
learning process, when older data have a different probability distribution from
the new concept. A larger set of examples allows a better generalization if the
distribution generating examples is stationary [34]. Systems using weighting ex-
amples use partial memory to select the more recent examples, and therefore
probably within the new context. Recent examples are more important (the
weight increase) while the weight of older examples shrinks, and eventually are
forgotten [18]. When a drift concept occurs the older examples become irrelevant.
We can apply a time window on the training examples to learn the new concept
description only from the most recent examples. The time window can be im-
proved by adapting its size. Widmer [34] and Klinkenberg [18] present several
methods to choose a time window dynamically adjusting the size using heuristics
to track the learning process. The methods select the time window to include
only examples on the current target concept. Kubat and Widmer [34] describe
a system that adapts to drift in continuous domains. Klinkenberg [18] shows
the application of several methods of handling concept drift with an adaptive
time window on the training data, by selecting representative training examples
or by weighting the training examples. Those systems automatically adjust the
window size, the example selection and the example weighting to minimize the
estimated generalization error.
Concept drift in the context of data streams appears for example in [28, 16].
Wang et al. [33] train ensembles of batch learners from sequential chunks of data
and use error estimates on the test data under the time-evolving environment.
The CVFDT [16] is an extension of VFDT algorithm for mining decision trees
from continuous-changing data streams. CVFDT works by keeping its model
consistent with a sliding window of the most recent examples. When a new
example arrives it increments the counts corresponding to the new example
and decrements the counts to the oldest example in the window which is now
forgotten. At each node CVFDT maintains the sufficient statistics to compute
the splitting-test. Each time an example traverses the node the statistics are
updated. Each node in the tree maintains the sufficient statistics. Periodically,
the splitting-test is recomputed. If a new test is chosen, the CVFDT starts
growing an alternate subtree. The old one is replaced only when the new subtree
becomes more accurate.
Other relevant works in the area use a multiple model approach [34, 22, 28,
33]. The basic idea is to control of error of the multiple models on the most
recent examples. Those models with lower error are used for classification test
examples. An interesting ensemble algorithm is the Dynamic Weighted Majority
(DWM) [22]. DWM maintains an ensemble of base learners, make predictions
using a weighted-majority vote of these experts, and dynamically creates and
deletes experts in response to changes in performance.

3 The VFDTc System


We implement a system based on the VFDT [6]. It maintains all the desir-
able properties of VFDT. The main extensions we propose include: an efficient
method to deal with numerical attributes, the use of functional leaves to classify
test examples, and automatically detect and react to concept drift.

Numerical attributes Most of real-world problems contain numerical at-


tributes. Practical applications of learning algorithms to real-world problems
should address this issue. For batch decision tree learners, this ability requires a
Fig. 1. Algorithm to insert value xj of an example label with class y into a Binary
Tree. Each node of the Btree contains: i the value assigned to the node; VE vector of
values less than i; VH vector of values greater than i.
Procedure InsertValueBtree(xj , y, Btree)
Begin
If (Btree == NULL) then
return NewNode(i = xj , VE[y]=1; y)
ElseIf (xj == i) then
VE[y]++.
return.
ElseIf (xj ≤ i) then
VE[y]++.
InsertValueBtree(xj , y, Btree.Left).
ElseIf (xj > i) then
VH[y]++.
InsertValueBtree(xj , y, Btree.Right).
End .

sort operation that is the most time consuming operation [4]. In this section we
provide an efficient method to deal with numerical attributes in the context of
online decision tree learning.
In VFDTc a decision node that contains a split-test based on a continuous
attribute has two descendant branches. The split-test is a condition of the form
attri ≤ cut point. The descendant branches corresponds to the values T RU E
and F ALSE for the split-test. The cut point is chosen from all the possible
observed values for that attribute. In order to evaluate the goodness of a split,
we need to compute the class distributions of the examples where the attribute-
value is less than and greater than the cut point.
The counts nijk are fundamental for computing all necessary statistics, they
are kept with the use of the following data structure: In each leaf of the decision
tree we maintain a vector of the class distribution of the examples that reach this
leaf. For each continuous attribute j, the system maintain a binary tree structure.
A node in the binary tree is identified with a value i (that is the value of the
attribute j seen in an example), and two vectors, V E and V H (of dimension k),
used to count the values per class that cross that node. The vectors V E and V H
contain the counts of values respectively ≤ i and > i for the examples labeled
with class k. When an example reaches a leaf, all the binary trees are updated.
Figure 1 presents the algorithm to insert a value in the binary tree. Insertion of
a new value in this structure is O(log n) for the best case and n in the worst
case, where n represent the number of distinct values for the attribute seen so
far.
To obtain the Information Gain of a given attribute we use an exhaustive
method to evaluate the merit of all possible cut points. In our case, any value
observed in the examples so far can be used as cut point. For each possible
Fig. 2. Algorithm to compute #(Aj ≤ z) for a given attribute j and class k:
Procedure LessThan(z, k, Btree)
Begin
if (Btree == NULL) return 0.
if (i == z) return V E[k].
if (i < z) return
V E[k] + LessThan(z, k, Btree.Right).
if (i > z)
return LessThan(z, k, Btree.Left).
End .

cut point, we compute the information of the two partitions using equation 3.

inf o(Aj (i)) = P (Aj ≤ i) ∗ iLow(Aj (i)) + P (Aj > i) ∗ iHigh(Aj (i)) (3)

where i is the split point, iLow(Aj (i)) the information of Aj ≤ i (equation 4)


and iHigh(Aj (i)) (equation 5) the information of Aj > i. So we choose the split
point that minimizes (3).
X
iLow(Aj (i)) = − P (K = k|Aj ≤ i) ∗ log2 (P (K = k|Aj ≤ i)) (4)
K

X
iHigh(Aj (i)) = − P (K = k|Aj > i) ∗ log2 (P (K = k|Aj > i)) (5)
K

These statistics are easily computed using the counts n ijk , and using algorithm
presented in figure 2. For each attribute, it is possible to compute the merit
of all possible cut points traversing the binary tree once. A split point for a
numerical attribute is binary. The examples would be divided into two subsets:
one representing the True value of the split-test and the other the False value of
the test installed at the decision node. VFDTc only considers a possible cut point
if and only if the number of examples in each of the subsets is higher than p min 2
percentage of the total number of examples seen in the node.

Functional tree leaves To classify a test example, the example traverses the
tree from the root to a leaf. The example is classified with the most representa-
tive class of the training examples that fall at that leaf. One of the innovations
of our algorithm is the ability to use the naive Bayes [8] classifiers at tree leaves.
That is, a test example is classified with the class that maximizes the posteri-
ori probability given by Bayes rule assuming the independence of the attributes
given the class. There is a simple motivation for this option. VFDT only changes
a leaf to a decision node when there are a sufficient number of examples to sup-
port the change. Usually hundreds or even thousands of examples are required.
To classify a test example, the majority class strategy only use the information
2
Where pmin is a user defined constant.
about class distributions and does not look for the attribute-values. It uses only
a small part of the available information, a crude approximation to the distribu-
tion of the examples. On the other hand, naive Bayes takes into account not only
the prior distribution of the classes, but also the conditional probabilities of the
attribute-values given the class. In this way, there is a much better exploitation
of the available information at each leaf.
Given the example →


− Q e = (x1 , ..., xj ) and applying Bayes theorem, we obtain:
P (Ck | e ) ∝ P (Ck ) P (xj |Ck ).
Several reasons justify the choice of naive Bayes. The first is that naive Bayes
is naturally incremental. It deals with heterogeneous data and missing values.
Moreover, it has been observed [7] that naive Bayes is a very competitive algo-
rithm for small datasets.
To compute the conditional probabilities P (xj |Ck ) we should distinguish be-
tween nominal attributes and continuous ones. In the former the problem is triv-
ial using the sufficient statistics used to compute information gain. In the latter,
there are two usual approaches: or assuming that each attribute follows a nor-
mal distribution or discretizing the attributes. Although the normal distribution
assumption allow to compute on the fly the sufficient statistics, some authors [7]
pointed out that this is a strong assumption that could increase the error rate.
As alternative, it is possible to compute the required statistics from the binary-
tree structure stored at each leaf. This is the method implemented in VFDTc.
Any numerical attribute is discretized into min(10, N r. of dif f erent values)
intervals [7]. To count the number of examples per class that fall at each interval
we use the algorithm described in figure 3. This algorithm is computed only
once in each leaf for each discretization bin. Those counts are used to estimate
P (xj |Ck ).
We should note, that the use of naive Bayes classifiers at tree leaves doesn’t
introduce any overhead in the training phase. In the application phase and for
nominal attributes, the sufficient statistics constitute (directly) the naive Bayes
tables. For continuous attributes, the naive Bayes contingency tables are effi-
ciently derived from the Btree’s used to store the numeric attribute-values. The
overhead introduced is proportional to depth of the Btree, that is at most log(n),
where n is the number of different values observed for a given attribute.

Conditions for applying naive Bayes. When a decision tree is very deep, the
number of examples that reach a leaf are scarce. In these cases, the use of a
more conservative strategy, like classification using majority class, would be more
appropriate. So, we established two simple heuristics to trigger the use of naive
Bayes. First, we only use naive Bayes in a leaf when the number of examples in
the leaf is greater then 100. Second, if we are applying a naive Bayes classifier,
we only consider a class k, if the number of examples labeled with that class are
at least greater than a user defined constant.

Computing nmin When an example reaches a leaf, it is very expensive to (re)compute


the evaluation function, therefore, in VFDT, the user inputs the number of (new)
examples ,nmin , that must reach a leaf before compute the evaluation function.
Fig. 3. Algorithm to compute P (xj |Ck ) for numeric attribute xj and class k at a given
leaf.
Procedure PNbc(xj , k, Btree, Xh , Xl , Nj )
Xh the highest value of xj observed at the Leaf
Xl the lowest value of xj observed at the Leaf
Nj the different values of xj observed at the Leaf
Begin
if (Btree == NULL) return 0
nintervals = min(10, Nj ). The number of intervals
Xh −Xl
inc = nintervals . The increment
Let Counts[nintervals] be the number
of examples between intervals
For i=1 to nintervals
Count[i] = LessT han(Xl + inc ∗ i, k, Btree)
If (i > 1) then
Counts[i]=Count[i]-Count[i-1]
If (xj ≤ Xl + inc ∗ i) then
Counts[i]
return Leaf.nrExs[k]
Counts[i]
return Leaf.nrExs[k]
End

In VFDTc, a leaf must see 200 examples before computing the evaluation
function for the first time. From there, the value of n min is computed automat-
ically. After a tie we can determine ∆H = H(xa ) − H(xb ) from the results of
the first computation. Let NrExs be the number of examples(total) in the leaf,
we can compute the contribution, Cex , of each example to ∆H:
∆H
Cex = (6)
N rExs
To transform a leaf into a decision node, we must check that ∆H > ǫ. If each
example contributes with Cex , then it will be need at least:
ǫ − ∆H
nmin = (7)
Cex
examples to make ∆H > ǫ.

Missing Values In each leaf we store the sufficient statistics to compute the
splitting evaluation function and naive Bayes probabilities. We use these suf-
ficient statistics to handle missing attributes-values. When a leaf is expanded
and becomes a decision node, all the sufficient statistics are released except suf-
ficient statistics to deal with missing values. We store the mean and mode for
continuous and nominal attributes respectively.

3.1 Detecting and reacting to Concept Drift


In most real-world applications, concepts change through time. The causes of
change are diverse: hidden contexts, regime change, etc. Our method to detect
drift is based on the assumption that whatever the cause of drift, the decision
surface moves. As side effect there is a change in the class distribution in partic-
ular regions of the instance space. In these regions, the class distribution in the
most recent examples is different from the one observed in past examples.
In a decision tree, each node corresponds to a hyper-rectangle in the instance
space. A change in the class-distribution of the examples in a region of the in-
stance space can be captured in the corresponding node. Moreover, in Hoeffding
trees a splitting-test is installed only if there is evidence enough that this is the
right test. If the distribution is stationary, the splitting-test is stable and will
remain forever. In non-stationary distributions the test should be revised.
This is the base idea behind Sequential regularization. Each time a training
example is processed, the example traverses the tree from the root to a leaf. In
this path statistics about the class-distribution at each node are updated. After
reaching a leaf, the algorithm traverses again the tree (from the leaf to the root
in the inverse path) (Figure 4). In this path the distribution of the node are
compared to the sum of the distributions of the descending leaves. If a change in
these distributions is detected, the sufficient statistics of leaves are pushed-up to
the node that detects the change and the subtree rooted at the node is pruned.
In Hoeffding trees the stream of examples traverse the tree till a leaf. The
examples that fall in each leaf corresponds to a sort of a time-window over the
most recent examples. We are comparing two distributions from two different
time-windows of examples: the oldest one corresponds to the set of examples
when the node was a leaf, the other corresponds to a time-window of the most
recent examples.
Based on this idea we present two methods to estimate the class-distributions
at each node of the decision tree: one based on pessimistic estimates of the
error, the other based on the affinity coefficient [2]. Both methods are able to
continuously monitoring the presence of drift. Once drift is detected the model
must be adapted to the most recent distribution. The adaptation method is
presented in the last subsection of this section.

Drift Detection based on Error Estimates (EE) This method is similar


to the Error Based Pruning (EBP) method for pruning standard decision tree
algorithms. It is considered to be one of the most robust methods [9]. The method
use, for each decision node i, two estimates of classification error: static error
(SEi ) and backed up error (BU Ei ). The value of SEi represents the estimate of
the error of the node i. The value BU Ei represents the sum of the error estimates
of all the descending leaves of the node i. With these estimates, we can detect
the concept change, by verifying the condition SE i ≤ BU Ei .
The base idea of the EE is the following. Each new example traverses the
tree from the root to one of the leaves (see figure 4). In this path the sufficient
statistics and the class distributions of the nodes are updated. When it arrives
to a leaf, the value of SE is updated. After that, the algorithm makes the
opposite path. In this path, the values of SE and BU E are updated for each
decision node, after the regularization condition is verified. This verification is
Fig. 4. Sequencial regularization method.

made, if a minimum number of examples exists in the children 3 . If SEi ≤ BU Ei


then is detected a concept drift in node i. It assumes that the set of examples
that crosses the decision node is a random sample of a population. Classifying
E
these examples, gets a estimate of the error f = N , where E it is the value of
the number of examples badly classified and N the total number of examples.
The question is: What is the error p of the population?. This value cannot be
calculated exactly, but we can compute a reliable interval where the value of p
can be fit, with a determined probability. The value of the error f in the sample
can be seen as E events in N attempts. Assuming a Binomial distribution for
the error and fixing the reliable level c, a confidence interval can be defined:
p ∈ [Umin : Umax ], where Umax is a pessimistic estimate of q the error for this
2 f 2 z2
z
f + 2N +z − fN +
N 4N 2
node. After some algebra, we can derive: Umax = 2 . For a
1+ zN
Normal distribution, the value c and the respective values of z are tabulated.
The usual default value for the confidence level is c = 25%. Thus, an estimate
of the error for the node P
i is: SEi = Umax . Finally, the value of BU Ei is given
#E
by the formula: BU Ei = j=son #ETj × ej , where #Ej indicates the number of
examples in the son j, #ET is the total number of examples of all the children,
and ej = SEj .
It is important to note that the error estimates we obtain at each node doesn’t
correspond to the training set error (as in the batch algorithm). In Hoeffding
trees each example is processed only once. When a training example arrives to
a leaf, it is classified, using the majority class at that leaf, before updating the
leaf statistics. The error estimates we obtain are unbiased estimates.

3
The best value found in some experiences, for the this minimum number, was of 2
examples.
Drift Detection based on Affinity Coefficient (AC) The basic idea of this
method is to directly compare between two distributions in a decision node i. One
distribution (Qi ) indicates the classes distribution before the node expansion,
that is, before the moment when the leaf becomes a decision node. The other
distribution (Si ) indicates the sum of the class distributions in all the leaves in
the tree rooted at that node. If the two distributions are significantly different,
this is interpreted as a change of the target concept. These distributions are
updated for each node, each time an example is processed by the algorithm.
The comparison between distributions is done using the Affinity Coefficient [2]:
Xp
p
ac(S, Q) = Sclass × Qclass where the Sj and Qj indicates the frequency
class=1
of each class in the distributions S and Q, respectively and p is the number of
existing classes in the data set. AC(S, Q) takes values in the range [0, 1]. If this
value is near to 1, the two distributions are similar, otherwise the two distri-
butions are different, and a signal of drift occurs. This comparison will only be
initiated after the number of examples in the decision node exceeds a minimum
number of examples(nmin Exp ). The chosen value has 300 examples, because it
was the best value found in several experiences. In addition, it is necessary to
define a threshold for the value of AC(S, Q) to decide when the distributions are
different.

ac(Si , Qi ) ≤ ϕ ⇔ ∆Ci = ϕ − ac(Si , Qi ) ≥ ǫ (8)


where ϕ is the threshold, and ǫ is given by the Hoeffding bound. In the experi-
ments reported in the paper ϕ = 0.97.

Reacting to Drift After detecting a concept change, the system should ac-
tualize the decision model. We must take into account that the most recent in-
formation contains the useful information about the actual concept. The latest
examples incorporated in the tree are stored in the leaves. Therefore, suppos-
ing that the change of concept was detected in the node i, the reaction method
push up all the information of the descending leaves to the node i, namely the
sufficient statistics and the classes distributions. The decision node becomes a
leaf and removes the sub-tree rooted at the decision tree. This is a forgetting
mechanism that removes the information that is no longer correct. The arise of
the sufficient statistics stored in the leaves for the i node, uses a algorithm that
join, for each attribute j (continuous or nominal), the correspondents binary
trees stored in all leaves of that sub-tree, in the structure of the new leave i.
Pushing-up the sufficient statistics and classes distributions of the leaves to the
node, allow the decision tree a faster future expansion of that node. The method
becomes faster in the adaptation to the new concept.

4 Experimental Evaluation
In this section we empirically evaluate VFDTc. We consider three dimensions
of analysis: error rate, learning time, and tree size. The goals of this experimen-
Fig. 5. Pushing-up the sufficient statistics from leaves.

tal study is twofold: first, to provide evidence that the use of functional leaves
improve the performance of VFDT and most important that it improve its any-
time characteristic, second we discuss the efficiency of the drift detection method
proposed.
We should note that VFDTc as VFDT was designed for fast induction of
decision models from large data streams using a single scan of the input data.
Practical application of these systems assume that the algorithm is running
online. The data stream will contain labeled examples, used to train the model,
and unlabeled examples, that are classified by the current model. Nevertheless,
the methodology used in this work to evaluate VFDTc, we follow the standard
approach using independent training and test sets.
In the first set of experiments we analyze the behavior of VFDTc in the
case of stationary datasets. We study the effects of two different strategies when
classifying test examples: classifying using the majority class(VFDTcMC) and
classifying using naive Bayes (VFDTcNB) at leaves. In the second set of experi-
ments we use non-stationary datasets. We study the efficiency of drift detection
methods. In the last set of experiments we analyze the behavior of VFDTc in
two real-world datasets.

4.1 Experimental Set-Up: Stationary datasets

The experimental work has been done using the Waveform and LED datasets.
These are well known artificial datasets. We have used the two versions of the
Waveform dataset available at the UCI repository [1]. Both versions are prob-
lems with three classes. The first version is defined by 21 numerical attributes.
The second one contains 40 attributes. It is known that the optimal Bayes error
is 14%. The LED problem has 24 binary attributes (17 are irrelevant) and 10
classes. The optimal Bayes error is 26%. The choice of these datasets was mo-
tivated by the existence of dataset generators at the UCI repository. We could
generate datasets with any number of examples and perform a set of learning
curves able to evaluate our claim about the any-time property of VFDTc.
For the Waveform and LED datasets, we generate training sets of a varying
number of examples, starting from 10k till 1000k. The test set contains 250k
examples.
The VFDTc algorithm was used with the parameters values δ = 5×10e −6 , τ =
5 × 10e−3 and nmin = 200. All algorithms run on a Pentium IV at 1GHz with
256 MB of RAM and using Linux RedHat 7.1. Detailed results are presented in
Table 5.

Classification strategy: Majority Class vs. naive Bayes In this subsec-


tion, we compare the classification performance of VFDTc using two different
classification strategies at the leaves: naive Bayes and Majority Class. Our goal
is to show that using stronger classification strategies at tree leaves will improve,
sometimes drastically, the performance of the classifier.
On these datasets VFDTcNB consistently out-performs VFDTcMC (Figures
6(a) and 6(b)). We should note, with the Waveform data, as the number of exam-
ples increases, the performance of VFDTcMC approximates VFDTcNB (Figure
6). The most relevant aspect in all learning curves is that the performance of
VFDTcNB is almost constant, independently of the number of training exam-
ples. For example the difference between the best and worst performance (over
all the experiments) is:

C4.5 VFDTcNB VFDTcMC


Waveform (21 atts) 4.28 2.89 18.54
Waveform (40 atts) 3,97 5.0 13,65
Led 0.41 0.3 7.97
These experiments support our claim that the use of appropriate classification
schema will improve the any time property.
With respect to the other dimensions of analysis, the size of the tree does
not depend on the classification strategy. With respect to the learning times, the
use of naive Bayes classifiers introduces an overhead. The overhead is due to two
factors. The first factor only applies when there are numeric attributes and is
related to the construction of the contingency tables from the Btrees. The size of
these tables is usually short (in our case 10 × #Classes) and independent of the
number of examples. In our experiments it was the less important factor. The
second factor is that the application of naive Bayes requires the estimation, for
each test example, of #Classes × #Attributes probabilities. The majority class
strategy only requires #Classes probabilities. When the number of test cases
is large (as in our experiments) this is the most relevant factor. Nevertheless
the impact of the overhead shrinks as the training time increases. It is why the
overhead is more visible for reduced number of training examples (Figure 7).
From now on we focus our analysis on VFDTcNB.

C4.5 versus VFDTcNB In this subsection, we compare VFDTcNB against


C4.5 [25]. VFDTc as VFDT was designed for fast induction of interpretable and
Waveform − 21 Atts LED Dataset
Error Rate Error Rate

80
C4.5 C4.5
VFDTcMC VFDTcNB
VFDTcNB VFDTcMC
35

70
60
30

Error Rates (%)


Error Rate (%)

50
25

40
20

30
10k 20k 30k 40k 50k 75k 100k 150k 200k 300k 400k 500k 750k 1000k 10k 20k 30k 40k 50k 75k 100k 150k 200k 300k 500k 750k 1000k

Nr.Exs Nr.Examples

Fig. 6. Learning Curves of Error-rates of VFDTcMC, VFDTcNB and C4.5 on Wave-


form data (21 atts) and LED data.

VFDTcNB / C4.5 p − value


Waveform (21 atts) 0.89 0.0011
Waveform (40 atts) 0.90 0.0001
LED 0.98 0.0017

Table 1. Ratio between the error rates of VFDTcNB and C4.5. The last column show
the P-values of Wilcoxon test.

accurate models from large data streams using a single scan of the input data.
As pointed out by the authors of VFDT [6]:

“A key property of Hoeffding trees is that it is possible to guarantee un-


der realistic assumptions that the trees it produces are asymptotically
arbitrary close to the ones produced by a batch learner.”

The motivation for these experiments is the comparison of the relative perfor-
mances of an online learning algorithm with a standard batch learner. We would
expect a faster convergence rate of VFDTcNB in comparison to VFDTcMC.
On the LED dataset the performance of both systems are quite similar for
all dimensions of the training set. On both Waveform datasets VFDTcNB out-
performs C4.5. The table 1 shows the mean of the ratios of the error rate
(VFDTcNB / C4.5) and the p−value of the Wilcoxon test for the three datasets
under study.

Tree size and Learning Times. In this work, we measure the size of tree models
as the number of decision nodes plus the number of leaves. We should note that
VFDTcNB and VFDTcMC generate exactly the same tree model.
In all experiments we have done, VFDTc generates decision models that are,
at least, one order of magnitude smaller than those generated by C4.5. Figure
7(a) plots the concept size as a function of the number of training examples.
Waveform − 21 Atts LED Dataset
Tree size (Learning Times)

200
50000
C4.5
VFDTc C4.5
VFDTcNB
VFDTcMC
VFDTc

150
5000
Nr. of nodes (log Scale)

Time in Seconds

100
500
100

50
50
10
5

0
10k 20k 30k 40k 50k 75k 100k 150k 200k 300k 400k 500k 750k 1000k 10k 20k 30k 40k 50k 75k 100k 150k 200k 300k 500k 750k 1000k

Nr.Examples Nr.Examples

Fig. 7. Tree size (left) in logarithm scale and learning times (right) of C4.5 and
VFDTcNB as a function of the number of training examples. For learning times, the
figure plots the learning time of VFDTc and C4.5. The other two lines refer to learning
time plus classification time of VFDTcMC and VFDTcNB, illustrating the overhead
introduced by using functional leaves.

Dataset Waveform 21 Led Forest


nmin automatic Yes No Yes No Yes No
Eval 656 3659 229 4992 121 2509
tree size 375 363 155 145 61 185
execution time 115.2 153.2 25.0 27.7 40.3 62.8
error rate 16.78 16.73 25.92 25.89 39.98 41.2

Table 2. Experimental study comparing nmin dynamic setting with static setting.

The size of C4.5 trees grows much more with the number of examples, just as it
would expect.
In another dimension, we measured the time needed by each algorithm to
generate a decision model. The analysis we have done in a previous section,
comparing VFDTcNB versus VFDTcMC applies in the comparison VFDTcNB
versus C4.5. VFDTcNB is a very fast in the training phase. It scans the entire
training set once, and the time needed to process each example is negligible.
In the application phase there is an overhead due to the use of naive Bayes at
leaves. In Figure 7(b) we plot the learning plus classification time as a function
of the number of training examples. For small datasets (less than 100k examples)
the overhead introduced in the application phase is the most important factor.

Computing nmin We use three datasets in ours experiences: Waveform 21, Led
and Forest Covertype dataset. For each dataset we constructed two decision trees,
one computing nmin dynamically and the other nmin is fixed by the user (default
is 200 examples). We measure the number of times the evaluation function was
computed, the number of nodes in the decision tree, the execution time, and the
error rate:
The number of times the evaluation function is calculated and the execution
time are improved with nmin automatic without compromise the error rate.

Model dependency to τ and δ parameters. Our experience with VFDTc indi-


cates that, for a given dataset, the best generalization error requires a good
compromise between δ and τ . We conduct a set of preliminary experiments on
the sensibility to these parameters using the Waveform data. The training set
contains 750k examples; the test set 250k examples. We vary τ from 0.3 to 0.03
and δ from 5 × 10e−6 to 5 × 10e−7 . The choice of these bounds was based in our
experience with the algorithm. We measure the error rate, learning time and size
of the different produced trees. These results indicate that when τ grows inside
the range, there is a small decrease in the learning time, and error rate while
the tree size increases. This behavior is an indication that, in these datasets, the
ties occurs between relevant attributes.

Sensitivity to the Order of the Examples. Several authors refer that decision
models built using incremental algorithms are very dependent in the order of the
examples. We have analyzed this effect in VFDTc. We conducted an experience
using the waveform dataset. The test set has 250k examples and the training set
750k examples. We shuffled the training set ten times and measure the error rate,
learning time and the attribute chosen at the root of the decision tree produced
by the VFDTc. An analysis of the results shows that the variance of the error
rate is 0.05, also the learning time isn’t affected, and the attribute chosen at the
root were Att6 or Att14. Those are the most informative attributes with similar
values of the evaluation function.

Sensitivity to Noise. We want to study the behavior of VFDTc in the presence


of noise. We have used the LED dataset to control the noise produced in the
training set. In our experience the test set was fixed at 250k examples with 10%
of noise. The training set was fixed at 750k examples with noise varying from 0
to 70%. The analysis of the results of VFDTcNB, shows that till 40% of noise the
error rate isn’t affected. After 50% error rate increases drastically as expected.
Using the majority class as strategy to classification (VFDTcMC), the decision
tree is very affected with the presence of noise.

Bias-Variance Decomposition of the Error. An interesting analysis of the classi-


fication error is given by the so-called Bias-Variance decomposition [3, 5]. Several
authors refer that there is a trade-off between the systematic errors due to the
representational language used by an algorithm (the bias) and the variance due
to the dependence of the model to the training set [3].
We have used the Waveform (21 attributes) dataset. The experimental method-
ology was as follows: We generate a test set with 50k examples, and 10 indepen-
dent training sets with 75k. VFDTc and C4.5 learn on the training set and the
corresponding models were used to classify the test set. The predictions are used
to compute the terms of the bias and variance equations using the definition
presented in [5]. The decomposition for C4.5 is Bias: 15.01, Variance: 4.9. For
VFDTcNB, we obtain 17.2 and 2.4 respectively. We observe that while VFDTc
exhibits lower variance, C4.5 exhibits lower Bias. With respect to the variance,
these results were expected. Decision nodes in VFDTc should be much more sta-
ble than greedy decisions. Moreover, naive Bayes is known to have low variance.
With respect to the bias component, these results are somewhat surprising. They
indicate that sub-optimal decisions could contribute to a bias reduction 4 .

4.2 Detecting Drift


This study intends to evaluate two situations: not detecting concept changes
when drift doesn’t exist and detecting concept changes when drift exist. We
consider two dimensions of analysis: error rate and tree size. In all the experi-
ments the learning algorithm used to construct the decision model is VFDTc.
VFDTc was used setting off the ability to detect drift, and setting on the AC
(VFDTc-AC) and EE (VFDTC-EE) methods. In any case, all the methods learn
on the same training set, and are evaluated on the same independent test set.
It is important to note that in the previous experiments, both methods to
deal with concept drift, EE and CA, didn’t detect any concept change, a strong
indication that both methods are robust to false alarms.

Non-stationary Datasets In the second situation, we used three artificial


data sets: Sea [28], H3 and Tree Data [15]. The artificial data set SEA consists
of three attributes, where only two are a relevant attributes: x i ∈ [0, 10], where
i = 1, 2, 3. The target concept is x1 + x2 ≤ β, where β ∈ {7, 8, 9, 9.5}. The
training set has four blocks. For the first block the target concept is with β = 8.
For the second, β = 9; the third, β = 7; and the fourth, β = 9.5. The H3 data set
is defined by three attributes, where only the first two are relevant: x i ∈ [0, 10],
where i = 1, 2, 3. The first concept is defined by the relation x 1 > x2 while
the second is defined by x1 ≤ −0.8 ∗ x2 + 9 (figure 8). The Tree Data, a tool
from the VFML [15], creates a synthetic data set by sampling from a randomly
generated decision tree. It generates a synthetic binary random tree and then
uses it to label data which can then be used to evaluate learning algorithms. We
generate two data sets with the same seed but with a different conceptSeed and
then we join the two data sets, simulating a concept change. The data set have
110 attributes (10 continuous and 100 categorical). At the end of this section we
evaluate VFDTc in two real-world datasets.

Experimental Set-Up The training set for the Sea data had 400k examples, the
concept change was made for every 100k examples (four blocks) and the test set
had 100k examples. The training set for the H3 data had 200k examples, the
concept change was made in the example 50k and the test set had 15k examples.
The training set for the Tree Data had 50k examples, the concept change was
made in the example 25k and the test set had 20k examples. All test set have the
4
This aspect deserves further research. A possible explanation is that C4.5 generates
much larger trees. Nevertheless the relation between tree size and bias is not linear [5].
Fig. 8. H3 dataset: (a)First Concept (b)Second Concept

same concept that the last concept of the training set. The VFDTc algorithm was
used with the parameters values δ = 5 × 10e−6 , τ = 5 × 10e−3 and nmin = 200.
In a first set of experiments we use H3 dataset. This is a noise free dataset.
The dataset contains two sequences of examples, randomly generated from two
concepts (see Figure 8). We run VFDTc setting on/off the ability to detect and
react to drift. Both methods (EE and AC) to deal with drift were able to detect
concept changes a few examples after drift occurs. Nevertheless, setting on the
drift detection increases the error in an independent test set generated from the
second concept (although without statistical significance). Similar experiments
with other noise free datasets (SEA concepts) confirm this observation. A similar
phenomena has been reported in the machine learning literature, when pruning
a regular tree in noisy-free data. It is known that pruning will never decrease the
error in an independent test set. The accuracy of larger trees is always equal or
greater than a simplified tree. Nevertheless, in the presence of concept drift, there
is a justification to prune: the decision model. Figure 9 presents the projection of
the decision tree over the instance space. The left figure presents the pruned tree,
the right figure plots the unpruned tree. The unpruned tree is more accurate;
the pruned tree represents much better the underlying concept.
There is a simple explanation for this fact. Suppose a leaf in a tree with a
certain class distribution. After drift, the class distribution doesn’t correspond
any more to the actual concept. Any expansion of this tree will generate two
new leaves. The statistics at these leaves corresponds to the new concept: they
are based on the most recent examples.
Figure 10 (b) shows the online error curve for the methods. The online error
is measured as follows. Each time an example is processed, it is classified before
updating the sufficient statistics. The curve plots the successive errors measured
in this way. When the concepts change, the three methods have a jump in error.
However, all methods allow the accuracy to recover quickly. The VFDTc-AC had
better results over the train, because the decision model continuously adapted
Fig. 9. Covering space for the final tree created for the H3 dataset. Tree after regular-
ization (left) and tree without regularization (right).

to the new concept. In figure 10 (b), we can see that the tree is growing over
time and the methods VFDTc-AC and VFDTc-EE constructs less complex trees
and doesn’t compromise the accuracy.
25

140

VFDTc
VFDTc
VFDTc−AC
VFDTc−EE VFDTc−AC
VFDTc−EE
120
100
20
Error Rate (%)

80
Tree Size

60
15

40
20
10

0 e+00 2 e+04 4 e+04 6 e+04 8 e+04 1 e+05 0 e+00 2 e+04 4 e+04 6 e+04 8 e+04 1 e+05

Examples Examples

Fig. 10. Learning curve of online-error (left) and tree size (right) on Sea dataset.

To achieve a statistical support for the results obtained, we make a statistical


test t-student, to compare the average of the results. The objective is to find out
if there exist significant differences, between using and not using the methods to
deal concept drift. For the bilateral test, the null hypothesis is: µ x = µy , where
µx represents the average of the samples using the method and µ y the average
of the samples not using the method. A value of 5% (α = 0.05) was used for
the level of significance. We reject the null hypothesis, if p-value is less than the
Sea H3 TreeData Sea H3 TreeData
MC Tree Size
VFDTc 13.49% 1.65% 42.00% 143 202 69
VFDTc-AC 11.60% 1.68% * 33.47% 180 113 18
VFDTc-EE 11.43% 1.69% * 35.52% 264 173 27
CVFDT 13.96% 3.29% 42.95% 130 123 71
NB
VFDTc 12.51% 1.51% 37.06%
VFDTc-AC * 11.91% 1.60% * 30.02%
VFDTc-EE 12.13% 1.61% * 31.7%
Table 3. Average of Error Rate. Significant differences are indicated by an asterisk
(*), denoting better results for sequential regularization. Tree size reports the average
number of nodes.

level of significance. The table 3 shows the average of error rate using the MC
and NB classification strategy for the three datasets under study.

VFDTc versus CVFDT We have experimentally compared Sequential Regu-


larization against the method of growing alternate trees used in CVFDT [16].
Results of VFDTc, VFDTc-CA, VFDTc-EE, and CVFDT are presented in Ta-
ble 3. We can observe that there are significant differences between the averages
of the error rate, even when VFDTc use majority class as classification strat-
egy. In this study, the two drift detection methods have better accuracy for all
datasets considered.

4.3 Discussion

We have presented two detection methods based on the same idea but using
different metrics. In the set of experiments reported here, the AC method is
more robust. It is more consistent in detecting drift, both in gradual and abrupt
changes. Nevertheless it requires more examples to detect. The EE method has
fast detection rates. In the case of abrupt concept changes, they were detect in
upper nodes of the tree. For gradual changes the EE method is almost insensible.
In that case AC is much more appropriate.
The main advantage of the proposed method is its efficiency. The method
can detect drift each time a new training example traverses the tree. This is
different from other systems [22, 28, 33, 16] where changes are evaluated time to
time.
In time-varying data streams, the most common method used to detect and
react to drift is based on multiple-models [22, 28, 33]. The method we propose
takes advantage of the tree structure and the dynamic windows directly derived
from the strategy used to grow the tree. The method is restricted to decision
trees.
KddCup 2000 PMatE
Error Rate Tree size Error Rate Tree size
MC NB MC NB
VFDTc 28.37 30.38 641 21.12 19.94 5507
VFDTc-AC 27.21 30.58 608 22.52 21.38 3184
VFDTc-EE 28.07 30.61 604 21.36 20.31 4871

Table 4. Error-rates and tree size of VFDTc (with and without drift detection) on
KDDCup2000 and PMatE datasets.

4.4 Results on real data


We have done some experiments on two real dataset: the KDD Cup 2000 and
PmatE. The KDD Cup 2000 competition consists of five questions. We use the
aggregated data set of question 1 [21]. The original training and testing set are
each 234,954 and 50,558 records respectively. There are 296 attributes but we
removed the attributes of time, date, the attributes with more than 20 different
elements and attributes that have no relevant information. The final data (train-
ing and test set) contained 53 attributes (14 nominal and 39 continuous). The
PmatE data set contains data extracted from the logs of a tutoring system. This
data belongs to a set of training games, carried through the users of the system
PmatE [24]. Each game is structured in levels. The objective is to predict if the
user is able to pass to the next level. The PmatE database problem is defined
by 16 attributes (14 nominal and 2 continuous). The dataset contains 2.320.354
examples.
Table 4 shows the accuracy and tree size using the same training and test
sets for the VFDTc and VFDTc-AC. The method VFDTc-EE didn’t detect
any change of concept. Figure 11 plots the learning curve of error rate and the
concept size of VFDTc and VFDTc-AC. The size of VFDTc trees grows more
with the number of examples (figure 11(b)). In Figure 11(a), we can see that the
performance of VFDTc-AC is a little better than the VFDTc.
The interest of this dataset is that it is a real-world dataset. We don’t know
when drift occurs or even if there is drift.

5 Conclusions
Hulten and Domingos [14] have presented a number of desirable criteria for
learning systems to allow efficient mining high-speed data streams. These criteria
include constant time and memory to process an example, single scan of the
database, availability of a classifier at any time, performance equivalent to a
standard batch algorithm, and adaptability to time-changing concepts.
In this paper, we analyze and extend VFDT, one of the most promising al-
gorithms for tree induction from data streams. VFDTc [12] extends the VFDT
system in three main dimensions. The first one is the ability to deal with nu-
merical attributes. The second one is the ability to apply naive Bayes classifiers
40
VFDTc

600
VFDTc−AC
VFDTc
VFDTc−AC

500
35

400
Tree Size

Tree Size
30

300
200
25

100
20

0
0 50000 100000 150000 200000 0 50000 100000 150000 200000

Examples Examples

Fig. 11. Learning curve of error-rates and tree size of VFDTcNB-AC and VFDTcNB
on KddCup2000 data.

in tree leaves. While the former extends the domain of applicability of the al-
gorithm to heterogeneous data, the latter reinforces the any-time characteristic,
an important property for any learning algorithm for data streams. The third
contribution is a method capable to detect and react to concept drift. We have
presented two detection methods. Both are able to continuously monitoring data
stream evolution.

We should note that the extensions we propose are integrated. In the training
phase, only the sufficient statistics required to compute the information gain are
stored at each leaf. In the application phase and for nominal attributes, the
sufficient statistics constitute (directly) the naive Bayes tables. For continuous
attributes, naive Bayes tables are efficiently derived from the Btree’s used to
store the numeric attribute-values. Nevertheless the application of naive Bayes
introduces an overhead with respect to the use of the majority class because
the former requires the estimation of much more probabilities than the latter.
VFDTc maintains all the desirable properties of VFDT. It is an incremental
algorithm, new examples can be incorporated as they arrive, it works online,
only see one example once, and using a small processing time per example. The
experimental evaluation of VFDTc clearly illustrates the advantages of the use
more powerful classification techniques. In the datasets under study VFDTcNB
exhibits similar performance against C4.5 using much less time and memory
resources. We evaluate VFDTc with datasets in presence of concept change and
real-world data. The results indicate that VFDTc doesn’t compromise accuracy
and produce less complex adapted trees. We also evaluate, analyze and discuss
VFDTc sensitivity to noise, order of examples, and missing values. Sensitivity
analysis is a fundamental aspect to understand the behavior of the algorithm.
Acknowledgments
The authors reveal its gratitude to the reviewers comments that very much im-
proved the text. Thanks to the financial support given by the FEDER, the Pluri-
anual support attributed to LIACC, project ALES II (POSI/EIA/55340/2004),
and project Retinae.

References
1. C. Blake, E. Keogh, and C. Merz. UCI repository of Machine Learning databases,
1999.
2. H. Bock and E. Diday. Analysis of symbolic data: Exploratory Methods for Ex-
tracting Statistical Information from Complex Data. Springer Verlag, 2000.
3. L. Breiman. Arcing classifiers. The Annals of Statistics, 26(3):801–849, 1998.
4. J. Catlett. Megainduction: a test flight. In Machine Learning: Proceedings of the
8 International Conference. Morgan Kaufmann, 1991.
5. P. Domingos. A unified bias-variance decomposition and its applications. In P. Lan-
gley, editor, Machine Learning, Proceedings of the 17th International Conference.
Morgan Kaufmann, 2000.
6. P. Domingos and G. Hulten. Mining high-speed data streams. In Knowledge
Discovery and Data Mining, pages 71–80, 2000.
7. P. Domingos and M. Pazzani. On the optimality of the simple Bayesian classifier
under zero-one loss. Machine Learning, 29:103–129, 1997.
8. R. Duda, P. Hart, and D. Stork. Pattern Classification. New York, Willey and
Sons, 2001.
9. F. Esposito, D. Malerba, and G. Semeraro. A comparative analysis of methods
for pruning decision trees. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 19(5):476–491, 1997.
10. J. Gama. An analysis of functional trees. In C. Sammut, editor, Machine Learning,
Proceedings of the 19th International Conference. Morgan Kaufmann, 2002.
11. J. Gama. Functional trees. Machine Learning, 55(3):219–250, 2004.
12. J. Gama, R. Rocha, and P. Mendas. Accurate decision trees for mining high-speed
data streams. In Conference on Knowledge Discovery in Data archive, Proceedings
of the ninth ACM IGKDD international conference on Knowledge discovery and
data mining, pages 523–528, 2003.
13. J. Gratch. Sequential inductive learning. In Proceedings of Thirteenth National
Conference on Artificial Intelligence, volume 1, pages 779–786, 1996.
14. G. Hulten and P. Domingos. Catching up with the data: research issues in min-
ing data streams. In Proc. of Workshop on Research issues in Data Mining and
Knowledge Discovery, 2001.
15. G. Hulten and P. Domingos. VFML – a toolkit for mining high-speed time-changing
data streams, 2003. http://www.cs.washington.edu/dm/vfml/.
16. G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In
Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pages 97–106, San Francisco, CA, 2001. ACM Press.
17. D. Kalles and T. Morris. Efficient incremental induction of decision trees. Machine
Learning, 24(3):231–242, 1996.
18. R. Klinkenberg. Learning drifting concepts: Example selection vs. example weight-
ing. In Intelligent Data Analysis (IDA), Special Issue on Incremental Learning
Systems Capable of Dealing with Concept Drift, 8(3), 2004.
19. R. Klinkenberg and S. Rüping. Concept drift and the importance of examples.
Text Mining – Theoretical Aspects and Applications, pages 55–77, 2003.
20. R. Kohavi. Scaling up the accuracy of naive Bayes classifiers: a decision tree hybrid.
In Proceedings of the 2nd International Conference on Knowledge Discovery and
Data Mining. AAAI Press, 1996.
21. R. Kohavi, C. Brodley, B. Frasca, L. Mason, and Z. Zheng. KDD-Cup 2000
organizers’ report: Peeling the onion. SIGKDD Explorations, 2(2):86–98, 2000.
http://www.ecn.purdue.edu/KDDCUP.
22. J. Kolter and M. Maloof. Dynamic weighted majority: A new ensemble method
for tracking concept drift. Technical Report,CSTR-20030610-3, 2003.
23. M. Lazarescu, S. Venkatesh, and H. Bui. Using multiple windows to track concept
drift. Intelligent Data Analysis Journal,, 8(1):29–59, 2004.
24. PmatE. Projecto matemática ensino, 2005. http://pmate.ua.pt.
25. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers,
Inc., 1993.
26. J. Schlimmer and R. Granger. Beyond incremental processing: Tracking concept
drift. In Proceedings of the Fifth National Conference on Artificial Intelligence,
pages 502–507. AAAI Press, 1986.
27. A. Seewald, J.Petrak, and G.Widmer. Hybrid decision tree learners with alternative
leaf classifiers: an empirical study. In Proceedings of the FLAIRS Conference.
AAAI, 2001.
28. W. N. Street and Y. Kim. A streaming ensemble algorithm (sea) for large-scale
classification. In Proceedings of the seventh ACM SIGKDD international confer-
ence on Knowledge discovery and data mining, pages 377–382. ACM Press, 2001.
29. P. Utgoff. ID5: An incremental ID3. In Fifth International Conference on Machine
Learning, pages 107–120. Morgan Kaufmann Publishers, 1988.
30. P. Utgoff. Perceptron trees - a case study in hybrid concept representation. In
Proceedings of the Seventh National Conference on Artificial Intelligence. Morgan
Kaufmann, 1988.
31. P. E. Utgoff, N. C. Berkman, and J. A. Clouse. Decision tree induction based on
efficient tree restructuring. Machine Learning, 29(1):5–44, 1997.
32. W. Van de Velde. Incremental induction of topologically minimal trees. In B. Porter
and R. Mooney, editors, Machine Learning, Proceedings of the 7th International
Conference. Morgan Kaufmann, 1990.
33. H. Wang, W. Fan, P. Yu, and J. Han. Mining concept-drifting data streams us-
ing ensemble classifiers. In Proceedings of the ninth ACM SIGKDD international
conference on Knowledge discovery and data mining, pages 735–740. ACM Press,
2003.
34. G. Widmer and M. Kubat. Learning in the presence of concept drift and hidden
contexts. Machine Learning, 23:69–101, 1996.
Error Rate (%) Learning Times Tree Size
Training+Classification Training
Nr. Exs C4.5 VFDTcNB VFDTcMC C4.5 VFDTcNB VFDTcMC VFDTc C4.5 VFDTc
Waveform dataset - 21 Attributes
10k 22.45 19.74 38.70 13.9 13.4 6.5 1.4 1059 5
20k 21.48 19.74 32.45 25.4 16.6 8.1 3.2 2061 9
30k 21.20 19.09 29.49 40.6 18.3 9.9 4.9 2849 15
40k 20.93 18.82 27.91 59.5 19.8 11.8 6.8 3773 19
50k 20.63 18.49 26.20 78.7 21.7 13.6 8.5 4269 25
75k 20.28 17.66 26.20 146.8 26.9 18.4 13.3 6283 35
100k 19.98 17.85 25.46 229.2 31.2 23.4 18.3 7901 51
150k 19.76 17.14 23.49 456.6 41.6 33.4 28.2 11815 73
200k 19.58 17.78 22.90 753.9 52.3 43.4 38.4 15071 105
300k 19.23 17.16 22.24 1552.9 73.7 64.6 59.3 21115 149
400k 18.97 16.85 21.60 2600.6 95.4 85.6 80.2 27229 191
500k 18.96 17.28 21.15 3975.1 116.5 106.9 101.5 37787 249
750k 18.65 16.92 20.57 8237.4 169.9 158.4 153.2 47179 359
1000k 18.17 17.15 20.16 14960.8 210.1 201.5 195.3 61323 495
Waveform dataset - 40 Attributes
10k 23.09 22.03 33.47 24.5 29.2 12.3 2.9 1199 7
20k 22.61 20.73 30.21 43.5 32.8 16.1 6.4 2287 11
30k 21.99 18.98 30.21 67.1 36.4 19.7 9.9 3211 15
40k 21.63 20.27 27.05 96.7 39.8 23.4 13.6 4205 23
50k 21.75 18.34 25.27 129.4 43.9 27.1 17.3 5319 25
75k 21.41 20.66 27.05 230.0 53.5 36.7 26.8 7403 43
100k 21.02 18.50 24.82 347.8 64.2 46.2 36.2 9935 51
150k 20.52 19.15 23.48 686.3 84.2 65.7 55.7 14971 79
200k 20.17 18.02 22.29 1074.9 105.2 85.0 75.0 18351 101
300k 19.82 17.57 22.04 2175.8 146.6 125.1 115.0 26823 143
400k 19.64 17.96 21.56 3605.8 186.9 164.8 154.7 33617 203
500k 19.63 17.17 21.29 5570.8 227.4 155.3 195.1 43725 243
750k 19.24 17.51 20.32 12544.5 377.8 304.9 294.6 61713 375
1000k 19.12 17.03 19.82 22402.7 431.3 408.2 397.4 80933 491
LED dataset - 24 Attributes
10k 26.76 25.94 81.90 3.2 16.4 3.9 0.2 899 3
20k 26.64 25.92 81.94 4.3 16.7 4.2 0.4 1809 7
30k 26.46 26.01 74.59 5.5 16.9 4.4 0.6 2425 11
40k 26.57 26.17 74.67 6.9 17.2 4.7 0.8 3347 17
50k 26.54 25.96 74.73 8.2 17.4 5.0 1.0 4207 19
75k 26.41 25.96 74.69 11.5 18.0 5.6 1.6 5953 31
100k 26.53 25.87 74.67 15.1 18.8 6.3 2.2 7923 39
150k 26.49 25.96 74.65 22.1 19.8 7.7 3.4 11507 61
200k 26.42 26.06 74.59 29.6 21.2 9.3 4.7 15281 81
300k 26.38 25.96 74.14 45.8 24.1 12.5 7.5 21875 123
500k 26.39 25.95 73.97 80.6 31.7 19.9 14.1 35413 207
750k 26.37 25.90 74.01 144.8 42.9 31.4 24.3 51267 301
1000k 26.35 26.05 73.98 190.5 59.9 45.3 36.8 65119 403
Table 5. Learning curves for Waveform-21, Waveform-40, and LED datasets. The
learning times are referred in seconds. In all the experiments the test set consists of
250k examples.

You might also like