Decision Trees For Mining Data Streams
Decision Trees For Mining Data Streams
1 Introduction
Databases are rich in information that can be used in the decision process.
Nowadays, the majority of the companies and organizations possess gigantic
databases, that grow to millions of registers per day. Continuous data streams
arise naturally, for example, in the network installations of large Telecommu-
nications and Internet service providers where detailed usage information from
different parts of the network needs to be continuously collected and analyzed for
interesting trends. The usual techniques of data mining have problems in dealing
with this huge volume of data. In traditional applications of data mining, the
volume of data is the main obstacle to the use of memory-based techniques due
the restrictions in the computational resources: memory, time or space in hard
disk. Therefore in the majority of these systems, the use of all available data
becomes impossible and can result in under fitting. To construct KDD systems
that use this entire amount of data and keeps the accuracy of the traditional
systems becomes problematic.
Decision trees, due to its characteristics, are one of the most used techniques
for data-mining. Decision tree models are non-parametric, distribution-free, and
robust to the presence of outliers and irrelevant attributes. Tree models have high
degree of interpretability. Global and complex decisions can be approximated
by a series of simpler and local decisions. Univariate trees are invariant under
all (strictly) monotone transformations of the individual input variables. Usual
algorithms that construct decision trees from data use a divide and conquer
strategy. A complex problem is divided into simpler problems and recursively
the same strategy is applied to the sub-problems. The solutions of sub-problems
are combined in the form of a tree to yield the solution of the complex problem.
Formally a decision tree is a direct acyclic graph where each node is either a
decision node with two or more successors or a leaf node. A decision node has
some condition based on attribute values. A leaf node is labeled with a constant
that minimizes a given loss function. The standard algorithm to construct a
decision tree usually install at each leaf a constant that minimizes a given loss
function. In the classification setting, the constant that minimizes the 0-1 loss
function is the mode of the classes that reach this leaf. Nevertheless, several
authors have studied the use of other functions at tree leaves [20, 10].
Data streams are, by definition, problems where the training examples used
to construct decision models come over time, usually one at time. A natural ap-
proach for this incremental task is incremental learning algorithms. A paradigm
for any learning algorithm is the ability to change a decision model in order to
incorporate new information in a incremental way. In the field of incremental
tree induction, we can distinguish two main research lines. In the first one, a
tree is constructed using a greedy search. The incorporation of new information
involves re-structuring the actual tree [29, 31, 17]. This is done using operators
that could pull-up or push-down decision nodes. This is the case of systems like
ID5 [29], ITI [31], or ID5R[17]. The second research line does not use the greedy
search of standard tree induction. It maintains at each decision node a set of
sufficient statistics and only make a decision (install a split-test at that node),
when there is enough statistical evidence in favor of a particular split test [13].
A notable example is the Very Fast Decision Tree (VFDT) system [6]. VFDT is
an online decision tree, designed to maintain a decision tree over a data stream.
It assumes a continuous flow of training examples, that could be mixed with
unlabeled examples. VFDT continuously train on labeled data and at anytime
classify unlabeled data.
In this paper we propose the VFDTc system, which incorporates three main
extensions to the VFDT system. The first one is the ability to deal with numerical
attributes. The second one is the ability to apply naive Bayes classifiers in tree
leaves. The third extension, is the ability to detect and react in the presence
of drift in the time-evolving data stream. In such cases the decision model is
updated to reflect the most recent class-distribution of the examples. We show
that incremental tree induction methods that wait till enough statistical support
for install a split-test will large benefit of using more appropriate classification
strategies at tree leaves [12]. The paper is organized as follows. The next section
describes VFDT and other related work that is the basis for our work. The
section 3, we present our extensions to VFDT leading to system VFDTc. We
detail the major options that we implemented and the differences to VFDT
and others available systems. The system has been implemented and evaluated.
Experimental evaluation and sensitivity analysis is done in Section 4. Last section
concludes the paper, resuming the main contributions of this work.
2 Related Work
In this section we analyze related work in three dimensions. One dimension is
related to the use of more power classification strategies at tree leaves, the other
dimension is related to methods for incremental tree induction, and last methods
to react to drift.
n
where Pik = P ijk
n
, is the probability of observing the value of the attribute
ajk
a P
nijb
i given class k and Pi = P P b
n
is the probability of observing the value of
ajb
a b
attribute i.
1
The original description of VFDT is general enough for other evaluation functions
(e.g. GINI). Without loss of generality, we restrict here to the information gain.
The main innovation of the VFDT system is the use of Hoeffding bounds
to decide how many examples are necessary to see before to install a split-test
at a given leaf. Suppose we have made n independent observations of a random
variable r whose range is R. The Hoeffding bound states with
q probability 1 − δ
ln( 1 )
that the true average of r, r̄ is in the range r̂ ± ǫ where ǫ = R2 2nδ . Let H be
the evaluation function of an attribute. For the information gain, the range, R, of
H is log2 (#classes). Let xa be the attribute with the highest H, xb the attribute
with second-highest H, and ∆H = H(xa ) − H(xb ), the difference between the
two best attributes. Then if ∆H > ǫ with n examples observed in the leaf, the
Hoeffding bound states with probability 1 − δ that x a is really the attribute with
highest value in the evaluation function. In this case the leaf must be transformed
into a decision node that splits on xa .
It turns out that is not efficient to calculate H every time one example
arrives. Moreover, it is highly unprovable that ∆H becomes greater than ǫ with
the arrival of this single new example, as such, a constant defined by the user
is used, nmin , which is the number of necessary examples to reach a leaf before
computing H. Contrary to [17], it uses the entropy difference to determine how
many examples will be necessary until it is verified that ∆H > ǫ.
When two or more attributes continuously have very similar values of H,
even given a large number of examples, the Hoeffding bound will not decide
between them. To solve this problem the VFDT uses a constant τ introduced
by the user for run-off, e.g., if ∆H < ǫ < τ then the leaf is transformed into a
decision node. The split test is based on the best attribute.
Another interesting characteristic of VFDT is the ability to deactivate all less
promising leaves in the case where the maximum available memory is reached.
One leaf can always be reactivate, if it is verified that the leaf is more promising
than those that are active. The VFDT can be initialized with a tree produced
by a conventional algorithm.
sort operation that is the most time consuming operation [4]. In this section we
provide an efficient method to deal with numerical attributes in the context of
online decision tree learning.
In VFDTc a decision node that contains a split-test based on a continuous
attribute has two descendant branches. The split-test is a condition of the form
attri ≤ cut point. The descendant branches corresponds to the values T RU E
and F ALSE for the split-test. The cut point is chosen from all the possible
observed values for that attribute. In order to evaluate the goodness of a split,
we need to compute the class distributions of the examples where the attribute-
value is less than and greater than the cut point.
The counts nijk are fundamental for computing all necessary statistics, they
are kept with the use of the following data structure: In each leaf of the decision
tree we maintain a vector of the class distribution of the examples that reach this
leaf. For each continuous attribute j, the system maintain a binary tree structure.
A node in the binary tree is identified with a value i (that is the value of the
attribute j seen in an example), and two vectors, V E and V H (of dimension k),
used to count the values per class that cross that node. The vectors V E and V H
contain the counts of values respectively ≤ i and > i for the examples labeled
with class k. When an example reaches a leaf, all the binary trees are updated.
Figure 1 presents the algorithm to insert a value in the binary tree. Insertion of
a new value in this structure is O(log n) for the best case and n in the worst
case, where n represent the number of distinct values for the attribute seen so
far.
To obtain the Information Gain of a given attribute we use an exhaustive
method to evaluate the merit of all possible cut points. In our case, any value
observed in the examples so far can be used as cut point. For each possible
Fig. 2. Algorithm to compute #(Aj ≤ z) for a given attribute j and class k:
Procedure LessThan(z, k, Btree)
Begin
if (Btree == NULL) return 0.
if (i == z) return V E[k].
if (i < z) return
V E[k] + LessThan(z, k, Btree.Right).
if (i > z)
return LessThan(z, k, Btree.Left).
End .
cut point, we compute the information of the two partitions using equation 3.
inf o(Aj (i)) = P (Aj ≤ i) ∗ iLow(Aj (i)) + P (Aj > i) ∗ iHigh(Aj (i)) (3)
X
iHigh(Aj (i)) = − P (K = k|Aj > i) ∗ log2 (P (K = k|Aj > i)) (5)
K
These statistics are easily computed using the counts n ijk , and using algorithm
presented in figure 2. For each attribute, it is possible to compute the merit
of all possible cut points traversing the binary tree once. A split point for a
numerical attribute is binary. The examples would be divided into two subsets:
one representing the True value of the split-test and the other the False value of
the test installed at the decision node. VFDTc only considers a possible cut point
if and only if the number of examples in each of the subsets is higher than p min 2
percentage of the total number of examples seen in the node.
Functional tree leaves To classify a test example, the example traverses the
tree from the root to a leaf. The example is classified with the most representa-
tive class of the training examples that fall at that leaf. One of the innovations
of our algorithm is the ability to use the naive Bayes [8] classifiers at tree leaves.
That is, a test example is classified with the class that maximizes the posteri-
ori probability given by Bayes rule assuming the independence of the attributes
given the class. There is a simple motivation for this option. VFDT only changes
a leaf to a decision node when there are a sufficient number of examples to sup-
port the change. Usually hundreds or even thousands of examples are required.
To classify a test example, the majority class strategy only use the information
2
Where pmin is a user defined constant.
about class distributions and does not look for the attribute-values. It uses only
a small part of the available information, a crude approximation to the distribu-
tion of the examples. On the other hand, naive Bayes takes into account not only
the prior distribution of the classes, but also the conditional probabilities of the
attribute-values given the class. In this way, there is a much better exploitation
of the available information at each leaf.
Given the example →
−
→
− Q e = (x1 , ..., xj ) and applying Bayes theorem, we obtain:
P (Ck | e ) ∝ P (Ck ) P (xj |Ck ).
Several reasons justify the choice of naive Bayes. The first is that naive Bayes
is naturally incremental. It deals with heterogeneous data and missing values.
Moreover, it has been observed [7] that naive Bayes is a very competitive algo-
rithm for small datasets.
To compute the conditional probabilities P (xj |Ck ) we should distinguish be-
tween nominal attributes and continuous ones. In the former the problem is triv-
ial using the sufficient statistics used to compute information gain. In the latter,
there are two usual approaches: or assuming that each attribute follows a nor-
mal distribution or discretizing the attributes. Although the normal distribution
assumption allow to compute on the fly the sufficient statistics, some authors [7]
pointed out that this is a strong assumption that could increase the error rate.
As alternative, it is possible to compute the required statistics from the binary-
tree structure stored at each leaf. This is the method implemented in VFDTc.
Any numerical attribute is discretized into min(10, N r. of dif f erent values)
intervals [7]. To count the number of examples per class that fall at each interval
we use the algorithm described in figure 3. This algorithm is computed only
once in each leaf for each discretization bin. Those counts are used to estimate
P (xj |Ck ).
We should note, that the use of naive Bayes classifiers at tree leaves doesn’t
introduce any overhead in the training phase. In the application phase and for
nominal attributes, the sufficient statistics constitute (directly) the naive Bayes
tables. For continuous attributes, the naive Bayes contingency tables are effi-
ciently derived from the Btree’s used to store the numeric attribute-values. The
overhead introduced is proportional to depth of the Btree, that is at most log(n),
where n is the number of different values observed for a given attribute.
Conditions for applying naive Bayes. When a decision tree is very deep, the
number of examples that reach a leaf are scarce. In these cases, the use of a
more conservative strategy, like classification using majority class, would be more
appropriate. So, we established two simple heuristics to trigger the use of naive
Bayes. First, we only use naive Bayes in a leaf when the number of examples in
the leaf is greater then 100. Second, if we are applying a naive Bayes classifier,
we only consider a class k, if the number of examples labeled with that class are
at least greater than a user defined constant.
In VFDTc, a leaf must see 200 examples before computing the evaluation
function for the first time. From there, the value of n min is computed automat-
ically. After a tie we can determine ∆H = H(xa ) − H(xb ) from the results of
the first computation. Let NrExs be the number of examples(total) in the leaf,
we can compute the contribution, Cex , of each example to ∆H:
∆H
Cex = (6)
N rExs
To transform a leaf into a decision node, we must check that ∆H > ǫ. If each
example contributes with Cex , then it will be need at least:
ǫ − ∆H
nmin = (7)
Cex
examples to make ∆H > ǫ.
Missing Values In each leaf we store the sufficient statistics to compute the
splitting evaluation function and naive Bayes probabilities. We use these suf-
ficient statistics to handle missing attributes-values. When a leaf is expanded
and becomes a decision node, all the sufficient statistics are released except suf-
ficient statistics to deal with missing values. We store the mean and mode for
continuous and nominal attributes respectively.
3
The best value found in some experiences, for the this minimum number, was of 2
examples.
Drift Detection based on Affinity Coefficient (AC) The basic idea of this
method is to directly compare between two distributions in a decision node i. One
distribution (Qi ) indicates the classes distribution before the node expansion,
that is, before the moment when the leaf becomes a decision node. The other
distribution (Si ) indicates the sum of the class distributions in all the leaves in
the tree rooted at that node. If the two distributions are significantly different,
this is interpreted as a change of the target concept. These distributions are
updated for each node, each time an example is processed by the algorithm.
The comparison between distributions is done using the Affinity Coefficient [2]:
Xp
p
ac(S, Q) = Sclass × Qclass where the Sj and Qj indicates the frequency
class=1
of each class in the distributions S and Q, respectively and p is the number of
existing classes in the data set. AC(S, Q) takes values in the range [0, 1]. If this
value is near to 1, the two distributions are similar, otherwise the two distri-
butions are different, and a signal of drift occurs. This comparison will only be
initiated after the number of examples in the decision node exceeds a minimum
number of examples(nmin Exp ). The chosen value has 300 examples, because it
was the best value found in several experiences. In addition, it is necessary to
define a threshold for the value of AC(S, Q) to decide when the distributions are
different.
Reacting to Drift After detecting a concept change, the system should ac-
tualize the decision model. We must take into account that the most recent in-
formation contains the useful information about the actual concept. The latest
examples incorporated in the tree are stored in the leaves. Therefore, suppos-
ing that the change of concept was detected in the node i, the reaction method
push up all the information of the descending leaves to the node i, namely the
sufficient statistics and the classes distributions. The decision node becomes a
leaf and removes the sub-tree rooted at the decision tree. This is a forgetting
mechanism that removes the information that is no longer correct. The arise of
the sufficient statistics stored in the leaves for the i node, uses a algorithm that
join, for each attribute j (continuous or nominal), the correspondents binary
trees stored in all leaves of that sub-tree, in the structure of the new leave i.
Pushing-up the sufficient statistics and classes distributions of the leaves to the
node, allow the decision tree a faster future expansion of that node. The method
becomes faster in the adaptation to the new concept.
4 Experimental Evaluation
In this section we empirically evaluate VFDTc. We consider three dimensions
of analysis: error rate, learning time, and tree size. The goals of this experimen-
Fig. 5. Pushing-up the sufficient statistics from leaves.
tal study is twofold: first, to provide evidence that the use of functional leaves
improve the performance of VFDT and most important that it improve its any-
time characteristic, second we discuss the efficiency of the drift detection method
proposed.
We should note that VFDTc as VFDT was designed for fast induction of
decision models from large data streams using a single scan of the input data.
Practical application of these systems assume that the algorithm is running
online. The data stream will contain labeled examples, used to train the model,
and unlabeled examples, that are classified by the current model. Nevertheless,
the methodology used in this work to evaluate VFDTc, we follow the standard
approach using independent training and test sets.
In the first set of experiments we analyze the behavior of VFDTc in the
case of stationary datasets. We study the effects of two different strategies when
classifying test examples: classifying using the majority class(VFDTcMC) and
classifying using naive Bayes (VFDTcNB) at leaves. In the second set of experi-
ments we use non-stationary datasets. We study the efficiency of drift detection
methods. In the last set of experiments we analyze the behavior of VFDTc in
two real-world datasets.
The experimental work has been done using the Waveform and LED datasets.
These are well known artificial datasets. We have used the two versions of the
Waveform dataset available at the UCI repository [1]. Both versions are prob-
lems with three classes. The first version is defined by 21 numerical attributes.
The second one contains 40 attributes. It is known that the optimal Bayes error
is 14%. The LED problem has 24 binary attributes (17 are irrelevant) and 10
classes. The optimal Bayes error is 26%. The choice of these datasets was mo-
tivated by the existence of dataset generators at the UCI repository. We could
generate datasets with any number of examples and perform a set of learning
curves able to evaluate our claim about the any-time property of VFDTc.
For the Waveform and LED datasets, we generate training sets of a varying
number of examples, starting from 10k till 1000k. The test set contains 250k
examples.
The VFDTc algorithm was used with the parameters values δ = 5×10e −6 , τ =
5 × 10e−3 and nmin = 200. All algorithms run on a Pentium IV at 1GHz with
256 MB of RAM and using Linux RedHat 7.1. Detailed results are presented in
Table 5.
80
C4.5 C4.5
VFDTcMC VFDTcNB
VFDTcNB VFDTcMC
35
70
60
30
50
25
40
20
30
10k 20k 30k 40k 50k 75k 100k 150k 200k 300k 400k 500k 750k 1000k 10k 20k 30k 40k 50k 75k 100k 150k 200k 300k 500k 750k 1000k
Nr.Exs Nr.Examples
Table 1. Ratio between the error rates of VFDTcNB and C4.5. The last column show
the P-values of Wilcoxon test.
accurate models from large data streams using a single scan of the input data.
As pointed out by the authors of VFDT [6]:
The motivation for these experiments is the comparison of the relative perfor-
mances of an online learning algorithm with a standard batch learner. We would
expect a faster convergence rate of VFDTcNB in comparison to VFDTcMC.
On the LED dataset the performance of both systems are quite similar for
all dimensions of the training set. On both Waveform datasets VFDTcNB out-
performs C4.5. The table 1 shows the mean of the ratios of the error rate
(VFDTcNB / C4.5) and the p−value of the Wilcoxon test for the three datasets
under study.
Tree size and Learning Times. In this work, we measure the size of tree models
as the number of decision nodes plus the number of leaves. We should note that
VFDTcNB and VFDTcMC generate exactly the same tree model.
In all experiments we have done, VFDTc generates decision models that are,
at least, one order of magnitude smaller than those generated by C4.5. Figure
7(a) plots the concept size as a function of the number of training examples.
Waveform − 21 Atts LED Dataset
Tree size (Learning Times)
200
50000
C4.5
VFDTc C4.5
VFDTcNB
VFDTcMC
VFDTc
150
5000
Nr. of nodes (log Scale)
Time in Seconds
100
500
100
50
50
10
5
0
10k 20k 30k 40k 50k 75k 100k 150k 200k 300k 400k 500k 750k 1000k 10k 20k 30k 40k 50k 75k 100k 150k 200k 300k 500k 750k 1000k
Nr.Examples Nr.Examples
Fig. 7. Tree size (left) in logarithm scale and learning times (right) of C4.5 and
VFDTcNB as a function of the number of training examples. For learning times, the
figure plots the learning time of VFDTc and C4.5. The other two lines refer to learning
time plus classification time of VFDTcMC and VFDTcNB, illustrating the overhead
introduced by using functional leaves.
Table 2. Experimental study comparing nmin dynamic setting with static setting.
The size of C4.5 trees grows much more with the number of examples, just as it
would expect.
In another dimension, we measured the time needed by each algorithm to
generate a decision model. The analysis we have done in a previous section,
comparing VFDTcNB versus VFDTcMC applies in the comparison VFDTcNB
versus C4.5. VFDTcNB is a very fast in the training phase. It scans the entire
training set once, and the time needed to process each example is negligible.
In the application phase there is an overhead due to the use of naive Bayes at
leaves. In Figure 7(b) we plot the learning plus classification time as a function
of the number of training examples. For small datasets (less than 100k examples)
the overhead introduced in the application phase is the most important factor.
Computing nmin We use three datasets in ours experiences: Waveform 21, Led
and Forest Covertype dataset. For each dataset we constructed two decision trees,
one computing nmin dynamically and the other nmin is fixed by the user (default
is 200 examples). We measure the number of times the evaluation function was
computed, the number of nodes in the decision tree, the execution time, and the
error rate:
The number of times the evaluation function is calculated and the execution
time are improved with nmin automatic without compromise the error rate.
Sensitivity to the Order of the Examples. Several authors refer that decision
models built using incremental algorithms are very dependent in the order of the
examples. We have analyzed this effect in VFDTc. We conducted an experience
using the waveform dataset. The test set has 250k examples and the training set
750k examples. We shuffled the training set ten times and measure the error rate,
learning time and the attribute chosen at the root of the decision tree produced
by the VFDTc. An analysis of the results shows that the variance of the error
rate is 0.05, also the learning time isn’t affected, and the attribute chosen at the
root were Att6 or Att14. Those are the most informative attributes with similar
values of the evaluation function.
Experimental Set-Up The training set for the Sea data had 400k examples, the
concept change was made for every 100k examples (four blocks) and the test set
had 100k examples. The training set for the H3 data had 200k examples, the
concept change was made in the example 50k and the test set had 15k examples.
The training set for the Tree Data had 50k examples, the concept change was
made in the example 25k and the test set had 20k examples. All test set have the
4
This aspect deserves further research. A possible explanation is that C4.5 generates
much larger trees. Nevertheless the relation between tree size and bias is not linear [5].
Fig. 8. H3 dataset: (a)First Concept (b)Second Concept
same concept that the last concept of the training set. The VFDTc algorithm was
used with the parameters values δ = 5 × 10e−6 , τ = 5 × 10e−3 and nmin = 200.
In a first set of experiments we use H3 dataset. This is a noise free dataset.
The dataset contains two sequences of examples, randomly generated from two
concepts (see Figure 8). We run VFDTc setting on/off the ability to detect and
react to drift. Both methods (EE and AC) to deal with drift were able to detect
concept changes a few examples after drift occurs. Nevertheless, setting on the
drift detection increases the error in an independent test set generated from the
second concept (although without statistical significance). Similar experiments
with other noise free datasets (SEA concepts) confirm this observation. A similar
phenomena has been reported in the machine learning literature, when pruning
a regular tree in noisy-free data. It is known that pruning will never decrease the
error in an independent test set. The accuracy of larger trees is always equal or
greater than a simplified tree. Nevertheless, in the presence of concept drift, there
is a justification to prune: the decision model. Figure 9 presents the projection of
the decision tree over the instance space. The left figure presents the pruned tree,
the right figure plots the unpruned tree. The unpruned tree is more accurate;
the pruned tree represents much better the underlying concept.
There is a simple explanation for this fact. Suppose a leaf in a tree with a
certain class distribution. After drift, the class distribution doesn’t correspond
any more to the actual concept. Any expansion of this tree will generate two
new leaves. The statistics at these leaves corresponds to the new concept: they
are based on the most recent examples.
Figure 10 (b) shows the online error curve for the methods. The online error
is measured as follows. Each time an example is processed, it is classified before
updating the sufficient statistics. The curve plots the successive errors measured
in this way. When the concepts change, the three methods have a jump in error.
However, all methods allow the accuracy to recover quickly. The VFDTc-AC had
better results over the train, because the decision model continuously adapted
Fig. 9. Covering space for the final tree created for the H3 dataset. Tree after regular-
ization (left) and tree without regularization (right).
to the new concept. In figure 10 (b), we can see that the tree is growing over
time and the methods VFDTc-AC and VFDTc-EE constructs less complex trees
and doesn’t compromise the accuracy.
25
140
VFDTc
VFDTc
VFDTc−AC
VFDTc−EE VFDTc−AC
VFDTc−EE
120
100
20
Error Rate (%)
80
Tree Size
60
15
40
20
10
0 e+00 2 e+04 4 e+04 6 e+04 8 e+04 1 e+05 0 e+00 2 e+04 4 e+04 6 e+04 8 e+04 1 e+05
Examples Examples
Fig. 10. Learning curve of online-error (left) and tree size (right) on Sea dataset.
level of significance. The table 3 shows the average of error rate using the MC
and NB classification strategy for the three datasets under study.
4.3 Discussion
We have presented two detection methods based on the same idea but using
different metrics. In the set of experiments reported here, the AC method is
more robust. It is more consistent in detecting drift, both in gradual and abrupt
changes. Nevertheless it requires more examples to detect. The EE method has
fast detection rates. In the case of abrupt concept changes, they were detect in
upper nodes of the tree. For gradual changes the EE method is almost insensible.
In that case AC is much more appropriate.
The main advantage of the proposed method is its efficiency. The method
can detect drift each time a new training example traverses the tree. This is
different from other systems [22, 28, 33, 16] where changes are evaluated time to
time.
In time-varying data streams, the most common method used to detect and
react to drift is based on multiple-models [22, 28, 33]. The method we propose
takes advantage of the tree structure and the dynamic windows directly derived
from the strategy used to grow the tree. The method is restricted to decision
trees.
KddCup 2000 PMatE
Error Rate Tree size Error Rate Tree size
MC NB MC NB
VFDTc 28.37 30.38 641 21.12 19.94 5507
VFDTc-AC 27.21 30.58 608 22.52 21.38 3184
VFDTc-EE 28.07 30.61 604 21.36 20.31 4871
Table 4. Error-rates and tree size of VFDTc (with and without drift detection) on
KDDCup2000 and PMatE datasets.
5 Conclusions
Hulten and Domingos [14] have presented a number of desirable criteria for
learning systems to allow efficient mining high-speed data streams. These criteria
include constant time and memory to process an example, single scan of the
database, availability of a classifier at any time, performance equivalent to a
standard batch algorithm, and adaptability to time-changing concepts.
In this paper, we analyze and extend VFDT, one of the most promising al-
gorithms for tree induction from data streams. VFDTc [12] extends the VFDT
system in three main dimensions. The first one is the ability to deal with nu-
merical attributes. The second one is the ability to apply naive Bayes classifiers
40
VFDTc
600
VFDTc−AC
VFDTc
VFDTc−AC
500
35
400
Tree Size
Tree Size
30
300
200
25
100
20
0
0 50000 100000 150000 200000 0 50000 100000 150000 200000
Examples Examples
Fig. 11. Learning curve of error-rates and tree size of VFDTcNB-AC and VFDTcNB
on KddCup2000 data.
in tree leaves. While the former extends the domain of applicability of the al-
gorithm to heterogeneous data, the latter reinforces the any-time characteristic,
an important property for any learning algorithm for data streams. The third
contribution is a method capable to detect and react to concept drift. We have
presented two detection methods. Both are able to continuously monitoring data
stream evolution.
We should note that the extensions we propose are integrated. In the training
phase, only the sufficient statistics required to compute the information gain are
stored at each leaf. In the application phase and for nominal attributes, the
sufficient statistics constitute (directly) the naive Bayes tables. For continuous
attributes, naive Bayes tables are efficiently derived from the Btree’s used to
store the numeric attribute-values. Nevertheless the application of naive Bayes
introduces an overhead with respect to the use of the majority class because
the former requires the estimation of much more probabilities than the latter.
VFDTc maintains all the desirable properties of VFDT. It is an incremental
algorithm, new examples can be incorporated as they arrive, it works online,
only see one example once, and using a small processing time per example. The
experimental evaluation of VFDTc clearly illustrates the advantages of the use
more powerful classification techniques. In the datasets under study VFDTcNB
exhibits similar performance against C4.5 using much less time and memory
resources. We evaluate VFDTc with datasets in presence of concept change and
real-world data. The results indicate that VFDTc doesn’t compromise accuracy
and produce less complex adapted trees. We also evaluate, analyze and discuss
VFDTc sensitivity to noise, order of examples, and missing values. Sensitivity
analysis is a fundamental aspect to understand the behavior of the algorithm.
Acknowledgments
The authors reveal its gratitude to the reviewers comments that very much im-
proved the text. Thanks to the financial support given by the FEDER, the Pluri-
anual support attributed to LIACC, project ALES II (POSI/EIA/55340/2004),
and project Retinae.
References
1. C. Blake, E. Keogh, and C. Merz. UCI repository of Machine Learning databases,
1999.
2. H. Bock and E. Diday. Analysis of symbolic data: Exploratory Methods for Ex-
tracting Statistical Information from Complex Data. Springer Verlag, 2000.
3. L. Breiman. Arcing classifiers. The Annals of Statistics, 26(3):801–849, 1998.
4. J. Catlett. Megainduction: a test flight. In Machine Learning: Proceedings of the
8 International Conference. Morgan Kaufmann, 1991.
5. P. Domingos. A unified bias-variance decomposition and its applications. In P. Lan-
gley, editor, Machine Learning, Proceedings of the 17th International Conference.
Morgan Kaufmann, 2000.
6. P. Domingos and G. Hulten. Mining high-speed data streams. In Knowledge
Discovery and Data Mining, pages 71–80, 2000.
7. P. Domingos and M. Pazzani. On the optimality of the simple Bayesian classifier
under zero-one loss. Machine Learning, 29:103–129, 1997.
8. R. Duda, P. Hart, and D. Stork. Pattern Classification. New York, Willey and
Sons, 2001.
9. F. Esposito, D. Malerba, and G. Semeraro. A comparative analysis of methods
for pruning decision trees. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 19(5):476–491, 1997.
10. J. Gama. An analysis of functional trees. In C. Sammut, editor, Machine Learning,
Proceedings of the 19th International Conference. Morgan Kaufmann, 2002.
11. J. Gama. Functional trees. Machine Learning, 55(3):219–250, 2004.
12. J. Gama, R. Rocha, and P. Mendas. Accurate decision trees for mining high-speed
data streams. In Conference on Knowledge Discovery in Data archive, Proceedings
of the ninth ACM IGKDD international conference on Knowledge discovery and
data mining, pages 523–528, 2003.
13. J. Gratch. Sequential inductive learning. In Proceedings of Thirteenth National
Conference on Artificial Intelligence, volume 1, pages 779–786, 1996.
14. G. Hulten and P. Domingos. Catching up with the data: research issues in min-
ing data streams. In Proc. of Workshop on Research issues in Data Mining and
Knowledge Discovery, 2001.
15. G. Hulten and P. Domingos. VFML – a toolkit for mining high-speed time-changing
data streams, 2003. http://www.cs.washington.edu/dm/vfml/.
16. G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In
Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pages 97–106, San Francisco, CA, 2001. ACM Press.
17. D. Kalles and T. Morris. Efficient incremental induction of decision trees. Machine
Learning, 24(3):231–242, 1996.
18. R. Klinkenberg. Learning drifting concepts: Example selection vs. example weight-
ing. In Intelligent Data Analysis (IDA), Special Issue on Incremental Learning
Systems Capable of Dealing with Concept Drift, 8(3), 2004.
19. R. Klinkenberg and S. Rüping. Concept drift and the importance of examples.
Text Mining – Theoretical Aspects and Applications, pages 55–77, 2003.
20. R. Kohavi. Scaling up the accuracy of naive Bayes classifiers: a decision tree hybrid.
In Proceedings of the 2nd International Conference on Knowledge Discovery and
Data Mining. AAAI Press, 1996.
21. R. Kohavi, C. Brodley, B. Frasca, L. Mason, and Z. Zheng. KDD-Cup 2000
organizers’ report: Peeling the onion. SIGKDD Explorations, 2(2):86–98, 2000.
http://www.ecn.purdue.edu/KDDCUP.
22. J. Kolter and M. Maloof. Dynamic weighted majority: A new ensemble method
for tracking concept drift. Technical Report,CSTR-20030610-3, 2003.
23. M. Lazarescu, S. Venkatesh, and H. Bui. Using multiple windows to track concept
drift. Intelligent Data Analysis Journal,, 8(1):29–59, 2004.
24. PmatE. Projecto matemática ensino, 2005. http://pmate.ua.pt.
25. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers,
Inc., 1993.
26. J. Schlimmer and R. Granger. Beyond incremental processing: Tracking concept
drift. In Proceedings of the Fifth National Conference on Artificial Intelligence,
pages 502–507. AAAI Press, 1986.
27. A. Seewald, J.Petrak, and G.Widmer. Hybrid decision tree learners with alternative
leaf classifiers: an empirical study. In Proceedings of the FLAIRS Conference.
AAAI, 2001.
28. W. N. Street and Y. Kim. A streaming ensemble algorithm (sea) for large-scale
classification. In Proceedings of the seventh ACM SIGKDD international confer-
ence on Knowledge discovery and data mining, pages 377–382. ACM Press, 2001.
29. P. Utgoff. ID5: An incremental ID3. In Fifth International Conference on Machine
Learning, pages 107–120. Morgan Kaufmann Publishers, 1988.
30. P. Utgoff. Perceptron trees - a case study in hybrid concept representation. In
Proceedings of the Seventh National Conference on Artificial Intelligence. Morgan
Kaufmann, 1988.
31. P. E. Utgoff, N. C. Berkman, and J. A. Clouse. Decision tree induction based on
efficient tree restructuring. Machine Learning, 29(1):5–44, 1997.
32. W. Van de Velde. Incremental induction of topologically minimal trees. In B. Porter
and R. Mooney, editors, Machine Learning, Proceedings of the 7th International
Conference. Morgan Kaufmann, 1990.
33. H. Wang, W. Fan, P. Yu, and J. Han. Mining concept-drifting data streams us-
ing ensemble classifiers. In Proceedings of the ninth ACM SIGKDD international
conference on Knowledge discovery and data mining, pages 735–740. ACM Press,
2003.
34. G. Widmer and M. Kubat. Learning in the presence of concept drift and hidden
contexts. Machine Learning, 23:69–101, 1996.
Error Rate (%) Learning Times Tree Size
Training+Classification Training
Nr. Exs C4.5 VFDTcNB VFDTcMC C4.5 VFDTcNB VFDTcMC VFDTc C4.5 VFDTc
Waveform dataset - 21 Attributes
10k 22.45 19.74 38.70 13.9 13.4 6.5 1.4 1059 5
20k 21.48 19.74 32.45 25.4 16.6 8.1 3.2 2061 9
30k 21.20 19.09 29.49 40.6 18.3 9.9 4.9 2849 15
40k 20.93 18.82 27.91 59.5 19.8 11.8 6.8 3773 19
50k 20.63 18.49 26.20 78.7 21.7 13.6 8.5 4269 25
75k 20.28 17.66 26.20 146.8 26.9 18.4 13.3 6283 35
100k 19.98 17.85 25.46 229.2 31.2 23.4 18.3 7901 51
150k 19.76 17.14 23.49 456.6 41.6 33.4 28.2 11815 73
200k 19.58 17.78 22.90 753.9 52.3 43.4 38.4 15071 105
300k 19.23 17.16 22.24 1552.9 73.7 64.6 59.3 21115 149
400k 18.97 16.85 21.60 2600.6 95.4 85.6 80.2 27229 191
500k 18.96 17.28 21.15 3975.1 116.5 106.9 101.5 37787 249
750k 18.65 16.92 20.57 8237.4 169.9 158.4 153.2 47179 359
1000k 18.17 17.15 20.16 14960.8 210.1 201.5 195.3 61323 495
Waveform dataset - 40 Attributes
10k 23.09 22.03 33.47 24.5 29.2 12.3 2.9 1199 7
20k 22.61 20.73 30.21 43.5 32.8 16.1 6.4 2287 11
30k 21.99 18.98 30.21 67.1 36.4 19.7 9.9 3211 15
40k 21.63 20.27 27.05 96.7 39.8 23.4 13.6 4205 23
50k 21.75 18.34 25.27 129.4 43.9 27.1 17.3 5319 25
75k 21.41 20.66 27.05 230.0 53.5 36.7 26.8 7403 43
100k 21.02 18.50 24.82 347.8 64.2 46.2 36.2 9935 51
150k 20.52 19.15 23.48 686.3 84.2 65.7 55.7 14971 79
200k 20.17 18.02 22.29 1074.9 105.2 85.0 75.0 18351 101
300k 19.82 17.57 22.04 2175.8 146.6 125.1 115.0 26823 143
400k 19.64 17.96 21.56 3605.8 186.9 164.8 154.7 33617 203
500k 19.63 17.17 21.29 5570.8 227.4 155.3 195.1 43725 243
750k 19.24 17.51 20.32 12544.5 377.8 304.9 294.6 61713 375
1000k 19.12 17.03 19.82 22402.7 431.3 408.2 397.4 80933 491
LED dataset - 24 Attributes
10k 26.76 25.94 81.90 3.2 16.4 3.9 0.2 899 3
20k 26.64 25.92 81.94 4.3 16.7 4.2 0.4 1809 7
30k 26.46 26.01 74.59 5.5 16.9 4.4 0.6 2425 11
40k 26.57 26.17 74.67 6.9 17.2 4.7 0.8 3347 17
50k 26.54 25.96 74.73 8.2 17.4 5.0 1.0 4207 19
75k 26.41 25.96 74.69 11.5 18.0 5.6 1.6 5953 31
100k 26.53 25.87 74.67 15.1 18.8 6.3 2.2 7923 39
150k 26.49 25.96 74.65 22.1 19.8 7.7 3.4 11507 61
200k 26.42 26.06 74.59 29.6 21.2 9.3 4.7 15281 81
300k 26.38 25.96 74.14 45.8 24.1 12.5 7.5 21875 123
500k 26.39 25.95 73.97 80.6 31.7 19.9 14.1 35413 207
750k 26.37 25.90 74.01 144.8 42.9 31.4 24.3 51267 301
1000k 26.35 26.05 73.98 190.5 59.9 45.3 36.8 65119 403
Table 5. Learning curves for Waveform-21, Waveform-40, and LED datasets. The
learning times are referred in seconds. In all the experiments the test set consists of
250k examples.