DataStreamsCRC Anjaly
DataStreamsCRC Anjaly
i
ii
3 Change Detection 35
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Tracking Drifting Concepts . . . . . . . . . . . . . . . . . . . 36
3.2.1 The Nature of Change. . . . . . . . . . . . . . . . . . 36
3.2.2 Characterization of Drift Detection Methods . . . . . 38
3.2.2.1 Data Management . . . . . . . . . . . . . . . 38
3.2.2.2 Detection Methods . . . . . . . . . . . . . . . 40
3.2.2.3 Adaptation Methods . . . . . . . . . . . . . . 41
3.2.2.4 Decision Model Management . . . . . . . . . 42
3.2.3 A Note on Evaluating Change Detection Methods . . 43
3.3 Monitoring the Learning Process . . . . . . . . . . . . . . . . 43
3.3.1 Drift Detection using Statistical Process Control . . . 44
3.3.2 An Illustrative Example . . . . . . . . . . . . . . . . . 46
3.4 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Bibliography 213
Index 237
A Resources 241
A.1 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
A.2 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
List of Tables
vii
viii
List of Figures
ix
x
xi
xii
Preface
In spite of being a small country, concerning geographic area and pop-
ulation size, Portugal has a very active and respected Artificial Intelligence
community, with a good number of researchers well known internationally for
the high quality of their work and relevant contributions in this area.
One of these well known researchers is João Gama from the University of
Porto. João Gama is one of the leading investigators in of the current hottest
research topics in Machine Learning and Data Mining: Data Streams.
Although other books have been published covering important aspects of
Data Streams, these books are either mainly related with Database aspects of
Data Streams or a collection of chapter contributions for different aspects of
this issue.
This book is the first book to didactically cover in a clear, comprehensive
and mathematical rigorous way the main Machine Learning related aspects of
this relevant research field. The book not only presents the main fundamentals
important to fully understand Data Streams, but also describes important
applications. The book also discusses some of the main challenges of Data
Mining future research, when Stream Mining will be in the core of many
applications. These challenges will have to be addressed for the design of useful
and efficient Data Mining solutions able to deal with real world problems. It
is important to stress that, in spite of this book being mainly about Data
Streams, most of the concepts presented are valid for different areas of Machine
Learning and Data Mining. Therefore, the book will be an up-to-date, broad
and useful source of reference for all those interested in knowledge acquisition
by learning techniques.
Acknowledgments
Life is the art of drawing sufficient conclusions from insufficient premises.
Samuel Butler
1.1 Introduction
In the last three decades, machine learning research and practice have
focused on batch learning usually using small datasets. In batch learning,
the whole training data is available to the algorithm, that outputs a decision
model after processing the data eventually (or most of the times) multiple
times. The rationale behind this practice is that examples are generated at
random accordingly to some stationary probability distribution. Most learners
use a greedy, hill-climbing search in the space of models. They are prone
to high-variance and overfiting problems. Brain and Webb (2002) pointed-
out the relation between variance and data sample size. When learning from
small datasets the main problem is variance reduction, while learning from
large datasets may be more effective when using algorithms that place greater
emphasis on bias management.
In most challenging applications, learning algorithms act in dynamic en-
vironments, where the data are collected over time. A desirable property of
these algorithms is the ability of incorporating new data. Some supervised
learning algorithms are naturally incremental, for example k-nearest neigh-
bors, and naive-Bayes. Others, like decision trees, require substantial changes
to make incremental induction. Moreover, if the process is not strictly station-
ary (as most of real world applications), the target concept could gradually
change over time. Incremental learning is a necessary property but not suf-
ficient. Incremental learning systems must have mechanisms to incorporate
concept drift, forgetting outdated data and adapt to the most recent state of
the nature.
What distinguishes current data sets from earlier ones is automatic data
feeds. We do not just have people who are entering information into a com-
puter. Instead, we have computers entering data into each other. Nowadays,
there are applications in which the data are better modeled not as persistent
tables but rather as transient data streams. Examples of such applications in-
clude network monitoring, web mining, sensor networks, telecommunications
data management, and financial applications. In these applications, it is not
feasible to load the arriving data into a traditional Data Base Management
3
4 Knowledge Discovery from Data Streams
System (DBMS), which are not traditionally designed to directly support the
continuous queries required in these application (Babcock et al., 2002).
• Cluster Analysis
• Predictive Analysis
– Predict the value measured by each sensor for different time hori-
zons;
– Prediction of peeks in the demand;
• Monitoring evolution
– Change Detection
∗ Detect changes in the behavior of sensors;
∗ Detect failures and abnormal activities;
Knowledge Discovery from Data Streams 5
The usual approach for dealing with these tasks consists of: i) Select a
finite data sample; and ii) Generate a static model. Several types of models
have been used for such: different clustering algorithms and structures, various
neural networks based models, Kalman filters, Wavelets, etc. This strategy
can exhibit very good performance in the next few months, but, later, the
performance starts degrading requiring re-train all decision models as times
goes by. What is the problem? The problem probably is related to the use of
static decision models. Traditional systems that are one-shot, memory based,
trained from fixed training sets and static models are not prepared to process
the high detailed evolving data. Thus, they are not able neither to continuously
maintain a predictive model consistent with the actual state of the nature,
nor to quickly react to changes. Moreover, with the evolution of hardware
components, these sensors are acquiring computational power. The challenge
will be to run the predictive model in the sensors themselves.
A basic question is: How can we collect labeled examples in real-time?
Suppose that at time t our predictive model made a prediction ŷt+k , for the
6 Knowledge Discovery from Data Streams
But does these characteristics really change the essence of machine learn-
ing? Would not simple adaptations to existing learning algorithms suffice to
cope with the new needs previously described? These new concerns might
indeed appear rather abstract, and with no visible direct impact on machine
1 As alternative we could make another prediction, using the current model, for the time
t + k.
Knowledge Discovery from Data Streams 7
learning techniques. Quite to the contrary, however, even very basic operations
that are at the core of learning methods are challenged in the new setting. For
instance, consider the standard approach to cluster variables (columns in a
working-matrix). In a batch scenario, where all data are available and stored
in a working matrix, we can apply any clustering algorithm over the trans-
pose of the working matrix. In a scenario where data evolve over time, this is
not possible, because the transpose operator is a blocking operator (Barbará,
2002): the first output tuple is available only after processing all the input
tuples. Now, think of the computation of the entropy of a collection of data
when this collection comes as a data stream which is no longer finite, where
the domain (set of values) of the variables can be huge, and where the number
of classes of objects is not known a priori; or think on continuous maintenance
of the k-most frequent items in a retail data warehouse with three terabytes
of data, hundreds of gigabytes new sales records updated daily with millions
of different items.
Then, what becomes of statistical computations when the learner can only
afford one pass on each data piece because of time and memory constraints;
when the learner has to decide on the fly what is relevant and must be further
processed and what is redundant or not representative and could be discarded?
These are few examples of a clear need for new algorithmic approaches.
9
10 Knowledge Discovery from Data Streams
Traditional Stream
Number of passes Multiple Single
Processing Time Unlimited Restricted
Memory Usage Unlimited Restricted
Type of Result Accurate Approximate
Distributed? No Yes
Table 2.2: Differences between traditional and stream data query processing.
the exact solution. When the size of the sliding window is greater than the
available memory, there is no exact solution. For example, suppose that the
sequence is monotonically decreasing and the aggregation function is MAX.
Whatever the window size, the first element in the window is always the max-
imum. As the sliding window moves, the exact answer requires maintaining
all the elements in memory.
Figure 2.1: The Count-Min Sketch: The dimensions of the array depend on
the desired probability level (δ), and the admissible error ().
The constants and δ have large influence on the space used. Typically, the
space is O( 12 log(1/δ)).
timate is given by taking the minimum value among these cells: x̂[IP ] =
min(CM [k, hk (IP )]). This estimate is always optimistic, that is x[j] ≤ x̂[j],
where x[j] is the true value. The interesting fact is that the estimate is upper
bounded by x̂[j] ≤ x[j] + × ||x||, with probability 1 − δ.
No more than 1/4 of the values are more than 2 standard deviations away
from the mean, no more than 1/9 are more than 3 standard deviations away,
no more than 1/25 are more than 5 standard deviations away, and so on.
Two results from the statistical theory useful in most of the cases are:
From this theorem, we can derive the absolute error (Motwani and Raghavan,
1997): r
R2 ln(2/δ)
≤ (2.4)
2n
Introduction to Data Streams 15
Chernoff and Hoeffding bounds are independent from the distribution gen-
erating examples. They are applicable in all situations where observations are
independent and generated by a stationary distribution. Due to their gener-
ality they are conservative, that is, they require more observations than when
using distribution dependent bounds. Chernoff bound is multiplicative and its
error is expressed as a relative approximation. The Hoeffding bound is ad-
ditive and the error is absolute. While the Chernoff bound uses the sum of
events and require the expected value for the sum, the Hoeffding bound uses
the expected value and the number of observations.
λk
pk = P (x = k) = e−λ
k!
P (x = k) increases with k from 0 till k ≤ λ and falls off beyond λ. The mean
and variance are E(X) = V ar(X) = λ (see Figure 2.2).
Some interesting properties of Poisson processes are:
• If the intervals (t1 , t2 ) and (t3 , t4 ) are non-overlapping, then the number
of points in these intervals are independent;
• If x1 (t) and x2 (t) represent two independent Poisson processes with
parameters λ1 t and λ2 t, their sum x1 (t) + x2 (t) is also a Poisson process
with parameter (λ1 + λ2 )t.
(i − 1) × x̄i−1 + xi
x̄i = (2.5)
i
In fact, to incrementally compute the mean of a variable, we only need to
maintain in Pmemory the number of observations (i) and the sum of the values
seen so far xi . Some simple mathematics allow to define an incremental
version of the standard deviation. In this case, we need to store
P3 quantities:
x2i , the sum
P
i the number of data points; xi the sum of the i points; and
of the squares of the i data points. The equation to continuously compute σ
is: s
P 2 (P xi )2
xi − i
σi = (2.6)
i−1
Another useful measure that can be recursively computed is the correlation
coefficient.
P Given two
P streams x and y, we need to maintain P 2 the sum
P of each
2
stream ( xi and yi ), the sum
P of the squared values ( xi and y i ), and
the sum of the crossproduct ( (xi × yi )). The exact correlation is:
P P
(xi × yi ) − xi × yi
P
n
corr(a, b) = q q (2.7)
P 2 P x2i P 2 P yi2
xi − n yi − n
Figure 2.4: Tilted Time Windows. The top figure presents a natural tilted
time window, the figure in the bottom presents the logarithm tilted windows.
Whenever the 1001th value is observed, the time window moves 1 obser-
vation and the updated sufficient statistics are: A = A + x1001 − x901 and
B = B + x21001 − x2901 .
Note that we need to store in memory the observations inside the window.
Due to the necessity to forget old observations, we need to maintain in memory
all the observations inside the window. The same problem applies for time
windows whose size changes with time. In the following sections we address
the problem of maintaining approximate statistics over sliding windows in a
stream, without storing in memory all the elements inside the window.
by using the Hoeffding bound. The test eventually boils down to whether the
average of the two subwindows is larger than a variable value cut computed
as follows:
2
m :=
1/|W0 | + 1/|W1 |
r
1 4|W |
cut := · ln .
2m δ
The main technical result in Bifet and Gavaldà (2006, 2007) about the
performance of ADWIN is the following theorem, that provides bounds on the
rate of false positives and false negatives for ADWIN:
Theorem 2.2.4 With cut defined as above, at every time step we have:
• As a change detector, since ADWIN shrinks its window if and only if there
has been a significant change in recent times (with high probability);
Figure 2.5: Output of algorithm ADWIN for different change rates: (a) Output
of algorithm ADWIN with abrupt change; (b) Output of algorithm ADWIN with
slow gradual changes.
iff n1 = O(µ ln(1/δ)/α2 )1/3 . So n1 steps after the change the window will
start shrinking, and will remain at approximately size n1 from then on. A
dependence on α of the form O(α−2/3 ) may seem odd at first, but one can
show that this window length is actually optimal in this setting, even if α is
known: it minimizes the sum of variance error (due to short window) and error
due to out-of-date data (due to long windows in the presence of change). Thus,
in this setting, ADWIN provably adjusts automatically the window setting to
Introduction to Data Streams 21
2.2.6.1 Sampling
Sampling is a common practice for selecting a subset of data to be ana-
lyzed. Instead of dealing with an entire data stream, we select instances at
periodic intervals. Sampling is used to compute statistics (expected values) of
the stream. While sampling methods reduce the amount of data to process,
and, by consequence, the computational costs, they can also be a source of
errors. The main problem is to obtain a representative sample, a subset of
data that has approximately the same properties of the original data.
How to obtain an unbiased sampling of the data? In statistics, most tech-
niques require to know the length of the stream. For data streams, we need to
modify these techniques. The key problems are the sample size and sampling
method.
The simplest form of sampling is random sampling, where each element has
equal probability of being selected. The reservoir sampling technique (Vitter,
1985) is the classic algorithm to maintain an online random sample. The basic
idea consists of maintaining a sample of size k, called the reservoir. As the
stream flows, every new element has a probability k/n of replacing an old
element in the reservoir.
Analyze the simplest case: sample size k = 1. The probability that the ith
item is the sample from a stream length n:
1 2 i n−2 n−1
2 × 3 ... × i+1 × ... × n−1 × n = 1/n
22 Knowledge Discovery from Data Streams
2.2.6.3 Wavelets
Wavelet transforms are mathematical techniques, in which signals are rep-
resented as a weighted sum of simpler, fixed building waveforms at different
scales and positions. Wavelets express a times series in terms of translation
and scaling operations over a simple function called mother wavelet. While
scaling compresses or stretches the mother wavelet, translation shifts it along
the time axis.
Wavelets attempt to capture trends in numerical functions. Wavelets de-
compose a signal into a set of coefficients. The decomposition does not preclude
information loss, because it is possible to reconstruct the signal from the full
set of coefficients. Nevertheless, it is possible to eliminate small coefficients
from the wavelet transform introducing small errors when reconstructing the
original signal. The reduced set of coefficients are of great interest for stream-
ing algorithms.
A wavelet transform decomposes a signal into several groups of coefficients.
Different coefficients contain information about characteristics of the signal at
different scales. Coefficients at coarse scale capture global trends of the signal,
while coefficients at fine scales capture details and local characteristics.
The simplest and most common transformation is the Haar wavelet (Jaw-
erth and Sweldens, 1994). Its based on the multiresolution analysis principle,
where the space is divided into a sequence of nested subspaces. Any sequence
(x0 , x1 , . . . , x2n , x2n+1 ) of even length is transformed into a sequence of two-
component-vectors ((s0 , d0 ) , . . . , (sn , dn )). The process continues, by separat-
ing the sequences s and d, and recursively transforming the sequence s. One
stage of the Fast Haar-Wavelet Transform consists of:
si 1 1 xi
= 1/2 ×
di 1 −1 xi+1
As an illustrative example, consider the sequence f = (2, 5, 8, 9, 7, 4, −1, 1).
Applying the Haar transform:
• Step 1
s1 = (2 + 5, 8 + 9, 7 + 4, −1 + 1)/2, d1 = (2 − 5, 8 − 9, 7 − 4, −1 − 1)/2
s1 = (7, 17, 11, 0)/2, d1 = {−1.5, −.5, 1.5, −1}
• Step 2
s2 = ((7 + 17)/2, (11 + 0)/2)/2, d2 = ((7 − 17)/2, (11 − 0)/2)/2
s2 = (24/2, 11/2)/2, d2 = {−2.5, −2.75}
24 Knowledge Discovery from Data Streams
• Step 3
s3 = ((24 + 11)/4)/2, d3 = {((24 − 11)/4)/2}
s3 = 4.375, d3 = {1.625}
The sequence {4.375, 1.625, −2.5, −2.75, −1.5, −.5, 1.5, −1} are the coefficients
of the expansion. The process consists of adding/subtracting pairs of numbers,
divided by the normalization factor.
Wavelet analysis is popular in several streaming applications, because most
signals can be represented using a small set of coefficients. Matias et al. (1998)
present an efficient algorithm based on multi-resolution wavelet decomposition
for building histograms with application to databases problems, like selectiv-
ity estimation. In the same research line, Chakrabarti et al. (2001) propose
the use of wavelet-coefficient synopses as a general method to provide approx-
imate answers to queries. The approach uses multi-dimensional wavelets syn-
opsis from relational tables. Guha and Harb (2005) propose one-pass wavelet
construction streaming algorithms with provable error guarantees for minimiz-
ing a variety of error-measures including all weighted and relative lp norms.
Marascu and Masseglia (2009) present an outlier detection method using the
two most significant coefficients of Haar wavelets.
w−1
1 X √
XF = √ xi ej2πFi /w , where j = −1.
w i=0
w−1
1 X
xi = √ XF ej2πFi /w .
w
F =0
Assume we have a retail data warehouse. The actual size of the data ware-
house is 3 TB of data, and hundreds of gigabytes of new sales records are
updated daily. The order of magnitude of the different items is millions.
The hot-list problem consists of identifying the most (say 20) popular
items. Moreover, we have restricted memory: we can have a memory of hun-
dreds of bytes only. The goal is to continuously maintain a list of the top-K
most frequent elements in a stream. Here, the goal is the rank of the items.
The absolute value of counts is not relevant, but their relative position. A
first and trivial approach consists of maintaining a count for each element in
the alphabet and when query return the k first elements in the sorted list of
counts. This is an exact and efficient solution for small alphabets. For large
alphabets it is very space (and time) inefficient, there will be a large number
of zero counts.
Misra and Gries (1982) present a simple algorithm (Algorithm 3) that
maintain partial information of interest by monitoring only a small subset
m of elements. We should note that m > k but m N where N is the
cardinality of the alphabet. When a new element i arrives, if it is in the set of
monitored elements, increment the appropriate counter; otherwise, if there is
some counter with count zero, it is allocated to the new item, and the counter
set to 1, otherwise decrement all the counters.
Later, Metwally et al. (2005) present the Space-saving algorithm (Algo-
rithm 4), an interesting variant of the Misra and Gries (1982) algorithm. When
a new element i arrives, if it is in the set of monitored elements, increment the
appropriate counter; otherwise remove the element with less hits, and include
i with a counter initialized with the counts of the removed element plus one.
Both algorithms are efficient for very large alphabets with skewed distribu-
tions. The advantage of the Space-saving comes up if the popular elements
evolve over time, because it tends to give more importance to recent observa-
tions. The elements that are growing more popular will gradually be pushed
to the top of the list. Both algorithms continuously return a list of the top-k
elements. This list might be only an approximate solution. Metwally et al.
(2005) report that, even if it is not possible to guarantee top-k elements, the
algorithm can guarantee top-k 0 elements, with k 0 ≈ k. If we denote by ci the
count associated with the most recent monitored element, any element i is
guaranteed to be among the top-m elements if its guaranteed number of hits,
counti − ci , exceeds countm+1 . Nevertheless, the counts maintained by both
are not liable, because only the rank of elements is of interest.
26 Knowledge Discovery from Data Streams
• Data points - the smallest unit of time over which the system collects
data;
• Basic windows - a consecutive subsequence of time points over which
the system maintains a digest incrementally;
• Sliding window - a user-defined subsequence of basic windows over
which the user wants statistics.
Figure 2.6 shows the relation between the three levels. Let w be the size
of the sliding window. Suppose w = kb, where b is the length of a basic
window and k is the number of basic windows inside the sliding window.
Let S[0], S[1], . . . , S[k − 1] denote the sequence of basic windows inside the
sliding window. The elements of a basic window S[i] = s[(t − w) + ib + 1 :
(t − w) + (i + 1)b. The sliding window moves over basic windows, when a new
basic window S[k] is full filled in, S[0] expires.
Simple Statistics. Moving averages always P involve w points. For each basic
window StatStream maintain the digest (S[i]). When the new basic win-
dow
P S[k] is P available and P the sliding
P window moves, the sum is updated as:
new (s) = old (s) + S[k] − S[0]. Maximum, minimum, and standard
deviation are computed in a similar way.
Monitoring Correlation.
30 Knowledge Discovery from Data Streams
old
Let Xm be the m-th DFT coefficient of the series in sliding window
new
x0 , x1 , . . . , xw−1 and Xm be that coefficient of the series x1 , x2 , . . . , xw ,
j2πm
new old xw√−x0
Xm = e w (Xm + w
).
This can be extended to an update on the basic windows when the slid-
old
ing window moves. Let Xm be the m-th DFT coefficient of the series in
new
sliding window x0 , x1 , . . . , xw−1 and Xm be that coefficient of the series
xb , xb+1 , . . . , xw , xw+1 , . . . , xw+b−1 .
b−1 b−1
new j2πmb
old 1 X j2πm(b−i) X j2πm(b−i)
Xm =e w Xm +√ ( e w xw+i − e w xi ).
w i=0 i=0
Figure 2.7: The vector space: The gray dots (A,B,C) corresponds to the
sensor’s measurements; and the black dot (D) to the aggregation vector. The
gray region corresponds to the alarm region. The center figure illustrates a
normal air condition. The right figure presents an alarm condition, even none
of the sensors are inside the alarm region.
A B
C D
2.4 Notes
The research on Data Stream Management Systems started in the database
community, to solve problems like continuous queries in transient data. The
most relevant projects include: The Stanford StREam DatA Manager (Stan-
ford University) with focus on data management and query processing in the
presence of multiple, continuous, rapid, time-varying data streams. At MIT,
the Aurora project was developed to build a single infrastructure that can
efficiently and seamlessly meet the demanding requirements of stream-based
applications. Focusing on real-time data processing issues, such as quality of
serviçe (QoS)- and memory-aware operator scheduling, semantic load shed-
ding for dealing with transient spikes at incoming data rates. Two other well
known systems include Telegraph developed at University of Berkeley, and
NiagaraCQ developed at University of Wisconsin. We must also refer to the
Hancock project, developed at AT&T Research labs, where a C-based domain-
specific language, was designed to make it easy to read, write, and maintain
programs that manipulate huge amounts of data.
34 Knowledge Discovery from Data Streams
Chapter 3
Change Detection
Most of the machine learning algorithms assume that examples are generated
at random, according to some stationary probability distribution. In this chap-
ter, we study the problem of learning when the distribution that generates the
examples changes over time. Embedding change detection in the learning pro-
cess is one of the most challenging problems when learning from data streams.
We review the Machine Learning literature for learning in the presence of drift,
discuss the major issues to detect and adapt decision problems when learning
from streams with unknown dynamics, and present illustrative algorithms to
detect changes in the distribution of the training examples.
3.1 Introduction
In many applications, learning algorithms acts in dynamic environments
where the data flows continuously. If the process is not strictly stationary (as
most of real world applications), the target concept could change over time.
Nevertheless, most of the work in Machine Learning assume that training
examples are generated at random according to some stationary probability
distribution. Basseville and Nikiforov (1993) present several examples of real
problems where change detection is relevant. These include user modeling,
monitoring in bio-medicine and industrial processes, fault detection and diag-
nosis, safety of complex systems, etc.
The Probably Approximately Correct - PAC learning (Valiant, 1984) frame-
work assumes that examples are independent and randomly generated ac-
cording to some probability distribution D. In this context, some model-class
learning algorithms (like Decision Trees, Neural Networks, some variants of k-
Nearest Neighbors, etc) could generate hypothesis that converge to the Bayes-
error in the limit, that is, when the number of training examples increases to
infinite. All that is required is that D must be stationary, the distribution
must not change over time.
Our environment is naturally dynamic, constantly changing in time. Huge
amounts of data are being continuously generated by various dynamic systems
or applications. Real-time surveillance systems, telecommunication systems,
35
36 Knowledge Discovery from Data Streams
sensor networks and other dynamic environments are such examples. Learning
algorithms that model the underlying processes must be able to track this
behavior and adapt the decision models accordingly.
Figure 3.1: Three illustrative examples of change: (a) Change on the mean;
(b) Change on variance; (c) Change on correlation
first dimension, the causes of change, we can distinguish between changes due
to modifications in the context of learning, because of changes in hidden vari-
ables, from changes in the characteristic properties in the observed variables.
Existing Machine Learning algorithms learn from observations described by a
finite set of attributes. In real world problems, there can be important prop-
erties of the domain that are not observed. There could be hidden variables
that influence the behavior of nature (Harries et al., 1998). Hidden variables
may change over time. As a result, concepts learned at one time can become
inaccurate. On the other hand, there could be changes in the characteristic
properties of the nature.
The second dimension is related to the rate of change. The term Concept
Drift is more associated to gradual changes in the target concept (for example
the rate of changes in prices), while the term Concept Shift refers to abrupt
changes. Usually, detection of abrupt changes are easier and require few ex-
amples for detection. Gradual changes are more difficult to detect. At least in
the first phases of gradual change, the perturbations in data can be seen as
noise by the detection algorithm. To be resilient to noise, they often require
more examples to distinguish change from noise. In an extreme case, if the
rate of change is large than our ability to learn, we cannot learn anything. In
the other side, slow changes can be confused with stationarity.
We can formalize concept drift as a change in the joint probability P (~x, y),
which can be decomposed in:
We are interested in changes in the y values given the attribute values ~x, that
is P (y|~x).
Lazarescu et al. (2004) defines concept drift in terms of consistency and
persistence. Consistency refers to the change t = θt −θt−1 that occurs between
consecutive examples of the target concept from time t − 1 to t, with θt being
the state of the target function in time t. A concept is consistent if t is
38 Knowledge Discovery from Data Streams
Time windows correspond to abrupt forgetting (weight equal 0). The exam-
ples are deleted from memory. We can combine, of course, both forgetting
mechanisms by weighting the examples in a time window (see Klinkenberg
(2004)).
Figure 3.3: Illustrative example of the Page-Hinkley test. The left figure
plots the on-line error rate of a learning algorithm. The center plot is the
accumulated on-line error. The slope increases after the occurrence of a change.
The right plot presents the evolution of the P H statistic.
1. Blind Methods: Methods that adapt the learner at regular intervals with-
out considering whether changes have really occurred. Examples include
methods that weight the examples according to their age and methods
that use time-windows of fixed size.
2. Informed Methods: Methods that only modify the decision model after
a change was detected. They are used in conjunction with a detection
model.
instance space.
44 Knowledge Discovery from Data Streams
2 For an infinite number of examples, the error rate will tend to the Bayes error.
Change Detection 45
Figure 3.6 illustrates the use of the SPC algorithm as a wrapper over
a naive-Bayes classifier that incrementally learns the SEA concept dataset.
The vertical bars denote drift occurrence, while doted vertical bars indicate
when drift was detect. Both figures plot the classifier prequential error (see
chapter 5). In the left figure, the classifier trains with the examples stored in
the buffer after the warning level was reached.
From the practical point of view, when a drift is signaled, the method
defines a dynamic time window of the most recent examples used to train a
new classifier. Here the key point is how fast the change occurs. If the change
occurs at slow rate, the prequential error will exhibit a small positive slope.
More examples are needed to evolve from the warning level to the action
48 Knowledge Discovery from Data Streams
Figure 3.6: Illustrative example of using the SPC algorithm in the Sea con-
cept dataset. All the figures plot the prequential error of a naive-Bayes in
the SEA-concepts dataset. In the first plot there is no drift detection. In the
second plot, the SPC algorithm was used to detect drift. The third plot is
similar to the second one, without using buffered examples when a warning is
signaled.
level and the window size increases. For abrupt changes, the increase of the
prequential error will be also abrupt and a small window (using few examples)
will be chosen. In any case the ability of training a new accurate classifier
depends on the rate of changes and the capacity of the learning algorithm
to converge to a stable model. The last aspect depends on the number of
examples in the context.
3.5 Notes
Change detection and concept drift have attracted much attention in the
last 50 years. Most of the works deal with fixed-sample problem and at-most
one change model. A review of techniques, methods and applications of change
detection appear in Basseville and Nikiforov (1993); Ghosh and Sen (1991).
Procedures of sequential detection of changes have been studied in statistical
process control (Grant and Leavenworth, 1996). The pioneer work in this area
is Shewhart (1931), which presented the 3-sigma control chart. More efficient
techniques, the cumulative sum procedures, were developed by Page (1950).
Cumulative sums have been used in data mining in Pang and Ting (2004).
Kalman filters associated with CUSUM appear in Schön et al. (2005); Bifet
and Gavaldà (2006); Severo and Gama (2006).
In the 90’s, concept drift was studied by several researchers in the context
of finite samples. The relevant works include Schlimmer and Granger (1986),
with the system STAGGER and the famous STAGGER dataset, Widmer
and Kubat (1996), presenting the FLORA family of algorithms, and Harries
et al. (1998), that present the Splice system for identifying context changes.
Kuh et al. (1990) present bounds on the frequency of concept changes, e.g.
rate of drift, that is acceptable by any learner. Pratt and Tschapek (2003)
describe a visualization technique that uses brushed, parallel histograms to
aid in understanding concept drift in multidimensional problem spaces.
A survey on incremental data mining model maintenance and change de-
tection under block evolution appears in Ganti et al. (2002). Remember that
in block evolution (Ramakrishnan and Gehrke, 2003), a dataset is updated pe-
riodically through insertions and deletions of blocks of records at each time.
One important application of change detection algorithms is in burst de-
tection. Burst regions are time intervals in which some data features are un-
expected. For example, gamma-ray burst in astronomical data might be as-
sociated with the death of massive starts; bursts in document streams might
be valid indicators of emerging topics, strong buy-sell signals in the financial
market, etc. Burst detection in text streams was discussed in Kleinberg (2004).
Vlachos et al. (2005) discuss a similar problem in financial streams.
50 Knowledge Discovery from Data Streams
Chapter 4
Maintaining Histograms from Data
Streams
Histograms are one of the most used tools in exploratory data analysis. They
present a graphical representation of data, providing useful information about
the distribution of a random variable. Histograms are widely used for density
estimation. They have been used in approximate query answering, in process-
ing massive datasets, to provide a quick but faster answer with error guaran-
tees. In this chapter we present representative algorithms to learn histograms
from data streams and its application in data mining.
4.1 Introduction
The most used histograms are either equal width, where the range of ob-
served values is divided into k intervals of equal length (∀i, j : (bi − bi−1 ) =
(bj − bj−1 )), or equal frequency, where the range of observed values is divided
into k bins such that the counts in all bins are equal (∀i, j : (fi = fj )).
51
52 Knowledge Discovery from Data Streams
width: h = 3.5sn−1/3 and Freedman and Diaconis (1981) rule for the class width: h =
2(IQ)n−1/3 where s is the sample standard deviation and IQ is the sample interquartile
range.
Maintaining Histograms from Data Streams 53
merged with a neighbor bucket, and the bucket with higher counts is
divided into two.
Figure 4.1: Split & Merge and Merge & Split Operators.
4.2.2.2 Discussion
The main property of the exponential histograms, is that the size grows
exponentially, i.e. 20 , 21 , 22 , ..., 2h . Datar et al. (2002) show that using the
algorithm for the basic counting problem, one can adapt many other tech-
niques to work for the sliding window model, with a multiplicative overhead
of O(log(N )/) in memory and a 1 + factor loss in accuracy. These include
maintaining approximate histograms, hash tables, and statistics or aggregates
such as sum and averages.
In this example the time scale is compressed. The most recent data is
stored inside the window at the finest detail (granularity). Oldest information
is stored at a coarser detail, in an aggregated way. The level of granularity
depends on the application. This window model is designated as tilted time
Maintaining Histograms from Data Streams 55
window. Tilted time windows can be designed in several ways. Han and Kam-
ber (2006) present two possible variants: natural tilted time windows, and
logarithm tilted windows. Illustrative examples are presented in Figure 2.4. In
the first case, data is store with granularity according to a natural time tax-
onomy: last hour at a granularity of fifteen minutes (4 buckets), last day in
hours (24 buckets), last month in days (31 buckets) and last year in months
(12 buckets). Similar time scale compression appears in the case of logarith-
mic tilted windows. Here, all buckets have the same size. Time is compressed
using an exponential factor: 20 , 21 , 22 , . . .. Buckets aggregate and store data
from time-intervals of increasing size.
Figure 4.2: An illustrative example of the two layers in PiD. The input for
layer1 is the raw data stream; the input for layer2 is the counts stored in
layer1 .
The two-layer architecture (see Figure 4.2) divides the histogram problem
into two phases. In the first phase, the algorithm traverses the data stream
and incrementally maintains an equal-with discretization. The second phase
constructs the final histogram using only the discretization of the first phase.
The computational costs of this phase can be ignored: it traverses once the
dicretization obtained in the first phase. We can construct several histograms
using different number of intervals, and different strategies: equal-width or
equal-frequency. This is the main advantage of PiD in exploratory data anal-
ysis. PiD was used as a building block for a distributed clustering algorithm
discussed in Section 12.3.2.
58 Knowledge Discovery from Data Streams
the distribution observed in the most recent data. Both distributions can be
compared using the Kullback-Leibler (KL) divergence (Sebastião and Gama,
2007). The Kullback-Leibler divergence measures the distance between two
probability distributions and so it can be used to test for change. Given a
reference window with empirical probabilities pi , and a sliding window with
probabilities qi , the KL divergence is:
X p(i)
KL(p||q) = p(i)log2 .
i
q(i)
PiD. For layer1 , the number of intervals was set to 200. For layer2 , the number
of intervals was set to 10, in two scenarios: equal width and equal frequency.
We have done illustrative experiments using a sample of 100k values of
a random variable from different distributions: Uniform, Normal, and Log-
Normal. Results reported here are the averages of 100 experiments. We eval-
uate two quality measures: the set of boundaries and frequencies. P Boundaries
are evaluated using the mean absolute deviation:
pP mad(P, S) = (|Pi −Si |)/n,
and the mean squared error: mse(P, S) = (Pi − Si )2 /n. To compare fre-
quencies,
P√ we use the affinity coefficient (Bock and Diday, 2000): AF (P, S) =
Pi × Si 2 . Its range is [0; 1]; values near 1 indicate that the two distribu-
tions are not different. The median of the performance metrics are presented
in table 4.1. The discretization provided by layer2 is based on the summaries
provided by layer1 . Of course, the summarization of layer1 introduces an er-
ror. These results provide evidence that the impact of this error in the final
discretization is reduced. Moreover, in a set of similar experiments, we have
observed that increasing the number of intervals of layer1 decreases the error.
The advantage of the two-layer architecture of PiD, is that after gener-
ating the layer1 , the computational costs, in terms of memory and time, to
generate the final histogram (the layer2 ) are low: only depend on the num-
ber of intervals of the layer1 . From layer1 we can generate histograms with
different number of intervals and using different strategies (equal-width or
equal-frequency). We should note that the standard algorithm to generate
equal-frequency histograms requires a sort operation, that could be costly for
large n. This is not the case of PiD. Generation of equal-frequency histograms
from the layer1 is straightforward.
2 P denotes the set of boundaries (frequency in the case of AF) defined by PiD, and S
denotes the set of boundaries (frequency) defined using all the data.
Maintaining Histograms from Data Streams 61
In our experiences, we generate four blocks of 100k points each. For each
block the threshold values were 7, 8, 10 and 11 respectively. The discretiza-
tion used PiD, adopting a equal-width with 200 intervals for the first layer
and a equal-frequency with 10 intervals for the second layer. The exponential
decay was applied every 1000 examples with α set to 0.9. Figure 4.4 presents
the evolution over time, of the second layer distribution for each attribute.
In Att1 no significant changes occur, its distribution was the same in each
block of data; changes in the distribution appear in both Att2 and Att3 . The
distribution of Att3 is almost the mirror of Att2 .
64 Knowledge Discovery from Data Streams
4.5 Notes
Histograms are the most basic form for density estimation. Other ap-
proaches to density estimation are kernel density estimation methods, for ex-
ample, Parzen windows. A probability distribution function (pdf) or density
of a random variable is a function that describes the density of probability at
each point. The probability of a random variable is given by the integral of its
density. Clustering, discussed in Chapter 6, are more sophisticated techniques
that can be used for density estimation.
Chapter 5
Evaluating Streaming Algorithms
5.1 Introduction
Most recent learning algorithms (Cormode et al., 2007; Babcock et al.,
2003; Domingos and Hulten, 2000; Hulten et al., 2001; Gama et al., 2003;
Ferrer-Troyano et al., 2004) maintain a decision model that continuously
evolve over time, taking into account that the environment is non-stationary
and computational resources are limited. Examples of public available software
for learning from data streams include: the VFML (Hulten and Domingos, 2003)
toolkit for mining high-speed time-changing data streams, the MOA (Kirkby,
2008) system for learning from massive data sets, Rapid-Miner (Mierswa
et al., 2006) a data mining system with plug-in for stream processing, etc.
Although the increasing number of streaming learning algorithms, the met-
rics and the design of experiments for assessing the quality of learning models
is still an open issue. The main difficulties are:
65
66 Knowledge Discovery from Data Streams
where R is the range of the random variable. Both bounds use the sum of inde-
pendent random variables and give a relative or absolute approximation of the
deviation of X from its expectation. They are independent of the distribution
of the random variable.
(n0,1 − n1,0 )2
M = sign(n0,1 − n1,0 ) ×
n0,1 + n1,0
Figure 5.5: Plot of the Qi statistic over a sliding window of 250 examples.
The last figure plots the Qi statistic using a fading factor of α = 0.995.
To overthrown this drawback, and since the fading factors are memoryless
and proves to exhibit similar behaviors to sliding windows, we compute this
statistic test using different windows size and fading factors. Figure 5.7 illus-
trates a comparison on the evolution of a signed McNemar statistic between
the two algorithms, computed over a sliding window of 1000 and 100 exam-
ples (on the top panel) and computed using a fading factor with α = 0.999
and α = 0.99 (on the bottom panel). It can be observed that in both cases,
the statistics reject the null hypothesis almost at the same point. The use of
this statistical test to compare stream-learning algorithms shows itself feasi-
ble by applying sliding-window or fading-factors techniques. Nevertheless, for
different forgetting factors we got different results about the significance of
the differences.
Figure 5.6: The evolution of signed McNemar statistic between two algo-
rithms. Vertical dashed lines indicates drift in data, and vertical lines indi-
cates when drift was detected. The top panel shows the evolution of the error
rate of two naive-Bayes variants: a standard one and a variant that detect and
relearn a new model whenever drift is detected. The bottom panel shows the
evolution of the signed McNemar statistic computed for these two algorithms.
Figure 5.7: The evolution of signed McNemar statistic between the two al-
gorithms. The top panel shows the evolution of the signed McNemar statistic
computed over a sliding window of 1000 and 100 examples and the bottom
panel shows the evolution of the signed McNemar statistic computed using a
Fading Factor with α = 0.999 and α = 0.99, respectively. The dotted line is
the threshold for a significance level of 99%. For different forgetting factors
we got different results about the significance of the differences.
evolution of the error rate of a naive-Bayes classifier (using again the SEA
concepts dataset). The formula used to embed fading factors in the Page-
Hinkley test is: mT = α × mT −1 + (xt − x̂T − δ). To detect increases in
the error rate (due to drifts in data) we compute the PHT, setting δ and λ
parameters to 10−3 and 2.5, respectively. Figure 5.9 shows the delay time for
this test: (a) without fading factors, (b) and (c) using different fading factors.
The advantage of the use of fading factors in the PHT can be easily observed
in this figure. The exponential forgetting results in small delay times without
compromise miss detections.
We can control the rate of forgetting using different fading factors; as
close to one is the α value of the fading factor the less it will forget data. We
evaluate the PHT using different fading factors to assess the delay time in
detection. Figure 5.9 (b) and c)) shows the delay time in detection of concept
drifts. We had used the PHT and different fading factors (α = 1 − 10−5
and α = 1 − 10−4 , respectively) to detect these changes. Table 5.2 presents
Evaluating Streaming Algorithms 77
Figure 5.8: Experiments in SEA dataset illustrating the first drift at point
15000. The top figure shows the evolution of the naive-Bayes error rate. The
bottom figure represents the evolution of the Page-Hinkley test statistic and
the detection threshold λ.
Fading Factors (1 − α)
Drifts 10−4 10−5 10−6 10−7 10−8 0
1st drift 1045 (1) 1609 2039 2089 2094 2095
2nd drift 654 (0) 2129 2464 2507 2511 2511
3rd drift 856 (1) 1357 1609 1637 2511 1641
Table 5.2: Delay times in drift scenarios using different fading factors. We
observe false alarms only for 1 − α = 10−4 . The number of false alarms is
indicated in parenthesis.
the delay time in detection of concept drifts in the same dataset used in
figure 5.9. As the fading factor increases, one can observe that the delay time
also increases, which is consistent with the design of experiments. As close to
one is the α value of the fading factor the greater is the weight of the old data,
which will lead to higher delay times. The feasible values for α are between
1 − 10−4 and 1 − 10−8 (with α = 1 − 10−8 the delay times are not decreased
and with α = 1 − 10−4 the delay time decreases dramatically but with false
alarms). We may focus on the resilience of this test to false alarms and on its
ability to reveal changes without miss detections. The results obtained with
this dataset were very consistent and precise, supporting that the use of fading
factors improves the accuracy of the Page-Hinkley test.
Figure 5.9: The evolution of the error rate and the delay times in drift
detection using the Page-Hinkley test and different fading factors. The top
panel shows the delay times using the PH test without fading factors. The
middle and bottom panels show the delay times using fading factors with
α = 1 − 10−5 and α = 1 − 10−4 , respectively.
5.5 Notes
Performance assessment, design of experimental work, and model selection,
are fundamental topics in Science in general and in Statistics (Bhattacharyya
and Johnson, 1977), Artificial Intelligence (Cohen, 1995), Machine Learn-
ing (Mitchell, 1997), Data Mining (Hastie et al., 2000), in particular. The topic
of model selection (Schaffer, 1993) is of great interest by oblivious arguments.
Some general methods include Occam’s razor (Domingos, 1998), minimum de-
scription length (Grünwald, 2007), Bayesian information score (Akaike, 1974),
risk minimization (Vapnik, 1995), etc. A theoretical comparison between some
of these methods appear in (Kearns et al., 1997). Some referential works in
the area of Machine Learning, with high criticism to some usual practices,
appear in (Dietterich, 1996; Salzberg, 1997), and more recently in (Demsar,
2006). More advanced topics in evaluation and model selection include the
receiver operating characteristic, ROC curves and the AUC metric (Hand and
Till, 2001; Fürnkranz and Flach, 2005).
80 Knowledge Discovery from Data Streams
6.1 Introduction
Major clustering approaches in traditional cluster analysis include:
81
82 Knowledge Discovery from Data Streams
• Additivity
if A1 and A2 are disjoint sets, merging them is equal to the sum of their
parts. The additive property allows us to merge sub-clusters incremen-
tally.
~ = LS
X0
N
conclude “The main positive result is that the single pass k-means algorithm,
with a buffer of size 1% of the input dataset, can produce clusters of almost
the same quality as the standard multiple pass k-means, while being several
times faster.”.
The Very Fast k-means algorithm (VFKM) (Domingos and Hulten, 2001)
uses the Hoeffding bound (Hoeffding, 1963) to determine the number of ex-
amples needed in each step of a k-means algorithm. VFKM runs as a sequence of
k-means runs, with increasing number of examples until the Hoeffding bound
is satisfied.
Guha et al. (2003) present a analytical study on k-median clustering data
streams. The proposed algorithm makes a single pass over the data stream
and uses small space. It requires O(nk) time and O(n) space where k is the
number of centers, n is the number of points and < 1. They have proved
that any k-median algorithm that achieves a constant factor approximation
cannot achieve a better run time than O(nk).
Figure 6.1: The Clustering Feature Tree in BIRCH. B is the maximum number
of CFs in a level of the tree.
CF-tree, where each node is a tuple (Clustering Feature) that contain the suf-
ficient statistics describing a set of data points, and compress all information
of the CFs below in the tree. BIRCH only works with continuous attributes. It
was designed for very large data sets, explicitly taking into account time and
memory constraints. For example, not all data-points are used for clustering,
dense regions of data points are treated as sub-clusters. BIRCH might scan
data twice to refine the CF-tree, although it can be used with a single scan of
data. It proceeds in two phases. The first phase scans the database to build
an initial in-memory CF tree, a multi-level compression of the data that tries
to preserve the inherent clustering structure of the data (see Figure 6.1). The
second phase uses an arbitrary clustering algorithm to cluster the leaf nodes
of the CF-tree.
BIRCH requires two user defined parameters: B the branch factor or the
maximum number of entries in each non-leaf node; and T the maximum di-
ameter (or radius) of any CF in a leaf node. The maximum diameter T defines
the examples that can be absorbed by a CF. Increasing T , more examples can
be absorbed by a micro-cluster and smaller CF-Trees are generated. When
an example is available, it traverses down the current tree from the root, till
finding the appropriate leaf. At each non-leaf node, the example follow the
closest-CF path, with respect to norms L1 or L2 . If the closest-CF in the leaf
cannot absorb the example, make a new CF entry. If there is no room for new
leaf, split the parent node. A leaf node might be expanded due to the con-
strains imposed by B and T . The process consists of taking the two farthest
CFs and creates two new leaf nodes. When traversing backup the CFs are
updated.
BIRCH tries to find the best groups with respect to the available memory,
while it minimizes the amount of input and output. The CF-tree grows by
aggregation, getting with only one pass over the data a result of complexity
O(N). However, Sheikholeslami et al. (1998) show that it does not perform
Clustering from Data Streams 87
• For any time window of length h, at least one stored snapshot can be
found within 2 × h units of the current time.
88 Knowledge Discovery from Data Streams
6.2.4.1 Discussion
The idea of dividing the clustering process into two layers, where the first
layer generate local models (micro-clusters) and the second layer generates
global models from the local ones, is a powerful idea that has been used
elsewhere. Three illustrative examples are:
The Clustering on Demand framework Dai et al. (2006) a system for clus-
tering time series. The first phase consists of one data scan for online statistics
collection and compact multi-resolution approximations, which are designed
to address the time and the space constraints in a data stream environment.
The second phase applies a clustering algorithm over the summaries. Fur-
thermore, with the multi-resolution approximations of data streams, flexible
clustering demands can be supported. The Clustering Using REpresentatives
system (Guha et al., 1998), a hierarchical algorithm that generate partial clus-
ters from samples of data in a first phase, and in the second phase cluster the
partial clusters. It uses multiple representative points to evaluate the distance
between clusters, being able to adjust to arbitrary shaped clusters. The On
Demand Classification of Data Streams algorithm (Aggarwal et al., 2006) uses
the two-layer model for classification problems. The first layer generates the
micro-clusters as in CluStream, with the additional information of class-labels.
A labeled data point can only be added to a micro-cluster belonging to the
same class.
In grid clustering the instance space is divided into a finite and potentially
large number of cells that form a grid structure. All clustering operations
are performed in the grid structure which is independent on the number of
data points. Grid clustering is oriented towards spatio-temporal problems.
Illustrative examples that appear in the literature are (Wang et al., 1997;
Hinneburg and Keim, 1999; Park and Lee, 2004).
The Fractal Clustering (FC) system (Barbará and Chen, 2000) is a grid-
based algorithm that define clusters as sets of points that exhibit high self-
similarity. Note that, if the attributes of a dataset obey uniformity and inde-
pendence properties, its intrinsic dimension equals the embedding dimension
E. On the other hand, whenever there is a correlation between two or more
attributes, the intrinsic dimension of the dataset is accordingly lower. Thus,
D is always smaller than or equal to E. A dataset exhibiting fractal behavior
is self-similar over a large range of scales (Schroeder, 1991; Sousa et al., 2007).
The fractal behavior of self-similar real datasets leads to a distribution of dis-
tances that follows a power law (Faloutsos et al., 2000). Given a dataset S of
N elements and a distance function d(si , sj ), the average number k of neigh-
bors within a distance r is proportional to r raised to D. Thus, the number
of pairs of elements within distance r (the pair-count P C(r)), follows a power
90 Knowledge Discovery from Data Streams
Figure 6.2: The box-counting plot: log-log plot n(r) versus r. D0 is the
Hausdorff fractal dimension.
If n(r) is the number of cells occupied by points in the data set, the plot of
n(r) versus r in log-log scales is called the box-counting plot. The negative
value of the slope of that plot corresponds to the Hausdorff fractal dimension.
is less affected. The FC algorithm has two distinct phases. The initialization
phase, where the initial clusters are defined, each with sufficient points so that
the fractal dimension can be computed. The second phase incrementally adds
new points to the set of initial clusters. The initialization phase uses a tradi-
tional distance-based algorithm. The initialization algorithm is presented in
Algorithm 12. After the initial clustering phase, the incremental step (Algo-
rithm 13) process points in a stream. For each point, the fractal impact in clus-
ter Ci is computed as |Fd (Ci0 ) − Fd (Ci )|. The quantity mini (|Fd (Ci0 ) − Fd (Ci )|
is the Minimum Fractal Impact (MFI) of the point. If the MFI of the point
is larger than a threshold τ the point is rejected as outlier, otherwise it is
included in that cluster.
FC algorithm processes data points in batches. A key issues, whenever
92 Knowledge Discovery from Data Streams
new batch is available, is Are the current set of clusters appropriate for the
incoming batch? Barbará and Chen (2001) extended the FC algorithm to track
the evolution of clusters. The key idea is to count the number of successful
clustered points to guarantee high-probability clusters. Successful clustered
points are those with MFI greater than τ . Using Chernoff bound, the num-
ber of successful clustered points must be greater than 3(1+)
2 × ln(2/δ). The
algorithm to track the evolution of the clusters is presented in Algorithm 14.
in Wang and Wang (2003), the factors used to compute the correlation can
be updated incrementally, achieving an exact incremental expression for the
correlation:
P − AB
corr(a, b) = q qn (6.2)
A2 2
A2 − n B2 − Bn
The sufficient statistics needed to compute the correlation are easily updated
at each time step: A = ai , B = bi , A2 = a2i , B2 = b2i , P = ai bi . In
P P P P P
Splitting Criteria. One problem that usually arises with this sort of models
is the definition of a minimum number of observations necessary to assure
convergence. A common way of doing this includes a user-defined parameter;
after a leaf has received at least nmin examples, it is considered ready to be
tested for splitting. Another approach is to apply techniques based on the
Hoeffding bound (Hoeffding, 1963) to solve this problem. Remember from
section 2.2.2 that after n independent observations of a real-valued random
variable r with range R, and with confidence 1 − δ, the true mean q of r is at
2
least r − , where r is the observed mean of the samples and = R ln(1/δ) 2n .
As each leaf is fed with a different number of examples, each cluster ck
will possess a different value for , designated k . Let d(a, b) be the distance
measure between pairs of time series, and Dk = {(xi , xj ) | xi , xj ∈ ck , i < j}
be the set of pairs of variables included in a specific leaf ck . After seeing n
samples at the leaf, let (x1 , y1 ) ∈ {(x, y) ∈ Dk | d(x, y) ≥ d(a, b), ∀(a, b) ∈ Dk }
be the pair of variables with maximum dissimilarity within the cluster ck , and
in the same way considering Dk0 = Dk \{(x1 , y1 )}.
Let (x2 , y2 ) ∈ {(x, y) ∈ Dk0 | d(x, y) ≥ d(a, b), ∀(a, b) ∈ Dk0 }, d1 = d(x1 , y1 ),
d2 = d(x2 , y2 ) and ∆d = d1 − d2 be a new random variable, consisting on the
difference between the observed values through time. Applying the Hoeffding
bound to ∆d, if ∆d > k , we can confidently say that, with probability 1 − δ,
the difference between d1 and d2 is larger than zero, and select (x1 , y1 ) as the
pair of variables representing the diameter of the cluster. That is:
d1 − d2 > k ⇒ diam(ck ) = d1 (6.4)
With this rule, the ODAC system will only split the cluster when the true diam-
eter of the cluster is known with statistical confidence given by the Hoeffding
bound. This rule triggers the moment the leaf has been fed with enough ex-
amples to support the decision. Although a time series is not a purely random
variable, ODAC models the time series first-order differences in order to reduce
the negative effect of autocorrelation on the Hoeffding bound. Moreover, with
this approach, the missing values can be easily treated with a zero value,
considering that, when unknown, the time series is constant.
Resolving Ties. The rule presented in equation 6.4 redirects the research to
a different problem. There might be cases where the two top-most distances
are nearly or completely equal. To distinguish between the cases where the
cluster has many variables nearly equidistant and the cases where there are
two or more highly dissimilar variables, a tweak must be done. Having in mind
the application of the system to a data stream with high dimension, possibly
with hundreds or thousands of variables, we turn to a heuristic approach.
Based on techniques presented in Domingos and Hulten (2000), we introduce
a parameter to the system, τ , which determines how long will we let the
system check for the real diameter until we force the splitting and aggregation
tests. At any time, if τ > k , the system overrules the criterion of equation 6.4,
assuming the leaf has been fed with enough examples, hence it should consider
the highest distance to be the real diameter.
96 Knowledge Discovery from Data Streams
Expanding the Tree. When a split point is reported, the pivots are variables
x1 and y1 where d1 = d(x1 , y1 ), and the system assigns each of the remaining
variables of the old cluster to the cluster which has the closest pivot. he suf-
ficient statistics of each new cluster are initialized. The total space required
by the two new clusters is always less than the one required by the previous
cluster. Algorithm 16 sketches the splitting procedure.
Figure 6.3: ODAC structure evolution in a time-changing data set. Start: First
concept is defined for the data set; 50000 exs (t): Concept drift occurs in
the data set; 53220 exs (t + 3220): ODAC detects changes in the structure;
62448 exs (t + 12448, s): ODAC collapses all structure; 71672 exs (t + 21672,
s + 9224): ODAC gathers second concept and stabilizes; End: Second concept
remains in the data set and the correct final structure of the second concept
was discovered.
supported by the confidence given by the parent’s consumed data. The system
decreases the number of clusters as previous division is no longer supported
and might not reflect the best divisive structure of data. The resulting leaf
starts new computations and a concept drift is detected. Figure 6.3 illustrates
the evolution of a cluster structure in time-changing data.
98 Knowledge Discovery from Data Streams
6.4 Notes
One of the first incremental clustering algorithms is the COBWEB system (Fisher,
1987). It is included in the group of hierarchical conceptual clustering algo-
rithms. COBWEB is an incremental system that uses a hill-climbing search.
It incorporates objects, one by one, in a classification tree, where each node
is a probabilistic concept representing a class of objects. Whenever a new
observation is available, the object traverses the tree, updating counts of suf-
ficient statistics while descending the nodes. At each intermediate cluster,
one of several operators is chosen: classify an object according to an existent
cluster, create a new cluster, combine two clusters or divide one cluster into
several ones. The search is guided by the cluster utility evaluation function.
Using COBWEB in streams is problematic because every instance translates into
a terminal node in the hierarchy, which is infeasible for large data sets.
Another relevant work is described in Kaski and Kohonen (1994), who
developed the concept of self-organizing maps (SOM), a projection based algo-
rithm that maps examples from a k dimensional space to a low-dimensional
(typically two dimensional) space. The map seeks to preserve the topological
properties of the input space. The model was first described as an artificial
neural network by Teuvo Kohonen, and is sometimes called a Kohonen map.
Elnekave et al. (2007) presents an incremental system for clustering mobile
objects. Incrementally clustering trajectories of moving objects in order to
recognize groups of objects with similar periodic (e.g., daily) mobile patterns.
Chapter 7
Frequent Pattern Mining
Frequent itemset mining is one of the most active research topics in knowledge
discovery from databases. The pioneer work was market basket analysis, espe-
cially the task to mine transactional data describing the shopping behavior of
customers. Since then, a large number of efficient algorithms were developed.
In this chapter we review some of the relevant algorithms and its extensions
from itemsets to item sequences.
• Given:
99
100 Knowledge Discovery from Data Streams
• Goal:
– the set of frequent item sets, that is, the set {I ⊆ A | σT (I) ≥
σmin }.
Since their introduction in Agrawal, Imielinski, and Swami (1993), the fre-
quent itemset (and association rule) mining problems have received a great
deal of attention. Within the past decade, hundreds of research papers have
been published presenting new algorithms or improvements on existing algo-
rithms to solve these mining problems more efficiently.
Table 7.2: A transaction database, with 10 transactions, and the search space
to find all possible frequent itemsets using the minimum support of smin = 3
or σmin = 0.3 = 30%.
Figure 7.1: The search space using the depth-first and corresponding prefix
tree for five items.
In the example of Figure 7.1, the maximal item sets are: {b, c}{a, c, d}{a, c, e}{a, d, e}.
All frequent itemsets are a subset of at least one of these sets.
The following relationship holds between these sets: M aximal ⊆ Closed ⊆
F requent. The maximal itemsets are a subset of the closed itemsets. From
the maximal itemsets it is possible to derive all frequent itemsets (not their
support) by computing all non-empty intersections. The set of all closed item
sets preserves the knowledge about the support values of all frequent itemsets.
1982) algorithm to solve top-k queries: find the k most popular items in a
stream.
Here, we discuss a somewhat different problem. Given a stream S of n
items t1 , . . . , tn , find those items whose frequency is greater than φ × N . The
frequency of an item i is fi = |{j|tj = i}|. The exact φ-frequent items comprise
the set {i|fi > φ × N }. Heavy hitters are in fact singleton items.
Suppose, φ = 0.5, i.e, we want to find a majority element. An algorithm to
solve this problem can be stated as follows: store the first item and a counter,
initialized to 1. For each subsequent item, if it is the same as the currently
stored item, increment the counter. If it differs, and the counter is zero, then
store the new item and set the counter to 1; else, decrement the counter. After
processing all items, the algorithm guarantees that if there is a majority vote,
then it must be the item stored by the algorithm. The correctness of this
algorithm is based on a pairing argument: if every non-majority item is paired
with a majority item, then there should still remain an excess of majority
items.
The algorithm proposed in Karp et al. (2003) generalizes this idea to an
arbitrary value of φ. We first note, that in any dataset there are no more than
1/φ heavy hitters. The algorithm proceeds as follows (see Algorithm 18). At
any given time, the algorithm maintain a set K of frequently occurring items
and their counts. Initially, this set is empty. As we read an element from
the sequence, we either increment its count in the set K, or insert it in the
set with a count of 1. Thus, the size of the set K can keep growing. To
bound the memory requirements, we do a special processing when |K| > 1/φ.
The algorithm decrements the count of each element in the set K and delete
elements whose count has becomes zero. The key property is that any element
which occurs at least N × φ times in the sequence is in the set |K|. Note,
however, that not all elements occurring in K need to have frequency greater
than N × φ. The set K is a superset of the frequent items we are interested
in. To find the precise set of frequent items, another pass can be taken on the
sequence, and the frequency of all elements in the set K can be counted. The
algorithm identifies a set K of b1/φc symbols guaranteed to contain I(x, φ),
using O(1/φ) memory cells.
Most of these algorithms identifies all true heavy hitters, but not all re-
ported items are necessarily heavy hitters. They are prone to false positives.
The only way to guarantee the non-zero counters correspond to true heavy
hitters is a second scan over the stream.
Cormode and Muthukrishnan (2003) present a method which work for
Insert-only and Insert-delete streams, that is, can cope with addition and
removal of items. Cormode and Hadjieleftheriou (2009) perform a thorough
experimental study of the properties of the most relevant heavy hitters algo-
rithms. The author concludes: The best methods can be implemented to find
frequent items with high accuracy using only tens of kilobytes of memory, at
rates of millions of items per second on cheap modern hardware.
Cormode, Korn, Muthukrishnan, and Srivastava (2008) discuss hierarchi-
Frequent Pattern Mining 105
cal heavy hitters (φ-HHH) in streaming data. Given a hierarchy and a support
φ find all nodes in the hierarchy that have a total number of descendants in
the data stream no smaller than φ × N after discounting the descendant nodes
that are also φ-HHH. This is of particular interest for network monitoring (IP
clustering, denial-of-service attack monitoring), XML summarization, etc, and
explores the internal structure of data.
Mining Frequent Itemsets from Data Streams poses many new challenges.
In addition to the one-scan constraint, the limited memory requirement, the
combinatorial explosion of itemsets exacerbates the difficulties. The most diffi-
cult problem in mining frequent itemsets from data streams is that infrequent
itemsets in the past might become frequent, and frequent itemsets in the past
might become infrequent.
We can identify three main approaches. Approaches that do not distin-
guish recent items from older ones (using landmark windows); approaches
that give more importance to recent transactions (using sliding windows or
decay factors); and approaches for mining at different time granularities. They
are discussed in the following sections.
106 Knowledge Discovery from Data Streams
• All item(set)s whose true frequency exceeds s × N are output. There are
no false negatives;
• No item(set) whose true frequency is less than (s − ) × N is output;
• Estimated frequencies are less than the true frequencies by at most ×N .
• Update itemset: For each entry (set, f, ∆), update by counting the
occurrences of set in the current batch. Delete any entry such that f +
∆ ≤ bcurrent ;
• New itemset: If a set set has frequency f ≥ β in the current batch
and does not occur in T , create a new entry (set, f, bcurrent − β).
For each expiring transaction of the sliding window, the itemsets in D that are
subsets of the transaction, are traversed. For each itemset, X, being visited, if
tid(X) < tid1 , f req(X) is decreased by 1; otherwise, no change is made since
the itemset is inserted by a transaction that comes later than the expiring
transaction. Then, pruning is performed on X as described before. Finally,
for each itemset, X, in D, estWin outputs X as an frequent itemset if:
1. tid(X) < tid1 and f req(X) ≥ σ × N ;
2. tid(X) > tid1 and (f req(X) + err(X)) ≥ σ × N .
Chang and Lee (2003), the same authors of estWin, proposed a decay
method for mining the most recent frequent itemsets adaptively. The effect of
old transactions is diminished by decaying the old occurrences of each itemset
as time goes by.
– X is infrequent,
– vY is the parent of vX and Y is frequent,
– if vY has a sibling, vY 0 , such that X = Y ∪ Y 0 , then Y 0 is frequent;
– X is frequent,
– ∃Y such that Y is a frequent closed itemset, Y ⊃ X, f req(Y ) =
f req(X) and Y is before X according to the lexicographical order
of the itemsets;
– X is frequent,
– vX is the parent of vY such that f req(Y ) = f req(X),
– vX is not a UGN;
Figure 7.2: The FP-tree generated from the database of Figure 7.2 with
support set to 4(a). The FP-stream structure: the pattern-tree with a tilted-
time window embedded (b).
time windows.
112 Knowledge Discovery from Data Streams
hours, one day is built, and so on. This model, allow to compute the frequent
itemsets in the last hour with the precision of a quarter of an hour, the last day
frequent itemsets with a precision of an hour, the last month with a precision
of a day, etc. For a period of one month we need 4 + 24 + 31 = 59 units of
time. Let t1 , . . . , tn be the tilted-time windows which group the batches seen
so far. Denote the number of transactions in ti by wi . The goal is to mine
all frequent itemsets with support larger than σ over period T = tk ∪ tk+1 ∪
. . . ∪ tk0 , where 1 ≤ k ≤ k0 ≤ n. The size of T , denoted by W , is the sum of
the sizes of all time-windows considered in T . It is not possible to store all
possible itemsets in all periods. FP-stream drops the tail sequences when
i
X i
X
∀i , n ≤ i ≤ 1, fI (ti ) < σ × wi and fI (tj ) < × wj
j=n j=n
Figure 7.3: Stream model with 3 different sequences ids with their associated
transactions.
of the stream at the time of arrival of the tth point is proportional to the prob-
ability p(r, t) of the rth point belonging to the reservoir at the time of arrival
of the tth point. The use of a bias function guarantees that recent points ar-
riving over the stream have higher probabilities to be inserted in the reservoir.
However, it is still an open problem to determine if maintenance algorithms
can be implemented in one pass but the author in Aggarwal (2006) exploit
some properties of a class of memory-less bias functions: the exponential bias
function which is defined as follow: f (r, t) = e−λ(t−r) with parameter λ be-
ing the bias rate. The inclusion of such a bias function enables the use of a
simple and efficient replacement algorithm. Furthermore, this special class of
bias functions imposes an upper bound on the reservoir size which is indepen-
dent of the stream length. This is a very interesting property since it means
that the reservoir can be maintained in main-memory independently of the
stream’s length.
In order to efficiently build a biased sample over data streams, the authors
started by applying the sampling techniques over a static data set scenario.
The authors introduced several theoretical results concerning the accuracy of
the sample and the mined result given a single parameter: the user defined er-
ror threshold3 . For a random sample SD generated from a sequence database
D, the authors estimate the probability that the error rate gets higher than
the user defined threshold , denoted P r[e(s, SD ) > ] by using Hoeffding
concentration inequalities (Hoeffding, 1963). Basically, the concentration in-
equalities are meant to give an accurate prediction of the actual value of a
random variable by bounding the error term (from the expected value) with
an associated probability.
The same statistical reasoning is applied next for the case of sequential
pattern mining over data streams mining. However, the main challenge in
this model is that the length of the stream is unknown. Therefore, there is a
need to maintain a dynamic sample of the stream. In order to do so, Raissi
and Poncelet (2007) proved that a sampling algorithm for sequential pattern
mining should respect two conditions:
1. There must be a lower bound on the size of the sample. According to
their previous results from the static database model, this is achieved
by using an (, δ)-approximation combined with an exponential biased
reservoir sampling method;
2. The insertion and replacement operations, essential for the reservoir
updating, must be done at sequence level and at transactions level. This
is necessary to control the size of the itemsets for each sequence in the
reservoir.
The proposed approach is a replacement algorithm using an exponential
bias function that regulates the sampling of customers and their transactions
over a stream. The algorithm starts with an empty reservoir of capacity λ1
3 Notice that a similar approach was used for itemset mining in Toivonen (1996).
Frequent Pattern Mining 115
(where λ is the bias rate of our exponential bias function) and each data
point arriving from the stream is deterministically added to the reservoir by
flipping a coin, either as:
7.5 Notes
Cheng, Ke, and Ng (2008) present a detailed survey on algorithms for
mining frequent itemsets over data streams. Wang and Yang (2005) discuss
several models of sequencial pattern mining from large data sets.
Raissi, Poncelet, and Teisseire (2007) proposed the system FIDS (Frequent
itemsets mining on data streams). One interesting feature of this system is
that each item is associated with a unique prime number. Each transaction is
represented by the product of the corresponding prime numbers of individual
items into the transaction. As the product of the prime number is unique we
can easily check the inclusion of two itemsets (e.g. X ⊆ Y ) by performing a
modulo division on itemsets (Y MOD X). If the remainder is 0 then X ⊆ Y ,
otherwise X is not included in Y . FIDS uses a novel data structure to maintain
116 Knowledge Discovery from Data Streams
frequent itemsets coupled with a fast pruning strategy. At any time, users can
issue requests for frequent sequences over an arbitrary time interval.
Li, Shan, and Lee (2008) proposed a in-memory summary data structure
SFI-forest (summary frequent itemset forest), to maintain an approximated set
of frequent itemsets. Each transaction of the stream is projected into a set of
sub-transactions, and these sub-transactions are inserted into the SFI-forest.
The set of all frequent itemsets is determined from the current SFI-forest.
Distributed algorithms for association rule learning were presented in Schus-
ter, Wolff, and Trock (2005).
In the context of Data Stream Management Systems, frequent pattern
mining is used to solve iceberg queries, and computing iceberg cubes (Fang,
Shivakumar, Garcia-Molina, Motwani, and Ullman, 1998).
Chapter 8
Decision Trees from Data Streams
8.1 Introduction
Formally, a decision tree is a direct acyclic graph in which each node is
either a decision node with two or more successors or a leaf node. In the
simplest model, a leaf node is labeled with a class label and a decision node
has some condition based on attribute values. Decision trees are one of the
most used methods in data mining mainly because of their high degree of
interpretability. The hypothesis space of decision trees is within the disjunctive
normal form (DNF) formalism. Classifiers generated by those systems encode
a DNF for each class. For each DNF, the conditions along a branch represent
conjuncts and the individual branches can be seen as disjuncts. Each branch
forms a rule with a conditional part and a conclusion. The conditional part
is a conjunction of conditions. Conditions are tests that involve a particular
attribute, operator (e.g. =, ≥, ∈, etc.) and a value from the domain of that
attribute. These kind of tests correspond, in the input space, to a hyper-plane
that is orthogonal to the axes of the tested attribute and parallel to all other
axis. The regions produced by these classifiers are all hyper-rectangles. Each
leaf corresponds to a region. The regions are mutually exclusive and exhaustive
(i.e. cover all the instance space).
Learning decision trees from data streams is one of the most challenging
problems for the data mining community. A successful example is the VFDT
system (Domingos and Hulten, 2000). The base idea comes from the observa-
tion that a small number of examples are enough to select the correct splitting
117
118 Knowledge Discovery from Data Streams
test and expand a leaf. The algorithm makes a decision, that is, installs a split-
test at a node, only when there is enough statistical evidence in favor of a split
test. This is the case of Gratch (1996); Domingos and Hulten (2000); Gama
et al. (2003). VFDT like algorithms can manage millions of examples using few
computational resources with a performance similar to a batch decision tree
given enough examples.
pensive. It turns out that it is not efficient to compute H(·) every time that an
example arrives. VFDT only computes the attribute evaluation function H(·)
when a minimum number of examples has been observed since the last eval-
uation. This minimum number of examples is a user-defined parameter.
When two or more attributes continuously exhibit very similar values of
H(·), even with a large number of examples, the Hoeffding bound will not
decide between them. This situation can happen even for two or more equally
informative attributes. To solve this problem VFDT uses a constant τ intro-
duced by the user for run-off. Taking into account that decreases when n
increases, if ∆H < < τ then the leaf is transformed into a decision node.
The split test is based on the best attribute.
Figure 8.1: Illustrative example of a decision tree and the time-window as-
sociated with each node.
x 71 69 80 83 70 65 64 72 75 68 81 85 72 75
C + - + - - + - - - - - + + -
In VFDTc (Gama et al., 2003) the cut point is chosen from all the observed
values for that attribute in the sample at a leaf. In order to evaluate the
quality of a split, we need to compute the class distribution of the examples
at which the attribute-value is less than or greater than the cut point. These
counts are the sufficient statistics for almost all splitting criteria. They are
computed with the use of the two data structures maintained in each leaf of
the decision tree. The first data structure is a vector of the classes distribution
over the attribute-values for the examples that reach the leaf.
For each continuous attribute j, the system maintains a binary tree struc-
ture. A node in the binary tree is identified with a value i (that is the value
of the attribute j seen in an example), and two vectors (of dimension k) used
to count the values that go through that node. These vectors, V E and V H
contain the counts of values respectively ≤ i and > i for the examples labeled
with one of the possible class values. When an example reaches a leaf, all the
binary trees are updated. Figure 21 presents the algorithm to insert a value
in the binary tree. Insertion of a new value in this structure is O(log n) where
n represents the number of distinct values for the attribute seen so far.
To compute the information gain of a given attribute we use an exhaustive
method to evaluate the merit of all possible cut points. In our case, any value
observed in the examples so far can be used as cut point. For each possible
cut point, we compute the information of the two partitions using equation
8.1.
Decision Trees from Data Streams 123
inf o(Aj (i)) = P (Aj ≤ i) ∗ iLow(Aj (i)) + P (Aj > i) ∗ iHigh(Aj (i)) (8.1)
X
iLow(Aj (i)) = − P (K = k|Aj ≤ i) ∗ log2 (P (K = k|Aj ≤ i)) (8.2)
K
X
iHigh(Aj (i)) = − P (K = k|Aj > i) ∗ log2 (P (K = k|Aj > i)) (8.3)
K
These statistics are easily computed using the counts in the Btree, and using
the algorithm presented in Algorithm 22. For each attribute, it is possible to
compute the merit of all possible cut points traversing the binary tree only
once.
When learning decision trees from streaming data, continuous attribute
processing is a must. Jin and Agrawal (2003) present a method for numerical
interval pruning,
The quadratic discriminant splits the X-axis into three intervals (−∞, d1 ),
(d1 , d2 ), (d2 , ∞) where d1 and d2 are the possible roots of the equation 8.4
where p(i) denotes the estimated probability than an example belongs to class
i (see Figure 8.3). We prefer a binary split, so we use the root closer to the
sample means of both classes. Let d be that root. The splitting test candidate
for each numeric attribute i will use the form Atti ≤ di . To choose the best
splitting test from the candidate list we use a heuristic method. We use the
information gain to choose, from all the splitting point candidates (one for
each attribute), the best splitting test. The information kept by the tree is
not sufficient to compute the exact number of examples for each entry in the
contingency table. Doing that would require to maintain information about all
the examples at each leaf. With the assumption of normality, we can compute
the probability of observing a value less or greater than di (See table 8.1).
From these probabilities and the distribution of examples per class at the
leaf we populate the contingency table. The splitting test with the maximum
information gain is chosen. This method only requires that we maintain the
1 To solve multi-class problems, Loh and Shih (1997) use a clustering algorithm to form
two super-classes. This strategy is not applicable in the streaming setting. Gama et al.
(2004) decompose a k-class problem is decomposed into k × (k − 1)/2 two-class problems,
generating a forest of trees (described in Section 10.4.2).
126 Knowledge Discovery from Data Streams
mean and standard deviation for each class per attribute. Both quantities are
easily maintained incrementally.
Algorithm 23: The algorithm to compute P (xj |Ck ) for numeric at-
tribute xj and class k at a given leaf.
input : BTree: Binary Tree for attribute xj
nrExs: Vector of the number of examples per Class
Xh : the highest value of xj observed at the Leaf
Xl : the lowest value of xj observed at the Leaf
Nj : the number different values of xj observed at the Leaf
output: Counts The vector of size N intervals with the percentage of
examples per interval;
begin
if BT ree == N U LL then return 0;
/* number of intervals */
N intervals ← min(10, |BT ree|);
/* interval range */
Xh −Xl
inc ← N intervals ;
for i = 1 to Nintervals do
Counts[i] ← LessT han(xl + inc ∗ i, k, BT ree) ;
if i > 1 then
Counts[i] ← Counts[i] − Counts[i − 1] ;
if xj ≤ Xl + inc ∗ i then
Counts
return nrExs[k] ;
else if i == Nintervals then
Counts
return nrExs[k] ;
We should note that, the use of naive Bayes classifiers at tree leaves does
not introduce any overhead in the training phase. In the application phase
and for nominal attributes, the sufficient statistics contains all the informa-
tion for the naive Bayes tables. For continuous attributes, the naive Bayes
contingency tables are efficiently derived from the Btree’s used to store the
numeric attribute-values. The overhead introduced is proportional to depth
of the Btree, that is at most log(n), where n is the number of different values
observed for a given attribute in a leaf.
into nested regions. The root node covers all the instance space, and subse-
quent nodes in the structure cover sub-regions of the upper nodes. Using the
tree structure we can have views of the instance space at different levels of
granularity.
This is of great interest in time-changing environments because a change or
a concept drift, may affect only some region of the instance space, and not the
instance space as a whole. When drift occurs, it does not have impact in the
whole instance space, but in particular regions. Adaptation of global models
(like naive Bayes, discriminant functions, SVM) requires reconstruction of the
decision model. Granular decision models (like decision rules and decision
trees) can adapt parts of the decision model. They need to adapt only those
parts that cover the region of the instance space affected by drift. In decision
models that fit different functions to regions of the instance space, like Decision
Trees and Rule Learners, we can use the granularity of the model to detect
regions of the instance space where change occurs and adapt the local model,
with advantages of fast adaptation. Nodes near the root should be able to
detect abrupt changes in the distribution of the examples, while deeper nodes
should detect localized, smoothed and gradual changes.
We should distinguish between the task of detect a change from the task
of reacting to a change. In this section we discuss methods that exploit the
tree structure to detect and react to changes.
The observation that Hoeffding trees define time-windows over the stream
lead to a straightforward change detection method: compare the distribution
of the target attribute between a node and the sum of distribution in all
leaves of the sub-tree rooted at that node. The techniques presented in Kifer
et al. (2004) for comparing two distributions, can be used for this purpose.
We should note that maintaining appropriate sufficient statistics at each node
enables the test to be performed on the fly whenever a new labeled example
is available. This strategy was used in Gama et al. (2006).
Figure 8.4 presents the path of an example traversing a tree. Error distri-
butions can be estimated in the descending path or in the ascending path. In
the descending path, the example is classified at each node, as if that node was
a leaf. In the ascending path, the class assigned to the example is propagate
upwards.
Comparing the distributions is appealing but might lead to relatively high
time delay in detection. A somewhat different strategy, used in Gama and
Medas (2005), exhibit faster detection rates. The system maintains a naive-
Bayes classifier at each node of the decision tree. The drift detection algorithm
described in Chapter 3.3 monitors the evolution of the naive-Bayes error rate.
It signals a drift whenever the performance goes to an out-of-control state.
Decision Trees from Data Streams 129
(A) (B)
(C) (D)
Figure 8.5: The Hyper-plane problem: two classes and by two continuous
attributes. The dataset is composed by a sequence of points first from concept-
1 (A) followed by points from concept 2 (B). The second row present the
projection of the final decision tree over the instance space. Figure C is the
tree without adaptation, Figure D is the pruned tree after detection of a
change. Although both have similar holdout performance, the pruned tree
represents much better the current concept.
They have other desirable properties that are specific to this algorithm:
• Convergence. Domingos and Hulten (2000) show that the trees produced
by the VFDT algorithm are asymptotically close to the ones generated by
a batch learner.
model designed to minimize the total number of predicting attributes. The un-
derlying principle of the IN-based methodology is to construct a multi-layered
network in order to maximize the Mutual Information (MI) between the input
and the target attributes. Each hidden layer is uniquely associated with a spe-
cific predictive attribute (feature) and represents an interaction between that
feature and features represented by preceding layers. Unlike popular decision-
tree algorithms such as CART and C4.5, the IFN algorithm uses a pre-pruning
strategy: a node is split if the split leads to a statistically significant decrease
in the conditional entropy of the target attribute (equal to an increase in the
conditional mutual information ). If none of the remaining candidate input
attributes provides a statistically significant increase in the mutual informa-
tion, the network construction stops. The output of the IFN algorithm is a
classification network, which can be used as a decision tree to predict the
value (class) of the target attribute. For continuous target attributes, each
prediction refers to a discretization interval rather than a specific class.
The OLIN algorithm extends the IFN algorithm for mining continuous and
dynamic data streams (Cohen et al., 2008). The system repeatedly applies the
IFN algorithm to a sliding window of training examples and changes the size
of the training window (and thus the re-construction frequency) according to
the current rate of concept drift. The purpose of the system is to predict, at
any time, the correct class for the next arriving example. The architecture of
the OLIN-based system is presented in Figure 8.7.
The online learning system contains three main parts: the Learning Module
is responsible for applying the IN algorithm to the current sliding window of
Decision Trees from Data Streams 133
examples, should conform to the same distribution. Thus, the error rates in
classifying those examples using the current model should not be significantly
different. On the other hand, a statistically significant difference may indicate
a possible concept drift. The variance of the differences between the error rates
is calculated by the following formula based on a Normal Approximation to
the Binominal distribution:
χ2α (N Ii − 1)(N T − 1)
W =
2ln2(H(T ) − H(Etr ) − Etr log2 (N T − 1))
where α is the significance level sign used by the network construction
algorithm (default: α = 0.1%), N Ii is the number of values (or discretized
intervals) for the first input attribute Ai in the info-fuzzy network, N T is
the number of target values, H(T ) is the entropy of the target, and Etr is
the training error of the current model. In addition, the number of examples
in the next validation interval is reduced by Red Add Count. Otherwise, the
concept is considered stable and both the training window and the validation
interval are increased by Add Count examples up to their maximum sizes of
M ax Add Count and M ax W in, respectively.
8.5 Notes
Decision trees are one of the most studied methods in Machine Learning.
The ability to induce decision trees incrementally appears in the Machine
Learning community under several designations: incremental learning, online
learning, sequential learning, theory revision, etc. In systems like ID4 (Van de
Velde, 1990), ITI (Utgoff et al., 1997), or ID5R (Kalles and Morris, 1996), a
tree is constructed using a greedy search. Incorporation of new information
Decision Trees from Data Streams 135
9.1 Introduction
Novelty detection makes possible to recognize novel profiles (concepts)
in unlabeled data, which may indicate the appearance of a new concept, a
change occurred in known concepts or the presence of noise. The discovery
of new concepts has increasingly attracted the attention of the knowledge
discovery community, usually under the terms novelty (Marsland, 2003) or
anomaly detection. It is a research field somewhat related to statistical outlier
detection (Barnett and Lewis, 1995). Since to be able to detect novelty the ML
technique must allow the learning of a single target or normal concept, the
terms one-class (Tax, 2001) or single-class classification are also frequently
used. The absence of negative examples, which would represent the unknown
novelty in this scenario, makes it hard for the induction algorithm to establish
adequate decision boundaries and, thus, to avoid the extremes of underfitting
and overfitting. This problem has also been studied under the term learning
from positive-only examples.
Besides recognizing novelty, a learning environment that considers the time
variable also imposes that the ML technique be able to identify changes oc-
curred in the known or normal concept, which has been addressed under the
term concept drift (see Chapter 3.3).
This chapter reviews the major approaches for the detection of novelty in
data streams.
1 Based on joint work with Eduardo Spinosa and Andre de Carvalho.
137
138 Knowledge Discovery from Data Streams
We can identify two main lines where novelty detection concepts are used:
• The decision model act as a detector. The requirement is to detect
whether an observation is part of the data that the classifier was trained
on or it is in fact unknown.
• The decision model is able to learn new characteristic descriptions of
the new concepts that appear in the test examples.
has knowledge about a single concept, the normal behavior of the system.
New unlabeled examples may be identified as members of that profile or not.
This context lies in the fact that examples of abnormal behaviors are usually
scarce or not available, since in most applications it is hard and expensive to
simulate abnormal scenarios. Besides that, it would be infeasible to simulate
all possible abnormal situations. For that reason, novelty detectors can rely
only on examples of a single profile that represents the normal behavior, hence
the terms one-class, single-class classification, or learning from positive-only
examples.
Several machine learning techniques have been adapted to work in a one-
class scenario. One-class classification techniques are learning methods that
proceed by recognizing positive instances of a concept rather than discrim-
inating between its positive and negative instances. This restriction makes
it a more difficult challenge to obtain an adequate level of generalization,
since counter-examples play an important role in that task. A general method
consists of estimating the probability density function of a normal class, for
example using Parzen windows, and a rejection threshold.
Li, Zhang, and Li (2009) present a VFDT like algorithm for one-class clas-
sification, using these equations. Given that the P osLevel is unknown, the
authors enumerate nine possible values of P osLevel, from 0.1 to 0.9 and learn
nine different trees. The best tree is chosen by estimating the classification
performance of the trees with a set of validating samples.
Actual
Pos Neg
Pos TP FP
Predict
Neg FN TN
positive examples (labeled +1), while the rest are unlabeled examples, which
we label −1. If the sample size is large enough, minimizing the number of unla-
beled examples classified as positive while constraining the positive examples
to be correctly classified will give a good classifier. The following soft margin
version of the Biased-SVM (Liu et al., 2003) formulation uses two parameters
C+ and C− to weight positive errors and negative errors differently:
k−1 n
1 t X X
Minimize: w w + C+ ξi + C − ξi
2 i=1 i=k
TP
• Recall = T P +F N (or Sensitivity)
TN
• Specificity = T N +F P
144 Knowledge Discovery from Data Streams
(w + 1) × Recall × P recision
F measure =
Recall + w × P recision
2 × P recision × Recall
F1 =
P recision + Recall
can be performed using integer additions and logical shifts, making it both
faster and more cost-effective than other alternatives (Wu et al., 2004).
Figure 9.3: Overview of the Online Novelty and Drift Detection Algorithm.
normal class. This initial phase is offline. The decision model, a set of hyper-
spheres labeled as normal, is learned using a k-means algorithm.
After learning the initial model, OLINDDA enters an online phase where the
system receives a stream of unlabeled examples. Each example is compared to
the current model. If any hypersphere covers the example, that is, the distance
between the example and the centroid of the hypersphere is lesser than the
radius, the example is classified using the label of the covering hypersphere.
Otherwise, since the current decision model cannot classify the example, it is
declared unknown and stored in a short term memory for further analysis later.
Time to time, the examples stored in the short term memory are analyzed. It
happens whenever the number of examples exceeds a user defined threshold.
The analysis employs again a standard clustering algorithm. OLINDDA looks
for clusters with a minimum density, and consider three different cases:
• Clusters that satisfy the density criteria and are far from known con-
cepts. They are declared novelties and added to the decision model with
new labels.
• Clusters that satisfy the density criteria and are close to known con-
cepts. They are declared extensions of existing concepts and added to
the decision model with the same label of the closest concept.
• Clusters that are sparse and do not satisfy the density criteria. They are
considered noise, thus, are not added to the decision model.
150 Knowledge Discovery from Data Streams
and its radius is the Euclidean distance from the centroid to the farthest
example of the respective cluster.
n(ci )
dens(ci ) = (9.2)
vol(ci , m)
where n (ci ) is the number of examples that belong to ci and vol (ci , m)
is the volume of the hypersphere whose radius r is the distance from
the centroid to the farthest example of ci . The volume vol (ci , m) in an
m-dimensional space is given by:
152 Knowledge Discovery from Data Streams
m
π 2 rm
vol (ci , m) = (9.3)
Γ m
2 +1
m
m
!, f or even m;
Γ +1 = √ 2
m!! (9.4)
2 π 2(m+1)/2 , f or odd m.
This criterion may not be applicable to data sets with a large number of
attributes, once as m increases, vol (ci , m) tends to zero (Stibor et al.,
2006).
• Sum of squares of distances between examples and centroid
divided by the number of examples. The sum of squares of distances
between examples belonging to ci and the centroid µi is given by:
X 2
d (xj , µi ) = (xj − µi ) (9.5)
xj ∈ci
d (xj , µi )
d1 (xj , µi ) = (9.6)
n (ci )
senting the normal concept) in the initial phase, and allow OLINDDA to discover
the remaining classes as novel concepts. In that scenario, our final goal would
be to have produced a class structure as similar as possible to the real one,
and the merging of concepts helps directing the algorithm toward that.
9.6 Notes
Novelty detection is an young and active research line. Traditional ap-
proaches to classification require labeled examples to train classifiers. Classi-
fiers predictions are restricted to the set of class-labels they have been trained.
In a data stream, when a new concept emerges, all instances belonging to this
new class will be misclassified until a human expert recognizes the new class
and manually label examples and train a new classifier. To empower machines
with the abilities to change and to incorporate knowledge is one of the greatest
challenges currently faced by Machine Learning researchers. We believe that a
way to confront such defiance is by approaching the detection of both novelty
and concept drift by means of a single strategy.
The relation and the frontiers between novelty detection and clustering
are still unclear. As in clustering, novelty detection learns new concepts from
unlabeled examples. Nevertheless, in contrast to clustering, novelty detection
systems has a supervised phase.
Fan et al. (2004) use active mining techniques in conjunction with a clas-
sifier equipped with a novelty detection mechanism. The examples rejected by
the novelty detection system are requested to be labeled by an expert. This
approach is useful in the context of data streams where the target label is not
immediately available but change detection needs to be done immediately, and
model reconstruction needs to be done whenever the estimated loss is higher
than a tolerable maximum.
156 Knowledge Discovery from Data Streams
Chapter 10
Ensembles of Classifiers
10.1 Introduction
Hansen and Salamon (1990) first introduced the hypothesis that an en-
semble of models is most useful when its member models make errors inde-
pendently with respect to one another. They proved that when all the models
have the same error rate, the error rate is less than 0.5, and they make errors
completely independently, then the expected ensemble error must decrease
monotonically with the number of models.
Figure 10.1 presents an example that clearly illustrates how and why an
ensemble of classifiers works. Figure 10.1(a) shows the variation of the error
rate obtained by varying the number of classifiers in an ensemble. This is a
simulation study, in a two class problem. The probability of observing each
class is 50%. A varying number of classifiers, from 3 to 24, are used to classify,
by uniform vote, each example. The classifiers have the same probability of
making an error, but errors are independent of one another.
We consider three scenarios:
• When it is equal to 50% (random choice) the error rate of the ensemble
stays constant;
157
158 Knowledge Discovery from Data Streams
Figure 10.1: (a) Error rate versus number of classifiers in an ensemble. (b)
Probability that exactly n of 24 classifiers will make an error.
• When this probability is 45%, the error rate of the ensemble monotoni-
cally decreases.
This study illustrates a necessary condition:
The error of the ensemble decreases, respecting to each individual
classifier, if and only if each individual classifier has a performance
better than a random choice.
In Figure 10.1(b) each bar represents the probability that exactly i clas-
sifiers are in error. In that case we use an ensemble of twenty four classifiers,
each one having an error of 30%. Using uniform voting the ensemble is in
error, if and only if twelve or more classifiers are in error. If the error rates of
n classifiers are all equal to p (p < 0.5) and if the errors are independent, then
the probability that the majority vote is wrong can be calculated using the
area under the curve of a binomial distribution. We can estimate the probabil-
ity that more than n/2 classifiers are wrong (Dietterich, 1997). Figure 10.1(b)
shows this area for the simulation of twenty four classifiers. The area under
the curve for more than twelve classifiers is 2%, which is much less than the
error rate of the individual classifiers (30%).
Two models are said to make a correlated error when they both classify
an example of class i as belonging to class j, i 6= j. The degree to which
the errors of two classifiers are correlated might be quantified using the error
correlation metric. We define the error correlation between pairs of classifiers
as the conditional probability that the two classifiers make the same error
given that one of them make an error. This definition of error correlation lies
in the interval [0 : 1] and the correlation between one classifier and itself is 1.
The formal definition is:
Ensembles of Classifiers 159
φij = p(fˆi (x) = fˆj (x)|fˆi (x) 6= f (x) ∨ fˆj (x) 6= f (x))
.
The error correlation measures the diversity between the predictions of
two algorithms;
Later on, the same author presented the Weighted-Majority Algorithm (Lit-
tlestone and Warmuth, 1994) to combine predictions from a set of base classi-
fiers. The main advantage of this algorithm is that we can bound the error of
the ensemble with respect to the best expert in the ensemble. The Weighted-
Majority Algorithm (WMA) (Algorithm 24) receives as input a set of predic-
tors and a sequence of examples. Predictors can be experts, learned models,
attributes, etc; the only thing that is required for the learning algorithm is
that it makes a prediction. Each predictor is associated with a weight, set to
1 in the initial phase.
The examples can be labeled or unlabeled. For each example in the se-
quence, the algorithm makes predictions by taking a weighted vote among the
pool of predictors. Each predictor classifies the example. The algorithm sums
the weight of the predictors that vote for the same class, and classifies the
example in the most weighted voted class. For the labeled examples, WMA
160 Knowledge Discovery from Data Streams
updates the weight associated to each predictor. This is the sequential learn-
ing step. The weight attached to wrong predictions is multiplied by a factor β
(for example 1/2). The vote of these predictors has less weight in the following
predictions.
The Weighted-Majority algorithm has an interesting property:
exp(−1)
P (k) = (10.1)
k!
As each training example is available, and for each base model, we choose
k ∼ P oisson(1) and update the base model k times. Equation 10.1, comes
from the fact that as the number of examples tends to infinity, the binomial
distribution of k tends to a P oisson(1) distribution. This way, we remove the
dependence from the number of examples, and design a bagging algorithm for
open-ended data streams.
The algorithm is presented in Algorithm 25 and illustrated in Figure 10.2.
Unlabeled examples are classified in the same way as in bagging: uniform
voting over the M decision models. Online bagging provides a good approxi-
mation to batch bagging given that their sampling methods generate similar
distributions of bootstrap training sets.
Figure 10.3: Illustrative example of online Boosting. The weight of the ex-
amples are represented by boxes, which height denotes the increase (decrease)
of the weight.
of decision trees. Option trees can include option nodes, which replace a single
decision by a set of decisions. An option node is like an or node in and-or
trees. Instead of selecting the ‘best’ attribute, all the promising attributes
are selected. For each selected attribute a decision tree is constructed. Note
that an option tree can have three types of nodes: Nodes with only one test
attribute - decision nodes; nodes with disjunctions of test attributes - option
nodes, and leaf nodes.
Classification of an instance x using an option tree is a recursive procedure:
• For a leaf node, return the class label predicted by the leaf.
• For a decision node the example follows the unique child that matches
the test outcome for instance x at the node.
• For an option node the instance follows all the subtrees linked to the
test attributes. The predictions of the disjunctive test are aggregated by
a voting schema.
Option trees is a deterministic algorithm known to be efficient as a variance
reduction method. The main disadvantage of option trees is that they require
much more memory and disk space than ordinary trees. Kohavi and Kunz
(1997) claim that it is possible to achieve significant reduction of the error
rates (in comparison with regular trees) using option trees restricted to two
levels of option nodes at the top of the tree.
In the context of streaming data, Kirkby (2008) first propose Option trees,
an extension to the VFDT algorithm, as a method to solve ties. After processing
a minimum number of examples, VFDT computes the merit of each attribute.
If the difference between the merits of the two best attributes satisfies the
Hoeffding bound, VFDT expands the decision tree by expanding the node and
generating new leaves. Otherwise, the two attributes are in a tie. VFDT reads
more examples and the process is repeated. Processing more examples implies
a decrease of (see Equation 2.1). Suppose that there are two equal discrim-
inative attributes. VFDT will require too many examples till becomes small
enough to choose one of them. The original VFDT uses a user defined constant
τ to solve ties. The node is expanded whenever < τ , (see Section 8.2). A
better solution is to generate an option node, containing tests in all the at-
tributes, such that the difference in merit with respect to the best attribute
is less than .
the classification of the training example. The weights of all the experts that
misclassified the example are decreased by a multiplicative constant β. If the
overall prediction was incorrect, a new expert is added to the ensemble with
weight equal to the total weight of the ensemble. Finally, all the experts are
trained on the example.
10.7 Notes
The study of voting systems, as a subfield of political science, economics
or mathematics, began formally in the 18th century and many proposals have
been made. The seminal forecasting paper by Granger and Newbold (1976)
stimulated a flurry of articles in the economics literature of the 1970s about
combining predictions from different forecasting models. Hansen and Sala-
mon (1990) showed the variance reduction property of an ensemble system.
Bayesian model averaging has been pointed out as an optimal method (Hoet-
ing et al., 1999) for combining classifiers. It provides a coherent mechanism for
accounting model uncertainty. Schapire (1990) work put the ensemble systems
at the center of machine learning research, as he proved that a strong classifier
in probably approximately correct sense can be generated by combining weak
classifiers through a boosting procedure. Lee and Clyde (2004) introduce an
on-line Bayesian version of bagging. Fern and Givan (2000) show empirical
results for both boosting and bagging-style on-line ensemble of decision trees
in a branch prediction domain. In addition, they show that, given tight space
constraints, ensembles of depth-bounded trees are often better than single
deeper trees. Ensembles of semi-random decision trees for data streams ap-
Ensembles of Classifiers 171
pear in (Hu et al., 2007). In a similar line, Abdulsalam et al. (2007) present
the Streaming Random Forests algorithm, an online and incremental stream
classification algorithm that extends the Random Forests algorithm (Breiman,
2001) to streaming setting.
172 Knowledge Discovery from Data Streams
Chapter 11
Time Series Data Streams
11.1.1 Trend
A moving average is commonly used in time series data to smooth out
short-term fluctuations and highlight longer-term trends or cycles. These smooth-
ing techniques reveal more clearly the underlying trend, seasonal and cyclic
components of a time-series.
We can distinguish averaging methods where all data points have the same
relevance and weighted averaging methods where data points are associated
with a weight that strengths their relevance. Relevant statistics in the first
group are:
173
174 Knowledge Discovery from Data Streams
Figure 11.1: Plot of the electricity load demand in Portugal in January 2008.
The time-series exhibit seasonality: it clear shows week patterns, working days,
and week-ends.
• Moving Average
The mean of the previous n data points:
xt−n+1 xt+1
MAt = MAt−1 − +
n n
EM At = α × xt + (1 − α) × EM At−1
Time Series Data Streams 175
11.1.2 Seasonality
Autocorrelation and autocovariance are useful statistics to detect periodic
signals. Since autocovariance depend on the units of the measurements and is
unbounded, it is more convenient to consider autocorrelation that is indepen-
dent of the units of the measurements, and is in the interval [−1; 1].
Autocorrelation is the cross-correlation of a time-series with itself. It is
used as a tool for finding repeating patterns, such as the presence of a periodic
signal. Autocorrelation is measured as the correlation between x and x + l
where l represents the time lag, and can be computed using Equation 11.1.
Pn−l
(x − x̄)(xi+l − x̄)
r(x, l) = Pni
i=1
2
(11.1)
i=1 (xi − x̄)
11.1.3 Stationarity
We should point out, that a common assumption in many time series tech-
niques is that the data are stationary. A stationary process has the property
that the mean, variance and autocorrelation structure do not change over
time.
A usual strategy to transform a given time-series to a stationary one con-
sists of differentiating the data. That is, given the series zt , we create the new
series yi = zi − zi−1 . Nevertheless, information about change points is of great
importance for any analysis task.
Figure 11.2: A study on the autocorrelation for the electricity load demand
in Portugal. In the plot, the x-axis represents the x time horizon [1 hour, 2
weeks], while the y-axis presents the autocorrelation between current time t
and time t−l. The plot shows high values for time lags of 1 hour, and multiples
of 24 hours. Weekly horizons (168 hours) are even more autocorrelated in the
electrical network.
AR(1) : zt = β0 + β1 × zt−1 + t
The simplest method to learn the parameters of the AR(1) model is regress
Z on lagged Z. If the model successfully captures the dependence structure
in the data then the residuals should look iid. There should be no dependence
in the residuals.
The case of β1 = 1 deserves special attention because of it is relevant
in economic data series. Many economic and business time series display a
random walk characteristic. A random walk is an AR(1) model with β1 = 1
is:
AR(1) : zt = β0 + zt−1 + t
A random walker is someone who has an equal chance of taking a step forward
or a step backward. The size of the steps is random as well. In statistics (Dillon
and Goldstein, 1984) much more sophisticated techniques are used. Most of
them are out of the scope of this book.
zk = Hxk + vk (11.3)
The variables wk and vk represent the process and the measurement noises
respectively. It is assumed that both are independent normally distributed
centered in 0 and with covariance matrices given by Q and R, respectively.
The matrix A in equation 11.2 relates the state of the system in time k − 1,
with the current state at time k; H in equation 11.3 relates the current state
of the system with the measurement. At a given timestep t, the filter uses the
state estimate from the previous timestep to produce an estimate of the state
at the current timestep: x̂−t . This predicted state estimate is also known as
the a priori state estimate because it does not include observation information
from the current timestep. Later, whenever zt is observed, the current a priori
prediction is combined with current observation information to refine the state
estimate. This improved estimate, x̂t , is termed the a posteriori state estimate.
We have two error estimates. The a priori error, e− −
k ≡ xk − x̂k , and the a
posteriori error, êk ≡ xk − x̂k .
The covariance matrix of the a priori error estimate is:
h i
−T
Pk− = E e− e
k k (11.4)
Pk = E êk êTk
(11.5)
x̂k = x̂− −
k + Kk zk − H x̂k (11.6)
The difference zk − H x̂−
k is the innovation measure or residual, and reflects
the difference between predict measurement H x̂− k , and the observed value zk .
The matrix K, from the equation 11.6 is called the factor gain and minimize
the covariance matrix of the a posteriori error (Kalman, 1960; Harvey, 1990).
K is given by:
Pk− H T
Kk = (11.7)
HPk− H T + R
Whenever the covariance matrix R is near 0, the gain factor K gives more
weight to the residual. Whenever the covariance matrix P is near 0, the gain
factor K gives less weight to the residual.
178 Knowledge Discovery from Data Streams
The Kalman filter estimates the state of the system using a set of recursive
equations. These equations are divided into two groups: time update equations
and measurement update equations. The time update equations are respon-
sible for projecting forward (in time) the current state and error covariance
estimates to obtain the a priori estimates for the next time step.
x̂−
k = Ax̂k−1 (11.8)
Pk− H T
Kk = (11.10)
HPk− H T + R
x̂k = x̂− −
k + Kk zk − H x̂k (11.11)
Pk = (I − Kk H) Pk− (11.12)
The true state is assumed to be an unobserved Markov process (see Fig-
ure 11.3), and the measurements are the observed states of a hidden Markov
model. The Markov assumption, justifies that the true state is conditionally
independent of all earlier states given the immediately previous state.
Equations 11.8 and 11.9 forecast the current system state, x̂k−1 , and the
error covariance matrix, Pk−1 , for the next time-stamp k. We obtain the a
priori estimates of the system state, x̂− −
k and the error covariance matrix Pk .
Equations 11.11 and 11.12 incorporate the measurements, zk , in the estimates
Time Series Data Streams 179
a priori, producing the estimates a posteriori of the system state, x̂k , and the
error covariance matrix, Pk . After computing the a posteriori estimates, all the
process is repeated using the a posteriori estimates to compute new estimates
a priori. This recursive nature is one of the main advantages of the Kalman
filter. It allows easy, fast and computational efficient implementations (Harvey,
1990).
The performance of the Kalman filter depends on the accuracy of the a-
priori assumptions:
∆i = (y − ŷ) × xi
wi = wi−1 + µ × ∆i
where wi−1 is the weight i at previous timestamp, and µ, the learning rate, is
a constant.
The main idea behind the least mean squares, or delta rule, is to use
gradient descent to search the hypothesis space of possible weight vectors to
find the weights that best fit the training examples.
However, if the generator of the data itself evolves with time, then
this [static] approach is inappropriate and it becomes necessary for
the network model to adapt to the data continuously so that ’track’
the time variation. This requires on-line learning techniques, and
raises a number of important issues, many of which are at present
largely unresolved and lie beyond the scope of this book.
backpropagating the error through the network are very efficient and can
follow high-speed data streams. This training procedure is robust to overfit-
ting, because each example is propagated through the network and the error
backpropagated only once. Another advantage is the smooth adaptation in
dynamic data streams where the target function gradually evolves over time.
Figure 11.4: Buffered on-line Predictions: 1. new real data arrives (r) at
time stamp i, substituting previously made prediction (o); 2. define the input
vector to predict time stamp i; 3. execute prediction (t) for time stamp i; 4.
compute error using predicted (t) and real (r) values; 5. back-propagate the
error one single time; 6. define input vector to predict time stamp i plus the
requested horizon; 7. execute prediction of the horizon (p); 8. discard oldest
real data (d).
baggingBreiman (1996). We can use the dual perturb and combine method
with three goals: as a method to reduce the variance exhibited by neural net-
works; as a method to estimate a confidence for predictions (users seem more
comfortable with both a prediction and a confidence estimate on the predic-
tion), which is very relevant point in industrial applications; and as a robust
prevention of the uncertainty in information provided by sensors in noisy en-
vironments. For example, if a sensor reads 100, most of times the real-value is
around 100: it could be 99 or 101. Perturbing the test example and aggregat-
ing predictions also reduce the uncertainty associated with the measurement
sent by the sensor.
Improving Predictive Accuracy using Kalman Filters. Our target func-
tion is a continuous and derivable function over time. For these type of time
series, one simple prediction strategy, reported elsewhere to work well, con-
sists of predicting for time t the value observed at time t − k. A study on
the autocorrelation (Figure 11.2) in the time series used to train the neural
network reveals that for next hour forecasts, k = 1 is the most autocorre-
lated value, while for next day and next week the most autocorrelated one is
the corresponding value one week before (k = 168). This very simple predic-
tive strategy is used as a default rule and as a baseline for comparisons. Any
predictive model should improve over this naive estimation.
We use this characteristic of the time series to improve neural nets fore-
Time Series Data Streams 183
casts, by coupling both using a Kalman filter (Kalman, 1960). The Kalman
filter is widely used in engineering for two main purposes: for combining mea-
surements of the same variables but from different sensors, and for combin-
ing an inexact forecast of system’s state with an inexact measurement of
the state. We use Kalman filter to combine the neural network forecast with
the observed value at time t − k, where k depends on the horizon forecast
as defined above. The one dimensional Kalman filters works by considering:
2
σi−1
ŷi = ŷi−1 + K(yi − ŷi−1 ) where σi2 = (1 − K)σi−1
2
and K = 2
σi−1 +σr2
.
1. identity: D(Q, Q) = 0;
time-stamp 1 2 3 4 5 6 7 8 9 10 11 12
Query 1.0 0.8 0.8 1.4 1.2 1.0 1.5 1.9 1.5 1.5 1.5 1.6
Reference 0.9 0.8 0.8 1.3 1.4 1.2 1.7 1.8 1.6 1.5 1.5 2.0
time-stamp 13 14 15 16 17 18 19
Query 1.8 2.8 2.5
Reference 2.5 2.7 2.9 2.5 3.1 2.4 2.9
Table 11.1: The two time-series used in the example of dynamic time-
warping.
The two time-series must have the same number of elements. It is quite
efficient as a distance, but not as a measure of similarity. For example, consider
two identical time-series, one slightly shifted along the time axis. Euclidean
distance will consider them to be very different from each other.
Figure 11.6: The two times-series. In the right panel plots the reference
time-series, the middle panel plot the query time-series, and the left panel
plot both.
where dist(wki , wkj ) is the Euclidean distance between the two data point
indexes (one from X and one from Y ) in the k th element of the warp path.
Formally, the dynamic time warping problem is stated as follows:
– max(n, p) ≤ K < n + p
– the k th element of the warp path is wk = (i, j) where
∗ i is an index from time-series X,
∗ j is an index from time-series Y .
Vertical sections of the warp path, means that a single point in time series
X is warped to multiple points in time series Y . Horizontal sections means
that a single point in Y is warped to multiple points in X. Since a single point
may map to multiple points in the other time series, the time series do not
need to be of equal length. If X and Y were identical time-series, the warp
path through the matrix would be a straight diagonal line.
To find the minimum-distance warp path, every cell of a cost matrix (of
size n × p) must be filled. The value of a cell in the cost matrix is: D(i, j) =
Dist(i, j) + min[D(i − 1, j), D(i, j − 1), D(i − 1, j − 1)]. In practice very good
186 Knowledge Discovery from Data Streams
Figure 11.7: Alignment between the two time series. The reference time-
series was pushed-up for legibility. The path between the two time series is
W={(1,1), (2,2), (3,2), (3,3), (4,4), (5,4), (6,5), (6,6), (7,7), (8,8), (9,9), (10,9),
(11,9), (11,10), (11,11), (11,12), (12,13), (13,14), (14,14), (15,14), (16,15),
(17,15), (18,15), (19,15)}.
• Symbolic Discretization;
• Distance Measure.
Where c̄i is the ith element in the approximated time series. w is a user
defined parameter and represent the number of episodes (intervals) of the
transformed time-series. If we plot a time-series in a Cartesian space, the
piecewise aggregate approximation divides the x axis into a set of intervals of
the same size.
a 3 4 5 6 7 8 9 10
β1 -0.43 -0.67 -0.84 -0.97 -1.07 -1.15 -1.22 -1.28
β2 0.43 0 -0.25 -0.43 -0.57 -0.67 -0.76 -0.84
β3 0.67 0.25 0 -0.18 -0.32 -0.43 -0.52
β4 0.84 0.43 0.18 0 -0.14 -0.25
β5 0.97 0.57 0.32 0.14 0
β6 1.07 0.67 0.43 0.25
β7 1.15 0.76 0.52
β8 1.22 0.84
β9 1.28
Table 11.2: A lookup table that contains the breakpoints that divide a Gaus-
sian distribution in an arbitrary number (from 3 to 10) of equiprobable regions
where dist(q, c) can be determined using a lookup table (Lin et al., 2003).
A relevant property of M IN DIST is that it lower bounding the Euclidean
distance, that is, for all Q and S, we have M IN DIST (Q̂, Ŝ) ≤ D(Q, S).
11.4.1.4 Discussion
SAX provides a symbolic representation for time-series data. Three impor-
tant properties of SAX are:
Visualizing massive time series (Lin, Keogh, and Lonardi, 2004; Lin, Keogh,
Lonardi, Lankford, and Nystrom, 2004); Clustering from streams (Keogh, Lin,
and Truppel, 2003); Kolmogorov complexity analysis in data mining (Lin,
Keogh, and Lonardi, 2004); and Finding discords in time series (Keogh, Lin,
and Fu, 2005).
11.5 Notes
Time series are a well studied topic in statistics and signal processing.
Methods for time series analyses are often divided into two classes: frequency-
domain methods and time-domain methods. The reference technique is the
ARIMA methodology developed by Box and Jenkins (1976). The analysis of
a series of data in the frequency domain includes Fourier transforms and its
inverse (Brigham, 1988). More recent techniques, multiresolution techniques,
attempt to model time dependence at multiple scales.
Chapter 12
Ubiquitous Data Mining
Over the last years, a new world of small and heterogeneous devices (mo-
bile phones, PDA, GPS devices, intelligent meters, etc) as emerged. They are
equipped with limited computational and communication power, and have
the ability to sense, to communicate and to interact over some communica-
tion infrastructure. These large scale distributed systems have in many cases
to interact in real-time with their users. Ubiquitous Data Mining is an emerg-
ing area of research at the intersection of distributed and mobile systems
and advanced knowledge discovery systems. In this chapter we discuss some
introductory issues related to distributed mining (Section 12.2) and resource-
constraint mining (Section 12.4).
191
192 Knowledge Discovery from Data Streams
• The periodic approach is to simply rebuild the model from time to time;
• The incremental approach is to update the model whenever the data
changes;
• The reactive approach is to monitor the change, and rebuild the model
only when it no longer suits the data.
ter 8. VFDT has the ability to deactivate all less promising leaves in the case where the
maximum of available memory is reached. Moreover, the memory usage is also minimized,
eliminating attributes that are less promising.
Ubiquitous Data Mining 193
Figure 12.1: The left figure plots the cover of the space: the -circle is in
white, the half-spaces are defined by the tangents to vectors ~u in gray, and
~ is its own estimate of global average,
the tie regions in dark. For peer pi : X
~ is the agreement with neighbor Pj , and W
A ~ is the withheld knowledge w.r.t
~ ~ ~
neighbor Pj (W = X − A). In the first figure all the peer estimates are inside
the circle. Both the agreement and withheld knowledge are inside too. Hence
according to the theorem, the global average is inside the circle.
every other feature. Instead, only a small group of features are usually highly
correlated with each other. This results in a sparse correlation matrix. In most
stream applications, for example those previously enumerated in this chapter,
the difference in the consecutive correlation matrices generated from two sub-
sequent sets of observations is usually small, thereby making the difference
matrix a very sparse one.
In the following we present the FMC algorithm, developed by Kargupta
et al. (2007), to determine the significant coefficients in a matrix generated
from the difference of the correlation matrices obtained at different times. The
method is used to:
and V ar[X] ≤ 2C 2 , where E[X] and V ar[X] represent the expectation and
the variance of the random variable X, respectively.
198 Knowledge Discovery from Data Streams
This can be used to directly look for significant changes in the correla-
tion matrix using the divide and conquer strategy previously described. This
Ubiquitous Data Mining 199
or less centers are found, then a valid clustering is generated. Each local site
maintains a Parallel Guessing Algorithm using its own data source. Whenever
it reaches a solution, it sends to the coordinator the k centers and the radius
Ri . Each local site only re-sends information when the centers change. The
coordinator site maintains a Furthest Point Algorithm over the centers sent
by local sites.
is read, or si (t) = si (t − 1), no information is sent to the central site. The pro-
cess of updating the first layer works online, doing a single scan over the data
stream, hence being able to process infinite sequences of data, processing each
example in constant time and (almost) constant space. The update process of
the second layer works online along with the first layer. For each new example
Xi (t), the system increments the counter in the second-layer cell where the
triggered first-layer cell is included, defining the discretized state si (t). The
grid represents an approximated histogram of the variable produced at the
local site.
The central site monitors the global state of the network at time t by
combining each local discretized state s(t) = hs1 (t), s2 (t), ..., sd (t)i. Each s(t)
is drawn from a finite set of cell combinations E = {e1 , e2 , ..., e|E| }, with
Qd
|E| = i=1 wi . Given the exponential size of E, the central site monitors only
a subset F of the top-m most frequent elements of E, with k < |F | << |E|.
Relevant focus is given to size requirements, as |E| ∈ O(k d ), but |F | ∈ O(dk β ),
with small β. Finally, the top-m frequent states central points are used in an
online adaptive partitional clustering algorithm, which defines the current k
cluster centers, being afterwards continuously adapted.
Figure 12.3: Example of final definition for 2 sensors data, with 5 clusters.
Each coordinate shows the actual grid for each sensor, with top-m frequent
states (shaded cells), gathered (circles) and real (crosses) centers, for different
grid resolutions.
has just become guaranteed top-m, then the clusters may have changed so we
move into a non-converged state of the system, updating the cluster centers.
Another scenario where the clusters centers require adaptation is when one
or more local sites transmit their new grid intervals, which are used to define
the central points of each state. In this case, we also update and move to
non-converged state.
Figure 12.3 presents an example of a final grid, frequent cells and cluster
centers for a specific case with d = 2, k = 5, for different values of w and m.
The flexibility of the system is exposed, as different parameter values yield
different levels of results. Moreover, the continuous update keeps track of the
most frequent cells, keeping the gathered centers within acceptable bounds. A
good characteristic of this system is its ability to adapt to resource restricted
environments: system granularity can be defined given the resources available
in the network processing sites.
204 Knowledge Discovery from Data Streams
The algorithm granularity is the first generic approach to address the issue
of adapting the algorithm parameters, and consequently the consumption rate
of computational resources dynamically. In the following three sections, details
of this approach including an overview, formalization and a generic procedure
are presented.
Granularity as a general approach for mining data streams will require some
formal definitions and notations. The following definitions will be used next:
1. Identify the set of resources that mining algorithm will adapt accordingly
(R);
2. Set the application lifetime (ALT ) and time interval/frame (T F );
3. Define AGP (ri )+ and AGP (ri )− for every ri ∈ R;
4. Run the algorithm for T F ;
5. Monitor the resource consumption for every ri ∈ R;
6. Apply AGP (ri )+ or AGP (ri )− to every ri ∈ R according to the ratio
ALT 0
T F : N oF (ri ) and the rule given in Section 12.4.2;
1. Mining Phase.
In this step, the algorithm threshold that can control the algorithm
output rate is determined as an initial value set by the user of the system
(or preset to a default initial value).
2. Adaptation Phase.
In this phase, the threshold value is adjusted to cope with the data rate
of the incoming stream, the available memory, and time constraints to
fill the available memory with resultant knowledge structures.
3. Knowledge Integration Phase.
This phase represents the merging of produced results when the compu-
tational device is running out of memory.
AOG has been instantiated for several data mining tasks (Gaber et al.,
2004). The approach was implemented for data stream mining in:
• Clustering: the threshold is used to specify the minimum distance be-
tween the cluster center and the data stream record. Clusters that are
within short proximity might be merged.
• Classification: in addition to using the threshold in specifying the dis-
tance, the class label is checked. If the class label of the stored records
and the new item/record that are close (within the accepted distance)
is the same, the weight of the stored item is increased and stored along
with the weighted average of the other attributes, otherwise the weight
is decreased and the new record is ignored.
Ubiquitous Data Mining 209
This integration allows the continuity of the data mining process. Other-
wise the computational device would run out of memory even with adapting
the algorithm threshold to its highest possible value that results in the lowest
possible generation of knowledge structures. Figure 12.6 shows a flowchart of
AOG-mining process. It shows the sequence of the three stages of AOG.
The algorithm output granularity approach is based on the following rules:
• The algorithm output rate (AR) is a function of the data rate (DR), i.e.,
AR = f (DR). Thus, the higher the input rate, the higher the output
rate. For example the number of clusters created over a period of time
should depend on the number of data records received over this period
of time.
• The time needed to fill the available memory by the algorithm results
(knowledge structures namely: clusters, classification models and fre-
quent items) (T M ) is a function of (AR), i.e., T M = f (AR). Thus,
a higher output rate would result in shorter time to fill the available
memory assigned for the application.
12.5 Notes
Scalable and distributed algorithm for decision tree learning in large and
distributed networks were reported in Bar-Or et al. (2005) and Bhaduri et al.
(2008). Kargupta and Park (2001) present a Fourier analysis-based technique
to analyze and aggregate decision trees in mobile resource-aware environ-
ments. Branch et al. (2006) address the problem of unsupervised outlier de-
tection in wireless sensor networks. Chen et al. (2004) present distributed
algorithms for learning Bayesian networks.
Privacy-preserving Data Mining (Agrawal and Srikant, 2000) is an emerg-
ing research topic, whose goal is to identify and disallow mining patterns that
can reveal sensitive information about the data holder. Privacy-preserving is
quite relevant in mining sensitive distributed data.
The gossip algorithms (Boyd et al., 2006) are a class of distributed asyn-
chronous algorithms for computation and information exchange in an arbitrar-
ily connected network of nodes. Nodes operate under limited computational,
communication and energy resources. These constraints naturally give rise to
gossip algorithms: schemes which distribute the computational burden and in
which a node communicates with a randomly chosen neighbor. Bhaduri and
Kargupta (2008) argue that the major drawback of gossip algorithms is their
scalability and the slow answer to dynamic data.
Chapter 13
Final Comments
Data Mining is faced with new challenges. All of them share common issues:
continuously flow of data generated by evolving distributions, the domains
involved (the set of attribute-values) can be also huge, and computation re-
sources (processing power, storage, bandwidth, and battery power) are lim-
ited. In this scenario, Data Mining approaches involving fixed training sets,
static models, stationary distributions, and unrestricted computational re-
sources are almost obsolete.
211
212 Knowledge Discovery from Data Streams
215
216 Knowledge Discovery from Data Streams
Bar-Or, A., R. Wolff, A. Schuster, and D. Keren (2005). Decision tree induc-
tion in high dimensional, hierarchically distributed databases. In Proceed-
ings SIAM International Data Mining Conference, Newport Beach, USA,
pp. 466–470. SIAM Press.
Barnett, V. and T. Lewis (1995). Outliers in Statistical Data (3rd ed.). John
Wiley & Sons.
Bifet, A. and R. Gavaldà (2006). Kalman filters and adaptive windows for
learning in data streams. In L. Todorovski and N. Lavrac (Eds.), Proceed-
ings of the 9th Discovery Science, Volume 4265 of Lecture Notes Artificial
Intelligence, Barcelona, Spain, pp. 29–40. Springer.
Box, G. and G. Jenkins (1976). Time series analysis: forecasting and control.
Holden-Day.
Brain, D. and G. Webb (2002). The need for low bias algorithms in classifica-
tion learning from large data sets. In T.Elomaa, H.Mannila, and H.Toivonen
(Eds.), Principles of Data Mining and Knowledge Discovery PKDD-02, Vol-
ume 2431 of Lecture Notes in Artificial Intelligence, Helsinki, Finland, pp.
62–73. Springer.
Branch, J. W., B. K. Szymanski, C. Giannella, R. Wolff, and H. Kargupta
(2006). In-network outlier detection in wireless sensor networks. In IEEE
International Conference on Distributed Computing Systems, pp. 51–60.
Breiman, L. (1996). Bagging predictors. Machine Learning 24, 123–140.
Breiman, L. (2001). Random forests. Machine Learning 45, 5–32.
Breiman, L., J. Friedman, R. Olshen, and C. Stone (1984). Classification and
Regression Trees. Wadsworth International Group., USA.
Brigham, E. O. (1988). The fast Fourier transform and its applications. Pren-
tice Hall.
Broder, A. Z., M. Charikar, A. M. Frieze, and M. Mitzenmacher (2000).
Min-wise independent permutations. Journal of Computer and System Sci-
ences 60 (3), 630–659.
Buntine, W. (1990). A Theory of Learning Classification Rules. Ph. D. thesis,
University of Sydney.
Calvo, B., P. Larrañaga, and J. A. Lozano (2007). Learning Bayesian classifiers
from positive and unlabeled examples. Pattern Recognition Letters 28 (16),
2375–2384.
Carpenter, G., M. Rubin, and W. Streilein (1997). ARTMAP-FD: familiarity
discrimination applied to radar target recognition. In Proceedings of the
International Conference on Neural Networks, Volume III, pp. 1459–1464.
Castano, B., M. Judd, R. C. Anderson, and T. Estlin (2003). Machine learning
challenges in Mars rover traverse science. Technical report, NASA.
Castillo, G. and J. Gama (2005). Bias management of bayesian network classi-
fiers. In A. Hoffmann, H. Motoda, and T. Scheffer (Eds.), Discovery Science,
Proceedings of 8th International Conference, Volume 3735 of Lecture Notes
in Artificial Intelligence, Singapore, pp. 70–83. Springer.
Catlett, J. (1991). Megainduction: A test flight. In L. Birnbaum and G. Collins
(Eds.), Machine Learning: Proceedings of the 8th International Conference,
Illinois, USA, pp. 596–599. Morgan Kaufmann.
Cauwenberghs, G. and T. Poggio (2000). Incremental and decremental sup-
port vector machine learning. In Proceedings of the Neural Information
Processing Systems.
Bibliography 219
Chu, F. and C. Zaniolo (2004). Fast and light boosting for adaptive mining
of data streams. In H. Dai, R. Srikant, and C. Zhang (Eds.), PAKDD,
Volume 3056 of Lecture Notes in Computer Science, Pisa, Italy, pp. 282–
292. Springer.
220 Knowledge Discovery from Data Streams
Denis, F., R. Gilleron, and F. Letouzey (2005). Learning from positive and
unlabeled examples. Theoretical Computer Science 348 (1), 70–83.
Dietterich, T. (1996). Approximate statistical tests for comparing supervised
classification learning algorithms. Corvallis, technical report nr. 97.331,
Oregon State University.
Dietterich, T. (1997). Machine learning research: four current directions. AI
Magazine 18 (4), 97–136.
Dillon, W. and M. Goldstein (1984). Multivariate Analysis, Methods and
Applications. J. Wiley & Sons, Inc.
Dobra, A. and J. Gehrke (2002). SECRET: a scalable linear regression tree
algorithm. In ACM-SIGKDD Knowledge Discovery and Data Mining, Ed-
monton, Canada, pp. 481–487. ACM.
Domingos, P. (1998). Occam’s two razor: the sharp and the blunt. In Proceed-
ings of the 4 International Conference on Knowledge Discovery and Data
Mining, Madison, USA, pp. 37–43. AAAI Press.
Domingos, P. and G. Hulten (2000). Mining High-Speed Data Streams. In
I. Parsa, R. Ramakrishnan, and S. Stolfo (Eds.), Proceedings of the ACM
Sixth International Conference on Knowledge Discovery and Data Mining,
Boston, USA, pp. 71–80. ACM Press.
Domingos, P. and G. Hulten (2001). A general method for scaling up ma-
chine learning algorithms and its application to clustering. In C. Brodley
(Ed.), Machine Learning, Proceedings of the 18th International Conference,
Williamstown, USA, pp. 106–113. Morgan Kaufmann.
Domingos, P. and M. Pazzani (1997). On the optimality of the simple Bayesian
classifier under zero-one loss. Machine Learning 29, 103–129.
Dougherty, J., R. Kohavi, and M. Sahami (1995). Supervised and unsuper-
vised discretization of continuous features. In Proceedings 12th International
Conference on Machine Learning, Tahoe City, USA, pp. 194–202. Morgan
Kaufmann.
Elnekave, S., M. Last, and O. Maimon (2007, April). Incremental cluster-
ing of mobile objects. In International Conference on Data Engineering
Workshop, Istanbul, Turkey, pp. 585–592. IEEE Press.
Faloutsos, C., B. Seeger, A. J. M. Traina, and C. T. Jr. (2000). Spatial join
selectivity using power laws. In Proceedings ACM SIGMOD International
Conference on Management of Data, Dallas, USA, pp. 177–188.
Fan, W. (2004). Systematic data selection to mine concept-drifting data
streams. In J. Gehrke and W. DuMouchel (Eds.), Proceedings of the Tenth
International Conference on Knowledge Discovery and Data Mining, Seat-
tle, USA, pp. 128–137. ACM Press.
222 Knowledge Discovery from Data Streams
Gama, J., R. Fernandes, and R. Rocha (2006). Decision trees for mining data
streams. Intelligent Data Analysis 10 (1), 23–46.
Gama, J. and P. Medas (2005). Learning decision trees from dynamic data
streams. Journal of Universal Computer Science 11 (8), 1353–1366.
Gama, J., P. Medas, and R. Rocha (2004). Forest trees for on-line data. In Pro-
ceedings of the ACM Symposium on Applied Computing, Nicosia, Cyprus,
pp. 632–636. ACM Press.
Gama, J., R. Rocha, and P. Medas (2003). Accurate decision trees for mining
high-speed data streams. In Proceedings of the ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, Washington DC,
USA, pp. 523–528. ACM Press.
Guh, R., F. Zorriassatine, and J. Tannock (1999). On-line control chart pat-
tern detection and discrimination - a neural network approach. Artificial
Intelligence Engeneering 13, 413–425.
Bibliography 225
Guha, S. and B. Harb (2005). Wavelet synopsis for data streams: minimizing
non-euclidean error. In Proceeding of the 11th ACM SIGKDD International
Conference on Knowledge Discovery in Data Mining, New York, USA, pp.
88–97. ACM Press.
Guha, S., A. Meyerson, N. Mishra, R. Motwani, and L. O’Callaghan (2003).
Clustering data streams: Theory and practice. IEEE Transactions on
Knowledge and Data Engineering 15 (3), 515–528.
Guha, S., R. Rastogi, and K. Shim (1998). Cure: an efficient clustering al-
gorithm for large databases. In Proceedings ACM SIGMOD International
Conference on Management of Data, Seattle, USA, pp. 73–84. ACM Press.
Guha, S., K. Shim, and J. Woo (2004). REHIST: Relative error histogram
construction algorithms. In Proceedings of the 30th International Confer-
ence on Very Large Data Bases, Toronto, Canada, pp. 288–299. Morgan
Kaufmann.
Han, J. and M. Kamber (2006). Data Mining Concepts and Techniques. Mor-
gan Kaufmann.
Han, J., J. Pei, Y. Yin, and R. Mao (2004). Mining frequent patterns without
candidate generation. Data Mining and Knowledge Discovery 8, 53–87.
Hand, D. J. and R. J. Till (2001). A simple generalisation of the area under the
roc curve for multiple class classification problems. Machine Learning 45,
171–186.
Hansen, L. and P. Salamon (1990). Neural networks ensembles. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence 12 (10), 993–1001.
Harries, M., C. Sammut, and K. Horn (1998). Extracting hidden context.
Machine Learning 32, 101–126.
Harvey, A. (1990). Forecasting, Structural Time Series Models and the Kalman
Filter. Cambridge University Press.
Hastie, T., R. Tibshirani, and J. Friedman (2000). The Elements of Statistical
Learning, Data Mining, Inference and Prediction. Springer.
Herbster, M. and M. Warmuth (1995). Tracking the best expert. In A. Priedi-
tis and S. Russel (Eds.), Machine Learning, Proceedings of the 12th Inter-
national Conference, Tahoe City, USA, pp. 286–294. Morgan Kaufmann.
Herbster, M. and W. Warmuth (1998). Tracking the best expert. Machine
Learning 32, 151–178.
Hinneburg, A. and D. A. Keim (1999). Optimal grid-clustering: Towards
breaking the curse of dimensionality in high-dimensional clustering. In Pro-
ceedings of the International Conference on Very Large Data Bases, Edin-
burgh, Scotland, pp. 506–517. Morgan Kaufmann.
226 Knowledge Discovery from Data Streams
Hu, X.-G., P. pei Li, X.-D. Wu, and G.-Q. Wu (2007). A semi-random multi-
ple decision-tree algorithm for mining data streams. J. Computer Science
Technology 22 (5), 711–724.
Ikonomovska, E. and J. Gama (2008). Learning model trees from data streams.
In J.-F. Boulicaut, M. R. Berthold, and T. Horváth (Eds.), Discovery Sci-
ence, Volume 5255 of Lecture Notes in Computer Science, Budapest, Hun-
gary, pp. 52–63. Springer.
Jin, R. and G. Agrawal (2007). Data streams – models and algorithms. See
Aggarwal (2007), Chapter Frequent Pattern Mining in Data Streams, pp.
61–84.
Kargupta, H. and B.-H. Park (2001). Mining decision trees from data streams
in a mobile environment. In IEEE International Conference on Data Min-
ing, San Jose, USA, pp. 281–288. IEEE Computer Society.
Keogh, E. J., J. Lin, and A. W.-C. Fu (2005). HOT SAX: Efficiently finding the
most unusual time series subsequence. In Proceedings of the International
Conference on Data Mining, Houston, USA, pp. 226–233. IEEE Press.
228 Knowledge Discovery from Data Streams
Keogh, E. J., J. Lin, and W. Truppel (2003). Clustering of the time series sub-
sequences is meaningless: Implications for previous and future research. In
Proceedings of the IEEE International Conference on Data Mining, Florida,
USA, pp. 115–122. IEEE Computer Society.
Kohavi, R. and C. Kunz (1997). Option decision trees with majority votes. In
D. Fisher (Ed.), Machine Learning Proc. of 14th International Conference.
Morgan Kaufmann.
Li, C., Y. Zhang, and X. Li (2009). OcVFDT: one-class very fast decision
tree for one-class classification of data streams. In O. A. Omitaomu, A. R.
Ganguly, J. Gama, R. R. Vatsavai, N. V. Chawla, and M. M. Gaber (Eds.),
KDD Workshop on Knowledge Discovery from Sensor Data, Paris, France,
pp. 79–86. ACM.
Li, H.-F., M.-K. Shan, and S.-Y. Lee (2008). DSM-FI: an efficient algorithm
for mining frequent itemsets in data streams. Knowledge and Information
Systems 17 (1), 79–97.
230 Knowledge Discovery from Data Streams
Lin, J., E. Keogh, and S. Lonardi (2004). Visualizing and discovering non-
trivial patterns in large time series databases. Information Visualiza-
tion 4 (2), 61–82.
Liu, B., Y. Dai, X. Li, W. S. Lee, and P. S. Yu (2003). Building text classifiers
using positive and unlabeled examples. In Proceedings of the International
Conference Data Mining, Florida, USA, pp. 179–188. IEEE Computer So-
ciety.
Loh, W. and Y. Shih (1997). Split selection methods for classification trees.
Statistica Sinica 7, 815–840.
Severo, M. and J. Gama (2006). Change detection with Kalman filter and
Cusum. In Discovery Science, Volume 4265 of Lecture Notes in Computer
Science, Barcelona, Spain, pp. 243–254. Springer.
Spath, H. (1980). Cluster Analysis Algorithms for Data Reduction and Clas-
sification. Ellis Horwood.
Spinosa, E., J. Gama, and A. Carvalho (2008). Cluster-based novel concept de-
tection in data streams applied to intrusion detection in computer networks.
In Proceedings of the ACM Symposium on Applied Computing, Fortaleza,
Brasil, pp. 976–980. ACM Press.
Bibliography 235
Utgoff, P., N. Berkman, and J. Clouse (1997). Decision tree induction based
on efficient tree restructuring. Machine Learning 29, 5–44.
239
240 Knowledge Discovery from Data Streams
Sampling, 21
Load Shedding, 22
Min-Wise Sampling, 22
SAX, 184
Discords, 187
Discretization, 185
Distance, 186
Hot-SAX, 187
Motifs, 186
Piecewise Aggregate Approxi-
mation, 185
Strings, 186
Sequence Pattern Mining, 112
The Problem, 112
Spatio-Temporal Data, 6
242 Knowledge Discovery from Data Streams
Appendix A
Resources
A.1 Software
Examples of public available software for learning from data streams in-
clude:
• VFML. The VFML (Hulten and Domingos, 2003) toolkit for mining high-
speed time-changing data streams. Available at http://www.cs.washington.edu/dm/vfml/.
• MOA. The MOA (Holmes et al., 2007) system for learning from massive
data sets. Available at http://www.cs.waikato.ac.nz/∼abifet/MOA/.
• SAX (Lin et al., 2003) is the first symbolic representation for time se-
ries that allows for dimensionality reduction and indexing with a lower-
bounding distance measure. Available at http://www.cs.ucr.edu/∼jessica/sax.htm.
• KDD Cup Center, the annual Data Mining and Knowledge Discovery
competition organized by ACM Special Interest Group on Knowledge
Discovery and Data Mining.
http://www.sigkdd.org/kddcup/.
243
244 Knowledge Discovery from Data Streams
• Intel Lab Data contains information about data collected from 54 sensors
deployed in the Intel Berkeley Research lab.
http://db.csail.mit.edu/labdata/labdata.html
• Mining Data Streams Bibliography, Maintained by Mohamed Gaber,
Monash University, Australia.
http://www.csse.monash.edu.au/∼mgaber/WResources.html
• Distributed Data Mining: http://www.umbc.edu/ddm/
• Time Series Data Library: http://www.robjhyndman.com/TSDL/