0% found this document useful (0 votes)
24 views175 pages

BDA Lect Data Streams SA

Uploaded by

Parth Ashtikar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views175 pages

BDA Lect Data Streams SA

Uploaded by

Parth Ashtikar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 175

Big Data Analytics

Background- Data Streams


Background
• Why Real Time Analytics is important
• Real-time visibility of any system help us to become more proactive
• To avoid delayed and inaccurate data
• To avoid machine down time
• To achieve machine utilization and efficiency
• To have better asset preventative maintenance
Understanding practical constraints using simple streaming example

• How many dots can you see?


• (Each dot is an event)
Understanding practical constraints using simple streaming example

• How many dots can you see?


• (Each dot is an event)
Understanding practical constraints using simple streaming example

• How many dots can you see?


• (Each dot is an event)
Let’s add some complexity

• How many red dots can you see?


Let’s add some temporal complexity

• How many yellow dots appear after the blue dots?


• How many red dots will there be next?
• IF there were 20 red dots, put the next three blue dots in a specific category.
• If there are ten blue dots, join to alternate data and see how many blue dots
there are in total ?
• If there were 6 green dots, wait for an hour, then compare again.
Data Stream

• Such streams of constantly arriving data applications including:


• aviation control dataset
• telecommunication connection data
• readings from sensor nets
• real time industrial control system dataset
Handling of Stream of Events
What is Stream Computing?
Continuous Ingestion Continuous Analysis in Microseconds
Characteristics of Data Streams

• Data Streams
• Data streams—continuous, ordered, changing, fast, huge amount
• Traditional DBMS—data stored in finite, persistent data sets

• Characteristics
• Huge volumes of continuous data, possibly infinite
• Fast changing and requires fast, real-time response
• Data stream captures nicely our data processing needs of today
• Random access is expensive—single scan algorithm (can only have one look)
• Store only the summary of the data seen thus far
• Most stream data are at pretty low-level or multi-dimensional in nature, needs multi-level
and multi-dimensional processing

12
Batch Processing Vs Stream Processing
Batch Processing Stream Processing

High-latency apps Low-latency apps


Static Files Event Streams
Process-after-store Sense-and-respond
Batch processors Stream processors
Batch Processing Vs Stream Processing

Batch Processing Stream Processing


• Persistent relations • Transient streams
• One-time queries • Continuous queries
• Random access • Sequential access
• “Unbounded” disk store • Bounded main memory
• No real-time services • Real-time requirements
• Relatively low update rate • Possibly multi-GB arrival rate
• Data at any granularity • Data at fine granularity
• Assume precise data • Data stale/imprecise
• Access plan determined by query • Unpredictable/variable data arrival
processor, physical DB design and characteristics
Windows in Stream Analytics

• Mechanism for extracting a finite relation from an infinite stream


• Various window proposals for restricting operator scope.
• Windows based on ordering attribute (e.g. time)
• Windows based on tuple counts
• Windows based on explicit markers (e.g. punctuations)
• Variants (e.g., partitioning tuples in a window)

Window
specifications streamify
Extracting
Stream Finite Stream
relations
Types of Windows

• Tumbling windows assign events into non-overlapping buckets of fixed size.

Count Based
Window

Time Based
Window
Types of Windows

• Sliding windows assign events into overlapping buckets of fixed size.

Sliding Window

Session Window

A parallel count-based
tumbling window of length 2.
Challenges in Data Streams
t1 t2 t3 t4 t5

Old data New data

Characteristics: may change over time.


Main goal of stream mining: make sure that the constructed
model is the most accurate and up-to-date.
Data Sufficiency

• Definition:
• A dataset is considered “sufficient” if adding more data items will not increase the final
accuracy of a trained model significantly.
• It can be judged by experience.
• We normally do not know if a dataset is sufficient or not.
• Sufficiency detection:
• Expensive “progressive sampling” experiment.
• Keep on adding data and stop when accuracy doesn’t increase significantly.
• Dependent on both dataset and algorithm
• Difficult to make a general claim
Possible changes of data streams

• Possible “concept drift”.


• For the same feature vector, different class labels are generated
at some later time
• Or with different probabilities.
Concept-Drift
Current hyperplane

Previous hyperplane
A data chunk

Negative instance
Instances victim of concept-drift
Positive instance
Concept Drift

• Attention to handling concept drift is increasing,


since predictions become less accurate over time.

• To prevent that, learning models need to adapt


to changes quickly and accurately.
Desired Properties of a System to Handle Concept Drift

• Adapt to concept drift asap

• Distinguish noise from changes


Desired Properties of a System to Handle Concept Drift

• Robust to noise, but adaptive to changes


• Recognizing and reacting to reoccurring contexts

• Adapting with limited resources


– time and memory
Criteria for evaluation of the ability of the algorithm to handle concept
drift:
Learning in Presence of Concept Drift

• Criteria for evaluation of the ability of the algorithm to handle


concept drift:
• Delay [Reflects how fast the method can detect/adapt to the concept drift. ]
• Resistance to noise. [Characterizes the ability of the method to distinguish the noise in the
data from the real concept drift. ]
• Cost of adaptation. [Defines whether the algorithm needs to recompute the model from
scratch after detecting the concept drift, or the localized re-computation is enough]
Types of Concept Drifts
Concept Drift Handling Methods

• Change detection/adaptation methods:


• On the Basis of
• Memory requirement,
• forgetting mechanism
• information they use for detecting/adapting to drift,
• model management (number of base learners)
How to detect the Concept Drift
Data stream classification cycle
1. Process an example at a time,
and inspect it only once (at
most)
2. Use a limited amount of
memory
3. Work in a limited amount of time
4. Be ready to predict at any point

Dimensions of Learning
• Space - the available memory is fixed
• Learning Time - process incoming examples at
the rate they arrive
• Generalization Power - how effective the model
is at capturing the true underlying concept
Requirements for streaming systems

• Process events online without storing them


• Support a high-level language (e.g. StreamSQL)
• Handle missing, out-of-order, delayed data
• Guarantee deterministic (on replay) and correct results (on recovery)
• Combine batch (historical) and stream processing
• Ensure availability despite failures
• Support distribution and automatic elasticity
• Offer low-latency
Windows in Data Streaming
• A temporary Data Storage
Tumbling window
• Tumbling window functions are used to segment a data stream into
distinct time segments and perform a function against them. The key
differentiators of a Tumbling window are that they repeat, do not
overlap, and an event cannot belong to more than one tumbling window.

Microsoft Azure Stream Processor


Hopping window
• Hopping window functions hop forward in time by a fixed period. It may
be easy to think of them as Tumbling windows that can overlap and be
emitted more often than the window size. Events can belong to more
than one Hopping window result set. To make a Hopping window the
same as a Tumbling window, specify the hop size to be the same as the
window size.

Microsoft Azure Stream Processor


Session Window
• Session window functions group events that arrive at similar times,
filtering out periods of time where there is no data. It has three main
parameters: timeout, maximum duration, and partitioning key
(optional).

Microsoft Azure Stream Processor


Snapshot Window
• Snapshot windows group events that have the same timestamp.

Microsoft Azure Stream Processor


Sliding Window
• Sliding windows, unlike Tumbling or Hopping windows, output events
only for points in time when the content of the window actually changes.
In other words, when an event enters or exits the window. So, every
window has at least one event. Similar to Hopping windows, events can
belong to more than one sliding window.

Microsoft Azure Stream Processor


Concept Drift
Concept Drift

• Example : the start of the public COVID-19


lockdowns in March 2020, which abruptly changed
population behaviors all over the world.
Concept Drift Problem
• Concept drift problem exists in many real-world situations.
• An example can be seen in the changes of behavior in mobile phone
usage.
• From the bars in this figure, the time percentage distribution of the
mobile phone usage pattern has changed from “Audio Call” to “Camera”
and then to “Mobile Internet” over the past two
decades.

Concept drift in mobile phone usage


(data used in figure are for demonstration only)
Concept Drift Definition and the Sources

• Concept drift is a phenomenon in which the statistical properties of a target


domain change over time in an arbitrary way.
• It was first proposed by J. C. Schlimmer and R. H. Granger who aimed to point
out that noise data may turn to non-noise information at different time.
• These changes might be caused by changes in hidden variables which cannot
be measured directly.
Example of Concept Drift
• Real Concept Drift
• Let consider the Harry Potter film series where watchers
have to consider them as interesting or junk.
• Suppose that the watchers are adults who initially enjoyed
the films for their special effects; after a long period of
time, they may no longer enjoy them as their special
effects become outdated.
• This is known by Real concept drift because a change has
occurred in the watcher preference.
Example of Concept Drift
• Virtual Concept Drift
• Another situation is when considering the watchers as
children who are enjoying the films today; after a certain
period of time, children grow up and their personal
preferences grow up too.
• Thus, if they are still enjoying the Harry Potter films, then
we will consider this evolution as Virtual concept drift
because it does not affect their preference.
Example of Concept Drift

• Class prior concept drift


• Finally, when the watchers may no longer be interested by the Harry Potter
films because of the emergence of new wizard film series, this can
considered as Class prior concept drift.
Concept-Drift
Current hyperplane

Previous hyperplane
A data chunk

Negative instance
Instances victim of concept-drift
Positive instance
Three Sources of Concept Drift
Concept Drift
Because data is expected to evolve over time, especially in
dynamically changing environments, where non stationarity is
typical, the underlying distribution can change dynamically
over time.

Concept drift between time point t0 and time point t1 can be


defined as:
How does Concept Drift Affect Classification Problems

• Classification decision is made based on the


posterior probabilities of the classes
may change may change
may change
How does Concept Drift Affect Classification

Problems does not affect the target concept, but


can lead to changes in the decision
boundary

Virtual Concept Drift


changes in data
distribution without
Real Concept Drift knowing the class labels

affects predictive decision

52
Concept Drift in Classification Problems - ad
Click Prediction

Some ads may suddenly grow


in high demand

User preferences may change affecting


classification

53
User profile may change Legend:
X :input examples
y :class labels
An example of concept drift types
• Sudden Drift: A new concept occur within a short time.

• Gradual Drift: A new concept gradually replaces the old one with respect to
time.

• Incremental Drift: An old concept incrementally changes to a new concept


over a period of time.

• Reoccurring Concepts: An old concept reoccur after sometime


Concept drift
characteristics

• 1. How long does the drift last?


• 2. How does the new concept replace the old one?
• 3. How much change does the new concept cause?
• 4. Is the drift recurrent?
• 5. Is the drift predictable?
Drift Aware Systems
Main Component
• Change detection
• Memory
• Learning
• Loss estimation: Bad Prediction
Desired Properties of a System to
Handle Concept Drift

• Adapt to concept drift asap

• Distinguish noise from changes


Desired Properties of a System to
Handle Concept Drift
• Robust to noise, but adaptive to changes

• Recognizing and reacting to reoccurring contexts

• Adapting with limited resources


– time and memory
Criteria for evaluation of the ability of the algorithm to handle concept
drift:
Learning in Presence of Concept Drift
• Criteria for evaluation of the ability of the algorithm to handle
concept drift:
• Delay [Reflects how fast the method can detect/adapt to the concept drift. ]
• Resistance to noise. [Characterizes the ability of the method to
distinguish the noise in the data from the real concept drift. ]
• Cost of adaptation. [Defines whether the algorithm needs to recompute
the model from scratch after detecting the concept drift, or the localized re-
computation is enough]
Concept Drift Detection
Drift detection refers to
the techniques and
mechanisms that
characterize and
quantify concept drift
via identifying change
points or change time
intervals.

A general framework for


drift detection contains
four stages, as shown in
Figure
Concept Drift Detection
• Stage 1 (Data Retrieval)
aims to retrieve data
chunks from data streams.
• Since a single data
instance cannot carry
enough information to
infer the overall
distribution, knowing how
to organize data chunks to
form a meaningful pattern
or knowledge is important
in data stream analysis
tasks.
Concept Drift Detection
• Stage 2 (Data Modeling)
aims to abstract the
retrieved data and extract
the key features containing
sensitive information, that
is, the features of the data
that most impact a system
if they drift.

• This stage is optional,


because it mainly concerns
dimensionality reduction, or
sample size reduction, to
meet storage and online
speed requirements.
Concept Drift Detection
• Stage 3 (Test Statistics
Calculation) is
measurement of
dissimilarity, or distance
estimation. It quantifies the
severity of the drift and
forms test statistics for the
hypothesis test.

• It is considered to be the
most challenging aspect of
concept drift detection.

• The problem of how to


define an accurate and
robust dissimilarity
measurement is still an
open question.
Hypothesis to identify Drifts
• Null hypothesis: This hypothesis
proposes that the means of two distinct
Data Streams are equal. When
performing statistical tests, the goal
becomes to either reject the null
hypothesis or prove it correct.

• Alternative hypothesis: Alternative


hypotheses propose that there is a
significant difference between Data
Streams and the variations between
the samples result in unequal means.
• If you arrive at an alternative
hypothesis during statistical analysis, it
can indicate a rejection of the null
hypothesis.
Test statistics for the hypothesis test
Concept Drift Detection
• Stage 4 (Hypothesis Test) uses
a specific hypothesis test
to evaluate the statistical
significance of the change
observed in Stage 3.

• They are used to determine


drift detection accuracy by
proving the statistical bounds
of the test statistics proposed
in Stage 3.
Types of dissimilarity measures
Drift Detection approaches

Explicit drift detection Implicit drift detection


(Supervised) (Unsupervised)

1 Sequential analysis Novelty detection/ clustering


methods

2 Statistical Process Control Multivariate distribution monitoring

3 Window based distribution monitoring Model dependent monitoring


Explicit concept drift detection methodologies
• Sequential analysis methodologies-
• Continuously monitor the sequence of performance metrics
, such as
• Accuracy
• F-measure
• precision and recall;
• to signal a change, in the event of a significant drop in
these values.
• Methodologies comes under the Sequential analysis-
• CUSUM (Cumulative Sum approach)
• PHT (PageHinckley Test)
Sequential analysis methodologies

• CUSUM(Cumulative Sum approach)- This approach signals an alarm when the


mean of the sequence significantly deviates from 0.
• The CUSUM test monitors a metric M, at time t, on an incoming sample’s
performance εt , using parameters v for acceptable deviation and θ for the
change threshold as given in the equation.

• Max function is used to test changes in positive direction. For reverse effect a
min function can be used.
• Memory-less and can be used incrementally.
CUSUM(Cumulative Sum approach)-
Sequential analysis methodologies
• Page Hinckley Test (PHT) is a variant of CUSUM approach.
• PHT monitor the metric as an accumulated difference
between its mean and current values, as shown below.

• Where, M0 is the initial metric at time t = 0. Mt is the current


metric computed far (Mt−1) and the sample’s performance at time
t = εt and v denotes acceptable deviation from mean and θ is the
change detection threshold.
Statistical Process Control based methodologies-
• Monitor the online trace of error rates, and detects deviations
based on ideas taken from control charts.
• A significantly increased error rate violates the model and as
such is assumed to be a result of concept drift.
• Methodologies under this category are-
- DDM (Drift Detection Method)
- EDDM (Early Drift Detection Method)
- STEPD (Statistical Test of Equal Proportion Distribution)
- EWMA (Exponentially Weighted Moving Average)
Statistical Process Control based methodologies

Landmark time window for drift detection. The starting point of the window is
fixed, while the end point of the window will be extended after a new data
instance has been received.
Statistical Process Control based methodologies
• EDDM(Early Drift Detection Methodology)
• An extension of DDM, and was made suitable for slow
moving gradual drifts, where DDM previously failed.
• EDDM monitors the number of samples between two
classification errors, as a metric to be tracked online for drift
detection.
• Based on the model, it was assumed that, in stationary
environments, the distance (in number of samples) between
two subsequent errors would increase.
• A violation of these condition was seen to be indicative of
drift.
EDDM
Drift Detection Method
Pros:
• DDM shows good performance when detecting gradual changes (if they
are not very slow) and abrupt changes (incremental and sudden drifts).

Cons:
• DDM has difficulties detecting drift when the change is slowly gradual.
• It is possible that many samples are stored for a long time, before the
drift level is activated and there is the risk of overflowing the sample
storage.
Statistical Process Control based methodologies
• STEPD(Statistical Test of Equal Proportions)
• Computes the accuracy of a chunk C of recent samples and compares it
with the overall accuracy from the beginning of the stream, using a chi-
squares test to check for deviation.

Two time windows for concept drift detection. The new data
window has to be defined by the user.
Statistical Process Control based methodologies

Window based distribution monitoring methodologies
• Window based approaches use a chunk based or sliding
window approach over the recent samples, to detect
changes.
• Deviations are computed by comparing the current
chunk’s distribution to a reference distribution, obtained
at the start of the stream, from the training dataset.
• These approaches provide precise localization of change
point, and are robust to noise and transient changes.
• Extra memory is required to store these two distributions
over time.
Window based distribution monitoring methodologies
• ADWIN(Adaptive Windowing)
• This algorithm of uses a variable length sliding window, whose length
is computed online according to the observed changes.
• Whenever two large enough sub windows of the current chunk
exhibit distinct averages of the performance metric, a drift is
detected.
• Hoeffding bounds are used to determine optimal change threshold
and window parameters.
Window based distribution monitoring methodologies
• DoD (Degree of Drift)
• Detects drifts by computing a distance map of all samples in
the current chunk and their nearest neighbors from the
previous chunk.
• If the distance increases more than a parameter θ, a drift is
signaled.
• Drift is managed by replacing the stable model with the
reactive one and setting the circular disagreement list to all
zeros.
Implicit drift detection methodologies
• Novelty detection / Clustering based methods
• Capable of identifying uncertain suspicious samples, which need
further evaluation.
• An additional ’Unknown’ class label to indicate that these suspicious
samples do not fit the existing view of the data.
• Clustering and outlier based approaches are popular for detecting
novel patterns, as they summarize current data and can use
dissimilarity metrics to identify new samples.

Two sliding time windows, of fixed size.


The historical data window will be fixed
while the new data window will keep
moving.
Novelty detection / Clustering based methods
OLINDDA(OnLIne Novelty and Drift Detection Algorithm)
• Uses K-means data clustering to continuously monitor and
adapt to emerging data distribution.
• Unknown samples are stored in a short term memory
queue, and are periodically clustered and then either
merged with existing similar cluster profiles or added as a
novel profile to the pool of clusters.
Novelty detection / Clustering based methods
• All novelty detection techniques rely on clustering to recognize
new regions of space, which were previously unseen.
• They suffer from the curse of dimensionality, being distance
dependent, and also the problem of dealing with binary data
spaces.
• Additionally, they are suitable to detect only specific type of
cluster-able drifts.
Multivariate distribution monitoring methods
• These approaches are primarily chunk based and store
summarized information of the training data chunk (as
histograms of binned values), as the reference distribution, to
monitor changes in the current data chunk.
• Hellinger distance and KL-divergence are commonly used to
measure differences between the two chunk distributions and
to signal drift in the event of a significant change.
Multivariate distribution monitoring methods
• Change of Concept(CoC)
- This technique considers each feature as an independent stream
of data and monitors correlation using Pearson correlation
between the current chunk and the reference training chunk.
- Change in the average correlation over the features is used as a
signal of change.
Multivariate distribution monitoring methods
• HDDDM(Hellinger Distance Drift Detection Methodology)-
▪ A non parametric chunk based approach which uses Hellinger distance
to measure change in distribution, over time.
▪ An increased Hellinger distance, between the current stream chunk
and a training reference chunk, is used to signal drift.
▪ The Hellinger distance (HD) between the reference chunk P and the
current chunk Q is computed as-

- Here, N is the number of samples in the chunk, d is the data


dimensionality and b is the number of bins (b= √ N), per feature.
Model dependent drift detection methodologies
• The model dependent approaches directly consider the
classification process by tracking the posterior probability
estimates of classifiers, to detect drift.
• By monitoring the posterior probability estimates, the drift
detection task is reduced to that of monitoring a univariate
stream of values, making the process computationally efficient.
• Following are the techniques used in this –
• A-distance: A reduced false positive rate was obtained by tracking the ’A-
distance’, which was proposed as a measure of histogram difference
obtained by binning the margin distribution of samples, between the
reference and current margin samples.
Concept Drift Understanding
• Drift understanding refers to retrieving concept drift
information about
• “When” (the time at which the concept drift occurs and how long the
drift lasts),
• “How” (the severity /degree of concept drift), and
• “Where” (the drift regions of concept drift).
• This status information is the output of the drift detection
algorithms, and is used as input for drift adaptation.
Drift Adaptation Techniques
• It focuses on strategies for updating existing learning models
according to the drift.
• There are three main groups of drift adaptation methods:
• simple retraining
• ensemble retraining
• model adjusting
Training New Models for Global Drift

• A new model is trained with latest data to replace the old model when a
concept drift is detected.
Training New Models for Global Drift

• Paired Learners follows this strategy and uses two learners:


• The stable learner and the reactive learner.
• If the stable learner frequently misclassifies instances that the reactive
learner correctly classifies, a new concept is detected and the stable learner
will be replaced with the reactive learner.
• This method is simple to understand and easy to implement, and can be
applied at any point in the data stream.
Model Ensemble for Recurring Drift
• In the case of recurring concept drift, preserving and reusing old
models can save significant effort to retrain a new model for
recurring concepts.
• This is the core idea of using ensemble methods to handle
concept drift
Adjusting Existing Models for Regional Drift
• An alternative to retraining an entire model is to develop a model that
adaptively learns from the changing data. Such models have the ability to
partially update themselves when the underlying data distribution changes, as
shown in Figure:

A decision tree node is


replaced with a new one as
its performance deteriorates
when a concept drift occurs
in a subregion

• This approach is arguably more efficient than retraining when the drift only
occurs in local regions.
• Many methods in this category are based on the decision tree algorithm because
trees have the ability to examine and adapt to each sub-region separately.
Example: Concept Drift Detection over Social Media

Phase 1: Data Collection


• Collecting data from various online sources such as Twitter, Web sites, news articles.
• Phase 2: Data Clustering and Labeling
2:1 Dividing data streams into windows.
2:2 Applying appropriate clustering algorithm on a window to group the data into
clusters.
2:3 Labeling clusters to get knowledge about the concept hidden in that cluster.
2:4 Steps 2.2 and 2.3 are repeated for each window of the streaming data.
Example: Concept Drift Detection over Social Media
Phase 3: Detection of Concept Drift
Clusters of two adjacent windows are then analyzed for the following tasks:
(i) To identify concept drift;
(ii) to find any relationship between the concepts of two windows;
(iii) to analyze coevolving events, if any.
For analyzing clusters, graph edge weight technique can be implementing by using
content-based features of the streaming data, entities of social networking sites, and
different data resources such as Web sites, new articles for making a decision.

Phase 4: Evaluation Phase


The performance of the algorithm can be evaluated by various performance metrics
such as accuracy, precision, recall and execution time, classification, or clustering
error.
Measuring Concept Drift Performance
Drift Detection in short
The main strategies how to handle concept drift
The main strategies how to handle concept drift
The main strategies how to handle concept drift
The main strategies how to handle concept drift
• Adaptive learning strategies
The main strategies how to handle concept drift
The main strategies how to handle concept drift
The main strategies how to handle concept drift
The main strategies how to handle concept drift
The main strategies how to handle concept drift
The main strategies how to handle concept drift
Types of Concept Drifts
Types of Concept Drifts
Types of Concept Drifts
Types of Concept Drifts
Challenges due to concept drift
Challenges due to concept drift
Challenges due to concept drift
Challenges due to concept drift
Evaluation Systems
• Several criteria:
• Time → seconds
• Memory → RAM/hour
• Generalizability of the model → % success
• Detecting concept drift → detected drifts, false positives and false negatives
Evaluation Systems
• Taxonomy of performance criteria for handling concept drift
• According to the requirements of the real world applications, the
performance criteria can concern:
–– Autonomy: the level of human involvement,
–– Reliability: the accuracy of drift information,
–– Parameter settings: the availability of a priori knowledge,
–– Complexity: the time and memory consumption
Performance Evaluation Parameters of Stream Processing
Kappa statistics

• Kappa statistics measure the performance of streaming classifiers and is


effective for measuring performance of imbalanced data sets wherein number
of data instances from one class beats the number of instances from other
classes significantly.

• Here, Aref represents the accuracy of the reference classifier which is being
evaluated and Arand is Random classifier's accuracy. Kappa values lies in range
[0, 1] or sometimes represented in form of percentage range [0%, 100%].
Higher value implies better performance.
Temporal Kappa statistics
• This statistic measures the effectiveness of classifier in the
presence of temporal dependence in the data instances of
streaming data wherein the class label of data instance at time
t+1 tends to belong to the same class as of data instance at time
t. The kappa temporal statistic is defined as:

• Here, Apers is Persistent classifier's accuracy which predicts same


class label of data instance at time t+1 as of data instance at time
t. The value of ktemp ranges between interval (1, -∞). The ktemp = 1
if the classifier is accurate. Negative values of ktemp tell that the
performance of the classifier is even worse than the persistent
classifier.
Classification Parameters

Evaluation Measure Major Purpose Value


Significance
Kappa statistics Assess performance Higher value
in imbalanced data means better
stream case performance

Temporal Kappa Assess performance Negative values


statistics in case of temporal means worse
dependent data performance
streams
Completeness
• It does assessment that all the data instances belonging to the
same class lie in the same cluster or not. For e.g., consider a
dataset D composed of data instances belonging to single
category. Let one clustering algorithm A1 generates two
clusters C1 and C2 whereas another clustering algorithm A2
produces a single cluster C. Then it can be represented as per
equation:

• Completeness (cluster-set {C}) > Completeness (clusters-set {C1, C2})

• Values for completeness parameter lies in [0, 1], where higher


values implies better performance.
Purity

The higher value of


Pscore specifies better
performance.
SSQ

• SSQ: It measures cohesiveness of the clusters by


computing the sum of the square of distance of each
instance in the cluster from their respective centroid
(Song and Zhang, 2013). It is calculated for each cluster as
indicated in equation .

• Here, n specifies the number of data instances in cluster j


and di,cj is the distance of instance i from cluster centrecj
of the jth cluster.
• The smaller value of SSQ implies better performance.
Silhouette Coefficient
Clustering Parameters

Completeness Measures whether same Higher value means


class instance fall in better clustering
same cluster or not
Purity Assesses purity of the Higher value means
clusters in terms of better clustering
having same class
instances
SSQ Measures clusters Lower value means
cohesiveness better clustering
Silhouette Assess compactness as Higher value means
Coefficient well as separation of better clustering
clusters
Important findings

• Error rate-based and data distribution-based drift


detection methods are still playing a dominant role
in concept drift detection research, while multiple
hypothesis test methods emerge in recent years;

• Regarding to concept drift understanding, all drift


detection methods can answer “When”, but very few
methods have the ability to answer “How” and
“Where”;
Important findings

• Adaptive models and ensemble techniques have played an increasingly


important role in recent concept drift adaptation developments. I
• n contrast, research of retraining models with explicit drift detection has
slowed;
• Most existing drift detection and adaptation algorithms assume the ground
true label is available after classification/prediction, or extreme verification
latency.
• Very few research has been conducted to address unsupervised or semi-
supervised drift detection and adaptation
important findings

• Some computational intelligence techniques, such as fuzzy logic, competence


model, have been applied in concept drift.
• There is no comprehensive analysis on real-world data streams from the
concept drift aspect, such as the drift occurrence time, the severity of drift, and
the drift regions.
Recent Trends and Future Perspective
From Algorithms Development Point of View

• The development of new algorithms that addresses the inherent


challenges in mining large scale data streams. New algorithms must
ensure:
• One-pass computation over the stream of data.
• Faster computation to respond in real time.
• Minimizing the memory utilization by storing the summarized or sampled
data information without significantly losing the accuracy of mining result.
From New Evaluation Measures Point of View

• Traditional evaluation measures are not sufficient to estimate the


performance of the stream mining tasks. Hence, identification of
new evaluation measures is also an important field of research in
stream data mining. These measures must consider:
• Underlying imbalances in data sets
• Non-uniform distribution of incoming data instances.
• Temporal dependence of data instances.
From Concept Change Identification Point of View

• In streaming data mining, the change of concept is the common


phenomenon.
• It opens a plenty of opportunities for research. Mining techniques
must be capable of identifying these concept changes with time.
• Also, mining techniques must periodically update the model or take
the appropriate steps accordingly to capture concept drift and to
deal with it.
Future Directions

• Drift detection research should not only focus on identifying drift occurrence
time accurately, but also need to provide the information of drift severity and
regions. These information could be utilized for better concept drift adaptation.

• In the real-world scenario, the cost to acquire true label could be expensive,
that is, unsupervised or semi-supervised drift detection and adaptation could
still be promising in the future.
Future Directions

• A framework for selecting real-world data streams should be


established for evaluating learning algorithms handling concept
drift.
• Research on effectively integrating concept drift handling
techniques with machine learning methodologies for data-driven
applications is highly desired.
Mllib Algorithms

• Algorithms:
• Basic Statistics
• Regression
• Classification
• Recommendation System
• Clustering
• Dimensionality Reduction
• Feature Extraction
• Optimization
References

• J. Lu, A. Liu, F. Dong, F. Gu, J. Gama and G. Zhang, "Learning under Concept Drift: A Review," in IEEE Transactions on
Knowledge and Data Engineering, vol. 31, no. 12, pp. 2346-2363, 1 Dec. 2019.
• C. Aggarwal, J. Han, J. Wang, P. S. Yu, “A Framework for Clustering Data Streams”, VLDB'03
• C. C. Aggarwal, J. Han, J. Wang and P. S. Yu, “On-Demand Classification of Evolving Data Streams”, KDD'04
• C. Aggarwal, J. Han, J. Wang, and P. S. Yu, “A Framework for Projected Clustering of High Dimensional Data Streams”,
VLDB'04
• S. Babu and J. Widom, “Continuous Queries over Data Streams”, SIGMOD Record, Sept. 2001
• B. Babcock, S. Babu, M. Datar, R. Motwani and J. Widom, “Models and Issues in Data Stream Systems”, PODS'02.
• Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang, “Multi-Dimensional Regression Analysis of Time-Series Data
Streams”, VLDB'02
• P. Domingos and G. Hulten, “Mining high-speed data streams”, KDD'00
• A. Dobra, M. N. Garofalakis, J. Gehrke, and R. Rastogi, “Processing Complex Aggregate Queries over Data Streams”,
SIGMOD’02

141
References

• S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan, “Clustering Data Streams”, FOCS'00


• G. Hulten, L. Spencer and P. Domingos, “Mining time-changing data streams”, KDD’01
• S. Madden, M. Shah, J. Hellerstein, V. Raman, “Continuously Adaptive Continuous Queries over Streams”, SIGMOD’02
• G. Manku, R. Motwani, “Approximate Frequency Counts over Data Streams”, VLDB’02
• A. Metwally, D. Agrawal, and A. El Abbadi. “Efficient Computation of Frequent and Top-k Elements in Data Streams”.
ICDT'05
• S. Muthukrishnan, “Data streams: algorithms and applications”, Proc 2003 ACM-SIAM Symp. Discrete Algorithms, 2003
• R. Motwani and P. Raghavan, Randomized Algorithms, Cambridge Univ. Press, 1995
• S. Viglas and J. Naughton, “Rate-Based Query Optimization for Streaming Information Sources”, SIGMOD’02
• Y. Zhu and D. Shasha. “StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time”, VLDB’02
• H. Wang, W. Fan, P. S. Yu, and J. Han, “Mining Concept-Drifting Data Streams using Ensemble Classifiers”, KDD'03

142
Some of the good web resources

• Indyk, "Streaming etc" lecture notes: http://people.csail.mit.edu/indy...


• Feldman et al., On the Complexity of Processing Massive, Unordered, Distributed
Data: http://arxiv.org/abs/cs/0611108
• Feldman et al, On Distributing Symmetric Streaming
Computations: http://www.google.com/research/p...
• Sarma et al., Estimating PageRank on graph streams: http://portal.acm.org/citation.c...
• Zhang, A Survey on Streaming Algorithms for Massive
Graphs:http://www.springerlink.com/cont...
• Vassilvitskii, "Dealing with Massive Data" lecture notes: http://www.cs.columbia.edu/~coms...
• McGregor's publications: Http://www.cs.umass.edu/~mcgregor
• Muthukrishnan, Data Streams: Algorithms and
Applications: http://www.cs.rutgers.edu/~muthu/
Thank You
Complex Event Processing
Stream Processing and CFP
Complex Event Processing (CEP) Example

• For a smart home which has sensors at the doors, a smart WiFi router, and room
movement detectors. With CEP streaming all the data into a home server, a user could
make some rules like the following:
• If it's daytime and the door is closed and no phones are connected to the WiFi, set
the house to “nobody home”
• If nobody is home and the door is unlocked, then lock the door and turn on the
alarm
• If nobody is home and it's winter, lower the house temperature to 18C
• If nobody is home and it's summer, turn off the air conditioning
• If nobody is home and the door is unlocked by a family member, then turn off the
alarm and set the house to “people are home'”
Complex Event Processing (CEP) Example

• What are the parts of this type of project?


• Data ingest
• Defining rules on the data
• Executing the rules
• Taking action from rules when the conditions are met.
VLDB vs. CEP

• Very Large Database :


– Static data
– Storage : distributed on several computers
– Query & Analysis : distributed and parallel processing

Two complementary approaches

• Complex Event Processing :


– Data in motion
– Storage : none (only buffer in memory)
– Query & Analysis : processing on the fly
(and parallel)
Complex Events Processing (CEP)

Main features:
• High frequency processing
• Parallel computing
• Fault-tolerant
• Robust to imperfect and asynchronous data
• Extensible (implementation of new operators)

Notable products:
• StreamBase (Tibco)
• InfoSphere Streams (IBM)
• STORM (Open source – Twitter)
• KINESIS (API - Amazon)
• SQLstream
• Apama
• Apache Flink
Application Areas

• Finance: High frequency trading


– Find correlations between the prices of stocks within the historical data;
– Evaluate the correlations over the time;
– Give more weight to recent data.
• Banking : Detection of frauds with credit cards
– Automatically monitor a large amount of transactions;
– Detects patterns of events that indicate a likelihood of fraud;
– Stop the processing and send an alert for a human adjudication.
• Medicine: Health monitoring
– Perform automatic medical analysis to reduce workload on nurses;
– Analyze measurements of devices to detect early signs of disease.;
– Help doctors to make a diagnosis in real time.
• Smart Cities & Smart grids : Dash boards & visualization
– Optimization of public transports;
– Management of the local production of electricity;
– Flattening of the peaks of consumption.
Complex Events Processing (CEP)
Visualization

E-mail
Operator

Twitter

Operator Operator Operator


RSS
Input data stream Output data stream

Stocks
Operator

Database
XML

• An operator implements a query or a more complex analysis


• An operator processes data in motion with a low latency
• Several operators run at the same time, parallelized on several CPUs and/or Computers
• The graph of operators is defined before the processing of data-streams
• Connectors allows to interact with: external data streams and visualization tools.
Complex Event Processing
Complex Event Processing
Complex Event Processing
Distributed Complex Event Processing

• Computation spread across many machines


Complex Event Processing
Required Characteristics for Complex Event
Processing Engines

1. Perform data processing without first storing and retrieving the data 

2. Leverage SQL query paradigm 

3. Store and access current or historical state information using a familiar standard such 
as SQL
4. Handle stream imperfections (e.g. late or delayed, missing, out-of-sequence data) 

5. Process time-series records (tuples) in a consistent, deterministic manner 

6. Failover streaming application to a back-up and keep running in the event of primary 
system failure
7. Split applications over multiple processors or machines for scalability, without writing 
low-level code
8. Run Rules 1-7 in-process at tens to hundreds of thousands of messages/second 
with low latency
CEP on Business Processes
An Use Case

• A business process is, in its simplest form, a chain of correlated events. It has a start and a
completion event. See the example depicted below:
CEP on Business Processes
An Use Case

• The start event of the example business process is ORDER_CREATED.

{
"event_type": "ORDER_CREATED",
"event_id": 1,
"occurred_at": "2017-04-18T20:00:00.000Z",
"order_number": 123
}
CEP on Business Processes
An Use Case

• The completion event is ALL_PARCELS_SHIPPED. It means that all parcels


pertaining to an order have been handed over for shipment to the logistic
provider.

{
"event_type": "ALL_PARCELS_SHIPPED",
"event_id": 11,
"occurred_at": "2017-04-19T08:00:00.000Z",
"order_number": 123
}
Notice that the events are correlated on order_number, and also that they
occur in order according to their occurred_at values.
CEP on Business Processes
An Use Case
• Problem Statement: A complex event is an event which is inferred
from a pattern of other events.
For our example business process, we want to infer the event ALL_PARCELS_SHIPPED
from a pattern of PARCEL_SHIPPED events, i.e. generate ALL_PARCELS_SHIPPED when
all distinct PARCEL_SHIPPED events pertaining to an order have been received within 7
days. If the received set of PARCEL_SHIPPED events is incomplete after 7 days, we
generate the alert event THRESHOLD_EXCEEDED.
More about Flink CEP

• The Pattern API


• Each pattern consists of multiple stages or what we call states. In order to go from one state to the
next, the user can specify conditions.
• These conditions can be the contiguity of events or a filter condition on an event.
• Each pattern has to start with an initial state with IterativeCondition or a SimpleCondition.
More about Flink CEP

• The Pattern API The Pattern API


• Types of patten
• Begin ✓Detecting Patterns
• Next ✓Selecting from Patterns
• FollowedBy ✓Handling Timed Out Partial Patterns
• Where ✓Handling Lateness in Event Time
• Or
• Subtype
• Within
• ZeroOrMore
• OneOrMore
• Times etc.
What does FlinkCEP offer
Thank You

You might also like