BDA Lect Data Streams SA
BDA Lect Data Streams SA
• Data Streams
• Data streams—continuous, ordered, changing, fast, huge amount
• Traditional DBMS—data stored in finite, persistent data sets
• Characteristics
• Huge volumes of continuous data, possibly infinite
• Fast changing and requires fast, real-time response
• Data stream captures nicely our data processing needs of today
• Random access is expensive—single scan algorithm (can only have one look)
• Store only the summary of the data seen thus far
• Most stream data are at pretty low-level or multi-dimensional in nature, needs multi-level
and multi-dimensional processing
12
Batch Processing Vs Stream Processing
Batch Processing Stream Processing
Window
specifications streamify
Extracting
Stream Finite Stream
relations
Types of Windows
Count Based
Window
Time Based
Window
Types of Windows
Sliding Window
Session Window
A parallel count-based
tumbling window of length 2.
Challenges in Data Streams
t1 t2 t3 t4 t5
• Definition:
• A dataset is considered “sufficient” if adding more data items will not increase the final
accuracy of a trained model significantly.
• It can be judged by experience.
• We normally do not know if a dataset is sufficient or not.
• Sufficiency detection:
• Expensive “progressive sampling” experiment.
• Keep on adding data and stop when accuracy doesn’t increase significantly.
• Dependent on both dataset and algorithm
• Difficult to make a general claim
Possible changes of data streams
Previous hyperplane
A data chunk
Negative instance
Instances victim of concept-drift
Positive instance
Concept Drift
Dimensions of Learning
• Space - the available memory is fixed
• Learning Time - process incoming examples at
the rate they arrive
• Generalization Power - how effective the model
is at capturing the true underlying concept
Requirements for streaming systems
Previous hyperplane
A data chunk
Negative instance
Instances victim of concept-drift
Positive instance
Three Sources of Concept Drift
Concept Drift
Because data is expected to evolve over time, especially in
dynamically changing environments, where non stationarity is
typical, the underlying distribution can change dynamically
over time.
52
Concept Drift in Classification Problems - ad
Click Prediction
53
User profile may change Legend:
X :input examples
y :class labels
An example of concept drift types
• Sudden Drift: A new concept occur within a short time.
• Gradual Drift: A new concept gradually replaces the old one with respect to
time.
• It is considered to be the
most challenging aspect of
concept drift detection.
• Max function is used to test changes in positive direction. For reverse effect a
min function can be used.
• Memory-less and can be used incrementally.
CUSUM(Cumulative Sum approach)-
Sequential analysis methodologies
• Page Hinckley Test (PHT) is a variant of CUSUM approach.
• PHT monitor the metric as an accumulated difference
between its mean and current values, as shown below.
Landmark time window for drift detection. The starting point of the window is
fixed, while the end point of the window will be extended after a new data
instance has been received.
Statistical Process Control based methodologies
• EDDM(Early Drift Detection Methodology)
• An extension of DDM, and was made suitable for slow
moving gradual drifts, where DDM previously failed.
• EDDM monitors the number of samples between two
classification errors, as a metric to be tracked online for drift
detection.
• Based on the model, it was assumed that, in stationary
environments, the distance (in number of samples) between
two subsequent errors would increase.
• A violation of these condition was seen to be indicative of
drift.
EDDM
Drift Detection Method
Pros:
• DDM shows good performance when detecting gradual changes (if they
are not very slow) and abrupt changes (incremental and sudden drifts).
Cons:
• DDM has difficulties detecting drift when the change is slowly gradual.
• It is possible that many samples are stored for a long time, before the
drift level is activated and there is the risk of overflowing the sample
storage.
Statistical Process Control based methodologies
• STEPD(Statistical Test of Equal Proportions)
• Computes the accuracy of a chunk C of recent samples and compares it
with the overall accuracy from the beginning of the stream, using a chi-
squares test to check for deviation.
Two time windows for concept drift detection. The new data
window has to be defined by the user.
Statistical Process Control based methodologies
•
Window based distribution monitoring methodologies
• Window based approaches use a chunk based or sliding
window approach over the recent samples, to detect
changes.
• Deviations are computed by comparing the current
chunk’s distribution to a reference distribution, obtained
at the start of the stream, from the training dataset.
• These approaches provide precise localization of change
point, and are robust to noise and transient changes.
• Extra memory is required to store these two distributions
over time.
Window based distribution monitoring methodologies
• ADWIN(Adaptive Windowing)
• This algorithm of uses a variable length sliding window, whose length
is computed online according to the observed changes.
• Whenever two large enough sub windows of the current chunk
exhibit distinct averages of the performance metric, a drift is
detected.
• Hoeffding bounds are used to determine optimal change threshold
and window parameters.
Window based distribution monitoring methodologies
• DoD (Degree of Drift)
• Detects drifts by computing a distance map of all samples in
the current chunk and their nearest neighbors from the
previous chunk.
• If the distance increases more than a parameter θ, a drift is
signaled.
• Drift is managed by replacing the stable model with the
reactive one and setting the circular disagreement list to all
zeros.
Implicit drift detection methodologies
• Novelty detection / Clustering based methods
• Capable of identifying uncertain suspicious samples, which need
further evaluation.
• An additional ’Unknown’ class label to indicate that these suspicious
samples do not fit the existing view of the data.
• Clustering and outlier based approaches are popular for detecting
novel patterns, as they summarize current data and can use
dissimilarity metrics to identify new samples.
• A new model is trained with latest data to replace the old model when a
concept drift is detected.
Training New Models for Global Drift
• This approach is arguably more efficient than retraining when the drift only
occurs in local regions.
• Many methods in this category are based on the decision tree algorithm because
trees have the ability to examine and adapt to each sub-region separately.
Example: Concept Drift Detection over Social Media
• Here, Aref represents the accuracy of the reference classifier which is being
evaluated and Arand is Random classifier's accuracy. Kappa values lies in range
[0, 1] or sometimes represented in form of percentage range [0%, 100%].
Higher value implies better performance.
Temporal Kappa statistics
• This statistic measures the effectiveness of classifier in the
presence of temporal dependence in the data instances of
streaming data wherein the class label of data instance at time
t+1 tends to belong to the same class as of data instance at time
t. The kappa temporal statistic is defined as:
• Drift detection research should not only focus on identifying drift occurrence
time accurately, but also need to provide the information of drift severity and
regions. These information could be utilized for better concept drift adaptation.
• In the real-world scenario, the cost to acquire true label could be expensive,
that is, unsupervised or semi-supervised drift detection and adaptation could
still be promising in the future.
Future Directions
• Algorithms:
• Basic Statistics
• Regression
• Classification
• Recommendation System
• Clustering
• Dimensionality Reduction
• Feature Extraction
• Optimization
References
• J. Lu, A. Liu, F. Dong, F. Gu, J. Gama and G. Zhang, "Learning under Concept Drift: A Review," in IEEE Transactions on
Knowledge and Data Engineering, vol. 31, no. 12, pp. 2346-2363, 1 Dec. 2019.
• C. Aggarwal, J. Han, J. Wang, P. S. Yu, “A Framework for Clustering Data Streams”, VLDB'03
• C. C. Aggarwal, J. Han, J. Wang and P. S. Yu, “On-Demand Classification of Evolving Data Streams”, KDD'04
• C. Aggarwal, J. Han, J. Wang, and P. S. Yu, “A Framework for Projected Clustering of High Dimensional Data Streams”,
VLDB'04
• S. Babu and J. Widom, “Continuous Queries over Data Streams”, SIGMOD Record, Sept. 2001
• B. Babcock, S. Babu, M. Datar, R. Motwani and J. Widom, “Models and Issues in Data Stream Systems”, PODS'02.
• Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang, “Multi-Dimensional Regression Analysis of Time-Series Data
Streams”, VLDB'02
• P. Domingos and G. Hulten, “Mining high-speed data streams”, KDD'00
• A. Dobra, M. N. Garofalakis, J. Gehrke, and R. Rastogi, “Processing Complex Aggregate Queries over Data Streams”,
SIGMOD’02
141
References
142
Some of the good web resources
• For a smart home which has sensors at the doors, a smart WiFi router, and room
movement detectors. With CEP streaming all the data into a home server, a user could
make some rules like the following:
• If it's daytime and the door is closed and no phones are connected to the WiFi, set
the house to “nobody home”
• If nobody is home and the door is unlocked, then lock the door and turn on the
alarm
• If nobody is home and it's winter, lower the house temperature to 18C
• If nobody is home and it's summer, turn off the air conditioning
• If nobody is home and the door is unlocked by a family member, then turn off the
alarm and set the house to “people are home'”
Complex Event Processing (CEP) Example
Main features:
• High frequency processing
• Parallel computing
• Fault-tolerant
• Robust to imperfect and asynchronous data
• Extensible (implementation of new operators)
Notable products:
• StreamBase (Tibco)
• InfoSphere Streams (IBM)
• STORM (Open source – Twitter)
• KINESIS (API - Amazon)
• SQLstream
• Apama
• Apache Flink
Application Areas
E-mail
Operator
Stocks
Operator
Database
XML
1. Perform data processing without first storing and retrieving the data
3. Store and access current or historical state information using a familiar standard such
as SQL
4. Handle stream imperfections (e.g. late or delayed, missing, out-of-sequence data)
6. Failover streaming application to a back-up and keep running in the event of primary
system failure
7. Split applications over multiple processors or machines for scalability, without writing
low-level code
8. Run Rules 1-7 in-process at tens to hundreds of thousands of messages/second
with low latency
CEP on Business Processes
An Use Case
• A business process is, in its simplest form, a chain of correlated events. It has a start and a
completion event. See the example depicted below:
CEP on Business Processes
An Use Case
{
"event_type": "ORDER_CREATED",
"event_id": 1,
"occurred_at": "2017-04-18T20:00:00.000Z",
"order_number": 123
}
CEP on Business Processes
An Use Case
{
"event_type": "ALL_PARCELS_SHIPPED",
"event_id": 11,
"occurred_at": "2017-04-19T08:00:00.000Z",
"order_number": 123
}
Notice that the events are correlated on order_number, and also that they
occur in order according to their occurred_at values.
CEP on Business Processes
An Use Case
• Problem Statement: A complex event is an event which is inferred
from a pattern of other events.
For our example business process, we want to infer the event ALL_PARCELS_SHIPPED
from a pattern of PARCEL_SHIPPED events, i.e. generate ALL_PARCELS_SHIPPED when
all distinct PARCEL_SHIPPED events pertaining to an order have been received within 7
days. If the received set of PARCEL_SHIPPED events is incomplete after 7 days, we
generate the alert event THRESHOLD_EXCEEDED.
More about Flink CEP