0% found this document useful (0 votes)
45 views

Methodologies for Stream Data Processing and Stream Data Systems

The document discusses methodologies for stream data processing, highlighting the need for real-time analytics due to growing data volumes and limitations of batch processing. It covers key algorithms such as Lossy Counting, Reservoir Sampling, and Hoeffding Tree, which are essential for handling continuous data streams in various applications like fraud detection and stock market analysis. The conclusion emphasizes the importance of real-time processing in improving operational efficiency and competitive advantage across multiple sectors.

Uploaded by

chitrabhanuk
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Methodologies for Stream Data Processing and Stream Data Systems

The document discusses methodologies for stream data processing, highlighting the need for real-time analytics due to growing data volumes and limitations of batch processing. It covers key algorithms such as Lossy Counting, Reservoir Sampling, and Hoeffding Tree, which are essential for handling continuous data streams in various applications like fraud detection and stock market analysis. The conclusion emphasizes the importance of real-time processing in improving operational efficiency and competitive advantage across multiple sectors.

Uploaded by

chitrabhanuk
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

METHODOLOGIES FOR STREAM DATA

PROCESSING AND STREAM DATA SYSTEMS

Mahima. A (AP24122060006)
Lokesh. K (AP24122060020)
Vamsi. K (AP24122060022)
Namratha. N (AP24122040015)
Teena Shaik (AP24122040002)
Avishek Thakur (AP24122040017)
Gowthami. G (AP24122040018)
Motivation For Stream Data Processing

▪ Growing Data Volumes : Continuous data streams from IoT devices, social media, financial transactions, etc.

▪ Limitations of Batch Processing : Traditional methods fail to provide real-time insights and handle massive
data inflows efficiently.

▪ Need for Real-time Analytics : Businesses require instant data analysis for quick decision-making and
anomaly detection.

▪ Operational Efficiency : Enhances automation, reduces latency, and improves responsiveness in various
applications.

▪ Scalability & Reliability : Stream systems can handle large-scale, continuous data efficiently, ensuring system
stability.

▪ Enhanced Customer Experience : Enables personalized, real-time interactions and recommendations.


INTRODUCTION

• It is impractical to scan through an entire data stream more than once. Some data streams are
too fast to examine every element.
• Gigantic data sets cannot be stored entirely in main memory or on disk.
• The challenge is not just the volume of data but the large universe of possible values. Small
universes, like human ages (0–120), can be tracked easily.
• Data stream processing is crucial in multiple domains:
Network Security: Detects anomalies like DDoS attacks and traffic spikes.
Social Media Analytics: Identifies trending topics in real time.
Financial Markets: Tracks stock price fluctuations and enables high-frequency trading.
• As data generation grows exponentially, scalable stream processing is essential.
KEYWORDS AND DEFINITIONS
•Stream Data – A continuous flow of real-time data from sources like IoT devices, stock markets, and
network traffic.
•Data Stream Management System (DSMS) – A system designed to manage and process continuous data
streams efficiently.
•Continuous Query – A query that is executed continuously as new data arrives in a stream.
•Sliding Window – A technique that processes only the most recent portion of a data stream for analysis.
•Frequent Pattern Analysis – Identifying patterns that frequently appear in streaming data for trend detection.
•Stream Classification – Categorizing real-time data points into predefined classes for applications like fraud
detection.
•Concept Drift – A change in data patterns over time, requiring adaptive models to maintain accuracy.
•CluStream – A framework for clustering evolving data streams using micro and macro clustering techniques.
Important Algorithms in Stream Data Processing & Mining

Lossy Counting Algorithm


• A space-efficient algorithm for mining frequent patterns in data streams.
• It maintains approximate frequency counts by dividing data into windows and discarding
infrequent items.
• This method is useful when exact counts are impractical due to memory constraints.

• The algorithm works by dividing the stream into buckets of size 1/ε and maintaining counts of
items while periodically pruning infrequent ones.

• It strikes a balance between memory usage and accuracy, making it ideal for applications in web
analytics, fraud detection, network traffic monitoring, and real-time recommendation systems.
Algorithm Steps:

Maintain a dictionary (or table) D to store item frequencies and an associated error term.
Each entry (e, f, Δ) in D consists of:
e → The item (element)
f → Estimated frequency of e
Δ → Maximum possible error in the count of e

Processing Incoming Elements:


For each new element e in the stream:
If e is already in D, increment its count f.
If e is not in D, insert it with an estimated count of 1 and set the error term Δ = current bucket ID - 1.

Prune Items Periodically:


After processing every N = 1/ε elements:
Remove items where f + Δ ≤ current bucket ID.
This step ensures that less frequent items are forgotten, saving memory.
Reservoir Sampling
Reservoir Sampling is an algorithm used for random sampling from a large or infinite data
stream when the total number of elements is unknown or too large to store in memory. It
ensures that each element has an equal probability of being selected in a fixed-size sample
(reservoir).
• Commonly used in real-time analytics and web traffic, system logs.
Algorithm
• Initialization
Fill the reservoir with the first k elements from the stream.
• Iterate Through Remaining Elements
For each element i (starting from k+1):
Generate a random number between 0 and i.
If ≤ k, replace a random element in the reservoir with i.
• Output
The reservoir contains a random sample of k elements.
Hoeffding Tree Algorithm
The Hoeffding Tree Algorithm (also known as VFDT - Very Fast Decision Tree) is a streaming
decision tree learning algorithm designed for large-scale, high-speed data streams. It allows
incremental learning while ensuring theoretical guarantees on accuracy.
Hoeffding Trees:
Processing each instance once (single pass)
Using limited memory
Making decisions incrementally
Providing probabilistic guarantees on split decisions using Hoeffding's bound

Hoeffding Bound Formula:


ε → Margin of error for the best attribute's score.
R → Range of the splitting criterion (e.g., entropy, Gini index).
δ → Confidence parameter (smaller δ = more certainty before splitting).
n → Number of instances seen at the node.
If the difference between the best and second-best attribute is greater than ε, the best attribute is
chosen for splitting.
Hoeffding Tree Algorithm

Hoeffding Tree Algorithm Steps


• Start with a single root node
No splits initially, all instances are passed down from the root.
• Update statistics for each incoming data instance
Track class distributions for attribute values.
• Check if a split is needed using Hoeffding Bound
Compute the best and second-best attribute based on the splitting criterion (e.g., entropy,
Gini index).
If their difference exceeds ε, split on the best attribute.
• Create child nodes and repeat the process recursively.
• Prune or discard less useful nodes to maintain efficiency.
CluStream Algorithm
The CluStream algorithm is a streaming clustering technique designed to handle evolving data
streams. It processes data in real-time (online phase) by maintaining micro-clusters and refines
them into macro-clusters during an offline phase.

CluStream Algorithm Steps


Online Micro-Clustering
• Receive a new data point.
• Assign it to the nearest micro-cluster (based on Euclidean distance).
• If no close cluster exists:
• Create a new micro-cluster.
• If micro-cluster count exceeds a threshold, merge or discard clusters.
• Update micro-cluster statistics (N, LS, SS, T).

Offline Macro-Clustering
• Retrieve stored micro-clusters.
• Apply k-means clustering on micro-cluster centroids.
• Generate macro-clusters from micro-clusters.
• Perform historical trend analysis on clusters over time.
Misra-Gries Algorithm
The Misra-Gries algorithm is a streaming algorithm used to find frequent elements (also known
as heavy hitters) in a data stream with limited memory.

• Efficiently processes large data streams using fixed memory.


• Identifies approximate frequent items without storing the full dataset.
• Works in one pass (single scan) over the data.
• Used in network traffic monitoring, fraud detection, and trend analysis.

Given a data stream of elements, the goal is to find elements appearing more than n/k​
times, where:
n is the total number of elements in the stream.
k is a user-defined parameter controlling memory usage.
Misra-Gries Algorithm

The algorithm maintains a fixed number of counters (k-1) instead of storing all elements.
Initialize k-1 counters,
1.each storing:
• An element.
• A count (number of times observed).
2.Process each element from the stream:
• If the element is already in the counters, increment its count.
• If there is space in the counters, add the element with count = 1.
• If all counters are full and the element is not in them, decrease all counts by 1.
• If any counter reaches zero, replace it with the new element.
3.Extract frequent elements:
• After processing the stream, re-check the true counts of stored elements in a second pass.
EXAMPLE PROBLEMS

Frequent Item Detection in a Twitter Stream

Problem: Social media platforms like Twitter generate massive amounts of real-time data,
including hashtags and trending topics. Detecting the top 10 most frequently used hashtags
in a continuous tweet stream is essential for marketing analytics and trend analysis.

Challenges: The dataset is unbounded, making it impossible to store all tweets for analysis.
The system must efficiently track hashtags while using limited memory.

Solution: The Misra-Gries Algorithm or Lossy Counting Algorithm can be used to maintain
approximate frequency counts of hashtags in real-time. These algorithms ensure that the
most popular hashtags are identified while ignoring less frequent ones.
EXAMPLE PROBLEMS
Real-Time Fraud Detection in Credit Card Transactions

Problem: Credit card companies need to detect fraudulent transactions in real-time to prevent
financial losses. The challenge is to analyze a continuous stream of transactions and flag
unusual activity based on past patterns.

Challenges: Fraudulent activities often change dynamically, making traditional batch


processing ineffective. A real-time detection system should adapt to evolving fraud patterns
and minimize false positives.

Solution: The VFDT (Very Fast Decision Tree) algorithm can be used to classify transactions
based on attributes like transaction amount, location, and time. The model updates
dynamically, allowing it to detect new fraud trends efficiently.
EXAMPLE PROBLEMS
Stock Market Trend Analysis

Problem: Investors and financial analysts need to predict stock price movements based on real-
time market data streams. This involves analyzing high-speed data like stock prices, trade
volumes, and investor sentiment.

Challenges: Market trends fluctuate rapidly, requiring real-time processing rather than historical
batch analysis. Memory-efficient algorithms must be used to track price changes dynamically.

Solution: The Sliding Window Model helps analyze only the most recent stock price data,
ensuring that outdated trends do not influence predictions. It enables timely decision-making for
investors.
EXAMPLE PROBLEMS

Customer Segmentation in E-commerce

Problem: Online shopping platforms need to segment customers based on their real-time
browsing and purchasing behavior. This helps businesses personalize recommendations and
improve customer engagement.

Challenges: Customer behaviors change continuously, requiring a system that can


dynamically update customer groups. Traditional clustering techniques struggle with
evolving data streams.

Solution: The CluStream Algorithm clusters users in real-time based on their activity. Micro-
clusters track short-term behavior, while macro-clusters help identify long-term trends
EXAMPLE PROBLEMS

Video Surveillance & Motion Detection

Problem: Security cameras generate continuous video streams, requiring efficient real-time
motion detection to identify suspicious activities.

Challenges: Storing and analyzing every frame is infeasible due to high data volume. The
system must focus on recent frames while ignoring older, irrelevant ones.

Solution: Sliding Window-based Anomaly Detection processes only the latest frames, reducing
memory usage while ensuring quick detection of unusual movement patterns.
CONCLUSION

● Real-time Necessity: Enables immediate analysis & decision-making in finance,


healthcare, cybersecurity, IoT.
● Compared to Batch Processing: Handles high-velocity, continuous data efficiently.
● Key Algorithms: Sliding Window Models, VFDT, CluStream, Lossy Counting
optimize memory usage.
● Scalability & Frameworks: Apache Flink, Kafka Streams, Spark Streaming ensure
large-scale distributed processing.
● Future Trends: AI, deep learning, edge computing to enhance real-time analytics &
decision-making.
● Business Impact: Improves operational efficiency, predictive modeling, & competitive
advantage.
REFERENCES
SkedBooks. (n.d.). Methodologies for stream data processing and stream data systems. Retrieved March
18, 2025, from
https://skedbooks.com/books/data-mining-data-warehousing/methodologies-for-stream-data-processing-and-
stream-data-systems/

Tantalaki, N., Souravlas, S., & Roumeliotis, M. (2019). A review on big data real-time stream processing and
its scheduling techniques. International Journal of Parallel, Emergent and Distributed Systems, 35(5), 571–
601. https://doi.org/10.1080/17445760.2019.1585848

Morgan, F. D., Williams, E. R., & Madden, T. R. (1989). Streaming potential measurements: 1. Properties of
fine-grained sediments. Journal of Geophysical Research: Solid Earth, 94(B9), 12449–12461.
https://doi.org/10.1029/JB094iB09p12449

Ahmad, S. G., Liew, C. S., Rafique, M. M., & Munir, E. U. (2017). Optimization of data-intensive workflows in
stream-based data processing models. The Journal of Supercomputing, 73(9), 3901–3923.
https://doi.org/10.1007/s11227-017-1991-0

CoinGecko. (n.d.). CoinGecko WebSocket API. Retrieved March 18, 2025, from wss://ws.coingecko.com/

Binance. (n.d.). Binance WebSocket API – Real-time trade data for BTC/USDT. Retrieved March 18, 2025,
from wss://stream.binance.com:9443/ws/btcusdt@trade
THANK YOU
Presenters:
Mahima. A (AP24122060006)
Lokesh. K (AP24122060020)
Vamsi. K (AP24122060022)
Namratha. N (AP24122040015)
Teena Shaik (AP24122040002)
Avishek Thakur (AP24122040017)
Gowthami. G (AP24122040018)

You might also like