Methodologies for Stream Data Processing and Stream Data Systems
Methodologies for Stream Data Processing and Stream Data Systems
Mahima. A (AP24122060006)
Lokesh. K (AP24122060020)
Vamsi. K (AP24122060022)
Namratha. N (AP24122040015)
Teena Shaik (AP24122040002)
Avishek Thakur (AP24122040017)
Gowthami. G (AP24122040018)
Motivation For Stream Data Processing
▪ Growing Data Volumes : Continuous data streams from IoT devices, social media, financial transactions, etc.
▪ Limitations of Batch Processing : Traditional methods fail to provide real-time insights and handle massive
data inflows efficiently.
▪ Need for Real-time Analytics : Businesses require instant data analysis for quick decision-making and
anomaly detection.
▪ Operational Efficiency : Enhances automation, reduces latency, and improves responsiveness in various
applications.
▪ Scalability & Reliability : Stream systems can handle large-scale, continuous data efficiently, ensuring system
stability.
• It is impractical to scan through an entire data stream more than once. Some data streams are
too fast to examine every element.
• Gigantic data sets cannot be stored entirely in main memory or on disk.
• The challenge is not just the volume of data but the large universe of possible values. Small
universes, like human ages (0–120), can be tracked easily.
• Data stream processing is crucial in multiple domains:
Network Security: Detects anomalies like DDoS attacks and traffic spikes.
Social Media Analytics: Identifies trending topics in real time.
Financial Markets: Tracks stock price fluctuations and enables high-frequency trading.
• As data generation grows exponentially, scalable stream processing is essential.
KEYWORDS AND DEFINITIONS
•Stream Data – A continuous flow of real-time data from sources like IoT devices, stock markets, and
network traffic.
•Data Stream Management System (DSMS) – A system designed to manage and process continuous data
streams efficiently.
•Continuous Query – A query that is executed continuously as new data arrives in a stream.
•Sliding Window – A technique that processes only the most recent portion of a data stream for analysis.
•Frequent Pattern Analysis – Identifying patterns that frequently appear in streaming data for trend detection.
•Stream Classification – Categorizing real-time data points into predefined classes for applications like fraud
detection.
•Concept Drift – A change in data patterns over time, requiring adaptive models to maintain accuracy.
•CluStream – A framework for clustering evolving data streams using micro and macro clustering techniques.
Important Algorithms in Stream Data Processing & Mining
• The algorithm works by dividing the stream into buckets of size 1/ε and maintaining counts of
items while periodically pruning infrequent ones.
• It strikes a balance between memory usage and accuracy, making it ideal for applications in web
analytics, fraud detection, network traffic monitoring, and real-time recommendation systems.
Algorithm Steps:
Maintain a dictionary (or table) D to store item frequencies and an associated error term.
Each entry (e, f, Δ) in D consists of:
e → The item (element)
f → Estimated frequency of e
Δ → Maximum possible error in the count of e
Offline Macro-Clustering
• Retrieve stored micro-clusters.
• Apply k-means clustering on micro-cluster centroids.
• Generate macro-clusters from micro-clusters.
• Perform historical trend analysis on clusters over time.
Misra-Gries Algorithm
The Misra-Gries algorithm is a streaming algorithm used to find frequent elements (also known
as heavy hitters) in a data stream with limited memory.
Given a data stream of elements, the goal is to find elements appearing more than n/k
times, where:
n is the total number of elements in the stream.
k is a user-defined parameter controlling memory usage.
Misra-Gries Algorithm
The algorithm maintains a fixed number of counters (k-1) instead of storing all elements.
Initialize k-1 counters,
1.each storing:
• An element.
• A count (number of times observed).
2.Process each element from the stream:
• If the element is already in the counters, increment its count.
• If there is space in the counters, add the element with count = 1.
• If all counters are full and the element is not in them, decrease all counts by 1.
• If any counter reaches zero, replace it with the new element.
3.Extract frequent elements:
• After processing the stream, re-check the true counts of stored elements in a second pass.
EXAMPLE PROBLEMS
Problem: Social media platforms like Twitter generate massive amounts of real-time data,
including hashtags and trending topics. Detecting the top 10 most frequently used hashtags
in a continuous tweet stream is essential for marketing analytics and trend analysis.
Challenges: The dataset is unbounded, making it impossible to store all tweets for analysis.
The system must efficiently track hashtags while using limited memory.
Solution: The Misra-Gries Algorithm or Lossy Counting Algorithm can be used to maintain
approximate frequency counts of hashtags in real-time. These algorithms ensure that the
most popular hashtags are identified while ignoring less frequent ones.
EXAMPLE PROBLEMS
Real-Time Fraud Detection in Credit Card Transactions
Problem: Credit card companies need to detect fraudulent transactions in real-time to prevent
financial losses. The challenge is to analyze a continuous stream of transactions and flag
unusual activity based on past patterns.
Solution: The VFDT (Very Fast Decision Tree) algorithm can be used to classify transactions
based on attributes like transaction amount, location, and time. The model updates
dynamically, allowing it to detect new fraud trends efficiently.
EXAMPLE PROBLEMS
Stock Market Trend Analysis
Problem: Investors and financial analysts need to predict stock price movements based on real-
time market data streams. This involves analyzing high-speed data like stock prices, trade
volumes, and investor sentiment.
Challenges: Market trends fluctuate rapidly, requiring real-time processing rather than historical
batch analysis. Memory-efficient algorithms must be used to track price changes dynamically.
Solution: The Sliding Window Model helps analyze only the most recent stock price data,
ensuring that outdated trends do not influence predictions. It enables timely decision-making for
investors.
EXAMPLE PROBLEMS
Problem: Online shopping platforms need to segment customers based on their real-time
browsing and purchasing behavior. This helps businesses personalize recommendations and
improve customer engagement.
Solution: The CluStream Algorithm clusters users in real-time based on their activity. Micro-
clusters track short-term behavior, while macro-clusters help identify long-term trends
EXAMPLE PROBLEMS
Problem: Security cameras generate continuous video streams, requiring efficient real-time
motion detection to identify suspicious activities.
Challenges: Storing and analyzing every frame is infeasible due to high data volume. The
system must focus on recent frames while ignoring older, irrelevant ones.
Solution: Sliding Window-based Anomaly Detection processes only the latest frames, reducing
memory usage while ensuring quick detection of unusual movement patterns.
CONCLUSION
Tantalaki, N., Souravlas, S., & Roumeliotis, M. (2019). A review on big data real-time stream processing and
its scheduling techniques. International Journal of Parallel, Emergent and Distributed Systems, 35(5), 571–
601. https://doi.org/10.1080/17445760.2019.1585848
Morgan, F. D., Williams, E. R., & Madden, T. R. (1989). Streaming potential measurements: 1. Properties of
fine-grained sediments. Journal of Geophysical Research: Solid Earth, 94(B9), 12449–12461.
https://doi.org/10.1029/JB094iB09p12449
Ahmad, S. G., Liew, C. S., Rafique, M. M., & Munir, E. U. (2017). Optimization of data-intensive workflows in
stream-based data processing models. The Journal of Supercomputing, 73(9), 3901–3923.
https://doi.org/10.1007/s11227-017-1991-0
CoinGecko. (n.d.). CoinGecko WebSocket API. Retrieved March 18, 2025, from wss://ws.coingecko.com/
Binance. (n.d.). Binance WebSocket API – Real-time trade data for BTC/USDT. Retrieved March 18, 2025,
from wss://stream.binance.com:9443/ws/btcusdt@trade
THANK YOU
Presenters:
Mahima. A (AP24122060006)
Lokesh. K (AP24122060020)
Vamsi. K (AP24122060022)
Namratha. N (AP24122040015)
Teena Shaik (AP24122040002)
Avishek Thakur (AP24122040017)
Gowthami. G (AP24122040018)