0% found this document useful (0 votes)

199 views20 pages

Methodologies For Stream Data Processing and Stream Data Systems

The document discusses methodologies for stream data processing, highlighting the need for real-time analytics due to growing data volumes and limitations of batch processing. It covers key algorithms such as Lossy Counting, Reservoir Sampling, and Hoeffding Tree, which are essential for handling continuous data streams in various applications like fraud detection and stock market analysis. The conclusion emphasizes the importance of real-time processing in improving operational efficiency and competitive advantage across multiple sectors.

Uploaded by

chitrabhanuk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

199 views20 pages

Methodologies For Stream Data Processing and Stream Data Systems

Uploaded by

chitrabhanuk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 20

METHODOLOGIES FOR STREAM DATA

PROCESSING AND STREAM DATA SYSTEMS

Mahima. A (AP24122060006)
Lokesh. K (AP24122060020)
Vamsi. K (AP24122060022)
Namratha. N (AP24122040015)
Teena Shaik (AP24122040002)
Avishek Thakur (AP24122040017)
Gowthami. G (AP24122040018)
Motivation For Stream Data Processing

▪ Growing Data Volumes : Continuous data streams from IoT devices, social media, financial transactions, etc.

▪ Limitations of Batch Processing : Traditional methods fail to provide real-time insights and handle massive
data inflows efficiently.

▪ Need for Real-time Analytics : Businesses require instant data analysis for quick decision-making and
anomaly detection.

▪ Operational Efficiency : Enhances automation, reduces latency, and improves responsiveness in various
applications.

▪ Scalability & Reliability : Stream systems can handle large-scale, continuous data efficiently, ensuring system
stability.

▪ Enhanced Customer Experience : Enables personalized, real-time interactions and recommendations.

INTRODUCTION

• It is impractical to scan through an entire data stream more than once. Some data streams are
too fast to examine every element.
• Gigantic data sets cannot be stored entirely in main memory or on disk.
• The challenge is not just the volume of data but the large universe of possible values. Small
universes, like human ages (0–120), can be tracked easily.
• Data stream processing is crucial in multiple domains:
Network Security: Detects anomalies like DDoS attacks and traffic spikes.
Social Media Analytics: Identifies trending topics in real time.
Financial Markets: Tracks stock price fluctuations and enables high-frequency trading.
• As data generation grows exponentially, scalable stream processing is essential.
KEYWORDS AND DEFINITIONS
•Stream Data – A continuous flow of real-time data from sources like IoT devices, stock markets, and
network traffic.
•Data Stream Management System (DSMS) – A system designed to manage and process continuous data
streams efficiently.
•Continuous Query – A query that is executed continuously as new data arrives in a stream.
•Sliding Window – A technique that processes only the most recent portion of a data stream for analysis.
•Frequent Pattern Analysis – Identifying patterns that frequently appear in streaming data for trend detection.
•Stream Classification – Categorizing real-time data points into predefined classes for applications like fraud
detection.
•Concept Drift – A change in data patterns over time, requiring adaptive models to maintain accuracy.
•CluStream – A framework for clustering evolving data streams using micro and macro clustering techniques.
Important Algorithms in Stream Data Processing & Mining

Lossy Counting Algorithm

• A space-efficient algorithm for mining frequent patterns in data streams.
• It maintains approximate frequency counts by dividing data into windows and discarding
infrequent items.
• This method is useful when exact counts are impractical due to memory constraints.

• The algorithm works by dividing the stream into buckets of size 1/ε and maintaining counts of
items while periodically pruning infrequent ones.

• It strikes a balance between memory usage and accuracy, making it ideal for applications in web
analytics, fraud detection, network traffic monitoring, and real-time recommendation systems.
Algorithm Steps:

Maintain a dictionary (or table) D to store item frequencies and an associated error term.
Each entry (e, f, Δ) in D consists of:
e → The item (element)
f → Estimated frequency of e
Δ → Maximum possible error in the count of e

Processing Incoming Elements:

For each new element e in the stream:
If e is already in D, increment its count f.
If e is not in D, insert it with an estimated count of 1 and set the error term Δ = current bucket ID - 1.

Prune Items Periodically:

After processing every N = 1/ε elements:
Remove items where f + Δ ≤ current bucket ID.
This step ensures that less frequent items are forgotten, saving memory.
Reservoir Sampling
Reservoir Sampling is an algorithm used for random sampling from a large or infinite data
stream when the total number of elements is unknown or too large to store in memory. It
ensures that each element has an equal probability of being selected in a fixed-size sample
(reservoir).
• Commonly used in real-time analytics and web traffic, system logs.
Algorithm
• Initialization
Fill the reservoir with the first k elements from the stream.
• Iterate Through Remaining Elements
For each element i (starting from k+1):
Generate a random number between 0 and i.
If ≤ k, replace a random element in the reservoir with i.
• Output
The reservoir contains a random sample of k elements.
Hoeffding Tree Algorithm
The Hoeffding Tree Algorithm (also known as VFDT - Very Fast Decision Tree) is a streaming
decision tree learning algorithm designed for large-scale, high-speed data streams. It allows
incremental learning while ensuring theoretical guarantees on accuracy.
Hoeffding Trees:
Processing each instance once (single pass)
Using limited memory
Making decisions incrementally
Providing probabilistic guarantees on split decisions using Hoeffding's bound

Hoeffding Bound Formula:

ε → Margin of error for the best attribute's score.
R → Range of the splitting criterion (e.g., entropy, Gini index).
δ → Confidence parameter (smaller δ = more certainty before splitting).
n → Number of instances seen at the node.
If the difference between the best and second-best attribute is greater than ε, the best attribute is
chosen for splitting.
Hoeffding Tree Algorithm

Hoeffding Tree Algorithm Steps

• Start with a single root node
No splits initially, all instances are passed down from the root.
• Update statistics for each incoming data instance
Track class distributions for attribute values.
• Check if a split is needed using Hoeffding Bound
Compute the best and second-best attribute based on the splitting criterion (e.g., entropy,
Gini index).
If their difference exceeds ε, split on the best attribute.
• Create child nodes and repeat the process recursively.
• Prune or discard less useful nodes to maintain efficiency.
CluStream Algorithm
The CluStream algorithm is a streaming clustering technique designed to handle evolving data
streams. It processes data in real-time (online phase) by maintaining micro-clusters and refines
them into macro-clusters during an offline phase.

CluStream Algorithm Steps

Online Micro-Clustering
• Receive a new data point.
• Assign it to the nearest micro-cluster (based on Euclidean distance).
• If no close cluster exists:
• Create a new micro-cluster.
• If micro-cluster count exceeds a threshold, merge or discard clusters.
• Update micro-cluster statistics (N, LS, SS, T).

Offline Macro-Clustering
• Retrieve stored micro-clusters.
• Apply k-means clustering on micro-cluster centroids.
• Generate macro-clusters from micro-clusters.
• Perform historical trend analysis on clusters over time.
Misra-Gries Algorithm
The Misra-Gries algorithm is a streaming algorithm used to find frequent elements (also known
as heavy hitters) in a data stream with limited memory.

• Efficiently processes large data streams using fixed memory.

• Identifies approximate frequent items without storing the full dataset.
• Works in one pass (single scan) over the data.
• Used in network traffic monitoring, fraud detection, and trend analysis.

Given a data stream of elements, the goal is to find elements appearing more than n/k
times, where:
n is the total number of elements in the stream.
k is a user-defined parameter controlling memory usage.
Misra-Gries Algorithm

The algorithm maintains a fixed number of counters (k-1) instead of storing all elements.
Initialize k-1 counters,
1.each storing:
• An element.
• A count (number of times observed).
2.Process each element from the stream:
• If the element is already in the counters, increment its count.
• If there is space in the counters, add the element with count = 1.
• If all counters are full and the element is not in them, decrease all counts by 1.
• If any counter reaches zero, replace it with the new element.
3.Extract frequent elements:
• After processing the stream, re-check the true counts of stored elements in a second pass.
EXAMPLE PROBLEMS

Frequent Item Detection in a Twitter Stream

Problem: Social media platforms like Twitter generate massive amounts of real-time data,
including hashtags and trending topics. Detecting the top 10 most frequently used hashtags
in a continuous tweet stream is essential for marketing analytics and trend analysis.

Challenges: The dataset is unbounded, making it impossible to store all tweets for analysis.
The system must efficiently track hashtags while using limited memory.

Solution: The Misra-Gries Algorithm or Lossy Counting Algorithm can be used to maintain
approximate frequency counts of hashtags in real-time. These algorithms ensure that the
most popular hashtags are identified while ignoring less frequent ones.
EXAMPLE PROBLEMS
Real-Time Fraud Detection in Credit Card Transactions

Problem: Credit card companies need to detect fraudulent transactions in real-time to prevent
financial losses. The challenge is to analyze a continuous stream of transactions and flag
unusual activity based on past patterns.

Challenges: Fraudulent activities often change dynamically, making traditional batch

processing ineffective. A real-time detection system should adapt to evolving fraud patterns
and minimize false positives.

Solution: The VFDT (Very Fast Decision Tree) algorithm can be used to classify transactions
based on attributes like transaction amount, location, and time. The model updates
dynamically, allowing it to detect new fraud trends efficiently.
EXAMPLE PROBLEMS
Stock Market Trend Analysis

Problem: Investors and financial analysts need to predict stock price movements based on real-
time market data streams. This involves analyzing high-speed data like stock prices, trade
volumes, and investor sentiment.

Challenges: Market trends fluctuate rapidly, requiring real-time processing rather than historical
batch analysis. Memory-efficient algorithms must be used to track price changes dynamically.

Solution: The Sliding Window Model helps analyze only the most recent stock price data,
ensuring that outdated trends do not influence predictions. It enables timely decision-making for
investors.
EXAMPLE PROBLEMS

Customer Segmentation in E-commerce

Problem: Online shopping platforms need to segment customers based on their real-time
browsing and purchasing behavior. This helps businesses personalize recommendations and
improve customer engagement.

Challenges: Customer behaviors change continuously, requiring a system that can

dynamically update customer groups. Traditional clustering techniques struggle with
evolving data streams.

Solution: The CluStream Algorithm clusters users in real-time based on their activity. Micro-
clusters track short-term behavior, while macro-clusters help identify long-term trends
EXAMPLE PROBLEMS

Video Surveillance & Motion Detection

Problem: Security cameras generate continuous video streams, requiring efficient real-time
motion detection to identify suspicious activities.

Challenges: Storing and analyzing every frame is infeasible due to high data volume. The
system must focus on recent frames while ignoring older, irrelevant ones.

Solution: Sliding Window-based Anomaly Detection processes only the latest frames, reducing
memory usage while ensuring quick detection of unusual movement patterns.
CONCLUSION

● Real-time Necessity: Enables immediate analysis & decision-making in finance,

healthcare, cybersecurity, IoT.
● Compared to Batch Processing: Handles high-velocity, continuous data efficiently.
● Key Algorithms: Sliding Window Models, VFDT, CluStream, Lossy Counting
optimize memory usage.
● Scalability & Frameworks: Apache Flink, Kafka Streams, Spark Streaming ensure
large-scale distributed processing.
● Future Trends: AI, deep learning, edge computing to enhance real-time analytics &
decision-making.
● Business Impact: Improves operational efficiency, predictive modeling, & competitive
advantage.
REFERENCES
SkedBooks. (n.d.). Methodologies for stream data processing and stream data systems. Retrieved March
18, 2025, from
https://skedbooks.com/books/data-mining-data-warehousing/methodologies-for-stream-data-processing-and-
stream-data-systems/

Tantalaki, N., Souravlas, S., & Roumeliotis, M. (2019). A review on big data real-time stream processing and
its scheduling techniques. International Journal of Parallel, Emergent and Distributed Systems, 35(5), 571–
601. https://doi.org/10.1080/17445760.2019.1585848

Morgan, F. D., Williams, E. R., & Madden, T. R. (1989). Streaming potential measurements: 1. Properties of
fine-grained sediments. Journal of Geophysical Research: Solid Earth, 94(B9), 12449–12461.
https://doi.org/10.1029/JB094iB09p12449

Ahmad, S. G., Liew, C. S., Rafique, M. M., & Munir, E. U. (2017). Optimization of data-intensive workflows in
stream-based data processing models. The Journal of Supercomputing, 73(9), 3901–3923.
https://doi.org/10.1007/s11227-017-1991-0

CoinGecko. (n.d.). CoinGecko WebSocket API. Retrieved March 18, 2025, from wss://ws.coingecko.com/

Binance. (n.d.). Binance WebSocket API – Real-time trade data for BTC/USDT. Retrieved March 18, 2025,
from wss://stream.binance.com:9443/ws/btcusdt@trade
THANK YOU
Presenters:
Mahima. A (AP24122060006)
Lokesh. K (AP24122060020)
Vamsi. K (AP24122060022)
Namratha. N (AP24122040015)
Teena Shaik (AP24122040002)
Avishek Thakur (AP24122040017)
Gowthami. G (AP24122040018)

Ad3351 Daa Question Bank
No ratings yet
Ad3351 Daa Question Bank
12 pages
Acn Imp Q&a
No ratings yet
Acn Imp Q&a
72 pages
cp5293 Big Data Analytics Question Bank
0% (1)
cp5293 Big Data Analytics Question Bank
13 pages
Trackpad Pro Ver. 5.0 Class 6
From Everand
Trackpad Pro Ver. 5.0 Class 6
Nidhi Arora
No ratings yet
DBMS Ninja Notes
No ratings yet
DBMS Ninja Notes
134 pages
Horowitz and Sahani, Fundamentals of Computer Algorithms, 2ND Edition PDF
0% (1)
Horowitz and Sahani, Fundamentals of Computer Algorithms, 2ND Edition PDF
777 pages
Context Free Grammar and Parsing
0% (1)
Context Free Grammar and Parsing
138 pages
FDP Brochure PDF
100% (1)
FDP Brochure PDF
2 pages
Unit-II BDA
No ratings yet
Unit-II BDA
19 pages
R Language
No ratings yet
R Language
59 pages
Unit 4 HIVE - PIG
No ratings yet
Unit 4 HIVE - PIG
71 pages
Software Engineering Notes (Unit-III)
No ratings yet
Software Engineering Notes (Unit-III)
21 pages
LAB # 07 Facts and Rules in PROLOG: Objective
No ratings yet
LAB # 07 Facts and Rules in PROLOG: Objective
6 pages
Course File Compiler Design
No ratings yet
Course File Compiler Design
41 pages
Arti PDF
0% (1)
Arti PDF
258 pages
CD Unit-Iv
No ratings yet
CD Unit-Iv
22 pages
Chpater 1 - Unit 2
No ratings yet
Chpater 1 - Unit 2
31 pages
Data Engineering Interview Preparation Questions
No ratings yet
Data Engineering Interview Preparation Questions
7 pages
DBMS Unit 3
No ratings yet
DBMS Unit 3
98 pages
AI Lab Manual Prolog Programs
No ratings yet
AI Lab Manual Prolog Programs
22 pages
BIDE: Efficient Mining of Frequent Closed Sequences: Jianyong Wang and Jiawei Han
No ratings yet
BIDE: Efficient Mining of Frequent Closed Sequences: Jianyong Wang and Jiawei Han
36 pages
Syntax Directed Translation
No ratings yet
Syntax Directed Translation
23 pages
Cs2357-Ooad Lab Manual
0% (1)
Cs2357-Ooad Lab Manual
199 pages
CO Unit 1-2
No ratings yet
CO Unit 1-2
14 pages
Stqa Viva
No ratings yet
Stqa Viva
10 pages
CS-703 (B) Data Warehousing and Data Mining Lab
No ratings yet
CS-703 (B) Data Warehousing and Data Mining Lab
50 pages
R24-M.Tech (CSE) Course Structure and Syllabus
No ratings yet
R24-M.Tech (CSE) Course Structure and Syllabus
73 pages
Cyber Security IMP Points Short Notes
No ratings yet
Cyber Security IMP Points Short Notes
20 pages
BDA Unit 1
No ratings yet
BDA Unit 1
10 pages
1.write A Program in Prolog To Show The Sum of N Natural Numbers. Code
No ratings yet
1.write A Program in Prolog To Show The Sum of N Natural Numbers. Code
2 pages
LR (0) Parser
No ratings yet
LR (0) Parser
8 pages
East West Institute of Technology: Sadp Notes
No ratings yet
East West Institute of Technology: Sadp Notes
30 pages
Big Data Analytics Notes
No ratings yet
Big Data Analytics Notes
33 pages
Big Data Analytics Unit 4
No ratings yet
Big Data Analytics Unit 4
83 pages
DSA-251 by Parikh Jain
No ratings yet
DSA-251 by Parikh Jain
19 pages
C Programming Tutorials: N.V.Raja Sekhar Reddy
No ratings yet
C Programming Tutorials: N.V.Raja Sekhar Reddy
19 pages
Develop A Java Program To Demonstrate Applet Life Cycle
No ratings yet
Develop A Java Program To Demonstrate Applet Life Cycle
8 pages
Unit 4 - Domain Testing
100% (1)
Unit 4 - Domain Testing
76 pages
Digital Logic Design Jan 2023
No ratings yet
Digital Logic Design Jan 2023
8 pages
Course File
No ratings yet
Course File
6 pages
Data Structure4
No ratings yet
Data Structure4
6 pages
Question Bank 1to11
No ratings yet
Question Bank 1to11
19 pages
Bda Super Imp
No ratings yet
Bda Super Imp
35 pages
Unit No.4 Parallel Database
No ratings yet
Unit No.4 Parallel Database
32 pages
CNS MTE QB Solutions
No ratings yet
CNS MTE QB Solutions
102 pages
Database Management SPPU Unit 1
No ratings yet
Database Management SPPU Unit 1
6 pages
AI Lab MAnual Final
No ratings yet
AI Lab MAnual Final
44 pages
Research Paper On DSA
No ratings yet
Research Paper On DSA
6 pages
Master of Computer Application: Lab Manual
No ratings yet
Master of Computer Application: Lab Manual
30 pages
Unit 4 DBMS R23
No ratings yet
Unit 4 DBMS R23
19 pages
Dbms Unit II
No ratings yet
Dbms Unit II
49 pages
DataWarehouseMining Complete Notes
No ratings yet
DataWarehouseMining Complete Notes
55 pages
Unit II: Software Requirement Analysis and Specifications
No ratings yet
Unit II: Software Requirement Analysis and Specifications
64 pages
JNTUGV B.tech R23 Course Structure
No ratings yet
JNTUGV B.tech R23 Course Structure
6 pages
Google App Engine
No ratings yet
Google App Engine
10 pages
Ad3391 LAB MANUAL
No ratings yet
Ad3391 LAB MANUAL
23 pages
ABES Institute of Technology Ghaziabad: Lab Manual
No ratings yet
ABES Institute of Technology Ghaziabad: Lab Manual
23 pages
Lecture 02 Part A - Uninformed or Blind Search
No ratings yet
Lecture 02 Part A - Uninformed or Blind Search
92 pages
Streaming Algorithms: Ajinkya Potdar Hemanga Krishna Borah
No ratings yet
Streaming Algorithms: Ajinkya Potdar Hemanga Krishna Borah
47 pages
Mining Data Streams
No ratings yet
Mining Data Streams
17 pages
Short Notes On Unit 4 - Data Mining and Data Wareho
No ratings yet
Short Notes On Unit 4 - Data Mining and Data Wareho
7 pages
PostgreDB Enterprise
No ratings yet
PostgreDB Enterprise
14 pages
Unit 2 Bda
No ratings yet
Unit 2 Bda
13 pages
Faculty of Information Management Universiti Teknologi Mara Kampus Puncak Perdana Shah Alam
No ratings yet
Faculty of Information Management Universiti Teknologi Mara Kampus Puncak Perdana Shah Alam
26 pages
SnowPro Core Test Prep
No ratings yet
SnowPro Core Test Prep
105 pages
Objective: For One Dimensional Data Set (7,10,20,28,35), Perform Hierarchical Clustering
No ratings yet
Objective: For One Dimensional Data Set (7,10,20,28,35), Perform Hierarchical Clustering
13 pages
Interview Preparation Questions
No ratings yet
Interview Preparation Questions
32 pages
Data Systems PPT - Group8
No ratings yet
Data Systems PPT - Group8
17 pages
Dbms Semester IV LAB Sheet - 1 Due Date For Submitting Report: 8th February 2021
No ratings yet
Dbms Semester IV LAB Sheet - 1 Due Date For Submitting Report: 8th February 2021
30 pages
Swiggy Business Analyst Interview Preparation
No ratings yet
Swiggy Business Analyst Interview Preparation
14 pages
Professional Platform Ops Engineer Certification Detail Sheet - en
0% (1)
Professional Platform Ops Engineer Certification Detail Sheet - en
5 pages
Naukri VivekChaubey (2y 5m)
No ratings yet
Naukri VivekChaubey (2y 5m)
2 pages
Plantweb Optics Connectivity: Don't Lose Your Data Interfaces Supported
No ratings yet
Plantweb Optics Connectivity: Don't Lose Your Data Interfaces Supported
2 pages
How To Determine The Actual Size of The LOB Segments Document 386341.1
No ratings yet
How To Determine The Actual Size of The LOB Segments Document 386341.1
4 pages
F3 Comp Midterm 1 2025
No ratings yet
F3 Comp Midterm 1 2025
6 pages
Univariate Statistics: Statistical Inference: Testing Hypothesis
No ratings yet
Univariate Statistics: Statistical Inference: Testing Hypothesis
28 pages
CSE311 IAH Slide01 Intro
No ratings yet
CSE311 IAH Slide01 Intro
17 pages
Take Home Assignment
No ratings yet
Take Home Assignment
4 pages
Ch06-The Relational Algebra and Calculus (Compatibility Mode) (Repaired)
No ratings yet
Ch06-The Relational Algebra and Calculus (Compatibility Mode) (Repaired)
80 pages
Resume Mehdi OUAZZA 2021
No ratings yet
Resume Mehdi OUAZZA 2021
1 page
WA Data Warehouse
No ratings yet
WA Data Warehouse
16 pages
Problem 6 4 Introduces Data On Coal Mining Disasters From 1851 To 1962 For These Data Assume
No ratings yet
Problem 6 4 Introduces Data On Coal Mining Disasters From 1851 To 1962 For These Data Assume
2 pages
EBSCO-FullText-02 07 2025
No ratings yet
EBSCO-FullText-02 07 2025
24 pages
Sap Abap Interview Questions
No ratings yet
Sap Abap Interview Questions
47 pages
Answer Keys Exam Data Structure
100% (1)
Answer Keys Exam Data Structure
9 pages
Success Stories in Recent Application of Data Mining
No ratings yet
Success Stories in Recent Application of Data Mining
11 pages
Purvi - Resume-1 - 1716995078460 - Purvi Lad
No ratings yet
Purvi - Resume-1 - 1716995078460 - Purvi Lad
2 pages
DBMS 2
No ratings yet
DBMS 2
8 pages
DAX Workshop Slides
No ratings yet
DAX Workshop Slides
14 pages
XML and Web Database
No ratings yet
XML and Web Database
10 pages

Methodologies For Stream Data Processing and Stream Data Systems

Uploaded by

Methodologies For Stream Data Processing and Stream Data Systems

Uploaded by

METHODOLOGIES FOR STREAM DATA

PROCESSING AND STREAM DATA SYSTEMS

▪ Enhanced Customer Experience : Enables personalized, real-time interactions and recommendations.

Lossy Counting Algorithm

Processing Incoming Elements:

Prune Items Periodically:

Hoeffding Bound Formula:

Hoeffding Tree Algorithm Steps

CluStream Algorithm Steps

• Efficiently processes large data streams using fixed memory.

Frequent Item Detection in a Twitter Stream

Challenges: Fraudulent activities often change dynamically, making traditional batch

Customer Segmentation in E-commerce

Challenges: Customer behaviors change continuously, requiring a system that can

Video Surveillance & Motion Detection

● Real-time Necessity: Enables immediate analysis & decision-making in finance,

You might also like