0% found this document useful (0 votes)

12 views11 pages

Unit Iv

The document covers streaming analytics, emphasizing real-time data processing and analysis of continuous data streams, contrasting it with traditional batch processing. It details the stream data model, its characteristics, key operations, and applications across various industries, such as fraud detection and IoT monitoring. Additionally, it discusses link analysis, PageRank computation, the Market Basket model for association rule mining, and limited pass algorithms for frequent itemsets.

Uploaded by

kannan.niran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views11 pages

Unit Iv

Uploaded by

kannan.niran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

UNIT IV - STREAMING ANALYTICS AND LINK ANALYSIS 9

Introduction to Stream analytics – Stream data model – Sampling Data –

filtering streams – Count distinct elements in a stream, Counting ones,
Estimating moments –Decaying windows – Link Analysis – Page Rank
Computation – Market Basket model –Limited pass algorithms for Frequent
Item sets

Introduction to Stream Analytics

Stream analytics is a specialized field within data analytics that focuses on real-time processing
and analysis of continuous data streams. It enables organizations to derive insights and make
decisions as the data arrives. This approach contrasts with traditional batch processing, which
deals with static datasets and processes them at intervals.

Stream Analytics in Detail

Definition

Stream analytics refers to the real-time processing of continuously generated data. The primary
goal is to analyze data in motion to extract actionable insights quickly. It involves ingesting,
filtering, aggregating, and analyzing data streams to generate real-time outputs.

Key Components of Stream Analytics

1. Data Sources: The origin of the data streams, such as IoT devices, social media feeds,
transaction logs, or clickstreams.
2. Data Ingestion: The process of collecting real-time data streams from the sources using
technologies like Apache Kafka, Apache Flume, or AWS Kinesis.
3. Stream Processing: The core analysis stage where the incoming data is filtered,
aggregated, and transformed using frameworks like Apache Flink, Spark Streaming, or
Google Dataflow.
4. Data Storage: Optional storage of processed or raw streams for historical analysis or
compliance purposes using time-series databases like InfluxDB or cloud-based solutions
like Amazon S3.
5. Visualization and Reporting: Generating dashboards, alerts, or real-time reports for
end-users using tools like Tableau or Power BI.

Features of Stream Analytics

1. Real-time Processing: Processes data as it arrives, enabling instant decision-making.

2. Continuous Data Flow: Operates on data streams without a predefined start or end.
3. Scalability: Handles large volumes of high-velocity data effectively.
4. Time-sensitive Insights: Incorporates time-based analysis, such as sliding windows or
tumbling windows.
5. Low Latency: Ensures minimal delay between data ingestion and actionable insights.
Steps in Stream Analytics

1. Data Collection: Ingesting data streams from multiple sources like sensors, logs, or user
interactions.
2. Data Preprocessing: Cleaning and normalizing the incoming data to ensure consistency.
3. Data Filtering: Selecting relevant data points based on defined criteria.
4. Data Aggregation: Summarizing data, such as calculating averages or counts.
5. Complex Event Processing (CEP): Identifying patterns or anomalies in the data
streams.
6. Output Generation: Producing actionable insights, such as alerts, visualizations, or
recommendations.

Applications of Stream Analytics

1. Fraud Detection:
o Financial institutions use stream analytics to monitor transactions in real time. For
example, an unusually high-value transaction from a new location triggers an alert
to prevent fraud.
2. IoT Monitoring:
o Stream analytics processes sensor data from IoT devices, such as temperature,
pressure, or energy usage, to detect anomalies or optimize performance.
3. Stock Market Analysis:
o Trading platforms analyze live market data to identify trends, predict stock
movements, and generate trading signals.
4. E-commerce Personalization:
o Online retailers analyze user activity in real time to recommend products or adjust
pricing dynamically.
5. Healthcare Monitoring:
o Wearable devices stream health metrics like heart rate or oxygen levels, enabling
real-time monitoring and alerts for potential health issues.

Advantages of Stream Analytics

1. Faster Decision-Making: Enables real-time responses to emerging trends or anomalies.

2. Proactive Insights: Detects and addresses issues before they escalate.
3. Resource Optimization: Ensures efficient use of resources through real-time
adjustments.
4. Scalability: Processes increasing data volumes without performance degradation.
5. Improved Customer Experience: Delivers timely and personalized interactions.

Challenges in Stream Analytics

1. High Velocity and Volume: Managing the constant influx of data requires robust
systems.
2. Data Quality: Ensuring accuracy and consistency in real-time data is challenging.
3. Latency: Maintaining low latency while processing large streams is critical.
4. Complex Event Processing: Detecting meaningful patterns from noisy data streams.
5. Integration: Combining multiple data sources seamlessly for analysis.

Examples of Stream Analytics

1. Stock Monitoring System:

o A system that processes live stock prices to detect trends or sudden price drops
and triggers alerts for investors.
2. Social Media Analysis:
o Analyzing tweets or posts in real time to track trends, sentiment, or brand
mentions.
3. Traffic Management:
o Using data from sensors and cameras to analyze traffic flow and optimize signal
timings.
4. Energy Management:
o Monitoring energy usage in real time to detect wastage or adjust power supply.
5. Network Security:
o Analyzing network logs in real time to detect and prevent cyberattacks.

Stream analytics is an integral part of data analytics in today’s fast-paced world. Its ability to
provide real-time insights across industries makes it indispensable for organizations looking to
stay competitive. With the right tools and frameworks, stream analytics enables proactive
decision-making and efficient operations.

2.Stream Data Model in Streaming Analytics

The stream data model defines how streaming data is structured, represented, and processed in a
real-time environment. Unlike traditional data models used for batch processing, the stream data
model is designed to handle continuous, unbounded, and time-sensitive data streams efficiently.

Characteristics of the Stream Data Model

1. Continuous Flow of Data:

o Data is generated and processed as a continuous sequence of tuples or events.
o Example: A stream of user clicks on a website, where each event is recorded as
(timestamp, user_id, clicked_item).
2. Unbounded Nature:
o Streams have no predefined beginning or end, making them unbounded.
o Example: Sensor readings from a smart factory that are constantly updated in real-
time.
3. Temporal Dependence:
o Each data tuple often includes a timestamp, allowing for time-based queries and
analysis.
o Example: A weather monitoring system records (time, temperature,
humidity) every second.
4. Dynamic and High Velocity:
oData arrives at high speed, requiring systems to process it with minimal latency.
oExample: Credit card transactions processed at thousands per second to detect
fraudulent activities.
5. Ephemeral Data:
o Data streams may not be stored permanently. Instead, insights are derived during
the processing stage, and only summarized results are stored.
o Example: Aggregating and storing the hourly count of customer check-ins instead
of storing each check-in.

Components of the Stream Data Model

1. Stream Elements (Tuples):

o The basic unit of data in a stream, typically represented as a tuple (key, value,
timestamp).
o Example: ("user123", "item456", "2024-01-01T10:00:00Z").
2. Stream Queries:
o Continuous queries applied on the incoming data stream for real-time analytics.
o Example: "Count the number of clicks per minute on a website."
3. Windows:
o Since streams are unbounded, windows are used to divide streams into
manageable chunks for processing.
 Sliding Window: Overlapping intervals.
 Tumbling Window: Non-overlapping intervals.
 Session Window: Based on user activity.
o Example: Calculating the average sales in 10-second intervals using a tumbling
window.
4. Metadata:
o Includes timestamps and other information necessary for processing and ordering
events.
o Example: Event generation time, event processing time.

Key Operations on the Stream Data Model

1. Filtering:
o Extracting relevant events based on conditions.
o Example: Filter tweets containing the hashtag #StreamingAnalytics.
2. Aggregation:
o Summarizing the data, such as counting, averaging, or finding the maximum
value.
o Example: Count the number of cars passing a toll booth every minute.
3. Join Operations:
o Combining multiple streams based on a common key or condition.
o Example: Join a stream of user clicks with a stream of product prices to show
real-time price comparisons.
4. Pattern Detection:
o Identifying sequences or anomalies within the stream.
o Example: Detecting a sequence of failed login attempts to identify a potential
security threat.

Stream Data Model Example

Scenario: Real-Time Ride-Sharing System

A ride-sharing app processes real-time data from drivers and passengers. The stream data model
can be represented as:

1. Driver Location Stream: (driver_id, latitude, longitude, timestamp)

2. Passenger Request Stream: (passenger_id, pickup_latitude,
pickup_longitude, timestamp)

Operations:

 Match drivers to passengers based on proximity (join operation).

 Count the number of rides requested per minute (aggregation).
 Detect areas with high demand (pattern detection).

Advantages of the Stream Data Model

1. Low Latency:
o Immediate processing enables real-time decision-making.
2. Scalability:
o Efficiently handles large volumes of high-velocity data streams.
3. Flexibility:
o Supports various operations like filtering, joining, and aggregating.
4. Time-Based Analysis:
o Allows temporal queries using windows and timestamps.

Stream data modeling is an essential aspect of streaming analytics, providing the foundation for
real-time data processing. Its ability to handle continuous and dynamic data makes it invaluable
for modern applications across industries.
Link Analysis

Link Analysis and PageRank Computation

Link Analysis:

Link analysis is a technique used to evaluate the structure of a network (such as the web, social
networks, etc.). The idea is to use the connections between entities (nodes) to understand their
importance or relevance. In the context of the web, this means evaluating the importance of web
pages based on the links between them.

Key Concepts:

 Nodes and Edges: A link structure can be represented as a directed graph, where each
page is a node, and each hyperlink between two pages is an edge.
 Importance: Pages that are linked to by many other pages are considered important. The
PageRank algorithm quantifies this importance.

PageRank Algorithm:

PageRank assigns a score to each page based on the number and quality of links pointing to it. It
operates under the assumption that a page is more important if it is linked to by other important
pages.

PageRank Formula: Given a page AAA, the PageRank of AAA is calculated as:

PR(A)=(1−d)+d∑B∈M(A)PR(B)L(B)PR(A) = (1 - d) + d \sum_{B \in M(A)} \frac{PR(B)}

{L(B)}PR(A)=(1−d)+dB∈M(A)∑L(B)PR(B)

Where:

 PR(A)PR(A)PR(A) is the PageRank of page AAA,

 ddd is the damping factor (typically 0.85),
 M(A)M(A)M(A) is the set of pages that link to AAA,
 L(B)L(B)L(B) is the number of outbound links from page BBB.

Explanation:

 The first term (1−d)(1 - d)(1−d) represents a small probability that a random web surfer
will jump to any page randomly.
 The second term represents the probability that a random web surfer reaches page AAA
by following a link from another page BBB. The importance of page BBB is proportional
to its own PageRank, and the term PR(B)L(B)\frac{PR(B)}{L(B)}L(B)PR(B) accounts
for the fact that page BBB is passing its rank across all its links.

Example:
Let’s consider a network of three pages:

 Page A links to Page B,

 Page B links to Page C,
 Page C links to Page A.

Starting with an initial PageRank of 1 for each page, the algorithm iteratively updates the ranks
of the pages. After several iterations, the PageRank values converge to their final values, which
reflect the relative importance of the pages in the network.

Page Rank (Iteration 1) Rank (Iteration 2) Final Rank

A 1 0.8 0.8
B 1 0.9 0.9
C 1 0.7 0.7

After several iterations, we find that Page B has the highest rank, as it links to both A and C.

2. Market Basket Model (Association Rule Mining)

Overview:

The Market Basket Model is a technique used in Association Rule Mining to discover patterns
in large transaction datasets. The goal is to identify associations between items that are
frequently bought together. This is essential for tasks like recommendation systems, product
bundling, and customer behavior analysis.

Key Concepts:

 Transaction: A record that contains a list of items purchased together.

 Itemset: A set of items that appear together in a transaction.
 Frequent Itemset: An itemset that appears in a transaction database with frequency
higher than a user-specified threshold (support).
 Association Rule: A rule of the form X→YX \rightarrow YX→Y, meaning "if itemset
XXX is bought, itemset YYY is likely to be bought."

Apriori Algorithm:

The Apriori algorithm is one of the most widely used algorithms for mining frequent itemsets.
It uses a bottom-up approach, where frequent itemsets of size kkk are used to generate candidate
itemsets of size k+1k+1k+1.

Apriori Algorithm Steps:

1. Generate Candidate Itemsets: Start by finding all frequent 1-itemsets (single items) and
generate candidate 2-itemsets by combining frequent 1-itemsets.
2. Prune Infrequent Itemsets: After counting the support of candidate itemsets, eliminate
those that don’t meet the minimum support threshold.
3. Iterate: Repeat the process for larger itemsets (3-itemsets, 4-itemsets, etc.) until no
further frequent itemsets can be generated.

Example:

Given the following transactions:

 T1: {Bread, Butter, Milk}

 T2: {Bread, Butter}
 T3: {Bread, Milk}
 T4: {Butter, Milk}
 T5: {Bread, Butter, Milk}

Step 1: Find 1-item frequent itemsets (support > 60%).

 Support for Bread: 45=0.8\frac{4}{5} = 0.854=0.8

 Support for Butter: 45=0.8\frac{4}{5} = 0.854=0.8
 Support for Milk: 35=0.6\frac{3}{5} = 0.653=0.6

Step 2: Generate candidate 2-itemsets.

 Candidate 2-itemsets: {Bread, Butter}, {Bread, Milk}, {Butter, Milk}.

 Support for {Bread, Butter}: 35=0.6\frac{3}{5} = 0.653=0.6
 Support for {Bread, Milk}: 35=0.6\frac{3}{5} = 0.653=0.6
 Support for {Butter, Milk}: 35=0.6\frac{3}{5} = 0.653=0.6

Since all 2-itemsets meet the minimum support threshold, we can generate rules, such as Bread
→ Butter.

FP-Growth Algorithm:

The FP-Growth algorithm is more efficient than Apriori because it doesn’t generate candidate
itemsets. It uses a compact structure called an FP-Tree to store the data.

FP-Growth Steps:

1. Construct FP-Tree: The dataset is scanned to construct a tree structure where frequent
itemsets are represented.
2. Mine Frequent Itemsets: The FP-Tree is recursively mined to discover frequent
itemsets.
Example: If we apply FP-Growth to the same dataset, it will create a compact FP-Tree that
efficiently finds frequent itemsets.

3. Limited Pass Algorithms for Frequent Itemsets

In frequent itemset mining, limited pass algorithms aim to reduce the number of database passes.
Instead of generating all candidate itemsets (which can be computationally expensive), these
algorithms try to minimize the number of passes while still finding all frequent itemsets.

Apriori (Limited Pass):

 The Apriori algorithm requires multiple passes over the data, but it optimizes the process
by pruning the search space. It starts by finding frequent 1-itemsets, then uses those to
generate candidate 2-itemsets, and so on.
 The key optimization here is pruning, which eliminates itemsets that do not meet the
minimum support threshold.

FP-Growth (Limited Pass):

 Unlike Apriori, FP-Growth does not generate candidate itemsets. Instead, it builds a
frequent pattern tree (FP-Tree), which is a compressed version of the dataset.
 FP-Growth only requires two passes over the data:
1. The first pass is used to create the FP-Tree.
2. The second pass mines the FP-Tree for frequent itemsets.

Advantages of FP-Growth over Apriori:

 Efficiency: FP-Growth is more efficient because it avoids the overhead of generating and
testing candidate itemsets.
 Fewer Passes: FP-Growth only requires two passes over the data, making it faster for
large datasets.

2.Limited Pass Algorithms for Frequent Itemsets

Overview:

In frequent itemset mining, the goal is to identify sets of items that occur together in a large
dataset. However, generating all possible itemsets can be computationally expensive. Limited
pass algorithms reduce the number of database passes, making them more efficient for large
datasets.

Apriori Algorithm:
The Apriori algorithm works by generating candidate itemsets in each iteration and pruning
those that do not meet the minimum support threshold.

Example: Let’s consider the following transactions:

 T1: {A, B, C}
 T2: {A, B}
 T3: {A, C}
 T4: {B, C}
 T5: {A, B, C}

First Pass: Find all individual items that meet the support threshold (say 60% or 3/5
transactions).

 Support for A: 4/5 = 80%

 Support for B: 4/5 = 80%
 Support for C: 3/5 = 60%

These items pass the support threshold.

Second Pass: Generate pairs of frequent itemsets (e.g., {A, B}, {A, C}, {B, C}) and check their
support.

 Support for {A, B}: 3/5 = 60%

 Support for {A, C}: 3/5 = 60%
 Support for {B, C}: 3/5 = 60%

These pairs are frequent.

Third Pass: Generate candidate 3-itemsets (e.g., {A, B, C}) and check their support.

 Support for {A, B, C}: 3/5 = 60%

This is the final frequent itemset.

FP-Growth Algorithm:

FP-Growth is more efficient than Apriori because it doesn’t generate candidate itemsets. Instead,
it builds a frequent pattern tree (FP-Tree) that compresses the dataset, and then mines the tree
for frequent patterns.

Example: Given the same transactions, FP-Growth would:

1. Create a compact FP-Tree by compressing itemsets.

2. Mine the tree for frequent patterns without generating candidate itemsets explicitly.This
reduces the computational overhead, making it faster for large datasets.

Bigdata-Mining Data Streams
No ratings yet
Bigdata-Mining Data Streams
19 pages
Unit-Ii 30-1-24
No ratings yet
Unit-Ii 30-1-24
162 pages
A Deep Dive Into Data Stream Processing
No ratings yet
A Deep Dive Into Data Stream Processing
10 pages
BDA Unit-4
No ratings yet
BDA Unit-4
12 pages
DAV Chapter3
No ratings yet
DAV Chapter3
44 pages
Stream Processing
No ratings yet
Stream Processing
33 pages
Big Data Notes
No ratings yet
Big Data Notes
37 pages
Getting Started With Real-Time Analytics With Kafka and Spark in Microsoft Azure - Joe Plumb.
No ratings yet
Getting Started With Real-Time Analytics With Kafka and Spark in Microsoft Azure - Joe Plumb.
44 pages
Big Data
No ratings yet
Big Data
37 pages
Big Data Analytics - Unit 2 Notes
No ratings yet
Big Data Analytics - Unit 2 Notes
44 pages
Hazelcast Level Up To Instant Action-1706173416548
No ratings yet
Hazelcast Level Up To Instant Action-1706173416548
36 pages
JyothsnaDST Unit-1 Extra
No ratings yet
JyothsnaDST Unit-1 Extra
25 pages
Data Analytics Chapter 3
No ratings yet
Data Analytics Chapter 3
12 pages
Chapter 1-1
No ratings yet
Chapter 1-1
34 pages
Data Analytics and Visualization Unit-III
No ratings yet
Data Analytics and Visualization Unit-III
21 pages
SA Unit 1 PPT 2
No ratings yet
SA Unit 1 PPT 2
27 pages
6 - Streaming Part 1
No ratings yet
6 - Streaming Part 1
44 pages
Unit 2 Bda
No ratings yet
Unit 2 Bda
13 pages
Unit-2 BDA
No ratings yet
Unit-2 BDA
30 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
57 pages
Swe2011 Bda - III
No ratings yet
Swe2011 Bda - III
53 pages
Chapter 1
No ratings yet
Chapter 1
13 pages
BDA Mod 3
No ratings yet
BDA Mod 3
57 pages
Swe2011 Bda - III
No ratings yet
Swe2011 Bda - III
50 pages
Mining&Data Stream Unit-3 - Removed
No ratings yet
Mining&Data Stream Unit-3 - Removed
50 pages
Stream Processing: Instant Insight Into Data As It Flows
No ratings yet
Stream Processing: Instant Insight Into Data As It Flows
14 pages
Unit II (Big Data)
No ratings yet
Unit II (Big Data)
19 pages
Unit 3
No ratings yet
Unit 3
30 pages
Bda 2
No ratings yet
Bda 2
16 pages
Big Data 3rd Unit
No ratings yet
Big Data 3rd Unit
16 pages
BDA GTU Study Material Presentations Unit-4 29092021094703AM
No ratings yet
BDA GTU Study Material Presentations Unit-4 29092021094703AM
33 pages
DataStreaming L-4
No ratings yet
DataStreaming L-4
16 pages
Big Data IV Nit
No ratings yet
Big Data IV Nit
15 pages
Azure Book 129
No ratings yet
Azure Book 129
1 page
Unit 3-6
No ratings yet
Unit 3-6
14 pages
Bdappt
No ratings yet
Bdappt
7 pages
SPA Group 20
No ratings yet
SPA Group 20
16 pages
Stream Processing and Website Tracking
No ratings yet
Stream Processing and Website Tracking
2 pages
Unit Iv
No ratings yet
Unit Iv
5 pages
BigData Mod2
No ratings yet
BigData Mod2
12 pages
Practical Notes PDF
67% (6)
Practical Notes PDF
42 pages
Unit-3 Notes
No ratings yet
Unit-3 Notes
10 pages
What Is Streaming Data
No ratings yet
What Is Streaming Data
4 pages
What Is Stream Processing
No ratings yet
What Is Stream Processing
3 pages
Unit 2 BD Mining Data Streams
No ratings yet
Unit 2 BD Mining Data Streams
34 pages
Unit 4 Notes PDF
100% (2)
Unit 4 Notes PDF
27 pages
Mining Data Streams
No ratings yet
Mining Data Streams
17 pages
5 Unit
No ratings yet
5 Unit
5 pages
BESM - Cold Hands, Dark Hearts
No ratings yet
BESM - Cold Hands, Dark Hearts
132 pages
Module-2-MINING DATA STREAMS
100% (3)
Module-2-MINING DATA STREAMS
17 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
11 pages
Uint 4miningdatastream 230810162429 9d7c02a7
No ratings yet
Uint 4miningdatastream 230810162429 9d7c02a7
11 pages
Data Analytics Unit 3
No ratings yet
Data Analytics Unit 3
14 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
19 pages
Bigdata Unit-Ii
No ratings yet
Bigdata Unit-Ii
33 pages
Unit 2
No ratings yet
Unit 2
10 pages
Introduction To Stream Concepts - Stream Data Model and Architecture
No ratings yet
Introduction To Stream Concepts - Stream Data Model and Architecture
8 pages
4 Building Blocks of A Streaming Data Architecture
No ratings yet
4 Building Blocks of A Streaming Data Architecture
11 pages
Module II
No ratings yet
Module II
22 pages
Bookdown Demo
No ratings yet
Bookdown Demo
448 pages
Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences and Other Useful Information That Can Help Organizations Make More-Informed Business Decisions
No ratings yet
Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences and Other Useful Information That Can Help Organizations Make More-Informed Business Decisions
4 pages
Usability Engineering FinalTerm Paper Spring 2022 Solution 15082022 124653am
No ratings yet
Usability Engineering FinalTerm Paper Spring 2022 Solution 15082022 124653am
7 pages
Inject-Concerning Transmitters and Receivers by Peter Neuthinger
No ratings yet
Inject-Concerning Transmitters and Receivers by Peter Neuthinger
5 pages
System Design Handbook
No ratings yet
System Design Handbook
21 pages
Week4 - Understanding Colors
No ratings yet
Week4 - Understanding Colors
43 pages
NI Tutorial 6628 en
No ratings yet
NI Tutorial 6628 en
6 pages
Unit5 Autoencoders
No ratings yet
Unit5 Autoencoders
45 pages
Learn Devops With A Grade Project
No ratings yet
Learn Devops With A Grade Project
13 pages
CSS 10 QUARTER 2 Module 1
No ratings yet
CSS 10 QUARTER 2 Module 1
27 pages
All Analysiscode Explanation
No ratings yet
All Analysiscode Explanation
22 pages
Industrial Engineering and Simulation Experience Using Flexsim Software
No ratings yet
Industrial Engineering and Simulation Experience Using Flexsim Software
6 pages
Project Book Finish
No ratings yet
Project Book Finish
40 pages
Ai 1
No ratings yet
Ai 1
3 pages
Wide Enterprise Networking
No ratings yet
Wide Enterprise Networking
8 pages
My Triumph Connectivity - Faq - English
No ratings yet
My Triumph Connectivity - Faq - English
21 pages
Unit 4 Streaming Data
No ratings yet
Unit 4 Streaming Data
4 pages
Lec6. Operator Overload
No ratings yet
Lec6. Operator Overload
28 pages
Technical Compliance Matrix
No ratings yet
Technical Compliance Matrix
4 pages
Descriptive Dataset
No ratings yet
Descriptive Dataset
6 pages
Counters: "Registers" Section
No ratings yet
Counters: "Registers" Section
10 pages
Ffu 0001114 01
No ratings yet
Ffu 0001114 01
27 pages
Cloudera Administrator Training For Apache Hadoop PDF
50% (2)
Cloudera Administrator Training For Apache Hadoop PDF
2 pages
SIFT Detector FPCV-2-3
No ratings yet
SIFT Detector FPCV-2-3
22 pages
Introduction To ROC Analysis: Pattern Recognition Letters June 2006
No ratings yet
Introduction To ROC Analysis: Pattern Recognition Letters June 2006
16 pages
Gaze-Based Trigger
No ratings yet
Gaze-Based Trigger
22 pages
Automation Assignment
No ratings yet
Automation Assignment
2 pages
Hybrid Decision Tree-Based Machine Learning Models For Short-Term Water Quality Prediction.
No ratings yet
Hybrid Decision Tree-Based Machine Learning Models For Short-Term Water Quality Prediction.
14 pages
Noto Sans Korean Font License
No ratings yet
Noto Sans Korean Font License
2 pages
Deep Learning Mental Health Dialogue System
No ratings yet
Deep Learning Mental Health Dialogue System
4 pages
Resume Vishnu Shankar
No ratings yet
Resume Vishnu Shankar
1 page
Exemple de Contrôle Continu
No ratings yet
Exemple de Contrôle Continu
1 page
3706durgam Cheruvu CADASTRAL
No ratings yet
3706durgam Cheruvu CADASTRAL
1 page
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
From Everand
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Byron Ellis
No ratings yet
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet

Unit Iv

Uploaded by

Unit Iv

Uploaded by

UNIT IV - STREAMING ANALYTICS AND LINK ANALYSIS 9

Introduction to Stream analytics – Stream data model – Sampling Data –

Introduction to Stream Analytics

Stream Analytics in Detail

Key Components of Stream Analytics

Features of Stream Analytics

1. Real-time Processing: Processes data as it arrives, enabling instant decision-making.

Applications of Stream Analytics

Advantages of Stream Analytics

1. Faster Decision-Making: Enables real-time responses to emerging trends or anomalies.

Challenges in Stream Analytics

Examples of Stream Analytics

1. Stock Monitoring System:

2.Stream Data Model in Streaming Analytics

Characteristics of the Stream Data Model

1. Continuous Flow of Data:

Components of the Stream Data Model

1. Stream Elements (Tuples):

Key Operations on the Stream Data Model

Stream Data Model Example

Scenario: Real-Time Ride-Sharing System

1. Driver Location Stream: (driver_id, latitude, longitude, timestamp)

 Match drivers to passengers based on proximity (join operation).

Advantages of the Stream Data Model

Link Analysis and PageRank Computation

PR(A)=(1−d)+d∑B∈M(A)PR(B)L(B)PR(A) = (1 - d) + d \sum_{B \in M(A)} \frac{PR(B)}

 PR(A)PR(A)PR(A) is the PageRank of page AAA,

 Page A links to Page B,

Page Rank (Iteration 1) Rank (Iteration 2) Final Rank

2. Market Basket Model (Association Rule Mining)

 Transaction: A record that contains a list of items purchased together.

Apriori Algorithm Steps:

Given the following transactions:

 T1: {Bread, Butter, Milk}

Step 1: Find 1-item frequent itemsets (support > 60%).

 Support for Bread: 45=0.8\frac{4}{5} = 0.854=0.8

Step 2: Generate candidate 2-itemsets.

 Candidate 2-itemsets: {Bread, Butter}, {Bread, Milk}, {Butter, Milk}.

3. Limited Pass Algorithms for Frequent Itemsets

Apriori (Limited Pass):

FP-Growth (Limited Pass):

Advantages of FP-Growth over Apriori:

2.Limited Pass Algorithms for Frequent Itemsets

Example: Let’s consider the following transactions:

 Support for A: 4/5 = 80%

These items pass the support threshold.

 Support for {A, B}: 3/5 = 60%

These pairs are frequent.

 Support for {A, B, C}: 3/5 = 60%

This is the final frequent itemset.

Example: Given the same transactions, FP-Growth would:

1. Create a compact FP-Tree by compressing itemsets.

You might also like