0% found this document useful (0 votes)
12 views11 pages

Unit Iv

The document covers streaming analytics, emphasizing real-time data processing and analysis of continuous data streams, contrasting it with traditional batch processing. It details the stream data model, its characteristics, key operations, and applications across various industries, such as fraud detection and IoT monitoring. Additionally, it discusses link analysis, PageRank computation, the Market Basket model for association rule mining, and limited pass algorithms for frequent itemsets.

Uploaded by

kannan.niran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views11 pages

Unit Iv

The document covers streaming analytics, emphasizing real-time data processing and analysis of continuous data streams, contrasting it with traditional batch processing. It details the stream data model, its characteristics, key operations, and applications across various industries, such as fraud detection and IoT monitoring. Additionally, it discusses link analysis, PageRank computation, the Market Basket model for association rule mining, and limited pass algorithms for frequent itemsets.

Uploaded by

kannan.niran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

UNIT IV - STREAMING ANALYTICS AND LINK ANALYSIS 9

Introduction to Stream analytics – Stream data model – Sampling Data –


filtering streams – Count distinct elements in a stream, Counting ones,
Estimating moments –Decaying windows – Link Analysis – Page Rank
Computation – Market Basket model –Limited pass algorithms for Frequent
Item sets

Introduction to Stream Analytics

Stream analytics is a specialized field within data analytics that focuses on real-time processing
and analysis of continuous data streams. It enables organizations to derive insights and make
decisions as the data arrives. This approach contrasts with traditional batch processing, which
deals with static datasets and processes them at intervals.

Stream Analytics in Detail

Definition

Stream analytics refers to the real-time processing of continuously generated data. The primary
goal is to analyze data in motion to extract actionable insights quickly. It involves ingesting,
filtering, aggregating, and analyzing data streams to generate real-time outputs.

Key Components of Stream Analytics

1. Data Sources: The origin of the data streams, such as IoT devices, social media feeds,
transaction logs, or clickstreams.
2. Data Ingestion: The process of collecting real-time data streams from the sources using
technologies like Apache Kafka, Apache Flume, or AWS Kinesis.
3. Stream Processing: The core analysis stage where the incoming data is filtered,
aggregated, and transformed using frameworks like Apache Flink, Spark Streaming, or
Google Dataflow.
4. Data Storage: Optional storage of processed or raw streams for historical analysis or
compliance purposes using time-series databases like InfluxDB or cloud-based solutions
like Amazon S3.
5. Visualization and Reporting: Generating dashboards, alerts, or real-time reports for
end-users using tools like Tableau or Power BI.

Features of Stream Analytics

1. Real-time Processing: Processes data as it arrives, enabling instant decision-making.


2. Continuous Data Flow: Operates on data streams without a predefined start or end.
3. Scalability: Handles large volumes of high-velocity data effectively.
4. Time-sensitive Insights: Incorporates time-based analysis, such as sliding windows or
tumbling windows.
5. Low Latency: Ensures minimal delay between data ingestion and actionable insights.
Steps in Stream Analytics

1. Data Collection: Ingesting data streams from multiple sources like sensors, logs, or user
interactions.
2. Data Preprocessing: Cleaning and normalizing the incoming data to ensure consistency.
3. Data Filtering: Selecting relevant data points based on defined criteria.
4. Data Aggregation: Summarizing data, such as calculating averages or counts.
5. Complex Event Processing (CEP): Identifying patterns or anomalies in the data
streams.
6. Output Generation: Producing actionable insights, such as alerts, visualizations, or
recommendations.

Applications of Stream Analytics

1. Fraud Detection:
o Financial institutions use stream analytics to monitor transactions in real time. For
example, an unusually high-value transaction from a new location triggers an alert
to prevent fraud.
2. IoT Monitoring:
o Stream analytics processes sensor data from IoT devices, such as temperature,
pressure, or energy usage, to detect anomalies or optimize performance.
3. Stock Market Analysis:
o Trading platforms analyze live market data to identify trends, predict stock
movements, and generate trading signals.
4. E-commerce Personalization:
o Online retailers analyze user activity in real time to recommend products or adjust
pricing dynamically.
5. Healthcare Monitoring:
o Wearable devices stream health metrics like heart rate or oxygen levels, enabling
real-time monitoring and alerts for potential health issues.

Advantages of Stream Analytics

1. Faster Decision-Making: Enables real-time responses to emerging trends or anomalies.


2. Proactive Insights: Detects and addresses issues before they escalate.
3. Resource Optimization: Ensures efficient use of resources through real-time
adjustments.
4. Scalability: Processes increasing data volumes without performance degradation.
5. Improved Customer Experience: Delivers timely and personalized interactions.

Challenges in Stream Analytics

1. High Velocity and Volume: Managing the constant influx of data requires robust
systems.
2. Data Quality: Ensuring accuracy and consistency in real-time data is challenging.
3. Latency: Maintaining low latency while processing large streams is critical.
4. Complex Event Processing: Detecting meaningful patterns from noisy data streams.
5. Integration: Combining multiple data sources seamlessly for analysis.

Examples of Stream Analytics

1. Stock Monitoring System:


o A system that processes live stock prices to detect trends or sudden price drops
and triggers alerts for investors.
2. Social Media Analysis:
o Analyzing tweets or posts in real time to track trends, sentiment, or brand
mentions.
3. Traffic Management:
o Using data from sensors and cameras to analyze traffic flow and optimize signal
timings.
4. Energy Management:
o Monitoring energy usage in real time to detect wastage or adjust power supply.
5. Network Security:
o Analyzing network logs in real time to detect and prevent cyberattacks.

Stream analytics is an integral part of data analytics in today’s fast-paced world. Its ability to
provide real-time insights across industries makes it indispensable for organizations looking to
stay competitive. With the right tools and frameworks, stream analytics enables proactive
decision-making and efficient operations.

2.Stream Data Model in Streaming Analytics

The stream data model defines how streaming data is structured, represented, and processed in a
real-time environment. Unlike traditional data models used for batch processing, the stream data
model is designed to handle continuous, unbounded, and time-sensitive data streams efficiently.

Characteristics of the Stream Data Model

1. Continuous Flow of Data:


o Data is generated and processed as a continuous sequence of tuples or events.
o Example: A stream of user clicks on a website, where each event is recorded as
(timestamp, user_id, clicked_item).
2. Unbounded Nature:
o Streams have no predefined beginning or end, making them unbounded.
o Example: Sensor readings from a smart factory that are constantly updated in real-
time.
3. Temporal Dependence:
o Each data tuple often includes a timestamp, allowing for time-based queries and
analysis.
o Example: A weather monitoring system records (time, temperature,
humidity) every second.
4. Dynamic and High Velocity:
oData arrives at high speed, requiring systems to process it with minimal latency.
oExample: Credit card transactions processed at thousands per second to detect
fraudulent activities.
5. Ephemeral Data:
o Data streams may not be stored permanently. Instead, insights are derived during
the processing stage, and only summarized results are stored.
o Example: Aggregating and storing the hourly count of customer check-ins instead
of storing each check-in.

Components of the Stream Data Model

1. Stream Elements (Tuples):


o The basic unit of data in a stream, typically represented as a tuple (key, value,
timestamp).
o Example: ("user123", "item456", "2024-01-01T10:00:00Z").
2. Stream Queries:
o Continuous queries applied on the incoming data stream for real-time analytics.
o Example: "Count the number of clicks per minute on a website."
3. Windows:
o Since streams are unbounded, windows are used to divide streams into
manageable chunks for processing.
 Sliding Window: Overlapping intervals.
 Tumbling Window: Non-overlapping intervals.
 Session Window: Based on user activity.
o Example: Calculating the average sales in 10-second intervals using a tumbling
window.
4. Metadata:
o Includes timestamps and other information necessary for processing and ordering
events.
o Example: Event generation time, event processing time.

Key Operations on the Stream Data Model

1. Filtering:
o Extracting relevant events based on conditions.
o Example: Filter tweets containing the hashtag #StreamingAnalytics.
2. Aggregation:
o Summarizing the data, such as counting, averaging, or finding the maximum
value.
o Example: Count the number of cars passing a toll booth every minute.
3. Join Operations:
o Combining multiple streams based on a common key or condition.
o Example: Join a stream of user clicks with a stream of product prices to show
real-time price comparisons.
4. Pattern Detection:
o Identifying sequences or anomalies within the stream.
o Example: Detecting a sequence of failed login attempts to identify a potential
security threat.

Stream Data Model Example

Scenario: Real-Time Ride-Sharing System

A ride-sharing app processes real-time data from drivers and passengers. The stream data model
can be represented as:

1. Driver Location Stream: (driver_id, latitude, longitude, timestamp)


2. Passenger Request Stream: (passenger_id, pickup_latitude,
pickup_longitude, timestamp)

Operations:

 Match drivers to passengers based on proximity (join operation).


 Count the number of rides requested per minute (aggregation).
 Detect areas with high demand (pattern detection).

Advantages of the Stream Data Model

1. Low Latency:
o Immediate processing enables real-time decision-making.
2. Scalability:
o Efficiently handles large volumes of high-velocity data streams.
3. Flexibility:
o Supports various operations like filtering, joining, and aggregating.
4. Time-Based Analysis:
o Allows temporal queries using windows and timestamps.

Stream data modeling is an essential aspect of streaming analytics, providing the foundation for
real-time data processing. Its ability to handle continuous and dynamic data makes it invaluable
for modern applications across industries.
Link Analysis

Link Analysis and PageRank Computation

Link Analysis:

Link analysis is a technique used to evaluate the structure of a network (such as the web, social
networks, etc.). The idea is to use the connections between entities (nodes) to understand their
importance or relevance. In the context of the web, this means evaluating the importance of web
pages based on the links between them.

Key Concepts:

 Nodes and Edges: A link structure can be represented as a directed graph, where each
page is a node, and each hyperlink between two pages is an edge.
 Importance: Pages that are linked to by many other pages are considered important. The
PageRank algorithm quantifies this importance.

PageRank Algorithm:

PageRank assigns a score to each page based on the number and quality of links pointing to it. It
operates under the assumption that a page is more important if it is linked to by other important
pages.

PageRank Formula: Given a page AAA, the PageRank of AAA is calculated as:

PR(A)=(1−d)+d∑B∈M(A)PR(B)L(B)PR(A) = (1 - d) + d \sum_{B \in M(A)} \frac{PR(B)}


{L(B)}PR(A)=(1−d)+dB∈M(A)∑L(B)PR(B)

Where:

 PR(A)PR(A)PR(A) is the PageRank of page AAA,


 ddd is the damping factor (typically 0.85),
 M(A)M(A)M(A) is the set of pages that link to AAA,
 L(B)L(B)L(B) is the number of outbound links from page BBB.

Explanation:

 The first term (1−d)(1 - d)(1−d) represents a small probability that a random web surfer
will jump to any page randomly.
 The second term represents the probability that a random web surfer reaches page AAA
by following a link from another page BBB. The importance of page BBB is proportional
to its own PageRank, and the term PR(B)L(B)\frac{PR(B)}{L(B)}L(B)PR(B) accounts
for the fact that page BBB is passing its rank across all its links.

Example:
Let’s consider a network of three pages:

 Page A links to Page B,


 Page B links to Page C,
 Page C links to Page A.

Starting with an initial PageRank of 1 for each page, the algorithm iteratively updates the ranks
of the pages. After several iterations, the PageRank values converge to their final values, which
reflect the relative importance of the pages in the network.

Page Rank (Iteration 1) Rank (Iteration 2) Final Rank


A 1 0.8 0.8
B 1 0.9 0.9
C 1 0.7 0.7

After several iterations, we find that Page B has the highest rank, as it links to both A and C.

2. Market Basket Model (Association Rule Mining)

Overview:

The Market Basket Model is a technique used in Association Rule Mining to discover patterns
in large transaction datasets. The goal is to identify associations between items that are
frequently bought together. This is essential for tasks like recommendation systems, product
bundling, and customer behavior analysis.

Key Concepts:

 Transaction: A record that contains a list of items purchased together.


 Itemset: A set of items that appear together in a transaction.
 Frequent Itemset: An itemset that appears in a transaction database with frequency
higher than a user-specified threshold (support).
 Association Rule: A rule of the form X→YX \rightarrow YX→Y, meaning "if itemset
XXX is bought, itemset YYY is likely to be bought."

Apriori Algorithm:

The Apriori algorithm is one of the most widely used algorithms for mining frequent itemsets.
It uses a bottom-up approach, where frequent itemsets of size kkk are used to generate candidate
itemsets of size k+1k+1k+1.

Apriori Algorithm Steps:


1. Generate Candidate Itemsets: Start by finding all frequent 1-itemsets (single items) and
generate candidate 2-itemsets by combining frequent 1-itemsets.
2. Prune Infrequent Itemsets: After counting the support of candidate itemsets, eliminate
those that don’t meet the minimum support threshold.
3. Iterate: Repeat the process for larger itemsets (3-itemsets, 4-itemsets, etc.) until no
further frequent itemsets can be generated.

Example:

Given the following transactions:

 T1: {Bread, Butter, Milk}


 T2: {Bread, Butter}
 T3: {Bread, Milk}
 T4: {Butter, Milk}
 T5: {Bread, Butter, Milk}

Step 1: Find 1-item frequent itemsets (support > 60%).

 Support for Bread: 45=0.8\frac{4}{5} = 0.854=0.8


 Support for Butter: 45=0.8\frac{4}{5} = 0.854=0.8
 Support for Milk: 35=0.6\frac{3}{5} = 0.653=0.6

Step 2: Generate candidate 2-itemsets.

 Candidate 2-itemsets: {Bread, Butter}, {Bread, Milk}, {Butter, Milk}.


 Support for {Bread, Butter}: 35=0.6\frac{3}{5} = 0.653=0.6
 Support for {Bread, Milk}: 35=0.6\frac{3}{5} = 0.653=0.6
 Support for {Butter, Milk}: 35=0.6\frac{3}{5} = 0.653=0.6

Since all 2-itemsets meet the minimum support threshold, we can generate rules, such as Bread
→ Butter.

FP-Growth Algorithm:

The FP-Growth algorithm is more efficient than Apriori because it doesn’t generate candidate
itemsets. It uses a compact structure called an FP-Tree to store the data.

FP-Growth Steps:

1. Construct FP-Tree: The dataset is scanned to construct a tree structure where frequent
itemsets are represented.
2. Mine Frequent Itemsets: The FP-Tree is recursively mined to discover frequent
itemsets.
Example: If we apply FP-Growth to the same dataset, it will create a compact FP-Tree that
efficiently finds frequent itemsets.

3. Limited Pass Algorithms for Frequent Itemsets

In frequent itemset mining, limited pass algorithms aim to reduce the number of database passes.
Instead of generating all candidate itemsets (which can be computationally expensive), these
algorithms try to minimize the number of passes while still finding all frequent itemsets.

Apriori (Limited Pass):

 The Apriori algorithm requires multiple passes over the data, but it optimizes the process
by pruning the search space. It starts by finding frequent 1-itemsets, then uses those to
generate candidate 2-itemsets, and so on.
 The key optimization here is pruning, which eliminates itemsets that do not meet the
minimum support threshold.

FP-Growth (Limited Pass):

 Unlike Apriori, FP-Growth does not generate candidate itemsets. Instead, it builds a
frequent pattern tree (FP-Tree), which is a compressed version of the dataset.
 FP-Growth only requires two passes over the data:
1. The first pass is used to create the FP-Tree.
2. The second pass mines the FP-Tree for frequent itemsets.

Advantages of FP-Growth over Apriori:

 Efficiency: FP-Growth is more efficient because it avoids the overhead of generating and
testing candidate itemsets.
 Fewer Passes: FP-Growth only requires two passes over the data, making it faster for
large datasets.

2.Limited Pass Algorithms for Frequent Itemsets

Overview:

In frequent itemset mining, the goal is to identify sets of items that occur together in a large
dataset. However, generating all possible itemsets can be computationally expensive. Limited
pass algorithms reduce the number of database passes, making them more efficient for large
datasets.

Apriori Algorithm:
The Apriori algorithm works by generating candidate itemsets in each iteration and pruning
those that do not meet the minimum support threshold.

Example: Let’s consider the following transactions:

 T1: {A, B, C}
 T2: {A, B}
 T3: {A, C}
 T4: {B, C}
 T5: {A, B, C}

First Pass: Find all individual items that meet the support threshold (say 60% or 3/5
transactions).

 Support for A: 4/5 = 80%


 Support for B: 4/5 = 80%
 Support for C: 3/5 = 60%

These items pass the support threshold.

Second Pass: Generate pairs of frequent itemsets (e.g., {A, B}, {A, C}, {B, C}) and check their
support.

 Support for {A, B}: 3/5 = 60%


 Support for {A, C}: 3/5 = 60%
 Support for {B, C}: 3/5 = 60%

These pairs are frequent.

Third Pass: Generate candidate 3-itemsets (e.g., {A, B, C}) and check their support.

 Support for {A, B, C}: 3/5 = 60%

This is the final frequent itemset.

FP-Growth Algorithm:

FP-Growth is more efficient than Apriori because it doesn’t generate candidate itemsets. Instead,
it builds a frequent pattern tree (FP-Tree) that compresses the dataset, and then mines the tree
for frequent patterns.

Example: Given the same transactions, FP-Growth would:

1. Create a compact FP-Tree by compressing itemsets.


2. Mine the tree for frequent patterns without generating candidate itemsets explicitly.This
reduces the computational overhead, making it faster for large datasets.

You might also like