Unit Iv
Unit Iv
Stream analytics is a specialized field within data analytics that focuses on real-time processing
and analysis of continuous data streams. It enables organizations to derive insights and make
decisions as the data arrives. This approach contrasts with traditional batch processing, which
deals with static datasets and processes them at intervals.
Definition
Stream analytics refers to the real-time processing of continuously generated data. The primary
goal is to analyze data in motion to extract actionable insights quickly. It involves ingesting,
filtering, aggregating, and analyzing data streams to generate real-time outputs.
1. Data Sources: The origin of the data streams, such as IoT devices, social media feeds,
transaction logs, or clickstreams.
2. Data Ingestion: The process of collecting real-time data streams from the sources using
technologies like Apache Kafka, Apache Flume, or AWS Kinesis.
3. Stream Processing: The core analysis stage where the incoming data is filtered,
aggregated, and transformed using frameworks like Apache Flink, Spark Streaming, or
Google Dataflow.
4. Data Storage: Optional storage of processed or raw streams for historical analysis or
compliance purposes using time-series databases like InfluxDB or cloud-based solutions
like Amazon S3.
5. Visualization and Reporting: Generating dashboards, alerts, or real-time reports for
end-users using tools like Tableau or Power BI.
1. Data Collection: Ingesting data streams from multiple sources like sensors, logs, or user
interactions.
2. Data Preprocessing: Cleaning and normalizing the incoming data to ensure consistency.
3. Data Filtering: Selecting relevant data points based on defined criteria.
4. Data Aggregation: Summarizing data, such as calculating averages or counts.
5. Complex Event Processing (CEP): Identifying patterns or anomalies in the data
streams.
6. Output Generation: Producing actionable insights, such as alerts, visualizations, or
recommendations.
1. Fraud Detection:
o Financial institutions use stream analytics to monitor transactions in real time. For
example, an unusually high-value transaction from a new location triggers an alert
to prevent fraud.
2. IoT Monitoring:
o Stream analytics processes sensor data from IoT devices, such as temperature,
pressure, or energy usage, to detect anomalies or optimize performance.
3. Stock Market Analysis:
o Trading platforms analyze live market data to identify trends, predict stock
movements, and generate trading signals.
4. E-commerce Personalization:
o Online retailers analyze user activity in real time to recommend products or adjust
pricing dynamically.
5. Healthcare Monitoring:
o Wearable devices stream health metrics like heart rate or oxygen levels, enabling
real-time monitoring and alerts for potential health issues.
1. High Velocity and Volume: Managing the constant influx of data requires robust
systems.
2. Data Quality: Ensuring accuracy and consistency in real-time data is challenging.
3. Latency: Maintaining low latency while processing large streams is critical.
4. Complex Event Processing: Detecting meaningful patterns from noisy data streams.
5. Integration: Combining multiple data sources seamlessly for analysis.
Stream analytics is an integral part of data analytics in today’s fast-paced world. Its ability to
provide real-time insights across industries makes it indispensable for organizations looking to
stay competitive. With the right tools and frameworks, stream analytics enables proactive
decision-making and efficient operations.
The stream data model defines how streaming data is structured, represented, and processed in a
real-time environment. Unlike traditional data models used for batch processing, the stream data
model is designed to handle continuous, unbounded, and time-sensitive data streams efficiently.
1. Filtering:
o Extracting relevant events based on conditions.
o Example: Filter tweets containing the hashtag #StreamingAnalytics.
2. Aggregation:
o Summarizing the data, such as counting, averaging, or finding the maximum
value.
o Example: Count the number of cars passing a toll booth every minute.
3. Join Operations:
o Combining multiple streams based on a common key or condition.
o Example: Join a stream of user clicks with a stream of product prices to show
real-time price comparisons.
4. Pattern Detection:
o Identifying sequences or anomalies within the stream.
o Example: Detecting a sequence of failed login attempts to identify a potential
security threat.
A ride-sharing app processes real-time data from drivers and passengers. The stream data model
can be represented as:
Operations:
1. Low Latency:
o Immediate processing enables real-time decision-making.
2. Scalability:
o Efficiently handles large volumes of high-velocity data streams.
3. Flexibility:
o Supports various operations like filtering, joining, and aggregating.
4. Time-Based Analysis:
o Allows temporal queries using windows and timestamps.
Stream data modeling is an essential aspect of streaming analytics, providing the foundation for
real-time data processing. Its ability to handle continuous and dynamic data makes it invaluable
for modern applications across industries.
Link Analysis
Link Analysis:
Link analysis is a technique used to evaluate the structure of a network (such as the web, social
networks, etc.). The idea is to use the connections between entities (nodes) to understand their
importance or relevance. In the context of the web, this means evaluating the importance of web
pages based on the links between them.
Key Concepts:
Nodes and Edges: A link structure can be represented as a directed graph, where each
page is a node, and each hyperlink between two pages is an edge.
Importance: Pages that are linked to by many other pages are considered important. The
PageRank algorithm quantifies this importance.
PageRank Algorithm:
PageRank assigns a score to each page based on the number and quality of links pointing to it. It
operates under the assumption that a page is more important if it is linked to by other important
pages.
PageRank Formula: Given a page AAA, the PageRank of AAA is calculated as:
Where:
Explanation:
The first term (1−d)(1 - d)(1−d) represents a small probability that a random web surfer
will jump to any page randomly.
The second term represents the probability that a random web surfer reaches page AAA
by following a link from another page BBB. The importance of page BBB is proportional
to its own PageRank, and the term PR(B)L(B)\frac{PR(B)}{L(B)}L(B)PR(B) accounts
for the fact that page BBB is passing its rank across all its links.
Example:
Let’s consider a network of three pages:
Starting with an initial PageRank of 1 for each page, the algorithm iteratively updates the ranks
of the pages. After several iterations, the PageRank values converge to their final values, which
reflect the relative importance of the pages in the network.
After several iterations, we find that Page B has the highest rank, as it links to both A and C.
Overview:
The Market Basket Model is a technique used in Association Rule Mining to discover patterns
in large transaction datasets. The goal is to identify associations between items that are
frequently bought together. This is essential for tasks like recommendation systems, product
bundling, and customer behavior analysis.
Key Concepts:
Apriori Algorithm:
The Apriori algorithm is one of the most widely used algorithms for mining frequent itemsets.
It uses a bottom-up approach, where frequent itemsets of size kkk are used to generate candidate
itemsets of size k+1k+1k+1.
Example:
Since all 2-itemsets meet the minimum support threshold, we can generate rules, such as Bread
→ Butter.
FP-Growth Algorithm:
The FP-Growth algorithm is more efficient than Apriori because it doesn’t generate candidate
itemsets. It uses a compact structure called an FP-Tree to store the data.
FP-Growth Steps:
1. Construct FP-Tree: The dataset is scanned to construct a tree structure where frequent
itemsets are represented.
2. Mine Frequent Itemsets: The FP-Tree is recursively mined to discover frequent
itemsets.
Example: If we apply FP-Growth to the same dataset, it will create a compact FP-Tree that
efficiently finds frequent itemsets.
In frequent itemset mining, limited pass algorithms aim to reduce the number of database passes.
Instead of generating all candidate itemsets (which can be computationally expensive), these
algorithms try to minimize the number of passes while still finding all frequent itemsets.
The Apriori algorithm requires multiple passes over the data, but it optimizes the process
by pruning the search space. It starts by finding frequent 1-itemsets, then uses those to
generate candidate 2-itemsets, and so on.
The key optimization here is pruning, which eliminates itemsets that do not meet the
minimum support threshold.
Unlike Apriori, FP-Growth does not generate candidate itemsets. Instead, it builds a
frequent pattern tree (FP-Tree), which is a compressed version of the dataset.
FP-Growth only requires two passes over the data:
1. The first pass is used to create the FP-Tree.
2. The second pass mines the FP-Tree for frequent itemsets.
Efficiency: FP-Growth is more efficient because it avoids the overhead of generating and
testing candidate itemsets.
Fewer Passes: FP-Growth only requires two passes over the data, making it faster for
large datasets.
Overview:
In frequent itemset mining, the goal is to identify sets of items that occur together in a large
dataset. However, generating all possible itemsets can be computationally expensive. Limited
pass algorithms reduce the number of database passes, making them more efficient for large
datasets.
Apriori Algorithm:
The Apriori algorithm works by generating candidate itemsets in each iteration and pruning
those that do not meet the minimum support threshold.
T1: {A, B, C}
T2: {A, B}
T3: {A, C}
T4: {B, C}
T5: {A, B, C}
First Pass: Find all individual items that meet the support threshold (say 60% or 3/5
transactions).
Second Pass: Generate pairs of frequent itemsets (e.g., {A, B}, {A, C}, {B, C}) and check their
support.
Third Pass: Generate candidate 3-itemsets (e.g., {A, B, C}) and check their support.
FP-Growth Algorithm:
FP-Growth is more efficient than Apriori because it doesn’t generate candidate itemsets. Instead,
it builds a frequent pattern tree (FP-Tree) that compresses the dataset, and then mines the tree
for frequent patterns.