0% found this document useful (0 votes)
21 views4 pages

Unit 4 Streaming Data

Streaming data is the continuous flow of data processed in real-time, allowing for immediate insights and actions, unlike batch processing. Key concepts include real-time processing, low latency, and fault tolerance, with common use cases in analytics, IoT, and fraud detection. Challenges include managing data volume and quality, while technologies like Apache Kafka and Amazon Kinesis facilitate streaming data workflows.

Uploaded by

kannan.niran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views4 pages

Unit 4 Streaming Data

Streaming data is the continuous flow of data processed in real-time, allowing for immediate insights and actions, unlike batch processing. Key concepts include real-time processing, low latency, and fault tolerance, with common use cases in analytics, IoT, and fraud detection. Challenges include managing data volume and quality, while technologies like Apache Kafka and Amazon Kinesis facilitate streaming data workflows.

Uploaded by

kannan.niran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Streaming data refers to the continuous flow of data generated from

various sources, such as sensors, social media, logs, or IoT devices,


and processed in real-time or near-real-time. Unlike batch processing,
where data is collected and processed in chunks, streaming data is
handled as it arrives, enabling immediate insights, analytics, and
actions.

Key Concepts in Streaming Data:

1. Real-Time Processing: Data is processed as soon as it is


generated, allowing for instant decision-making.
2. Continuous Data Flow: Data is produced and consumed in a
continuous, unbounded stream.
3. Low Latency: Streaming systems aim to minimize the delay
between data generation and processing.
4. Scalability: Streaming systems must handle high volumes of
data and scale horizontally as needed.
5. Fault Tolerance: Systems must recover from failures without
losing data or compromising accuracy.

Common Use Cases:

 Real-Time Analytics: Monitoring and analyzing data in real-time


(e.g., stock market trends, website traffic).
 IoT Applications: Processing data from sensors and devices
(e.g., smart homes, industrial IoT).
 Fraud Detection: Identifying fraudulent transactions or
activities as they occur.
 Log Monitoring: Analyzing server logs for errors, performance
issues, or security threats.
 Recommendation Systems: Providing real-time personalized
recommendations (e.g., Netflix, Amazon).

Streaming Data Technologies:

1. Apache Kafka: A distributed event streaming platform for


building real-time data pipelines.
2. Apache Flink: A stream processing framework for stateful
computations over data streams.
3. Apache Storm: A real-time computation system for processing
unbounded data streams.
4. Apache Spark Streaming: An extension of Apache Spark for
processing real-time data streams.
5. Amazon Kinesis: A cloud-based service for real-time data
streaming and processing.
6. Google Pub/Sub: A messaging service for streaming data
between applications.
7. Azure Stream Analytics: A real-time analytics service for
streaming data.

Challenges in Streaming Data:

 Data Volume: Handling massive amounts of data in real-time.


 Data Quality: Ensuring accuracy and consistency in fast-moving
data.
 Complexity: Managing state, time, and ordering in distributed
systems.
 Resource Management: Allocating and optimizing resources for
continuous processing.

Example Workflow:

1. Data Ingestion: Collect data from sources like sensors, APIs, or


logs.
2. Data Processing: Use a streaming framework (e.g., Kafka, Flink)
to process and transform the data.
3. Storage: Store processed data in databases or data lakes for
further analysis.
4. Visualization/Action: Display insights on dashboards or trigger
automated actions (e.g., alerts, recommendations).

Streaming data is a critical component of modern data architectures,


enabling businesses to respond quickly to changing conditions and
make data-driven decisions in real-time.

difference between periodic and continuous query

Periodic and continuous queries are two approaches to querying data,


particularly in the context of real-time or streaming data systems. They
differ in how and when data is processed and updated. Here's a
breakdown of their differences:

1. Periodic Query
 Definition: A periodic query is executed at fixed intervals (e.g.,
every minute, hour, or day) to retrieve or process data.
 How It Works:
o The query runs repeatedly on a schedule.
o Each execution processes a snapshot of the data available
at that moment.
 Use Cases:
o Batch processing systems (e.g., daily reports).
o Systems where real-time updates are not critical.
o Monitoring systems that don't require instant feedback.
 Advantages:
o Simpler to implement and manage.
o Reduces computational overhead compared to continuous
processing.
o Suitable for historical or aggregated data analysis.
 Disadvantages:
o Delayed insights due to the interval-based nature.
o May miss real-time events or changes between intervals.
 Example:
o A daily sales report generated at midnight.
o Checking server logs every 5 minutes for errors.

2. Continuous Query

 Definition: A continuous query runs persistently and processes


data as soon as it arrives, providing real-time or near-real-time
results.
 How It Works:
o The query is registered once and remains active.
o It processes incoming data streams incrementally and
updates results continuously.
 Use Cases:
o Real-time analytics (e.g., stock market monitoring).
o Fraud detection systems.
o IoT applications (e.g., sensor data processing).
 Advantages:
o Provides immediate insights and updates.
o Ideal for time-sensitive applications.
o Handles unbounded, real-time data streams effectively.
 Disadvantages:
o More complex to implement and maintain.
o Requires higher computational resources.
o May involve challenges like handling out-of-order data or
managing state.
 Example:
o A live dashboard showing real-time website traffic.
o Detecting fraudulent credit card transactions as they occur.

Key Differences

Aspect Periodic Query Continuous Query


Runs persistently, processing
Execution Runs at fixed intervals.
data as it arrives.
Higher latency (depends on Low latency (real-time or
Latency
interval). near-real-time).
Processes a snapshot of Processes incremental
Data Scope
data at each run. updates in a stream.
More complex due to real-
Complexity Simpler to implement.
time requirements.
Resource Lower resource usage (runs Higher resource usage
Usage intermittently). (continuous processing).
Batch processing, Real-time monitoring, event-
Use Cases
scheduled reports. driven systems.

You might also like