0% found this document useful (0 votes)
13 views33 pages

BDA Lec10

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views33 pages

BDA Lec10

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

3rd grade

Big Data Analytics


Dr. Nesma Mahmoud
Lecture 6:
Streaming in Big
Data Analytics
Big Data Analytics (In short)
Goal: Generalizations
A model or summarization of the data.

Data/Workflow Frameworks Analytics and Algorithms

Spark
MapReduce Large-scale Data
Hadoop File System Mining/ML
Streaming
What will we learn in this lecture?
01. What is Streaming?

02. Basics of Streaming Data Processing

03. Streaming Processing with Spark

04. Streaming with Spark in Lab (self-study & required in Exam)


01. What is Streaming?
Data Processing
● Batch processing (data-at-rest)
○ – Data is batched and processed regularly (eg. once a day)
○ – Effective for large amounts of data
○ – High latencies – “Traditional” approach
○ – Easier than stream processing
○ – Map-reduce, Spark, Hive, Impala, …

● Stream processing (data-in-motion)


○ – Data is processed continuously (in a stream)
○ – Data can be processed per message, per time-based window, per count-
based window
○ – Either realtime or near-realtime processing
○ – Spark Streaming, Apache Storm, Kafka Streams, …
What is streaming?

Broadly:

RECORD IN Process RECORD GONE

Data can be continuously analyzed and transformed in memory before it is stored on a disk.
Stream Data?
● Sequence of items
● Structured (e.g., tuples)
● Ordered (implicitly or timestamped)(process data in exact order)
● Arriving continuously at high volumes
● Not possible to store entirely
● Sometimes not possible to even examine all items

● Not (only) what you see on YouTube


○ can have structure and semantics, they’re not only audio or video
Why Streaming?
Often, data …
● … cannot be stored (too big, privacy concerns)
● … are not practical to access repeatedly (reading is
too long)
● … are rapidly arriving (need rapidly updated "results")

■ Examples: Google search queries,Text Messages, Status updates


Does Streaming mean real-time?
● Streaming and Real-Time are Related, but Not Synonymous:
○ Streaming: Refers to the continuous flow of data from a source to a
destination.
○ Real-Time: Implies processing and reacting to data as it arrives,
with minimal delay.

● Streaming Can Be Real-Time, but Not Always:


○ Real-Time Streaming: Data is processed and delivered as it's
generated, like live video broadcasts.
○ Non-Real-Time Streaming(Batching): Data is stored and processed
later, like downloading a video file.
02. Basics of Streaming
Data Processing
Streaming Processing lifecycle
● › Streaming data lifecycle
○ 1. Data is generated (upstream) by application
○ 2. Distribution and reorganization of data (by Message
Processor)
○ 3. Data processing (by Stream Processor)
○ 4. Storing results, alerting, sending messages downstream
Streaming Processing Components
● Major components in
stream processing
○ – Application
(generating
stream of data)
○ – Message
processor
○ – Stream
processor
○ – Data storage
(stores
processed
data, state etc.)
Stream Data
Stream can be abstracted as an endless (Unbounded) sequence of messages ●
● Stream can be represented by
○ – File
○ – TCP connection
○ – Database table
● Streams can be partitioned
○ – Enables parallelization
● Streams can be
○ – Read
○ – Written into
○ – Joined
○ – Filtered
○ – Enriched
○ – Transformed
How can we characterize stream processing?
1. › Realtime streaming vs Micro batches
2. › Usage of time windows
3. › Stateless or Stateful
4. › Out-of-order messages
1. Micro-batches vs (Real-time) Streaming

› Realtime Streaming (True Realtime, Continuous Processing)


– Message is processed immediately after delivery
– Messages are processed one by one Micro batches (Near Realtime)
– Low latencies (usually also lower throughput) – Message is not processed immediately after
– Output should be available in tens to hundreds of milliseconds delivery
– Messages are processed together in small
batches
– Latency is at least the length of batch interval
(usually leads to higher throughput)
– Output is available within seconds or tens of
seconds
2. Usage of time windows
● › It is possible define a sliding window that bounds the data processing processed in one batch
● › Two important attributes
○ › Length of window:
■ specifies the duration of the window
■ It determines the amount of data that will be included in each window.
■ For example, a window length of 10 seconds means that each window will contain
data from the past 10 seconds.
○ › Slide interval:
■ defines the frequency with which the window slides.
■ It controls how often new windows are created and processed.
■ For example, a slide interval of 5 seconds means that a new window will be created
every 5 seconds, even if the previous window hasn't fully elapsed.)‫(لم تنته بشكل كامل‬
● › Windows can overlap
2. Usage of time windows
● Window length vs sliding interval
2. Usage of time windows
● Event vs Processing Time
○ › Event Time
■ – When the message was generated
■ Example: A user purchases a product at 10:00 AM on October 25, 2023.
■ Use Case: Analyzing historical sales data, calculating daily, weekly, or
monthly sales totals based on the actual purchase time.
○ › Processing Time
■ – When the message was processed
■ Example: The system receives and processes the purchase event at
10:02 AM.
■ Use Case: Monitoring real-time sales, sending immediate notifications
for low stock levels, or detecting fraudulent activity based on the time
the system receives and processes the event.
3. Stateless or Stateful
● Stateless or Stateful
○ – In many use cases, we need to keep stream processing state
● Stateless Streaming:
○ Independence and Scalability
○ Stateless streaming treats each piece of data as an independent entity,
devoid of any reliance on past information. This approach ensures that each
data point is processed in isolation, without the need to remember or
reference historical information
● Stateful Streaming:
○ Navigates the Historical Context
○ In contrast, Stateful streaming involves the system’s ability to retain
information about past data and the current state of the streaming process.
By keeping track of historical information over time, the system can make
informed decisions and perform more sophisticated operations.
3. Stateless or Stateful
● Stateless Streaming:
○ Examples: Filtering messages based on a specific condition, Mapping
messages to a new format, Calculating the average of a numeric field within a
single window.
○ Limited to current window(single window)
● Stateful Streaming:
○ Aggregations: Calculating running sums, averages, or other statistics over a
window of time or a fixed number of events.
○ Message Enrichment: Joining incoming messages with data from external
sources, such as a database or a cache, to add context or additional
information.
○ Spans across multiple windows/events.
○ The state can be stored outside the stream processor
■ Databases (Redis, HBase, Cassandra, …)
Stateful Streaming Processing
● › Adds another layer of complexity
○ – Size of data (does it fit in RAM?)
○ – Complicates High Availability (HA challenge) setup (checkpointing)
■ – If the state is too large, it slows down stream processing

● › The state can be stored outside the stream processor


○ – Databases )Redis, HBase, Cassandra, …(
4. Out-of-order messages
● Message can be received with delay (issues in network, backlog of
messages)
● How to handle messages that are received out of order?
○ – The solution depends on the use-case
○ – We can ignore them
○ – We can reprocess the data
○ – Or a custom action is executed (alert, include in a separate
pipeline)
● Some tools uses watermarking
○ – Threshold specifying how long the stream processor waits for
delayed messages in data stream.
■ – If the message arrives before configured watermark, it is processed
■ – Otherwise, it is dropped
Common tools
● › Kafka )Streams API(
● › Flink )near real-time)
● › NiFi (Data flow management system, not true streaming)
● › Spark (Streaming API, Structured Streaming API)
03. Streaming Processing
with Spark
Intro
● Streaming Processing with Spark can be achieved through
1. Spark Streaming
2. Spark Structured Streaming
1. Spark Streaming
● Extension of the core Spark API that enables scalable, high-throughput, fault-
tolerant processing of data streams
○ – Micro batches
● › Fixed batch interval
○ We want to calculate the average temperature and humidity every 10 seconds.

● › Near real-time streaming


○ – Given by batch interval configuration and overhead
○ – Low interval (~1s) leads to large overhead
○ https://spark.apache.org/docs/latest/streaming-programming-guide.htm
How does Spark Streaming work?
● > Chop up )‫ (تقطيع‬data streams into batches of few secs
● > Spark treats each batch of data as RDDs and processes them using RDD
operations
● > Processed results are pushed out in batches
● › Spark processes queued RDD
How does Spark Streaming work?

● Streaming engine receives data and creates micro batches based on processing time
● Spark Streaming Programming Model
○ > Discretized Stream (DStream)
■ - high-level abstraction representing a continuous stream of data.
■ - Represents a stream of data
■ - Implemented as a sequence of RDDs one for each Micro-Batch.
○ > DStreams API very similar to RDD API
■ - Functional APIs in Scala, Java for tasks such as map, reduce, filter, etc.
■ - Create input DStream from different sources(Kafka, Flume, HDFS, etc.)
■ - Apply parallel operations
● The size of RDD is dependent on the length of batch interval (or the size of time window) and the
number of messages delivered during the interval
How does Spark Streaming work?
● Streaming Spark Context (SSC) is the entry point to Spark Streaming
application
○ – 1 SSC runs in 1 JVM (we can run several JVMs on cluster of
distributed processing)
○ – Configure batch interval (e.g., 5 seconds, 10 seconds)
● › After fetching SSC
○ – Define source of data - Input Dstream(Kafka, file system, socket)
○ – Define pipeline (computation)
○ – Begin/Start accepting and processing data from input stream
○ – Wait for processing termination (manually or on error)
● › After/Once processing is started the pipeline becomes immutable
○ – It is not possible to pause or modify the processing pipeline.
○ To make changes, you must stop application and restart it with the
updated logic.
1. Spark Streaming Example App.
2. Spark Structured Streaming ?
● › Stream processing engine built on the Spark SQL engine
● › “Streaming computation expressed the same way as batch computation“
● › Data Frame API available in Scala, Java, Python, R
● Supports Advanced Operations
○ – Aggregations, event-time windows, stream joins, …
● › 2 processing models
○ – Micro-batch: Data is divided into small micro-batches and processed
at regular intervals.
■ – Latency ~ 100 ms
■ – Possibility of exactly-once semantics(ensuring each event is
processed exactly once.)
○ – Continuous Processing (since Spark 2.3)
■ – Each message is processed immediately after delivery
■ – Latency ~ 1 ms
■ at-least-once semantics(meaning events might be processed
more than once in certain failure scenarios.)
Streaming DataFrame
● › Basic data structure
● › Represents unbounded table

You might also like