0% found this document useful (0 votes)

13 views33 pages

BDA Lec10

Uploaded by

Ahmed Ibrahim Ghnnam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views33 pages

BDA Lec10

Uploaded by

Ahmed Ibrahim Ghnnam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

3rd grade

Big Data Analytics

Dr. Nesma Mahmoud
Lecture 6:
Streaming in Big
Data Analytics
Big Data Analytics (In short)
Goal: Generalizations
A model or summarization of the data.

Data/Workflow Frameworks Analytics and Algorithms

Spark
MapReduce Large-scale Data
Hadoop File System Mining/ML
Streaming
What will we learn in this lecture?
01. What is Streaming?

02. Basics of Streaming Data Processing

03. Streaming Processing with Spark

04. Streaming with Spark in Lab (self-study & required in Exam)

01. What is Streaming?
Data Processing
● Batch processing (data-at-rest)
○ – Data is batched and processed regularly (eg. once a day)
○ – Effective for large amounts of data
○ – High latencies – “Traditional” approach
○ – Easier than stream processing
○ – Map-reduce, Spark, Hive, Impala, …

● Stream processing (data-in-motion)

○ – Data is processed continuously (in a stream)
○ – Data can be processed per message, per time-based window, per count-
based window
○ – Either realtime or near-realtime processing
○ – Spark Streaming, Apache Storm, Kafka Streams, …
What is streaming?

Broadly:

RECORD IN Process RECORD GONE

Data can be continuously analyzed and transformed in memory before it is stored on a disk.
Stream Data?
● Sequence of items
● Structured (e.g., tuples)
● Ordered (implicitly or timestamped)(process data in exact order)
● Arriving continuously at high volumes
● Not possible to store entirely
● Sometimes not possible to even examine all items

● Not (only) what you see on YouTube

○ can have structure and semantics, they’re not only audio or video
Why Streaming?
Often, data …
● … cannot be stored (too big, privacy concerns)
● … are not practical to access repeatedly (reading is
too long)
● … are rapidly arriving (need rapidly updated "results")

■ Examples: Google search queries,Text Messages, Status updates

Does Streaming mean real-time?
● Streaming and Real-Time are Related, but Not Synonymous:
○ Streaming: Refers to the continuous flow of data from a source to a
destination.
○ Real-Time: Implies processing and reacting to data as it arrives,
with minimal delay.

● Streaming Can Be Real-Time, but Not Always:

○ Real-Time Streaming: Data is processed and delivered as it's
generated, like live video broadcasts.
○ Non-Real-Time Streaming(Batching): Data is stored and processed
later, like downloading a video file.
02. Basics of Streaming
Data Processing
Streaming Processing lifecycle
● › Streaming data lifecycle
○ 1. Data is generated (upstream) by application
○ 2. Distribution and reorganization of data (by Message
Processor)
○ 3. Data processing (by Stream Processor)
○ 4. Storing results, alerting, sending messages downstream
Streaming Processing Components
● Major components in
stream processing
○ – Application
(generating
stream of data)
○ – Message
processor
○ – Stream
processor
○ – Data storage
(stores
processed
data, state etc.)
Stream Data
Stream can be abstracted as an endless (Unbounded) sequence of messages ●
● Stream can be represented by
○ – File
○ – TCP connection
○ – Database table
● Streams can be partitioned
○ – Enables parallelization
● Streams can be
○ – Read
○ – Written into
○ – Joined
○ – Filtered
○ – Enriched
○ – Transformed
How can we characterize stream processing?
1. › Realtime streaming vs Micro batches
2. › Usage of time windows
3. › Stateless or Stateful
4. › Out-of-order messages
1. Micro-batches vs (Real-time) Streaming

› Realtime Streaming (True Realtime, Continuous Processing)

– Message is processed immediately after delivery
– Messages are processed one by one Micro batches (Near Realtime)
– Low latencies (usually also lower throughput) – Message is not processed immediately after
– Output should be available in tens to hundreds of milliseconds delivery
– Messages are processed together in small
batches
– Latency is at least the length of batch interval
(usually leads to higher throughput)
– Output is available within seconds or tens of
seconds
2. Usage of time windows
● › It is possible define a sliding window that bounds the data processing processed in one batch
● › Two important attributes
○ › Length of window:
■ specifies the duration of the window
■ It determines the amount of data that will be included in each window.
■ For example, a window length of 10 seconds means that each window will contain
data from the past 10 seconds.
○ › Slide interval:
■ defines the frequency with which the window slides.
■ It controls how often new windows are created and processed.
■ For example, a slide interval of 5 seconds means that a new window will be created
every 5 seconds, even if the previous window hasn't fully elapsed.)‫(لم تنته بشكل كامل‬
● › Windows can overlap
2. Usage of time windows
● Window length vs sliding interval
2. Usage of time windows
● Event vs Processing Time
○ › Event Time
■ – When the message was generated
■ Example: A user purchases a product at 10:00 AM on October 25, 2023.
■ Use Case: Analyzing historical sales data, calculating daily, weekly, or
monthly sales totals based on the actual purchase time.
○ › Processing Time
■ – When the message was processed
■ Example: The system receives and processes the purchase event at
10:02 AM.
■ Use Case: Monitoring real-time sales, sending immediate notifications
for low stock levels, or detecting fraudulent activity based on the time
the system receives and processes the event.
3. Stateless or Stateful
● Stateless or Stateful
○ – In many use cases, we need to keep stream processing state
● Stateless Streaming:
○ Independence and Scalability
○ Stateless streaming treats each piece of data as an independent entity,
devoid of any reliance on past information. This approach ensures that each
data point is processed in isolation, without the need to remember or
reference historical information
● Stateful Streaming:
○ Navigates the Historical Context
○ In contrast, Stateful streaming involves the system’s ability to retain
information about past data and the current state of the streaming process.
By keeping track of historical information over time, the system can make
informed decisions and perform more sophisticated operations.
3. Stateless or Stateful
● Stateless Streaming:
○ Examples: Filtering messages based on a specific condition, Mapping
messages to a new format, Calculating the average of a numeric field within a
single window.
○ Limited to current window(single window)
● Stateful Streaming:
○ Aggregations: Calculating running sums, averages, or other statistics over a
window of time or a fixed number of events.
○ Message Enrichment: Joining incoming messages with data from external
sources, such as a database or a cache, to add context or additional
information.
○ Spans across multiple windows/events.
○ The state can be stored outside the stream processor
■ Databases (Redis, HBase, Cassandra, …)
Stateful Streaming Processing
● › Adds another layer of complexity
○ – Size of data (does it fit in RAM?)
○ – Complicates High Availability (HA challenge) setup (checkpointing)
■ – If the state is too large, it slows down stream processing

● › The state can be stored outside the stream processor

○ – Databases )Redis, HBase, Cassandra, …(
4. Out-of-order messages
● Message can be received with delay (issues in network, backlog of
messages)
● How to handle messages that are received out of order?
○ – The solution depends on the use-case
○ – We can ignore them
○ – We can reprocess the data
○ – Or a custom action is executed (alert, include in a separate
pipeline)
● Some tools uses watermarking
○ – Threshold specifying how long the stream processor waits for
delayed messages in data stream.
■ – If the message arrives before configured watermark, it is processed
■ – Otherwise, it is dropped
Common tools
● › Kafka )Streams API(
● › Flink )near real-time)
● › NiFi (Data flow management system, not true streaming)
● › Spark (Streaming API, Structured Streaming API)
03. Streaming Processing
with Spark
Intro
● Streaming Processing with Spark can be achieved through
1. Spark Streaming
2. Spark Structured Streaming
1. Spark Streaming
● Extension of the core Spark API that enables scalable, high-throughput, fault-
tolerant processing of data streams
○ – Micro batches
● › Fixed batch interval
○ We want to calculate the average temperature and humidity every 10 seconds.

● › Near real-time streaming

○ – Given by batch interval configuration and overhead
○ – Low interval (~1s) leads to large overhead
○ https://spark.apache.org/docs/latest/streaming-programming-guide.htm
How does Spark Streaming work?
● > Chop up )‫ (تقطيع‬data streams into batches of few secs
● > Spark treats each batch of data as RDDs and processes them using RDD
operations
● > Processed results are pushed out in batches
● › Spark processes queued RDD
How does Spark Streaming work?

● Streaming engine receives data and creates micro batches based on processing time
● Spark Streaming Programming Model
○ > Discretized Stream (DStream)
■ - high-level abstraction representing a continuous stream of data.
■ - Represents a stream of data
■ - Implemented as a sequence of RDDs one for each Micro-Batch.
○ > DStreams API very similar to RDD API
■ - Functional APIs in Scala, Java for tasks such as map, reduce, filter, etc.
■ - Create input DStream from different sources(Kafka, Flume, HDFS, etc.)
■ - Apply parallel operations
● The size of RDD is dependent on the length of batch interval (or the size of time window) and the
number of messages delivered during the interval
How does Spark Streaming work?
● Streaming Spark Context (SSC) is the entry point to Spark Streaming
application
○ – 1 SSC runs in 1 JVM (we can run several JVMs on cluster of
distributed processing)
○ – Configure batch interval (e.g., 5 seconds, 10 seconds)
● › After fetching SSC
○ – Define source of data - Input Dstream(Kafka, file system, socket)
○ – Define pipeline (computation)
○ – Begin/Start accepting and processing data from input stream
○ – Wait for processing termination (manually or on error)
● › After/Once processing is started the pipeline becomes immutable
○ – It is not possible to pause or modify the processing pipeline.
○ To make changes, you must stop application and restart it with the
updated logic.
1. Spark Streaming Example App.
2. Spark Structured Streaming ?
● › Stream processing engine built on the Spark SQL engine
● › “Streaming computation expressed the same way as batch computation“
● › Data Frame API available in Scala, Java, Python, R
● Supports Advanced Operations
○ – Aggregations, event-time windows, stream joins, …
● › 2 processing models
○ – Micro-batch: Data is divided into small micro-batches and processed
at regular intervals.
■ – Latency ~ 100 ms
■ – Possibility of exactly-once semantics(ensuring each event is
processed exactly once.)
○ – Continuous Processing (since Spark 2.3)
■ – Each message is processed immediately after delivery
■ – Latency ~ 1 ms
■ at-least-once semantics(meaning events might be processed
more than once in certain failure scenarios.)
Streaming DataFrame
● › Basic data structure
● › Represents unbounded table

Bigdata-Mining Data Streams
No ratings yet
Bigdata-Mining Data Streams
19 pages
Arihant Trigonometry Unlocked
90% (10)
Arihant Trigonometry Unlocked
401 pages
BDA Unit-4
No ratings yet
BDA Unit-4
12 pages
StreamProcessingAndAnalytics Handout
No ratings yet
StreamProcessingAndAnalytics Handout
7 pages
Lec 05
No ratings yet
Lec 05
10 pages
Lecture 3 SStreaming Data Systems and Applications
No ratings yet
Lecture 3 SStreaming Data Systems and Applications
39 pages
Getting Started With Real-Time Analytics With Kafka and Spark in Microsoft Azure - Joe Plumb.
No ratings yet
Getting Started With Real-Time Analytics With Kafka and Spark in Microsoft Azure - Joe Plumb.
44 pages
Bài Giảng Spark Streaming
No ratings yet
Bài Giảng Spark Streaming
75 pages
Assignment No. 3 For Business Data Analytics
No ratings yet
Assignment No. 3 For Business Data Analytics
16 pages
Reference Guide To Stream Processing
No ratings yet
Reference Guide To Stream Processing
14 pages
Streaming Data
No ratings yet
Streaming Data
33 pages
7 - Streaming 2 - Calcite
No ratings yet
7 - Streaming 2 - Calcite
45 pages
Lec 02
No ratings yet
Lec 02
13 pages
Lecture 11
No ratings yet
Lecture 11
31 pages
Lecture #7.1 - Introducing Streaming Data
No ratings yet
Lecture #7.1 - Introducing Streaming Data
24 pages
SA Unit 1 PPT 2
No ratings yet
SA Unit 1 PPT 2
27 pages
Lec 19
No ratings yet
Lec 19
24 pages
Stream Processing
No ratings yet
Stream Processing
33 pages
Big Data PDF
No ratings yet
Big Data PDF
10 pages
Lec 19
No ratings yet
Lec 19
23 pages
BDA Unit V
No ratings yet
BDA Unit V
21 pages
ECS765P - W11 - Stream Processing II
No ratings yet
ECS765P - W11 - Stream Processing II
47 pages
6 - Streaming Part 1
No ratings yet
6 - Streaming Part 1
44 pages
TRabl StreamProcessing
No ratings yet
TRabl StreamProcessing
79 pages
b0m33bdt 7p Spark Databricks Streaming - 2023 - en
No ratings yet
b0m33bdt 7p Spark Databricks Streaming - 2023 - en
50 pages
JyothsnaDST Unit-1 Extra
No ratings yet
JyothsnaDST Unit-1 Extra
25 pages
BDA Unit 3
No ratings yet
BDA Unit 3
42 pages
BDA Unit 3
No ratings yet
BDA Unit 3
18 pages
Big Data Notes
No ratings yet
Big Data Notes
37 pages
Real Time Data Streaming New Techniques
No ratings yet
Real Time Data Streaming New Techniques
5 pages
DSPL Casestidy
No ratings yet
DSPL Casestidy
3 pages
Big Data Analytics - Unit 2 Notes
No ratings yet
Big Data Analytics - Unit 2 Notes
44 pages
Chapter 1
No ratings yet
Chapter 1
13 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
57 pages
Chapter 1-1
No ratings yet
Chapter 1-1
34 pages
Streaming Concepts
No ratings yet
Streaming Concepts
94 pages
Streaming Graph Processing Unit5
No ratings yet
Streaming Graph Processing Unit5
7 pages
Unit 4 Streaming Data
No ratings yet
Unit 4 Streaming Data
4 pages
DataStreaming L-4
No ratings yet
DataStreaming L-4
16 pages
What Is Streaming Data
No ratings yet
What Is Streaming Data
4 pages
Big Data IV Nit
No ratings yet
Big Data IV Nit
15 pages
Unit 3-6
No ratings yet
Unit 3-6
14 pages
What Is Stream Processing
No ratings yet
What Is Stream Processing
3 pages
Unit 3
No ratings yet
Unit 3
51 pages
Putting Apache Kafka To Use!: Building A Real-Time Data Platform For Event Streams!
No ratings yet
Putting Apache Kafka To Use!: Building A Real-Time Data Platform For Event Streams!
48 pages
Unit-II BDA
No ratings yet
Unit-II BDA
19 pages
Module-2-MINING DATA STREAMS
100% (3)
Module-2-MINING DATA STREAMS
17 pages
Data Analytics Unit 3
No ratings yet
Data Analytics Unit 3
14 pages
Unit-II (Big Data)
No ratings yet
Unit-II (Big Data)
20 pages
ECS765P - W10 - Stream Processing
No ratings yet
ECS765P - W10 - Stream Processing
39 pages
Spark Streaming Through Dynamic Batch Sizing
No ratings yet
Spark Streaming Through Dynamic Batch Sizing
4 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
11 pages
Ade Mod 1 Incremental Processing With Spark Structured Streaming
No ratings yet
Ade Mod 1 Incremental Processing With Spark Structured Streaming
73 pages
Bigdata Unit-Ii
No ratings yet
Bigdata Unit-Ii
33 pages
T09 Data Streaming
No ratings yet
T09 Data Streaming
52 pages
Acer Aspire v5-572p Quanta ZQK DAOZQKMB8E0 Rev1A Schematic
100% (1)
Acer Aspire v5-572p Quanta ZQK DAOZQKMB8E0 Rev1A Schematic
46 pages
UNIT V Streaming
No ratings yet
UNIT V Streaming
22 pages
4 Building Blocks of A Streaming Data Architecture
No ratings yet
4 Building Blocks of A Streaming Data Architecture
11 pages
Pushpull Final
100% (1)
Pushpull Final
59 pages
Interface Knowledge
No ratings yet
Interface Knowledge
4 pages
Module II
No ratings yet
Module II
22 pages
Industrial Color Testing Fundamentals and Techniques Second Edition
100% (1)
Industrial Color Testing Fundamentals and Techniques Second Edition
388 pages
Elevator Planning UOM
No ratings yet
Elevator Planning UOM
41 pages
Water Level Indicator
No ratings yet
Water Level Indicator
29 pages
C-Full Programs 001
No ratings yet
C-Full Programs 001
25 pages
S70me-C8 5
No ratings yet
S70me-C8 5
406 pages
Ionic Equilibrium
No ratings yet
Ionic Equilibrium
55 pages
Sma 306 - Complex Analysis 1 - April 2017
No ratings yet
Sma 306 - Complex Analysis 1 - April 2017
4 pages
15 16 H2 Quantum Physics II Summary
No ratings yet
15 16 H2 Quantum Physics II Summary
1 page
Suyono & Hariyanto - 2012 - Relationship Between Internal Control, Internal Audit, and Organization Commitment With Good Governance Indonesian Case
No ratings yet
Suyono & Hariyanto - 2012 - Relationship Between Internal Control, Internal Audit, and Organization Commitment With Good Governance Indonesian Case
10 pages
Visio-PMPP-DIA-001 - Rev0 - Tank Project Delivery Process - Final
No ratings yet
Visio-PMPP-DIA-001 - Rev0 - Tank Project Delivery Process - Final
1 page
If M Controller
No ratings yet
If M Controller
280 pages
AI Lecture 9
No ratings yet
AI Lecture 9
39 pages
Pipe Vibration and Pressure Detection - Bruel - Kjaer
No ratings yet
Pipe Vibration and Pressure Detection - Bruel - Kjaer
12 pages
Essay For Villa Savoye Abstract
No ratings yet
Essay For Villa Savoye Abstract
1 page
MNU CAI ICI334 Lec4&5
No ratings yet
MNU CAI ICI334 Lec4&5
33 pages
Lec4 Designpattern
No ratings yet
Lec4 Designpattern
48 pages
Sen QB5
No ratings yet
Sen QB5
18 pages
BDA Lec3
No ratings yet
BDA Lec3
48 pages
Generalized Minimum Miscibility Pressure Correlation: SPE, Petroleum Technology Research LNST
No ratings yet
Generalized Minimum Miscibility Pressure Correlation: SPE, Petroleum Technology Research LNST
10 pages
BDA Lec1
No ratings yet
BDA Lec1
25 pages
Infire HTC Speed Operating Instruction
No ratings yet
Infire HTC Speed Operating Instruction
56 pages
Lecture 02,03
No ratings yet
Lecture 02,03
54 pages
MVD Universal Battery Chargers
0% (1)
MVD Universal Battery Chargers
4 pages
Lecture 9 - MapReduce
No ratings yet
Lecture 9 - MapReduce
50 pages
BDA Lec4
No ratings yet
BDA Lec4
40 pages
MNU CAI ICI334 Lec7
No ratings yet
MNU CAI ICI334 Lec7
30 pages
Studbolt Catalouge
No ratings yet
Studbolt Catalouge
47 pages
Lecture 7 - Wide Column Stores - Part 1
No ratings yet
Lecture 7 - Wide Column Stores - Part 1
30 pages
HND Year 1 Isa
No ratings yet
HND Year 1 Isa
51 pages
Chapter 8 Concurrency-P1
No ratings yet
Chapter 8 Concurrency-P1
30 pages
Lec. 3
No ratings yet
Lec. 3
18 pages
Assignment 1
No ratings yet
Assignment 1
12 pages
CIEN 30043 Lecture No. 6 - Chapter 4 Part 2
No ratings yet
CIEN 30043 Lecture No. 6 - Chapter 4 Part 2
8 pages
Stragieretal.2019 Efficacyofanewstrengthtrainingdesign The3 7method
No ratings yet
Stragieretal.2019 Efficacyofanewstrengthtrainingdesign The3 7method
12 pages
The Preparation of Ethene From Ethanol - Chemistry U2
No ratings yet
The Preparation of Ethene From Ethanol - Chemistry U2
3 pages
Sodapdf
No ratings yet
Sodapdf
4 pages
Lec5 Flask
No ratings yet
Lec5 Flask
5 pages
Answer Midterm 2024 - 11 - 19
No ratings yet
Answer Midterm 2024 - 11 - 19
4 pages
Design and Development of Hand Operate Milk Churn Machine: Tandin Wangdi, Chenga Dorji, Namgay Dorji and Norbu Tshering
No ratings yet
Design and Development of Hand Operate Milk Churn Machine: Tandin Wangdi, Chenga Dorji, Namgay Dorji and Norbu Tshering
4 pages
Section 5
No ratings yet
Section 5
7 pages
Capstone Project-Naan Mudlvan
No ratings yet
Capstone Project-Naan Mudlvan
2 pages
Adobe Photoshop Cs6 Portable Camera Raw
No ratings yet
Adobe Photoshop Cs6 Portable Camera Raw
2 pages
Solax Solar Inverter: ZDNY-TL10000 / 12000 / 15000 / 17000 / 20000
No ratings yet
Solax Solar Inverter: ZDNY-TL10000 / 12000 / 15000 / 17000 / 20000
1 page
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
20 Windows Tools Every SysAdmin Should Know
From Everand
20 Windows Tools Every SysAdmin Should Know
padmin
4.5/5 (3)
Beginner's Guide for Cybercrime Investigators
From Everand
Beginner's Guide for Cybercrime Investigators
Nicolae Sfetcu
5/5 (1)

BDA Lec10

Uploaded by

BDA Lec10

Uploaded by

3rd grade

Big Data Analytics

Data/Workflow Frameworks Analytics and Algorithms

02. Basics of Streaming Data Processing

03. Streaming Processing with Spark

04. Streaming with Spark in Lab (self-study & required in Exam)

● Stream processing (data-in-motion)

RECORD IN Process RECORD GONE

● Not (only) what you see on YouTube

■ Examples: Google search queries,Text Messages, Status updates

● Streaming Can Be Real-Time, but Not Always:

› Realtime Streaming (True Realtime, Continuous Processing)

● › The state can be stored outside the stream processor

● › Near real-time streaming

You might also like