0% found this document useful (0 votes)

25 views31 pages

Lecture #9.1 - Apache Spark - Streaming API II

The document discusses Apache Spark's Streaming API. It covers event and processing time, window operations like tumbling and sliding windows, handling late data with watermarking, join operations between streaming and static/streaming DataFrames, and stream deduplication to eliminate duplicate records. Window operations group and aggregate data based on time windows defined on an event time column. Watermarking tracks a point in time before which late data is not expected to clean state for incremental aggregations.

Uploaded by

Sumit Khaitan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views31 pages

Lecture #9.1 - Apache Spark - Streaming API II

Uploaded by

Sumit Khaitan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

MODERN DATA ARCHITECTURES

FOR BIG DATA II

APACHE SPARK
STREAMING API
Agenda

● Event and Processing Time

● Windows Operations
● Late Data and Watermarking
● Join Operations
● Stream Deduplication
1.
EVENT AND
PROCESSING TIME
Handling Processing time
● Processing-time is related to the moment
Spark is processing the data.

● current_timestamp returns the current

timestamp at the start of query evaluation
as a TimeStamp data type column.
Handling Processing time

Processing time
Handling Event time
● Event-time is embedded in the data itself,
but it might be referenced differently
depending on the source we’re using.

● Event-time is a column value in the row

with TimeStamp data type.

● Window-based aggregations ➡ type of

grouping and aggregation on this column:

○ Each time window is a group and each row

can belong to multiple windows/groups.
Handling Event time

Event time
2.
WINDOW
OPERATIONS
Window Operations* on Event Time

● Tumbling windows:

● Time based windowing strategies:

Recently added. We won’t
cover it

Options in
Structured Streaming
* Naming convention in “Streaming Data - Understanding the Real-Time Pipeline”.
Tumbling time-based window
pyspark.sql.functions.window(timeColumn, windowDuration)

● Bucketize rows into one time window given a

timestamp specifying column.

● Window starts are inclusive but the window ends

are exclusive ➡ for example [12:05,12:10)

● Durations provided as strings: ‘week’, ‘day’, ‘hour’,

‘minute’, ‘second’, ‘millisecond’ and ‘microsecond’.
Tumbling time-based window

key & value ➡ array of bytes;

convert them into Strings
before working with them

Event
time

A window is deﬁned by
its start and end
Sliding window
pyspark.sql.functions.window(timeColumn, windowDuration,
slideDuration)

● Bucketize rows into one or more time windows

given a timestamp specifyingEventcolumn.
time

Processing
time
Sliding window
3.
LATE DATA AND
WATERMARKING
Events can be late
● Due to multiple factors, events can arrive late
to the analytics tier, Spark in our case.

● The stream processing engine can maintain

the intermediate state for partial aggregates.

● This allows to update the aggregates of old

windows correctly, although this is not done
forever but for a certain amount of time ➡
watermarks.
Events can be late

* More on this at Handling Late Data and Watermarking

Events can be late

Watermark of
10 minutes

* More on this at Handling Late Data and Watermarking

Watermarking in Spark Streaming
DataFrame.withWatermark(eventTime, delayThreshold)

● Deﬁnes an event time watermark for a DataFrame.

● A watermark tracks a point in time before which we

assume no more late data is going to arrive.

● eventTime is a string with the name of the column or

the column itself.

● delayThreshold is a string with an interval: “1 minute”,

“5 hours”, ...
Watermarking in Spark Streaming
Watermarking in Spark Streaming
● Conditions for watermarking to clean
aggregation state:

○ Output mode* must be Append or Update, but

not Complete (requires all aggregate data to be
preserved ➡ more resources).
○ Aggregation must have the event-time column
or a window on the event-time column.
○ withWatermark must be called on the same
timestamp column.
○ withWatermark must be called before the
aggregation
* Append is the default value if not output mode is speciﬁed
4.
JOIN
OPERATIONS
Join Operations
● Spark Structured Streaming supports:

○ Streaming DataFrame with a Static one

○ Streaming DataFrame with a Streaming one

● Streaming Join ➡ incremental results like

Streaming Aggregations.
Stream-Static Joins
● It supports inner joins and some type of
outer joins.
Stream-Static Joins
Stream-Stream Joins
● Main challenge:

○ During join operation, view of DataFrames

might be incomplete for both sides
○ Harder to ﬁnd matches between inputs

● Any row received from one input stream can

match with any future, yet-to-be-received
row from the other input stream.
Stream-Stream Joins
● Approach followed:

○ past input is buffered for “a while”

○ every future input will match with past input
and accordingly generate joined results
○ late out-of-order data is handled automatically
○ “A while” is handled by using watermarks.

● We’re not going to go into detail on this type

of joins.
Stream-Stream Joins
● Example of Inner Join with Watermarking:
5.
STREAM
DEDUPLICATION
Stream Deduplication
● In computing, data deduplication* is a
technique for eliminating duplicate copies of
repeating data.

● If part of your end-to-end solution provides

at-least-once guarantee ➡ data duplication.

● We can turn at-least-once into exactly-once

by using stream deduplication.

* Data deduplication deﬁnition by the Wikipedia.

Stream Deduplication
● There has to be a unique identiﬁer in events,
determined by one or multiple columns.

● The query keeps history from previous events

in order to ﬁlter duplicates.

● Deduplication can be used:

○ With Watermarking - bounds the size of the

history the query has to maintain
○ Without Watermarking - the query stores the
data from all the past events
Stream Deduplication
pyspark.sql.DataFrame.dropDuplicates(subset=None)

● Valid for Static and Stream DataFrames.

Reel Bundle 2 2024 21 06 08 08 40
33% (6)
Reel Bundle 2 2024 21 06 08 08 40
5 pages
Unit 4 Notes PDF
100% (2)
Unit 4 Notes PDF
27 pages
Rice Coffee With Fruit Creamer
50% (4)
Rice Coffee With Fruit Creamer
52 pages
Resilience in Information Systems Research - A Literature Review From A Socio-Technical and Temporal Perspective
No ratings yet
Resilience in Information Systems Research - A Literature Review From A Socio-Technical and Temporal Perspective
10 pages
8 Business Intelligence Tools and Techniques
No ratings yet
8 Business Intelligence Tools and Techniques
21 pages
UNIT V Streaming
No ratings yet
UNIT V Streaming
22 pages
Lecture 11
No ratings yet
Lecture 11
31 pages
8- Streaming 3 - Spark Flink
No ratings yet
8- Streaming 3 - Spark Flink
52 pages
b0m33bdt-7p-spark-databricks-streaming_2023_en
No ratings yet
b0m33bdt-7p-spark-databricks-streaming_2023_en
50 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
19 pages
Lecture #7.2 - Apache Spark - Streaming API
No ratings yet
Lecture #7.2 - Apache Spark - Streaming API
37 pages
Bigdata-Mining Data Streams
No ratings yet
Bigdata-Mining Data Streams
19 pages
Question Bank (1)
No ratings yet
Question Bank (1)
15 pages
Bài Giảng Spark Streaming
No ratings yet
Bài Giảng Spark Streaming
75 pages
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark
No ratings yet
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark
51 pages
Spark Structured Streaming
No ratings yet
Spark Structured Streaming
655 pages
Unit 5(Big Data Analytics)
No ratings yet
Unit 5(Big Data Analytics)
11 pages
Unit 1 Windowing
No ratings yet
Unit 1 Windowing
23 pages
lec19
No ratings yet
lec19
23 pages
TRabl StreamProcessing
No ratings yet
TRabl StreamProcessing
79 pages
Spark Streaming
No ratings yet
Spark Streaming
99 pages
ECS765P - W11 - Stream Processing II
No ratings yet
ECS765P - W11 - Stream Processing II
47 pages
lec19
No ratings yet
lec19
24 pages
Bigdata Unit-Ii
No ratings yet
Bigdata Unit-Ii
33 pages
Dataflow
No ratings yet
Dataflow
3 pages
unit 4 Streaming data
No ratings yet
unit 4 Streaming data
4 pages
09_Apache Spark Streaming
No ratings yet
09_Apache Spark Streaming
31 pages
Reflections FINAL PDF
No ratings yet
Reflections FINAL PDF
15 pages
BDA-Lec10
No ratings yet
BDA-Lec10
33 pages
SA Unit 1 PPT 2
No ratings yet
SA Unit 1 PPT 2
27 pages
Module-2-MINING DATA STREAMS
100% (3)
Module-2-MINING DATA STREAMS
17 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
57 pages
Lecture 7_1-spark_streaming
No ratings yet
Lecture 7_1-spark_streaming
25 pages
UNIT V Window Operations
No ratings yet
UNIT V Window Operations
12 pages
Big Data Analytics - Unit 2 Notes
No ratings yet
Big Data Analytics - Unit 2 Notes
44 pages
Big Data 3rd Unit
No ratings yet
Big Data 3rd Unit
16 pages
Ade Mod 1 Incremental Processing With Spark Structured Streaming
No ratings yet
Ade Mod 1 Incremental Processing With Spark Structured Streaming
73 pages
Bda Unit-4 PDF
No ratings yet
Bda Unit-4 PDF
63 pages
Bda M4
No ratings yet
Bda M4
57 pages
UNIT IV
No ratings yet
UNIT IV
5 pages
UNIT IV
No ratings yet
UNIT IV
11 pages
dspl_casestidy.docx
No ratings yet
dspl_casestidy.docx
3 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
11 pages
Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
No ratings yet
Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
407 pages
BDA Lect Data Streams SA
No ratings yet
BDA Lect Data Streams SA
175 pages
6- Streaming Part 1
No ratings yet
6- Streaming Part 1
44 pages
Chapter-5 Stream Processing Part1
No ratings yet
Chapter-5 Stream Processing Part1
32 pages
Chapter 1-1
No ratings yet
Chapter 1-1
34 pages
BDA-2
No ratings yet
BDA-2
16 pages
Module II
No ratings yet
Module II
22 pages
Py Spark 3 Quick Reference Guide
No ratings yet
Py Spark 3 Quick Reference Guide
2 pages
Pyspark SQL and DataFrames
No ratings yet
Pyspark SQL and DataFrames
6 pages
BDA Unit-4
No ratings yet
BDA Unit-4
12 pages
Unit - 5 FBDA
No ratings yet
Unit - 5 FBDA
7 pages
lec20
No ratings yet
lec20
25 pages
T09 Data Streaming
No ratings yet
T09 Data Streaming
52 pages
Unit-2 BDA
No ratings yet
Unit-2 BDA
30 pages
02data Stream Processing With Apache Flink
No ratings yet
02data Stream Processing With Apache Flink
61 pages
Introduction To Stream Concepts - Stream Data Model and Architecture
No ratings yet
Introduction To Stream Concepts - Stream Data Model and Architecture
8 pages
BDA Unit 3
No ratings yet
BDA Unit 3
42 pages
What's New in .NET 8? A Complete Guide to the Latest Features
From Everand
What's New in .NET 8? A Complete Guide to the Latest Features
Nitika
No ratings yet
Oracle Data Guard 11gR2 Administration Beginner's Guide
From Everand
Oracle Data Guard 11gR2 Administration Beginner's Guide
Emre Baransel
No ratings yet
PostgreSQL Server Programming - Second Edition
From Everand
PostgreSQL Server Programming - Second Edition
Hannu Krosing
No ratings yet
Oracle 11g Streams Implementer's Guide
From Everand
Oracle 11g Streams Implementer's Guide
Ann L. R. McKinnell
No ratings yet
Proc Guide
100% (1)
Proc Guide
24 pages
CCC March
No ratings yet
CCC March
24 pages
Tata Steel Summer Internship Projects 2 PDF
No ratings yet
Tata Steel Summer Internship Projects 2 PDF
8 pages
Injection Well Transient Testing and Analysis
100% (1)
Injection Well Transient Testing and Analysis
82 pages
E Statement
No ratings yet
E Statement
1 page
Aquila Reviewer Updated by ZGB 1
No ratings yet
Aquila Reviewer Updated by ZGB 1
92 pages
Extrusion of Plastic
No ratings yet
Extrusion of Plastic
6 pages
LEGO ITC Complaint
No ratings yet
LEGO ITC Complaint
67 pages
CTCMPAO Standards of Practice
No ratings yet
CTCMPAO Standards of Practice
14 pages
2008-Harvey Stensaker. Quality Culture
No ratings yet
2008-Harvey Stensaker. Quality Culture
17 pages
VISA01
No ratings yet
VISA01
75 pages
Arun Ice Cream: Early History and Strategy
No ratings yet
Arun Ice Cream: Early History and Strategy
17 pages
FULL Version Testbank MIMs Medical Microbiology and Immunology 6th Edition Richard Goering Multiple Formats
No ratings yet
FULL Version Testbank MIMs Medical Microbiology and Immunology 6th Edition Richard Goering Multiple Formats
406 pages
Digital Logic Microprocessor Assignment
No ratings yet
Digital Logic Microprocessor Assignment
4 pages
Ratiba Ya Mitihani December 2024-1
No ratings yet
Ratiba Ya Mitihani December 2024-1
10 pages
The Answer of Case Sullivan Ford Auto Wo
No ratings yet
The Answer of Case Sullivan Ford Auto Wo
14 pages
VMR HA Deplyment
No ratings yet
VMR HA Deplyment
23 pages
2019 Annual Report and Sourcebook
No ratings yet
2019 Annual Report and Sourcebook
227 pages
Bosch Therm 8000S
No ratings yet
Bosch Therm 8000S
48 pages
Brooks Etal 2010 GlobalBiodiversityConservationpriorities Expandedreview
No ratings yet
Brooks Etal 2010 GlobalBiodiversityConservationpriorities Expandedreview
24 pages
Journal Pre-Proof: Physical Therapy in Sport
No ratings yet
Journal Pre-Proof: Physical Therapy in Sport
59 pages
Karunakaran Resume (English)
No ratings yet
Karunakaran Resume (English)
2 pages
Classifying Emergency Patients into Fast-Track and Complex Cases using Machine Learning
No ratings yet
Classifying Emergency Patients into Fast-Track and Complex Cases using Machine Learning
13 pages
Hspice Toolbox
No ratings yet
Hspice Toolbox
4 pages
organic statistics 2023-24 india
No ratings yet
organic statistics 2023-24 india
10 pages
Availability and Cost Function For Periodically Inspected Maintenance
No ratings yet
Availability and Cost Function For Periodically Inspected Maintenance
8 pages

Lecture #9.1 - Apache Spark - Streaming API II

Uploaded by

Lecture #9.1 - Apache Spark - Streaming API II

Uploaded by

MODERN DATA ARCHITECTURES

FOR BIG DATA II

● Event and Processing Time

● current_timestamp returns the current

● Event-time is a column value in the row

● Window-based aggregations ➡ type of

○ Each time window is a group and each row

● Time based windowing strategies:

● Bucketize rows into one time window given a

● Window starts are inclusive but the window ends

● Durations provided as strings: ‘week’, ‘day’, ‘hour’,

key & value ➡ array of bytes;

● Bucketize rows into one or more time windows

● The stream processing engine can maintain

● This allows to update the aggregates of old

* More on this at Handling Late Data and Watermarking

* More on this at Handling Late Data and Watermarking

● Deﬁnes an event time watermark for a DataFrame.

● A watermark tracks a point in time before which we

● eventTime is a string with the name of the column or

● delayThreshold is a string with an interval: “1 minute”,

○ Output mode* must be Append or Update, but

○ Streaming DataFrame with a Static one

● Streaming Join ➡ incremental results like

○ During join operation, view of DataFrames

● Any row received from one input stream can

○ past input is buffered for “a while”

● We’re not going to go into detail on this type

● If part of your end-to-end solution provides

● We can turn at-least-once into exactly-once

* Data deduplication deﬁnition by the Wikipedia.

● The query keeps history from previous events

● Deduplication can be used:

○ With Watermarking - bounds the size of the

● Valid for Static and Stream DataFrames.

You might also like