0% found this document useful (0 votes)
15 views

Lecture #9.1 - Apache Spark - Streaming API II

The document discusses Apache Spark's Streaming API. It covers event and processing time, window operations like tumbling and sliding windows, handling late data with watermarking, join operations between streaming and static/streaming DataFrames, and stream deduplication to eliminate duplicate records. Window operations group and aggregate data based on time windows defined on an event time column. Watermarking tracks a point in time before which late data is not expected to clean state for incremental aggregations.

Uploaded by

Sumit Khaitan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Lecture #9.1 - Apache Spark - Streaming API II

The document discusses Apache Spark's Streaming API. It covers event and processing time, window operations like tumbling and sliding windows, handling late data with watermarking, join operations between streaming and static/streaming DataFrames, and stream deduplication to eliminate duplicate records. Window operations group and aggregate data based on time windows defined on an event time column. Watermarking tracks a point in time before which late data is not expected to clean state for incremental aggregations.

Uploaded by

Sumit Khaitan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

MODERN DATA ARCHITECTURES

FOR BIG DATA II

APACHE SPARK
STREAMING API
Agenda

● Event and Processing Time


● Windows Operations
● Late Data and Watermarking
● Join Operations
● Stream Deduplication
1.
EVENT AND
PROCESSING TIME
Handling Processing time
● Processing-time is related to the moment
Spark is processing the data.

● current_timestamp returns the current


timestamp at the start of query evaluation
as a TimeStamp data type column.
Handling Processing time

Processing time
Handling Event time
● Event-time is embedded in the data itself,
but it might be referenced differently
depending on the source we’re using.

● Event-time is a column value in the row


with TimeStamp data type.

● Window-based aggregations ➡ type of


grouping and aggregation on this column:

○ Each time window is a group and each row


can belong to multiple windows/groups.
Handling Event time

Event time
2.
WINDOW
OPERATIONS
Window Operations* on Event Time

● Tumbling windows:

● Time based windowing strategies:


Recently added. We won’t
cover it

Options in
Structured Streaming
* Naming convention in “Streaming Data - Understanding the Real-Time Pipeline”.
Tumbling time-based window
pyspark.sql.functions.window(timeColumn, windowDuration)

● Bucketize rows into one time window given a


timestamp specifying column.

● Window starts are inclusive but the window ends


are exclusive ➡ for example [12:05,12:10)

● Durations provided as strings: ‘week’, ‘day’, ‘hour’,


‘minute’, ‘second’, ‘millisecond’ and ‘microsecond’.
Tumbling time-based window

key & value ➡ array of bytes;


convert them into Strings
before working with them

Event
time

A window is defined by
its start and end
Sliding window
pyspark.sql.functions.window(timeColumn, windowDuration,
slideDuration)

● Bucketize rows into one or more time windows


given a timestamp specifyingEventcolumn.
time

Processing
time
Sliding window
3.
LATE DATA AND
WATERMARKING
Events can be late
● Due to multiple factors, events can arrive late
to the analytics tier, Spark in our case.

● The stream processing engine can maintain


the intermediate state for partial aggregates.

● This allows to update the aggregates of old


windows correctly, although this is not done
forever but for a certain amount of time ➡
watermarks.
Events can be late

* More on this at Handling Late Data and Watermarking


Events can be late

Watermark of
10 minutes

* More on this at Handling Late Data and Watermarking


Watermarking in Spark Streaming
DataFrame.withWatermark(eventTime, delayThreshold)

● Defines an event time watermark for a DataFrame.

● A watermark tracks a point in time before which we


assume no more late data is going to arrive.

● eventTime is a string with the name of the column or


the column itself.

● delayThreshold is a string with an interval: “1 minute”,


“5 hours”, ...
Watermarking in Spark Streaming
Watermarking in Spark Streaming
● Conditions for watermarking to clean
aggregation state:

○ Output mode* must be Append or Update, but


not Complete (requires all aggregate data to be
preserved ➡ more resources).
○ Aggregation must have the event-time column
or a window on the event-time column.
○ withWatermark must be called on the same
timestamp column.
○ withWatermark must be called before the
aggregation
* Append is the default value if not output mode is specified
4.
JOIN
OPERATIONS
Join Operations
● Spark Structured Streaming supports:

○ Streaming DataFrame with a Static one


○ Streaming DataFrame with a Streaming one

● Streaming Join ➡ incremental results like


Streaming Aggregations.
Stream-Static Joins
● It supports inner joins and some type of
outer joins.
Stream-Static Joins
Stream-Stream Joins
● Main challenge:

○ During join operation, view of DataFrames


might be incomplete for both sides
○ Harder to find matches between inputs

● Any row received from one input stream can


match with any future, yet-to-be-received
row from the other input stream.
Stream-Stream Joins
● Approach followed:

○ past input is buffered for “a while”


○ every future input will match with past input
and accordingly generate joined results
○ late out-of-order data is handled automatically
○ “A while” is handled by using watermarks.

● We’re not going to go into detail on this type


of joins.
Stream-Stream Joins
● Example of Inner Join with Watermarking:
5.
STREAM
DEDUPLICATION
Stream Deduplication
● In computing, data deduplication* is a
technique for eliminating duplicate copies of
repeating data.

● If part of your end-to-end solution provides


at-least-once guarantee ➡ data duplication.

● We can turn at-least-once into exactly-once


by using stream deduplication.

* Data deduplication definition by the Wikipedia.


Stream Deduplication
● There has to be a unique identifier in events,
determined by one or multiple columns.

● The query keeps history from previous events


in order to filter duplicates.

● Deduplication can be used:

○ With Watermarking - bounds the size of the


history the query has to maintain
○ Without Watermarking - the query stores the
data from all the past events
Stream Deduplication
pyspark.sql.DataFrame.dropDuplicates(subset=None)

● Valid for Static and Stream DataFrames.

You might also like