Lecture #9.1 - Apache Spark - Streaming API II
Lecture #9.1 - Apache Spark - Streaming API II
APACHE SPARK
STREAMING API
Agenda
Processing time
Handling Event time
● Event-time is embedded in the data itself,
but it might be referenced differently
depending on the source we’re using.
Event time
2.
WINDOW
OPERATIONS
Window Operations* on Event Time
● Tumbling windows:
Options in
Structured Streaming
* Naming convention in “Streaming Data - Understanding the Real-Time Pipeline”.
Tumbling time-based window
pyspark.sql.functions.window(timeColumn, windowDuration)
Event
time
A window is defined by
its start and end
Sliding window
pyspark.sql.functions.window(timeColumn, windowDuration,
slideDuration)
Processing
time
Sliding window
3.
LATE DATA AND
WATERMARKING
Events can be late
● Due to multiple factors, events can arrive late
to the analytics tier, Spark in our case.
Watermark of
10 minutes