Spark Streaming
Spark Streaming
Receivers
RDD
Batches of data
RDD For a given time increment
RDD
DStream
if you need them. RDD
RDD
Common stateless transformations
on DStreams
■ Map
■ Flatmap
■ Fliter
■ reduceByKey
Stateful data
■ Allow you to compute results across a longer time period than your batch
interval
■ Example: top-sellers from the past hour
– You might process data every one second (the batch interval)
– But maintain a window of one hour
■ The window “slides” as time goes on, to represent batches within the window
interval
Batch interval vs. slide interval vs.
window interval
■ The batch interval is how often data is captured into a Dstream
■ The slide interval is how often a windowed transformation is computed
■ The window interval is how far back in time the windowed transformation goes
Example
Time
ssc = StreamingContext(sc, 1)
■ You can use reduceByWindow() or reduceByKeyAndWindow() to aggregate data
across a longer period of time!
hashtagCounts = hashtagKeyValues.reduceByKeyAndWindow(lambda x, y: x
+ y, lambda x, y : x - y, 300, 1)
STRUCTURED
STREAMING
What is structured streaming?
Flume