0% found this document useful (0 votes)
22 views

Lecture #7.2 - Apache Spark - Streaming API

Spark Streaming is a framework for processing real-time data streams within Apache Spark. Structured Streaming is the new API that is recommended over the older DStreams API. It allows users to write streaming jobs that process live data streams using the same APIs as batch jobs like DataFrames and SQL. The streaming jobs can integrate with other Spark components and continuously update output sinks like files, databases or Kafka as new data arrives.

Uploaded by

Sumit Khaitan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Lecture #7.2 - Apache Spark - Streaming API

Spark Streaming is a framework for processing real-time data streams within Apache Spark. Structured Streaming is the new API that is recommended over the older DStreams API. It allows users to write streaming jobs that process live data streams using the same APIs as batch jobs like DataFrames and SQL. The streaming jobs can integrate with other Spark components and continuously update output sinks like files, databases or Kafka as new data arrives.

Uploaded by

Sumit Khaitan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

MODERN DATA ARCHITECTURES

FOR BIG DATA II

APACHE SPARK
STREAMING API
Agenda

● Spark Streaming
● API
● Summary

2
Where are we?

Spark Streaming is built on top of Spark’s APIs:

3
1.
SPARK
STREAMING
Spark Streaming Features

Spark Streaming’s design options are:

● Declarative API:
Application specifies what instead of how to
compute events.

● Event and processing time:


timestamp of the event when it was created at the
source or when it arrived to Spark.

● Micro-Batch and Continuous execution:


until Spark 2.3 only micro batching was possible
(↑throughput vs ↓latency).
5
Spark Streaming Features

6
Spark Streaming Features
Spark Streaming has two streaming APIs:

● DStreams API (RDD):


very low level and no recommended any
more as it faces some limitations
(micro-batching, processing time, RDD
interaction, java/python objects, …)

● Structured Streaming (DataFrames):


built upon Spark’s Structured APIs and
multiple optimizations (event time,
continuous processing, …).
7
1.1
STRUCTURED
STREAMING
BASICS
Structured Streaming Basics

Structured Streaming uses the Structured APIs


in Spark: DataFrames, Datasets and SQL.

All the operations we’ve seen so far are supported


→ that’s an unified computing engine.

9
Structured Streaming Basics

Streaming application execution summarizes as:

● Write the code for your processing.

● Specify a destination (file, database, kafka, …)

● The Structured Streaming engine will run the


code incrementally and continuously as new
data arrives into the system.

10
Structured Streaming Basics

Stream of data abstracted as a table to which


data is continuously appended.

11
Continuous Application

End-to-end application that reacts to data in


real time by combining a variety of tools:

● streaming jobs
● batch jobs
● joins between streaming and offline data
● interactive ad-hoc queries.

12
Continuous Application

Spark’s uniqueness might be gotten by having a


scenario like the following one:

“Structured Streaming to (1) continuously update


a table that (2) users query interactively with
Spark SQL, (3) serve a machine learning model
trained by MLlib, or (4) join streams with offline
data in any of Spark’s data sources.”

13
1.2
CORE
CONCEPTS
Core components

The following are the core components in a


Structured Streaming job:

● Transformations & Actions


● Input sources
● Output Sinks
● Output modes
● Triggers
● Event-time processing

15
Core components

Transformations & Actions

Structured Streaming maintains the same


concept of transformations and actions.

Same transformations but with some


restrictions.

Only one action available: starting a stream,


which will then run continuously and output
results.
16
Core components
Input Sources

Specifies the source of the data stream

● Kafka
● File
● Socket*
● Rate**

* used for testing


** used for benchmarking 17
Core components

18
Core components

19
Core components

Output Sinks

Specifies the destination of the processing


results

● Kafka
● File
● Console *
● Memory *
● Foreach
● ForeachBatch
* used for debugging
20
Core components

21
Core components

Output Modes

Specifies how we want to save the data to the


sink:

● Append: only add new records to the sink.


● Update: update changed records in place
● Complete: rewrites the full output.

* Certain sinks only support certain output modes.

22
Core components

23
Core components

Triggers

Specifies when should check for new input data


and update the result

● Micro-batch mode (default)


● Fixed interval micro-batches
● One-time micro-batch
● Continuous with fixed checkpoint interval

24
Core components

25
Core components

Event-Time Processing
Structured Streaming support for event-time
processing (i.e., processing data based on
timestamps included in the record that may
arrive out of order)

26
Core components

Event-Time Processing
● Event-time data:
Event-time means time fields that are
embedded in the data

Rather than processing data according to the


time it reaches your system, you process it
according to the time that it was generated
Handy when events arrive out of order (ex.
due to network delays).
27
Core components

Event-Time Processing

● Watermarks:
Allow to specify how late it’s expected to see
data in event time (to limit how long they
need to remember old data)

28
2.
API
Input Source

SparkSession.readStream

Returns a DataStreamReader that can be used to


read data streams as a streaming DataFrame.

30
Input Source

DataStreamReader

To setup the input source through methods:

● .format(source) Specifies the input data


source format.
● .option(key, value) Adds an input option for
the underlying data source.
● .load() Loads a data stream from a data
source and returns it as a DataFrame.
● ...
31
Output Sink

DataFrame.writeStream

Returns a DataStreamWriter to save the content


of the streaming DataFrame

32
Actions

DataStreamWriter.start

Starts the processing of the contents of the


DataFrame

33
Actions

StreamQuery.awaitTermination(timeout)

In addition to the start action, we need to tell the


driver process to keep running in the background
“forever” by using the following method:

34
3.
SUMMARY
Summary

Structured Streaming presents a powerful way to


write streaming applications.

Taking a batch job you already run and turning it


into a streaming job with almost no code
changes is both simple and extremely helpful

36
Summary

Structured Streaming Programming Guide

PySpark Structured Streaming

37

You might also like