0% found this document useful (0 votes)

57 views37 pages

Spark Streaming API Guide

Spark Streaming is a framework for processing real-time data streams within Apache Spark. Structured Streaming is the new API that is recommended over the older DStreams API. It allows users to write streaming jobs that process live data streams using the same APIs as batch jobs like DataFrames and SQL. The streaming jobs can integrate with other Spark components and continuously update output sinks like files, databases or Kafka as new data arrives.

Uploaded by

Sumit Khaitan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views37 pages

Spark Streaming API Guide

Uploaded by

Sumit Khaitan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

MODERN DATA ARCHITECTURES

FOR BIG DATA II

APACHE SPARK
STREAMING API
Agenda

● Spark Streaming
● API
● Summary

2
Where are we?

Spark Streaming is built on top of Spark’s APIs:

3
1.
SPARK
STREAMING
Spark Streaming Features

Spark Streaming’s design options are:

● Declarative API:
Application speciﬁes what instead of how to
compute events.

● Event and processing time:

timestamp of the event when it was created at the
source or when it arrived to Spark.

● Micro-Batch and Continuous execution:

until Spark 2.3 only micro batching was possible
(↑throughput vs ↓latency).
5
Spark Streaming Features

6
Spark Streaming Features
Spark Streaming has two streaming APIs:

● DStreams API (RDD):

very low level and no recommended any
more as it faces some limitations
(micro-batching, processing time, RDD
interaction, java/python objects, …)

● Structured Streaming (DataFrames):

built upon Spark’s Structured APIs and
multiple optimizations (event time,
continuous processing, …).
7
1.1
STRUCTURED
STREAMING
BASICS
Structured Streaming Basics

Structured Streaming uses the Structured APIs

in Spark: DataFrames, Datasets and SQL.

All the operations we’ve seen so far are supported

→ that’s an uniﬁed computing engine.

9
Structured Streaming Basics

Streaming application execution summarizes as:

● Write the code for your processing.

● Specify a destination (ﬁle, database, kafka, …)

● The Structured Streaming engine will run the

code incrementally and continuously as new
data arrives into the system.

10
Structured Streaming Basics

Stream of data abstracted as a table to which

data is continuously appended.

11
Continuous Application

End-to-end application that reacts to data in

real time by combining a variety of tools:

● streaming jobs
● batch jobs
● joins between streaming and ofﬂine data
● interactive ad-hoc queries.

12
Continuous Application

Spark’s uniqueness might be gotten by having a

scenario like the following one:

“Structured Streaming to (1) continuously update

a table that (2) users query interactively with
Spark SQL, (3) serve a machine learning model
trained by MLlib, or (4) join streams with ofﬂine
data in any of Spark’s data sources.”

13
1.2
CORE
CONCEPTS
Core components

The following are the core components in a

Structured Streaming job:

● Transformations & Actions

● Input sources
● Output Sinks
● Output modes
● Triggers
● Event-time processing

15
Core components

Transformations & Actions

Structured Streaming maintains the same

concept of transformations and actions.

Same transformations but with some

restrictions.

Only one action available: starting a stream,

which will then run continuously and output
results.
16
Core components
Input Sources

Speciﬁes the source of the data stream

● Kafka
● File
● Socket*
● Rate**

* used for testing

** used for benchmarking 17
Core components

18
Core components

19
Core components

Output Sinks

Speciﬁes the destination of the processing

results

● Kafka
● File
● Console *
● Memory *
● Foreach
● ForeachBatch
* used for debugging
20
Core components

21
Core components

Output Modes

Speciﬁes how we want to save the data to the

sink:

● Append: only add new records to the sink.

● Update: update changed records in place
● Complete: rewrites the full output.

* Certain sinks only support certain output modes.

22
Core components

23
Core components

Triggers

Speciﬁes when should check for new input data

and update the result

● Micro-batch mode (default)

● Fixed interval micro-batches
● One-time micro-batch
● Continuous with ﬁxed checkpoint interval

24
Core components

25
Core components

Event-Time Processing
Structured Streaming support for event-time
processing (i.e., processing data based on
timestamps included in the record that may
arrive out of order)

26
Core components

Event-Time Processing
● Event-time data:
Event-time means time ﬁelds that are
embedded in the data

Rather than processing data according to the

time it reaches your system, you process it
according to the time that it was generated
Handy when events arrive out of order (ex.
due to network delays).
27
Core components

Event-Time Processing

● Watermarks:
Allow to specify how late it’s expected to see
data in event time (to limit how long they
need to remember old data)

28
2.
API
Input Source

SparkSession.readStream

Returns a DataStreamReader that can be used to

read data streams as a streaming DataFrame.

30
Input Source

DataStreamReader

To setup the input source through methods:

● .format(source) Speciﬁes the input data

source format.
● .option(key, value) Adds an input option for
the underlying data source.
● .load() Loads a data stream from a data
source and returns it as a DataFrame.
● ...
31
Output Sink

DataFrame.writeStream

Returns a DataStreamWriter to save the content

of the streaming DataFrame

32
Actions

DataStreamWriter.start

Starts the processing of the contents of the

DataFrame

33
Actions

StreamQuery.awaitTermination(timeout)

In addition to the start action, we need to tell the

driver process to keep running in the background
“forever” by using the following method:

34
3.
SUMMARY
Summary

Structured Streaming presents a powerful way to

write streaming applications.

Taking a batch job you already run and turning it

into a streaming job with almost no code
changes is both simple and extremely helpful

36
Summary

Structured Streaming Programming Guide

PySpark Structured Streaming

09 - Apache Spark Streaming
No ratings yet
09 - Apache Spark Streaming
31 pages
b0m33bdt 7p Spark Databricks Streaming - 2023 - en
No ratings yet
b0m33bdt 7p Spark Databricks Streaming - 2023 - en
50 pages
Structured Streaming in Apache Spark
No ratings yet
Structured Streaming in Apache Spark
99 pages
Structured Streaming and Basic Concepts
No ratings yet
Structured Streaming and Basic Concepts
4 pages
Lec 05
No ratings yet
Lec 05
10 pages
8 - Streaming 3 - Spark Flink
No ratings yet
8 - Streaming 3 - Spark Flink
52 pages
Spark Streaming for Data Engineers
No ratings yet
Spark Streaming for Data Engineers
22 pages
Stream Processing Chapter 5
No ratings yet
Stream Processing Chapter 5
23 pages
Unit 4 Streaming Data
No ratings yet
Unit 4 Streaming Data
4 pages
Sigmod Structured Streaming
No ratings yet
Sigmod Structured Streaming
13 pages
BDA Unit 3
No ratings yet
BDA Unit 3
42 pages
Lecture 11
No ratings yet
Lecture 11
31 pages
Lec 19
No ratings yet
Lec 19
23 pages
Lec 19
No ratings yet
Lec 19
24 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
11 pages
Databricks Streaming and Delta Live Tables
No ratings yet
Databricks Streaming and Delta Live Tables
69 pages
Spark Streaming
No ratings yet
Spark Streaming
14 pages
Bài Giảng Spark Streaming
No ratings yet
Bài Giảng Spark Streaming
75 pages
Stream Data Processing
No ratings yet
Stream Data Processing
32 pages
T09 Data Streaming
No ratings yet
T09 Data Streaming
52 pages
SA Unit 1 PPT 2
No ratings yet
SA Unit 1 PPT 2
27 pages
Continuous Application 1725280881
No ratings yet
Continuous Application 1725280881
72 pages
Lecture 7 - 1-Spark - Streaming
No ratings yet
Lecture 7 - 1-Spark - Streaming
25 pages
Learning Real-Time Processing With Spark Streaming - Sample Chapter
No ratings yet
Learning Real-Time Processing With Spark Streaming - Sample Chapter
30 pages
Real-Time Streaming for Tech Pros
No ratings yet
Real-Time Streaming for Tech Pros
5 pages
Bda Unit-Iii-1
No ratings yet
Bda Unit-Iii-1
29 pages
B.Tech. 3rd Yr CSE (AI) 2022 23 Revised - 30
No ratings yet
B.Tech. 3rd Yr CSE (AI) 2022 23 Revised - 30
1 page
Ebin - Pub Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
100% (1)
Ebin - Pub Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
307 pages
Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
No ratings yet
Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
407 pages
Real-Time Data Pipelines Made Easy With Structured Streaming in Apache Spark
No ratings yet
Real-Time Data Pipelines Made Easy With Structured Streaming in Apache Spark
51 pages
Ade Mod 1 Incremental Processing With Spark Structured Streaming
No ratings yet
Ade Mod 1 Incremental Processing With Spark Structured Streaming
73 pages
Big Data With Spark Detailed Presentation
No ratings yet
Big Data With Spark Detailed Presentation
13 pages
Sparksql
No ratings yet
Sparksql
2 pages
Spark Streaming: Big Data Processing Guide
No ratings yet
Spark Streaming: Big Data Processing Guide
28 pages
Spark Streaming for Developers
100% (1)
Spark Streaming for Developers
28 pages
Chapter 1-1
No ratings yet
Chapter 1-1
34 pages
Unit 3-6
No ratings yet
Unit 3-6
14 pages
BDA1
No ratings yet
BDA1
17 pages
Understanding Data Processing in Databricks: From Spark Streaming To Structured Streaming
No ratings yet
Understanding Data Processing in Databricks: From Spark Streaming To Structured Streaming
12 pages
Stream Processing and Analytics Handout
No ratings yet
Stream Processing and Analytics Handout
8 pages
Hands-On Guide To Apache Spark 3: Build Scalable Computing Engines For Batch and Stream Data Processing Alfonso Antolínez García Download
No ratings yet
Hands-On Guide To Apache Spark 3: Build Scalable Computing Engines For Batch and Stream Data Processing Alfonso Antolínez García Download
77 pages
4 Building Blocks of A Streaming Data Architecture
No ratings yet
4 Building Blocks of A Streaming Data Architecture
11 pages
Unit - 5 FBDA
No ratings yet
Unit - 5 FBDA
7 pages
BDA Lec10
No ratings yet
BDA Lec10
33 pages
The Future of Real-Time in Spark: Reynold Xin @rxin
No ratings yet
The Future of Real-Time in Spark: Reynold Xin @rxin
30 pages
Bda Unit-4 PDF
No ratings yet
Bda Unit-4 PDF
63 pages
Unit Iii
No ratings yet
Unit Iii
19 pages
Stream Processing in Big Data
No ratings yet
Stream Processing in Big Data
39 pages
Big Data Analytics - Unit 2 Notes
No ratings yet
Big Data Analytics - Unit 2 Notes
44 pages
Bigdata-Mining Data Streams
No ratings yet
Bigdata-Mining Data Streams
19 pages
Stream Processing
No ratings yet
Stream Processing
33 pages
Counting Distinct Elements in Streams
No ratings yet
Counting Distinct Elements in Streams
19 pages
Spark Basic Info
No ratings yet
Spark Basic Info
11 pages
SPA Session 10 Stream Platforms
No ratings yet
SPA Session 10 Stream Platforms
26 pages