0% found this document useful (0 votes)
136 views

Structured Streaming

Uploaded by

forotheuse123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views

Structured Streaming

Uploaded by

forotheuse123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Structured Streaming

Learning Objectives

u Process streaming data

u DataStreamReader

u DataStreamWriter

Derar Alhussein © Udemy | Databricks Certified Data Engineer Associate - Preparation


Data Stream

u Any data source that grows over time

u New files landing in cloud storage

u Updates to a database captured in a CDC feed

u Events queued in a pub/sub messaging feed


Derar Alhussein © Udemy | Databricks Certified Data Engineer Associate - Preparation
Processing Data Stream

u 2 approaches:

1. Reprocess the entire source dataset each time

2. Only process those new data added since last update


u Structured Streaming

Derar Alhussein © Udemy | Databricks Certified Data Engineer Associate - Preparation


Spark Structured Streaming

infinite data source

data sink
Derar Alhussein © Udemy | Databricks Certified Data Engineer Associate - Preparation
Treating Infinite Data as a Table
Input Data Stream Unbounded Table

Derar Alhussein © Udemy | Databricks Certified Data Engineer Associate - Preparation


Input Streaming Table
Input_Table Output_Table

streamDF

streamDF = spark.readStream streamDF.writeStream


.table("Input_Table") .trigger(processingTime="2 minutes")
.outputMode("append")
.option("checkpointLocation", "/path")
.table("Output_Table")
Derar Alhussein © Udemy | Databricks Certified Data Engineer Associate - Preparation
Trigger Intervals
streamDF.writeStream
.trigger(processingTime="2 minutes")
.outputMode("append")
.option("checkpointLocation", "/path")
.table(”Output_Table")

Trigger Method call Behavior


Unspecified Default: processingTime="500ms"
Fixed interval .trigger(processingTime=”5 minutes") Process data in micro-batches at
the user-specified intervals
Triggered .trigger(once=True) Process all available data in a
batch single batch, then stop

Triggered .trigger(availableNow=True) Process all available data in


micro-batches multiple micro-batches, then stop
Derar Alhussein © Udemy | Databricks Certified Data Engineer Associate - Preparation
Output Modes
streamDF.writeStream
.trigger(processingTime="2 minutes")
.outputMode("append")
.option("checkpointLocation", "/path")
.table(”Output_Table")

Mode Method call Behavior

Append .outputMode("append") Only newly appended rows are incrementally


(Default) appended to the target table with each batch

Complete .outputMode("complete") The target table is overwritten with each batch

Derar Alhussein © Udemy | Databricks Certified Data Engineer Associate - Preparation


Checkpointing
streamDF.writeStream
.trigger(processingTime="2 minutes")
.outputMode("append")
.option("checkpointLocation", "/path")
.table(”Output_Table")

u Store stream state

u Track the progress of your stream processing

u Can Not be shared between separate streams


Derar Alhussein © Udemy | Databricks Certified Data Engineer Associate - Preparation
Guarantees

1. Fault Tolerance
u Checkpointing + Write-ahead logs

u record the offset range of data being processed during each trigger interval.

2. Exactly-once guarantee
u Idempotent sinks

Derar Alhussein © Udemy | Databricks Certified Data Engineer Associate - Preparation


Unsupported Operations

u Some operations are not supported by streaming DataFrame


u Sorting
u Deduplication

u Advanced methods
u Windowing
u Watermarking

Derar Alhussein © Udemy | Databricks Certified Data Engineer Associate - Preparation

You might also like