Data Engineering 101 - Streaming in Databricks
Data Engineering 101 - Streaming in Databricks
Engineering 101
Streaming Data in
Databricks
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Databricks Streaming
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Databricks Streaming
Trigger Options
streamingDF.writeStream \
.trigger(processingTime="10 seconds") \
.outputMode("append").format("console").start()
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Databricks Streaming
Output Modes
streamingDF.writeStream \
.outputMode("append") \
.format("console").start()
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Databricks Streaming
Window Operations
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Databricks Streaming
Watermarking
streamingDF \
.withWatermark("timestamp", "10 minutes") \
.groupBy("event").count() \
.writeStream.format("console").start()
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Databricks Streaming
Stateful Processing
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Databricks Streaming
Checkpointing
streamingDF.writeStream \
.outputMode("append") \
.option("checkpointLocation", "/path/to/checkpoint") \
.start()
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Databricks Streaming
Auto Loader
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Databricks Streaming
Schema Evolution
autoloaderDF =
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.option("cloudFiles.schemaLocation",
"/path/to/schema") \
.load("/path/to/data")
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Databricks Streaming
Incremental Processing
autoloaderDF.writeStream.format("delta") \
.option("checkpointLocation", "/path/to/checkpoint") \
.start("/path/to/output")
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Databricks Streaming
autoloaderDF \
.withWatermark("timestamp", "10 minutes") \
.writeStream.outputMode("append") \
.format("delta").start("/path/to/output")
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Databricks Streaming
Multi-Hop Architecture
goldDF = spark.read.format("delta") \
.load("/silver").groupBy("key") \
.agg(sum("value")).write.format("delta").save("/gold")
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Databricks Streaming
Streaming Aggregations
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Databricks Streaming
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Databricks Streaming
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Databricks Streaming
Streaming Joins
staticDF = spark.read.format("delta") \
.load("/path/to/static")
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Databricks Streaming
Shwetank Singh
GritSetGrow - GSGLearn.com