0% found this document useful (0 votes)
112 views19 pages

Data Engineering 101 - Streaming in Databricks

Data analytics data bricks
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
112 views19 pages

Data Engineering 101 - Streaming in Databricks

Data analytics data bricks
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Data

Engineering 101
Streaming Data in
Databricks

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Databricks Streaming

Spark Structured Streaming


A scalable and fault-tolerant stream
processing engine built on the Spark
SQL engine.
streamingDF =
spark.readStream.format("csv").load("/path/to/dir")

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Databricks Streaming

Trigger Options

Defines when the streaming data


should be processed.

streamingDF.writeStream \
.trigger(processingTime="10 seconds") \
.outputMode("append").format("console").start()

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Databricks Streaming

Output Modes

Determines how streaming results are


output: append, complete, or update.

streamingDF.writeStream \
.outputMode("append") \
.format("console").start()

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Databricks Streaming

Window Operations

Allows aggregation of data over a


sliding window for time-based
operations.
from pyspark.sql.functions import window, col
windowedCounts = streamingDF \
.groupBy(window(col("timestamp"), "10
minutes")).count()

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Databricks Streaming

Watermarking

Handles late data by specifying a delay


threshold.

streamingDF \
.withWatermark("timestamp", "10 minutes") \
.groupBy("event").count() \
.writeStream.format("console").start()

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Databricks Streaming

Stateful Processing

Maintains state across streaming


batches for operations like
aggregations.
from pyspark.sql.functions import sum
statefulCounts =
streamingDF.groupBy("key") \
.agg(sum("value").alias("total"))

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Databricks Streaming

Checkpointing

Saves the streaming state to a durable


storage location for fault tolerance.

streamingDF.writeStream \
.outputMode("append") \
.option("checkpointLocation", "/path/to/checkpoint") \
.start()

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Databricks Streaming

Auto Loader

Efficiently processes new data files as


they arrive in a directory, ideal for
streaming ingestion.
autoloaderDF =
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.load("/path/to/autoloader/source")

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Databricks Streaming

Schema Evolution

Auto Loader automatically detects and


applies schema changes.

autoloaderDF =
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.option("cloudFiles.schemaLocation",
"/path/to/schema") \
.load("/path/to/data")

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Databricks Streaming

Incremental Processing

Processes only new data since the last


checkpoint.

autoloaderDF.writeStream.format("delta") \
.option("checkpointLocation", "/path/to/checkpoint") \
.start("/path/to/output")

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Databricks Streaming

Handling Late Data

Auto Loader handles late-arriving data


using the watermarking mechanism.

autoloaderDF \
.withWatermark("timestamp", "10 minutes") \
.writeStream.outputMode("append") \
.format("delta").start("/path/to/output")

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Databricks Streaming

Multi-Hop Architecture

Implements the Medallion architecture


(Bronze, Silver, Gold) with Auto Loader
for data ingestion.
bronzeDF = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json").load("/bronze")

silverDF = bronzeDF.filter("value > 0") \


.writeStream.format("delta").start("/silver")

goldDF = spark.read.format("delta") \
.load("/silver").groupBy("key") \
.agg(sum("value")).write.format("delta").save("/gold")

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Databricks Streaming

Streaming Aggregations

Performs real-time aggregations on


streaming data.

from pyspark.sql.functions import window, sum


aggregatedDF =
streamingDF.groupBy(window(col("timestamp"),
"10 minutes")).agg(sum("value").alias("total"))

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Databricks Streaming

Delta Lake Integration

Provides ACID transactions and schema


enforcement for streaming data with
Delta Lake.
autoloaderDF.writeStream.format("delta") \
.option("checkpointLocation",
"/path/to/checkpoint").start("/path/to/delta/table")

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Databricks Streaming

File Notification Mode

Auto Loader can be configured to use


file notification for low-latency file
discovery.
autoloaderDF =
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.option("cloudFiles.useNotifications", "true") \
.load("/path/to/data")

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Databricks Streaming

Streaming Joins

Joins streaming data with static or


another streaming dataset.

staticDF = spark.read.format("delta") \
.load("/path/to/static")

joinedDF = streamingDF.join(staticDF, "key")

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Databricks Streaming

Automatic File Discovery

Auto Loader automatically detects and


processes new files in the source
directory.
autoloaderDF = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "parquet") \
.load("/path/to/autoloader/source")

Shwetank Singh
GritSetGrow - GSGLearn.com

You might also like