SlideShare a Scribd company logo
https://dbricks.co/tutorial-pydata-miami
1
Enter your cluster
name
Use DBR 5.0 and
Apache Spark 2.4,
Scala 2.11
Choose Python 3
WiFi: CIC or CIC-A
Writing Continuous
Applications with Structured
Streaming in PySpark
Jules S. Damji
PyData, Miami, FL Jan 11, 2019
I have used Apache Spark 2.x Before…
Apache Spark Community & DeveloperAdvocate@ Databricks
DeveloperAdvocate@ Hortonworks
Software engineering @Sun Microsystems, Netscape, @Home, VeriSign,
Scalix, Centrify, LoudCloud/Opsware, ProQuest
Program Chair Spark + AI Summit
https://www.linkedin.com/in/dmatrix
@2twitme
DATABRICKS WORKSPACE
Databricks Delta ML Frameworks
DATABRICKS CLOUD SERVICE
DATABRICKS RUNTIME
Reliable & Scalable Simple & Integrated
Databricks Unified Analytics Platform
APIs
Jobs
Models
Notebooks
Dashboards End to end ML lifecycle
Agenda for Today’s Talk
• What and Why Apache Spark
• Why Streaming Applications are Difficult
• What’s Structured Streaming
• Anatomy of a Continunous Application
• Tutorials & Demo
• Q & A
How to think about data in 2019 - 2020
“Data is the new oil"
What’s Apache Spark
& Why
What is Apache Spark?
• General cluster computing engine
that extends MapReduce
• Rich set of APIs and libraries
• Unified Engine
• Large community: 1000+ orgs,
clusters up to 8000 nodes
Apache Spark, Spark and Apache are trademarks of the Apache Software Foundation
SQLStreaming ML Graph
…
DL
Unique Thing about Spark
• Unification: same engine and same API for diverse use cases
• Streaming, batch, or interactive
• ETL, SQL, machine learning, or graph
Why Unification?
Why Unification?
• MapReduce: a general engine for batch processing
MapReduce
Generalbatch
processing
Pregel
Dremel Millwheel
Drill
Giraph
ImpalaStorm
S4 . . .
Specialized systems
for newworkloads
Big Data Systems Yesterday
Hard to manage, tune, deployHard to combine in pipelines
MapReduce
Generalbatch
processing
Unified engine
Big Data Systems Today
?
Pregel
Dremel Millwheel
Drill
Giraph
ImpalaStorm
S4 . . .
Specialized systems
for newworkloads
Faster, Easier to Use, Unified
15
First	Distributed
Processing	Engine
Specialized	Data	
Processing	Engines
Unified	Data	
Processing	Engine
Benefits of Unification
1. Simpler to use and operate
2. Code reuse: e.g. only write monitoring, FT, etc once
3. New apps that span processing types: e.g. interactive
queries on a stream, online machine learning
An Analogy
Specialized devices Unified device
New applications
Why Streaming Applications
are Inherently Difficult?
building
robust
stream
processing
apps is hard
Complexities in stream processing
COMPLEX DATA
Diverse data formats
(json, avro, txt, csv, binary, …)
Data can be dirty,
late, out-of-order
COMPLEX SYSTEMS
Diverse storage systems
(Kafka, S3, Kinesis, RDBMS, …)
System failures
COMPLEX WORKLOADS
Combining streaming with
interactive queries
Machine learning
Structured Streaming
stream processing on Spark SQL engine
fast, scalable, fault-tolerant
rich, unified, high level APIs
deal with complex data and complex workloads
rich ecosystem of data sources
integrate with many storage systems
you
should not have to
reason about streaming
Treat Streams as Unbounded Tables
23
data stream unbounded inputtable
newdata in the
data stream
=
newrows appended
to a unboundedtable
you
should write queries
&
Apache Spark
should continuously update the answer
DataFrames,
Datasets, SQL
input = spark.readStream
.format("kafka")
.option("subscribe", "topic")
.load()
result = input
.select("device", "signal")
.where("signal > 15")
result.writeStream
.format("parquet")
.start("dest-path") Logical
Plan
Read from
Kafka
Project
device, signal
Filter
signal > 15
Writeto
Parquet
Apache Spark automatically streamifies!
Spark SQL converts batch-like query to a series of incremental
execution plans operating on new batches of data
Series of Incremental
Execution Plans
Kafka
Source
Optimized
Operator
codegen, off-
heap, etc.
Parquet
Sink
Optimized
Physical Plan
process
newdata
t = 1 t = 2 t = 3
process
newdata
process
newdata
Structured Streaming – Processing Modes
26
Structured Streaming Processing Modes
27
Simple Streaming ETL
Streaming word count
Anatomy of a Streaming Query
Anatomy of a Streaming Query: Step 1
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.
Source
• Specify one or more locations
to read data from
• Built in support for
Files/Kafka/Socket,
pluggable.
Anatomy of a Streaming Query: Step 2
from pyspark.sql import Trigger
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.groupBy(“value.cast("string") as key”)
.agg(count("*") as “value”)
Transformation
• Using DataFrames,Datasets and/or
SQL.
• Internal processingalways exactly-
once.
Anatomy of a Streaming Query: Step 3
from pyspark.sql import Trigger
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.groupBy(“value.cast("string") as key”)
.agg(count("*") as “value”)
.writeStream()
.format("kafka")
.option("topic", "output")
.trigger("1 minute")
.outputMode(OutputMode.Complete())
.option("checkpointLocation", "…")
.start()
Sink
• Accepts the output of each
batch.
• When supported sinks are
transactional and exactly
once (Files).
Anatomy of a Streaming Query: Output Modes
from pyspark.sql import Trigger
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.groupBy(“value.cast("string") as key”)
.agg(count("*") as 'value’)
.writeStream()
.format("kafka")
.option("topic", "output")
.trigger("1 minute")
.outputMode("update")
.option("checkpointLocation", "…")
.start()
Output mode – What's output
• Complete – Output the whole answer
every time
• Update – Output changed rows
• Append– Output new rowsonly
Trigger – When to output
• Specifiedas a time, eventually
supportsdata size
• No trigger means as fast as possible
Anatomy of a Streaming Query: Checkpoint
from pyspark.sql import Trigger
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.groupBy(“value.cast("string") as key”)
.agg(count("*") as 'value)
.writeStream()
.format("kafka")
.option("topic", "output")
.trigger("1 minute")
.outputMode("update")
.option("checkpointLocation", "…")
.withWatermark(“timestamp” “2 minutes”)
.start()
Checkpoint & Watermark
• Tracks the progress of a
query in persistent storage
• Can be used to restart the
query if there is a failure.
• trigger( Trigger. Continunous(“ 1 second”))
Set checkpoint location &
watermark to drop very late
events
Fault-tolerance with Checkpointing
Checkpointing – tracks progress
(offsets) of consuming data from
the source and intermediate state.
Offsets and metadata saved as JSON
Can resume after changing your
streaming transformations
end-to-end
exactly-once
guarantees
process
newdata
t = 1 t = 2 t = 3
process
newdata
process
newdata
write
ahead
log
Complex Streaming ETL
Traditional ETL
• Raw, dirty, un/semi-structured is data dumped as files
• Periodic jobs run every few hours to convert raw data to structured
data ready for further analytics
• Hours of delay before taking decisions on latest data
• Problem: Unacceptable when time is of essence
• [intrusion , anomaly or fraud detection,monitoringIoT devices, etc.]
37
file
dump
seconds hours
table
10101010
Streaming ETL w/ Structured Streaming
Structured Streaming enables raw data to be available
as structured data as soon as possible
38
seconds
table
10101010
Streaming ETL w/ Structured Streaming
Example
Json data being received in Kafka
Parse nested json and flatten it
Store in structured Parquet table
Get end-to-end failure guarantees
from pyspark.sql import Trigger
rawData = spark.readStream
.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
parsedData = rawData
.selectExpr("cast (value as string) as json"))
.select(from_json("json", schema).as("data"))
.select("data.*") # do your ETL/Transformation
query = parsedData.writeStream
.option("checkpointLocation", "/checkpoint")
.partitionBy("date")
.format("parquet")
.trigger( Trigger. Continunous(“5 second”))
.start("/parquetTable")
Reading from Kafka
raw_data_df = spark.readStream
.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
rawData dataframe has
the following columns
key value topic partition offset timestamp
[binary] [binary] "topicA" 0 345 1486087873
[binary] [binary] "topicB" 3 2890 1486086721
Transforming Data
Cast binary value to string
Name it column json
Parse json string and expand into
nested columns, name it data
parsedData = rawData
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data"))
.select("data.*")
json
{ "timestamp": 1486087873, "device": "devA", …}
{ "timestamp": 1486082418, "device": "devX", …}
data (nested)
timestamp device …
1486087873 devA …
1486086721 devX …
from_json("json")
as "data"
Transforming Data
Cast binary value to string
Name it column json
Parse json string and expand into
nested columns, name it data
Flatten the nested columns
parsedData = rawData
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data"))
.select("data.*")
powerful built-in Python
APIs to perform complex
data transformations
from_json, to_json, explode,...
100s offunctions
(see our blogpost & tutorial)
Writing to
Save parsed data as Parquet
table in the given path
Partition files by date so that
future queries on time slices of
data is fast
e.g. query on last 48 hours of data
query = parsedData.writeStream
.option("checkpointLocation", ...)
.partitionBy("date")
.format("parquet")
.start("/parquetTable") #pathname
Tutorials
Summary
• Apache Spark best suited for unified analytics &
processing at scale
• Structured Streaming APIs Enables Continunous
Applications
• Demonstrated Continunous Application
Resources
• Getting Started Guide with Apache Spark on Databricks
• docs.databricks.com
• Spark Programming Guide
• Structured Streaming Programming Guide
• Anthology of Technical Assets for Structured Streaming
• Databricks Engineering Blogs
• https://databricks.com/training/instructor-led-training
15% Discount Code: PyDataMiami
Go to databricks.com/training
Apache Spark Training from Databricks
Thank You J
jules@databricks.com
@2twitme
https://www.linkedin.com/in/dmatrix/

More Related Content

PPTX
Combining Machine Learning Frameworks with Apache Spark
PPTX
Large-Scale Data Science in Apache Spark 2.0
PDF
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
PPTX
Robust and Scalable ETL over Cloud Storage with Apache Spark
PDF
Elasticsearch And Apache Lucene For Apache Spark And MLlib
PDF
What's New in Apache Spark 2.3 & Why Should You Care
PDF
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
PDF
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Combining Machine Learning Frameworks with Apache Spark
Large-Scale Data Science in Apache Spark 2.0
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Robust and Scalable ETL over Cloud Storage with Apache Spark
Elasticsearch And Apache Lucene For Apache Spark And MLlib
What's New in Apache Spark 2.3 & Why Should You Care
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...

What's hot (20)

PDF
Apache Spark 2.0: Faster, Easier, and Smarter
PDF
Composable Parallel Processing in Apache Spark and Weld
PPTX
Introduction to Apache Spark Developer Training
PDF
Spark Community Update - Spark Summit San Francisco 2015
PPTX
ETL with SPARK - First Spark London meetup
PDF
Strata NYC 2015: What's new in Spark Streaming
PPTX
Introduction to Spark ML
PDF
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
PDF
Recent Developments In SparkR For Advanced Analytics
PDF
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
A Journey into Databricks' Pipelines: Journey and Lessons Learned
PDF
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
PDF
Designing Distributed Machine Learning on Apache Spark
PPTX
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
PDF
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
PDF
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution
PDF
Apache Spark Performance: Past, Future and Present
PDF
Memory Management in Apache Spark
PDF
Clipper: A Low-Latency Online Prediction Serving System
Apache Spark 2.0: Faster, Easier, and Smarter
Composable Parallel Processing in Apache Spark and Weld
Introduction to Apache Spark Developer Training
Spark Community Update - Spark Summit San Francisco 2015
ETL with SPARK - First Spark London meetup
Strata NYC 2015: What's new in Spark Streaming
Introduction to Spark ML
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Recent Developments In SparkR For Advanced Analytics
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Time Series Analytics with Spark: Spark Summit East talk by Simon Ouellette
Designing Distributed Machine Learning on Apache Spark
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution
Apache Spark Performance: Past, Future and Present
Memory Management in Apache Spark
Clipper: A Low-Latency Online Prediction Serving System
Ad

Similar to Writing Continuous Applications with Structured Streaming Python APIs in Apache Spark (20)

PDF
Writing Continuous Applications with Structured Streaming in PySpark
PDF
Writing Continuous Applications with Structured Streaming PySpark API
PPTX
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
PDF
Easy, scalable, fault tolerant stream processing with structured streaming - ...
PDF
Easy, scalable, fault tolerant stream processing with structured streaming - ...
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
PDF
Spark (Structured) Streaming vs. Kafka Streams
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
PDF
Designing Structured Streaming Pipelines—How to Architect Things Right
PDF
20170126 big data processing
PDF
Productizing Structured Streaming Jobs
PDF
SSR: Structured Streaming on R for Machine Learning with Felix Cheung
PDF
SSR: Structured Streaming for R and Machine Learning
PDF
Making Structured Streaming Ready for Production
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
PDF
Real-Time Spark: From Interactive Queries to Streaming
PPTX
Jump Start with Apache Spark 2.0 on Databricks
PDF
A Tale of Two APIs: Using Spark Streaming In Production
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
PDF
Spark what's new what's coming
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming PySpark API
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Designing Structured Streaming Pipelines—How to Architect Things Right
20170126 big data processing
Productizing Structured Streaming Jobs
SSR: Structured Streaming on R for Machine Learning with Felix Cheung
SSR: Structured Streaming for R and Machine Learning
Making Structured Streaming Ready for Production
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Real-Time Spark: From Interactive Queries to Streaming
Jump Start with Apache Spark 2.0 on Databricks
A Tale of Two APIs: Using Spark Streaming In Production
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark what's new what's coming
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
Digital Strategies for Manufacturing Companies
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
history of c programming in notes for students .pptx
DOCX
The Five Best AI Cover Tools in 2025.docx
PPTX
Transform Your Business with a Software ERP System
PPTX
Materi-Enum-and-Record-Data-Type (1).pptx
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Complete React Javascript Course Syllabus.pdf
PPTX
Introduction to Artificial Intelligence
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
top salesforce developer skills in 2025.pdf
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPT
JAVA ppt tutorial basics to learn java programming
PDF
System and Network Administration Chapter 2
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
Digital Strategies for Manufacturing Companies
Internet Downloader Manager (IDM) Crack 6.42 Build 41
How to Choose the Right IT Partner for Your Business in Malaysia
history of c programming in notes for students .pptx
The Five Best AI Cover Tools in 2025.docx
Transform Your Business with a Software ERP System
Materi-Enum-and-Record-Data-Type (1).pptx
Understanding Forklifts - TECH EHS Solution
Complete React Javascript Course Syllabus.pdf
Introduction to Artificial Intelligence
ManageIQ - Sprint 268 Review - Slide Deck
top salesforce developer skills in 2025.pdf
Adobe Illustrator 28.6 Crack My Vision of Vector Design
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
ISO 45001 Occupational Health and Safety Management System
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
JAVA ppt tutorial basics to learn java programming
System and Network Administration Chapter 2
Upgrade and Innovation Strategies for SAP ERP Customers

Writing Continuous Applications with Structured Streaming Python APIs in Apache Spark

  • 1. https://dbricks.co/tutorial-pydata-miami 1 Enter your cluster name Use DBR 5.0 and Apache Spark 2.4, Scala 2.11 Choose Python 3 WiFi: CIC or CIC-A
  • 2. Writing Continuous Applications with Structured Streaming in PySpark Jules S. Damji PyData, Miami, FL Jan 11, 2019
  • 3. I have used Apache Spark 2.x Before…
  • 4. Apache Spark Community & DeveloperAdvocate@ Databricks DeveloperAdvocate@ Hortonworks Software engineering @Sun Microsystems, Netscape, @Home, VeriSign, Scalix, Centrify, LoudCloud/Opsware, ProQuest Program Chair Spark + AI Summit https://www.linkedin.com/in/dmatrix @2twitme
  • 5. DATABRICKS WORKSPACE Databricks Delta ML Frameworks DATABRICKS CLOUD SERVICE DATABRICKS RUNTIME Reliable & Scalable Simple & Integrated Databricks Unified Analytics Platform APIs Jobs Models Notebooks Dashboards End to end ML lifecycle
  • 6. Agenda for Today’s Talk • What and Why Apache Spark • Why Streaming Applications are Difficult • What’s Structured Streaming • Anatomy of a Continunous Application • Tutorials & Demo • Q & A
  • 7. How to think about data in 2019 - 2020 “Data is the new oil"
  • 9. What is Apache Spark? • General cluster computing engine that extends MapReduce • Rich set of APIs and libraries • Unified Engine • Large community: 1000+ orgs, clusters up to 8000 nodes Apache Spark, Spark and Apache are trademarks of the Apache Software Foundation SQLStreaming ML Graph … DL
  • 10. Unique Thing about Spark • Unification: same engine and same API for diverse use cases • Streaming, batch, or interactive • ETL, SQL, machine learning, or graph
  • 12. Why Unification? • MapReduce: a general engine for batch processing
  • 13. MapReduce Generalbatch processing Pregel Dremel Millwheel Drill Giraph ImpalaStorm S4 . . . Specialized systems for newworkloads Big Data Systems Yesterday Hard to manage, tune, deployHard to combine in pipelines
  • 14. MapReduce Generalbatch processing Unified engine Big Data Systems Today ? Pregel Dremel Millwheel Drill Giraph ImpalaStorm S4 . . . Specialized systems for newworkloads
  • 15. Faster, Easier to Use, Unified 15 First Distributed Processing Engine Specialized Data Processing Engines Unified Data Processing Engine
  • 16. Benefits of Unification 1. Simpler to use and operate 2. Code reuse: e.g. only write monitoring, FT, etc once 3. New apps that span processing types: e.g. interactive queries on a stream, online machine learning
  • 17. An Analogy Specialized devices Unified device New applications
  • 18. Why Streaming Applications are Inherently Difficult?
  • 20. Complexities in stream processing COMPLEX DATA Diverse data formats (json, avro, txt, csv, binary, …) Data can be dirty, late, out-of-order COMPLEX SYSTEMS Diverse storage systems (Kafka, S3, Kinesis, RDBMS, …) System failures COMPLEX WORKLOADS Combining streaming with interactive queries Machine learning
  • 21. Structured Streaming stream processing on Spark SQL engine fast, scalable, fault-tolerant rich, unified, high level APIs deal with complex data and complex workloads rich ecosystem of data sources integrate with many storage systems
  • 22. you should not have to reason about streaming
  • 23. Treat Streams as Unbounded Tables 23 data stream unbounded inputtable newdata in the data stream = newrows appended to a unboundedtable
  • 24. you should write queries & Apache Spark should continuously update the answer
  • 25. DataFrames, Datasets, SQL input = spark.readStream .format("kafka") .option("subscribe", "topic") .load() result = input .select("device", "signal") .where("signal > 15") result.writeStream .format("parquet") .start("dest-path") Logical Plan Read from Kafka Project device, signal Filter signal > 15 Writeto Parquet Apache Spark automatically streamifies! Spark SQL converts batch-like query to a series of incremental execution plans operating on new batches of data Series of Incremental Execution Plans Kafka Source Optimized Operator codegen, off- heap, etc. Parquet Sink Optimized Physical Plan process newdata t = 1 t = 2 t = 3 process newdata process newdata
  • 26. Structured Streaming – Processing Modes 26
  • 29. Streaming word count Anatomy of a Streaming Query
  • 30. Anatomy of a Streaming Query: Step 1 spark.readStream .format("kafka") .option("subscribe", "input") .load() . Source • Specify one or more locations to read data from • Built in support for Files/Kafka/Socket, pluggable.
  • 31. Anatomy of a Streaming Query: Step 2 from pyspark.sql import Trigger spark.readStream .format("kafka") .option("subscribe", "input") .load() .groupBy(“value.cast("string") as key”) .agg(count("*") as “value”) Transformation • Using DataFrames,Datasets and/or SQL. • Internal processingalways exactly- once.
  • 32. Anatomy of a Streaming Query: Step 3 from pyspark.sql import Trigger spark.readStream .format("kafka") .option("subscribe", "input") .load() .groupBy(“value.cast("string") as key”) .agg(count("*") as “value”) .writeStream() .format("kafka") .option("topic", "output") .trigger("1 minute") .outputMode(OutputMode.Complete()) .option("checkpointLocation", "…") .start() Sink • Accepts the output of each batch. • When supported sinks are transactional and exactly once (Files).
  • 33. Anatomy of a Streaming Query: Output Modes from pyspark.sql import Trigger spark.readStream .format("kafka") .option("subscribe", "input") .load() .groupBy(“value.cast("string") as key”) .agg(count("*") as 'value’) .writeStream() .format("kafka") .option("topic", "output") .trigger("1 minute") .outputMode("update") .option("checkpointLocation", "…") .start() Output mode – What's output • Complete – Output the whole answer every time • Update – Output changed rows • Append– Output new rowsonly Trigger – When to output • Specifiedas a time, eventually supportsdata size • No trigger means as fast as possible
  • 34. Anatomy of a Streaming Query: Checkpoint from pyspark.sql import Trigger spark.readStream .format("kafka") .option("subscribe", "input") .load() .groupBy(“value.cast("string") as key”) .agg(count("*") as 'value) .writeStream() .format("kafka") .option("topic", "output") .trigger("1 minute") .outputMode("update") .option("checkpointLocation", "…") .withWatermark(“timestamp” “2 minutes”) .start() Checkpoint & Watermark • Tracks the progress of a query in persistent storage • Can be used to restart the query if there is a failure. • trigger( Trigger. Continunous(“ 1 second”)) Set checkpoint location & watermark to drop very late events
  • 35. Fault-tolerance with Checkpointing Checkpointing – tracks progress (offsets) of consuming data from the source and intermediate state. Offsets and metadata saved as JSON Can resume after changing your streaming transformations end-to-end exactly-once guarantees process newdata t = 1 t = 2 t = 3 process newdata process newdata write ahead log
  • 37. Traditional ETL • Raw, dirty, un/semi-structured is data dumped as files • Periodic jobs run every few hours to convert raw data to structured data ready for further analytics • Hours of delay before taking decisions on latest data • Problem: Unacceptable when time is of essence • [intrusion , anomaly or fraud detection,monitoringIoT devices, etc.] 37 file dump seconds hours table 10101010
  • 38. Streaming ETL w/ Structured Streaming Structured Streaming enables raw data to be available as structured data as soon as possible 38 seconds table 10101010
  • 39. Streaming ETL w/ Structured Streaming Example Json data being received in Kafka Parse nested json and flatten it Store in structured Parquet table Get end-to-end failure guarantees from pyspark.sql import Trigger rawData = spark.readStream .format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() parsedData = rawData .selectExpr("cast (value as string) as json")) .select(from_json("json", schema).as("data")) .select("data.*") # do your ETL/Transformation query = parsedData.writeStream .option("checkpointLocation", "/checkpoint") .partitionBy("date") .format("parquet") .trigger( Trigger. Continunous(“5 second”)) .start("/parquetTable")
  • 40. Reading from Kafka raw_data_df = spark.readStream .format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() rawData dataframe has the following columns key value topic partition offset timestamp [binary] [binary] "topicA" 0 345 1486087873 [binary] [binary] "topicB" 3 2890 1486086721
  • 41. Transforming Data Cast binary value to string Name it column json Parse json string and expand into nested columns, name it data parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .select("data.*") json { "timestamp": 1486087873, "device": "devA", …} { "timestamp": 1486082418, "device": "devX", …} data (nested) timestamp device … 1486087873 devA … 1486086721 devX … from_json("json") as "data"
  • 42. Transforming Data Cast binary value to string Name it column json Parse json string and expand into nested columns, name it data Flatten the nested columns parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .select("data.*") powerful built-in Python APIs to perform complex data transformations from_json, to_json, explode,... 100s offunctions (see our blogpost & tutorial)
  • 43. Writing to Save parsed data as Parquet table in the given path Partition files by date so that future queries on time slices of data is fast e.g. query on last 48 hours of data query = parsedData.writeStream .option("checkpointLocation", ...) .partitionBy("date") .format("parquet") .start("/parquetTable") #pathname
  • 45. Summary • Apache Spark best suited for unified analytics & processing at scale • Structured Streaming APIs Enables Continunous Applications • Demonstrated Continunous Application
  • 46. Resources • Getting Started Guide with Apache Spark on Databricks • docs.databricks.com • Spark Programming Guide • Structured Streaming Programming Guide • Anthology of Technical Assets for Structured Streaming • Databricks Engineering Blogs • https://databricks.com/training/instructor-led-training
  • 47. 15% Discount Code: PyDataMiami
  • 48. Go to databricks.com/training Apache Spark Training from Databricks