SlideShare a Scribd company logo
Continuous Application
with
Apache® Spark™ 2.0
Jules S. Damji
Spark Community Evangelist
QconfSF 11/10.2016
@2twitme
$ whoami
• Spark Community Evangelist @ Databricks
• Previously Developer Advocate @ Hortonworks
• In the past engineering roles at:
• Sun Microsystems, Netscape, @Home, VeriSign,
Scalix, Centrify, LoudCloud/Opsware, ProQuest
• jules@databricks.com
• https://www.linkedin.com/in/dmatrix
Introduction to Structured
Streaming
Streaming in Apache Spark
Streaming demands newtypes of streaming requirements…
3
SQL Streaming MLlib
Spark Core
GraphX
Functional, conciseand expressive
Fault-tolerant statemanagement
Unified stack with batch processing
More than 51%users say most important partof Apache Spark
Spark Streaming in production jumped to 22%from 14%
Streaming apps are
growing more complex
4
Streaming computations
don’t run in isolation
• Need to interact with batch data,
interactive analysis, machine learning, etc.
Use case: IoT Device Monitoring
IoT events
fromKafka
ETL into long term storage
- Prevent dataloss
- PreventduplicatesStatusmonitoring
- Handle latedata
- Aggregateon windows
on even-ttime
Interactively
debugissues
-consistency
event stream
Anomalydetection
- Learn modelsoffline
- Use online+continuous
learning
Use case: IoT Device Monitoring
Anomalydetection
- Learn modelsoffline
- Useonline + continuous
learning
IoT events event stream
fromKafka
ETL into long term storage
- Prevent dataloss
Status monitoring - Preventduplicates Interactively
- Handle late data debugissues
- Aggregateon windows -consistency
on eventtime
Continuous Applications
Not just streaming any more
Continuous Application with Structured Streaming 2.0
The simplest way to perform streaming analytics
is not having to reason about streaming at all
Static,
bounded table
Stream as a unbound DataFrame
Streaming,
unbounded table
Single API !
Stream as unbounded DataFrame
Gist of Structured Streaming
High-level streaming API built on SparkSQL engine
Runs the same computation as batch queriesin Datasets / DataFrames
Eventtime, windowing,sessions,sources& sinks
Guaranteesan end-to-end exactlyonce semantics
Unifies streaming, interactive and batch queries
Aggregate data in a stream, then serve using JDBC
Add, remove,change queriesat runtime
Build and apply ML modelsto your Stream
Advantages over DStreams
1. Processingwith event-time,dealingwith late data
2. Exactly same API for batch,streaming,and interactive
3. End-to-endexactly-once guaranteesfromthe system
4. Performance through SQL optimizations
- Logical plan optimizations, Tungsten, Codegen, etc.
- Faster state management for stateful stream processing
14
Structured Streaming ModelTrigger: every 1 sec
1 2 3
Time
data up
to 1
Input data up
to 2
data up
to 3
Query
Input: data from source as an
append-only table
Trigger: howfrequently to check
input for newdata
Query: operations on input
usual map/filter/reduce
newwindow, session ops
Model Trigger: every 1 sec
1 2 3
output
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
to 2
output
for data
up to 2
data up
to 3
output
for data
up to 3
Result: final operated table
updated every triggerinterval
Output: what part of result to write
to data sink after every trigger
Complete output: Write full result table every time
Output
complete
output
Model Trigger: every 1 sec
1 2 3
output
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
to 2
output
for data
up to 2
data up
to 3
output
for data
up to 3
Output
delta
output
Result: final operated table
updated every triggerinterval
Output: what part of result to write
to data sink after every trigger
Complete output: Write full result table every time
Delta output: Write only the rows that changed
in result from previous batch
Append output: Write only new rows
*Not all output modes are feasible withall queries
Example WordCount
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes
Batch ETL with DataFrame
inputDF = spark.read
.format("json")
.load("source-path")
resultDF = input
.select("device", "signal")
.where("signal > 15")
resultDF.write
.format("parquet")
.save("dest-path")
Read from JSON file
Select some devices
Write to parquet file
Streaming ETL with DataFrame
input = ctxt.read
.format("json")
.stream("source-path")
result = input
.select("device", "signal")
.where("signal > 15")
result.write
.format("parquet")
.outputMode("append")
.startStream("dest-path")
read…stream() creates a streaming
DataFrame, doesnot start any of the
computation
write…startStream() defineswhere & how
to outputthe data and starts the
processing
Streaming ETL with DataFrame
input = spark.read
.format("json")
.stream("source-path")
result = input
.select("device", "signal")
.where("signal > 15")
result.write
.format("parquet")
.outputMode("append")
.startStream("dest-path")
1 2 3
Result
[append-only table]
Input
Output
[append mode]
new rows
in result
of 2
new rows
in result
of 3
Continuous Aggregations
Continuously compute average
signal of each type of device
22
input.groupBy("device-type")
.avg("signal")
input.groupBy(
window("event-time",
"10min"),
"device type")
.avg("signal")
Continuously compute average signal of
each type of device in last10 minutesof
eventtime
- Windowing is just a type of aggregation
- Simple API for event time based windowing
Joining streams with static data
kafkaDataset = spark.read
. ka f ka ( "io t - u pd a te s")
. st r e a m ()
st a t icDa t a se t = ct xt . r e a d
. j d b c ( " j d b c : / / ", "io t - d e vice - in f o ")
joinedDataset =
ka f ka Dataset .joi n(
st a t icDa t a se t , "d e vice- type ")
21
Join streaming data from Kafka with
static data via JDBC to enrich the
streaming data …
… withouthavingto thinkthat you
are joining streamingdata
Output Modes
Defines what is outputted every time there is a trigger
Different output modes make sense for differentqueries
22
i n p u t.select ("dev ic e", "s i g n al ")
.w r i te
.outputMode("append")
.fo r m a t( "parq uet")
.startStrea m( "de st-pa th ")
Append modewith
non-aggregationqueries
i n p u t.agg( cou nt("* ") )
.w r i te
.outputMode("complete")
.fo r m a t( "parq uet")
.startStrea m( "de st-pa th ")
Complete mode with
aggregationqueries
Query Management
query = result.write
.format("parquet")
.outputMode("append")
.startStream("dest-path")
query.stop()
query.awaitTermination()
query.exception()
query.sourceStatuses()
query.sinkStatus()
25
query: a handle to the running streaming
computation for managingit
- Stop it, wait for it to terminate
- Get status
- Get error, if terminated
Multiple queries can be active at the same time
Each query has unique name for keepingtrack
Logically:
Dataset operations on table
(i.e. as easy to understand asbatch)
Physically:
Spark automatically runs the queryin
streaming fashion
(i.e. incrementally andcontinuously)
DataFrame
LogicalPlan
Catalystoptimizer
Continuous,
incrementalexecution
Query Execution
Batch/Streaming Execution on Spark SQL
27
DataFrame/
Dataset
Logical
Plan
Planner
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical
Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
CatalogDataset
Helluvalotofmagic!
Continuous Incremental Execution
Planner knows how to convert
streaming logical plans to a
continuous series of incremental
execution plans
28
DataFrame/
Dataset
Logical
Plan
Incremental
Execution Plan 1
Incremental
Execution Plan 2
Incremental
Execution Plan 3
Planner
Incremental
Execution Plan 4
Structured Streaming: Recap
• High-level streaming API built on Datasets/DataFrames
• Eventtime, windowing,sessions,sources&
sinks End-to-end exactly once semantics
• Unifies streaming, interactive and batch queries
• Aggregate data in a stream, then serveusing
JDBC Add, remove,change queriesat runtime
• Build and applyML models
Continuous Application with Structured Streaming 2.0
Demo & Workshop: Structured Streaming
• Import Notebook into your Spark 2.0 Cluster
• http://dbricks.co/sswksh3 (Demo)
• http://dbricks.co/sswksh4 (Workshop)
Resources
• docs.databricks.com
• Spark Programming Guide
• StructuredStreaming Programming Guide
• Databricks EngineeringBlogs
• sparkhub.databricks.com
• https://spark-packages.org/
Do you have any questions
for my prepared answers?

More Related Content

PPTX
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
PDF
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
PDF
Monitoring Error Logs at Databricks
PDF
Designing Structured Streaming Pipelines—How to Architect Things Right
PDF
Building Continuous Application with Structured Streaming and Real-Time Data ...
PDF
Structured streaming for machine learning
PDF
Exceptions are the Norm: Dealing with Bad Actors in ETL
PDF
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Monitoring Error Logs at Databricks
Designing Structured Streaming Pipelines—How to Architect Things Right
Building Continuous Application with Structured Streaming and Real-Time Data ...
Structured streaming for machine learning
Exceptions are the Norm: Dealing with Bad Actors in ETL
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg

What's hot (20)

PDF
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
PDF
Productizing Structured Streaming Jobs
PDF
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
PDF
Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PDF
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
PPTX
Apache Beam (incubating)
PPTX
Spark Streaming Recipes and "Exactly Once" Semantics Revised
PDF
Composable Parallel Processing in Apache Spark and Weld
PPTX
DataFlow & Beam
PDF
Jump Start on Apache Spark 2.2 with Databricks
PPTX
Apache Flink Overview at SF Spark and Friends
PDF
Stream Processing use cases and applications with Apache Apex by Thomas Weise
PDF
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
PDF
Writing Continuous Applications with Structured Streaming PySpark API
PDF
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
PDF
Clipper: A Low-Latency Online Prediction Serving System
PDF
What's New in Apache Spark 2.3 & Why Should You Care
PDF
Operational Tips For Deploying Apache Spark
PPTX
Google cloud Dataflow & Apache Flink
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Productizing Structured Streaming Jobs
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing
A Deep Dive into Query Execution Engine of Spark SQL
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Apache Beam (incubating)
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Composable Parallel Processing in Apache Spark and Weld
DataFlow & Beam
Jump Start on Apache Spark 2.2 with Databricks
Apache Flink Overview at SF Spark and Friends
Stream Processing use cases and applications with Apache Apex by Thomas Weise
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Writing Continuous Applications with Structured Streaming PySpark API
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Clipper: A Low-Latency Online Prediction Serving System
What's New in Apache Spark 2.3 & Why Should You Care
Operational Tips For Deploying Apache Spark
Google cloud Dataflow & Apache Flink
Ad

Viewers also liked (20)

PDF
Easy, scalable, fault tolerant stream processing with structured streaming - ...
PDF
A Deep Dive into Structured Streaming in Apache Spark
PDF
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
PPTX
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
PPTX
Unified Batch and Real-Time Stream Processing Using Apache Flink
PDF
Building a High-Performance Database with Scala, Akka, and Spark
PDF
Big Data Analytics
PDF
Fighting Fraud with Apache Spark
PDF
Big Data Analytics with Spark
PDF
What's new with Apache Spark's Structured Streaming?
PDF
Breakthrough OLAP performance with Cassandra and Spark
PDF
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
PDF
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
PDF
Data Source API in Spark
PDF
Apache Spark in Action
PDF
Spark Summit EU talk by Christos Erotocritou
PPTX
Kafka for data scientists
PDF
Wrangling Big Data in a Small Tech Ecosystem
PPTX
Streaming datasets for personalization
PPTX
Online learning with structured streaming, spark summit brussels 2016
Easy, scalable, fault tolerant stream processing with structured streaming - ...
A Deep Dive into Structured Streaming in Apache Spark
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
Unified Batch and Real-Time Stream Processing Using Apache Flink
Building a High-Performance Database with Scala, Akka, and Spark
Big Data Analytics
Fighting Fraud with Apache Spark
Big Data Analytics with Spark
What's new with Apache Spark's Structured Streaming?
Breakthrough OLAP performance with Cassandra and Spark
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
Data Source API in Spark
Apache Spark in Action
Spark Summit EU talk by Christos Erotocritou
Kafka for data scientists
Wrangling Big Data in a Small Tech Ecosystem
Streaming datasets for personalization
Online learning with structured streaming, spark summit brussels 2016
Ad

Similar to Continuous Application with Structured Streaming 2.0 (20)

PDF
Taking Spark Streaming to the Next Level with Datasets and DataFrames
ODP
Introduction to Structured Streaming
PDF
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
PDF
GDG Jakarta Meetup - Streaming Analytics With Apache Beam
PPTX
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
PDF
The Future of Real-Time in Spark
PDF
The Future of Real-Time in Spark
PDF
Tecnicas e Instrumentos de Recoleccion de Datos
PDF
So you think you can stream.pptx
PDF
Spark what's new what's coming
PPTX
Flink 0.10 @ Bay Area Meetup (October 2015)
PDF
Writing Continuous Applications with Structured Streaming in PySpark
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
PDF
Apache Samza 1.0 - What's New, What's Next
PDF
Azure Streaming Analytics: A comprehensive Guide.
PDF
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
PDF
Spark streaming
PDF
Presto anatomy
PDF
Flow based programming in golang
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Introduction to Structured Streaming
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
GDG Jakarta Meetup - Streaming Analytics With Apache Beam
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
The Future of Real-Time in Spark
The Future of Real-Time in Spark
Tecnicas e Instrumentos de Recoleccion de Datos
So you think you can stream.pptx
Spark what's new what's coming
Flink 0.10 @ Bay Area Meetup (October 2015)
Writing Continuous Applications with Structured Streaming in PySpark
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Apache Samza 1.0 - What's New, What's Next
Azure Streaming Analytics: A comprehensive Guide.
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Spark streaming
Presto anatomy
Flow based programming in golang

More from Anyscale (8)

PDF
Sotware Engineering Adapting to the AI Revolution: Thriving in the Age of GenAI
PDF
ACM Sunnyvale Meetup.pdf
PDF
What's Next for MLflow in 2019
PDF
Putting AI to Work on Apache Spark
PDF
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
PDF
Monitoring Large-Scale Apache Spark Clusters at Databricks
PDF
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
PDF
Jump Start with Apache Spark 2.0 on Databricks
Sotware Engineering Adapting to the AI Revolution: Thriving in the Age of GenAI
ACM Sunnyvale Meetup.pdf
What's Next for MLflow in 2019
Putting AI to Work on Apache Spark
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Monitoring Large-Scale Apache Spark Clusters at Databricks
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Jump Start with Apache Spark 2.0 on Databricks

Recently uploaded (20)

PDF
Best Practices for Rolling Out Competency Management Software.pdf
PPTX
ai tools demonstartion for schools and inter college
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Build Multi-agent using Agent Development Kit
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
System and Network Administraation Chapter 3
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
AI in Product Development-omnex systems
PDF
QAware_Mario-Leander_Reimer_Architecting and Building a K8s-based AI Platform...
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
Mastering-Cybersecurity-The-Crucial-Role-of-Antivirus-Support-Services.pptx
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Multi-factor Authentication (MFA) requirement for Microsoft 365 Admin Center_...
PDF
A REACT POMODORO TIMER WEB APPLICATION.pdf
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
medical staffing services at VALiNTRY
PPTX
What to Capture When It Breaks: 16 Artifacts That Reveal Root Causes
PPTX
Introduction to Artificial Intelligence
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Best Practices for Rolling Out Competency Management Software.pdf
ai tools demonstartion for schools and inter college
2025 Textile ERP Trends: SAP, Odoo & Oracle
Build Multi-agent using Agent Development Kit
Odoo POS Development Services by CandidRoot Solutions
System and Network Administraation Chapter 3
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
AI in Product Development-omnex systems
QAware_Mario-Leander_Reimer_Architecting and Building a K8s-based AI Platform...
Upgrade and Innovation Strategies for SAP ERP Customers
Mastering-Cybersecurity-The-Crucial-Role-of-Antivirus-Support-Services.pptx
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Multi-factor Authentication (MFA) requirement for Microsoft 365 Admin Center_...
A REACT POMODORO TIMER WEB APPLICATION.pdf
Understanding Forklifts - TECH EHS Solution
ISO 45001 Occupational Health and Safety Management System
medical staffing services at VALiNTRY
What to Capture When It Breaks: 16 Artifacts That Reveal Root Causes
Introduction to Artificial Intelligence
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool

Continuous Application with Structured Streaming 2.0

  • 1. Continuous Application with Apache® Spark™ 2.0 Jules S. Damji Spark Community Evangelist QconfSF 11/10.2016 @2twitme
  • 2. $ whoami • Spark Community Evangelist @ Databricks • Previously Developer Advocate @ Hortonworks • In the past engineering roles at: • Sun Microsystems, Netscape, @Home, VeriSign, Scalix, Centrify, LoudCloud/Opsware, ProQuest • [email protected] • https://www.linkedin.com/in/dmatrix
  • 4. Streaming in Apache Spark Streaming demands newtypes of streaming requirements… 3 SQL Streaming MLlib Spark Core GraphX Functional, conciseand expressive Fault-tolerant statemanagement Unified stack with batch processing More than 51%users say most important partof Apache Spark Spark Streaming in production jumped to 22%from 14%
  • 5. Streaming apps are growing more complex 4
  • 6. Streaming computations don’t run in isolation • Need to interact with batch data, interactive analysis, machine learning, etc.
  • 7. Use case: IoT Device Monitoring IoT events fromKafka ETL into long term storage - Prevent dataloss - PreventduplicatesStatusmonitoring - Handle latedata - Aggregateon windows on even-ttime Interactively debugissues -consistency event stream Anomalydetection - Learn modelsoffline - Use online+continuous learning
  • 8. Use case: IoT Device Monitoring Anomalydetection - Learn modelsoffline - Useonline + continuous learning IoT events event stream fromKafka ETL into long term storage - Prevent dataloss Status monitoring - Preventduplicates Interactively - Handle late data debugissues - Aggregateon windows -consistency on eventtime Continuous Applications Not just streaming any more
  • 10. The simplest way to perform streaming analytics is not having to reason about streaming at all
  • 11. Static, bounded table Stream as a unbound DataFrame Streaming, unbounded table Single API !
  • 12. Stream as unbounded DataFrame
  • 13. Gist of Structured Streaming High-level streaming API built on SparkSQL engine Runs the same computation as batch queriesin Datasets / DataFrames Eventtime, windowing,sessions,sources& sinks Guaranteesan end-to-end exactlyonce semantics Unifies streaming, interactive and batch queries Aggregate data in a stream, then serve using JDBC Add, remove,change queriesat runtime Build and apply ML modelsto your Stream
  • 14. Advantages over DStreams 1. Processingwith event-time,dealingwith late data 2. Exactly same API for batch,streaming,and interactive 3. End-to-endexactly-once guaranteesfromthe system 4. Performance through SQL optimizations - Logical plan optimizations, Tungsten, Codegen, etc. - Faster state management for stateful stream processing 14
  • 15. Structured Streaming ModelTrigger: every 1 sec 1 2 3 Time data up to 1 Input data up to 2 data up to 3 Query Input: data from source as an append-only table Trigger: howfrequently to check input for newdata Query: operations on input usual map/filter/reduce newwindow, session ops
  • 16. Model Trigger: every 1 sec 1 2 3 output for data up to 1 Result Query Time data up to 1 Input data up to 2 output for data up to 2 data up to 3 output for data up to 3 Result: final operated table updated every triggerinterval Output: what part of result to write to data sink after every trigger Complete output: Write full result table every time Output complete output
  • 17. Model Trigger: every 1 sec 1 2 3 output for data up to 1 Result Query Time data up to 1 Input data up to 2 output for data up to 2 data up to 3 output for data up to 3 Output delta output Result: final operated table updated every triggerinterval Output: what part of result to write to data sink after every trigger Complete output: Write full result table every time Delta output: Write only the rows that changed in result from previous batch Append output: Write only new rows *Not all output modes are feasible withall queries
  • 19. Batch ETL with DataFrame inputDF = spark.read .format("json") .load("source-path") resultDF = input .select("device", "signal") .where("signal > 15") resultDF.write .format("parquet") .save("dest-path") Read from JSON file Select some devices Write to parquet file
  • 20. Streaming ETL with DataFrame input = ctxt.read .format("json") .stream("source-path") result = input .select("device", "signal") .where("signal > 15") result.write .format("parquet") .outputMode("append") .startStream("dest-path") read…stream() creates a streaming DataFrame, doesnot start any of the computation write…startStream() defineswhere & how to outputthe data and starts the processing
  • 21. Streaming ETL with DataFrame input = spark.read .format("json") .stream("source-path") result = input .select("device", "signal") .where("signal > 15") result.write .format("parquet") .outputMode("append") .startStream("dest-path") 1 2 3 Result [append-only table] Input Output [append mode] new rows in result of 2 new rows in result of 3
  • 22. Continuous Aggregations Continuously compute average signal of each type of device 22 input.groupBy("device-type") .avg("signal") input.groupBy( window("event-time", "10min"), "device type") .avg("signal") Continuously compute average signal of each type of device in last10 minutesof eventtime - Windowing is just a type of aggregation - Simple API for event time based windowing
  • 23. Joining streams with static data kafkaDataset = spark.read . ka f ka ( "io t - u pd a te s") . st r e a m () st a t icDa t a se t = ct xt . r e a d . j d b c ( " j d b c : / / ", "io t - d e vice - in f o ") joinedDataset = ka f ka Dataset .joi n( st a t icDa t a se t , "d e vice- type ") 21 Join streaming data from Kafka with static data via JDBC to enrich the streaming data … … withouthavingto thinkthat you are joining streamingdata
  • 24. Output Modes Defines what is outputted every time there is a trigger Different output modes make sense for differentqueries 22 i n p u t.select ("dev ic e", "s i g n al ") .w r i te .outputMode("append") .fo r m a t( "parq uet") .startStrea m( "de st-pa th ") Append modewith non-aggregationqueries i n p u t.agg( cou nt("* ") ) .w r i te .outputMode("complete") .fo r m a t( "parq uet") .startStrea m( "de st-pa th ") Complete mode with aggregationqueries
  • 25. Query Management query = result.write .format("parquet") .outputMode("append") .startStream("dest-path") query.stop() query.awaitTermination() query.exception() query.sourceStatuses() query.sinkStatus() 25 query: a handle to the running streaming computation for managingit - Stop it, wait for it to terminate - Get status - Get error, if terminated Multiple queries can be active at the same time Each query has unique name for keepingtrack
  • 26. Logically: Dataset operations on table (i.e. as easy to understand asbatch) Physically: Spark automatically runs the queryin streaming fashion (i.e. incrementally andcontinuously) DataFrame LogicalPlan Catalystoptimizer Continuous, incrementalexecution Query Execution
  • 27. Batch/Streaming Execution on Spark SQL 27 DataFrame/ Dataset Logical Plan Planner SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation CatalogDataset Helluvalotofmagic!
  • 28. Continuous Incremental Execution Planner knows how to convert streaming logical plans to a continuous series of incremental execution plans 28 DataFrame/ Dataset Logical Plan Incremental Execution Plan 1 Incremental Execution Plan 2 Incremental Execution Plan 3 Planner Incremental Execution Plan 4
  • 29. Structured Streaming: Recap • High-level streaming API built on Datasets/DataFrames • Eventtime, windowing,sessions,sources& sinks End-to-end exactly once semantics • Unifies streaming, interactive and batch queries • Aggregate data in a stream, then serveusing JDBC Add, remove,change queriesat runtime • Build and applyML models
  • 31. Demo & Workshop: Structured Streaming • Import Notebook into your Spark 2.0 Cluster • http://dbricks.co/sswksh3 (Demo) • http://dbricks.co/sswksh4 (Workshop)
  • 32. Resources • docs.databricks.com • Spark Programming Guide • StructuredStreaming Programming Guide • Databricks EngineeringBlogs • sparkhub.databricks.com • https://spark-packages.org/
  • 33. Do you have any questions for my prepared answers?