SlideShare a Scribd company logo
Easy, Scalable, Fault-tolerant
Stream Processing with
Structured Streaming
Spark Summit Europe 2017
25th October, Dublin
Tathagata “TD” Das
@tathadas
About Me
Started Spark Streaming project in AMPLab, UC Berkeley
Currently focused on building Structured Streaming
Member of the Apache Spark PMC
Software Engineer at Databricks
building robust
stream processing
apps is hard
Complexities in stream processing
COMPLEX DATA
Diverse data formats
(json, avro, binary, …)
Data can be dirty,
late, out-of-order
COMPLEX SYSTEMS
Diverse storage systems
(Kafka, S3, Kinesis, RDBMS, …)
System failures
COMPLEX WORKLOADS
Combining streaming with
interactive queries
Machine learning
Structured Streaming
stream processing on Spark SQL engine
fast, scalable, fault-tolerant
rich, unified, high level APIs
deal with complex data and complex workloads
rich ecosystem of data sources
integrate with many storage systems
you
should not have to
reason about streaming
you
should write simple queries
&
Spark
should continuously update the answer
Streaming word count
Anatomy of a Streaming Word Count
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.groupBy($"value".cast("string"))
.count()
.writeStream
.format("kafka")
.option("topic", "output")
.trigger("1 minute")
.outputMode(OutputMode.Complete())
.option("checkpointLocation", "…")
.start()
Source
• Specify one or more locations
to read data from
• Built in support for
Files/Kafka/Socket,
pluggable.
• Can include multiple sources
of different types using
union()
Anatomy of a Streaming Query
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.groupBy('value.cast("string") as 'key)
.agg(count("*") as 'value)
.writeStream
.format("kafka")
.option("topic", "output")
.trigger("1 minute")
.outputMode(OutputMode.Complete())
.option("checkpointLocation", "…")
.start()
Transformation
• Using DataFrames,
Datasets and/or SQL.
• Catalyst figures out how to
execute the transformation
incrementally.
• Internal processing always
exactly-once.
DataFrames,
Datasets, SQL
input = spark.readStream
.format("kafka")
.option("subscribe", "topic")
.load()
result = input
.select("device", "signal")
.where("signal > 15")
result.writeStream
.format("parquet")
.start("dest-path")
Logical
Plan
Read from
Kafka
Project
device, signal
Filter
signal > 15
Write to
Kafka
Spark automatically streamifies!
Spark SQL converts batch-like query to a series of incremental
execution plans operating on new batches of data
Series of Incremental
Execution Plans
Kafka
Source
Optimized
Operator
codegen, off-
heap, etc.
Kafka
Sink
Optimized
Physical Plan
process
newdata
t = 1 t = 2 t = 3
process
newdata
process
newdata
Anatomy of a Streaming Query
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.groupBy('value.cast("string") as 'key)
.agg(count("*") as 'value)
.writeStream
.format("kafka")
.option("topic", "output")
.trigger("1 minute")
.outputMode(OutputMode.Complete())
.option("checkpointLocation", "…")
.start()
Sink
• Accepts the output of each
batch.
• When supported sinks are
transactional and exactly
once (Files).
• Use foreach to execute
arbitrary code.
Anatomy of a Streaming Query
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.groupBy('value.cast("string") as 'key)
.agg(count("*") as 'value)
.writeStream
.format("kafka")
.option("topic", "output")
.trigger("1 minute")
.outputMode("update")
.option("checkpointLocation", "…")
.start()
Output mode – What's output
• Complete – Output the whole answer
every time
• Update – Output changed rows
• Append – Output new rows only
Trigger – When to output
• Specified as a time, eventually
supports data size
• No trigger means as fast as possible
Anatomy of a Streaming Query
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.groupBy('value.cast("string") as 'key)
.agg(count("*") as 'value)
.writeStream
.format("kafka")
.option("topic", "output")
.trigger("1 minute")
.outputMode("update")
.option("checkpointLocation", "…")
.start()
Checkpoint
• Tracks the progress of a
query in persistent storage
• Can be used to restart the
query if there is a failure
Fault-tolerance with Checkpointing
Checkpointing – tracks progress
(offsets) of consuming data from
the source and intermediate state.
Offsets and metadata saved as JSON
Can resume after changing your
streaming transformations
end-to-end
exactly-once
guarantees
process
newdata
t = 1 t = 2 t = 3
process
newdata
process
newdata
write
ahead
log
4xlower cost
Structured Streaming reuses
the Spark SQL Optimizer
and Tungsten Engine.
Performance: Benchmark
System Throughput
700K
15M
65M
0
10
20
30
40
50
60
70
Kafka	
Streams
Apache	
Flink
Structured	
Streaming
Millions	of	records/s
Read more details in our blog post
Complex Streaming ETL
Traditional ETL
Raw, dirty, un/semi-structured is data dumped as files
Periodic jobs run every few hours to convert raw data
to structured data ready for further analytics
18
file
dump
seconds hours
table
10101010
Traditional ETL
Hours of delay before taking decisions on latest data
Unacceptable when time is of essence
[intrusion detection, anomaly detection, etc.]
file
dump
seconds hours
table
10101010
Streaming ETL w/ Structured Streaming
Structured Streaming enables raw data to be available
as structured data as soon as possible
20
seconds
table
10101010
Streaming ETL w/ Structured Streaming
Example
Json data being received in Kafka
Parse nested json and flatten it
Store in structured Parquet table
Get end-to-end failure guarantees
val rawData = spark.readStream
.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
val parsedData = rawData
.selectExpr("cast (value as string) as json"))
.select(from_json("json", schema).as("data"))
.select("data.*")
val query = parsedData.writeStream
.option("checkpointLocation", "/checkpoint")
.partitionBy("date")
.format("parquet")
.start("/parquetTable")
Reading from Kafka
Specify options to configure
How?
kafka.boostrap.servers => broker1,broker2
What?
subscribe => topic1,topic2,topic3 // fixed list of topics
subscribePattern => topic* // dynamic list of topics
assign => {"topicA":[0,1] } // specific partitions
Where?
startingOffsets => latest(default) / earliest / {"topicA":{"0":23,"1":345} }
val rawData = spark.readStream
.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
Reading from Kafka
val rawData = spark.readStream
.format("kafka")
.option("kafka.boostrap.servers",...)
.option("subscribe", "topic")
.load()
rawData dataframe has
the following columns
key value topic partition offset timestamp
[binary] [binary] "topicA" 0 345 1486087873
[binary] [binary] "topicB" 3 2890 1486086721
Transforming Data
Cast binary value to string
Name it column json
val parsedData = rawData
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data"))
.select("data.*")
Transforming Data
Cast binary value to string
Name it column json
Parse json string and expand into
nested columns, name it data
val parsedData = rawData
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data"))
.select("data.*")
json
{ "timestamp": 1486087873, "device": "devA", …}
{ "timestamp": 1486082418, "device": "devX", …}
data (nested)
timestamp device …
1486087873 devA …
1486086721 devX …
from_json("json")
as "data"
Transforming Data
Cast binary value to string
Name it column json
Parse json string and expand into
nested columns, name it data
Flatten the nested columns
val parsedData = rawData
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data"))
.select("data.*")
data (nested)
timestamp device …
1486087873 devA …
1486086721 devX …
timestamp device …
1486087873 devA …
1486086721 devX …
select("data.*")
(not nested)
Transforming Data
Cast binary value to string
Name it column json
Parse json string and expand into
nested columns, name it data
Flatten the nested columns
val parsedData = rawData
.selectExpr("cast (value as string) as json")
.select(from_json("json", schema).as("data"))
.select("data.*")
powerful built-in APIs to
perform complex data
transformations
from_json, to_json, explode, ...
100s of functions
(see our blog post)
Writing to
Save parsed data as Parquet
table in the given path
Partition files by date so that
future queries on time slices of
data is fast
e.g. query on last 48 hours of data
val query = parsedData.writeStream
.option("checkpointLocation", ...)
.partitionBy("date")
.format("parquet")
.start("/parquetTable")
Checkpointing
Enable checkpointing by
setting the checkpoint
location to save offset logs
start actually starts a
continuous running
StreamingQuery in the
Spark cluster
val query = parsedData.writeStream
.option("checkpointLocation", ...)
.format("parquet")
.partitionBy("date")
.start("/parquetTable/")
Streaming Query
query is a handle to the continuously
running StreamingQuery
Used to monitor and manage the
execution
val query = parsedData.writeStream
.option("checkpointLocation", ...)
.format("parquet")
.partitionBy("date")
.start("/parquetTable")/")
process
newdata
t = 1 t = 2 t = 3
process
newdata
process
newdata
StreamingQuery
Data Consistency on Ad-hoc Queries
Data available for complex, ad-hoc analytics within seconds
Parquet table is updated atomically, ensures prefix integrity
Even if distributed, ad-hoc queries will see either all updates from
streaming query or none, read more in our blog
https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
complex, ad-hoc
queries on
latest
data
seconds!
More Kafka Support
Write out to Kafka
Dataframe must have binary fields
named key and value
Direct, interactive and batch
queries on Kafka
Makes Kafka even more powerful
as a storage platform!
Added to Spark 2.2
result.writeStream
.format("kafka")
.option("topic", "output")
.start()
val df = spark
.read // not readStream
.format("kafka")
.option("subscribe", "topic")
.load()
df.registerTempTable("topicData")
spark.sql("select value from topicData")
Amazon Kinesis
Configure with options (similar to Kafka)
Available with Databricks Runtime
How?
region => us-west-2 / us-east-1 / ...
awsAccessKey (optional) => AKIA...
awsSecretKey (optional) => ...
What?
streamName => name-of-the-stream
Where?
initialPosition => latest(default) / earliest / trim_horizon
spark.readStream
.format("kinesis")
.option("streamName", "myStream")
.option("region", "us-west-2")
.option("awsAccessKey", ...)
.option("awsSecretKey", ...)
.load()
Working With Time
Event Time
Many use cases require aggregate statistics by event time
E.g. what's the #errors in each system in the 1 hour windows?
Many challenges
Extracting event time from data, handling late, out-of-order data
DStream APIs were insufficient for event-time stuff
Event time Aggregations
Windowing is just another type of grouping in Struct.
Streaming
number of records every hour
Support UDAFs!
parsedData
.groupBy(window("timestamp","1 hour"))
.count()
parsedData
.groupBy(
"device",
window("timestamp","10 mins"))
.avg("signal")
avg signal strength of each
device every 10 mins
Stateful Processing for Aggregations
Aggregates has to be saved as
distributed state between triggers
Each trigger reads previous state and
writes updated state
State stored in memory,
backed by write ahead log in HDFS/S3
Fault-tolerant, exactly-once guarantee!
process
newdata
t = 1
sink
src
t = 2
process
newdata
sink
src
t = 3
process
newdata
sink
src
state state
write
ahead
log
state updates
are written to
log for checkpointing
state
Automatically handles Late Data
12:00 - 13:00 1 12:00 - 13:00 3
13:00 - 14:00 1
12:00 - 13:00 3
13:00 - 14:00 2
14:00 - 15:00 5
12:00 - 13:00 5
13:00 - 14:00 2
14:00 - 15:00 5
15:00 - 16:00 4
12:00 - 13:00 3
13:00 - 14:00 2
14:00 - 15:00 6
15:00 - 16:00 4
16:00 - 17:00 3
13:00 14:00 15:00 16:00 17:00Keeping state allows
late data to update
counts of old windows
red = state updated
with late data
But size of the state increases indefinitely
if old windows are not dropped
Watermarking
Watermark - moving threshold of
how late data is expected to be
and when to drop old state
Trails behind max event time
seen by the engine
Watermark delay = trailing gap
event time
max event time
watermark data older
than
watermark
not expected
12:30 PM
12:20 PM
trailing gap
of 10 mins
Watermarking
Data newer than watermark may
be late, but allowed to aggregate
Data older than watermark is
"too late" and dropped
Windows older than watermark
automatically deleted to limit the
amount of intermediate state
max event time
event time
watermark
late data
allowed to
aggregate
data too
late,
dropped
watermark
delay
of 10 mins
Watermarking
max event time
event time
watermark
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes"))
.count()
late data
allowed to
aggregate
data too
late,
dropped
Useful only in stateful operations
Ignored in non-stateful streaming
queries and batch queries
watermark
delay
of 10 mins
Watermarking
data too late,
ignored in counts,
state dropped
Processing Time12:00
12:05
12:10
12:15
12:10 12:15 12:20
12:07
12:13
12:08
EventTime
12:15
12:18
12:04
watermark updated to
12:14 - 10m = 12:04
for next trigger,
state < 12:04 deleted
data is late, but
considered in
counts
system tracks max
observed event time
12:08
wm = 12:04
10min
12:14
More details in my blog post
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes"))
.count()
Clean separation of concerns
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes"))
.count()
.writeStream
.trigger("10 seconds")
.start()
Query Semantics
Processing Details
separated from
Clean separation of concerns
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes"))
.count()
.writeStream
.trigger("10 seconds")
.start()
Query Semantics
How to group data by time?
(same for batch & streaming)
Processing Details
Clean separation of concerns
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes"))
.count()
.writeStream
.trigger("10 seconds")
.start()
Query Semantics
How to group data by time?
(same for batch & streaming)
Processing Details
How late can data be?
Clean separation of concerns
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes"))
.count()
.writeStream
.trigger("10 seconds")
.start()
Query Semantics
How to group data by time?
(same for batch & streaming)
Processing Details
How late can data be?
How often to emit updates?
Other Interesting Operations
Streaming Deduplication
Watermarks to limit state
Joins
Stream-batch joins supported,
stream-stream joins coming in 2.3
Arbitrary Stateful Processing
[map|flatMap]GroupsWithState
parsedData.join(batchData, "device")
parsedData.dropDuplicates("eventId")
[See	my	other	talk at	4:20	PM,	today	for	a	deep	dive	into	stateful ops]
ds.groupByKey(_.id)
.mapGroupsWithState
(timeoutConf)
(mappingWithStateFunc)
Monitoring Streaming Queries
Get last progress of the
streaming query
Current input and processing rates
Current processed offsets
Current state metrics
Get progress asynchronously
through by registering your own
StreamingQueryListener
new StreamingQueryListener {
def onQueryStart(...)
def onQueryProgress(...)
def onQueryTermination(...)
}
streamingQuery.lastProgress()
{ ...
"inputRowsPerSecond" : 10024.225210926405,
"processedRowsPerSecond" : 10063.737001006373,
"durationMs" : { ... },
"sources" : [ ... ],
"sink" : { ... }
...
}
Dropwizard Metrics
Metrics into Ganglia, Graphite, etc.
Enabled using SQL configuration
spark.conf.set("spark.sql.streaming.metricsEnabled", "true")
Building Complex
Continuous Apps
Metric Processing @
Dashboards Analyze	trends	in	usage	as	they	occur
Alerts Notify	engineers	of	critical	issues
Ad-hoc	Analysis Diagnose	issues	when	they	occur
ETL Clean, normalize and store historical data
Events generated by user actions (logins, clicks, spark job updates)
Metric Processing @
Dashboards
Alerts
Ad-hoc	Analysis
ETL
Difficult with only streaming frameworks
Limited retention in
streaming storage
Inefficient for ad-hoc queries
Hard for novice users
(limited SQL support)
Metric Processing @
=
Metrics
Filter
ETL
Dashboards
Ad-hoc
Analysis
Alerts
Databricks	Delta
+
Read from
rawLogs = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", ...)
.option("subscribe", "rawLogs")
.load()
augmentedLogs = rawLogs
.withColumn("msg",
from_json($"value".cast("string"),
schema))
.select("timestamp", "msg.*")
.join(table("customers"), ["customer_id"])
DataFrames can be
reused for multiple
streams
Can build libraries of
useful DataFrames and
share code between
applications
JSON ETL
Write to
augmentedLogs
.repartition(1)
.writeStream
.format("delta")
.option("path", "/data/metrics")
.trigger("1 minute")
.start()
Store augmented stream as efficient
columnar data for later processing
Latency: ~1 minute
Buffer data and
write one large file
every minute for
efficient reads
ETL
Databricks	Delta
+
Dashboards
logins = spark.readStream.parquet("/data/metrics")
.where("metric = 'login'")
.groupBy(window("timestamp", "1 minute"))
.count()
display(logins) // visualize in Databricks notebooks
Always up-to-date visualizations of
important business trends
Latency: ~1 minute to hours (configurable) Dashboards
Databricks	Delta
Filter and write to
filteredLogs = augmentedLogs
.where("eventType = 'clusterHeartbeat'")
.selectExpr("to_json(struct("*")) as value")
filteredLogs.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", ...)
.option("topic", "clusterHeartbeats")
.start()
Forward filtered and augmented
events back to Kafka
Latency: ~100 ms average
Filter
to_json() to convert
columns back into json
string, and then save as
different Kafka topic
Simple Alerts
sparkErrors
.as[ClusterHeartBeat]
.filter(_.load > 99)
.writeStream
.foreach(new PagerdutySink(credentials))
E.g. Alert when Spark cluster load > threshold
Latency: ~100 ms
Alerts
notify PagerDuty
Complex Alerts
sparkErrors
.as[ClusterHeartBeat]
.groupBy(_.id)
.flatMapGroupsWithState(Update, EventTimeTimeout) {
(id: Int, events: Iterator[ClusterHeartBeat], state: GroupState[ClusterState]) =>
... // check if cluster non-responsive for a while
}
E.g. Monitor health of Spark clusters
using custom stateful logic
Latency: ~10 seconds
Alerts
react if no heartbeat
from cluster for 1 min
Ad-hoc Analysis
SELECT *
FROM delta.`/data/metrics`
WHERE level IN ('WARN', 'ERROR')
AND customer = "…"
AND timestamp < now() – INTERVAL 1 HOUR
Trouble shoot problems as they
occur with latest information
Latency: ~1 minute
Ad-hoc
Analysis
will read latest data
when query executed
Databricks	Delta
+
Metric Processing @
=
Metrics
Filter
ETL
Dashboards
Ad-hoc
Analysis
Alerts14+ billion records / hour
with 10 nodes
meet diverse latency requirements
as efficiently as possible
Structured Streaming @
100s of customer streaming apps
in production on Databricks
Largest app process 10s of trillions
of records per month
Future Direction: Continuous Processing
Continuous processing mode to run without micro-batches
<1 ms latency (same as per-record streaming systems)
No changes to user code
Proposal in SPARK-20928, expected in Spark 2.3
More Info
Structured Streaming Programming Guide
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Databricks blog posts for more focused discussions
https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html
https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html
https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html
https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-structured-streaming.html
https://databricks.com/blog/2017/10/11/benchmarking-structured-streaming-on-databricks-runtime-against-state-of-the-art-streaming-systems.html
and more to come, stay tuned!!
UNIFIED ANALYTICS PLATFORM
Try Apache Spark in Databricks!
• Collaborative cloud environment
• Free version (community edition)
DATABRICKS RUNTIME 3.0
• Apache Spark - optimized for the cloud
• Caching and optimization layer - DBIO
• Enterprise security - DBES
Try for free today
databricks.com

More Related Content

PDF
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
PDF
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
PDF
Kappa vs Lambda Architectures and Technology Comparison
PDF
Introduction to Spark Streaming
PPTX
Apache Flink @ NYC Flink Meetup
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
PDF
From Zero to Hero with Kafka Connect
PDF
Dependency Injection in Apache Spark Applications
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Kappa vs Lambda Architectures and Technology Comparison
Introduction to Spark Streaming
Apache Flink @ NYC Flink Meetup
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
From Zero to Hero with Kafka Connect
Dependency Injection in Apache Spark Applications

What's hot (20)

PPTX
A visual introduction to Apache Kafka
PDF
Delta from a Data Engineer's Perspective
PDF
Building Robust ETL Pipelines with Apache Spark
PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
PDF
Hello, kafka! (an introduction to apache kafka)
PDF
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
PPTX
RocksDB detail
PDF
Making Apache Spark Better with Delta Lake
PDF
Deep dive into stateful stream processing in structured streaming by Tathaga...
PDF
Dive into PySpark
PDF
Introduction to Apache Flink - Fast and reliable big data processing
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
PPTX
Introduction to Apache Kafka
PDF
Real time stock processing with apache nifi, apache flink and apache kafka
PPTX
Kafka 101
PPTX
APACHE KAFKA / Kafka Connect / Kafka Streams
PDF
Native Support of Prometheus Monitoring in Apache Spark 3.0
PDF
Massive Data Processing in Adobe Using Delta Lake
PDF
Apache Kafka - Martin Podval
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
A visual introduction to Apache Kafka
Delta from a Data Engineer's Perspective
Building Robust ETL Pipelines with Apache Spark
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Hello, kafka! (an introduction to apache kafka)
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
RocksDB detail
Making Apache Spark Better with Delta Lake
Deep dive into stateful stream processing in structured streaming by Tathaga...
Dive into PySpark
Introduction to Apache Flink - Fast and reliable big data processing
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Introduction to Apache Kafka
Real time stock processing with apache nifi, apache flink and apache kafka
Kafka 101
APACHE KAFKA / Kafka Connect / Kafka Streams
Native Support of Prometheus Monitoring in Apache Spark 3.0
Massive Data Processing in Adobe Using Delta Lake
Apache Kafka - Martin Podval
Spark SQL Deep Dive @ Melbourne Spark Meetup
Ad

Viewers also liked (9)

PDF
Anatomy of in memory processing in Spark
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
PDF
Feature Hashing for Scalable Machine Learning with Nick Pentreath
PDF
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
PDF
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
PPTX
Parallelizing Existing R Packages with SparkR
PDF
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
PDF
Map reduce vs spark
PDF
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Anatomy of in memory processing in Spark
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
Feature Hashing for Scalable Machine Learning with Nick Pentreath
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Parallelizing Existing R Packages with SparkR
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
Map reduce vs spark
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Ad

Similar to Easy, scalable, fault tolerant stream processing with structured streaming - with Tathagata Das (20)

PDF
Easy, scalable, fault tolerant stream processing with structured streaming - ...
PDF
Making Structured Streaming Ready for Production
PDF
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
PPTX
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
PDF
Writing Continuous Applications with Structured Streaming in PySpark
PDF
Writing Continuous Applications with Structured Streaming PySpark API
PPTX
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
PPTX
Spark Structured Streaming
PDF
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
PDF
Designing Structured Streaming Pipelines—How to Architect Things Right
PDF
A Tale of Two APIs: Using Spark Streaming In Production
PDF
Apache: Big Data - Starting with Apache Spark, Best Practices
PDF
Productizing Structured Streaming Jobs
PDF
Elasticsearch And Apache Lucene For Apache Spark And MLlib
PDF
What's new with Apache Spark's Structured Streaming?
PDF
Easy, scalable, fault tolerant stream processing with structured streaming - ...
PDF
Real-Time Spark: From Interactive Queries to Streaming
PDF
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
PDF
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark.pdf
PPTX
Apache Spark Structured Streaming + Apache Kafka = ♡
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Making Structured Streaming Ready for Production
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming PySpark API
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Spark Structured Streaming
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Designing Structured Streaming Pipelines—How to Architect Things Right
A Tale of Two APIs: Using Spark Streaming In Production
Apache: Big Data - Starting with Apache Spark, Best Practices
Productizing Structured Streaming Jobs
Elasticsearch And Apache Lucene For Apache Spark And MLlib
What's new with Apache Spark's Structured Streaming?
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Real-Time Spark: From Interactive Queries to Streaming
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark.pdf
Apache Spark Structured Streaming + Apache Kafka = ♡

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Machine Learning CI/CD for Email Attack Detection
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Machine Learning CI/CD for Email Attack Detection

Recently uploaded (20)

PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Data Science Trends & Career Guide---ppt
PPTX
Computer network topology notes for revision
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Data-Driven-Credit-Card-Launch-A-Wells-Fargo-Case-Study.pptx
PPTX
1intro to AI.pptx AI components & composition
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Foundation of Data Science unit number two notes
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Purple and Violet Modern Marketing Presentation (1).pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Challenges and opportunities in feeding a growing population
PPTX
Trading Procedures (1).pptxcffcdddxxddsss
PDF
Launch Your Data Science Career in Kochi – 2025
PDF
Data Analyst Certificate Programs for Beginners | IABAC
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Miokarditis (Inflamasi pada Otot Jantung)
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Data Science Trends & Career Guide---ppt
Computer network topology notes for revision
Reliability_Chapter_ presentation 1221.5784
IB Computer Science - Internal Assessment.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Data-Driven-Credit-Card-Launch-A-Wells-Fargo-Case-Study.pptx
1intro to AI.pptx AI components & composition
Business Acumen Training GuidePresentation.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Foundation of Data Science unit number two notes
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Purple and Violet Modern Marketing Presentation (1).pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Challenges and opportunities in feeding a growing population
Trading Procedures (1).pptxcffcdddxxddsss
Launch Your Data Science Career in Kochi – 2025
Data Analyst Certificate Programs for Beginners | IABAC
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf

Easy, scalable, fault tolerant stream processing with structured streaming - with Tathagata Das

  • 1. Easy, Scalable, Fault-tolerant Stream Processing with Structured Streaming Spark Summit Europe 2017 25th October, Dublin Tathagata “TD” Das @tathadas
  • 2. About Me Started Spark Streaming project in AMPLab, UC Berkeley Currently focused on building Structured Streaming Member of the Apache Spark PMC Software Engineer at Databricks
  • 4. Complexities in stream processing COMPLEX DATA Diverse data formats (json, avro, binary, …) Data can be dirty, late, out-of-order COMPLEX SYSTEMS Diverse storage systems (Kafka, S3, Kinesis, RDBMS, …) System failures COMPLEX WORKLOADS Combining streaming with interactive queries Machine learning
  • 5. Structured Streaming stream processing on Spark SQL engine fast, scalable, fault-tolerant rich, unified, high level APIs deal with complex data and complex workloads rich ecosystem of data sources integrate with many storage systems
  • 6. you should not have to reason about streaming
  • 7. you should write simple queries & Spark should continuously update the answer
  • 9. Anatomy of a Streaming Word Count spark.readStream .format("kafka") .option("subscribe", "input") .load() .groupBy($"value".cast("string")) .count() .writeStream .format("kafka") .option("topic", "output") .trigger("1 minute") .outputMode(OutputMode.Complete()) .option("checkpointLocation", "…") .start() Source • Specify one or more locations to read data from • Built in support for Files/Kafka/Socket, pluggable. • Can include multiple sources of different types using union()
  • 10. Anatomy of a Streaming Query spark.readStream .format("kafka") .option("subscribe", "input") .load() .groupBy('value.cast("string") as 'key) .agg(count("*") as 'value) .writeStream .format("kafka") .option("topic", "output") .trigger("1 minute") .outputMode(OutputMode.Complete()) .option("checkpointLocation", "…") .start() Transformation • Using DataFrames, Datasets and/or SQL. • Catalyst figures out how to execute the transformation incrementally. • Internal processing always exactly-once.
  • 11. DataFrames, Datasets, SQL input = spark.readStream .format("kafka") .option("subscribe", "topic") .load() result = input .select("device", "signal") .where("signal > 15") result.writeStream .format("parquet") .start("dest-path") Logical Plan Read from Kafka Project device, signal Filter signal > 15 Write to Kafka Spark automatically streamifies! Spark SQL converts batch-like query to a series of incremental execution plans operating on new batches of data Series of Incremental Execution Plans Kafka Source Optimized Operator codegen, off- heap, etc. Kafka Sink Optimized Physical Plan process newdata t = 1 t = 2 t = 3 process newdata process newdata
  • 12. Anatomy of a Streaming Query spark.readStream .format("kafka") .option("subscribe", "input") .load() .groupBy('value.cast("string") as 'key) .agg(count("*") as 'value) .writeStream .format("kafka") .option("topic", "output") .trigger("1 minute") .outputMode(OutputMode.Complete()) .option("checkpointLocation", "…") .start() Sink • Accepts the output of each batch. • When supported sinks are transactional and exactly once (Files). • Use foreach to execute arbitrary code.
  • 13. Anatomy of a Streaming Query spark.readStream .format("kafka") .option("subscribe", "input") .load() .groupBy('value.cast("string") as 'key) .agg(count("*") as 'value) .writeStream .format("kafka") .option("topic", "output") .trigger("1 minute") .outputMode("update") .option("checkpointLocation", "…") .start() Output mode – What's output • Complete – Output the whole answer every time • Update – Output changed rows • Append – Output new rows only Trigger – When to output • Specified as a time, eventually supports data size • No trigger means as fast as possible
  • 14. Anatomy of a Streaming Query spark.readStream .format("kafka") .option("subscribe", "input") .load() .groupBy('value.cast("string") as 'key) .agg(count("*") as 'value) .writeStream .format("kafka") .option("topic", "output") .trigger("1 minute") .outputMode("update") .option("checkpointLocation", "…") .start() Checkpoint • Tracks the progress of a query in persistent storage • Can be used to restart the query if there is a failure
  • 15. Fault-tolerance with Checkpointing Checkpointing – tracks progress (offsets) of consuming data from the source and intermediate state. Offsets and metadata saved as JSON Can resume after changing your streaming transformations end-to-end exactly-once guarantees process newdata t = 1 t = 2 t = 3 process newdata process newdata write ahead log
  • 16. 4xlower cost Structured Streaming reuses the Spark SQL Optimizer and Tungsten Engine. Performance: Benchmark System Throughput 700K 15M 65M 0 10 20 30 40 50 60 70 Kafka Streams Apache Flink Structured Streaming Millions of records/s Read more details in our blog post
  • 18. Traditional ETL Raw, dirty, un/semi-structured is data dumped as files Periodic jobs run every few hours to convert raw data to structured data ready for further analytics 18 file dump seconds hours table 10101010
  • 19. Traditional ETL Hours of delay before taking decisions on latest data Unacceptable when time is of essence [intrusion detection, anomaly detection, etc.] file dump seconds hours table 10101010
  • 20. Streaming ETL w/ Structured Streaming Structured Streaming enables raw data to be available as structured data as soon as possible 20 seconds table 10101010
  • 21. Streaming ETL w/ Structured Streaming Example Json data being received in Kafka Parse nested json and flatten it Store in structured Parquet table Get end-to-end failure guarantees val rawData = spark.readStream .format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() val parsedData = rawData .selectExpr("cast (value as string) as json")) .select(from_json("json", schema).as("data")) .select("data.*") val query = parsedData.writeStream .option("checkpointLocation", "/checkpoint") .partitionBy("date") .format("parquet") .start("/parquetTable")
  • 22. Reading from Kafka Specify options to configure How? kafka.boostrap.servers => broker1,broker2 What? subscribe => topic1,topic2,topic3 // fixed list of topics subscribePattern => topic* // dynamic list of topics assign => {"topicA":[0,1] } // specific partitions Where? startingOffsets => latest(default) / earliest / {"topicA":{"0":23,"1":345} } val rawData = spark.readStream .format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load()
  • 23. Reading from Kafka val rawData = spark.readStream .format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() rawData dataframe has the following columns key value topic partition offset timestamp [binary] [binary] "topicA" 0 345 1486087873 [binary] [binary] "topicB" 3 2890 1486086721
  • 24. Transforming Data Cast binary value to string Name it column json val parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .select("data.*")
  • 25. Transforming Data Cast binary value to string Name it column json Parse json string and expand into nested columns, name it data val parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .select("data.*") json { "timestamp": 1486087873, "device": "devA", …} { "timestamp": 1486082418, "device": "devX", …} data (nested) timestamp device … 1486087873 devA … 1486086721 devX … from_json("json") as "data"
  • 26. Transforming Data Cast binary value to string Name it column json Parse json string and expand into nested columns, name it data Flatten the nested columns val parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .select("data.*") data (nested) timestamp device … 1486087873 devA … 1486086721 devX … timestamp device … 1486087873 devA … 1486086721 devX … select("data.*") (not nested)
  • 27. Transforming Data Cast binary value to string Name it column json Parse json string and expand into nested columns, name it data Flatten the nested columns val parsedData = rawData .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .select("data.*") powerful built-in APIs to perform complex data transformations from_json, to_json, explode, ... 100s of functions (see our blog post)
  • 28. Writing to Save parsed data as Parquet table in the given path Partition files by date so that future queries on time slices of data is fast e.g. query on last 48 hours of data val query = parsedData.writeStream .option("checkpointLocation", ...) .partitionBy("date") .format("parquet") .start("/parquetTable")
  • 29. Checkpointing Enable checkpointing by setting the checkpoint location to save offset logs start actually starts a continuous running StreamingQuery in the Spark cluster val query = parsedData.writeStream .option("checkpointLocation", ...) .format("parquet") .partitionBy("date") .start("/parquetTable/")
  • 30. Streaming Query query is a handle to the continuously running StreamingQuery Used to monitor and manage the execution val query = parsedData.writeStream .option("checkpointLocation", ...) .format("parquet") .partitionBy("date") .start("/parquetTable")/") process newdata t = 1 t = 2 t = 3 process newdata process newdata StreamingQuery
  • 31. Data Consistency on Ad-hoc Queries Data available for complex, ad-hoc analytics within seconds Parquet table is updated atomically, ensures prefix integrity Even if distributed, ad-hoc queries will see either all updates from streaming query or none, read more in our blog https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html complex, ad-hoc queries on latest data seconds!
  • 32. More Kafka Support Write out to Kafka Dataframe must have binary fields named key and value Direct, interactive and batch queries on Kafka Makes Kafka even more powerful as a storage platform! Added to Spark 2.2 result.writeStream .format("kafka") .option("topic", "output") .start() val df = spark .read // not readStream .format("kafka") .option("subscribe", "topic") .load() df.registerTempTable("topicData") spark.sql("select value from topicData")
  • 33. Amazon Kinesis Configure with options (similar to Kafka) Available with Databricks Runtime How? region => us-west-2 / us-east-1 / ... awsAccessKey (optional) => AKIA... awsSecretKey (optional) => ... What? streamName => name-of-the-stream Where? initialPosition => latest(default) / earliest / trim_horizon spark.readStream .format("kinesis") .option("streamName", "myStream") .option("region", "us-west-2") .option("awsAccessKey", ...) .option("awsSecretKey", ...) .load()
  • 35. Event Time Many use cases require aggregate statistics by event time E.g. what's the #errors in each system in the 1 hour windows? Many challenges Extracting event time from data, handling late, out-of-order data DStream APIs were insufficient for event-time stuff
  • 36. Event time Aggregations Windowing is just another type of grouping in Struct. Streaming number of records every hour Support UDAFs! parsedData .groupBy(window("timestamp","1 hour")) .count() parsedData .groupBy( "device", window("timestamp","10 mins")) .avg("signal") avg signal strength of each device every 10 mins
  • 37. Stateful Processing for Aggregations Aggregates has to be saved as distributed state between triggers Each trigger reads previous state and writes updated state State stored in memory, backed by write ahead log in HDFS/S3 Fault-tolerant, exactly-once guarantee! process newdata t = 1 sink src t = 2 process newdata sink src t = 3 process newdata sink src state state write ahead log state updates are written to log for checkpointing state
  • 38. Automatically handles Late Data 12:00 - 13:00 1 12:00 - 13:00 3 13:00 - 14:00 1 12:00 - 13:00 3 13:00 - 14:00 2 14:00 - 15:00 5 12:00 - 13:00 5 13:00 - 14:00 2 14:00 - 15:00 5 15:00 - 16:00 4 12:00 - 13:00 3 13:00 - 14:00 2 14:00 - 15:00 6 15:00 - 16:00 4 16:00 - 17:00 3 13:00 14:00 15:00 16:00 17:00Keeping state allows late data to update counts of old windows red = state updated with late data But size of the state increases indefinitely if old windows are not dropped
  • 39. Watermarking Watermark - moving threshold of how late data is expected to be and when to drop old state Trails behind max event time seen by the engine Watermark delay = trailing gap event time max event time watermark data older than watermark not expected 12:30 PM 12:20 PM trailing gap of 10 mins
  • 40. Watermarking Data newer than watermark may be late, but allowed to aggregate Data older than watermark is "too late" and dropped Windows older than watermark automatically deleted to limit the amount of intermediate state max event time event time watermark late data allowed to aggregate data too late, dropped watermark delay of 10 mins
  • 41. Watermarking max event time event time watermark parsedData .withWatermark("timestamp", "10 minutes") .groupBy(window("timestamp","5 minutes")) .count() late data allowed to aggregate data too late, dropped Useful only in stateful operations Ignored in non-stateful streaming queries and batch queries watermark delay of 10 mins
  • 42. Watermarking data too late, ignored in counts, state dropped Processing Time12:00 12:05 12:10 12:15 12:10 12:15 12:20 12:07 12:13 12:08 EventTime 12:15 12:18 12:04 watermark updated to 12:14 - 10m = 12:04 for next trigger, state < 12:04 deleted data is late, but considered in counts system tracks max observed event time 12:08 wm = 12:04 10min 12:14 More details in my blog post parsedData .withWatermark("timestamp", "10 minutes") .groupBy(window("timestamp","5 minutes")) .count()
  • 43. Clean separation of concerns parsedData .withWatermark("timestamp", "10 minutes") .groupBy(window("timestamp","5 minutes")) .count() .writeStream .trigger("10 seconds") .start() Query Semantics Processing Details separated from
  • 44. Clean separation of concerns parsedData .withWatermark("timestamp", "10 minutes") .groupBy(window("timestamp","5 minutes")) .count() .writeStream .trigger("10 seconds") .start() Query Semantics How to group data by time? (same for batch & streaming) Processing Details
  • 45. Clean separation of concerns parsedData .withWatermark("timestamp", "10 minutes") .groupBy(window("timestamp","5 minutes")) .count() .writeStream .trigger("10 seconds") .start() Query Semantics How to group data by time? (same for batch & streaming) Processing Details How late can data be?
  • 46. Clean separation of concerns parsedData .withWatermark("timestamp", "10 minutes") .groupBy(window("timestamp","5 minutes")) .count() .writeStream .trigger("10 seconds") .start() Query Semantics How to group data by time? (same for batch & streaming) Processing Details How late can data be? How often to emit updates?
  • 47. Other Interesting Operations Streaming Deduplication Watermarks to limit state Joins Stream-batch joins supported, stream-stream joins coming in 2.3 Arbitrary Stateful Processing [map|flatMap]GroupsWithState parsedData.join(batchData, "device") parsedData.dropDuplicates("eventId") [See my other talk at 4:20 PM, today for a deep dive into stateful ops] ds.groupByKey(_.id) .mapGroupsWithState (timeoutConf) (mappingWithStateFunc)
  • 48. Monitoring Streaming Queries Get last progress of the streaming query Current input and processing rates Current processed offsets Current state metrics Get progress asynchronously through by registering your own StreamingQueryListener new StreamingQueryListener { def onQueryStart(...) def onQueryProgress(...) def onQueryTermination(...) } streamingQuery.lastProgress() { ... "inputRowsPerSecond" : 10024.225210926405, "processedRowsPerSecond" : 10063.737001006373, "durationMs" : { ... }, "sources" : [ ... ], "sink" : { ... } ... }
  • 49. Dropwizard Metrics Metrics into Ganglia, Graphite, etc. Enabled using SQL configuration spark.conf.set("spark.sql.streaming.metricsEnabled", "true")
  • 51. Metric Processing @ Dashboards Analyze trends in usage as they occur Alerts Notify engineers of critical issues Ad-hoc Analysis Diagnose issues when they occur ETL Clean, normalize and store historical data Events generated by user actions (logins, clicks, spark job updates)
  • 52. Metric Processing @ Dashboards Alerts Ad-hoc Analysis ETL Difficult with only streaming frameworks Limited retention in streaming storage Inefficient for ad-hoc queries Hard for novice users (limited SQL support)
  • 54. Read from rawLogs = spark.readStream .format("kafka") .option("kafka.bootstrap.servers", ...) .option("subscribe", "rawLogs") .load() augmentedLogs = rawLogs .withColumn("msg", from_json($"value".cast("string"), schema)) .select("timestamp", "msg.*") .join(table("customers"), ["customer_id"]) DataFrames can be reused for multiple streams Can build libraries of useful DataFrames and share code between applications JSON ETL
  • 55. Write to augmentedLogs .repartition(1) .writeStream .format("delta") .option("path", "/data/metrics") .trigger("1 minute") .start() Store augmented stream as efficient columnar data for later processing Latency: ~1 minute Buffer data and write one large file every minute for efficient reads ETL Databricks Delta +
  • 56. Dashboards logins = spark.readStream.parquet("/data/metrics") .where("metric = 'login'") .groupBy(window("timestamp", "1 minute")) .count() display(logins) // visualize in Databricks notebooks Always up-to-date visualizations of important business trends Latency: ~1 minute to hours (configurable) Dashboards Databricks Delta
  • 57. Filter and write to filteredLogs = augmentedLogs .where("eventType = 'clusterHeartbeat'") .selectExpr("to_json(struct("*")) as value") filteredLogs.writeStream .format("kafka") .option("kafka.bootstrap.servers", ...) .option("topic", "clusterHeartbeats") .start() Forward filtered and augmented events back to Kafka Latency: ~100 ms average Filter to_json() to convert columns back into json string, and then save as different Kafka topic
  • 58. Simple Alerts sparkErrors .as[ClusterHeartBeat] .filter(_.load > 99) .writeStream .foreach(new PagerdutySink(credentials)) E.g. Alert when Spark cluster load > threshold Latency: ~100 ms Alerts notify PagerDuty
  • 59. Complex Alerts sparkErrors .as[ClusterHeartBeat] .groupBy(_.id) .flatMapGroupsWithState(Update, EventTimeTimeout) { (id: Int, events: Iterator[ClusterHeartBeat], state: GroupState[ClusterState]) => ... // check if cluster non-responsive for a while } E.g. Monitor health of Spark clusters using custom stateful logic Latency: ~10 seconds Alerts react if no heartbeat from cluster for 1 min
  • 60. Ad-hoc Analysis SELECT * FROM delta.`/data/metrics` WHERE level IN ('WARN', 'ERROR') AND customer = "…" AND timestamp < now() – INTERVAL 1 HOUR Trouble shoot problems as they occur with latest information Latency: ~1 minute Ad-hoc Analysis will read latest data when query executed Databricks Delta +
  • 61. Metric Processing @ = Metrics Filter ETL Dashboards Ad-hoc Analysis Alerts14+ billion records / hour with 10 nodes meet diverse latency requirements as efficiently as possible
  • 62. Structured Streaming @ 100s of customer streaming apps in production on Databricks Largest app process 10s of trillions of records per month
  • 63. Future Direction: Continuous Processing Continuous processing mode to run without micro-batches <1 ms latency (same as per-record streaming systems) No changes to user code Proposal in SPARK-20928, expected in Spark 2.3
  • 64. More Info Structured Streaming Programming Guide http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html Databricks blog posts for more focused discussions https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-structured-streaming.html https://databricks.com/blog/2017/10/11/benchmarking-structured-streaming-on-databricks-runtime-against-state-of-the-art-streaming-systems.html and more to come, stay tuned!!
  • 65. UNIFIED ANALYTICS PLATFORM Try Apache Spark in Databricks! • Collaborative cloud environment • Free version (community edition) DATABRICKS RUNTIME 3.0 • Apache Spark - optimized for the cloud • Caching and optimization layer - DBIO • Enterprise security - DBES Try for free today databricks.com