Continuous Application 1725280881
Continuous Application 1725280881
Applications with
Apache Spark ™
1
Building Streaming Applications with Apache Spark™
How to Use Structured Streaming to Build Complex Continuous Applications
Special thanks to the contributions of Michael Armbrust, Bill Chambers, Tyson Condie,
Jules Damji, Tathagata Das, Kunal Khamar, Reynold Xin, Burak Yavuz, and Matei Zaharia
to this ebook.
About Databricks
Databricks’ mission is to accelerate innovation for its customers by unifying Data Science, Engineering and Business. Founded by the team who started the Apache Spark™ project, Databricks
provides a Unified Analytics Platform for data science teams to collaborate with data engineering and lines of business to build data products. Users achieve faster time-to-value with
Databricks by creating analytic workflows that go from ETL and interactive exploration to production. The company also makes it easier for its users to focus on their data by providing a fully
managed, scalable, and secure cloud infrastructure that reduces operational complexity and total cost of ownership. Databricks, venture-backed by Andreessen Horowitz and NEA, has a global
customer base that includes Salesforce, Viacom, Amgen, Shell and HP. For more information, visit www.databricks.com.
Databricks
160 Spear Street, 13th Floor
San Francisco, CA 94105
Contact Us
© Databricks 2018. All rights reserved. Apache, Apache Spark, Spark and the Spark logo are trademarks of the Apache Software Foundation.
2
Table of Contents
Introduction 4
Conclusion 72
3
Introduction
Since its release, Apache Spark Streaming has become one of the most At Databricks, we’ve worked with thousands of users to understand
widely used distributed streaming engines, thanks to its high-level API how to simplify real-time applications. This ebook provides an
and exactly-once semantics. Most streaming engines focus on overview of Structured Streaming and explores how we are using the
performing computations on a stream. Instead, stream processing new features of Apache Spark 2.1 and 2.2 to overcome the primary
happens as part of a larger application, which we’ll call a continuous challenges of building continuous applications and building our own
application. production pipelines. Highlights include how to use Structured
Streaming to:
We define a continuous application as an end-to-end application that
reacts to data in real-time. Structured Streaming is a high-level API • Easily build an end-to-end streaming ETL pipeline;
originally contributed to Apache Spark 2.0 to support continuous • Solve complex data transformation challenges;
applications and was recently improved upon in the release of Apache • Perform monitoring and alerting;
Spark 2.3. Structured Streaming incorporates the idea of continuous • Consume and transform complex data streams with Spark and
applications to provide a number of features that no other streaming Kafka;
engines offer strong guarantees about consistency with batch jobs, • Easily process streaming aggregations; and
transactional integration with storage systems, and tight integration • Better manage resources for incremental processing of data.
with the rest of Spark.
Introduction 4
Part 1: An
Introduction
Part 1:
to Structured
Streaming
An Introduction to
Structured Streaming
5
actions of each type happened each hour, then store the result in
Structured Streaming In Apache MySQL. If we were running this application as a batch job and had a
Spark table with all the input events, we could express it as the following SQL
query:
A new high-level API for streaming SELECT action, WINDOW(time, "1 hour"), COUNT(*)
July 28, 2016 | by Matei Zaharia, Tathagata Das, Michael Armbrust and FROM events
Reynold Xin
GROUP BY action, WINDOW(time, "1 hour")
Try this notebook in Databricks: Scala Notebook, Python Notebook In a distributed streaming engine, we might set up nodes to process
the data in a “map-reduce” pattern, as shown below. Each node in the
first layer reads a partition of the input data (say, the stream from one
Apache Spark 2.0 adds the first version of a new higher-level API, set of phones), then hashes the events by (action, hour) to send them
Structured Streaming, for building continuous applications. The main to a reducer node, which tracks that group’s count and periodically
goal is to make it easier to build end-to-end streaming applications, updates MySQL.
which integrate with storage, serving systems, and batch jobs in a
consistent and fault-tolerant way. In this post, we explain why this is
hard to do with current distributed streaming engines, and introduce
Structured Streaming.
MySQL. Structured Streaming also supports APIs for filtering out .writeStream.format("jdbc")
overly old data if the user wants. But fundamentally, out-of-order .save("jdbc:mysql//…")
data is not a “special case”: the query says to group by time field,
The next sections explain the model in more detail, as well as the API.
and seeing an old time is no different than seeing a repeated
action.
Model Details
The last benefit of Structured Streaming is that the API is very easy to Conceptually, Structured Streaming treats all the data arriving as an
use: it is simply Spark’s DataFrame and Dataset API. Users just describe unbounded input table. Each new item in the stream is like a row
the query they want to run, the input and output locations, and appended to the input table. We won’t actually retain all the input, but
optionally a few more details. The system then runs their query our results will be equivalent to having all of it and running a batch job.
incrementally, maintaining enough state to recover from failure, keep
the results consistent in external storage, etc. For example, here is how
to write our streaming monitoring application:
• Update: Only the rows that were updated in the result table since
the last trigger will be changed in the external storage. This mode
works for output sinks that can be updated in place, such as a
MySQL table.
Let’s see how we can run our mobile monitoring application in this
model. Our batch query is to compute a count of actions grouped by
(action, hour). To run this query incrementally, Spark will maintain
some state with the counts for each pair so far, and update when new
records arrive. For each record changed, it will then output data
according to its output mode. The figure below shows this execution
using the Update output mode:
Note that the system also automatically handles late data. In the figure
3. We found that most Spark applications already use sinks and
above, the “open” event for phone3, which happened at 1:58 on the
phone, only gets to the system at 2:02. Nonetheless, even though it’s sources with these properties, because users want their jobs to be
reliable.
past 2:00, we update the record for 1:00 in MySQL. However, the prefix
integrity guarantee in Structured Streaming ensures that we process
Structured Streaming programs can use DataFrame and Dataset’s many other systems, windowing is not just a special operator for
existing methods to transform data, including map, filter, select, and streaming computations; we can run the same code in a batch job to
others. In addition, running (or infinite) aggregations, such as a count group data in the same way.
from the beginning of time, are available through the existing APIs.
This is what we used in our monitoring application above. Windowed aggregation is one area where we will continue to expand
Structured Streaming. In particular, in Spark 2.1, we plan to add
Windowed Aggregations on Event Time watermarks, a feature for dropping overly old data when sufficient
time has passed. Without this type of feature, the system might have to
Streaming applications often need to compute data on various types track state for all old windows, which would not scale as the
of windows, including sliding windows, which overlap with each other application runs. In addition, we plan to add support for session-based
windows, i.e. grouping the events from one source into variable-length
(e.g. a 1-hour window that advances every 5 minutes), and tumbling
sessions according to business logic.
windows, which do not (e.g. just every hour). In Structured Streaming,
windowing is simply represented as a group-by. Each input event can be
mapped to one or more windows, and simply results in updating one Joining Streams with Static Data
or more result table rows. Because Structured Streaming simply uses the DataFrame API, it is
straightforward to join a stream against a static DataFrame, such as an
Windows can be specified using the window function in DataFrames. Apache Hive table:
For example, we could change our monitoring job to count actions by
sliding windows as follows: // Bring in data about each customer from a static "customers"
table,
// then join it with a streaming DataFrame
inputDF.groupBy($"action", window($"time", "1 hour", "5 val customersDF = spark.table("customers")
minutes"))
inputDF.join(customersDF, "customer_id")
.count()
.groupBy($"customer_name", hour($"time"))
.count()
Interactive Queries
Structured Streaming can expose results directly to interactive queries
through Spark’s JDBC server. In Spark 2.0, there is a rudimentary
“memory” output sink for this purpose that is not designed for large
data volumes. However, in future releases, this will let you write query
results to an in-memory Spark SQL table, and run queries directly
against it.
Read More
In addition, the following resources cover Structured Streaming:
Structured Streaming in Apache and automatically mitigated to ensure highly available insights are
delivered in real-time.
Part 1 of Scalable Data @ Databricks formats (CSV, JSON, Avro, etc.) that often must be restructured,
transformed and augmented before being consumed. Such
January 19, 2017 | by Tathagata Das, Michael Armbrust and Tyson
Condie restructuring requires that all the traditional tools from batch
processing systems are available, but without the added latencies
that they typically entail.
Try this notebook in Databricks: Scala Notebook, Python Notebook
In this first post, we will focus on an ETL pipeline that converts raw • Partition data by important columns – By partitioning the data
AWS CloudTrail audit logs into a JIT data warehouse for faster ad-hoc based on the value of one or more columns, common queries can
queries. We will show how easy it is to take an existing batch ETL job be answered more efficiently by reading only the relevant fraction
and subsequently productize it as a real-time streaming pipeline of the total dataset.
using Structured Streaming in Databricks. Using this pipeline, we
have converted 3.8 million JSON files containing 7.9 billion records Traditionally, ETL is performed as periodic batch jobs. For example,
into a Parquet table, which allows us to do ad-hoc queries on updated- dump the raw data in real time, and then convert it to structured form
to-the-minute Parquet table 10x faster than those on raw JSON files. every few hours to enable efficient queries. We had initially setup our
system this way, but this technique incurred a high latency; we had to
wait for few hours before getting any insights. For many use cases, this
The Need for Streaming ETL
delay is unacceptable. When something suspicious is happening in an
Extract, Transform, and Load (ETL) pipelines prepare raw, unstructured
account, we need to be able to ask questions immediately. Waiting
data into a form that can be queried easily and efficiently. Specifically,
minutes to hours could result in an unreasonable delay in responding
they need to be able to do the following:
to an incident.
In the rest of post, we dive into the details of how we transform AWS
• Convert to a more efficient storage format – Text, JSON and CSV
CloudTrail audit logs into an efficient, partitioned, parquet data
data are easy to generate and are human readable, but are very
warehouse. AWS CloudTrail allows us to track all actions performed in
expensive to query. Converting it to more efficient formats like
However, in their original form, they are very costly to query, even with
A good way to understand what this rawRecords DataFrame represents
the capabilities of Apache Spark. To enable rapid insight, we run a
is to first understand the Structured Streaming programming model.
Continuous Application that transforms the raw JSON logs files into an
The key idea is to treat any data stream as an unbounded table: new
optimized Parquet table. Let’s dive in and look at how to write this
records added to the stream are like rows being appended to the table.
pipeline. If you want to see the full code, here are the Scala and Python
notebooks. Import them into Databricks and run them yourselves.
Here, we explode (split) the array of records loaded from each file into • Write the transformed data from parsed DataFrame as a Parquet-
separate records. We also parse the string event time string in each formatted table at the path /cloudtrail.
record to Spark’s timestamp type, and flatten out the nested columns
for easier querying. Note that if cloudtrailEvents was a batch • Partition the Parquet table by date so that we can later efficiently
DataFrame on a fixed set of files, then we would have written the same query time slices of the data; a key requirement in monitoring
query, and we would have written the results only once as applications.
parsed.write.parquet("/cloudtrail"). Instead, we will start a
StreamingQuery that runs continuously to transform new data as it • Save checkpoint information at the path checkpoints/cloudtrail
arrives. for fault-tolerance (explained later in the blog)
You can read more details about the Structured Streaming model, and
its advantages over other streaming engines in our previous blog.
the Result Table, which then get written out as Parquet files.
This checkpoint directory is per query, and while a query is active,
Furthermore, while this streaming query is running, you can use Spark Spark continuously writes metadata of the processed data to the
SQL to simultaneously query the Parquet table. The streaming query checkpoint directory. Even if the entire cluster fails, the query can be
writes the Parquet data transactionally such that concurrent restarted on a new cluster, using the same checkpoint directory, and
Monitoring, Alerting and Upgrading Combining Live Data with Historical/Batch Data
For a Continuous Application to run smoothly, it must be robust to Many applications require historical/batch data to be combined with
individual machine or even whole cluster failures. In Databricks, we live data. For example, besides the incoming audit logs, we may
have developed tight integration with Structured Streaming that already have a large backlog of logs waiting to be converted. Ideally,
allows us continuously monitor your StreamingQueries for failures we would like to achieve both, interactively query the latest data as
(and automatically restart them. All you have to do is create a new Job, soon as possible, and also have access to historical data for future
and configure the Job retry policy. You can also configure the job to analysis. It is often complex to set up such a pipeline using most
send emails to notify you of failures. existing systems as you would have to set up multiples processes: a
batch job to convert the historical data, a streaming pipeline to
convert the live data, and maybe a another step to combine the
results.
Working with Complex Data Formats with Structured Streaming in Apache Spark 2.1 24
• What are the different data formats and their tradeoffs
Working with Complex Data • How to work with them easily using Spark SQL
Formats with Structured • How to choose the right final format for your use case
February 23, 2017 | by Burak Yavuz, Michael Armbrust, Tathagata Das expressed in XML, CSV, TSV; application metrics can be written out in
and Tyson Condie raw text or JSON. Every use case has a particular data format tailored
for it. In the world of Big Data, we commonly come across formats like
Parquet, ORC, Avro, JSON, CSV, SQL and NoSQL data sources, and
Try this notebook in Databricks: Scala Notebook, Python Notebook, SQL
Notebook plain text files. We can broadly classify these data formats into three
categories: structured, semi-structured, and unstructured data. Let’s
try to understand the benefits and shortcomings of each category.
In part 1 of this series on Structured Streaming blog posts, we
demonstrated how easy it is to write an end-to-end streaming ETL
pipeline using Structured Streaming that converts JSON CloudTrail
logs into a Parquet table. The blog highlighted that one of the major
challenges in building such pipelines is to read and transform data
from various sources and complex formats. In this blog post, we are
going to examine this problem in further detail, and show how Apache
Spark SQL’s built-in functions can be used to solve all your data
transformation challenges.
Working with Complex Data Formats with Structured Streaming in Apache Spark 2.1 25
Structured data Semi-structured data
Structured data sources define a schema on the data. With this extra Semi-structured data sources are structured per record but don’t
bit of information about the underlying data, structured data sources necessarily have a well-defined global schema spanning all records. As
provide efficient storage and performance. For example, columnar a result, each data record is augmented with its schema information.
formats such as Parquet and ORC make it much easier to extract values JSON and XML are popular examples. The benefits of semi-structured
from a subset of columns. Reading each record row by row first, then data formats are that they provide the most flexibility in expressing
extracting the values from the specific columns of interest can read your data as each record is self-describing. These formats are very
much more data than what is necessary when a query is only common across many applications as many lightweight parsers exist
interested in a small fraction of the columns. A row-based storage for dealing with these records, and they also have the benefit of being
format such as Avro efficiently serializes and stores data providing human readable. However, the main drawback for these formats is
storage benefits. However, these advantages often come at the cost of that they incur extra parsing overheads, and are not particularly built
flexibility. For example, because of rigidity in structure, evolving a for ad-hoc querying.
schema can be challenging.
Interchanging data formats with Spark SQL
Unstructured data In our previous blog post, we discussed how transforming Cloudtrail
By contrast, unstructured data sources are generally free-form text or Logs from JSON into Parquet shortened the runtime of our ad-hoc
binary objects that contain no markup, or metadata (e.g., commas in queries by 10x. Spark SQL allows users to ingest data from these
CSV files), to define the organization of data. Newspaper articles, classes of data sources, both in batch and streaming queries. It
medical records, image blobs, application logs are often treated as natively supports reading and writing data in Parquet, ORC, JSON, CSV,
unstructured data. These sorts of sources generally require context and text format and a plethora of other connectors exist on Spark
around the data to be parseable. That is, you need to know that the file Packages. You may also connect to SQL databases using the JDBC
is an image or is a newspaper article. Most sources of data are DataSource.
unstructured. The cost of having unstructured formats is that it
becomes cumbersome to extract value out of these data sources as Apache Spark can be used to interchange data formats as easily as:
many transformations and feature extraction techniques are required
to interpret these datasets.
Working with Complex Data Formats with Structured Streaming in Apache Spark 2.1 26
events = spark.readStream \
Transforming complex data types
.format("json") \ # or parquet, kafka, orc...
.option() \ # format specific options
.schema(my_schema) \ # required
.load("path/to/data")
output.writeStream \ # write out your data It is common to have complex data types such as structs, maps, and
.format("parquet") \
arrays when working with semi-structured formats. For example, you
.start("path/to/write")
may be logging API requests to your web server. This API request will
contain HTTP Headers, which would be a string-string map. The
Whether batch or streaming data, we know how to read and write to
request payload may contain form-data in the form of JSON, which
different data sources and formats, but different sources support
may contain nested fields or arrays. Some sources or formats may or
different kinds of schema and data types. Traditional databases only
may not support complex data types. Some formats may provide
support primitive data types, whereas formats like JSON allow users to
performance benefits when storing the data in a specific data type. For
nest objects within columns, have an array of values or represent a set
example, when using Parquet, all struct columns will receive the same
of key-value pairs. Users will generally have to go in-between these
treatment as top-level columns. Therefore, if you have filters on a
data types to efficiently store and represent their data. Fortunately,
nested field, you will get the same benefits as a top-level column.
Spark SQL makes it easy to handle both primitive and complex data
However, maps are treated as two array columns, hence you wouldn’t
types. Let’s now dive into a quick overview of how we can go from
receive efficient filtering semantics.
complex data types to primitive data types and vice-a-versa.
Let’s look at some examples on how Spark SQL allows you to shape
your data ad libitum with some data transformation techniques.
Working with Complex Data Formats with Structured Streaming in Apache Spark 2.1 27
Selecting from nested columns Flattening structs
Dots (.) can be used to access nested columns for structs and maps. A star (*) can be used to select all of the subfields in a struct.
// input // input
{ {
"a": { "a": {
"b": 1 "b": 1,
} "c": 2
} }
}
Python: events.select("a.b")
Scala: events.select("a.b") Python: events.select("a.*")
SQL: select a.b from events Scala: events.select("a.*")
SQL: select a.* from events
// output
{ // output
"b": 1 {
} "b": 1,
"c": 2
}
Working with Complex Data Formats with Structured Streaming in Apache Spark 2.1 28
Nesting columns Nesting all columns
The struct function or just parentheses in SQL can be used to create a The star (*) can also be used to include all columns in a nested struct.
new struct.
// input // input
{ {
"a": 1, "a": 1,
"b": 2, "b": 2
"c": 3 }
}
Python: events.select(struct("*").alias("x"))
Python: events.select(struct(col("a").alias("y")).alias("x")) Scala: events.select(struct("*") as 'x)
Scala: events.select(struct('a as 'y) as 'x) SQL: select struct(*) as x from events
SQL: select named_struct("y", a) as x from events
// output
// output {
{ "x": {
"x": { "a": 1,
"y": 1 "b": 2
} }
} }
Working with Complex Data Formats with Structured Streaming in Apache Spark 2.1 29
Selecting a single array or map element Creating a row for each array or map element
getItem() or square brackets (i.e. [ ]) can be used to select a single explode() can be used to create a new row for each element in an
element out of an array or a map. array or each key-value pair. This is similar to LATERAL VIEW EXPLODE
in HiveQL.
// input
{ // input
"a": [1, 2] {
} "a": [1, 2]
}
Python: events.select(col("a").getItem(0).alias("x"))
Scala: events.select('a.getItem(0) as 'x) Python: events.select(explode("a").alias("x"))
SQL: select a[0] as x from events Scala: events.select(explode('a) as 'x)
SQL: select explode(a) as x from events
// output
{ "x": 1 } // output
[{ "x": 1 }, { "x": 2 }]
// input
{ // input
"a": { {
"b": 1 "a": {
} "b": 1,
} "c": 2
}
Python: events.select(col("a").getItem("b").alias("x")) }
Scala: events.select('a.getItem("b") as 'x)
SQL: select a['b'] as x from events Python: events.select(explode("a").alias("x", "y"))
Scala: events.select(explode('a) as Seq("x", "y"))
// output SQL: select explode(a) as (x, y) from events
{ "x": 1 }
// output
[{ “x”: "b", "y": 1 }, { "x": "c", "y": 2 }]
Working with Complex Data Formats with Structured Streaming in Apache Spark 2.1 30
Collecting multiple rows into an array Selecting one field from each item in an array
collect_list() and collect_set() can be used to aggregate items When you use dot notation on an array we return a new array where
into an array. that field has been selected from each array element.
// input // input
[{ "x": 1 }, { "x": 2 }] {
"a": [
Python: events.select(collect_list("x").alias("x")) {"b": 1},
Scala: events.select(collect_list('x) as 'x) {"b": 2}
SQL: select collect_list(x) as x from events ]
}
// output
{ "x": [1, 2] } Python: events.select("a.b")
Scala: events.select("a.b")
// input SQL: select a.b from events
[{ "x": 1, "y": "a" }, { "x": 2, "y": "b" }]
// output
Python: events.groupBy("y").agg(collect_list("x").alias("x")) {
Scala: events.groupBy("y").agg(collect_list('x) as 'x) "b": [1, 2]
SQL: select y, collect_list(x) as x from events group by y }
// output
[{ "y": "a", "x": [1]}, { "y": "b", "x": [2]}]
Working with Complex Data Formats with Structured Streaming in Apache Spark 2.1 31
Power of to_json() and from_json() Decode json column as a struct
What if you really want to preserve your column’s complex structure from_json() can be used to turn a string column with JSON data into
but you need it to be encoded as a string to store it? Are you doomed? a struct. Then you may flatten the struct as described above to have
Of course not! Spark SQL provides functions like to_json() to encode individual columns. This method is not presently available in SQL.
a struct as a string and from_json() to retrieve the struct as a complex
type. Using JSON strings as columns are useful when reading from or // input
{
writing to a streaming source like Kafka. Each Kafka key-value record
"a": "{\"b\":1}"
will be augmented with some metadata, such as the ingestion
}
timestamp into Kafka, the offset in Kafka, etc. If the “value” field that
contains your data is in JSON, you could use from_json() to extract Python:
your data, enrich it, clean it, and then push it downstream to Kafka schema = StructType().add("b", IntegerType())
again or write it out to a file. events.select(from_json("a", schema).alias("c"))
Scala:
val schema = new StructType().add("b", IntegerType)
Encode a struct as json
events.select(from_json('a, schema) as 'c)
to_json() can be used to turn structs into JSON strings. This method
// output
{
"c": "{\"b\":1}"
}
Working with Complex Data Formats with Structured Streaming in Apache Spark 2.1 32
// input
// input
{
{
"a": "{\"b\":1}"
"a": "{\"b\":{\"x\":1,\"y\":{\"z\":2}}}"
}
}
.add("y", StringType()))
events.select(from_json("a", schema).alias("c")) // output
Scala: { "c": 1 }
Working with Complex Data Formats with Structured Streaming in Apache Spark 2.1 33
That’s a lot of transformations! Let’s now look at some real life use 10:00-10:30 for this specific service”? The speed-ups can be attributed
cases to put all of these data formats, and data manipulation to:
capabilities to good use.
• We no longer pay the price of deserializing JSON records
Working with Complex Data Formats with Structured Streaming in Apache Spark 2.1 34
“I would like to train a speech recognition algorithm on newspaper If you want to learn more about the Structured Streaming, here are a
articles or sentiment analysis on product comments.” few useful links.
In cases where your data may not have a fixed schema, nor a fixed • Previous blogs posts explaining the motivation and concepts of
pattern/structure, it may just be easier to store it as plain text files. You Structured Streaming:
may also have a pipeline that performs feature extraction on this - Continuous Applications: Evolving Streaming in Apache Spark
unstructured data and stores it as Avro in preparation for your Machine 2.0
Learning pipeline. - Structured Streaming In Apache Spark
In the future blog posts in this series, we’ll cover more on:
• Python Notebook
• Scala Notebook
• Monitoring your streaming applications
• SQL Notebook
• Integrating Structured Streaming with Apache Kafka
• Computing event time aggregations with Structured Streaming
Working with Complex Data Formats with Structured Streaming in Apache Spark 2.1 35
Part 4:
Processing
Data in
Part 4:
Apache Kafka
Processing Data in
Apache Kafka
Processing Data in Apache Kafka with Structured Streaming in Apache Spark 2.2 36
We start with a review of Kafka terminology and then present
Processing Data in Apache examples of Structured Streaming queries that read data from and
Kafka with Structured write data to Apache Kafka. And finally, we’ll explore an end-to-end
real-world use case.
Processing Data in Apache Kafka with Structured Streaming in Apache Spark 2.2 37
Specifying What Data to Read from Kafka • earliest — start reading at the beginning of the stream. This
excludes data that has already been deleted from Kafka because it
was older than the retention period (“aged out” data).
• latest — start now, processing only new data that arrives after the
query has started.
As you will see below, the startingOffsets option accepts one of the
three options above, and is only used when starting a query from a
fresh checkpoint. If you restart a query from an existing checkpoint,
then it will always resume exactly where it left off, except when the
data at that offset has been aged out. If any unprocessed data was
aged out, the query behavior will depend on what is set by the
failOnDataLoss option, which is described in the Kafka Integration
A Kafka topic can be viewed as an infinite stream where data is
Guide.
retained for a configurable amount of time. The infinite nature of this
stream means that when starting a new query, we have to first decide
Existing users of the KafkaConsumer will notice that Structured
what data to read and where in time we are going to begin. At a high
Streaming provides a more granular version of the configuration
level, there are three choices:
option, auto.offset.reset. Instead of one option, we split these
concerns into two different parameters, one that says what to do when
the stream is first starting (startingOffsets), and another that handles
Processing Data in Apache Kafka with Structured Streaming in Apache Spark 2.2 38
what to do if the query is not able to pick up from where it left off, # Construct a streaming DataFrame that reads from topic1
because the desired data has already been aged out (failOnDataLoss). df = spark \
.readStream \
.format("kafka") \
Apache Kafka support in Structured .option("kafka.bootstrap.servers",
Streaming "host1:port1,host2:port2") \
.option("subscribe", "topic1") \
Structured Streaming provides a unified batch and streaming API that .option("startingOffsets", "earliest") \
enables us to view data published to Kafka as a DataFrame. When .load()
processing unbounded data in a streaming fashion, we use the same
API and get the same data consistency guarantees as in batch The DataFrame above is a streaming DataFrame subscribed to
processing. The system ensures end-to-end exactly-once fault- “topic1”. The configuration is set by providing options to the
tolerance guarantees, so that a user does not have to reason about DataStreamReader, and the minimal required parameters are the
low-level aspects of streaming. location of the kafka.bootstrap.servers (i.e. host:port) and the topic
that we want to subscribe to. Here, we have also specified
Let’s examine and explore examples of reading from and writing to startingOffsets to be “earliest”, which will read all data available in the
Kafka, followed by an end-to-end application. topic at the start of the query. If the startingOffsets option is not
specified, the default value of “latest” is used and only data that
Reading Records from Kafka topics arrives after the query starts will be processed.
The first step is to specify the location of our Kafka cluster and which
df.printSchema() reveals the schema of our DataFrame.
topic we are interested in reading from. Spark allows you to read an
individual topic, a specific set of topics, a regex pattern of topics, or
even a specific set of partitions belonging to a set of topics. We will
only look at an example of reading from an individual topic, the other
possibilities are covered in the Kafka Integration Guide.
Processing Data in Apache Kafka with Structured Streaming in Apache Spark 2.2 39
root # value schema: { "a": 1, "b": "string" }
|-- key: binary (nullable = true) schema = StructType().add("a", IntegerType()).add("b",
|-- value: binary (nullable = true) StringType())
common types of serialization as we’ll show below. spark.udf.register("deserialize", (topic: String, bytes:
Array[Byte]) =>
MyDeserializerWrapper.deser.deserialize(topic, bytes)
Data Stored as a UTF8 String )
If the bytes of the Kafka records represent UTF8 strings, we can simply
use a cast to convert the binary data into the correct type. df.selectExpr("""deserialize("topic1", value) AS message""")
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") Note that the DataFrame code above is analogous to specifying
value.deserializer when using the standard Kafka consumer.
Processing Data in Apache Kafka with Structured Streaming in Apache Spark 2.2 40
column named “value”, and optionally a column named “key”. If a key The two required options for writing to Kafka are the
column is not specified, then a null valued key column will be kafka.bootstrap.servers and the checkpointLocation. As in the above
automatically added. A null valued key column may, in some cases, example, an additional topic option can be used to set a single topic to
lead to uneven data partitioning in Kafka, and should be used with write to, and this option will override the “topic” column if it exists in
care. the DataFrame.
The destination topic for the records of the DataFrame can either be
End-to-End Example with Nest Devices
specified statically as an option to the DataStreamWriter or on a per-
In this section, we will explore an end-to-end pipeline involving Kafka
record basis as a column named “topic” in the DataFrame.
along with other data sources and sinks. We will work with a data set
involving a collection of Nest device logs, with a JSON format
# Write key-value data from a DataFrame to a Kafka topic
specified in an option
described here. We’ll specifically examine data from Nest’s cameras,
query = df \ which look like the following JSON:
.selectExpr("CAST(userId AS STRING) AS key",
"to_json(struct(*)) AS value") \
.writeStream \ "devices": {
.format("kafka") \ "cameras": {
The above query takes a DataFrame containing user information and "end_time": "2016-12-29T18:42:00.000Z"
Processing Data in Apache Kafka with Structured Streaming in Apache Spark 2.2 41
We’ll also be joining with a static dataset (called “device_locations”) While these may sound like wildly different use-cases, you can perform
that contains a mapping from device_id to the zip_code where the all of them using DataFrames and Structured Streaming in a single
device was registered. end-to-end Spark application! In the following sections, we’ll walk
through individual steps, starting from ingest to processing to storing
aggregated results.
At a high-level, the desired workflow looks like the graph above. Given Expected Schema for JSON data
a stream of updates from Nest cameras, we want to use Spark to
schema = StructType() \
perform several different tasks: .add("metadata", StructType() \
.add("access_token", StringType()) \
• Create an efficient, queryable historical archive of all events using .add("client_version", IntegerType())) \
.add("devices", StructType() \
a columnar format like Parquet.
.add("thermostats", MapType(StringType(), StructType().add(...))) \
.add("smoke_co_alarms", MapType(StringType(), StructType().add(...))) \
• Perform low-latency event-time aggregation and push the results .add("cameras", MapType(StringType(), StructType().add(...))) \
back to Kafka for other consumers. .add("companyName", StructType().add(...))) \
.add("structures", MapType(StringType(), StructType().add(...)))
Processing Data in Apache Kafka with Structured Streaming in Apache Spark 2.2 42
Parse the Raw JSON root
|-- device_id: string (nullable = true)
jsonOptions = { "timestampFormat": nestTimestampFormat } |-- software_version: string (nullable = true)
to create a new row for each key-value pair, flattening the data. Lastly, | |-- web_url: string (nullable = true)
| |-- app_url: string (nullable = true)
we use star () to unnest the “value” column. The following is the result
| |-- image_url: string (nullable = true)
of calling camera.printSchema() | |-- animated_image_url: string (nullable = true)
| |-- activity_zone_ids: array (nullable = true)
| | |-- element: string (containsNull = true)
Processing Data in Apache Kafka with Structured Streaming in Apache Spark 2.2 43
Aggregate and Write Back to Kafka sightingLoc \
We will now process the sightings DataFrame by augmenting each .groupBy("zip_code", window("start_time", "1 hour")) \
.count() \
sighting with its location. Recall that we have some location data that
.select( \
lets us look up the zip code of a device by its device id. We first create a
to_json(struct("zip_code", "window")).alias("key"),
DataFrame representing this location data, and then join it with the col("count").cast("string").alias("value")) \
sightings DataFrame, matching on device id. What we are doing here .writeStream \
is joining the streaming DataFrame sightings with a static DataFrame .format("kafka") \
Processing Data in Apache Kafka with Structured Streaming in Apache Spark 2.2 44
camera.writeStream \ Batch Read and Format the Data
.format("parquet") \
.option("startingOffsets", "earliest") \ report = spark \
In the future blog posts in this series, we’ll cover more on:
Processing Data in Apache Kafka with Structured Streaming in Apache Spark 2.2 45
• Monitoring your streaming applications
Additional Configuration
• Computing event-time aggregations with Structured Streaming
If you want to learn more about the Structured Streaming, here are a Kafka Integration Guide
few useful links:
Contains further examples and Spark specific configuration options for
processing data in Kafka.
• Previous blogs posts explaining the motivation and concepts of
Structured Streaming:
Kafka Consumer and Producer Configuration Docs
- Continuous Applications: Evolving Streaming in Apache Spark
2.0 Kafka’s own configurations can be set via DataStreamReader.option
- Structured Streaming In Apache Spark and DataStreamWriter.option with the kafka. prefix, e.g:
- Real-time Streaming ETL with Structured Streaming in Apache For possible kafka parameters, see the Kafka consumer config docs for
Spark 2.1 parameters related to reading data, and the Kafka producer config
- Working with Complex with Structured Streaming in Apache docs for parameters related to writing data.
Spark 2.1
• Structured Streaming Programming Guide See the Kafka Integration Guide for the list of options managed by
Spark, which are consequently not configurable.
• Talk at Spark Summit 2017 East – Making Structured Streaming
Ready for Production and Future Directions
1.
A compacted Kafka topic is a topic where retention is enforced by compaction to
To try Structured Streaming in Apache Spark 2.1, try Databricks today. ensure that the log is guaranteed to have at least the last state for each key. See Kafka
Log Compaction for more information.
Processing Data in Apache Kafka with Structured Streaming in Apache Spark 2.2 46
Part 5: Event-
time
Aggregations
and
Part 5: Event-time Aggregations and
Watermarking
Watermarking
Watermarking in Apache Spark’s events arrive. You choose different output modes for writing the
updated averages to external systems like file systems and databases.
Structured Streaming Furthermore, you can also implement custom aggregations using
Spark’s user-defined aggregation function (UDAFs).
avgSignalDF = eventsDF.groupBy("deviceId").avg("signal")
windowedAvgSignalDF = \ windowedAvgSignalDF = \
eventsDF \ eventsDF \
.groupBy(window("eventTime", "5 minute")) \ .groupBy(window("eventTime", "10 minutes", "5 minutes")) \
.count() .count()
In the above query, every record is going to be assigned to a 5 minute In the above query, every record will be assigned to multiple
tumbling window as illustrated below. overlapping windows as illustrated below.Mapping of event-time to
overlapping windows of length 10 mins and sliding interval 5 mins
Note how the late, out-of-order record [12:04, dev2] updated an old
window’s count.
Note the two events that arrive between the processing-times 12:20 To try Structured Streaming in Apache Spark 2.0, try Databricks today.
and 12:25. The watermark is used to differentiate between the late and
the “too-late” events and treat them accordingly.
Conclusion
In short, I covered Structured Streaming’s windowing strategy to
handle key streaming aggregations: windows over event-time and late
and out-of-order data. Using this windowing strategy allows
Structured Streaming engine to implement watermarking, in which
Structured Streaming in Apache Spark provides a simple "message" : "Getting offsets from FileStreamSource[dbfs:/
databricks-datasets/structured-streaming/events]",
programmatic API to get information about a stream that is currently
"isDataAvailable" : true,
executing. There are two key commands that you can run on a "isTriggerActive" : true
currently active stream in order to get relevant information about the }
Let’s explore why we chose to display these metrics and why they’re
important for you to understand.
Recent Progress
While the query status is certainly important, equally important is an
ability to view query’s historical progress. Progress metadata will allow
us to answer questions like “At what rate am I processing tuples?” or
“How fast are tuples arriving from the source?”
You get the best of both worlds. The system will attempt to self-heal
while keeping employees and developers informed of the status.
Part 7:
Better Cost Management
through APIs # Load your Streaming DataFrame
sdf = spark.readStream.load(path="/in/path", format="json",
schema=my_schema)
# Perform transformations and then write…
sdf.writeStream.trigger(once=True).start(path="/out/path",
format="parquet")
Day For 10x Cost Savings The default behavior of Structured Streaming is to run with the lowest
latency possible, so triggers fire as soon as the previous trigger
Part 6 of Scalable Data @ Databricks finishes. For use cases with lower latency requirements, Structured
May 22, 2017 | by Burak Yavuz and Tyson Condie
Streaming supports a ProcessingTime trigger which will fire every user-
provided interval, for example every minute.
Traditionally, when people think about streaming, terms such as “real-
time,” “24/7,” or “always on” come to mind. You may have cases where
data only arrives at fixed intervals. That is, data appears every hour or
once a day. For these use cases, it is still beneficial to perform While this is great, it still requires the cluster to remain running 24/7. In
incremental processing on this data. However, it would be wasteful to contrast, a RunOnce trigger will fire only once and then will stop the
keep a cluster up and running 24/7 just to perform a short amount of query. As we’ll see below, this lets you effectively utilize an external
processing once a day. scheduling mechanism such as Databricks Jobs.
Fortunately, by using the new Run Once trigger feature added to Triggers are specified when you start your streams.
Structured Streaming in Apache Spark 2.2, you will get all the benefits
PYTHON
of the Catalyst Optimizer incrementalizing your workload, and the cost
savings of not having an idle cluster lying around. In this post, we will # Load your Streaming DataFrame
sdf = spark.readStream.load(path="/in/path", format="json",
examine how to employ triggers to accomplish both.
schema=my_schema)
# Perform transformations and then write…
In addition to all these benefits over batch processing, you also get the
cost savings of not having an idle 24/7 cluster up and running for an
irregular streaming job. The best of both worlds for batch and
streaming processing are now under your fingertips.
PYTHON
SELECT * FROM deduplicated
from pyspark.sql.functions import expr +----+-----+
|User|count|
withEventTime\ +----+-----+
.withWatermark("event_time", "5 seconds")\ | a| 8085|
.dropDuplicates(["User", "event_time"])\ | b| 9123|
.groupBy("User")\ | c| 7715|
.count()\ | g| 9167|
.writeStream\ | h| 7733|
.queryName("pydeduplicated")\ | e| 9891|
.format("memory")\ | f| 9206|
.outputMode("complete")\ | d| 8124|
.start() | i| 9255|
+----+-----+
SCALA
To set event time timeout, use So let’s define our input, output, and state data structure definitions.
GroupState.setTimeoutTimestamp(...). Only for timeouts based on
event time must you specify watermark. As such all events in the group
older than watermark will be filtered out, and the timeout will occur case class InputRow(user:String, timestamp:java.sql.Timestamp,
activity:String)
when the watermark has advanced beyond the set timestamp.
case class UserState(user:String,
var activity:String,
When timeouts occur, your function supplied in the streaming query var start:java.sql.Timestamp,
will be invoked with arguments: the key by which you keep the state; var end:java.sql.Timestamp)
an iterator rows of input, and an old state. The example with
mapGroupsWithState below defines a number of functional classes
} new java.sql.Timestamp(6284160000000L),
state.start = input.timestamp )
withEventTime +----+--------+--------------------+--------------------+
| a| bike|2015-02-23 13:30:...|2015-02-23 14:06:...|
.selectExpr("User as user", "cast(Creation_Time/1000000000
| a| bike|2015-02-23 13:30:...|2015-02-23 14:06:...|
as timestamp) as timestamp", "gt as activity")
...
.as[InputRow]
| b| bike|2015-02-24 14:01:...|2015-02-24 14:38:...|
// group the state by user key | b| bike|2015-02-24 14:01:...|2015-02-24 14:38:...|
.groupByKey(_.user) | c| bike|2015-02-23 12:40:...|2015-02-23 13:15:...|
.mapGroupsWithState(GroupStateTimeout.NoTimeout) ...
(updateAcrossEvents) | d| bike|2015-02-24 13:07:...|2015-02-24 13:42:...|
.writeStream +----+--------+--------------------+--------------------+
.queryName("events_per_window")
.format("memory")
.outputMode("update")
What’s Next
.start() In this blog, we expanded on two additional functionalities and APIs
for advanced streaming analytics. The first allows removing duplicates
We can now query our results in the stream: bounded by a watermark. With the second, you can implement
customized stateful aggregations, beyond event-time basics and
To try out Databricks for yourself, sign-up for a 14-day free trial today!
Conclusion
72