Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
Apress Standard
The publisher, the authors and the editors are safe to assume that the advice and
information in this book are believed to be true and accurate at the date of
publication. Neither the publisher nor the authors or the editors give a warranty,
expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral
with regard to jurisdictional claims in published maps and institutional
affiliations.
This Apress imprint is published by the registered company APress Media, LLC,
part of Springer Nature.
The registered company address is: 1 New York Plaza, New York, NY 10004,
U.S.A.
To my beloved family
Any source code or other supplementary material referenced by the author in this
book is available to readers on GitHub (https://github.com/Apress). For more
detailed information, please visit http://www.apress.com/source-code.
Table of Contents
Part I: Apache Spark Batch Data Processing
Chapter 1: Introduction to Apache Spark for Large-Scale Data Analytics
1.1 What Is Apache Spark?
Simpler to Use and Operate
Fast
Scalable
Ease of Use
Fault Tolerance at Scale
1.2 Spark Unified Analytics Engine
1.3 How Apache Spark Works
Spark Application Model
Spark Execution Model
Spark Cluster Model
1.4 Apache Spark Ecosystem
Spark Core
Spark APIs
Spark SQL and DataFrames and Datasets
Spark Streaming
Spark GraphX
1.5 Batch vs. Streaming Data
What Is Batch Data Processing?
What Is Stream Data Processing?
Difference Between Stream Processing and Batch Processing
1.6 Summary
Chapter 2: Getting Started with Apache Spark
2.1 Downloading and Installing Apache Spark
Installation of Apache Spark on Linux
Installation of Apache Spark on Windows
2.2 Hands-On Spark Shell
Using the Spark Shell Command
Running Self-Contained Applications with the spark-submit
Command
2.3 Spark Application Concepts
Spark Application and SparkSession
Access the Existing SparkSession
2.4 Transformations, Actions, Immutability, and Lazy Evaluation
Transformations
Narrow Transformations
Wide Transformations
Actions
2.5 Summary
Chapter 3: Spark Low-Level API
3.1 Resilient Distributed Datasets (RDDs)
Creating RDDs from Parallelized Collections
Creating RDDs from External Datasets
Creating RDDs from Existing RDDs
3.2 Working with Key-Value Pairs
Creating Pair RDDs
Showing the Distinct Keys of a Pair RDD
Transformations on Pair RDDs
Actions on Pair RDDs
3.3 Spark Shared Variables: Broadcasts and Accumulators
Broadcast Variables
Accumulators
3.4 When to Use RDDs
3.5 Summary
Chapter 4: The Spark High-Level APIs
4.1 Spark Dataframes
Attributes of Spark DataFrames
Methods for Creating Spark DataFrames
4.2 Use of Spark DataFrames
Select DataFrame Columns
Select Columns Based on Name Patterns
Filtering Results of a Query Based on One or Multiple Conditions
Using Different Column Name Notations
Using Logical Operators for Multi-condition Filtering
Manipulating Spark DataFrame Columns
Renaming DataFrame Columns
Dropping DataFrame Columns
Creating a New Dataframe Column Dependent on Another Column
User-Defined Functions (UDFs)
Merging DataFrames with Union and UnionByName
Joining DataFrames with Join
4.3 Spark Cache and Persist of Data
Unpersisting Cached Data
4.4 Summary
Chapter 5: Spark Dataset API and Adaptive Query Execution
5.1 What Are Spark Datasets?
5.2 Methods for Creating Spark Datasets
5.3 Adaptive Query Execution
5.4 Data-Dependent Adaptive Determination of the Shuffle Partition
Number
5.5 Runtime Replanning of Join Strategies
5.6 Optimization of Unevenly Distributed Data Joins
5.7 Enabling the Adaptive Query Execution (AQE)
5.8 Summary
Chapter 6: Introduction to Apache Spark Streaming
6.1 Real-Time Analytics of Bound and Unbound Data
6.2 Challenges of Stream Processing
6.3 The Uncertainty Component of Data Streams
6.4 Apache Spark Streaming’s Execution Model
6.5 Stream Processing Architectures
The Lambda Architecture
The Kappa Architecture
6.6 Spark Streaming Architecture: Discretized Streams
6.7 Spark Streaming Sources and Receivers
Basic Input Sources
Advanced Input Sources
6.8 Spark Streaming Graceful Shutdown
6.9 Transformations on DStreams
6.10 Summary
Part II: Apache Spark Streaming
Chapter 7: Spark Structured Streaming
7.1 General Rules for Message Delivery Reliability
7.2 Structured Streaming vs. Spark Streaming
7.3 What Is Apache Spark Structured Streaming?
Spark Structured Streaming Input Table
Spark Structured Streaming Result Table
Spark Structured Streaming Output Modes
7.4 Datasets and DataFrames Streaming API
Socket Structured Streaming Sources
Running Socket Structured Streaming Applications Locally
File System Structured Streaming Sources
Running File System Streaming Applications Locally
7.5 Spark Structured Streaming Transformations
Streaming State in Spark Structured Streaming
Spark Stateless Streaming
Spark Stateful Streaming
Stateful Streaming Aggregations
7.6 Spark Checkpointing Streaming
Recovering from Failures with Checkpointing
7.7 Summary
Chapter 8: Streaming Sources and Sinks
8.1 Spark Streaming Data Sources
Reading Streaming Data from File Data Sources
Reading Streaming Data from Kafka
Reading Streaming Data from MongoDB
8.2 Spark Streaming Data Sinks
Writing Streaming Data to the Console Sink
Writing Streaming Data to the File Sink
Writing Streaming Data to the Kafka Sink
Writing Streaming Data to the ForeachBatch Sink
Writing Streaming Data to the Foreach Sink
Writing Streaming Data to Other Data Sinks
8.3 Summary
Chapter 9: Event-Time Window Operations and Watermarking
9.1 Event-Time Processing
9.2 Stream Temporal Windows in Apache Spark
What Are Temporal Windows and Why Are They Important in
Streaming
9.3 Tumbling Windows
9.4 Sliding Windows
9.5 Session Windows
Session Window with Dynamic Gap
9.6 Watermarking in Spark Structured Streaming
What Is a Watermark?
9.7 Summary
Chapter 10: Future Directions for Spark Streaming
10.1 Streaming Machine Learning with Spark
What Is Logistic Regression?
Types of Logistic Regression
Use Cases of Logistic Regression
Assessing the Sensitivity and Specificity of Our Streaming ML Model
10.2 Spark 3.3.x
Spark RocksDB State Store Database
10.3 The Project Lightspeed
Predictable Low Latency
Enhanced Functionality for Processing Data/Events
New Ecosystem of Connectors
Improve Operations and Troubleshooting
10.4 Summary
Bibliography
Index
About the Author
Alfonso Antolínez García
is a senior IT manager with a long professional
career serving in several multinational
companies such as Bertelsmann SE, Lafarge,
and TUI AG. He has been working in the media
industry, the building materials industry, and the
leisure industry. Alfonso also works as a
university professor, teaching artificial
intelligence, machine learning, and data science.
In his spare time, he writes research papers on
artificial intelligence, mathematics, physics, and
the applications of information theory to other
sciences.
About the Technical Reviewer
Akshay R. Kulkarni
is an AI and machine learning evangelist and a
thought leader. He has consulted several Fortune
500 and global enterprises to drive AI- and data
science–led strategic transformations. He is a
Google Developer Expert, author, and regular
speaker at major AI and data science
conferences (including Strata, O’Reilly AI
Conf, and GIDS). He is a visiting faculty
member for some of the top graduate institutes
in India. In 2019, he has been also featured as
one of the top 40 under-40 data scientists in
India. In his spare time, he enjoys reading,
writing, coding, and building next-gen AI
products.
Part I
Apache Spark Batch Data Processing
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2023
A. Antolínez García, Hands-on Guide to Apache Spark 3
https://doi.org/10.1007/978-1-4842-9380-5_1
Fast
On November 5, 2014, Databricks officially announced they have won the
Daytona GraySort contest.1 In this competition, the Databricks team used a
Spark cluster of 206 EC2 nodes to sort 100 TB of data (1 trillion records) in 23
minutes. The previous world record of 72 minutes using a Hadoop MapReduce
cluster of 2100 nodes was set by Yahoo. Summarizing, Spark sorted the same
data three times faster with ten times fewer machines. Impressive, right?
But wait a bit. The same post also says, “All the sorting took place on disk
(HDFS), without using Spark’s in-memory cache.” So was it not all about
Spark’s in-memory capabilities? Apache Spark is recognized for its in-memory
performance. However, assuming Spark’s outstanding results are due to this
feature is one of the most common misconceptions about Spark’s design. From
its genesis, Spark was conceived to achieve a superior performance both in
memory and on disk. Therefore, Spark operators perform regular operations on
disk when data does not fit in memory.
Scalable
Apache Spark is an open source framework intended to provide parallelized data
processing at scale. At the same time, Spark high-level functions can be used to
carry out different data processing tasks on datasets of diverse sizes and
schemas. This is accomplished by distributing workloads from several servers to
thousands of machines, running on a cluster of computers and orchestrated by a
cluster manager like Mesos or Hadoop YARN. Therefore, hardware resources
can increase linearly with every new computer added. It is worth clarifying that
hardware addition to the cluster does not necessarily represent a linear increase
in computing performance and hence linear reduction in processing time because
internal cluster management, data transfer, network traffic, and so on also
consume resources, subtracting them from the effective Spark computing
capabilities. Despite the fact that running in cluster mode leverages Spark’s full
distributed capacity, it can also be run locally on a single computer, called local
mode.
If you have searched for information about Spark before, you probably have
read something like “Spark runs on commodity hardware.” It is important to
understand the term “commodity hardware.” In the context of big data,
commodity hardware does not denote low quality, but rather equipment based on
market standards, which is general-purpose, widely available, and hence
affordable as opposed to purpose-built computers.
Ease of Use
Spark makes the life of data engineers and data scientists operating on large
datasets easier. Spark provides a single unified engine and API for diverse use
cases such as streaming, batch, or interactive data processing. These tools allow
it to easily cope with diverse scenarios like ETL processes, machine learning, or
graphs and graph-parallel computation. Spark also provides about a hundred
operators for data transformation and the notion of dataframes for manipulating
semi-structured data.
Figure 1-2 Spark communication architecture with worker nodes and executors
This Execution Model also has some downsides. Data cannot be exchanged
between Spark applications (instances of the SparkContext) via the in-memory
computation model, without first saving the data to an external storage device.
As mentioned before, Spark can be run with a wide variety of cluster
managers. That is possible because Spark is a cluster-agnostic platform. This
means that as long as a cluster manager is able to obtain executor processes and
to provide communication among the architectural components, it is suitable for
the purpose of executing Spark. That is why communication between the driver
program and worker nodes must be available at all times, because the former
must acquire incoming connections from the executors for as long as
applications are executing on them.
Figure 1-3 Spark’s heartbeat communication between executors and the driver
Spark Core
Spark Core is the bedrock on top of which in-memory computing, fault
tolerance, and parallel computing are developed. The Core also provides data
abstraction via RDDs and together with the cluster manager data arrangement
over the different nodes of the cluster. The high-level libraries (Spark SQL,
Streaming, MLlib for machine learning, and GraphX for graph data processing)
are also running over the Core.
Spark APIs
Spark incorporates a series of application programming interfaces (APIs) for
different programming languages (SQL, Scala, Java, Python, and R), paving the
way for the adoption of Spark by a great variety of professionals with different
development, data science, and data engineering backgrounds. For example,
Spark SQL permits the interaction with RDDs as if we were submitting SQL
queries to a traditional relational database. This feature has facilitated many
transactional database administrators and developers to embrace Apache Spark.
Let’s now review each of the four libraries in detail.
Spark Streaming
Spark Structured Streaming is a high-level library on top of the core Spark SQL
engine. Structured Streaming enables Spark’s fault-tolerant and real-time
processing of unbounded data streams without users having to think about how
the streaming takes place. Spark Structured Streaming provides fault-tolerant,
fast, end-to-end, exactly-once, at-scale stream processing. Spark Streaming
permits express streaming computation in the same fashion as static data is
computed via batch processing. This is achieved by executing the streaming
process incrementally and continuously and updating the outputs as the
incoming data is ingested.
With Spark 2.3, a new low-latency processing mode called continuous
processing was introduced, achieving end-to-end latencies of as low as 1 ms,
ensuring at-least-once2 message delivery. The at-least-once concept is depicted
in Figure 1-5. By default, Structured Streaming internally processes the
information as micro-batches, meaning data is processed as a series of tiny batch
jobs.
Figure 1-5 Depiction of the at-least-once message delivery semantic
Spark Structured Streaming also uses the same concepts of datasets and
DataFrames to represent streaming aggregations, event-time windows, stream-
to-batch joins, etc. using different programming language APIs (Scala, Java,
Python, and R). It means the same queries can be used without changing the
dataset/DataFrame operations, therefore choosing the operational mode that best
fits our application requirements without modifying the code.
Spark’s machine learning (ML) library is commonly known as MLlib,
though it is not its official name. MLlib’s goal is to provide big data out-of-the-
box, easy-to-use machine learning capabilities. At a high level, it provides
capabilities such as follows:
Machine learning algorithms like classification, clustering, regression,
collaborative filtering, decision trees, random forests, and gradient-boosted
trees among others
Featurization:
Term Frequency-Inverse Document Frequency (TF-IDF) statistical and
feature vectorization method for natural language processing and
information retrieval.
Word2vec: It takes text corpus as input and produces the word vectors as
output.
StandardScaler: It is a very common tool for pre-processing steps and
feature standardization.
Principal component analysis, which is an orthogonal transformation to
convert possibly correlated variables.
Etc.
ML Pipelines, to create and tune machine learning pipelines
Predictive Model Markup Language (PMML), to export models to PMML
Basic Statistics, including summary statistics, correlation between series,
stratified sampling, etc.
As of Spark 2.0, the primary Spark Machine Learning API is the DataFrame-
based API in the spark.ml package, switching from the traditional RDD-based
APIs in the spark.mllib package.
Spark GraphX
GraphX is a new high-level Spark library for graphs and graph-parallel
computation designed to solve graph problems. GraphX extends the Spark RDD
capabilities by introducing this new graph abstraction to support graph
computation and includes a collection of graph algorithms and builders to
optimize graph analytics.
The Apache Spark ecosystem described in this section is portrayed in Figure
1-6.
1.6 Summary
In this chapter we briefly looked at the Apache Spark architecture,
implementation, and ecosystem of applications. We also covered the two
different types of data processing Spark can deal with, batch and streaming, and
the main differences between them. In the next chapter, we are going to go
through the Spark setup process, the Spark application concept, and the two
different types of Apache Spark RDD operations: transformations and actions.
Footnotes
1 www.databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-
in-large-scale-sorting.html
2 With the at-least-once message delivery semantic, a message can be delivered more than once; however,
no message can be lost.
3 www.statista.com/statistics/871513/worldwide-data-created/
Now that you have an understanding of what Spark is and how it works, we can
get you set up to start using it. In this chapter, I’ll provide download and
installation instructions and cover Spark command-line utilities in detail. I’ll also
review Spark application concepts, as well as transformations, actions,
immutability, and lazy evaluation.
$ java -version
If Java is already installed on your system, you get to see a message similar
to the following:
$ java -version
java version "18.0.2" 2022-07-19
Java(TM) SE Runtime Environment (build 18.0.2+9-61)
Java HotSpot(TM) 64-Bit Server VM (build 18.0.2+9-61,
mixed mode, sharing)
Your Java version may be different. Java 18 is the Java version in this case.
If you don’t have Java installed
1.
Open a browser window, and navigate to the Java download page as seen in
Figure 2-2.
2.
Click the Java file of your choice and save the file to a location (e.g.,
/home/<user>/Downloads).
Figure 2-2 Java download page
$ cd PATH/TO/spark-3.3.0-bin-hadoop3.tgz_location
and execute
$ su -
Password:
$ cd /home/<user>/Downloads/
$ mv spark-3.3.0-bin-hadoop3 /usr/local/spark
$ exit
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin
Use the following command for sourcing the ~/.bashrc file, updating the
environment variables:
$ source ~/.bashrc
$ $SPARK_HOME/bin/spark-shell
If Spark is installed successfully, then you will find the following output:
scala>
You can try the installation a bit further by taking advantage of the
README.md file that is present in the $SPARK_HOME directory:
The Spark context Web UI would be available typing the following URL in
your browser:
http://localhost:4040
There, you can see the jobs, stages, storage space, and executors that are
used for your small application. The result can be seen in Figure 2-3.
Figure 2-3 Apache Spark Web UI showing jobs, stages, storage, environment, and executors used for the
application running on the Spark shell
java -version
You must see the same signature you copied before; if not, something is
wrong. Try to solve it by downloading the file again.
You will be prompted with the System Properties dialog box, Figure 2-5 left:
1.
Click the Environment Variables button.
2.
The Environment Variables window appears, Figure 2-5 top-right:
a.
Click the New button.
3.
Insert the following variables:
a.
JAVA_HOME: \PATH\TO\YOUR\JAVA-DIRECTORY
b.
SPARK_HOME: \PATH\TO\YOUR\SPARK-DIRECTORY
c.
HADOOP_HOME: \PATH\TO\YOUR\HADOOP-DIRECTORY
You will have to repeat the previous step twice, to introduce the three
variables.
4.
Click the OK button to save the changes.
5.
Then, click your Edit button, Figure 2-6 left, to edit your PATH.
6.
After that, click the New button, Figure 2-6 right.
And add these new variables to the PATH:
a.
%JAVA_HOME%\bin
b.
%SPARK_HOME%\bin
c.
%HADOOP_HOME%bin
rootLogger.level = info
rootLogger.level = ERROR
To carry out a more complete test of the installation, let’s try the following
code:
val file
=sc.textFile("C:\\PATH\\TO\\YOUR\\spark\\README.md")
This will create a RDD. You can view the file’s content by using the next
instruction:
file.take(10).foreach(println)
If you have Python installed, you can run PySpark with this command:
pyspark
#For R examples:
$ $SPARK_HOME/bin/spark-submit
examples/src/main/r/dataframe.R
$SPARK_HOME/bin/spark-shell
~ % spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at
http://192.168.0.16:4040
Spark context available as 'sc' (master = local[*],
app id = local-1662487353802).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.3.0
/_/
scala>
scala> :type sc
org.apache.spark.SparkContext
You can access environment variables from the shell using the getenv
method as System.getenv('ENV_NAME'), for example:
scala> System.getenv("PWD")
res12: String = /Users/aantolinez
scala> :help
All commands can be abbreviated, e.g., :he instead of
:help.
:completions <string> output completions for the given
string
:edit <id>|<line> edit history
:help [command] print this summary or command-
specific help
:history [num] show the history (optional num
is commands to show)
:h? <string> search the history
...
...
:save <path> save replayable session to a
file
:settings <options> update compiler options, if
possible; see reset
:silent disable/enable automatic
printing of results
:warnings show the suppressed warnings
from the most recent line which had any
scala>
Please, do not confuse the inline help provided by :help with spark-
shell runtime options shown by the spark-shell -h option.
The -h option permits passing runtime environment options to the shell,
allowing a flexible execution of your application, depending on the cluster
configuration. Let’s see some examples in which we run Apache Spark with
Apache Hudi; set the cluster manager (YARN), the deployment mode, and the
number of cores per executor; and allocate the memory available for the driver
and executors:
$SPARK_HOME/bin/spark-shell \
--master yarn \
--deploy-mode cluster \
--driver-memory 16g \
--executor-memory 32g \
--executor-cores 4 \
--conf "spark.sql.shuffle.partitions=1000" \
--conf "spark.executor.memoryOverhead=4024" \
--conf "spark.memory.fraction=0.7" \
--conf "spark.memory.storageFraction=0.3" \
--packages org.apache.hudi:hudi-spark3.3-bundle_2.12:0.12.0 \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerial
--conf
"spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.cata
\
--conf
"spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSess
In the next example, we define at runtime the database driver and version we
would like to be used:
$SPARK_HOME/bin/spark-shell \
--master yarn \
--deploy-mode cluster \
--driver-memory 16g \
--executor-memory 32g \
--executor-cores 4 \
--driver-class-path /path/to/postgresql-42.5.0.jar \
--conf "spark.sql.shuffle.partitions=1000" \
--conf "spark.executor.memoryOverhead=4024" \
--conf "spark.memory.fraction=0.7" \
--conf "spark.memory.storageFraction=0.3" \
scala> cars_df.show()
+-------+---------+-----+-------------+
| _1| _2| _3| _4|
+-------+---------+-----+-------------+
| USA| Chrysler|Dodge| Jeep|
|Germany| BMW| VW| Mercedes|
| Spain|GTA Spano| SEAT|Hispano Suiza|
+-------+---------+-----+-------------+
scala> df_postgresql.show()
+-----------+--------------+--------------------+-------+
|category_id| category_name| description|picture|
+-----------+--------------+--------------------+-------+
| 1| Beverages|Soft drinks, coff...| []|
| 2| Condiments|Sweet and savory ...| []|
| 3| Confections|Desserts, candies...| []|
| 4|Dairy Products| Cheeses| []|
| 5|Grains/Cereals|Breads, crackers,...| []|
| 6| Meat/Poultry| Prepared meats| []|
| 7| Produce|Dried fruit and b...| []|
| 8| Seafood| Seaweed and fish| []|
+-----------+--------------+--------------------+-------+
$SPARK_HOME/bin/pyspark
~ % pyspark
Python 3.6.12 (default, May 18 2021, 22:47:55)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux
Type "help", "copyright", "credits" or "license" for
more information.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/usr/share/aws/glue/etl/jars/glue-
assembly.jar!/org/slf4j/impl/StaticLoggerBinder.class]
...
...
22/09/07 16:30:26 WARN Client: Same path resource
file:///usr/share/aws/glue/libs/pyspark.zip added
multiple times to distributed cache.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.3.0
/_/
$SPARK_HOME/bin/pyspark \
--master yarn \
--deploy-mode client \
--executor-memory 16G \
--executor-cores 8 \
--conf spark.sql.parquet.mergeSchema=true \
--conf spark.sql.parquet.filterPushdown=true \
--conf spark.sql.parquet.writeLegacyFormat=false
Hands_On_Spark3_Script.py >
./Hands_On_Spark3_Script.log 2>&1 &
Unlike spark-shell, to leave pyspark, you can type in your terminal one of the
commands quit() and exit() or press Ctrl-D.
~ % $SPARK_HOME/bin/spark-submit --help
$ $SPARK_HOME/bin/spark-submit \
--master <master-url> \
--deploy-mode <deploy-mode> \
--class <main-class> \
--conf <key>=<value> \
--driver-memory <value>g \
--executor-memory <value>g \
--executor-cores <number of cores> \
--jars <comma separated dependencies>
... # other options
<application-jar> \
[application-arguments]
Now, we are going to illustrate how spark-submit works with a practical
example:
$ $SPARK_HOME/bin/spark-submit \
--deploy-mode client \
--master local \
--class org.apache.spark.examples.SparkPi \
/$SPARK_HOME/examples/jars/spark-examples_2.12-
3.3.0.jar 80
If you have run this piece of code as it is, you would have surely seen some
stuff like this coming out to your console:
$ ls $SPARK_HOME/conf
fairscheduler.xml.template spark-
defaults.conf.template
log4j2.properties.template spark-env.sh.template
metrics.properties.template workers.template
$ mv $SPARK_HOME/conf/log4j2.properties.template
$SPARK_HOME/conf/log4j2.properties
$ vi $SPARK_HOME/conf/log4j2.properties
And change it to
# Set everything to be logged to the console
rootLogger.level = ERROR
Save the file, and run the Spark example application again:
$ $SPARK_HOME/bin/spark-submit \
--name "Hands-On Spark 3" \
--master local\[4] \
--deploy-mode client \
--conf spark.eventLog.enabled=false \
--conf "spark.executor.extraJavaOptions=-
XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \
--class org.apache.spark.examples.SparkPi \
/$SPARK_HOME/examples/jars/spark-examples_2.12-
3.3.0.jar 80
Pi is roughly 3.1410188926273617
$
$ $SPARK_HOME/bin/spark-submit --help
Option Description
cluster In cluster mode, the driver program will run in one of the worker machines inside a cluster.
Cluster mode is used to run production jobs.
client In client mode, the driver program runs locally where the application is submitted and the
(default executors run in different nodes.
option)
(base) aantolinez@MacBook-Pro ~ %
$SPARK_HOME/bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master local\[4] \
/$SPARK_HOME/examples/jars/spark-examples_2.12-
3.3.0.jar 80
Option Description
--executor-cores • Number of CPU cores to be used by the Spark driver for the executor process.
• The cores property controls the number of concurrent tasks an executor can run.
• --executor-cores 5 means that each executor can run a maximum of five tasks at the
same time.
--executor- • Amount of RAM to use for the executor process.
memory • This option affects the maximum size of data Spark can cache and allocate for
shuffle data structures.
• This property impacts operations performance like aggregations, grouping, and joins.
--num-executors It controls the number of executors requested.
(*)
--driver-memory Memory to be used by the Spark driver.
--driver-cores The number of CPU cores given to the Spark driver.
--total-executor- The total number of cores granted to the executor.
cores
(*) Note Starting with CDH 5.4/Spark 1.3, you can bypass setting up
this parameter with the spark.dynamicAllocation.enabled property, turning on
dynamic allocation. Dynamic allocation permits your application to solicit
available executors while there are pending tasks and release them when
unused.
sc = SparkContext(conf=conf)
The SparkContext is created only once for an application; thus, another more
flexible approach to the problem could be constructing it with a void
configuration:
Spark submit allows you to fine-tune your cluster configuration with dozens
of parameters that can be sent to the SparkContext using the --config/-c option or
by setting the SparkConf to create a SparkSession.
These options control application properties (Table 2-4), the runtime
environment (Table 2-5), shuffle behavior, Spark UI, compression and
serialization, memory management, execution behavior, executor metrics,
networking, scheduling, barrier execution mode, dynamic allocation (Table 2-6),
thread configurations, security, and runtime SQL configuration, among others
(Table 2-7). Next, we will explore some of the most common ones.
Application Properties
Table 2-4 Spark application properties
Property Description
spark.app.name The name of your application.
(Default value, none)
spark.driver.cores In cluster mode, the number of cores to use for the driver process only.
(Default value, 1)
spark.driver.memory Amount of memory to use for the driver process. In client mode, it should be
(Default value, 1g) set via the --driver-memory command-line option or the properties file.
Runtime Environment
Table 2-5 Spark runtime environment
Property Description
spark.driver.extraClassPath Extra classpath entries to prepend to the classpath of the driver.
(Default value, none) In client mode, it should be set via the --driver-class-
path command-line option or in the default properties file.
The option allows you to load specific JAR files, such as
database connectors and others.
Dynamic Allocation
Table 2-6 Spark allocation resources
Property Description
spark.dynamicAllocation.enabled Whether to use dynamic
(Default value, false) resource allocation to adjust the
number of executors processing
your application based on
existing workload.
spark.dynamicAllocation.executorIdleTimeout If dynamic allocation is
(Default value, 60 s) enabled, an executor process
will be killed if it has been idle
for a longer time.
spark.dynamicAllocation.cachedExecutorIdleTimeout If dynamic allocation is
(Default value, infinity) enabled, an executor process
will be killed if it has cached
data and has been idle for a
longer time.
Others
Table 2-7 Other Spark options to control application properties
Property Description
spark.sql.shuffle.partitions Number of partitions to use when shuffling data
(Default value, 200) for joins or aggregations.
$SPARK_HOME/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--conf "spark.sql.shuffle.partitions=10000" \
--conf "spark.executor.memoryOverhead=8192" \
--conf "spark.memory.fraction=0.7" \
--conf "spark.memory.storageFraction=0.3" \
--conf "spark.dynamicAllocation.minExecutors=10" \
--conf "spark.dynamicAllocation.maxExecutors=2000" \
--conf "spark.dynamicAllocation.enabled=true" \
--conf "spark.executor.extraJavaOptions=-
XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \
--files /path/of/config.conf,
/path/to/mypropeties.json \
--class org.apache.spark.examples.SparkPi \
/$SPARK_HOME/examples/jars/spark-examples_2.12-
3.3.0.jar 80
object Functions {
def main(args: Array[String]) = {
agregar(1,2)
}
val agregar = (x: Int, y: Int) => println(x+y)
}
// SparkSession output
org.apache.spark.sql.SparkSession@7dabc2f9
You can get the active SparkSession for the current thread, returned by the
builder, using the getActiveSession() method:
configMap: Map[String,String] =
Map(spark.sql.warehouse.dir -> file:/Users/.../spark-
warehouse, spark.executor.extraJavaOptions -> -
XX:+IgnoreUnrecognizedVMOptions --add-
opens=java.base/java.lang=ALL-UNNAMED --add-
opens=java.base/java.lang.invoke=ALL-UNNAMED --add-
opens=java.base/java.lang.reflect=ALL-UNNAMED --add-
opens=java.base/java.io=ALL-UNNAMED --add-
opens=java.base/java.net=ALL-UNNAMED --add-
opens=java.base/java.nio=ALL-UNNAMED --add-
opens=java.base/java.util=ALL-UNNAMED --add-
opens=java.base/java.util.concurrent=ALL-UNNAMED --
add-opens=java.base/java.util.concurrent.atomic=ALL-
UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED -
-add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-
opens=java.base/sun.security.action=ALL-UNNAMED --add-
opens=java.ba...
('spark.driver.extraJavaOptions', '-
XX:+IgnoreUnrecognizedVMOptions --add-
opens=java.base/java.lang=ALL-UNNAMED --add-
opens=java.base/java.lang.invoke=ALL-UNNAMED --add-
opens=java.base/java.lang.reflect=ALL-UNNAMED --add-
opens=java.base/java.io=ALL-UNNAMED --add-
opens=java.base/java.net=ALL-UNNAMED --add-
opens=java.base/java.nio=ALL-UNNAMED --add-
opens=java.base/java.util=ALL-UNNAMED --add-
opens=java.base/java.util.concurrent=ALL-UNNAMED --
add-opens=java.base/java.util.concurrent.atomic=ALL-
UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED -
-add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-
opens=java.base/sun.security.action=ALL-UNNAMED --add-
opens=java.base/sun.util.calendar=ALL-UNNAMED --add-
opens=java.security.jgss/sun.security.krb5=ALL-
UNNAMED')
('spark.app.submitTime', '1662916389744')
('spark.sql.warehouse.dir', 'file:/Users/.../spark-
warehouse')
('spark.app.id', 'local-1662916391188')
('spark.executor.id', 'driver')
('spark.app.startTime', '1662916390211')
('spark.app.name', 'PySparkShell')
('spark.driver.port', '54543')
('spark.sql.catalogImplementation', 'hive')
('spark.rdd.compress', 'True')
('spark.executor.extraJavaOptions', '-
XX:+IgnoreUnrecognizedVMOptions --add-
opens=java.base/java.lang=ALL-UNNAMED --add-
opens=java.base/java.lang.invoke=ALL-UNNAMED --add-
opens=java.base/java.lang.reflect=ALL-UNNAMED --add-
opens=java.base/java.io=ALL-UNNAMED --add-
opens=java.base/java.net=ALL-UNNAMED --add-
opens=java.base/java.nio=ALL-UNNAMED --add-
opens=java.base/java.util=ALL-UNNAMED --add-
opens=java.base/java.util.concurrent=ALL-UNNAMED --
add-opens=java.base/java.util.concurrent.atomic=ALL-
UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED -
-add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-
opens=java.base/sun.security.action=ALL-UNNAMED --add-
opens=java.base/sun.util.calendar=ALL-UNNAMED --add-
opens=java.security.jgss/sun.security.krb5=ALL-
UNNAMED')
('spark.serializer.objectStreamReset', '100')
('spark.driver.host', '192.168.0.16')
('spark.master', 'local[*]')
('spark.submit.pyFiles', '')
('spark.submit.deployMode', 'client')
('spark.ui.showConsoleProgress', 'true')
In a similar way, you can set the Spark configuration parameters during
runtime:
You can also use the SparkSession to work with the catalog metadata, via the
catalog variable and spark.catalog.listDatabases and
spark.catalog.listTables methods:
+-------+----------------+-------------------------------------
|name |description |locationUri
+-------+----------------+-------------------------------------
|default|default database|file:/Users/.../spark-warehouse
+-------+----------------+-------------------------------------
+------------+--------+-----------+---------+-----------+
|name |database|description|tableType|isTemporary|
+------------+--------+-----------+---------+-----------+
|hive_table |default |null |MANAGED |false |
|sample_table|null |null |TEMPORARY|true |
|table_1 |null |null |TEMPORARY|true |
+------------+--------+-----------+---------+-----------+
SparkSession in spark-shell
The SparkSession object is created by the Spark driver program. Remember we
mentioned in a previous section that the SparkSession object is automatically
created for you when you use the Spark shell and it is available via the spark
variable. You can use the spark variable in the Spark shell command line like
this:
scala> spark.version
Spark Version : 3.3.0
org.apache.spark.sql.SparkSession@ddfc241
The Spark Version is : 3.3.0
Transformations
Transformations are operations that take a RDD or dataframe as input and return
a new RDD or dataframe as output. Therefore, transformations preserve the
original copy of the data, and that is why Spark data structures are said to be
immutable. Another important characteristic of the transformations is that they
are not executed immediately after they are defined; on the contrary, they are
memorized, creating a transformations lineage as the one shown in Figure 2-11.
For example, operations such as map(), filter(), and others don’t take
effect until an action is defined.
Narrow Transformations
Narrow transformations are operations without data shuffling, that is to say, there
is no data movement between partitions. Thus, narrow transformations operate
on data residing in the same partition as can be seen in Figure 2-13.
Figure 2-13 Example of narrow transformations
Wide Transformations
Wide transformations are operations involving data shuffling; it means there is
data movement between partitions as can be seen in Figure 2-14.
Actions
Actions, on the other hand, are Spark operations returning a single value—in
other words, operations not returning another data structure, RDD, or dataframe.
When an action is called in Spark, it triggers the transformations preceding it in
the DAG.
Examples of functions triggering actions are aggregate(), collect(), count(),
fold(), first(), min(), max(), top(), etc.
2.5 Summary
In this chapter we have covered the more essential steps to have Spark up and
running: downloading the necessary software and configuration. We have also
seen how to work with the Spark shell interface and how to execute self-
contained applications and examples using the Spark shell interface. Finally, we
went through the Spark concepts of immutability, lazy evaluation,
transformations, and actions. In the next chapter, we explain the Spark low-level
API together with the notion of Resilient Distributed Datasets (RDDs).
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2023
A. Antolínez García, Hands-on Guide to Apache Spark 3
https://doi.org/10.1007/978-1-4842-9380-5_3
Next, you can see how you can implement the RDD depicted in Figure 3-2
with PySpark:
alphabetList =
['a','b','c','d','e','f','g','h','i','j','k','l']
rdd = spark.sparkContext.parallelize(dataList, 4)
print("Number of partitions:
"+str(rdd.getNumPartitions()))
Number of partitions: 4
The following is another example, this time in Scala, of how to create a RDD
by parallelizing a collection of numbers:
scala> rdd.reduce(_ + _)
res7: Int = 55
If you are using a file located in your local file system, it must be available
on the same path to all the nodes. Thus, you have two options: either you copy it
to each worker node or use a network-shared file system such as HDFS. The
following is an example of how you can load a file located in a distributed file
system:
By default, Spark will split the file’s data into chunks of the same number as
file blocks, but as you have just seen, you can request Spark to divide your file
into more partitions. However, what you cannot do is to request from Spark
fewer partitions than file blocks.
Apart from the textFile() method, the Spark Scala API also supports
other input data formats. For example, the wholeTextFiles() method can
be used to read multiple small UTF-8-encoded text files from HDFS, a local file
system, or any other Hadoop-compatible URI. While textFile() reads one
or more files and returns one record per line of each file processed, the
wholeTextFiles() method reads the files returning them as a key-value
pair (path of the file, file content), hence preserving the relationship between the
content and the file of origin. The latter might not happen when textFile()
processes multiple files at once, because the data is shuffled and split across
several partitions. Because the process of sequentially processing files depends
on the order they are returned by the file system, the distribution of rows within
the file is not preserved.
Since each file is loaded in memory, wholeTextFiles() is preferred for
small file processing. Additionally, wholeTextFiles() provides a second
parameter to set the minimum number of partitions.
The Apache Spark API also provides methods to handle Hadoop sequence
files. This Hadoop file format is intended to store serialized key-value pairs.
Sequence files are broadly used in MapReduce processing tasks as input and
output formats. The sequence file format offers several advantages such as
compression at the level of record and block. They can be used to wrap up a
large number of small files, thus solving the drawback of some file systems in
processing large numbers of small files.
Apache Spark also provides a method to save RDDs as serialized Java
objects, a format similar to the Hadoop sequence files mentioned just before.
RDD.saveAsObjectFile and SparkContext.objectFile methods
can be used to save and load RDDs. saveAsObjectFile() uses Java
serialization to store information on a file system and permits saving metadata
information about the data type when written to a file. The following is an
example of how saveAsObjectFile() and
SparkContext.objectFile() can be employed to save and recover a
RDD object:
scala> val list =
sc.parallelize(List("España","México","Colombia","Perú","Ecuado
list: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[
at parallelize at <console>:23
scala> list.saveAsObjectFile("/tmp/SpanishCountries")
scala> newlist.collect
res9: Array[String] = Array(Ecuador, España, México, Colombia,
Perú)
The use of collect() is dangerous because it collects all the RDD data
from all the workers in the driver node; thus, you can run out of memory if the
size of the whole dataset does not fit into the driver memory. It is very inefficient
as well, because all the data from the cluster has to travel through the network,
and this is much slower than writing to disk and much more inefficient than
computation in memory. If you only want to see some samples from your RDD,
it is safer to use the take() method:
val currencyListRdd =
spark.sparkContext.parallelize(List("USD;Euro;GBP;CHF","CHF;JPY
val currenciesRdd = currencyListRdd.flatMap(_.split(";"))
val pairRDD = currenciesRdd.map(c=>(c,1))
pairRDD.foreach(println)
(USD,1)
(Euro,1)
(GBP,1)
(CHF,1)
(CHF,1)
(JPY,1)
(CNY,1)
(KRW,1)
(CNY,1)
(KRW,1)
(Euro,1)
(USD,1)
(CAD,1)
(NZD,1)
(SEK,1)
(MXN,1)
The preceding code first creates a session of name “Hands-On Spark 3”
using the .appName() method and a local cluster specified by the parameter
local[n], where n must be greater than 0 and represents the number of cores
to be allocated, hence the number of partitions, by default, RDDs are going to be
split up into. If a SparkSession is available, it is returned by getOrCreate();
otherwise, a new one for our program is created.
Next, the same example is reproduced but using PySpark this time:
currencyList =
["USD;Euro;GBP;CHF","CHF;JPY;CNY;KRW","CNY;KRW;Euro;USD","CAD;N
currencyListRdd = spark.sparkContext.parallelize(currencyList,
for f in sampleData:
print(str("("+f[0]) +","+str(f[1])+")")
(USD,1)
(Euro,1)
(GBP,1)
(CHF,1)
(CHF,1)
If you want to show the full list, use the collect() method instead of take()
like this:
sampleData = pairRDD.collect()
But be careful. In large datasets, this could cause you overflow problems in
your driver node.
pairRDD.distinct().foreach(println)
(MXN,1)
(GBP,1)
(CHF,1)
(CNY,1)
(KRW,1)
(SEK,1)
(USD,1)
(JPY,1)
(Euro,1)
(NZD,1)
(CAD,1)
Now, here’s another code snippet in PySpark to get the same result:
(GBP,1)
(MXN,1)
(CNY,1)
(KRW,1)
(USD,1)
(Euro,1)
(CHF,1)
(JPY,1)
(CAD,1)
(NZD,1)
(SEK,1)
As you can see in the preceding example, keys are not necessarily returned
sorted. If you want to have your returned data ordered by key, you can use the
sorted() method. Here is an example of how you can do it:
sampleData = sorted(pairRDD.distinct().collect())
for f in sampleData:
print(str("("+f[0]) +","+str(f[1])+")")
(CAD,1)
(CHF,1)
(CNY,1)
(Euro,1)
(GBP,1)
(JPY,1)
(KRW,1)
(MXN,1)
(NZD,1)
(SEK,1)
(USD,1)
pairRDD.sortByKey().foreach(println)
(KRW,1)
(KRW,1)
(CNY,1)
(CNY,1)
(GBP,1)
(NZD,1)
(JPY,1)
(MXN,1)
(Euro,1)
(Euro,1)
(SEK,1)
(USD,1)
(USD,1)
(CAD,1)
(CHF,1)
(CHF,1)
pairRDD.sortByKey(true).foreach(println)
In PySpark, we can use the following code snippet to achieve the same
result:
sampleData = pairRDD.sortByKey().collect()
for f in sampleData:
print(str("("+f[0]) +","+str(f[1])+")")
(CAD,1)
(CHF,1)
(CHF,1)
(CNY,1)
(CNY,1)
(Euro,1)
(Euro,1)
(GBP,1)
(JPY,1)
(KRW,1)
(KRW,1)
(MXN,1)
(NZD,1)
(SEK,1)
(USD,1)
(USD,1)
sparkContext.textFile("hdfs://")
.flatMap(line =>
line.split("ELEMENT_SEPARATOR"))
.map(element => (element,1))
.reduceByKey((a,b)=> (a+b))
To illustrate the power of this function, we are going to use a portion of the
Don Quixote of La Mancha to have a larger dataset. You have already seen how
to load files and transform them into RDDs. So let’s start with an example in
Scala:
val DonQuixoteRdd =
spark.sparkContext.textFile("DonQuixote.txt")
DonQuixoteRdd.foreach(println)
// You would see an output like this
saddle the hack as well as handle the bill-hook. The
age of this
In a village of La Mancha, the name of which I have no
desire to call
gentleman of ours was bordering on fifty; he was of a
hardy habit,
spare, gaunt-featured, a very early riser and a great
sportsman. They
to mind, there lived not long since one of those
gentlemen that keep a
will have it his surname was Quixada or Quesada (for
here there is some
lance in the lance-rack, an old buckler, a lean hack,
and a greyhound
difference of opinion among the authors who write on
the subject),
for coursing. An olla of rather more beef than mutton,
a salad on most
although from reasonable conjectures it seems plain
that he was called
nights, scraps on Saturdays, lentils on Fridays, and a
pigeon or so
Quexana. This, however, is of but little importance to
our tale; it
extra on Sundays, made away with three-quarters of his
income. The rest
of it went in a doublet of fine cloth and velvet
breeches and shoes to
will be enough not to stray a hair's breadth from the
truth in the
telling of it.
match for holidays, while on week-days he made a brave
figure in his
best homespun. He had in his house a housekeeper past
forty, a niece
under twenty, and a lad for the field and market-
place, who used to
val wordsDonQuixoteRdd =
DonQuixoteRdd.flatMap(_.split(" "))
val tupleDonQuixoteRdd = wordsDonQuixoteRdd.map(w =>
(w,1))
val reduceByKeyDonQuixoteRdd =
tupleDonQuixoteRdd.reduceByKey((a,b)=>a+b)
// Finally, you can see the values merged by key and
added.
// The output has been truncated.
reduceByKeyDonQuixoteRdd.foreach(println)
(Quesada,1)
(went,1)
(under,1)
(call,1)
(this,1)
...
(made,2)
(it,4)
(on,7)
(he,3)
(in,5)
(for,3)
(the,9)
(a,15)
(or,2)
(was,4)
(to,6)
(breeches,1)
(more,1)
(of,13)
println("Count : "+reduceByKeyDonQuixoteRdd.count())
Count : 157
As usual, you can achieve the same results employing PySpark code. Let me
show it to you with an example. In this case most of the outputs have been
suppressed, but believe me the final result is the same:
DonQuixoteRdd =
spark.sparkContext.textFile("DonQuixote.txt")
DonQuixoteRdd2 = DonQuixoteRdd.flatMap(lambda x:
x.split(" "))
DonQuixoteRdd3 = DonQuixoteRdd2.map(lambda x: (x,1))
DonQuixoteRddReduceByKey =
DonQuixoteRdd3.reduceByKey(lambda x,y: x+y)
print("Count :
"+str(DonQuixoteRddReduceByKey.count()))
Count : 157
reduceByKeyDonQuixoteRdd.saveAsTextFile("RDDDonQuixote")
You can also create a temporary directory to store your files, and instead of
letting your operating system decide where to make that directory, you can have
control over those parameters. Here is an example in PySpark:
import tempfile
from tempfile import NamedTemporaryFile
tempfile.tempdir = "./"
RDDDonQuixote = NamedTemporaryFile(delete=True)
RDDDonQuixote.close()
DonQuixoteRdd3.saveAsTextFile(RDDDonQuixote.name)
print(RDDDonQuixote)
print(RDDDonQuixote.name)
# Output
<tempfile._TemporaryFileWrapper object at
0x7f9ed1e65040>
/Users/aantolinez/tmp906w7eoy
If you just want to add the values by key performing a sum, both
reduceByKey and aggregateByKey will produce the same result. You can see an
example in the following:
Let’s now assume you are interested in a different sort of operation, implying
the values returned are of a different kind than those of the origin. For example,
imagine your desired output is a set of values, which is a different data type than
the values themselves (integers) and the operations inside each partition (sum of
integers returns another integer).
Next, we explain this idea with an example:
val outcomeSets = pairs.aggregateByKey(new
HashSet[Int])(_+_, _++_)
// _+_ adds a value to a set
// _++_ joins the two sets
outcomeSets.collect
res52: Array[(String,
scala.collection.mutable.HashSet[Int])] =
Array((a,Set(1, 5, 3, 7)), (b,Set(2, 6, 4)))
def to_list(x):
return [x]
def append(x, y):
x.append(y) # The append() method adds the y
element to the x list.
return x
def extend(x, y):
x.extend(y) # The extend() method adds the
elements of list y to the end of the x list.
return x
sorted(pairs.combineByKey(to_list, append,
extend).collect())
// Output
countriesTuples: Seq[(String, Int)] = List((España,1),
(Kazakhstan,1), (Denmark,1), (España,1), (España,1),
(Kazakhstan,1), (Kazakhstan,1))
countriesDs: org.apache.spark.rdd.RDD[(String, Int)] =
ParallelCollectionRDD[32] at parallelize at
<console>:29
countriesDs.collect.foreach(println)
// Output
(España,1)
(Kazakhstan,1)
(Denmark,1)
(España,1)
(España,1)
(Kazakhstan,1)
(Kazakhstan,1)
Now we will group the values by key using the groupByKey() method:
// Output
groupRDDByKey: org.apache.spark.rdd.RDD[(String,
Iterable[Int])] = ShuffledRDD[34] at groupByKey at
<console>:26
groupRDDByKey.collect.foreach(println)
// Output
(España,CompactBuffer(1, 1, 1))
(Kazakhstan,CompactBuffer(1, 1, 1))
(Denmark,CompactBuffer(1))
As you can see in the preceding code, groupByKey() groups the data with
respect to every key, and a iterator is returned. Note that unlike
reduceByKey(), the groupByKey() function doesn’t perform any
operation on the final output; it only groups the data and returns it in the form of
an iterator. This iterator can be used to transform a key-value RDD into any kind
of collection like a List or a Set.
Now imagine you want to know the number of occurrences of every country
and then you want to convert the prior CompactBuffer format to a List:
countryCountRDD.collect.foreach(println)
// Output
(España,3)
(Kazakhstan,3)
(Denmark,1)
val countriesDs =
spark.sparkContext.parallelize(countriesList)
// Output
countriesList: List[String] = List(España, Kazakhstan,
Denmark, España, España, Kazakhstan, Kazakhstan)
countriesDs: org.apache.spark.rdd.RDD[String] =
ParallelCollectionRDD[39] at parallelize at
<console>:28
val countryPairsRDD =
sc.parallelize(countriesList).map(country => (country,
1))
// Output
countryPairsRDD: org.apache.spark.rdd.RDD[(String,
Int)] = MapPartitionsRDD[51] at map at <console>:27
countryCountsWithReduce: Array[(String, Int)] =
Array((España,3), (Kazakhstan,3), (Denmark,1))
countryCountsWithGroup: Array[(String, Int)] =
Array((España,3), (Kazakhstan,3), (Denmark,1))
On the other hand, when you use groupByKey(), the key-value pairs on
each partition are shuffled across the nodes of the cluster. When you are working
with big datasets, this behavior requires the unnecessary movement of huge
amounts of data across the network representing an important process overhead.
Another reason to avoid the use of groupByKey() in large datasets are
possible out-of-memory (OutOfMemoryError) situations in the driver node.
Remember Spark must write data to disk whenever the amount of it cannot be
fitted in memory. The out-of-memory situation can happen when a single
executor machine receives more data that can be accommodated in its memory,
causing a memory overflow. Spark saves data to disk one key at a time; thus, the
process of flushing out data to a permanent storage device seriously disrupts a
Spark operation.
Thus, the bigger the dataset, the more likely the occurrence of out-of-
memory problems. Therefore, in general reduceByKey(),
combineByKey(), foldByKey(), or others are preferable than
groupByKey() for big datasets.
The groupByKey() internal operational mode is graphically shown in
Figure 3-5.
The same results can be achieved using PySpark code as you see just in the
following:
rdd1 = spark.sparkContext.parallelize([("PySpark",10),
("Scala",15),("R",100)])
rdd2 = spark.sparkContext.parallelize([("Scala",11),
("Scala",20),("PySpark",75), ("PySpark",35)])
joinedRDD = rdd1.join(rdd2)
print(joinedRDD.collect())
# Output
[('Scala', (15, 11)), ('Scala', (15, 20)), ('PySpark',
(10, 75)), ('PySpark', (10, 35))]
In the following, you can see how to apply the leftOuterJoin() in Scala:
You will obtain the same result using PySpark code as you see in the
following:
rdd1 = spark.sparkContext.parallelize([("PySpark",10),
("Scala",15),("R",100)])
rdd2 = spark.sparkContext.parallelize([("Scala",11),
("Scala",20),("PySpark",75), ("PySpark",35)])
joinedRDD = rdd1.leftOuterJoin(rdd2)
print(joinedRDD.collect())
# Output
[('R', (100, None)), ('Scala', (15, 11)), ('Scala',
(15, 20)), ('PySpark', (10, 75)), ('PySpark', (10,
35))]
rdd1 = spark.sparkContext.parallelize([("PySpark",10),
("Scala",15),("R",100)])
rdd2 = spark.sparkContext.parallelize([("Scala",11),
("Scala",20),("PySpark",75), ("PySpark",35)])
joinedRDD = rdd1.rightOuterJoin(rdd2)
print(joinedRDD.collect())
# Output
[('Scala', (15, 11)), ('Scala', (15, 20)), ('PySpark',
(10, 75)), ('PySpark', (10, 35))]
Once again, you can get the same results by using PySpark code as you see
in the following:
rdd1 = spark.sparkContext.parallelize([("PySpark",10),
("Scala",15),("R",100)])
rdd2 = spark.sparkContext.parallelize([("Scala",11),
("Scala",20),("PySpark",75), ("PySpark",35)])
joinedRDD = rdd1.leftOuterJoin(rdd2)
print(joinedRDD.collect())
# Output
[('R', (100, None)), ('Scala', (15, 11)), ('Scala',
(15, 20)), ('PySpark', (10, 75)), ('PySpark', (10,
35))]
rdd1.sortByKey(true).foreach(println) // ascending
order (true)
// Output
(R,100)
(PySpark,10)
(Scala,15)
rdd1.sortByKey(false).foreach(println) // descending
order (false)
// Output
(PySpark,10)
(R,100)
(Scala,15)
The sortBy() function accepts three arguments. The first one is a key
function (keyfunc) provided, which sorts a RDD based on the key designated and
returns another RDD.
The second one is a flag that specifies whether the results should be returned
in ascending or descending order. The default is ascending (true).
The third parameter (numPartitions) specifies the total number of partitions
the result is going to be divided into. numPartitions is an important optimization
parameter, because sortBy() involves the shuffling of the elements of RDDs,
and we have already seen it can involve unnecessary data movement.
Let’s now take a look at how sortBy() works with an example, taking
advantage of the RDD1 created from previous examples:
// Output
(PySpark,10)
(Scala,15)
(R,100)
rdd2.countByKey()
// Output
res48: scala.collection.Map[String,Long] = Map(PySpark
-> 2, Scala -> 2)
rdd2.countByKey().foreach(println)
// Output
(PySpark,2)
(Scala,2)
You can access the elements of the elementsCount dictionary just as you
would do for an ordinary dictionary:
Now you are going to see the same example, but this time using PySpark:
rdd2 = spark.sparkContext.parallelize([("Scala",11),
("Scala",20),("PySpark",75), ("PySpark",35)])
rdd2.collect()
# Output
[('Scala', 11), ('Scala', 20), ('PySpark', 75),
('PySpark', 35)]
# Output
defaultdict(int, {'Scala': 2, 'PySpark': 2})
elementsCount = rdd2.countByKey()
print(elementsCount)
#Output
defaultdict(<class 'int'>, {'Scala': 2, 'PySpark': 2})
Now you can access the elements of the elementsCount dictionary just as
you would do for an ordinary dictionary:
elementsCount['Scala']
# Output
2
elementsCount['SQL']
# Output
0
println(rdd2.countByValue())
//Output:
Map((PySpark,35) -> 1, (Scala,11) -> 1, (Scala,20) ->
1, (PySpark,75) -> 1)
Continuing with our previous RDD example, now we are going to see how to
use countByValue() with PySpark:
rdd2 = spark.sparkContext.parallelize([("Scala",11),
("Scala",20),("PySpark",75), ("PySpark",35)])
sorted(rdd2.countByValue().items())
# Output
[(('PySpark', 35), 1),
(('PySpark', 75), 1),
(('Scala', 11), 1),
(('Scala', 20), 1)]
rdd1.collectAsMap()
// Output
res62: scala.collection.Map[String,Int] = Map(R ->
100, Scala -> 15, PySpark -> 10)
However, if you have duplicate keys, the last key-value pair will overwrite
the former ones. In the following, the tuple (“Scala”,11) has been overwritten by
(“Scala”,20):
rdd2.collectAsMap()
// Output
res63: scala.collection.Map[String,Int] = Map(Scala ->
20, PySpark -> 35)
Here is now the same example, but with PySpark this time:
rdd1 = spark.sparkContext.parallelize([("PySpark",10),
("Scala",15),("R",100)])
rdd2 = spark.sparkContext.parallelize([("Scala",11),
("Scala",20),("PySpark",75), ("PySpark",35)])
rdd1.collectAsMap()
# Output
{'PySpark': 10, 'Scala': 15, 'R': 100}
Remember that if you have duplicate keys, the last key-value pair will
overwrite the previous ones. In the following, the tuple (“Scala”,11) has been
overwritten by (“Scala”,20):
rdd2.collectAsMap()
# Output
{'Scala': 20, 'PySpark': 35}
rdd2.lookup("PySpark")
// Output
res66: Seq[Int] = WrappedArray(75, 35)
rdd2 = spark.sparkContext.parallelize([("Scala",11),
("Scala",20),("PySpark",75), ("PySpark",35)])
rdd2.lookup("PySpark")
# Output
[75, 35]
Broadcast Variables
Broadcast variables are read-only variables that allow maintaining a cached
variable in each cluster node instead of transporting it with each task every time
tasks are sent to the executors. Therefore, each executor will keep a local copy of
the broadcast variable; in consequence, no network I/O is needed.
Broadcast variables are transferred once from the driver to the executors and
used by tasks running there as many times as necessary, minimizing data
movement through the network as a result because that information is not
transferred to the executors every time a new task is delivered to them.
In Figure 3-9 we explain graphically the difference between using broadcast
variables and normal variables to share information with the workers.
Figure 3-9 Difference between broadcast variables and normal variables
If you look at the left side of Figure 3-9, we use a map operation to multiply
every RDD element by the external multiplier variable. In operating like this, a
copy of the multiplier variable will be distributed with every task to each
executor of the cluster.
On the other hand, if you look at the right side of Figure 3-9, a single copy of
the broadcast variable is transmitted to each node and shared among all the tasks
running on them, therefore potentially saving an important amount of memory
and reducing network traffic.
Broadcast variables have the value() method, to store the data and access
the broadcasted information.
bVariable.value
res70: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9)
# Python code for broadcast variables
bVariable = spark.sparkContext.broadcast([1, 2, 3, 4,
5, 6, 7, 8, 9])
bVariable.value
# Output
[1, 2, 3, 4, 5, 6, 7, 8, 9]
Accumulators
Accumulators are variables used to track and update information across a
cluster’s executors. Accumulators can be used to implement counters and sums,
and they can only be “added” to through associative and commutative
operations.
You can see in Figure 3-10 a graphical representation of the process by
which accumulators are used to collect data at the executor level and bring it to
the driver node.
// Output
(Acc value: ,15)
Next is the same code snippet, but this time written with PySpark:
# Output
Acc value: 15
3.5 Summary
In this chapter we briefly looked at the concept of the Spark low-level API, the
notion of Spark Resilient Distributed Datasets (RDDs) as Spark building blocks
to construct other Spark data structures such as DataFrames and datasets with a
higher level of technical isolation. We also covered the most essential operations
that you can perform using RDDs, and finally we also explained the so-called
Spark shared variables: broadcasts and accumulators. In the next chapter, we are
going to focus on the Spark high-level API and how to use it in the big data
world.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2023
A. Antolínez García, Hands-on Guide to Apache Spark 3
https://doi.org/10.1007/978-1-4842-9380-5_4
Spark SQL, dataframes, and datasets are Spark high-level API components
intended for structured data manipulation, allowing Spark to automatically
improve storage and computation performance. Structured data is information
organized into standardized structure of schema, which makes it accessible and
analyzable without further treatment. Examples of structured data are database
tables, Excel sheets, RDBMS tables, Parquet files, and so on.
Spark’s high-level APIs allow the optimization of applications working with
a certain kind of data, like binary format files, beyond the limits permitted by
Spark’s RDD, for example. Dataframes and datasets take advantage of the Spark
SQL’s Catalyst Optimizer and Spark Project Tungsten, studied later in this
chapter, to optimize their performance.
The most important difference between the Dataset API and DataFrame API
is probably that the Dataset implements type safety at compile time. Datasets
enact compile-time type safety, whereas DataFrames do not. Spark verifies
DataFrame data types comply with those defined in its schema at runtime,
whereas dataset data types are validated at compile time. We will cover the
concept of compile-time type safety in detail later on in this chapter.
root
|-- Nombre: string (nullable = true)
|-- Primer_Apellido: string (nullable = true)
|-- Segundo_Apellido: string (nullable = true)
|-- Edad: integer (nullable = true)
|-- Sexo: string (nullable = true)
As you can see in the preceding schema, every dataframe column includes a
set of attributes such as name, data type, and a nullable flag, which represents
whether it accepts null values or not.
The Dataframe API is a component of the Spark SQL module and is
available for all programming languages such as Java, Python, SparkR, and
Scala. Unlike RDDs, dataframes provide automatic optimization, but unlike the
former, they do not provide compile-time type safety. This means that while with
RDDs and datasets the compiler knows the columns’ data types (string, integer,
StructType, etc.), when you work with dataframes, values returned by actions are
an array of rows without a defined data type. You can cast the values returned to
a specific type employing Scala´s asInstanceOf() or PySpark’s cast()
method, for example.
Let’s analyze how the implementation of type safety influences Spark
application behavior with three practical examples.
For that purpose we are going to use a small dataset populated with just three
of the most prominent Spanish writers of all times. First of all we are going to
show how type safety influences the use of a lambda expression in a filter or
map function. The following is the code snippet.
First of all we create a case class SpanishWritersDataFrame
APISpanishWriters including four personal writer’s attributes:
In the next step, we create a RDD from the preceding set of data:
val SpanishWritersRDD =
spark.sparkContext.parallelize(SpanishWritersData)
Now we are going to see the differences between the data entities when using
a lambda function to filter the data:
// Dataframe
val writersDFResult = writersDF.filter(writer =>
writer.Edad > 53)
// Output
error: value Edad is not a member of
org.apache.spark.sql.Row val writersDFResult =
writersDF.filter(writer => writer.Edad > 53)
^
//Dataset
val writersDSResult = writersDS.filter(writer =>
writer.Edad > 53)
// Output
writersDSResult:
org.apache.spark.sql.Dataset[SpanishWriters] =
[Nombre: string, Apellido: string ... 2 more fields]
Please, pay attention to the different output we get when filtering the
information in both data structures. When we apply filter to a dataframe, the
lambda function implemented is returning a Row-type object and not an integer
value as you probably were expecting, so it cannot be used to compare it with an
integer (53 in this case). Thus, using just the column name, we cannot retrieve
the value coded as a Row object. To get the Row object value, you have to
typecast the value returned to an integer. Therefore, we need to change the code
as follows:
The preceding example shows one of the reasons datasets were introduced.
The developer does not need to know the data type returned beforehand.
Another example of compile-time type safety appears when we query a
nonexisting column:
// Dataframe
val writersDFBirthday = writersDF.select("Birthday")
// Output
rg.apache.spark.sql.AnalysisException: Column
'Birthday' does not exist. Did you mean one of the
following? [Edad, Apellido, Nombre, Sexo];
// Dataset
val writersDSBirthday = writersDS.map(writer =>
writer.Birthday)
// Output
error: value Birthday is not a member of
SpanishWriters
val writersDSBirthday = writersDS.map(writer =>
writer.Birthday)
^
In the preceding example, you can see the difference between execution time
(dataframe) and compile time (dataset). The former will throw an error only at
runtime, while the latter will give you an error message at compile time.
Another case in which we are going to find a different behavior between
DataFrames and datasets is when we want to revert them to a primitive RDD. In
this case DataFrame reversion to RDD won’t preserve the data schema, while
dataset reversion will. Let’s see it again with an example:
rddFromDF.map(writer =>
writer.Nombre).foreach(println)
// Output
error: value Nombre is not a member of
org.apache.spark.sql.Row
Now, we are going to do the same operation, but this time with our dataset:
// Dataset reversion to RDD
val rddFromDS = writersDS.rdd
// Output
rddFromDS: org.apache.spark.rdd.RDD[SpanishWriters] =
MapPartitionsRDD[252] at rdd at
rddFromDS.map(writer =>
writer.Nombre).foreach(println)
// Output
Luis
Fancisco
Miguel
val carsData=Seq(("USA","Chrysler","Chrysler
300",292),("Germany","BMW","BMW 8 Series",617),
("Spain", "Spania GTA", "GTA Spano",925))
val carsRdd = spark.sparkContext.parallelize(carsData)
// Seq to RDD
val dfCars = carsRdd.toDF() // RDD to DF
dfCars.show()
// Output
+-------+----------+------------+---+
| _1| _2| _3| _4|
+-------+----------+------------+---+
| USA| Chrysler|Chrysler 300|292|
|Germany| BMW|BMW 8 Series|617|
| Spain|Spania GTA| GTA Spano|925|
+-------+----------+------------+---+
dfCars.printSchema()
// Output
root
|-- _1: string (nullable = true)
|-- _2: string (nullable = true)
|-- _3: string (nullable = true)
|-- _4: integer (nullable = false)
val dfBrandedCars =
carsRdd.toDF("Country","Manufacturer","Model","Power"
)
dfBrandedCars.show()
// Output
+-------+------------+------------+-----+
|Country|Manufacturer| Model|Power|
+-------+------------+------------+-----+
| USA| Chrysler|Chrysler 300| 292|
|Germany| BMW|BMW 8 Series| 617|
| Spain| Spania GTA| GTA Spano| 925|
+-------+------------+------------+-----+
The conclusion we obtain from the preceding example is that using toDF()
we have no control over the dataframe schema. This means we have no control
over column types and nullable flags.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.
{IntegerType,StringType, StructField, StructType}
// First of all we create a schema for the carsData
dataset.
val carSchema = StructType( Array(
StructField("Country", StringType,true),
StructField("Manufacturer", StringType,true),
StructField("Model", StringType,true),
StructField("Power", IntegerType,true)
))
// Notice we are using here the carsRdd RDD shown in
the previous example
val carsRowRdd = carsRdd.map(carSpecs =>
Row(carSpecs._1, carSpecs._2, carSpecs._3,
carSpecs._4))
val dfCarsFromRDD =
spark.createDataFrame(carsRowRdd,carSchema)
dfCarsFromRDD.show()
// Output
+-------+------------+------------+-----+
|Country|Manufacturer| Model|Power|
+-------+------------+------------+-----+
| USA| Chrysler|Chrysler 300| 292|
|Germany| BMW|BMW 8 Series| 617|
| Spain| Spania GTA| GTA Spano| 925|
+-------+------------+------------+-----+
val personasDF =
spark.read.load("/Users/aantolinez/Downloads/personas.csv")
val personasDF =
spark.read.load("/Users/aantolinez/Downloads/personas.parquet")
personasDF =
spark.read.load("/Users/aantolinez/Downloads/personas.csv")
// Output
Caused by: java.lang.RuntimeException:
file:/Users/aantolinez/Downloads/personas.csv is not a
Parquet file. Expected magic number at tail, but found [48,
44, 70, 10]
personasDF =
spark.read.load("/Users/aantolinez/Downloads/personas.parquet")
personasDF.show(1)
# Output
+------+---------------+----------------+----+----+
|Nombre|Primer_Apellido|Segundo_Apellido|Edad|Sexo|
+------+---------------+----------------+----+----+
|Miguel| de Cervantes| Saavedra| 50| M|
+------+---------------+----------------+----+----+
spark.sql("SELECT * FROM
parquet.`/Users/aantolinez/Downloads/personas.parquet`").show()
// Output Scala and PySpark
+--------+-----------------+-------------------+----+----+
| Nombre| Primer_Apellido| Segundo_Apellido|Edad|Sexo|
+--------+-----------------+-------------------+----+----+
| Miguel| de Cervantes| Saavedra| 50| M|
|Fancisco| Quevedo|Santibáñez Villegas| 55| M|
| Luis| de Góngora| y Argote| 65| M|
| Teresa|Sánchez de Cepeda| y Ahumada| 70| F|
+--------+-----------------+-------------------+----+----+
Time-Based Paths
Spark provides modifiedBefore and modifiedAfter options for time
control over files that should be loaded at query time.
modifiedBefore takes a timestamp as a parameter instructing Spark to
only read files whose modification time occurred before the given time.
Similarly, modifiedAfter also takes a timestamp as a parameter but this time
commanding Spark to only load files whose modification time took place after
the given time. In both cases timestamp must have the following format: YYYY-
MM-DDTHH:mm:ss (e.g. 2022-10-29T20:30:50).
Let’s see this Spark behavior with an example in Scala and later on in
PySpark:
modifiedAfterDF.show();
We can get the same result using PySpark code as you can see in the
following:
modifiedAfterDF = spark.read.format("csv") \
.option("header", "true") \
.option("modifiedAfter", "2022-10-30T05:30:00") \
.load("/Users/aantolinez/Downloads/Hands-On-
Spark3");
modifiedAfterDF.show();
+--------+-----------------+-------------------+----+----+
| Nombre| Primer_Apellido| Segundo_Apellido|Edad|Sexo|
+--------+-----------------+-------------------+----+----+
| Miguel| de Cervantes| Saavedra| 50| M|
|Fancisco| Quevedo|Santibáñez Villegas| 55| M|
| Luis| de Góngora| y Argote| 65| M|
| Teresa|Sánchez de Cepeda| y Ahumada| 70| F|
+--------+-----------------+-------------------+----+----+
Now if you have a look at the saved data in Figure 4-2, you see something
surprising.
Figure 4-2 Spark saved data
NOTICE
Be careful when using the coalesce() and/or repartition() method
with large data volumes as you can overload the driver memory and get into
trouble facing OutOfMemory problems.
Let’s see how to get a single file with a simple example valid for Scala and
PySpark:
SpanishWritersAppendDF.coalesce(1)
.write.csv("/Users/aantolinez/Downloads/personas_coalesce.csv")
Now, when you look at the output of the preceding code (Figure 4-3), you
can see a single CSV file; however, the folder, _SUCCESS file, and .crc hidden
files are still there.
Figure 4-3 Spark saving to a single file
For further refinement, such as removing the folder and _SUCCESS and .crc
hidden files, you would have to use the Hadoop file system library to manipulate
the final output.
Spanish_Writers_by_Century.parquet:
_SUCCESS part-00000-e4385fd4-fcc0-4a5c-8632-d0080438fa82-c00
The compression codec can be set to none, uncompressed, snappy, gzip, lzo,
brotli, lz4, and zstd, overriding the
spark.sql.parquet.compression.codec. Data can be appended to a
Parquet file using the append option:
// Saving data with gzip compression codec compression option
SpanishWritersDf.write.mode("append").option("compression",
"gzip").parquet("/Users/aantolinez/Downloads/Spanish_Writers_by
Spanish_Writers_by_Century.parquet:
_SUCCESS part-00000-e4385fd4-fcc0-4a5c-8632-
d0080438fa82-c000.gz.parquet
part-00000-d070dd4d-86ca-476f-8e67-060365db7ca7-
c000.snappy.parquet
schemaWriters = StructType([
StructField("Name",StringType(),True),
StructField("Surname",StringType(),True),
StructField("Century",StringType(),True),
StructField("YearOfBirth", IntegerType(), True)
])
SpanishWritersDf = spark.read.option("header", "true") \
.schema(schemaWriters) \
.csv("/Users/aantolinez/Downloads/Spanish_Writers_by_Century.cs
val parquetDF =
spark.read.parquet("/Users/aantolinez/Downloads/Spanish_Writers
parquetDF.createOrReplaceTempView("TempTable")
val sqlDf = spark.sql("select * from TempTable where YearOfBirt
sqlDf.show()
// Output
+--------+-----------+-------+-----------+
| Name| Surname|Century|YearOfBirth|
+--------+-----------+-------+-----------+
|Calderón|de la Barca| XVII| 1600|
+--------+-----------+-------+-----------+
You can get the same result using PySpark code as you see in the following
code snippet:
parquetDF =
spark.read.parquet("/Users/aantolinez/Downloads/Spanish_Writers
parquetDF.createOrReplaceTempView("TempTable")
sqlDf = spark.sql("select * from TempTable where YearOfBirth =
sqlDf.show()
# Output
+--------+-----------+-------+-----------+
| Name| Surname|Century|YearOfBirth|
+--------+-----------+-------+-----------+
|Calderón|de la Barca| XVII| 1600|
+--------+-----------+-------+-----------+
Spark creates a folder hierarchy based on “Century” as the first partition key
and recursively a group of subfolders for “Gender”, the second partition key.
You can see the mentioned hierarchy in Figure 4-5.
The following PySpark code snippet will return to you the same output:
schemaWriters = StructType([
StructField("Name",StringType(),True),
StructField("Surname",StringType(),True),
StructField("Century",StringType(),True),
StructField("YearOfBirth", IntegerType(),True),
StructField("Gender",StringType(),True),
])
SpanishWritersDf.write.partitionBy("Century","Gender") \
.parquet("/Users/aantolinez/Downloads/Spanish_Writers_by_Gender
val partitionDF =
spark.read.parquet("/Users/aantolinez/Downloads/Spanish_Writers
partitionDF.show()
+----+---------------+-----------+------+
|Name| Surname|YearOfBirth|Gender|
+----+---------------+-----------+------+
|José|Ortega y Gasset| 1883| M|
+----+---------------+-----------+------+
partitionDF =
spark.read.parquet("/Users/aantolinez/Downloads/Spanish_Writers
partitionDF.show()
{"id":1,"first_name":"Luis","last_name":"Ortiz","email":"luis.o
16","registered":false},
{"id":2,"first_name":"Alfonso","last_name":"Antolinez","email":
03-11","registered":true},
{"id":3,"first_name":"Juan","last_name":"Dominguez","email":"jd
15","registered":true},
{"id":4,"first_name":"Santiago","last_name":"Sanchez","email":"
10-31","registered":false}
To load a single and simple JSON file, you can use the read.json() as
the example shown next:
val df =
spark.read.json("/Users/aantolinez/Downloads/Spaniards.json")
Unlike other data source formats, Spark has the capacity to infer the data
schema while reading a JSON file:
df.printSchema()
// Output
root
|-- country: string (nullable = true)
|-- email: string (nullable = true)
|-- first_name: string (nullable = true)
|-- id: long (nullable = true)
|-- last_name: string (nullable = true)
|-- registered: boolean (nullable = true)
|-- updated: string (nullable = true)
The same result is obtained when we use PySpark code as shown next:
jsonDf =
spark.read.json("/Users/aantolinez/Downloads/Spaniards.json")
jsonDf.printSchema()
# Output
root
|-- country: string (nullable = true)
|-- email: string (nullable = true)
|-- first_name: string (nullable = true)
|-- id: long (nullable = true)
|-- last_name: string (nullable = true)
|-- registered: boolean (nullable = true)
|-- updated: string (nullable = true)
However, in many real cases, you are going to find files formatted as arrays
of JSON strings. One example of this file format is shown in the following:
[{"id":1,"first_name":"Luis","last_name":"Ortiz","email":"luis.
16","registered":false},
{"id":2,"first_name":"Alfonso","last_name":"Antolinez","email":
03-11","registered":true},
{"id":3,"first_name":"Juan","last_name":"Dominguez","email":"jd
15","registered":true},
{"id":4,"first_name":"Santiago","last_name":"Sanchez","email":"
10-31","registered":false}]
These kinds of files are known as multiline JSON strings. For multiline
JSON files, you have to use .option("multiline","true") while
reading the data. Let’s see how it works with an example in Scala and PySpark:
multipleJsonsDf = spark.read.option("multiline","true") \
.json(["/Users/aantolinez/Downloads/Spaniards_array.json", \
"/Users/aantolinez/Downloads/Spaniards_array2.json"])
multipleJsonsDf.show(10,False)
# Output
+-------+-------------------+----------+---+---------+---------
-----+
|country|email |first_name|id
|last_name|registered|updated |
+-------+-------------------+----------+---+---------+---------
-----+
|Spain |[email protected] |Luis |1 |Ortiz |false
05-16|
|Spain |[email protected] |Alfonso |2 |Antolinez|true
03-11|
|Spain |[email protected] |Juan |3 |Dominguez|true
02-15|
|Spain |[email protected]|Santiago |4 |Sanchez |false
10-31|
|Spain |[email protected]|Luis |1 |Herrera |false
05-15|
|Spain |[email protected] |Marcos |2 |Abad |true
03-21|
|Spain |[email protected] |Juan |3 |Abalos |true
02-14|
|Spain |[email protected] |Santiago |4 |Amo |false
10-21|
+-------+-------------------+----------+---+---------+---------
-----+
You can get exactly the same result using PySpark code as follows:
patternJsonsDf =
spark.read.option("multiline","true").json(
"/Users/aantolinez/Downloads/Spaniards_array*.json")
patternJsonsDf.show(20, False)
In a similar way, you can use patterns to load all the JSON files from a
folder. For example, the following code snippets will allow you to read all the
JSON files from a directory and only JSON files:
Similarly, if you want to read all the files in a directory, you can use the
following code:
As usual, let’s see now how to implement it with Scala and PySpark coding:
Exactly the same outcome can be achieved using a more compressed code:
Now we are going to show how to get the same result using PySpark code:
multipleJsonsDf.write
.json("/Users/aantolinez/Downloads/Merged_Spaniards_array.json
ls Downloads/Merged_Spaniards_array.json
_SUCCESS
part-00000-69975a01-3566-4d2d-898d-cf9e543d81c3-
c000.json
part-00001-69975a01-3566-4d2d-898d-cf9e543d81c3-
c000.json
Saving Modes
As it is with other file formats, the saving modes applicable to JSON files are the
same as those shown earlier in Table 4-1. In the next code snippet, you can see
how to append data to an already existing JSON file:
multipleJsonsDf.write.mode("append").json("/Users/aantolinez/Do
ls Downloads/Merged_Spaniards_array.json
_SUCCESS
part-00000-188063e9-e5f6-4308-b6e1-7965eaa46c80-c000.json
part-00000-7453b1ad-f3b6-4e68-80eb-254fb539c04d-c000.json
part-00001-188063e9-e5f6-4308-b6e1-7965eaa46c80-c000.json
part-00001-7453b1ad-f3b6-4e68-80eb-254fb539c04d-c000.json
import org.apache.spark.sql.types.{StructType,StructField,
StringType,IntegerType,BooleanType,DateType}
val schemaSpaniardsDf =
spark.read.schema(schemaSpaniards).json("/Users/aantolinez/Down
We can see how the new DataFrame matches the data schema previously
defined:
schemaSpaniardsDf.printSchema()
// Output
root
|-- id: string (nullable = true)
|-- first_name: string (nullable = true)
|-- last_name: string (nullable = true)
|-- email: string (nullable = true)
|-- country: string (nullable = true)
|-- updated: date (nullable = true)
|-- registered: boolean (nullable = true)
Now we can see the final result after loading the JSON file based on our
customized schema:
schemaSpaniardsDf.show(false)
// Output
+---+----------+---------+-------------------+-------+---------
---+
|id
|first_name|last_name|email |country|updated |re
+---+----------+---------+-------------------+-------+---------
---+
|1 |Luis |Ortiz |[email protected] |Spain |2015-05-
16|false |
|2 |Alfonso |Antolinez|[email protected] |Spain |2015-03-
11|true |
|3 |Juan |Dominguez|[email protected] |Spain |2015-02-
15|true |
|4 |Santiago |Sanchez |[email protected]|Spain |2014-10-
31|false |
+---+----------+---------+-------------------+-------+---------
---+
As usual, you get the same result using PySpark code. Let’s repeat the
previous steps, but this time written in PySpark:
schemaSpaniards = StructType([ \
StructField("id",IntegerType(),nullable=True), \
StructField("first_name",StringType(),nullable=True), \
StructField("last_name",StringType(),nullable=True), \
StructField("email",StringType(),nullable=True), \
StructField("country",StringType(),nullable=True), \
StructField("updated",DateType(),nullable=True), \
StructField("registered",BooleanType(),nullable=True)])
schemaSpaniardsDf.printSchema()
# Output
root
|-- id: integer (nullable = true)
|-- first_name: string (nullable = true)
|-- last_name: string (nullable = true)
|-- email: string (nullable = true)
|-- country: string (nullable = true)
|-- updated: date (nullable = true)
|-- registered: boolean (nullable = true)
Finally, you can also see the same result, as the one obtained with the Scala
script:
schemaSpaniardsDf.show(4, False)
# Output
+---+----------+---------+-------------------+-------+---------
---+
|id
|first_name|last_name|email |country|updated |re
+---+----------+---------+-------------------+-------+---------
---+
|1 |Luis |Ortiz |[email protected] |Spain |2015-05-
16|false |
|2 |Alfonso |Antolinez|[email protected] |Spain |2015-03-
11|true |
|3 |Juan |Dominguez|[email protected] |Spain |2015-02-
15|true |
|4 |Santiago |Sanchez |[email protected]|Spain |2014-10-
31|false |
+---+----------+---------+-------------------+-------+---------
---+
root
|-- Book: struct (nullable = true)
| |-- Authors: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- firstname: string (nullable =
true)
| | | |-- lastname: string (nullable =
true)
| |-- DOI: string (nullable = true)
| |-- Editors: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- firstname: string (nullable =
true)
| | | |-- lastname: string (nullable =
true)
| |-- ISBN: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Hardcover ISBN: string
(nullable = true)
| | | |-- Softcover ISBN: string
(nullable = true)
| | | |-- eBook ISBN: string (nullable =
true)
| |-- Id: long (nullable = true)
| |-- Publisher: string (nullable = true)
| |-- Title: struct (nullable = true)
| | |-- Book Subtitle: string (nullable =
true)
| | |-- Book Title: string (nullable = true)
| |-- Topics: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- eBook Packages: array (nullable = true)
| | |-- element: string (containsNull = true)
And you would like to transform it into a schema like the one shown in the
following:
root
|-- Afirstname: string (nullable = true)
|-- Alastname: string (nullable = true)
|-- DOI: string (nullable = true)
|-- Efirstname: string (nullable = true)
|-- Elastname: string (nullable = true)
|-- Hardcover ISBN: string (nullable = true)
|-- Softcover ISBN: string (nullable = true)
|-- eBook ISBN: string (nullable = true)
|-- Id: long (nullable = true)
|-- Publisher: string (nullable = true)
|-- Book Subtitle: string (nullable = true)
|-- Book Title: string (nullable = true)
|-- Topics: string (nullable = true)
|-- eBook Packages: string (nullable = true)
Hence, flatten the data and get a final Spark DataFrame as the one shown in
Figure 4-6.
val PATH
="Downloads/Spanish_Writers_by_Century_II.csv"
val df0 = spark.read.csv(PATH)
df0.show(5)
// Output
+--------------------+
| _c0|
+--------------------+
|Name;Surname;Cent...|
|Gonzalo;de Berceo...|
| Juan ;Ruiz;XIV;1283|
|Fernando;de Rojas...|
|Garcilaso;de la V...|
+--------------------+
Exactly the same output would be achieved if you use PySpark code, as
follows:
To skip the first line and use it as column names, we can use
.option("header", "true"):
val GZIP_PATH =
"Downloads/Spanish_Writers_by_Century_II.csv.gz"
Other important options are nullValue, nanValue, and dateFormat. The first
option permits establishing a string representing a null value. The second option
permits the specification of a string as representation of a non-number value
(NaN by default). The last option sets the string that indicates a date format (by
default “yyyy-MM-dd”).
To save a Spark DataFrame to a CSV format, we can use the write()
function. The write() function takes a folder as a parameter. That directory
represents the output path in which the CSV file, plus a _SUCCESS file, will be
saved:
As it happens with other file formats like Parquet, several saving options,
Overwrite, Append, Ignore, and the default option ErrorIfExists, are available.
After creating the basic Hive resources, we can write our code to load the
data from a file and save it to the Hive table:
import java.io.File
import org.apache.spark.sql.{Row, SaveMode,
SparkSession}
val warehouseLocation =
"hdfs://localhost:9745/user/hive/warehouse"
import spark.implicits._
import spark.sql
val path =
"file:///tmp/Spanish_Writers_by_Century.csv"
After saving the data, we can go to our Hive server and check the data is
already there:
dfPostgresql.show()
// Output
+-----------+--------------+--------------------+-----
--+
|category_id|
category_name| description|picture|
+-----------+--------------+--------------------+-----
--+
| 1| Beverages|Soft drinks,
coff...| []|
| 2| Condiments|Sweet and savory
...| []|
| 3| Confections|Desserts,
candies...| []|
| 4|Dairy
Products| Cheeses| []|
| 5|Grains/Cereals|Breads,
crackers,...| []|
| 6| Meat/Poultry| Prepared
meats| []|
| 7| Produce|Dried fruit and
b...| []|
| 8| Seafood| Seaweed and
fish| []|
+-----------+--------------+--------------------+-----
--+
jdbcAwsMySQL.select("id","company","last_name","first_name","jo
// Output
+---+---------+----------------+----------+--------------------
| id| company| last_name|first_name| job_title
+---+---------+----------------+----------+--------------------
| 1|Company A| Bedecs| Anna| Owner
| 2|Company B|Gratacos Solsona| Antonio| Owner
| 3|Company C| Axen| Thomas|Purchasing Repres...
| 4|Company D| Lee| Christina| Purchasing Manager
| 5|Company E| O’Donnell| Martin| Owner
| 6|Company F| Pérez-Olaeta| Francisco| Purchasing Manager
| 7|Company G| Xie| Ming-Yang| Owner
| 8|Company H| Andersen| Elizabeth|Purchasing Repres...
+---+---------+----------------+----------+--------------------
You can get the same result using an embedded SQL query instead of
retrieving the whole table:
jdbcAwsMySQL.show(8)
// Output
+---+---------+----------------+----------+-------------------
-+
|
id| company| last_name|first_name| job_title|
+---+---------+----------------+----------+-------------------
-+
| 1|Company
A| Bedecs| Anna| Owner|
| 2|Company B|Gratacos
Solsona| Antonio| Owner|
| 3|Company C| Axen| Thomas|Purchasing
Repres...|
| 4|Company D| Lee| Christina| Purchasing
Manager|
| 5|Company
E| O’Donnell| Martin| Owner|
| 6|Company F| Pérez-Olaeta| Francisco| Purchasing
Manager|
| 7|Company G| Xie| Ming-
Yang| Owner|
| 8|Company H| Andersen| Elizabeth|Purchasing
Repres...|
+---+---------+----------------+----------+-------------------
-+
In a similar way, you can save a Spark DataFrame to a database using a
JDBC connection. Let’s see it with an example of how to add data to a MySQL
database table:
import spark.implicits._
val data = Seq((6, "Alfonso"))
val dataRdd = spark.sparkContext.parallelize(data)
val dfFromRDD = dataRdd.toDF("id","name")
dfFromRDD.write
.mode("append")
.format("jdbc")
.option("url",
"jdbc:mysql://dbserver_url:3306/northwind")
.option("driver", "com.mysql.jdbc.Driver")
.option("dbtable", "customers")
.option("user", "YOUR_USER_HERE")
.option("password", "YOUR_PASSWORD_HERE")
.save()
There are many other options you can use when using a JDBC connection.
The ones we show next are probably the most relevant when you want to
optimize the communication with the data source:
The show() function without parameters displays 20 rows and truncates the
text length to 20 characters by default. However, show() can take up to three
parameters: The first one is an integer corresponding to the number of rows to
display. The second parameter can be a Boolean value, indicating whether text
string should be truncated, or an integer, denoting the number of characters to
display. The third parameter is a Boolean-type value, designating whether values
should be shown vertically.
The following output displays the results of the previous example, but using
the dfWC.show(5,8) option, showing only the first five rows and just eight
characters in length:
+----+--------+--------+----------+-------+--------+-----------
----+----------+
|Year| Country| Winner|Runners-
Up| Third| Fourth|GoalsScored|QualifiedTeams|MatchesPlayed|At
+----+--------+--------+----------+-------+--------+-----------
----+----------+
|1930| Uruguay|
Uruguay| Argen...| USA|Yugos...| 70| 13|
|1934| Italy| Italy| Czech...|Germany|
Austria| 70| 16| 17| 363.000|
|1938| France| Italy| Hungary|
Brazil| Sweden| 84| 15| 18| 375
|1950| Brazil| Uruguay| Brazil|
Sweden| Spain| 88| 13| 22| 1.04
|1954|Switz...|Germa...| Hungary|Austria|
Uruguay| 140| 16| 26| 768.607|
+----+--------+--------+----------+-------+--------+-----------
----+----------+
only showing top 5 rows
dfWC.select("*").show()
On the other hand, you can fetch specific columns from your DataFrame
using column names in one of the following query ways:
dfWC.select(dfWC.columns(0),dfWC.columns(1),dfWC.columns(2),dfW
false)
// Output
+----+-----------+----------+--------------+
|Year|Country |Winner |Runners-Up |
+----+-----------+----------+--------------+
|1930|Uruguay |Uruguay |Argentina |
|1934|Italy |Italy |Czechoslovakia|
|1938|France |Italy |Hungary |
|1950|Brazil |Uruguay |Brazil |
|1954|Switzerland|Germany FR|Hungary |
+----+-----------+----------+--------------+
Sequences can also be used in several other ways, for instance, using the
sequence plus the string column names, as you see next:
val seqColumnas =
Seq("Year","Country","Winner","Runners-
Up","Third","Fourth")
val result = dfWC.select(seqColumnas.head,
seqColumnas.tail: _*).show(5, false)
Another way could be using a sequence plus the map function with a set of
column names:
In both examples, you get exactly the same result you got in previous code
snippets.
You can also use a list of columns to retrieve the desired data. Have a look at
the next example:
import org.apache.spark.sql.Column
val miColumnas: List[Column] = List(new
Column("Year"), new Column("Country"), new
Column("Winner"))
dfWC.select(miColumnas: _*).show(5,false)
// Output
+----+-----------+----------+
|Year|Country |Winner |
+----+-----------+----------+
|1930|Uruguay |Uruguay |
|1934|Italy |Italy |
|1938|France |Italy |
|1950|Brazil |Uruguay |
|1954|Switzerland|Germany FR|
+----+-----------+----------+
dfWC.select(dfWC.columns.filter(s=>s.endsWith("ner")).map(c=>co
// Output
+----------+
|Winner |
+----------+
|Uruguay |
|Italy |
|Italy |
|Uruguay |
|Germany FR|
+----------+
On the other hand, the function contains() can be used to filter rows by
columns containing a specific pattern. You can see an example in the following
in which we filter the dataset rows with letter “y” in the Winner column:
import org.apache.spark.sql.functions.col
dfWC.select("Year","Country","Winner","Runners-
Up","Third","Fourth").filter(col("Winner").contains("S")).show(
// Output
+----+------------+------+-----------+-------+-------+
|Year| Country|Winner| Runners-Up| Third| Fourth|
+----+------------+------+-----------+-------+-------+
|2010|South Africa| Spain|Netherlands|Germany|Uruguay|
+----+------------+------+-----------+-------+-------+
filter can be complemented with the function like() to achieve the same
outcome:
dfWC.select("Year","Country","Winner","Runners-
Up","Third","Fourth").filter(col("Winner").like("%S%")).show()
// Outcome
+----+------------+------+-----------+-------+-------+
|Year| Country|Winner| Runners-Up| Third| Fourth|
+----+------------+------+-----------+-------+-------+
|2010|South Africa| Spain|Netherlands|Germany|Uruguay|
+----+------------+------+-----------+-------+-------+
We can also use SQL ANSI language as a complement to filter dataset rows:
dfWC.createOrReplaceTempView("WorldCups")
spark.sql("select Year,Country,Winner,`Runners-
Up`,Third,Fourth from WorldCups where Winner like
'%S%'").show()
// Output
+----+------------+------+-----------+-------+-------+
|Year| Country|Winner| Runners-Up| Third| Fourth|
+----+------------+------+-----------+-------+-------+
|2010|South Africa| Spain|Netherlands|Germany|Uruguay|
+----+------------+------+-----------+-------+-------+
In the last piece of code, you can appreciate how Spark performs query
operations. First, Spark performs the select() transformation, and then it
applies the filter criteria to the selected data. Finally, it applies the action
show() to the results.
Several filter() functions can be cascaded to provide additional data
refinement:
dfWC.select(col("Year"),col("Country"),col("Winner"),col("Runne
Up")).filter("Year < 1938").filter("Country =
'Italy'").show(5,false)
// Output
+----+-------+------+--------------+
|Year|Country|Winner|Runners-Up |
+----+-------+------+--------------+
|1934|Italy |Italy |Czechoslovakia|
+----+-------+------+--------------+
Alternatively, you can use the whether() function to get the same
outcome, as you can see in the following example:
It seems the Dutch are real experts in being the second one!
One possible use of NOT or “!” could be something like the next one:
dfWC.printSchema()
// Output
root
|-- Year: string (nullable = true)
|-- Country: string (nullable = true)
|-- Winner: string (nullable = true)
|-- Runners-Up: string (nullable = true)
|-- Third: string (nullable = true)
|-- Fourth: string (nullable = true)
|-- GoalsScored: string (nullable = true)
|-- QualifiedTeams: string (nullable = true)
|-- MatchesPlayed: string (nullable = true)
|-- Attendance: string (nullable = true)
Do you see the problem? Yes, Spark identified all the columns as string.
Therefore, if you attempt to perform the operation at this stage, you will get the
following errors:
Thus, the first step should be converting the columns of interest to a numeric
data type. However, if you try to directly convert some figures such as 1.045.246
to numeric, you will also have problems, as they are saved in Metric System
format. Therefore, it could be preferable to remove “ . ” to avoid problems:
import org.apache.spark.sql.functions.regexp_replace
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types.IntegerType
val dfWC2=dfWC.withColumn("Attendance",
regexp_replace($"Attendance", "\\.", ""))
val dfWC3=dfWC2.withColumn("GoalsScored",
col("GoalsScored").cast(IntegerType))
.withColumn("QualifiedTeams",
col("QualifiedTeams").cast(IntegerType))
.withColumn("MatchesPlayed",
col("MatchesPlayed").cast(IntegerType))
.withColumn("Attendance",
col("Attendance").cast(IntegerType))
dfWC3.printSchema()
// Output
root
|-- Year: string (nullable = true)
|-- Country: string (nullable = true)
|-- Winner: string (nullable = true)
|-- Runners-Up: string (nullable = true)
|-- Third: string (nullable = true)
|-- Fourth: string (nullable = true)
|-- GoalsScored: integer (nullable = true)
|-- QualifiedTeams: integer (nullable = true)
|-- MatchesPlayed: integer (nullable = true)
|-- Attendance: integer (nullable = true)
val dfWCExt=dfWC3.withColumn("AttendancePerMatch",
round(col("Attendance").cast(IntegerType)/col("MatchesPlayed").
3))
dfWCExt.select("Attendance","MatchesPlayed","AttendancePerMatch
// Output
+----------+-------------+------------------+
|Attendance|MatchesPlayed|AttendancePerMatch|
+----------+-------------+------------------+
| 590549| 18| 32808.278|
| 363000| 17| 21352.941|
| 375700| 18| 20872.222|
| 1045246| 22| 47511.182|
| 768607| 26| 29561.808|
+----------+-------------+------------------+
In the preceding example, you have already used a good bunch of the
withColumn() use cases.
val dfRenamed =
dfWCExt.withColumnRenamed("AttendancePerMatch","AxMatch")
dfRenamed.select("Attendance",
"MatchesPlayed","AxMatch").show(5)
// Output
+----------+-------------+---------+
|Attendance|MatchesPlayed| AxMatch|
+----------+-------------+---------+
| 590549| 18|32808.278|
| 363000| 17|21352.941|
| 375700| 18|20872.222|
| 1045246| 22|47511.182|
| 768607| 26|29561.808|
+----------+-------------+---------+
dfFropTwo = dfRenamed2.drop(*("MatchesPlayed","AttendancexMatch
dfFropTwo.show(5)
# Output
+----+-----------+----------+--------------+-------+----------+
+-------+
|Year| Country| Winner| Runners-
Up| Third| Fourth|GoalsScored|QualifiedTeams| Att|
+----+-----------+----------+--------------+-------+----------+
+-------+
|1930| Uruguay| Uruguay| Argentina| USA|Yugoslavia|
590549|
|1934| Italy| Italy|Czechoslovakia|Germany| Austria|
363000|
|1938| France| Italy| Hungary| Brazil| Sweden|
375700|
|1950| Brazil| Uruguay| Brazil|
Sweden| Spain| 88| 13|1045246|
|1954|Switzerland|Germany FR| Hungary|Austria| Uruguay|
768607|
+----+-----------+----------+--------------+-------+----------+
+-------+
You can delete all the DataFrame columns at once using the following Scala
code snippet:
allColumnsList = dfRenamed2.columns
dfFropAll = dfRenamed2.drop(*allColumnsList)
dfFropAll.show(2)
# Output
++
||
++
||
||
++
import org.apache.spark.sql.functions.{when, _}
import spark.sqlContext.implicits._
val df2 = df
.withColumn("stage", when(col("age") <
10,"Child")
.when(col("age") >= 10 && col("age") <
18,"Teenager")
.otherwise("Adult"))
df2.show(5)
// Output
+--------+-------+---+------+------+--------+
| name|surname|age|gender|salary| stage|
+--------+-------+---+------+------+--------+
| Juan | Bravo| 67| M| 65000| Adult|
| Miguel |Rosales| 40| M| 87000| Adult|
|Roberto |Morales| 7| M| 0| Child|
| Maria | Gomez| 12| F| 0|Teenager|
| Vanesa| Lopez| 25| F| 72000| Adult|
+--------+-------+---+------+------+--------+
The when() clause can also be used as part of a SQL select statement:
Now that our UDF is registered, we can use it with our personas dataframe
as a normal SQL function:
val
finalDF=df.withColumn("is_adult",isAdultUDF(col("age")))
finalDF.show()
// Output
+--------+-------+---+------+------+--------+
| name|surname|age|gender|salary|is_adult|
+--------+-------+---+------+------+--------+
| Juan | Bravo| 67| M| 65000| Adult|
| Miguel |Rosales| 40| M| 87000| Adult|
|Roberto |Morales| 7| M| 0|No adult|
| Maria | Gomez| 12| F| 0|No adult|
| Vanesa| Lopez| 25| F| 72000| Adult|
+--------+-------+---+------+------+--------+
val dfCOP3=spark.read.option("header",
"true").csv("file:///Users/aantolinez/Downloads/Crude_Oil_Produ
dfCOP3.show(5)
// Output
+----+------+------+------+------+------+------+------+------+-
|Year| Jan| Feb| Mar| Apr| May| Jun| Jul| Aug|
+----+------+------+------+------+------+------+------+------+-
|2010|167529|155496|170976|161769|167427|161385|164234|168867|1
|2011|170393|151354|174158|166858|174363|167673|168635|175618|1
|2012|191472|181783|196138|189601|197456|188262|199368|196867|1
|2013|219601|200383|223683|221242|227139|218355|233210|233599|2
|2014|250430|228396|257225|255822|268025|262291|274273|276909|2
+----+------+------+------+------+------+------+------+------+-
val dfCOP4=spark.read.option("header",
"true").csv("file:///Users/aantolinez/Downloads/Crude_Oil_Produ
dfCOP4.show(5)
// Output
+----+------+------+------+------+------+------+------+------+-
|Year| Jan| Feb| Mar| Apr| May| Jun| Jul| Aug|
+----+------+------+------+------+------+------+------+------+-
|2015|290891|266154|297091|289755|293711|280734|292807|291702|2
|2016|285262|262902|282132|266219|273875|260284|268526|269386|2
|2017|275117|255081|284146|273041|284727|273321|286657|286759|2
|2018|310032|287870|324467|314996|323491|319216|337814|353154|3
|2019|367924|326845|369292|364458|376763|366546|368965|387073|3
+----+------+------+------+------+------+------+------+------+-
To produce a clear DataFrame, we can use the following code snippet. You
can see in the following only one 2015 row is preserved:
val dfCOP1=spark.read.option("header",
"true").csv("file:///Users/aantolinez/Downloads/Crude_Oil_Produ
dfCOP1.show(5)
// Output
+----+------+------+------+------+------+------+
|Year| Jan| Feb| Mar| Apr| May| Jun|
+----+------+------+------+------+------+------+
|2015|290891|266154|297091|289755|293711|280734|
|2016|285262|262902|282132|266219|273875|260284|
|2017|275117|255081|284146|273041|284727|273321|
|2018|310032|287870|324467|314996|323491|319216|
|2019|367924|326845|369292|364458|376763|366546|
+----+------+------+------+------+------+------+
val dfCOP2=spark.read.option("header",
"true").csv("file:///Users/aantolinez/Downloads/Crude_Oil_Produ
dfCOP2.show(5)
// Output
+----+------+------+------+------+------+------+
|Year| Jan| Feb| Mar| Apr| May| Jun|
+----+------+------+------+------+------+------+
|2015|290891|266154|297091|289755|293711|280734|
|2016|285262|262902|282132|266219|273875|260284|
|2017|275117|255081|284146|273041|284727|273321|
|2018|310032|287870|324467|314996|323491|319216|
|2019|367924|326845|369292|364458|376763|366546|
+----+------+------+------+------+------+------+
The dfCOP1 and dfCOP2 DataFrames only have the Year column in
common:
// Using allowMissingColumns=true
val missingColumnsDf=dfCOP1.unionByName(dfCOP2, allowMissingCol
missingColumnsDf.show()
// Output
+----+------+------+------+------+------+------+------+------+-
|Year| Jan| Feb| Mar| Apr| May| Jun| Jul| Aug|
+----+------+------+------+------+------+------+------+------+-
|2015|290891|266154|297091|289755|293711|280734| null| null|
|2016|285262|262902|282132|266219|273875|260284| null| null|
|2017|275117|255081|284146|273041|284727|273321| null| null|
|2018|310032|287870|324467|314996|323491|319216| null| null|
|2019|367924|326845|369292|364458|376763|366546| null| null|
|2020|398420|372419|396693|357412|301105|313275| null| null|
|2015| null| null| null| null| null| null|292807|291702|2
|2016| null| null| null| null| null| null|268526|269386|2
|2017| null| null| null| null| null| null|286657|286759|2
|2018| null| null| null| null| null| null|337814|353154|3
|2019| null| null| null| null| null| null|368965|387073|3
|2020| null| null| null| null| null| null|341184|327875|3
+----+------+------+------+------+------+------+------+------+-
val dfUByL=spark.read.option("header",
"true").csv("file:///Users/aantolinez/Downloads/User_by_languag
// Output
+---------+--------+------+--------+-----------+------+
|firstName|lastName|gender|language|ISO3166Code|salary|
+---------+--------+------+--------+-----------+------+
| Liselda| Rojas|Female| Spanish| 484| 62000|
| Leopoldo| Galán| Male| Spanish| 604| 47000|
| William| Adams| Male| English| 826| 99000|
| James| Allen| Male| English| 124| 55000|
| Andrea| López|Female| Spanish| 724| 95000|
+---------+--------+------+--------+-----------+------+
The second one includes the names of different countries together with their
ISO 3166 country codes:
val dfCCodes=spark.read.option("header",
"true").csv("file:///Users/aantolinez/Downloads/ISO_3166_countr
// Output
+------------+--------------+
| ISO3166Code| CountryName|
+------------+--------------+
| 484| Mexico|
| 826|United Kingdom|
| 250| France|
| 124| Canada|
| 724| Spain|
+------------+--------------+
Let’s now use an INNER join to join the preceding two DataFrames on the
ISO3166Code column:
Do you see something in the preceding output? Yes, the join column is
duplicated. This situation will put you into trouble if you try to work with the
final DataFrame, as duplicate columns will create ambiguity. Therefore, it is
very likely you will receive a message similar to this one:
“org.apache.spark.sql.AnalysisException: Reference
'ISO3166Code' is ambiguous, could be: ISO3166Code,
ISO3166Code.”
One way to avoid this kind of problem is using a temporary view and
selecting just the fields you would like to have. Let’s repeat the previous
example, but this time using the Spark function
createOrReplaceTempView() to create a temporary view:
On the other hand, a fullouter, outer, or full join collects all rows
from both DataFrames and adds a null value for those records that do not have a
match in both DataFrames. Once more, we are going to show you how to use
this kind of join with a practical example using the previous DataFrames. Please
notice that the next three code snippets return exactly the same outcome:
The Spark left outer join collects all the elements from the left DataFrame
and only those from the right one that have a matching on the left DataFrame. If
there is no matching element on the left DataFrame, no join takes place.
Once again, the next code snippets will give you the same result:
The Spark right outer or right join performs the left join symmetrical
operation. In this case, all the elements from the right DataFrame are collected,
and a null value is added where no matching is found on the left one. Next is an
example, and again, both lines of code produce the same outcome:
Finally, the Spark anti join returns rows from the first DataFrame not having
matches in the second one. Here is one more example:
Summarizing, in this section you have seen how to use the most typical
Spark joins. However, you have to bear something important in mind. Joins are
wide Spark transformations; therefore, they imply data shuffling across the
nodes. Hence, performance can be seriously affected if you use them without
caution.
4.3 Spark Cache and Persist of Data
We have already mentioned in this book that one of Spark’s competitive
advantages is its data partitioning capability across multiple executors. Splitting
large volumes of information across the network poses important challenges
such as bandwidth saturation and network latency.
While using Spark you might need to use a dataset many times over a period
of time; therefore, fetching the same dataset once and again to the executors
could be inefficient. To overcome this obstacle, Spark provides two API calls
called cache() and persist() to store locally in the executors as many of
the partitions as the memory permits. Therefore, cache() and persist()
are Spark methods intended for iterative and interactive application performance
improvement.
Spark cache() and persist() are equivalent. In fact when
persist() is called without arguments, it internally calls cache().
However, persist() with the StorageLevel argument offers additional
storage optimization capabilities such as whether data should be stored in
memory or on disk and/or in a serialized or unserialized way.
Now we are going to see with a practical example the impact the use of
cache() can have in Spark operation performance:
If you look attentively at the preceding example, you can see that cache() is
lazily evaluated; it means that it is not materialized when it is called, but the first
time it is invoked. Thus, the first call to dfcache.count() does not take
advantage of the cached data; only the second call can profit from it. As you can
see, the second dfcache.count() executes 24.34 times faster.
Is it worth noticing that cache() persists in memory the partitioned data in
unserialized format. The cached data is localized in the node memory processing
the corresponding partition; therefore, if that node is lost in the next invocation
to that information, it would have to be recovered from the source. As the data
will not be serialized, it would take longer and perhaps produce network
bottleneck.
To overcome the cache() restrictions, the DataFrame.persist()
method was introduced and accepts numerous types of storage levels via the
'storageLevel' [ = ] value key and value pair.
The most typical valid options for storageLevel are
NONE: With no options, persist() calls cache() under the hood.
DISK_ONLY: Data is stored on disk rather than in RAM. Since you are
persisting on disk, it is serialized in nature.
MEMORY_ONLY: Stores data in RAM as deserialized Java objects. Full data
cache is not guaranteed as it cannot be fully accommodated into memory and
has no replication.
MEMORY_ONLY_SER: Stores data as serialized Java objects. Generally
more space-efficient than deserialized objects, but more read CPU-intensive.
MEMORY_AND_DISK: Stores data as deserialized Java objects. If the whole
data does not fit in memory, store partitions not fitting in RAM to disk.
OFF_HEAP: It is an experimental storage level similar to
MEMORY_ONLY_SER, but storing the data in off-heap memory if off-heap
memory is enabled.
MEMORY_AND_DISK_SER: Option Akin to MEMORY_ONLY_SER;
however, partitions not fitting in memory are streamed to disk.
DISK_ONLY_2, DISK_ONLY_3, MEMORY_ONLY_2,
MEMORY_AND_DISK_2, MEMORY_AND_DISK_SER_2,
MEMORY_ONLY_SER_2: The same as the parent levels adding partition
replication to two cluster nodes.
Next, we are going to show you how to use persist() with a practical
example:
import org.apache.spark.storage.StorageLevel
val dfpersist = spark.range(1,
1000000000).toDF("base").withColumn("square", $"base"
* $"base")
dfpersist.persist(StorageLevel.DISK_ONLY) // Serialize
and cache data on disk
spark.time(dfpersist.count()) // Materialize the cache
Once again, you can see in the preceding example the operation performed
on persisted data is 26.8 times faster.
Additionally, tables and views derived from DataFrames can also be cached.
Let’s see again how it can be used with a practical example:
Time taken: 2 ms
+---------+
| count(1)|
+---------+
|999999999|
+---------+
Time taken: 1 ms
+---------+
| count(1)|
+---------+
|999999999|
+---------+
spark.time(dfUnpersist.count())
Time taken: 99 ms
Out[6]: res2: Long = 9999999
4.4 Summary
In this chapter we have explained what is called the Apache Spark high-level
API. We have reviewed the concept of DataFrames and the DataFrame
attributes. We have talked about the different methods available to create Spark
DataFrames. Next, we explained how DataFrames can be used to manipulate and
analyze information. Finally, we went through the options available to speed up
data processing by caching data in memory. In the next chapter, we are going to
study another Spark high-level data structure named datasets.
Footnotes
1 ETL stands for Extract, Transform, and Load data.
2 ETL (Extract, Transform, Load) is a process to extract, transform, and load data from several sources to a
consolidated data repository.
3 www.kaggle.com/datasets/abecklas/fifa-world-cup?select=WorldCups.csv
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2023
A. Antolínez García, Hands-on Guide to Apache Spark 3
https://doi.org/10.1007/978-1-4842-9380-5_5
In the Spark ecosystem, DataFrames and datasets are higher-level APIs that use
Spark RDDs under the hood. Spark developers mainly use DataFrames and
datasets because these data structures are the ones more efficiently using Spark
storage and query optimizers, hence achieving the best data processing
performance. Therefore, DataFrames and datasets are the best Spark tools in
getting the best performance to handle structured data. Spark DataFrames and
datasets also allow technicians with a RDBMS and SQL background to take
advantage of Spark capabilities quicker.
To create a dataset from a sequence of case classes using again the toDS()
method, we first need a case class. Let’s see how it works with a practical
example. First, we create a Scala case class named Personas:
Then we can create a sequence of data matching the Personas case class
schema. In this case we are going to use the data of some of the most famous
Spanish writers of all times:
After that we can obtain a dataset by applying the toDS() method to the
personasSeq sequence:
val myRdd =
spark.sparkContext.parallelize(Seq(("Miguel de
Cervantes", 1547),("Lope de Vega", 1562),("Fernando de
Rojas",1470)))
val rddToDs = myRdd.toDS
.withColumnRenamed("_1","Nombre")
.withColumnRenamed("_2","Nacimiento")
rddToDs.show()
// Output
+-------------------+----------+
| Nombre|Nacimiento|
+-------------------+----------+
|Miguel de Cervantes| 1547|
| Lope de Vega| 1562|
| Fernando de Rojas| 1470|
+-------------------+----------+
The fourth way of creating a dataset is from a DataFrame. In this case we are
going to use an external file to create a DataFrame and after that transform it into
a dataset, as you can see in the following code snippet:
In real-life scenarios, you will often have to tackle the problem of non-ideal
data distributions. Severe data skewness can severely jeopardize Spark join
performance because for join operations, Spark has to place records of each key
in its particular partition. Therefore, if you want to join two dataframes by a
specific key or column and one of the keys has many more records than the
others, its corresponding partition will become much bigger than the others (or
skewed); therefore, the time taken to process that partition will be comparatively
longer than the time consumed by others, consequently causing job bottlenecks,
poor CPU utilization, and/or out-of-memory problems.
The AQE automatically detects data skewness from shuffle statistics and
divides the bigger partitions into smaller ones that will be joined locally with
their corresponding counterparts.
For Spark, to take advantage of skew join optimization, both options
“spark.sql.adaptive.enabled” and “spark.sql.adaptive.skewJoin.enabled” have to
be set to true.
5.8 Summary
Datasets are part of the so-called Spark high-level API together with
DataFrames. However, unlike the latter, they are only available with compiled
programming languages, such as Java and Scala. This attribute is both an
advantage and disadvantage as those programming languages’ learning curves
have a more pronounced gradient than that of Python, for example; hence, they
are less commonly employed. Datasets also provide security improvements as
they are strongly typed data structures. After introducing the concept of datasets,
we focused on the Spark Adaptive Query Execution (AQE) as it is one of the
newest and more interesting features introduced in Spark 3.0. The AQE permits
the improvement of Spark query performance as it is able to automatically adapt
query plans based on statistical data collected at runtime. In the coming chapters,
we have to switch to another important Spark feature, which is data streaming.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2023
A. Antolínez García, Hands-on Guide to Apache Spark 3
https://doi.org/10.1007/978-1-4842-9380-5_6
The real-time lane takes care of data as it arrives to the system, and as the
batch lane, it stores the result in a distributed data warehouse system.
The Lambda architecture has shown to satisfy many business use cases, and
it is currently in use by important corporations like Yahoo and Netflix.
The Lambda architecture is integrated by three main layers or lanes:
Batch processing layer
Real-time or speed layer
Serving layer
In addition to these three main data processing lanes, some authors would
add a pre-layer for data intake:
Data ingestion layer
Let’s succinctly review all of them.
Serving Layer
This layer merges the results from the batch and real-time layers and constitutes
the way users use to interactively submit queries and receive the results online.
This layer allows users to seamlessly interact with full data being processed
independently of whether it is being processed on the batch or stream lane. This
lane also provides the visualization layer with up-to-the-minute information.
The use of RDDs under the hood to manipulate data facilitates the use of a
common API both for batch and streaming processing. At the same time, this
architecture permits the use of any third-party library available to process Spark
data streams.
Spark implements streaming load balancing and a faster fault recovery by
dynamically assigning processing tasks to the workers available.
Thus, a Discretized Stream (DStream) is in fact a continuous sequence of
RDDs of the same type simulating a continuous flow of data and bringing all the
RDD advantages in terms of speed and safety to near-real-time stream data
processing. However, the DStream API does not offer the complete set of
transformations compared with the Apache Spark RDD (low-level API).
1004,Tomás,30,DEndo,01-09-2022
1005,Lorena,50,DGineco,01-09-2022
1006,Pedro,10,DCardio,01-09-2022
1007,Ester,10,DCardio,01-09-2022
1008,Marina,10,DCardio,01-09-2022
1009,Julia,20,DNeuro,01-09-2022
1010,Javier,30,DEndo,01-09-2022
1011,Laura,50,DGineco,01-09-2022
1012,Nuria,10,DCardio,01-09-2022
1013,Helena,10,DCardio,01-09-2022
1014,Nati,10,DCardio,01-09-2022
Next, we show two options to see our program up and running. The first
code (socketTextStream.scala) is shown next, and it is a Scala variant that can be
compiled and executed in Spark using the $SPARK_HOME/bin/spark-
submit command. It is out of the scope of this book to discuss how to compile
and link Scala code, but it is recommended to use sbt1 together with sbt-
assembly2 to create a so-called “fat JAR” file including all the necessary
libraries, a.k.a. dependencies:
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{Seconds,
StreamingContext}
import java.io.IOException
object socketTextStream {
def main(args: Array[String]): Unit = {
val host = "localhost"
val port = 9999
try {
val spark: SparkSession = SparkSession.builder()
.master("local[*]")
.appName("Hand-On-Spark3_socketTextStream")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
val sc = spark.sparkContext
val ssc = new StreamingContext(sc, Seconds(5))
lines.print()
ssc.start()
ssc.awaitTermination()
} catch {
case e: java.net.ConnectException =>
println("Error establishing connection to " + host +
":" + port)
case e: IOException => println("IOException
occurred")
case t: Throwable => println("Error receiving
data", t)
} finally {
println("Finally block")
}
}
}
try{
val spark = SparkSession
.builder()
.master("local[*]")
.appName("Hands-On_Spark3_socketTextStream")
.getOrCreate()
val sc = spark.sparkContext
lines.print()
ssc.start()
ssc.awaitTermination()
}catch {
case e: java.net.ConnectException =>
println("Error establishing connection to " + host +
":" + port)
case e: IOException => println("IOException
occurred")
case t: Throwable => println("Error receiving
data", t)
} finally {
println("Finally block")
}
Pay attention to the local[*] option. In this case we have used “*”; thus,
the program is going to use all the cores available. It is important to use more
than one because the application must be able to run two tasks in parallel,
listening to a TCP socket (localhost:9999) and, at the same time, processing the
data and showing it on the console.
Netcat has several [<options>]; however, we are going to use only -l, which
instructs nc to listen on a UDP or TCP <port>, and -k, which is used in listen
mode to accept multiple connections. When <host> is omitted, nc listens to all
the IP addresses bound to the <port> given.
To illustrate how the program works, we are going to take advantage of the
nc utility introduced, to establish a streaming client/server connection between
nc and our Spark application. In our case nc will act as a server (listens to a
host:port), while our application will act as a client (connects to the nc server).
Whether you have built your JAR file from the previous code or are using
the notebook version, running the application consists of a two-step process:
1.
Open a terminal in your system and set up the server side of the client/server
streaming connection by running the following code:
nc -lk 9999
2.
Depending on how you are running the application
2.1.
Using a JAR file: Open a second terminal and execute your
application as shown in the following:
$SPARK_HOME/bin/spark-submit --class
org.apress.handsOnSpark3.socketTextStream --
master "local[*]"
/PATH/TO/socketTextStream/HandsOnSpark3-
socketTextStream.jar
2.2.
Using a notebook: Just execute the code in your notebook.
As soon as you see the message Spark is listening on port
9999 and ready... on your screen, you can go back to step 1 and type
some of the CSV strings provided as examples, for instance:
1009,Julia,20,DNeuro,01-09-2022
1010,Javier,30,DEndo,01-09-2022
1011,Laura,50,DGineco,01-09-2022
1012,Nuria,10,DCardio,01-09-2022
1013,Helena,10,DCardio,01-09-2022
1014,Nati,10,DCardio,01-09-2022
1004,Tomás,30,DEndo,01-09-2022
1005,Lorena,50,DGineco,01-09-2022
1006,Pedro,10,DCardio,01-09-2022
1007,Ester,10,DCardio,01-09-2022
1008,Marina,10,DCardio,01-09-2022
With a cadence of seconds, you will see an output like the following one
coming up on your terminal or notebook:
3.
Application termination
awaitTermination() waits for a user’s termination signal. Thus, going to
the terminal session started in step 1 and pressing Ctrl+C or SIGTERM, the
streaming context will be stopped and your streaming application
terminated.
However, this way of abruptly killing a streaming process is neither elegant
nor convenient in most of the real streaming applications. The notion of
unbounded data implies a continuous flow of information arriving to the system;
thus, an abrupt interruption of the streaming process is bound to a loss of
information in all likelihood in the majority of situations. To avoid data loss, a
procedure is to halt a streaming application without suddenly killing it during the
RDD processing; it is called a “graceful shutdown,” and we are going to explain
it later on in the “Spark Streaming Graceful Shutdown” section.
lines.flatMap(_.split(",")).print()
We can also introduce the count() function to count the number of lines in
our stream. Thus, adding the count() function as you can see in the following
line
lines.flatMap(_.split(",")).count().print()
and typing text lines from our example, we get an output similar to the
following:
-------------------------------------------
Time: 1675204465000 ms
-------------------------------------------
75
-------------------------------------------
Time: 1675204470000 ms
-------------------------------------------
35
-------------------------------------------
Time: 1675204470000 ms
-------------------------------------------
260
lines.flatMap(_.split(",")).print()
lines.countByValue().print()
Running again the code and copying and pasting some of the lines provided
as examples, you could see an output similar to the following:
-------------------------------------------
Time: 1675236630000 ms
-------------------------------------------
(1007,Ester,10,DCardio,01-09-2022,1)
(1005,Lorena,50,DGineco,01-09-202,1)
(1005,Lorena,50,DGineco,01-09-2022,1)
(1014,Nati,10,DCardio,01-09-2022,1)
(1004,Tomás,30,DEndo,01-09-2022,1)
(1008,Marina,10,DCardio,01-09-2022,1)
(1006,Pedro,10,DCardio,01-09-2022,1)
-------------------------------------------
Time: 1675236640000 ms
-------------------------------------------
(1007,Ester,10,DCardio,01-09-2022,1)
(1005,Lorena,50,DGineco,01-09-2022,1)
(1008,Marina,10,DCardio,01-09-2022,1)
(1006,Pedro,10,DCardio,01-09-2022,1)
We can improve our example even more and achieve the same result linking
or piping the previous code line with the flatMap() function we saw before:
lines.flatMap(_.split(",")).countByValue().print()
As you previously did, run the code again, copy and paste the example lines
in your terminal, and you will see again outcome similar to the next one:
This time the output is more informative, as you can see the Department of
Cardiology registrations are piling up. That information could be used to, for
example, trigger an alarm when the number of appointments approaches or
crosses the threshold of maximum capacity.
We could have gotten the same result by using the reduceByKey()
function. This function works on RDDs (key/value pairs) and is used to merge
the values of each key using a provided reduce function ( _ + _ in our
example).
To do that, just replace the following line of code
lines.countByValue().print()
Repeating the process of copying and pasting example lines to you terminal
will give an output similar to this:
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{Seconds,
StreamingContext}
import java.io.IOException
try{
val spark = SparkSession
.builder()
.master("local[*]")
.appName("Hands-On_Spark3_socketTextStream")
.getOrCreate()
val sc = spark.sparkContext
After applying these changes, if you execute the program again and paste the
following lines to your terminal
NSS,Nom,DID,DNom,Fecha
1004,Tomás,30,DEndo,01-09-2022
1005,Lorena,50,DGineco,01-09-2022
1006,Pedro,10,DCardio,01-09-2022
1007,Ester,10,DCardio,01-09-2022
1008,Marina,10,DCardio,01-09-2022
NSS,Nom,DID,DNom,Fecha
1009,Julia,20,DNeuro,01-09-2022
1010,Javier,30,DEndo,01-09-2022
1011,Laura,50,DGineco,01-09-2022
1012,Nuria,10,DCardio,01-09-2022
1013,Helena,10,DCardio,01-09-2022
1014,Nati,10,DCardio,01-09-2022
-------------------------------------------
Time: 1675284925000 ms
-------------------------------------------
(DCardio,3)
(DGineco,1)
(DEndo,1)
-------------------------------------------
Time: 1675284930000 ms
-------------------------------------------
(DCardio,3)
(DGineco,1)
(DEndo,1)
(DNeuro,1)
As you can appreciate, header lines are removed; therefore, only the lines of
interest are considered.
Next, we are going to see the other basic source directly available in the
Spark Streaming core API, file systems compatible with HDFS (Hadoop
Distributed File System).
package org.apress.handsOnSpark3
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{Seconds,
StreamingContext}
import java.io.IOException
object textFileStream {
def main(args: Array[String]): Unit = {
val folder="/tmp/patient_streaming"
try {
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("Hand-On-Spark3_textFileStream")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
val sc = spark.sparkContext
val ssc = new StreamingContext(sc, Seconds(5))
} catch {
case e: IOException => println("IOException
occurred")
case t: Throwable => println("Error receiving
data", t)
} finally {
println("Finally block")
}
}
}
Pay attention to the local[1] option. In this case we have used only “[1]”
because file streams do not require executing a receiver; therefore, no additional
cores are required for file intake.
The next piece of code is a version of the preceding Hospital Queue
Management System application that can be executed in Spark using a notebook
application such as Jupyter, Apache Zeppelin, etc.:
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{Seconds,
StreamingContext}
import java.io.IOException
val folder="/tmp/patient_streaming"
try{
val spark = SparkSession
.builder()
.master("local[1]")
.appName("Hand-On-Spark3_textFileStream")
.getOrCreate()
val sc = spark.sparkContext
$SPARK_HOME/bin/spark-submit --class
org.apress.handsOnSpark3.textFileStream --
master "local[1]" /PATH/TO/YOUR/HandsOnSpark3-
textFileStream.jar
1.2.
If you are using a notebook, just execute the code in your notebook.
2.
Open a new terminal in your computer to copy the CSV files provided to the
monitored folder.
As soon as you see on your screen the message Spark is
monitoring the folder /tmp/patient_streaming and
ready... , you can go back to step 2 and start copying the CSV files to
the /tmp/patient_streaming folder,6 for example:
cp /PATH/TO/patient1.csv /tmp/patient_streaming
cp /PATH/TO/patient2.csv /tmp/patient_streaming
cp /PATH/TO/patient3.csv /tmp/patient_streaming
cp /PATH/TO/patient4.csv /tmp/patient_streaming
cp /PATH/TO/patient5.csv /tmp/patient_streaming
With a cadence of seconds, you will start seeing on your terminal session
or notebook an output similar to the next one:
-------------------------------------------
Time: 1675447070000 ms
-------------------------------------------
(DEndo,1)
(DNeuro,1)
-------------------------------------------
Time: 1675447075000 ms
-------------------------------------------
(DGastro,1)
(DCardio,3)
(DGineco,1)
(DNeuro,2)
3.
Application termination
Once again, awaitTermination() waits for a user’s termination
signal. Thus, going to the terminal session started in step 2 and pressing
Ctrl+C or SIGTERM, the streaming context will be stopped. If the
application is run in a notebook, you can stop the execution of the
application by stopping or restarting the Spark kernel.
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Seconds,
StreamingContext}
import org.apache.spark.sql.SparkSession
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import spark.implicits._
val sc = spark.sparkContext
val ssc = new StreamingContext(sc, Seconds(5))
val host = "localhost"
val port = 9999
ssc.start()
while (! wasStopped) {
printf("\n Listening and ready... \n")
wasStopped = ssc.awaitTerminationOrTimeout(timeout)
if (wasStopped)
println("Streaming process is no longer active...")
else
println("Streaming is in progress...")
Now, before you execute the preceding code example, in a terminal set up
the socket server by typing
nc -lk 9999
After that, you can execute the preceding code snippet, and as soon as you
see the message Listening and ready... on your screen, you can start
copying and pasting the CSV example lines provided, for example:
1004,Tomás,30,DEndo,01-09-2022
1005,Lorena,50,DGineco,01-09-2022
1006,Pedro,10,DCardio,01-09-2022
1007,Ester,10,DCardio,01-09-2022
...
1010,Javier,30,DEndo,01-09-2022
1011,Laura,50,DGineco,01-09-2022
1012,Nuria,10,DCardio,01-09-2022
1013,Helena,10,DCardio,01-09-2022
1014,Nati,10,DCardio,01-09-2022
1009,Julia,20,DNeuro,01-09-2022
1010,Javier,30,DEndo,01-09-2022
In a few seconds you would see an output similar to this coming out of your
program:
-------------------------------------------
Time: 1675631800000 ms
-------------------------------------------
-------------------------------------------
Time: 1675631800000 ms
-------------------------------------------
mkdir /tmp/alt_folder
-------------------------------------------
Time: 1675631820000 ms
-------------------------------------------
Graceful Shutdown finished the job in queue and nicely stopped your
streaming process without losing any data.
If you look carefully through the preceding code snippet, you can see that
Spark is uninterruptedly listening to the network socket localhost:9999
(host:port) while the flag stopFlag is false. Thus, we need to find a way to
send to Spark the stop streaming signal. We achieve that by creating a new folder
in the defined file system path /tmp/alt_folder.
The next two lines of code
6.10 Summary
In this chapter we have explained what Apache Spark Streaming is, together
with the Spark DStream (Discretized Stream) as the basic abstraction behind the
Spark Streaming concept. We mentioned DStream is a high-level abstraction for
Spark Streaming just like RDD. We also went through the differences between
real-time analytics of bound and unbound data, mentioning the challenges and
uncertainties stream processing brings in. Next, we talked about the Spark
Streaming Execution Model and stream processing architectures. At that point,
we explained the Lambda and Kappa architectures as the main stream processing
architectures available. After that, we went through the concepts of Discretized
Streams and stream sources and receivers. The last point was quite dense,
explaining and giving examples of basic and advanced data sources. The
advanced topic of Grateful Shutdown was described, giving a practical example,
and finally, a list of the most common transformations on DStreams was
provided.
Footnotes
1 www.scala-sbt.org/
2 https://github.com/sbt/sbt-assembly
3 https://jupyter.org/
4 https://zeppelin.apache.org/
5 https://netcat.sourceforge.net/
6 It is advised to copy the files progressively to better see how Spark processes them.
Part II
Apache Spark Streaming
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2023
A. Antolínez García, Hands-on Guide to Apache Spark 3
https://doi.org/10.1007/978-1-4842-9380-5_7
Nowadays, in the big data world, more and more business processes and daily
used applications require the analysis of real-time or near-real-time information
at scale. Real-time data analysis is commonly associated with processes that
require decisions to be taken quickly and without delay. Therefore,
infrastructures capable of providing instant analytics, management of
continuously flowing data, and fault tolerance and handling stragglers or slow
components are necessary.
Considering the main characteristics that define data streaming, in which
Information is continuous
Information is unbounded
There is high volume and velocity of data production
Information is time-sensitive
There is heterogeneity of data sources
we can assume data faults and stragglers1 are certain to occur in this sort of
environment.
Data faults and stragglers represent a serious challenge for streaming data
processing. For instance, how can we get insights from a sequence of events
arriving at a stream processing system if we do not know what is the order in
which they took place?
Stream processing uses timestamps2 to sequence the events and includes
different notions of time regarding stream event processing:
Event time: It corresponds with the moment in time in which the event is
generated by a device.
Ingestion time: It is the time when an event arrives at the stream processing
architecture.
Processing time: It refers to the computer time when it begins treating the
event.
In Chapter 6 we saw that Spark's first attempt to keep up with the dynamic
nature of information streaming and to deal with the challenges mentioned
before was the introduction of Apache Spark Streaming (DStream API). We also
studied that DStreams or Discretized Streams are implemented on top of Spark’s
Resilient Distributed Dataset (RDD) data structures. DStream handles
continuous data flowing by dividing the information into small chunks,
processing them later on as micro-batches.
The use of the low-level RDD API offers both advantages and
disadvantages. The first disadvantage is that, as the name states, it is a low-level
framework; therefore, it requires higher technical skills to take advantage of it
and poses performance problems because of data serialization and memory
management. Serialization is critical for distributed system performance to
minimize data shuffling across the network; therefore, if not managed with
caution, it can lead to numerous issues such as memory overuse and network
bottlenecks.
Figure 7-2 The Spark Structured Streaming unbounded output table flow diagram
import org.apache.spark.sql.SparkSession
import
org.apache.spark.sql.streaming.OutputMode.Update
inputStream.select(explode(split(df("value")," "))
.alias("palabra"))
.groupBy("word")
.count()
.writeStream
.format("console")
.outputMode("complete") // Complete output mode
selected
.start()
.awaitTermination()
Now that we have studied the basics of Spark Structured Streaming and the
main sources of data, it is time to see how streaming DataFrames work with
some examples.
package org.apress.handsOnSpark3
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{IntegerType, StringType,
StructField, StructType}
import java.io.IOException
object readStreamSocket {
def main(args: Array[String]): Unit = {
try {
val spark: SparkSession = SparkSession.builder()
.master("local[*]")
.appName("Hand-On-Spark3_Socket_Data_Source")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
// Set up spark.readStream …
import spark.implicits._
selectDF.writeStream
.format("console")
.outputMode("append")
.option("truncate", false)
.option("newRows", 30)
.start()
.awaitTermination()
} catch {
case e: java.net.ConnectException => println("Error
establishing connection to " + host + ":" + port)
case e: IOException => println("IOException occurred")
case t: Throwable => println("Error receiving data", t)
} finally {
println("Finally block")
}
}
}
The next piece of code is a version of the preceding Hospital Queue
Management System application that can be executed in Spark using a notebook
application such as Jupyter, Apache Zeppelin, etc., which can be more
convenient for learning purposes, especially if you are not familiar with Scala
code compiler tools:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{IntegerType,
StringType, StructField,
StructType,DoubleType,LongType}
import org.apache.spark.sql.{DataFrame, Dataset,
Encoders, SparkSession}
import java.io.IOException
spark.sparkContext.setLogLevel("ERROR")
try {
val PatientDS = spark.readStream
.format("socket")
.option("host",host)
.option("port",port)
.load()
.select(from_json(col("value"),
PatientsSchema).as("patient"))
.selectExpr("Patient.*")
.as[Patient]
selectDF.writeStream
.format("console")
.outputMode("append")
.option("truncate",false)
.option("newRows",30)
.start()
.awaitTermination()
} catch {
case e: java.net.ConnectException =>
println("Error establishing connection to " + host +
":" + port)
case e: IOException => println("IOException
occurred")
case t: Throwable => println("Error receiving
data", t)
}finally {
println("In finally block")
}
Notice how we have defined the PatientsSchema schema before ingesting the
data:
val PatientsSchema = StructType(Array(
StructField("NSS", StringType),
StructField("Nom", StringType),
StructField("DID", IntegerType),
StructField("DNom", StringType),
StructField("Fecha", StringType))
)
Netcat has several [<options>]; however, we are going to use only -l, which
instructs nc to listen on a UDP or TCP <port>, and -k, which is used in listen
mode to accept multiple connections. When <host> is omitted, nc listens to all
the IP addresses bound to the <port> given.
To illustrate how the program works, we are going to take advantage of the
nc utility introduced before, to establish a streaming client/server connection
between nc and our Spark application. In our case nc will act as a server (listens
to a host:port), while our application will act as a client (connects to the nc
server).
Whether you have built your JAR file from the previous code or are using
the notebook version, running the application consists of a two-step process:
1.
Open a terminal in your system and set up the server side of the client/server
streaming connection by running the following code:
nc -lk 9999
2.
Depending on how you are running the application
2.1.
Using a JAR file: Open a second terminal and execute your
application as shown in the following:
$SPARK_HOME/bin/spark-submit --class
org.apress.handsOnSpark3.readStreamSocket --
master "local[*]" PATH/TO/YOUR/HandsOnSpark3-
readStreamSocket.jar
2.2.
Using a notebook: Just execute the code in your notebook.
As soon as you see the message Listening and ready… on your
screen, you can go back to step 1 and type some of the JSON strings
provided, for example:
{"NSS":"1234","Nom":"María", "DID":10,
"DNom":"Cardio", "Fecha":"01-09-2022"}
{"NSS":"2345","Nom":"Emilio", "DID":20,
"DNom":"Neuro", "Fecha":"01-09-2022"}
{"NSS":"3456","Nom":"Marta", "DID":30,
"DNom":"Endo", "Fecha":"01-09-2022"}
…
{"NSS":"4567","Nom":"Marcos", "DID":40,
"DNom":"Gastro", "Fecha":"01-09-2022"}
{"NSS":"5678","Nom":"Sonia", "DID":50,
"DNom":"Gineco", "Fecha":"01-09-2022"}
{"NSS":"6789","Nom":"Eduardo", "DID":10,
"DNom":"Cardio", "Fecha":"01-09-2022"}
With a cadence of seconds, you will see an output like the following one
coming up on your terminal:
Listening and ready...
-------------------------------------------
Batch: 1
-------------------------------------------
+----+------+---+------+----------+
|NSS |Nom |DID|DNom |Fecha |
+----+------+---+------+----------+
|1234|María |10 |Cardio|01-09-2022|
|2345|Emilio|20 |Neuro |01-09-2022|
|3456|Marta |30 |Endo |01-09-2022|
|4567|Marcos|40 |Gastro|01-09-2022|
|5678|Sonia |50 |Gineco|01-09-2022|
+----+------+---+------+----------+
-------------------------------------------
Batch: 2
-------------------------------------------
+----+-------+---+------+----------+
|NSS |Nom |DID|DNom |Fecha |
+----+-------+---+------+----------+
|6789|Eduardo|10 |Cardio|01-09-2022|
+----+-------+---+------+----------+
-------------------------------------------
Batch: 3
-------------------------------------------
+----+------+---+------+----------+
|NSS |Nom |DID|DNom |Fecha |
+----+------+---+------+----------+
|1009|Julia |20 |Neuro |01-09-2022|
|1010|Javier|30 |Endo |01-09-2022|
|1011|Laura |50 |Gineco|01-09-2022|
|1012|Nuria |10 |Cardio|01-09-2022|
|1013|Helena|10 |Cardio|01-09-2022|
+----+------+---+------+----------+
3.
Application termination
awaitTermination() waits for a user’s termination signal. Thus, going to
the terminal session started in step 1 and pressing Ctrl+C or SIGTERM, the
streaming context will be stopped.
package org.apress.handsOnSpark3
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{IntegerType, StringType,
StructField, StructType}
import java.io.IOException
object dStreamsFiles {
def main(args: Array[String]): Unit = {
try {
val spark: SparkSession = SparkSession
.builder()
.master("local[3]")
.appName("Hand-On-Spark3_File_Data_Source")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
val df = spark.readStream
.schema(PatientsSchema).json("/tmp/patient_streaming")
groupDF.writeStream
.format("console")
.outputMode("complete")
.option("truncate", false)
.option("newRows", 30)
.start()
.awaitTermination()
} catch {
case e: IOException => println("IOException
occurred")
case t: Throwable => println("Error receiving
data", t)
} finally {
println("Finally block")
}
}
}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{IntegerType,
StringType, StructField, StructType}
import org.apache.spark.sql.functions.desc
import java.io.IOException
spark.sparkContext.setLogLevel("ERROR")
try{
val df = spark.readStream.schema(PatientsSchema)
.json("/tmp/patient_streaming")
groupDF.writeStream
.format("console")
.outputMode("complete")
.option("truncate",false)
.option("newRows",30)
.start()
.awaitTermination()
} catch{
case e: IOException => println("IOException
occurred")
case t: Throwable => println("Error receiving
data", t)
}
$SPARK_HOME/bin/spark-submit --class
org.apress.handsOnSpark3.dStreamsFiles --master
"local[*]" target/scala-2.12/HandsOnSpark3-
dStreamsFiles-assembly-fatjar-1.0.jar
1.2.
If you are using a notebook, just execute the code in your notebook.
2.
Open a new terminal in your computer to copy the JSON files provided to
the monitored folder.
As soon as you see on your screen the message Listening and
ready... , you can go back to step 2 and start copying JSON files to the
/tmp/patient_streaming folder,5, for example:
cp /PATH/TO/patient1.json /tmp/patient_streaming
cp /PATH/TO/patient2.json /tmp/patient_streaming
cp /PATH/TO/patient3.json /tmp/patient_streaming
cp /PATH/TO/patient4.json /tmp/patient_streaming
cp /PATH/TO/patient5.json /tmp/patient_streaming
With a cadence of seconds, you will start seeing on your terminal session or
notebook an output like this:
-------------------------------------------
Batch: 1
-------------------------------------------
+---+-----------+
|DID|Accumulated|
+---+-----------+
|20 |1 |
|10 |1 |
|30 |1 |
+---+-----------+
-------------------------------------------
Batch: 2
-------------------------------------------
+---+-----------+
|DID|Accumulated|
+---+-----------+
|10 |4 |
|20 |3 |
|40 |1 |
|50 |1 |
|30 |1 |
+---+-----------+
-------------------------------------------
Batch: 3
-------------------------------------------
+---+-----------+
|DID|Accumulated|
+---+-----------+
|10 |7 |
|20 |3 |
|50 |2 |
|30 |2 |
|40 |1 |
+---+-----------+
The examples provided generate untyped DataFrames; it means that the
schema provided is not validated at compile time, only at runtime when the code
is executed.
So far, we have only been applying transformations to the data arriving to the
streaming process. For example, in our last program, the “Accumulated” column
adds up the input “DID” field. This is what is called stateless streaming. Suppose
now a scenario in which you want to find out the total occurrences of each value
received by your streaming application, updating the state of the previously
processed information. Here is where the concepts of streaming state and stateful
streaming that we are going to see next come into play.
Time-Based Aggregations
Time-based aggregations are studied in detail in Chapter 8. Thus, we are going
to leave it for now.
No-Time-Based Aggregations
No-time-based aggregations include
Global aggregations
Those are general aggregations with no key discrimination. In the
examples you have seen so far, it could be the number of patients registering
in a hospital:
# PySpark
counts = PatientDS.groupBy().count()
// Scala
val counts = PatientDS.groupBy().count()
Grouped aggregations
These are aggregations by key or with key discrimination. Adding to the
previous Hospital Queue Management System application example, we could
be interested in seeing only the number of appointments of specific medical
departments in a hospital.
In the following you can see a modified version of our Hospital Queue
Management System application example with stateful grouped aggregations
by department id and department name:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{IntegerType, StringType,
StructField, StructType,DoubleType,LongType}
import org.apache.spark.sql.{DataFrame, Dataset, Encoders,
SparkSession}
import java.io.IOException
spark.sparkContext.setLogLevel("ERROR")
try {
val PatientDS = spark.readStream
.format("socket")
.option("host",host)
.option("port",port)
.load()
.select(from_json(col("value"), PatientsSchema).as("pa
.selectExpr("Patient.*")
.as[Patient]
counts.writeStream
.format("update")
.format("console")
.outputMode("complete")
.option("truncate",false)
.option("newRows",30)
.start()
.awaitTermination()
} catch {
case e: java.net.ConnectException => println("Error
establishing connection to " + host + ":" + port)
case e: IOException => println("IOException occurred")
case t: Throwable => println("Error receiving data", t)
}finally {
println("Finally block")
}
Running the previous code, you will see an output similar to the next one:
-------------------------------------------
Batch: 2
-------------------------------------------
+---+------+-----+
|DID|DNom |count|
+---+------+-----+
|20 |Neuro |11 |
|40 |Gastro|3 |
|50 |Gineco|9 |
|30 |Endo |8 |
|10 |Cardio|26 |
+---+------+-----+
-------------------------------------------
Batch: 3
-------------------------------------------
+---+------+-----+
|DID|DNom |count|
+---+------+-----+
|20 |Neuro |11 |
|40 |Gastro|3 |
|50 |Gineco|9 |
|30 |Endo |8 |
|10 |Cardio|27 |
+---+------+-----+
Multiple aggregations
The groupBy clause allows you to specify more than one aggregation
function to transform column information. Therefore, multiple aggregations
can be performed at once. For example, you can modify the previous code
snippet as follows
{"NSS":"4567","Nom":"Marcos", "DID":40,
"DNom":"Gastro", "Fecha":"2023-02-23T00:00:03.002Z"}
{"NSS":"5678","Nom":"Sonia", "DID":50,
"DNom":"Gineco", "Fecha":"2023-02-23T00:00:04.002Z"}
{"NSS":"6789","Nom":"Eduardo", "DID":10,
"DNom":"Cardio", "Fecha":"2023-02-23T00:00:05.002Z"}
{"NSS":"1001","Nom":"Lorena", "DID":10,
"DNom":"Cardio", "Fecha":"2023-02-23T00:00:06.002Z"}
{"NSS":"1006","Nom":"Sara", "DID":20,
"DNom":"Neuro", "Fecha":"2023-02-23T00:00:07.002Z"}
{"NSS":"1002","Nom":"Teresa", "DID":10,
"DNom":"Cardio", "Fecha":"2023-02-23T00:00:08.002Z"}
{"NSS":"1003","Nom":"Luis", "DID":20,
"DNom":"Neuro", "Fecha":"2023-02-23T00:00:09.002Z"}
and after that this second set of JSON strings
{"NSS":"1004","Nom":"Tomás", "DID":30,
"DNom":"Endo", "Fecha":"2023-02-23T00:00:10.002Z"}
{"NSS":"1005","Nom":"Lorena", "DID":50,
"DNom":"Gineco", "Fecha":"023-02-23T00:00:11.002Z"}
{"NSS":"1006","Nom":"Pedro", "DID":10,
"DNom":"Cardio", "Fecha":"023-02-23T00:00:12.002Z"}
{"NSS":"1007","Nom":"Ester", "DID":10,
"DNom":"Cardio", "Fecha":"023-02-23T00:00:13.002Z"}
{"NSS":"1008","Nom":"Marina", "DID":10,
"DNom":"Cardio", "Fecha":"023-02-23T00:00:14.002Z"}
{"NSS":"1009","Nom":"Julia", "DID":20,
"DNom":"Neuro", "Fecha":"023-02-23T00:00:15.002Z"}
{"NSS":"1010","Nom":"Javier", "DID":30,
"DNom":"Endo", "Fecha":"023-02-23T00:00:16.002Z"}
{"NSS":"1011","Nom":"Laura", "DID":50,
"DNom":"Gineco", "Fecha":"023-02-23T00:00:17.002Z"}
{"NSS":"1012","Nom":"Nuria", "DID":10,
"DNom":"Cardio", "Fecha":"023-02-23T00:00:18.002Z"}
{"NSS":"1013","Nom":"Helena", "DID":10,
"DNom":"Cardio", "Fecha":"023-02-23T00:00:19.002Z"}
you could see an output like the following one:
-------------------------------------------
Batch: 2
-------------------------------------------
+---+------+--------+------+-------+---------+-----------+---
|DID|DNom |countDID|sumDID|meanDID|stddevDID|distinctDID|col
+---+------+--------+------+-------+---------+-----------+---
|20 |Neuro |3 |60 |20.0 |0.0 |1 |[20
|40 |Gastro|1 |40 |40.0 |null |1 |[40
|50 |Gineco|3 |150 |50.0 |0.0 |1 |[50
|30 |Endo |2 |60 |30.0 |0.0 |1 |[30
|10 |Cardio|7 |70 |10.0 |0.0 |1 |[10
+---+------+--------+------+-------+---------+-----------+---
-------------------------------------------
Batch: 3
-------------------------------------------
+---+------+--------+------+-------+---------+-----------+---
|DID|DNom |countDID|sumDID|meanDID|stddevDID|distinctDID|col
+---+------+--------+------+-------+---------+-----------+---
|20 |Neuro |3 |60 |20.0 |0.0 |1 |[20
|40 |Gastro|1 |40 |40.0 |null |1 |[40
|50 |Gineco|3 |150 |50.0 |0.0 |1 |[50
|30 |Endo |2 |60 |30.0 |0.0 |1 |[30
|10 |Cardio|8 |80 |10.0 |0.0 |1 |[10
+---+------+--------+------+-------+---------+-----------+---
Note The aggregation functions shown in the previous code snippet are
included for illustration purposes only. Obviously, functions such as sum(),
mean(), stddev(), approx_count_distinct(), and collect_list() applied to a
medical department id "DID" do not make any business sense.
User-defined aggregation
Finally, Spark Structured Streaming supports user-defined aggregation
functions. Check the Spark SQL Guide for more and updated details.
import org.apache.spark.sql.streaming._
// ...
val checkpointDir = "/tmp/streaming_checkpoint"
// ...
counts.writeStream
// ...
.trigger(Trigger.ProcessingTime("5 seconds"))
.option("checkpointLocation", checkpointDir)
// ...
.start()
.awaitTermination()
As you can see, we have introduced several new features that we explain in
the following:
Trigger: Defines how often a streaming query must be triggered (run) to
process newly available streaming data—in other words, how frequently our
application has to review the data sources looking for new information and
possibly emit new data. Trigger was introduced into Spark to set the stream
batch period.
ProcessingTime is a trigger that assumes milliseconds as the minimum
unit of time. ProcessingTime(interval: String) accepts
CalendarInterval instances with or without interval strings, for example:
With interval strings: ProcessingTime("interval 10
seconds")
Without interval strings: ProcessingTime("10 seconds")
There are four factory methods (options):
Default: If no trigger is set, the streaming query runs micro-batches one
after another, as soon as the precedent micro-batch has finished.
OneTimeTrigger: With this trigger mode set, it executes the trigger once and
stops. The streaming query will execute the data available in only one
micro-batch. A use case for this trigger mode could be to use it as a kind of
daily batch processing, saving computing resources and money. Example:
.trigger(Trigger.Once).
ProcessingTime: The user can define the ProcessingTime parameter, and the
streaming query will be triggered with the interval established, executing
new micro-batches and possibly emitting new data.
ContinuousTrigger: At the time this book was written, continuous
processing was an experimental streaming execution mode introduced in
Spark 2.3.7 It has been designed to achieve low latencies (in the order of 1
ms) providing at-least-once guarantee. To provide fault tolerance, a
checkpoint interval must be provided as a parameter. Example:
.trigger(Trigger.Continuous("1 second")). A checkpoint
interval of 1 s means that the stream engine will register the intermediate
results of the query every second. Every checkpoint is written in a micro-
batch engine-compatible structure; therefore, after a failure, the ongoing
(supported) query can be restarted by any other kind of trigger. For
example, a supported query that was started using the micro-batch mode
can be restarted in continuous mode, and vice versa. The continuous
processing mode only supports stateless queries such as select, map,
flatMap, mapPartitions, etc. and selections like where, filter, etc. All the
SQL functions are supported in continuous mode, except aggregation
functions current_timestamp() and current_date().
checkpointLocation
This parameter points to the file system directory created for state storage
persistence purposes. To make the store fault-tolerant, the option
checkpointLocation must be set as part of the writeStream output
configuration.
The state storage uses the checkpoint folder to store mainly
Data checkpointing
Metadata checkpointing
In case we are using stateful operations, the structure of the Spark Streaming
checkpoint folder and the state data representation folders will look as illustrated
in Table 7-4.
Table 7-4 Spark Streaming Checkpoint and the State Data Representation Structure
7.7 Summary
In this chapter we went over the Spark Structured Streaming module. Firstly, we
studied the general semantics of message delivery reliability mechanisms.
Secondly, we compared Structured Streaming with Spark Streaming based on
DStreams. After that, we explained the technical details behind the Spark
Structured Streaming architecture, such as input and result tables as well as the
different output modes supported. In addition, we also went through the
streaming API for DataFrames and datasets and Structured Streaming stateless
and stateful transformations and aggregations, giving some interesting examples
that will help you learn how to implement these features. Finally, we studied the
concepts of streaming checkpointing and recovery, giving some practical
examples. In the next chapter, we are moving forward studying streaming
sources and sinks.
Footnotes
1 Late or out-of-order events (information).
4 https://spark.apache.org/docs/latest/structured-streaming-programming-
guide.html#input-sources
5 It is advised to copy the files progressively to better see how Spark processes them.
6
https://spark.apache.org/docs/3.3.2/api/R/reference/column_aggregate_functions.html
7 For up-to-date information, please check the Apache Spark official documentation,
https://spark.apache.org/docs/latest/structured-streaming-programming-
guide.html#continuous-processing
8 https://spark.apache.org/docs/3.3.2/structured-streaming-programming-
guide.html#recovery-semantics-after-changes-in-a-streaming-query
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2023
A. Antolínez García, Hands-on Guide to Apache Spark 3
https://doi.org/10.1007/978-1-4842-9380-5_8
val df = spark.readStream
.format("text")
.option("maxFilesPerTrigger", 1)
.load("/tmp/logs")
You can also specify the schema of your data, for example:
And then you can read the files based on the precedent schema:
val df = spark
.readStream
.schema(PatientsSchema)
.json("/tmp/patient_streaming")
In our preceding example, the returned df streaming DataFrame will have the
PatientsSchema. The “/tmp/patient_streaming” source directory must exist when
the stream process starts.
There are some important points to remember when using file sources:
The source directory must exist when the stream process starts, as mentioned
before.
All the files streamed to the source directory must be of the same format, that
is to say, they all must be text, JSON, Parquet, etc., and the schema must be
also the same if we want to preserve data integrity and avoid errors.
Files already present in the designated folder when the streaming job begins
are ignored. This concept is depicted in Figure 8-1.
Spark uses system tools that list files to identify the new files. Therefore, the
files appearing in the streaming directory must be complete and closed,
because Spark will process them as soon as they are discovered. Thus, any
data addition or file update could result in data loss.
When Spark processes a file, it is internally labeled as processed. Hence, it
will not be processed again even if it is updated.
In case several files should be processed, but Spark can only cope with part of
them in the next micro-batch, files with the earliest timestamps will be
processed first.
When creating a new FileStreamSource instance, two main options are
available:
schema: As we have already mentioned, it is the schema of the data, and it is
specified at instantiation time.
maxFilesPerTrigger: It specifies the maximum number of files read per
micro-batch. Therefore, it is used to control the stream read rate to the
maximum number of files per trigger.
In the following you have a code example in which we stream data from a
file source. This example includes the schema of the files used as a data source,
streams data from a directory, and outputs the results of the transformation to the
console:
package org.apress.handsOnSpark3
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{IntegerType,
StringType, StructField, StructType}
import java.io.IOException
object dStreamsFiles {
def main(args: Array[String]): Unit = {
try {
val spark: SparkSession = SparkSession
.builder()
.master("local[3]")
.appName("Hand-On-Spark3_File_Data_Source")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
val df = spark
.readStream
.schema(PatientsSchema)
.json("/tmp/patient_streaming")
val groupDF = df.select("DID")
.groupBy("DID").agg(count("DID").as("Accumulated"))
.sort(desc("Accumulated"))
groupDF.writeStream
.format("console")
.outputMode("complete")
.option("truncate", false)
.option("newRows", 30)
.start()
.awaitTermination()
} catch {
case e: IOException => println("IOException
occurred")
case t: Throwable => println("Error receiving
data", t)
} finally {
println("Finally block")
}
}
}
Next, we are going to jump to another built-in data source and one of the
most commonly used nowadays. First, we are going to provide an introduction
about Kafka, and after that we are going to provide a practical example.
package org.apress.handsOnSpark3.com
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, from_json}
import org.apache.spark.sql.types.{IntegerType,
StringType, StructType, StructField}
object SparkKafka {
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers",
"localhost:9092")
.option("subscribe", "patient")
.option("startingOffsets", "earliest")
.load()
df.printSchema()
patient.printSchema()
As soon as you have your code ready, it is time to give it a try. The first thing
we are going to do is to start the Kafka environment.
Note At the time this book was written, Kafka 3.4.0 was the latest release
and the one used in our examples. To be able to execute the code shown
before, your local environment must have Java 8+ installed.
Apache Kafka can be started using ZooKeeper or KRaft. In this book we are
using only the former.
Firstly, open a terminal session and from your $KAFKA_HOME directory
execute the following commands in order to start all services in the correct order.
Run the following commands to start the ZooKeeper service with the default
configuration:
$ bin/zookeeper-server-start.sh
config/zookeeper.properties
Secondly, open another terminal session and run the following commands to
start the Kafka broker service with the default configuration as well:
$ bin/kafka-server-start.sh config/server.properties
Now is time to write some events into the patient topic just created and see
the results. To do that, we are going to create a Kafka producer using the
“bin/kafka-console-producer.sh”, which is located in the Kafka
directory.
A Kafka producer is a client application that communicates with the Kafka
brokers for writing events into topics. Once the information is received, the
brokers will save it in a fault-tolerant storage for as long as we could need it,
allegedly forever. This is the reason our Spark application is going to be able to
asynchronously consume the information stored in our example topic.
To see how it works, open a new terminal session and run the producer
console client, as shown in the following, to write some events into our “patient”
topic just created. In this example we are going to use the data from the JSON
files of Chapter 6. By default, every line you type will be a new event being
written to the “patient” topic:
After pasting the content of the JSON files onto the Kafka producer console,
run your example program as follows:
$SPARK_HOME/bin/spark-submit --class
org.apress.handsOnSpark3.com.SparkKafka --master yarn
--packages org.apache.spark:spark-sql-kafka-0-
10_2.12:3.2.0 /PATH/TO/JAR/FILE/HandsOnSpark3-
Structured_Streaming_Hospital-1.0.jar
As soon as the program is running, you could see an output similar to the
next one coming out from your program:
root
|-- key: binary (nullable = true)
|-- value: binary (nullable = true)
|-- topic: string (nullable = true)
|-- partition: integer (nullable = true)
|-- offset: long (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- timestampType: integer (nullable = true)
root
|-- NSS: string (nullable = true)
|-- Nom: string (nullable = true)
|-- DID: integer (nullable = true)
|-- DNom: string (nullable = true)
|-- Fecha: string (nullable = true)
-------------------------------------------
Batch: 0
-------------------------------------------
+----+-----+---+------+----------+
| NSS| Nom|DID| DNom| Fecha|
+----+-----+---+------+----------+
|1234|María| 10|Cardio|01-09-2022|
+----+-----+---+------+----------+
-------------------------------------------
Batch: 1
-------------------------------------------
+----+------+---+------+----------+
| NSS| Nom|DID| DNom| Fecha|
+----+------+---+------+----------+
|4567|Marcos| 40|Gastro|01-09-2022|
|5678| Sonia| 50|Gineco|01-09-2022|
+----+------+---+------+----------+
-------------------------------------------
Batch: 2
-------------------------------------------
+----+-------+---+------+----------+
| NSS| Nom|DID| DNom| Fecha|
+----+-------+---+------+----------+
|6789|Eduardo| 10|Cardio|01-09-2022|
+----+-------+---+------+----------+
-------------------------------------------
Batch: 3
-------------------------------------------
+----+------+---+------+----------+
| NSS| Nom|DID| DNom| Fecha|
+----+------+---+------+----------+
|1234| María| 10|Cardio|01-09-2022|
|2345|Emilio| 20| Neuro|01-09-2022|
|3456| Marta| 30| Endo|01-09-2022|
|4567|Marcos| 40|Gastro|01-09-2022|
+----+------+---+------+----------+
-------------------------------------------
Batch: 4
-------------------------------------------
+----+-------+---+------+----------+
| NSS| Nom|DID| DNom| Fecha|
+----+-------+---+------+----------+
|4567| Marcos| 40|Gastro|01-09-2022|
|5678| Sonia| 50|Gineco|01-09-2022|
|6789|Eduardo| 10|Cardio|01-09-2022|
|1234| María| 10|Cardio|01-09-2022|
|4567| Marcos| 40|Gastro|01-09-2022|
|5678| Sonia| 50|Gineco|01-09-2022|
|6789|Eduardo| 10|Cardio|01-09-2022|
|1234| María| 10|Cardio|01-09-2022|
|2345| Emilio| 20| Neuro|01-09-2022|
|3456| Marta| 30| Endo|01-09-2022|
|4567| Marcos| 40|Gastro|01-09-2022|
+----+-------+---+------+----------+
To double-check the results of your streaming process, you can also read the
events from the Kafka brokers using a Kafka consumer, which is a client
application that subscribes to (reads and processes) events.
To see how that works, open another terminal session and run the consumer
console client as shown in the following, to read the patient topic we created
before:
You will see on your screen an output similar to the following one:
{"NSS":"1234","Nom":"María", "DID":10,
"DNom":"Cardio", "Fecha":"01-09-2022"}
{"NSS":"2345","Nom":"Emilio", "DID":20,
"DNom":"Neuro", "Fecha":"01-09-2022"}
{"NSS":"3456","Nom":"Marta", "DID":30, "DNom":"Endo",
"Fecha":"01-09-2022"}
{"NSS":"4567","Nom":"Marcos", "DID":40,
"DNom":"Gastro", "Fecha":"01-09-2022"}
. . .
. . .
{"NSS":"4567","Nom":"Marcos", "DID":40,
"DNom":"Gastro", "Fecha":"01-09-2022"}
{"NSS":"5678","Nom":"Sonia", "DID":50,
"DNom":"Gineco", "Fecha":"01-09-2022"}
{"NSS":"6789","Nom":"Eduardo", "DID":10,
"DNom":"Cardio", "Fecha":"01-09-2022"}
To be able to compile the code examples used in this section, you have to use
the correct Kafka dependencies and Scala compiler version, and all depend on
your Kafka, Spark, and Scala versions installed.
So far, we have talked about Spark built-in streaming data sources like
TCP/IP sockets, files, Apache Kafka, etc. Other advanced streaming applications
that can be paired with Apache Spark to create stream pipes could be Kinesis. In
the next section, we are going to see how to create custom stream data sources
using tools primarily not intended for that purpose. In particular we are going to
show you how to stream data from a NoSQL database such as MongoDB.
{"NSS":"2345","Nom":"Emilio", "DID":20,
"DNom":"Neuro", "Fecha":"01-09-2022"}
{"NSS":"3456","Nom":"Marta", "DID":30, "DNom":"Endo",
"Fecha":"01-09-2022"}
{
"_id": {
"$oid": "640cba70f9972564d8c4ef2f"
},
"NSS": "2345",
"Nom": "Emilio",
"DID": 20,
"DNom": "Neuro",
"Fecha": "01-09-2022"
}
In the next code snippet, we will use the new MongoDB Spark connector to
read data from our MongoDB data collection:
import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.streaming.Trigger
val sc = spark.sparkContext
sc.setLogLevel("ERROR")
df.printSchema()
import spark.implicits._
groupDF.printSchema()
groupDF.writeStream
.outputMode("append")
.option("forceDeleteTempCheckpointLocation", "true")
.format("console")
.option("checkpointLocation", "/tmp/checkpointDir")
//.trigger(Trigger.ProcessingTime("10 seconds"))
.trigger(Trigger.Continuous("30 seconds"))
.start()
.awaitTermination()
Going through the preceding code, you notice that while reading from a
MongoDB database, we do not necessarily need to define an information schema
as the schema is inferred from the MongoDB collection.
In any case, if you prefer or need to define your data schema, you can do it
and call the stream read process as follows.
First, define the schema of your data:
After that, use it in combination with your readStream method like this, to
define the schema of the incoming data:
val df = spark.readStream
.format("mongodb")
.option("spark.mongodb.connection.uri", mongoDBURI)
.option("spark.mongodb.database", "MongoDB_Data_Source")
.option("spark.mongodb.collection", "MongoDB_Data_Source")
.option("spark.mongodb.change.stream.publish.full.document.on
"true")
.option("forceDeleteTempCheckpointLocation", "true")
.schema(PatientsSchema)
.load()
We have used the property isStreaming to verify that the dataset is streaming.
It returns true if the df dataset contains one or more data sources that constantly
send data as it arrives.
Finally, when writing the streamed data to the console, we have chosen the
continuous trigger type as it is supported by the latest MongoDB Spark
connector.
In this case, we have set the trigger to “30 s” for the sake of readability, as
using a “1 s” trigger, for instance, would have been pulling data continuously to
the console and it would have been more difficult to collect it for the book:
.trigger(Trigger.Continuous("30 seconds"))
Nevertheless, you can use any of the other supported trigger types, such as
Default trigger: It runs micro-batches as soon as possible.
ProcessingTime trigger: It triggers micro-batches with a time interval
specified.
One-time trigger: It will execute only one micro-batch, process the
information available, and stop.
Available-now trigger: It is similar to the one-time trigger with the difference
that it is designed to achieve better query scalability trying to process data in
multiple micro-batches based on the configured source options (e.g.,
maxFilesPerTrigger).
In the next code example, we show how to modify the previous program to
use Trigger.ProcessingTime with a 10 s interval:
groupDF.writeStream
.outputMode("append")
.option("forceDeleteTempCheckpointLocation",
"true")
.format("console")
.option("checkpointLocation",
"/tmp/checkpointDir")
.trigger(Trigger.ProcessingTime("10 seconds"))
.start()
.awaitTermination()
Once the new document is inserted into the MongoDB database, you can see
it displayed as in Figure 8-7.
Figure 8-7 A new document is inserted into a MongoDB database and collection
You should see an outcome similar to the following one coming out from
your application:
root
|-- _id: string (nullable = true)
|-- NSS: string (nullable = true)
|-- Nom: string (nullable = true)
|-- DID: integer (nullable = true)
|-- DNom: string (nullable = true)
|-- Fecha: string (nullable = true)
-------------------------------------------
Batch: 1
-------------------------------------------
+----+------+---+------+----------+--------------------
+
|
NSS| Nom|DID| DNom| Fecha| _id|
+----+------+---+------+----------+--------------------
+
|3456| Marta| 30| Endo|01-09-
2022|640cbaa7f9972564d...|
|4567|Marcos| 40|Gastro|01-09-
2022|640cbab3f9972564d...|
+----+------+---+------+----------+--------------------
+
-------------------------------------------
Batch: 2
-------------------------------------------
+----+-------+---+------+----------+-------------------
-+
|
NSS| Nom|DID| DNom| Fecha| _id|
+----+-------+---+------+----------+-------------------
-+
|4567| Marcos| 40|Gastro|01-09-
2022|640cbabdf9972564d...|
|5678| Sonia| 50|Gineco|01-09-
2022|640cbac8f9972564d...|
|6789|Eduardo| 10|Cardio|01-09-
2022|640cbad3f9972564d...|
+----+-------+---+------+----------+-------------------
-+
-------------------------------------------
Batch: 3
-------------------------------------------
+----+-----+---+------+----------+--------------------+
| NSS| Nom|DID| DNom| Fecha| _id|
+----+-----+---+------+----------+--------------------+
|1234|María| 10|Cardio|01-09-2022|640cbadcf9972564d...|
+----+-----+---+------+----------+--------------------+
As we have used the continuous trigger type with a 30 s interval, the data is
not streamed as it is registered, but every 30 s; otherwise, you could not see
aggregated data in different batches, unless you would be able to type quicker
that the server is able to process the information.
Now, after we have seen several stream sources, it is time to deal with data
storage. In data streaming terminology, those stores are known as data sinks.
PatientDF.writeStream
// You have to change this part of the code
.format("csv")
.option("path", "/tmp/streaming_output/csv")
// … for this
.format("parquet")
.option("path", "/tmp/streaming_output/parquet")
// ...
.trigger(Trigger.ProcessingTime("5 seconds"))
.option("checkpointLocation", checkpointDir)
.outputMode("append")
.option("truncate",false)
.option("newRows",30)
.start()
.awaitTermination()
Now, if you have a look at the designated output directories, you should find
an output similar to the one depicted in Figure 8-8.
Figure 8-8 Example of streaming output to the file sink in CSV and Parquet formats
For the sake of simplicity and visibility, in Figure 8-8 we have paired both
outputs together. The CSV output format is on the left, and the Parquet output
format is on the right.
counts.writeStream
.format("kafka")
.option("kafka.bootstrap.servers","host1:port1,host2:port2")
// ...
.option("topic", "patient")
.option("checkpointLocation", "/tmp/kafka_checkpoint")
.start()
.awaitTermination()
As you can see, this code snippet is a small modification of our previous
examples. First of all, we have defined our own writing business logic
encapsulated inside the saveToCSV() function. This function adds a
timestamp to each micro-batch processed.
Here is the code example:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{IntegerType,
StringType, StructField,
StructType,DoubleType,LongType}
import org.apache.spark.sql.{DataFrame, Dataset,
Encoders, SparkSession}
import java.io.IOException
import org.apache.spark.sql.streaming._
import org.apache.spark.sql.streaming.
{GroupState,GroupStateTimeout,OutputMode}
import org.apache.spark.sql.DataFrame
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
try {
val PatientDS = spark.readStream
.format("socket")
.option("host",host)
.option("port",port)
.load()
.select(from_json(col("value"),
PatientsSchema).as("patient"))
.selectExpr("Patient.*")
.as[Patient]
PatientDF.writeStream
.trigger(Trigger.ProcessingTime("5 seconds"))
.option("checkpointLocation", checkpointDir)
.outputMode("append")
.foreachBatch(saveToCSV)
.start()
.awaitTermination()
} catch {
case e: java.net.ConnectException =>
println("Error establishing connection to " + host +
":" + port)
case e: IOException => println("IOException
occurred")
case t: Throwable => println("Error receiving
data", t)
}finally {
println("In finally block")
}
nc -lk 9999
Then, run the preceding code example, and when you see the following in
your notebook
go back to the previous terminal session and paste the JSON examples we
provided you in Chapter 6, for instance:
{"NSS":"1234","Nom":"María", "DID":10,
"DNom":"Cardio", "Fecha":"01-09-2022"}
. . .
{"NSS":"2345","Nom":"Emilio", "DID":20,
"DNom":"Neuro", "Fecha":"01-09-2022"}
{"NSS":"3456","Nom":"Marta", "DID":30, "DNom":"Endo",
"Fecha":"01-09-2022"}
After running the previous program and pasting the data to the terminal
console, if you have a look at the designated output directory path
/tmp/streaming_output/foreachBatch/, you should find a bunch of
files similar to the following:
/tmp/streaming_output/foreachBatch/
├── part-00000-07c12f65-b1d6-4c7b-b50d-2d8b25d724b8-
c000.csv
├── part-00000-63507ff8-a09a-4c8e-a526-28890c170d96-
c000.csv
├── part-00000-9a2caabe-7d84-4799-b788-a633cfc32042-
c000.csv
├── part-00000-dabb8320-0c0e-4bb5-ad19-c36a53ac8d1e-
c000.csv
├── part-00000-df0c4ba0-a9f0-40ed-b773-b879488b0a85-
c000.csv
├── part-00000-f924d5cc-8e4a-4d5f-91b7-965ce2ac8710-
c000.csv
├── part-00000-fd07c2e4-1db1-441c-8199-a69a064efe75-
c000.csv
├── part-00001-07c12f65-b1d6-4c7b-b50d-2d8b25d724b8-
c000.csv
├── part-00001-63507ff8-a09a-4c8e-a526-28890c170d96-
c000.csv
├── part-00001-9a2caabe-7d84-4799-b788-a633cfc32042-
c000.csv
├── part-00001-df0c4ba0-a9f0-40ed-b773-b879488b0a85-
c000.csv
├── part-00001-fd07c2e4-1db1-441c-8199-a69a064efe75-
c000.csv
├── part-00002-07c12f65-b1d6-4c7b-b50d-2d8b25d724b8-
c000.csv
├── part-00002-9a2caabe-7d84-4799-b788-a633cfc32042-
c000.csv
├── part-00002-df0c4ba0-a9f0-40ed-b773-b879488b0a85-
c000.csv
├── part-00002-fd07c2e4-1db1-441c-8199-a69a064efe75-
c000.csv
├── part-00003-07c12f65-b1d6-4c7b-b50d-2d8b25d724b8-
c000.csv
├── part-00003-df0c4ba0-a9f0-40ed-b773-b879488b0a85-
c000.csv
├── part-00003-fd07c2e4-1db1-441c-8199-a69a064efe75-
c000.csv
├── part-00004-07c12f65-b1d6-4c7b-b50d-2d8b25d724b8-
c000.csv
├── part-00004-df0c4ba0-a9f0-40ed-b773-b879488b0a85-
c000.csv
├── part-00005-07c12f65-b1d6-4c7b-b50d-2d8b25d724b8-
c000.csv
├── part-00005-df0c4ba0-a9f0-40ed-b773-b879488b0a85-
c000.csv
└── _SUCCESS
vi part-00000-07c12f65-b1d6-4c7b-b50d-2d8b25d724b8-
c000.csv
we see the following content, including the timestamp at the end of the
record, as we expected:
1009,Julia,20,Neuro,01-09-2022,20230317
{"NSS":"2345","Nom":"Emilio", "DID":20,
"DNom":"Neuro", "Fecha":"01-09-2022"}
2345,Emilio,20,Neuro,01-09-2022,20230317
{"NSS":"4567","Nom":"Marcos", "DID":40,
"DNom":"Gastro", "Fecha":"01-09-2022"}
4567,Marcos,40,Gastro,01-09-2022,20230317
And so forth.
In a similar way, you could write your own function to use PostgreSQL as a
data sink. You code could look like this:
df
.withColumn("timeStamp",
date_format(current_date(),"yyyyMMdd"))
.write.format("jdbc")
.option("driver": "org.postgresql.Driver")
.option("url", url)
.option("dbtable", "<your_table>")
.option("user", "<your_user>")
.option("password", <your_pasword>)
.mode("append")
.save()
}
counts.writeStream
.foreach( "some user logic goes here")
// ...
.start()
.awaitTermination()
Let’s see now with a simple example how foreach can be implemented.
For the purpose of this example, we have slightly modified our previous code
snippet used for the foreachBatch sink to accommodate it to meet our
necessities:
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
try {
val PatientDS = spark.readStream
.format("socket")
.option("host",host)
.option("port",port)
.load()
.select(from_json(col("value"),
PatientsSchema).as("patient"))
.selectExpr("Patient.*")
.as[Patient]
PatientDF.writeStream
.trigger(Trigger.ProcessingTime("5 seconds"))
.option("checkpointLocation", checkpointDir)
.outputMode("append")
.foreach(customWriterToConsole)
.start()
.awaitTermination()
} catch {
case e: java.net.ConnectException =>
println("Error establishing connection to " + host +
":" + port)
case e: IOException => println("IOException
occurred")
case t: Throwable => println("Error receiving
data", t)
}finally {
println("In finally block")
}
Before executing the preceding code, open a terminal session and create a
socket session as follows:
$ nc -lk 9999
Once the socket session has been created, it is time to run the code. As soon
as you see the line Listening and ready... on your screen, go back to the terminal
with the socket session open and start typing JSON lines. You can use lines like
following:
{"NSS":"1234","Nom":"María", "DID":10,
"DNom":"Cardio", "Fecha":"01-09-2022"}
{"NSS":"2345","Nom":"Emilio", "DID":20,
"DNom":"Neuro", "Fecha":"01-09-2022"}
{"NSS":"3456","Nom":"Marta", "DID":30, "DNom":"Endo",
"Fecha":"01-09-2022"}
{"NSS":"4567","Nom":"Marcos", "DID":40,
"DNom":"Gastro", "Fecha":"01-09-2022"}
{"NSS":"5678","Nom":"Sonia", "DID":50,
"DNom":"Gineco", "Fecha":"01-09-2022"}
{"NSS":"6789","Nom":"Eduardo", "DID":10,
"DNom":"Cardio", "Fecha":"01-09-2022"}
{"NSS":"1001","Nom":"Lorena", "DID":10,
"DNom":"Cardio", "Fecha":"01-09-2022"}
{"NSS":"1006","Nom":"Sara", "DID":20, "DNom":"Neuro",
"Fecha":"01-09-2022"}
{"NSS":"1002","Nom":"Teresa", "DID":10,
"DNom":"Cardio", "Fecha":"01-09-2022"}
{"NSS":"1003","Nom":"Luis", "DID":20, "DNom":"Neuro",
"Fecha":"01-09-2022"}
You will see an output like this coming out of you program:
Going back to the previous code example, you can see that the only
differences are the customWriterToConsole() function implementing the
ForeachWriter and the foreach sink call itself, inside the writeStream
method.
Notice the implementation of the three mandatory methods—open,
process, and finally close:
And notice the foreach sink call inside the writeStream method:
PatientDF.writeStream
.trigger(Trigger.ProcessingTime("5 seconds"))
.option("checkpointLocation", checkpointDir)
.outputMode("append")
.foreach(customWriterToConsole)
.start()
.awaitTermination()
Next, we are going to see a practical code example, showing how to use
MongoDB as a Spark data sink. As in other previous examples, the program
reads JSON files from a directory as soon as they emerge over there and inserts
the data into a MongoDB collection. The JSON files we are using in this
example are the patient examples we have been using so far in previous
examples and the ones firstly included in Chapter 6:
import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.types.{IntegerType,
StringType, StructField, StructType}
val sc = spark.sparkContext
sc.setLogLevel("ERROR")
val df = spark.readStream
.schema(PatientsSchema)
.option("checkpointLocation", "/tmp/checkpoint")
.json("/tmp/stream_mongo")
df.printSchema()
newDF.printSchema()
newDF.writeStream
.format("mongodb")
.option("checkpointLocation", "/tmp/checkpoint")
.option("forceDeleteTempCheckpointLocation",
"true")
root
|-- NSS: string (nullable = true)
|-- Nom: string (nullable = true)
|-- DID: integer (nullable = true)
|-- DNom: string (nullable = true)
|-- Fecha: string (nullable = true)
you can start copying files to the designated streaming directory. For the
purpose of this example, we use the JSON files we used in Chapter 6. Here is an
example of how you can do it:
$ cp /tmp/json/patient1.json /tmp/stream_mongo
$ cp /tmp/json/patient2.json /tmp/stream_mongo
$ cp /tmp/json/patient3.json /tmp/stream_mongo
$ cp /tmp/json/patient4.json /tmp/stream_mongo
$ cp /tmp/json/patient5.json /tmp/stream_mongo
$ cp /tmp/json/patient6.json /tmp/stream_mongo
Now, if you have a look at your MongoDB database—in our case we have
used the graphical interface MongoDB Compass to do it—you could see the data
inserted from the streaming process.
Figure 8-10 shows you how to filter the already recorded data using different
data keys. In this case we have used the department ID (“DID”). Remember
MongoDB stores the information in a JSON-like format, not in tables as
traditional OLTP databases do.
Figure 8-10 MongoDB Compass filtering data by department ID (DID)
In Figure 8-11 you can see a similar filtering query, but in this case we have
filtered by the Social Security Number (SSN).
Figure 8-11 MongoDB Compass filtering data by Social Security Number (SSN)
8.3 Summary
In this chapter we went over the Spark Structured Streaming module. In
particular we have studied the most common data sources and data sinks,
regarding streaming data processing. Firstly, we studied the built-in Spark
Structured Streaming data sources, paying special attention to the most typical
ones: the file, socket, and Kafka sources. Kafka is one of most important
streaming frameworks nowadays; therefore, we developed a specific code
example showing how to use it as a live stream source. Secondly, we showed
how to implement a custom data source and implemented another practical
example how to do it with MongoDB. After that, we moved forward and
repeated the same process with data sinks. First, we went through the defined
data sinks, that is to say, the console sink, file sink, and Kafka sink. Later on, we
studied the foreachBatch and foreach sinks and analyzed how they can be used
by a user to create their own tailor-made data sinks. To finalize, we also provided
a practical example of a custom-made data sink implemented once again with
MongoDB. In the next chapter, we are moving forward studying advanced
streaming configurations, introducing the Event-Time Window Operations and
Watermarking.
Footnotes
1 More information can be found here: https://docs.databricks.com/structured-
streaming/foreach.html
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2023
A. Antolínez García, Hands-on Guide to Apache Spark 3
https://doi.org/10.1007/978-1-4842-9380-5_9
After having studied the insights of Apache Spark Streaming and Structured
Streaming, in this chapter, we are going to focus on time-based stream
processing.
Data analytics is evolving from batch to stream data processing for many use
cases. One of the reasons for this shift is that it is becoming more and more
commonly accepted that streaming data is more suited to model the life we live.
This is particularly true when we think about most of the systems we want to
analyze and model—autonomous cars receiving and emitting satellite navigation
coordinates, Internet of things (IoT) devices exchanging signals, road sensors
counting vehicles for traffic control, wearable devices, etc.—all have a common
similarity; they all appear as a continuous stream of events and in a timely
manner. In fact, streaming data sources are almost omnipresent.
Additionally, events are generated as a result of some activity, and in many
scenarios they require some immediate action to be taken. Consider, for
example, applications for fraud or anomaly detection or personalization,
marketing, and advertising in real time as some of the most common use cases of
real-time stream processing and event-driven applications.
Coherent time semantics are of paramount importance in stream processing
as many operations in event processing such as aggregation over a time window,
joins, and stragglers management depend on time.
In this chapter, we are going to go through the concept of temporal windows,
also known as time windows, for stream processing, study Spark’s built-in
window functions, and explain windowing semantics.
9.1 Event-Time Processing
As mentioned just before, many operations in real-time event stream processing
are depending on time. When dealing with events and time, we have several
options of time marks for event, and depending on the use case at hand, we must
prioritize one variant over the others:
Event-time: It refers to the time in which the event was created, for example,
produced by a sensor.
Ingestion-time: It denotes the moment in time when the event was ingested by
the event streaming platform. It is implemented by adding a timestamp to the
event when it enters the streaming platform.
Processing-time, also called Wall-clock-time: It is the moment when the event
is effectively processed.
Next, Figure 9-1 graphically explains the previous event-time processing
concepts.
Figure 9-4 Example of temporal window to count events per window time
For the sake of simplicity, the previous figures are shown the same number
of events per window interval; however, be advised that is not always going to
happen, and different numbers of events can fall in different temporary windows
as is highlighted next in Figure 9-6.
Figure 9-6 A ten-second tumbling window with different number of events per window
The next code snippet uses a tumbling window of ten seconds’ size to
perform an aggregate count of the number of patients entering the hospital over
the same window time.
// Tumbling windows
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{IntegerType,
StringType, StructField,
StructType,DoubleType,LongType}
import org.apache.spark.sql.{DataFrame, Dataset,
Encoders, SparkSession}
import java.io.IOException
import org.apache.spark.sql.streaming._
import org.apache.spark.sql.streaming.
{GroupState,GroupStateTimeout,OutputMode}
import org.apache.spark.sql.DataFrame
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
try {
val PatientDS = spark.readStream
.schema(PatientsSchema)
.json("/tmp/window")
PatientDF.writeStream
.outputMode("complete")
.format("console")
.option("truncate", false)
.start()
.awaitTermination()
} catch {
case e: IOException => println("IOException
occurred")
case t: Throwable => println("Error receiving
data", t)
}finally {
println("In finally block")
}
To run the previous code example, first of all you have to create the
necessary data source (in our case “/tmp/window”) to pull the corresponding
JSON files to.
Ones you have done so, run the code and when you see the message
$ cp json_file1.json /tmp/window
$ cp json_file2.json /tmp/window
$ cp json_file3.json /tmp/window
$ cp json_file4.json /tmp/window
$ cp json_file5.json /tmp/window
$ cp json_file6.json /tmp/window
$ cp json_file7.json /tmp/window
You will have a similar output like the following one coming out of your
program:
-------------------------------------------
Batch: 1
-------------------------------------------
+------------------------------------------+----------+
|window |Suma_x_Dpt|
+------------------------------------------+----------+
|{2023-02-23 01:00:00, 2023-02-23 01:00:10}|3 |
+------------------------------------------+----------+
-------------------------------------------
Batch: 2
-------------------------------------------
+------------------------------------------+----------+
|window |Suma_x_Dpt|
+------------------------------------------+----------+
|{2023-02-23 01:00:00, 2023-02-23 01:00:10}|6 |
+------------------------------------------+----------+
-------------------------------------------
Batch: 3
-------------------------------------------
+------------------------------------------+----------+
|window |Suma_x_Dpt|
+------------------------------------------+----------+
|{2023-02-23 01:00:00, 2023-02-23 01:00:10}|10 |
+------------------------------------------+----------+
-------------------------------------------
Batch: 4
-------------------------------------------
+------------------------------------------+----------+
|window |Suma_x_Dpt|
+------------------------------------------+----------+
|{2023-02-23 01:00:00, 2023-02-23 01:00:10}|10 |
|{2023-02-23 01:00:10, 2023-02-23 01:00:20}|1 |
+------------------------------------------+----------+
-------------------------------------------
Batch: 5
-------------------------------------------
+------------------------------------------+----------+
|window |Suma_x_Dpt|
+------------------------------------------+----------+
|{2023-02-23 01:00:00, 2023-02-23 01:00:10}|10 |
|{2023-02-23 01:00:10, 2023-02-23 01:00:20}|1 |
+------------------------------------------+----------+
-------------------------------------------
Batch: 6
-------------------------------------------
+------------------------------------------+----------+
|window |Suma_x_Dpt|
+------------------------------------------+----------+
|{2023-02-23 01:00:00, 2023-02-23 01:00:10}|10 |
|{2023-02-23 01:00:10, 2023-02-23 01:00:20}|1 |
+------------------------------------------+----------+
Now if you introduce a small change in the previous code like this
PatientDF.printSchema()
PatientDF.writeStream
.outputMode("complete")
.format("console")
.option("truncate", false)
.start()
.awaitTermination()
You will see the schema of your window data frame is like the following:
root
|-- window: struct (nullable = true)
| |-- start: timestamp (nullable = true)
| |-- end: timestamp (nullable = true)
|-- Suma_x_Dpt: long (nullable = false)
And you will see the window information as shown in the following:
-------------------------------------------
Batch: 1
-------------------------------------------
+-------------------+-------------------+----------+
|start |end |Suma_x_Dpt|
+-------------------+-------------------+----------+
|2023-02-23 01:00:00|2023-02-23 01:00:10|3 |
+-------------------+-------------------+----------+
-------------------------------------------
Batch: 2
-------------------------------------------
+-------------------+-------------------+----------+
|start |end |Suma_x_Dpt|
+-------------------+-------------------+----------+
|2023-02-23 01:00:00|2023-02-23 01:00:10|6 |
+-------------------+-------------------+----------+
-------------------------------------------
Batch: 3
-------------------------------------------
+-------------------+-------------------+----------+
|start |end |Suma_x_Dpt|
+-------------------+-------------------+----------+
|2023-02-23 01:00:00|2023-02-23 01:00:10|10 |
+-------------------+-------------------+----------+
-------------------------------------------
Batch: 4
-------------------------------------------
+-------------------+-------------------+----------+
|start |end |Suma_x_Dpt|
+-------------------+-------------------+----------+
|2023-02-23 01:00:00|2023-02-23 01:00:10|10 |
|2023-02-23 01:00:10|2023-02-23 01:00:20|1 |
+-------------------+-------------------+----------+
-------------------------------------------
Batch: 5
-------------------------------------------
+-------------------+-------------------+----------+
|start |end |Suma_x_Dpt|
+-------------------+-------------------+----------+
|2023-02-23 01:00:00|2023-02-23 01:00:10|10 |
|2023-02-23 01:00:10|2023-02-23 01:00:20|1 |
+-------------------+-------------------+----------+
-------------------------------------------
Batch: 6
-------------------------------------------
+-------------------+-------------------+----------+
|start |end |Suma_x_Dpt|
+-------------------+-------------------+----------+
|2023-02-23 01:00:00|2023-02-23 01:00:10|10 |
|2023-02-23 01:00:10|2023-02-23 01:00:20|1 |
+-------------------+-------------------+----------+
Note Please notice the time intervals established are very narrow for the
sake of usage illustration. The same code applied to a real hospital will
probably use wider time intervals.
// Sliding Windows
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{IntegerType,
StringType, StructField,
StructType,DoubleType,LongType}
import org.apache.spark.sql.{DataFrame, Dataset,
Encoders, SparkSession}
import java.io.IOException
import org.apache.spark.sql.streaming._
import org.apache.spark.sql.streaming.
{GroupState,GroupStateTimeout,OutputMode}
import org.apache.spark.sql.DataFrame
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
try {
val PatientDS = spark.readStream
.schema(PatientsSchema)
.json("/tmp/window")
PatientDF.writeStream
.outputMode("complete")
.format("console")
.option("truncate", false)
.start()
.awaitTermination()
} catch {
case e: IOException => println("IOException
occurred")
case t: Throwable => println("Error receiving
data", t)
}finally {
println("In finally block")
}
$ cp json_file9.json /tmp/window
$ cp json_file8.json /tmp/window
$ cp json_file7.json /tmp/window
...
$ cp json_file1.json /tmp/window
As soon as you copy the mentioned files, and depending on the copying rate
you apply, the program will create a window size of ten seconds with a sliding
interval of five seconds. A new window of ten seconds will be created every 5,
with a five-second gap from the beginning of the previous one, as it is shown in
the next program output.
-------------------------------------------
Batch: 1
-------------------------------------------
+------------------------------------------+----------+
|window |Suma_x_Dpt|
+------------------------------------------+----------+
|{2023-02-23 01:00:25, 2023-02-23 01:00:35}|10 |
|{2023-02-23 01:00:20, 2023-02-23 01:00:30}|8 |
|{2023-02-23 01:00:35, 2023-02-23 01:00:45}|4 |
|{2023-02-23 01:00:30, 2023-02-23 01:00:40}|9 |
|{2023-02-23 01:00:15, 2023-02-23 01:00:25}|3 |
+------------------------------------------+----------+
-------------------------------------------
Batch: 2
-------------------------------------------
+------------------------------------------+----------+
|window |Suma_x_Dpt|
+------------------------------------------+----------+
|{2023-02-23 01:00:25, 2023-02-23 01:00:35}|10 |
|{2023-02-23 01:00:20, 2023-02-23 01:00:30}|10 |
|{2023-02-23 01:00:35, 2023-02-23 01:00:45}|4 |
|{2023-02-23 01:00:10, 2023-02-23 01:00:20}|5 |
|{2023-02-23 01:00:30, 2023-02-23 01:00:40}|9 |
|{2023-02-23 01:00:15, 2023-02-23 01:00:25}|10 |
+------------------------------------------+----------+
-------------------------------------------
Batch: 3
-------------------------------------------
+------------------------------------------+----------+
|window |Suma_x_Dpt|
+------------------------------------------+----------+
|{2023-02-23 01:00:25, 2023-02-23 01:00:35}|10 |
|{2023-02-23 01:00:20, 2023-02-23 01:00:30}|11 |
|{2023-02-23 01:00:35, 2023-02-23 01:00:45}|4 |
|{2023-02-23 01:00:10, 2023-02-23 01:00:20}|10 |
|{2023-02-23 01:00:30, 2023-02-23 01:00:40}|9 |
|{2023-02-23 01:00:15, 2023-02-23 01:00:25}|16 |
+------------------------------------------+----------+
As we did with tumbling windows, you can modify the previous code
snippet as follows:
-------------------------------------------
Batch: 1
-------------------------------------------
+-------------------+-------------------+----------+
|start |end |Suma_x_Dpt|
+-------------------+-------------------+----------+
|2023-02-23 01:00:25|2023-02-23 01:00:35|10 |
|2023-02-23 01:00:20|2023-02-23 01:00:30|8 |
|2023-02-23 01:00:35|2023-02-23 01:00:45|4 |
|2023-02-23 01:00:30|2023-02-23 01:00:40|9 |
|2023-02-23 01:00:15|2023-02-23 01:00:25|3 |
+-------------------+-------------------+----------+
-------------------------------------------
Batch: 2
-------------------------------------------
+-------------------+-------------------+----------+
|start |end |Suma_x_Dpt|
+-------------------+-------------------+----------+
|2023-02-23 01:00:25|2023-02-23 01:00:35|10 |
|2023-02-23 01:00:20|2023-02-23 01:00:30|10 |
|2023-02-23 01:00:35|2023-02-23 01:00:45|4 |
|2023-02-23 01:00:10|2023-02-23 01:00:20|5 |
|2023-02-23 01:00:30|2023-02-23 01:00:40|9 |
|2023-02-23 01:00:15|2023-02-23 01:00:25|10 |
+-------------------+-------------------+----------+
-------------------------------------------
Batch: 3
-------------------------------------------
+-------------------+-------------------+----------+
|start |end |Suma_x_Dpt|
+-------------------+-------------------+----------+
|2023-02-23 01:00:25|2023-02-23 01:00:35|10 |
|2023-02-23 01:00:20|2023-02-23 01:00:30|11 |
|2023-02-23 01:00:35|2023-02-23 01:00:45|4 |
|2023-02-23 01:00:10|2023-02-23 01:00:20|10 |
|2023-02-23 01:00:30|2023-02-23 01:00:40|9 |
|2023-02-23 01:00:15|2023-02-23 01:00:25|16 |
+-------------------+-------------------+----------+
With sliding windows, we can answer questions such as what was the
number of patients visiting our hospital during the last minute, hour, etc.? or
trigger events such as “ring an alarm” whenever more than five patients for the
same medical department enter the hospital in the last ten seconds.
In the next section, we are going to study session windows which have a
different semantics compared to the previous two types of windows.
Figure 9-8 Ten-second session window with a gap interval of five seconds
Session windows are the right tools when business questions like which
patients visited the hospital at a certain moment in time? Or what are the hospital
busiest moments along a defined period of time?
As usual, we include a practical example of session window usage. In the
following code snippet, you can see how session windows could be depicted as
creating a window to collect all upcoming events arriving within the timeout
period. As you see, all collected events inside the window time frame are added
to the current session.
In the next example, we have implemented the session_window() to count
incoming events over a session window with a ten-second gap on the Fecha
column of our sample events.
// Session Window
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.{DataFrame, Dataset,
Encoders, SparkSession}
import java.io.IOException
import org.apache.spark.sql.streaming._
import org.apache.spark.sql.streaming.
{GroupState,GroupStateTimeout,OutputMode}
import org.apache.spark.sql.DataFrame
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
try {
val PatientDS = spark.readStream
.schema(PatientsSchema)
.json("/tmp/window")
PatientDS.printSchema()
PatientDF.printSchema()
PatientDF.writeStream
.outputMode("complete")
.format("console")
.option("truncate", false)
.start()
.awaitTermination()
} catch {
case e: IOException => println("IOException
occurred")
case t: Throwable => println("Error receiving
data", t)
}finally {
println("In finally block")
}
As you did in previous examples, before running the precedent code, you
first have to create the data source folder (again “/tmp/window” in our example).
After that you can copy JSON files provided as an example to the data source
directory, for example, like this:
$ cp json_file11.json /tmp/window
$ cp json_file9.json /tmp/window
...
$ cp json_file7.json /tmp/window
Once the files are copied, your program should output something like this:
-------------------------------------------
Batch: 1
-------------------------------------------
+--------------------------------------------------+---+-----+
|session_window |DID|count|
+--------------------------------------------------+---+-----+
|{2023-02-23 01:00:15.002, 2023-02-23 01:00:25.002}|20 |2 |
|{2023-02-23 01:00:18.002, 2023-02-23 01:00:31.002}|10 |7 |
|{2023-02-23 01:00:17.002, 2023-02-23 01:00:27.002}|50 |2 |
|{2023-02-23 01:00:16.002, 2023-02-23 01:00:26.002}|30 |2 |
+--------------------------------------------------+---+-----+
-------------------------------------------
Batch: 2
-------------------------------------------
+--------------------------------------------------+---+-----+
|session_window |DID|count|
+--------------------------------------------------+---+-----+
|{2023-02-23 01:02:00.002, 2023-02-23 01:02:10.002}|20 |1 |
|{2023-02-23 01:00:15.002, 2023-02-23 01:00:25.002}|20 |2 |
|{2023-02-23 01:01:34.002, 2023-02-23 01:01:44.002}|20 |1 |
|{2023-02-23 01:02:05.002, 2023-02-23 01:02:15.002}|10 |1 |
|{2023-02-23 01:00:18.002, 2023-02-23 01:00:31.002}|10 |7 |
|{2023-02-23 01:00:17.002, 2023-02-23 01:00:27.002}|50 |2 |
|{2023-02-23 01:02:20.002, 2023-02-23 01:02:30.002}|50 |1 |
|{2023-02-23 01:02:38.002, 2023-02-23 01:02:48.002}|50 |1 |
|{2023-02-23 01:01:30.002, 2023-02-23 01:01:43.002}|50 |4 |
|{2023-02-23 01:02:37.002, 2023-02-23 01:02:47.002}|30 |1 |
|{2023-02-23 01:00:16.002, 2023-02-23 01:00:26.002}|30 |2 |
|{2023-02-23 01:02:10.002, 2023-02-23 01:02:20.002}|30 |1 |
+--------------------------------------------------+---+-----+
-------------------------------------------
Batch: 3
-------------------------------------------
+--------------------------------------------------+---+-----+
|session_window |DID|count|
+--------------------------------------------------+---+-----+
|{2023-02-23 01:00:15.002, 2023-02-23 01:00:25.002}|20 |2 |
|{2023-02-23 01:01:34.002, 2023-02-23 01:01:44.002}|20 |1 |
|{2023-02-23 01:00:34.002, 2023-02-23 01:00:45.002}|20 |2 |
|{2023-02-23 01:02:00.002, 2023-02-23 01:02:10.002}|20 |1 |
|{2023-02-23 01:00:18.002, 2023-02-23 01:00:31.002}|10 |7 |
|{2023-02-23 01:02:05.002, 2023-02-23 01:02:15.002}|10 |1 |
|{2023-02-23 01:00:36.002, 2023-02-23 01:00:46.002}|10 |1 |
|{2023-02-23 01:02:20.002, 2023-02-23 01:02:30.002}|50 |1 |
|{2023-02-23 01:00:17.002, 2023-02-23 01:00:27.002}|50 |2 |
|{2023-02-23 01:00:30.002, 2023-02-23 01:00:48.002}|50 |5 |
|{2023-02-23 01:02:38.002, 2023-02-23 01:02:48.002}|50 |1 |
|{2023-02-23 01:01:30.002, 2023-02-23 01:01:43.002}|50 |4 |
|{2023-02-23 01:00:16.002, 2023-02-23 01:00:26.002}|30 |2 |
|{2023-02-23 01:02:10.002, 2023-02-23 01:02:20.002}|30 |1 |
|{2023-02-23 01:00:37.002, 2023-02-23 01:00:47.002}|30 |1 |
|{2023-02-23 01:02:37.002, 2023-02-23 01:02:47.002}|30 |1 |
+--------------------------------------------------+---+-----+
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.{DataFrame, Dataset,
Encoders, SparkSession}
import java.io.IOException
import org.apache.spark.sql.streaming._
import org.apache.spark.sql.streaming.
{GroupState,GroupStateTimeout,OutputMode}
import org.apache.spark.sql.DataFrame
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
try {
val PatientDS = spark.readStream
.schema(PatientsSchema)
.json("/tmp/window")
PatientDS.printSchema()
PatientDF.printSchema()
PatientDF.writeStream
.outputMode("complete")
.format("console")
.option("truncate", false)
.start()
.awaitTermination()
} catch {
case e: IOException => println("IOException
occurred")
case t: Throwable => println("Error receiving
data", t)
}finally {
println("In finally block")
}
The novelty of the previous code resides in this block of code, in which to
implement the session window with a dynamic timeout.
-------------------------------------------
Batch: 0
-------------------------------------------
+--------------------------------------------------+---+-----+
|session_window |DID|count|
+--------------------------------------------------+---+-----+
|{2023-02-23 01:00:15.002, 2023-02-23 01:00:25.002}|20 |1 |
|{2023-02-23 01:00:18.002, 2023-02-23 01:01:20.002}|10 |3 |
|{2023-02-23 01:00:17.002, 2023-02-23 01:01:17.002}|50 |1 |
|{2023-02-23 01:00:16.002, 2023-02-23 01:01:16.002}|30 |1 |
+--------------------------------------------------+---+-----+
-------------------------------------------
Batch: 1
-------------------------------------------
+--------------------------------------------------+---+-----+
|session_window |DID|count|
+--------------------------------------------------+---+-----+
|{2023-02-23 01:00:15.002, 2023-02-23 01:00:45.002}|20 |2 |
|{2023-02-23 01:00:18.002, 2023-02-23 01:01:21.002}|10 |7 |
|{2023-02-23 01:00:17.002, 2023-02-23 01:01:17.002}|50 |2 |
|{2023-02-23 01:00:16.002, 2023-02-23 01:01:16.002}|30 |2 |
+--------------------------------------------------+---+-----+
-------------------------------------------
Batch: 2
-------------------------------------------
+--------------------------------------------------+---+-----+
|session_window |DID|count|
+--------------------------------------------------+---+-----+
|{2023-02-23 01:00:15.002, 2023-02-23 01:01:22.002}|20 |3 |
|{2023-02-23 01:00:18.002, 2023-02-23 01:01:28.002}|10 |11 |
|{2023-02-23 01:00:17.002, 2023-02-23 01:01:29.002}|50 |4 |
|{2023-02-23 01:00:16.002, 2023-02-23 01:01:23.002}|30 |3 |
+--------------------------------------------------+---+-----+
-------------------------------------------
Batch: 3
-------------------------------------------
+--------------------------------------------------+---+-----+
|session_window |DID|count|
+--------------------------------------------------+---+-----+
|{2023-02-23 01:00:15.002, 2023-02-23 01:01:35.002}|20 |5 |
|{2023-02-23 01:00:18.002, 2023-02-23 01:01:36.002}|10 |12 |
|{2023-02-23 01:00:17.002, 2023-02-23 01:01:38.002}|50 |9 |
|{2023-02-23 01:00:16.002, 2023-02-23 01:01:37.002}|30 |4 |
+--------------------------------------------------+---+-----+
-------------------------------------------
Batch: 4
-------------------------------------------
+--------------------------------------------------+---+-----+
|session_window |DID|count|
+--------------------------------------------------+---+-----+
|{2023-02-23 01:00:15.002, 2023-02-23 01:03:00.002}|20 |7 |
|{2023-02-23 01:02:05.002, 2023-02-23 01:03:05.002}|10 |1 |
|{2023-02-23 01:00:18.002, 2023-02-23 01:01:36.002}|10 |12 |
|{2023-02-23 01:00:17.002, 2023-02-23 01:03:38.002}|50 |15 |
|{2023-02-23 01:00:16.002, 2023-02-23 01:01:37.002}|30 |4 |
|{2023-02-23 01:02:10.002, 2023-02-23 01:03:37.002}|30 |2 |
+--------------------------------------------------+---+-----+
-------------------------------------------
Batch: 5
-------------------------------------------
+--------------------------------------------------+---+-----+
|session_window |DID|count|
+--------------------------------------------------+---+-----+
|{2023-02-23 01:00:15.002, 2023-02-23 01:03:00.002}|20 |7 |
|{2023-02-23 01:00:18.002, 2023-02-23 01:01:36.002}|10 |12 |
|{2023-02-23 01:02:05.002, 2023-02-23 01:03:05.002}|10 |1 |
|{2023-02-23 01:00:17.002, 2023-02-23 01:03:38.002}|50 |15 |
|{2023-02-23 01:02:10.002, 2023-02-23 01:03:37.002}|30 |2 |
|{2023-02-23 01:00:10.002, 2023-02-23 01:01:37.002}|30 |5 |
+--------------------------------------------------+---+-----+
At the time this book was written and as for Spark 3.3.2, some restrictions
are in place when using session windows in streaming query:
Output mode “update” is not supported.
The grouping clause should include at least two columns, the session_window
and another one.
However, when used for batch query, grouping clauses can only include the
session_window column as mentioned in the Apache Spark official
documentation2.
What Is a Watermark?
Watermarking could be defined as a lateness threshold. Watermarking permits
Spark Structured Streaming to tackle the problem of late-arrival events.
Management of stragglers or out-of-order events is critical in distributed
architectures for the sake of data integrity, accuracy, and fault tolerance. When
dealing with this kind of complex system, it is not guaranteed that the data will
arrive to the streaming platform in the order it was delivered. This could happen
due to network bottlenecks, latency in the communications, etc. To overcome
these difficulties, the state of aggregate operations must be retained.
Spark Structured Streaming uses watermarks as a cutoff point to control for
how long the Spark Stream Processing Engine will wait for late events.
Therefore, when we declare a watermark, we specify a timestamp field and a
watermark limit of time. For instance, consider our Session Windows code
snippet. We can modify it as shown in the following, to introduce a watermark
threshold.
In this example
The Fecha column is used to define a 30 seconds’ watermark.
A count is performed for each DID observed for each nonoverlapping ten
seconds’ window.
State information is preserved for each count until the end of the window is
ten seconds older than the latest observed Fecha value.
After including a watermark, as new data arrives, Spark tracks the most
recent timestamp in the designated column and processes the incoming event
within the watermark threshold.
Here is the complete code example including a watermark of 30 seconds.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._ //{IntegerType,
StringType, StructField,
StructType,DoubleType,LongType}
import org.apache.spark.sql.{DataFrame, Dataset,
Encoders, SparkSession}
import java.io.IOException
import org.apache.spark.sql.streaming._
import org.apache.spark.sql.streaming.
{GroupState,GroupStateTimeout,OutputMode}
import org.apache.spark.sql.DataFrame
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
try {
val PatientDS = spark.readStream
.schema(PatientsSchema)
.json("/tmp/window")
.withColumn("Fecha", to_timestamp(col("Fecha"),
"yyyy-MM-dd'T'HH:mm:ss.SSSX"))
PatientDS.printSchema()
PatientDF.printSchema()
PatientDF.writeStream
.outputMode("complete")
.format("console")
.option("truncate", false)
.start()
.awaitTermination()
} catch {
case e: IOException => println("IOException
occurred")
case t: Throwable => println("Error receiving
data", t)
}finally {
println("In finally block")
}
There is another important part of the precedent code snippet you should pay
attention to.
PatientDS.printSchema()
root
|-- NSS: string (nullable = true)
|-- Nom: string (nullable = true)
|-- DID: integer (nullable = true)
|-- DNom: string (nullable = true)
|-- Fecha: timestamp (nullable = true)
-------------------------------------------
Batch: 1
-------------------------------------------
+--------------------------------------------------+---+-----+
|session_window |DID|count|
+--------------------------------------------------+---+-----+
|{2023-02-23 01:02:00.002, 2023-02-23 01:02:10.002}|20 |1 |
|{2023-02-23 01:01:34.002, 2023-02-23 01:01:44.002}|20 |1 |
|{2023-02-23 01:02:05.002, 2023-02-23 01:02:15.002}|10 |1 |
|{2023-02-23 01:02:20.002, 2023-02-23 01:02:30.002}|50 |1 |
|{2023-02-23 01:02:38.002, 2023-02-23 01:02:48.002}|50 |1 |
|{2023-02-23 01:01:30.002, 2023-02-23 01:01:43.002}|50 |4 |
|{2023-02-23 01:02:37.002, 2023-02-23 01:02:47.002}|30 |1 |
|{2023-02-23 01:02:10.002, 2023-02-23 01:02:20.002}|30 |1 |
+--------------------------------------------------+---+-----+
-------------------------------------------
Batch: 2
-------------------------------------------
+--------------------------------------------------+---+-----+
|session_window |DID|count|
+--------------------------------------------------+---+-----+
|{2023-02-23 01:02:00.002, 2023-02-23 01:02:10.002}|20 |1 |
|{2023-02-23 01:01:34.002, 2023-02-23 01:01:44.002}|20 |1 |
|{2023-02-23 01:02:05.002, 2023-02-23 01:02:15.002}|10 |1 |
|{2023-02-23 01:02:20.002, 2023-02-23 01:02:30.002}|50 |1 |
|{2023-02-23 01:02:38.002, 2023-02-23 01:02:48.002}|50 |1 |
|{2023-02-23 01:01:30.002, 2023-02-23 01:01:43.002}|50 |4 |
|{2023-02-23 01:02:37.002, 2023-02-23 01:02:47.002}|30 |1 |
|{2023-02-23 01:02:10.002, 2023-02-23 01:02:20.002}|30 |1 |
+--------------------------------------------------+---+-----+
-------------------------------------------
Batch: 3
-------------------------------------------
+--------------------------------------------------+---+-----+
|session_window |DID|count|
+--------------------------------------------------+---+-----+
|{2023-02-23 01:02:00.002, 2023-02-23 01:02:10.002}|20 |1 |
|{2023-02-23 01:01:34.002, 2023-02-23 01:01:44.002}|20 |1 |
|{2023-02-23 01:02:05.002, 2023-02-23 01:02:15.002}|10 |1 |
|{2023-02-23 01:02:20.002, 2023-02-23 01:02:30.002}|50 |1 |
|{2023-02-23 01:02:38.002, 2023-02-23 01:02:48.002}|50 |1 |
|{2023-02-23 01:01:30.002, 2023-02-23 01:01:43.002}|50 |4 |
|{2023-02-23 01:02:37.002, 2023-02-23 01:02:47.002}|30 |1 |
|{2023-02-23 01:02:10.002, 2023-02-23 01:02:20.002}|30 |1 |
+--------------------------------------------------+---+-----+
-------------------------------------------
Batch: 4
-------------------------------------------
+--------------------------------------------------+---+-----+
|session_window |DID|count|
+--------------------------------------------------+---+-----+
|{2023-02-23 01:02:00.002, 2023-02-23 01:02:10.002}|20 |1 |
|{2023-02-23 01:01:34.002, 2023-02-23 01:01:44.002}|20 |1 |
|{2023-02-23 01:02:05.002, 2023-02-23 01:02:15.002}|10 |1 |
|{2023-02-23 01:02:20.002, 2023-02-23 01:02:30.002}|50 |1 |
|{2023-02-23 01:02:38.002, 2023-02-23 01:02:48.002}|50 |1 |
|{2023-02-23 01:01:30.002, 2023-02-23 01:01:43.002}|50 |4 |
|{2023-02-23 01:02:37.002, 2023-02-23 01:02:47.002}|30 |1 |
|{2023-02-23 01:02:10.002, 2023-02-23 01:02:20.002}|30 |1 |
+--------------------------------------------------+---+-----+
-------------------------------------------
Batch: 5
-------------------------------------------
+--------------------------------------------------+---+-----+
|session_window |DID|count|
+--------------------------------------------------+---+-----+
|{2023-02-23 01:02:00.002, 2023-02-23 01:02:10.002}|20 |1 |
|{2023-02-23 01:01:34.002, 2023-02-23 01:01:44.002}|20 |1 |
|{2023-02-23 01:02:05.002, 2023-02-23 01:02:15.002}|10 |1 |
|{2023-02-23 01:02:20.002, 2023-02-23 01:02:30.002}|50 |1 |
|{2023-02-23 01:02:38.002, 2023-02-23 01:02:48.002}|50 |1 |
|{2023-02-23 01:01:30.002, 2023-02-23 01:01:43.002}|50 |4 |
|{2023-02-23 01:02:37.002, 2023-02-23 01:02:47.002}|30 |1 |
|{2023-02-23 01:02:10.002, 2023-02-23 01:02:20.002}|30 |1 |
+--------------------------------------------------+---+-----+
9.7 Summary
In this chapter, we covered the different Event-Time Window Operations and
Watermarking with Apache Spark. First, we studied how to perform streaming
aggregations with the tumbling and sliding windows, the two types of fixed-
sized window operations. After that we learned how to implement a session
window and how to use the new Spark built-in function session_window to
create a window column. Special attention was paid to the session window with
dynamic gap duration to adapt the window length as a function of the input data.
Finally, we have covered Watermarking in Spark Structured Streaming and how
it can be used to manage late-arriving events. In the next and final chapter, we
are going to explore future directions for Spark Streaming.
Footnotes
1 More information: www.databricks.com/blog/2021/10/12/native-support-of-
session-window-in-spark-structured-streaming.html
2 https://spark.apache.org/docs/latest/structured-streaming-programming-
guide.html#types-of-time-windows
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2023
A. Antolínez García, Hands-on Guide to Apache Spark 3
https://doi.org/10.1007/978-1-4842-9380-5_10
The column “output” represents the dependent variable, and as you can
see, it can only take two possible values: 0 and 1. Therefore, we have to deal
with a binary classification problem. For that reason, we can implement a
logistic regression model as it is suitable for probability prediction.
import org.apache.spark.sql.types.
{StructType,LongType}
import org.apache.spark.ml.feature.{OneHotEncoder,
VectorAssembler, MinMaxScaler, StringIndexer}
import org.apache.spark.ml.{Pipeline, PipelineModel}
import
org.apache.spark.ml.classification.LogisticRegression
spark.sparkContext.setLogLevel("ERROR")
When you run this code, for instance, in a notebook, you will find the
following output.
303
root
|-- age: long (nullable = true)
|-- sex: long (nullable = true)
|-- cp: long (nullable = true)
|-- trtbps: long (nullable = true)
|-- chol: long (nullable = true)
|-- fbs: long (nullable = true)
|-- restecg: long (nullable = true)
|-- thalachh: long (nullable = true)
|-- exng: long (nullable = true)
|-- oldpeak: long (nullable = true)
|-- slp: long (nullable = true)
|-- caa: long (nullable = true)
|-- thall: long (nullable = true)
|-- label: long (nullable = true)
You can see in the preceding code the schema of the dataframe and column’s
data types.
A very important step when working with data is the process of data
engineering and feature engineering. As part of the data engineering process, it is
always recommended to check the existence of NULL values in our dataset.
If inadvertently you process a dataset with NULL values, at best you will
receive an error and understand something is wrong with the data, and at worst
you will get inaccurate results.
In our dataset, if you check the “oldpeak” column, running the following
line of code, you will find there are 173 NULL values
heartdF.filter("oldpeak is null").count
res2: Long = 173
The previous line of code will randomly split the data in a 80%–20%
proportion. Eighty percent of the data will be used to train our PipelineModel
and the other 20% (the unseen data) to test it.
trainingPred.select("label","probability","prediction").show(tr
If you execute the precedent piece of code in your notebook, you will get an
output pretty similar to the next one.
+-----+------------------------------------------+----------+
|label|probability |prediction|
+-----+------------------------------------------+----------+
|1 |[0.03400091691592197,0.965999083084078] |1.0 |
|1 |[0.05511659822191829,0.9448834017780817] |1.0 |
|0 |[0.5605994301074364,0.4394005698925636] |0.0 |
|1 |[0.03115074381750154,0.9688492561824985] |1.0 |
|1 |[0.004384634167846924,0.995615365832153] |1.0 |
|1 |[0.08773404036960819,0.9122659596303918] |1.0 |
|1 |[0.08773404036960819,0.9122659596303918] |1.0 |
|1 |[0.06985863429068614,0.9301413657093138] |1.0 |
|0 |[0.7286457381073151,0.27135426189268486] |0.0 |
|1 |[0.02996587703476992,0.9700341229652301] |1.0 |
|1 |[0.0016700146317826447,0.9983299853682174]|1.0 |
|0 |[0.36683434534535186,0.6331656546546481] |1.0 |
|1 |[0.04507024193962369,0.9549297580603763] |1.0 |
|1 |[0.013996165515300337,0.9860038344846996] |1.0 |
|1 |[0.016828318827434772,0.9831716811725653] |1.0 |
|1 |[0.2671331307894787,0.7328668692105214] |1.0 |
|1 |[0.32331781956753536,0.6766821804324646] |1.0 |
|1 |[0.09759145569985764,0.9024085443001424] |1.0 |
|1 |[0.032829375720753985,0.967170624279246] |1.0 |
|0 |[0.8584162531850159,0.1415837468149841] |0.0 |
+-----+------------------------------------------+----------+
only showing top 20 rows
If you pay attention to the previous outcome, you will see that the lower the
probability, the more likely the prediction to be 1, while on the other hand, the
higher the probability, the more likely the prediction to be 0.
One line of the previous code you should pay attention to is this one:
.setHandleInvalid("skip")
If you remember, our dataset has columns with NULL values. If you do not
take care of them, you will receive an error. The previous line of code skips
NULL values.
Once we have trained our model, we are going to divide our test dataset
(tetDF) into multiple files to simulate a streaming data flow. Then, we are going
to set up a file data source and copy each individual file to the source folder, as
we did in previous chapters simulating a stream of information.
Next is the code to divide testDF into ten partitions (individual files) and
writing them to the /tmp/spark_ml_streaming/ directory.
testDF.repartition(10)
.write.format("csv")
.option("header", true)
.mode("overwrite")
.save("file:///tmp/spark_ml_streaming/")
After executing the previous code snippet, if you have a look at the
designated source directory, you will find something similar to this:
$ tree /tmp/spark_ml_streaming/
/tmp/spark_ml_streaming/
├── part-00000-2c24d64a-2ecd-4674-a394-44aa5e17f131-
c000.csv
├── part-00001-2c24d64a-2ecd-4674-a394-44aa5e17f131-
c000.csv
├── part-00002-2c24d64a-2ecd-4674-a394-44aa5e17f131-
c000.csv
├── part-00003-2c24d64a-2ecd-4674-a394-44aa5e17f131-
c000.csv
├── part-00004-2c24d64a-2ecd-4674-a394-44aa5e17f131-
c000.csv
├── part-00005-2c24d64a-2ecd-4674-a394-44aa5e17f131-
c000.csv
├── part-00006-2c24d64a-2ecd-4674-a394-44aa5e17f131-
c000.csv
├── part-00007-2c24d64a-2ecd-4674-a394-44aa5e17f131-
c000.csv
├── part-00008-2c24d64a-2ecd-4674-a394-44aa5e17f131-
c000.csv
├── part-00009-2c24d64a-2ecd-4674-a394-44aa5e17f131-
c000.csv
└── _SUCCESS
Next, we have to create the streaming source to load the files from the data
source as soon as they appear in the directory.
val streamingSource=spark
.readStream
.format("csv")
.option("header",true)
.schema(schema)
.option("ignoreLeadingWhiteSpace",true)
.option("mode","dropMalformed")
.option("maxFilesPerTrigger",1)
.load("file:///tmp/HeartTest/")
.withColumnRenamed("output","label")
We have to control the quality of the data that is injected into the model; that
is why we have included the following lines:
.option("ignoreLeadingWhiteSpace",true)
.option("mode","dropMalformed")
to be sure that unnecessary white spaces and malformed rows do not get to
the model.
We have also added the line
.option("maxFilesPerTrigger",1)
val streamingHeart =
PipelineModel.transform(streamingSource).select("label","probab
streamingHeart.writeStream
.outputMode("append")
.option("truncate", false)
.format("console")
.start()
.awaitTermination()
Now, execute the precedent code snippet and copy the partitioned files to the
data source. For example
$ cp part-00000-2c24d64a-2ecd-4674-a394-44aa5e17f131-
c000.csv /tmp/HeartTest/
$ cp part-00001-2c24d64a-2ecd-4674-a394-44aa5e17f131-
c000.csv /tmp/HeartTest/
...
$ cp part-00009-2c24d64a-2ecd-4674-a394-44aa5e17f131-
c000.csv /tmp/HeartTest/
You will see an output similar to the next one, coming out of your program.
-------------------------------------------
Batch: 0
-------------------------------------------
+-----+----------------------------------------+----------+
|label|probability |prediction|
+-----+----------------------------------------+----------+
|0 |[0.7464870545074516,0.25351294549254844]|0.0 |
|1 |[0.1632367041842738,0.8367632958157262] |1.0 |
+-----+----------------------------------------+----------+
-------------------------------------------
Batch: 1
-------------------------------------------
+-----+-----------------------------------------+----------+
|label|probability |prediction|
+-----+-----------------------------------------+----------+
|0 |[0.9951659487928823,0.004834051207117662]|0.0 |
|0 |[0.9929886660069713,0.007011333993028668]|0.0 |
+-----+-----------------------------------------+----------+
-------------------------------------------
Batch: 2
-------------------------------------------
+-----+-----------------------------------------+----------+
|label|probability |prediction|
+-----+-----------------------------------------+----------+
|0 |[0.6601488743972465,0.33985112560275355] |0.0 |
|0 |[0.9885105583774811,0.011489441622518859]|0.0 |
|1 |[0.004729033461790646,0.9952709665382093]|1.0 |
|1 |[0.002543643876197849,0.9974563561238021]|1.0 |
+-----+-----------------------------------------+----------+
-------------------------------------------
Batch: 3
-------------------------------------------
+-----+----------------------------------------+----------+
|label|probability |prediction|
+-----+----------------------------------------+----------+
|1 |[0.23870496408150266,0.7612950359184973]|1.0 |
|0 |[0.8285765606366566,0.17142343936334337]|0.0 |
|1 |[0.1123278992547269,0.8876721007452731] |1.0 |
+-----+----------------------------------------+----------+
-------------------------------------------
Batch: 4
-------------------------------------------
+-----+-----------------------------------------+----------+
|label|probability |prediction|
+-----+-----------------------------------------+----------+
|1 |[0.3811392681451562,0.6188607318548438] |1.0 |
|1 |[0.016044469761318698,0.9839555302386813]|1.0 |
|1 |[0.011124987326959632,0.9888750126730403]|1.0 |
|0 |[0.009425069592366693,0.9905749304076333]|1.0 |
+-----+-----------------------------------------+----------+
-------------------------------------------
Batch: 5
-------------------------------------------
+-----+-----------------------------------------+----------+
|label|probability |prediction|
+-----+-----------------------------------------+----------+
|1 |[0.030581176663381764,0.9694188233366182]|1.0 |
|1 |[0.028952221072329157,0.9710477789276708]|1.0 |
|0 |[0.7251959061823547,0.27480409381764526] |0.0 |
+-----+-----------------------------------------+----------+
-------------------------------------------
Batch: 6
-------------------------------------------
+-----+----------------------------------------+----------+
|label|probability |prediction|
+-----+----------------------------------------+----------+
|1 |[0.3242653848343221,0.6757346151656779] |1.0 |
|0 |[0.9101196538221397,0.08988034617786034]|0.0 |
|1 |[0.08227291309126751,0.9177270869087325]|1.0 |
+-----+----------------------------------------+----------+
-------------------------------------------
Batch: 7
-------------------------------------------
+-----+----------------------------------------+----------+
|label|probability |prediction|
+-----+----------------------------------------+----------+
|1 |[0.09475287521715883,0.9052471247828412]|1.0 |
+-----+----------------------------------------+----------+
-------------------------------------------
Batch: 8
-------------------------------------------
+-----+----------------------------------------+----------+
|label|probability |prediction|
+-----+----------------------------------------+----------+
|1 |[0.8256079035149502,0.17439209648504983]|0.0 |
|0 |[0.31539711793989017,0.6846028820601098]|1.0 |
|0 |[0.9889473486170233,0.01105265138297673]|0.0 |
|1 |[0.12416982209602322,0.8758301779039768]|1.0 |
+-----+----------------------------------------+----------+
When developing a machine learning (ML) model, it is always essential to
find out whether it accurately measures what it is set out to measure.
In the next section, we are going to introduce a small variation in our
example code to show you how to assess the accuracy of a pipeline model
through the measure of its sensitivity and specificity.
val streamingHeart =
PipelineModel.transform(streamingSource).select("label","probab
streamingHeart.writeStream
.outputMode("append")
.option("truncate", false)
.format("console")
.start()
.awaitTermination()
val streamingRates =
PipelineModel.transform(streamingSource)
.groupBy('label)
.agg(
(sum(when('prediction === 'label, 1)) /
count('label)).alias("true prediction rate"),
count('label).alias("count")
)
streamingRates.writeStream
.outputMode("complete")
.option("truncate", false)
.format("console")
.start()
.awaitTermination()
-------------------------------------------
Batch: 0
-------------------------------------------
+-----+--------------------+-----+
|label|true prediction rate|count|
+-----+--------------------+-----+
|0 |0.5 |2 |
|1 |0.5 |2 |
+-----+--------------------+-----+
-------------------------------------------
Batch: 1
-------------------------------------------
+-----+--------------------+-----+
|label|true prediction rate|count|
+-----+--------------------+-----+
|0 |0.6666666666666666 |3 |
|1 |0.6666666666666666 |3 |
+-----+--------------------+-----+
-------------------------------------------
Batch: 2
-------------------------------------------
+-----+--------------------+-----+
|label|true prediction rate|count|
+-----+--------------------+-----+
|0 |0.6666666666666666 |3 |
|1 |0.7142857142857143 |7 |
+-----+--------------------+-----+
-------------------------------------------
Batch: 3
-------------------------------------------
+-----+--------------------+-----+
|label|true prediction rate|count|
+-----+--------------------+-----+
|0 |0.75 |4 |
|1 |0.75 |8 |
+-----+--------------------+-----+
-------------------------------------------
Batch: 4
-------------------------------------------
+-----+--------------------+-----+
|label|true prediction rate|count|
+-----+--------------------+-----+
|0 |0.8 |5 |
|1 |0.7777777777777778 |9 |
+-----+--------------------+-----+
-------------------------------------------
Batch: 5
-------------------------------------------
+-----+--------------------+-----+
|label|true prediction rate|count|
+-----+--------------------+-----+
|0 |0.8 |5 |
|1 |0.6363636363636364 |11 |
+-----+--------------------+-----+
-------------------------------------------
Batch: 6
-------------------------------------------
+-----+--------------------+-----+
|label|true prediction rate|count|
+-----+--------------------+-----+
|0 |0.8 |5 |
|1 |0.7333333333333333 |15 |
+-----+--------------------+-----+
-------------------------------------------
Batch: 7
-------------------------------------------
+-----+--------------------+-----+
|label|true prediction rate|count|
+-----+--------------------+-----+
|0 |0.7142857142857143 |7 |
|1 |0.75 |16 |
+-----+--------------------+-----+
-------------------------------------------
Batch: 8
-------------------------------------------
+-----+--------------------+-----+
|label|true prediction rate|count|
+-----+--------------------+-----+
|0 |0.7777777777777778 |9 |
|1 |0.7647058823529411 |17 |
+-----+--------------------+-----+
As you can see, the rates of true positive and true negative predictions are
continuously updated as the data goes in. The true prediction rate is
nothing out of this world because we are using a very small dataset and to make
things worse, it had NULL values that have been discharged.
One of the main drawbacks of logistic regression is that it needs big datasets
to be really able to get the insights of the data.
If you want to dig deeper into how to use Spark ML with Spark Structured
Streaming, you can find a complete stream pipeline example following this link.
In the next section, we are going to analyze some of the expected future
Spark Streaming features that are already here.
What Is RocksDB?
RocksDB is an embeddable persistent key-value store for fast storage based on
three basic structures: memtable, sstfile, and logfile.
RocksDB includes the following main features:
It uses a log structured database engine.
It is optimized for storing small to medium size key-values, though keys and
values are arbitrarily sized byte streams.
It is optimized for fast, low latency storage such as flash drives and high-
speed disk drives, for high read/write rate performance.
It works on multicore processors.
Apart from Spark Streaming, RocksDB is also used as a state backend by
other state-of-the-art streaming frameworks such Apache Flink or Kafka Streams
which uses RocksDB to maintain local state on a computing node.
If you want to incorporate RocksDB to your Spark cluster, setting
spark.conf.set(
"spark.sql.streaming.stateStore.providerClass",
"org.apache.spark.sql.execution.streaming.state.RocksDBStateSto
BlockBasedTableConfig
BlockBasedTable is RocksDB’s default SST file format. It includes the
configuration for plain tables in sst format. RocksDB creates a
BlockBasedTableConfig when created.
RocksDB Possible Performance Degradation
With this option enabled, it adds extra attempts to retrieve data on write
operations to track the changes of the total number of rows, bringing an
overhead on massive write workloads. It is used when we want RocksDB to
upload a version of a pair key-value, update the value, and after that remove the
key. Thus, be advised turning it on can jeopardize the system performance.
Wrapping up, RocksDB is able to achieve very high performance. RocksDB
includes a flexible and tunable architecture with many settings that can be
tweaked to adapt it to different production environments and hardware available,
including in-memory storage, flash memory, commodity hard disks, HDFS file
systems, etc.
RocksDB supports advanced database operations such as merging,
compaction filters, and SNAPSHOT isolation level. On the other hand,
RocksDB does not support some database features such as joins, query
compilation, or stored procedures.
10.4 Summary
In this chapter, we discussed the capacities of Spark Structured Streaming when
coupled with Spark ML to perform real-time predictions. This is one of the more
relevant features the Apache Spark community is expected to improve as more
business applications require in-motion data analytics to trigger prompt reaction.
After that, we discussed the advantages of the new RocksDB State Store to
finalize with one of the most expected Spark Streaming turning points, the
Project Lightspeed, which will drive Spark Structured Streaming into the real-
time era.
Footnotes
1 https://spark.apache.org/news/
Bibliography
Akidau, T., Chernyak, S., & Lax, R. (2018). Streaming Systems: The What,
Where, When, and How of Large-Scale Data Processing (first edition).
O’Reilly Media.
Chambers, B., & Zaharia, M. (2018). Spark: The Definitive Guide: Big Data
Processing Made Simple (first edition). O’Reilly Media.
Chellappan, S., & Ganesan, D. (2018). Practical Apache Spark: Using the
Scala API (first edition). Apress.
Damji, J., Wenig, B., Das, T., & Lee, D. (2020). Learning Spark: Lightning-
Fast Data Analytics (second edition). O’Reilly Media.
Elahi, I. (2019). Scala Programming for Big Data Analytics: Get Started with
Big Data Analytics Using Apache Spark (first edition). Apress.
Haines, S. (2022). Modern Data Engineering with Apache Spark: A Hands-On
Guide for Building Mission-Critical Streaming Applications. Apress.
Introducing Native Support for Session Windows in Spark Structured
Streaming. (October 12, 2021). Databricks.
www.databricks.com/blog/2021/10/12/native-support-
of-session-window-in-spark-structured-streaming.html
Kakarla, R., Krishnan, S., & Alla, S. (2020). Applied Data Science Using
PySpark: Learn the End-to-End Predictive Model-Building Cycle (first
edition). Apress.
Karau, H., & Warren, R. (2017). High Performance Spark: Best Practices for
Scaling and Optimizing Apache Spark (first edition). O’Reilly Media.
Kukreja, M., & Zburivsky, D. (2021). Data Engineering with Apache Spark,
Delta Lake, and Lakehouse: Create Scalable Pipelines That Ingest, Curate,
and Aggregate Complex Data in a Timely and Secure Way (first edition).
Packt Publishing.
Lee, D., & Drabas, T. (2018). PySpark Cookbook: Over 60 Recipes for
Implementing Big Data Processing and Analytics Using Apache Spark and
Python (first edition). Packt Publishing.
Luu, H. (2021). Beginning Apache Spark 3: With DataFrame, Spark SQL,
Structured Streaming, and Spark Machine Learning Library (second edition).
Apress.
Maas, G., & Garillot, F. (2019). Stream Processing with Apache Spark:
Mastering Structured Streaming and Spark Streaming (first edition). O’Reilly
Media.
MLlib: Main Guide—Spark 3.3.2 Documentation. (n.d.). Retrieved April 5,
2023, from https://spark.apache.org/docs/latest/ml-
guide.html
Nabi, Z. (2016). Pro Spark Streaming: The Zen of Real-Time Analytics Using
Apache Spark (first edition). Apress.
Nudurupati, S. (2021). Essential PySpark for Scalable Data Analytics: A
Beginner’s Guide to Harnessing the Power and Ease of PySpark 3 (first
edition). Packt Publishing.
Overview—Spark 3.3.2 Documentation. (n.d.). Retrieved April 5, 2023, from
https://spark.apache.org/docs/latest/
Perrin, J.-G. (2020). Spark in Action: Covers Apache Spark 3 with Examples
in Java, Python, and Scala (second edition). Manning.
Project Lightspeed: Faster and Simpler Stream Processing with Apache
Spark. (June 28, 2022). Databricks.
www.databricks.com/blog/2022/06/28/project-
lightspeed-faster-and-simpler-stream-processing-
with-apache-spark.html
Psaltis, A. (2017). Streaming Data: Understanding the real-time pipeline (first
edition). Manning.
RDD Programming Guide—Spark 3.3.2 Documentation. (n.d.). Retrieved
April 5, 2023, from
https://spark.apache.org/docs/latest/rdd-
programming-guide.html
Ryza, S., Laserson, U., Owen, S., & Wills, J. (2017). Advanced Analytics with
Spark: Patterns for Learning from Data at Scale (second edition). O’Reilly
Media.
Spark SQL and DataFrames—Spark 3.3.2 Documentation. (n.d.). Retrieved
April 5, 2023, from
https://spark.apache.org/docs/latest/sql-
programming-guide.html
Spark Streaming—Spark 3.3.2 Documentation. (n.d.). Retrieved April 5, 2023,
from https://spark.apache.org/docs/latest/streaming-
programming-guide.html
Structured Streaming Programming Guide—Spark 3.3.2 Documentation.
(n.d.). Retrieved April 5, 2023, from
https://spark.apache.org/docs/latest/structured-
streaming-programming-guide.html
Tandon, A., Ryza, S., Laserson, U., Owen, S., & Wills, J. (2022). Advanced
Analytics with PySpark (first edition). O’Reilly Media.
Wampler, D. (2021). Programming Scala: Scalability = Functional
Programming + Objects (third edition). O’Reilly Media.
Index
A
Accumulators
Adaptive Query Execution (AQE)
cost-based optimization framework
features
join strategies, runtime replanning
partition number, data dependent
spark.sql.adaptive.enabled configuration
SQL plan
unevenly distributed data joins
Apache Parquet
Apache Spark
APIs
batch vs. streaming data
dataframe/datasets
definition
execution model
fault tolerance
GraphX
scalable
streaming
unified API
Apache Spark Structured Streaming
Application programming interface (API)
.appName() method
Artificial intelligence (AI)
Association for Computing Machinery (ACM)
B
Batch data processing
Broadcast variables
C
Cache()
Checkpointing streaming
example
failures
features
state data representation
stateful
state storage
collect() method
collect() method
countByValue() method
count() function
createDataFrame() method
Customer Relationship Management (CRM)
customWriterToConsole() function
D
Data analytics
DataFrame API
Apache Parquet
attributes
complex nested JSON structures
CSV files
dataset reversion
data sources
data structure
direct queries, parquet files
file partitioning
Hive tables
JDBC connector
JSON file, read/write
JSON files
JSON files, customized schemas
JSON files, save
methods
direct queries, parquet files
saving/data compression, parquet file format
querying files, using SQL
reading multiple JSON files
reading parquet file partitions
read.json()
row-type object
saving/data compression, parquet file format
saving modes
SpanishWriters
Spark SQL module
time-based paths
toDF()
See also DataFrames
DataFrames
column name notations
drop()
filtering
join()
multi-condition filtering, logical operators
name patterns, select columns
select ()
SQL case statement
UDF
union()
withColumn()
withColumnRenamed()
Dataset API
creating methods
RDD features
type row
Data sinks
built-in output sinks
writing streaming data
console sink
file
ForeachBatch
Foreach sink
Kafka
MongoDB
Data sources
built-in sources
file data, reading streaming data
Kafka
MongoDB, data streams read
testing
Data streaming
characteristics
data faults/stragglers
timestamps
Directed Acyclic Graph (DAG)
Discretized Stream (DStream)
advanced sources
input sources
data analytics, streaming transformations
file system
running file system
running socket
socket connection
TCP/IP
Discretized Streams (DStreams)
Download/installation instructions, Spark
configuring environment variables
file integrity
Linux system
page
Windows
drop() function
Dynamic gap duration
E
Enterprise Resource Planning (ERP)
Event-driven architectures
Event-time window operations
real-time event
sliding windows
temporal windows
tumbling/nonoverlapping windows
Event-Time Window Operations and Watermarking
F
Fault-tolerant file systems
filter()
flatMap() function
foldByKey() aggregation
ForeachBatch sink
G
getActiveSession() method
getOrCreate()
GraphX
groupByKey() function
H
Hadoop
Hadoop Distributed File System (HDFS)
Hadoop MapReduce
High-level APIs
dataframes
datasets
structured data
Hospital Queue Management System
I
Internet of things (IoT)
isAdult() function
J
Java Virtual Machine (JVM)
join() method
K
Kafka
kafka-topics.sh command
Kappa
Key-value pair
creating pair RDDs
definition
distinct keys, pair RDDs
Kyro
L
Lambda architecture
Apache Storm
characteristics
lanes
layers
pros/cons
real-time stream/batch data processing
Lazy evaluation
Least Recently Used (LRU) Page Replacement algorithm
leftOuterJoin()
like() function
Logistic regression
classification/predictive analytics
data source
supervised machine learning
types
use cases
create streaming source
dataset NULL values
data source
loading data
PipelineModel
run code
testDF
Low-level API/Spark Core API
See also Resilient Distributed Dataset (RDD)
M
Machine learning (ML)
--master option
Mesos/Hadoop YARN
Message delivery reliability mechanisms
Microservice processes orchestration
ML algorithms
MLlib
MongoDB
N, O
NamedTemporaryFile() function
Netcat
newSession() method
No-time-based aggregations
P, Q
Pair RDDs transformations
adding values
adding values
combining data
countByKey() counts
countByKey() counts
countByValue()
custom aggregation functions
grouping data
joins
key function, sorting
lineage
Neutral ZeroValue, merging values
returns key-value pairs, dictionary
save RDD as text file
sortByKey()
sorting data
values associated with key
parquet() method
partitionBy() method
Persist()
Project Lightspeed
connectors
next-gen Spark Streaming engine
operations/troubleshooting, improve
predictable low latency
streaming functionalities
R
Read-Eval-Print Loop (REPL) shell interface
read.json() method
readStream() method
Real-time data processing architectures
Real-time event stream processing
reduceByKey() function
Regression model
repartition() method
Resilient Distributed Datasets (RDDs)
creating new elements
external Hadoop source
fault-tolerant entities
immutable
logical partitions
parallelized collections
rightOuterJoin()
RocksDB
S
saveToCSV() function
select() function
Session windows
dynamic gap duration
gap interval duration
JSON files
session_window()
window length
Shared variables
accumulators
broadcast
definition
Sliding windows
five seconds sliding interval
fixed-sized
JSON files
sliding offset, five seconds
Social Security Number (SSN)
Spark
Spark 3.3.2 version
HDFS
RocksDB
BlockBasedTable
definition
features
performance
state store parameters
stateful operations
Spark application model
Spark cluster model
SparkContext.objectFile method
SparkContext.parallelize() function
SparkContext’s textFile method
Spark Core
Spark execution model
Spark high-level API
Spark Machine Learning (Spark ML)
dataset
logistic regression
PipelineModel
sensitivity/specificity
SparkR
SparkSession
access existing SparkSession
catalog metadata
configuration parameters
create programs
shell command line
definition
SparkSession.builder() constructor
SparkSession.readStream() method
Spark shell
command-line tools
pyspark
Scala
Scala command line
triggering actions
spark.sql.files.ignoreCorruptFiles method
spark.sql.files.ignoreMissingFiles method
Spark Structured Streaming
Spark-submit command
application properties
cluster managers
configurations
control application properties
deployment modes
dynamic allocation
example
Functions.scala
options
resources
runtime environment
Stateful streaming aggregations
functions
no-time-based
time-based
Stream data processing
Streaming
bound/unbound data
challenges
data sources/data sinks
DStream
DStreams, transformations
features
graceful shutdown
Kappa architecture
Lambda architecture
uncertainty component
Streaming sources
See Data sources
Structured streaming
DataFrames/datasets API
datasources
file systems
running file systems locally
running socket application
socket
data transformations
stateless/stateful operations
input table
output mode
result table
vs. Spark streaming
stateful stream
stateless state
Supervised machine learning models
T
take() method
textFile() method
Time-based aggregations
toDS() method
Transformations, RDD/dataframe
DAG
example
narrow transformation
wide transformation
Tumbling/nonoverlapping windows
U, V
union() method
unionByName() function
unpersist()
User-defined functions
W
Watermarking
example
JSON files
late-arrival events
stateful streaming operations
timestamp field
timestamp/window columns
whether() function
wholeTextFiles() method
withColumn() transformation function
withColumnRenamed() transformation function
writeStream method
X, Y
$SPARK_HOME/bin directory
Z
ZooKeeper/KRaft