Chapter 2
Chapter 2
Deploying Spark
In This Chapter:
Overview of the different Spark deployment modes
How to install Spark
The contents of a Spark installation
Overview of the various methods available for deploying Spark in the
cloud
This chapter covers the basics of how Spark is deployed, how to install Spark,
and how to get Spark clusters up and running. It discusses the various
deployment modes and schedulers available for Spark clusters, as well as options
for deploying Spark in the cloud. If you complete the installation exercises in
this chapter, you will have a fully functional Spark programming and runtime
environment that you can use for the remainder of the book.
Local Mode
Local mode allows all Spark processes to run on a single machine, optionally
using any number of cores on the local system. Using Local mode is often a
quick way to test a new Spark installation, and it allows you to quickly test
Spark routines against small datasets.
Listing 2.1 shows an example of submitting a Spark job in local mode.
$SPARK_HOME/bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master local \
$SPARK_HOME/examples/jars/spark-examples*.jar 10
You specify the number of cores to use in Local mode by supplying the number
in brackets after the local directive. For instance, to use two cores, you specify
local[2]; to use all the cores on the system, you specify local[*].
When running Spark in Local mode, you can access any data on the local
filesystem as well as data from HDFS, S3, or other filesystems, assuming that
you have the appropriate configuration and libraries available on the local
system.
Although Local mode allows you to get up and running quickly, it is limited in
its scalability and effectiveness for production use cases.
Spark Standalone
Spark Standalone refers to the built-in, or “standalone,” scheduler. We will look
at the function of a scheduler, or cluster manager, in more detail in Chapter 3.
The term standalone can be confusing because it has nothing to do with the
cluster topology, as might be interpreted. For instance, you can have a Spark
deployment in Standalone mode on a fully distributed, multi-node cluster; in this
case, Standalone simply means that it does not need an external scheduler.
Multiple host processes, or services, run in a Spark Standalone cluster, and each
service plays a role in the planning, orchestration, and management of a given
Spark application running on the cluster. Figure 2.1 shows a fully distributed
Spark Standalone reference cluster topology. (Chapter 3 details the functions that
these services provide.)
$SPARK_HOME/bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://mysparkmaster:7077 \
$SPARK_HOME/examples/jars/spark-examples*.jar 10
With Spark Standalone, you can get up and running quickly with few
dependencies or environmental considerations. Each Spark release includes
everything you need to get started, including the binaries and configuration files
for any host to assume any specified role in a Spark Standalone cluster. Later in
this chapter you will deploy your first cluster in Spark Standalone mode.
Spark on YARN
As introduced in Chapter 1, “Introducing Big Data, Hadoop, and Spark,” the
most common deployment method for Spark is using the YARN resource
management framework provided with Hadoop. Recall that YARN is the
Hadoop core component that allows you to schedule and manage workloads on a
Hadoop cluster.
According to a Databricks annual survey (see
https://databricks.com/resources/type/infographic-surveys), YARN and
standalone are neck and neck, with Mesos trailing behind.
As first-class citizens in the Hadoop ecosystem, Spark applications can be easily
submitted and managed with minimal incremental effort. Spark processes such
as the Driver, Master, and Executors (covered in Chapter 3) are hosted or
facilitated by YARN processes such as the ResourceManager, NodeManager,
and ApplicationMaster.
The spark-submit, pyspark, and spark-shell programs include
command line arguments used to submit Spark applications to YARN clusters.
Listing 2.3 provides an example of this.
Listing 2.3 Submitting a Spark Job to a YARN Cluster
Click here to view code image
$SPARK_HOME/bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
$SPARK_HOME/examples/jars/spark-examples*.jar 10
There are two cluster deployment modes when using YARN as a scheduler:
cluster and client. We will distinguish between the two in Chapter 3 when
we look at the runtime architecture for Spark.
Spark on Mesos
Apache Mesos is an open source cluster manager developed at University of
California, Berkeley; it shares some of its lineage with the creation of Spark.
Mesos is capable of scheduling different types of applications, offering fine-
grained resource sharing that results in more efficient cluster utilization. Listing
2.4 shows an example of a Spark application submitted to a Mesos cluster.
$SPARK_HOME/bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master mesos://mesosdispatcher:7077 \
--deploy-mode cluster \
--supervise \
--executor-memory 20G \
--total-executor-cores 100 \
$SPARK_HOME/examples/jars/spark-examples*.jar 1000
This book focuses on the more common schedulers for Spark: Spark Standalone
and YARN. However, if you are interested in Mesos, a good place to start is
http://mesos.apache.org.
Preparing to Install Spark
Spark is a cross-platform application that can be deployed on the following
operating systems:
Linux (all distributions)
Windows
Mac OS X
Getting Spark
Using a Spark release is often the easiest way to install Spark on a given system.
Spark releases are downloadable from http://spark.apache.org/downloads.html.
These releases are cross-platform: They target a JVM environment, which is
platform agnostic.
Using the build instructions provided on the official Spark website, you could
also download the source code for Spark and build it yourself for your target
platform. This method is more complicated however.
If you download a Spark release, you should select the builds with Hadoop, as
shown in Figure 2.2. The “with Hadoop” Spark releases do not actually include
Hadoop, as the name may imply. These releases simply include libraries to
integrate with the Hadoop clusters and distributions listed. Many of the Hadoop
classes are required, regardless of whether you are using Hadoop with Spark.
2. Get Spark. Download a release of Spark, using wget and the appropriate
URL to download the release; you can obtain the actual download address
from the http://spark.apache.org/downloads.html page shown in Figure 2.2.
Although there is likely to be a later release available to you by the time you
read this book, the following example shows a download of release 2.2.0:
Click here to view code image
$ wget https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin-
hadoop2.7.tgz
3. Unpack the Spark release. Unpack the Spark release and move it into a
shared directory, such as /opt/spark:
Click here to view code image
$ tar -xzf spark-2.2.0-bin-hadoop2.7.tgz
$ sudo mv spark-2.2.0-bin-hadoop2.7 /opt/spark
You may wish to set these on a persistent or permanent basis (for example,
using /etc/environment on an Ubuntu instance).
5. Test the installation. Test the Spark installation by running the built-in Pi
Estimator example in Local mode, as follows:
Click here to view code image
$ spark-submit --class org.apache.spark.examples.SparkPi \
--master local \
$SPARK_HOME/examples/jars/spark-examples*.jar 1000
If successful, you should see output similar to the following among a large
amount of informational log messages (which you will learn how to
minimize later in this chapter):
Pi is roughly 3.1414961114149613
You can test the interactive shells, pyspark and spark-shell, at the
terminal prompt as well.
Congratulations! You have just successfully installed and tested Spark on Linux.
How easy was that?
If you are using Windows PowerShell, you can enter the following
equivalent command:
Click here to view code image
PS C:\>[Environment]::SetEnvironmentVariable("_JAVA_OPTIONS",
"-Djava.net.preferIPv4Stack=true", "User")
Note that you need to run these commands as a local administrator. For
simplicity, this example shows applying all configuration settings at a user
level. However, you can instead choose to apply any of the settings shown at
a machine level—for instance, if you have multiple users on a system.
Consult the documentation for Microsoft Windows for more information
about this.
7. Set the necessary environment variables. Set the HADOOP_HOME
environment variable by running the following command at the Windows
command prompt:
Click here to view code image
C:\> setx HADOOP_HOME C:\Hadoop
8. Set up the local metastore. You need to create a location and set the
appropriate permissions to a local metastore. We discuss the role of the
metastore specifically in Chapter 6, “SQL and NoSQL Programming with
Spark,” when we begin to look at Spark SQL. For now, just run the
following commands from the Windows or PowerShell command prompt:
Click here to view code image
C:\> mkdir C:\tmp\hive
C:\> Hadoop\bin\winutils.exe chmod 777 /tmp/hive
9. Test the installation. Open a Windows command prompt or PowerShell
session and change directories to the bin directory of your Spark
installation, as follows:
Click here to view code image
C:\> cd C:\Spark\bin
Figure 2.3 shows an example of what you should expect to see using
Windows PowerShell.
The remainder of this book references many of the directories listed in Table 2.1.
5. Start the Spark Master. On the sparkmaster host, run the following
command:
Click here to view code image
$ sudo $SPARK_HOME/sbin/start-master.sh
Test the Spark Master process by viewing the Spark Master web UI at
http://sparkmaster:8080/.
6. Start the Spark Workers. On each sparkworker node, run the following
command:
Click here to view code image
$ sudo $SPARK_HOME/sbin/start-slave.sh spark://sparkmaster:7077
You should see output similar to that from the previous exercises.
You could also enable passwordless SSH (Secure Shell) for the Spark Master to
the Spark Workers. This is required to enable remote login for the slave daemon
startup and shutdown actions.
Spark on EC2
You can launch Spark clusters (or Hadoop clusters capable of running Spark) on
EC2 instances in AWS. Typically this is done within a Virtual Private Cloud
(VPC), which allows you to isolate cluster nodes from public networks.
Deployment of Spark clusters on EC2 usually involves deployment of
configuration management tools such as Ansible, Chef, Puppet, or AWS
CloudFormation, which can automate deployment routines using an
Infrastructure-as-Code (IaC) discipline.
In addition, there are several predeveloped Amazon Machine Images (AMIs)
available in the AWS Marketplace; these have a pre-installed and configured
release of Spark.
You can also create Spark clusters on containers by using the EC2 Container
Service. There are numerous options to create these, from existing projects
available in GitHub and elsewhere.
Spark on EMR
Elastic MapReduce (EMR) is Amazon’s Hadoop-as-a-Service platform. EMR
clusters are essentially Hadoop clusters with a variety of configurable ecosystem
projects, such as Hive, Pig, Presto, Zeppelin, and, of course, Spark.
You can provision EMR clusters using the AWS Management Console or via the
AWS APIs. Options for creating EMR clusters include number of nodes, node
instance types, Hadoop distribution, and additional applications to install,
including Spark.
EMR clusters are designed to read data and output results directly to and from
S3. EMR clusters are intended to be provisioned on demand, run a discrete work
flow or job flow, and terminate. They do have local storage, but they are not
intended to run in perpetuity. Therefore, you should use this local storage only
for transient data.
Listing 2.5 demonstrates creating a simple three-node EMR cluster with Spark
and Zeppelin using the AWS CLI.
Figure 2.5 shows the Zeppelin notebook interface included with the EMR
deployment, which can be used as a Spark programming environment.
Using EMR is a quick and scalable deployment method for Spark. For more
information about EMR, go to https://aws.amazon.com/elasticmapreduce/.
TensorFlow
TensorFlow is an open source software library that Google created
specifically for training neural networks, an approach to deep learning.
Neural networks are used to discover patterns, sequences, and relations in
much the same way that the human brain does.
As with AWS, you could choose to deploy Spark using Google’s IaaS offering,
Compute, which requires you to deploy the underlying infrastructure. However,
there is a managed Hadoop and Spark platform available with GCP called Cloud
Dataproc, and it may be an easier option.
Cloud Dataproc offers a similarly managed software stack to AWS EMR, and
you can deploy it to a cluster of nodes.
Databricks
Databricks is an integrated cloud-based Spark workspace that allows you to
launch managed Spark clusters and ingest and interact with data from S3 or
other relational database or flat-file data sources, either in the cloud or from your
environment. The Databricks platform uses your AWS credentials to create its
required infrastructure components, so you effectively have ownership of these
assets in your AWS account. Databricks provides the deployment, management,
and user/application interface framework for a cloud-based Spark platform in
AWS.
Databricks has several pricing plans available, with different features spanning
support levels, security and access control options, GitHub integration, and
more. Pricing is subscription based, with a flat monthly fee plus nominal
utilization charges (charged per hour per node). Databricks offers a 14-day free
trial period to get started. You are responsible for the instance costs incurred in
AWS for Spark clusters deployed using the Databricks platform; however,
Databricks allows you to use discounted spot instances to minimize AWS costs.
For the latest pricing and subscription information, go to
https://databricks.com/product/pricing.
Databricks provides a simple deployment and user interface, shown in Figure
2.6, which abstracts the underlying infrastructure and security complexities
involved in setting up a secure Spark environment in AWS. The Databricks
management console allows you to create notebooks, similar to the Zeppelin
notebook deployed with AWS EMR. There are APIs available from Databricks
for deployment and management as well. These notebooks are automatically
associated with your Spark cluster and provide seamless programmatic access to
Spark functions using Python, Scala, SQL, or R.
Figure 2.6 Databricks console.
Databricks has its own distributed filesystem called the Databricks File System
(DBFS). DBFS allows you to mount existing S3 buckets and make them
seamlessly available in your Spark workspace. You can also cache data on the
solid-state disks (SSDs) of your worker nodes to speed up access. The dbutils
library included in your Spark workspace allows you to configure and interact
with the DBFS.
The Databricks platform and management console allows you to create data
objects as tables—which is conceptually similar to tables in a relational database
—from a variety of sources, including AWS S3 buckets, Java Database
Connectivity (JDBC) data sources, the DBFS, or by uploading your own files
using drag-and-drop functionality. You can also create jobs by using the
Databricks console, and you can run them non-interactively on a user-defined
schedule.
The core AMP Labs team that created—and continues to be a major contributor
to—the Spark project founded the Databricks company and platform. Spark
releases and new features are typically available in the Databricks platform
before they are shipped with other distributions, such as CDH or HDP. More
information about Databricks is available at http://databricks.com.
Summary
In this chapter, you have learned how to install Spark and considered the various
prerequisite requirements and dependencies. You have also learned about the
various deployment modes available for deploying a Spark cluster, including
Local, Spark Standalone, YARN, and Mesos. In the first exercise, you set up a
fully functional Spark Standalone cluster. In this chapter you also looked at some
of the cloud deployment options available for deploying Spark clusters, such as
AWS EC2 or EMR clusters, Google Cloud Dataproc, and Databricks. Any of the
deployments discussed or demonstrated in this chapter can be used for
programming exercises throughout the remainder of this book—and beyond.