Apache Spark Component Guide
Apache Spark Component Guide
Apache Spark Component Guide
docs.hortonworks.com
Hortonworks Data Platform August 31, 2017
The Hortonworks Data Platform, powered by Apache Hadoop, is a massively scalable and 100% open
source platform for storing, processing and analyzing large volumes of data. It is designed to deal with
data from many sources and formats in a very quick, easy and cost-effective manner. The Hortonworks
Data Platform consists of the essential set of Apache Hadoop projects including MapReduce, Hadoop
Distributed File System (HDFS), HCatalog, Pig, Hive, HBase, ZooKeeper and Ambari. Hortonworks is the
major contributor of code and patches to many of these projects. These projects have been integrated and
tested as part of the Hortonworks Data Platform release process and installation and configuration tools
have also been included.
Unlike other providers of platforms built using Apache Hadoop, Hortonworks contributes 100% of our
code back to the Apache Software Foundation. The Hortonworks Data Platform is Apache-licensed and
completely open source. We sell only expert technical support, training and partner-enablement services.
All of our technology is, and will remain, free and open source.
Please visit the Hortonworks Data Platform page for more information on Hortonworks technology. For
more information on Hortonworks services, please visit either the Support or Training page. Feel free to
contact us directly to discuss your specific needs.
Except where otherwise noted, this document is licensed under
Creative Commons Attribution ShareAlike 4.0 License.
http://creativecommons.org/licenses/by-sa/4.0/legalcode
ii
Hortonworks Data Platform August 31, 2017
Table of Contents
1. Analyzing Data with Apache Spark .............................................................................. 1
2. Installing Spark ............................................................................................................ 3
2.1. Installing Spark Using Ambari ........................................................................... 3
2.2. Installing Spark Manually .................................................................................. 6
2.3. Verifying Spark Configuration for Hive Access ................................................... 7
2.4. Installing the Spark Thrift Server After Deploying Spark ..................................... 7
2.5. Validating the Spark Installation ....................................................................... 8
3. Configuring Spark ........................................................................................................ 9
3.1. Configuring the Spark Thrift Server ................................................................... 9
3.1.1. Enabling Spark SQL User Impersonation for the Spark Thrift Server .......... 9
3.1.2. Customizing the Spark Thrift Server Port .............................................. 11
3.2. Configuring the Livy Server ............................................................................. 11
3.2.1. Configuring SSL for the Livy Server ....................................................... 11
3.2.2. Configuring High Availability for the Livy Server .................................... 12
3.3. Configuring the Spark History Server ............................................................... 12
3.4. Configuring Dynamic Resource Allocation ........................................................ 12
3.4.1. Customizing Dynamic Resource Allocation Settings on an Ambari-
Managed Cluster ............................................................................................ 13
3.4.2. Configuring Cluster Dynamic Resource Allocation Manually ................... 14
3.4.3. Configuring a Job for Dynamic Resource Allocation .............................. 15
3.4.4. Dynamic Resource Allocation Properties ............................................... 15
3.5. Configuring Spark for Wire Encryption ............................................................ 16
3.5.1. Configuring Spark for Wire Encryption ................................................. 17
3.5.2. Configuring Spark2 for Wire Encryption ................................................ 18
3.6. Configuring Spark for a Kerberos-Enabled Cluster ............................................ 20
3.6.1. Configuring the Spark History Server .................................................... 21
3.6.2. Configuring the Spark Thrift Server ...................................................... 21
3.6.3. Setting Up Access for Submitting Jobs .................................................. 21
4. Running Spark ........................................................................................................... 24
4.1. Specifying Which Version of Spark to Run ....................................................... 24
4.2. Running Sample Spark 1.x Applications ........................................................... 25
4.2.1. Spark Pi ................................................................................................ 26
4.2.2. WordCount .......................................................................................... 27
4.3. Running Sample Spark 2.x Applications ........................................................... 28
4.3.1. Spark Pi ................................................................................................ 29
4.3.2. WordCount .......................................................................................... 30
5. Submitting Spark Applications Through Livy ............................................................... 32
5.1. Using Livy with Spark Versions 1 and 2 ............................................................ 32
5.2. Using Livy with Interactive Notebooks ............................................................. 33
5.3. Using the Livy API to Run Spark Jobs: Overview ............................................... 34
5.4. Running an Interactive Session With the Livy API ............................................. 35
5.4.1. Livy Objects for Interactive Sessions ...................................................... 36
5.4.2. Setting Path Variables for Python ......................................................... 37
5.4.3. Livy API Reference for Interactive Sessions ............................................ 38
5.5. Submitting Batch Applications Using the Livy API ............................................ 40
5.5.1. Livy Batch Object .................................................................................. 41
5.5.2. Livy API Reference for Batch Jobs ......................................................... 41
6. Running PySpark in a Virtual Environment ................................................................. 43
iii
Hortonworks Data Platform August 31, 2017
iv
Hortonworks Data Platform August 31, 2017
List of Tables
1.1. Spark and Livy Feature Support by HDP Version ........................................................ 1
3.1. Dynamic Resource Allocation Properties .................................................................. 16
3.2. Optional Dynamic Resource Allocation Properties .................................................... 16
8.1. Comparison of the Spark-HBase Connectors ............................................................ 69
v
Hortonworks Data Platform August 31, 2017
Deep integration of Spark with YARN allows Spark to operate as a cluster tenant alongside
Apache engines such as Hive, Storm, and HBase, all running simultaneously on a single
data platform. Instead of creating and managing a set of dedicated clusters for Spark
applications, you can store data in a single location, access and analyze it with multiple
processing engines, and leverage your resources.
Spark on YARN leverages YARN services for resource allocation, runs Spark executors in
YARN containers, and supports workload management and Kerberos security features. It
has two modes:
• YARN-client mode, best for interactive use such as prototyping, testing, and debugging
Spark shell and the Spark Thrift server run in YARN-client mode only.
HDP 2.6 supports Spark versions 1.6 and 2.0; Livy, for local and remote access to Spark
through the Livy REST API; and Apache Zeppelin, for browser-based notebook access to
Spark. (For more information about Zeppelin, see the Zeppelin Component Guide.)
1
Hortonworks Data Platform August 31, 2017
HDP Version(s) 2.6.1 2.6.0 2.5.0, 2.5.3 2.4.3 2.4.2 2.4.0 2.3.4, 2.3.2
2.3.4.7,
2.3.6
Spark Version 1.6.3, 2.1.1 1.6.3, 2.1.0 1.6.2 1.6.2 1.6.1 1.6.0 1.5.2 1.4.1
HBase connector # # # TP TP
GraphX TP TP TP TP TP TP TP
DataSet API TP TP TP TP TP
The following features and associated tools are not officially supported by Hortonworks:
• Spark Standalone
• Spark on Mesos
2
Hortonworks Data Platform August 31, 2017
2. Installing Spark
Before installing Spark, ensure that your cluster meets the following prerequisites:
You can choose to install Spark version 1, Spark version 2, or both. (To specify which version
of Spark runs a job, see Specifying Which Version of Spark to Run.)
Additionally, note the following requirements and recommendations for optional Spark
services and features:
• Spark access through Livy requires the Livy server installed on the cluster.
• For clusters not managed by Ambari, see "Installing and Configuring Livy" in the Spark
or Spark 2 chapter of the Command Line Installation Guide, depending on the version
of Spark installed on your cluster.
• PySpark and associated libraries require Python version 2.7 or later, or Python version 3.4
or later, installed on all nodes.
• For optimal performance with MLlib, consider installing the netlib-java library.
Caution
During the installation process, Ambari creates and edits several configuration
files. If you configure and manage your cluster using Ambari, do not edit these
3
Hortonworks Data Platform August 31, 2017
files during or after installation. Instead, use the Ambari web UI to revise
configuration settings.
This starts the Add Service wizard, displaying the Choose Services page. Some of the
services are enabled by default.
3. Scroll through the alphabetic list of components on the Choose Services page, and select
"Spark", "Spark2", or both:
5. On the Assign Masters page, review the node assignment for the Spark History Server or
Spark2 History Server, depending on which Spark versions you are installing. Modify the
node assignment if desired, and click "Next":
a. Scroll to the right and choose the "Client" nodes that you want to run Spark clients.
These are the nodes from which Spark jobs can be submitted to YARN.
b. To install the optional Livy server, for security and user impersonation features, check
the "Livy Server" box for the desired node assignment on the Assign Slaves and Clients
page, for the version(s) of Spark you are deploying.
c. To install the optional Spark Thrift server at this time, for ODBC or JDBC access, review
Spark Thrift Server node assignments on the Assign Slaves and Clients page and assign
one or two nodes to it, as needed for the version(s) of Spark you are deploying. (To
install the Thrift server later, see Installing the Spark Thrift Server after Deploying
Spark.)
Deploying the Thrift server on multiple nodes increases scalability of the Thrift server.
When specifying the number of nodes, take into consideration the cluster capacity
allocated to Spark.
4
Hortonworks Data Platform August 31, 2017
8. Unless you are installing the Spark Thrift server now, use the default values displayed on
the Customize Services page. Note that there are two tabs, one for Spark settings, and
one for Spark2 settings.
9. If you are installing the Spark Thrift server at this time, complete the following steps:
a. Click the "Spark" or "Spark2" tab on the Customize Services page, depending on which
version of Spark you are installing.
c. Set the spark.yarn.queue value to the name of the YARN queue that you want to
use.
5
Hortonworks Data Platform August 31, 2017
11.If Kerberos is enabled on the cluster, review principal and keytab settings on the
Configure Identities page, modify settings if desired, and then click Next.
12.When the wizard displays the Review page, ensure that all HDP components correspond
to HDP 2.6.0 or later. Scroll down and check the node assignments for selected services;
for example:
14.When Ambari displays the Install, Start and Test page, monitor the status bar and
messages for progress updates:
15.When the wizard presents a summary of results, click "Complete" to finish installing
Spark.
If you previously installed Spark on a cluster not managed by Ambari, and you want to
move to Spark 2:
1. Install Spark 2 according to the Spark 2 instructions in the Command Line Installation
Guide.
3. Test your Spark jobs on Spark 2. To direct a job to Spark 2 when Spark 1 is the default
version, see Specifying Which Version of Spark to Run.
6
Hortonworks Data Platform August 31, 2017
4. When finished testing, optionally remove Spark 1 from the cluster: stop all services
and then uninstall Spark. Manually check to make sure all library and configuration
directories have been removed.
<configuration>
<property>
<name>hive.metastore.uris</name>
<!-- hostname must point to the Hive metastore URI in your cluster -->
<value>thrift://hostname:9083</value>
<description>URI for client to contact metastore server</description>
</property>
</configuration>
1. On the Summary tab, click "+ Add" and choose the Spark Thrift server:
7
Hortonworks Data Platform August 31, 2017
2. When Ambari prompts you to confirm the selection, click Confirm All:
8
Hortonworks Data Platform August 31, 2017
3. Configuring Spark
This chapter describes how to configure the following Apache Spark services and features:
• Livy server for remote access to Spark through the Livy REST API
• Wire encryption
This subsection describes optional Spark Thrift Server features and configuration steps:
• Enabling user impersonation, so that SQL queries run under the identity of the user who
originated the query. (By default, queries run under the account associated with the
Spark Thrift server.)
For information about configuring the Thrift server on a Kerberos-enabled cluster, see
Configuring the Spark Thrift Server in "Configuring Spark for a Kerberos-Enabled Cluster."
When user impersonation is enabled, Spark Thrift server runs Spark SQL queries as the
submitting user. By running queries under the user account associated with the submitter,
the Thrift server can enforce user level permissions and access control lists. Associated data
cached in Spark is visible only to queries from the submitting user.
User impersonation enables granular access control for Spark SQL queries at the level of
files or tables.
The user impersonation feature is controlled with the doAs property. When doAs is set to
true, Spark Thrift server launches an on-demand Spark application to handle user queries.
These queries are shared only with connections from the same user. Spark Thrift server
9
Hortonworks Data Platform August 31, 2017
forwards incoming queries to the appropriate Spark application for execution, making the
Spark Thrift server extremely lightweight: it merely acts as a proxy to forward requests and
responses. When all user connections for a Spark application are closed at the Spark Thrift
server, the corresponding Spark application also terminates.
Prerequisites
Spark SQL user impersonation is supported for Apache Spark 1 versions 1.6.3 and later.
To enable user impersonation for the Spark Thrift server on an Ambari-managed cluster,
complete the following steps:
2. Add DataNucleus jars to the Spark Thrift server classpath. Navigate to the “Custom
spark-thrift-sparkconf” section and set the spark.jars property as follows:
spark.jars=/usr/hdp/current/spark-thriftserver/lib/datanucleus-api-jdo-3.2.
6.jar,/usr/hdp/current/spark-thriftserver/lib/datanucleus-core-3.2.10.jar,/
usr/hdp/current/spark-thriftserver/lib/datanucleus-rdbms-3.2.9.jar
3. (Optional) Disable Spark Yarn application for Spark Thrift server master. Navigate to the
"Advanced spark-thrift-sparkconf" section and set spark.master=local. This prevents
launching a spark-client HiveThriftServer2 application master, which is not needed when
doAs=true because queries are executed by the Spark AM, launched on behalf of the
user. When spark.master is set to local, SparkContext uses only the local machine
for driver and executor tasks.
(When the Thrift server runs with doAs set to false, you should set spark.master to
yarn-client, so that query execution leverages cluster resources.)
To enable user impersonation for the Spark Thrift server on a cluster not managed by
Ambari, complete the following steps:
2. Add DataNucleus jars to the Spark Thrift server classpath. Add the following setting
to the /usr/hdp/current/spark-thriftserver/conf/spark-thrift-
sparkconf.conf file:
10
Hortonworks Data Platform August 31, 2017
spark.jars=/usr/hdp/current/spark-thriftserver/lib/datanucleus-api-jdo-3.2.
6.jar,/usr/hdp/current/spark-thriftserver/lib/datanucleus-core-3.2.10.jar,/
usr/hdp/current/spark-thriftserver/lib/datanucleus-rdbms-3.2.9.jar
3. (Optional) Disable Spark Yarn application for Spark Thrift server master. Add the
following setting to the /usr/hdp/current/spark-thriftserver/conf/spark-
thrift-sparkconf.conf file:
spark.master=local
(When the Thrift server runs with doAs set to false, you should set spark.master to
yarn-client, so that query execution leverages cluster resources.)
For more information about user impersonation for the Spark Thrift Server, see Using Spark
SQL.
For a cluster not managed by Ambari, see "Installing and Configuring Livy" in the Spark or
Spark 2 chapter of the Command Line Installation Guide, depending on the version of Spark
installed on your cluster.
livy.keystore=<keystore_file>
livy.keystore.password = <storePassword>
livy.key-password = <KeyPassword>
11
Hortonworks Data Platform August 31, 2017
For background information about configuring SSL for Spark or Spark2, see Configuring
Spark for Wire Encryption.
For deployments that require high availability, Livy supports session recovery, which ensures
that a Spark cluster remains available if the Livy server fails. After a restart, the Livy server
can connect to existing sessions and roll back to the state before failing.
Livy uses several property settings for recovery behavior related to high availability. If
your cluster is managed by Ambari, Ambari manages these settings. If your cluster is not
managed by Ambari, or for a list of recovery properties, see instructions for enabling Livy
recovery in the Spark or Spark2 chapter of the Command Line Installation Guide.
For information about configuring optional history server properties, see the Apache
Monitoring and Instrumentation document.
Dynamic resource allocation is available for use by the Spark Thrift server and general Spark
jobs.
Note
Dynamic Resource Allocation does not work with Spark Streaming.
You can configure dynamic resource allocation at either the cluster or the job level:
• Cluster level:
12
Hortonworks Data Platform August 31, 2017
server runs in YARN mode by default, so the Thrift server uses resources from the
YARN cluster.) The associated shuffle service starts automatically, for use by the Thrift
server and general Spark jobs.
• Job level: You can customize dynamic resource allocation settings on a per-job basis. Job
settings override cluster configuration settings.
You can review dynamic resource allocation for the Spark Thrift server, and enable and
configure settings for general Spark jobs, by choosing Services > Spark and then navigating
to the "Advanced spark-thrift-sparkconf" group:
13
Hortonworks Data Platform August 31, 2017
The "Advanced spark-thrift-sparkconf" group lists required settings. You can specify
optional properties in the custom section. For a complete list of DRA properties, see
Dynamic Resource Allocation Properties.
Dynamic resource allocation requires an external shuffle service that runs on each worker
node as an auxiliary service of NodeManager. If you installed your cluster using Ambari,
the service is started automatically for use by the Thrift server and general Spark jobs; no
further steps are needed.
1. Add the following properties to the spark-defaults.conf file associated with your
Spark installation (typically in the $SPARK_HOME/conf directory):
14
Hortonworks Data Platform August 31, 2017
2. (Optional) To specify a starting point and range for the number of executors, use the
following properties:
• spark.dynamicAllocation.initialExecutors
• spark.dynamicAllocation.minExecutors
• spark.dynamicAllocation.maxExecutors
For more information, see the Apache Spark Shuffle Behavior documentation.
• Include property values in the spark-submit command, using the -conf option.
This approach loads the default spark-defaults.conf file first, and then applies
property values specified in your spark-submit command. Here is an example:
This approach uses the specified properties file, without reading the default property file.
Here is an example:
15
Hortonworks Data Platform August 31, 2017
16
Hortonworks Data Platform August 31, 2017
• "In transit" encryption refers to data that is encrypted when it traverses a network. The
data is encrypted between the sender and receiver process across the network. Wire
encryption is a form of "in transit" encryption.
Apache Spark supports "in transit" wire encryption of data for Apache Spark jobs. When
encryption is enabled, Spark encrypts all data that is moved across nodes in a cluster on
behalf of a job, including the following scenarios:
• Data that is moving between executors and drivers, such as during a collect()
operation.
Spark does not support encryption for connectors accessing external sources; instead,
the connectors must handle any encryption requirements. For example, the Spark HDFS
connector supports transparent encrypted data access from HDFS: when transparent
encryption is enabled in HDFS, Spark jobs can use the HDFS connector to read encrypted
data from HDFS.
Spark does not support encrypted data on local disk, such as intermediate data written to a
local disk by an executor process when the data does not fit in memory. Additionally, wire
encryption is not supported for shuffle files, cached data, and other application files. For
these scenarios you should enable local disk encryption through your operating system.
In Spark 2.0, enabling wire encryption also enables HTTPS on the History Server UI, for
browsing historical job data.
The following two subsections describe how to configure Spark and Spark2 for wire
encryption, respectively.
b. Create a certificate:
keytool -export \
-alias <host> \
-keystore <keystore_file> \
-rfc –file <cert_file> \
-storepass <StorePassword>
17
Hortonworks Data Platform August 31, 2017
2. Create one truststore file that contains the public keys from all certificates.
a. Log on to one host and import the truststore file for that host:
keytool -import \
-noprompt \
-alias <hostname> \
-file <cert_file> \
-keystore <all_jks> \
-storepass <allTruststorePassword>
b. Copy the <all_jks> file to the other nodes in your cluster, and repeat the keytool
command on each node.
18
Hortonworks Data Platform August 31, 2017
keytool -genkey \
-alias <host> \
-keyalg RSA \
-keysize 1024 \
–dname CN=<host>,OU=hw,O=hw,L=paloalto,ST=ca,C=us \
–keypass <KeyPassword> \
-keystore <keystore_file> \
-storepass <storePassword>
b. Create a certificate:
keytool -export \
-alias <host> \
-keystore <keystore_file> \
-rfc –file <cert_file> \
-storepass <StorePassword>
keytool -import \
-noprompt \
-alias <host> \
-file <cert_file> \
-keystore <truststore_file> \
-storepass <truststorePassword>
2. Create one truststore file that contains the public keys from all certificates.
a. Log on to one host and import the truststore file for that host:
keytool -import \
-noprompt \
-alias <hostname> \
-file <cert_file> \
-keystore <all_jks> \
-storepass <allTruststorePassword>
b. Copy the <all_jks> file to the other nodes in your cluster, and repeat the keytool
command on each node.
<property>
<name>spark.authenticate</name>
<value>true</value>
</property>
spark.authenticate true
spark.authenticate.enableSaslEncryption true
19
Hortonworks Data Platform August 31, 2017
spark.ssl.enabled true
spark.ssl.enabledAlgorithms TLS_RSA_WITH_AES_128_CBC_SHA,
TLS_RSA_WITH_AES_256_CBC_SHA
spark.ssl.keyPassword <KeyPassword>
spark.ssl.keyStore <keystore_file>
spark.ssl.keyStorePassword <storePassword>
spark.ssl.protocol TLS
spark.ssl.trustStore <all_jks>
spark.ssl.trustStorePassword <allTruststorePassword>
spark.ui.https.enabled true
Note: In Spark2, enabling wire encryption also enables HTTPS on the History Server UI,
for browsing job history data.
6. (Optional) If you want to enable optional on-disk block encryption, which applies to
both shuffle and RDD blocks on disk, complete the following steps:
spark.io.encryption.enabled true
spark.io.encryption.keySizeBits 128
spark.io.encryption.keygen.algorithm HmacSHA1
For more information, see the Shuffle Behavior section of Apache Spark Properties
documentation, and the Apache Spark Security documentation.
20
Hortonworks Data Platform August 31, 2017
When you enable Kerberos for a Hadoop cluster with Ambari, Ambari configures Kerberos
for the history server and automatically creates a Kerberos account and keytab for it. For
more information, see Enabling Kerberos Authentication Using Ambari in the HDP Security
Guide.
If your cluster is not managed by Ambari, or if you plan to enable Kerberos manually for
the history server, see Creating Service Principals and Keytab Files for HDP in the HDP
Security Guide.
• The Spark Thrift server must run in the same host as HiveServer2, so that it can access
the hiveserver2 keytab.
• You must use the Hive service account to start the thriftserver process.
If you access Hive warehouse files through HiveServer2 on a deployment with fine-grained
access control, run the Spark Thrift server as user hive. This ensures that the Spark Thrift
server can access Hive keytabs, the Hive metastore, and HDFS data stored under user hive.
Important
If you read files from HDFS directly through an interface such as Hive CLI or
Spark CLI (as opposed to HiveServer2 with fine-grained access control), you
should use a different service account for the Spark Thrift server. Configure the
account so that it can access Hive keytabs and the Hive metastore. Use of an
alternate account provides a more secure configuration: when the Spark Thrift
server runs queries as user hive, all data accessible to user hive is accessible to
the user submitting the query.
For Spark jobs that are not submitted through the Thrift server, the user submitting the job
must have access to the Hive metastore in secure mode, using the kinit command.
21
Hortonworks Data Platform August 31, 2017
When access is authenticated without human interaction (as happens for processes that
submit job requests), the process uses a headless keytab. Security risk is mitigated by
ensuring that only the service that should be using the headless keytab has permission to
read it.
The following example creates a headless keytab for a spark service user account that will
submit Spark jobs on node [email protected]:
3. For every node of your cluster, create a spark user and add it to the hadoop group:
5. Limit access by ensuring that user spark is the only user with access to the keytab:
In the following example, user spark runs the Spark Pi example in a Kerberos-enabled
environment:
su spark
kinit -kt /etc/security/keytabs/spark.keytab spark/[email protected]
cd /usr/hdp/current/spark-client/
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--num-executors 1 \
--driver-memory 512m \
--executor-memory 512m \
--executor-cores 1 \
lib/spark-examples*.jar 10
Each person who submits jobs must have a Kerberos account and their own keytab; end
users should use their own keytabs (instead of using a headless keytab) when submitting
a Spark job. This is a best practice: submitting a job under the end user keytab delivers a
higher degree of audit capability.
In the following example, end user $USERNAME has their own keytab and runs the Spark Pi
job in a Kerberos-enabled environment:
22
Hortonworks Data Platform August 31, 2017
su $USERNAME
kinit [email protected]
cd /usr/hdp/current/spark-client/
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--num-executors 3 \
--driver-memory 512m \
--executor-memory 512m \
--executor-cores 1 \
lib/spark-examples*.jar 10
23
Hortonworks Data Platform August 31, 2017
4. Running Spark
You can run Spark interactively or from a client program:
• Submit interactive statements through the Scala, Python, or R shell, or through a high-
level notebook such as Zeppelin.
• Use APIs to create a Spark application that runs interactively or in batch mode, using
Scala, Python, R, or Java.
To launch Spark applications on a cluster, you can use the spark-submit script in the
Spark bin directory. You can also use the API interactively by launching an interactive shell
for Scala (spark-shell), Python (pyspark), or SparkR. Note that each interactive shell
automatically creates SparkContext in a variable called sc. For more informationa about
spark-submit, see the Apache Spark document Submitting Applications.
Alternately, you can use Livy to submit and manage Spark applications on a cluster. Livy
is a Spark service that allows local and remote applications to interact with Apache Spark
over an open source REST interface. Livy offers additional multi-tenancy and security
functionality. For more information about using Livy to run Spark Applications, see
Submitting Spark Applications through Livy.
This chapter describes how to specify Spark version for a Spark application, and how to run
Spark 1 and Spark 2 sample programs.
Use the following guidelines for determining which version of Spark runs a job by default,
and for specifying an alternate version if desired.
• By default, if only one version of Spark is installed on a node, your job runs with the
installed version.
• By default, if more than one version of Spark is installed on a node, your job runs with
the default version for your HDP package. In HDP 2.6, the default is Spark version 1.6.
• If you want to run jobs on the non-default version of Spark, use one of the following
approaches:
• If you use full paths in your scripts, change spark-client to spark2-client; for
example:
change /usr/hdp/current/spark-client/bin/spark-submit
24
Hortonworks Data Platform August 31, 2017
to /usr/hdp/current/spark2-client/bin/spark-submit.
• If you do not use full paths, but instead launch jobs from the path, set the
SPARK_MAJOR_VERSION environment variable to the desired version of Spark before
you launch the job.
For example, if Spark 1.6.3 and Spark 2.0 are both installed on a node and you want to
run your job with Spark 2.0, set
SPARK_MAJOR_VERSION=2.
You can set SPARK_MAJOR_VERSION in automation scripts that use Spark, or in your
manual settings after logging on to the shell.
Note: The SPARK_MAJOR_VERSION environment variable can be set by any user who
logs on to a client machine to run Spark. The scope of the environment variable is local
to the user session.
The following example submits a SparkPi job to Spark 2, using spark-submit under /
usr/bin:
cd /usr/hdp/current/spark2-client/
export SPARK_MAJOR_VERSION=2
Note that the path to spark-examples-*.jar is different than the path used for
Spark 1.x.
To change the environment variable setting later, either remove the environment variable
or change the setting to the newly desired version.
25
Hortonworks Data Platform August 31, 2017
4.2.1. Spark Pi
You can test your Spark installation by running the following compute-intensive example,
which calculates pi by “throwing darts” at a circle. The program generates points in the
unit square ((0,0) to (1,1)) and counts how many points fall within the unit circle within the
square. The result approximates pi.
1. Log on as a user with Hadoop Distributed File System (HDFS) access: for example, your
spark user, if you defined one, or hdfs.
When the job runs, the library is uploaded into HDFS, so the user running the job needs
permission to write to HDFS.
2. Navigate to a node with a Spark client and access the spark-client directory:
cd /usr/hdp/current/spark-client
su spark
3. Run the Apache Spark Pi job in yarn-client mode, using code from org.apache.spark:
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn-client \
--num-executors 1 \
--driver-memory 512m \
--executor-memory 512m \
--executor-cores 1 \
lib/spark-examples*.jar 10
26
Hortonworks Data Platform August 31, 2017
Your job should produce output similar to the following. Note the value of pi in the
output.
You can also view job status in a browser by navigating to the YARN ResourceManager
Web UI and viewing job history server information. (For more information about
checking job status and history, see Tuning and Troubleshooting Spark.)
4.2.2. WordCount
WordCount is a simple program that counts how often a word occurs in a text file. The
code builds a dataset of (String, Int) pairs called counts, and saves the dataset to a file.
2. Log on as a user with HDFS access: for example, your spark user (if you defined one) or
hdfs.
cd /usr/hdp/current/spark-client/
su spark
You should see output similar to the following (with additional status messages):
27
Hortonworks Data Platform August 31, 2017
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.6.3
/_/
Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.
0_112)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.
scala>
5. At the scala> prompt, submit the job by typing the following commands, replacing
node names, file name, and file location with your own values:
val file = sc.textFile("/tmp/data")
val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).
reduceByKey(_ + _)
counts.saveAsTextFile("/tmp/wordcount")
28
Hortonworks Data Platform August 31, 2017
4.3.1. Spark Pi
You can test your Spark installation by running the following compute-intensive example,
which calculates pi by “throwing darts” at a circle. The program generates points in the
unit square ((0,0) to (1,1)) and counts how many points fall within the unit circle within the
square. The result approximates pi.
1. Log on as a user with Hadoop Distributed File System (HDFS) access: for example, your
spark user, if you defined one, or hdfs.
When the job runs, the library is uploaded into HDFS, so the user running the job needs
permission to write to HDFS.
2. Navigate to a node with a Spark client and access the spark2-client directory:
cd /usr/hdp/current/spark2-client
su spark
3. Run the Apache Spark Pi job in yarn-client mode, using code from org.apache.spark:
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn-client \
--num-executors 1 \
--driver-memory 512m \
--executor-memory 512m \
--executor-cores 1 \
examples/jars/spark-examples*.jar 10
29
Hortonworks Data Platform August 31, 2017
Your job should produce output similar to the following. Note the value of pi in the
output.
17/03/22 23:21:10 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.
scala:38, took 1.302805 s
Pi is roughly 3.1445191445191445
You can also view job status in a browser by navigating to the YARN ResourceManager
Web UI and viewing job history server information. (For more information about
checking job status and history, see Tuning and Troubleshooting Spark.)
4.3.2. WordCount
WordCount is a simple program that counts how often a word occurs in a text file. The
code builds a dataset of (String, Int) pairs called counts, and saves the dataset to a file.
2. Log on as a user with HDFS access: for example, your spark user (if you defined one) or
hdfs.
cd /usr/hdp/current/spark2-client/
su spark
You should see output similar to the following (with additional status messages):
Spark context Web UI available at http://172.26.236.247:4041
Spark context available as 'sc' (master = yarn, app id =
application_1490217230866_0002).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0.2.6.0.0-598
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.
0_112)
30
Hortonworks Data Platform August 31, 2017
scala>
5. At the scala> prompt, submit the job by typing the following commands, replacing
node names, file name, and file location with your own values:
val file = sc.textFile("/tmp/data")
val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).
reduceByKey(_ + _)
counts.saveAsTextFile("/tmp/wordcount")
31
Hortonworks Data Platform August 31, 2017
• Livy supports user impersonation: the Livy server submits jobs on behalf of the user who
submits the requests. Multiple users can share the same server ("user impersonation"
support). This is important for multi-tenant environments, and it avoids unnecessary
permission escalation.
• Livy supports security features such as Kerberos authentication and wire encryption.
• REST APIs are backed by SPNEGO authentication, which the requested user should get
authenticated by Kerberos at first.
• RPCs between Livy Server and Remote SparkContext are encrypted with SASL.
Livy 0.3.0 supports programmatic and interactive access to Spark1 and Spark2 with Scala
2.10, and Scala 2.11:
• Develop a Scala, Java, or Python client that uses the Livy API. The Livy REST API supports
full Spark 1 and Spark 2 functionality including SparkSession, and SparkSession with Hive
enabled.
Code runs in a Spark context, either locally or in YARN; YARN cluster mode is
recommended.
To install Livy on an Ambari-managed cluster, see Installing Spark Using Ambari. To install
Livy on a cluster not managed by Ambari, see the Spark sections of the Command Line
Installation Guide. For additional configuration steps, see Configuring the Livy Server.
32
Hortonworks Data Platform August 31, 2017
export SPARK_HOME=<path-to>/spark-2.1.0-bin-hadoop2.6
Scala Support
For default Scala builds, Spark 1.6 with Scala 2.10 or Spark 2.0 with Scala 2.11, Livy
automatically detects the correct Scala version and associated jar files.
If you require a different Spark-Scala combination, such as Spark 2.0 with Scala 2.10, set
livy.spark.scalaVersion to the desired version so that Livy uses the right jar files.
When you run code in a Zeppelin notebook using the %livy directive, the notebook
offloads code execution to Livy and Spark:
33
Hortonworks Data Platform August 31, 2017
For more information about Zeppelin and Livy, see the Zeppelin Component Guide.
• When using the Spark API, the entry point (SparkContext) is created by user who wrote
the code. When using the Livy API, SparkContext is offered by the framework; the user
does not need to create it.
• The client submits code to the Livy server through the REST API. The Livy server sends the
code to a specific Spark cluster for execution.
Architecturally, the client creates a remote Spark cluster, initializes it, and submits jobs
through REST APIs. The Livy server unwraps and rewraps the job, and then sends it to the
34
Hortonworks Data Platform August 31, 2017
remote SparkContext through RPC. While the job runs the client waits for the result, using
the same path. The following diagram illustrates the process:
The Livy REST API supports GET, POST, and DELETE calls for interactive sessions.
The following example shows how to create an interactive session, submit a statement, and
retrieve the result of the statement; the return ID could be used for further queries.
1. Create an interactive session. The following POST request starts a new Spark cluster with
a remote Spark interpreter; the remote Spark interpreter is used to receive and execute
code snippets, and return the result.
POST /sessions
host = 'http://localhost:8998'
data = {'kind': 'spark'}
headers = {'Content-Type': 'application/json'}
r = requests.post(host + '/sessions', data=json.dumps(data),
headers=headers)
r.json()
2. Submit a statement. The following POST request submits a code snippet to a remote
Spark interpreter, and returns a statement ID for querying the result after execution is
finished.
35
Hortonworks Data Platform August 31, 2017
POST /sessions/{sessionId}/statements
data = {'code': 'sc.parallelize(1 to 10).count()'}
r = requests.post(statements_url, data=json.dumps(data), headers=
headers)
r.json()
3. Get the result of a statement. The following GET request returns the result of a
statement in JSON format, which you can parse to extract elements of the result.
GET /sessions/{sessionId}/statements/{statementId}
statement_url = host + r.headers['location']
r = requests.get(statement_url, headers=headers)
pprint.pprint(r.json())
{u'id': 0,
u'output': {u'data': {u'text/plain': u'res0: Long = 10'},
u'execution_count': 0,
u'status': u'ok'},
u'state': u'available'}
The remainder of this section describes Livy objects and REST API calls for interactive
sessions.
The following values are valid for the kind property in a session object:
Value Description
spark Interactive Scala Spark session
pyspark Interactive Python 2 Spark session
pyspark3 Interactive Python 3 Spark session
sparkr Interactive R Spark session
36
Hortonworks Data Platform August 31, 2017
The following values are valid for the state property in a session object:
Value Description
not_started Session has not started
starting Session is starting
idle Session is waiting for input
busy Session is executing a statement
shutting_down Session is shutting down
error Session terminated due to an error
dead Session exited
success Session successfully stopped
Statement Object
The following values are valid for the state property in a statement object:
value Description
waiting Statement is queued, execution has not started
running Statement is running
available Statement has a response ready
error Statement failed
cancelling Statement is being cancelled
cancelled Statement is cancelled
The following values are valid for the output property in a statement object:
pyspark
37
Hortonworks Data Platform August 31, 2017
Livy reads the path from the PYSPARK_PYTHON environment variable (this is the same as
PySpark).
• If Livy is running in local mode, simply set the environment variable (this is the same as
PySpark).
pyspark3
38
Hortonworks Data Platform August 31, 2017
POST
POST /sessions creates a new interactive Scala, Python, or R shell in the cluster.
39
Hortonworks Data Platform August 31, 2017
DELETE
The following example shows a spark-submit command that submits a SparkPi job,
followed by an example that uses Livy POST requests to submit the job. The remainder of
this subsection describes Livy objects and REST API syntax. For additional examples and
information, see the readme.rst file at https://github.com/hortonworks/livy-release/
releases/tag/HDP-2.6.0.3-8-tag.
To submit the SparkPi job using Livy, complete the following steps. Note: the POST request
does not upload local jars to the cluster. You should upload required jar files to HDFS
40
Hortonworks Data Platform August 31, 2017
before running the job. This is the main difference between the Livy API and spark-
submit.
3. To submit the SparkPi application to the Livy server, use the a POST /batches request.
GET /batches/{batchId}/log retrieves log records for the specified batch session.
41
Hortonworks Data Platform August 31, 2017
POST /batches creates a new batch environment and runs a specified application:
42
Hortonworks Data Platform August 31, 2017
• A large application needs a Python package that requires C code to be compiled before
installation.
For these situations, you can create a virtual environment as an isolated Python runtime
environment. HDP 2.6 supports VirtualEnv for PySpark in both local and distributed
environments, easing the transition from a local environment to a distributed environment.
43
Hortonworks Data Platform August 31, 2017
The Oozie "Spark action" runs a Spark job as part of an Oozie workflow. The workflow
waits until the Spark job completes before continuing to the next action.
For additional information about Spark action, see the Apache Oozie Spark Action
Extension documentation. For general information about Oozie, see Using HDP for
Workflow and Scheduling with Oozie. For general information about using Workflow
Manager, see the Workflow Management Guide.
Note
In HDP 2.6, Oozie works with either Spark 1 or Spark 2 (not side-by-side
deployments), but Spark 2 support for Oozie Spark action is available as a
technical preview; it is not ready for production deployment. Configuration is
through manual steps (not Ambari).
Support for yarn-client execution mode for Oozie Spark action will be removed
in a future release. Oozie will continue to support yarn-cluster execution mode
for Oozie Spark action.
• A workflow XML file that defines workflow logic and parameters for running the Spark
job. Some of the elements in a Spark action are specific to Spark; others are common to
many types of actions.
You can configure a Spark action manually, or on an Ambari-managed cluster you can use
the Spark action editor in the Ambari Oozie Workflow Manager (WFM). The Workflow
Manager is designed to help build powerful workflows.
For two examples that use Oozie Workflow Manager--one that creates a new Spark action,
and another that imports and runs an existing Spark workflow--see the Hortonworks
Community Connection article Apache Ambari Workflow Manager View for Apache Oozie:
Part 7 ( Spark Action & PySpark).
Here is the basic structure of a workflow definition XML file for a Spark action:
<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.3">
...
<action name="[NODE-NAME]">
44
Hortonworks Data Platform August 31, 2017
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>[JOB-TRACKER]</job-tracker>
<name-node>[NAME-NODE]</name-node>
<prepare>
<delete path="[PATH]"/>
...
<mkdir path="[PATH]"/>
...
</prepare>
<job-xml>[SPARK SETTINGS FILE]</job-xml>
<configuration>
<property>
<name>[PROPERTY-NAME]</name>
<value>[PROPERTY-VALUE]</value>
</property>
...
</configuration>
<master>[SPARK MASTER URL]</master>
<mode>[SPARK MODE]</mode>
<name>[SPARK JOB NAME]</name>
<class>[SPARK MAIN CLASS]</class>
<jar>[SPARK DEPENDENCIES JAR / PYTHON FILE]</jar>
<spark-opts>[SPARK-OPTIONS]</spark-opts>
<arg>[ARG-VALUE]</arg>
...
<arg>[ARG-VALUE]</arg>
...
</spark>
<ok to="[NODE-NAME]"/>
<error to="[NODE-NAME]"/>
</action>
...
</workflow-app>
The following examples show a workflow definition XML file and an Oozie job
configuration file for running a SparkPi job (Spark version 1.x).
45
Hortonworks Data Platform August 31, 2017
1. Create a spark2 ShareLib directory under the Oozie ShareLib directory associated with
the oozie service user:
hdfs dfs -mkdir /user/oozie/share/lib/lib_<ts>/spark2
2. Copy spark2 jar files from the spark2 jar directory to the Oozie spark2 ShareLib:
hdfs dfs -put \
/usr/hdp/<version>/spark2/jars/* \
/user/oozie/share/lib/lib_<ts>/spark2/
3. Copy the oozie-sharelib-spark jar file from the spark ShareLib directory to the
spark2 ShareLib directory:
hdfs dfs -cp \
/user/oozie/share/lib/lib_<ts>/spark/oozie-sharelib-spark-<version>.jar
\
/user/oozie/share/lib/lib_<ts>/spark2/
4. Copy the hive-site.xml file from the current spark ShareLib to the spark2
ShareLib:
hdfs dfs -cp \
/user/oozie/share/lib/lib_<ts>/spark/hive-site.xml \
/user/oozie/share/lib/lib_<ts>/spark2/
46
Hortonworks Data Platform August 31, 2017
To verify the configuration, run the Oozie shareliblist command. You should see
spark2 in the results.
oozie admin –shareliblist spark2
To run a Spark job with the spark2 ShareLib, add the action.sharelib.for.spark
property to the job.properties file, and set its value to spark2:
oozie.action.sharelib.for.spark=spark2
The following examples show a workflow definition XML file, an Oozie job configuration
file, and a Python script for running a Spark2-Pi job.
<action name='spark-node'>
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>${master}</master>
<name>Python-Spark-Pi</name>
<jar>pi.py</jar>
</spark>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>Workflow failed, error message
[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name='end' />
</workflow-app>
if __name__ == "__main__":
"""
Usage: pi [partitions]
"""
47
Hortonworks Data Platform August 31, 2017
sc = SparkContext(appName="Python-Spark-Pi")
partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
n = 100000 * partitions
def f(_):
x = random() * 2 - 1
y = random() * 2 - 1
return 1 if x ** 2 + y ** 2 < 1 else 0
sc.stop()
48
Hortonworks Data Platform August 31, 2017
Depending on your use case, you can extend your use of Spark into several domains,
including the following described in this chapter:
• Spark DataFrames
• Spark SQL
• Spark Streaming
Additional resources:
• To get started with Spark, see the Apache Spark Quick Start and the Spark 1.6.3 and
Spark 2.0 overviews.
• For more information about application development, see the Apache Spark
Programming Guide.
• For more information about using Livy to submit Spark jobs, see Submitting Spark
Applications Through Livy.
1. As user spark, upload the people.txt and people.json sample files to the Hadoop
Distributed File System (HDFS):
cd /usr/hdp/current/spark-client
su spark
hdfs dfs -copyFromLocal examples/src/main/resources/people.txt people.txt
hdfs dfs -copyFromLocal examples/src/main/resources/people.json people.json
49
Hortonworks Data Platform August 31, 2017
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
The following examples use Scala to access DataFrame df defined in the previous
subsection:
// Import the DataFrame functions API
scala> import org.apache.spark.sql.functions._
The following example uses the DataFrame API to specify a schema for people.txt, and
then retrieves names from a temporary table associated with the schema:
import org.apache.spark.sql._
import org.apache.spark.sql.types.{StructType,StructField,StringType}
peopleDataFrame.registerTempTable("people")
50
Hortonworks Data Platform August 31, 2017
Using HiveContext, Spark SQL can also read data by interacting with the Hive MetaStore.
If you already use Hive, you should use HiveContext; it supports all Hive data formats and
user-defined functions (UDFs), and it enables you to have full access to the HiveQL parser.
HiveContext extends SQLContext, so HiveContext supports all SQLContext functionality.
• Interactive access using the Spark shell (see Accessing Spark SQL through the Spark Shell).
• From an application, operating through one of the following two APIs and the Spark
Thrift server:
• JDBC, using your own Java code or the Beeline JDBC client
For more information about JDBC and ODBC access, see Accessing Spark SQL through
JDBC: Prerequisites and Accessing Spark SQL through JDBC and ODBC.
The following diagram illustrates the access process, depending on whether you are using
the Spark shell or business intelligence (BI) application:
This subsection describes how to access Spark SQL through the Spark shell, and through
JDBC and ODBC.
51
Hortonworks Data Platform August 31, 2017
To read data directly from the file system, construct a SQLContext. For an example that
uses SQLContext and the Spark DataFrame API to access a JSON file, see Using the Spark
DataFrame API.
To read data by interacting with the Hive Metastore, construct a HiveContext instance
(HiveContext extends SQLContext). For an example of the use of HiveContext (instantiated
as val sqlContext), see Accessing ORC Files from Spark.
The following prerequisites must be met before accessing Spark SQL through JDBC or
ODBC:
• For an Ambari-managed cluster, deploy and launch the Spark Thrift server using the
Ambari web UI (see Installing and Configuring Spark Over Ambari).
• For a cluster that is not managed by Ambari, see Starting the Spark Thrift Server in the
Non-Ambari Cluster Installation Guide.
export SPARK_HOME=/usr/hdp/current/spark-client
If you want to enable user impersonation for the Spark Thrift server, so that the Thrift
server runs Spark SQL jobs as the submitting user, see Configuring the Spark Thrift server.
Before accessing Spark SQL through JDBC or ODBC, note the following caveats:
• ODBC and JDBC client configurations must match Spark Thrift server configuration
parameters. For example, if the Thrift server is configured to listen in binary mode,
the client should send binary requests and use HTTP mode when the Thrift server is
configured over HTTP.
• All client requests coming to the Spark Thrift server share a SparkContext.
52
Hortonworks Data Platform August 31, 2017
To manually stop the Spark Thrift server, run the following commands:
su spark
./sbin/stop-thriftserver.sh
jdbc:hive2://<host>:<port>/<dbName>;<sessionConfs>?
<hiveConfs>#<hiveVars>
Note
The Spark Thrift server is a variant of HiveServer2, so you can use many of the
same settings. For more information about JDBC connection strings, including
transport and security settings, see Hive JDBC and ODBC Drivers in the HDP
Data Access Guide.
The following connection string accesses Spark SQL through JDBC on a Kerberos-enabled
cluster:
beeline> !connect jdbc:hive2://localhost:10002/default;httpPath=/;principal=
hive/[email protected]
The following connection string accesses Spark SQL through JDBC over HTTP transport on a
Kerberos-enabled cluster:
beeline> !connect jdbc:hive2://localhost:10002/default;transportMode=
http;httpPath=/;principal=hive/[email protected]
53
Hortonworks Data Platform August 31, 2017
b. At the Beeline prompt, connect to the Spark SQL Thrift server with the JDBC
connection string:
beeline> !connect jdbc:hive2://localhost:10015
The host port must match the host port on which the Spark Thrift server is running.
Drivers and associated documentation are available in the "Hortonworks Data Platform
Add-Ons" section of the Hortonworks downloads page (http://hortonworks.com/
downloads/) under "Hortonworks ODBC Driver for SparkSQL." If the latest version of HDP
is newer than your version, check the Hortonworks Data Platform Archive area of the add-
ons section for the version of the driver that corresponds to your version of HDP.
54
Hortonworks Data Platform August 31, 2017
associated with the submitter, the Thrift Server can enforce user level permissions and
access control lists. This enables granular access control to Spark SQL at the level of files or
tables. Associated data cached in Spark is visible only to queries from the submitting user.
Spark SQL user impersonation is supported for Apache Spark 1 versions 1.6.3 and later. To
enable user impersonation, see Enabling User Impersonation for the Spark Thrift Server.
The following paragraphs illustrate several features of user impersonation.
A Beeline session running as user “foo” can access the data, read the drivers table, and
create a new table based on the table:
55
Hortonworks Data Platform August 31, 2017
All user permissions and access control lists are enforced while accessing tables, data or
other resources. In addition, all output generated is for user “foo”.
For the table created in the preceding Beeline session, the owner is user “foo”:
The per-user Spark Application Master ("AM") caches data in memory without other users
being able to access the data--cached data and state are restricted to the Spark AM running
the query. Data and state information are not stored in the Spark Thrift server, so they are
not visible to other users. Spark master runs as yarn-cluster, but query execution works as
though it is yarn-client (essentially a yarn-cluster user program that accepts queries from
STS indefinitely).
When user impersonation is enabled for the Spark Thrift server, the Thrift Server is
responsible for the following features and capabilities:
• Authorizing incoming user connections (SASL authorization that validates the user
Beeline/socket connection).
• Terminating the Spark AM when all associated user connections are closed at the
Spark Thrift server.
56
Hortonworks Data Platform August 31, 2017
• Ensuring that long-running Spark SQL sessions persist, by keeping the Kerberos state
valid.
• The Spark Thrift server and Spark AM, when launched on behalf of a user, can be long-
running applications in clusters with Kerberos enabled.
• The submitter's principal and keytab are not required for long-running Spark AM
processes, although the Spark Thrift server requires the Hive principal and keytab.
For an example, see the preceding Beeline example where the !connect command
specifies the connection URL for “foo_db”.
Hive variables can be used to parameterize queries. To set a Hive variable, use the set
hivevar command:
set hivevar:key=value
You can also set a Hive variable as part of the connection URL. In the following Beeline
example, plan=miles is appended to the connection URL. The variable is referenced in
the query as ${hivevar:plan}.
57
Hortonworks Data Platform August 31, 2017
Every Spark AM managed by the Spark Thrift server is associated with a user and
connectionId. Connection IDs are not globally unique; they are specific to the user.
You can specify connectionId to control which Spark AM executes queries. If you not
specify connectionId, a default connectionId is associated with the Spark AM.
Note: Named connections allow users to specify their own Spark AM connections. They are
scoped to individual users, and do not allow a user to access the Spark AM associated with
another user.
58
Hortonworks Data Platform August 31, 2017
If the Spark AM is available, the connection is associated with the existing Spark AM.
For a user, cached data is shared and available only within a single AM, not across Spark
AM’s.
Different user connections on the same Spark AM can leverage previously cached data.
Each user connection has its own Hive session (which maintains the current database,
Hive variables, and so on), but shares the underlying cached data, executors, and Spark
application.
The following example shows a session for the first connection from user “foo” to named
connection “conn1”:
59
Hortonworks Data Platform August 31, 2017
After caching the ‘drivers’ table, the query runs an order of magnitude faster.
A second connection to the same connectionId from user “foo” leverages the cached
table from the other active Beeline session, significantly increasing query execution speed:
60
Hortonworks Data Platform August 31, 2017
If the Spark Thrift server is unable to find an existing Spark AM for a user connection,
by default the Thrift server launches a new Spark AM to service user queries. This is
applicable to named connections and unnamed connections. When a new Spark AM is to
be launched, you can override current Spark configuration settings by specifying them in
the connection URL. Specify Spark configuration settings as hiveconf variables prepended
by the sparkconf prefix:
The environment tab of the Spark application shows the appropriate value:
61
Hortonworks Data Platform August 31, 2017
When using a custom UDF, ensure that the .jar file for your UDF is included with your
application, or use the --jars command-line option to specify the file.
The following example uses a custom Hive UDF. This example uses the more limited
SQLContext, instead of HiveContext.
62
Hortonworks Data Platform August 31, 2017
Spark Streaming receives live input data streams and divides the data into batches, which
are then processed by the Spark engine to generate the final stream of results in batches:
See the Apache Spark Streaming Programming Guide for conceptual information;
programming examples in Scala, Java, and Python; and performance tuning
recommendations.
Apache Spark 1.6 has built-in support for the Apache Kafka 08 API. If you want to access
a Kafka 0.10 cluster using new Kafka 0.10 APIs (such as wire encryption support) from
Spark 1.6 streaming jobs, the spark-kafka-0-10-connector package supports a Kafka
0.10 connector for Spark 1.x streaming. See the package readme file for additional
documentation.
The remainder of this subsection describes general steps for developers using Spark
Streaming with Kafka on a Kerberos-enabled cluster; it includes a sample pom.xml file for
Spark Streaming applications with Kafka. For additional examples, see the Apache GitHub
example repositories for Scala, Java, and Python.
Important
Dynamic Resource Allocation does not work with Spark Streaming.
8.4.1. Prerequisites
Before running a Spark Streaming application, Spark and Kafka must be deployed on the
cluster.
Unless you are running a job that is part of the Spark examples package installed by
Hortonworks Data Platform (HDP), you must add or retrieve the HDP spark-streaming-
kafka .jar file and associated .jar files before running your Spark job.
63
Hortonworks Data Platform August 31, 2017
2. Specify the Hortonworks version number for Spark streaming Kafka and streaming
dependencies to your pom.xml file:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.10</artifactId>
<version>1.6.3.2.4.2.0-90</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.6.3.2.4.2.0-90</version>
<scope>provided</scope>
</dependency>
Note that the correct version number includes the Spark version and the HDP version.
3. (Optional) If you prefer to pack an uber .jar rather than use the default ("provided"),
add the maven-shade-plugin to your pom.xml file:
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<finalName>uber-${project.artifactId}-${project.version}</
finalName>
</configuration>
</plugin>
• Instructions for submitting your job depend on whether you used an uber .jar file or not:
• If you kept the default .jar scope and you can access an external network, use --
packages to download dependencies in the runtime library:
64
Hortonworks Data Platform August 31, 2017
The artifact and repository locations should be the same as specified in your pom.xml
file.
• If you packed the .jar file into an uber .jar, submit the .jar file in the same way as you
would a regular Spark application:
spark-submit --master yarn-client \
--num-executors 1 \
--class <user-main-class> \
<user-uber-application.jar> \
<user arg lists>
For a sample pom.xml file, see Sample pom.xml file for Spark Streaming with Kafka.
3. Create a Java Authentication and Authorization Service (JAAS) login configuration file:
for example, key.conf.
The keytab and configuration files are distributed using YARN local resources. Because
they reside in the current directory of the Spark YARN container, you should specify the
location as ./v.keytab.
65
Hortonworks Data Platform August 31, 2017
5. In your spark-submit command, pass the JAAS configuration file and keytab as local
resource files, using the --files option, and specify the JAAS configuration file options
to the JVM options specified for the driver and executor:
spark-submit \
--files key.conf#key.conf,v.keytab#v.keytab \
--driver-java-options "-Djava.security.auth.login.config=./key.conf" \
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.
config=./key.conf" \
...
For example, the KafkaWordCount example accepts PLAINTEXTSASL as the last option
in the command line:
<groupId>test</groupId>
<artifactId>spark-kafka</artifactId>
<version>1.0-SNAPSHOT</version>
<repositories>
<repository>
<id>hortonworks</id>
<name>hortonworks repo</name>
<url>http://repo.hortonworks.com/content/repositories/releases/</
url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.10</artifactId>
<version>1.6.3.2.4.2.0-90</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.6.3.2.4.2.0-90</version>
<scope>provided</scope>
</dependency>
</dependencies>
<build>
<defaultGoal>package</defaultGoal>
<resources>
66
Hortonworks Data Platform August 31, 2017
<resource>
<directory>src/main/resources</directory>
<filtering>true</filtering>
</resource>
<resource>
<directory>src/test/resources</directory>
<filtering>true</filtering>
</resource>
</resources>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-resources-plugin</artifactId>
<configuration>
<encoding>UTF-8</encoding>
</configuration>
<executions>
<execution>
<goals>
<goal>copy-resources</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.0</version>
<configuration>
<recompileMode>incremental</recompileMode>
<args>
<arg>-target:jvm-1.7</arg>
</args>
<javacArgs>
<javacArg>-source</javacArg>
<javacArg>1.7</javacArg>
<javacArg>-target</javacArg>
<javacArg>1.7</javacArg>
</javacArgs>
</configuration>
<executions>
<execution>
<id>scala-compile</id>
<phase>process-resources</phase>
<goals>
<goal>compile</goal>
</goals>
</execution>
<execution>
<id>scala-test-compile</id>
<phase>process-test-resources</phase>
<goals>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
67
Hortonworks Data Platform August 31, 2017
<configuration>
<source>1.7</source>
<target>1.7</target>
</configuration>
<executions>
<execution>
<phase>compile</phase>
<goals>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<finalName>uber-${project.artifactId}-${project.version}</
finalName>
</configuration>
</plugin>
</plugins>
</build>
</project>
68
Hortonworks Data Platform August 31, 2017
Important
The HDP bundle includes two different connectors that extract datasets out of
HBase and streams them into Spark:
8.5.1. Selecting a Connector
The two connectors are designed to meet the needs of different workloads. In general,
use the Hortonworks Spark-HBase Connector for SparkSQL, DataFrame, and other
fixed schema workloads. Use the RDD-Based Spark-HBase Connector for RDDs and
other flexible schema workloads.
When using the connector developed by Hortonworks, the underlying context is data
frame, with support for optimizations such as partition pruning, predicate pushdowns,
and scanning. The connector is highly optimized to push down filters into the HBase level,
speeding up workload. The tradeoff is limited flexibility because you must specify your
schema upfront. The connector leverages the standard Spark DataSource API for query
optimization.
For more information about the connector, see A Year in Review blog.
Refer to the following table for other factors that might affect your choice of connector,
source repos, and code examples.
69
Hortonworks Data Platform August 31, 2017
8.6.1. Specifying Compression
To add a compression library to Spark, you can use the --jars option. For an example, see
Adding Libraries to Spark.
To save a Spark RDD to HDFS in compressed format, use code similar to the following (the
example uses the GZip algorithm):
rdd.saveAsHadoopFile("/tmp/spark_compressed",
"org.apache.hadoop.mapred.TextOutputFormat",
compressionCodecClass="org.apache.hadoop.io.compress.
GzipCodec")
For more information about supported compression algorithms, see Configuring HDFS
Compression in the HDFS Administration Guide.
70
Hortonworks Data Platform August 31, 2017
If HADOOP_CONF_DIR is not set properly, you might see the following error:
Error from secure cluster
2016-08-22 00:27:06,046|t1.machine|INFO|1580|140672245782272|MainThread|
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.
PythonRDD.collectAndServe.
2016-08-22 00:27:06,047|t1.machine|INFO|1580|140672245782272|MainThread|: org.
apache.hadoop.security.AccessControlException: SIMPLE authentication is not
enabled. Available:[TOKEN, KERBEROS]
2016-08-22 00:27:06,047|t1.machine|INFO|1580|140672245782272|MainThread|at
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
2016-08-22 00:27:06,047|t1.machine|INFO|1580|140672245782272|
MainThread|at sun.reflect.NativeConstructorAccessorImpl.
newInstance(NativeConstructorAccessorImpl.java:57)
2016-08-22 00:27:06,048|t1.machine|INFO|1580|140672245782272|MainThread|at
{code}
Spark ORC data source supports ACID transactions, snapshot isolation, built-in indexes, and
complex data types (such as array, map, and struct), and provides read and write access
to ORC files. It leverages the Spark SQL Catalyst engine for common optimizations such as
column pruning, predicate push-down, and partition pruning.
This subsection has several examples of Spark ORC integration, showing how ORC
optimizations are applied to user programs.
The following example uses data structures to demonstrate working with complex types.
The Person struct data type has a name, an age, and a sequence of contacts, which are
themselves defined by names and phone numbers.
71
Hortonworks Data Platform August 31, 2017
In the physical file, these records are saved in columnar format. When accessing ORC files
through the DataFrame API, you see rows.
3. To write person records as ORC files to a directory named “people”, you can use the
following command:
sc.parallelize(records).toDF().write.format("orc").save("people")
5. For reuse in future operations, register the new "people" directory as temporary table
“people”:
people.registerTempTable("people")
6. After you register the temporary table “people”, you can query columns from the
underlying table:
sqlContext.sql("SELECT name FROM people WHERE age < 15").count()
In this example the physical table scan loads only columns name and age at runtime,
without reading the contacts column from the file system. This improves read performance.
You can also use Spark DataFrameReader and DataFrameWriter methods to access
ORC files.
ORC avoids this type of overhead by using predicate push-down, with three levels of built-in
indexes within each file: file level, stripe level, and row level:
• File-level and stripe-level statistics are in the file footer, making it easy to determine if the
rest of the file must be read.
• Row-level indexes include column statistics for each row group and position, for finding
the start of the row group.
ORC uses these indexes to move the filter operation to the data loading phase by reading
only data that potentially includes required rows.
This combination of predicate push-down with columnar storage reduces disk I/O
significantly, especially for larger datasets in which I/O bandwidth becomes the main
bottleneck to performance.
72
Hortonworks Data Platform August 31, 2017
Here is the Scala API version of the SELECT query used in the previous section, using the
DataFrame API:
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
val people = sqlContext.read.format("orc").load("peoplePartitioned")
people.filter(people("age") < 15).select("name").show()
DataFrames are not limited to Scala. There is a Java API and, for data scientists, a Python
API binding:
sqlContext = HiveContext(sc)
sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
people = sqlContext.read.format("orc").load("peoplePartitioned")
people.filter(people.age < 15).select("name").show()
Partition pruning is possible when data within a table is split across multiple logical
partitions. Each partition corresponds to a particular value of a partition column and
is stored as a subdirectory within the table root directory on HDFS. Where applicable,
only the required partitions (subdirectories) of a table are queried, thereby avoiding
unnecessary I/O.
Spark supports saving data in a partitioned layout seamlessly, through the partitionBy
method available during data source write operations. To partition the "people" table by
the “age” column, you can use the following command:
people.write.format("orc").partitionBy("age").save("peoplePartitioned")
As a result, records are automatically partitioned by the age field and then
saved into different directories: for example, peoplePartitioned/age=1/,
peoplePartitioned/age=2/, and so on.
After partitioning the data, subsequent queries can omit large amounts of I/O when
the partition column is referenced in predicates. For example, the following query
automatically locates and loads the file under peoplePartitioned/age=20/and omits
all others:
val peoplePartitioned = sqlContext.read.format("orc").
load("peoplePartitioned")
peoplePartitioned.registerTempTable("peoplePartitioned")
sqlContext.sql("SELECT * FROM peoplePartitioned WHERE age = 20")
73
Hortonworks Data Platform August 31, 2017
8.7.5. Additional Resources
• Apache ORC website: https://orc.apache.org/
If you want to use a custom library, such as a compression library or Magellan, you can use
one of the following two spark-submit script options:
• The --jars option, which transfers associated .jar files to the cluster. Specify a list of
comma-separated .jar files.
• The --packages option, which pulls files directly from Spark packages. This approach
requires an internet connection.
For example, you can use the --jars option to add codec files. The following example
adds the LZO compression library:
spark-submit --driver-memory 1G \
--executor-memory 1G \
--master yarn-client \
--jars /usr/hdp/2.6.0.3-8/hadoop/lib/hadoop-lzo-0.6.0.2.6.0.3-8.jar \
test_read_write.py
For more information about the two options, see Advanced Dependency Management on
the Apache Spark "Submitting Applications" web page.
Note
If you launch a Spark job that references a codec library without specifying
where the codec resides, Spark returns an error similar to the following:
Caused by: java.lang.IllegalArgumentException: Compression codec
com.hadoop.compression.lzo.LzoCodec not found.
To address this issue, specify the codec file with the --jars option in your job
submit command.
74
Hortonworks Data Platform August 31, 2017
SparkR provides a distributed data frame implementation that supports operations like
selection, filtering, and aggregation on large datasets. In addition, SparkR supports
distributed machine learning through MLlib.
This chapter lists prerequisites, followed by a SparkR example. Here are several links to
additional information:
• For information about SparkR architecture and the use of SparkR in a data science
workflow, see Integrate SparkR and R for Better Data Science Workflow.
• For information about how to install and use R packages with SparkR, see Using R
Packages with SparkR.
• For additional SparkR information, see the Apache SparkR documentation for your
version of Apache Spark (the link is for Spark 1, version 1.6.3).
9.1. Prerequisites
Before you run SparkR, ensure that your cluster meets the following prerequisites:
• R must be installed on all nodes. Commands for installing R are specific to the operating
system. For example, for CentOS you would log on as root and run the following
command:
yum install R
9.2. SparkR Example
The following example launches SparkR, and then uses R to create a people DataFrame
in Spark 1.6. The example then lists part of the DataFrame, and reads the DataFrame. (For
more information about Spark DataFrames, see "Using the Spark DataFrame API").
1. Launch SparkR:
su spark
cd /usr/hdp/2.6.0.0-598/spark/bin
./sparkR
75
Hortonworks Data Platform August 31, 2017
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.6.3
/_/
2. From your R prompt (not the Spark shell), initialize SQLContext, create a DataFrame,
and list the first few rows:
sqlContext <- sparkRSQL.init(sc)
df <- createDataFrame(sqlContext, faithful)
head(df)
76
Hortonworks Data Platform August 31, 2017
This chapter provides an overview of approaches for assessing and tuning Spark
performance.
10.1. Provisioning Hardware
For general information about Spark memory use, including node distribution, local disk,
memory, network, and CPU core recommendations, see the Apache Spark Hardware
Provisioning document.
• To list running applications by ID from the command line, use yarn application –
list.
• To check the query plan when using the DataFrame API, use DataFrame#explain().
• Spark history server UI: view information about Spark jobs that have completed.
• Spark history server web UI: view information about Spark jobs that have completed.
In a browser window, navigate to the history server web UI. The default host port is
<host>:18080.
77
Hortonworks Data Platform August 31, 2017
• YARN web UI: view job history and time spent in various stages of the job:
http://<host>:8088/proxy/<job_id>/environment/
http://<host>:8088/proxy/<app_id>/stages/
• yarn logs command: list the contents of all log files from all containers associated with
the specified application.
• Hadoop Distributed File System (HDFS) shell or API: view container log files.
For more information, see "Debugging your Application" in the Apache document
Running Spark on YARN.
• Consider switching from the default serializer to the Kryo serializer to improve
performance.
YARN evaluates all available compute resources on each machine in a cluster and
negotiates resource requests from applications running in the cluster. YARN then provides
processing capacity to each application by allocating containers. A container is the basic
unit of processing capacity in YARN; it is an encapsulation of resource elements such as
memory (RAM) and CPU.
In a Hadoop cluster, it is important to balance the use of RAM, CPU cores, and disks so that
processing is not constrained by any one of these cluster resources.
When determining the appropriate YARN memory configurations for Spark, note the
following values on each node:
78
Hortonworks Data Platform August 31, 2017
When configuring YARN memory allocation for Spark, consider the following information:
• Driver memory does not need to be large if the job does not aggregate much data (as
with a collect() action).
Large executor memory does not imply better performance, due to JVM garbage
collection. Sometimes it is better to configure a larger number of small JVMs than a small
number of large JVMs.
• Executor processes are not released if the job has not finished, even if they are no longer
in use.
In yarn-cluster mode, the Spark driver runs inside an application master process that
is managed by YARN on the cluster. The client can stop after initiating the application.
The following example shows starting a YARN client in yarn-cluster mode, specifying
the number of executors and associated memory and core, and driver memory. The client
starts the default Application Master, and SparkPi runs as a child thread of the Application
Master. The client periodically polls the Application Master for status updates and displays
them on the console.
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--num-executors 3 \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1 \
lib/spark-examples*.jar 10
In yarn-client mode, the driver runs in the client process. The application master is
used only to request resources for YARN. To launch a Spark application in yarn-client
mode, replace yarn-cluster with yarn-client. The following example launches the
Spark shell in yarn-client mode and specifies the number of executors and associated
memory:
./bin/spark-shell --num-executors 32 \
--executor-memory 24g \
--master yarn-client
79