HDPDeveloper EnterpriseSpark1 StudentGuide
HDPDeveloper EnterpriseSpark1 StudentGuide
HDPDeveloper EnterpriseSpark1 StudentGuide
Enterprise Spark 1
Student Guide
Rev 0.1
Summit Summer 2016
filter() ..................................................................................................................................................... 43
distinct() ................................................................................................................................................ 44
Basic Spark Actions ................................................................................................................................. 44
collect(), first(), and take() ..................................................................................................................... 44
count() ................................................................................................................................................... 44
saveAsTextFile().................................................................................................................................... 45
Transformations vs Actions: Lazy Evaluation ..................................................................................... 45
Lazy Evaluation Visualized ................................................................................................................... 46
RDD Special Topics ................................................................................................................................. 47
Multiple RDDs: union() and intersection() ............................................................................................ 47
Named Functions ................................................................................................................................. 47
Numeric Operations ............................................................................................................................. 48
More Functions: Spark Documentation............................................................................................... 48
Knowledge Check .................................................................................................................................... 49
Questions .............................................................................................................................................. 49
Answers ................................................................................................................................................ 50
Summary .................................................................................................................................................. 51
Pair RDDs ..................................................................................................................................................... 53
Lesson Objectives .................................................................................................................................... 53
Pair RDD Introduction .............................................................................................................................. 53
Create Pair RDDs ................................................................................................................................. 53
Pair RDD Operations ................................................................................................................................ 55
mapValues() .......................................................................................................................................... 55
keys(), values(), and sortByKey() .......................................................................................................... 55
Reorder Key-Value Pairs using map() .................................................................................................. 56
lookup(), countByKey(), and collectAsMap() ....................................................................................... 56
reduceByKey() ...................................................................................................................................... 57
groupByKey() ........................................................................................................................................ 58
flatMapValues() ..................................................................................................................................... 58
subtractByKey() .................................................................................................................................... 59
Pair RDD Joins...................................................................................................................................... 59
More Functions: Spark Documentation............................................................................................... 60
Knowledge Check .................................................................................................................................... 61
Questions .............................................................................................................................................. 61
Answers ................................................................................................................................................ 62
Summary .................................................................................................................................................. 63
Spark Streaming .......................................................................................................................................... 65
Lesson Objectives .................................................................................................................................... 65
Spark Streaming Overview ...................................................................................................................... 65
What is Spark Streaming? .................................................................................................................... 65
DStreams .............................................................................................................................................. 66
DStream vs. RDD .................................................................................................................................. 66
DStream Replication ............................................................................................................................ 67
Receiver Availability ............................................................................................................................. 67
Receiver Reliability ............................................................................................................................... 68
Streaming Data Source Examples ....................................................................................................... 68
Basic Streaming ....................................................................................................................................... 69
StreamingContext................................................................................................................................. 69
Modify REPL CPU Cores ...................................................................................................................... 69
Launch StreamingContext ................................................................................................................... 69
Stream from HDFS Directories and TCP Sockets............................................................................... 70
Output to Console and to HDFS .......................................................................................................... 70
Start and Stop the Streaming Application........................................................................................... 71
Simple Streaming Program Example Using a REPL ........................................................................... 71
Basic Streaming Transformations ........................................................................................................... 72
DStream Transformations .................................................................................................................... 72
Transformation using flatMap() ............................................................................................................ 72
Combine DStreams using union() ........................................................................................................ 73
Create Key-Value Pairs ........................................................................................................................ 73
reduceByKey() ...................................................................................................................................... 73
Window Transformations ......................................................................................................................... 74
Stateful vs. Stateless Operations......................................................................................................... 74
Checkpointing....................................................................................................................................... 74
Streaming Window Functions .............................................................................................................. 74
Basic Window Transformations ........................................................................................................... 75
Sample Window Application ................................................................................................................ 75
reduceByKeyAndWindow() .................................................................................................................. 76
Knowledge Check .................................................................................................................................... 77
Questions .............................................................................................................................................. 77
Answers ................................................................................................................................................ 78
Summary .................................................................................................................................................. 79
Spark SQL .................................................................................................................................................... 81
Lesson Objectives .................................................................................................................................... 81
Spark SQL Components .......................................................................................................................... 81
DataFrames .......................................................................................................................................... 81
Hive ....................................................................................................................................................... 81
Hive Data Visually ................................................................................................................................. 82
DataFrame Visually ............................................................................................................................... 82
Spark SQL Contexts ............................................................................................................................. 83
SQLContext vs. HiveContext ............................................................................................................... 83
Catalyst Spark SQL Optimizer ............................................................................................................. 84
DataFrames, Tables and Contexts .......................................................................................................... 84
DataFrames and Tables ....................................................................................................................... 84
DataFrames and Tables Summary ...................................................................................................... 88
Create and Save DataFrames and Tables .............................................................................................. 89
Converting an RDD to a DataFrame .................................................................................................... 89
Creating DataFrames Programmatically in Python ............................................................................. 90
Creating DataFrames Programmatically in Scala ............................................................................... 90
Registering DataFrames as Temporary Tables ................................................................................... 91
Making Tables Available Across Contexts with CREATE TABLE ...................................................... 91
Creating DataFrames from Existing Hive Tables ................................................................................ 92
Saving DataFrames from HDFS ........................................................................................................... 92
Manipulate DataFrames and Tables ....................................................................................................... 95
Manipulating SQL Tables ..................................................................................................................... 96
Manipulating DataFrames using the DataFrames API ........................................................................ 96
Knowledge Check .................................................................................................................................. 103
Questions ............................................................................................................................................ 103
Answers .............................................................................................................................................. 104
Summary ................................................................................................................................................ 105
Data Visualization in Zeppelin ................................................................................................................... 107
Lesson Objectives .................................................................................................................................. 107
Data Visualization Overview .................................................................................................................. 107
Data Visualization and Spark ............................................................................................................. 107
Data Exploration in Zeppelin ................................................................................................................. 108
Visualizations on Tables - %sql default ............................................................................................ 108
Define HDP and how it fits into overall data lifecycle management strategies
Volume
Volume refers to the amount of data being generated. Think in terms of gigabytes, terabytes, and
petabytes. Many systems and applications are just not able to store, let alone ingest or process, that
much data.
Many factors contribute to the increase in data volume. This includes transaction-based data stored
for years, unstructured data streaming in from social media, and the ever increasing amounts of sensor
and machine data being produced and collected.
There are problems related to the volume of data. Storage cost is an obvious issue. Another problem
is filtering and finding relevant and valuable information in large quantities of data that often contains
not-valuable information.
You also need a solution to analyze data quickly enough in order to maximize business value today
and not just next quarter or next year.
Velocity
Velocity refers to the rate at which new data is created. Think in terms of megabytes per second and
gigabytes per second.
Data is streaming in at unprecedented speed and must be dealt with in a timely manner in order to
extract maximum value from the data. Sources of this data include logs, social media, RFID tags,
sensors, and smart metering.
There are problems related to the velocity of data. These include not reacting quickly enough to
benefit from the data. For example, data could be used to create a dashboard that could warn of
imminent failure or a security breach. Failure to react in time could lead to service outages.
Another problem related to the velocity of data is that data flows tend to be highly inconsistent with
periodic peaks. Causes include daily or seasonal changes or event-triggered peak loads. For example,
a change in political leadership could cause an a peak in social media.
Variety
Variety refers to the number of types of data being generated. Varieties of data include structured,
semi-structured, and unstructured data arriving from a myriad of sources. Data can be gathered from
databases, XML or JSON files, text documents, email, video, audio, stock ticker data, and financial
transactions.
There are problems related to the variety of data. This include how to gather, link, match, cleanse, and
transform data across systems. You also have to consider how to connect and correlate data
relationships and hierarchies in order to extract business value from the data.
Sentiment: Understand how your customers feel about your brand and products right now
Clickstream: Capture and analyze website visitor's data trails and optimize your website
Server Logs: Research log files to diagnose and process failures and prevent security
breaches
Text: Understand patterns in text across millions of web pages, emails and documents
Sentiment
Understand how your customers feel about your brand and products right now
Sentiment data is unstructured data containing opinions, emotions, and attitudes. Sentiment data is
gathered from social media like Facebook and Twitter. It is also gathered from blogs, online product
reviews, and customer support interactions.
Enterprises use sentiment analysis to understand how the public thinks and feels about something.
They can also track how those thoughts and feelings change over time.
It is used to make targeted, real-time decisions that improve performance and improve market share.
Sentiment data may be analyzed to get feedback about products, services, competitors, and
reputation.
Clickstream
Capture and analyze website visitor's data trails and optimize your website
Clickstream data is the data trail left by a user while visiting a Web site. Clickstream data can be used
to determine how long a customer stayed on a Web site, which pages they most frequently visited,
which pages they most quickly abandoned, along with other statistical information.
This data is commonly captured in semi-structured Web logs.
Clickstream data is used, for example, for path optimization, basket analysis, next-product-to-buy
analysis, and allocation of Web site resources.
Hadoop makes it easier to analyze, visualize, and ultimately change how visitors behave on your Web
site.
Sensor/Machine
Discover Patterns in data streaming automatically from remote sensors and machines
A sensor is a converter that measures a physical quantity and transforms it into a digital signal. Sensor
data is used to monitor machines, infrastructure, or natural phenomenon.
Sensors are everywhere these days. They are on the factory floor and they are in department stores in
the form of RFID tags. Hospitals use biometric sensors to monitor patients and other sensors to
monitor the delivery of medicines via intravenous drip lines. In all cases these machines stream lowcost, always-on data.
Hadoop makes it easier for you to rapidly collect, store, process, and refine this data. By processing
and refining your data you can identify meaningful patterns that provide insight to make proactive
business decisions.
Geographic
Analyze location-based data to manage operations where they occur
Geographic/geolocation data identifies the location of an object or individual at a moment in time. This
data may take the form of coordinates or an actual street address.
This data might be voluminous to collect, store, and process; just like sensor data. In fact, geolocation
data is collected by sensors.
Hadoop helps reduce data storage costs while providing value-driven intelligence from asset tracking.
For example, you might optimize truck routes to save fuel costs.
Server Log
Research log files to diagnose and process failures and prevent security breaches
Server log data captures system and network operation information. Information technology
organizations analyze server logs for many reasons. These include the need to answer questions
about security, monitor for regulatory compliance, and troubleshoot failures.
Hadoop takes server-log analysis to the next level by speeding and improving log aggregation and
data center-wide analysis. In many environments Hadoop can replace existing enterprise-wide
systems and network monitoring tools, and reduce the complexity and costs associated with deploying
and maintaining such tools.
Text
Understand patterns in text across millions of web pages, emails and documents
Text is often used for text-based data generated that doesnt neatly fit into one of the above
categories, as well as combinations of categories in order to find patterns across different text-based
sources.
HDP Introduction
Hadoop is a collection of open source software frameworks for the distributed storing and processing
of large sets of data. Hadoop development is a community effort governed under the licensing of the
Apache Software Foundation. Anyone can help to improve Hadoop by adding features, fixing software
bugs, or improving performance and scalability.
Hadoop clusters are scalable, ranging from as few as one machine to literally thousands of machines.
It is also fault tolerant. Hadoop services achieve fault tolerance through redundancy.
Clusters are created using commodity, enterprise-grade hardware, which not only reduces the original
purchase price, but potentially also reduces support costs too.
Hadoop also uses distributed storage and processing to achieve massive scalability. Large datasets
are automatically split into smaller chunks, called blocks, and distributed across the cluster machines.
Not only that, but each machine commonly processes its local block of data. This means that
processing is distributed too, potentially across hundreds of CPUs and hundreds of gigabytes of
memory.
HDP is an enterprise-ready collection of frameworks (sometimes referred to as the HDP Stack) that
work within Hadoop that has been tested and is supported by Hortonworks for business clients.
Hadoop is not a monolithic piece of software. It is a collection of software frameworks. Most of the
frameworks are part of the Apache software ecosystem. The picture illustrates the Apache frameworks
that are part of the Hortonworks Hadoop distribution.
So why does Hadoop have so many frameworks and tools? The reason is that each tool is designed
for a specific purpose. The functionality of some tools overlap but typically one tool is going to be
better than others when performing certain tasks.
For example, both Apache Storm and Apache Flume ingest data and perform real-time analysis, but
Storm has more functionality and is more powerful for real-time data analysis.
HDP Introduction
HDFS is a Java-based distributed file system that provides scalable, reliable, high-throughput access
to application data stored across commodity servers. HDFS is similar to many conventional file
systems. For example, it shares many similarities to the Linux file system. HDFS supports operations
to read, write, and delete files. It supports operations to create, list, and delete directories. HDFS is
described in more detail in another lesson.
YARN is a framework for cluster resource management and job scheduling. YARN is the architectural
center of Hadoop that enables multiple data processing engines such as interactive SQL, real-time
streaming, data science, and batch processing to co-exist on a single cluster. YARN is described in
more detail in another lesson.
6
The four operations frameworks are Apache Ambari, Apache ZooKeeper, Cloudbreak, and Apache
Oozie.
Apache Pig is a high-level platform for extracting, transforming, or analyzing large datasets. Pig
includes a scripted, procedural-based language that excels at building data pipelines to aggregate and
add structure to data. Pig also provides data analysts with tools to analyze data.
Apache Hive is a data warehouse infrastructure built on top of Hadoop. It was designed to enable
users with database experience to analyze data using familiar SQL-based statements. Hive includes
support for SQL:2011 analytics. Hive and its SQL-based language enable an enterprise to utilize
existing SQL skillsets to quickly derive value from a Hadoop deployment.
Apache HCatalog is a table information, schema, and metadata management system for Hive, Pig,
MapReduce, and Tez. HCatalog is actually a module in Hive that enables non-Hive tools to access Hive
metadata tables. It includes a REST API, named WebHCat, to make table information and metadata
available to other vendors tools.
Cascading is an application development framework for building data applications. Acting as an
abstraction layer, Cascading converts applications built on Cascading into MapReduce jobs that run
on top of Hadoop.
Apache HBase is a non-relational database. Sometimes a non-relational database is referred to as a
NoSQL database. HBase was created for hosting very large tables with billions of rows and millions of
columns. HBase provides random, real-time access to data. It adds some transactional capabilities to
Hadoop, allowing users to conduct table inserts, updates, scans, and deletes.
Apache Phoenix is a client-side SQL skin over HBase that provides direct, low-latency access to
HBase. Entirely written in Java, Phoenix enables querying and managing HBase tables using SQL
commands.
Apache Accumulo is a low-latency, large table data storage and retrieval system with cell-level
security. Accumulo is based on Googles Bigtable but it runs on YARN.
Apache Storm is a distributed computation system for processing continuous streams of real-time
data. Storm augments the batch processing capabilities provided by MapReduce and Tez by adding
reliable, real-time data processing capabilities to a Hadoop cluster.
Apache Solr is a distributed search platform capable of indexing petabytes of data. Solr provides
user-friendly, interactive search to help businesses find data patterns, relationships, and correlations
across petabytes of data. Solr ensures that all employees in an organization, not just the technical
ones, can take advantage of the insights that Big Data can provide.
Apache Spark is an open source, general purpose processing engine that allows data scientists to
build and run fast and sophisticated applications on Hadoop. Spark provides a set of simple and easyto-understand programming APIs that are used to build applications at a rapid pace in Scala. The
Spark Engine supports a set of high-level tools that support SQL-like queries, streaming data
applications, complex analytics such as machine learning, and graph algorithms.
Apache Falcon is a data governance tool. It provides a workflow orchestration framework designed
for data motion, coordination of data pipelines, lifecycle management, and data discovery. Falcon
enables data stewards and Hadoop administrators to quickly onboard data and configure its
associated processing and management on Hadoop clusters.
WebHDFS uses the standard HTTP verbs GET, PUT, POST, and DELETE to access, operate, and
manage HDFS. Using WebHDFS, a user can create, list, and delete directories as well as create, read,
append, and delete files. A user can also manage file and directory ownership and permissions.
Administrators can manage HDFS.
8
The HDFS NFS Gateway allows access to HDFS as though it were part of an NFS clients local file
system. The NFS client mounts the root directory of the HDFS cluster as a volume and then uses local
command-line commands, scripts, or file explorer applications to manipulate HDFS files and
directories.
Apache Flume is a distributed, reliable, and available service that efficiently collects, aggregates, and
moves streaming data. It is a distributed service because it can can be deployed across many
systems. The benefits of a distributed system include increased scalability and redundancy. It is
reliable because its architecture and components are designed to prevent data loss. It is highlyavailable because it uses redundancy to limit downtime.
Apache Sqoop is a collection of related tools. The primary tools are the import and export tools.
Writing your own scripts or MapReduce program to move data between Hadoop and a database or an
enterprise data warehouse is an error prone and non-trivial task. Sqoop import and export tools are
designed to reliably transfer data between Hadoop and relational databases or enterprise data
warehouse systems.
Apache Kafka is a fast, scalable, durable, and fault-tolerant publish-subscribe messaging system.
Kafka is often used in place of traditional message brokers like Java Messaging Service (JMS) or
Advance Message Queueing Protocol (AMQP) because of its higher throughput, reliability, and
replication.
Apache Atlas is a scalable and extensible set of core foundational governance services that enable
and enterprise to meet their compliance requirements within Hadoop and enables integration with the
complete enterprise data ecosystem.
Security Frameworks
HDFS also contributes security features to Hadoop. HDFS include file and directory permissions,
access control lists, and transparent data encryption. Access to data and services often depends on
having the correct HDFS permissions and encryption keys.
YARN also contributes security features to Hadoop. YARN includes access control lists that control
access to cluster memory and CPU resources, along with access to YARN administrative capabilities.
Hive can be configured to control access to table columns and rows.
Falcon is a data governance tool that also includes access controls that limit who may submit
automated workflow jobs on a Hadoop cluster.
Apache Knox is a perimeter gateway protecting a Hadoop cluster. It provides a single point of
authentication into a Hadoop cluster.
Apache Ranger is a centralized security framework offering fine-grained policy controls for HDFS,
Hive, HBase, Knox, Storm, Kafka, and Solr. Using the Ranger Console, security administrators can
easily manage policies for access to files, directories, databases, tables, and columns. These policies
can be set for individual users or groups and then enforced within Hadoop.
Copyright 2012 - 2016 Hortonworks, Inc. All rights reserved.
Why does HDP need so many frameworks? Lets take a look at a simple data lifecycle example.
We start with some raw data and an HDP cluster. The first step in managing the data is to get it into
the HDP cluster. We must have some mechanism to ingest that data perhaps Sqoop, Flume, Spark
Streaming, or Storm - and then another mechanism to analyze and decide what to do with it next.
Does this data require some kind of transformation in order to be used? If so, ETL processes must be
run, and those results generated into another file. Quite often, this is not a single step, but multiple
steps configured into a data application pipeline.
The next decision comes with regards to whether to keep or discard the data. Not all data must be
kept, either because it has no value (empty files, for example), or it is not necessary to keep once it has
been processed and transformed. Thus some raw data can simply be deleted.
Data that must be kept requires additional decisions to be made. For example, where will the data be
stored, and for how long? Your Hadoop cluster might have multiple tiers of HDFS storage available,
perhaps separated via some kind of node label mechanism. In the example, we have two HDFS
storage tiers. Any data that is copied to tier 2 should be stored for 90 days. We have another, higher
tier of HDFS Storage, and any data stored here should be kept until it is manually deleted.
You may decide that some data should be archived rather than made immediately available via HDFS,
and you can might multiple tiers of archives as well. In our example we have three tiers of archival
storage, and data is kept for one, three, and seven years depending on where it is stored.
A third location where and data might end up is on some kind of cloud storage, such as AWS or
Microsoft Azure.
10
Both raw data and transformed data might be kept anywhere in this storage infrastructure as result of
having been input and processed by this HDP cluster. In addition, you may be working in a multicluster environment, in which case an additional decision is required. What data needs to be replicated
between the clusters? If files need to be replicated to another HDP cluster, then once that cluster
ingests in examines that data, this same kind of processes and decision mechanisms need to be
employed. Perhaps additional transformation is required. Perhaps some files can be examined and
deleted. For files that are to be kept, their location and length of time to retain must be decided, just
as on the first cluster.
This is a relatively simple example of the kind of data lifecycle decisions that need to be made in an
environment where the capabilities of HDP are being fully utilized. This can get significantly more
complex with the addition of additional storage tiers, retention requirements, and geographically
dispersed HDP clusters which must replicate data between each other, and perhaps the central global
cluster designed to do all final processing.
HDFS Overview
The Hadoop Distributed File System (HDFS) and YARN (Yet Another Resource Negotiator) are part of
the core of Hadoop and are installed when you install the Hortonworks Data Platform. In this section
we will focus on HDFS, the distributed file system for HDP.
11
12
Developers can interact with HDFS directly via the command line using the hdfs dfs command and
appropriate arguments. If a developer has previous Linux command line experience, the hdfs dfs
commands will be familiar and intuitive to use. The most common use for command line usage is
manual data ingestion and file manipulation. Example commands include:
-put, -get: copies files from the local file system to the HDFS and vice versa.
-ls, -rm: list and remove files/directories (adding -R makes listing/removal recursive)
-stat: statistical info for any given file (block size, number of blocks, file type, etc.)
Additional information can be obtained by hdfs dfs at the command line with no arguments or
options, or by viewing online documentation.
The sequence of command below creates a directory, copies a file from the local file system to the new
directory in HDFS, and then lists the contents of the directory:
hdfs dfs -mkdir mydata
hdfs dfs -put numbers.txt mydata/
hdfs dfs -ls mydata
HDFS implements a permissions model for files and directories that shares much of the POSIX model:
Each file and directory is associated with an owner and a group. The file or directory has separate
permissions for the user that is the owner, for other users that are members of the group, and for all
other users.
For files, the r permission is required to read the file and the w permission is required to write or
append to the file.
For directories, the r permission is required to list the contents of the directory, the w permission is
required to create or delete files or directories, and the x permission is required to access a child of the
directory.
13
YARN Overview
Why is YARN so important to Spark? Let's take a look at a sample enterprise Hadoop deployment.
Without a central resource manager to ensure good behavior between applications, it is necessary to
create specialized, separate clusters to support multiple applications. This, in turn, means that when
you want to do something different with the data that application was using, it is necessary to copy
that data between clusters. This introduces inefficiencies in terms of network, CPU, memory, storage,
general datacenter management, and data integrity across Hadoop applications.
YARN as a resource manager mitigates this issue by allowing different types of applications to access
the same underlying resources pooled into a single data lake. Since Spark runs on YARN, it can join
other Hadoop applications on the same cluster, enabling data and resource sharing at enterprise scale.
14
YARN (unofficially "Yet Another Resource Negotiator") is the computing framework for Hadoop. If you
think about HDFS as the cluster file system for Hadoop, YARN would be the cluster operating system.
It is the architectural center of Hadoop.
A computer operating system, such as Windows or Linux, manages access to resources, such as CPU,
memory, and disk, for installed applications. In similar fashion, YARN provides a managed framework
that allows for multiple types of applications batch, interactive, online, streaming, and so on to
execute on data across your entire cluster. Just like a computer operating system manages both
resource allocation (which application gets access to CPU, memory, and disk now, and which one has
to wait if contention exists?) and security (does the current user have permission to perform the
requested action?), YARN manages resource allocation for the various types of data processing
workloads, prioritizes and schedules jobs, and enables authentication and multitenancy.
Every slave node in a cluster is comprised of resources such as CPU and memory. The abstract notion
of a resource container is used to represent a discreet amount of these resources. Cluster
applications run inside one or more containers.
Containers are managed and scheduled by YARN.
A containers resources are logically isolated from other containers running on the same machine. This
isolation provides strong application multi-tenancy support.
Applications are allocated different sized containers based on application-defined resource requests,
but always within the constraints configured by the Hadoop administrator.
15
16
Knowledge Check
You can use the following questions and answers for self-assessment.
Questions
1 ) Name the three Vs of big data.
2 ) Name four of the six types of data commonly found in Hadoop.
3 ) Why is HDP comprised of so many different frameworks?
4 ) What two frameworks make up the core of HDP?
5 ) What is the base command-line interface command for manipulating files and directories in
HDFS?
6 ) YARN allocates resources to applications via _____________________.
17
Answers
7 ) Name the three Vs of big data.
Answer: Volume, Velocity, and Variety
8 ) Name four of the six types of data commonly found in Hadoop.
Answer: Sentiment, clickstream, sensor/machine, server, geographic, and text
9 ) Why is HDP comprised of so many different frameworks?
Answer: To allow for end-to-end management of the data lifecycle
10 ) What two frameworks make up the core of HDP?
Answer: HDFS and YARN
11 ) What is the base command-line interface command for manipulating files and directories in
HDFS?
Answer: hdfs dfs
12 ) YARN allocates resources to applications via _____________________.
Answer: Containers
18
Summary
The hdfs dfs command can be used to create and manipulate files and directories
YARN serves as the operating system and architectural center of HDP, allocating resources to
a wide variety of applications via containers
19
Zeppelin Overview
Lets begin with an overview of Zeppelin, the interface we will use to work with Spark.
Apache Zeppelin
Apache Zeppelin is a web-based notebook that enables interactive data analytics on top of Spark. It
supports a growing list of programming languages, such as Python, Scala, Hive, SparkSQL, shell, and
markdown. It allows for data visualization, report generation, and collaboration.
21
Zeppelin
Zeppelin has four major functions: data ingestion, discovery, analytics, and visualization. It comes with
built-in examples that demonstrate these capabilities. These examples can be reused and modified for
real-world scenarios.
Data Visualization
Zeppelin comes with several built-in ways to interactively view and visualize data including table view,
column charts, pie charts, area charts, line charts, and scatter plot charts illustrated below:
22
23
*Nearly* all of the labs in this class could be run either from the command line or Zeppelin with
identical steps, so the use of Zeppelin does not interfere with learning in any way.
Despite its tech preview status, Zeppelin is the best solution available today in HDP for key
functionalities such as data visualization and collaboration supporting multiple languages (for
example, both Python and Scala.)
When Zeppelin comes out of tech preview - which may have already happened by the time this
is read - by using it now you will already have significant hands-on experience.
Spark Overview
Lets look at an overview of Spark.
Spark Introduction
Spark is a platform that allows for large-scale, cluster-based, in-memory data processing. It enables
fast, large-scale data engineering and analytics for iterative and performance-sensitive applications. It
offers development APIs for Scala, Java, Python, and R. In addition, Spark has been extended to
support SQL-like operations, streaming, and machine learning as well.
Spark is supported by Hortonworks on HDP and is YARN compliant, meaning it can leverage datasets
that exist across many other applications in HDP.
24
Spark RDDs
To leverage Hadoops horizontal scalability, Spark processes data in a Resilient Distributed Dataset,
called an RDD. An RDD is a fault-tolerant collection of data elements. An RDD is created by starting
with a file on disk, or a collection of data in memory in a driver program. Each RDD is distributed
across multiple nodes in a cluster. This enables parallel processing across the nodes.
Allowing RDDs to reside and be processed in memory dramatically increases performance especially
when a dataset needs to be manipulated through multiple stages of processing. The data in an RDD
can be transformed and analyzed very rapidly.
25
Spark tools
Coming from the Spark project, Spark Core supports a set of four high-level tools that support SQLlike queries, streaming data applications, a machine learning library (Mlib), and graph algorithms
(GraphX). In addition, Spark also integrates with a number of other HDP tools, such as Hive for SQLlike operations and Zeppelin for graphing / data visualization.
26
There are five core components of an enterprise Spark application in HDP. They are the Driver,
SparkContext, YARN ResourceManager, HDFS Storage, and Executors.
When using a REPL, the driver and SparkContext will run on a client machine. When deploying an
application as a cluster application, the driver and SparkContext can also run in a YARN container. In
both cases, Spark executors run in YARN containers on the cluster.
27
Spark Driver
The Spark driver contains the main() Spark program that manages the overall execution of a Spark
application. It is a JVM that creates the SparkContext which then communicates directly with Spark.
It is also responsible for writing/displaying and storing the logs that the SparkContext gathers from
executors.
The Spark shell REPLs are examples of Spark driver programs.
IMPORTANT: The Spark driver is a single point of failure for a YARN client application. If the driver
fails, the application will fail. This is mitigated when deploying applications using YARN cluster.
SparkContext
SparkContext
For any application to become a Spark application, an instance of the SparkContext class must be
instantiated. The SparkContext contains all the code and objects required to process the data in the
cluster, and works with the YARN ResourceManager to get the requested resources for the
application. It is also responsible for scheduling tasks for Spark executors. The SparkContext
checks in with the executors to report work being done and provide log updates.
A SparkContext is automatically created and named sc when a REPL is launched. The following
code is executed at start up for pyspark:
from pyspark import SparkContext, SparkConf conf = SparkConf()
conf = SparkConf()
sc = SparkContext(conf=conf)
28
Spark Executors
Spark Executors
The Spark executor is the component that does performs the map and reduce tasks of a Spark
application, and is sometimes referred to as a Spark worker. Once created, executors exist for the
life of the application.
NOTE: In the context of Spark, the SparkContext is the "master" and executors are the "workers."
However, in the context of HDP in general, you also have "master" nodes and "worker" nodes. Both
uses of the term worker are correct - in terms of HDP, the worker (node) can run one or more Spark
workers (executors). When in doubt, make sure to verify whether the worker being described is an HDP
node or a Spark executor running on an HDP node.
Spark executors function as interchangeable work spaces for Spark application processing. If an
executor is lost while an application is running, all tasks assigned to it will be reassigned to another
executor. In addition, any data lost will be recomputed on another executor.
Executor behavior can be controlled programmatically. Configuring the number of executors and their
resources available can greatly increase performance of an application when done correctly.
29
30
Knowledge Check
Use the following questions as a self assessment.
Questions
1 ) Name the tool in HDP that allows for interactive data analytics, data visualization, and
collaboration with Spark.
2 ) What programming languages does Spark currently support?
3 ) What is the primary benefit of running Spark on YARN?
4 ) Name the five components of an enterprise Spark application running in HDP.
5 ) Which component of a Spark application is responsible for application workload processing?
31
Answers
1 ) Name the tool in HDP that allows for interactive data analytics, data visualization, and
collaboration with Spark.
Answer: Zeppelin
2 ) What programming languages does Spark currently support?
Answer: Scala, Java, Python, and R
3 ) What is the primary benefit of running Spark on YARN?
Answer: Access to datasets shared across the cluster with other HDP applications
4 ) Name the five components of an enterprise Spark application running in HDP.
Answer: Driver, SparkContext, YARN, HDFS, and executors.
5 ) Which component of a Spark application is responsible for application workload processing?
Answer: Executor
32
Summary
Zeppelin is a web-based notebook that supports multiple programming languages and allows
for data engineering, analytics, visualization, and collaboration using Spark
Spark provides REPLs for rapid, interactive application development and testing
Driver
SparkContext
YARN
HDFS
Executors
33
Invoke functions for multiple RDDs, create named functions, and use numeric operations
Introduction to RDDs
Let's begin by going through a brief overview of what an RDD is and a few methods that can be used
to create one.
35
36
To create an RDD that contains all .txt files that meet wildcard requirements in a given location:
rddWild = sc.textFile("fileLocation/*.txt")
Wildcards and comma-separated lists can be combined in any configuration.
In this simple example, we have a small cluster which has been loaded with three data files, and will
walk through their input into HDFS, followed by having two of them used to create a single RDD, and
begin to demonstrate the power of parallel datasets.
The first file in the example is small enough to fit entirely in a single 128 MB HDFS block thus data file
1 is made up of only one HDFS block (labeled DF1.1) which is written to Node 1. This would be
replicated by default to two other nodes, which we are assuming are not shown in the image.
Data files 2 and 3 take up two HDFS blocks apiece. In our example, these four data blocks are written
to four different HDFS nodes. Data file 2 is represented in HDFS by DF2.1 and DF 2.2 (written to node 2
and node 4, respectively). Data file 3 is represented in HDFS by DF3.1 and DF3.2 (written to node 3 and
node 5, respectively). Again, it is not shown in the image, but each of these blocks would be replicated
multiple times across nodes in the cluster.
Next we write a Spark application that initially defines an RDD that is made up of a combination of the
data in data files 1 and 2. In HDFS, these two files are represented by three HDFS blocks on nodes 1,
2, and 4. When the RDD partitions are created in memory, the nodes that will be used will be nodes
that contain the data blocks in order to improve performance and reduce the network traffic that would
result from pulling data from one nodes disk to another nodes memory.
The three data blocks that represent these two files that were combined by the Spark application are
then copied into memory on their respective nodes and become the partitions of the RDD. DF2.1 is
written to an RDD partition we have labeled RDD 1.1. DF2.2 is written to an RDD partition we have
labeled RDD 1.2. DF1.1 is written to an RDD partition we have labeled RDD 1.3.
Thus, in our example, one RDD was created from two files (which are split across three HDFS data
nodes), which exist in memory as three partitions which a Spark application can then continue to use.
37
A hypothetical cluster that has two RDD's. Each RDD is composed of multiple partitions, which are distributed across the cluster.
RDD Characteristics
RDDs can contain any type of serializable element, meaning those that can be converted to and from
a byte stream. Examples include: int, float, bool, and sequences/iteratives like arrays, lists,
tuples, and strings. Element types in an RDD can be mixed as well. For example, an array or list
can contain both string and int values. Furthermore, RDD types are converted implicitly when
possible, meaning there is no need to explicitly specify the type during RDD creation.
NOTE: Non-serializable elements (for example: objects created with certain third-party JAR files or
other external resource) cannot be made into RDDs.
38
RDD Operations
Once an RDD is created, there are two operations that can be performed on it: Actions and
Transformations.
Transformations apply a function to RDD elements and create new RDD partitions based on the output
Transformations take an existing RDD, applies a function to the elements of the RDD, and creates a
new RDD comprised of the transformed elements.
An action returns a result of a function applied to the elements of an RDD in the form of screen output,
a file write, etc.
39
First let's take a look at an example of non-functional programming. In this function, we define a
variable value outside of our function, then pull that value into our function and modify it. Note the
dependence on and the writing to a variable that exists external to the function itself:
varValue = 0
def unfunctionalCode():
global varValue
varValue = varValue + 1
Now let's take a look at the same basic example, but this time written using functional programming
principles. In this example, the variable value is instantiated as part of calling the function itself, and
only the value within the function is modified.
def functionalCode(varValue):
return varValue + 1
All Spark transformations are based on functional programming.
Immutable data: RDD1A can be transformed into RDD1B, but an individual element within
RDD1A cannot be independently modified.
Behavioral consistency: If you pass the same value into a function multiple times, you will
always get the same result - changing order of evaluation does not change results.
Lazy evaluation: function arguments are not evaluated / executed until required.
40
map()
The map transformation applies a function supplied as its argument to each element of an RDD.
In the example below, we have a list of numbers that make up our RDD. We then apply the map
function and instruct Spark to run the anonymous function (using z as the variable name for each
element) z+1, an immediately print the output to the screen:
rddNumList = sc.parallelize([5, 7, 11, 14])
rddNumList.map(lambda z: z + 1).collect()
[6, 8, 12, 15]
Note that the second line of code in this example did not define a new RDD. If further transformations
were necessary, the second line of code would need to be rewritten as follows:
rddAnon = rddNumList.map(lambda z: z + 1)
rddAnon.collect()
[6, 8, 12, 15]
Maps can apply to strings as well. Here is an example that starts by reading a file from HDFS called
"mary.txt":
rddMary=sc.textFile("mary.txt")
RDDs created using the textFile method treat newline characters as characters that separate
elements. Thus, since the file had four lines, the file as shown in the image would have four elements in
the RDD.
rddLineSplit=rddMary.map(lambda line: line.split(' "))
A map() transformation is then called. The goal of the map transformation in this scenario is to take
each element, which is a string containing multiple words, and break it up into an array that is
stored in a new RDD for further processing. The split function takes a string and breaks it into
arrays based on the delimiter passed into split().
41
The result is an RDD which still only has four elements, but now those elements are arrays rather than
monolithic strings.
flatMap()
The flatMap function is similar to map, with the exception that it performs an extra step to break
down (or flattens) the component parts of elements such as arrays or other sequences into individual
elements after running a map function.
As we saw previously, map is a one to one transformation: one element comes in, one element comes
out. Using map(), four line elements were converted into four array elements, but we still started and
ended with the same number of elements. The flatMap function, on the other hand, is a one to
possibly many transformation: one element may go in, but many can come out.
Let's compare using the previous map() illustration.
rddLineSplit = rddMary.map(lambda line: line.split(" "))
If we run the exact same code, only replace map() with flatMap(), the output is returned as a single
list of individual elements rather than four lists of elements that were originally separated by the line
break.
rddFlat = rddMary.flatmap(lambda line: line.split(" "))
This time, the results are each word being treated as its own element, resulting in 22 elements instead
of 4. Again, it is easiest to think about flatMap() as simply a map operation followed by a flatten
operation, in a single step.
42
filter()
The filter function is used to remove elements from an RDD that don't pass a certain criteria, or put
another way, only keeps elements in an RDD based on a predicate. If the predicate returns true (the
filter criteria are met), the record is passed on to the transformed RDD.
In the example below, we have an RDD composed of four elements and want filter out any element
whose value is greater than 10 (or in other words, keep any value 10 or less). Note that the initial RDD
is being created by using the sc.parallelize API:
rddNumList = sc.parallelize([5, 7, 11, 14])
rddNumList.filter(lambda number: number <= 10).collect()
[5, 7]
This could have been performed using any standard mathematical operation. Filter is not limited to
working with numbers. It can work with strings as well. Let's use an example RDD consisting of the
list of months below:
months = ["January", "March", "May", "July", "September"]
rddMonths = sc.parallelize(months)
We then use filter, with an anonymous function that uses the len function to count the number of
characters in each element, and then filter out any that contain five or fewer characters.
rddMonths.filter(lambda name: len(name) > 5).collect()
['January', 'September']
Again, any available function that performs evaluations on text strings or arrays could be used to filter
for a given result.
43
distinct()
The distinct function removes duplicate elements from an RDD. Consider the following RDD:
rddBigList = sc.parallelize([5, 7, 11, 14, 2, 4, 5, 14, 21])
rddBigList.collect()
[5, 7, 11, 14, 2, 4, 5, 14, 21]
Notice that the numbers 5 and 14 are listed twice. If we just wanted to see each element only listed one
time in our output, we could use distinct() as follows:
rddDistinct = rddBigList.distinct()
rddDistinct.collect()
[4, 5, 21, 2, 14, 11, 7]
count()
count returns the number of elements in the RDD. Here is an example:
rddNumList = sc.parallelize([5, 7, 11, 14])
rddNumList.count()
4
In the case of a file that contains lines of text, count() would return the number of lines in the RDD, as
in the following example:
rddMary=sc.textFile("mary.txt")
rddMary.count()
4
The count function applied to rddMary returns 4, a count for each line
44
saveAsTextFile()
The saveAsTextFile function writes the contents of RDD partitions to a specified location (such as
hdfs:// for HDFS, file:/ for local file system, and so forth) and directory as a set of text files. For
example:
rddNumList.saveAsTextFile("hdfs://desiredLocation/foldername")
The contents of the RDD in the example would be written to a specific directory in HDFS.
Success can be verified using typical tools from a command line or GUI. In the case of our example,
we could use the hdfs dfs -ls command to verify it had written successfully:
$ hdfs dfs -ls desiredLocation/foldername
Using saveAsTextFile(), each RDD partition is written to a different text file by default
The output would look like the screenshot shown, with each RDD partition being written to a different
text file by default. The files could be copied to the local file system and then be read using a standard
text editor / viewer such as nano, more, vi, etc.
45
As the visual indicates, when performing transformations on an RDD, it just saves the recipe of what it
is supposed to do when needed.
When the action is called, the data is pushed through the transformations so that the result can be calculated
Only at the end, when an action is called, will the data be pushed through the recipe to create the
desired outcome.
46
Named Functions
Custom functions can be defined and named, then used as arguments to other functions. A custom
function should be defined when the same code will be used multiple times throughout the program, or
if a function to be used as an argument will take more than a single line of code, making it too complex
for an anonymous function.
The following example evaluates a number to determine if is 90 or greater. If so, it returns the text
string "A" and if not it returns "Not an A".
def gradeAorNot(percentage):
if percentage > 89:
return "A"
else:
return "Not an A"
NOTE: In the REPL, the number of tabs matter. For example, in line 2, you have to tab once before
typing the line, and in line 3 you must tab twice. In line 4, you have to tab once, and line 5, twice.
47
The custom named function gradeAorNot can then be passed as an argument to another function for example, map().
rddGrades = sc.parallelize([87, 94, 41, 90])
rddGrades.map(gradeAorNot).collect()
['Not an A', 'A', 'Not an A', 'A']
The named function could also be used as the function body in an anonymous function. The following
example results in equivalent output to the code above:
rddGrades.map(lambda grade: gradeAorNot(grade)).collect()
Numeric Operations
Numeric operations can be performed on RDDs, including mean, count, stDev, sum, max, and min, as
well as a stats function that collects several of these values with a single function. For example:
rddNumList = sc.parallelize([5, 7, 11, 14])
rddNumList.stats()
(count: 4, mean: 9.25, stdev: 3.49, max: 14, min: 5)
The individual functions can be called as well:
rddNumList.min()
5
NOTE: To double check the output of the Spark stdev() output in Excel, use Excel's stdevp
function rather than the stdev function. Excel's stdev function assumes that the entire population is
unknown, and thus makes adjustments to outputs based on assumed bias. The stdevp function
assumes the entire dataset (the p stands for "population") is fully represented and does not make a
bias correction. Thus, Excel's stdevp function is more similar to Spark's stdev function.
From there, simply click on the appropriate programming language to view the documentation.
48
Knowledge Check
You can use the following questions and answers as a self-assessment.
Questions
1 ) What does RDD stand for?
2 ) What two functions were covered in this lesson that create RDDs?
3 ) True or False: Transformations apply a function to an RDD, modifying its values
4 ) What operation does the lambda function perform?
5 ) Which transformation will take all of the words in a text object and break each of them down
into a separate element in an RDD?
6 ) True or False: The count action returns the number of lines in a text document, not the
number of words it contains.
7 ) What is it called when transformations are not actually executed until an action is performed?
8 ) True or False: The distinct function allows you to compare two RDDs and return only
those values that exist in both of them
9 ) True or False: Lazy evaluation makes it possible to run code that "performs" hundreds of
transformations without actually executing any of them
49
Answers
1 ) What does RDD stand for?
Answer: Resilient Distributed Dataset
2 ) What two functions were covered in this lesson that create RDDs?
Answer: sc.parallelize() and sc.textfile()
3 ) True or False: Transformations apply a function to an RDD, modifying its values
Answer: False. Transformations result in new RDDs being created. In Spark, data is
immutable.
4 ) What operation does the lambda function perform?
Answer: Lambda is a keyword that precedes an anonymous function. These functions can
perform whatever operation is needed as long as it is contained in a single line of code.
5 ) Which transformation will take all of the words in a text object and break each of them down
into a separate element in an RDD?
Answer: flatmap()
6 ) True or False: The count action returns the number of lines in a text document, not the
number of words it contains.
Answer: True
7 ) What is it called when transformations are not actually executed until an action is performed?
Answer: Lazy evaluation
8 ) True or False: The distinct function allows you to compare two RDDs and return only
those values that exist in both of them
Answer: False. The intersection function performs this task. The distinct function
would remove duplicate elements, so that each element is only listed once regardless of how
many times it appeared in the original RDD.
9 ) True or False: Lazy evaluation makes it possible to run code that "performs" hundreds of
transformations without actually executing any of them
Answer: True
50
Summary
Resilient Distributed Datasets (RDDs) are immutable collection of elements that can be
operated on in parallel
Once an RDD is created, there are two things that can be done to it: transformations and
actions
Spark makes heavy use of functional programming practices, including the use of anonymous
functions
51
Pair RDDs
Pair RDDs
Lesson Objectives
After completing this lesson, students should be able to:
53
Pair RDDs
The picture above visually demonstrates what happens when the map function is applied on the initial
elements each consisting of a single word.
keyBy()
The keyBy API creates key-value pairs by applying a function on each data element. The function
result becomes the key, and the original data element becomes the value in the pair.
For example:
rddTwoNumList = sc.parallelize([(1,2,3),(7,8)])
keyByRdd = rddTwoNumList.keyBy(len)
keyByRdd.collect()
[(3,(1,2,3)),(2,(7,8))]
Additional example:
rddThreeWords = sc.parallelize(["cat","A","spoon"])
keyByRdd2 = rddThreeWords.keyBy(len)
keyByRdd2.collect()
[(3,'cat'),(1,'A'),(5,'spoon')]
zipWithIndex()
The zipWithIndex function creates key-value pairs by assigning the index, or numerical position, of
the element as the value, and the element itself as the key.
For example:
rddThreeWords = sc.parallelize(["cat","A","spoon"])
zipWIRdd = rddThreeWords.zipWithIndex()
zipWIRdd.collect()
[('cat', 0), ('A', 1), ('spoon', 2)]
54
Pair RDDs
zip()
The zip function creates key-value pairs by taking elements from one RDD as the key and elements of
another RDD as the value. It has the following syntax:
keyRDD.zip(valueRDD)
The API assumes the two RDDs have the same number of partitions and elements.
rddThreeWords = sc.parallelize(["cat", "A", "spoon"])
rddThreeNums = sc.parallelize([11, 241, 37])
zipRdd = rddThreeWords.zip(rddThreeNums)
zipRdd.collect()
[('cat', 11), ('A', 241), ('spoon', 37)]
mapValues()
The mapValues function performs a defined operation on the values in a Pair RDD while leaving the
keys unchanged. For example:
zipWIRdd = sc.parallelize([("cat", 0), ("A", 1), ("spoon", 2)])
rddMapVals = zipWIRdd.mapValues(lambda val: val + 1)
rddMapVals.collect()
[('cat', 1), ('A', 2), ('spoon', 3)]
keys() - returns a list of just the keys in the RDD without any values.
rddMapVals.keys().collect()
['cat', 'A', 'spoon']
values() - returns a list of just the values in the RDD without any keys.
rddMapVals.values().collect()
[1, 2, 3]
rddMapVals.sortByKey().collect()
[('A', 2), ('cat', 1), ('spoon', 3)]
NOTE: Without creating a PairRDD prior to using these functions, they will not work as expected.
Copyright 2012 - 2016 Hortonworks, Inc. All rights reserved.
55
Pair RDDs
keyByRdd.lookup(2)
[(7, 8)]
countByKey() - returns a count of the number of times each key appears in the RDD (in our
example, there were no duplicate keys, so each is returned as 1).
keyByRdd.countByKey()
defaultdict(<type 'int'>,{2: 1, 3: 1})
collectAsMap() - collects the result as a map. If multiple values exist for the same key only
one will be returned.
keyByRdd.collectAsMap()
{2: (7, 8), 3: (1, 2, 3)}
Note that these actions did not require us to also specify collect() in order to view the results.
56
Pair RDDs
reduceByKey()
The reduceByKey function performs a reduce operation on all elements of a key/value pair RDD that
share a key. For our example here, we'll return to kvRdd that was created using the following code:
rddMary = sc.textFile("filelocation/mary.txt")
rddFlat = rddMary.flatMap(lambda line: line.split(' '))
kvRdd = rddFlat.map(lambda word: (word,1))
As an example of reduceByKey(), take a look at the following code:
kvReduced = kvRdd.reduceByKey(lambda a,b: a+b)
The reduceByKey function goes through the elements and if it sees a key that it hasn't already
encountered, it adds it to the list and records the value as-is. If a duplicate key is found, reduceByKey
performs a function on the key values and keeps the number of keys to one. In our example, then, the
anonymous function "lambda a,b: a+b only kicks in if a duplicate key is found. If so, the
anonymous function tells reduceByKey to take the values of the two keys (a and b) and add them
together to compute a new value for the now-reduced key. The actual function being performed is up
to the developer, but incrementally adding values would be a fairly common task.
Note that the keys Mary, was, and lamb have been reduced
Visually, then, what is happening is the elements of the RDD are being recorded and passed to the new
kvReduced RDD, with the exception of two keys - 'Mary' and 'lamb' - which were both found twice. All
other values remain unchanged, but now 'Mary' and 'lamb' are reduced to single key, each with a value
of 2.
57
Pair RDDs
groupByKey()
Grouping values by key allow us to aggregate values based on a key. In order to see this grouping, the
results must be turned into a list before being collected.
For example, let's again use our kvRdd example created with the following code:
rddMary = sc.textFile("filelocation/mary.txt")
rddFlat = rddMary.flatMap(lambda line: line.split(' '))
kvRdd = rddFlat.map(lambda word: (word,1))
Next, we will use groupByKey to group all values that have the same key into an iterable object (that
is, on its own, unable to be viewed) and then use a map function to define these grouped elements into
a readable list:
kvGroupByKey = kvRdd.groupByKey().map(lambda x : (x[0], list(x[1])))
kvGroupByKey.collect()
[(u'a', [1]), (u'lamb', [1, 1]),(u'little', [1]),(u'Mary',[1, 1])]
If we had simply generated output using groupByKey alone, as below:
kvGroupByKey = kvRdd.groupByKey()
when we ran the code to collect the results, the output would have looked something like this:
[(u'a', <pyspark.resultiterable.ResultIterable object at 0xde8450>), (u'lamb',
<pyspark.resultiterable.ResultInterable object at 0xde8490>),(u'Mary',
<pyspark.resultiterable.ResultIterable object at 0xde8960>)]
This tells you that the results are an object which allows iteration, but does not display the individual
elements by default. Using map to list the elements performed the necessary iteration to be able to see
the desired formatted results.
NOTE: The groupByKey and reduceByKey functions have significant overlap and similar capabilities,
depending on how the called function is defined by the developer. When either is able to get the
desired output, it is better to use reduceByKey() as it is more efficient over large datasets.
flatMapValues()
Like the mapValues function, the flatMapValues function performs a function on Pair RDD values,
leaving the keys unchanged. However, in the event it encounters a key that has multiple values, it
flattens those into individual key-value pairs, meaning no key will have more than one value, but you
will end up with duplicate keys in the RDD. Let's start with the RDD we created in the groupByKey()
example:
kvGroupByKey = kvRdd.groupByKey().map(lambda x : (x[0], list(x[1])))
kvGroupByKey.collect()
[(u'a', [1]), (u'lamb', [1, 1]),(u'little', [1]),(u'Mary',[1, 1])]
Notice that both the 'lamb' and 'Mary' keys contain a multiple key values in a list. Next, let's create
an RDD that flattens those key-value pairs using the flatMapValues function:
58
Pair RDDs
subtractByKey()
The subtractByKey function will return key-value pairs containing keys not found in another RDD.
This can be useful when you need to identify differences between keys in two RDDs. Here is an
example:
zipWIRdd = sc.parallelize([("cat", 0), ("A", 1), ("spoon", 2)])
rddSong = sc.parallelize([("cat", 7), ("cradle", 9), ("spoon", 4)])
rddSong.subtractByKey(zipWIRdd).collect()
[('cradle', 9)]
The key-value pair that had a key of 'cradle' was the only one returned because both RDDs
contained key-value pairs that had key values of 'cat' and 'spoon'.
Note that ('A', 1) was not returned as part of the result. This is because subtractByKey() is only
evaluating matches in the first RDD (the one that precedes the function) as compared to the second
one. It does not return all keys in either RDD that are unique to that RDD. If you wanted to get a list of
all unique keys for both RDDs using subtractByKey(), you would need to run the operation twice once as shown, and then again, swapping out the two RDD values in the last line of code. For example,
this code would return the unique key values for zipWIRdd:
zipWIRdd.subtractByKey(rddSong).collect()
[('A', 1)]
If needed, you could store these outputs in two other RDDs, then use another function to combine
them into a single list as desired.
59
Pair RDDs
From there, simply click on the appropriate programming language to view the documentation.
60
Pair RDDs
Knowledge Check
You may use these questions as a self assessment check.
Questions
1 ) An RDD that contains elements made up of key-value pairs is sometimes referred to as a
_________________.
2 ) Name two functions that can be used to create a Pair RDD.
3 ) True or False: A key can have a value that is actually a list of many values.
4 ) Since sortByKey() only sorts by key, and there is no equivalent function to sort by values,
how could you go about getting your Pair RDD sorted alphanumerically by value?
5 ) You determine either reduceByKey() or groupByKey() could be used in your program to get
the same results. Which one should you choose?
6 ) How can you use subtractByKey() to determine *all* of the unique keys across two RDDs?
61
Pair RDDs
Answers
1 ) An RDD that contains elements made up of key-value pairs is sometimes referred to as a
_________________.
Answer: Pair RDD
2 ) Name two functions that can be used to create a Pair RDD.
Answer: map(), keyBy(), zipWithIndex(), zip()
3 ) True or False: A key can have a value that is actually a list of many values.
Answer: True
4 ) Since sortByKey() only sorts by key, and there is no equivalent function to sort by values,
how could you go about getting your Pair RDD sorted alphanumerically by value?
Answer: First use map() to reorder the key-value pair so that the key is now the value. Then
use sortByKey() to sort. Finally, use map() again to swap the keys and values back to their
original positions.
5 ) You determine either reduceByKey() or groupByKey() could be used in your program to get
the same results. Which one should you choose?
Answer: reduceByKey(), because it is more efficient - especially on large datasets.
6 ) How can you use subtractByKey() to determine *all* of the unique keys across two RDDs?
Answer: Run it twice, switching the order of the RDDs each time.
62
Pair RDDs
Summary
Common functions used to create Pair RDDs include map(), keyBy(), zipWithIndex(), and
zip()
Common functions used with Pair RDDs include mapValues(), keys(), values(),
sortByKey(), lookup(), countByKey(), collectAsMap(), reduceByKey(),
groupByKey(), flatMapValues(), subtractByKey(), and various join types.
63
Spark Streaming
Spark Streaming
Lesson Objectives
After completing this lesson, students should be able to:
Spark streaming
65
Spark Streaming
DStreams
A DStream is a collection of one or more specialized RDDs divided into discrete chunks based on time
interval. When a streaming source communicates with Spark Streaming, the receiver caches
information for a specified time period, after which point the data is converted into a DStream and
available for further processing. Each discrete time period (in the example pictured, every five
seconds) is a separate DStream.
DStreams
66
Spark Streaming
DStream Replication
DStreams are fault tolerant, meaning they are written to a minimum of two executors at the moment of
creation. The loss of a single executor will not result in the loss of the DStream.
Receiver Availability
By default, receivers are highly available. If the executor running the receiver goes down, the receiver
will be immediately restarted in another executor.
67
Spark Streaming
As mentioned earlier, Spark Streaming performs micro-batching rather than true bit-by-bit streaming.
Collecting and processing data in batches can be more efficient in terms of resource utilization, but
comes at a cost of latency and risk of small amounts of lost data. Spark Streaming can be configured
to process batch sizes as small as one second, which takes approximately another second to process,
for a two-second delay from the moment the data is received until a response can be generated. This
introduces a small risk of data loss, which can be mitigated by the use of reliable receivers (available in
the Scala and Java APIs only at the time of this writing) and intelligent data sources.
Receiver Reliability
By default, receivers are "unreliable." This means there is:
To implement a reliable receiver, a custom receiver must be created. A reliable receiver implements a
handshake mechanism that acknowledges that data has been received and processed. Assuming the
data source is also intelligent, it will wait to discard the data on the other side until this
acknowledgement has been received, which also means it can retransmit it in the event of data loss.
Custom receivers are available in the Scala and Java Spark Streaming APIs only, and are not available
in Python. For more information on creating and implementing custom / reliable receivers, please refer
to Spark Streaming documentation.
68
Spark Streaming
Basic Streaming
Next we will examine the basics of creating a data stream using Spark Streaming.
StreamingContext
Spark Streaming extends the Spark Core architecture model by layering in a StreamingContext on top
of the SparkContext. The StreamingContext acts as the entry point for streaming applications. It is
responsible for setting up the receiver and enables real-time transformations on DStreams. It also
produces various types of output.
The StreamingContext
Launch StreamingContext
To launch the StreamingContext, you first need to import the StreamingContext API. In pyspark,
the code to perform this operation would be:
from pyspark.streaming import StreamingContext
Next, you create an instance of the StreamingContext. When doing so, you supply the name of the
SparkContext in use, as well as the time interval (in seconds) for the receiver to collect data for
micro-batch processing. When using a REPL, the SparkContext will be named sc by default.
69
Spark Streaming
For example, when creating an instance of StreamingContext named ssc, in the pyspark REPL,
with a micro-batch interval of one second, you would use the following code:
ssc = StreamingContext(sc, 1)
NOTE: This operation will return an error if the StreamingContext API has not been imported.
Both the name of the SparkStreaming instance and the time interval can be modified to fit your
purposes. Here's an example of creating a StreamingContext instance with a 10-second microbatch interval:
sscTen = StreamingContext(sc, 10)
It is important to note, that while multiple instances of StreamingContext can be defined, only a
single instance of SparkContext can run per JVM. Once running, another instance will fail to launch.
In fact, once the current instance has been stopped, it cannot be launched again in the same JVM.
Thus, while the REPL is a useful tool for learning and perhaps testing Spark Streaming applications, in
production it would be problematic because every time a developer wanted to test a slightly different
application, it would require stopping and restarting the REPL itself.
Python: DSvariableName.pprint()
Scala/Java: DSvariableName.print()
When printing output to the console, we suggest setting the log level for the SparkContext to
"ERROR" in order to reduce screen output. Otherwise, lots of information besides what is being
streamed will appear and clutter up the screen. To do this, use the SparkContext setLogLevel
function as follows:
sc.setLogLevel("ERROR")
70
Spark Streaming
To save the output as a time-stamped text file on HDFS, use the saveAsTextFiles function. Make
sure the application has write permissions to the HDFS directory selected. The syntax for this
operation is:
DSVariable.saveAsTextFiles("HDFSlocation/prefix", "optionalSuffix")
The output will be generated on a per-DStream basis, with whatever you select as the prefix
prepending the rest of the file name, which will be "-<timestamp>." In addition, you can choose to
add a suffix to each file as well, which will appear at the end of the file. Prefixes and suffixes are useful,
particularly if multiple data streams are being written to the same directory.
It is also noteworthy that these outputs are not exclusive. You can choose to output to the console
*and* to HDFS without issue.
In a REPL, the streaming application will also stop if the terminal it is running in is closed.
71
Spark Streaming
DStream Transformations
Transformations allow modification of DStream data to create new DStreams with different output.
DStream transformations are similar in nature and scope to traditional RDD transformations. In fact,
many of the same functions in Spark Core also exist in Spark Streaming. The following functions
should look familiar:
map()
flatMap()
filter()
repartition()
union()
count()
reduceByKey()
join()
72
Spark Streaming
reduceByKey()
And again, like traditional RDDs, key-value pair DStreams can be reduced using the reduceByKey
function. Here's an example:
# pyspark --master local[2]
>>> sc.setLogLevel("ERROR")
>>> from pyspark.streaming import StreamingContext
>>> ssc = StreamingContext(sc, 5)
>>> hdfsInputDS = ssc.textFileStream("someHDFSdirectory")
>>> kvPairDS = hdfsInputDS.flatMap(lambda line: line.split(" ").map(lambda
word: (word, 1))
>>> kvReduced = kvPairDS.reduceByKey(lambda a,b: a+b)
>>> kvReduced.pprint()
>>> ssc.start()
This coding pattern is often used to write word count applications.
73
Spark Streaming
Window Transformations
Finally for Spark Streaming, we will discuss "stateful" (vs. "stateless") operations, and specifically look
at window transformations.
Checkpointing
Checkpointing is used in stateful streaming operations to maintain state in the event of system failure.
To enable checkpointing, you can simply specify an HDFS directory to write checkpoint data to using
the checkpoint function. For example:
ssc.checkpoint("someHDFSdirectory")
Trying to write a stateful application without specifying a checkpoint directory will result in an error
once the application is launched.
74
Spark Streaming
NOTE: Technically, you could also process a 15-second window in 15-second intervals, however this
is functionally equivalent to setting the StreamingContext interval to 15 seconds and not using the
window function at all.
IMPORTANT: For basic inputs, window() does not work as expected using textFileStream(). An
application will process the first file stream correctly, but then lock up when a second file is added to
the HDFS directory. Because of this, all labs and examples will use the socketTextStream function.
75
Spark Streaming
reduceByKeyAndWindow()
You can also work with key-value pair windows, and there are some specialized functions designed to
do just that. One such example is reduceByKeyAndWindow(), which behaves similarly to the
reduceByKey function discussed previously, but over a specified window and collection interval. For
example, take a look at the following application:
# pyspark --master local[2]
>>> sc.setLogLevel("ERROR")
>>> from pyspark.streaming import StreamingContext
>>> ssc = StreamingContext(sc, 1)
>>> ssc.checkpoint("/user/root/test/checkpoint/")
>>> tcpInDS = ssc.socketTextStream("sandbox",9999)
>>> redPrWinDS = tcpInDS.flatMap(lambda line: line.split(" ")).map(lambda word:
(word, 1)). reduceByKeyAndWindow(lambda a,b: a+b, lambda a,b: a-b, 10, 2)
>>> redPrWinDS.pprint()
>>> ssc.start()
To generate the reduced key-value pair, the DStream is transformed using flatMap(), then converted
to a key-value pair using map(). Then, the reduceByKeyAndWindow function is called.
Note that the reduceByKeyAndWindow function actually takes two functions as arguments prior to the
window size and interval arguments. The first argument is the function that should be applied to the
DStream. The second argument is the *inverse* of the first function, and is applied to the data that has
fallen out of the window. The value of each window is thus calculated incrementally as the window
slides across DStreams without having to recompute the data across all of the DStreams in the window
each time.
76
Spark Streaming
Knowledge Check
Questions
1 ) Name the two new components added to Spark Core to create Spark Streaming.
2 ) If an application will ingest three streams of data, how many CPU cores should it be allocated?
3 ) Name the three basic streaming input types supported by both Python and Scala APIs.
4 ) What two arguments does an instance of StreamingContext require?
5 ) What is the additional prerequisite for any stateful operation?
6 ) What two parameters are required to create a window?
77
Spark Streaming
Answers
1 ) Name the two new components added to Spark Core to create Spark Streaming.
Answer: Receivers and DStreams. StreamingContext is also an acceptable answer here.
2 ) If an application will ingest three streams of data, how many CPU cores should it be allocated?
Answer: Four - one for each stream, and one for the receiver.
3 ) Name the three basic streaming input types supported by both Python and Scala APIs.
Answer: HDFS text via directory monitoring, text via TCP socket monitoring, and queues of
RDDs.
4 ) What two arguments does an instance of StreamingContext require?
Answer: The name of the SparkContext and the micro-batch interval.
5 ) What is the additional prerequisite for any stateful operation?
Answer: Checkpointing.
6 ) What two parameters are required to create a window?
Answer: Window duration and collection/sliding interval.
78
Spark Streaming
Summary
Spark Streaming is an extension of Spark Core that adds the concept of a streaming data
receiver and a specialized type of RDD called a DStream.
Window functions allow operations across multiple time slices of the same DStream, and are
thus stateful and require checkpointing to be enabled.
79
Spark SQL
Spark SQL
Lesson Objectives
After completing this lesson, students should be able to:
DataFrames
A DataFrame is data that has been organized into one or more columns, similar in structure to a SQL
table, but that is actually constructed from underlying RDDs. DataFrames can be created directly from
RDDs, as well as from Hive tables and many other outside data sources.
There are three primary methods available to interact with DataFrames and tables in Spark SQL:
The DataFrames API, which is available for Java, Scala, Python, and R developers
The native Spark SQL API, which is composed of a subset of the SQL92 API commands
The HiveQL API. Most of the HiveQL API is supported in Spark SQL.
Hive
Most enterprises that have deployed Hadoop are familiar with Hive. It is the original data warehouse
platform developed for Hadoop. It represents unstructured data stored in HDFS as structured tables
using a metadata overlay managed by Hive's HCatalog, and can interact with those tables via
HiveQL, it's SQL-like query language.
Hive is distributed with every major Hadoop distribution. Massive amounts of data are currently
managed by Hive across the globe. Thus, Spark SQL's ability to integrate with Hive and utilize HiveQL
capabilities and syntax provides massive value for the Spark developer.
81
Spark SQL
Hive data starts as raw data that has been written to HDFS. Hive has a metadata component that
logically organizes these unstructured data files into rows and columns like a table. The metadata layer
acts as a translator, enabling SQL-like interactions to take place even though the underlying data on
HDFS remains unstructured.
DataFrame Visually
In much the same way, a DataFrame starts out as something else - perhaps an ORC or JSON file,
perhaps a list of values in an RDD, or perhaps a Hive table. Spark SQL has the ability to take these (and
other) data sources and convert them into a DataFrame. As mentioned earlier, DataFrames are actually
RDDs, but are represented logically as rows and columns. In this sense, Spark SQL behaves in a
similar fashion as Hive, only instead of representing files on a disk as tables like Hive does, Spark SQL
represents RDDs in memory as tables.
82
Spark SQL
The SQLcontext
or
The HiveContext.
In both cases, the default name usually given to the context is sqlContext, but this is at the discretion
of the developer.
Spark SQL Contexts Python
The basic code difference is which context gets imported at the beginning of the program. The options
are:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
or
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
Zeppelin and Spark REPLs both default to using a HiveContext named sqlContext. For most of the
code examples in this lesson, we will follow the same pattern. For Zeppelin, this becomes active when
running code prepended by %sql.
Spark SQL Contexts Scala
To create instances of either SQLContext or HiveContext using Scala you would use:
import org.apache.spark.sql._
val sqlContext = SQLContext(sc)
or
import org.apache.spark.sql.hive._
val sqlContext = HiveContext(sc)
83
Spark SQL
Spark SQL uses an optimizer called Catalyst. Catalyst accelerates query performance via an
extensive built-in, extensible, catalog of optimizations that goes through a logical process and builds a
series of optimized plans for executing a query. This is followed by an intelligent, cost-based modeling
and plan selection engine, which then generates the code required to perform the operation.
This provides numerous advantages over core RDD programming. It is simpler to write an SQL
statement to perform an operation on structured data than it is to write a series of filter(),
group(), and other calls. Not only is it simpler, executing queries using Catalyst provides performance
that matches or outperforms equivalent core RDD nearly 100% of the time. Thus, not only can Spark
SQL make managing and processing structured data easier, it provides performance improvements as
well.
84
Spark SQL
In the code, a CSV file is converted to an RDD named eventsFile using sc.textFile. Next, a
schema named Event is created which labels each column and sets its type. T
hen a new RDD named eventsRDD is generated which takes the contents of the eventFile RDD and
transforms the elements according to the Event schema, casting each column and reformatting data as
necessary. The code then counts the rows of eventsRDD - presumably as some kind of verification
that the operation was successfully performed.
The final two steps are performed on a single line of code. First eventsRDD is converted to an
unnamed DataFrame, which is then immediately registered as a temporary table named
enrichedEvents.
IMPORTANT:
Some of the techniques shown here to format a text file for use in a DataFrame is
beyond the scope of this class, but various references exist online on how to
accomplish this.
85
Spark SQL
This temporary table can also be permanently converted to a permanent table in Hive. Making the table
part of Hive's managed data has the added benefit of making the table available across multiple Spark
SQL contexts.
86
Spark SQL
This Row object has an implied schema of two columns - named code and value - and there are two
records for code AA with a value of 150000 and code BB with a value of 80000. Because we do not
need to work with the RDD directly, we immediately convert this collection of Row objects to a
DataFrame using toDR(). We then visually verify that the Row objects were converted to a DataFrame
using show(), and that the schema was applied correctly using printSchema().
The DataFrame is registered as a temporary table named test4, which is then converted to a
permanent Hive table named permab. We then run SHOW TABLES to view the tables available for SQL
to manipulate.
SHOW TABLES returned two Hive tables - permab and permenriched - and the temporary table
test4. However, if we return to our other SQL context and run SHOW TABLES, we see the two Hive
tables, but we do not see test4.
In addition, this SQL context can still see the enrichedevents temporary table, which was not visible
to our other SQL context.
KEY TAKE-AW AY:
if you want to make a table available to all SQL contexts across the cluster and not just
the current SQL context, you must convert it to a permanent Hive table.
87
Spark SQL
This code in this screenshot performs a similar operation to the IoT example, but on a smaller scale,
and using Row objects in Scala rather than Python. Some of the key differences between this and the
previous example include the definition of the DataSample schema class prior to creation of the df2
RDD, with a different syntax during creation of the RDD.
As in the previous example, this RDD is immediately converted to a DataFrame, then registered as a
temporary table which is converted to a Hive table.
NOTE:
The only temporary table visible when SHOW TABLES is executed is the temporary
table created by this instance of the SQL context. All of the Hive tables are returned, as
per previous examples.
A key concept when writing multiple applications and utilizing multiple Spark SQL contexts across a
cluster, is that registering a temporary table makes it available for either DataFrame API or SQL
interactions while operating in that specific context.
88
Spark SQL
However, those DataFrames and tables are only available within the Spark SQL context in which they
were created.
To make a table visible across Spark SQL contexts, you should store that table permanently in Hive,
which makes it available to any HiveContext instance across the cluster.
dataframeX = rddName.toDF()
createDataFrame():
dataframeX = sqlContext.createDataFrame("rddName")
If an RDD is properly formatted but lacks a schema, createDataFrame() can be used to infer the
schema on DataFrame creation
rddName = sc.parallelize([(AA, 150000), (BB, 80000)])
dataframeX = sqlContext.createDataFrame(rddName, [code, value])
89
Spark SQL
90
Spark SQL
91
Spark SQL
This creates a table named table1hive in Hive, copying all of the contents from temporary table
table1. The following screenshot demonstrates that table1hive is now registered as a permanent
table.
92
read()
write()
Spark SQL
NOTE:
The contents are saved into an HDFS folder with the specified name, but in various
parts rather than as a single reusable file. This collection of files can be recovered
and converted back to a single file using the hdfs dfs -getmerge command.
NOTE:
in order to keep the file in one piece during the -put command, you must specify the
file name in HDFS. If you simply put the file, HDFS will once again break it into parts.
93
Spark SQL
If you look at the screenshot provided, you may notice that the JSON format in use is not the typical
JSON format.
WARNING:
This row-based JSON formatting is a requirement for working with JSON files that you
intend to convert to DataFrames. Attempting to create a DataFrame from a standard
JSON file will result in an error.
Save Modes
By default, if a write() is used and the file already exists, an error will be returned. This is because of
the default behavior of save modes. However, this default can be modified. Here are the possible
values for save mode when writing a file and their definitions:
For example, if you were using the write() command from before to save an ORC file and you
wanted the data to be overwritten / replaced if a file by the same name already existed, you would use
the following code:
94
Spark SQL
The syntax is similar to the write() command used before, only read() is used, and an appropriate
file is loaded. For example, to use the JSON file created earlier to create a DataFrame:
dataframeJSON = sqlContext.read.format("json").load("dfsamp.json")
NOTE:
If you peruse the documentation, you will note that some file formats have read()
shortcuts - for example: read.json instead of read.format("json"). We do not
demonstrate them in class because they are not consistent across all supported file
types, however if a developer works primarily with JSON files on a regular basis, using
the read.json shortcut may be beneficial.
95
Spark SQL
96
Spark SQL
The show() function displays the contents of a DataFrame, the output of a SQL command run within
sqlContext.sql(), and is also required for on-screen display of several other functions that will be
discussed.
The printSchema() function displays the schema for a DataFrame.
The withColumn() function returns a new DataFrame with a new column based on criteria you
provide. In the example, a new column named multiplied was created, and the numbers in that column
are the numbers in the value column multiplied by two.
withColumnRenamed() and select()
97
Spark SQL
The filter() function returns a DataFrame with only rows that have column values that meet a
defined criteria - in the screenshot, only rows that had values less than 100,000 were returned.
The limit() function returns a DataFrame with a defined number of rows - in the example, only the
first row was returned.
98
Spark SQL
99
Spark SQL
The drop() function returns a DataFrame without specific volumes included. Think of it as the
opposite of the select() function.
The groupBy() function groups rows by matching column values, and can then perform other
functions on the combined rows such as count() or agg().
The screenshot shows two examples. In the first one, the code column is grouped and the number of
matching values are counted and displayed in a separate column. In the second one, the values
column is scanned for matching values, and then the sums of the identical values are displayed in a
separate column.
count(), take(), and head()
100
Spark SQL
The count() function returns the number of rows in the DataFrame as a result.
The take() function returns a number of rows in the DataFrame and returns them as Row objects.
The head function returns the first row in the DataFrame as a Row object.
IMPORTANT:
In the screenshot provided, these functions are shown prepended with print.
However, the print command is not required for Scala, nor is it required for Python
when using the REPL.
Technically, these functions should probably not have required the print function in
order to produce output either, but via trial and error testing we discovered that they
worked when print was supplied in Zeppelin. In addition, at the time of this writing, a
handful of pyspark functions did not operate correctly *at all* when run inside Zeppelin,
even in conjunction with the print command.
Some examples include first(), collect(), and columns(). This is likely the result
of a bug in the version of Zeppelin used to write this course material and may no
longer be the case by the time you are reading this.
For additional DataFrames API functions, please refer to the online Apache Spark SQL
DataFrames API documentation. Testing these pyspark functions without the print
command will likely result in success in a future implementations.
101
Spark SQL
102
Spark SQL
Knowledge Check
Questions
1 ) While core RDD programming is used with [structured/unstructured/both] data Spark SQL is
used with [structured/unstructured/both] data.
2 ) True or False: Spark SQL is an extra layer of translation over RDDs. Therefore while it may be
easier to use, core RDD programs will generally see better performance.
3 ) True or False: A HiveContext can do everything that an SQLContext can do, but provides
more functionality and flexibility.
4 ) True or False: Once a DataFrame is registered as a temporary table, it is available to any
running sqlContext in the cluster.
5 ) Hive tables are stored [in memory/on disk].
6 ) Name two functions that can convert an RDD to a DataFrame.
7 ) Name two file formats that Spark SQL can use without modification to
create DataFrames.
103
Spark SQL
Answers
1 ) While core RDD programming is used with [structured/unstructured/both] data Spark SQL is
used with [structured/unstructured/both] data.
Answer: Both / Structured
2 ) True or False: Spark SQL is an extra layer of translation over RDDs. Therefore while it may be
easier to use, core RDD programs will generally see better performance.
Answer: False. The Catalyst optimizer means Spark SQL programs will generally outperform
core RDD programs
3 ) True or False: A HiveContext can do everything that a SQLContext can do, but provides
more functionality and flexibility.
Answer: True
4 ) True or False: Once a DataFrame is registered as a temporary table, it is available to any
running sqlContext in the cluster.
Answer: False. Temporary tables are only visible to the context that created them.
5 ) Hive tables are stored [in memory/on disk].
Answer: On Disk
6 ) Name two functions that can convert an RDD to a DataFrame.
Answer: toDF() and createDataFrame()
7 ) Name two file formats that Spark SQL can use without modification to
create DataFrames.
Answer: The ones discussed in class were ORC, JSON, and parquet files.
104
Spark SQL
Summary
Spark SQL gives developers the ability to utilize Spark's in-memory processing capabilities on
structured data
Spark SQL integrates with Hive via the HiveContext, which broadens SQL capabilities and
allows Spark to use Hive HCatalog for table management
DataFrames are RDDs that are represented as table objects which can used to create tables
for SQL interactions
DataFrames can be created from and saved as files such as ORC, JSON, and parquet
Because of Catalyst optimizations of SQL queries, SQL programming operations will generally
outperform core RDD programming operations
105
Data Visualizations
Add manipulation tools to those published paragraphs so users without any code knowledge
can be granted the ability to manipulate those results in real time for various purposes
107
Because of Zeppelin's direct integration with Spark, flexibility in terms of supported languages, and
collaboration and reporting capabilities - the rest of this lesson will show a developer how to use this
tool for greatest effect.
Keep in mind, however, that Zeppelin also supports HTML and JavaScript, and can also work with
other data visualization libraries available to Python, Java, and other languages. If Zeppelin's built-in
capabilities don't quite meet your needs, you always have the ability to expand on them.
Bar chart
Pie chart
Area chart
Line chart
108
Bar Chart
Pie Chart
Area Chart
109
Line Chart
110
Visualizations on DataFrames
Zeppelin can also provide visualizations on DataFrames that have not been converted to SQL tables by
using the following command:
z.show(DataFrameName)
This command tells Zeppelin to treat the DataFrame like a table for visualization purposes. Since a
DataFrame is already formatted like a table, the command should work without issue on every
DataFrame.
111
Zeppelin then displays the content as a table, with supporting data visualizations available as below.
If the data is not formatted correctly, Zeppelin would simply return the string as a table name with no
data.
For example, in this screenshot, the SQL command displays a visualization for all columns and rows.
However, in this second screenshot, the query was updated to only include rows with a value in the
age column that exceeded 45.
112
This provides you with an ability to manipulate the chart output in a number of ways without requiring
you to modify the initial query. In our example, we see that the default chart uses age as a key, and
sums the balances for all persons of a given age as the value.
You'll note that this was done automatically, without any grouping or sum command as part of the SQL
statement itself.
113
The pivot chart feature allows you to change the action performed on the Values column selected.
Click on the box (in the screenshot, the one that says balance) and a drop-down menu of options
appear which can be used to change the default value action. Options include SUM, AVG, COUNT,
MIN, and MAX.
Pivot Chart Change Values or Keys
Any column in the table can be set as a Key or Value.
To remove a column, click the "x" to the top right of the name box and it will disappear.
114
If either the Key or Value field is blank, the output indicates that there is no data available.
Then you simply drag and drop the field you want as a value into the appropriate box and the output
refreshes to match.
In this example, we elected to use the age column for both Keys and Values, and used the COUNT
feature to count the number of individuals in each age category.
115
In this example, the marital column was defined as a grouping, and therefore every unique value in that
column (married, single, or divorced) became its own bar color in the bar chart.
116
Dynamic Forms
Dynamic Forms give you the ability to define a variable in the query or command and allow that value
to be dynamically set via a form that appears above the output chart. These can be done in various
programming languages. For SQL, you would use a WHERE clause and then specify the column name,
some mathematical operator, and then a variable indicated by a dollar sign with the form name and the
default value specified inside a pair of curly braces.
SELECT * FROM table WHERE colName [mathOp] ${LabelName=DefValue}
In the following screenshot, we select the age column, where age is greater than or equal to the
variable value, label the column Minimum Age and set the default value to 0 so that all values will
appear by default.
117
Then, in the resulting dynamic form, we set the minimum age to 45 and press enter, which results in
the chart updating to reflect a minimum age of 45 in the output.
Multiple Variables
Multiple variables can be included as dynamic forms.
In this example, the WHERE clause has been extended with an AND operator, so both a minimum age of
0 and a maximum age of 100 are set as defaults. The user then sets the minimum age value to 30 and
the maximum age to 55 and presses enter, resulting in the underlying output changing to meet those
criteria.
118
Select Lists
Dynamic forms can also include select lists (a.k.a. drop-down menus). The syntax for a select list within
a WHERE clause would be:
... WHERE colName = "${LabelName=defaultLabel,opt1|opt2|opt3|}"
In the example shown, the marital column is specified, a variable created, and within the variable
definition we specify the default value for marital = married.
Then, insert a comma, and provide the complete list of options you wish to provide separated by the |
(pipe) character - in our case, married, single, and divorced.
This results is a new dynamic form, and the output will respond to changes in the drop-down menu.
119
120
Clone
Makes a copy of the note in your Zeppelin notebook. You can clone a note by clicking on the
button labeled "Clone the notebook" (when you hover over it with your mouse pointer) at the
top of the note.
Export
Downloads a copy of the note to the local file system in JSON format. You can export a note by
clicking on the button labeled "Export the notebook" at the top of the note.
Importing Notes
Exporting a note also gives you the ability to share that file with another developer, which they can
then import into their own notebook from the Zeppelin landing page by clicking on "Import note."
Note Cleanup
Often note development will be a series of trial and error approaches, comparing methods to pick the
best alternative. This can result in a notebook that contains paragraphs that you don't want to keep, or
don't want distributed to others for sake of clarity. Fortunately cleaning up a note prior to distribution is
relatively easy.
Individual paragraphs that are no longer needed can be removed/deleted from the note. In the
paragraph, click on the settings button (gear icon) and select remove to delete it.
Paragraphs can also be moved up or down in the note and new paragraphs can be inserted (for
example, to add comments in Markdown format describing the flow of the note).
121
122
Formatting Notes
Note owners can control all paragraphs at the note level, via a set of buttons at the top of the note.
These controls include:
Hide/Show all code via the button labeled "Show/hide the code,"
Hide/Show all output via the button labeled "Show/hide the output" (which changes from an
open book to a closed book icon based on the current setting), and
Clear all output via the eraser icon button labeled "Clear output."
The Simple view removes the note-level controls at the top of the note
The Report view removes all note-level controls, as well as hides all code in the note, resulting
in a series of outputs
These views can be selected by clicking the button labeled "default" at the top-right corner of the note
and then choosing the appropriate option from the resulting drop-down menu.
123
This operation can be scheduled to run on a regular basis using the scheduling feature, which is
enabled by clicking the clock icon button labeled "Run scheduler." This allows you to schedule the
note to run at regular intervals including every minute, every five minutes, every hour, and so on up to
every 24 hours via preset links that can simply be clicked to activate. If these options are not granular
enough for you, you can also schedule the note at a custom interval by supplying a Cron expression.
Paragraph Formatting
Paragraphs can also be formatted prior to distribution on an individual basis. These settings are
available in the buttons menu at the top right of each paragraph, as well as underneath the settings
menu (gear icon) button.
Formatting options that were also available at the note level include: Hide/Show paragraph code,
Hide/Show paragraph output, and Clear paragraph output (only available under settings).
124
Paragraph Enhancements
The visual appearance of paragraphs can be enhanced to support various collaboration goals. Such
enhancements include:
Width
Example:
Let's assume you want to create a dashboard within a Zeppelin note, showing multiple views of the
same data on the same line.
This can be accomplished by modifying the Width setting, found in paragraph settings. By default, the
maximum width is used per paragraph, however, this can be modified so that two or more paragraphs
will appear on the same line.
125
Show Title
Paragraphs can be given titles for added clarity when viewing output. To set a title, select Show title
under paragraph settings. The default title is "Untitled."
Click on the title to change it, type the new title, and press the Enter key to set it.
Line Numbers
Paragraphs displaying code can also be enhanced by showing the line numbers for each line of code.
126
To turn on this feature, select Show line numbers under paragraph settings.
The numbers will appear to the left of the code lines. Lines that are wrapped based on the width of the
paragraph will only be given a single number, even though on the screen they will appear as multiple
lines.
Sharing Paragraphs
Individual paragraphs can be shared by generating a link, which can be used as an iframe or otherwise
embedded in an external-to-Zeppelin report. To generate this URL, select Link this paragraph under
paragraph settings.
This will automatically open the paragraph in a new browser tab, and the URL can be copied and
pasted into whatever report or web page is needed.
127
It is important to note here that if dynamic forms have been enabled for this note, anyone who modifies
the form values will change the appearance of the paragraph output for everyone looking at the link.
This can be a valuable tool if, for example, a marketing department wants to generate multiple outputs
based on slight tweaks to the query. You can allow them to do this without giving them access to the
entire note, and without the need to modify the code on the back end.
128
Any changes to the code, as well as changes to dynamic forms input, will not change the output
presented as long as the Disable run option is selected.
129
130
Knowledge Check
Use the following questions to assess your understanding of the concepts presented in this lesson.
Questions
35 ) What is the value of data visualization?
36 ) How many chart views does Zeppelin provide by default?
37 ) How do you share a copy of your note (non-collaborative) with another developer?
38 ) How do you share your note collaboratively with another developer?
39 ) Which note view provides only paragraph outputs?
40 ) Which paragraph feature provides the ability for an outside person to see a paragraph's output
without having access to the note?
41 ) What paragraph feature allows you to give outside users the ability to modify parameters and
update the displayed output without using code?
131
Answers
42 ) What is the value of data visualization?
Answer: Enable humans to make inferences and draw conclusions about large sets of data
that would be impossible to make by looking at the data in tabular format.
43 ) How many chart views does Zeppelin provide by default?
Answer: Five
44 ) How do you share a copy of your note (non-collaborative) with another developer?
Answer: Export to JSON format, then they can import it.
45 ) How do you share your note collaboratively with another developer?
Answer: Give them the note URL
46 ) Which note view provides only paragraph outputs?
Answer: The Report view.
47 ) Which paragraph feature provides the ability for an outside person to see a paragraph's output
without having access to the note?
Answer: Link the paragraph
48 ) What paragraph feature allows you to give outside users the ability to modify parameters and
update the displayed output without using code?
Answer: Dynamic forms
132
Summary
Data visualizations are important when humans need to draw conclusions about large sets of
data
Zeppelin provides support for a number of built-in data visualizations, and these can be
extended via visualization libraries and other tools like HTML and JavaScript
Zeppelin visualizations can be used for interactive data exploration by modifying queries, as
well as the use of pivot charts and implementation of dynamic forms
Zeppelin notes can be shared via export to a JSON file or by sharing the note URL
Zeppelin provides numerous tools for controlling the appearance of notes and paragraphs
which can assist in communicating important information
133
Job Monitoring
Job Monitoring
Lesson Objectives
After completing this lesson, students should be able to:
Explain default parallel execution for stages, tasks, across CPU cores
Spark applications require a Driver, which in turn loads and monitors the SparkContext. The
SparkContext is then responsible for launching and managing Spark jobs. But what do we mean
when we say job?
When you type a line of code to use a Spark function, such as flatMap(), filter(), or map(), you
are defining a Spark task which must be performed. A task is a unit of work, or "a thing to be done."
When you put one or more tasks together with a resulting action task - such as collect() or save()
- you have defined a Spark job. A job, then, is a collection of tasks (or things to be done) culminating in
an action.
135
Job Monitoring
NOTE:
Not explicitly called out here, but a Spark application can consist of one or more
Spark jobs. Every action is considered to be part of a unique job, even if the action is
the only task being performed.
Job Stages
A Spark job can be made up of several types of tasks. Some tasks don't require any data to be moved
from one executor to another in order to finish processing. This is referred to as a "narrow" operation,
or one that does not require a data "shuffle" in order to execute. Transformations that do require that
data be moved between executors are called "wide" operations, and that movement of data is called a
shuffle. Both narrow and wide operations will be discussed in far greater detail in the Performance
Tuning section of this class. For now, it is only important to know that some tasks require shuffles, and
some do not.
When executed, Spark will evaluate the tasks to be performed and break up a job at any point where a
shuffle will be required. While non-shuffle operations can happen somewhat asynchronously if needed,
a task that follows a shuffle *must* wait for the shuffle to complete before executing.
136
Job Monitoring
This break point, where processing must complete before the next task or set of tasks is executed, is
referred to as a stage. A stage, then, can be thought of as "a logical grouping of tasks" or things to be
done. A shuffle is a task requiring that data between RDD partitions be aggregated (or combined) in
order to produce a desired result.
Tasks are a unit of work, or a thing to be done. Each transformation and action is a separate
task.
Because shuffled data must be complete before a following task can begin, jobs are divided
into stages based on these shuffle boundaries.
137
Job Monitoring
Parallel Execution
Spark jobs are automatically optimized via parallel execution at different levels.
However, not all stages are dependent on one another. For example, in this job Stage 1 has to run first,
but once it has completed there are three other stages (two, four, and seven) that can begin execution.
Operating in this fashion is known as a Directed Acrylic Graph, or DAG. A DAG is essentially a logical
ordering of operations (in the case of our discussion here, Spark stages) based on dependencies.
Since there is no reason for Stage 7 to wait for all of the previous six stages to complete, Spark will go
ahead and execute it immediately after Stage 1 completes, along with Stage 2 and Stage 4.
138
Job Monitoring
This parallel operation based on logical dependencies allows, in some cases, for significantly faster job
completion across a cluster compared to platforms that require stages to complete, one at a time, in
order.
The tracking and managing of these stages and their dependencies is managed by a Spark component
known as a DAG Scheduler. It is the DAG scheduler that tells Spark which stages (sets of tasks) to
execute and in what order. The DAG Scheduler ensures that dependencies are met, and any
dependent stages have completed prior to the next stage completing.
Task Steps
A task is actually a collection of three separate steps. When a task is first scheduled, it must first fetch
the data it will need - either from an outside source, or perhaps from the results of a previous task.
Once the data has been collected, the operation that the task is to do on that data can execute. Finally,
the task produces some kind of output, either as an action, or perhaps as an intermediate step for a
task to follow.
Tasks can begin execution once data has started to be collected. There is no need for the entire set of
data to be loaded prior to performing the task operation. Therefore, execution begins as soon as the
first bits of data are available, and can continue in parallel while the rest of the data is being fetched.
139
Job Monitoring
Furthermore, the output production step can begin as soon as the first bits of data have been
transformed, and can theoretically be happening while the operation is being executed *and* while the
rest of the data is being fetched. In this manner, all three steps of a task can be running at the same
time, with the execute phase starting shortly after the fetch begins, and the output phase starting
shortly after the execute phase begins. In terms of completion, the fetch will always complete first, but
the execute can finish shortly thereafter, with the output phase shortly after that.
140
Job Monitoring
Spark Application UI
Now that we've explored the anatomy of a Spark job and understand how they are executed on the
cluster, let's take a look at monitoring those jobs and their components via the Spark Application UI.
Spark Application UI
The Spark Application UI is a web interface generated by a SparkContext. It is therefore available for
the life of the SparkContext, but once it has been shut down, the Spark Application UI will no longer
be available.
You access the Spark Application UI by default via your Driver node at port 4040. In our lab
environment, then, you would use the address: "http://sandbox:4040".
Every SparkContext instance manages a separate Spark Application UI instance. Therefore, if
multiple SparkContext instances are running on the same system, multiple Spark Appliction Uis will
be available. Since they cannot share a port, when a SparkContext launches and detects an existing
Spark Application UI, it will generate its own version of the monitoring tool at the next available port
number, incremented by 1. Therefore, if you are running Zeppelin and it has created a Spark
Application UI instance at port 4040, and then you launch an instance of the PySpark REPL in a
terminal on the same machine, the REPL version of the monitoring site will exist at port 4041 instead of
4040. A third SparkContext would create the UI at port 4042, and so on.
Once a SparkContext is exited, that port number becomes available. Therefore, if you exited Zeppelin
(using port 4040) and opened another REPL, it would create its Spark Application UI at port 4040. The
two older SparkContext instances would keep the port numbers they had when they started.
141
Job Monitoring
The Spark UI landing page opens up to a list of all of the Spark jobs that have been run by this
SparkContext instance. You can see information about the number of jobs completed, as well as
overview information for each job in terms of ID, description, when it was submitted, how long it took
to execute, how many stages it had and how many of those were successful, and the number of tasks
for all of those stages (and how many were successful.)
Clicking on a job description link will result in a screen providing more detailed information about that
particular job.
NOTE:
The URL - which was typed as "sandbox:4040" - was redirected to port 8088. Port
8088 is the YARN ResourceManager UI, which tracks and manages all YARN jobs. This
means that in this instance, Zeppelin has been configured to run on (and be managed
resource-wise by) YARN.
142
Job Monitoring
On the Spark Application UI landing page, you will notice a link called Event Timeline. Clicking on this
link results in a visualization that shows executors being added and removed from the cluster, as well
as jobs being tracked and their current status. Enabling zoom allows you to see more granular detail,
which can be particularly helpful if a large number of jobs have been executed over a long period of
time for this SparkContext instance.
Job View
Clicking on a job description on the Jobs landing page takes you to the "Details for Job XX" page. Here
you can see more specific information about the job stages, including description, when they were
submitted, how long they took to run, how many tasks succeeded out of how many attempted, the size
of the input and output of each stage, how much data shuffling occurred between stages.
143
Job Monitoring
Clicking on the Event timeline once again results in a visualization very similar to the one on the landing
page, but this time for the stages of that particular job instead of for all jobs.
144
Job Monitoring
Job DAG
In addition, a new link is available on this screen: DAG Visualization. Clicking this link results in a visual
display of the stages (red outline boxes) and the flow of tasks each contains, as well as the
dependencies between the stages.
145
Job Monitoring
Stage View
At the top of the window, to the right of the Jobs tab you will see a tab called Stages. Clicking on this
tab will result in a screen similar to the Jobs landing page, but instead of tracking activity at the job
level it tracks it at the stage level, providing pertinent high-level information about each stage.
Stage Detail
146
Job Monitoring
Stage DAG
Clicking on the DAG Visualization link results in a visual display of the operations within the stage in a
DAG formatted view.
DAG Visualization
147
Job Monitoring
Clicking on the Show Additional Metrics link allows you to customize the display of information
collected in the table below. Hovering over the metric will result in a brief description of the information
that metric can provide. This can be particularly useful when troubleshooting and determining the root
cause of performance problems for an application.
148
Job Monitoring
The The Stage Detail page also provides an Event Timeline visualization, which breaks down tasks and
types of tasks performed across the executors it utilized.
149
Job Monitoring
Task List
At the bottom of the Stage Detail page is a textual list of the tasks performed, as well as various
information on them including the ID, status, executor and host, and duration.
150
Job Monitoring
Executor View
The last standard tab (visible regardless of what kind of jobs have been performed) is the Executor tab.
This shows information about the executors that have been used across all Spark jobs run by this
SparkContext instance, as well as providing links to logs and thread dumps, which can be used for
troubleshooting purposes as well.
SQL View
When you run a Spark job that uses one of the Spark modules, another tab appears at the top of the
window that provides module-specific types of information for those jobs. In this screenshot, we see
that a Spark SQL job has been executed, and that a tab labeled SQL has appeared at the top-right. The
information provided is in terms of queries rather than jobs (although the corresponding Spark job
number is part of the information provided.)
151
Job Monitoring
Clicking on a query description will take you to a Details for Query "X" page, which will show a DAG of
the operations performed as part of that query.
SQL Text Query Details
Below the query DAG, there is a Details link which - when clicked - provides a text-based view of the
details of the query.
152
Job Monitoring
Streaming Tab
When running Spark Streaming jobs, a Streaming tab will appear in the Spark Application UI. In this
view we can see details about each job Spark runs, equating to a DStream collection and processing.
Streaming View
153
Job Monitoring
Clicking on a streaming job description link results in a page that shows streaming statistics charts for
a number of different metrics. Shown on screen here are input rate and scheduling delay. Input Rate is
a link that can be clicked to expand the metrics window and show the rate per receiver, if multiple
receivers are in use. In our example, only one receiver was in use and active.
Additional Streaming Charts
Additional Spark Streaming charts include scheduling delay, processing time, and total delay.
When you need to understand what the cause of slowness might be, the charts on this page can be
particularly useful when troubleshooting performance issues with a Spark Streaming job.
Streaming Batches
Beneath the charts are a list of batches, with statistics available about each individual batch and
whether the output operation was successful.
154
Job Monitoring
Batch Detail
Clicking on the batch time link results in a batch details page, where additional information may be
found.
155
Job Monitoring
156
Job Monitoring
Knowledge Check
You can use the following questions to assess your understanding of the concepts presented in this
lesson.
Questions
1 ) Spark jobs are divided into _____________, which are logical collections of _______________.
2 ) A job is defined as a set of tasks that culminates in a ________________.
3 ) What Spark component organizes stages into logical groupings that allow for parallel
execution?
4 ) What is the default port used for the Spark Application UI?
5 ) If two SparkContext instances are running, what is the port used for the Spark Application UI of
the second one?
6 ) As discussed in this lesson, what tabs in the Spark Application UI only appear if certain types
of jobs are run?
157
Job Monitoring
Answers
1 ) Spark jobs are divided into _____________, which are logical collections of _______________.
Answer: Stages, tasks
158
Job Monitoring
Summary
Spark applications consist of Spark jobs, which are collections of tasks that culminate in an
action.
Spark jobs are divided into stages, which separate lists of tasks based on shuffle boundaries
and are organized for optimized parallel execution via the DAG Scheduler.
The Spark Application UI provides a view into all jobs run or running for a given SparkContext
instance, including detailed information and statistics appropriate for the application and tasks
being performed.
159
Performance Tuning
Performance Tuning
Lesson Objectives
After completing this lesson, students should be able to:
Describe how to repartition RDDs and how this can improve performance
Describe how checkpointing can reduce recovery time in the event of loosing an executor
161
Performance Tuning
IMPORTANT:
The brackets around sum(x) are required in this example because the input *and*
output of mapPartitions() must be iterable. Without the brackets to keep the
individual partition values separate, the function would attempt to return a number
rather than a list of results, and as such would fail. If the total sum was needed, you
would need to perform an additional operation on rdd2 (from the modification to line
2 above) in order to compute it such as the operation below.
rdd1.mapPartitions(lambda x: [sum(x)]).reduce(lambda a,b: a+b)
or just rdd1.reduce(lambda a,b: a+b) in the first place if not
trying to explain mapPartitions
RDD Parallelism
The cornerstone of performance in Spark centers around the concepts of narrow and wide operations.
How RDDs are partitioned, initially and via explicit changes, can make a significant impact on
performance.
162
Performance Tuning
163
Performance Tuning
Narrow Dependencies/Operations
Narrow operations can be executed locally and do not depend on any outside the current element.
Examples of narrow operations are map(), flatMap(), union(), and filter().
The picture above depicts examples of how narrow operations work. As visible in the picture above,
there are no interdependencies between partitions.
Transformations maintain the partitioning of the largest parent RDD for the operation. For single parent
RDD transformations, including filter(), flatMap(), and map(), the resulting RDD has the same number of
partitions as the parent RDD.
For combining transformations such as union(), the number of resulting partitions will be equal to the
total number of partitions from the parent RDDs.
Wide Dependencies/Operations
Wide operations occur when shuffling of data is required. Examples of wide operations are
reduceByKey(), groupByKey(), repartition(), and join().
Wide Dependencies/Operations
Above is an example of a wide operation. Note that the child partitions are dependent on more than
one parent partition. This should help explain why wide operations separate stages. The child RDD
cannot exist completely unless all the data from the parent partitions have finished processing.
The example in the image shows the four RDD1 partitions reducing to a single RDD2 partition, but in
reality multiple RDD2 partitions would have been generated, each one pulling a different subset of data
from each of the RDD1 partitions. The diagram shows the logical combination rather than a physical
result.
164
Performance Tuning
All shuffle-based operations outputs use the number of partitions that are present in the parent with
the largest number of partitions. In the diagram above this would have resulted in RDD2 being spread
across four partitions. Again, it was shown as a single partition to help visualize what is happening and
to prevent an overly complicated diagram. The developer can specify the number of partitions the
transformation will use, instead of defaulting to the larger parent. This is shown by passing a
numPartitions as an optional parameter, as shown in the following two versions of the same
operation.
reduceByKey(lambda c1,c2: c1+c2, numPartitions=4)
or simply
reduceByKey(lambda c1,c2: c1+c2, 4)
Controlling Parallelism
The following RDD transformations allow for partition-number changes: distinct(), groupByKey(),
reduceByKey(), aggregateByKey(), sortByKey(), join(), cogroup(), coalesce(), and
repartition().
Generally speaking, the larger the number of partitions, the more parallelization the application can
achieve. There are two operations for manually changing the partitions without using a shuffle-based
transformation: repartition() and coalesce().
A repartition() operation will shuffle the entire dataset across the network. A coalesce() just
shuffles the partitions that need to be moved. Coalesce should only be used when reducing the
number of partitions. Examples:
Use reparition() to change the number of partitions to 500:
rdd.repartition(500)
Use coalesce() to reduce the number of partitions to 20:
rdd.coalesce(20)
Changing Parallelism During a Transformation
The code below represents a simple application that is summing up population counts by state.
sc.textFile("statePopulations.csv").map(lambda line:
line.split(",")).map(lambda rec: (rec[4],int(rec[5]))).reduceByKey(lambda
c1,c2: c1+c2, numPartitions=2).collect()
The following series of diagrams is an example of what would be going on with the RDD partitions of
data.
165
Performance Tuning
The number of partitions defaults to the number of blocks the file takes up on the HDFS. Here, if the
file takes up three blocks on the HDFS it is represented by a three-partition RDD spread across three
worker nodes.
In the first map operation that is splitting the CSV record into attributes, no data needed to be
referenced from another partition to perform the map transformation.
.map(lambda line: line.split(",")) \
The same is true for the next map that is creating a PairRDD for each rows particular state and
population count.
.map(lambda rec: (rec[4],int(rec[5])))
166
Performance Tuning
In the reduceByKey transformation that calculates final population totals for each state, there is an
explicit reduction in the number of partitions.
The reduction of partitions is controlled by an optional numPartitions argument that
reduceByKey() takes.
.reduceByKey(lambda c1,c2 : c1+c2, numPartitions=2)
Note also, when doing a reduceByKey(), the same key may be present in multiple partitions that are
output from the map operation. When this happens, the data is required to shuffle.
Finally, at the end, the collect() returns all the results from the two partitions in the reduceByKey
operation to the driver.
Changing Parallelism without a Transformation
Changing the level of parallelism is a very common performance optimization task. These diagrams
illustrates the difference between repartition() and coalesce().
167
Performance Tuning
Whenever reducing the number of partitions, always use coalesce(), as it minimizes the amount of
network shuffle. A repartition() is required if the developer is going to increase the number of
partitions.
168
Performance Tuning
169
Performance Tuning
Partitioning Optimization
There is no perfect formula, but the general rule of thumb is that too many partitions is better than too
few. Since each dataset and each use case can be very different from all the others, experimentation
is required to find the optimum number of partitions for each situation.
The more partitions you have, the less time it will take to process each. There is a point at when the
law of diminishing returns kicks in. Spark schedules its own tasks within the executors it has available.
This scheduling of tasks takes approximately 10-20ms in most situation. Spark can efficiently run tasks
as fast as 200ms without any troubles. With experimentation, you could have tasks run for even less
time than this, but tasks should take at least 100ms to execute to prevent the system from spending
too much time scheduling tasks instead of executing tasks.
A simple and novel approach to identifying the best number of partitions is to keep increasing the
number by 50% until performance stops improving. Once that occurs, find the mid-point between the
two partition sizes and execute the application in production there. If anything changes regarding the
executors (number available or sizing characteristics) then the tuning exercise should be executed
again.
It is optimal to have the number of partitions be a just slightly smaller than a multiple of the number of
overall executor cores available. This is to ensure that if multiple waves of tasks must be run that all
waves are fully utilizing the available resources. The slight reduction from an actual multiple is to
account for Spark internal activities such as speculative execution (a process that looks for
performance outliers and reruns potentially slow or hung tasks). For example, if there are 10 executors
with two cores (20 cores total), make RDDs with 39, 58, or 78 partitions.
Generally speaking, this level of optimization is for programmers working directly with RDDs. Spark
SQL and its DataFrame API were created to be a higher level of abstraction that lets the developer
focus on what needs to get done as opposed to exactly how those steps should be executed. Spark
SQL's Catalyst optimizer eliminates the need for developers to focus on this level of optimization in the
code.
NOTE:
Spark SQL (as of 1.6.1) still has some outstanding unsupported Hive functionality that
developers should be aware of. These are identified at
http://spark.apache.org/docs/latest/sql-programmingguide.html#unsupported-hive-functionality. These shortcomings will likely
be addressed in a future release.
170
Performance Tuning
The key difference for Spark is how it can perform a series of narrow tasks within a single stage. Since
each partition can be acted upon independently from all other partitions in a RDD or DataFrame, Spark
can perform each task on the data element(s) in such a way that it does not have to write the data back
to persistent storage (such as HDFS). This does not mean that Spark can always fit into memory
completely during these transformations, and in these situations it can fall back to local disk used by
the executors as needed. This is still far faster than having to write this information to HDFS at the
completion of each task.
This concept actually works very well when applications are written so that no previous RDD is
programmatically referenced again. For example, rddA > rddB > rddC > rddD. If a previously created
RDD is referenced later -- for example, rddB is used to create a new rddE, a consequence arises. In
this case, the Spark consults the underlying lineage/recipe, then recreates rddA and rddB so that it can
create rddE. Caching rddB prior to creating rddE could increase overall performance.
A typical example might be an application where a "clean" file, reject file, and summary file are each
created by processing the same original file. Caching the original file prior to performing the
transformations would result in performance improvements.
171
Performance Tuning
Caching Syntax
The functions for caching and persisting have the same names for RDDs and DataFrames.
To use caching, we must do a couple things. The first is to import the library. Here is how to import
the library for Scala and Python:
Scala: import org.apache.spark.storageLevel._
Python: from pyspark import StorageLevel
Once the libraries are imported, we can then call the persist/cache operations. Here is an example of
persisting our RDD "rdd"
rdd.persist(StorageLevel)
The Spark SQL SQLContext object also features a cacheTable(tableName) method for any table
that it knows by name and the complimentary uncacheTable(tableName) method. Additional helper
methods isCached(tableName) and clearCache() are also provided.
NOTE:
In Python, cached objects will always be serialized, so it does not matter whether you
choose serialization or not. When we talk about storage levels next, when using
pyspark, MEMORY_ONLY is the same as MEMORY_ONLY_SER.
For RDDs, it is recommended that the persist API be utilized as it requires the developer to be
completely aware of which "storage level" is best for a given dataset and its use case. The ultimate
decision on which storage level to use is based on a few questions centered around serialization and
disk usage.
172
Performance Tuning
The first question is if the cached data should live in-memory or on disk. It may not sound like disk is a
great choice, but remember that the RDD could be the result of multiple transformations that would be
costly to reproduce if not cached. Additionally, the executor can very quickly get to data on its local
disk when needed. This storage level is identified as DISK_ONLY.
If in-memory is a better choice, then the next question is around raw vs serialized caching (for Scala as previously mentioned, Python automatically serializes cached objects). Regardless of that answer,
the next question to answer is should the cached data be rolled onto local disk if it gets evicted from
memory or should it just be dropped? There still may be significant value from this data on local disk
compared to having to recompute an RDD from the beginning.
Raw in-memory caching has an additional option to store the cached data for each partition in two
different cluster nodes. This storage level allows for some additional levels of resiliency should an
executor fail.
The following list identifies the in-memory storage levels beyond DISK_ONLY.
NOTE:
There is also an experimental storage level identified as OFF_HEAP which is most
similar to MEMORY_ONLY_SER except that, as the name suggests, the cache is stored
off of the JVM heap.
173
Performance Tuning
Again, while there are many choices above which indicates some level of testing will be required to find
the optimal storage level, it is recommended to use persist() with RDD caching and specifically
identifying the best storage level possible. For Spark SQL, continue to use cache(DataFrame) or
cacheTable(table) and let the Catalyst optimizer determine the best options.
Caching Example
Here is an example where an RDD is reused more than once:
from pyspark import StorageLevel
ordersRrdd = sc.textFiles("/orders/received/*")
ordersRdd.persist(StorageLevel.MEMORY_ONLY_SER)
ordersRdd.map().saveAsTextFile("/orders/reports/valid.txt")
ordersRdd.filter().saveAsTextFile(/orders/reports/filtered.txt")
ordersRdd.unpersist()
If the RDD fits in memory, use the default MEMORY_ONLY, as it will be the fasted option for
processing. (Again, in Python coding, serialization is automatic, therefore this setting is
identical to MEMORY_ONLY_SER.)
If RDDs dont fit in memory, try MEMORY_ONLY_SER with a fast serialization library. Doing this
uses more CPU, so use efficient serialization like Kryo (described later).
If the RDDs dont fit into memory even in serialized form, consider the time to compute this
RDD from parent RDDs vs the time to load it from disk. In some cases, re-computing an RDD
may sometimes be faster than reading it from disk. The best way to decide which one is better
is to try them both and see what happens.
Another option is replicated storage. The data is replicated on two nodes, instead of just one.
Replicated storage is good for fast fault recovery, but usually this is overkill, and not a good idea if
youre using a lot of data relative to total memory of the system.
DataFrame objects are actually more efficient when left to their defaults due to Catalyst optimizations.
Use cache() rather than persist() when working with DataFrames.
Serialization Options
For Scala
For JVM-based languages (Java and Scala), Spark aims to strike a balance between convenience
(allowing you to work with any Java type in your operations) and performance. It provides two
serialization libraries: Java serialization (by default), and Kryo serialization.
Kryo serializing is significantly faster and more compact than the default Java serialization, often as
much as 10x. The reason Kryo wasnt set to default is because, initially, Kryo didnt support all
serializable types and the user would have register the custom classes they use with the Kryo registrar.
These issues have been addressed in recent versions of Spark. Therefore, Kryo serialization should
always be used.
174
Performance Tuning
For Python
In Python, applications use the Pickle library for serializing, unless you are working with DataFrames or
tables, in which case the Catalyst optimizer manages serialization. The optimizer converts the code
into Java byte code, which - if left unspecified - will use the Java default serialization. Thus, Python
DataFrame applications will still need to specify the use of Kryo serialization.
Kryo Serialization
To implement Kryo Serialization in your application, include the following in your configuration:
conf = SparkConf()
conf.set('spark.serializer, 'org.apache.spark.serializer.KryoSerializer')
sc=SparkContext(conf=conf)
This works for DataFrames as well as RDDs since the SQLContext (or HiveContext) are passed the
SparkContext in their constructor method.
If using the pyspark or spark-shell REPLs, add -conf
spark.serializer=org.apache.spark.serializer.KryoSerializer as a command-line
argument to these executables.
Checkpointing
Spark was initially built for long-running, iterative applications. Spark keeps track of an RDD's recipe or
lineage. This provides reliability and resilience, but as the number of transformations performed
increases and the lineage grows, the application can run into problems. The lineage can become too
big for the object allocated to hold everything. When the lineage gets too long, there is a possibility of
a stack overflow.
Also, when a worker node dies, any intermediate data stored on the executor has to be re-computed.
If 500 iterations were already performed, and part of the 500th iteration was lost, the application has to
re-do all 500 iterations. That can take an incredibly long time, and can become inevitable given a long
enough running application processing a large amount of data.
Spark provides a mechanism to mitigate these issues: checkpointing.
About Checkpointing
When checkpointing is enabled, it does two things: data checkpointing and metadata checkpointing.
(Checkpointing is not yet available for Spark SQL).
Data checkpointing Saves the generated RDDs to reliable storage. As we saw with Spark
Streaming window transformations, this was a requirement for transformations that combined data
across multiple batches. To avoid unbounded increases in recovery time (proportional to the
dependency chain), intermediate RDDs of extended transformation chains can be periodically
checkpointed to reliable storage (typically HDFS) to shorten the number of dependencies in the event
of failure.
Metadata checkpointing Saves the information defining the streaming computation to faulttolerant storage like HDFS. This is used to recover from failure of the node running the driver of the
streaming application.
175
Performance Tuning
Metadata includes:
Configuration - The configuration that was used to create the streaming application.
DStream operations - The set of DStream operations that define the streaming application.
Incomplete batches - Batches whose jobs are queued but have not completed yet.
When a checkpoint is initialized, the lineage tracker is "reset" to the point of the last checkpoint.
When enabling checkpointing, consider the following:
There is a performance expense incurred when pausing to write the checkpoint data, but this is
usually overshadowed by the benefits in the event of failure
Checkpointed data is not automatically deleted from the HDFS. The user needs to manually
clean up the directory when theyre positive that data wont be required anymore.
Without Checkpointing, all processes must be repeated potentially thousands of transformations if a node is lost
176
Performance Tuning
With Checkpointing, only processes performed since the last checkpoint must be repeated
In this example, we have the same application with checkpointing enabled. We can see that every nth
iteration, data is being permanently stored to the HDFS. This may not seem intuitive at first as one
might ask why we should save data to the HDFS when it is not needed. In the case that a worker node
goes down, instead of trying to redo all previous transformations (which again, can number in the
thousands), the data can be retrieved from HDFS and then processing can continue from the point of
the last checkpoint.
This example shows that checkpointing can be viewed as a sort of insurance for events such as this.
Instead of simply hoping a long-running application will finish without any worker failures, the
developer makes a bit a performance tradeoff up front in choosing to pause and write data to HDFS
from time to time.
177
Performance Tuning
Implementing Checkpointing
To implement checkpointing, the developer must specify a location for the checkpoint directory before
using the checkpoint function. Here is an example:
sc.setCheckpointDir("hdfs://somedir/")
rdd = sc.textFile("/path/to/file.txt")
while x in range(<large number>)
rdd.map()
if x % 5 == 0
rdd.checkpoint()
rdd.saveAsTextFile("/path/to/output.txt")
This code generates a checkpoint every fifth iteration of the RDD operation.
Broadcast Variables
A broadcast variable is a read-only variable cached once in each executor that can be shared among
tasks. It cannot be modified by the executor. The ideal use case is something more substantial that a
very small list or map, but also not something that could be considered Big Data.
Broadcast variables are implemented as wrappers around collections of simple data types. They are
not intended to wrap around other distributed data structures such as RDDs and DataFrames.
The goal of broadcast variables is to increase performance by not copying a local dataset to each task
that needs it and leveraging a broadcast version of it. This is not a transparent operation in the
codebase - the developer has to specifically leverage the broadcast variable name.
Spark uses concepts from P2P torrenting to efficiently distribute broadcast variables to the nodes and
minimize communication cost. Once a broadcast variable is written to a single executor, that executor
can send the broadcast variable to other executors. This concept reduces the load on the machine
running the driver and allows the executors (aka the peers in the P2P model) to share the burden of
broadcasting the data.
Broadcast variables are lazy and will not receive the broadcast data until needed. The first time a
broadcast variable is read, the node will retrieve and store the data in case it is needed again. Thus,
broadcast variables get sent to each node only once.
178
Performance Tuning
Without broadcast variables, reference data (such as lookup tables, lists, or other variables) gets sent
to every task on the executor, even though multiple tasks reuse the same variables. This what an
application does normally.
Using Broadcast Variables, Spark sends Reference Data to the Node only Once
Using broadcast variables, Spark sends a copy to the node once, then the data is stored in memory.
Each task will reference the local copy of the data. These broadcast variables get stored in the
executor memory overhead portion of the executor.
179
Performance Tuning
Joining Strategies
While joins can happen on more than two datasets, this discussion will illustrate the use case of only
two datasets which can be extrapolated upon when thinking of more than two datasets being joined.
Additionally, the concepts discussed (unless otherwise called out) relate to RDD and DataFrame
processing even though the illustrations will often reference RDD as the DataFrame will use the
underlying RDD for these activities.
Spark performs joins at the partition level. It ensures that each partition from the datasets being joined
are guaranteed to align with each other.
That means that the join key will always be in the same numbered partition from each dataset being
joined.
For that to happen, both joining datasets need to have the same number of partitions and have used
the same hashing algorithm (described as "Hashed Partitions" earlier in this module) against the same
join key before it can start the actual join processing.
180
Performance Tuning
To help explain the intersection of joins and hash partitions, let's look at the worst case situation. We
have two datasets that have different numbers of partitions (two on the left, four on the right) and
which were not partitioned with the same key. Both of the datasets will require a shuffle to occur so
that the equal number of hashed partitions will be created on the join key prior to the join operation
being executed.
NOTE:
The number of hashed partitions created for the JOIN RDD was equal to the number
of partitions from the dataset with the largest number of partitions.
181
Performance Tuning
A better scenario would occur if the larger dataset was already hashed partitioned on the join key. In
this situation, only the second dataset would need to be shuffled. As before, more partitions would be
created in the newly created dataset.
182
Performance Tuning
Co-Partitioned Datasets
The best situation would occur when the joining datasets already have the same number of hashed
partitions using the same join key. In this situation, no additional shuffling would need to happen. This
is called a co-partition join and is classified as a narrow operation.
Most likely, the aligned partitions will not be always on the same executor and thus one of them will
need to be moved to the executor where its counterpart is located. While there is some expense to
this, it is far less costly than requiring either of the joined datasets to perform a full shuffle operation.
183
Performance Tuning
Executor Optimization
Executors are highly configurable and are the first place to start when doing optimizations.
Executor Regions
Executors are broken into memory regions. The first is the overhead of the executor, which is almost
always 384MB. The second region is reserved for creating Java objects, which makes up 40% of the
executor. The third and final region is reserved for caching data, which makes up the other 60%.
While these percentages are configurable, it is recommended that initial tuning occur in the overall
configuration of an executor as well as the multiple of how many executors are needed.
When submitting an application, we tell the context how many and what size of resources to request.
To set these at runtime, we use the three following flags:
--executor-memory This is the property that defines how much memory will be allocated to a
particular YARN container that will run a Spark executor.
--executor-cores This is the property that defines how many CPU cores will be allocated to a
particular YARN container that will run a Spark executor.
--num-executors This is the property that defines how many YARN containers are being requested
to run Spark executors within.
Configuring Executors
Deciding how many and how much resources for the executors can be difficult. Heres a good starting
point.
executor-memory
-
executor-cores
-
At least two, but a max of four should be configured without performing tests to
validate the additional cores are an overall advantage considering all other properties.
num-executors
-
184
This is the most flexible as it is the multiple of the combination of memory and cores
that make up an individual executor.
If caching data, it is desirable to have, at least, twice the dataset size as the total
executor memory.
Performance Tuning
Many variables come into play including the size of the YARN cluster nodes that will be hosting the
executors. A good starting point would be 16GB and two cores as almost all modern Hadoop cluster
configurations would support YARN containers of this size.
If data set is 100GB, it would be ideal to have 100GB*2/(16GB) executors, which is 12.5. For this
application, choosing 12 or 13 could be ideal.
This section presents the primary configuration switches available to the DevOps team. Any fine-tuning
final answers will be best derived from direct testing results.
185
Performance Tuning
186
Performance Tuning
Knowledge Check
You can use the following question to assess your understanding of the concepts presented in this
lesson.
Questions
1 ) By default, parallelize() creates a number of RDD partitions based on the number of
___________________.
2 ) By default, textFile() creates a number of RDD partitions based on the number of
______________________.
3 ) Operations that require shuffles are also known as ___________ operations.
4 ) Which function should I use to reduce the number of partitions in an RDD without any data
changes?
5 ) When all identical keys are shuffled to the same partition, this is called a _______________
partition.
6 ) True or False: DataFrames are structured objects, therefore a developer must work must work
harder to optimize them than when working directly with RDDs.
187
Performance Tuning
Answers
1 ) By default, parallelize() creates a number of RDD partitions based on the number of
___________________.
Answer: Executor CPU cores available
2 ) By default, textFile() creates a number of RDD partitions based on the number of
______________________.
Answer: HDFS blocks
3 ) Operations that require shuffles are also known as ___________ operations.
Answer: Wide
4 ) Which function should I use to reduce the number of partitions in an RDD without any data
changes?
Answer: coalesce()
5 ) When all identical keys are shuffled to the same partition, this is called a _______________
partition.
Answer: Hash (or hashed)
6 ) True or False: DataFrames are structured objects, therefore a developer must work must work
harder to optimize them than when working directly with RDDs.
Answer: False. The Catalyst optimizer does this work for you when working with DataFrames.
188
Performance Tuning
Summary
mapPartitions() is similar to map() but operates at the partition instead of element level
Controlling RDD parallelism before performing complex operations can result in significant
performance improvements
Checkpointing writes data to disk every so often, resulting in faster recovery should a system
failure occur
Broadcast variables allow tasks running in an executor to share a single, centralized copy of a
data variable to reduce network traffic and improve performance
Executors are highly customizable, including number, memory, and CPU resources
189
Performance Tuning
190
191
To import other Spark libraries, its the same as with any other application. Here is an example of
importing more Spark libraries that are related to Dataframe processing:
from pyspark.sql import SQLContext
from pyspark.sql.types import Row, IntegerType
Developers should be familiar with this concept. Applications are built with dependencies all the time,
and Spark is no exception.
Creating a "main" Program
Zeppelin and the REPLs automatically set up the main program. Here is an example of how to set it up
for a standalone application:
import os
import sys
from pyspark import SparkContext, SparkConf, SQLContext
if __name__ == "__main__":
#Spark Program
Again, this should look exactly the same as any other application. There is some main part of the
application that gets executed, and the same goes with Spark.
Creating a Spark Configuration
The next thing the developer must do is create the SparkConf configuration object. This configuration
will tell the context some very important information about the application, like the resource manager,
application name, amount of resources to request, etc. The developer can set configurations a couple
ways as described further in a later section of this module. Here is an example of creating the
configuration.
conf = SparkConf().setAppName("AppName").setMaster("yarnMode")
conf.set('spark.executor.instances', '5')
conf.set('configuration', 'value')
The conf.set(configuration,value) lets the developer set any number of configurations. It
is very common to have several of these in the application.
Creating the SparkContext
After creating the Spark Configuration, the developer must create the SparkContext. The
SparkContext will communicate to the cluster, schedule tasks, and requests resources, amongst
other things. Once the SparkContext is created, the application will begin interacting with other
Hadoop resources such as YARNs Resource Manager. After creating the SparkContext, we now
have an application that is ready to process distributed data in a parallel manner.
The SparkContext has some configurations that can be set as well. One that we used in the Spark
Streaming labs a lot was setLogLevel("level"), where we set the log level to ERROR:
sc.setLogLevel("ERROR")
Check the API documentation for information about all the setters that can be used on the
SparkContext after it has been created.
192
At the end of the application, the developer should stop the context. Failing to do so can leave some
ghost processes running which will hold on to resources and may be very difficult to find and kill.
Spark will not throw an error if the stop() method is not utilized, but the best practice is to use it.
Below is an example of creating the SparkContext, setting a configuration, and stopping the
SparkContext. While not including the sc.stop() may have no impact to the developer, the overall
cluster experience will diminish as resources will still be allocated and administrators will start having
to manually kill these processes. This process is not trivial as identifying which processes are related
to this problem amongst all YARN processing running is difficult. Developers should be cognizant of
this multi-tenant nature of most Hadoop clusters and the fact that resources should be freed when no
longer needed. This is easy to do by simply including the necessary call to sc.stop() in all
applications.
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")
sc.stop()
193
The first, and probably the one that developers have the most experience with, is "yarn-client." In
yarn-client mode, the driver program is a JVM started on the machine the application is submitted
from. The SparkContext then lives in that JVM. This is the way the REPLs and Zeppelin start a Spark
application. They provide an interactive way to use Spark, so the context must exist where the
developer has access.
The other option, and the one that should be used for production applications, is "yarn-cluster." The
biggest difference between the two is the location of the Spark Driver.
In client mode, the Spark driver exists on the client machine. If something should happen to that client
machine, the application will fail.
In cluster mode, the application puts the Spark driver on the YARN ApplicationMaster process which is
running on a worker node somewhere in the cluster.
194
One big advantage of this is that even if the client machine that submitted the application to YARN
fails, the Spark application will continue to run.
Initial development of an application, especially if items are being printed to the screen
Testing applications
195
In yarn-client mode, the driver and context are running on the client, as seen in the example.
When the application is submitted, the SparkContext reaches out to the resource manager to create
an Application Master. The Application Master is then created, and asks the Resource Manager for the
rest of the resources that were requested in the SparkConf, or from the runtime configurations. After
the Application Master gets confirmation, it contacts the Node Managers to launch the executors. The
SparkContext will then start scheduling tasks for the executors to execute.
In a yarn-cluster submission, the application starts similarly to yarn-client, except that a Spark client is
created. The Spark client is a proxy that communicates with the Resource Manager to create the
Application Master.
The Application Master then hosts the Spark driver and SparkContext. Once this handoff has
occurred, the client machine can go away with no repercussions to the application. The only job the
client had was to start the job and pass the binaries. Once the Application Master is started, it is the
same internal process as during a yarn-client submission. The Application Master talks to the Node
Managers to start the executors. Then the SparkContext, which resides in the Application Master
can start assigning tasks to the executors.
This removes the single point of failure that exists with yarn-client job submissions. The Application
Master needs to be functional, but in yarn-cluster, there is no need for the client after the application
has launched.
In addition, many applications are often submitted from the same machine. The driver program, which
holds the application, requires resources and a JVM. If there are too many applications running on the
client, then the next will have to wait until resources free up, which can create a bottleneck. A yarncluster submission moves that resource usage to the cluster.
196
197
REFERENCE:
These can be seen in the documentation at spark.apache.org
Because of this, setting as few in the application is best practice, with the exception of some specific
configurations. Pass the rest in at runtime or in a configuration file.
198
199
200
Knowledge Check
You can use the following questions to assess your understanding of the concepts presented in this
lesson.
Questions
1 ) What components does the developer need to recreate when creating a Spark Application as
opposed to using Zeppelin or a REPL?
2 ) What are the two YARN submission options the developer has?
3 ) What is the difference between the two YARN submission options?
4 ) When making a configuration setting, which location has the highest priority if the event of a
conflict?
5 ) True or False: You should set your Python Spark SQL application to use Kryo serialization
201
Answers
1 ) What components does the developer need to recreate when creating a Spark Application as
opposed to using Zeppelin or a REPL?
Answer: The developer must import the SparkContext, SparkConf libraries, create the main
program, create a SparkConf and a SparkContext, and stop the SparkContext at the end
of the application
2 ) What are the two YARN submission options the developer has?
Answer: yarn-client and yarn-cluster are the two yarn submission options
3 ) What is the difference between the two YARN submission options?
Answer: The difference between yarn-client and yarn-cluster is where the driver and
SparkContext reside. The driver and context reside on the client in yarn-client, and in the
application master in yarn-cluster.
4 ) When making a configuration setting, which location has the highest priority if the event of a
conflict?
Answer: Settings configured inside the application
5 ) True or False: You should set your Python Spark SQL application to use Kryo serialization
Answer: True. It is used for JVM objects that will be created when using Spark SQL
202
Summary
A developer must reproduce some of the back-end environment creation that Zeppelin and the
REPLs handle automatically.
The main differences between a yarn-client and yarn-cluster application submission is the
location the Spark driver and SparkContext.
Use spark-submit, with appropriate configurations, the application file, and necessary
arguments, to submit an application to YARN.
203
204
Describe the purpose of machine learning and some common algorithms used in it
Supervised Learning
Supervised learning is the most common type of machine learning. It occurs when a model is created
using one or more variables to make a prediction, and then the accuracy of that prediction can be
immediately tested.
There are two common types of predictions: Classification and Regression.
Classification attempts to answer a discrete question - is the answer yes or no? Will the application
be approved or rejected? Is this email spam or safe to send to the user? "Will the flight depart on
time?" It's either a yes or no answer - if we predict the flight departs early or on time, the answer is yes.
If we predict it will be one minute late or more, the answer is no.
Copyright 2012 - 2016 Hortonworks, Inc. All rights reserved.
205
Regression attempts to determine what a value will be given specific information. What will the home
sell for? What should their life insurance rate be? What time is the flight likely to depart? It's an answer
where a specific value is being placed, rather than a simple yes or no is being applied. Therefore, we
might say the flight will depart at 11:35 as our prediction.
Supervised learning starts by randomly breaking a dataset into two parts: training data and testing
data.
Training data is what a machine learning algorithm uses to create a model. It starts with this dataset,
then performs statistical analysis of the effect one or more variables has on the final result. Since the
answers (yes or no for classification, or the exact value - ex: flight departure time) are known, the
training dataset can know with a high degree of certainty that the weight it applies to a variable is
accurate within the training data.
Once a model that is accurate for the training dataset is built, that model is then applied to the testing
dataset to see how accurate it is when the correct answers are not known ahead of time. The model
will almost never be 100% accurate for testing data, but the better the model is, the better it will be at
accurately predicting results where the answers are not known ahead of time.
Thousands upon Thousands of Data Points are collected and Available Every Day
This is a simple example of what a supervised learning dataset might look like. We have many columns
to choose from when selecting the variables we want to test. There would likely be thousands upon
thousands of data points collected and available, with new information streaming in on a continuous
basis, giving us massive historical data to work from.
Note that this dataset could be used either for regression or classification. Classification would
compare the Sched vs. Actual column and if the Actual value was less than or equal to Sched interpret
it as a yes. If not, it would be a no. For regression, the actual departure time is known.
Terminology
206
Columns selected for inclusion in the model are called "target variables"
Randomly break data into two parts for training vs. test data
-
In Spark, extremely large datasets can be used due to availability of cluster resources
Run the model against the test data and see how accurately it predicts results
-
Then go back and alter variables, build new model, and test again until satisfied
The developer starts with a training dataset, labels and a target variable. The features must be
extracted and then turned into a feature matrix. Given the labels and the feature matrix, a model can
be trained. Once the model is created, as new data comes in, a feature vector must be extracted from
the new data. Then the target variable can be predicted, using the model created.
Let's take a closer look though and see why simply using the average value isn't the best approach.
In the case of Model A, in two observations, the model is exactly accurate, and in two it is off by 4. In
the case of Model B, the model never predicts the value with 100% accuracy, but it tends to be closer,
more often. Intuitively, Model B seems to fit the data better. This should make sense, but how do we
quantify this?
207
The sum of mean squared error simply squares each of the variances of every observation and adds
those values together. This adds an exponential penalty for observations as they get further away from
the predicted value. Thus, the sum of means squared error for Model A = 0 + 16 + 0 + 16, for a final
value of 32. The sum of means squared error for Model B = 1 + 9 + 4 + 4, for a final value of 18. Since
the sum of means squared error is lower for Model B, we can determine that it is the better model.
Thus, we can both intuitively and mathematically determine that Model B is a better predictor than
Model A.
Decision Tree Algorithm
One commonly used classification algorithm is called the Decision Tree algorithm. In essence, a
decision tree uses a selected variable to determine the probability of an outcome, and then - assuming
that variable and probability are known - selects another variable and does the same thing. This
continues through the dataset, with variables and their order of selection/evaluation determined by the
data scientist. In the graphic, we see a small part of what would be a much larger decision tree, where
an airport value of ORD has been evaluated, followed by carriers at ORD, followed by weather
conditions.
There are often numerous ways in which decision trees might be constructed, and some paths will
produce better predictions than others. The same target variables can be arranged into multiple
decision tree paths, which can be combined into what is known as a Forest. In the end, the
classification (prediction) that has the most "votes" is selected as the prediction.
208
Classification Algorithms
Classification Algorithms
When creating a classification visualization, the model draws a line where it predicts the answers will
be. This line can then be compared to the actual results in the test data. For example, in this simple
visualization, the white-filled circles represent observations of target variables where actual departure
time was less than or equal to the scheduled departure time. The red-filled circles represent
observations where actual departure time was greater than scheduled departure time. The red line
represents the predictions that the model made. Above the red line would be where the model
predicted on-time departures, and below the red line would be were the model predicted delayed
departures.
Linear Regression Algorithm
209
In the case of regression, the line drawn is predicting an actual value rather than a binary result. In the
first diagram, we see a regression where only a single variable was selected and weighted - thus the
result will be a straight line. As more variables are added, the regression curves, and in some cases,
can curve wildly based on the variables and the weights determined by the model. The second
diagram is what a model with two variables might look like. To determine which model was a better
predictor, we would find out how far away each of the dots were from the prediction line and perform a
sum of means squared error calculation.
Unsupervised Learning
Supervised learning is a powerful tool as long as you have clean, formatted data where every column
has an accurate label. However, in some cases, what we start with is simply data, and appropriate
labels may be unknown. For example, take product reviews that people leave on social media, blogs,
and other web sites. Unlike reviews on retailers pages, where the user explicitly gives a negative,
neutral, or positive rating as part of creating their review (for example, a star rating), the social media
and other reviews have no such rating or label applied. How then can we group them to determine
whether any given review is positive, neutral, or negative, and determine whether the general
consensus is positive or negative?
For a human evaluator, simply reading the review would be enough. However, if we are collecting
thousands of reviews every day from various sources, employing a human to read and categorize each
one would be highly inefficient. This is where unsupervised learning comes in. The goal of
unsupervised learning is to define criteria by which a dataset will be evaluated, and then find patterns
in the data that are made up of groupings with similar characteristics. The algorithm does not
determine what those groupings mean - that is up to the data scientist to fill in. All it determines is what
should be grouped, based on the supplied criteria.
For example, we might look at examples where certain phrases are compared, and the algorithm might
determine that when a review contains phrase X, it quite usually also contains phrase Y. Therefore, a
review that contains phrase X but not phrase Y would still be combined with the phrase Y group. After
this processing is complete, the data scientist looks at a few of the phrase Y grouped reviews and
determines that they are generally positive, and thus assigns them to the positive review category.
The most common type of unsupervised learning, and the one described in this example, is called
clustering.
210
In this example, we have observations from which we have picked out phrases from a defined list we
are looking for. The data has been cleaned of extraneous words and phrases, and then the remaining
groups of phrases are evaluated to determine how frequently they are used within the same review.
The algorithm searches for patterns so that reviews can be grouped, but has no idea whether any
particular grouping represents positive, neutral, or negative reviews.
K-Means Algorithm
K-Means is Used to Identify Groupings that Likely Share the Same Label
Once the algorithm has grouped the results, the data scientist must determine the meaning. In
diagram, negative reviews are coded red, positive are coded green, and neutral are coded yellow. A
clustering algorithm known as the K-Means algorithm was applied and groupings were created. Note,
just as in supervised learning, not all reviews could be grouped closely with some others, and in some
cases, reviews were grouped with the wrong category. The better the model is, the more accurate
these groupings will be.
211
Logistic Regression
Nave Bayes
Clustering
K-Nearest Neighbors
In addition, Spark also offers a range of Basic Statistics tools, such as summary stats, correlations,
and random data generation.
212
mllib Modules
This is a list of the modules available in Spark's mllib package:
classification
clustering
evaluation
feature
fpm
linalg*
optimization
pmml
random
recommendation
regression
stat*
tree*
util
ml Modules
This is a list of the modules available in Spark's ml package:
attribute
classification
clustering
evaluation
feature
param
recommendation
regression
source.libsvm
tree*
tuning
util
213
For example, if you wanted to view ml samples available for Python, you would browse to
/usr/hdp/current/spark-client/examples/src/main/Python/ml/.
214
215
Using a text editor, you can open and examine the contents of each application. The examples are well
commented, meaning they can actually be used as teaching tools to help you learn how to employ
Spark's machine learning capabilities for your own needs. In this example, we have opened the
decision tree classification program in the Python mllib directory.
216
Here is another example, a logistic regression (which, as you will recall, is actually a classification
algorithm) from the Python ml directory.
217
218
219
220
More sample code from the imported machine learning note in Zeppelin
221
222
Knowledge Check
You can use the following questions to assess your understanding of the concepts presented in this
lesson.
Questions
1 ) What are two types of machine learning?
2 ) What are two types of supervised learning?
3 ) What do you call columns that are selected as variables to build a machine learning model?
4 ) What is a row of data called in machine learning?
5 ) What is the goal of unsupervised learning?
6 ) Name the two Spark machine learning packages.
7 ) Which machine learning package is designed to take advantage of flexibility and performance
benefits of DataFrames?
8 ) Name two reasons to prefer Spark machine learning over other alternatives
223
Answers
1 ) What are two types of machine learning?
Answer: Supervised and unsupervised
2 ) What are two types of supervised learning?
Answer: Classification and regression
3 ) What do you call columns that are selected as variables to build a machine learning model?
Answer: Target variables
4 ) What is a row of data called in machine learning?
Answer: An observation
5 ) What is the goal of unsupervised learning?
Answer: The goal of unsupervised learning is to find groupings in unlabeled data
6 ) Name the two Spark machine learning packages.
Answer: mllib and ml
7 ) Which machine learning package is designed to take advantage of flexibility and performance
benefits of DataFrames?
Answer: ml
8 ) Name two reasons to prefer Spark machine learning over other alternatives
Answer: Cluster-level resource availability, parallel processing, in-memory processing (vs.
older Hadoop machine learning libraries)
224
Summary
Spark supports machine learning algorithms running in a highly parallelized fashion using
cluster-level resources and performing in-memory processing
Supervised machine learning builds a model based on known data and uses it to predict
outcomes for unknown data
225
Hortonworks University courses are designed by the leaders and committers of Apache Hadoop.
We provide immersive, real-world experience in scenario-based training. Courses offer
unmatched depth and expertise available in both the classroom or online from anywhere in the
world. We prepare you to be an expert with highly valued skills and for Certification.