HDP Developer-Enterprise Spark 1-Student Guide-Rev 1
HDP Developer-Enterprise Spark 1-Student Guide-Rev 1
Enterprise Spark 1
Student Guide
Rev 1
Copyright © 2012 - 2016 Hortonworks, Inc. All rights reserved.
The contents of this course and all its lessons and related materials, including handouts to
audience members, are Copyright © 2012 - 2015 Hortonworks, Inc.
No part of this publication may be stored in a retrieval system, transmitted or reproduced in any
way, including, but not limited to, photocopy, photograph, magnetic, electronic or other record,
without the prior written permission of Hortonworks, Inc.
This instructional program, including all material provided herein, is supplied without any
guarantees from Hortonworks, Inc. Hortonworks, Inc. assumes no liability for damages or legal
action arising from the use or misuse of contents or details contained herein.
Linux® is the registered trademark of Linus Torvalds in the United States and other countries.
• HDP Certified Developer: for Hadoop developers using frameworks like Pig, Hive, Sqoop and
Flume.
• HDP Certified Administrator: for Hadoop administrators who deploy and manage Hadoop
clusters.
• HDP Certified Developer: Java: for Hadoop developers who design, develop and architect
Hadoop-based solutions written in the Java programming language.
• HDP Certified Developer: Spark: for Hadoop developers who write and deploy applications for
the Spark framework.
How to Register: Visit www.examslocal.com and search for “Hortonworks” to register for an
exam. The cost of each exam is $250 USD, and you can take the exam anytime, anywhere
using your own computer. For more details, including a list of exam objectives and instructions
on how to attempt our practice exams, visit http://hortonworks.com/training/certification/
Earn Digital Badges: Hortonworks Certified Professionals receive a digital badge for each
certification earned. Display your badges proudly on your résumé, LinkedIn profile, email
signature, etc.
On Demand Learning
Hortonworks University courses are designed and developed by Hadoop experts and
provide an immersive and valuable real world experience. In our scenario-based training
courses, we offer unmatched depth and expertise. We prepare you to be an expert with
highly valued, practical skills and prepare you to successfully complete Hortonworks
Technical Certifications.
The online library accelerates time to Hadoop competency. In addition, the content is
constantly being expanded with new material, on an ongoing basis.
Visit: http://hortonworks.com/training/class/hortonworks-university-self-paced-learning-
library/
Lesson Objectives
After completing this lesson, students should be able to:
ü Describe the characteristics and types of Big Data
ü Define HDP and how it fits into overall data lifecycle management strategies
ü Describe and use HDFS
ü Explain the purpose and function of YARN
Volume
Volume refers to the amount of data being generated. Think in terms of gigabytes, terabytes, and
petabytes. Many systems and applications are just not able to store, let alone ingest or process, that
much data.
Many factors contribute to the increase in data volume. This includes transaction-based data stored
for years, unstructured data streaming in from social media, and the ever increasing amounts of sensor
and machine data being produced and collected.
There are problems related to the volume of data. Storage cost is an obvious issue. Another problem
is filtering and finding relevant and valuable information in large quantities of data that often contains
not-valuable information.
You also need a solution to analyze data quickly enough in order to maximize business value today
and not just next quarter or next year.
Velocity
Velocity refers to the rate at which new data is created. Think in terms of megabytes per second and
gigabytes per second.
Data is streaming in at unprecedented speed and must be dealt with in a timely manner in order to
extract maximum value from the data. Sources of this data include logs, social media, RFID tags,
sensors, and smart metering.
There are problems related to the velocity of data. These include not reacting quickly enough to
benefit from the data. For example, data could be used to create a dashboard that could warn of
imminent failure or a security breach. Failure to react in time could lead to service outages.
Another problem related to the velocity of data is that data flows tend to be highly inconsistent with
periodic peaks. Causes include daily or seasonal changes or event-triggered peak loads. For example,
a change in political leadership could cause an a peak in social media.
Variety
Variety refers to the number of types of data being generated. Varieties of data include structured,
semi-structured, and unstructured data arriving from a myriad of sources. Data can be gathered from
databases, XML or JSON files, text documents, email, video, audio, stock ticker data, and financial
transactions.
There are problems related to the variety of data. This include how to gather, link, match, cleanse, and
transform data across systems. You also have to consider how to connect and correlate data
relationships and hierarchies in order to extract business value from the data.
Sentiment
Understand how your customers feel about your brand and products right now
Sentiment data is unstructured data containing opinions, emotions, and attitudes. Sentiment data is
gathered from social media like Facebook and Twitter. It is also gathered from blogs, online product
reviews, and customer support interactions.
Enterprises use sentiment analysis to understand how the public thinks and feels about something.
They can also track how those thoughts and feelings change over time.
It is used to make targeted, real-time decisions that improve performance and improve market share.
Sentiment data may be analyzed to get feedback about products, services, competitors, and
reputation.
Clickstream
Capture and analyze website visitor's data trails and optimize your website
Clickstream data is the data trail left by a user while visiting a Web site. Clickstream data can be used
to determine how long a customer stayed on a Web site, which pages they most frequently visited,
which pages they most quickly abandoned, along with other statistical information.
This data is commonly captured in semi-structured Web logs.
Clickstream data is used, for example, for path optimization, basket analysis, next-product-to-buy
analysis, and allocation of Web site resources.
Hadoop makes it easier to analyze, visualize, and ultimately change how visitors behave on your Web
site.
Sensor/Machine
Discover Patterns in data streaming automatically from remote sensors and machines
A sensor is a converter that measures a physical quantity and transforms it into a digital signal. Sensor
data is used to monitor machines, infrastructure, or natural phenomenon.
Sensors are everywhere these days. They are on the factory floor and they are in department stores in
the form of RFID tags. Hospitals use biometric sensors to monitor patients and other sensors to
monitor the delivery of medicines via intravenous drip lines. In all cases these machines stream low-
cost, always-on data.
Hadoop makes it easier for you to rapidly collect, store, process, and refine this data. By processing
and refining your data you can identify meaningful patterns that provide insight to make proactive
business decisions.
Geographic
Analyze location-based data to manage operations where they occur
Geographic/geolocation data identifies the location of an object or individual at a moment in time. This
data may take the form of coordinates or an actual street address.
This data might be voluminous to collect, store, and process; just like sensor data. In fact, geolocation
data is collected by sensors.
Hadoop helps reduce data storage costs while providing value-driven intelligence from asset tracking.
For example, you might optimize truck routes to save fuel costs.
Server Log
Research log files to diagnose and process failures and prevent security breaches
Server log data captures system and network operation information. Information technology
organizations analyze server logs for many reasons. These include the need to answer questions
about security, monitor for regulatory compliance, and troubleshoot failures.
Hadoop takes server-log analysis to the next level by speeding and improving log aggregation and
data center-wide analysis. In many environments Hadoop can replace existing enterprise-wide
systems and network monitoring tools, and reduce the complexity and costs associated with deploying
and maintaining such tools.
Text
Understand patterns in text across millions of web pages, emails and documents
Text is often used for text-based data generated that doesn’t neatly fit into one of the above
categories, as well as combinations of categories in order to find patterns across different text-based
sources.
Introduction to HDP
Hadoop is a collection of open source software frameworks for the distributed storing and processing
of large sets of data. Hadoop development is a community effort governed under the licensing of the
Apache Software Foundation. Anyone can help to improve Hadoop by adding features, fixing software
bugs, or improving performance and scalability.
Hadoop clusters are scalable, ranging from a single machine to literally thousands of machines. It is
also fault tolerant. Hadoop services achieve fault tolerance through redundancy.
Clusters are created using commodity, enterprise-grade hardware, which not only reduces the original
purchase price, but potentially also reduces support costs too.
Hadoop also uses distributed storage and processing to achieve massive scalability. Large datasets
are automatically split into smaller chunks, called blocks, and distributed across the cluster machines.
Not only that, but each machine commonly processes its local block of data. This means that
processing is distributed too, potentially across hundreds of CPUs and hundreds of gigabytes of
memory.
HDP is an enterprise-ready collection of frameworks (sometimes referred to as the HDP Stack) that
work within Hadoop that have been tested and are supported by Hortonworks for business clients.
Hadoop is not a monolithic piece of software. It is a collection of software frameworks. Most of the
frameworks are part of the Apache software ecosystem. The picture illustrates the Apache frameworks
that are part of the Hortonworks Hadoop distribution.
So why does Hadoop have so many frameworks and tools? The reason is that each tool is designed
for a specific purpose. The functionality of some tools overlap but typically one tool is going to be
better than others when performing certain tasks.
For example, both Apache Storm and Apache Flume ingest data and perform real-time analysis, but
Storm has more functionality and is more powerful for real-time data analysis.
HDP Overview
The Hortonworks Data Platform (HDP) is an open enterprise version of Hadoop distributed by
Hortonworks. It includes a single installation utility that installs many of the Apache Hadoop software
frameworks. Even the installer is pure Hadoop. The primary benefit of HDP is that Hortonworks has put
it through a rigorous set of system, functional, and regression tests to ensure that versions of any
framework included in the distribution works seamlessly with other frameworks in a secure and reliable
manner.
Because HDP is an open enterprise version of Hadoop, it is imperative that it uses the best
combination of the most stable, reliable, secure, and current frameworks.
ZooKeeper is a coordination service for distributed applications and services. Coordination services
are hard to build correctly, and are especially prone to errors such as race conditions and deadlock. In
addition, a distributed system must be able to conduct coordinated operations while dealing with such
things as scalability concerns, security concerns, consistency issues, network outages, bandwidth
limitations, and synchronization issues. ZooKeeper is designed to help with these issues.
Cloudbreak is a cloud-agnostic tool for provisioning, managing, and monitoring on-demand clusters.
It automates the launching of elastic Hadoop clusters with policy-based autoscaling on the major
cloud infrastructure platforms including Microsoft Azure, Amazon Web Services, Google Cloud
Platform, OpenStack, and Docker containers.
Oozie is a server-based workflow engine used to execute Hadoop jobs. Oozie enables Hadoop users
to build and schedule complex data transformations by combining MapReduce, Apache Hive, Apache
Pig, and Apache Sqoop jobs into a single, logical unit of work. Oozie can also perform Java, Linux
shell, distcp, SSH, email, and other operations.
Apache Pig is a high-level platform for extracting, transforming, or analyzing large datasets. Pig
includes a scripted, procedural-based language that excels at building data pipelines to aggregate and
add structure to data. Pig also provides data analysts with tools to analyze data.
Apache Hive is a data warehouse infrastructure built on top of Hadoop. It was designed to enable
users with database experience to analyze data using familiar SQL-based statements. Hive includes
support for SQL:2011 analytics. Hive and its SQL-based language enable an enterprise to utilize
existing SQL skillsets to quickly derive value from a Hadoop deployment.
Apache HCatalog is a table information, schema, and metadata management system for Hive, Pig,
MapReduce, and Tez. HCatalog is actually a module in Hive that enables non-Hive tools to access Hive
metadata tables. It includes a REST API, named WebHCat, to make table information and metadata
available to other vendors’ tools.
Cascading is an application development framework for building data applications. Acting as an
abstraction layer, Cascading converts applications built on Cascading into MapReduce jobs that run
on top of Hadoop.
Apache Falcon is a data governance tool. It provides a workflow orchestration framework designed
for data motion, coordination of data pipelines, lifecycle management, and data discovery. Falcon
enables data stewards and Hadoop administrators to quickly onboard data and configure associated
processing and management on Hadoop clusters.
WebHDFS uses the standard HTTP verbs GET, PUT, POST, and DELETE to access, operate, and
manage HDFS. Using WebHDFS, a user can create, list, and delete directories as well as create, read,
append, and delete files. A user can also manage file and directory ownership and permissions.
Administrators can manage HDFS.
The HDFS NFS Gateway allows access to HDFS as though it were part of an NFS client’s local file
system. The NFS client mounts the root directory of the HDFS cluster as a volume and then uses local
command-line commands, scripts, or file explorer applications to manipulate HDFS files and
directories.
Apache Flume is a distributed, reliable, and highly-available service that efficiently collects,
aggregates, and moves streaming data. It is a distributed service because it can be deployed across
many systems. The benefits of a distributed system include increased scalability and redundancy. It is
reliable because its architecture and components are designed to prevent data loss. It is highly-
available because it uses redundancy to limit downtime.
Apache Sqoop is a collection of related tools. The primary tools are the import and export tools.
Writing your own scripts or MapReduce program to move data between Hadoop and a database or an
enterprise data warehouse is an error prone and non-trivial task. Sqoop import and export tools are
designed to reliably transfer data between Hadoop and relational databases or enterprise data
warehouse systems.
Apache Kafka is a fast, scalable, durable, and fault-tolerant publish-subscribe messaging system.
Kafka is often used in place of traditional message brokers like Java Messaging Service (JMS) or
Advance Message Queuing Protocol (AMQP) because of its higher throughput, reliability, and
replication.
Apache Atlas is a scalable and extensible set of core foundational governance services that enable
an enterprise to meet compliance requirements within Hadoop and enables integration with the
complete enterprise data ecosystem.
Security Frameworks
HDFS also contributes security features to Hadoop. HDFS includes file and directory permissions,
access control lists, and transparent data encryption. Access to data and services often depends on
having the correct HDFS permissions and encryption keys.
YARN also contributes security features to Hadoop. YARN includes access control lists that control
access to cluster memory and CPU resources, along with access to YARN administrative capabilities.
Hive can be configured to control access to table columns and rows.
Falcon is a data governance tool that also includes access controls that limit who may submit
automated workflow jobs on a Hadoop cluster.
Apache Knox is a perimeter gateway protecting a Hadoop cluster. It provides a single point of
authentication into a Hadoop cluster.
Apache Ranger is a centralized security framework offering fine-grained policy controls for HDFS,
Hive, HBase, Knox, Storm, Kafka, and Solr. Using the Ranger Console, security administrators can
easily manage policies for access to files, directories, databases, tables, and columns. These policies
can be set for individual users or groups and then enforced within Hadoop.
Why does HDP need so many frameworks? Let’s take a look at a simple data lifecycle example.
We start with some raw data and an HDP cluster. The first step in managing the data is to get it into
the HDP cluster. We must have some mechanism to ingest that data – perhaps Sqoop, Flume, Spark
Streaming, or Storm - and then another mechanism to analyze and decide what to do with it next.
Does this data require some kind of transformation in order to be used? If so, ETL processes must be
run, and those results generated into another file. Quite often, this is not a single step, but multiple
steps configured into a data application pipeline.
The next decision comes in regard to whether to keep or discard the data. Not all data must be kept,
either because it has no value (empty files, for example), or it is not necessary to keep once it has been
processed and transformed. Thus some raw data can simply be deleted.
Data that must be kept requires additional decisions. For example, where will the data be stored, and
for how long? Your Hadoop cluster might have multiple tiers of HDFS storage available, perhaps
separated via some kind of node label mechanism. In the example, we have two HDFS storage tiers.
Any data that is copied to tier 2 should be stored for 90 days. We have another, higher tier of HDFS
Storage, and any data stored here should be kept until it is manually deleted.
You may decide that some data should be archived rather than made immediately available via HDFS,
and you can have multiple tiers of archives as well. In our example we have three tiers of archival
storage, and data is kept for one, three, and seven years depending on where it is stored.
A third location where and data might end up is on some kind of cloud storage, such as AWS or
Microsoft Azure.
Both raw data and transformed data might be kept anywhere in this storage infrastructure as result of
having been input and processed by this HDP cluster. In addition, you may be working in a multi-
cluster environment, in which case an additional decision is required. What data needs to be replicated
between the clusters? If files need to be replicated to another HDP cluster, then once that cluster
ingests in examines that data, this same kind of processes and decision mechanisms need to be
employed. Perhaps additional transformation is required. Perhaps some files can be examined and
deleted. For files that are to be kept, their location and length of retention must be decided, just as on
the first cluster.
This is a relatively simple example of the kind of data lifecycle decisions that need to be made in an
environment where the capabilities of HDP are being fully utilized. This can get significantly more
complex with the addition of additional storage tiers, retention requirements, and geographically
dispersed HDP clusters which must replicate data between each other, and perhaps with a central
global cluster designed to do all final processing.
HDFS Overview
The Hadoop Distributed File System (HDFS) and YARN (Yet Another Resource Negotiator) are part of
the core of Hadoop and are installed when you install the Hortonworks Data Platform. In this section
we will focus on HDFS, the distributed file system for HDP.
Developers can interact with HDFS directly via the command line using the hdfs dfs command and
appropriate arguments. If a developer has previous Linux command line experience, the hdfs dfs
commands will be familiar and intuitive to use. The most common use for command line usage is
manual data ingestion and file manipulation.
YARN Overview
Why is YARN so important to Spark? Let's take a look at a sample enterprise Hadoop deployment.
Without a central resource manager to ensure good behavior between applications, it is necessary to
create specialized, separate clusters to support multiple applications. This, in turn, means that when
you want to do something different with the data that application was using, it is necessary to copy
that data between clusters. This introduces inefficiencies in terms of network, CPU, memory, storage,
general datacenter management, and data integrity across Hadoop applications.
YARN as a resource manager mitigates this issue by allowing different types of applications to access
the same underlying resources pooled into a single data lake. Since Spark runs on YARN, it can join
other Hadoop applications on the same cluster, enabling data and resource sharing at enterprise scale.
YARN (unofficially "Yet Another Resource Negotiator") is the computing framework for Hadoop. If you
think about HDFS as the cluster file system for Hadoop, YARN would be the cluster operating system.
It is the architectural center of Hadoop.
A computer operating system, such as Windows or Linux, manages access to resources, such as CPU,
memory, and disk, for installed applications. In similar fashion, YARN provides a managed framework
that allows for multiple types of applications – batch, interactive, online, streaming, and so on – to
execute on data across your entire cluster. Just like a computer operating system manages both
resource allocation (which application gets access to CPU, memory, and disk now, and which one has
to wait if contention exists?) and security (does the current user have permission to perform the
requested action?), YARN manages resource allocation for the various types of data processing
workloads, prioritizes and schedules jobs, and enables authentication and multitenancy.
Every slave node in a cluster is comprised of resources such as CPU and memory. The abstract notion
of a resource container is used to represent a discreet amount of these resources. Cluster
applications run inside one or more containers. Containers are managed and scheduled by YARN.
A container’s resources are logically isolated from other containers running on the same machine. This
isolation provides strong application multi-tenancy support.
Applications are allocated different-sized containers based on application-defined resource requests,
but always within the constraints configured by the Hadoop administrator.
Knowledge Check
You can use the following questions and answers for self-assessment.
Questions
1 ) Name the three V’s of big data.
5 ) What is the base command-line interface command for manipulating files and directories in
HDFS?
Answers
1 ) Name the three V’s of big data.
5 ) What is the base command-line interface command for manipulating files and directories in
HDFS?
Answer: Containers
Summary
• Data is made "Big" Data by ever-increasing Volume, Velocity, and Variety
• Hadoop is often used to handle sentiment, clickstream, sensor/machine, server, geographic,
and text data
• HDP is comprised of an enterprise-ready and supported collection of open source Hadoop
frameworks designed to allow for end-to-end data lifecycle management
• The core frameworks in HDP are HDFS and YARN
• HDFS serves as the distributed file system for HDP
• The hdfs dfs command can be used to create and manipulate files and directories
• YARN serves as the operating system and architectural center of HDP, allocating resources to
a wide variety of applications via containers
Lesson Objectives
After completing this lesson, students should be able to:
ü Use Apache Zeppelin to work with Spark
ü Describe the purpose and benefits of Spark
ü Define Spark REPLs and application architecture
Zeppelin Overview
Apache Zeppelin
Apache Zeppelin is a web-based notebook that enables interactive data analytics on top of Spark. It
supports a growing list of programming languages, such as Python, Scala, Hive, SparkSQL, shell, and
markdown. It allows for data visualization, report generation, and collaboration.
Zeppelin
Zeppelin has four major functions: data ingestion, discovery, analytics, and visualization. It
comes with built-in examples that demonstrate these capabilities. These examples can be reused and
modified for real-world scenarios.
Data Visualization
Zeppelin comes with several built-in ways to interactively view and visualize data including table view,
column charts, pie charts, area charts, line charts, and scatter plot charts – illustrated below:
Spark Overview
Spark Introduction
Spark is a platform that allows large-scale, cluster-based, in-memory data processing. It enables fast,
large-scale data engineering and analytics for iterative and performance-sensitive applications. It
offers development APIs for Scala, Java, Python, and R. In addition, Spark has been extended to
support SQL-like operations, streaming, and machine learning as well.
Spark is supported by Hortonworks on HDP and is YARN compliant, meaning it can leverage datasets
that exist across many other applications in HDP.
Spark RDDs
Spark tools
Coming from the Spark project, Spark Core supports a set of four high-level tools that support SQL-
like queries, streaming data applications, a machine learning library (Mlib), and graph algorithms
(GraphX). In addition, Spark also integrates with a number of other HDP tools, such as Hive for SQL-
like operations and Zeppelin for graphing / data visualization.
There are five core components of an enterprise Spark application in HDP. They are the Driver,
SparkContext, YARN ResourceManager, HDFS Storage, and Executors.
When using a REPL, the driver and SparkContext will run on a client machine. When deploying an
application as a cluster application, the driver and SparkContext can also run in a YARN container. In
both cases, Spark executors run in YARN containers on the cluster.
Spark Driver
The Spark driver contains the main() Spark program that manages the overall execution of a Spark
application. It is a JVM that creates the SparkContext which then communicates directly with Spark.
It is also responsible for writing/displaying and storing the logs that the SparkContext gathers from
executors.
Spark shell REPLs are examples of Spark driver programs.
IMPORTANT: The Spark driver is a single point of failure for a YARN client application. If the driver
fails, the application will fail. This is mitigated when deploying applications using YARN cluster.
SparkContext
SparkContext
For an application to become a Spark application, an instance of the SparkContext class must be
instantiated. The SparkContext contains all code and objects required to process data in the cluster,
and works with the YARN ResourceManager to get requested resources for the application. It is also
responsible for scheduling tasks for Spark executors. The SparkContext checks in with the
executors to report work being done and provide log updates.
A SparkContext is automatically created and named sc when a REPL is launched. The following
code is executed at start up for pyspark:
from pyspark import SparkContext, SparkConf conf = SparkConf()
conf = SparkConf()
sc = SparkContext(conf=conf)
Spark Executors
Spark Executors
The Spark executor is the component that performs the map and reduce tasks of a Spark
application, and is sometimes referred to as a Spark “worker.” Once created, executors exist for the
life of the application.
NOTE:
In the context of Spark, the SparkContext is the "master" and executors
are the "workers." However, in the context of HDP in general, you also
have "master" nodes and "worker" nodes. Both uses of the term worker
are correct - in terms of HDP, the worker (node) can run one or more Spark
workers (executors). When in doubt, make sure to verify whether the
worker being described is an HDP node or a Spark executor running on an
HDP node.
Spark executors function as interchangeable work spaces for Spark application processing. If an
executor is lost while an application is running, all tasks assigned to it will be reassigned to another
executor. In addition, any data lost will be recomputed on another executor.
Executor behavior can be controlled programmatically. Configuring the number of executors and their
available resources can greatly increase performance of an application when done correctly.
Knowledge Check
You can use the following questions and answers for self-assessment.
Questions
1 ) Name the tool in HDP that allows for interactive data analytics, data visualization, and
collaboration with Spark.
Answers
1 ) Name the tool in HDP that allows for interactive data analytics, data visualization, and
collaboration with Spark.
Answer: Zeppelin
Answer: Access to datasets shared across the cluster with other HDP applications
Answer: Executor
Summary
• Zeppelin is a web-based notebook that supports multiple programming languages and allows
for data engineering, analytics, visualization, and collaboration using Spark
• Spark is a large-scale, cluster-based, in-memory data processing platform that supports
parallelized operations on enterprise-scale datasets
• Spark provides REPLs for rapid, interactive application development and testing
• The five components of an enterprise Spark application running on HDP are:
- Driver
- SparkContext
- YARN
- HDFS
- Executors
Lesson Objectives
After completing this lesson, students should be able to:
ü Explain the purpose and function of RDDs
ü Explain Spark programming basics
ü Define and use basic Spark transformations
ü Define and use basic Spark actions
ü Invoke functions for multiple RDDs, create named functions, and use numeric operations
Introduction to RDDs
NOTE:
The brackets [ ] around the variable name prevents the variable input
from being read one character at a time. If we had used parentheses ( )
instead, the collect function would have returned:
['M', 'a', 'r', 'y', ' ', 'h', 'a', 'd', . . . , 'l',
'a', 'm', 'b']
In this simple example, we have a small cluster which has been loaded with three data files, and will
walk through their input into HDFS, followed by having two of them used to create a single RDD, and
begin to demonstrate the power of parallel datasets.
The first file in the example is small enough to fit entirely in a single 128 MB HDFS block – so data file 1
is made up of only one HDFS block (labeled DF1.1) which is written to Node 1. This would be replicated
by default to two other nodes, which are not shown in the image.
Data files 2 and 3 take up two HDFS blocks each. In our example, these four data blocks are written to
four different HDFS nodes. Data file 2 is represented in HDFS by DF2.1 and DF 2.2 (written to node 2
and node 4, respectively). Data file 3 is represented in HDFS by DF3.1 and DF3.2 (written to node 3 and
node 5, respectively). Again, not shown in the image, each of these blocks would be replicated multiple
times across nodes in the cluster.
Next we write a Spark application that initially defines an RDD that is made up of a combination of the
data in data files 1 and 2. In HDFS, these two files are represented by three HDFS blocks on nodes 1,
2, and 4. When the RDD partitions are created in memory, the nodes that will be used will be the same
nodes that contain the data blocks. This improves performance and reduces network traffic that would
result from pulling data from one node’s disk to another node’s memory.
The three data blocks that represent these two files that were combined by the Spark application are
then copied into memory on their respective nodes and become the partitions of the RDD. DF2.1 is
written to an RDD partition we have labeled RDD 1.1. DF2.2 is written to an RDD partition we have
labeled RDD 1.2. DF1.1 is written to an RDD partition we have labeled RDD 1.3.
In our example, one RDD was created from two files (which are split across three HDFS data nodes),
which exist in memory as three partitions that a Spark application can then continue to use.
A hypothetical cluster that has two RDD's. Each RDD is composed of multiple partitions, which are distributed across the cluster.
RDD Characteristics
RDDs can contain any type of serializable element, meaning those that can be converted to and from
a byte stream. Examples include: int, float, bool, and sequences/iteratives like arrays, lists,
tuples, and strings. Element types in an RDD can be mixed as well. For example, an array or list
can contain both string and int values. Furthermore, RDD types are converted implicitly when
possible, meaning there is no need to explicitly specify type during RDD creation.
NOTE:
Non-serializable elements (for example: objects created with certain third-
party JAR files or other external resource) cannot be made into RDDs.
RDD Operations
Once an RDD is created, there are two operations that can be performed on it: Actions and
Transformations.
Transformations apply a function to RDD elements and create new RDD partitions based on the output
A Transformation takes an existing RDD, applies a function to the elements of the RDD, and creates a
new RDD comprised of the transformed elements.
An action returns a result of a function applied to the elements of an RDD in the form of screen output,
a file write, etc.
First let's take a look at an example of non-functional programming. In the following function, we define
a variable value outside of our function, then pull that value into our function and modify it. Note the
dependence on, and the writing to, a variable that exists external to the function itself:
varValue = 0
def unfunctionalCode():
global varValue
varValue = varValue + 1
Now let's take a look at the same basic example, but this time written using functional programming
principles. In this example, the variable value is instantiated as part of calling the function itself, and
only the value within the function is modified.
def functionalCode(varValue):
return varValue + 1
All Spark transformations are based on functional programming.
rddAnon = rddNumList.map(lambda z: z + 1)
rddAnon.collect()
[6, 8, 12, 15]
NOTE:
the second line of code in this example did not define a new RDD. If
further transformations were necessary, the second line of code would
need to be rewritten as follows:
rddAnon = rddNumList.map(lambda z: z + 1)
rddAnon.collect()
[6, 8, 12, 15]
Maps can apply to strings as well. Here is an example that starts by reading a file from HDFS called
"mary.txt":
rddMary=sc.textFile("mary.txt")
RDDs created using the textFile method treat newline characters as characters that separate
elements. Thus, since the file had four lines, the file as shown in the image would have four elements in
the RDD.
rddLineSplit=rddMary.map(lambda line: line.split(" "))
A map() transformation is then called. The goal of the map transformation in this scenario is to take
each element, which is a string containing multiple words, and break it up into an array that is
stored in a new RDD for further processing. The split function takes a string and breaks it into
arrays based on the delimiter passed into split().
The result is an RDD that still only has four elements, but now those elements are arrays rather than
monolithic strings.
flatMap()
The flatMap function is similar to map, with the exception that it performs an extra step to break
down (or flatten) the component parts of elements such as arrays or other sequences into individual
elements after running a map function.
map() is a one-to-one transformation: one element comes in, one element comes out. Using map(),
four line elements were converted into four array elements, but we still started and ended with the
same number of elements. The flatMap() function, on the other hand, is a one-to-many
transformation: one element may go in, but many can come out.
Let's compare using the previous map() illustration.
rddLineSplit = rddMary.map(lambda line: line.split(" "))
If we run the same code, replacing map() with flatMap(), the output is returned as a single list of
individual elements rather than four lists of elements that were originally separated by the line break.
rddFlat = rddMary.flatmap(lambda line: line.split(" "))
This time, each word is treated as its own element, resulting in 22 elements instead of 4. Again, it is
easiest to think about flatMap() as a map operation followed by a flatten operation, in a single step.
filter()
The filter function is used to remove elements from an RDD that don't meet a certain criteria, or put
another way, filter()keeps elements in an RDD based on a predicate. If the predicate returns true
(the filter criteria are met), the record is passed on to the transformed RDD.
In the example below, we have an RDD composed of four elements and want to filter out any element
whose value is greater than ten (or, in other words, keep any value ten or less). Notice that the initial
RDD is being created using the sc.parallelize API:
rddNumList = sc.parallelize([5, 7, 11, 14])
rddNumList.filter(lambda number: number <= 10).collect()
[5, 7]
This could have been performed using any standard mathematical operation.
Filter is not limited to working with numbers. It can work with strings as well. Let's use an
example RDD consisting of the list of months below:
months = ["January", "March", "May", "July", "September"]
rddMonths = sc.parallelize(months)
We then use filter, with an anonymous function that uses the len function to count the number of
characters in each element, and then filter out any that contain five or fewer characters.
rddMonths.filter(lambda name: len(name) > 5).collect()
['January', 'September']
Again, any available function that performs evaluations on text strings or arrays could be used to filter
for a given result.
distinct()
The distinct function removes duplicate elements from an RDD. Consider the following RDD:
rddBigList = sc.parallelize([5, 7, 11, 14, 2, 4, 5, 14, 21])
rddBigList.collect()
[5, 7, 11, 14, 2, 4, 5, 14, 21]
Notice that the numbers 5 and 14 are listed twice. If we just wanted to see each element only listed one
time in our output, we could use distinct() as follows:
rddDistinct = rddBigList.distinct()
rddDistinct.collect()
[4, 5, 21, 2, 14, 11, 7]
Notice that 5 and 14 now only appear once in the results.
count()
count() returns the number of elements in the RDD. Here is an example:
rddNumList = sc.parallelize([5, 7, 11, 14])
rddNumList.count()
4
In the case of a file that contains lines of text, count() would return the number of lines in the RDD, as
in the following example:
rddMary=sc.textFile("mary.txt")
rddMary.count()
4
The count function applied to rddMary returns 4, a count for each line
reduce()
reduce() aggregates the elements of an RDD using a function that takes two arguments and returns
one, iterating through them until only a single value remains. For example:
rddNumList = sc.parallelize([5, 7, 11, 14])
...
37
Because of the parallelized nature of RDDs, the reduce function does not necessarily process RDD
elements in a particular order. Therefore, the function used with reduce() should be both
commutative and associative.
Commutative means "to move around" - which implies that the order things are done in should not
matter. For example, 5+7 = 7+5, or 5*7 = 7*5. A function that is not commutative can return different
results each time it is run. For example, 5 / 7 does not equal 7 / 5, so a function that employed division
would not be commutative.
Associative refers to the way numbers are grouped using parenthesis. For example, 5 + (7 + 5) equals
(5 + 7) + 5, or 5 * (7 * 5) = (5 * 7) * 5. A function that is not associative can return different results each
time it is run. For example 5 * (7 + 5) does not equal (5 * 7) + 5, so this would not be associative.
saveAsTextFile()
The saveAsTextFile function writes the contents of RDD partitions to a specified location (such as
hdfs:// for HDFS, or file:/ for local file system, and so forth) and directory as a set of text files. For
example:
rddNumList.saveAsTextFile("hdfs://desiredLocation/foldername")
The contents of the RDD in the example would be written to a specific a directory in HDFS.
Success can be verified using typical tools from a command line or GUI. In the case of our example,
we could use the hdfs dfs -ls command to verify it had been written successfully:
$ hdfs dfs -ls desiredLocation/foldername
Using saveAsTextFile(), each RDD partition is written to a different text file by default
The output would look like the screenshot shown, with each RDD partition being written to a different
text file by default. The files could be copied to the local file system and then be read using a standard
text editor / viewer such as nano, more, vi, etc.
In the above code, two transformations were applied (flatMap and filter), but it was not until the
count action is called that the RDD runs through the various transformations.
As the visual indicates, when performing transformations on an RDD, it just saves the recipe of what it
is supposed to do when needed.
When the action is called, the data is pushed through the transformations so that the result can be calculated
Only at the end, when an action is called, will the data be pushed through the recipe to create the
desired outcome.
NOTE:
If there was no need to use the union or intersected RDDs for future
purposes, the results above could have been obtained in each case with
the following single lines of code:
rddNumList.union(rddNumList2).collect()
…and…
rddNumList.intersection(rddNumList2).collect()
Named Functions
Custom functions can be defined and named, then used as arguments to other functions. A custom
function should be defined when the same code will be used multiple times throughout the program, or
if a function argument will take more than a single line of code, making it too complex for an
anonymous function.
The following example evaluates a number to determine if is 90 or greater. If true, it returns the text
string "A" and if false it returns "Not an A".
def gradeAorNot(percentage):
if percentage > 89:
return "A"
else:
return "Not an A"
In the REPL, the number of tabs matter. For example, in line 2, you have to tab once before typing the
line, and in line 3 you must tab twice. In line 4, you have to tab once, and in line 5, twice.
The custom named function gradeAorNot can then be passed as an argument to another function -
for example, map().
rddGrades = sc.parallelize([87, 94, 41, 90])
rddGrades.map(gradeAorNot).collect()
['Not an A', 'A', 'Not an A', 'A']
The named function could also be used as the function body in an anonymous function. The following
example results in equivalent output to the code above:
rddGrades.map(lambda grade: gradeAorNot(grade)).collect()
Numeric Operations
Numeric operations can be performed on RDDs, including mean, count, stDev, sum, max, and min, as
well as a stats function that collects several of these values with a single function. For example:
rddNumList = sc.parallelize([5, 7, 11, 14])
rddNumList.stats()
(count: 4, mean: 9.25, stdev: 3.49…, max: 14, min: 5)
The individual functions can be called as well:
rddNumList.min()
5
IMPORTANT:
To double check the output of the Spark stdev()in Excel, use Excel's
stdevp function rather than the stdev function. Excel's stdev function
assumes that the entire population is unknown, and thus makes
adjustments to outputs based on assumed bias. The stdevp function
assumes the entire dataset (the p stands for "population") is fully
represented and does not make a bias correction. Thus, Excel's stdevp
function is more similar to Spark's stdev function.
REFERENCE:
There are many other Spark APIs available at:
http://spark.apache.org/docs/<version>/api/
<version> can be an actual version number, such as "1.4.0" or
"1.6.1", or alternatively you can use "latest" to view documentation on
the newest release of Spark. For example:
http://spark.apache.org/docs/latest/api/
Knowledge Check
You can use the following questions and answers as a self-assessment.
Questions
1 ) What does RDD stand for?
2 ) What two functions were covered in this lesson that create RDDs?
4 ) Which transformation will take all of the words in a text object and break each of them down
into a separate element in an RDD?
5 ) True or False: The count action returns the number of lines in a text document, not the
number of words it contains.
6 ) What is it called when transformations are not actually executed until an action is performed?
7 ) True or False: The distinct function allows you to compare two RDDs and return only
those values that exist in both of them
8 ) True or False: Lazy evaluation makes it possible to run code that "performs" hundreds of
transformations without actually executing any of them
Answers
1 ) What does RDD stand for?
2 ) What two functions were covered in this lesson that create RDDs?
Answer: False. Transformations result in new RDDs being created. In Spark, data is
immutable.
4 ) Which transformation will take all of the words in a text object and break each of them down
into a separate element in an RDD?
Answer: flatmap()
5 ) True or False: The count action returns the number of lines in a text document, not the
number of words it contains.
Answer: True
6 ) What is it called when transformations are not actually executed until an action is performed?
7 ) True or False: The distinct function allows you to compare two RDDs and return only
those values that exist in both of them
Answer: False. The intersection function performs this task. The distinct function
would remove duplicate elements, so that each element is only listed once regardless of how
many times it appeared in the original RDD.
8 ) True or False: Lazy evaluation makes it possible to run code that "performs" hundreds of
transformations without actually executing any of them
Answer: True
Summary
• Resilient Distributed Datasets (RDDs) are immutable collection of elements that can be
operated on in parallel
• Once an RDD is created, there are two things that can be done to it: transformations and
actions
• Spark makes heavy use of functional programming practices, including the use of anonymous
functions
• Common transformations include map(), flatmap(), filter(), distinct(), union(), and
intersection()
• Common actions include collect(), first(), take(), count(), saveAsTextFile(), and
certain mathematic and statistical functions
Lesson Objectives
After completing this lesson, students should be able to:
ü Use Core RDD functions and create your own function
ü Perform common operations on Pair RDDs
map()
The first two lines of the example below should look very familiar. We are first creating an RDD from a
file called "mary.txt" and then calling a flatMap. Once each word is now its own element, key-value
pairs can be created.
rdd = sc.textFile("filelocation/mary.txt")
rddFlat = rdd.flatMap(lambda line: line.split(' '))
A map transformation can be used to create the key-value pair. In this process, a function is passed to
map() to create a tuple. In the example below, we create an anonymous function which returns a tuple
of (word,1).
kvRdd = rddFlat.map(lambda word: (word,1))
kvRdd.collect()
The illustration visually demonstrates what happens when the map function is applied on the initial
elements, each consisting of a single word.
keyBy()
The keyBy API creates key-value pairs by applying a function on each data element. The function
result becomes the key, and the original data element becomes the value in the pair.
For example:
rddTwoNumList = sc.parallelize([(1,2,3),(7,8)])
keyByRdd = rddTwoNumList.keyBy(len)
keyByRdd.collect()
[(3,(1,2,3)),(2,(7,8))]
Additional example:
rddThreeWords = sc.parallelize(["cat","A","spoon"])
keyByRdd2 = rddThreeWords.keyBy(len)
keyByRdd2.collect()
[(3,'cat'),(1,'A'),(5,'spoon')]
zipWithIndex()
The zipWithIndex function creates key-value pairs by assigning the index, or numerical position, of
the element as the value, and the element itself as the key.
For example:
rddThreeWords = sc.parallelize(["cat","A","spoon"])
zipWIRdd = rddThreeWords.zipWithIndex()
zipWIRdd.collect()
[('cat', 0), ('A', 1), ('spoon', 2)]
zip()
The zip function creates key-value pairs by taking elements from one RDD as the key and elements of
another RDD as the value. It has the following syntax:
keyRDD.zip(valueRDD)
The API assumes the two RDDs have the same number of elements. If not, it will return an error.
rddThreeWords = sc.parallelize(["cat", "A", "spoon"])
rddThreeNums = sc.parallelize([11, 241, 37])
zipRdd = rddThreeWords.zip(rddThreeNums)
zipRdd.collect()
[('cat', 11), ('A', 241), ('spoon', 37)]
rddMapVals.collect()
[('cat', 1), ('A', 2), ('spoon', 3)]
rddMapVals.collect()
[('cat', 1), ('A', 2), ('spoon', 3)]
• keys() - returns a list of just the keys in the RDD without any values.
rddMapVals.keys().collect()
['cat', 'A', 'spoon']
• values() - returns a list of just the values in the RDD without any keys.
rddMapVals.values().collect()
[1, 2, 3]
• sortByKey(ascending=False) - sorts the RDD alphanumerically by key. By default, will sort
from smallest to largest value (ascending=True). If ascending is explicitly set to False, it
orders from largest to smallest.
rddMapVals.sortByKey().collect()
[('A', 2), ('cat', 1), ('spoon', 3)]
IMPORTANT:
Without creating a PairRDD prior to using these functions, they will not
work as expected.
keyByRdd.collect()
[(3, (1, 2, 3)), (2, (7, 8))]
• lookup(key) - returns a list containing all values for a given key.
keyByRdd.lookup(2)
[(7, 8)]
• countByKey() - returns a count of the number of times each key appears in the RDD (in our
example, there were no duplicate keys, so each is returned as 1).
keyByRdd.countByKey()
defaultdict(<type 'int'>,{2: 1, 3: 1})
• collectAsMap() - collects the result as a map. If multiple values exist for the same key only
one will be returned.
keyByRdd.collectAsMap()
{2: (7, 8), 3: (1, 2, 3)}
Note that these actions did not require us to also specify collect() in order to view the results.
reduceByKey()
The reduceByKey function performs a reduce operation on all elements of a key/value pair RDD that
share a key. For our example here, we'll return to kvRdd that was created using the following code:
rddMary = sc.textFile("filelocation/mary.txt")
rddFlat = rddMary.flatMap(lambda line: line.split(' '))
kvRdd = rddFlat.map(lambda word: (word,1))
The reduceByKey function goes through the elements and if it sees a key that it hasn't already
encountered, it adds it to the list and records the value as-is. If a duplicate key is found, reduceByKey
performs a function on the key values and keeps the number of keys to one. In our example, then, the
anonymous function "lambda a,b: a+b“ only kicks in if a duplicate key is found. If so, the
anonymous function tells reduceByKey to take the values of the two keys (a and b) and add them
together to compute a new value for the now-reduced key. The actual function being performed is up
to the developer, but incrementally adding values would be a fairly common task.
Observe that the keys ‘Mary’, ‘was’, and ‘lamb’ have been reduced
Visually, then, what is happening is that the elements of the RDD are being recorded and passed to the
new kvReduced RDD, with the exception of two keys - 'Mary' and 'lamb' - which were both found
twice. All other values remain unchanged, but now 'Mary' and 'lamb' are reduced to single key, each
with a value of 2.
groupByKey()
Grouping values by key allow us to aggregate values based on a key. In order to see this grouping, the
results must be turned into a list before being collected.
For example, let's again use our kvRdd example created with the following code:
rddMary = sc.textFile("filelocation/mary.txt")
rddFlat = rddMary.flatMap(lambda line: line.split(' '))
kvRdd = rddFlat.map(lambda word: (word,1))
Next, we will use groupByKey to group all values that have the same key into an iterable object (that
is, on its own, unable to be viewed) and then use a map function to define these grouped elements into
a readable list:
kvGroupByKey = kvRdd.groupByKey().map(lambda x : (x[0], list(x[1])))
kvGroupByKey.collect()
[(u'a', [1]), (u'lamb', [1, 1]),(u'little', [1]),…(u'Mary',[1, 1])]
If we had simply generated output using groupByKey alone, as below:
kvGroupByKey = kvRdd.groupByKey()
… the output would have looked something like this:
[(u'a', <pyspark.resultiterable.ResultIterable object at 0xde8450>), (u'lamb',
<pyspark.resultiterable.ResultInterable object at 0xde8490>),…(u'Mary',
<pyspark.resultiterable.ResultIterable object at 0xde8960>)]
This tells you that the results are an object that allows iteration, but does not display the individual
elements by default. Using map to list the elements, performed the necessary iteration to see the
desired formatted results.
NOTE:
The groupByKey and reduceByKey functions have significant overlap
and similar capabilities, depending on how the called function is defined
by the developer. When either is able to get the desired output, it is better
to use reduceByKey() as it is more efficient over large datasets.
flatMapValues()
Like the mapValues function, the flatMapValues function performs a function on Pair RDD values,
leaving the keys unchanged. However, in the event it encounters a key that has multiple values, it
flattens those into individual key-value pairs, meaning no key will have more than one value, but you
will end up with duplicate keys in the RDD. Let's start with the RDD we created in the groupByKey()
example:
kvGroupByKey = kvRdd.groupByKey().map(lambda x : (x[0], list(x[1])))
kvGroupByKey.collect()
[(u'a', [1]), (u'lamb', [1, 1]),(u'little', [1]),…(u'Mary',[1, 1])]
Notice that both the 'lamb' and 'Mary' keys contain a multiple key values in a list. Next, let's create
an RDD that flattens those key-value pairs using the flatMapValues function:
rddFlatMapVals = kvGroupByKey.flatMapValues(lambda val: val)
rddFlatMapVals.collect()
[(u'a', 1), (u'lamb', 1), (u'lamb', 1), (u'little', 1), … (u'Mary',1),
(u'Mary',1)]
Now all key-value pairs have only a single value, and both 'lamb' and 'Mary' exist as duplicated
keys in the RDD.
NOTE:
In the example above, the anonymous function simply returns the original
unedited value. However, like the mapValues function, flatMapValues()
can be configured to modify the keys as it flattens out the key-value pairs.
For example, if we had defined the anonymous function as: (lambda
val: val + 1), then each of the values would have been returned as 2
instead of 1.
subtractByKey()
The subtractByKey function will return key-value pairs containing keys not found in another RDD.
This can be useful when you need to identify differences between keys in two RDDs. Here is an
example:
zipWIRdd = sc.parallelize([("cat", 0), ("A", 1), ("spoon", 2)])
rddSong = sc.parallelize([("cat", 7), ("cradle", 9), ("spoon", 4)])
rddSong.subtractByKey(zipWIRdd).collect()
[('cradle', 9)]
The key-value pair that had a key of 'cradle' was the only one returned because both RDDs
contained key-value pairs that had key values of 'cat' and 'spoon'.
Notice that ('A', 1) was not returned as part of the result. This is because subtractByKey() is
only evaluating matches in the first RDD (the one that precedes the function) as compared to the
second one. It does not return all keys in either RDD that are unique to that RDD. If you wanted to get a
list of all unique keys for both RDDs using subtractByKey(), you would need to run the operation
twice - once as shown, and then again, swapping out the two RDD values in the last line of code. For
example, this code would return the unique key values for zipWIRdd:
zipWIRdd.subtractByKey(rddSong).collect()
[('A', 1)]
If needed, you could store these outputs in two other RDDs, then use another function to combine
them into a single list as desired.
rddSong.leftOuterJoin(zipWIRdd).collect()
[('spoon', (4, 2)), ('cradle', (9, none)), ('cat', (7, 0))]
REFERENCE:
There are many other Spark APIs available at:
http://spark.apache.org/docs/<version>/api/
<version> can be an actual version number, such as "1.4.0" or
"1.6.1", or alternatively you can use "latest" to view documentation on
the newest release of Spark. For example:
http://spark.apache.org/docs/latest/api/
Knowledge Check
You can use the following questions and answers as a self-assessment.
Questions
1 ) An RDD that contains elements made up of key-value pairs is sometimes referred to as a
_________________.
3 ) True or False: A key can have a value that is actually a list of many values.
4 ) Since sortByKey() only sorts by key, and there is no equivalent function to sort by values,
how could you go about getting your Pair RDD sorted alphanumerically by value?
5 ) You determine either reduceByKey() or groupByKey() could be used in your program to get
the same results. Which one should you choose?
6 ) How can you use subtractByKey() to determine *all* of the unique keys across two RDDs?
Answers
1 ) An RDD that contains elements made up of key-value pairs is sometimes referred to as a
_________________.
3 ) True or False: A key can have a value that is actually a list of many values.
Answer: True
4 ) Since sortByKey() only sorts by key, and there is no equivalent function to sort by values,
how could you go about getting your Pair RDD sorted alphanumerically by value?
Answer: First use map() to reorder the key-value pair so that the key is now the value. Then
use sortByKey() to sort. Finally, use map() again to swap the keys and values back to their
original positions.
5 ) You determine either reduceByKey() or groupByKey() could be used in your program to get
the same results. Which one should you choose?
6 ) How can you use subtractByKey() to determine *all* of the unique keys across two RDDs?
Answer: Run it twice, switching the order of the RDDs each time.
Summary
• Pair RDDs contain elements made up of key-value pairs
• Common functions used to create Pair RDDs include map(), keyBy(), zipWithIndex(), and
zip()
• Common functions used with Pair RDDs include mapValues(), keys(), values(),
sortByKey(), lookup(), countByKey(), collectAsMap(), reduceByKey(),
groupByKey(), flatMapValues(), subtractByKey(), and various join types.
Lesson Objectives
After completing this lesson, students should be able to:
ü Describe Spark Streaming
ü Create and view basic data streams
ü Perform basic transformations on streaming data
ü Utilize window transformations on streaming data
Spark streaming
DStreams
A DStream is a collection of one or more specialized RDDs divided into discrete chunks based on a
time interval. When a streaming source communicates with Spark Streaming, the receiver caches
information for a specified time period, after which the data is converted into a DStream and made
available for further processing. Each discrete time period (in the example pictured, every five
seconds) is a separate DStream.
DStreams
DStream Replication
DStreams are fault tolerant, meaning they are written to a minimum of two executors at the moment of
creation. The loss of a single executor will not result in the loss of the DStream.
Receiver Availability
By default, receivers are highly available. If the executor running the receiver goes down, the receiver
will be immediately restarted in another executor.
Spark Streaming performs micro-batching rather than true bit-by-bit streaming. Collecting and
processing data in batches can be more efficient in terms of resource utilization, but comes at a cost
of latency and the risk that small amounts of data could be lost. Spark Streaming can be configured to
process batch sizes as small as one second, which takes approximately another second to process,
for a two-second delay from the moment the data is received until a response can be generated. This
introduces a small risk of data loss, which can be mitigated by the use of reliable receivers (available in
the Scala and Java APIs only at the time of this writing) and intelligent data sources.
Receiver Reliability
By default, receivers are "unreliable." This means there is:
• No acknowledgment between receiver and source
• No record of whether data has been successfully written
• No ability to ask for retransmission for missed data
• Possibility for data loss if receiver is lost
To implement a reliable receiver, a custom receiver must be created. A reliable receiver implements a
handshake mechanism that acknowledges that data has been received and processed. Assuming the
data source is also intelligent, it will wait to discard the data on the other side until this
acknowledgement has been received, which also means it can retransmit it in the event of data loss.
Custom receivers are available in the Scala and Java Spark Streaming APIs only, and are not available
in Python. For more information on creating and implementing custom / reliable receivers, please refer
to Spark Streaming documentation.
REFERENCE:
Additional data sources are available via the Scala and Java APIs. Please
refer to the Spark Streaming documentation for additional information.
Visit http://spark.apache.org/documentation.html
Basic Streaming
StreamingContext
Spark Streaming extends the Spark Core architecture model by layering in a StreamingContext on
top of the SparkContext. The StreamingContext acts as the entry point for streaming applications. It
is responsible for setting up the receiver and enables real-time transformations on DStreams. It also
produces various types of output.
The StreamingContext
Launch StreamingContext
To launch the StreamingContext, you first need to import the StreamingContext API. In pyspark,
the code to perform this operation would be:
from pyspark.streaming import StreamingContext
Next, you create an instance of the StreamingContext. When doing so, you supply the name of the
SparkContext in use, as well as the time interval (in seconds) for the receiver to collect data for
micro-batch processing. When using a REPL, the SparkContext will be named sc by default. For
example, when creating an instance of StreamingContext named ssc, in the pyspark REPL, with a
micro-batch interval of one second, you would use the following code:
ssc = StreamingContext(sc, 1)
IMPORTANT:
This operation will return an error if the StreamingContext API has not
been imported.
Both the name of the SparkStreaming instance and the time interval can be modified to fit your
purposes. Here is an example of creating a StreamingContext instance with a 10-second micro-
batch interval:
sscTen = StreamingContext(sc, 10)
IMPORTANT:
While multiple instances of StreamingContext can be defined, only a
single instance of SparkContext can run per JVM. Once running, another
instance will fail to launch. In fact, once the current instance has been
stopped, it cannot be launched again in the same JVM. Thus, while the
REPL is a useful tool for learning and perhaps testing Spark Streaming
applications, in production it would be problematic because every time a
developer wanted to test a slightly different application, it would require
stopping and restarting the REPL itself.
NOTE:
If the HDFS directory exists on the cluster the application is attached to,
only the path needs to be provided. If it exists on a separate cluster,
prepend the path with "hdfs:/nodename:8020/…"
To create a stream by monitoring a TCP socket source, choose a variable name for the DStream, then
call the socketTextStream() function and supply the source hostname or IP address (whichever is
appropriate in your situation) and the port number to monitor. For example:
tcpInputDS = ssc.socketTextStream("someHostname", portNumber)
Notice that in both examples, the name of the StreamingContext instance had to be specified before
calling the function. Otherwise, the application would have tried to call this from the default
SparkContext, and since these functions to not exist outside of Spark Streaming, an error would have
been returned.
DStream Transformations
Transformations allow modification of DStream data to create new DStreams with different output.
DStream transformations are similar in nature and scope to traditional RDD transformations. In fact,
many of the same functions in Spark Core also exist in Spark Streaming. The following functions
should look familiar:
• map()
• flatMap()
• filter()
• repartition()
• union()
• count()
• reduceByKey()
• join()
reduceByKey()
Also, like traditional RDDs, key-value pair DStreams can be reduced using the reduceByKey function.
Here's an example:
# pyspark --master local[2]
>>> sc.setLogLevel("ERROR")
>>> from pyspark.streaming import StreamingContext
>>> ssc = StreamingContext(sc, 5)
>>> hdfsInputDS = ssc.textFileStream("someHDFSdirectory")
>>> kvPairDS = hdfsInputDS.flatMap(lambda line: line.split(" ").map(lambda
word: (word, 1))
>>> kvReduced = kvPairDS.reduceByKey(lambda a,b: a+b)
>>> kvReduced.pprint()
>>> ssc.start()
This coding pattern is often used to write word count applications.
Window Transformations
Checkpointing
Checkpointing is used in stateful streaming operations to maintain state in the event of system failure.
To enable checkpointing, you can simply specify an HDFS directory to write checkpoint data to using
the checkpoint function. For example:
ssc.checkpoint("someHDFSdirectory")
Trying to write a stateful application without specifying a checkpoint directory will result in an error
once the application is launched.
NOTE:
Technically, you could also process a 15-second window in 15-second
intervals, however this is functionally equivalent to setting the
StreamingContext interval to 15 seconds and not using the window
function at all.
IMPORTANT:
For basic inputs, window() does not work as expected using
textFileStream(). An application will process the first file stream
correctly, but then lock up when a second file is added to the HDFS
directory. Because of this, all course labs and examples will use the
socketTextStream function.
reduceByKeyAndWindow()
You can also work with key-value pair windows, and there are specialized functions designed to do
that. One such example is reduceByKeyAndWindow(), which behaves similarly to the reduceByKey
function discussed previously, but over a specified window and collection interval. For example, take a
look at the following application:
# pyspark --master local[2]
>>> sc.setLogLevel("ERROR")
>>> from pyspark.streaming import StreamingContext
>>> ssc = StreamingContext(sc, 1)
>>> ssc.checkpoint("/user/root/test/checkpoint/")
>>> tcpInDS = ssc.socketTextStream("sandbox",9999)
>>> redPrWinDS = tcpInDS.flatMap(lambda line: line.split(" ")).map(lambda word:
(word, 1)). reduceByKeyAndWindow(lambda a,b: a+b, lambda a,b: a-b, 10, 2)
>>> redPrWinDS.pprint()
>>> ssc.start()
To generate the reduced key-value pair, the DStream is transformed using flatMap(), then converted
to a key-value pair using map(). Then, the reduceByKeyAndWindow function is called.
Notice that the reduceByKeyAndWindow function actually takes two functions as arguments prior to
the window size and interval arguments. The first argument is the function that should be applied to the
DStream. The second argument is the *inverse* of the first function, and is applied to the data that has
fallen out of the window. The value of each window is calculated incrementally as the window slides
across DStreams without having to recompute the data across all of the DStreams in the window each
time.
Knowledge Check
Questions
1 ) Name the two new components added to Spark Core to create Spark Streaming.
2 ) If an application will ingest three streams of data, how many CPU cores should it be allocated?
3 ) Name the three basic streaming input types supported by both Python and Scala APIs.
Answers
1 ) Name the two new components added to Spark Core to create Spark Streaming.
2 ) If an application will ingest three streams of data, how many CPU cores should it be allocated?
Answer: Four - one for each stream, and one for the receiver.
3 ) Name the three basic streaming input types supported by both Python and Scala APIs.
Answer: HDFS text via directory monitoring, text via TCP socket monitoring, and queues of
RDDs.
Answer: Checkpointing.
Summary
• Spark Streaming is an extension of Spark Core that adds the concept of a streaming data
receiver and a specialized type of RDD called a DStream.
• DStreams are fault tolerant, whereas receivers are highly available.
• Spark Streaming utilizes a micro-batch architecture.
• Spark Streaming layers in a StreamingContext on top of the Spark Core SparkContext.
• Many DStream transformations are similar to traditional RDD transformations
• Window functions allow operations across multiple time slices of the same DStream, and are
thus stateful and require checkpointing to be enabled.
Lesson Objectives
After completing this lesson, students should be able to:
ü List various components of Spark SQL and explain their purpose
ü Describe the relationship between DataFrames, tables, and contexts
ü Use various methods to create and save DataFrames and tables
ü Manipulate DataFrames and tables
DataFrames
A DataFrame is data that has been organized into one or more columns, similar in structure to an SQL
table, but is actually constructed from underlying RDDs. DataFrames can be created directly from
RDDs, as well as from Hive tables and many other outside data sources.
There are three primary methods available to interact with DataFrames and tables in Spark SQL:
• The DataFrames API, which is available for Java, Scala, Python, and R developers
• The native Spark SQL API, which is composed of a subset of the SQL92 API commands
• The HiveQL API. Most of the HiveQL API is supported in Spark SQL.
Hive
Most enterprises that have deployed Hadoop are familiar with Hive which is the original data
warehouse platform developed for Hadoop. It represents unstructured data stored in HDFS as
structured tables using a metadata overlay managed by Hive's HCatalog, and can interact with those
tables via HiveQL, it's SQL-like query language.
Hive is distributed with every major Hadoop distribution. Massive amounts of data are currently
managed by Hive across the globe. Thus, Spark SQL's ability to integrate with Hive and utilize HiveQL
capabilities and syntax provides massive value for the Spark developer.
Hive data starts as raw data that has been written to HDFS. Hive has a metadata component that
logically organizes these unstructured data files into rows and columns like a table. The metadata layer
acts as a translator, enabling SQL-like interactions to take place even though the underlying data on
HDFS remains unstructured.
DataFrame Visually
In much the same way, a DataFrame starts out as something else - perhaps an ORC or JSON file,
perhaps a list of values in an RDD, or perhaps a Hive table. Spark SQL has the ability to take these (and
other) data sources and convert them into a DataFrame. As mentioned earlier, DataFrames are actually
RDDs, but are represented logically as rows and columns. In this sense, Spark SQL behaves in a
similar fashion to Hive, only instead of representing files on a disk as tables like Hive does, Spark SQL
represents RDDs in memory as tables.
Spark SQL uses an optimizer called Catalyst. Catalyst accelerates query performance via an
extensive built-in, extensible, catalog of optimizations that goes through a logical process and builds a
series of optimized plans for executing a query. This is followed by an intelligent, cost-based modeling
and plan selection engine, which then generates the code required to perform the operation.
This provides numerous advantages over core RDD programming. It is simpler to write an SQL
statement to perform an operation on structured data than it is to write a series of filter(),
group(), and other calls. Not only is it simpler, executing queries using Catalyst provides performance
that matches or outperforms equivalent core RDD nearly 100% of the time.
Spark SQL make managing and processing structured data easier, and it provides performance
improvements as well.
In this code, a CSV file is converted to an RDD named eventsFile using sc.textFile. Next, a
schema named Event is created which labels each column and sets its type.
Then a new RDD named eventsRDD is generated which takes the content of the eventFile RDD and
transforms the elements according to the Event schema, casting each column and reformatting data as
necessary.
The code then counts the rows of eventsRDD - presumably as some kind of verification that the
operation was successfully performed.
The final two steps are performed on a single line of code. First eventsRDD is converted to an
unnamed DataFrame, which is then immediately registered as a temporary table named
enrichedEvents.
IMPORTANT:
Some of the techniques shown here to format a text file for use in a DataFrame is
beyond the scope of this class, but various references exist online on how to accomplish
this.
This temporary table can also be permanently converted to a permanent table in Hive. Making the table
part of Hive's managed data has the added benefit of making the table available across multiple Spark
SQL contexts.
This Row object has an implied schema of two columns - named code and value - and there are two
records for code AA with a value of 150000 and code BB with a value of 80000. Because we do not
need to work with the RDD directly, we immediately convert this collection of Row objects to a
DataFrame using toDF(). We then visually verify that the Row objects were converted to a DataFrame
using show(), and that the schema was applied correctly using printSchema().
The DataFrame is registered as a temporary table named test4, which is then converted to a
permanent Hive table named permab. We then run SHOW TABLES to view the tables available for SQL
to manipulate.
SHOW TABLES returned two Hive tables - permab and permenriched - and the temporary table
test4. However, if we return to our other SQL context and run SHOW TABLES, we see the two Hive
tables, but we do not see test4.
In addition, this SQL context can still see the enrichedevents temporary table, which was not visible
to our other SQL context.
The code in this screenshot performs an operation similar to the IoT example, but on a smaller scale,
and using Row objects in Scala rather than Python. Some of the key differences between this and the
previous example include the definition of the DataSample schema class prior to creation of the df2
RDD, with a different syntax during creation of the RDD.
As in the previous example, this RDD is immediately converted to a DataFrame, then registered as a
temporary table which is converted to a Hive table.
NOTE:
The only temporary table visible when SHOW TABLES is executed is the temporary table
created by this instance of the SQL context. All of the Hive tables are returned, as per
previous examples.
A key concept when writing multiple applications and utilizing multiple Spark SQL contexts across a
cluster, is that registering a temporary table makes it available for either DataFrame API or SQL
interactions while operating in that specific context.
However, those DataFrames and tables are only available within the Spark SQL context in which
they were created.
To make a table visible across Spark SQL contexts, you should store that table permanently in Hive,
which makes it available to any HiveContext instance across the cluster.
createDataFrame():
dataframeX = sqlContext.createDataFrame("rddName")
If an RDD is properly formatted but lacks a schema, in Python createDataFrame() can be used to
infer the schema on DataFrame creation. (Scala lacks an easy, "on the fly" way to accomplish this.)
rddName = sc.parallelize([(‘AA’, 150000), (‘BB’, 80000)])
dataframeX = sqlContext.createDataFrame(rddName, [‘code’, ‘value’])
This creates a table named table1hive in Hive, copying all of the contents from temporary table
table1. The following screenshot demonstrates that table1hive is now registered as a permanent
table.
Save Modes
By default, if a write() is used and the file already exists, an error will be returned. This is because of
the default behavior of save modes. However, this default can be modified. Here are the possible
values for save mode when writing a file and their definitions:
• ErrorIfExists: Default mode, returns an error if the data already exists
• Append: Appends data to file or table if it already exists
• Overwrite: Replaces existing data if it already exists
• Ignore: Does nothing if the data already exists
For example, if you were using the write() command from before to save an ORC file and you
wanted the data to be overwritten / replaced if a file by the same name already existed, you would use
the following code:
REFERENCE:
For a complete list of supported file types for direct import into DataFrames, please
refer to Spark SQL documentation. (http://spark.apache.org/documentation.html)
The syntax is similar to the write() command used before, only read() is used, and an appropriate
file is loaded. For example, to use the JSON file created earlier to create a DataFrame:
dataframeJSON = sqlContext.read.format("json").load("dfsamp.json")
NOTE:
If you peruse the documentation, you will note that some file formats have read()
shortcuts - for example: read.json instead of read.format("json"). We do not
demonstrate them in class because they are not consistent across all supported file
types, however if a developer works primarily with JSON files on a regular basis, using
the read.json shortcut may be beneficial.
The filter() function returns a DataFrame with only rows that have column values that meet a
defined criteria - in the screenshot, only rows that had values less than 100,000 were returned.
The limit() function returns a DataFrame with the first n rows of the DataFrame. In the example, only
the first row was returned.
The drop() function returns a DataFrame without specific volumes included. Think of it as the
opposite of the select() function.
The groupBy() function groups rows by matching column values, and can then perform other
functions on the combined rows such as count().
The screenshot shows two examples. In the first one, the code column is grouped and the number of
matching values are counted and displayed in a separate column. In the second one, the values
column is scanned for matching values, and then the sums of the identical values are displayed in a
separate column.
IMPORTANT:
In the screenshot provided, these functions are shown prepended with print. However,
the print command is not required for Scala, nor is it required for Python when using the
REPL.
Technically, these functions should probably not have required the print function in
order to produce output either, but via trial and error testing we discovered that they
worked when print was supplied in Zeppelin. In addition, at the time of this writing, a
handful of pyspark functions did not operate correctly *at all* when run inside Zeppelin,
even in conjunction with the print command.
Some examples include first(), collect(), and columns(). This is likely the result
of a bug in the version of Zeppelin used to write this course material and may no longer
be the case by the time you are reading this.
For additional DataFrames API functions, please refer to the online Apache Spark SQL
DataFrames API documentation. Testing these pyspark functions without the print
command will likely result in success in a future implementations.
Knowledge Check
Questions
1 ) While core RDD programming is used with [structured/unstructured/both] data Spark SQL is
used with [structured/unstructured/both] data.
2 ) True or False: Spark SQL is an extra layer of translation over RDDs. Therefore while it may be
easier to use, core RDD programs will generally see better performance.
3 ) True or False: A HiveContext can do everything that an SQLContext can do, but provides
more functionality and flexibility.
7 ) Name two file formats that Spark SQL can use without modification to
create DataFrames.
Answers
1 ) While core RDD programming is used with [structured/unstructured/both] data Spark SQL is
used with [structured/unstructured/both] data.
2 ) True or False: Spark SQL is an extra layer of translation over RDDs. Therefore while it may be
easier to use, core RDD programs will generally see better performance.
Answer: False. The Catalyst optimizer means Spark SQL programs will generally outperform
core RDD programs
3 ) True or False: A HiveContext can do everything that an SQLContext can do, but provides
more functionality and flexibility.
Answer: True
Answer: False. Temporary tables are only visible to the context that created them.
Answer: On Disk
7 ) Name two file formats that Spark SQL can use without modification to
create DataFrames.
Answer: The ones discussed in class were ORC, JSON, and parquet files.
Summary
• Spark SQL gives developers the ability to utilize Spark's in-memory processing capabilities on
structured data
• Spark SQL integrates with Hive via the HiveContext, which broadens SQL capabilities and
allows Spark to use Hive HCatalog for table management
• DataFrames are RDDs that are represented as table objects which can used to create tables
for SQL interactions
• DataFrames can be created from and saved as files such as ORC, JSON, and parquet
• Because of Catalyst optimizations of SQL queries, SQL programming operations will generally
outperform core RDD programming operations
Lesson Objectives
After completing this lesson, students should be able to:
ü Explain the purpose and benefits of data visualization
ü Perform interactive data exploration using visualization in Zeppelin
ü Collaborate with other developers and stakeholders using Zeppelin
Data Visualizations
Because of Zeppelin's direct integration with Spark, flexibility in terms of supported languages, and
collaboration and reporting capabilities - the rest of this lesson will explore how to use this tool for
greatest effect.
Keep in mind, however, that Zeppelin also supports HTML and JavaScript, and can also work with
other data visualization libraries available to Python, Java, and other languages. If Zeppelin's built-in
capabilities don't quite meet your needs, you always have the ability to expand on them.
Bar Chart
Pie Chart
Area Chart
Line Chart
Visualizations on DataFrames
Zeppelin can also provide visualizations on DataFrames that have not been converted to SQL tables by
using the following command:
z.show(DataFrameName)
This command tells Zeppelin to treat the DataFrame like a table for visualization purposes. Since a
DataFrame is already formatted like a table, the command should work without issue on every
DataFrame.
This screenshot shows what Zeppelin shows when %table is not part of the command.
Zeppelin then displays the content as a table, with supporting data visualizations available as below.
If the data is not formatted correctly, Zeppelin would simply return the string as a table name with no
data.
For example, in the previous screenshot, the SQL command displays a visualization for all columns
and rows.
However, in the second screenshot, the query was updated to only include rows with a value in the age
column that exceeded 45.
This provides you with the ability to manipulate the chart output in a number of ways without requiring
you to modify the initial query.
In the example above, we see that the default chart uses age as a key, and sums the balances for all
persons of a given age as the value.
This was done automatically, without any grouping or sum command as part of the SQL statement
itself.
The pivot chart feature allows you to change the action performed on the Values column selected.
Click on the box (in the screenshot, the one that says balance) and a drop-down menu of options
appear which can be used to change the default value action. Options include SUM, AVG, COUNT,
MIN, and MAX.
To remove a column, click the "x" to the top right of the name box and it will disappear.
If either the Key or Value field is blank, the output indicates that there is no data available.
Then you simply drag and drop the field you want as a value into the appropriate box and the output
refreshes to match.
In this example, we elected to use the age column for both Keys and Values, and used the COUNT
feature to count the number of individuals in each age category.
In this example, the marital column was defined as a grouping, and therefore every unique value in that
column (married, single, or divorced) became its own bar color in the bar chart.
Dynamic Forms
Dynamic Forms give you the ability to define a variable in the query or command and allow that value
to be dynamically set via a form that appears above the output chart. These can be done in various
programming languages. For SQL, you would use a WHERE clause and then specify the column name,
some mathematical operator, and then a variable indicated by a dollar sign with the form name and the
default value specified inside a pair of curly braces.
SELECT * FROM table WHERE colName [mathOp] ${LabelName=DefValue}
In the following screenshot, we select the age column, where age is greater than or equal to the
variable value, label the column Minimum Age and set the default value to 0 so that all values will
appear by default.
Then, in the resulting dynamic form, we set the minimum age to 45 and press enter, which results in
the chart updating to reflect a minimum age of 45 in the output.
Multiple Variables
Multiple variables can be included as dynamic forms.
In this example, the WHERE clause has been extended with an AND operator, so both a minimum age of
0 and a maximum age of 100 are set as defaults. The user then sets the minimum age value to 30 and
the maximum age to 55 and presses enter, resulting in the underlying output changing to meet those
criteria.
Select Lists
Dynamic forms can also include select lists (a.k.a. drop-down menus). The syntax for a select list within
a WHERE clause would be:
... WHERE colName = "${LabelName=defaultLabel,opt1|opt2|opt3|…}"
In the example shown, the marital column is specified, a variable created, and within the variable
definition we specify the default value for marital = married.
Then, insert a comma, and provide the complete list of options you wish to provide separated by the |
(pipe) character - in our case, married, single, and divorced.
This result is a new dynamic form, and the output will respond to changes in the drop-down menu.
• Export
Downloads a copy of the note to the local file system in JSON format. You can export a note by
clicking on the button labeled "Export the notebook" at the top of the note.
Importing Notes
Exporting a note also gives you the ability to share that file with another developer, which they can
then import into their own notebook from the Zeppelin landing page by clicking on "Import note."
Note Cleanup
Often note development will be a series of trial and error approaches, comparing methods to pick the
best alternative. This can result in a notebook that contains paragraphs that you don't want to keep, or
don't want distributed to others for sake of clarity. Fortunately cleaning up a note prior to distribution is
relatively easy.
Individual paragraphs that are no longer needed can be removed/deleted from the note. In the
paragraph, click on the settings button (gear icon) and select remove to delete it.
Paragraphs can also be moved up or down in the note and new paragraphs can be inserted (for
example, to add comments in Markdown format describing the flow of the note).
Formatting Notes
Note owners can control all paragraphs at the note level, via a set of buttons at the top of the note.
These controls include:
• Hide/Show all code via the button labeled "Show/hide the code,"
• Hide/Show all output via the button labeled "Show/hide the output" (which changes from an
open book to a closed book icon based on the current setting), and
• Clear all output via the eraser icon button labeled "Clear output."
This operation can be scheduled to run on a regular basis using the scheduling feature, which is
enabled by clicking the clock icon button labeled "Run scheduler." This allows you to schedule the
note to run at regular intervals including every minute, every five minutes, every hour, and so on up to
every 24 hours via preset links that can simply be clicked to activate. If these options are not granular
enough for you, you can also schedule the note at a custom interval by supplying a Cron expression.
Paragraph Formatting
Paragraphs can also be formatted prior to distribution on an individual basis. These settings are
available in the buttons menu at the top right of each paragraph, as well as underneath the settings
menu (gear icon) button.
Formatting options that were also available at the note level include:
• Hide/Show paragraph code,
• Hide/Show paragraph output, and
• Clear paragraph output (only available under settings).
Paragraph Enhancements
The visual appearance of paragraphs can be enhanced to support various collaboration goals. Such
enhancements include:
• Setting paragraph width
• Showing paragraph title
• Showing line numbers
Width
Example:
Let's assume you want to create a dashboard within a Zeppelin note, showing multiple views of the
same data on the same line.
This can be accomplished by modifying the Width setting, found in paragraph settings. By default, the
maximum width is used per paragraph, however, this can be modified so that two or more paragraphs
will appear on the same line.
Show Title
Paragraphs can be given titles for added clarity when viewing output. To set a title, select Show title
under paragraph settings. The default title is "Untitled."
Click on the title to change it, type the new title, and press the Enter key to set it.
Line Numbers
Paragraphs displaying code can also be enhanced by showing line numbers for each line of code.
To turn on this feature, select Show line numbers under paragraph settings.
The numbers will appear to the left of the code lines. Lines that are wrapped based on the width of the
paragraph will only be given a single number, even though on the screen they will appear as multiple
lines.
Sharing Paragraphs
Individual paragraphs can be shared by generating a link, which can be used as an iframe or otherwise
embedded in an external-to-Zeppelin report. To generate this URL, select Link this paragraph under
paragraph settings.
This will automatically open the paragraph in a new browser tab, and the URL can be copied and
pasted into whatever report or web page is needed.
If dynamic forms have been enabled for this note, anyone who modifies the form values will change the
appearance of the paragraph output for everyone looking at the link. This can be a valuable tool if, for
example, a marketing department wants to generate multiple outputs based on slight tweaks to the
query. You can allow them to do this without giving them access to the entire note, and without the
need to modify the code on the backend.
Any changes to the code, as well as changes to dynamic forms input, will not change the output
presented as long as the Disable run option is selected.
Knowledge Check
Use the following questions to assess your understanding of the concepts presented in this lesson.
Questions
1 ) What is the value of data visualization?
3 ) How do you share a copy of your note (non-collaborative) with another developer?
6 ) Which paragraph feature provides the ability for an outside person to see a paragraph's output
without having access to the note?
7 ) What paragraph feature allows you to give outside users the ability to modify parameters and
update the displayed output without using code?
Answers
1 ) What is the value of data visualization?
Answer: Enable humans to make inferences and draw conclusions about large sets of data
that would be impossible to make by looking at the data in tabular format.
Answer: Five
3 ) How do you share a copy of your note (non-collaborative) with another developer?
6 ) Which paragraph feature provides the ability for an outside person to see a paragraph's output
without having access to the note?
7 ) What paragraph feature allows you to give outside users the ability to modify parameters and
update the displayed output without using code?
Summary
• Data visualizations are important when humans need to draw conclusions about large sets of
data
• Zeppelin provides support for a number of built-in data visualizations, and these can be
extended via visualization libraries and other tools like HTML and JavaScript
• Zeppelin visualizations can be used for interactive data exploration by modifying queries, as
well as the use of pivot charts and implementation of dynamic forms
• Zeppelin notes can be shared via export to a JSON file or by sharing the note URL
• Zeppelin provides numerous tools for controlling the appearance of notes and paragraphs
which can assist in communicating important information
• Paragraphs can be shared via a URL link
• Paragraphs can be modified to control their appearance and assist in communicating
important information
Lesson Objectives
After completing this lesson, students should be able to:
ü Describe the components of a Spark job
ü Explain default parallel execution for stages, tasks, across CPU cores
ü Monitor Spark jobs via the Spark Application UI
Spark applications require a Driver, which in turn loads and monitors the SparkContext. The
SparkContext is then responsible for launching and managing Spark jobs. But what do we mean
when we say job? When you type a line of code to use a Spark function, such as flatMap(),
filter(), or map(), you are defining a Spark task which must be performed. A task is a unit of work,
or "a thing to be done." When you put one or more tasks together with a resulting action task - such as
collect() or save() - you have defined a Spark job. A job, then, is a collection of tasks (or things to
be done) culminating in an action.
NOTE:
Not explicitly called out here: a Spark application can consist of one or more Spark jobs.
Every action is considered part of a unique job, even if the action is the only task being
performed.
Job Stages
A Spark job can be made up of several types of tasks. Some tasks don't require any data be moved
from one executor to another in order to finish processing. This is referred to as a "narrow" operation,
or one that does not require a data "shuffle" in order to execute.
Transformations that do require that data be moved between executors are called "wide" operations,
and that movement of data is called a shuffle.
When executed, Spark will evaluate tasks that need to be performed and break up a job at any point
where a shuffle will be required. While non-shuffle operations can happen somewhat asynchronously if
needed, a task that follows a shuffle *must* wait for the shuffle to complete before executing.
This break point, where processing must complete before the next task or set of tasks is executed, is
referred to as a stage. A stage, then, can be thought of as "a logical grouping of tasks" or things to be
done. A shuffle is a task requiring that data between RDD partitions be aggregated (or combined) in
order to produce a desired result.
Parallel Execution
Spark jobs are automatically optimized via parallel execution at different levels.
However, not all stages are dependent on one another. For example, in this job Stage 1 has to run first,
but once it has completed there are three other stages (two, four, and seven) that can begin execution.
Operating in this fashion is known as a Directed Acyclic Graph, or DAG. A DAG is essentially a
logical ordering of operations (in the case of our discussion here, Spark stages) based on
dependencies. Since there is no reason for Stage 7 to wait for all of the previous six stages to
complete, Spark will go ahead and execute it immediately after Stage 1 completes, along with Stage 2
and Stage 4.
This parallel operation based on logical dependencies allows, in some cases, for significantly faster job
completion across a cluster compared to platforms that require stages to complete, one at a time, in
order.
The tracking and managing of these stages and their dependencies is managed by a Spark component
known as a DAG Scheduler. It is the DAG scheduler that tells Spark which stages (sets of tasks) to
execute and in what order. The DAG Scheduler ensures that dependencies are met, and any
dependent stages have completed prior to the next stage completing.
Task Steps
A task is actually a collection of three separate steps. When a task is first scheduled, it must first fetch
the data it will need - either from an outside source, or perhaps from the results of a previous task.
Once the data has been collected, the operation that the task is to do on that data can execute. Finally,
the task produces some kind of output, either as an action, or perhaps as an intermediate step for a
task to follow.
Tasks can begin execution once data has started to be collected. There is no need for the entire set of
data to be loaded prior to performing the task operation. Therefore, execution begins as soon as the
first bits of data are available, and can continue in parallel while the rest of the data is being fetched.
Furthermore, the output production step can begin as soon as the first bits of data have been
transformed, and can theoretically be happening while the operation is being executed and while the
rest of the data is being fetched. In this manner, all three steps of a task can be running at the same
time, with the execute phase starting shortly after the fetch begins, and the output phase starting
shortly after the execute phase begins. In terms of completion, the fetch will always complete first, but
the execute can finish shortly thereafter, with the output phase shortly after that.
Spark Application UI
Now that we've explored the anatomy of a Spark job and understand how they are executed on the
cluster, let's take a look at monitoring those jobs and their components via the Spark Application UI.
Spark Application UI
The Spark Application UI is a web interface generated by a SparkContext. It is therefore available
for the life of the SparkContext. Once the SparkContext has been shut down, the Spark Application
UI will no longer be available.
You access the Spark Application UI by default via your Driver node at port 4040.
Every SparkContext instance manages a separate Spark Application UI instance. Therefore, if
multiple SparkContext instances are running on the same system, multiple Spark Application UIs will
be available. Since they cannot share a port, when a SparkContext launches and detects an existing
Spark Application UI, it will generate its own version of the monitoring tool at the next available port
number, incremented by 1. Therefore, if you are running Zeppelin and it has created a Spark
Application UI instance at port 4040, and then you launch an instance of the PySpark REPL in a
terminal on the same machine, the REPL version of the monitoring site will exist at port 4041 instead of
4040. A third SparkContext would create the UI at port 4042, and so on.
Once a SparkContext is exited, that port number becomes available. Therefore, if you exited Zeppelin
(using port 4040) and opened another REPL, it would create its Spark Application UI at port 4040. The
two older SparkContext instances would keep the port numbers they had when they started.
The Spark UI landing page opens up to a list of all of the Spark jobs that have been run by this
SparkContext instance. You can see information about the number of jobs completed, as well as
overview information for each job in terms of ID, description, when it was submitted, how long it took
to execute, how many stages it had and how many of those were successful, and the number of tasks
for all of those stages (and how many were successful.)
Clicking on a job description link will result in a screen providing more detailed information about that
particular job.
NOTE:
The URL - which was typed as "sandbox:4040" - was redirected to port 8088. Port
8088 is the YARN ResourceManager UI, which tracks and manages all YARN jobs. This
means that in this instance, Zeppelin has been configured to run on (and be managed
resource-wise by) YARN.
On the Spark Application UI landing page, you will notice a link called Event Timeline. Clicking on this
link results in a visualization that shows executors being added and removed from the cluster, as well
as jobs being tracked and their current status. Enabling zoom allows you to see more granular detail,
which can be particularly helpful if a large number of jobs have been executed over a long period of
time for this SparkContext instance.
Job View
Clicking on a job description on the Jobs landing page takes you to the "Details for Job nn" page (in
our example Job 11). Here you can see more specific information about the job stages, including
description, when they were submitted, how long they took to run, how many tasks succeeded out of
how many attempted, the size of the input and output of each stage, how much data shuffling occurred
between stages.
Clicking on the Event timeline once again results in a visualization very similar to the one on the landing
page, but this time only for the stages of that particular job instead of all jobs.
Job DAG
Clicking the DAG Visualization link results in a visual display of the stages (red outline boxes) and
the flow of tasks each contains, as well as the dependencies between the stages.
Stage View
At the top of the window, to the right of the Jobs tab you will see a tab called Stages.
Clicking on the Stages tab will result in a screen similar to the Jobs landing page, but instead of
tracking activity at the job level it tracks it at the stage level, providing pertinent high-level information
about each stage.
Stage Detail
Stage DAG
Clicking on the DAG Visualization link from this page results in a visual display of the operations within
the stage in a DAG formatted view.
DAG Visualization
Clicking on the Show Additional Metrics link allows you to customize the display of information
collected in the table below. Hovering over the metric will result in a brief description of the information
that metric can provide.
This can be particularly useful when troubleshooting and determining the root cause of performance
problems for an application.
The Stage Detail page also provides an Event Timeline visualization, which breaks down tasks
and types of tasks performed across the executors it utilized.
Task List
At the bottom of the Stage Detail page is a textual list of the tasks performed, as well as various
information on them including the ID, status, executor and host, and duration.
Executor View
The last standard tab (visible regardless of what kind of jobs have been performed) is the Executor
tab. This shows information about the executors that have been used across all Spark jobs run by this
SparkContext instance, as well as providing links to logs and thread dumps, which can be used for
troubleshooting purposes as well.
SQL View
When you run a Spark job that uses one of the Spark modules, another tab appears at the top of the
window that provides module-specific types of information for those jobs. In this screenshot, we see
that a Spark SQL job has been executed, and that a tab labeled SQL has appeared at the top-right. The
information provided is in terms of queries rather than jobs (although the corresponding Spark job
number is part of the information provided.)
Clicking on a query description will take you to a Details for Query "X" page, which will show a DAG of
the operations performed as part of that query.
Below the query DAG, there is a Details link which - when clicked - provides a text-based view of the
details of the query.
Streaming Tab
When running Spark Streaming jobs, a Streaming tab will appear in the Spark Application UI. In this
view we can see details about each job Spark runs, equating to a DStream collection and processing.
Streaming View
Clicking on a streaming job description link results in a page that shows streaming statistics charts for
a number of different metrics. Shown here are input rate and scheduling delay. Input Rate is a link that
can be clicked to expand the metrics window and show the rate per receiver, if multiple receivers are in
use. In our example, only one receiver was in use and active.
Additional Spark Streaming charts include scheduling delay, processing time, and total delay.
When you need to understand what the cause of slowness might be, the charts on this page can be
particularly useful when troubleshooting performance issues with a Spark Streaming job.
Streaming Batches
Beneath the charts is a list of batches, with statistics available about each individual batch and
whether the output operation was successful.
Batch Detail
Clicking on the batch time link results in a batch details page, where additional information may be
found.
Knowledge Check
You can use the following questions to assess your understanding of the concepts presented in this
lesson.
Questions
1 ) Spark jobs are divided into _____________, which are logical collections of _______________.
3 ) What Spark component organizes stages into logical groupings that allow for parallel
execution?
4 ) What is the default port used for the Spark Application UI?
5 ) If two SparkContext instances are running, what is the port used for the Spark Application UI of
the second one?
6 ) As discussed in this lesson, what tabs in the Spark Application UI only appear if certain types
of jobs are run?
Answers
1 ) Spark jobs are divided into _____________, which are logical collections of _______________.
Answer: Action
3 ) What Spark component organizes stages into logical groupings that allow for parallel
execution?
4 ) What is the default port used for the Spark Application UI?
Answer: 4040
5 ) If two SparkContext instances are running, what is the port used for the Spark Application UI of
the second one?
Answer: 4041
6 ) As discussed in this lesson, what tabs in the Spark Application UI only appear if certain types
of jobs are run?
Summary
• Spark applications consist of Spark jobs, which are collections of tasks that culminate in an
action.
• Spark jobs are divided into stages, which separate lists of tasks based on shuffle boundaries
and are organized for optimized parallel execution via the DAG Scheduler.
• The Spark Application UI provides a view into all jobs run or running for a given SparkContext
instance, including detailed information and statistics appropriate for the application and tasks
being performed.
Lesson Objectives
After completing this lesson, students should be able to:
ü Explain why mapPartitions usually performs better than map
ü Describe how to repartition RDDs and how this can improve performance
ü Explain the different caching options available
ü Describe how checkpointing can reduce recovery time in the event of loosing an executor
ü Describe situations where broadcasting increases runtime efficiencies
ü Detail the options available for configuring executors
ü Explain the purpose and function of YARN
IMPORTANT:
The brackets [ ] around sum(x) are required in this example because the input and
output of mapPartitions() must be iterable. Without the brackets to keep the
individual partition values separate, the function would attempt to return a number rather
than a list of results, and as such would fail. If the total sum was needed, you would
need to perform an additional operation on rdd2 (from the modification to line 2 above) in
order to compute it such as the operation below:
rdd1.mapPartitions(lambda x: [sum(x)]).reduce(lambda a,b: a+b) # or
just rdd1.reduce(lambda a,b: a+b) in the first place if not trying
to explain mapPartitions
RDD Parallelism
The cornerstone of performance in Spark centers around the concepts of narrow and wide operations.
How RDDs are partitioned, initially and via explicit changes, can make a significant impact on
performance.
Narrow Dependencies/Operations
Narrow operations can be executed locally and do not depend on any outside the current element.
Examples of narrow operations are map(), flatMap(), union(), and filter().
The picture above depicts examples of how narrow operations work. As visible in the picture above,
there are no interdependencies between partitions.
Transformations maintain the partitioning of the largest parent RDD for the operation. For single parent
RDD transformations, including filter(), flatMap(), and map(), the resulting RDD has the same
number of partitions as the parent RDD.
For combining transformations such as union(), the number of resulting partitions will be equal to the
total number of partitions from the parent RDDs.
Wide Dependencies/Operations
Wide operations occur when shuffling of data is required. Examples of wide operations are
reduceByKey(), groupByKey(), repartition(), and join().
Wide Dependencies/Operations
Above is an example of a wide operation. Notice that the child partitions are dependent on more than
one parent partition. This should help explain why wide operations separate stages. The child RDD
cannot exist completely unless all the data from the parent partitions have finished processing.
The above example shows the four RDD1 partitions reducing to a single RDD2 partition, but in reality
multiple RDD2 partitions would have been generated, each one pulling a different subset of data from
each of the RDD1 partitions. The diagram shows the logical combination rather than a physical result.
All shuffle-based operations’ outputs use the number of partitions that are present in the parent with
the largest number of partitions. In the previous diagram this would have resulted in RDD2 being
spread across four partitions. Again, it was shown as a single partition to help visualize what is
happening and to prevent an overly complicated diagram. The developer can specify the number of
partitions the transformation will use, instead of defaulting to the larger parent. This is shown by
passing a numPartitions as an optional parameter, as shown in the following two versions of the
same operation.
reduceByKey(lambda c1,c2: c1+c2, numPartitions=4)
or simply
reduceByKey(lambda c1,c2: c1+c2, 4)
Controlling Parallelism
The following RDD transformations allow for partition-number changes: distinct(), groupByKey(),
reduceByKey(), aggregateByKey(), sortByKey(), join(), cogroup(), coalesce(), and
repartition().
Generally speaking, the larger the number of partitions, the more parallelization the application can
achieve. There are two operations for manually changing the partitions without using a shuffle-based
transformation: repartition() and coalesce().
A repartition() operation will shuffle the entire dataset across the network. A coalesce() just
shuffles the partitions that need to be moved. Coalesce should only be used when reducing the
number of partitions. Examples:
Use reparition() to change the number of partitions to 500:
rdd.repartition(500)
The number of partitions defaults to the number of blocks the file takes up on the HDFS. Here, if the
file takes up three blocks on HDFS it is represented by a three-partition RDD spread across three
worker nodes.
In the first map operation that is splitting the CSV record into attributes, no data needed to be
referenced from another partition to perform the map transformation.
.map(lambda line: line.split(",")) \
The same is true for the next map that is creating a PairRDD for each row’s particular state and
population count.
.map(lambda rec: (rec[4],int(rec[5])))
In the reduceByKey transformation that calculates final population totals for each state, there is an
explicit reduction in the number of partitions.
Notice also, when doing a reduceByKey(), the same key may be present in multiple partitions that are
output from the map operation. When this happens, the data is required to shuffle.
Finally, at the end, the collect() returns all the results from the two partitions in the reduceByKey
operation to the driver.
Whenever reducing the number of partitions, always use coalesce(), as it minimizes the amount of
network shuffle. A repartition() is required if the developer is going to increase the number of
partitions.
rdd3 = rdd1.union(rdd2)
rdd3.getNumPartitions()
2
rdd3.collect()
[(0, 'X'), (1, 'X'), (2, 'X'), (1, 'Y'), (2, 'Y'), (3, 'Y')]
rdd3.glom().collect()
[[(0, 'X'), (1, 'X'), (2, 'X')], [(1, 'Y'), (2, 'Y'), (3, 'Y')]]
rdd3.glom().collect() # make an array for each partition
[[(0, 'X'), (1, 'X'), (2, 'X')], [(1, 'Y'), (2, 'Y'), (3, 'Y')]]
rdd4 = rdd3.partitionBy(2)
rdd4.glom().collect() # this one shows all of same key in same part
[[(0, 'X'), (2, 'X'), (2, 'Y')], [(1, 'X'), (1, 'Y'), (3, 'Y')]]
This concept allows many operations to skip shuffle steps knowing all the keys are in a single location,
and thus increasing the performance. When partitioning data manually, by specifying number of
partitions, be aware of how many executors are being used and try to have at least one partition per
executor.
This can result in performance improvements, particularly when implementing joins.
Continuing the same hash partitioning demo from the previous section, the following example shows
that the two partitions from rdd4 are still observed after performing a filter operation.
rdd4.glom().collect()
[[(0, 'X'), (2, 'X'), (2, 'Y')], [(1, 'X'), (1, 'Y'), (3, 'Y')]]
Partitioning Optimization
There is no perfect formula, but the general rule of thumb is that too many partitions is better than too
few. Since each dataset and each use case can be very different from all others, experimentation is
required to find the optimum number of partitions for each situation.
The more partitions you have, the less time it will take to process each. Eventually, you will reach a
point of diminishing returns. Spark schedules its own tasks within the executors it has available. This
scheduling of tasks takes approximately 10-20ms in most situations. Spark can efficiently run tasks as
fast as 200ms without any trouble. With experimentation, you could have tasks run for even less time
than this, but tasks should take at least 100ms to execute to prevent the system from spending too
much time scheduling instead of executing tasks.
A simple and novel approach to identifying the best number of partitions is to keep increasing the
number by 50% until performance stops improving. Once that occurs, find the mid-point between the
last two partition sizes and execute the application in production there. If anything changes regarding
the executors (number available or sizing characteristics) then the tuning exercise should be executed
again.
It is optimal to have the number of partitions be just slightly smaller than a multiple of the number of
overall executor cores available. This is to ensure that if multiple waves of tasks must be run that all
waves are fully utilizing the available resources. The slight reduction from an actual multiple is to
account for Spark internal activities such as speculative execution (a process that looks for
performance outliers and reruns potentially slow or hung tasks). For example, if there are 10 executors
with two cores (20 cores total), make RDDs with 39, 58, or 78 partitions.
Generally speaking, this level of optimization is for programmers who work directly with RDDs. Spark
SQL and it’s DataFrame API were created to be a higher level of abstraction that lets the developer
focus on what needs to be done as opposed to exactly how those steps should be executed. Spark
SQL's Catalyst optimizer eliminates the need for developers to focus on this level of optimization in the
code.
REFERENCE:
Spark SQL (as of 1.6.1) still has some outstanding unsupported Hive functionality that
developers should be aware of. These are identified at
http://spark.apache.org/docs/latest/sql-programming-
guide.html#unsupported-hive-functionality. These shortcomings will likely be
addressed in a future release.
There is a finite amount of memory allocated within an executor for caching of datasets. When an
attempt is made to store more data in cache than physical memory allows, some type of control must
be in place to help the system determine what to do. Spark fundamentally leverages a Least
Recently Used (LRU) strategy to determine what to do. This system keeps track of when cached
data was last accessed and evicts datasets from memory based on when they were last used.
If an operation tries to use cached data that for some reason was lost, the operation will attempt to use
whatever is still in cache, but will recompute the lost data. As the data is recomputed, the data will be
re-cached assuming space is available.
Caching Syntax
The functions for caching and persisting have the same names for RDDs and DataFrames.
• persist() - Developer can control caching storage level. Syntax: persist(StorageLevel).
Example: persist(MEMORY_AND_DISK).
• cache() - Simple operation, equivalent to persist(MEMORY_ONLY).
• unpersist() –remove data from cache.
To use caching, we must do a two things:
• The first is to import the library. Here is how to import the library for Scala and Python:
rdd.persist(StorageLevel)
The Spark SQL SQLContext object also features a cacheTable(tableName) method for any table
that it knows by name and the complimentary uncacheTable(tableName) method. Additional helper
methods isCached(tableName) and clearCache() are also provided.
NOTE:
In Python, cached objects will always be serialized, so it does not matter whether you
choose serialization or not. When we talk about storage levels next, when using
pyspark, MEMORY_ONLY is the same as MEMORY_ONLY_SER.
Use of the persist API is recommended for use with RDDs as it requires the developer to be
completely aware of which "storage level" is best for a given dataset and its use case. The ultimate
decision on which storage level to use is based on questions centered around serialization and disk
usage.
First: should the cached data live in-memory or on disk. It may not sound like disk is a great choice,
but remember that the RDD could be the result of multiple transformations that would be costly to
reproduce if not cached. Additionally, the executor can very quickly get to data on its local disk when
needed. This storage level is identified as DISK_ONLY.
If in-memory is a better choice, then the second question is whether raw or serialized caching (for
Scala - as previously mentioned, Python automatically serializes cached objects) should be used.
Regardless of that answer, the third question to answer is should the cached data be rolled onto local
disk if it gets evicted from memory or should it just be dropped? There still may be significant value
from this data on local disk compared to having to recompute an RDD from the beginning.
Raw in-memory caching has an additional option to store the cached data for each partition in two
different cluster nodes. This storage level allows for some additional levels of resiliency should an
executor fail.
NOTE:
There is also an experimental storage level identified as OFF_HEAP which is most similar
to MEMORY_ONLY_SER except that, as the name suggests, the cache is stored off of the
JVM heap.
Again, while there are many choices above which indicate some level of testing will be required to find
the optimal storage level, the use of persist() with RDD caching and specifically identifying the best
storage level possible is recommended. For Spark SQL, continue to use cache(DataFrame) or
cacheTable(table) and let the Catalyst optimizer determine the best options.
Caching Example
Here is an example where an RDD is reused more than once:
from pyspark import StorageLevel
ordersRrdd = sc.textFiles("/orders/received/*")
ordersRdd.persist(StorageLevel.MEMORY_ONLY_SER)
ordersRdd.map(…).saveAsTextFile("/orders/reports/valid.txt")
ordersRdd.filter(…).saveAsTextFile(”/orders/reports/filtered.txt")
ordersRdd.unpersist()
Serialization Options
For Scala
For JVM-based languages (Java and Scala), Spark attempts to strike a balance between convenience
(allowing you to work with any Java type in your operations) and performance. It provides two
serialization libraries: Java serialization (by default), and Kryo serialization.
Kryo serializing is significantly faster and more compact than the default Java serialization, often as
much as 10 times faster. The reason Kryo wasn’t set as default is because initially Kryo didn’t support
all serializable types and users would have had to register custom classes.
Since these issues have been addressed in recent versions of Spark. Kryo serialization should always
be used.
For Python
In Python, applications use the Pickle library for serializing, unless you are working with DataFrames or
tables, in which case the Catalyst optimizer manages serialization. The optimizer converts the code
into Java byte code, which - if left unspecified - will use the Java default serialization. Thus, Python
DataFrame applications will still need to specify the use of Kryo serialization.
Kryo Serialization
To implement Kryo Serialization in your application, include the following in your configuration:
conf = SparkConf()
conf.set('spark.serializer’, 'org.apache.spark.serializer.KryoSerializer')
sc=SparkContext(conf=conf)
This works for DataFrames as well as RDDs since the SQLContext (or HiveContext) are passed the
SparkContext in their constructor method.
If using the pyspark or spark-shell REPLs, add -conf
spark.serializer=org.apache.spark.serializer.KryoSerializer as a command-line
argument to these executables.
Checkpointing
Since Spark was initially built for long-running, iterative applications, it keeps track of an RDD's recipe
or lineage. This provides reliability and resilience, but as the number of transformations performed
increases and the lineage grows, an application can run into problems. The lineage can become too
big for the object allocated to hold everything. When the lineage gets too long, there is a possibility of
a stack overflow.
Also, when a worker node dies, any intermediate data stored on the executor has to be re-computed.
If 500 iterations were already performed, and part of the 500th iteration was lost, the application has to
re-do all 500 iterations. That can take an incredibly long time, and can become inevitable given a long
enough running application processing a large amount of data.
Spark provides a mechanism to mitigate these issues: checkpointing.
About Checkpointing
When checkpointing is enabled, it does two things: data checkpointing and metadata checkpointing.
(Checkpointing is not yet available for Spark SQL).
Data checkpointing – Saves the generated RDDs to reliable storage. As we saw with Spark
Streaming window transformations, this was a requirement for transformations that combined data
across multiple batches. To avoid unbounded increases in recovery time (proportional to the
dependency chain), intermediate RDDs of extended transformation chains can be periodically
checkpointed to reliable storage (typically HDFS) to shorten the number of dependencies in the event
of failure.
Metadata checkpointing – Saves the information defining the streaming computation to fault-
tolerant storage like HDFS. This is used to recover from failure of the node running the driver of the
streaming application.
Metadata includes:
• Configuration - The configuration that was used to create the streaming application.
• DStream operations - The set of DStream operations that define the streaming application.
• Incomplete batches - Batches whose jobs are queued but have not completed yet.
When a checkpoint is initialized, the lineage tracker is "reset" to the point of the last checkpoint.
When enabling checkpointing, consider the following:
• Checkpointing is performed at the RDD level, not the application level
• Checkpointing is not supported in DataFrames or Spark SQL
• There is a performance expense incurred when pausing to write the checkpoint data, but this is
usually overshadowed by the benefits in the event of failure
• Checkpointed data is not automatically deleted from the HDFS. The user needs to manually
clean up the directory when they’re positive that data will no longer be required.
Without Checkpointing, all processes must be repeated – potentially thousands of transformations – if a node is lost
This is a typical application that iterates and has a lineage of "n" number of RDDs. There is no
checkpointing enabled. If an in-use node fails during processing, all processing steps up to that point
must be repeated from the beginning. This might be hundreds, or even thousands of transformations,
which can result in significant time lost to reprocessing.
With Checkpointing, only processes performed since the last checkpoint must be repeated
In this example, we have the same application with checkpointing enabled. We can see that every nth
iteration, data is being permanently stored to the HDFS. This may not seem intuitive at first as one
might ask why we should save data to the HDFS when it is not needed. In the case that a worker node
goes down, instead of trying to redo all previous transformations (which again, can number in the
thousands), the data can be retrieved from HDFS and then processing can continue from the point of
the last checkpoint.
This example shows that checkpointing can be viewed as a sort of insurance for events such as this.
Instead of simply hoping a long-running application will finish without any worker failures, the
developer makes a bit a performance tradeoff up front in choosing to pause and write data to HDFS
from time to time.
Implementing Checkpointing
To implement checkpointing, the developer must specify a location for the checkpoint directory before
using the checkpoint function. Here is an example:
sc.setCheckpointDir("hdfs://somedir/")
rdd = sc.textFile("/path/to/file.txt")
while x in range(<large number>)
rdd.map(…)
if x % 5 == 0
rdd.checkpoint()
rdd.saveAsTextFile("/path/to/output.txt")
This code generates a checkpoint every fifth iteration of the RDD operation.
Broadcast Variables
A broadcast variable is a read-only variable cached once in each executor that can be shared among
tasks. It cannot be modified by the executor. The ideal use case is something more substantial than a
very small list or map, but also not something that could be considered “Big Data”.
Broadcast variables are implemented as wrappers around collections of simple data types. They are
not intended to wrap around other distributed data structures such as RDDs and DataFrames.
The goal of broadcast variables is to increase performance by not copying a local dataset to each task
that needs it and leveraging a broadcast version of it. This is not a transparent operation in the
codebase - the developer has to specifically leverage the broadcast variable name.
Spark uses concepts from P2P torrenting to efficiently distribute broadcast variables to the nodes and
minimize communication cost. Once a broadcast variable is written to a single executor, that executor
can send the broadcast variable to other executors. This concept reduces the load on the machine
running the driver and allows the executors (aka the peers in the P2P model) to share the burden of
broadcasting the data.
Broadcast variables are lazy and will not receive the broadcast data until needed. The first time a
broadcast variable is read, the node will retrieve and store the data in case it is needed again. Thus,
broadcast variables get sent to each node only once.
Without broadcast variables, reference data (such as lookup tables, lists, or other variables) is sent to
every task on the executor, even though multiple tasks reuse the same variables. This what an
application does normally.
Using Broadcast Variables, Spark sends Reference Data to the Node only Once
Using broadcast variables, Spark sends a copy to the node once and the data is stored in memory.
Each task will reference the local copy of the data. These broadcast variables get stored in the
executor memory overhead portion of the executor.
For this example, the use case could be to break all of the words from a story into an RDD as shown by
these two lines of code.
story = sc.parallelize(["I like a house", "I live in the house"])
words = story.flatMap(lambda line: line.split(" "))
This example is just creating a small two-line story and in practice the “story” RDD would be more
likely created by using sc.textFile() to read a large HDFS dataset. The rest of the use case would
be to strip out any words that were considered irrelevant (aka “noise”) to further processing steps. In
the code statements below, we are simply building a new RDD of elements that are not in the irrelevant
word list.
noise = (["the","an","a","in"])
words.filter(lambda w: w not in noise).collect()
['I', 'like', 'house', 'I', 'live', 'house']
Next, we can elevate the irrelevant word list to a broadcast variable to distribute it to all workers and
then simply reference it with the new variable name to obtain the same results.
bcastNoise = sc.broadcast(noise)
words.filter(lambda w: w not in bcastNoise.value).collect()
['I', 'like', 'house', 'I', 'live', 'house’]
Notice that the results are the same. While in this small illustrative example we would not show any
performance gain, we would see improvements if the input “story” was of significant size and the
irrelevant filter data itself was much bigger. Additionally, more savings would surface if the broadcast
variable was used over again in subsequent stages.
Joining Strategies
While joins can happen on more than two datasets, this discussion will illustrate the use case of only
two datasets which can be extrapolated upon when thinking of more than two datasets being joined.
Additionally, the concepts discussed (unless otherwise called out) relate to RDD and DataFrame
processing even though the illustrations will often reference RDD as the DataFrame will use the
underlying RDD for these activities.
Spark performs joins at the partition level. It ensures that each partition from the datasets being joined
are guaranteed to align with each other. That means that the join key will always be in the same
numbered partition from each dataset being joined.
For that to happen, both datasets need to have the same number of partitions and have used the same
hashing algorithm (described as "Hashed Partitions" earlier in this module) against the same join key
before it can start the actual join processing.
To help explain the intersection of joins and hash partitions, let's look at the worst case situation. We
have two datasets that have different numbers of partitions (two on the left, four on the right) and
which were not partitioned with the same key. Both of the datasets will require a shuffle to occur so
that the equal number of hashed partitions will be created on the join key prior to the join operation
being executed.
NOTE:
The number of hashed partitions created for the JOIN RDD was equal to the number of
partitions from the dataset with the largest number of partitions.
A better scenario would occur if the larger dataset was already hashed partitioned on the join key. In
this situation, only the second dataset would need to be shuffled. As before, more partitions would be
created in the newly created dataset.
Co-Partitioned Datasets
The best situation would occur when the joining datasets already have the same number of hashed
partitions using the same join key. In this situation, no additional shuffling would need to happen. This
is called a co-partition join and is classified as a narrow operation.
Most likely, the aligned partitions will not be always on the same executor and thus one of them will
need to be moved to the executor where its counterpart is located. While there is some expense to
this, it is far less costly than requiring either of the joined datasets to perform a full shuffle operation.
Executor Optimization
Executors are highly configurable and are the first place to start when doing optimizations.
Executor Regions
Executors are broken into memory regions. The first is the overhead of the executor, which is almost
always 384MB. The second region is reserved for creating Java objects, which makes up 40% of the
executor. The third and final region is reserved for caching data, which makes up the other 60%. While
these percentages are configurable, it is recommended that initial tuning occur in the overall
configuration of an executor as well as the multiple of how many executors are needed.
When submitting an application, we tell the context how many and what size resources to request. To
set these at runtime, we use the three following flags:
--executor-memory
This property defines how much memory will be allocated to a particular YARN container that will run a
Spark executor.
--executor-cores
This property defines how many CPU cores will be allocated to a particular YARN container that will
run a Spark executor.
--num-executors
This property defines how many YARN containers are being requested to run Spark executors within.
Configuring Executors
Deciding how many and how much resources for the executors can be difficult. Heres a good starting
point.
• executor-memory
- Should be between 8GB and 32GB.
- 64GB would be a strong upper limit as executors with too much memory often are
troubled with long JVM garbage collection processing times.
• executor-cores
- At least two, but a max of four should be configured without performing tests to
validate the additional cores are an overall advantage considering all other properties.
• num-executors
- This is the most flexible as it is the multiple of the combination of memory and cores
that make up an individual executor.
- If caching data, it is desirable to have, at least, twice the dataset size as the total
executor memory.
Many variables come into play including the size of the YARN cluster nodes that will be hosting the
executors. A good starting point would be 16GB and two cores as almost all modern Hadoop cluster
configurations would support YARN containers of this size.
If data set is 100GB, it would be ideal to have 100GB*2/(16GB) executors, which is 12.5. For this
application, choosing 12 or 13 could be ideal.
This section presents the primary configuration switches available. Fine-tuning final answers will be
best derived from direct testing results.
Knowledge Check
You can use the following question to assess your understanding of the concepts presented in this
lesson.
Questions
1 ) By default, parallelize() creates a number of RDD partitions based on the number of
___________________.
4 ) Which function should I use to reduce the number of partitions in an RDD without any data
changes?
5 ) When all identical keys are shuffled to the same partition, this is called a _______________
partition.
6 ) True or False: DataFrames are structured objects, therefore a developer must work must work
harder to optimize them than when working directly with RDDs.
Answers
1 ) By default, parallelize() creates a number of RDD partitions based on the number of
___________________.
Answer: Wide
4 ) Which function should I use to reduce the number of partitions in an RDD without any data
changes?
Answer: coalesce()
5 ) When all identical keys are shuffled to the same partition, this is called a _______________
partition.
6 ) True or False: DataFrames are structured objects, therefore a developer must work must work
harder to optimize them than when working directly with RDDs.
Answer: False. The Catalyst optimizer does this work for you when working with DataFrames.
Summary
• mapPartitions() is similar to map() but operates at the partition instead of element level
• Controlling RDD parallelism before performing complex operations can result in significant
performance improvements
• Caching uses memory to store data that is frequently used
• Checkpointing writes data to disk every so often, resulting in faster recovery should a system
failure occur
• Broadcast variables allow tasks running in an executor to share a single, centralized copy of a
data variable to reduce network traffic and improve performance
• Join operations can be significantly enhanced by pre-shuffling and pre-filtering data
• Executors are highly customizable, including number, memory, and CPU resources
• Spark SQL makes a lot of manual optimization unnecessary due to Catalyst
Lesson Objectives
After completing this lesson, students should be able to:
ü Create an application to submit to the cluster
ü Describe client vs cluster submission with YARN
ü Submit an application to the cluster
ü List and set important configuration items
Importing Libraries
The first part a developer must do is import the SparkContext and SparkConf libraries. In addition,
they will need to import all the other libraries they want to use in the application - for example, the
SQLContext libraries. Doing this looks like any other application. Here is an example of importing
some important libraries in Python.
import os
import sys
from pyspark import SparkContext, SparkConf
To import other Spark libraries, it’s the same as with any other application. Here is an example of
importing more Spark libraries that are related to Dataframe processing:
Below is an example of creating the SparkContext, setting a configuration, and stopping the
SparkContext. While not including the sc.stop() may have no impact on the developer, the overall
cluster experience will diminish as resources will still be allocated and administrators will start having
to manually kill these processes. This process is not trivial as identifying which processes are related
to this problem amongst all YARN processing running is difficult. Developers should be cognizant of
this multi-tenant nature of most Hadoop clusters and the fact that resources should be freed when no
longer needed. This is easy to do by simply including the necessary call to sc.stop() in all
applications.
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")
sc.stop()
The first, and probably the one that developers have the most experience with, is "yarn-client." In
yarn-client mode, the driver program is a JVM started on the machine the application is submitted
from. The SparkContext then lives in that JVM. This is the way the REPLs and Zeppelin start a Spark
application. They provide an interactive way to use Spark, so the context must exist where the
developer has access.
The other option, and the one that should be used for production applications, is "yarn-cluster." The
biggest difference between the two is the location of the Spark Driver.
In client mode, the Spark driver exists on the client machine. If something should happen to that client
machine, the application will fail.
In cluster mode, the application puts the Spark driver on the YARN ApplicationMaster process which is
running on a worker node somewhere in the cluster.
One big advantage of this is that even if the client machine that submitted the application to YARN
fails, the Spark application will continue to run.
In yarn-client mode, the driver and context are running on the client, as seen in the example.
When the application is submitted, the SparkContext reaches out to the resource manager to create
an Application Master. The Application Master is then created, and asks the Resource Manager for the
rest of the resources that were requested in the SparkConf, or from the runtime configurations. After
the Application Master gets confirmation of resource availability, it contacts the Node Managers to
launch the executors. The SparkContext will then start scheduling tasks for the executors to
execute.
In a yarn-cluster submission, the application starts similarly to yarn-client, except that a Spark client is
created. The Spark client is a proxy that communicates with the Resource Manager to create the
Application Master.
The Application Master then hosts the Spark driver and SparkContext. Once this handoff has
occurred, the client machine can fail with no repercussions to the application. The only job the client
had was to start the job and pass the binaries. Once the Application Master is started, it is the same
internal process as during a yarn-client submission. The Application Master talks to the Node
Managers to start the executors. Then the SparkContext, which resides in the Application Master
can start assigning tasks to the executors.
This removes the single point of failure that exists with yarn-client job submissions. The Application
Master needs to be functional, but in yarn-cluster, there is no need for the client after the application
has launched.
In addition, many applications are often submitted from the same machine. The driver program, which
holds the application, requires resources and a JVM. If there are too many applications running on the
client, applications may have to wait until resources free up, which can create a bottleneck. A yarn-
cluster submission moves that resource usage to the cluster.
Between the spark-submit and the application file, the developer can add runtime configurations.
Here are some of the runtime configurations that can be set, with the format to submit them:
--num-executors 2
--executor-memory 1g
--master yarn-cluster
--conf spark.executor.cores=2
NOTE:
The last configuration property in this list could also have been added as:
--executor-cores=2.
TIP:
Be careful when requesting resources. Spark will hold on to all resources
it is allocated, even if they are not being used. While YARN can pre-empt
containers, it is the developer's duty to make sure they are using a
reasonable number of resources. Once allocated, Spark does not easily
give up resources.
In the example above, the application is being submitted to YARN in the yarn-cluster mode. We're
requesting four executors, each with 8GB of memory. The application being submitted is
/user/username/sparkDemo.py which is being passed two arguments -- an input file and an output
file.
NOTE:
the PYSPARK_PYTHON variable is not specified here for brevity, but it would be a good
idea to include this in all submissions where there might be a conflict on Python
versioning.
REFERENCE:
These can be seen in the documentation at spark.apache.org
Because of this, setting as few in the application is best practice, with the exception of some specific
configurations. Pass the rest in at runtime or in a configuration file.
Knowledge Check
You can use the following questions to assess your understanding of the concepts presented in this
lesson.
Questions
1 ) What components does the developer need to recreate when creating a Spark Application as
opposed to using Zeppelin or a REPL?
2 ) What are the two YARN submission options the developer has?
4 ) When making a configuration setting, which location has the highest priority if the event of a
conflict?
5 ) True or False: You should set your Python Spark SQL application to use Kryo serialization
Answers
1 ) What components does the developer need to recreate when creating a Spark Application as
opposed to using Zeppelin or a REPL?
Answer: The developer must import the SparkContext, SparkConf libraries, create the main
program, create a SparkConf and a SparkContext, and stop the SparkContext at the end
of the application
2 ) What are the two YARN submission options the developer has?
Answer: yarn-client and yarn-cluster are the two yarn submission options
Answer: The difference between yarn-client and yarn-cluster is where the driver and
SparkContext reside. The driver and context reside on the client in yarn-client, and in the
application master in yarn-cluster.
4 ) When making a configuration setting, which location has the highest priority if the event of a
conflict?
5 ) True or False: You should set your Python Spark SQL application to use Kryo serialization
Answer: True. It is used for JVM objects that will be created when using Spark SQL
Summary
• A developer must reproduce some of the back-end environment creation that Zeppelin and the
REPLs handle automatically.
• The main differences between a yarn-client and yarn-cluster application submission is the
location the Spark driver and SparkContext.
• Use spark-submit, with appropriate configurations, the application file, and necessary
arguments, to submit an application to YARN.
Lesson Objectives
After completing this lesson, students should be able to:
ü Describe the purpose of machine learning and some common algorithms used in it
ü Describe the machine learning packages available in Spark
ü Examine and run sample machine learning applications
DISCLAIMER:
Machine learning is an expansive topic that could easily span multiple days of training
without covering everything. Since this is a class for application developers and not
specifically data scientists, an in-depth discussion on machine learning is out of scope.
However, Spark does come with a number of powerful machine learning tools and
capabilities. Fully utilizing the packages and practices this lesson will discuss requires a
fundamental understanding that goes well beyond what will be covered here. Even so, it
is well worth a developer's time to be aware of machine learning algorithms and
generally the kinds of things they can do.
Furthermore, the lab and suggested exercises that accompany this lesson will consist of
pre-built scripts and sample applications that will demonstrate some of these topics in
practice. A student interested in learning more is encouraged to take a look at additional
and future Hortonworks University offerings that specifically focus on more advanced
programming, data science, or both.
Supervised Learning
Supervised learning is the most common type of machine learning. It occurs when a model is created
using one or more variables to make a prediction, and then the accuracy of that prediction can be
immediately tested.
There are two common types of predictions: Classification and Regression.
Classification attempts to answer a discrete question - is the answer yes or no? Will the application
be approved or rejected? Is this email spam or safe to send to the user? "Will the flight depart on
time?" It's either a yes or no answer - if we predict the flight departs early or on time, the answer is yes.
If we predict it will be one minute late or more, the answer is no.
Regression attempts to determine what a value will be given specific information. What will the home
sell for? What should their life insurance rate be? What time is the flight likely to depart? It's an answer
where a specific value is being placed, rather than a simple yes or no is being applied. Therefore, we
might say the flight will depart at 11:35 as our prediction.
Supervised learning starts by randomly breaking a dataset into two parts: training data and testing
data.
Training data is what a machine-learning algorithm uses to create a model. It starts with a dataset,
then performs statistical analysis of the effect one or more variables has on the final result. Since the
answers (yes or no for classification, or the exact value - ex: flight departure time) are known, the
training dataset can know with a high degree of certainty that the weight it applies to a variable is
accurate within the training data.
Once a model that is accurate for the training dataset is built, that model is then applied to the testing
dataset to see how accurate it is when the correct answers are not known ahead of time. The model
will almost never be 100% accurate for testing data, but the better the model is, the better it will be at
accurately predicting results where the answers are not known ahead of time.
Thousands upon Thousands of Data Points are collected and Available Every Day
This is a simple example of what a supervised learning dataset might look like. We have many columns
to choose from when selecting the variables we want to test. There would likely be thousands upon
thousands of data points collected and available, with new information streaming in on a continuous
basis, giving us massive historical data to work from.
Notice that this dataset could be used either for regression or classification. Classification would
compare the Sched vs. Actual column and if the Actual value was less than or equal to Sched interpret
it as a yes. If not, it would be a no. For regression, the actual departure time is known.
Terminology
• Each row in the dataset is called an "observation"
• Each column in the dataset is called a "feature"
• Columns selected for inclusion in the model are called "target variables"
The sum of mean squared error simply squares each of the variances of every observation and adds
those values together. This adds an exponential penalty for observations as they get further away from
the predicted value. Thus, the sum of mean squared error for Model A = 0 + 16 + 0 + 16, for a final
value of 32. The sum of mean squared error for Model B = 1 + 9 + 4 + 4, for a final value of 18. Since the
sum of mean squared error is lower for Model B, we can determine that it is the better model. Thus, we
can both intuitively and mathematically determine that Model B is a better predictor than Model A.
One commonly used classification algorithm is called the Decision Tree algorithm. In essence, a
decision tree uses a selected variable to determine the probability of an outcome, and then - assuming
that variable and probability are known - selects another variable and does the same thing. This
continues through the dataset, with variables and their order of selection/evaluation determined by the
data scientist. In the graphic, we see a small part of what would be a much larger decision tree, where
an airport value of ORD has been evaluated, followed by carriers at ORD, followed by weather
conditions.
There are often numerous ways in which decision trees might be constructed, and some paths will
produce better predictions than others. The same target variables can be arranged into multiple
decision tree paths, which can be combined into what is known as a Forest. In the end, the
classification (prediction) that has the most "votes" is selected as the prediction.
Classification Algorithms
Classification Algorithms
When creating a classification visualization, the model draws a line where it predicts the answers will
be. This line can then be compared to the actual results in the test data. For example, in this simple
visualization, the white-filled circles represent observations of target variables where actual departure
time was less than or equal to the scheduled departure time. The red-filled circles represent
observations where actual departure time was greater than scheduled departure time. The red line
represents the predictions that the model made. Above the red line would be where the model
predicted on-time departures, and below the red line would be were the model predicted delayed
departures.
In the case of regression, the line drawn is predicting an actual value rather than a binary result. In the
first diagram, we see a regression where only a single variable was selected and weighted - thus the
result will be a straight line. As more variables are added, the regression curves, and in some cases,
can curve wildly based on the variables and the weights determined by the model. The second
diagram is what a model with two variables might look like. To determine which model was a better
predictor, we would find out how far away each of the dots were from the prediction line and perform a
sum of means squared error calculation.
Unsupervised Learning
Supervised learning is a powerful tool as long as you have clean, formatted data where every column
has an accurate label. However, in some cases, what we start with is simply data, and appropriate
labels may be unknown. For example, take product reviews that people leave on social media, blogs,
and other web sites. Unlike reviews on retailers’ pages, where the user explicitly gives a negative,
neutral, or positive rating as part of creating their review (for example, a star rating), the social media
and other reviews have no such rating or label applied. How then can we group them to determine
whether any given review is positive, neutral, or negative, and determine whether the general
consensus is positive or negative?
For a human evaluator, simply reading the review would be enough. However, if we are collecting
thousands of reviews every day from various sources, employing a human to read and categorize each
one would be highly inefficient. This is where unsupervised learning comes in. The goal of
unsupervised learning is to define criteria by which a dataset will be evaluated, and then find patterns
in the data that are made up of groupings with similar characteristics. The algorithm does not
determine what those groupings mean - that is up to the data scientist to fill in. All it determines is
what should be grouped, based on the supplied criteria.
For example, we might look at examples where certain phrases are compared, and the algorithm might
determine that when a review contains phrase X, it quite usually also contains phrase Y. Therefore, a
review that contains phrase X but not phrase Y would still be combined with the phrase Y group. After
this processing is complete, the data scientist looks at a few of the phrase Y grouped reviews and
determines that they are generally positive, and thus assigns them to the positive review category.
The most common type of unsupervised learning, and the one described in this example, is called
clustering.
In this example, we have observations from which we have picked out phrases from a defined list we
are looking for. The data has been cleaned of extraneous words and phrases, and then the remaining
groups of phrases are evaluated to determine how frequently they are used within the same review.
The algorithm searches for patterns so that reviews can be grouped, but has no idea whether any
particular grouping represents positive, neutral, or negative reviews.
K-Means Algorithm
K-Means is Used to Identify Groupings that Likely Share the Same Label
Once the algorithm has grouped the results, the data scientist must determine the meaning. In the
diagram, negative reviews are coded red, positive are coded green, and neutral are coded yellow. A
clustering algorithm known as the K-Means algorithm was applied and groupings were created. Be
aware that just as in supervised learning, not all reviews could be grouped closely with some others,
and in some cases, reviews were grouped with the wrong category. The better the model is, the more
accurate these groupings will be.
mllib Modules
This is a list of the modules available in Spark's mllib package:
• classification
• clustering
• evaluation
• feature
• fpm
• linalg*
• optimization
• pmml
• random
• recommendation
• regression
• stat*
• tree*
• util
ml Modules
This is a list of the modules available in Spark's ml package:
• attribute
• classification
• clustering
• evaluation
• feature
• param
• recommendation
• regression
• source.libsvm
• tree*
• tuning
• util
For example, if you wanted to view ml samples available for Python, you would browse to
/usr/hdp/current/spark-client/examples/src/main/Python/ml/.
Using a text editor, you can open and examine the contents of each application. The examples are well
commented. They can actually be used as teaching tools to help you learn how to employ Spark's
machine learning capabilities for your own needs. In this example, we have opened the decision tree
classification program in the Python mllib directory.
Here is another example, a logistic regression (which, as you will recall, is actually a classification
algorithm) from the Python ml directory.
More sample code from the imported machine learning note in Zeppelin
Knowledge Check
You can use the following questions to assess your understanding of the concepts presented in this
lesson.
Questions
1 ) What are two types of machine learning?
3 ) What do you call columns that are selected as variables to build a machine learning model?
7 ) Which machine learning package is designed to take advantage of flexibility and performance
benefits of DataFrames?
8 ) Name two reasons to prefer Spark machine learning over other alternatives
Answers
1 ) What are two types of machine learning?
3 ) What do you call columns that are selected as variables to build a machine learning model?
Answer: An observation
7 ) Which machine learning package is designed to take advantage of flexibility and performance
benefits of DataFrames?
Answer: ml
8 ) Name two reasons to prefer Spark machine learning over other alternatives
Summary
• Spark supports machine learning algorithms running in a highly parallelized fashion using
cluster-level resources and performing in-memory processing
• Supervised machine learning builds a model based on known data and uses it to predict
outcomes for unknown data
• Unsupervised machine learning attempts to find grouping patterns within datasets
• Spark has two machine learning packages available
- mllib operates on RDDs
- ml operates on DataFrames
• Spark installs with a collection of sample machine learning applications