Mastering Apache Spark
Mastering Apache Spark
of Contents
Introduction
1.1
Overview of Spark
1.2
1.3
1.3.1
1.3.1.1
1.3.2
1.3.2.1
1.3.2.2
1.3.3
1.3.3.1
Transformations
1.3.3.1.1
Actions
1.3.3.1.2
1.3.3.1.3
1.3.3.2
Shuffling
1.3.3.3
Checkpointing
1.3.3.4
Dependencies
1.3.3.5
ParallelCollectionRDD
1.3.3.6
ParallelCollectionRDD
1.3.3.6.1
MapPartitionsRDD
1.3.3.6.2
PairRDDFunctions
1.3.3.6.3
CoGroupedRDD
1.3.3.6.4
HadoopRDD
1.3.3.6.5
ShuffledRDD
1.3.3.6.6
BlockRDD
1.3.3.6.7
Spark Tools
1.4
Spark Shell
1.4.1
1.4.2
Stages Tab
Stages for All Jobs
1.4.2.1
1.4.2.1.1
Stage Details
1.4.2.1.2
Pool Details
1.4.2.1.3
Storage Tab
1.4.2.2
Executors Tab
1.4.2.3
SQL Tab
1.4.2.4
SQLListener
JobProgressListener
1.4.2.4.1
1.4.2.5
spark-submit
1.4.3
spark-class
1.4.4
Spark Architecture
1.5
Driver
1.5.1
Master
1.5.2
Workers
1.5.3
Executors
1.5.4
TaskRunner
Spark Services
1.5.4.1
1.6
MemoryManagerMemory Management
UnifiedMemoryManager
1.6.1
1.6.1.1
1.6.2
DAGScheduler
1.6.3
Jobs
1.6.3.1
Stages
1.6.3.2
1.6.3.2.1
1.6.3.2.2
Task Scheduler
1.6.4
Tasks
1.6.4.1
TaskSets
1.6.4.2
Schedulable
1.6.4.3
TaskSetManager
1.6.4.3.1
Schedulable Pool
1.6.4.3.2
Schedulable Builders
1.6.4.3.3
FIFOSchedulableBuilder
1.6.4.3.3.1
FairSchedulableBuilder
1.6.4.3.3.2
Scheduling Mode
1.6.4.3.4
2
1.6.4.4
TaskContext
1.6.4.5
TaskMemoryManager
1.6.4.6
MemoryConsumer
TaskMetrics
Scheduler Backend
CoarseGrainedSchedulerBackend
Executor Backend
CoarseGrainedExecutorBackend
BlockManager
1.6.4.6.1
1.6.4.7
1.6.5
1.6.5.1
1.6.6
1.6.6.1
1.6.7
MemoryStore
1.6.7.1
DiskStore
1.6.7.2
BlockDataManager
1.6.7.3
ShuffleClient
1.6.7.4
BlockTransferService
1.6.7.5
BlockManagerMaster
1.6.7.6
BlockInfoManager
1.6.7.7
BlockInfo
Dynamic Allocation (of Executors)
1.6.7.7.1
1.6.8
ExecutorAllocationManager
1.6.8.1
ExecutorAllocationClient
1.6.8.2
ExecutorAllocationListener
1.6.8.3
ExecutorAllocationManagerSource
1.6.8.4
Shuffle Manager
ExternalShuffleService
1.6.9
1.6.9.1
ExternalClusterManager
1.6.10
1.6.11
Broadcast Manager
1.6.12
Data Locality
1.6.13
Cache Manager
1.6.14
1.6.15
OutputCommitCoordinator
1.6.16
1.6.17
Netty-based RpcEnv
1.6.17.1
ContextCleaner
1.6.18
MapOutputTracker
1.6.19
1.7
1.7.1
Spark on cluster
1.7.2
Spark on YARN
1.7.2.1
YarnShuffleServiceExternalShuffleService on YARN
1.7.2.1.1
ExecutorRunnable
1.7.2.1.2
Client
1.7.2.1.3
YarnRMClient
1.7.2.1.4
ApplicationMaster
1.7.2.1.5
1.7.2.1.5.1
1.7.2.1.6
1.7.2.1.7
YarnScheduler
1.7.2.1.7.1
YarnClusterScheduler
1.7.2.1.7.2
1.7.2.1.8
YarnSchedulerBackend
1.7.2.1.8.1
YarnClientSchedulerBackend
1.7.2.1.8.2
YarnClusterSchedulerBackend
1.7.2.1.8.3
1.7.2.1.8.4
YarnAllocator
1.7.2.1.9
1.7.2.1.10
1.7.2.1.11
Kerberos
1.7.2.1.12
YarnSparkHadoopUtil
1.7.2.1.13
Settings
1.7.2.1.14
Spark Standalone
1.7.2.2
Standalone Master
1.7.2.2.1
Standalone Worker
1.7.2.2.2
web UI
1.7.2.2.3
Submission Gateways
1.7.2.2.4
1.7.2.2.5
4
1.7.2.2.6
Checking Status
1.7.2.2.7
1.7.2.2.8
StandaloneSchedulerBackend
1.7.2.2.9
Spark on Mesos
1.7.2.3
MesosCoarseGrainedSchedulerBackend
1.7.2.3.1
About Mesos
1.7.2.3.2
Execution Model
1.8
Optimising Spark
1.9
1.9.1
Broadcast variables
1.9.2
Accumulators
1.9.3
Spark Security
1.10
Spark Security
1.10.1
Securing Web UI
1.10.2
1.11
1.11.1
1.11.1.1
Serialization
1.11.1.2
1.11.2
1.11.3
1.11.4
1.12
1.12.1
1.12.1.1
SQLConf
1.12.1.2
Catalog
1.12.1.3
Dataset
1.12.1.4
Encoder
1.12.1.4.1
Columns
1.12.1.4.2
Schema
1.12.1.4.3
DataFrame (Dataset[Row])
1.12.1.4.4
Row
1.12.1.4.5
5
1.12.1.5
DataFrameReader
1.12.1.5.1
DataFrameWriter
1.12.1.5.2
DataSource
1.12.1.5.3
DataSourceRegister
1.12.1.5.4
1.12.1.6
1.12.1.6.1
Aggregation (GroupedData)
1.12.1.6.2
1.12.1.6.3
1.12.1.6.4
Structured Streaming
1.12.1.7
DataStreamReader
1.12.1.7.1
DataStreamWriter
1.12.1.7.2
Source
1.12.1.7.3
FileStreamSource
Streaming Sinks
1.12.1.7.3.1
1.12.1.7.4
ConsoleSink
1.12.1.7.4.1
ForeachSink
1.12.1.7.4.2
StreamSinkProvider
1.12.1.7.5
StreamingQueryManager
1.12.1.7.6
StreamingQuery
1.12.1.7.7
Trigger
1.12.1.7.8
StreamExecution
1.12.1.7.9
StreamingRelation
1.12.1.7.10
StreamingQueryListenerBus
1.12.1.7.11
Joins
1.12.1.8
Hive Integration
1.12.1.9
1.12.1.9.1
SQL Parsers
1.12.1.10
Caching
1.12.1.11
Datasets vs RDDs
1.12.1.12
SessionState
1.12.1.13
1.12.1.14
SQLContext
1.12.1.15
1.12.1.16
1.12.1.16.1
Predicate Pushdown
1.12.1.16.1.1
QueryPlan
1.12.1.16.1.2
SparkPlan
1.12.1.16.1.3
LogicalPlan
1.12.1.16.1.4
QueryPlanner
1.12.1.16.1.5
QueryExecution
1.12.1.16.1.6
1.12.1.16.1.7
Project Tungsten
Settings
Spark Streaming
1.12.1.16.2
1.12.1.17
1.12.2
StreamingContext
1.12.2.1
Stream Operators
1.12.2.2
Windowed Operators
1.12.2.2.1
SaveAs Operators
1.12.2.2.2
Stateful Operators
1.12.2.2.3
1.12.2.3
Streaming Listeners
1.12.2.4
Checkpointing
1.12.2.5
JobScheduler
1.12.2.6
JobGenerator
1.12.2.7
DStreamGraph
1.12.2.8
1.12.2.9
Input DStreams
1.12.2.9.1
ReceiverInputDStreams
1.12.2.9.2
ConstantInputDStreams
1.12.2.9.3
ForEachDStreams
1.12.2.9.4
WindowedDStreams
1.12.2.9.5
MapWithStateDStreams
1.12.2.9.6
StateDStreams
1.12.2.9.7
TransformedDStream
1.12.2.9.8
Receivers
1.12.2.10
7
ReceiverTracker
1.12.2.10.1
ReceiverSupervisors
1.12.2.10.2
ReceivedBlockHandlers
1.12.2.10.3
1.12.2.11
KafkaRDD
1.12.2.11.1
RecurringTimer
1.12.2.12
Backpressure
1.12.2.13
1.12.2.14
ExecutorAllocationManager
Settings
Spark MLlib - Machine Learning in Spark
ML Pipelines (spark.ml)
1.12.2.14.1
1.12.2.15
1.12.3
1.12.3.1
Transformers
1.12.3.1.1
Estimators
1.12.3.1.2
Models
1.12.3.1.3
Evaluators
1.12.3.1.4
CrossValidator
1.12.3.1.5
1.12.3.1.6
ExampleText Classification
1.12.3.1.7
ExampleLinear Regression
1.12.3.1.8
1.12.3.2
Vector
1.12.3.3
LabeledPoint
1.12.3.4
Streaming MLlib
1.12.3.5
1.12.4
1.12.4.1
1.13
1.13.1
HistoryServer
1.13.2
SQLHistoryListener
1.13.2.1
FsHistoryProvider
1.13.2.2
Logging
1.13.3
Performance Tuning
1.13.4
1.13.5
Spark Listeners
1.13.6
LiveListenerBus
1.13.6.1
ReplayListenerBus
1.13.6.2
EventLoggingListenerEvent Logging
1.13.6.3
1.13.6.4
1.13.7
1.14
Building Spark
1.14.1
1.14.2
1.14.3
1.14.4
1.14.5
Spark Packages
1.14.6
TransportConfTransport Configuration
1.14.7
1.15
1.15.1
1.15.2
1.15.3
Exercises
1.16
1.16.1
1.16.2
1.16.3
1.16.4
1.16.5
1.16.6
1.16.7
1.16.8
1.16.9
1.16.10
1.16.11
1.16.12
1.16.13
Courses
1.17
9
Courses
1.17.1
Books
1.17.2
DataStax Enterprise
1.18
DataStax Enterprise
1.18.1
1.18.2
1.19
1.19.1
1.19.2
1.20
Requirements
1.20.1
Day 1
1.20.2
Day 2
1.20.3
1.21
1.21.1
1.21.2
10
Introduction
11
Overview of Spark
Apache Spark
Apache Spark is an open-source distributed general-purpose cluster computing
framework with in-memory data processing engine that can do ETL, analytics, machine
learning and graph processing on large volumes of data at rest (batch processing) or in
motion (streaming processing) with rich concise high-level APIs for the programming
languages: Scala, Python, Java, R, and SQL.
12
Overview of Spark
Using Spark Application Frameworks, Spark simplifies access to machine learning and
predictive analytics at scale.
Spark is mainly written in Scala, but supports other languages, i.e. Java, Python, and R.
If you have large amounts of data that requires low latency processing that a typical
MapReduce program cannot provide, Spark is an alternative.
Access any data type across any data source.
Huge demand for storage and data processing.
The Apache Spark project is an umbrella for SQL (with DataFrames), streaming, machine
learning (pipelines) and graph processing engines built atop Spark Core. You can run them
all in a single application using a consistent API.
Spark runs locally as well as in clusters, on-premises or in cloud. It runs on top of Hadoop
YARN, Apache Mesos, standalone or in the cloud (Amazon EC2 or IBM Bluemix).
Spark can access data from many data sources.
Apache Sparks Streaming and SQL programming models with MLlib and GraphX make it
easier for developers and data scientists to build applications that exploit machine learning
and graph analytics.
At a high level, any Spark application creates RDDs out of some input, run (lazy)
transformations of these RDDs to some other form (shape), and finally perform actions to
collect or store data. Not much, huh?
You can look at Spark from programmers, data engineers and administrators point of view.
And to be honest, all three types of people will spend quite a lot of their time with Spark to
finally reach the point where they exploit all the available features. Programmers use
language-specific APIs (and work at the level of RDDs using transformations and actions),
data engineers use higher-level abstractions like DataFrames or Pipelines APIs or external
tools (that connect to Spark), and finally it all can only be possible to run because
administrators set up Spark clusters to deploy Spark applications to.
It is Sparks goal to be a general-purpose computing platform with various specialized
applications frameworks on top of a single unified engine.
Note
When you hear "Apache Spark" it can be two thingsthe Spark engine aka
Spark Core or the Apache Spark open source project which is an "umbrella"
term for Spark Core and the accompanying Spark Application Frameworks, i.e.
Spark SQL, Spark Streaming, Spark MLlib and Spark GraphX that sit on top of
Spark Core and the main data abstraction in Spark called RDD - Resilient
Distributed Dataset.
13
Overview of Spark
Why Spark
Lets list a few of the many reasons for Spark. We are doing it first, and then comes the
overview that lends a more technical helping hand.
Overview of Spark
When you think about distributed batch data processing, Hadoop naturally comes to mind
as a viable solution.
Spark draws many ideas out of Hadoop MapReduce. They work together well - Spark on
YARN and HDFS - while improving on the performance and simplicity of the distributed
computing engine.
For many, Spark is Hadoop++, i.e. MapReduce done in a better way.
And it should not come as a surprise, without Hadoop MapReduce (its advances and
deficiencies), Spark would not have been born at all.
15
Overview of Spark
bringing skilled people with their expertise in different programming languages together to a
Spark project.
Single Environment
Regardless of which programming language you are good at, be it Scala, Java, Python, R or
SQL, you can use the same single clustered runtime environment for prototyping, ad hoc
queries, and deploying your applications leveraging the many ingestion data points offered
by the Spark platform.
You can be as low-level as using RDD API directly or leverage higher-level APIs of Spark
SQL (Datasets), Spark MLlib (ML Pipelines), Spark GraphX (Graphs) or Spark Streaming
(DStreams).
Or use them all in a single application.
The single programming model and execution engine for different kinds of workloads
simplify development and deployment architectures.
16
Overview of Spark
Both, input and output data sources, allow programmers and data engineers use Spark as
the platform with the large amount of data that is read from or saved to for processing,
interactively (using Spark shell) or in applications.
Low-level Optimizations
Apache Spark uses a directed acyclic graph (DAG) of computation stages (aka execution
DAG). It postpones any processing until really required for actions. Sparks lazy evaluation
gives plenty of opportunities to induce low-level optimizations (so users have to know less to
do more).
Mind the proverb less is more.
17
Overview of Spark
Spark can cache intermediate data in memory for faster model building and training. Once
the data is loaded to memory (as an initial step), reusing it multiple times incurs no
performance slowdowns.
Also, graph algorithms can traverse graphs one connection per iteration with the partial
result in memory.
Less disk access and network can make a huge difference when you need to process lots of
data, esp. when it is a BIG Data.
18
Overview of Spark
One of the many motivations to build Spark was to have a framework that is good at data
reuse.
Spark cuts it out in a way to keep as much data as possible in memory and keep it there
until a job is finished. It doesnt matter how many stages belong to a job. What does matter
is the available memory and how effective you are in using Spark API (so no shuffle occur).
The less network and disk IO, the better performance, and Spark tries hard to find ways to
minimize both.
19
For it to work, you have to create a Spark configuration using SparkConf or use a custom
SparkContext constructor.
package pl.japila.spark
import org.apache.spark.{SparkContext, SparkConf}
object SparkMeApp {
def main(args: Array[String]) {
val masterURL = "local[*]" (1)
val conf = new SparkConf() (2)
.setAppName("SparkMe Application")
.setMaster(masterURL)
val sc = new SparkContext(conf) (3)
val fileName = util.Try(args(0)).getOrElse("build.sbt")
val lines = sc.textFile(fileName).cache() (4)
val c = lines.count() (5)
println(s"There are $c lines in $fileName")
}
}
Spark shell creates a Spark context and SQL context for you at startup.
20
Your Spark application can run locally or on the cluster which is based on the
cluster manager and the deploy mode ( --deploy-mode ). Refer to Deployment
Modes.
You can then create RDDs, transform them to other RDDs and ultimately execute actions.
You can also cache interim RDDs to speed up data processing.
After all the data processing is completed, the Spark application finishes by stopping the
Spark context.
21
Caution
FIXME
22
Start Spark shell with --conf spark.logConf=true to log the effective Spark
configuration as INFO when SparkContext is started.
Tip
You can query for the values of Spark properties in Spark shell as follows:
scala> sc.getConf.getOption("spark.local.dir")
res0: Option[String] = None
scala> sc.getConf.getOption("spark.app.name")
res1: Option[String] = Some(Spark shell)
scala> sc.getConf.get("spark.master")
res2: String = local[*]
Setting up Properties
There are the following ways to set up properties for Spark and user programs (in the order
of importance from the least important to the most important):
conf/spark-defaults.conf - the default
--conf or -c - the command-line option used by spark-shell and spark-submit
SparkConf
Default Configuration
The default Spark configuration is created when you execute the following code:
import org.apache.spark.SparkConf
val conf = new SparkConf
23
24
Deploy Mode
Deploy Mode
Deploy mode specifies the location of where driver executes in the deployment
environment.
Deploy mode can be one of the following options:
client (default) - the driver runs on the machine that the Spark application was
launched.
cluster - the driver runs on a random node in a cluster.
Note
You can control deploy mode using spark-submits --deploy-mode or --conf command-line
options with spark.submit.deployMode setting.
Note
Client Mode
Caution
FIXME
Cluster Mode
Caution
FIXME
spark.submit.deployMode
spark.submit.deployMode (default: client ) can be client or cluster .
25
SparkContext
Note
26
SparkContext
deploy mode
default level of parallelism
Spark user
the time (in milliseconds) when SparkContext was created
Spark version
Setting configuration
mandatory master URL
local properties
default log level
Creating objects
RDDs
accumulators
broadcast variables
Accessing services, e.g. TaskScheduler, LiveListenerBus, BlockManager,
SchedulerBackends, ShuffleManager.
Running jobs
Setting up custom Scheduler Backend, TaskScheduler and DAGScheduler
Closure Cleaning
Submitting Jobs Asynchronously
Unpersisting RDDs, i.e. marking RDDs as non-persistent
Registering SparkListener
Programmable Dynamic Allocation
Tip
Persisted RDDs
Caution
FIXME
persistRDD
27
SparkContext
persistRDD(rdd: RDD[_])
requestExecutors
killExecutors
(private!) requestTotalExecutors
(private!) getExecutorIds
contract. It simply passes the call on to the current coarse-grained scheduler backend, i.e.
calls getExecutorIds .
Note
When called for other scheduler backends you should see the following WARN message in
the logs:
WARN Requesting executors is only supported in coarse-grained mode
Caution
requestExecutors method
requestExecutors(numAdditionalExecutors: Int): Boolean
CoarseGrainedSchedulerBackend.
28
SparkContext
Caution
FIXME
requestTotalExecutors method
requestTotalExecutors(
numExecutors: Int,
localityAwareTasks: Int,
hostToLocalTaskCount: Map[String, Int]): Boolean
When called for other scheduler backends you should see the following WARN message in
the logs:
WARN Requesting executors is only supported in coarse-grained mode
Creating SparkContext
You can create a SparkContext instance with or without creating a SparkConf object first.
new one.
29
SparkContext
import org.apache.spark.SparkContext
val sc = SparkContext.getOrCreate()
// Using an explicit SparkConf object
import org.apache.spark.SparkConf
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("SparkMe App")
val sc = SparkContext.getOrCreate(conf)
The no-param getOrCreate method requires that the two mandatory Spark settings - master
and application name - are specified using spark-submit.
Constructors
SparkContext()
SparkContext(conf: SparkConf)
SparkContext(master: String, appName: String, conf: SparkConf)
SparkContext(
master: String,
appName: String,
sparkHome: String = null,
jars: Seq[String] = Nil,
environment: Map[String, String] = Map())
When a Spark context starts up you should see the following INFO in the logs (amongst the
other messages that come from the Spark services):
INFO SparkContext: Running Spark version 2.0.0-SNAPSHOT
Note
Only one SparkContext may be running in a single JVM (check out SPARK2243 Support multiple SparkContexts in the same JVM). Sharing access to a
SparkContext in the JVM is the solution to share data within Spark (without
relying on other means of data sharing using external data stores).
30
SparkContext
Note
Changing the SparkConf object does not change the current configuration (as
the method returns a copy).
master method returns the current value of spark.master which is the deployment
environment in use.
Note
set.
31
SparkContext
Note
getPoolForName is part of the Developers API and may change in the future.
Internally, it requests the TaskScheduler for the root pool and looks up the Schedulable by
the pool name.
It is exclusively used to show pool details in web UI (for a stage).
Note
Note
Caution
Note
32
SparkContext
sc.setLocalProperty("spark.scheduler.pool", "myPool")
The goal of the local property concept is to differentiate between or group jobs submitted
from different threads by local properties.
Note
If value is null the key property is removed the key from the local properties
sc.setLocalProperty("spark.scheduler.pool", null)
A common use case for the local property concept is to set a local property in a thread, say
spark.scheduler.pool, after which all jobs submitted within the thread will be grouped, say
into a pool by FAIR job scheduler.
val rdd = sc.parallelize(0 to 9)
sc.setLocalProperty("spark.scheduler.pool", "myPool")
// these two jobs (one per action) will run in the myPool pool
rdd.count
rdd.collect
sc.setLocalProperty("spark.scheduler.pool", null)
// this job will run in the default pool
rdd.count
SparkContext.makeRDD
33
SparkContext
Caution
FIXME
DAGScheduler.submitJob method).
It cleans the processPartition input function argument and returns an instance of
SimpleFutureAction that holds the JobWaiter instance (it has received from
DAGScheduler.submitJob ).
Caution
It is used in:
AsyncRDDActions methods
Spark Streaming for ReceiverTrackerEndpoint.startReceiver
Spark Configuration
Caution
FIXME
34
SparkContext
Creating RDD
SparkContext allows you to create many different RDDs from input sources like:
35
SparkContext
Caution
FIXME
register registers the acc accumulator. You can optionally give an accumulator a name .
Tip
You can create built-in accumulators for longs, doubles, and collection types
using specialized methods.
36
SparkContext
The name input parameter allows you to give a name to an accumulator and have it
displayed in Spark UI (under Stages tab for a given stage).
broadcast method creates a broadcast variable that is a shared memory with value on all
Spark executors.
37
SparkContext
Spark transfers the value to Spark executors once, and tasks can share it without incurring
repetitive network transmissions when requested multiple times.
Note
38
SparkContext
Caution
RDD actions in Spark run jobs using one of runJob methods. It executes a function on one
or many partitions of a RDD to produce a collection of values per partition.
39
SparkContext
Tip
For some actions, e.g. first() and lookup() , there is no need to compute all
the partitions of the RDD in a job. And Spark knows it.
import org.apache.spark.TaskContext
scala> sc.runJob(lines, (t: TaskContext, i: Iterator[String]) => 1) (1)
res0: Array[Int] = Array(1, 1) (2)
1. Run a job using runJob on lines RDD with a function that returns 1 for every partition
(of lines RDD).
2. What can you say about the number of partitions of the lines RDD? Is your result
res0 different than mine? Why?
Tip
partition).
When executed, runJob prints out the following INFO message:
INFO Starting job: ...
40
SparkContext
You can stop a SparkContext using stop method. Stopping a Spark context stops the
Spark Runtime Environment and shuts down the entire Spark application (see Anatomy of
Spark Application).
Calling stop many times leads to the following INFO message in the logs:
INFO SparkContext: SparkContext already stopped.
41
SparkContext
scala> sc.stop
scala> sc.parallelize(0 to 5)
java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
Stops ContextCleaner
Stops ExecutorAllocationManager
Stops DAGScheduler
Stops LiveListenerBus
Stops EventLoggingListener
Stops HeartbeatReceiver
Stops optional ConsoleProgressBar
It clears the reference to TaskScheduler (i.e. _taskScheduler is null )
Stops SparkEnv and calls SparkEnv.set(null)
Caution
If all went fine till now you should see the following INFO message in the logs:
INFO SparkContext: Successfully stopped SparkContext
42
SparkContext
Caution
Events
When a Spark context starts, it triggers SparkListenerEnvironmentUpdate and
SparkListenerApplicationStart messages.
Refer to the section SparkContexts initialization.
setLogLevel allows you to set the root logging level in a Spark application, e.g. Spark shell.
SparkStatusTracker
SparkStatusTracker requires a Spark context to work. It is created as part of SparkContexts
initialization.
SparkStatusTracker is only used by ConsoleProgressBar.
43
SparkContext
ConsoleProgressBar
ConsoleProgressBar shows the progress of active stages in console (to stderr ). It polls the
status of stages from SparkStatusTracker periodically and prints out active stages with more
than one task. It keeps overwriting itself to hold in one line for at most 3 first concurrent
stages at a time.
[Stage 0:====> (316 + 4) / 1000][Stage 1:> (0 + 0) / 1000][Sta
ge 2:> (0 + 0) / 1000]]]
The progress includes the stages id, the number of completed, active, and total tasks.
It is useful when you ssh to workers and want to see the progress of active stages.
It is only instantiated if the value of the boolean property spark.ui.showConsoleProgress
(default: true ) is true and the log level of org.apache.spark.SparkContext logger is WARN
or higher (refer to Logging).
import org.apache.log4j._
Logger.getLogger("org.apache.spark.SparkContext").setLevel(Level.WARN)
44
SparkContext
With DEBUG logging level you should see the following messages in the logs:
+++ Cleaning closure [func] ([func.getClass.getName]) +++
+ declared fields: [declaredFields.size]
[field]
...
+++ closure [func] ([func.getClass.getName]) is now cleaned +++
45
SparkContext
Serialization is verified using a new instance of Serializer (as closure Serializer). Refer to
Serialization.
Caution
Hadoop Configuration
While a SparkContext is being created, so is a Hadoop configuration (as an instance of
org.apache.hadoop.conf.Configuration that is available as _hadoopConfiguration ).
Note
SparkHadoopUtil.get.newConfiguration is used.
of AWS_ACCESS_KEY_ID
fs.s3.awsSecretAccessKey , fs.s3n.awsSecretAccessKey , and fs.s3a.secret.key are set
listenerBus
listenerBus is a LiveListenerBus object that acts as a mechanism to announce events to
Note
46
SparkContext
scala> sc.startTime
res0: Long = 1464425605653
Note
Settings
spark.driver.allowMultipleContexts
Quoting the scaladoc of org.apache.spark.SparkContext:
Only one SparkContext may be active per JVM. You must stop() the active
SparkContext before creating a new one.
You can however control the behaviour using spark.driver.allowMultipleContexts flag.
It is disabled, i.e. false , by default.
If enabled (i.e. true ), Spark prints the following WARN message to the logs:
WARN Multiple running SparkContexts detected in the same JVM!
When creating an instance of SparkContext , Spark marks the current thread as having it
being created (very early in the instantiation process).
Caution
Its not guaranteed that Spark will work properly with two or more
SparkContexts. Consider the feature a work in progress.
47
SparkContext
Environment Variables
SPARK_EXECUTOR_MEMORY
SPARK_EXECUTOR_MEMORY sets the amount of memory to allocate to each executor. See
Executor Memory.
SPARK_USER
SPARK_USER is the user who is running SparkContext . It is available later as sparkUser.
48
When created, it requires a SparkContext and a Clock . Later, it uses the SparkContext to
register itself as a SparkListener and TaskScheduler (as scheduler ).
Note
created.
Enable DEBUG or TRACE logging levels for org.apache.spark.HeartbeatReceiver
to see what happens inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.HeartbeatReceiver=TRACE
Refer to Logging.
When called, HeartbeatReceiver cancels the checking task (that sends a blocking
ExpireDeadHosts every spark.network.timeoutInterval on eventLoopThread - Heartbeat
Receiver Event Loop Thread - see Starting (onStart method)) and shuts down
eventLoopThread and killExecutorThread executors.
49
Messages
ExecutorRegistered
ExecutorRegistered(executorId: String)
Note
registration.
Note
It is an internal message.
ExecutorRemoved
ExecutorRemoved(executorId: String)
Note
Note
It is an internal message.
ExpireDeadHosts
50
ExpireDeadHosts
When ExpireDeadHosts arrives the following TRACE is printed out to the logs:
TRACE HeartbeatReceiver: Checking for hosts with no recent heartbeats in HeartbeatRece
iver.
Each executor (in executorLastSeen registry) is checked whether the time it was last seen is
not longer than spark.network.timeout.
For any such executor, the following WARN message is printed out to the logs:
WARN HeartbeatReceiver: Removing executor [executorId] with no recent heartbeats: [tim
e] ms exceeds timeout [timeout] ms
killExecutorThread).
The executor is removed from executorLastSeen.
Note
It is an internal message.
Heartbeat
Heartbeat(executorId: String,
accumUpdates: Array[(Long, Seq[AccumulatorV2[_, _]])],
blockManagerId: BlockManagerId)
When Heartbeat arrives and the internal scheduler is not set yet (no TaskSchedulerIsSet
earlier), the following WARN is printed out to the logs:
WARN HeartbeatReceiver: Dropping [heartbeat] because TaskScheduler is not ready yet
Heartbeats messages are the mechanism of executors to inform that they are
51
If however the internal scheduler was set already, HeartbeatReceiver checks whether the
executor executorId is known (in executorLastSeen).
If the executor is not recognized, the following DEBUG message is printed out to the logs:
DEBUG HeartbeatReceiver: Received heartbeat from unknown executor [executorId]
TaskScheduler.executorHeartbeatReceived.
Caution
FIXME Figure
TaskSchedulerIsSet
When TaskSchedulerIsSet arrives, HeartbeatReceiver sets scheduler internal attribute
(using SparkContext.taskScheduler ).
Note
Note
It is an internal message.
Internal Registries
executorLastSeen - a registry of executor ids and the timestamps of when the last
Settings
spark.storage.blockManagerSlaveTimeoutMs
spark.storage.blockManagerSlaveTimeoutMs (default: 120s )
spark.network.timeout
52
Other
spark.storage.blockManagerTimeoutIntervalMs (default: 60s )
spark.network.timeoutInterval (default: spark.storage.blockManagerTimeoutIntervalMs )
53
Note
The example uses Spark in local mode, but the initialization with the other
cluster modes would follow similar steps.
Note
// the SparkContext code goes here
SparkContext.setActiveContext(this, allowMultipleContexts)
Tip
You can use version method to learn about the current Spark version or
org.apache.spark.SPARK_VERSION value.
54
Detected yarn cluster mode, but isn't running on a cluster. Deployment to YARN is not
supported directly by SparkContext. Please use spark-submit.
Caution
Note
The drivers host and port are set if missing. spark.driver.host becomes the value of
Utils.localHostName (or an exception is thrown) while spark.driver.port is set to 0 .
Note
Tip
55
It sets the jars and files based on spark.jars and spark.files , respectively. These are
files that are required for proper task execution on executors.
If event logging is enabled, i.e. spark.eventLog.enabled is true , the internal field
_eventLogDir is set to the value of spark.eventLog.dir setting or the default value
/tmp/spark-events . Also, if spark.eventLog.compress is true (default: false ), the short
Caution
Caution
CoarseMesosSchedulerBackend.
The value of SPARK_PREPEND_CLASSES environment variable is included in executorEnvs .
56
FIXME
Whats _executorMemory ?
Caution
The setting spark.app.id is set to the current application id and Web UI gets notified about
it if used (using setAppId(_applicationId) ).
The BlockManager (for the driver) is initialized (with _applicationId ).
Caution
The drivers metrics (servlet handler) are attached to the web ui after the metrics system is
started.
_eventLogger is created and started if isEventLogEnabled . It uses EventLoggingListener
57
Caution
Note
Caution
FIXME Itd be quite useful to have all the properties with their default values
in sc.getConf.toDebugString , so when a configuration is not included but
does change Spark runtime configuration, it should be added to _conf .
LiveListenerBus with information about Task Schedulers scheduling mode, added jar and
file paths, and other environmental details. They are displayed in Web UIs Environment tab.
SparkListenerApplicationStart message is posted to LiveListenerBus (using the internal
postApplicationStart method).
TaskScheduler.postStartHook is called.
Note
Caution
58
Caution
59
Caution
FIXME
If there are two or more external cluster managers that could handle url , a
SparkException is thrown:
Note
Note
setupAndStartListenerBus
setupAndStartListenerBus(): Unit
60
When no single- SparkConf or zero-argument constructor could be found for a class name in
spark.extraListeners , a SparkException is thrown with the message:
createSparkEnv simply delegates the call to SparkEnv to create a SparkEnv for the driver.
It calculates the number of cores to 1 for local master URL, the number of processors
available for JVM for * or the exact number in the master URL, or 0 for the cluster
master URLs.
Utils.getCurrentUserName
getCurrentUserName(): String
getCurrentUserName computes the user name who has started the SparkContext instance.
Note
61
Internally, it reads SPARK_USER environment variable and, if not set, reverts to Hadoop
Security APIs UserGroupInformation.getCurrentUser().getShortUserName() .
Note
It is another place where Spark relies on Hadoop API for its operation.
Utils.localHostName
localHostName computes the local host name.
Caution
stopped flag
Caution
62
RDD is the bread and butter of Spark, and mastering the concept is of utmost
importance to become a Spark pro. And you wanna be a Spark pro, dont you?
With RDD the creators of Spark managed to hide data partitioning and so distribution that in
turn allowed them to design parallel computational framework with a higher-level
programming interface (API) for four mainstream programming languages.
Learning about RDD by its name:
Resilient, i.e. fault-tolerant with the help of RDD lineage graph and so able to
recompute missing or damaged partitions due to node failures.
Distributed with data residing on multiple nodes in a cluster.
Dataset is a collection of partitioned data with primitive values or values of values, e.g.
tuples or other objects (that represent records of the data you work with).
63
Figure 1. RDDs
From the scaladoc of org.apache.spark.rdd.RDD:
A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an
immutable, partitioned collection of elements that can be operated on in parallel.
From the original paper about RDD - Resilient Distributed Datasets: A Fault-Tolerant
Abstraction for In-Memory Cluster Computing:
Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that lets
programmers perform in-memory computations on large clusters in a fault-tolerant
manner.
Beside the above traits (that are directly embedded in the name of the data abstraction RDD) it has the following additional traits:
In-Memory, i.e. data inside RDD is stored in memory as much (size) and long (time) as
possible.
Immutable or Read-Only, i.e. it does not change once created and can only be
transformed using transformations to new RDDs.
Lazy evaluated, i.e. the data inside RDD is not available or transformed until an action
is executed that triggers the execution.
Cacheable, i.e. you can hold all the data in a persistent "storage" like memory (default
and the most preferred) or disk (the least preferred due to access speed).
Parallel, i.e. process data in parallel.
Typed, i.e. values in a RDD have types, e.g. RDD[Long] or RDD[(Int, String)] .
64
Partitioned, i.e. the data inside a RDD is partitioned (split into partitions) and then
distributed across nodes in a cluster (one partition per JVM that may or may not
correspond to a single node).
RDDs are distributed by design and to achieve even data distribution as well as leverage
data locality (in distributed systems like HDFS or Cassandra in which data is partitioned by
default), they are partitioned to a fixed number of partitions - logical chunks (parts) of data.
The logical division is for processing only and internally it is not divided whatsoever. Each
partition comprises of records.
Figure 2. RDDs
Partitions are the units of parallelism. You can control the number of partitions of a RDD
using repartition or coalesce operations. Spark tries to be as close to data as possible
without wasting time to send data across network by means of RDD shuffling, and creates
as many partitions as required to follow the storage layout and thus optimize data access. It
leads to a one-to-one mapping between (physical) data in distributed data storage, e.g.
HDFS or Cassandra, and partitions.
RDDs support two kinds of operations:
transformations - lazy operations that return another RDD.
actions - operations that trigger computation and return values.
The motivation to create RDD were (after the authors) two types of applications that current
computing frameworks handle inefficiently:
iterative algorithms in machine learning and graph computations.
interactive data mining tools as ad-hoc queries on the same dataset.
The goal is to reuse intermediate in-memory results across multiple data-intensive
workloads with no need for copying large amounts of data over the network.
An RDD is defined by five main intrinsic properties:
65
List of parent RDDs that is the list of the dependencies an RDD depends on for records.
An array of partitions that a dataset is divided to.
A compute function to do a computation on partitions.
An optional partitioner that defines how keys are hashed, and the pairs partitioned (for
key-value RDDs)
Optional preferred locations (aka locality info), i.e. hosts for a partition where the data
will have been loaded.
This RDD abstraction supports an expressive set of operations without having to modify
scheduler for each one.
An RDD is a named (by name) and uniquely identified (by id) entity inside a SparkContext. It
lives in a SparkContext and as a SparkContext creates a logical boundary, RDDs cant be
shared between SparkContexts (see SparkContext and RDDs).
An RDD can optionally have a friendly name accessible using name that can be changed
using = :
scala> val ns = sc.parallelize(0 to 10)
ns: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <consol
e>:24
scala> ns.id
res0: Int = 2
scala> ns.name
res1: String = null
scala> ns.name = "Friendly name"
ns.name: String = Friendly name
scala> ns.name
res2: String = Friendly name
scala> ns.toDebugString
res3: String = (8) Friendly name ParallelCollectionRDD[2] at parallelize at <console>:
24 []
RDDs are a container of instructions on how to materialize big (arrays of) distributed data,
and how to split it into partitions so Spark (using executors) can hold some of them.
In general, data distribution can help executing processing in parallel so a task processes a
chunk of data that it could eventually keep in memory.
66
Spark does jobs in parallel, and RDDs are split into partitions to be processed and written in
parallel. Inside a partition, data is processed sequentially.
Saving partitions results in part-files instead of one single file (unless there is a single
partition).
Types of RDDs
There are some of the most interesting types of RDDs:
ParallelCollectionRDD
CoGroupedRDD
HadoopRDD is an RDD that provides core functionality for reading data stored in HDFS
using the older MapReduce API. The most notable use case is the return RDD of
SparkContext.textFile .
Appropriate operations of a given RDD type are automatically available on a RDD of the
right type, e.g. RDD[(Int, Int)] , through implicit conversion in Scala.
Transformations
A transformation is a lazy operation on a RDD that returns another RDD, like map ,
flatMap , filter , reduceByKey , join , cogroup , etc.
Tip
67
Actions
An action is an operation that triggers execution of RDD transformations and returns a value
(to a Spark driver - the user program).
Tip
Creating RDDs
SparkContext.parallelize
One way to create a RDD is with SparkContext.parallelize method. It accepts a collection
of elements as shown below ( sc is a SparkContext instance):
scala> val rdd = sc.parallelize(1 to 1000)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <conso
le>:25
Given the reason to use Spark to process more data than your own laptop could handle,
SparkContext.parallelize is mainly used to learn Spark in the Spark shell.
SparkContext.parallelize requires all the data to be available on a single machine - the
SparkContext.makeRDD
Caution
SparkContext.textFile
68
One of the easiest ways to create an RDD is to use SparkContext.textFile to read files.
You can use the local README.md file (and then flatMap over the lines inside to have an
RDD of words):
scala> val words = sc.textFile("README.md").flatMap(_.split("\\W+")).cache
words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[27] at flatMap at <console>
:24
Note
You cache it so the computation is not performed every time you work with
words .
Transformations
RDD transformations by definition transform an RDD into another RDD and hance are the
way to create new ones.
Refer to Transformations section to learn more.
RDDs in Web UI
It is quite informative to look at RDDs in the Web UI that is at http://localhost:4040 for Spark
shell.
Execute the following Spark application (type all the lines in spark-shell ):
val ints = sc.parallelize(1 to 100) (1)
ints.setName("Hundred ints") (2)
ints.cache (3)
ints.count (4)
69
70
The abstract compute method computes the input split partition in the TaskContext to
produce a collection of values (of type T ).
It is implemented by any type of RDD in Spark and is called every time the records are
requested unless RDD is cached or checkpointed (and the records can be read from an
external storage, but this time closer to the compute node).
When an RDD is cached, for specified storage levels (i.e. all but NONE ) CacheManager is
requested to get or compute partitions.
compute method runs on the driver.
RDD).
71
getNumPartitions: Int
scala> sc.textFile("README.md").getNumPartitions
res0: Int = 2
scala> sc.textFile("README.md", 5).getNumPartitions
res1: Int = 5
72
Operators
73
Operators
Transformations
Transformations are lazy operations on a RDD that create one or many new RDDs, e.g.
map , filter , reduceByKey , join , cogroup , randomSplit .
In other words, transformations are functions that take a RDD as the input and produce one
or many RDDs as the output. They do not change the input RDD (since RDDs are
immutable and hence cannot be modified), but always produce one or more new RDDs by
applying the computations they represent.
By applying transformations you incrementally build a RDD lineage with all the parent RDDs
of the final RDD(s).
Transformations are lazy, i.e. are not executed immediately. Only after calling an action are
transformations executed.
After executing a transformation, the result RDD(s) will always be different from their parents
and can be smaller (e.g. filter , count , distinct , sample ), bigger (e.g. flatMap ,
union , cartesian ) or the same size (e.g. map ).
Caution
There are transformations that may trigger jobs, e.g. sortBy , zipWithIndex,
etc.
74
Operators
75
Operators
Narrow Transformations
Narrow transformations are the result of map , filter and such that is from the data from
a single partition only, i.e. it is self-sustained.
An output RDD has partitions with records that originate from a single partition in the parent
RDD. Only a limited subset of partitions used to calculate the result.
Spark groups narrow transformations as a stage which is called pipelining.
Wide Transformations
Wide transformations are the result of groupByKey and reduceByKey . The data required to
compute the records in a single partition may reside in many partitions of the parent RDD.
76
Operators
Note
All of the tuples with the same key must end up in the same partition, processed by the
same task. To satisfy these operations, Spark must execute RDD shuffle, which transfers
data across cluster and results in a new stage with a new set of partitions.
mapPartitions
Caution
FIXME
Using an external key-value store (like HBase, Redis, Cassandra) and performing
lookups/updates inside of your mappers (creating a connection within a mapPartitions code
block to avoid the connection setup/teardown overhead) might be a better solution.
If hbase is used as the external key value store, atomicity is guaranteed
zipWithIndex
zipWithIndex(): RDD[(T, Long)]
77
Operators
If the number of partitions of the source RDD is greater than 1, it will submit
an additional job to calculate start indices.
val onePartition = sc.parallelize(0 to 9, 1)
scala> onePartition.partitions.length
res0: Int = 1
// no job submitted
onePartition.zipWithIndex
val eightPartitions = sc.parallelize(0 to 9, 8)
scala> eightPartitions.partitions.length
res1: Int = 8
Caution
// submits a job
eightPartitions.zipWithIndex
78
Operators
Actions
Actions are RDD operations that produce non-RDD values. They materialize a value in a
Spark program. In other words, a RDD operation that returns a value of any type but
RDD[T] is an action.
Note
They trigger execution of RDD transformations to return values. Simply put, an action
evaluates the RDD lineage graph.
You can think of actions as a valve and until action is fired, the data to be processed is not
even in the pipes, i.e. transformations. Only actions can materialize the entire processing
pipeline with real data.
Actions are one of two ways to send data from executors to the driver (the other being
accumulators).
Actions in org.apache.spark.rdd.RDD:
aggregate
collect
count
countApprox*
countByValue*
first
fold
foreach
foreachPartition
max
min
reduce
79
Operators
1.
Tip
You should cache RDDs you work with when you want to execute two or more
actions on it for a better performance. Refer to RDD Caching and Persistence.
AsyncRDDActions
AsyncRDDActions class offers asynchronous actions that you can use on RDDs (thanks to
80
Operators
foreachPartitionAsync
FutureActions
Caution
FIXME
81
Operators
The following diagram uses cartesian or zip for learning purposes only. You
may use other operators to build a RDD graph.
A RDD lineage graph is hence a graph of what transformations need to be executed after an
action has been called.
You can learn about a RDD lineage graph using RDD.toDebugString method.
toDebugString
toDebugString: String
You can learn about a RDD lineage graph using toDebugString method.
82
Operators
The numbers in round brackets show the level of parallelism at each stage.
spark.logLineage
Enable spark.logLineage (assumed: false ) to see a RDD lineage graph using
RDD.toDebugString method every time an action on a RDD is called.
$ ./bin/spark-shell -c spark.logLineage=true
scala> sc.textFile("README.md", 4).count
...
15/10/17 14:46:42 INFO SparkContext: Starting job: count at <console>:25
15/10/17 14:46:42 INFO SparkContext: RDD's recursive dependencies:
(4) MapPartitionsRDD[1] at textFile at <console>:25 []
| README.md HadoopRDD[0] at textFile at <console>:25 []
83
Caution
1. How does the number of partitions map to the number of tasks? How to
verify it?
2. How does the mapping between partitions and tasks correspond to data
locality if any?
Spark manages data using partitions that helps parallelize distributed data processing with
minimal network traffic for sending data between executors.
By default, Spark tries to read data into an RDD from the nodes that are close to it. Since
Spark usually accesses distributed partitioned data, to optimize transformation operations it
creates partitions to hold the data chunks.
There is a one-to-one correspondence between how data is laid out in data storage like
HDFS or Cassandra (it is partitioned for the same reasons).
Features:
size
number
partitioning scheme
node distribution
repartitioning
Read the following documentations to learn what experts say on the topic:
Tip
84
By default, a partition is created for each HDFS partition, which by default is 64MB (from
Sparks Programming Guide).
RDDs get partitioned automatically without programmer intervention. However, there are
times when youd like to adjust the size and number of partitions or the partitioning scheme
according to the needs of your application.
You use def getPartitions: Array[Partition] method on a RDD to know the set of
partitions in this RDD.
As noted in View Task Execution Against Partitions Using the UI:
When a stage executes, you can see the number of partitions for a given stage in the
Spark UI.
Start spark-shell and see it yourself!
scala> sc.parallelize(1 to 100).count
res0: Long = 100
When you execute the Spark job, i.e. sc.parallelize(1 to 100).count , you should see the
following in Spark shell application UI.
You can request for the minimum number of partitions, using the second input parameter to
many transformations.
scala> sc.parallelize(1 to 100, 2).count
res1: Long = 100
85
Also, the number of partitions determines how many files get generated by actions that save
RDDs to files.
The maximum size of a partition is ultimately limited by the available memory of an executor.
In the first RDD transformation, e.g. reading from a file using sc.textFile(path, partition) ,
the partition parameter will be applied to all further transformations and actions on this
RDD.
86
Partitions get redistributed among nodes whenever shuffle occurs. Repartitioning may
cause shuffle to occur in some situations, but it is not guaranteed to occur in all cases.
And it usually happens during action stage.
When creating an RDD by reading a file using rdd = SparkContext().textFile("hdfs://
/file.txt") the number of partitions may be smaller. Ideally, you would get the same
number of blocks as you see in HDFS, but if the lines in your file are too long (longer than
the block size), there will be fewer partitions.
Preferred way to set up the number of partitions for an RDD is to directly pass it as the
second input parameter in the call like rdd = sc.textFile("hdfs:///file.txt", 400) , where
400 is the number of partitions. In this case, the partitioning makes for 400 splits that would
be done by the Hadoops TextInputFormat , not Spark and it would work much faster. Its
also that the code spawns 400 concurrent tasks to try to load file.txt directly into 400
partitions.
It will only work as described for uncompressed files.
When using textFile with compressed files ( file.txt.gz not file.txt or similar), Spark
disables splitting that makes for an RDD with only 1 partition (as reads against gzipped files
cannot be parallelized). In this case, to change the number of partitions you should do
repartitioning.
Some operations, e.g. map , flatMap , filter , dont preserve partitioning.
map , flatMap , filter operations apply a function to every partition.
Repartitioning
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null) does
87
scala> lines.repartition(5).count
...
15/10/07 08:10:00 INFO DAGScheduler: Submitting 5 missing tasks from ResultStage 7 (Ma
pPartitionsRDD[19] at repartition at <console>:27)
15/10/07 08:10:00 INFO TaskSchedulerImpl: Adding task set 7.0 with 5 tasks
15/10/07 08:10:00 INFO TaskSetManager: Starting task 0.0 in stage 7.0 (TID 17, localho
st, partition 0,NODE_LOCAL, 2089 bytes)
15/10/07 08:10:00 INFO TaskSetManager: Starting task 1.0 in stage 7.0 (TID 18, localho
st, partition 1,NODE_LOCAL, 2089 bytes)
15/10/07 08:10:00 INFO TaskSetManager: Starting task 2.0 in stage 7.0 (TID 19, localho
st, partition 2,NODE_LOCAL, 2089 bytes)
15/10/07 08:10:00 INFO TaskSetManager: Starting task 3.0 in stage 7.0 (TID 20, localho
st, partition 3,NODE_LOCAL, 2089 bytes)
15/10/07 08:10:00 INFO TaskSetManager: Starting task 4.0 in stage 7.0 (TID 21, localho
st, partition 4,NODE_LOCAL, 2089 bytes)
...
You can see a change after executing repartition(1) causes 2 tasks to be started using
PROCESS_LOCAL data locality.
scala> lines.repartition(1).count
...
15/10/07 08:14:09 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 8
(MapPartitionsRDD[20] at repartition at <console>:27)
15/10/07 08:14:09 INFO TaskSchedulerImpl: Adding task set 8.0 with 2 tasks
15/10/07 08:14:09 INFO TaskSetManager: Starting task 0.0 in stage 8.0 (TID 22, localho
st, partition 0,PROCESS_LOCAL, 2058 bytes)
15/10/07 08:14:09 INFO TaskSetManager: Starting task 1.0 in stage 8.0 (TID 23, localho
st, partition 1,PROCESS_LOCAL, 2058 bytes)
...
Please note that Spark disables splitting for compressed files and creates RDDs with only 1
partition. In such cases, its helpful to use sc.textFile('demo.gz') and do repartitioning
using rdd.repartition(100) as follows:
rdd = sc.textFile('demo.gz')
rdd = rdd.repartition(100)
With the lines, you end up with rdd to be exactly 100 partitions of roughly equal in size.
rdd.repartition(N) does a shuffle to split data to match N
If partitioning scheme doesnt work for you, you can write your own custom
partitioner.
88
Tip
coalesce transformation
coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = nul
l): RDD[T]
The coalesce transformation is used to change the number of partitions. It can trigger RDD
shuffling depending on the second shuffle boolean input parameter (defaults to false ).
In the following sample, you parallelize a local 10-number sequence and coalesce it first
without and then with shuffling (note the shuffle parameter being false and true ,
respectively). You use toDebugString to check out the RDD lineage graph.
scala> val rdd = sc.parallelize(0 to 10, 8)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <conso
le>:24
scala> rdd.partitions.size
res0: Int = 8
scala> rdd.coalesce(numPartitions=8, shuffle=false) (1)
res1: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[1] at coalesce at <console>:27
scala> res1.toDebugString
res2: String =
(8) CoalescedRDD[1] at coalesce at <console>:27 []
| ParallelCollectionRDD[0] at parallelize at <console>:24 []
scala> rdd.coalesce(numPartitions=8, shuffle=true)
res3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[5] at coalesce at <console>:27
scala> res3.toDebugString
res4: String =
(8) MapPartitionsRDD[5] at coalesce at <console>:27 []
| CoalescedRDD[4] at coalesce at <console>:27 []
| ShuffledRDD[3] at coalesce at <console>:27 []
+-(8) MapPartitionsRDD[2] at coalesce at <console>:27 []
| ParallelCollectionRDD[0] at parallelize at <console>:24 []
1.
shuffle is false by default and its explicitly used here for demos purposes. Note the
number of partitions that remains the same as the number of partitions in the source
RDD rdd .
Partitioner
89
Caution
FIXME
A partitioner captures data distribution at the output. A scheduler can optimize future
operations based on this.
val partitioner: Option[Partitioner] specifies how the RDD is partitioned.
HashPartitioner
Caution
FIXME
HashPartitioner is the default partitioner for coalesce operation when shuffle is allowed,
HashPartitioner.
90
Shuffling
RDD shuffling
Tip
Read the official documentation about the topic Shuffle operations. It is still better
than this page.
Shuffling is a process of redistributing data across partitions (aka repartitioning) that may or
may not cause moving data across JVM processes or even over the wire (between
executors on separate machines).
Shuffling is the process of data transfer between stages.
Tip
Avoid shuffling at all cost. Think about ways to leverage existing partitions.
Leverage partial aggregation to reduce data transfer.
By default, shuffling doesnt change the number of partitions, but their content.
Avoid groupByKey and use reduceByKey or combineByKey instead.
groupByKey shuffles all the data, which is slow.
reduceByKey shuffles only the results of sub-aggregations in each partition of the
data.
Example - join
PairRDD offers join transformation that (quoting the official documentation):
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs
with all pairs of elements for each key.
Lets have a look at an example and see how it works under the covers:
91
Shuffling
It doesnt look good when there is an "angle" between "nodes" in an operation graph. It
appears before the join operation so shuffle is expected.
Here is how the job of executing joined.count looks in Web UI.
92
Shuffling
Caution
join operation is one of the cogroup operations that uses defaultPartitioner , i.e. walks
through the RDD lineage graph (sorted by the number of partitions decreasing) and picks
the partitioner with positive number of output partitions. Otherwise, it checks
spark.default.parallelism setting and if defined picks HashPartitioner with the default
Caution
93
Checkpointing
Checkpointing
Introduction
Checkpointing is a process of truncating RDD lineage graph and saving it to a reliable
distributed (HDFS) or local file system.
There are two types of checkpointing:
reliable - in Spark (core), RDD checkpointing that saves the actual intermediate RDD
data to a reliable distributed file system, e.g. HDFS.
local - in Spark Streaming or GraphX - RDD checkpointing that truncates RDD lineage
graph.
Its up to a Spark application developer to decide when and how to checkpoint using
RDD.checkpoint() method.
Before checkpointing is used, a Spark developer has to set the checkpoint directory using
SparkContext.setCheckpointDir(directory: String) method.
Reliable Checkpointing
You call SparkContext.setCheckpointDir(directory: String) to set the checkpoint directory
- the directory where RDDs are checkpointed. The directory must be a HDFS path if
running on a cluster. The reason is that the driver may attempt to reconstruct the
checkpointed RDD from its own local file system, which is incorrect because the checkpoint
files are actually on the executor machines.
You mark an RDD for checkpointing by calling RDD.checkpoint() . The RDD will be saved to
a file inside the checkpoint directory and all references to its parent RDDs will be removed.
This function has to be called before any job has been executed on this RDD.
Note
When an action is called on a checkpointed RDD, the following INFO message is printed out
in the logs:
15/10/10 21:08:57 INFO ReliableRDDCheckpointData: Done checkpointing RDD 5 to file:/Us
ers/jacek/dev/oss/spark/checkpoints/91514c29-d44b-4d95-ba02-480027b7c174/rdd-5, new pa
rent is RDD 6
94
Checkpointing
ReliableRDDCheckpointData
When RDD.checkpoint() operation is called, all the information related to RDD
checkpointing are in ReliableRDDCheckpointData .
spark.cleaner.referenceTracking.cleanCheckpoints (default: false ) - whether clean
ReliableCheckpointRDD
After RDD.checkpoint the RDD has ReliableCheckpointRDD as the new parent with the exact
number of partitions as the RDD.
Local Checkpointing
Beside the RDD.checkpoint() method, there is similar one - RDD.localCheckpoint() that
marks the RDD for local checkpointing using Sparks existing caching layer.
This RDD.localCheckpoint() method is for users who wish to truncate RDD lineage graph
while skipping the expensive step of replicating the materialized data in a reliable distributed
file system. This is useful for RDDs with long lineages that need to be truncated periodically,
e.g. GraphX.
Local checkpointing trades fault-tolerance for performance.
The checkpoint directory set through SparkContext.setCheckpointDir is not used.
LocalRDDCheckpointData
FIXME
LocalCheckpointRDD
FIXME
95
Dependencies
Dependencies
Dependency (represented by Dependency class) is a connection between RDDs after
applying a transformation.
You can use RDD.dependencies method to know the collection of dependencies of a RDD
( Seq[Dependency[_]] ).
scala> val r1 = sc.parallelize(0 to 9)
r1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[20] at parallelize at <conso
le>:18
scala> val r2 = sc.parallelize(0 to 9)
r2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[21] at parallelize at <conso
le>:18
scala> val r3 = sc.parallelize(0 to 9)
r3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[22] at parallelize at <conso
le>:18
scala> val r4 = sc.union(r1, r2, r3)
r4: org.apache.spark.rdd.RDD[Int] = UnionRDD[23] at union at <console>:24
scala> r4.dependencies
res0: Seq[org.apache.spark.Dependency[_]] = ArrayBuffer(org.apache.spark.RangeDependen
cy@6f2ab3f6, org.apache.spark.RangeDependency@7aa0e351, org.apache.spark.RangeDependen
cy@26468)
scala> r4.toDebugString
res1: String =
(24) UnionRDD[23] at union at <console>:24 []
| ParallelCollectionRDD[20] at parallelize at <console>:18 []
| ParallelCollectionRDD[21] at parallelize at <console>:18 []
| ParallelCollectionRDD[22] at parallelize at <console>:18 []
scala> r4.collect
...
res2: Array[Int] = Array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0
, 1, 2, 3, 4, 5, 6, 7, 8, 9)
Kinds of Dependencies
Dependency is the base abstract class with a single def rdd: RDD[T] method.
96
Dependencies
ShuffleDependency
A ShuffleDependency represents a dependency on the output of a shuffle map stage.
scala> val r = sc.parallelize(0 to 9).groupBy(identity)
r: org.apache.spark.rdd.RDD[(Int, Iterable[Int])] = ShuffledRDD[12] at groupBy at <con
sole>:18
scala> r.dependencies
res0: Seq[org.apache.spark.Dependency[_]] = List(org.apache.spark.ShuffleDependency@49
3b0b09)
97
Dependencies
combineByKeyWithClassTag
combineByKey
aggregateByKey
foldByKey
reduceByKey
countApproxDistinctByKey
groupByKey
partitionBy
Note
NarrowDependency
NarrowDependency is an abstract extension of Dependency with narrow (limited) number of
partitions of the parent RDD that are required to compute a partition of the child RDD.
Narrow dependencies allow for pipelined execution.
NarrowDependency extends the base with the additional method:
98
Dependencies
to get the parent partitions for a partition partitionId of the child RDD.
OneToOneDependency
OneToOneDependency is a narrow dependency that represents a one-to-one dependency
PruneDependency
PruneDependency is a narrow dependency that represents a dependency between the
PartitionPruningRDD and its parent.
RangeDependency
RangeDependency is a narrow dependency that represents a one-to-one dependency
99
Dependencies
100
ParallelCollectionRDD
ParallelCollectionRDD
ParallelCollectionRDD is an RDD of a collection of elements with numSlices partitions and
optional locationPrefs .
ParallelCollectionRDD is the result of SparkContext.parallelize and SparkContext.makeRDD
methods.
The data collection is split on to numSlices slices.
It uses ParallelCollectionPartition .
101
ParallelCollectionRDD
MapPartitionsRDD
MapPartitionsRDD is an RDD that applies the provided function f to every partition of the
parent RDD.
By default, it does not preserve partitioningthe last input parameter
preservesPartitioning is false . If it is true , it retains the original RDDs partitioning.
MapPartitionsRDD is the result of the following transformations:
map
flatMap
filter
glom
mapPartitions
mapPartitionsWithIndex
PairRDDFunctions.mapValues
PairRDDFunctions.flatMapValues
102
ParallelCollectionRDD
PairRDDFunctions
Tip
PairRDDFunctions are available in RDDs of key-value pairs via Scalas implicit conversion.
Tip
Partitioning is an advanced feature that is directly linked to (or inferred by) use
of PairRDDFunctions . Read up about it in Partitions and Partitioning.
Think of situations where kind has low cardinality or highly skewed distribution and using
the technique for partitioning might be not an optimal solution.
You could do as follows:
rdd.keyBy(_.kind).reduceByKey(....)
mapValues, flatMapValues
Caution
FIXME
combineByKeyWithClassTag
103
ParallelCollectionRDD
default. It then creates ShuffledRDD with the value of mapSideCombine when the input
partitioner is different from the current one in an RDD.
The function is a generic base function for combineByKey -based functions,
combineByKeyWithClassTag -based functions, aggregateByKey , foldByKey , reduceByKey ,
countApproxDistinctByKey , groupByKey , combineByKeyWithClassTag -based functions.
104
ParallelCollectionRDD
CoGroupedRDD
A RDD that cogroups its pair RDD parents. For each key k in parent RDDs, the resulting
RDD contains a tuple with the list of values for that key.
Use RDD.cogroup() to create one.
105
ParallelCollectionRDD
HadoopRDD
HadoopRDD is an RDD that provides core functionality for reading data stored in HDFS, a
local file system (available on all nodes), or any Hadoop-supported file system URI using the
older MapReduce API (org.apache.hadoop.mapred).
HadoopRDD is created as a result of calling the following methods in SparkContext:
hadoopFile
textFile (the most often used in examples!)
sequenceFile
ParallelCollectionRDD
getPartitions
The number of partition for HadoopRDD, i.e. the return value of getPartitions , is
calculated using InputFormat.getSplits(jobConf, minPartitions) where minPartitions is
only a hint of how many partitions one may want at minimum. As a hint it does not mean the
number of partitions will be exactly the number given.
For SparkContext.textFile the input format class is
org.apache.hadoop.mapred.TextInputFormat.
The javadoc of org.apache.hadoop.mapred.FileInputFormat says:
FileInputFormat is the base class for all file-based InputFormats. This provides a
generic implementation of getSplits(JobConf, int). Subclasses of FileInputFormat can
also override the isSplitable(FileSystem, Path) method to ensure input-files are not
split-up and are processed as a whole by Mappers.
Tip
107
ParallelCollectionRDD
ShuffledRDD
ShuffledRDD is an RDD of (key, value) pairs. It is a shuffle step (the result RDD) for
transformations that trigger shuffle at execution. Such transformations ultimately call
coalesce transformation with shuffle input parameter true (default: false ).
As you may have noticed, groupBy transformation adds ShuffledRDD RDD that will execute
shuffling at execution time (as depicted in the following screenshot).
when it shouldnt)
partitionBy (only when the input partitioner is different from the current one in an
RDD)
It uses Partitioner.
108
ParallelCollectionRDD
109
ParallelCollectionRDD
BlockRDD
Caution
FIXME
110
Spark Tools
Spark Tools
111
Spark Shell
Spark shell
Spark shell is an interactive shell for learning about Apache Spark, ad-hoc queries and
developing Spark applications. It is a very convenient tool to explore the many things
available in Spark and one of the many reasons why Spark is so helpful even for very simple
tasks (see Why Spark).
There are variants of Spark for different languages: spark-shell for Scala and pyspark for
Python.
Note
Note
Set SPARK_PRINT_LAUNCH_COMMAND to see the entire command to be executed. Refer to
Command of Spark Scripts.
Spark shell boils down to executing Spark submit and so command-line arguments of Spark
submit become Spark shells, e.g. --verbose .
112
Spark Shell
$ ./bin/spark-shell
Spark context available as sc.
SQL context available as spark.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.6.0-SNAPSHOT
/_/
Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
Spark shell gives you the sc value which is the SparkContext for the session.
scala> sc
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@2ac0cb64
To close Spark shell, you press Ctrl+D or type in :q (or any subset of :quit ).
scala> :quit
113
Spark Shell
Tip
These `nulls could instead be replaced with some other, more meaningful values.
114
You can use the web UI after the application is finished by persisting events using
EventLoggingListener.
115
Note
Environment Tab
SparkUI
SparkUI isFIXME
createLiveUI
Caution
FIXME
appUIAddress
Caution
FIXME
Settings
spark.ui.enabled
spark.ui.enabled (default: true ) setting controls whether the web UI is started at all.
116
spark.ui.port
spark.ui.port (default: 4040 ) controls the port Web UI binds to.
If multiple SparkContexts attempt to run on the same host (it is not possible to have two or
more Spark contexts on a single JVM, though), they will bind to successive ports beginning
with spark.ui.port .
spark.ui.killEnabled
spark.ui.killEnabled (default: true ) - whether or not you can kill stages in web UI.
117
Stages Tab
The state sections are only displayed when there are stages in a given state.
Refer to Stages for All Jobs.
In FAIR scheduling mode you have access to the table showing the scheduler pools.
118
Stages Tab
killEnabled flag
Caution
FIXME
119
Stages Tab
stages in a Spark application - active, pending, completed, and failed stages with their count.
Figure 1. Stages Tab in web UI for FAIR scheduling mode (with pools only)
In FAIR scheduling mode you have access to the table showing the scheduler pools as well
as the pool names per stage.
Note
Internally, AllStagesPage is a WebUIPage with access to the parent Stages tab and more
importantly the JobProgressListener to have access to current state of the entire Spark
application.
120
Stages Tab
Caution
There are 4 different tables for the different states of stages - active, pending, completed,
and failed. They are displayed only when there are stages in a given state.
Figure 2. Stages Tab in web UI for FAIR scheduling mode (with pools and stages)
You could also notice "retry" for stage when it was retried.
Caution
FIXME A screenshot
121
Stages Tab
Stage Details
StagePage shows the task details for a stage given its id and attempt id.
122
Stages Tab
The 1st row is Duration which includes the quantiles based on executorRunTime .
The 2nd row is the optional Scheduler Delay which includes the time to ship the task from
the scheduler to executors, and the time to send the task result from the executors to the
scheduler. It is not enabled by default and you should select Scheduler Delay checkbox
under Show Additional Metrics to include it in the summary table.
Tip
The 3rd row is the optional Task Deserialization Time which includes the quantiles based
on executorDeserializeTime task metric. It is not enabled by default and you should select
Task Deserialization Time checkbox under Show Additional Metrics to include it in the
summary table.
The 4th row is GC Time which is the time that an executor spent paused for Java garbage
collection while the task was running (using jvmGCTime task metric).
The 5th row is the optional Result Serialization Time which is the time spent serializing the
task result on a executor before sending it back to the driver (using
resultSerializationTime task metric). It is not enabled by default and you should select
Result Serialization Time checkbox under Show Additional Metrics to include it in the
summary table.
The 6th row is the optional Getting Result Time which is the time that the driver spends
fetching task results from workers. It is not enabled by default and you should select Getting
Result Time checkbox under Show Additional Metrics to include it in the summary table.
Tip
If Getting Result Time is large, consider decreasing the amount of data returned
from each task.
123
Stages Tab
If Tungsten is enabled (it is by default), the 7th row is the optional Peak Execution Memory
which is the sum of the peak sizes of the internal data structures created during shuffles,
aggregations and joins (using peakExecutionMemory task metric). For SQL jobs, this only
tracks all unsafe operators, broadcast joins, and external sort. It is not enabled by default
and you should select Peak Execution Memory checkbox under Show Additional Metrics
to include it in the summary table.
If the stage has an input, the 8th row is Input Size / Records which is the bytes and records
read from Hadoop or from a Spark storage (using inputMetrics.bytesRead and
inputMetrics.recordsRead task metrics).
If the stage has an output, the 9th row is Output Size / Records which is the bytes and
records written to Hadoop or to a Spark storage (using outputMetrics.bytesWritten and
outputMetrics.recordsWritten task metrics).
If the stage has shuffle read there will be three more rows in the table. The first row is
Shuffle Read Blocked Time which is the time that tasks spent blocked waiting for shuffle
data to be read from remote machines (using shuffleReadMetrics.fetchWaitTime task
metric). The other row is Shuffle Read Size / Records which is the total shuffle bytes and
records read (including both data read locally and data read from remote executors using
shuffleReadMetrics.totalBytesRead and shuffleReadMetrics.recordsRead task metrics). And
the last row is Shuffle Remote Reads which is the total shuffle bytes read from remote
executors (which is a subset of the shuffle read bytes; the remaining shuffle data is read
locally). It uses shuffleReadMetrics.remoteBytesRead task metric.
If the stage has shuffle write, the following row is Shuffle Write Size / Records (using
shuffleWriteMetrics.bytesWritten and shuffleWriteMetrics.recordsWritten task metrics).
If the stage has bytes spilled, the following two rows are Shuffle spill (memory) (using
memoryBytesSpilled task metric) and Shuffle spill (disk) (using diskBytesSpilled task
metric).
Request Parameters
id is
attempt is
Note
task.page (default: 1 ) is
task.sort (default: Index )
124
Stages Tab
Metrics
Scheduler Delay isFIXME
Task Deserialization Time isFIXME
Result Serialization Time isFIXME
Getting Result Time isFIXME
Peak Execution Memory isFIXME
Shuffle Read Time isFIXME
Executor Computing Time isFIXME
Shuffle Write Time isFIXME
125
Stages Tab
126
Stages Tab
Executor ID
Address
Task Time
Total Tasks
Failed Tasks
Killed Tasks
Succeeded Tasks
(optional) Input Size / Records (only when the stage has an input)
(optional) Output Size / Records (only when the stage has an output)
(optional) Shuffle Read Size / Records (only when the stage read bytes for a shuffle)
(optional) Shuffle Write Size / Records (only when the stage wrote bytes for a shuffle)
(optional) Shuffle Spill (Memory) (only when the stage spilled memory bytes)
(optional) Shuffle Spill (Disk) (only when the stage spilled bytes to disk)
Stages Tab
Accumulators
Stage page displays the table with named accumulators (only if they exist). It contains the
name and value of the accumulators.
Tasks
Settings
spark.ui.timeline.tasks.maximum
spark.ui.timeline.tasks.maximum (default: 1000 ) FIXME
spark.sql.unsafe.enabled
spark.sql.unsafe.enabled (default: true )FIXME
128
Stages Tab
order by default).
Summary Table
The Summary table shows the details of a Schedulable pool.
129
Stages Tab
Request Parameters
poolname
130
Stages Tab
poolname is the name of the scheduler pool to display on the page. It is a mandatory
request parameter.
131
Storage Tab
Storage Tab
Caution
FIXME
132
Executors Tab
Executors Tab
Caution
FIXME
133
SQL Tab
SQL Tab
SQL tab in web UI displays accumulator values per operator.
Caution
FIXME Intro
You can access the SQL tab under /SQL URL, e.g. http://localhost:4040/SQL/.
By default, it displays all SQL query executions. However, after a query has been selected,
the SQL tab displays the details of the SQL query execution.
AllExecutionsPage
AllExecutionsPage displays all SQL query executions in a Spark application per state sorted
ExecutionPage
ExecutionPage displays SQL query execution details for a given query execution id .
Note
ExecutionPage displays a summary with Submitted Time, Duration, the clickable identifiers
134
SQL Tab
It also display a visualization (using accumulator updates and the SparkPlanGraph for the
query) with the expandable Details section (that corresponds to
SQLExecutionUIData.physicalPlanDescription ).
135
SQL Tab
Note
SharedState represents the shared state across all active SQL sessions.
136
SQL Tab
SQLListener
SQLListener is a custom SparkListener that collects information about SQL query
executions for web UI (to display in SQL tab). It relies on spark.sql.execution.id key to
distinguish between queries.
Internally, it uses SQLExecutionUIData data structure exclusively to record all the necessary
data for a single SQL query execution. SQLExecutionUIData is tracked in the internal
registries, i.e. activeExecutions , failedExecutions , and completedExecutions as well as
lookup tables, i.e. _executionIdToData , _jobIdToExecutionId , and _stageIdToStageMetrics .
SQLListener starts recording a query execution by intercepting a
onJobStart reads the spark.sql.execution.id key, the identifiers of the job and the stages
and then updates the SQLExecutionUIData for the execution id in activeExecutions internal
registry.
Note
137
SQL Tab
The job in SQLExecutionUIData is marked as running with the stages added (to stages ).
For each stage, a SQLStageMetrics is created in the internal _stageIdToStageMetrics
registry. At the end, the execution id is recorded for the job id in the internal
_jobIdToExecutionId .
onOtherEvent
In onOtherEvent , SQLListener listens to the following SparkListenerEvent events:
SparkListenerSQLExecutionStart
SparkListenerSQLExecutionEnd
SparkListenerDriverAccumUpdates
SparkListenerSQLExecutionEnd
case class SparkListenerSQLExecutionEnd(
executionId: Long,
time: Long)
extends SparkListenerEvent
138
SQL Tab
If there are no other running jobs (registered in SQLExecutionUIData), the query execution
is removed from the activeExecutions internal registry and moved to either
completedExecutions or failedExecutions registry.
SparkListenerDriverAccumUpdates
case class SparkListenerDriverAccumUpdates(
executionId: Long,
accumUpdates: Seq[(Long, Long)])
extends SparkListenerEvent
onJobEnd
onJobEnd(jobEnd: SparkListenerJobEnd): Unit
When called, onJobEnd retrieves the SQLExecutionUIData for the job and records it either
successful or failed depending on the job result.
If it is the last job of the query execution (tracked as SQLExecutionUIData), the execution is
removed from activeExecutions internal registry and moved to either
If the query execution has already been marked as completed (using completionTime ) and
there are no other running jobs (registered in SQLExecutionUIData), the query execution is
removed from the activeExecutions internal registry and moved to either
completedExecutions or failedExecutions registry.
139
SQL Tab
getExecutionMetrics gets the metrics (aka accumulator updates) for executionId (by which
mergeAccumulatorUpdates method
mergeAccumulatorUpdates is a private helper method forTK
SQLExecutionUIData
SQLExecutionUIData is the data abstraction of SQLListener to describe SQL query
executions. It is a container for jobs, stages, and accumulator updates for a single query
execution.
Settings
spark.sql.ui.retainedExecutions
spark.sql.ui.retainedExecutions (default: 1000 ) is the number of SQLExecutionUIData
the end execution status. It is when SQLListener makes sure that the number of
SQLExecutionUIData entires does not exceed spark.sql.ui.retainedExecutions and
140
JobProgressListener
JobProgressListener
JobProgressListener is the SparkListener for web UI.
As a SparkListener it intercepts Spark events and collect information about jobs, stages,
and tasks that the web UI uses to present the status of a Spark application.
JobProgressListener is interested in the following events:
1. A job starts.
Caution
poolToActiveStages
poolToActiveStages = HashMap[PoolName, HashMap[StageId, StageInfo]]()
poolToActiveStages
Caution
FIXME
When called, onJobStart reads the optional Spark Job group id (using
SparkListenerJobStart.properties and SparkContext.SPARK_JOB_GROUP_ID key).
It then creates a JobUIData (as jobData ) based on the input jobStart . status attribute is
JobExecutionStatus.RUNNING .
The internal jobGroupToJobIds is updated with the job group and job ids.
The internal pendingStages is updated with StageInfo for the stage id (for every
StageInfo in SparkListenerJobStart.stageInfos collection).
numTasks attribute in the jobData (as JobUIData instance created above) is set to the sum
of tasks in every stage (from jobStart.stageInfos ) for which completionTime attribute is not
set.
The internal jobIdToData and activeJobs are updated with jobData for the current job.
141
JobProgressListener
The internal stageIdToActiveJobIds is updated with the stage id and job id (for every stage in
the input jobStart ).
The internal stageIdToInfo is updated with the stage id and StageInfo (for every StageInfo
in jobStart.stageInfos ).
A StageUIData is added to the internal stageIdToData for every StageInfo (in
jobStart.stageInfos ).
Note
stageIdToInfo Registry
stageIdToInfo = new HashMap[StageId, StageInfo]
stageIdToActiveJobIds Registry
stageIdToActiveJobIds = new HashMap[StageId, HashSet[JobId]]
jobIdToData Registry
jobIdToData = new HashMap[JobId, JobUIData]
activeJobs Registry
activeJobs = new HashMap[JobId, JobUIData]
pendingStages Registry
pendingStages = new HashMap[StageId, StageInfo]
Caution
FIXME
JobUIData
Caution
FIXME
142
JobProgressListener
blockManagerIds method
blockManagerIds: Seq[BlockManagerId]
Caution
FIXME
Registries
stageIdToData Registry
stageIdToData = new HashMap[(StageId, StageAttemptId), StageUIData]
stageIdToData holds StageUIData per stage (given the stage and attempt ids).
StageUIData
Caution
FIXME
schedulingMode Attribute
schedulingMode attribute is used to show the scheduling mode for the Spark application in
Spark UI.
Note
field.
Note
143
spark-submit
spark-submit script
spark-submit script allows you to manage your Spark applications. You can submit your
--driver-cores command-line option sets the number of cores for the driver in the cluster
deploy mode.
Note
Note
Note
System Properties
spark-submit collects system properties for execution in the internal sysProps .
Caution
144
spark-submit
--jars is a comma-separated list of local jars to include on the drivers and executors'
classpaths.
Caution
FIXME
Caution
FIXME
Caution
FIXME
With --queue you can choose the YARN queue to submit a Spark application to. The
default queue name is default .
Caution
Note
Note
Actions
Submitting Applications for Execution (submit method)
The default action of spark-submit script is to submit a Spark application to a deployment
environment for execution.
145
spark-submit
Tip
runMain is an internal method to build execution environment and invoke the main method
It optionally prints out input parameters with verbose input flag enabled (i.e. true ).
Note
It adds the local jars specified in childClasspath input parameter to the context classloader
(that is later responsible for loading the childMainClass main class).
Note
146
spark-submit
Tip
Read System Properties about how the process of collecting system properties
works.
Tip
with.
You should avoid using scala.App trait for main classes in Scala as reported in
SPARK-4170 Closure problems when running Scala app that "extends App"
If you use scala.App for the main class, you should see the following WARN message in
the logs:
WARN Subclasses of scala.App may not work correctly. Use a main() method instead.
Finally, it executes the main method of the Spark application passing in the childArgs
arguments.
Any SparkUserAppException exceptions lead to System.exit while the others are simply rethrown.
Adding Local Jars to ClassLoader (addJarToClasspath method)
addJarToClasspath(localJar: String, loader: MutableURLClassLoader)
addJarToClasspath is an internal method to add file or local jars (as localJar ) to the
loader classloader.
Internally, addJarToClasspath resolves the URI of localJar . If the URI is file or local
and the file denoted by localJar exists, localJar is added to loader . Otherwise, the
following warning is printed out to the logs:
Warning: Local jar /path/to/fake.jar does not exist, skipping.
For all other URIs, the following warning is printed out to the logs:
Warning: Skip remote jar hdfs://fake.jar.
Note
147
spark-submit
Caution
FIXME What is a URI fragment? How does this change re YARN distributed
cache? See Utils#resolveURI .
Command-line Options
Execute spark-submit --help to know about the command-line options supported.
spark git:(master) ./bin/spark-submit --help
Usage: spark-submit [options] <app jar | python file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]
Options:
--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client")
or
on one of the worker machines inside the cluster ("clust
er")
(Default: client).
--class CLASS_NAME Your application's main class (for Java / Scala apps).
--name NAME A name of your application.
--jars JARS Comma-separated list of local jars to include on the dri
ver
and executor classpaths.
--packages Comma-separated list of maven coordinates of jars to inc
lude
on the driver and executor classpaths. Will search the l
ocal
maven repo, then maven central and any additional remote
repositories given by --repositories. The format for the
coordinates should be groupId:artifactId:version.
--exclude-packages Comma-separated list of groupId:artifactId, to exclude w
hile
resolving the dependencies provided in --packages to avo
id
dependency conflicts.
--repositories Comma-separated list of additional remote repositories t
o
search for the maven coordinates given with --packages.
--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to plac
148
spark-submit
e
on the PYTHONPATH for Python apps.
--files FILES Comma-separated list of files to be placed in the workin
g
directory of each executor.
--conf PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If n
ot
specified, this will look for conf/spark-defaults.conf.
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
--driver-java-options Extra Java options to pass to the driver.
--driver-library-path Extra library path entries to pass to the driver.
--driver-class-path Extra class path entries to pass to the driver. Note tha
t
jars added with --jars are automatically included in the
classpath.
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).
--proxy-user NAME User to impersonate when submitting the application.
This argument does not work with --principal / --keytab.
--help, -h Show this help message and exit.
--verbose, -v Print additional debug output.
--version, Print the version of current Spark.
Spark standalone with cluster deploy mode only:
--driver-cores NUM Cores for driver (Default: 1).
Spark standalone or Mesos with cluster deploy mode only:
--supervise If given, restarts the driver on failure.
--kill SUBMISSION_ID If given, kills the driver specified.
--status SUBMISSION_ID If given, requests the status of the driver specified.
Spark standalone and Mesos only:
--total-executor-cores NUM Total cores for all executors.
Spark standalone and YARN only:
--executor-cores NUM Number of cores per executor. (Default: 1 in YARN mode,
or all available cores on the worker in standalone mode)
YARN-only:
--driver-cores NUM Number of cores used by the driver, only in cluster mode
(Default: 1).
--queue QUEUE_NAME The YARN queue to submit to (Default: "default").
--num-executors NUM Number of executors to launch (Default: 2).
--archives ARCHIVES Comma separated list of archives to be extracted into th
e
working directory of each executor.
--principal PRINCIPAL Principal to be used to login to KDC, while running on
secure HDFS.
149
spark-submit
--keytab KEYTAB The full path to the file that contains the keytab for t
he
principal specified above. This keytab will be copied to
the node running the Application Master via the Secure
Distributed Cache, for renewing the login tickets and th
e
delegation tokens periodically.
--class
--conf or -c
--deploy-mode (see Deploy Mode)
--driver-class-path
--driver-cores (see Driver Cores in Cluster Deploy Mode)
--driver-java-options
--driver-library-path
--driver-memory
--executor-memory
--files
--jars
--kill for Standalone cluster mode only
--master
--name
--packages
--exclude-packages
--properties-file
--proxy-user
--py-files
--repositories
--status for Standalone cluster mode only
--total-executor-cores
150
spark-submit
YARN-only options:
--archives
--executor-cores
--keytab
--num-executors
--principal
--queue (see Specifying YARN Resource Queue (--queue switch))
It also prints out propertiesFile and the properties from the file.
151
spark-submit
FIXME
Environment Variables
The following is the list of environment variables that are considered when command-line
options are not specified:
MASTER for --master
SPARK_DRIVER_MEMORY for --driver-memory
SPARK_EXECUTOR_MEMORY (see Environment Variables in the SparkContext document)
SPARK_EXECUTOR_CORES
DEPLOY_MODE
SPARK_YARN_APP_NAME
_SPARK_CMD_USAGE
152
spark-submit
Note
Tip
When executed, spark-submit script simply passes the call to spark-class with
org.apache.spark.deploy.SparkSubmit class followed by command-line arguments.
It creates an instance of SparkSubmitArguments.
If in verbose mode, it prints out the application arguments.
It then relays the execution to action-specific internal methods (with the application
arguments):
When no action was explicitly given, it is assumed submit action.
kill (when --kill switch is used)
requestStatus (when --status switch is used)
Note
The action can only have one of the three available values: SUBMIT , KILL , or
REQUEST_STATUS .
SparkSubmitArgumentsspark-submit Command-Line
Argument Parser
SparkSubmitArguments is a private[deploy] class to handle the command-line arguments of
spark-submit script that the actions use for their execution (possibly with the explicit env
environment).
SparkSubmitArguments(
args: Seq[String],
env: Map[String, String] = sys.env)
Note
153
spark-submit
export JAVA_HOME=/your/directory/java
export HADOOP_HOME=/usr/lib/hadoop
export SPARK_WORKER_CORES=2
export SPARK_WORKER_MEMORY=1G
154
spark-class
spark-class script
bin/spark-class shell script is the script launcher for internal Spark classes.
Note
Tip
Note
section in this document). And then spark-class searches for so-called the
Spark assembly jar ( spark-assembly.hadoop..jar ) in SPARK_HOME/lib or
SPARK_HOME/assembly/target/scala-$SPARK_SCALA_VERSION for a binary distribution
or Spark built from sources, respectively.
Set SPARK_PREPEND_CLASSES to have the Spark launcher classes (from
$SPARK_HOME/launcher/target/scala-$SPARK_SCALA_VERSION/classes ) to appear
before the Spark assembly jar. Its useful for development so your changes
dont require rebuilding Spark from the beginning.
org.apache.spark.launcher.Main
org.apache.spark.launcher.Main is the command-line launcher used in Spark scripts, like
spark-class .
the command.
SPARK_DAEMON_MEMORY (default: 1g ) for -Xms and -Xmx .
155
Spark Architecture
Spark Architecture
Spark uses a master/worker architecture. There is a driver that talks to a single
coordinator called master that manages workers in which executors run.
156
Spark Architecture
157
Driver
Driver
A Spark driver (aka an applications driver process) is the separate Java process
(running on its own JVM) that manages a SparkContext in a Spark application.
It can be your Spark application that executes the main method in which the SparkContext
object is created ( client deploy mode), but can also be a process in a cluster (if executed
in cluster deploy mode).
It is the cockpit of jobs and tasks execution (using DAGScheduler and Task Scheduler). It
hosts Web UI for the environment.
158
Driver
Note
Driver requires the additional services (beside the common ones like ShuffleManager,
MemoryManager, BlockTransferService, BroadcastManager, CacheManager):
Listener Bus
driverActorSystemName
RPC Environment (for Netty and Akka)
MapOutputTrackerMaster with the name MapOutputTracker
spark-BlockManagerMaster.adoc[BlockManagerMaster] with the name
BlockManagerMaster
HttpFileServer
MetricsSystem with the name driver
OutputCommitCoordinator with the endpoints name OutputCommitCoordinator
Caution
Settings
spark.driver.extraClassPath
spark.driver.extraClassPath is an optional setting that is used toFIXME
spark.driver.cores
spark.driver.cores (default: 1 ) sets the number of CPU cores assigned for the driver in
159
Driver
It can be set using spark-submits --driver-cores command-line option for Spark on cluster.
Note
When Client is created (for Spark on YARN in cluster mode only), it sets the
number of cores for ApplicationManager using spark.driver.cores .
spark.driver.memory
spark.driver.memory (default: 1g ) sets the drivers memory size (in MiBs).
160
Master
Master
A master is a running Spark instance that connects to a cluster manager for resources.
The master acquires cluster nodes to run executors.
Caution
161
Workers
Workers
Workers (aka slaves) are running Spark instances where executors live to execute tasks.
They are the compute nodes in Spark.
Caution
Caution
Explain task execution in Spark and understand Sparks underlying execution model.
New vocabulary often faced in Spark UI
When you create SparkContext, each worker starts an executor. This is a separate process
(JVM), and it loads your jar, too. The executors connect back to your driver program. Now
the driver can send them commands, like flatMap , map and reduceByKey . When the
driver quits, the executors shut down.
A new process is not started for each step. A new process is started on each worker when
the SparkContext is constructed.
The executor deserializes the command (this is possible because it has loaded your jar),
and executes it on a partition.
Shortly speaking, an application in Spark is executed in three steps:
1. Create RDD graph, i.e. DAG (directed acyclic graph) of RDDs to represent entire
computation.
2. Create stage graph, i.e. a DAG of stages that is a logical execution plan based on the
RDD graph. Stages are created by breaking the RDD graph at shuffle boundaries.
3. Based on the plan, schedule and execute tasks on workers.
In the WordCount example, the RDD graph is as follows:
file lines words per-word count global word count output
162
Workers
Based on this graph, two stages are created. The stage creation rule is based on the idea of
pipelining as many narrow transformations as possible. RDD operations with "narrow"
dependencies, like map() and filter() , are pipelined together into one set of tasks in
each stage.
In the end, every stage will only have shuffle dependencies on other stages, and may
compute multiple operations inside it.
In the WordCount example, the narrow transformation finishes at per-word count. Therefore,
you get two stages:
file lines words per-word count
global word count output
Once stages are defined, Spark will generate tasks from stages. The first stage will create a
series of ShuffleMapTask and the last stage will create ResultTasks because in the last
stage, one action operation is included to produce results.
The number of tasks to be generated depends on how your files are distributed. Suppose
that you have 3 three different files in three different nodes, the first stage will generate 3
tasks: one task per partition.
Therefore, you should not map your steps to tasks directly. A task belongs to a stage, and is
related to a partition.
The number of tasks being generated in each stage will be equal to the number of partitions.
Cleanup
Caution
FIXME
Settings
spark.worker.cleanup.enabled (default: false ) Cleanup enabled.
163
Executors
Executors
Executors are distributed agents that execute tasks.
They typically (i.e. not always) run for the entire lifetime of a Spark application. Executors
send active task metrics to a driver and inform executor backends about task status updates
(task results including).
Note
Executors provide in-memory storage for RDDs that are cached in Spark applications (via
Block Manager).
When executors are started they register themselves with the driver and communicate
directly to execute tasks.
Executor offers are described by executor id and the host on which an executor runs (see
Resource Offers in this document).
Executors can run multiple tasks over its lifetime, both in parallel and sequentially. They
track running tasks (by their task ids in runningTasks internal map). Consult Launching
Tasks section.
Executors use a thread pool for launching tasks and sending metrics.
It is recommended to have as many executors as data nodes and as many cores as you can
get from the cluster.
Executors are described by their id, hostname, environment (as SparkEnv ), and
classpath (and, less importantly, and more for internal optimization, whether they run in
local or cluster mode).
Enable INFO or DEBUG logging level for org.apache.spark.executor.Executor
logger to see what happens inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.executor.Executor=INFO
Refer to Logging.
164
Executors
and whether it runs in local or non-local mode (as isLocal that is non-local by default).
Note
While an executor is being created you should see the following INFO messages in the logs:
INFO Executor: Starting executor ID [executorId] on host [executorHostname]
INFO Executor: Using REPL class URI: http://[executorHostname]:56131
It creates an RPC endpoint for sending hearbeats to the driver (using the internal
startDriverHeartbeater method).
The BlockManager is initialized (only when in non-local/cluster mode).
Note
A worker requires the additional services (beside the common ones like ):
executorActorSystemName
RPC Environment (for Akka only)
MapOutputTrackerWorker
MetricsSystem with the name executor
Note
Caution
launchTask creates a TaskRunner object, registers it in the internal runningTasks map (by
taskId ), and executes it on Executor task launch worker Thread Pool.
165
Executors
Note
166
Executors
Note
A blocking Heartbeat message that holds the executor id, all accumulator updates (per task
id), and BlockManagerId is sent to HeartbeatReceiver RPC endpoint (with
spark.executor.heartbeatInterval timeout).
Caution
If the response requests to reregister BlockManager, you should see the following INFO
message in the logs:
INFO Executor: Told to re-register on heartbeat
167
Executors
If there are any issues with communicating with the driver, you should see the following
WARN message in the logs:
WARN Executor: Issue communicating with driver in heartbeater
The internal heartbeatFailures is incremented and checked to be less than the acceptable
number of failures. If the number is greater, the following ERROR is printed out to the logs:
ERROR Executor: Exit as unable to send heartbeats to driver more than [HEARTBEAT_MAX_F
AILURES] times
Coarse-Grained Executors
Coarse-grained executors are executors that use CoarseGrainedExecutorBackend for task
scheduling.
FetchFailedException
Caution
FIXME
168
Executors
TaskRunner catches it and informs ExecutorBackend about the case (using statusUpdate
with TaskState.FAILED task state).
Caution
Resource Offers
Read resourceOffers in TaskSchedulerImpl and resourceOffer in TaskSetManager.
You can change the assigned memory per executor per node in standalone cluster using
SPARK_EXECUTOR_MEMORY environment variable.
You can find the value displayed as Memory per Node in web UI for standalone Master (as
depicted in the figure below).
169
Executors
Metrics
Executors use Metrics System (via ExecutorSource ) to report metrics about internal status.
Note
Metrics are only available for cluster modes, i.e. local mode turns metrics off.
170
Executors
Internal Registries
runningTasks is FIXME
heartbeatFailures is FIXME
Settings
spark.executor.cores
spark.executor.cores - the number of cores for an executor
spark.executor.extraClassPath
spark.executor.extraClassPath is a list of URLs representing a users CLASSPATH.
spark.executor.extraJavaOptions
spark.executor.extraJavaOptions - extra Java options for executors.
171
Executors
spark.executor.extraLibraryPath
spark.executor.extraLibraryPath - a list of additional library paths separated by system-
spark.executor.userClassPathFirst
spark.executor.userClassPathFirst (default: false ) controls whether to load classes in
spark.executor.heartbeatInterval
spark.executor.heartbeatInterval (default: 10s ) - the interval after which an executor
reports heartbeat and metrics for active tasks to the driver. Refer to Sending heartbeats and
partial metrics for active tasks.
spark.executor.heartbeat.maxFailures
spark.executor.heartbeat.maxFailures (default: 60 ) controls how many times an executor
will try to send heartbeats to the driver before it gives up and exits (with exit code 56 ).
Note
It was introduced in SPARK-13522 Executor should kill itself when its unable to
heartbeat to the driver more than N times
spark.executor.id
spark.executor.id
spark.executor.instances
spark.executor.instances sets the number of executors to use.
spark.executor.memory
172
Executors
Others
spark.executor.logs.rolling.maxSize
spark.executor.logs.rolling.maxRetainedFiles
spark.executor.logs.rolling.strategy
spark.executor.logs.rolling.time.interval
spark.executor.port
spark.executor.uri - equivalent to SPARK_EXECUTOR_URI
spark.repl.class.uri (default: null ) used when in spark-shell to create REPL
ClassLoader to load new classes defined in the Scala REPL as a user types code.
Enable INFO logging level for org.apache.spark.executor.Executor logger to have the
value printed out to the logs:
INFO Using REPL class URI: [classUri]
spark.akka.frameSize (default: 128 MB, maximum: 2047 MB) - the configured max
frame size for Akka messages. If a task result is bigger, executors use block manager to
send results back.
spark.driver.maxResultSize (default: 1g )
Caution
173
TaskRunner
TaskRunner
TaskRunner is a thread of execution that manages a single individual task. It can be run or
killed that boils down to running or killing the task the TaskRunner object manages.
Enable INFO or DEBUG logging level for org.apache.spark.executor.Executor
logger to see what happens inside TaskRunner (since TaskRunner is an internal
class of Executor ).
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.executor.Executor=DEBUG
Refer to Logging.
Lifecycle
Caution
174
TaskRunner
Caution
The target task to run is not deserialized yet, but only its environment - the files,
jars, and properties.
FIXME Describe Task.deserializeWithDependencies .
Caution
This is the moment when the proper Task object is deserialized (from taskBytes ) using the
earlier-created closure Serializer object. The local properties (as localProperties ) are
initialized to be the tasks properties (from the earlier call to
Task.deserializeWithDependencies ) and the TaskMemoryManager (created earlier in the
The tasks properties were part of the serialized object passed on to the current
TaskRunner object.
Note
If kill method has been called in the meantime, the execution stops by throwing a
TaskKilledException . Otherwise, TaskRunner continues executing the task.
175
TaskRunner
The task runs (with taskId , attemptNumber , and the globally-configured MetricsSystem ). It
runs inside a "monitored" block (i.e. try-finally block) to clean up after the tasks run
finishes regardless of the final outcome - the tasks value or an exception thrown.
After the tasks run finishes (and regardless of an exception thrown or not), run always
calls BlockManager.releaseAllLocksForTask (with the current tasks taskId ).
run then always queries TaskMemoryManager for memory leaks. If there is any (i.e. the
Note
Note
176
TaskRunner
Caution
When a task finishes successfully, it returns a value. The value is serialized (using a new
instance of Serializer from SparkEnv, i.e. serializer ).
Note
The time to serialize the tasks value is tracked (using beforeSerialization and
afterSerialization ).
The tasks metrics are set, i.e. executorDeserializeTime , executorRunTime , jvmGCTime , and
resultSerializationTime .
Caution
FIXME Describe the metrics in more details. And include a figure to show the
metric points.
A DirectTaskResult object with the serialized result and the latest values of accumulators is
created (as directResult ). The DirectTaskResult object is serialized (using the global
closure Serializer).
The limit of the buffer for the serialized DirectTaskResult object is calculated (as
resultSize ).
Caution
177
TaskRunner
$ ./bin/spark-shell -c spark.driver.maxResultSize=1m
scala> sc.version
res0: String = 2.0.0-SNAPSHOT
scala> sc.getConf.get("spark.driver.maxResultSize")
res1: String = 1m
scala> sc.range(0, 1024 * 1024 + 10, 1).collect
WARN Executor: Finished task 4.0 in stage 0.0 (TID 4). Result is larger than maxResult
Size (1031.4 KB > 1024.0 KB), dropping it.
...
ERROR TaskSetManager: Total size of serialized results of 1 tasks (1031.4 KB) is bigge
r than spark.driver.maxResultSize (1024.0 KB)
...
org.apache.spark.SparkException: Job aborted due to stage failure: Total size of seria
lized results of 1 tasks (1031.4 KB) is bigger than spark.driver.maxResultSize (1024.0
KB)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$
failJobAndIndependentStages(DAGScheduler.scala:1448)
...
Note
The difference between the two cases is that the result is dropped or sent via
BlockManager.
When the two cases above do not hold, the following INFO message is printed out to the
logs:
178
TaskRunner
INFO Executor: Finished [taskName] (TID [taskId]). [resultSize] bytes result sent to d
river
Note
The serializedResult serialized result for the task is sent to the driver using
ExecutorBackend as TaskState.FINISHED .
Caution
When the TaskRunner finishes, taskId is removed from the internal runningTasks map of
the owning Executor (that ultimately cleans up any references to the TaskRunner ).
Note
kill marks the current instance of TaskRunner as killed and passes the call to kill a task
Internally, kill enables the internal flag killed and executes its Task.kill method if a task
is available.
Note
The internal flag killed is checked in run to stop executing the task. Calling
Task.kill method allows for task interruptions later on.
Settings
spark.unsafe.exceptionOnMemoryLeak (default: false )
179
TaskRunner
180
Spark Services
Spark Services
181
MemoryManagerMemory Management
MemoryManagerMemory Management
MemoryManager is an abstract base memory manager to manage shared memory for
Note
Spark.
MemoryManager Contract
Every MemoryManager obeys the following contract:
maxOnHeapStorageMemory
acquireStorageMemory
acquireStorageMemory
acquireStorageMemory(blockId: BlockId, numBytes: Long, memoryMode: MemoryMode): Boolean
acquireStorageMemory
Caution
FIXME
maxOnHeapStorageMemory
maxOnHeapStorageMemory: Long
182
MemoryManagerMemory Management
maxOnHeapStorageMemory is the total amount of memory available for storage, in bytes. It can
releaseExecutionMemory
releaseAllExecutionMemoryForTask
tungstenMemoryMode
tungstenMemoryMode informs others whether Spark works in OFF_HEAP or ON_HEAP memory
mode.
It uses spark.memory.offHeap.enabled (default: false ), spark.memory.offHeap.size (default:
0 ), and org.apache.spark.unsafe.Platform.unaligned before OFF_HEAP is assumed.
Caution
183
UnifiedMemoryManager
UnifiedMemoryManager
Caution
FIXME
Note
acquireStorageMemory method
Note
Caution
It makes sure that the requested number of bytes numBytes (for a block to store) fits the
available memory. If it is not the case, you should see the following INFO message in the
logs and the method returns false .
INFO Will not store [blockId] as the required space ([numBytes] bytes) exceeds our mem
ory limit ([maxMemory] bytes)
If the requested number of bytes numBytes is greater than memoryFree in the storage pool,
acquireStorageMemory will attempt to use the free memory from the execution pool.
Note
The storage pool can use the free memory from the execution pool.
It will take as much memory as required to fit numBytes from memoryFree in the execution
pool (up to the whole free memory in the pool).
184
UnifiedMemoryManager
Ultimately, acquireStorageMemory requests the storage pool for numBytes for blockId .
acquireUnrollMemory method
Note
maxOnHeapStorageMemory method
Note
Caution
185
Refer to Logging.
SparkEnv
SparkEnv holds all runtime objects for a running Spark instance, using
SparkEnv.createDriverEnv() for a driver and SparkEnv.createExecutorEnv() for an executor.
You can access the Spark environment using SparkEnv.get .
scala> import org.apache.spark._
import org.apache.spark._
scala> SparkEnv.get
res0: org.apache.spark.SparkEnv = org.apache.spark.SparkEnv@2220c5f7
186
create is a internal helper method to create a "base" SparkEnv regardless of the target
create registers the BlockManagerMaster RPC endpoint for the driver and
looks it up for executors.
187
If called from the driver, you should see the following INFO message in the logs:
INFO SparkEnv: Registering [name]
188
It then passes the call straight on to the create helper method (with driver executor id,
isDriver enabled, and the input parameters).
Note
189
serializer
Caution
FIXME
190
closureSerializer
Caution
FIXME
Settings
spark.driver.host
spark.driver.host is the name of the machine where the driver runs. It is set when
SparkContext is created.
spark.driver.port
spark.driver.port is the port the driver listens to. It is first set to 0 in the driver when
SparkContext is initialized. It is later set to the port of RpcEnv of the driver (in
SparkEnv.create).
spark.serializer
spark.serializer (default: org.apache.spark.serializer.JavaSerializer ) - the Serializer.
spark.closure.serializer
spark.closure.serializer (default: org.apache.spark.serializer.JavaSerializer ) is the
Serializer.
spark.shuffle.manager
spark.shuffle.manager (default: sort ) - one of the three available implementations of
spark.memory.useLegacyMode
191
( false ).
192
DAGScheduler
DAGScheduler
Note
Introduction
DAGScheduler is the scheduling layer of Apache Spark that implements stage-oriented
scheduling, i.e. after an RDD action has been called it becomes a job that is then
transformed into a set of stages that are submitted as TaskSets for execution (see Execution
Model).
193
DAGScheduler
reads and executes sequentially. See the section Internal Event Loop - dag-scheduler-eventloop.
DAGScheduler runs stages in topological order.
194
DAGScheduler
Refer to Logging.
DAGScheduler needs SparkContext, Task Scheduler, LiveListenerBus, MapOutputTracker
and Block Manager to work. However, at the very minimum, DAGScheduler needs
SparkContext only (and asks SparkContext for the other services).
DAGScheduler reports metrics about its execution (refer to the section Metrics).
When DAGScheduler schedules a job as a result of executing an action on a RDD or calling
SparkContext.runJob() method directly, it spawns parallel tasks to compute (partial) results
per partition.
FIXME
Internal Registries
DAGScheduler maintains the following information in internal registries:
nextJobId for the next job id
numTotalJobs (alias of nextJobId ) for the total number of submitted
nextStageId for the next stage id
jobIdToStageIds for a mapping between jobs and their stages
stageIdToStage for a mapping between stage ids to stages
shuffleToMapStage for a mapping between ids to ShuffleMapStages
jobIdToActiveJob for a mapping between job ids to ActiveJobs
waitingStages for stages with parents to be computed
runningStages for stages currently being run
failedStages for stages that failed due to fetch failures (as reported by
195
DAGScheduler
arrays indexed by partition numbers). Each array value is the set of locations where that
RDD partition is cached on. See Cache Tracking.
failedEpoch is a mapping between failed executors and the epoch number when the
cleanupStateForJobAndIndependentStages
DAGScheduler.resubmitFailedStages
resubmitFailedStages() is called to go over failedStages collection (of failed stages) and
cacheLocs and failedStages are cleared, and failed stages are submitStage one by one,
DAGScheduler.runJob
When executed, DAGScheduler.runJob is given the following arguments:
A RDD to run job on.
A function to run on each partition of the RDD.
A set of partitions to run on (not all partitions are always required to compute a job for
actions like first() or take() ).
A callback function resultHandler to pass results of executing the function to.
Properties to attach to a job.
It calls DAGScheduler.submitJob and then waits until a result comes using a JobWaiter
object. A job can succeed or fail.
196
DAGScheduler
DAGScheduler.submitJob
DAGScheduler.submitJob is called by SparkContext.submitJob and DAGScheduler.runJob.
Figure 3. DAGScheduler.submitJob
197
DAGScheduler
You may see an exception thrown when the partitions in the set are outside the range:
Attempting to access a non-existent partition: [p]. Total number of partitions: [maxPa
rtitions]
A job listener is notified each time a task succeeds (by def taskSucceeded(index: Int,
result: Any) ), as well as if the whole job fails (by def jobFailed(exception: Exception) ).
JobWaiter
A JobWaiter is an extension of JobListener. It is used as the return value of
DAGScheduler.submitJob and DAGScheduler.submitMapStage . You can use a JobWaiter to
block until the job finishes executing or to cancel it.
While the methods execute, JobSubmitted and MapStageSubmitted events are posted that
reference the JobWaiter.
198
DAGScheduler
DAGScheduler.executorAdded
executorAdded(execId: String, host: String) method simply posts a ExecutorAdded event
to eventProcessLoop .
DAGScheduler.taskEnded
taskEnded(
task: Task[_],
reason: TaskEndReason,
result: Any,
accumUpdates: Map[Long, Any],
taskInfo: TaskInfo,
taskMetrics: TaskMetrics): Unit
event loop.
Note
Tip
failJobAndIndependentStages
The internal failJobAndIndependentStages methodFIXME
Note
It is called byFIXME
process loop to which Spark (by DAGScheduler.submitJob) posts jobs to schedule their
execution. Later on, TaskSetManager talks back to DAGScheduler to inform about the status
of the tasks using the same "communication channel".
199
DAGScheduler
It allows Spark to release the current thread when posting happens and let the event loop
handle events on a separate thread - asynchronously.
IMAGEFIXME
Internally, DAGSchedulerEventProcessLoop uses java.util.concurrent.LinkedBlockingDeque
blocking deque that grows indefinitely (i.e. up to Integer.MAX_VALUE events).
The name of the single "logic" thread that reads events and takes decisions is dagscheduler-event-loop.
"dag-scheduler-event-loop" #89 daemon prio=5 os_prio=31 tid=0x00007f809bc0a000 nid=0xc
903 waiting on condition [0x0000000125826000]
The following are the current types of DAGSchedulerEvent events that are handled by
DAGScheduler :
StageCancelled
JobCancelled
JobGroupCancelled
AllJobsCancelled
BeginEvent - posted when TaskSetManager reports that a task is starting.
dagScheduler.handleBeginEvent is executed in turn.
GettingResultEvent - posted when TaskSetManager reports that a task has completed
200
DAGScheduler
ResubmitFailedStages
FIXME
Caution
Note
FIXME
It checks failedEpoch for the executor id (using execId ) and if it is found the following INFO
message appears in the logs:
INFO Host added was in lost list earlier: [host]
201
DAGScheduler
Note
Figure 4. DAGScheduler.handleExecutorLost
Recurring ExecutorLost events merely lead to the following DEBUG message in the logs:
DEBUG Additional executor lost message for [execId] (epoch [currentEpoch])
If however the executor is not in the list of executor lost or the failed epoch number is
smaller than the current one, the executor is added to failedEpoch.
The following INFO message appears in the logs:
INFO Executor lost: [execId] (epoch [currentEpoch])
202
DAGScheduler
MapOutputTrackerMaster.registerMapOutputs(shuffleId,
stage.outputLocInMapOutputTrackerFormat(), changeEpoch = true)
For no ShuffleMapStages (in shuffleToMapStage ),
MapOutputTrackerMaster.incrementEpoch is called.
cacheLocs is cleared.
At the end, DAGScheduler.submitWaitingStages() is called.
and for each job associated with the stage, it calls handleJobCancellation(jobId, s"because
Stage [stageId] was cancelled") .
Note
A stage knows what jobs it is part of using the internal set jobIds .
def handleJobCancellation(jobId: Int, reason: String = "") checks whether the job exists
However, if the job exists, the job and all the stages that are only used by it (using the
internal failJobAndIndependentStages method).
For each running stage associated with the job ( jobIdToStageIds ), if there is only one job
for the stage ( stageIdToStage ), TaskScheduler.cancelTasks is called,
outputCommitCoordinator.stageEnd(stage.id) , and SparkListenerStageCompleted is posted.
job on cancel.
203
DAGScheduler
If no stage exists for stageId , the following INFO message shows in the logs:
INFO No active jobs to kill for Stage [stageId]
204
DAGScheduler
205
DAGScheduler
Figure 6. DAGScheduler.handleJobSubmitted
handleJobSubmitted has access to the final RDD, the partitions to compute, and the
Then, the finalStage stage is given the ActiveJob instance and some housekeeping is
performed to track the job (using jobIdToActiveJob and activeJobs ).
SparkListenerJobStart message is posted to LiveListenerBus.
Caution
When DAGScheduler executes a job it first submits the final stage (using submitStage).
206
DAGScheduler
The task knows about the stage it belongs to (using Task.stageId ), the partition it works on
(using Task.partitionId ), and the stage attempt (using Task.stageAttemptId ).
OutputCommitCoordinator.taskCompleted is called.
If the stage the task belongs to has been cancelled, stageIdToStage should not contain it,
and the method quits.
The main processing begins now depending on TaskEndReason - the reason for task
completion (using event.reason ). The method skips processing TaskEndReasons :
TaskCommitDenied , ExceptionFailure , TaskResultLost , ExecutorLostFailure , TaskKilled ,
TaskEndReason: Success
SparkListenerTaskEnd is posted to LiveListenerBus.
The partition the task worked on is removed from pendingPartitions of the stage.
207
DAGScheduler
INFO Ignoring result from [task] because its job has finished
Otherwise, check whether the task is marked as running for the job (using job.finished )
and proceed. The method skips execution when the task has already been marked as
completed in the job.
Caution
FIXME When could a task that has just finished be ignored, i.e. the job has
already marked finished ? Could it be for stragglers?
updateAccumulators(event) is called.
The partition is marked as finished (using job.finished ) and the number of partitions
calculated increased (using job.numFinished ).
If the whole job has finished (when job.numFinished == job.numPartitions ), then:
markStageAsFinished is called
cleanupStateForJobAndIndependentStages(job)
Caution
ShuffleMapTask
For ShuffleMapTask, the stage is ShuffleMapStage.
updateAccumulators(event) is called.
event.result is MapStatus that knows the executor id where the task has finished (using
status.location.executorId ).
208
DAGScheduler
If failedEpoch contains the executor and the epoch of the ShuffleMapTask is not greater than
that in failedEpoch, you should see the following INFO message in the logs:
INFO Ignoring possibly bogus [task] completion from executor [executorId]
submitStage(shuffleStage) is called.
Caution
TaskEndReason: Resubmitted
For Resubmitted case, you should see the following INFO message in the logs:
209
DAGScheduler
The task (by task.partitionId ) is added to the collection of pending partitions of the stage
(using stage.pendingPartitions ).
Tip
A stage knows how many partitions are yet to be calculated. A task knows about
the partition id for which it was launched.
TaskEndReason: FetchFailed
FetchFailed(bmAddress, shuffleId, mapId, reduceId, failureMessage) comes with
BlockManagerId (as bmAddress ) and the other self-explanatory values.
Note
When FetchFailed happens, stageIdToStage is used to access the failed stage (using
task.stageId and the task is available in event in handleTaskCompletion(event:
CompletionEvent) ). shuffleToMapStage is used to access the map stage (using shuffleId ).
Caution
Caution
If the failed stage is not in runningStages , the following DEBUG message shows in the logs:
210
DAGScheduler
DEBUG Received fetch failure from [task], but its from [failedStage] which is no longe
r running
Caution
If the number of fetch failed attempts for the stage exceeds the allowed number (using
Stage.failedOnFetchAndShouldAbort), the following method is called:
abortStage(failedStage, s"$failedStage (${failedStage.name}) has failed the maximum al
lowable number of times: ${Stage.MAX_CONSECUTIVE_FETCH_FAILURES}. Most recent failure
reason: ${failureMessage}", None)
If there are no failed stages reported (failedStages is empty), the following INFO shows in
the logs:
INFO Resubmitting [mapStage] ([mapStage.name]) and [failedStage] ([failedStage.name])
due to fetch failure
Caution
For all the cases, the failed stage and map stages are both added to failedStages set.
If mapId (in the FetchFailed object for the case) is provided, the map stage output is
cleaned up (as it is broken) using mapStage.removeOutputLoc(mapId, bmAddress) and
MapOutputTrackerMaster.unregisterMapOutput(shuffleId, mapId, bmAddress) methods.
Caution
211
DAGScheduler
The method clears the internal waitingStages set with stages that wait for their parent
stages to finish.
It goes over the waiting stages sorted by job ids in increasing order and calls submitStage
method.
FIXME
For a stage with ActiveJob available, the following DEBUG message show up in the logs:
DEBUG DAGScheduler: submitStage([stage])
Only when the stage is not in waiting ( waitingStages ), running ( runningStages ) or failed
states can this stage be processed.
A list of missing parent stages of the stage is calculated (see Calculating Missing Parent
Stages) and the following DEBUG message shows up in the logs:
DEBUG DAGScheduler: missing: [missing]
212
DAGScheduler
When the stage has no parent stages missing, it is submitted and the INFO message shows
up in the logs:
INFO DAGScheduler: Submitting [stage] ([stage.rdd]), which has no missing parents
small amount of time to see whether other nodes or tasks fail, then resubmit TaskSets for
any lost stage(s) that compute the missing tasks.
Please note that tasks from the old attempts of a stage could still be running.
A stage object tracks multiple StageInfo objects to pass to Spark listeners or the web UI.
213
DAGScheduler
The latest StageInfo for the most recent attempt for a stage is accessible through
latestInfo .
Cache Tracking
DAGScheduler tracks which RDDs are cached to avoid recomputing them and likewise
remembers which shuffle map stages have already produced output files to avoid redoing
the map side of a shuffle.
DAGScheduler is only interested in cache location coordinates, i.e. host and executor id, per
partition of an RDD.
Caution
If the storage level of an RDD is NONE, there is no caching and hence no partition cache
locations are available. In such cases, whenever asked, DAGScheduler returns a collection
with empty-location elements for each partition. The empty-location elements are to mark
uncached partitions.
Otherwise, a collection of RDDBlockId instances for each partition is created and sparkBlockManagerMaster.adoc[BlockManagerMaster] is asked for locations (using
BlockManagerMaster.getLocations ). The result is then mapped to a collection of
TaskLocation for host and executor id.
Preferred Locations
DAGScheduler computes where to run each task in a stage based on the preferred locations
of its underlying RDDs, or the location of cached or shuffle data.
214
DAGScheduler
When executed, it prints the following DEBUG message out to the logs:
DEBUG DAGScheduler: submitMissingTasks([stage])
pendingPartitions internal field of the stage is cleared (it is later filled out with the partitions
Caution
The mapping between task ids and task preferred locations is computed (see
getPreferredLocs - Computing Preferred Locations for Tasks and Partitions).
A new stage attempt is created (using Stage.makeNewStageAttempt ).
SparkListenerStageSubmitted is posted.
The stage is serialized and broadcast to workers using SparkContext.broadcast method, i.e.
it is Serializer.serialize to calculate taskBinaryBytes - an array of bytes of (rdd, func) for
ResultStage and (rdd, shuffleDep) for ShuffleMapStage.
Caution
When serializing the stage fails, the stage is removed from the internal runningStages set,
abortStage is called and the method stops.
215
DAGScheduler
Caution
If there are tasks to launch (there are missing partitions in the stage), the following INFO and
DEBUG messages are in the logs:
INFO DAGScheduler: Submitting [tasks.size] missing tasks from [stage] ([stage.rdd])
DEBUG DAGScheduler: New pending partitions: [stage.pendingPartitions]
For ResultStage:
DEBUG DAGScheduler: Stage [stage] is actually done; (partitions: [numPartitions])
Note
Note
Stopping
When a DAGScheduler stops (via stop() ), it stops the internal dag-scheduler-message
thread pool, dag-scheduler-event-loop, and TaskScheduler.
216
DAGScheduler
Metrics
Sparks DAGScheduler uses Spark Metrics System (via DAGSchedulerSource ) to report
metrics about internal status.
Caution
The private updateAccumulators method merges the partial values of accumulators from a
completed task into their "source" accumulators on the driver.
Note
It is called by handleTaskCompletion.
For each AccumulableInfo in the CompletionEvent , a partial value from a task is obtained
(from AccumulableInfo.update ) and added to the drivers accumulator (using
Accumulable.++= method).
For named accumulators with the update value being a non-zero value, i.e. not
Accumulable.zero :
stage.latestInfo.accumulables for the AccumulableInfo.id is set
CompletionEvent.taskInfo.accumulables has a new AccumulableInfo added.
Caution
Settings
217
DAGScheduler
218
Jobs
Jobs
A job (aka action job or active job) is a top-level work item (computation) submitted to
DAGScheduler to compute the result of an action.
Note that not all partitions have always to be computed for ResultStages for
actions like first() and lookup() .
219
Jobs
A job can be one of two logical types (that are only distinguished by an internal finalStage
field of ActiveJob ):
Map-stage job that computes the map output files for a ShuffleMapStage (for
submitMapStage ) before any downstream stages are submitted.
It is also used for adaptive query planning, to look at map output statistics before
submitting later stages.
Result job that computes a ResultStage to execute an action.
Jobs track how many partitions have already been computed (using finished array of
Boolean elements).
220
Stages
Stages
Introduction
A stage is a physical unit of execution. It is a step in a physical execution plan.
A stage is a set of parallel tasks, one per partition of an RDD, that compute partial results of
a function executed as part of a Spark job.
Figure 2. Submitting a job triggers execution of the stage and its parent stages
Finally, every stage has a firstJobId that is the id of the job that submitted the stage.
There are two types of stages:
221
Stages
ShuffleMapStage is an intermediate stage (in the execution DAG) that produces data for
other stage(s). It writes map output files for a shuffle. It can also be the final stage in a
job in adaptive query planning.
ResultStage is the final stage that executes a Spark action in a user program by running
a function on an RDD.
When a job is submitted, a new stage is created with the parent ShuffleMapStage linked
they can be created from scratch or linked to, i.e. shared, if other jobs use them already.
222
Stages
223
Stages
At some point of time in a stages life, every partition of the stage gets transformed into a
task - ShuffleMapTask or ResultTask for ShuffleMapStage and ResultStage, respectively.
Partitions are computed in jobs, and result stages may not always need to compute all
partitions in their target RDD, e.g. for actions like first() and lookup() .
DAGScheduler prints the following INFO message when there are tasks to submit:
FIXME Why do stages have numTasks ? Where is this used? How does this
correspond to the number of partitions in a RDD?
Stage.findMissingPartitions
Stage.findMissingPartitions() calculates the ids of the missing partitions, i.e. partitions for
which the ActiveJob knows they are not finished (and so they are missing).
A ResultStage stage knows it by querying the active job about partition ids ( numPartitions )
that are not finished (using ActiveJob.finished array of booleans).
224
Stages
Stage.failedOnFetchAndShouldAbort
Stage.failedOnFetchAndShouldAbort(stageAttemptId: Int): Boolean checks whether the
225
Stages
Note
When executed, ShuffleMapStages save map output files that can later be fetched by
reduce tasks.
Caution
The number of the partitions of an RDD is exactly the number of the tasks in a
ShuffleMapStage.
The output locations ( outputLocs ) of a ShuffleMapStage are the same as used by its
ShuffleDependency. Output locations can be missing, i.e. partitions have not been cached or
are lost.
ShuffleMapStages are registered to DAGScheduler that tracks the mapping of shuffles (by
their ids from SparkContext) to corresponding ShuffleMapStages that compute them, stored
in shuffleToMapStage .
A new ShuffleMapStage is created from an input ShuffleDependency and a jobs id (in
DAGScheduler#newOrUsedShuffleStage ).
Stages
cleanupStateForJobAndIndependentStages
handleExecutorLost
When there is no ShuffleMapStage for a shuffle id (of a ShuffleDependency), one is created
with the ancestor shuffle dependencies of the RDD (of a ShuffleDependency) that are
registered to MapOutputTrackerMaster.
FIXME Where is ShuffleMapStage used?
newShuffleMapStage - the proper way to create shuffle map stages (with the additional
setup steps)
MapStageSubmitted
getShuffleMapStage - see Stage sharing
FIXME
Caution
ShuffleMapStage Sharing
ShuffleMapStages can be shared across multiple jobs, if these jobs reuse the same RDDs.
When a ShuffleMapStage is submitted to DAGScheduler to execute, getShuffleMapStage is
called (as part of handleMapStageSubmitted while newResultStage - note the new part - for
handleJobSubmitted).
scala> val rdd = sc.parallelize(0 to 5).map((_,1)).sortByKey() (1)
scala> rdd.count (2)
scala> rdd.count (3)
1. Shuffle at sortByKey()
2. Submits a job with two stages with two being executed
3. Intentionally repeat the last action that submits a new job with two stages with one being
shared as already-being-computed
227
Stages
228
Stages
229
Stages
230
Task Scheduler
TaskScheduler
A TaskScheduler schedules tasks for a single Spark application according to scheduling
mode.
TaskScheduler Contract
Every TaskScheduler follows the following contract:
It can be started.
It can be stopped.
It can do post-start initialization if needed for additional post-start initialization.
It submits TaskSets for execution.
It can cancel tasks for a stage.
It can set a custom DAGScheduler.
231
Task Scheduler
TaskSchedulers Lifecycle
A TaskScheduler is created while SparkContext is being created (by calling
SparkContext.createTaskScheduler for a given master URL and deploy mode).
232
Task Scheduler
Caution
Starting TaskScheduler
start(): Unit
Stopping TaskScheduler
stop(): Unit
post-start initialization.
Note
233
Task Scheduler
Note
Note
Note
defaultParallelism calculates the default level of parallelism to use in a cluster that is a hint
to sizing jobs.
Note
applicationId gives the current applications id. It is in the format spark-application[System.currentTimeMillis] by default.
Note
234
Task Scheduler
Note
executorLost handles events about an executor executorId being lost for a given reason .
Note
Available Implementations
Spark comes with the following task schedulers:
TaskSchedulerImpl
YarnScheduler - the TaskScheduler for Spark on YARN in client deploy mode.
235
Task Scheduler
236
Tasks
Tasks
In Spark, a task (aka command) is the smallest individual unit of execution that represents a
partition in a dataset and that an executor can execute on a single machine.
A task in Spark is represented by the Task abstract class with two concrete
implementations:
ShuffleMapTask that executes a task and divides the tasks output to multiple buckets
(based on the tasks partitioner).
ResultTask that executes a task and sends the tasks output back to the driver
application.
The very last stage in a job consists of multiple ResultTasks , while earlier stages are a set
of ShuffleMapTasks.
Task Attributes
A Task instance is uniquely identified by the following task attributes:
237
Tasks
stageId - there can be many stages in a job. Every stage has its own unique stageId
If the task has been killed before the task runs it is killed (with interruptThread flag
disabled).
The task runs.
Caution
Note
238
Tasks
Task status updates are sent from executors to the driver through
ExecutorBackend.
Tip
It is used in TaskRunner to send a tasks final results with the latest values of
accumulators used.
kill marks the task to be killed, i.e. it sets the internal _killed flag to true .
239
Tasks
If interruptThread is enabled and the internal taskThread is available, kill interrupts it.
Caution
ShuffleMapTask
A ShuffleMapTask divides the elements of an RDD into multiple buckets (based on a
partitioner specified in ShuffleDependency).
ResultTask
Caution
FIXME
taskMemoryManager attribute
taskMemoryManager is the TaskMemoryManager that manages the memory allocated by the
task.
240
TaskSets
TaskSets
Introduction
A TaskSet is a collection of tasks that belong to a single stage and a stage attempt. It has
also priority and properties attributes. Priority is used in FIFO scheduling mode (see
Priority Field and FIFO Scheduling) while properties are the properties of the first job in the
stage.
Caution
A TaskSet contains a fully-independent sequence of tasks that can run right away based on
the data that is already on the cluster, e.g. map output files from previous stages, though it
may fail if this data becomes unavailable.
TaskSet can be submitted (consult TaskScheduler Contract).
removeRunningTask
Caution
241
TaskSets
A TaskSet has priority field that turns into the priority fields value of TaskSetManager
(which is a Schedulable).
The priority field is used in FIFOSchedulingAlgorithm in which equal priorities give stages
an advantage (not to say priority).
Note
Effectively, the priority field is the jobs id of the first job this stage was part of (for FIFO
scheduling).
242
Schedulable
Schedulable
Schedulable is a contract of schedulable entities.
Schedulable is a private[spark] Scala trait. You can find the sources in
org.apache.spark.scheduler.Schedulable.
Note
Schedulable Contract
Every Schedulable follows the following contract:
It has a name .
name: String
243
Schedulable
Note
schedulableQueue is java.util.concurrent.ConcurrentLinkedQueue.
Caution
getSortedTaskSetQueue
getSortedTaskSetQueue: ArrayBuffer[TaskSetManager]
schedulableQueue
schedulableQueue: ConcurrentLinkedQueue[Schedulable]
244
Schedulable
TaskSetManager
A TaskSetManager is a Schedulable that manages execution of the tasks in a single TaskSet
(after having it been handed over by TaskScheduler).
245
Schedulable
TaskSetManager is Schedulable
TaskSetManager is a Schedulable with the following implementation:
name is TaskSet_[taskSet.stageId.toString]
schedule).
weight is always 1 .
minShare is always 0 .
runningTasks is the number of running tasks in the internal runningTasksSet .
priority is the priority of the owned TaskSet (using taskSet.priority ).
stageId is the stage id of the owned TaskSet (using taskSet.stageId ).
schedulableQueue returns no queue, i.e. null .
addSchedulable and removeSchedulable do nothing.
getSchedulableByName always returns null .
getSortedTaskSetQueue returns a one-element collection with the sole element being
itself.
executorLost
checkSpeculatableTasks
246
Schedulable
Note
server is used (that could serve the shuffle outputs in case of failure).
If it is indeed for a failed ShuffleMapStage and no external shuffle server is enabled, all
successfully-completed tasks for the failed executor (using taskInfos internal registry) get
added to the collection of pending tasks and the DAGScheduler is informed about
resubmission (as Resubmitted end reason).
The internal registries - successful , copiesRunning , and tasksSuccessful - are updated.
Regardless of the above check, all currently-running tasks for the failed executor are
reported as failed (with the task state being FAILED ).
recomputeLocality is called.
Note
checkSpeculatableTasks is called by
TaskSchedulerImpl.checkSpeculatableTasks.
247
Schedulable
It then checks whether the number is equal or greater than the number of tasks completed
successfully (using tasksSuccessful ).
Having done that, it computes the median duration of all the successfully completed tasks
(using taskInfos ) and task length threshold using the median duration multiplied by
spark.speculation.multiplier that has to be equal or less than 100 .
You should see the DEBUG message in the logs:
DEBUG Task length threshold for speculation: [threshold]
For each task (using taskInfos ) that is not marked as successful yet (using successful )
for which there is only one copy running (using copiesRunning ) and the task takes more
time than the calculated threshold, but it was not in speculatableTasks it is assumed
speculatable.
You should see the following INFO message in the logs:
INFO Marking task [index] in stage [taskSet.id] (on [info.host]) as speculatable becau
se it ran more than [threshold] ms
The task gets added to the internal speculatableTasks collection. The method responds
positively.
addPendingTask
Caution
FIXME
dequeueSpeculativeTask
Caution
FIXME
dequeueTask
Caution
FIXME
TaskSetManager.executorAdded
executorAdded simply calls recomputeLocality method.
TaskSetManager.recomputeLocality
248
Schedulable
Note
task localities.
Then, the method checks pendingTasksWithNoPrefs and if its not empty, NO_PREF becomes
an element of the levels collection.
If pendingTasksForRack is not empty, and the wait time for RACK_LOCAL is defined, and there
is an executor for which TaskSchedulerImpl.hasHostAliveOnRack is true , RACK_LOCAL is
added to the levels collection.
ANY is the last and always-added element in the levels collection.
Right before the method finishes, it prints out the following DEBUG to the logs:
DEBUG Valid locality levels for [taskSet]: [levels]
TaskSetManager.resourceOffer
Caution
resourceOffer(
execId: String,
host: String,
maxLocality: TaskLocality): Option[TaskDescription]
249
Schedulable
FIXME
It dequeues a pending task from the taskset by checking pending tasks per executor (using
pendingTasksForExecutor ), host (using pendingTasksForHost ), with no localization
If a serialized task is bigger than 100 kB (it is not a configurable value), a WARN message
is printed out to the logs (only once per taskset):
WARN TaskSetManager: Stage [task.stageId] contains a task of very large size ([seriali
zedTask.limit / 1024] KB). The maximum recommended task size is 100 KB.
For example:
INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, partition 1,PRO
CESS_LOCAL, 2054 bytes)
FIXME
250
Schedulable
Caution
TaskSetManager requests the current epoch from MapOutputTracker and sets it on all tasks
in the taskset.
You should see the following DEBUG in the logs:
DEBUG Epoch for [taskSet]: [epoch]
Caution
TaskSetManager keeps track of the tasks pending execution per executor, host, rack or with
no locality preferences.
Caution
Events
When a task has finished, the TaskSetManager calls DAGScheduler.taskEnded.
Caution
FIXME
TaskSetManager.handleSuccessfulTask
handleSuccessfulTask(tid: Long, result: DirectTaskResult[_]) method marks the task (by
tid ) as successful and notifies the DAGScheduler that the task has ended.
It is called by whenFIXME
Caution
251
Schedulable
It removes the task from runningTasksSet . It also decreases the number of running tasks in
the parent pool if it is defined (using parent and Pool.decreaseRunningTasks ).
It notifies DAGScheduler that the task ended successfully (using DAGScheduler.taskEnded
with Success as TaskEndReason ).
If the task was not marked as successful already (using successful ), tasksSuccessful is
incremented and the following INFO message appears in the logs:
INFO Finished task [info.id] in stage [taskSet.id] (TID [info.taskId]) in [info.durati
on] ms on [info.host] ([tasksSuccessful]/[numTasks])
Note
It also marks the task as successful (using successful ). Finally, if the number of tasks
finished successfully is exactly the number of tasks the TaskSetManager manages, the
TaskSetManager turns zombie.
Otherwise, when the task was already marked as successful, the following INFO message
appears in the logs:
INFO Ignoring task-finished event for [info.id] in stage [taskSet.id] because task [in
dex] has already completed successfully
failedExecutors.remove(index) is called.
Caution
At the end, the method checks whether the TaskSetManager is a zombie and no task is
running (using runningTasksSet ), and if so, it calls TaskSchedulerImpl.taskSetFinished.
TaskSetManager.handleFailedTask
handleFailedTask(tid: Long, state: TaskState, reason: TaskEndReason) method is called by
TaskSchedulerImpl or executorLost.
Caution
The method first checks whether the task has already been marked as failed (using
taskInfos) and if it has, it quits.
It removes the task from runningTasksSet and informs the parent pool to decrease its
running tasks.
252
Schedulable
It marks the TaskInfo as failed and grabs its index so the number of copies running of the
task is decremented (see copiesRunning).
Caution
The method calculates the failure exception to report per TaskEndReason . See below for the
possible cases of TaskEndReason.
Caution
FetchFailed
For FetchFailed , it logs WARNING:
WARNING Lost task [id] in stage [id] (TID [id], [host]): [reason.toErrorString]
Unless it has already been marked as successful (in successful), the task becomes so and
tasksSuccessful is incremented.
The TaskSetManager becomes zombie.
No exception is returned.
ExceptionFailure
253
Schedulable
ExecutorLostFailure
For ExecutorLostFailure if not exitCausedByApp , the following INFO appears in the logs:
INFO Task [tid] failed because while it was being computed, its executor exited for a
reason unrelated to the task. Not counting this failure towards the maximum number of
failures for the task.
Other TaskFailedReasons
For the other TaskFailedReasons, the WARNING appears in the logs:
WARNING Lost task [id] in stage [id] (TID [id], [host]): [reason.toErrorString]
254
Schedulable
Other TaskEndReason
For the other TaskEndReasons, the ERROR appears in the logs:
ERROR Unknown TaskEndReason: [e]
FIXME
Up to spark.task.maxFailures attempts
255
Schedulable
Zombie state
TaskSetManager enters zombie state when all tasks in a taskset have completed
successfully (regardless of the number of task attempts), or if the task set has been aborted
(see Aborting TaskSet).
256
Schedulable
While in zombie state, TaskSetManager can launch no new tasks and responds with no
TaskDescription to resourceOffers.
TaskSetManager remains in the zombie state until all tasks have finished running, i.e. to
continue to track and account for the running tasks.
Internal Registries
copiesRunning
successful
numFailures
failedExecutors contains a mapping of TaskInfos indices that failed to executor ids
257
Schedulable
pendingTasksForRack
pendingTasksWithNoPrefs
allPendingTasks
speculatableTasks
taskInfos is the mapping between task ids and their TaskInfo
recentExceptions
Settings
spark.scheduler.executorTaskBlacklistTime (default: 0L ) - time interval to pass after
which a task can be re-launched on the executor where it has once failed. It is to
prevent repeated task failures due to executor failures.
spark.speculation (default: false )
spark.speculation.quantile (default: 0.75 ) - the percentage of tasks that has not
258
Schedulable
Schedulable Pool
Pool is a Schedulable entity that represents a tree of TaskSetManagers, i.e. it contains a
taskSetSchedulingAlgorithm Attribute
Using the scheduling mode (given when a Pool object is created), Pool selects
SchedulingAlgorithm and sets taskSetSchedulingAlgorithm :
FIFOSchedulingAlgorithm for FIFO scheduling mode.
FairSchedulingAlgorithm for FAIR scheduling mode.
It throws an IllegalArgumentException when unsupported scheduling mode is passed on:
Unsupported spark.scheduler.mode: [schedulingMode]
Tip
Note
addSchedulable
Note
schedulableNameToSchedulable.
More importantly, it sets the Schedulable entitys parent to itself.
Schedulable
Note
used in SparkContext.getPoolForName.
SchedulingAlgorithm
SchedulingAlgorithm is the interface for a sorting algorithm to sort Schedulables.
FIFOSchedulingAlgorithm
FIFOSchedulingAlgorithm is a scheduling algorithm that compares Schedulables by their
priority first and, when equal, by their stageId .
Note
Caution
FairSchedulingAlgorithm
FairSchedulingAlgorithm is a scheduling algorithm that compares Schedulables by their
minShare , runningTasks , and weight .
Note
260
Schedulable
Figure 1. FairSchedulingAlgorithm
For each input Schedulable , minShareRatio is computed as runningTasks by minShare
(but at least 1 ) while taskToWeightRatio is runningTasks by weight .
261
Schedulable
Schedulable Builders
SchedulableBuilder is a contract of schedulable builders that operate on a pool of
in org.apache.spark.scheduler.SchedulableBuilder.
SchedulableBuilder Contract
Every SchedulableBuilder provides the following services:
It manages a root pool.
It can build pools.
It can add a Schedulable with properties.
Note
262
Schedulable
rootPool.
Note
263
Schedulable
FIFOSchedulableBuilder - SchedulableBuilder
for FIFO Scheduling Mode
FIFOSchedulableBuilder is a SchedulableBuilder that is a mere wrapper around a single
Note
264
Schedulable
FairSchedulableBuilder - SchedulableBuilder
for FAIR Scheduling Mode
FairSchedulableBuilder is a SchedulableBuilder with the pools configured in an optional
inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.scheduler.FairSchedulableBuilder=INFO
Refer to Logging.
buildPools
buildPools builds the rootPool based on the allocations configuration file from the
addTaskSetManager
addTaskSetManager looks up the default pool (using Pool.getSchedulableByName).
Note
Note
265
Schedulable
Note
If the pool name is not available, it is registered with the pool name, FIFO scheduling mode,
minimum share 0 , and weight 1 .
After the new pool was registered, you should see the following INFO message in the logs:
INFO FairSchedulableBuilder: Created pool [poolName], schedulingMode: FIFO, minShare:
0, weight: 1
The manager schedulable is registered to the pool (either the one that already existed or
was created just now).
You should see the following INFO message in the logs:
INFO FairSchedulableBuilder: Added task set [manager.name] to pool [poolName]
spark.scheduler.pool Property
SparkContext.setLocalProperty allows for setting properties per thread. This mechanism is
used by FairSchedulableBuilder to watch for spark.scheduler.pool property to group jobs
from a thread and submit them to a non-default pool.
val sc: SparkContext = ???
sc.setLocalProperty("spark.scheduler.pool", "myPool")
Tip
266
Schedulable
<?xml version="1.0"?>
<allocations>
<pool name="production">
<schedulingMode>FAIR</schedulingMode>
<weight>1</weight>
<minShare>2</minShare>
</pool>
<pool name="test">
<schedulingMode>FIFO</schedulingMode>
<weight>2</weight>
<minShare>3</minShare>
</pool>
</allocations>
Tip
The top-level elements name allocations can be anything. Spark does not
insist on allocations and accepts any name.
the default pool with FIFO scheduling mode, minimum share 0 , and weight 1 .
You should see the following INFO message in the logs:
INFO FairSchedulableBuilder: Created default pool default, schedulingMode: FIFO, minSh
are: 0, weight: 1
For each pool element, it reads its name (from name attribute) and assumes the default
pool configuration to be FIFO scheduling mode, minimum share 0 , and weight 1 (unless
overrode later).
Caution
267
Schedulable
If schedulingMode element exists and is not empty for the pool it becomes the current pools
scheduling mode. It is case sensitive, i.e. with all uppercase letters.
If minShare element exists and is not empty for the pool it becomes the current pools
minShare . It must be an integer number.
If weight element exists and is not empty for the pool it becomes the current pools
weight . It must be an integer number.
Settings
spark.scheduler.allocation.file
spark.scheduler.allocation.file is the file path of an optional scheduler configuration file
268
Schedulable
Scheduling Modespark.scheduler.mode
Scheduling Mode (aka order task policy or scheduling policy or scheduling order) defines a
policy to sort tasks in order for execution.
The scheduling mode schedulingMode attribute is a part of the TaskScheduler Contract.
The only implementation of the TaskScheduler contract in SparkTaskSchedulerImpl
uses spark.scheduler.mode setting to configure schedulingMode that is merely used to set
up the rootPool attribute (with FIFO being the default). It happens when TaskSchedulerImpl
is initialized.
There are three acceptable scheduling modes:
FIFO with no pools but a single top-level unnamed pool with elements being
TaskSetManager objects; lower priority gets Schedulable sooner or earlier stage wins.
FAIR with a hierarchy of Schedulable (sub)pools with the rootPool at the top.
Out of three possible SchedulingMode policies only FIFO and FAIR modes are
supported by TaskSchedulerImpl.
After the root pool is initialized, the scheduling mode is no longer relevant (since
the Schedulable that represents the root pool is fully set up).
Note
Note
The root pool is later used when TaskSchedulerImpl submits tasks (as
TaskSets ) for execution.
FIXME Describe me
269
to track racks per host and port. It can schedule tasks for multiple types of cluster managers
by means of Scheduler Backends.
Using spark.scheduler.mode setting you can select the scheduling policy.
It submits tasks using SchedulableBuilders.
When a Spark application starts (and an instance of SparkContext is created)
TaskSchedulerImpl with a SchedulerBackend and DAGScheduler are created and soon
started.
inside.
Tip
schedulableBuilder Attribute
schedulableBuilder is a SchedulableBuilder for the TaskSchedulerImpl .
It is set up when a TaskSchedulerImpl is initialized and can be one of two available builders:
270
getRackForHost is a method to know about the racks per hosts and ports. By default, it
assumes that racks are unknown (i.e. the method returns None ).
Note
TaskSchedulerImpl.removeExecutor toFIXME
TaskSetManager.addPendingTask, TaskSetManager.dequeueTask, and
TaskSetManager.dequeueSpeculativeTask
Creating TaskSchedulerImpl
Creating a TaskSchedulerImpl object requires a SparkContext object, the acceptable
number of task failures ( maxTaskFailures ) and optional isLocal flag (disabled by default, i.e.
false ).
Note
271
affected hosts and racks are the last entries in executorsByHost and hostsByRack ,
appropriately, they are removed from the registries.
Unless reason is LossReasonPending , the executor is removed from executorIdToHost
registry and TaskSetManagers get notified.
Note
FIXME
272
initialize saves the reference to the current SchedulerBackend (as backend ) and sets
rootPool to be an empty-named Pool with already-initialized schedulingMode (while
Caution
273
Contract that waits until a scheduler backend is ready (using the internal blocking
waitBackendReady).
Note
and YarnClusterScheduler.postStartHook.
274
Shuts down the internal task-scheduler-speculation thread pool executor (used for
Speculative execution of tasks).
Stops SchedulerBackend.
Stops TaskResultGetter.
Cancels starvationTimer timer.
The job with speculatable tasks should finish while speculative tasks are running, and it will
leave these tasks running - no KILL command yet.
It uses checkSpeculatableTasks method that asks rootPool to check for speculatable tasks.
If there are any, SchedulerBackend is called for reviveOffers.
Caution
FIXME How does Spark handle repeated results of speculative tasks since
there are copies launched?
275
submitTasks creates a TaskSetManager for the input TaskSet and adds it to the
Schedulable root pool.
Note
The root pool can be a single flat linked queue (in FIFO scheduling mode) or a
hierarchy of pools of Schedulables (in FAIR scheduling mode).
It makes sure that the requested resources, i.e. CPU and memory, are assigned to the
Spark application for a non-local environment before requesting the current
SchedulerBackend to revive offers.
Figure 4. TaskSchedulerImpl.submitTasks
Note
When submitTasks is called, you should see the following INFO message in the logs:
INFO TaskSchedulerImpl: Adding task set [taskSet.id] with [tasks.length] tasks
It creates a new TaskSetManager for the input taskSet and the acceptable number of task
failures.
276
Note
Note
A TaskSet knows the tasks to execute (as tasks ) and stage id (as stageId )
the tasks belong to. Read TaskSets.
If there is more than one active TaskSetManager for the stage, a IllegalStateException is
thrown with the message:
more than one active taskSet for stage [stage]: [TaskSet ids]
Note
Note
Every time the starvation timer thread is executed and hasLaunchedTask flag is false , the
following WARN message is printed out to the logs:
WARN Initial job has not accepted any resources; check your cluster UI to ensure that
workers are registered and have sufficient resources
Otherwise, when the hasLaunchedTask flag is true the timer thread cancels itself.
Ultimately, submitTasks requests the SchedulerBackend to revive offers.
Tip
taskSetsByStageIdAndAttempt Registry
Caution
FIXME
277
LocalBackend (for local mode) with WorkerOffer resource offers that represent cores
(CPUs) available on all the active executors with one WorkerOffer per active executor.
A WorkerOffer is a 3-tuple with executor id, host, and the number of free cores available.
WorkerOffer(executorId: String, host: String, cores: Int)
For each WorkerOffer (that represents free cores on an executor) resourceOffers method
records the host per executor id (using the internal executorIdToHost ) and sets 0 as the
number of tasks running on the executor if there are no tasks on the executor (using
executorIdToTaskCount ). It also records hosts (with executors in the internal
executorsByHost registry).
Warning
For the offers with a host that has not been recorded yet (in the internal executorsByHost
registry) the following occurs:
1. The host is recorded in the internal executorsByHost registry.
2. executorAdded callback is called (with the executor id and the host from the offer).
3.
newExecAvail flag is enabled (it is later used to inform TaskSetManagers about the new
executor).
278
Caution
It shuffles the input offers that is supposed to help evenly distributing tasks across
executors (that the input offers represent) and builds internal structures like tasks and
availableCpus .
For every TaskSetManager in the TaskSetManager sorted queue, the following DEBUG
message is printed out to the logs:
DEBUG TaskSchedulerImpl: parentName: [taskSet.parent.name], name: [taskSet.name], runn
ingTasks: [taskSet.runningTasks]
Note
While traversing over the sorted collection of TaskSetManagers , if a new host (with an
executor) was registered, i.e. the newExecAvail flag is enabled, TaskSetManagers are
informed about the new executor added.
279
Note
A TaskSetManager will be informed about one or more new executors once per
host regardless of the number of executors registered on the host.
For each TaskSetManager (in sortedTaskSets ) and for each preferred locality level
(ascending), resourceOfferSingleTaskSet is called until launchedTask flag is false .
Caution
Check whether the number of cores in an offer is greater than the number of cores needed
for a task.
When resourceOffers managed to launch a task (i.e. tasks collection is not empty), the
internal hasLaunchedTask flag becomes true (that effectively means what the name says
"There were executors and I managed to launch a task").
resourceOffers returns the tasks collection.
Note
offers.
resourceOfferSingleTaskSet method
resourceOfferSingleTaskSet(
taskSet: TaskSetManager,
maxLocality: TaskLocality,
shuffledOffers: Seq[WorkerOffer],
availableCpus: Array[Int],
tasks: Seq[ArrayBuffer[TaskDescription]]): Boolean
TaskResultGetter
TaskResultGetter is a helper class for TaskSchedulerImpl.statusUpdate. It asynchronously
fetches the task results of tasks that have finished successfully (using
enqueueSuccessfulTask) or fetches the reasons of failures for failed tasks (using
enqueueFailedTask). It then sends the "results" back to TaskSchedulerImpl .
Caution
Tip
280
Note
enqueueSuccessfulTask
enqueueFailedTask
The methods use the internal (daemon thread) thread pool task-result-getter (as
getTaskResultExecutor ) with spark.resultGetter.threads so they can be executed
asynchronously.
TaskResultGetter.enqueueSuccessfulTask
enqueueSuccessfulTask(taskSetManager: TaskSetManager, tid: Long, serializedData:
ByteBuffer) starts by deserializing TaskResult (from serializedData using the global
closure Serializer).
If the result is DirectTaskResult , the method checks
taskSetManager.canFetchMoreResults(serializedData.limit()) and possibly quits. If not, it
FIXME Review
taskSetManager.canFetchMoreResults(serializedData.limit()) .
281
Caution
TaskResultGetter.enqueueFailedTask
enqueueFailedTask(taskSetManager: TaskSetManager, tid: Long, taskState: TaskState,
serializedData: ByteBuffer) checks whether serializedData contains any data and if it
TaskSchedulerImpl.statusUpdate
statusUpdate(tid: Long, state: TaskState, serializedData: ByteBuffer) is called by
scheduler backends to inform about task state changes (see Task States in Tasks).
Caution
It is called by:
CoarseGrainedSchedulerBackend when StatusUpdate(executorId, taskId, state,
data) comes.
The method looks up the TaskSetManager for the task (using taskIdToTaskSetManager ).
282
When the TaskSetManager is found and the task is in finished state, the task is removed
from the internal data structures, i.e. taskIdToTaskSetManager and taskIdToExecutorId , and
the number of currently running tasks for the executor(s) is decremented (using
executorIdToTaskCount ).
TaskSchedulerImpl.handleFailedTask
TaskSchedulerImpl.handleFailedTask(taskSetManager: TaskSetManager, tid: Long, taskState:
TaskState, reason: TaskEndReason) is called when TaskResultGetter.enqueueSuccessfulTask
TaskSchedulerImpl.taskSetFinished
taskSetFinished(manager: TaskSetManager) method is called to inform TaskSchedulerImpl
283
Note
TaskSchedulerImpl.executorAdded
executorAdded(execId: String, host: String)
DAGScheduler.executorAdded)
Caution
Internal Registries
Caution
284
Settings
spark.task.maxFailures
spark.task.maxFailures (default: 4 for cluster mode and 1 for local except local-with-
retries) - The number of individual task failures before giving up on the entire TaskSet and
the job afterwards.
It is used in TaskSchedulerImpl to initialize a TaskSetManager.
spark.task.cpus
spark.task.cpus (default: 1 ) sets how many CPUs to request per task.
spark.scheduler.mode
spark.scheduler.mode (default: FIFO ) is a case-insensitive name of the scheduling mode
spark.speculation.interval
spark.speculation.interval (default: 100ms ) - how often to check for speculative tasks.
spark.starvation.timeout
spark.starvation.timeout (default: 15s ) - Threshold above which Spark warns a user that
spark.resultGetter.threads
spark.resultGetter.threads (default: 4 ) - the number of threads for TaskResultGetter.
285
TaskContext
TaskContext
TaskContext allows a task to access contextual information about itself as well as register
task listeners.
Using TaskContext you can access local properties that were set by the driver. You can also
access task metrics.
You can access the active TaskContext instance using TaskContext.get method.
TaskContext belongs to org.apache.spark package.
import org.apache.spark.TaskContext
Note
TaskContext is serializable.
Contextual Information
stageId is the id of the stage the task belongs to.
partitionId is the id of the partition computed by the task.
attemptNumber is to denote how many times the task has been attempted (starting from
0).
taskAttemptId is the id of the attempt of the task.
isCompleted returns true when a task is completed.
isInterrupted returns true when a task was killed.
All these attributes are accessible using appropriate getters, e.g. getPartitionId for the
partition id.
addTaskCompletionListener
addTaskCompletionListener registers a TaskCompletionListener listener that will be
286
TaskContext
Note
addTaskFailureListener
addTaskFailureListener registers a TaskFailureListener listener that will only be executed
on task failure. It can be executed multiple times since a task can be re-attempted when it
fails.
287
TaskContext
You can use getLocalProperty method to access local properties that were set by the driver
using SparkContext.setLocalProperty.
Task Metrics
taskMetrics(): TaskMetrics
taskMetrics method is part of the Developer API that allows to access the instance of
288
TaskContext
getMetricsSources allows to access all metrics sources for sourceName name which are
TaskContext.get method returns TaskContext instance for the active task (as a
TaskContextImpl object). There can only be one instance and tasks can use the object to
access contextual information about themselves.
val rdd = sc.range(0, 3, numSlices = 3)
scala> rdd.partitions.size
res0: Int = 3
rdd.foreach { n =>
import org.apache.spark.TaskContext
val tc = TaskContext.get
val msg = s"""|------------------ |partitionId: ${tc.partitionId}
|stageId: ${tc.stageId}
|attemptNum: ${tc.attemptNumber}
|taskAttemptId: ${tc.taskAttemptId}
|-------------------""".stripMargin
println(msg)
}
Note
TaskContextImpl
TaskContextImpl is the only implementation of TaskContext abstract class.
Caution
FIXME
stage
partition
task attempt
289
TaskContext
attempt number
runningLocally = false
taskMemoryManager
Caution
markInterrupted
Caution
FIXME
FIXME
290
TaskMemoryManager
TaskMemoryManager
TaskMemoryManager manages the memory allocated by an individual task.
It assumes that:
The number of bits to address pages (aka PAGE_NUMBER_BITS ) is 13
The number of bits to encode offsets in data pages (aka OFFSET_BITS ) is 51 (i.e. 64
bits - PAGE_NUMBER_BITS )
The number of entries in the page table and allocated pages (aka PAGE_TABLE_SIZE ) is
8192 (i.e. 1 << PAGE_NUMBER_BITS )
The maximum page size (aka MAXIMUM_PAGE_SIZE_BYTES ) is 15GB (i.e. ((1L << 31) - 1)
* 8L )
Note
Tip
log4j.logger.org.apache.spark.memory.TaskMemoryManager=TRACE
Refer to Logging.
Caution
FIXME How to trigger the messages in the logs? What to execute to have
them printed out to the logs?
A single TaskMemoryManager manages the memory of a single task (by the tasks
taskAttemptId ).
Note
291
TaskMemoryManager
When called, the constructor uses the input MemoryManager to know whether it is in
Tungsten memory mode (disabled by default) and saves the MemoryManager and
taskAttemptId for later use.
Note
memory could be allocated, it calls spill on every consumer, itself including. Finally, it
returns the allocated memory.
Note
Note
When the memory obtained is less than requested (by required ), it requests all consumers
to spill the remaining required memory.
Note
It requests memory from consumers that work in the same mode except the
requesting one.
292
TaskMemoryManager
You may see the following DEBUG message when spill released some memory:
DEBUG Task [taskAttemptId] released [bytes] from [consumer] for [consumer]
It does the memory acquisition until it gets enough memory or there are no more consumers
to request spill from.
You may also see the following ERROR message in the logs when there is an error while
requesting spill with OutOfMemoryError followed.
ERROR error while calling spill() on [consumer]
If the earlier spill on the consumers did not work out and there is still not enough memory
acquired, acquireExecutionMemory calls spill on the input consumer (that requested more
memory!)
If the consumer releases some memory, you should see the following DEBUG message in
the logs:
DEBUG Task [taskAttemptId] released [bytes] from itself ([consumer])
Note
Note
293
TaskMemoryManager
Caution
FIXME
FIXME
FIXME
cleanUpAllAllocatedMemory
It clears page table.
All recorded consumers are queried for the size of used memory. If the memory used is
greater than 0, the following WARN message is printed out to the logs:
WARN TaskMemoryManager: leak [bytes] memory from [consumer]
Note
294
TaskMemoryManager
It then acquires execution memory (for the input size and consumer ).
It finishes by returning null when no execution memory could be acquired.
With the execution memory acquired, it finds the smallest unallocated page index and
records the page number (using allocatedPages registry).
If the index is PAGE_TABLE_SIZE or higher, releaseExecutionMemory(acquired, consumer) is
called and then the following IllegalStateException is thrown:
Have already allocated a maximum of [PAGE_TABLE_SIZE] pages
Caution
When successful, MemoryBlock gets assigned pageNumber and it gets added to the internal
pageTable registry.
You should see the following TRACE message in the logs:
TRACE Allocate page number [pageNumber] ([acquired] bytes)
And acquiredButNotUsed gets acquired memory space with the pageNumber cleared in
allocatedPages (i.e. the index for pageNumber gets false ).
Caution
295
TaskMemoryManager
releaseExecutionMemory
Caution
FIXME
Internal Registries
pageTable
pageTable is an internal array of size PAGE_TABLE_SIZE with indices being MemoryBlock
objects.
When allocating a MemoryBlock page for Tungsten consumers, the index corresponds to
pageNumber that points to the MemoryBlock page allocated.
allocatedPages
allocatedPages is an internal collection of flags ( true or false values) of size
PAGE_TABLE_SIZE with all bits initially disabled (i.e. false ).
Tip
allocatedPages is java.util.BitSet.
When allocatePage is called, it will record the page in the registry by setting the bit at the
specified index (that corresponds to the allocated page) to true .
consumers
consumers is an internal set of MemoryConsumers.
acquiredButNotUsed
acquiredButNotUsed tracks the size of memory allocated but not used.
pageSizeBytes method
Caution
FIXME
showMemoryUsage method
Caution
FIXME
296
TaskMemoryManager
297
TaskMemoryManager
MemoryConsumer
MemoryConsumer is the contract for memory consumers of TaskMemoryManager with support
for spilling.
A MemoryConsumer basically tracks how much memory is allocated.
Creating a MemoryConsumer requires a TaskMemoryManager with optional pageSize and a
MemoryMode .
Note
MemoryConsumer Contract
Caution
spill method
abstract long spill(long size, MemoryConsumer trigger) throws IOException
Internally, it decrements used registry by the size of page and frees the page.
298
TaskMemoryManager
Internally, it allocates a page for the requested size . The size is recorded in the internal
used counter.
However, if it was not possible to allocate the size memory, it shows the current memory
usage and a OutOfMemoryError is thrown.
Unable to acquire [required] bytes of memory, got [got]
acquireMemory acquires execution memory of size size. The memory is recorded in used
registry.
299
TaskMetrics
TaskMetrics
Caution
FIXME
incUpdatedBlockStatuses
Caution
FIXME
300
Scheduler Backend
Scheduler Backends
Introduction
Spark comes with a pluggable backend mechanism called scheduler backend (aka
backend scheduler) to support various cluster managers, e.g. Apache Mesos, Hadoop
YARN or Sparks own Spark Standalone and Spark local.
These cluster managers differ by their custom task scheduling modes and resource offers
mechanisms, and Sparks approach is to abstract the differences in SchedulerBackend
Contract.
A scheduler backend is created and started as part of SparkContexts initialization (when
TaskSchedulerImpl is started - see Creating Scheduler Backend and Task Scheduler).
FIXME Image how it gets created with SparkContext in play here or in
SparkContext doc.
Caution
SchedulerBackend Contract
Note
Spark.
301
Scheduler Backend
Caution
reviveOffers
Note
There are currently three custom implementations of reviveOffers available in Spark for
different clustering options:
For local mode read Task Submission a.k.a. reviveOffers.
CoarseGrainedSchedulerBackend
MesosFineGrainedSchedulerBackend
Default level of parallelism is used by TaskScheduler to use as a hint for sizing jobs.
Note
It is used in TaskSchedulerImpl.defaultParallelism .
killTask
302
Scheduler Backend
applicationAttemptId
applicationAttemptId(): Option[String] returns no application attempt id.
It is currently only supported by YARN cluster scheduler backend as the YARN cluster
manager supports multiple attempts.
getDriverLogUrls
getDriverLogUrls: Option[Map[String, String]] returns no URLs by default.
Available Implementations
Spark comes with the following scheduler backends:
LocalBackend (local mode)
CoarseGrainedSchedulerBackend
SparkDeploySchedulerBackend used in Spark Standalone (and local-cluster FIXME)
YarnSchedulerBackend
YarnClientSchedulerBackend (for client deploy mode)
YarnClusterSchedulerBackend (for cluster deploy mode).
CoarseMesosSchedulerBackend
MesosSchedulerBackend
303
CoarseGrainedSchedulerBackend
CoarseGrainedSchedulerBackend
CoarseGrainedSchedulerBackend is a SchedulerBackend and ExecutorAllocationClient.
It is responsible for requesting resources from a cluster manager for executors to be able to
launch tasks (on coarse-grained executors).
This backend holds executors for the duration of the Spark job rather than relinquishing
executors whenever a task is done and asking the scheduler to launch a new executor for
each new task.
When being created, CoarseGrainedSchedulerBackend requires a Task Scheduler, and a RPC
Environment.
It uses LiveListenerBus.
It registers CoarseGrainedScheduler RPC Endpoint that executors use for RPC
communication.
It tracks:
the total number of cores in the cluster (using totalCoreCount )
the total number of executors that are currently registered
executors ( ExecutorData )
executors to be removed ( executorsPendingToRemove )
hosts and the number of possible tasks possibly running on them
lost executors with no real exit reason
tasks per slaves ( taskIdsOnSlave )
Enable INFO or DEBUG logging level for
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend logger to see what
happens inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend=DEBUG
Refer to Logging.
CoarseGrainedSchedulerBackend
being created.
It initializes the following registries:
totalCoreCount to 0
totalRegisteredExecutors to 0
maxRpcMessageSize to spark.rpc.message.maxSize.
_minRegisteredRatio to spark.scheduler.minRegisteredResourcesRatio (between 0
and 1 inclusive).
maxRegisteredWaitingTimeMs to
spark.scheduler.maxRegisteredResourcesWaitingTime.
createTime to the current time.
executorDataMap to an empty collection.
numPendingExecutors to 0
executorsPendingToRemove to an empty collection.
hostToLocalTaskCount to an empty collection.
localityAwareTasks to 0
currentExecutorIdCounter to 0
It accesses the current LiveListenerBus and SparkConf through the constructors reference
to TaskSchedulerImpl.
CoarseGrainedSchedulerBackend Contract
Caution
FIXME
305
CoarseGrainedSchedulerBackend
doRequestTotalExecutors
doRequestTotalExecutors(requestedTotal: Int): Boolean = false
Note
Internal Registries
currentExecutorIdCounter Counter
currentExecutorIdCounter is the last (highest) identifier of all allocated executors.
Note
executorDataMap Registry
executorDataMap = new HashMap[String, ExecutorData]
It uses ExecutorData that holds an executors endpoint reference, address, host, the
number of free and total CPU cores, the URL of execution logs.
Note
numPendingExecutors
306
CoarseGrainedSchedulerBackend
Caution
FIXME
numExistingExecutors
Caution
FIXME
executorsPendingToRemove
Caution
FIXME
localityAwareTasks
Caution
FIXME
hostToLocalTaskCount
Caution
FIXME
Note
When called, you should see the following INFO message followed by DEBUG message in
the logs:
INFO Requesting [numAdditionalExecutors] additional executor(s) from the cluster manag
er
DEBUG Number of pending executors is now [numPendingExecutors]
307
CoarseGrainedSchedulerBackend
requestExecutors requests executors from a cluster manager (that reflects the current
computation needs). The "new executor total" is a sum of the internal numExistingExecutors
and numPendingExecutors decreased by the number of executors pending to be removed.
If numAdditionalExecutors is negative, a IllegalArgumentException is thrown:
Attempted to request a negative number of additional executor(s) [numAdditionalExecuto
rs] from the cluster manager. Please specify a positive number!
Note
Note
Note
Note
308
CoarseGrainedSchedulerBackend
minRegisteredRatio
minRegisteredRatio: Double
Note
Note
CoarseGrainedSchedulerBackend
Note
Caution
Note
FIXME Image
killTask is part of the SchedulerBackend Contract.
CoarseGrainedSchedulerBackend
If sufficient resources are available, you should see the following INFO message in the logs:
INFO SchedulerBackend is ready for scheduling beginning after
reached minRegisteredResourcesRatio: [minRegisteredRatio]
Note
In case there are no sufficient resources available yet (the above requirement does not
hold), it checks whether the time from the startup (as createTime ) passed
spark.scheduler.maxRegisteredResourcesWaitingTime to give a way to submit tasks
(despite minRegisteredRatio not being reached yet).
You should see the following INFO message in the logs:
INFO SchedulerBackend is ready for scheduling beginning after
waiting maxRegisteredResourcesWaitingTime:
[maxRegisteredWaitingTimeMs](ms)
sufficientResourcesRegistered
sufficientResourcesRegistered always responds that sufficient resources are available.
311
CoarseGrainedSchedulerBackend
initialized).
Note
1. Sets numPendingExecutors to 0
2. Clears executorsPendingToRemove
3. Sends a blocking RemoveExecutor message to driverEndpoint for every executor (in
the internal executorDataMap ) to inform it about SlaveLost with the message:
Stale executor after cluster manager re-registered.
Note
312
CoarseGrainedSchedulerBackend
Note
that in turn lives inside a Spark driver. That explains the name driverEndpoint
(at least partially).
RPC message.
It uses driver-revive-thread daemon single-thread thread pool for FIXME
FIXME A potential issue with
Caution
driverEndpoint.asInstanceOf[NettyRpcEndpointRef].toURI - doubles
spark:// prefix.
RPC Messages
KillTask(taskId, executorId, interruptThread)
RemoveExecutor
RetrieveSparkProps
ReviveOffers
ReviveOffers simply passes the call on to makeOffers.
Caution
FIXME When is an executor alive? What other states can an executor be in?
313
CoarseGrainedSchedulerBackend
StopDriver
StopDriver message stops the RPC endpoint.
StopExecutors
StopExecutors message is receive-reply and blocking. When received, the following INFO
RegisterExecutor
RegisterExecutor(executorId, executorRef, cores, logUrls)
Note
Endpoint) starts.
When numPendingExecutors is more than 0 , the following is printed out to the logs:
314
CoarseGrainedSchedulerBackend
DriverEndpoint
DriverEndpoint is a ThreadSafeRpcEndpoint.
onDisconnected Callback
When called, onDisconnected removes the worker from the internal addressToExecutorId
registry (that effectively removes the worker from a cluster).
While removing, it calls removeExecutor with the reason being SlaveLost and message:
Remote RPC client disassociated. Likely due to containers
exceeding thresholds, or network issues. Check driver logs for
WARN messages.
Note
makeOffers is a private method that takes the active executors (out of the executorDataMap
internal registry) and creates WorkerOffer resource offers for each (one per executor with
the executors id, host and free cores).
Caution
Only free cores are considered in making offers. Memory is not! Why?!
315
CoarseGrainedSchedulerBackend
launchTasks is a private helper method that iterates over TaskDescription objects in the
tasks input collection and FIXME
Note
resource offers.
Caution
If the serialized tasks size is over the maximum RPC message size, the tasks
TaskSetManager is aborted.
Caution
FIXME At that point, tasks have their executor assigned. When and how did
that happen?
If the serialized tasks size is correct, the tasks executor is looked up in the internal
executorDataMap registry to record that the task is about to be launched and the number of
free cores of the executor is decremented by the CPUS_PER_TASK constant (i.e.
spark.task.cpus).
Caution
Note
Ultimately, launchTasks sends a LaunchTask message to the executors RPC endpoint with
the serialized task (wrapped in SerializableBuffer ).
316
CoarseGrainedSchedulerBackend
Note
Scheduling in Spark relies on cores only (not memory), i.e. the number of tasks
Spark can run on an executor is constrained by the number of cores available
only. When submitting Spark application for execution bothmemory and
corescan be specified explicitly.
Known Implementations
StandaloneSchedulerBackend
link:spark-mesosMesosCoarseGrainedSchedulerBackend.adoc[MesosCoarseGrainedSchedulerBacken
d
Settings
spark.rpc.message.maxSize
spark.rpc.message.maxSize (default: 128 and not greater than 2047m - 200k for the
largest frame size for RPC messages (serialized tasks or task results) in MB.
spark.default.parallelism
spark.default.parallelism (default: maximum of totalCoreCount and 2) - default
spark.scheduler.minRegisteredResourcesRatio
spark.scheduler.minRegisteredResourcesRatio (default: 0 ) - a double value between 0 and
1 (including) that controls the minimum ratio of (registered resources / total expected
resources) before submitting tasks. See isReady.
spark.scheduler.maxRegisteredResourcesWaitingTime
spark.scheduler.maxRegisteredResourcesWaitingTime (default: 30s ) - the time to wait for
317
Executor Backend
Executor Backends
ExecutorBackend is a pluggable interface used by executors to send status updates about
It is effectively a bridge between the driver and an executor, i.e. there are two endpoints
running.
Caution
Status updates include information about tasks, i.e. id, state, and data (as ByteBuffer ).
At startup, an executor backend connects to the driver and creates an executor. It then
launches and kills tasks. It stops when the driver orders so.
There are the following types of executor backends:
LocalBackend (local mode)
CoarseGrainedExecutorBackend
MesosExecutorBackend
MesosExecutorBackend
318
Executor Backend
Caution
FIXME
319
CoarseGrainedExecutorBackend
CoarseGrainedExecutorBackend
CoarseGrainedExecutorBackend manages a single executor object. The internal executor
object is created after a connection to the driver is established (i.e. after RegisteredExecutor
has arrived).
All task status updates are sent along to driverRef as StatusUpdate messages.
320
CoarseGrainedExecutorBackend
happens inside.
Tip
Note
When onStart is executed, it prints out the following INFO message to the logs:
INFO CoarseGrainedExecutorBackend: Connecting to driver: [driverUrl]
It then accesses the RpcEndpointRef for the driver (using the constructors driverUrl) and
eventually initializes the internal driver that it will send a blocking RegisterExecutor
message to.
If there is an issue while registering the executor, you should see the following ERROR
message in the logs and process exits (with the exit code 1 ).
ERROR Cannot register with driver: [driverUrl]
Note
FIXME
driver RpcEndpointRef
driver is an optional RpcEndpointRef for the driver.
Tip
Drivers URL
321
CoarseGrainedExecutorBackend
main
CoarseGrainedExecutorBackend is a command-line application (it comes with main
method).
It accepts the following options:
--driver-url (required) - the drivers URL. See drivers URL.
--executor-id (required) - the executors id
--hostname (required) - the name of the host
--cores (required) - the number of cores (must be more than 0 )
--app-id (required) - the id of the application
--worker-url - the workers URL, e.g. spark://[email protected]:64557
--user-class-path - a URL/path to a resource to be added to CLASSPATH; can be
322
CoarseGrainedExecutorBackend
It sends a (blocking) RetrieveSparkProps message to the driver (using the value for
driverUrl command-line option). When the response (the drivers SparkConf ) arrives it
adds spark.app.id (using the value for appid command-line option) and creates a brand
new SparkConf .
If spark.yarn.credentials.file is set, FIXME
A SparkEnv is created using SparkEnv.createExecutorEnv (with isLocal being false ).
Caution
FIXME
Usage
Caution
It is used in:
SparkDeploySchedulerBackend
CoarseMesosSchedulerBackend
SparkClassCommandBuilder - ???
start
stop
requestTotalExecutors
executor internal field
executor is an ExecutorFIXME
Caution
FIXME
RPC Messages
RegisteredExecutor
RegisteredExecutor(hostname)
323
CoarseGrainedExecutorBackend
When a RegisteredExecutor message arrives, you should see the following INFO in the
logs:
INFO CoarseGrainedExecutorBackend: Successfully registered with driver
The internal executor is created using executorId constructor parameter, with hostname
that has arrived and others.
Note
RegisterExecutorFailed
RegisterExecutorFailed(message)
When a RegisterExecutorFailed message arrives, the following ERROR is printed out to the
logs:
ERROR CoarseGrainedExecutorBackend: Slave registration failed: [message]
LaunchTask
LaunchTask(data: SerializableBuffer)
The LaunchTask handler deserializes TaskDescription from data (using the global closure
Serializer).
Note
CoarseGrainedSchedulerBackend.launchTasks.
324
CoarseGrainedExecutorBackend
KillTask(taskId, _, interruptThread)
KillTask(taskId, _, interruptThread) message kills a task (calls Executor.killTask ).
If an executor has not been initialized yet (FIXME: why?), the following ERROR message is
printed out to the logs and CoarseGrainedExecutorBackend exits:
ERROR Received KillTask command but executor was null
StopExecutor
StopExecutor message handler is receive-reply and blocking. When received, the handler
Shutdown
Shutdown stops the executor, itself and RPC Environment.
325
BlockManager
BlockManager
BlockManager is a key-value store for blocks of data in Spark. BlockManager acts as a local
cache that runs on every node in Spark cluster, i.e. the driver and executors. It provides
interface for uploading and fetching blocks both locally and remotely using various stores,
i.e. memory, disk, and off-heap. See Stores in this document.
A BlockManager is a BlockDataManager, i.e. manages the storage for blocks that can
represent cached RDD partitions, intermediate shuffle outputs, broadcasts, etc. It is also a
BlockEvictionHandler that drops a block from memory and storing it on a disk if applicable.
Cached blocks are blocks with non-zero sum of memory and disk sizes.
BlockManager is created as a Spark application starts.
Refer to Logging.
You may want to shut off WARN messages being printed out about the current
state of blocks using the following line to cut the noise:
Tip
log4j.logger.org.apache.spark.storage.BlockManager=OFF
326
BlockManager
registerTask
Caution
FIXME
Stores
A Store is the place where blocks are held.
There are the following possible stores:
MemoryStore for memory storage level.
DiskStore for disk storage level.
ExternalBlockStore for OFF_HEAP storage level.
putBytes puts the blockId block of bytes bytes and level storage level to
BlockManager .
doPutBytes
def doPutBytes[T](
blockId: BlockId,
bytes: ChunkedByteBuffer,
level: StorageLevel,
classTag: ClassTag[T],
tellMaster: Boolean = true,
keepReadLock: Boolean = false): Boolean
doPutBytes is an internal method that calls the internal helper doPut with putBody being a
327
BlockManager
Caution
succeed and the storage level is also a disk one, you should see the following WARN
message in the logs:
WARN BlockManager: Persisting block [blockId] to disk instead.
DiskStore.putBytes is called.
Note
DiskStore is only used when MemoryStore has failed for memory and disk
storage levels.
driver should know about it ( tellMaster ), it reports current storage status of the block to the
driver. The current TaskContext metrics are updated with the updated block status.
Regardless of the block being successfully stored or not, you should see the following
DEBUG message in the logs:
DEBUG BlockManager: Put block [blockId] locally took [time] ms
For replication level greater than 1 , doPutBytes waits for the earlier asynchronous
replication to finish.
The final result of doPutBytes is the result of storing the block successful or not (as
computed earlier).
replicate
Caution
FIXME
doPutIterator
Caution
FIXME
doPut
328
BlockManager
doPut[T](
blockId: BlockId,
level: StorageLevel,
classTag: ClassTag[_],
tellMaster: Boolean,
keepReadLock: Boolean)(putBody: BlockInfo => Option[T]): Option[T]
It releases the read lock for the block when keepReadLock flag is disabled. doPut returns
None immediately.
putBody is executed.
329
BlockManager
removeBlock removes the blockId block from the MemoryStore and DiskStore.
When executed, it prints out the following DEBUG message to the logs:
DEBUG Removing block [blockId]
It requests BlockInfoManager for lock for writing for the blockId block. If it receives none, it
prints out the following WARN message to the logs and quits.
WARN Asked to remove block [blockId], which does not exist
Otherwise, with a write lock for the block, the block is removed from MemoryStore and
DiskStore (see Removing Block in MemoryStore and Removing Block in DiskStore ).
If both removals fail, it prints out the following WARN message:
WARN Block [blockId] could not be removed as it was not found in either the disk, memo
ry, or external block store
removeRdd removes all the blocks that belong to the rddId RDD.
It then requests RDD blocks from BlockInfoManager and removes them (from memory and
disk) (without informing the driver).
The number of blocks removed is the final result.
330
BlockManager
Note
removeBroadcast removes all the blocks that belong to the broadcastId broadcast.
It then requests all BroadcastBlockId objects that belong to the broadcastId broadcast
from BlockInfoManager and removes them (from memory and disk).
The number of blocks removed is the final result.
Note
FIXME
RpcEnv
BlockManagerMaster
SerializerManager
SparkConf
MemoryManager
MapOutputTracker
ShuffleManager
331
BlockManager
BlockTransferService
SecurityManager
Note
Caution
It calculates the port used by the external shuffle service (as externalShuffleServicePort ).
Note
Caution
It creates a client to read other executors' shuffle files (as shuffleClient ). If the external
shuffle service is used an ExternalShuffleClient is created or the input BlockTransferService
is used.
It sets the maximum number of failures before this block manager refreshes the block
locations from the driver (as maxFailuresBeforeLocationRefresh ).
It registers BlockManagerSlaveEndpoint with the input RpcEnv, itself, and
MapOutputTracker (as slaveEndpoint ).
Note
shuffleClient
332
BlockManager
Caution
FIXME
shuffleServerId
Caution
FIXME
initialize method is called to initialize the BlockManager instance on the driver and
If the External Shuffle Service is used, the following INFO appears in the logs:
INFO external shuffle service port = [externalShuffleServicePort]
333
BlockManager
Note
Note
When executed, you should see the following INFO message in the logs:
INFO Registering executor with local external shuffle service.
It uses shuffleClient to register the block manager using shuffleServerId (i.e. the host, the
port and the executorId) and a ExecutorShuffleInfo .
Note
The maximum number of attempts and the sleep time in-between are hardcoded, i.e. they are not configured.
Any issues while connecting to the external shuffle service are reported as ERROR
messages in the logs:
ERROR Failed to connect to external shuffle server, will retry [#attempts] more times
after waiting 5 seconds...
When is called, you should see the following INFO in the logs:
334
BlockManager
It registers itself to the drivers BlockManagerMaster (just as it was when BlockManager was
initializing). It passes the BlockManagerId, the maximum memory (as maxMemory ), and the
BlockManagerSlaveEndpoint.
Caution
reregister will then report all the local blocks to the BlockManagerMaster.
For each block metadata (in BlockInfoManager) it gets block current status and tries to send
it to the BlockManagerMaster.
If there is an issue communicating to the BlockManagerMaster, you should see the following
ERROR message in the logs:
ERROR BlockManager: Failed to report [blockId] to master; giving up.
heartbeats.
getCurrentBlockStatus returns the current BlockStatus of the BlockId block (with the
blocks current StorageLevel, memory and disk sizes). It uses MemoryStore and DiskStore
for size and other information.
Note
335
BlockManager
Internally, it uses the input BlockInfo to know about the blocks storage level. If the storage
level is not set (i.e. null ), the returned BlockStatus assumes the default NONE storage
level and the memory and disk sizes being 0 .
If however the storage level is set, getCurrentBlockStatus uses MemoryStore or DiskStore
to check whether the block is stored in the storages or not and request for their sizes in the
storages respectively (using their getSize or assume 0 ).
Note
It is acceptable that the BlockInfo says to use memory or disk yet the block is
not in the storages (yet or anymore). The method will give current status.
Note
When dropFromMemory is executed, you should see the following INFO message in the logs:
INFO BlockManager: Dropping block [blockId] from memory
Caution
336
BlockManager
WARN BlockManager: Block [blockId] could not be dropped from memory as it does not exi
st
It then calculates the current storage status of the block and reports it to the driver. It only
happens when info.tellMaster .
Caution
A block is considered updated when it was written to disk or removed from memory or both.
If either happened, the current TaskContext metrics are updated with the change.
Ultimately, dropFromMemory returns the current storage level of the block.
Note
reportBlockStatus is an internal method for reporting a block status to the driver and if told
Note
and removeBlock.
tryToReportBlockStatus
337
BlockManager
def tryToReportBlockStatus(
blockId: BlockId,
info: BlockInfo,
status: BlockStatus,
droppedMemorySize: Long = 0L): Boolean
BlockEvictionHandler
BlockEvictionHandler is a private[storage] Scala trait with a single method
dropFromMemory.
dropFromMemory(
blockId: BlockId,
data: () => Either[Array[T], ChunkedByteBuffer]): StorageLevel
Note
Note
A BlockManager is a BlockEvictionHandler .
dropFromMemory is called when MemoryStore evicts blocks from memory to free
space.
BlockManagerSlaveEndpoint
BlockManagerSlaveEndpoint is a thread-safe RPC endpoint for remote communication
338
BlockManager
inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.storage.BlockManagerSlaveEndpoint=DEBUG
Refer to Logging.
RemoveBlock Message
RemoveBlock(blockId: BlockId)
When a RemoveBlock message comes in, you should see the following DEBUG message in
the logs:
DEBUG BlockManagerSlaveEndpoint: removing block [blockId]
When the computation is successful, you should see the following DEBUG in the logs:
DEBUG BlockManagerSlaveEndpoint: Done removing block [blockId], response is [response]
And true response is sent back. You should see the following DEBUG in the logs:
DEBUG BlockManagerSlaveEndpoint: Sent response: true to [senderAddress]
In case of failure, you should see the following ERROR in the logs and the stack trace.
ERROR BlockManagerSlaveEndpoint: Error in removing block [blockId]
RemoveRdd Message
RemoveRdd(rddId: Int)
339
BlockManager
When a RemoveRdd message comes in, you should see the following DEBUG message in
the logs:
DEBUG BlockManagerSlaveEndpoint: removing RDD [rddId]
When the computation is successful, you should see the following DEBUG in the logs:
DEBUG BlockManagerSlaveEndpoint: Done removing RDD [rddId], response is [response]
And the number of blocks removed is sent back. You should see the following DEBUG in the
logs:
DEBUG BlockManagerSlaveEndpoint: Sent response: [#blocks] to [senderAddress]
In case of failure, you should see the following ERROR in the logs and the stack trace.
ERROR BlockManagerSlaveEndpoint: Error in removing RDD [rddId]
RemoveShuffle Message
RemoveShuffle(shuffleId: Int)
When a RemoveShuffle message comes in, you should see the following DEBUG message
in the logs:
DEBUG BlockManagerSlaveEndpoint: removing shuffle [shuffleId]
If MapOutputTracker was given (when the RPC endpoint was created), it calls
MapOutputTracker to unregister the shuffleId shuffle.
It then calls ShuffleManager to unregister the shuffleId shuffle.
Note
When the computation is successful, you should see the following DEBUG in the logs:
340
BlockManager
And the result is sent back. You should see the following DEBUG in the logs:
DEBUG BlockManagerSlaveEndpoint: Sent response: [response] to [senderAddress]
In case of failure, you should see the following ERROR in the logs and the stack trace.
ERROR BlockManagerSlaveEndpoint: Error in removing shuffle [shuffleId]
RemoveBroadcast Message
RemoveBroadcast(broadcastId: Long)
When a RemoveBroadcast message comes in, you should see the following DEBUG
message in the logs:
DEBUG BlockManagerSlaveEndpoint: removing broadcast [broadcastId]
When the computation is successful, you should see the following DEBUG in the logs:
DEBUG BlockManagerSlaveEndpoint: Done removing broadcast [broadcastId], response is [r
esponse]
And the result is sent back. You should see the following DEBUG in the logs:
DEBUG BlockManagerSlaveEndpoint: Sent response: [response] to [senderAddress]
In case of failure, you should see the following ERROR in the logs and the stack trace.
ERROR BlockManagerSlaveEndpoint: Error in removing broadcast [broadcastId]
GetBlockStatus Message
341
BlockManager
GetBlockStatus(blockId: BlockId)
When a GetBlockStatus message comes in, it responds with the result of calling
BlockManager about the status of blockId .
GetMatchingBlockIds Message
GetMatchingBlockIds(filter: BlockId => Boolean)
When a GetMatchingBlockIds message comes in, it responds with the result of calling
BlockManager for matching blocks for filter .
TriggerThreadDump Message
When a TriggerThreadDump message comes in, a thread dump is generated and sent back.
pool ( asyncThreadPool ) for some messages to talk to other Spark services, i.e.
BlockManager , MapOutputTracker, ShuffleManager in a non-blocking, asynchronous way.
The reason for the async thread pool is that the block-related operations might take quite
some time and to release the main RPC thread other threads are spawned to talk to the
external services and pass responses on to the clients.
Note
Broadcast Values
When a new broadcast value is created, TorrentBroadcast - the default implementation of
Broadcast - blocks are put in the block manager. See TorrentBroadcast.
It puts the data in the memory first and drop to disk if the memory store cant hold it.
342
BlockManager
BlockManagerId
FIXME
DiskBlockManager
DiskBlockManager creates and maintains the logical mapping between logical blocks and
physical on-disk locations.
By default, one block is mapped to one file with a name given by its BlockId. It is however
possible to have a block map to only a segment of a file.
Block files are hashed among the directories listed in spark.local.dir (or in
SPARK_LOCAL_DIRS if set).
Caution
Execution Context
block-manager-future is the execution context forFIXME
Metrics
Block Manager uses Spark Metrics System (via BlockManagerSource ) to report metrics about
internal status.
The name of the source is BlockManager.
It emits the following numbers:
memory / maxMem_MB - the maximum memory configured
memory / remainingMem_MB - the remaining memory
memory / memUsed_MB - the memory used
memory / diskSpaceUsed_MB - the disk used
Misc
343
BlockManager
The underlying abstraction for blocks in Spark is a ByteBuffer that limits the size of a block
to 2GB ( Integer.MAX_VALUE - see Why does FileChannel.map take up to
Integer.MAX_VALUE of data? and SPARK-1476 2GB limit in spark for blocks). This has
implication not just for managed blocks in use, but also for shuffle blocks (memory mapped
blocks are limited to 2GB, even though the API allows for long ), ser-deser via byte arraybacked output streams.
When a non-local executor starts, it initializes a BlockManager object for the spark.app.id
id.
Settings
spark.broadcast.compress (default: true ) whether to compress stored broadcast
variables.
spark.shuffle.compress (default: true ) whether to compress stored shuffle output.
spark.rdd.compress (default: false ) whether to compress RDD partitions that are
stored serialized.
spark.shuffle.spill.compress (default: true ) whether to compress shuffle output
344
MemoryStore
MemoryStore
MemoryStore manages blocks (in the internal entries registry).
MemoryStore requires SparkConf, BlockInfoManager, SerializerManager , MemoryManager
Caution
Note
Refer to Logging.
entries Registry
entries is Javas LinkedHashMap with the initial capacity of 32 , the load factor of 0.75
and access-order ordering mode (i.e. iteration is in the order in which its entries were last
accessed, from least-recently accessed to most-recently).
Note
putBytes
putBytes[T: ClassTag](
blockId: BlockId,
size: Long,
memoryMode: MemoryMode,
_bytes: () => ChunkedByteBuffer): Boolean
putBytes requests size memory for the blockId block from the current
345
MemoryStore
FIXME
Removing Block
Caution
FIXME
Settings
spark.storage.unrollMemoryThreshold
spark.storage.unrollMemoryThreshold (default: 1024 * 1024 ) controls
346
DiskStore
DiskStore
Caution
FIXME
putBytes
Caution
FIXME
Removing Block
Caution
FIXME
347
BlockDataManager
Note
Spark.
BlockDataManager Contract
Every BlockDataManager offers the following services:
getBlockData to fetch a local block data by blockId .
putBlockData to upload a block data locally by blockId . The return value says
BlockId
BlockId identifies a block of data. It has a globally unique identifier ( name )
348
BlockDataManager
ManagedBuffer
349
ShuffleClient
ShuffleClient
ShuffleClient is an interface ( abstract class ) for reading shuffle files.
Note
ShuffleClient Contract
Every ShuffleClient can do the following:
It can be init . The default implementation does nothing by default.
public void init(String appId)
ExternalShuffleClient
Caution
FIXME
FIXME
350
BlockTransferService
BlockTransferService
BlockTransferService is a contract for specialized ShuffleClient objects that can fetch and
BlockTransferService Contract
Every BlockTransferService offers the following:
init that accepts BlockDataManager for storing or fetching blocks. It is assumed that
port: Int
hostName: String
uploadBlock(
hostname: String,
port: Int,
execId: String,
blockId: BlockId,
blockData: ManagedBuffer,
level: StorageLevel,
classTag: ClassTag[_]): Future[Unit]
Synchronous (and hence blocking) fetchBlockSync to fetch one block blockId (that
corresponds to the ShuffleClient parents asynchronous fetchBlocks).
351
BlockTransferService
fetchBlockSync(
host: String,
port: Int,
execId: String,
blockId: String): ManagedBuffer
fetchBlockSync is a mere wrapper around fetchBlocks to fetch one blockId block that
uploadBlockSync(
hostname: String,
port: Int,
execId: String,
blockId: BlockId,
blockData: ManagedBuffer,
level: StorageLevel,
classTag: ClassTag[_]): Unit
uploadBlockSync is a mere wrapper around uploadBlock that waits until the upload
finishes.
NettyBlockTransferService - Netty-Based
BlockTransferService
Caution
FIXME
352
BlockManagerMaster
executors) to allow executors for sending block status updates to it and hence keep track of
block statuses.
An instance of BlockManagerMaster is created in SparkEnv (for the driver and
executors), and immediately used to create their BlockManagers.
Note
Refer to Logging.
353
BlockManagerMaster
If all goes fine, you should see the following INFO message in the logs:
INFO BlockManagerMaster: Removed executor [execId]
removeRdd removes all the blocks of rddId RDD, possibly in a blocking fashion.
fashion.
It posts a RemoveShuffle(shuffleId) message to BlockManagerMaster RPC endpoint on a
separate thread.
354
BlockManagerMaster
If there is an issue, you should see the following WARN message in the logs and the entire
exception:
WARN Failed to remove shuffle [shuffleId] - [exception]
fashion.
It posts a RemoveBroadcast(broadcastId, removeFromMaster) message to
BlockManagerMaster RPC endpoint on a separate thread.
If there is an issue, you should see the following WARN message in the logs and the entire
exception:
WARN Failed to remove broadcast [broadcastId] with removeFromMaster = [removeFromMaste
r] - [exception]
If all goes fine, you should see the following INFO message in the logs:
INFO BlockManagerMaster: BlockManagerMaster stopped
BlockManagerMaster
When registerBlockManager runs, you should see the following INFO message in the logs:
INFO BlockManagerMaster: Trying to register BlockManager
356
BlockManagerMaster
Note
endpoint and waits for a response which becomes the return value.
BlockManagerMaster RPC endpoint and waits for a response which becomes the return
value.
getPeers
357
BlockManagerMaster
getExecutorEndpointRef
getExecutorEndpointRef(executorId: String): Option[RpcEndpointRef]
BlockManagerMaster RPC endpoint and waits for a response which becomes the return
value.
getMemoryStatus
getMemoryStatus: Map[BlockManagerId, (Long, Long)]
getStorageStatus
getStorageStatus: Array[StorageStatus]
endpoint and waits for a response which becomes the return value.
getBlockStatus
getBlockStatus(
blockId: BlockId,
askSlaves: Boolean = true): Map[BlockManagerId, BlockStatus]
BlockManagerMaster RPC endpoint and waits for a response (of type Map[BlockManagerId,
Future[Option[BlockStatus]]] ).
358
BlockManagerMaster
It then builds a sequence of future results that are BlockStatus statuses and waits for a
result for spark.rpc.askTimeout, spark.network.timeout or 120 secs.
No result leads to a SparkException with the following message:
BlockManager returned null for BlockStatus query: [blockId]
getMatchingBlockIds
getMatchingBlockIds(
filter: BlockId => Boolean,
askSlaves: Boolean): Seq[BlockId]
BlockManagerMaster RPC endpoint and waits for a response which becomes the result for
spark.rpc.askTimeout, spark.network.timeout or 120 secs.
hasCachedBlocks
hasCachedBlocks(executorId: String): Boolean
RPC endpoint and waits for a response which becomes the result.
359
BlockManagerMaster
happens inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.storage.BlockManagerMasterEndpoint=INFO
Refer to Logging.
Internal Registries
blockLocations
blockLocations is a collection of BlockId and its locations (as BlockManagerId ).
Note
RemoveExecutor
RemoveExecutor(execId: String)
When RemoveExecutor is received, executor execId is removed and the response true
sent back.
GetLocations
GetLocations(blockId: BlockId)
When GetLocations comes in, the internal getLocations method is executed and the result
becomes the response sent back.
Note
RegisterBlockManager
360
BlockManagerMaster
RegisterBlockManager(
blockManagerId: BlockManagerId,
maxMemSize: Long,
sender: RpcEndpointRef)
register
register(id: BlockManagerId, maxMemSize: Long, slaveEndpoint: RpcEndpointRef): Unit
register records the current time and registers BlockManager by id if it has not been
If another BlockManager has earlier been registered for the executor, you should see the
following ERROR message in the logs:
ERROR Got two different block manager registrations on same executor - will replace ol
d one [oldId] with new one [id]
Caution
361
BlockManagerMaster
Caution
When executed, removeExecutor prints the following INFO message to the logs:
INFO BlockManagerMasterEndpoint: Trying to remove executor [execId] from BlockManagerM
aster.
362
BlockManagerMaster
Note
It then goes over all the blocks for the BlockManager , and removes the executor for each
block from blockLocations registry.
SparkListenerBlockManagerRemoved(System.currentTimeMillis(), blockManagerId) is
posted to listenerBus.
You should then see the following INFO message in the logs:
INFO BlockManagerMasterEndpoint: Removing block manager [blockManagerId]
363
BlockInfoManager
BlockInfoManager
BlockInfoManager manages memory blocks (aka memory pages). It controls concurrent
access to memory blocks by read and write locks (for existing and new ones).
Locks are the mechanism to control concurrent access to data and prevent
destructive interaction between operations that use the same resource.
Note
Note
org.apache.spark.storage package.
Refer to Logging.
lockForReading locks blockId memory block for reading when the block was registered
364
BlockInfoManager
metadata is incremented and the block is recorded in the internal readLocksByTask registry.
You should see the following TRACE message in the logs:
TRACE BlockInfoManager: Task [taskAttemptId] acquired read lock for [blockId]
For blocks with writerTask other than NO_WRITER , when blocking is enabled,
lockForReading waits (until another thread invokes the Object.notify method or the
Object.notifyAll methods for this object).
With blocking enabled, it will repeat the waiting-for-read-lock sequence until either None
or the lock is obtained.
When blocking is disabled and the lock could not be obtained, None is returned
immediately.
Note
lockForReading is a synchronized method, i.e. no two objects can use this and
When executed, lockForWriting prints out the following TRACE message to the logs:
TRACE Task [currentTaskAttemptId] trying to acquire write lock for [blockId]
It looks up blockId in the internal infos registry. When no BlockInfo could be found, None
is returned. Otherwise, BlockInfo is checked for writerTask to be BlockInfo.NO_WRITER with
no readers (i.e. readerCount is 0 ) and only then the lock is returned.
When the write lock can be returned, BlockInfo.writerTask is set to currentTaskAttemptId
and a new binding is added to the internal writeLocksByTask registry. You should see the
following TRACE message in the logs:
TRACE Task [currentTaskAttemptId] acquired write lock for [blockId]
365
BlockInfoManager
If, for some reason, blockId has a writer (i.e. info.writerTask is not BlockInfo.NO_WRITER )
or the number of readers is positive (i.e. BlockInfo.readerCount is greater than 0 ), the
method will wait (based on the input blocking flag) and attempt the write lock acquisition
process until it finishes with a write lock.
Note
(deadlock possible) The method is synchronized and can block, i.e. wait that
causes the current thread to wait until another thread invokes Object.notify or
Object.notifyAll methods for this object.
lockForWriting return None for no blockId in the internal infos registry or when
blocking flag is disabled and the write lock could not be acquired.
lockNewBlockForWriting obtains a write lock for blockId but only when the method could
blocks.
When executed, lockNewBlockForWriting prints out the following TRACE message to the
logs:
TRACE Task [currentTaskAttemptId] trying to put [blockId]
If some other thread has already created the block, it finishes returning false . Otherwise,
when the block does not exist, newBlockInfo is recorded in the internal infos registry and
the block is locked for this client for writing. It then returns true .
lockNewBlockForWriting executes itself in synchronized block so once the
Note
FIXME
366
BlockInfoManager
FIXME
FIXME
assertBlockIsLockedForWriting
Caution
FIXME
Internal Registries
infos
infos is used to track BlockInfo per block (identified by BlockId).
readLocksByTask
readLocksByTask is used to track tasks (by TaskAttemptId ) and the blocks they locked for
writeLocksByTask
writeLocksByTask is used to track tasks (by TaskAttemptId ) and the blocks they locked for
367
BlockInfoManager
It is incremented when a read lock is acquired and decreases when the following happens:
The memory block is unlocked
All locks for the memory block obtained by a task are released.
The memory block is removed
Clearing the current state of BlockInfoManager .
test code.
the task attempt id of the task which currently holds the write lock for this block.
The writer task is assigned in the following scenarios:
A write lock is requested for a memory block (with no writer and readers)
A memory block is unlocked
368
BlockInfoManager
369
See the excellent slide deck Dynamic Allocation in Spark from Databricks.
Utils.isDynamicAllocationEnabled method
isDynamicAllocationEnabled(conf: SparkConf): Boolean
370
1. spark.executor.instances is 0
2. spark.dynamicAllocation.enabled is enabled
3. Spark on cluster is used (spark.master is non- local )
Otherwise, it returns false .
Note
Note
Tip
log4j.logger.org.apache.spark.util.Utils=WARN
Refer to Logging.
validateSettings is an internal method to ensure that the settings for dynamic allocation
are correct.
It validates the following and throws a SparkException if set incorrectly.
1. spark.dynamicAllocation.minExecutors must be positive.
371
Settings
spark.dynamicAllocation.enabled
spark.dynamicAllocation.enabled (default: false ) controls whether dynamic allocation is
spark.dynamicAllocation.minExecutors
spark.dynamicAllocation.minExecutors (default: 0 ) sets the minimum number of executors
spark.dynamicAllocation.maxExecutors
spark.dynamicAllocation.maxExecutors (default: Integer.MAX_VALUE ) sets the maximum
spark.dynamicAllocation.initialExecutors
spark.dynamicAllocation.initialExecutors sets the initial number of executors for dynamic
allocation.
spark.dynamicAllocation.schedulerBacklogTimeout
spark.dynamicAllocation.schedulerBacklogTimeout (default: 1s ) setsFIXME
372
spark.dynamicAllocation.sustainedSchedulerBacklogTimeo
ut
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout (default:
spark.dynamicAllocation.schedulerBacklogTimeout) setsFIXME
It must be greater than 0 .
spark.dynamicAllocation.executorIdleTimeout
spark.dynamicAllocation.executorIdleTimeout (default: 60s ) setsFIXME
spark.dynamicAllocation.cachedExecutorIdleTimeout
spark.dynamicAllocation.cachedExecutorIdleTimeout (default: Integer.MAX_VALUE )
setsFIXME
spark.dynamicAllocation.testing
spark.dynamicAllocation.testing isFIXME
Future
SPARK-4922
SPARK-4751
SPARK-7955
373
ExecutorAllocationManager
ExecutorAllocationManagerAllocation
Manager for Spark Core
ExecutorAllocationManager is responsible for dynamically allocating and removing executors
addExecutors
Caution
FIXME
removeExecutor
Caution
FIXME
maxNumExecutorsNeeded method
Caution
FIXME
events and make decisions when to add and remove executors. It then immediately starts
spark-dynamic-executor-allocation allocation executor that is responsible for the scheduling
every 100 milliseconds.
Note
100 milliseconds for the period between successive scheduling is fixed, i.e. not
configurable.
374
ExecutorAllocationManager
Note
enabled).
It then go over removeTimes to remove expired executors, i.e. executors for which
expiration time has elapsed.
updateAndSyncNumExecutorsTarget
updateAndSyncNumExecutorsTarget(now: Long): Int
updateAndSyncNumExecutorsTarget FIXME
initializing flag
initializing flag starts enabled (i.e. true ).
initialNumExecutors attribute
375
ExecutorAllocationManager
Caution
FIXME
numExecutorsTarget attribute
Caution
FIXME
numExecutorsToAdd attribute
numExecutorsToAdd attribute controlsFIXME
Note
Internal Registries
executorsPendingToRemove registry
Caution
FIXME
removeTimes registry
removeTimes keeps track of executors and theirFIXME
executorIds
Caution
FIXME
It is started
It is stopped
376
ExecutorAllocationManager
377
ExecutorAllocationClient
ExecutorAllocationClient
ExecutorAllocationClient is a contract for clients to communicate with a cluster manager to
use.
Note
It is used when SparkContext calculates the executors in use and also when
Spark Streaming manages executors.
exact number of executors desired. It returns whether the request has been acknowledged
by the cluster manager ( true ) or not ( false ).
It is used when:
1.
only).
Note
2.
YarnSchedulerBackend stops.
ExecutorAllocationClient
whether the request has been acknowledged by the cluster manager ( true ) or not
( false ).
Note
killExecutor requests that a cluster manager to kill a single executor that is no longer in
use and returns whether the request has been acknowledged by the cluster manager
( true ) or not ( false ).
Note
Note
1.
2.
killExecutors requests that a cluster manager to kill one or many executors that are no
longer in use and returns whether the request has been acknowledged by the cluster
manager ( true ) or not ( false ).
Note
379
ExecutorAllocationListener
ExecutorAllocationListener
Caution
FIXME
380
ExecutorAllocationManagerSource
ExecutorAllocationManagerSourceMetric
Source for Dynamic Allocation
ExecutorAllocationManagerSource is a metric source for dynamic allocation with name
ExecutorAllocationManager and the following gauges:
executors/numberExecutorsToAdd which exposes numExecutorsToAdd.
executors/numberExecutorsPendingToRemove which corresponds to the number of
elements in executorsPendingToRemove.
executors/numberAllExecutors which corresponds to the number of elements in
executorIds.
executors/numberTargetExecutors which is numExecutorsTarget.
executors/numberMaxNeededExecutors which simply calls maxNumExecutorsNeeded.
Note
Spark uses Metrics Java library to expose internal state of its services to
measure.
381
Shuffle Manager
Shuffle Manager
Spark comes with a pluggable mechanism for shuffle systems.
Shuffle Manager (aka Shuffle Service) is a Spark service that tracks shuffle dependencies
for ShuffleMapStage. The driver and executors all have their own Shuffle Service.
The setting spark.shuffle.manager sets up the default shuffle manager.
The driver registers shuffles with a shuffle manager, and executors (or tasks running locally
in the driver) can ask to read and write data.
It is network-addressable, i.e. it is available on a host and port.
There can be many shuffle services running simultaneously and a driver registers with all of
them when CoarseGrainedSchedulerBackend is used.
The service is available under SparkEnv.get.shuffleManager .
When ShuffledRDD is computed it reads partitions from it.
The name appears here, twice in the builds output and others.
Review the code in network/shuffle module.
When is data eligible for shuffling?
Get the gist of "The shuffle files are not currently cleaned up when using Spark on
Mesos with the external shuffle service"
ShuffleManager Contract
Note
382
Shuffle Manager
Available Implementations
Spark comes with two implementations of ShuffleManager contract:
org.apache.spark.shuffle.sort.SortShuffleManager (short name: sort or tungstensort )
org.apache.spark.shuffle.hash.HashShuffleManager (short name: hash )
Caution
SortShuffleManager
SortShuffleManager is a shuffle manager with the short name being sort .
Settings
spark.shuffle.manager
spark.shuffle.manager (default: sort ) sets the default shuffle manager by a short name or
spark.shuffle.spill
spark.shuffle.spill (default: true ) - no longer used, and when false the following
383
ExternalShuffleService
ExternalShuffleService
ExternalShuffleService is an external shuffle service that serves shuffle blocks from
outside an Executor process. It runs as a standalone application and manages shuffle output
files so they are available for executors at all time. As the shuffle output files are managed
externally to the executors it offers an uninterrupted access to the shuffle output files
regardless of executors being killed or down.
You start ExternalShuffleService using start-shuffle-service.sh shell script and enable
its use by the driver and executors using spark.shuffle.service.enabled.
Note
Tip
log4j.logger.org.apache.spark.deploy.ExternalShuffleService=INFO
Refer to Logging.
384
ExternalShuffleService
$ ./sbin/start-shuffle-service.sh
starting org.apache.spark.deploy.ExternalShuffleService, logging
to ...logs/spark-jacekorg.apache.spark.deploy.ExternalShuffleService-1japila.local.out
$ tail -f ...logs/spark-jacekorg.apache.spark.deploy.ExternalShuffleService-1japila.local.out
Spark Command:
/Library/Java/JavaVirtualMachines/Current/Contents/Home/bin/java
-cp
/Users/jacek/dev/oss/spark/conf/:/Users/jacek/dev/oss/spark/asse
mbly/target/scala-2.11/jars/* -Xmx1g
org.apache.spark.deploy.ExternalShuffleService
========================================
Using Spark's default log4j profile: org/apache/spark/log4jdefaults.properties
16/06/07 08:02:02 INFO ExternalShuffleService: Started daemon
with process name: [email protected]
16/06/07 08:02:03 INFO ExternalShuffleService: Starting shuffle
service on port 7337 with useSasl = false
spark-class org.apache.spark.deploy.ExternalShuffleService
385
ExternalShuffleService
happens inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.network.shuffle.ExternalShuffleBlockResolver=DEBUG
Refer to Logging.
You should see the following INFO message in the logs:
INFO ExternalShuffleBlockResolver: Registered executor [AppExecId] with [executorInfo]
You should also see the following messages when a SparkContext is closed:
INFO ExternalShuffleBlockResolver: Application [appId] removed, cleanupLocalDirs = [cl
eanupLocalDirs]
INFO ExternalShuffleBlockResolver: Cleaning up executor [AppExecId]'s [executor.localD
irs.length] local dirs
DEBUG ExternalShuffleBlockResolver: Successfully cleaned up directory: [localDir]
Caution
FIXME TransportContext?
386
ExternalShuffleService
When start is executed, you should see the following INFO message in the logs:
INFO ExternalShuffleService: Starting shuffle service on port [port] with useSasl = [u
seSasl]
FIXME SaslServerBootstrap?
The internal server reference (a TransportServer ) is created (which will attempt to bind to
port ).
Note
stop closes the internal server reference and clears it (i.e. sets it to null ).
ExternalShuffleBlockHandler
ExternalShuffleBlockHandler is a RpcHandler (i.e. a handler for sendRPC() messages sent
by TransportClients ).
When created, ExternalShuffleBlockHandler requires a OneForOneStreamManager and
TransportConf with a registeredExecutorFile to create a ExternalShuffleBlockResolver .
It handles two BlockTransferMessage messages: OpenBlocks and RegisterExecutor.
387
ExternalShuffleService
happens inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.network.shuffle.ExternalShuffleBlockHandler=TRACE
Refer to Logging.
handleMessage method
handleMessage(
BlockTransferMessage msgObj,
TransportClient client,
RpcResponseCallback callback)
OpenBlocks
RegisterExecutor
For any other BlockTransferMessage message it throws a UnsupportedOperationException :
Unexpected message: [msgObj]
OpenBlocks
OpenBlocks(String appId, String execId, String[] blockIds)
FIXME checkAuth ?
It then gets block data for each block id in blockIds (using ExternalShuffleBlockResolver).
Finally, it registers a stream and does callback.onSuccess with a serialized byte buffer (for
the streamId and the number of blocks in msg ).
Caution
FIXME callback.onSuccess ?
388
ExternalShuffleService
TRACE Registered streamId [streamId] with [length] buffers for client [clientId] from
host [remoteAddress]
RegisterExecutor
RegisterExecutor(String appId, String execId, ExecutorShuffleInfo executorInfo)
RegisterExecutor
ExternalShuffleBlockResolver
Caution
FIXME
getBlockData method
ManagedBuffer getBlockData(String appId, String execId, String blockId)
It throws a IllegalArgumentException for block ids with less than four parts:
Unexpected block id format: [blockId]
389
ExternalShuffleService
OneForOneStreamManager
Caution
FIXME
registerStream method
long registerStream(String appId, Iterator<ManagedBuffer> buffers)
Caution
FIXME
Settings
spark.shuffle.service.enabled
spark.shuffle.service.enabled flag (default: false ) controls whether the External Shuffle
Service is used or not. When enabled ( true ), the driver registers with the shuffle service.
spark.shuffle.service.enabled has to be enabled for dynamic allocation of executors.
spark.shuffle.service.port
spark.shuffle.service.port (default: 7337 )
390
ExternalClusterManager
ExternalClusterManager
ExternalClusterManager is a contract for pluggable cluster managers.
Note
Note
Note
ExternalClusterManager Contract
initialize
initialize(scheduler: TaskScheduler, backend: SchedulerBackend): Unit
canCreate
canCreate(masterURL: String): Boolean
Note
It is used when finding the external cluster manager for a master URL (in
SparkContext ).
createTaskScheduler
createTaskScheduler(sc: SparkContext, masterURL: String): TaskScheduler
createSchedulerBackend
391
ExternalClusterManager
createSchedulerBackend(sc: SparkContext,
masterURL: String,
scheduler: TaskScheduler): SchedulerBackend
392
Settings
spark.fileserver.port (default: 0 ) - the port of a file server
spark.fileserver.uri (Spark internal) - the URI of a file server
393
Broadcast Manager
Broadcast Manager
Broadcast Manager is a Spark service to manage broadcast values in Spark jobs. It is
created for a Spark application as part of SparkContexts initialization and is a simple
wrapper around BroadcastFactory.
Broadcast Manager tracks the number of broadcast values (using the internal field
nextBroadcastId ).
The idea is to transfer values used in transformations from a driver to executors in a most
effective way so they are copied once and used many times by tasks (rather than being
copied every time a task is launched).
When BroadcastManager is initialized an instance of BroadcastFactory is created based on
spark.broadcast.factory setting.
BroadcastFactory
BroadcastFactory is a pluggable interface for broadcast implementations in Spark. It is
TorrentBroadcast
The BroadcastManager implementation used in Spark by default is
org.apache.spark.broadcast.TorrentBroadcast (see spark.broadcast.factory). It uses a
394
Broadcast Manager
395
Broadcast Manager
Compression
When spark.broadcast.compress is true (default), compression is used.
There are the following compression codec implementations available:
lz4 or org.apache.spark.io.LZ4CompressionCodec
lzf or org.apache.spark.io.LZFCompressionCodec - a fallback when snappy is not
available.
snappy or org.apache.spark.io.SnappyCompressionCodec - the default implementation
Settings
spark.broadcast.factory (default:
org.apache.spark.broadcast.TorrentBroadcastFactory ) - the fully-qualified class name for
Compression.
spark.broadcast.blockSize (default: 4m ) - the size of a block
396
Data Locality
397
Cache Manager
Cache Manager
Cache Manager in Spark is responsible for passing RDDs partition contents to Block
Manager and making sure a node doesnt load two copies of an RDD at once.
It keeps reference to Block Manager.
Caution
FIXME
FIXME
398
sparkMaster is the name of Actor System for the master in Spark Standalone, i.e.
akka://sparkMaster is the Akka URL.
bytes
spark.akka.logLifecycleEvents (default: false )
spark.akka.logAkkaConfig (default: true )
spark.akka.heartbeat.pauses (default: 6000s )
spark.akka.heartbeat.interval (default: 1000s )
399
OutputCommitCoordinator
OutputCommitCoordinator
From the scaladoc (its a private[spark] class so no way to find it outside the code):
Authority that decides whether tasks can commit output to HDFS. Uses a "first
committer wins" policy. OutputCommitCoordinator is instantiated in both the drivers and
executors. On executors, it is configured with a reference to the drivers
OutputCommitCoordinatorEndpoint, so requests to commit output will be forwarded to
the drivers OutputCommitCoordinator.
The most interesting piece is in
This class was introduced in SPARK-4879; see that JIRA issue (and the associated pull
requests) for an extensive design discussion.
400
Caution
401
RpcEndpointRefs can be looked up by name or uri (because different RpcEnvs may have
different naming schemes).
org.apache.spark.rpc package contains the machinery for RPC communication in Spark.
RpcEnvFactory
Spark comes with ( private[spark] trait ) RpcEnvFactory which is the factory contract to
create a RPC Environment.
An RpcEnvFactory implementation has a single method create(config: RpcEnvConfig):
RpcEnv that returns a RpcEnv for a given RpcEnvConfig.
You can choose an RPC implementation to use by spark.rpc (default: netty ). The setting
can be one of the two short names for the known RpcEnvFactories - netty or akka - or a
fully-qualified class name of your custom factory (including Netty-based and Akka-based
implementations).
$ ./bin/spark-shell --conf spark.rpc=netty
$ ./bin/spark-shell --conf spark.rpc=org.apache.spark.rpc.akka.AkkaRpcEnvFactory
RpcEndpoint
RpcEndpoint defines how to handle messages (what functions to execute given a
message). RpcEndpoints live inside RpcEnv after being registered by a name.
A RpcEndpoint can be registered to one and only one RpcEnv.
The lifecycle of a RpcEndpoint is onStart , receive and onStop in sequence.
receive can be called concurrently.
Tip
ThreadSafeRpcEndpoint
402
Caution
Note
RpcEndpointRef
A RpcEndpointRef is a reference for a RpcEndpoint in a RpcEnv.
It is serializable entity and so you can send it over a network or save it for later use (it can
however be deserialized using the owning RpcEnv only).
A RpcEndpointRef has an address (a Spark URL), and a name.
You can send asynchronous one-way messages to the corresponding RpcEndpoint using
send method.
You can send a semi-synchronous message, i.e. "subscribe" to be notified when a response
arrives, using ask method. You can also block the current calling thread for a response
using askWithRetry method.
spark.rpc.numRetries (default: 3 ) - the number of times to retry connection attempts.
spark.rpc.retry.wait (default: 3s ) - the number of milliseconds to wait on each retry.
RpcAddress
RpcAddress is the logical address for an RPC Environment, with hostname and port.
RpcAddress is encoded as a Spark URL, i.e. spark://host:port .
RpcEndpointAddress
RpcEndpointAddress is the logical address for an endpoint registered to an RPC
Environment, with RpcAddress and name.
It is in the format of spark://[name]@[rpcAddress.host]:[rpcAddress.port].
403
When a remote endpoint is resolved, a local RPC environment connects to the remote one.
It is called endpoint lookup. To configure the time needed for the endpoint lookup you can
use the following settings.
It is a prioritized list of lookup timeout properties (the higher on the list, the more important):
spark.rpc.lookupTimeout
spark.network.timeout
Their value can be a number alone (seconds) or any number with time suffix, e.g. 50s ,
100ms , or 250us . See Settings.
Exceptions
When RpcEnv catches uncaught exceptions, it uses RpcCallContext.sendFailure to send
exceptions back to the sender, or logging them if no such sender or
NotSerializableException .
If any error is thrown from one of RpcEndpoint methods except onError , onError will be
invoked with the cause. If onError throws an error, RpcEnv will ignore it.
404
RpcEnvConfig
RpcEnvConfig is a placeholder for an instance of SparkConf, the name of the RPC
Environment, host and port, a security manager, and clientMode.
RpcEnv.create
You can create a RPC Environment using the helper method RpcEnv.create .
It assumes that you have a RpcEnvFactory with an empty constructor so that it can be
created via Reflection that is available under spark.rpc setting.
Settings
spark.rpc
spark.rpc (default: netty since Spark 1.6.0-SNAPSHOT) - the RPC implementation to
spark.rpc.lookupTimeout
spark.rpc.lookupTimeout (default: 120s ) - the default timeout to use for RPC remote
spark.network.timeout
spark.network.timeout (default: 120s ) - the default network timeout to use for RPC remote
endpoint lookup.
It is used as a fallback value for spark.rpc.askTimeout.
Other
spark.rpc.numRetries (default: 3 ) - the number of attempts to send a message and
Others
405
The Worker class calls startRpcEnvAndEndpoint with the following configuration options:
host
port
webUiPort
cores
memory
masters
workDir
It starts sparkWorker[N] where N is the identifier of a worker.
406
Netty-based RpcEnv
Netty-based RpcEnv
Tip
Client Mode
Refer to Client Mode = is this an executor or the driver? for introduction about client mode.
This is only for Netty-based RpcEnv.
When created, a Netty-based RpcEnv starts the RPC server and register necessary
endpoints for non-client mode, i.e. when client mode is false .
Caution
407
Netty-based RpcEnv
It means that the required services for remote communication with NettyRpcEnv are only
started on the driver (not executors).
Thread Pools
shuffle-server-ID
EventLoopGroup uses a daemon thread pool called shuffle-server-ID , where ID is a
dispatcher-event-loop-ID
NettyRpcEnvs Dispatcher uses the daemon fixed thread pool with
spark.rpc.netty.dispatcher.numThreads threads.
Thread names are formatted as dispatcher-event-loop-ID , where ID is a unique,
sequentially assigned integer.
It starts the message processing loop on all of the threads.
netty-rpc-env-timeout
NettyRpcEnv uses the daemon single-thread scheduled thread pool netty-rpc-env-timeout .
"netty-rpc-env-timeout" #87 daemon prio=5 os_prio=31 tid=0x00007f887775a000 nid=0xc503
waiting on condition [0x0000000123397000]
netty-rpc-connection-ID
NettyRpcEnv uses the daemon cached thread pool with up to spark.rpc.connect.threads
threads.
Thread names are formatted as netty-rpc-connection-ID , where ID is a unique,
sequentially assigned integer.
Settings
408
Netty-based RpcEnv
JVM)
spark.rpc.connect.threads (default: 64 ) - used in cluster mode to communicate with a
controls the maximum number of binding attempts/retries to a port before giving up.
Endpoints
endpoint-verifier ( RpcEndpointVerifier ) - a RpcEndpoint for remote RpcEnvs to
query whether an RpcEndpoint exists or not. It uses Dispatcher that keeps track of
registered endpoints and responds true / false to CheckExistence message.
endpoint-verifier is used to check out whether a given endpoint exists or not before the
Message Dispatcher
A message dispatcher is responsible for routing RPC messages to the appropriate
endpoint(s).
It uses the daemon fixed thread pool dispatcher-event-loop with
spark.rpc.netty.dispatcher.numThreads threads for dispatching messages.
409
Netty-based RpcEnv
410
ContextCleaner
ContextCleaner
It does cleanup of shuffles, RDDs and broadcasts.
Caution
It uses a daemon Spark Context Cleaner thread that cleans RDD, shuffle, and broadcast
states (using keepCleaning method).
Caution
registerRDDForCleanup
Caution
FIXME
registerAccumulatorForCleanup
Caution
FIXME
Settings
spark.cleaner.referenceTracking (default: true ) controls whether to enable or not
cleaning thread will block on cleanup tasks (other than shuffle, which is controlled by the
spark.cleaner.referenceTracking.blocking.shuffle parameter).
411
ContextCleaner
412
MapOutputTracker
MapOutputTracker
A MapOutputTracker is a Spark service to track the locations of the (shuffle) map outputs of
a stage. It uses an internal MapStatus map with an array of MapStatus for every partition for
a shuffle id.
There are two versions of MapOutputTracker :
MapOutputTrackerMaster for a driver
MapOutputTrackerWorker for executors
MapOutputTracker is available under SparkEnv.get.mapOutputTracker . It is also available as
MapOutputTracker in the drivers RPC Environment.
It works with ShuffledRDD when it asks for preferred locations for a shuffle using
tracker.getPreferredLocationsForShuffle .
FIXME DAGScheduler.mapOutputTracker
unregisterShuffle
Caution
FIXME
MapStatus
413
MapOutputTracker
Caution
Epoch Number
Caution
FIXME
MapOutputTrackerMaster
A MapOutputTrackerMaster is the MapOutputTracker for a driver.
A MapOutputTrackerMaster is the source of truth for the collection of MapStatus objects
(map output locations) per shuffle id (as recorded from ShuffleMapTasks).
MapOutputTrackerMaster uses Sparks org.apache.spark.util.TimeStampedHashMap for
mapStatuses .
Note
There is currently a hardcoded limit of map and reduce tasks above which
Spark does not assign preferred locations aka locality preferences based on
map output sizes 1000 for map and reduce each.
414
MapOutputTracker
MapOutputTrackerMaster.registerShuffle
Caution
FIXME
MapOutputTrackerMaster.getStatistics
Caution
FIXME
MapOutputTrackerMaster.unregisterMapOutput
Caution
FIXME
MapOutputTrackerMaster.registerMapOutputs
Caution
FIXME
MapOutputTrackerMaster.incrementEpoch
Caution
FIXME
You should see the following DEBUG message in the logs for entries being removed:
DEBUG Removing key [entry.getKey]
415
MapOutputTracker
MapOutputTrackerMaster.getEpoch
Caution
FIXME
Settings
spark.shuffle.reduceLocality.enabled (default: true) - whether to compute locality
MapOutputTrackerWorker
A MapOutputTrackerWorker is the MapOutputTracker for executors. The internal
mapStatuses map serves as a cache and any miss triggers a fetch from the drivers
MapOutputTrackerMaster.
Note
416
Master URLs
Spark supports the following master URLs (see private object SparkMasterRegex):
local, local[N] and local[*] for Spark local
local[N, maxRetries] for Spark local-with-retries
local-cluster[N, cores, memory] for simulating a Spark cluster of [N, cores, memory]
locally
spark://host:port,host1:port1, for connecting to Spark Standalone cluster(s)
mesos:// or zk:// for Spark on Mesos cluster
yarn-cluster (deprecated: yarn-standalone) for Spark on YARN (cluster mode)
yarn-client for Spark on YARN cluster (client mode)
simr:// for Spark in MapReduce (SIMR) cluster
You use a master URL with spark-submit as the value of --master command-line option or
when creating SparkContext using setMaster method.
417
scala> sc.isLocal
res0: Boolean = true
Spark shell defaults to local mode with local[*] as the the master URL.
418
scala> sc.master
res0: String = local[*]
Tasks are not re-executed on failure in local mode (unless local-with-retries master URL is
used).
The task scheduler in local mode works with LocalBackend task scheduler backend.
Master URL
You can run Spark in local mode using local , local[n] or the most general local[*] for
the master URL.
The URL says how many threads can be used in total:
local uses 1 thread only.
local[n] uses n threads.
local[*] uses as many threads as the number of processors available to the Java
FIXME What happens when theres less cores than n in the master URL?
It is a question from twitter.
419
LocalBackend
LocalBackend is a scheduler backend and a executor backend for Spark local mode.
It acts as a "cluster manager" for local mode to offer resources on the single worker it
manages, i.e. it calls TaskSchedulerImpl.resourceOffers(offers) with offers being a singleelement collection with WorkerOffer("driver", "localhost", freeCores) .
Caution
FIXME Review freeCores . It appears you could have many jobs running
simultaneously.
420
LocalEndpoint
LocalEndpoint is the communication channel between Task Scheduler and LocalBackend.
It is a (thread-safe) RPC Endpoint that hosts an executor (with id driver and hostname
localhost ) for Spark local mode.
When a LocalEndpoint starts up (as part of Spark locals initialization) it prints out the
following INFO messages to the logs:
INFO Executor: Starting executor ID driver on host localhost
INFO Executor: Using REPL class URI: http://192.168.1.4:56131
FIXME
RPC Messages
LocalEndpoint accepts the following RPC message types:
ReviveOffers (receive-only, non-blocking) - read Task Submission a.k.a. reviveOffers.
421
(using statusUpdate ) and if the tasks status is finished, it revives offers (see
ReviveOffers ).
KillTask (receive-only, non-blocking) that kills the task that is currently running on the
executor.
StopExecutor (receive-reply, blocking) that stops the executor.
Settings
spark.default.parallelism (default: the number of threads as specified in master URL)
422
Spark on cluster
Spark Clustered
Spark can be run in distributed mode on a cluster. The following (open source) cluster
managers (aka task schedulers aka resource managers) are currently supported:
Sparks own built-in Standalone cluster manager
Hadoop YARN
Apache Mesos
Here is a very brief list of pros and cons of using one cluster manager versus the other
options supported by Spark:
1. Spark Standalone is included in the official distribution of Apache Spark.
2. Hadoop YARN has a very good support for HDFS with data locality.
3. Apache Mesos makes resource offers that a framework can accept or reject. It is Spark
(as a Mesos framework) to decide what resources to accept. It is a push-based
resource management model.
4. Hadoop YARN responds to a YARN frameworks resource requests. Spark (as a YARN
framework) requests CPU and memory from YARN. It is a pull-based resource
management model.
5. Hadoop YARN supports Kerberos for a secured HDFS.
Running Spark on a cluster requires workload and resource management on distributed
systems.
Spark driver requests resources from a cluster manager. Currently only CPU and memory
are requested resources. It is a cluster managers responsibility to spawn Spark executors in
the cluster (on its workers).
FIXME
Spark execution in cluster - Diagram of the communication between
driver, cluster manager, workers with executors and tasks. See Cluster
Mode Overview.
Caution
Show Sparks driver with the main code in Scala in the box
Nodes with executors with tasks
Hosts drivers
Manages a cluster
423
Spark on cluster
The workers are in charge of communicating the cluster manager the availability of their
resources.
Communication with a driver is through a RPC interface (at the moment Akka), except
Mesos in fine-grained mode.
Executors remain alive after jobs are finished for future ones. This allows for better data
utilization as intermediate data is cached in memory.
Spark reuses resources in a cluster for:
efficient data sharing
fine-grained partitioning
low-latency scheduling
Reusing also means the the resources can be hold onto for a long time.
Spark reuses long-running executors for speed (contrary to Hadoop MapReduce using
short-lived containers for each task).
Note
"Theres not a good reason to run more than one worker per machine." by Sean
Owen in What is the relationship between workers, worker instances, and
executors?
424
Spark on cluster
Caution
One executor per node may not always be ideal, esp. when your nodes have
lots of RAM. On the other hand, using fewer executors has benefits like
more efficient broadcasts.
Review core/src/main/scala/org/apache/spark/deploy/master/Master.scala
Others
Spark application can be split into the part written in Scala, Java, and Python with the
cluster itself in which the application is going to run.
Spark application runs on a cluster with the help of cluster manager.
A Spark application consists of a single driver process and a set of executor processes
scattered across nodes on the cluster.
Both the driver and the executors usually run as long as the application. The concept of
dynamic resource allocation has changed it.
Caution
FIXME Figure
A node is a machine, and theres not a good reason to run more than one worker per
machine. So two worker nodes typically means two machines, each a Spark worker.
Workers hold many executors for many applications. One application has executors on
many workers.
425
Spark on YARN
Spark on YARN
You can submit Spark applications to a Hadoop YARN cluster using yarn master URL.
There are two deploy modes for YARNclient (default) or cluster deploy modes. They are
about where the Spark driver runs. In client mode it runs on a node outside a YARN cluster
whereas in cluster mode it runs inside a container of ApplicationMaster in a YARN cluster.
spark-submit --master yarn --deploy-mode cluster mySparkApp.jar
Note
Since Spark 2.0.0, yarn master URL is the only proper master URL and you
can use --deploy-mode to choose between client (default) or cluster
modes.
In order to deploy applications to YARN clusters, you need to use Spark with
YARN support.
Spark on YARN supports multiple application attempts and supports data locality for data in
HDFS. You can also take advantage of Hadoops security and run Spark in a secure Hadoop
environment using Kerberos authentication (aka Kerberized clusters).
There are few settings that are specific to YARN (see Settings). Among them, you can
particularly like the support for YARN resource queues (to divide cluster resources and
allocate shares to different teams and users based on advanced policies).
Tip
You can start spark-submit with --verbose command-line option to have some
settings displayed, including YARN-specific. See spark-submit and YARN options.
The memory in the YARN resource requests is --executor-memory + whats set for
spark.yarn.executor.memoryOverhead , which defaults to 10% of --executor-memory .
If YARN has enough resources it will deploy the executors distributed across the cluster,
then each of them will try to process the data locally ( NODE_LOCAL in Spark Web UI), with as
many splits in parallel as you defined in spark.executor.cores.
426
Spark on YARN
--archives
--executor-cores
--keytab
--num-executors
--principal
--queue
Tip
Master URL
Since Spark 2.0.0, the only proper master URL is yarn .
./bin/spark-submit --master yarn ...
Before Spark 2.0.0, you could have used yarn-client or yarn-cluster , but it is now
deprecated. When you use the deprecated master URLs, you should see the following
warning in the logs:
Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" with spe
cified deploy mode instead.
Keytab
Caution
FIXME
427
Spark on YARN
Environment Variables
SPARK_DIST_CLASSPATH
SPARK_DIST_CLASSPATH is a distribution-defined CLASSPATH to add to processes.
Settings
Caution
FIXME
428
Spark on YARN
YarnShuffleServiceExternalShuffleService
on YARN
YarnShuffleService is an external shuffle service for Spark on YARN. It is YARN
There is the ExternalShuffleService for Spark and despite their names they
dont share code.
Note
Caution
Tip
YARN saves logs in /usr/local/Cellar/hadoop/2.7.2/libexec/logs directory on
Mac OS X with brew, e.g. /usr/local/Cellar/hadoop/2.7.2/libexec/logs/yarnjacek-nodemanager-japila.local.log .
Advantages
The advantages of using the YARN Shuffle Service:
With dynamic allocation enabled executors can be discarded and a Spark application
could still get at the shuffle data the executors wrote out.
It allows individual executors to go into GC pause (or even crash) and still allow other
Executors to read shuffle data and make progress.
429
Spark on YARN
getRecoveryPath
Caution
FIXME
serviceStop
void serviceStop()
Caution
When an exception occurs, you should see the following ERROR message in the logs:
ERROR org.apache.spark.network.yarn.YarnShuffleService: Exception when stopping service
stopContainer
void stopContainer(ContainerTerminationContext context)
Caution
When called, stopContainer simply prints out the following INFO message in the logs and
exits.
430
Spark on YARN
initializeContainer
void initializeContainer(ContainerInitializationContext context)
whenFIXME
Caution
When called, initializeContainer simply prints out the following INFO message in the logs
and exits.
INFO org.apache.spark.network.yarn.YarnShuffleService: Initializing container [contain
erId]
stopApplication
void stopApplication(ApplicationTerminationContext context)
whenFIXME
Caution
431
Spark on YARN
When an exception occurs, you should see the following ERROR message in the logs:
ERROR org.apache.spark.network.yarn.YarnShuffleService: Exception when stopping applic
ation [appId]
initializeApplication
void initializeApplication(ApplicationInitializationContext context)
whenFIXME
Caution
authentication is enabled.
When called, initializeApplication obtains YARNs ApplicationId for the application
(using the input context ) and calls context.getApplicationDataForService for
shuffleSecret .
serviceInit
432
Spark on YARN
Caution
FIXME
When called, serviceInit creates a TransportConf for the shuffle module that is used to
create ExternalShuffleBlockHandler (as blockHandler ).
It checks spark.authenticate key in the configuration (defaults to false ) and if only
authentication is enabled, it sets up a SaslServerBootstrap with a ShuffleSecretManager
and adds it to a collection of TransportServerBootstraps .
It creates a TransportServer as shuffleServer to listen to spark.shuffle.service.port
(default: 7337 ). It reads spark.shuffle.service.port key in the configuration.
You should see the following INFO message in the logs:
INFO org.apache.spark.network.yarn.YarnShuffleService: Started YARN shuffle service fo
r Spark on port [port]. Authentication is [authEnabled]. Registered executor file is
[registeredExecutorFile]
Installation
YARN Shuffle Service Plugin
Add the YARN Shuffle Service plugin from the common/network-yarn module to YARN
NodeManagers CLASSPATH.
Tip
cp common/network-yarn/target/scala-2.11/spark-2.0.0-SNAPSHOT-yarn-shuffle.jar \
/usr/local/Cellar/hadoop/2.7.2/libexec/share/hadoop/yarn/lib/
433
Spark on YARN
<?xml version="1.0"?>
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>spark_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
<value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>
<!-- optional -->
<property>
<name>spark.shuffle.service.port</name>
<value>10000</value>
</property>
<property>
<name>spark.authenticate</name>
<value>true</value>
</property>
</configuration>
434
Spark on YARN
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1
148)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:
617)
at java.lang.Thread.run(Thread.java:745)
at org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnab
le.scala:126)
at org.apache.spark.deploy.yarn.ExecutorRunnable.run(ExecutorRunnable.scala:71
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1
142)
... 2 more
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAcce
ssorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstru
ctorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instan
tiateException(SerializedExceptionPBImpl.java:168)
at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSeri
alize(SerializedExceptionPBImpl.java:106)
at org.apache.hadoop.yarn.client.api.impl.NMClientImpl.startContainer(NMClient
Impl.java:207)
at org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnab
le.scala:123)
... 4 more
435
Spark on YARN
ExecutorRunnable
ExecutorRunnable starts a YARN container with CoarseGrainedExecutorBackend application.
Note
Refer to Logging.
prepareEnvironment
Caution
FIXME
It creates a NMClient (using YARNs Client API), inits it with yarnConf and starts it.
It ultimately calls startContainer.
container.
436
Spark on YARN
When startContainer is executed, you should see the following INFO message in the logs:
INFO ExecutorRunnable: Setting up ContainerLaunchContext
It then creates a ContainerLaunchContext (which represents all of the information for the
NodeManager to launch a container) with the local resources being the input
localResources and environment being the input env . It also sets security tokens.
Ultimately, it sends a request to the NodeManager to start the container (as specified when
the ExecutorRunnable was created) with the ContainerLaunchContext context.
If any exception happens, a SparkException is thrown.
Exception while starting container [containerId] on host [hostname]
Note
Spark on YARN
prepareCommand(
masterAddress: String,
slaveId: String,
hostname: String,
executorMemory: Int,
executorCores: Int,
appId: String): List[String]
Caution
FIXME Client.getClusterPath ?
Caution
FIXME Client.getUserClasspath ?
Internal Registries
yarnConf
438
Spark on YARN
is created.
439
Spark on YARN
Client
org.apache.spark.deploy.yarn.Client can be used as a standalone application to submit
Refer to Logging.
ApplicationMaster.
createContainerLaunchContext(newAppResponse: GetNewApplicationResponse): ContainerLaun
chContext
ContainerLaunchContextFIXME
Note
When called, you should see the following INFO message in the logs:
INFO Setting up container launch context for our AM
440
Spark on YARN
Caution
Caution
FIXME tmpDir ?
FIXME SPARK_USE_CONC_INCR_GC ?
FIXME
-Dspark.yarn.app.container.log.dir= FIXME
Caution
FIXME
441
Spark on YARN
prepareLocalResources method
prepareLocalResources(
destDir: Path,
pySparkArchives: Seq[String]): HashMap[String, LocalResource]
prepareLocalResources isFIXME
Caution
FIXME
442
Spark on YARN
When prepareLocalResources is called, you should see the following INFO message in the
logs:
INFO Client: Preparing resources for our AM container
(only for a secure Hadoop cluster) It computes the list of Hadoops Paths to access and
requests delegation tokens for them. It includes the optional list of extra NameNode URLs
(from spark.yarn.access.namenodes) and the input destDir .
Caution
(only for a secure Hadoop cluster) It also obtains delegation tokens for Hive metastore, and
HBase (using the constructors sparkConf and hadoopConf with the internal credentials
attribute). After all the security delegation tokens are obtained, you should see the following
DEBUG message in the logs:
DEBUG Client: [token1]
DEBUG Client: [token2]
...
DEBUG Client: [tokenN]
Caution
Note
It creates the input destDir (on a HDFS-compatible file system) with 0700 permission
( rwx------ ), i.e. inaccessible to all but its owner and the superuser so the owner only can
read, write and execute. It uses Hadoops Path.getFileSystem to access Hadoops
FileSystem that owns destDir (using the constructors hadoopConf Hadoops
Configuration).
Tip
FIXME if (loginFromKeytab)
If the location of the single archive containing Spark jars (spark.yarn.archive) is set, it is
distributed (as ARCHIVE) to spark_libs .
443
Spark on YARN
If neither spark.yarn.archive nor spark.yarn.jars is set, you should see the following WARN
message in the logs:
WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to up
loading libraries under SPARK_HOME.
It then finds the directory with jar files under SPARK_HOME (using
YarnCommandBuilderUtils.findJarsDir ).
Caution
FIXME YarnCommandBuilderUtils.findJarsDir
And all the jars are zipped to a temporary archive, e.g. spark_libs2944590295025097383.zip
that is distribute as ARCHIVE to spark_libs (only when they differ).
If a user jar ( --jar ) was specified on command line, the jar is distribute as FILE to
app.jar .
It then distributes additional resources specified in SparkConf for the application, i.e. jars
(under spark.yarn.dist.jars), files (under spark.yarn.dist.files), and archives (under
spark.yarn.dist.archives).
Note
Caution
It sets spark.yarn.secondary.jars for the jars that have localized path (non-local paths) or
their path (for local paths).
distCacheMgr.updateConfiguration(sparkConf) is executed.
Caution
FIXME distCacheMgr.updateConfiguration(sparkConf) ??
Caution
It distCacheMgr.addResource .
444
Spark on YARN
Caution
points.
Note
Unless force is enabled (it is disabled by default), copyFileToRemote will only copy
srcPath when the source (of srcPath ) and target (of destDir ) file systems are the same.
copyFileToRemote copies srcPath to destDir and sets 644 permissions, i.e. world-wide
445
Spark on YARN
INFO Client: Source and destination file systems are the same. Not copying [srcPath]
Ultimately, copyFileToRemote returns the destination path resolved following symlinks and
mount points.
It merely adds the following entries to the CLASSPATH key in the input env :
1. The optional extraClassPath (which is first changed to include paths on YARN cluster
machines).
Note
FIXME
6. (unless the optional spark.yarn.archive is defined) All the local jars in spark.yarn.jars
(which are first changed to be paths on YARN cluster machines).
7. All the entries from YARNs yarn.application.classpath or
YarnConfiguration.DEFAULT_YARN_APPLICATION_CLASSPATH (if yarn.application.classpath
is not set)
446
Spark on YARN
You should see the result of executing populateClasspath when you enable DEBUG logging leve
Tip
value of spark.yarn.config.replacementPath.
addClasspathEntry is a private helper method to add the input path to CLASSPATH key in
distribute method
distribute(
path: String,
resType: LocalResourceType = LocalResourceType.FILE,
destName: Option[String] = None,
targetDir: Option[String] = None,
appMasterOnly: Boolean = false): (Boolean, String)
Caution
FIXME
447
Spark on YARN
buildPath is a helper method to join all the path components using the directory separator,
i.e. org.apache.hadoop.fs.Path.SEPARATOR.
448
Spark on YARN
launcherBackend value
launcherBackend FIXME
SPARK_YARN_MODE flag
SPARK_YARN_MODE is a flag that says whetherFIXME.
Note
Caution
Any environment variable with the SPARK_ prefix is propagated to all (remote)
processes.
FIXME Where is SPARK_ prefix rule enforced?
SPARK_YARN_MODE is a system property (i.e. available using System.getProperty )
Note
It is enabled (i.e. true ) when SparkContext is created for Spark on YARN in client deploy
mode, when ClientFIXME and a Spark application is deployed to a YARN cluster.
Caution
accessed.
It is cleared later when Client is requested to stop.
FIXME
FIXME
449
Spark on YARN
Caution
FIXME
main
main method is invoked while a Spark application is being deployed to a YARN cluster.
Note
When you start the main method when starting the Client standalone application, say using
org.apache.spark.deploy.yarn.Client , you will see the following WARN message in the logs unl
Note
WARN Client: WARNING: This client is deprecated and will be removed in a future version of
stop
stop(): Unit
stop closes the internal LauncherBackend and stops the internal yarnClient. It also clears
run
run submits a Spark application to a YARN ResourceManager (RM).
Caution
Caution
450
Spark on YARN
monitorApplication
monitorApplication(
appId: ApplicationId,
returnOnRunning: Boolean = false,
logApplicationReport: Boolean = true): (YarnApplicationState, FinalApplicationStatus)
Unless logApplicationReport is disabled, it prints the following INFO message to the logs:
INFO Client: Application report for [appId] (state: [state])
If logApplicationReport and DEBUG log level are enabled, it prints report details every time
interval to the logs:
451
Spark on YARN
diagnostics: N/A
queue: default
user: jacek
For INFO log level it prints report details only when the application state changes.
When the application state changes, LauncherBackend is notified (using
LauncherBackend.setState ).
Note
For states FINISHED , FAILED or KILLED , cleanupStagingDir is called and the method
finishes by returning a pair of the current state and the final application status.
If returnOnRunning is enabled (it is disabled by default) and the application state turns
RUNNING , the method returns a pair of the current state RUNNING and the final application
status.
Note
The current state is recorded for future checks (in the loop).
cleanupStagingDir
cleanupStagingDir clears the staging directory of an application.
Note
It uses spark.yarn.stagingDir setting or falls back to a users home directory for the staging
directory. If cleanup is enabled, it deletes the entire staging directory for the application.
You should see the following INFO message in the logs:
INFO Deleting staging directory [stagingDirPath]
452
Spark on YARN
reportLauncherState
reportLauncherState(state: SparkAppHandle.State): Unit
Caution
ResourceManager). It waits until the application is running and eventually returns its unique
ApplicationId.
Note
453
Spark on YARN
submitApplication verifies whether the cluster has resources for the ApplicationManager
(using verifyClusterResources).
It then createContainerLaunchContext and createApplicationSubmissionContext.
It submits the application to YARN ResourceManager.
INFO Client: Submitting application [applicationId.getId] to ResourceManager
verifyClusterResources
INFO Client: Verifying our application has not requested more than the maximum memory
capability of the cluster (8192 MB per container)
INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
createApplicationSubmissionContext
ClientArguments
454
Spark on YARN
YarnRMClient
YarnRMClient is responsible for registering and unregistering a Spark application (in the
Refer to Logging.
getAmIpFilterParams
Caution
FIXME
register instantiates YARNs AMRMClient, initializes it (using conf input parameter) and
starts immediately. It saves uiHistoryAddress input parameter internally for later use.
You should see the following INFO message in the logs (in stderr in YARN):
455
Spark on YARN
It then registers the application master using the local host, port 0 , and uiAddress input
parameter for the URL at which the master info can be seen.
The internal registered flag is enabled.
Ultimately, it creates a new YarnAllocator with the input parameters of register passed in
and the just-created YARN AMRMClient.
It basically checks that ApplicationMaster is registered and only when it is requests the
internal AMRMClient to unregister.
unregister is called when ApplicationMaster wants to unregister.
settings and return the maximum number of application attempts before ApplicationMaster
registration with YARN is considered unsuccessful (and so the Spark application).
It reads YARNs yarn.resourcemanager.am.max-attempts (available as
YarnConfiguration.RM_AM_MAX_ATTEMPTS) or falls back to
YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS (which is 2 ).
The return value is the minimum of the configuration settings of YARN and Spark.
456
Spark on YARN
getAttemptId(): ApplicationAttemptId
getAttemptId returns YARNs ApplicationAttemptId (of the Spark application to which the
457
Spark on YARN
ExecutorLauncher is a custom ApplicationMaster for client deploy mode only for the purpose
$ jps -lm
Note
70631 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager
70934 org.apache.spark.deploy.SparkSubmit --master yarn --class org.apache.spark.repl.Main
71320 sun.tools.jps.Jps -lm
70731 org.apache.hadoop.yarn.server.nodemanager.NodeManager
458
Spark on YARN
ResourceManager.
Note
It first checks that the ApplicationMaster has not already been unregistered (using the
internal unregistered flag). If so, you should see the following INFO message in the logs:
INFO ApplicationMaster: Unregistering ApplicationMaster with [status]
main
ApplicationMaster is started as a standalone command-line application inside a YARN
container on a node.
Note
459
Spark on YARN
as a thread local variable (distributed to child threads) for authenticating HDFS and YARN
calls.
Enable DEBUG logging level for org.apache.spark.deploy.SparkHadoopUtil logger
to see what happens inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.deploy.SparkHadoopUtil=DEBUG
Refer to Logging.
You should see the following message in the logs:
DEBUG running as user: [user]
ApplicationMasterArgumentsCommand-Line Parameters
Handler
460
Spark on YARN
parameters.
ApplicationMasterArguments is created right after main method has been executed for args
command-line parameters.
It accepts the following command-line parameters:
--jar JAR_PATH the path to the Spark applications JAR file
--class CLASS_NAME the name of the Spark applications main class
--arg ARG an argument to be passed to the Spark applications main class. There
When an unsupported parameter is found the following message is printed out to standard
error output and ApplicationMaster exits with the exit code 1 .
Unknown/unsupported param [unknownParam]
Usage: org.apache.spark.deploy.yarn.ApplicationMaster [options]
Options:
--jar JAR_PATH Path to your application's JAR file
--class CLASS_NAME Name of your application's main class
--primary-py-file A main Python file
--primary-r-file A main R file
--arg ARG Argument to be passed to your application's main class.
Multiple invocations are possible, each will be passed in order
.
--properties-file FILE Path to a custom Spark properties file.
461
Spark on YARN
registerAM(
_rpcEnv: RpcEnv,
driverRef: RpcEndpointRef,
uiAddress: String,
securityMgr: SecurityManager): Unit
Caution
It then starts the user class (with the driver) in a separate thread. You should see the
following INFO message in the logs:
INFO Starting the user application in a separate Thread
Caution
Caution
462
Spark on YARN
Caution
FIXME Finish
reporterThread
Caution
FIXME
launchReporterThread
Caution
FIXME
463
Spark on YARN
reference (to be sc ).
FIXME
run
When ApplicationMaster is started as a standalone command-line application (using main
method), ultimately it calls run . The result of calling run is the final result of the
ApplicationMaster command-line application.
run(): Int
(either calling runDriver for cluster mode or runExecutorLauncher for client mode).
When run runs you should see the following INFO in the logs:
INFO ApplicationAttemptId: [appAttemptId]
464
Spark on YARN
Caution
When executed in cluster deploy mode, it sets the following system properties:
spark.ui.port as 0
spark.master as yarn
spark.submit.deployMode as cluster
spark.yarn.app.id as application id
Caution
FIXME Link to the page about yarn deploy modes (not the general ones).
Caution
It finally registers ApplicationMaster for the Spark application (either calling runDriver for
cluster mode or runExecutorLauncher for client mode).
Any exceptions in run are caught and reported to the logs as ERROR message:
ERROR Uncaught exception: [exception]
And the application run attempt is finished with FAILED status and EXIT_UNCAUGHT_EXCEPTION
(10) exit code.
finish
Caution
FIXME
ExecutorLauncher
ExecutorLauncher comes with no extra functionality when compared to ApplicationMaster .
It serves as a helper class to run ApplicationMaster under another class name in client
deploy mode.
465
Spark on YARN
With the two different class names (pointing at the same class ApplicationMaster ) you
should be more successful to distinguish between ExecutorLauncher (which is really a
ApplicationMaster ) in client deploy mode and the ApplicationMaster in cluster deploy
getAttemptId returns YARNs ApplicationAttemptId (of the Spark application to which the
In cluster deploy mode (when ApplicationMaster runs with web UI), it sets
spark.ui.filters system property as
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter . It also sets system properties
In client deploy mode (when ApplicationMaster runs on another JVM or even host than web
UI), it simply sends a AddWebUIFilter to ApplicationMaster (namely to AMEndpoint RPC
Endpoint).
466
Spark on YARN
AMEndpointApplicationMaster RPC
Endpoint
onStart Callback
When onStart is called, AMEndpoint communicates with the driver (the driver remote
RPC Endpoint reference) by sending a one-way RegisterClusterManager message with a
reference to itself.
After RegisterClusterManager has been sent (and received by YarnSchedulerEndpoint) the
communication between the RPC endpoints of ApplicationMaster (YARN) and
YarnSchedulerBackend (the Spark driver) is considered established.
RPC Messages
AddWebUIFilter
AddWebUIFilter(
filterName: String,
filterParams: Map[String, String],
proxyBase: String)
When AddWebUIFilter arrives, you should see the following INFO message in the logs:
INFO ApplicationMaster$AMEndpoint: Add WebUI Filter. [addWebUIFilter]
It then passes the AddWebUIFilter message on to the drivers scheduler backend (through
YarnScheduler RPC Endpoint).
RequestExecutors
RequestExecutors(
requestedTotal: Int,
localityAwareTasks: Int,
hostToLocalTaskCount: Map[String, Int])
467
Spark on YARN
If the requestedTotal number of executors is different than the current number of executors
requested earlier, resetAllocatorInterval is executed.
In case when YarnAllocator is not available yet, you should see the following WARN
message in the logs:
WARN Container allocator is not ready to request executors yet.
resetAllocatorInterval
Caution
FIXME
468
Spark on YARN
YarnClusterManager
ExternalClusterManager for YARN
YarnClusterManager is the only currently known ExternalClusterManager in Spark. It creates
canCreate method
YarnClusterManager can handle the yarn master URL only.
createTaskScheduler method
createTaskScheduler creates a YarnClusterScheduler for cluster deploy mode and a
createSchedulerBackend method
createSchedulerBackend creates a YarnClusterSchedulerBackend for cluster deploy mode
initialize method
initialize simply initializes the input TaskSchedulerImpl .
469
Spark on YARN
470
Spark on YARN
It is a custom TaskSchedulerImpl with ability to compute racks per hosts, i.e. it comes with a
specialized getRackForHost.
It also sets org.apache.hadoop.yarn.util.RackResolver logger to WARN if not set already.
Note
471
Spark on YARN
inside YarnClusterScheduler .
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.scheduler.cluster.YarnClusterScheduler=INFO
Refer to Logging.
postStartHook
postStartHook calls ApplicationMaster.sparkContextInitialized before the parents
postStartHook .
472
Spark on YARN
473
Spark on YARN
YarnSchedulerBackendCoarse-Grained
Scheduler Backend for YARN
YarnSchedulerBackend is an abstract CoarseGrainedSchedulerBackend for YARN that
contains common logic for the client and cluster YARN scheduler backends, i.e.
YarnClientSchedulerBackend and YarnClusterSchedulerBackend respectively.
YarnSchedulerBackend is available in the RPC Environment as YarnScheduler RPC
Resetting YarnSchedulerBackend
Note
doRequestTotalExecutors
def doRequestTotalExecutors(requestedTotal: Int): Boolean
Note
Contract.
474
Spark on YARN
YarnScheduler RPC Endpoint with the input requestedTotal and the internal
localityAwareTasks and hostToLocalTaskCount attributes.
Caution
FIXME The internal attributes are already set. When and how?
totalExpectedExecutors
totalExpectedExecutors is a value that is 0 initially when a YarnSchedulerBackend instance
is created but later changes when Spark on YARN starts (in client mode or cluster mode).
Note
It is used in sufficientResourcesRegistered.
Caution
It sets optional appId (of type ApplicationId ), attemptId (for cluster mode only and of
type ApplicationAttemptId ).
It also creates SchedulerExtensionServices object (as services ).
Caution
475
Spark on YARN
sufficientResourcesRegistered
sufficientResourcesRegistered checks whether totalRegisteredExecutors is greater than or
Caution
minRegisteredRatio
minRegisteredRatio is set when YarnSchedulerBackend is created.
It is used in sufficientResourcesRegistered.
Note
java.lang.IllegalArgumentException: requirement failed: application ID unset
Caution
476
Spark on YARN
Caution
bindToYarn sets the internal appId and attemptId to the value of the input parameters,
appId and attemptId , respectively.
Note
Internal Registries
shouldResetOnAmRegister flag
When YarnSchedulerBackend is created, shouldResetOnAmRegister is disabled (i.e. false ).
shouldResetOnAmRegister controls whether to reset YarnSchedulerBackend when another
RegisterClusterManager RPC message arrives.
It allows resetting internal state after the initial ApplicationManager failed and a new one was
registered.
Note
Settings
spark.scheduler.minRegisteredResourcesRatio
spark.scheduler.minRegisteredResourcesRatio (default: 0.8 )
477
Spark on YARN
mode.
Note
Refer to Logging.
FIXME
start
start is part of the SchedulerBackend Contract. It is executed when TaskSchedulerImpl
starts.
478
Spark on YARN
start(): Unit
It creates the internal client object and submits the Spark application. After the application is
deployed to YARN and running, stop starts the internal monitorThread state monitor
thread. In the meantime it also calls the supertypes start .
start sets spark.driver.appUIAddress to be SparkUI.appUIAddress (if Sparks web UI is
enabled).
Note
With DEBUG log level enabled you should see the following DEBUG message in the logs:
DEBUG YarnClientSchedulerBackend: ClientArguments called with: [argsArrayBuf]
Note
FIXME Why is this part of subtypes since they both set it to the same value?
If spark.yarn.credentials.file is defined,
YarnSparkHadoopUtil.get.startExecutorDelegationTokenRenewer(conf) is called.
Caution
stop
479
Spark on YARN
It stops the internal helper objects, i.e. monitorThread and client as well as "announces"
the stop to other services through Client.reportLauncherState . In the meantime it also calls
the supertypes stop .
stop makes sure that the internal client has already been created (i.e. it is not null ),
waitForApplication
waitForApplication(): Unit
waitForApplication is an internal (private) method that waits until the current application is
You should see the following INFO message in the logs for RUNNING state:
INFO YarnClientSchedulerBackend: Application [appId] has started running.
asyncMonitorApplication
asyncMonitorApplication(): MonitorThread
480
Spark on YARN
MonitorThread
MonitorThread internal class is to monitor a Spark application deployed to YARN in client
mode.
When started, it calls the blocking Client.monitorApplication (with no application reports
printed out to the console, i.e. logApplicationReport is disabled).
Note
When the call to Client.monitorApplication has finished, it is assumed that the application
has exited. You should see the following ERROR message in the logs:
ERROR Yarn application has already exited with state [state]!
481
Spark on YARN
sources in org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend.
Enable DEBUG logging level for
org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend logger to see what
happens inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend=DEBUG
Refer to Logging.
Creating YarnClusterSchedulerBackend
Creating a YarnClusterSchedulerBackend object requires a TaskSchedulerImpl and
SparkContext objects.
Note
Internally, it first queries ApplicationMaster for attemptId and records the application and
attempt ids.
It then calls the parents start and sets the parents totalExpectedExecutors to the initial
number of executors.
482
Spark on YARN
Internally, it retrieves the container id and through environment variables computes the base
URL.
You should see the following DEBUG in the logs:
DEBUG Base URL for logs: [baseUrl]
483
Spark on YARN
It uses the reference to the remote ApplicationMaster RPC Endpoint to send messages to.
Enable INFO logging level for
happens inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint=IN
Refer to Logging.
RPC Messages
RequestExecutors
RequestExecutors(
requestedTotal: Int,
localityAwareTasks: Int,
hostToLocalTaskCount: Map[String, Int])
extends CoarseGrainedClusterMessage
RequestExecutors is to inform ApplicationMaster about the current requirements for the total
number of executors (as requestedTotal ), including already pending and running executors.
484
Spark on YARN
RemoveExecutor
KillExecutors
AddWebUIFilter
AddWebUIFilter(
filterName: String,
filterParams: Map[String, String],
proxyBase: String)
It firstly sets spark.ui.proxyBase system property to the input proxyBase (if not empty).
485
Spark on YARN
If it defines a filter, i.e. the input filterName and filterParams are both not empty, you
should see the following INFO message in the logs:
INFO Add WebUI Filter. [filterName], [filterParams], [proxyBase]
It then sets spark.ui.filters to be the input filterName in the internal conf SparkConf
attribute.
All the filterParams are also set as spark.[filterName].param.[key] and [value] .
The filter is added to web UI using JettyUtils.addFilters(ui.getHandlers, conf) .
Caution
RegisterClusterManager Message
RegisterClusterManager(am: RpcEndpointRef)
When RegisterClusterManager message arrives, the following INFO message is printed out
to the logs:
INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as [am]
The internal reference to the remote ApplicationMaster RPC Endpoint is set (to am ).
If the internal shouldResetOnAmRegister flag is enabled, YarnSchedulerBackend is reset. It
is disabled initially, so shouldResetOnAmRegister is enabled.
shouldResetOnAmRegister controls whether to reset YarnSchedulerBackend when
Note
RetrieveLastAllocatedExecutorId
When RetrieveLastAllocatedExecutorId is received, YarnSchedulerEndpoint responds with
the current value of currentExecutorIdCounter.
Note
onDisconnected Callback
486
Spark on YARN
onDisconnected clears the internal reference to the remote ApplicationMaster RPC Endpoint
You should see the following WARN message in the logs if that happens:
WARN ApplicationMaster has disassociated: [remoteAddress]
onStop Callback
onStop shuts askAmThreadPool down immediately.
Note
new threads as needed and reuses previously constructed threads when they are available.
487
Spark on YARN
YarnAllocatorContainer Allocator
YarnAllocator requests containers from the YARN ResourceManager to run Spark
ResourceManager.
Refer to Logging.
allocation requests.
It may later be changed when YarnAllocator is requested for executors given locality
preferences.
488
Spark on YARN
requestTotalExecutorsWithPreferredLocalities(
requestedTotal: Int,
localityAwareTasks: Int,
hostToLocalTaskCount: Map[String, Int]): Boolean
If the input requestedTotal is different than the internal targetNumExecutors attribute you
should see the following INFO message in the logs:
INFO YarnAllocator: Driver requested a total number of [requestedTotal] executor(s).
It sets the internal targetNumExecutors attribute to the input requestedTotal and returns
true . Otherwise, it returns false .
Note
It tracks the number of locality-aware tasks to be used as container placement hint when
YarnAllocator is requested for executors given locality preferences.
executors.
489
Spark on YARN
1. missing executors, i.e. when the number of executors allocated already or pending does
not match the needs and so there are missing executors.
2. executors to cancel, i.e. when the number of pending executor allocations is positive,
but the number of all the executors is more than Spark needs.
It then splits pending container allocation requests per locality preference of pending tasks
(in the internal hostToLocalTaskCounts registry).
Caution
FIXME Stale?
For any new container needed updateResourceRequests adds a container request (using
YARNs AMRMClient.addContainerRequest).
You should see the following INFO message in the logs:
INFO YarnAllocator: Submitted container request (host: [host], capability: [resource])
Spark on YARN
When there are executors to cancel (case 2.), you should see the following INFO message
in the logs:
INFO Canceling requests for [numToCancel] executor container(s) to have a new desired
total [targetNumExecutors] executors.
It checks whether there are pending allocation requests and removes the excess (using
YARNs AMRMClient.removeContainerRequest). If there are no pending allocation requests,
you should see the WARN message in the logs:
WARN Expected to find pending requests, but found none.
handleAllocatedContainers
handleAllocatedContainers(allocatedContainers: Seq[Container]): Unit
Caution
FIXME
processCompletedContainers
processCompletedContainers(completedContainers: Seq[ContainerStatus]): Unit
State
Exit status of a completed container.
Diagnostic message for a failed container.
FIXME The host may or may not exist in the lookup table?
491
Spark on YARN
Other exit statuses of the container are considered application failures and reported as a
WARN message in the logs:
WARN Container killed by YARN for exceeding memory limits. [diagnostics] Consider boos
ting spark.yarn.executor.memoryOverhead.
or
WARN Container marked as failed: [id] [host]. Exit status: [containerExitStatus]. Diag
nostics: [containerDiagnostics]
492
Spark on YARN
If no executor was found, the executor and the exit reason are recorded in the internal
releasedExecutorLossReasons lookup table.
In case the container was not in the internal releasedContainers registry, the internal
numUnexpectedContainerRelease counter is increased and a RemoveExecutor RPC
message is sent to the driver (as specified when YarnAllocator was created) to notify about
the failure of the executor.
FIXME
FIXME
FIXME
FIXME
FIXME
FIXME
FIXME
Spark on YARN
Caution
FIXME
allocateResources
allocateResources(): Unit
allocateResources is???
494
Spark on YARN
FIXME nodeLabelConstructor?
FIXME LocalityPreferredContainerPlacementStrategy?
495
Spark on YARN
Internal Registries
hostToLocalTaskCounts
hostToLocalTaskCounts: Map[String, Int] = Map.empty
Caution
FIXME
containerIdToExecutorId
Caution
FIXME
executorIdToContainer
Caution
FIXME
releasedExecutorLossReasons
Caution
FIXME
pendingLossReasonRequests
Caution
FIXME
failedExecutorsTimeStamps
Caution
FIXME
releasedContainers
releasedContainers contains containers of no use anymore by their globally unique
Note
496
Spark on YARN
497
Spark on YARN
YARN ResourceManager
YARN ResourceManager manages the global assignment of compute resources to
applications, e.g. memory, cpu, disk, network, etc.
498
Spark on YARN
Others
A host is the Hadoop term for a computer (also called a node, in YARN terminology).
A cluster is two or more hosts connected by a high-speed local network.
It can technically also be a single host used for debugging and simple testing.
Master hosts are a small number of hosts reserved to control the rest of the cluster.
Worker hosts are the non-master hosts in the cluster.
A master host is the communication point for a client program. A master host sends
the work to the rest of the cluster, which consists of worker hosts.
The YARN configuration file is an XML file that contains properties. This file is placed in
a well-known location on each host in the cluster and is used to configure the
ResourceManager and NodeManager. By default, this file is named yarn-site.xml .
Each NodeManager tracks its own local resources and communicates its resource
configuration to the ResourceManager, which keeps a running total of the clusters
available resources.
By keeping track of the total, the ResourceManager knows how to allocate
resources as they are requested.
A container in YARN holds resources on the YARN cluster.
A container hold request consists of vcore and memory.
Once a hold has been granted on a host, the NodeManager launches a process called
a task.
An application is a YARN client program that is made up of one or more tasks.
499
Spark on YARN
For each running application, a special piece of code called an ApplicationMaster helps
coordinate tasks on the YARN cluster. The ApplicationMaster is the first process run
after the application starts.
An application in YARN comprises three parts:
The application client, which is how a program is run on the cluster.
An ApplicationMaster which provides YARN with the ability to perform allocation on
behalf of the application.
One or more tasks that do the actual work (runs in a process) in the container
allocated by YARN.
An application running tasks on a YARN cluster consists of the following steps:
The application starts and talks to the ResourceManager (running on the master)
for the cluster.
The ResourceManager makes a single container request on behalf of the
application.
The ApplicationMaster starts running within that container.
The ApplicationMaster requests subsequent containers from the ResourceManager
that are allocated to run tasks for the application. Those tasks do most of the status
communication with the ApplicationMaster.
Once all tasks are finished, the ApplicationMaster exits. The last container is deallocated from the cluster.
The application client exits. (The ApplicationMaster launched in a container is more
specifically called a managed AM).
The ResourceManager, NodeManager, and ApplicationMaster work together to manage
the clusters resources and ensure that the tasks, as well as the corresponding
application, finish cleanly.
Distributed Cache for application jar files.
Preemption (for high-priority applications)
Queues and nested queues
User authentication via Kerberos
Hadoop YARN
500
Spark on YARN
ContainerExecutors
LinuxContainerExecutor and Docker
501
Spark on YARN
WindowsContainerExecutor
502
Spark on YARN
503
Spark on YARN
Kerberos
Microsoft incorporated Kerberos authentication into Windows 2000
Two open source Kerberos implementations exist: the MIT reference implementation
and the Heimdal Kerberos implementation.
YARN supports user authentication via Kerberos (so do the other services: HDFS, HBase,
Hive).
FIXME
504
Spark on YARN
YarnSparkHadoopUtil
YarnSparkHadoopUtil isFIXME
getApplicationAclsForYarn
Caution
FIXME
obtainTokenForHBase
obtainTokenForHBase(
sparkConf: SparkConf,
conf: Configuration,
credentials: Credentials): Unit
Caution
FIXME
obtainTokenForHiveMetastore
obtainTokenForHiveMetastore(
sparkConf: SparkConf,
conf: Configuration,
credentials: Credentials): Unit
Caution
FIXME
505
Spark on YARN
obtainTokensForNamenodes
obtainTokensForNamenodes(
paths: Set[Path],
conf: Configuration,
creds: Credentials,
renewer: Option[String] = None): Unit
Caution
Note
FIXME
It uses Hadoops UserGroupInformation.isSecurityEnabled() to determine
whether UserGroupInformation is working in a secure environment.
FIXME
getContainerId is a private[spark] method that gets YARNs ContainerId from the YARN
startExecutorDelegationTokenRenewer
Caution
FIXME
stopExecutorDelegationTokenRenewer
Caution
FIXME
506
Spark on YARN
addPathToEnvironment
addPathToEnvironment(env: HashMap[String, String], key: String, value: String): Unit
Caution
FIXME
507
Spark on YARN
Settings
The following settings (aka system properties) are specific to Spark on YARN.
spark.yarn.maxAppAttempts
spark.yarn.maxAppAttempts is the maximum number of attempts to register
spark.yarn.app.id
Caution
FIXME
spark.yarn.am.port
Caution
FIXME
spark.yarn.user.classpath.first
Caution
FIXME
spark.yarn.archive
spark.yarn.archive is the location of the archive containing jars files with Spark classes. It
spark.yarn.queue
Caution
spark.yarn.jars
spark.yarn.jars is the location of the Spark jars.
508
Spark on YARN
--conf spark.yarn.jar=hdfs://master:8020/spark/spark-assembly-2.0.0-hadoop2.7.2.jar
spark.yarn.report.interval
spark.yarn.report.interval (default: 1s ) is the interval (in milliseconds) between reports
spark.yarn.dist.jars
spark.yarn.dist.jars (default: empty) is a collection of additional jars to distribute.
It is used when Client distributes additional resources as specified using --jars commandline option for spark-submit.
spark.yarn.dist.files
spark.yarn.dist.files (default: empty) is a collection of additional files to distribute.
spark.yarn.dist.archives
spark.yarn.dist.archives (default: empty) is a collection of additional archives to distribute.
spark.yarn.principal
spark.yarn.principal See the corresponding --principal command-line option for spark-
submit.
spark.yarn.keytab
509
Spark on YARN
submit.
spark.yarn.submit.file.replication
spark.yarn.submit.file.replication is the replication factor (number) for files uploaded by
Spark to HDFS.
spark.yarn.config.gatewayPath
spark.yarn.config.gatewayPath (default: null ) is the root of configuration paths that is
present on gateway nodes, and will be replaced with the corresponding path in cluster
machines.
It is used when Client resolves a path to be YARN NodeManager-aware.
spark.yarn.config.replacementPath
spark.yarn.config.replacementPath (default: null ) is the path to use as a replacement for
spark.yarn.historyServer.address
spark.yarn.historyServer.address is the optional address of the History Server.
spark.yarn.access.namenodes
spark.yarn.access.namenodes (default: empty) is a list of extra NameNode URLs for which to
request delegation tokens. The NameNode that hosts fs.defaultFS does not need to be
listed here.
spark.yarn.cache.types
spark.yarn.cache.types is an internal setting
spark.yarn.cache.visibilities
spark.yarn.cache.visibilities is an internal setting
spark.yarn.cache.timestamps
510
Spark on YARN
spark.yarn.cache.filenames
spark.yarn.cache.filenames is an internal setting
spark.yarn.cache.sizes
spark.yarn.cache.sizes is an internal setting
spark.yarn.cache.confArchive
spark.yarn.cache.confArchive is an internal setting
spark.yarn.secondary.jars
spark.yarn.secondary.jars is
spark.yarn.executor.nodeLabelExpression
spark.yarn.executor.nodeLabelExpression is a node label expression for executors.
spark.yarn.launchContainers
spark.yarn.launchContainers (default: true )FIXME
spark.yarn.containerLauncherMaxThreads
spark.yarn.containerLauncherMaxThreads (default: 25 )FIXME
spark.yarn.executor.failuresValidityInterval
spark.yarn.executor.failuresValidityInterval (default: -1L ) is an interval (in milliseconds)
after which Executor failures will be considered independent and not accumulate towards
the attempt count.
spark.yarn.submit.waitAppCompletion
spark.yarn.submit.waitAppCompletion (default: true ) is a flag to control whether to wait for
the application to finish before exiting the launcher process in cluster mode.
511
Spark on YARN
spark.yarn.executor.memoryOverhead
spark.yarn.executor.memoryOverhead (in MiBs)
spark.yarn.am.cores
spark.yarn.am.cores (default: 1 ) sets the number of CPU cores for ApplicationMasters
JVM.
spark.yarn.driver.memoryOverhead
spark.yarn.driver.memoryOverhead (in MiBs)
spark.yarn.am.memoryOverhead
spark.yarn.am.memoryOverhead (in MiBs)
spark.yarn.am.memory
spark.yarn.am.memory (default: 512m ) sets the memory size of ApplicationMasters JVM (in
MiBs)
spark.yarn.stagingDir
spark.yarn.stagingDir is a staging directory used while submitting applications.
spark.yarn.preserve.staging.files
spark.yarn.preserve.staging.files (default: false ) controls whether to preserve
spark.yarn.credentials.file
spark.yarn.credentials.file
512
Spark Standalone
Caution
You can deploy, i.e. spark-submit , your applications to Spark Standalone in client or
cluster deploy mode (read Deployment modes).
Deployment modes
Caution
FIXME
513
Spark Standalone
FIXME
It is enabled by default.
scheduleExecutorsOnWorkers
Caution
FIXME
scheduleExecutorsOnWorkers(
app: ApplicationInfo,
usableWorkers: Array[WorkerInfo],
spreadOutApps: Boolean): Array[Int]
514
Spark Standalone
SPARK_WORKER_INSTANCES (and
SPARK_WORKER_CORES)
There is really no need to run multiple workers per machine in Spark 1.5 (perhaps in 1.4,
too). You can run multiple executors on the same machine with one worker.
Use SPARK_WORKER_INSTANCES (default: 1 ) in spark-env.sh to define the number of worker
instances.
If you use SPARK_WORKER_INSTANCES , make sure to set SPARK_WORKER_CORES explicitly to limit
the cores per worker, or else each worker will try to use all the cores.
You can set up the number of cores as an command line argument when you start a worker
daemon using --cores .
Since the change SPARK-1706 Allow multiple executors per worker in Standalone mode in
Spark 1.4 its currently possible to start multiple executors in a single JVM process of a
worker.
To launch multiple executors on a machine you start multiple standalone workers, each with
its own JVM. It introduces unnecessary overhead due to these JVM processes, provided
that there are enough cores on that worker.
If you are running Spark in standalone mode on memory-rich nodes it can be beneficial to
have multiple worker instances on the same node as a very large heap size has two
disadvantages:
Garbage collector pauses can hurt throughput of Spark jobs.
Heap size of >32 GB cant use CompressedOoops. So 35 GB is actually less than 32
GB.
Mesos and YARN can, out of the box, support packing multiple, smaller executors onto the
same physical host, so requesting smaller executors doesnt mean your application will have
fewer overall resources.
SparkDeploySchedulerBackend
SparkDeploySchedulerBackend is the Scheduler Backend for Spark Standalone, i.e. it is used
515
Spark Standalone
AppClient
AppClient is an interface to allow Spark applications to talk to a Standalone cluster (using a
Environment.
AppClient uses a daemon cached thread pool ( askAndReplyThreadPool ) with threads' name
to master.
516
Spark Standalone
When AppClient starts, AppClient.start() method is called that merely registers AppClient
RPC Endpoint.
Others
killExecutors
start
stop
517
Spark Standalone
An AppClient tries connecting to a standalone master 3 times every 20 seconds per master
before giving up. They are not configurable parameters.
The appclient-register-master-threadpool thread pool is used until the registration is finished,
i.e. AppClient is connected to the primary standalone Master or the registration fails. It is
then shutdown .
RegisteredApplication RPC message
RegisteredApplication is a one-way message from the primary master to confirm
successful application registration. It comes with the application id and the masters RPC
endpoint reference.
The AppClientListener gets notified about the event via listener.connected(appId) with
appId being an application id.
Caution
Caution
518
Spark Standalone
Caution
stop the AppClient after the SparkContext has been stopped (and so should the running
application on the standalone cluster).
It stops the AppClient RPC endpoint.
RequestExecutors RPC message
RequestExecutors is a reply-response message from the SparkDeploySchedulerBackend
Settings
spark.deploy.spreadOut
spark.deploy.spreadOut (default: true ) controls whether standalone Master should
519
Spark Standalone
Standalone Master
Standalone Master (often written standalone Master) is the cluster manager for Spark
Standalone cluster. It can be started and stopped using custom management scripts for
standalone Master.
A standalone Master is pretty much the Master RPC Endpoint that you can access using
RPC port (low-level operation communication) or Web UI.
Application ids follows the pattern app-yyyyMMddHHmmss .
Master keeps track of the following:
workers ( workers )
mapping between ids and applications ( idToApp )
waiting applications ( waitingApps )
applications ( apps )
mapping between ids and workers ( idToWorker )
mapping between RPC address and workers ( addressToWorker )
endpointToApp
addressToApp
completedApps
nextAppNumber
The following INFO shows up when the Master endpoint starts up ( Master#onStart is
called):
INFO Master: Starting Spark master at spark://japila.local:7077
INFO Master: Running Spark version 1.6.0-SNAPSHOT
520
Spark Standalone
Master WebUI
FIXME MasterWebUI
MasterWebUI is the Web UI server for the standalone master. Master starts Web UI to listen
States
Master can be in the following states:
STANDBY - the initial state while Master is initializing
ALIVE - start scheduling resources among applications.
RECOVERING
COMPLETING_RECOVERY
Caution
FIXME
RPC Environment
The org.apache.spark.deploy.master.Master class starts sparkMaster RPC environment.
INFO Utils: Successfully started service 'sparkMaster' on port 7077.
521
Spark Standalone
The Master endpoint starts the daemon single-thread scheduler pool master-forwardmessage-thread . It is used for worker management, i.e. removing any timed-out workers.
Metrics
Master uses Spark Metrics System (via MasterSource ) to report metrics about internal
status.
The name of the source is master.
It emits the following metrics:
workers - the number of all workers (any state)
aliveWorkers - the number of alive workers
apps - the number of applications
waitingApps - the number of waiting applications
REST Server
The standalone Master starts the REST Server service for alternative application submission
that is supposed to work across Spark versions. It is enabled by default (see
spark.master.rest.enabled) and used by spark-submit for the standalone cluster mode, i.e. -deploy-mode is cluster .
RestSubmissionClient is the client.
522
Spark Standalone
Recovery Mode
A standalone Master can run with recovery mode enabled and be able to recover state
among the available swarm of masters. By default, there is no recovery, i.e. no persistence
and no election.
Only a master can schedule tasks so having one always on is important for
cases where you want to launch new tasks. Running tasks are unaffected by
the state of the master.
Note
Check out the exercise Spark Standalone - Using ZooKeeper for High-Availability
of Master.
Leader Election
Master endpoint is LeaderElectable , i.e. FIXME
Caution
FIXME
RPC Messages
Master communicates with drivers, executors and configures itself using RPC messages.
The following message types are accepted by master (see Master#receive or
Master#receiveAndReply methods):
ElectedLeader for Leader Election
CompleteRecovery
RevokedLeadership
RegisterApplication
ExecutorStateChanged
DriverStateChanged
523
Spark Standalone
Heartbeat
MasterChangeAcknowledged
WorkerSchedulerStateResponse
UnregisterApplication
CheckForWorkerTimeOut
RegisterWorker
RequestSubmitDriver
RequestKillDriver
RequestDriverStatus
RequestMasterState
BoundPortsRequest
RequestExecutors
KillExecutors
RegisterApplication event
A RegisterApplication event is sent by AppClient to the standalone Master. The event
holds information about the application being deployed ( ApplicationDescription ) and the
drivers endpoint reference.
ApplicationDescription describes an application by its name, maximum number of cores,
executors memory, command, appUiUrl, and user with optional eventLogDir and
eventLogCodec for Event Logs, and the number of cores per executor.
Caution
FIXME Finish
524
Spark Standalone
Caution
FIXME persistenceEngine.addApplication(app)
The message holds information about the id and name of the driver.
A driver can be running on a single worker while a worker can have many drivers running.
When a worker receives a LaunchDriver message, it prints out the following INFO:
INFO Asked to launch driver [driver.id]
525
Spark Standalone
It then creates a DriverRunner and starts it. It starts a separate JVM process.
Workers' free memory and cores are considered when assigning some to waiting drivers
(applications).
Caution
DriverRunner
Warning
Internals of org.apache.spark.deploy.master.Master
You can debug a Standalone master using the following command:
Tip
The above command suspends ( suspend=y ) the process until a JPDA debugging client, e.g. you
When Master starts, it first creates the default SparkConf configuration whose values it
then overrides using environment variables and command-line options.
A fully-configured master instance requires host , port (default: 7077 ), webUiPort
(default: 8080 ) settings defined.
Tip
It starts RPC Environment with necessary endpoints and lives until the RPC environment
terminates.
Worker Management
526
Spark Standalone
Settings
FIXME
Caution
terms of cores).
527
Spark Standalone
528
Spark Standalone
Standalone Worker
Standalone Worker (aka standalone slave) is the worker in Spark Standalone cluster.
You can have one or many standalone workers in a standalone cluster. They can be started
and stopped using custom management scripts for standalone workers.
529
Spark Standalone
Executor Summary
Executor Summary page displays information about the executors for the application id
given as the appId request parameter.
530
Spark Standalone
If no application for the appId could be found, Not Found page is displayed.
531
Spark Standalone
Submission Gateways
Caution
FIXME
From SparkSubmit.submit :
In standalone cluster mode, there are two submission gateways:
1. The traditional legacy RPC gateway using o.a.s.deploy.Client as a wrapper
2. The new REST-based gateway introduced in Spark 1.3
The latter is the default behaviour as of Spark 1.3, but Spark submit will fail over to use the
legacy gateway if the master endpoint turns out to be not a REST server.
532
Spark Standalone
sbin/start-master.sh
sbin/start-master.sh script starts a Spark master on the machine the script is executed on.
./sbin/start-master.sh
org.apache.spark.deploy.master.Master \
--ip japila.local --port 7077 --webui-port 8080
Note
It has support for starting Tachyon using --with-tachyon command line option. It assumes
tachyon/bin/tachyon command be available in Sparks home directory.
Command-line Options
You can use the following command-line options:
--host or -h the hostname to listen on; overrides SPARK_MASTER_HOST.
--ip or -i (deprecated) the IP to listen on
533
Spark Standalone
overrides it.
--properties-file (default: $SPARK_HOME/conf/spark-defaults.conf ) - the path to a
sbin/stop-master.sh
You can stop a Spark Standalone master using sbin/stop-master.sh script.
./sbin/stop-master.sh
Caution
534
Spark Standalone
slave.
SPARK_WORKER_PORT - the base port number to listen on for the first worker. If set,
subsequent workers will increment this number. If unset, Spark will pick a random port.
SPARK_WORKER_WEBUI_PORT (default: 8081 ) - the base port for the web UI of the first
worker. Subsequent workers will increment this number. If the port is used, the
successive ports are tried until a free one is found.
SPARK_WORKER_CORES - the number of cores to use by a single executor
SPARK_WORKER_MEMORY (default: 1G )- the amount of memory to use, e.g. 1000M , 2G
SPARK_WORKER_DIR (default: $SPARK_HOME/work ) - the directory to run apps in
535
Spark Standalone
sbin/spark-config.sh
bin/load-spark-env.sh
Command-line Options
You can use the following command-line options:
--host or -h sets the hostname to be available under.
--port or -p - command-line version of SPARK_WORKER_PORT environment
variable.
--cores or -c (default: the number of processors available to the JVM) - command-
environment variable.
--work-dir or -d - command-line version of SPARK_WORKER_DIR environment
variable.
--webui-port - command-line version of SPARK_WORKER_WEBUI_PORT
environment variable.
--properties-file (default: conf/spark-defaults.conf ) - the path to a custom Spark
properties file
--help
Spark properties
After loading the default SparkConf, if --properties-file or SPARK_WORKER_OPTS define
spark.worker.ui.port , the value of the property is used as the port of the workers web UI.
or
$ cat worker.properties
spark.worker.ui.port=33333
$ ./sbin/start-slave.sh spark://localhost:7077 --properties-file worker.properties
536
Spark Standalone
sbin/spark-daemon.sh
Ultimately, the script calls sbin/spark-daemon.sh start to kick off
org.apache.spark.deploy.worker.Worker with --webui-port , --port and the master URL.
Internals of org.apache.spark.deploy.worker.Worker
Upon starting, a Spark worker creates the default SparkConf.
It parses command-line arguments for the worker using WorkerArguments class.
SPARK_LOCAL_HOSTNAME - custom host name
SPARK_LOCAL_IP - custom IP to use (when SPARK_LOCAL_HOSTNAME is not set or hostname
RPC environment
The org.apache.spark.deploy.worker.Worker class starts its own sparkWorker RPC
environment with Worker endpoint.
It has support for starting Tachyon using --with-tachyon command line option. It assumes
tachyon/bin/tachyon command be available in Sparks home directory.
The script uses the following environment variables (and sets them when unavailable):
SPARK_PREFIX
SPARK_HOME
SPARK_CONF_DIR
537
Spark Standalone
SPARK_MASTER_PORT
SPARK_MASTER_IP
The following command will launch 3 worker instances on each node. Each worker instance
will use two cores.
SPARK_WORKER_INSTANCES=3 SPARK_WORKER_CORES=2 ./sbin/start-slaves.sh
538
Spark Standalone
If you however want to filter out the JVM processes that really belong to Spark you should
pipe the commands output to OS-specific tools like grep .
$ jps -lm
999 org.apache.spark.deploy.master.Master --ip japila.local --port 7077 --webui-port 8
080
397
669 org.jetbrains.idea.maven.server.RemoteMavenServer
1198 sun.tools.jps.Jps -lm
$ jps -lm | grep -i spark
999 org.apache.spark.deploy.master.Master --ip japila.local --port 7077 --webui-port 8
080
spark-daemon.sh status
You can also check out ./sbin/spark-daemon.sh status .
When you start Spark Standalone using scripts under sbin , PIDs are stored in /tmp
directory by default. ./sbin/spark-daemon.sh status can read them and do the "boilerplate"
for you, i.e. status a PID.
$ jps -lm | grep -i spark
999 org.apache.spark.deploy.master.Master --ip japila.local --port 7077 --webui-port 8
080
$ ls /tmp/spark-*.pid
/tmp/spark-jacek-org.apache.spark.deploy.master.Master-1.pid
$ ./sbin/spark-daemon.sh status org.apache.spark.deploy.master.Master 1
org.apache.spark.deploy.master.Master is running.
539
Spark Standalone
Tip
You can use the Spark Standalone cluster in the following ways:
Use spark-shell with --master MASTER_URL
Important
Notes:
Read Operating Spark Standalone master
Use SPARK_CONF_DIR for the configuration directory (defaults to $SPARK_HOME/conf ).
Use spark.deploy.retainedApplications (default: 200 )
Use spark.deploy.retainedDrivers (default: 200 )
Use spark.deploy.recoveryMode (default: NONE )
Use spark.deploy.defaultCores (default: Int.MaxValue )
2. Open masters web UI at http://localhost:8080 to know the current setup - no workers
and applications.
540
Spark Standalone
Note
4. Check out masters web UI at http://localhost:8080 to know the current setup - one
worker.
541
Spark Standalone
./sbin/stop-slave.sh
6. Check out masters web UI at http://localhost:8080 to know the current setup - one
worker in DEAD state.
542
Spark Standalone
Note
8. Check out masters web UI at http://localhost:8080 to know the current setup - one
worker ALIVE and another DEAD.
543
Spark Standalone
Figure 4. Masters web UI with one worker ALIVE and one DEAD
9. Configuring cluster using conf/spark-env.sh
Theres the conf/spark-env.sh.template template to start from.
Were going to use the following conf/spark-env.sh :
conf/spark-env.sh
SPARK_WORKER_CORES=2 (1)
SPARK_WORKER_INSTANCES=2 (2)
SPARK_WORKER_MEMORY=2g
544
Spark Standalone
$ ./sbin/start-slave.sh spark://japila.local:7077
starting org.apache.spark.deploy.worker.Worker, logging to
../logs/spark-jacek-org.apache.spark.deploy.worker.Worker-1japila.local.out
starting org.apache.spark.deploy.worker.Worker, logging to
../logs/spark-jacek-org.apache.spark.deploy.worker.Worker-2japila.local.out
11. Check out masters web UI at http://localhost:8080 to know the current setup - at least
two workers should be ALIVE.
Note
$ jps
6580 Worker
4872 Master
6874 Jps
6539 Worker
545
Spark Standalone
./sbin/stop-all.sh
546
Spark Standalone
StandaloneSchedulerBackend
Caution
FIXME
547
Spark on Mesos
Spark on Mesos
Running Spark on Mesos
A Mesos cluster needs at least one Mesos Master to coordinate and dispatch tasks onto
Mesos Slaves.
$ mesos-master --registry=in_memory --ip=127.0.0.1
I0401 00:12:01.955883 1916461824 main.cpp:237] Build: 2016-03-17 14:20:58 by brew
I0401 00:12:01.956457 1916461824 main.cpp:239] Version: 0.28.0
I0401 00:12:01.956538 1916461824 main.cpp:260] Using 'HierarchicalDRF' allocator
I0401 00:12:01.957381 1916461824 main.cpp:471] Starting Mesos master
I0401 00:12:01.964118 1916461824 master.cpp:375] Master 9867c491-5370-48cc-8e25-e1aff1
d86542 (localhost) started on 127.0.0.1:5050
...
548
Spark on Mesos
$ mesos-slave --master=127.0.0.1:5050
I0401 00:15:05.850455 1916461824 main.cpp:223] Build: 2016-03-17 14:20:58 by brew
I0401 00:15:05.850772 1916461824 main.cpp:225] Version: 0.28.0
I0401 00:15:05.852812 1916461824 containerizer.cpp:149] Using isolation: posix/cpu,pos
ix/mem,filesystem/posix
I0401 00:15:05.866186 1916461824 main.cpp:328] Starting Mesos slave
I0401 00:15:05.869470 218980352 slave.cpp:193] Slave started on 1)@10.1.47.199:5051
...
I0401 00:15:05.906355 218980352 slave.cpp:832] Detecting new master
I0401 00:15:06.762917 220590080 slave.cpp:971] Registered with master [email protected]
:5050; given slave ID 9867c491-5370-48cc-8e25-e1aff1d86542-S0
...
Figure 2. Mesos Management Console (Slaves tab) with one slave running
You have to export MESOS_NATIVE_JAVA_LIBRARY environment variable
before connecting to the Mesos cluster.
Important
$ export MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.dylib
The preferred approach to launch Spark on Mesos and to give the location of Spark
binaries is through spark.executor.uri setting.
--conf spark.executor.uri=/Users/jacek/Downloads/spark-1.5.2-bin-hadoop2.6.tgz
Note
549
Spark on Mesos
In Frameworks tab you should see a single active framework for spark-shell .
Figure 3. Mesos Management Console (Frameworks tab) with Spark shell active
Tip
Important
550
Spark on Mesos
CoarseMesosSchedulerBackend
CoarseMesosSchedulerBackend is the scheduler backend for Spark on Mesos.
It requires a Task Scheduler, Spark context, mesos:// master URL, and Security Manager.
It is a specialized CoarseGrainedSchedulerBackend and implements Mesoss
org.apache.mesos.Scheduler interface.
It accepts only two failures before blacklisting a Mesos slave (it is hardcoded and not
configurable).
It tracks:
the number of tasks already submitted ( nextMesosTaskId )
the number of cores per task ( coresByTaskId )
the total number of cores acquired ( totalCoresAcquired )
slave ids with executors ( slaveIdsWithExecutors )
slave ids per host ( slaveIdToHost )
task ids per slave ( taskIdToSlaveId )
How many times tasks on each slave failed ( failuresBySlaveId )
Tip
Settings
551
Spark on Mesos
of Parallelism.
spark.cores.max (default: Int.MaxValue ) - maximum number of cores to acquire
spark.mesos.extra.cores (default: 0 ) - extra cores per slave ( extraCoresPerSlave )
FIXME
spark.mesos.constraints (default: (empty)) - offer constraints FIXME
slaveOfferConstraints
spark.mesos.rejectOfferDurationForUnmetConstraints (default: 120s ) - reject offers with
MesosExternalShuffleClient
FIXME
(Fine)MesosSchedulerBackend
When spark.mesos.coarse is false , Spark on Mesos uses MesosSchedulerBackend
reviveOffers
It calls mesosDriver.reviveOffers() .
Caution
FIXME
Settings
spark.mesos.coarse (default: true ) controls whether the scheduler backend for Mesos
MesosClusterScheduler.scala
MesosExternalShuffleService
Schedulers in Mesos
552
Spark on Mesos
Commands
The following command is how you could execute a Spark application on Mesos:
./bin/spark-submit --master mesos://iq-cluster-master:5050 --total-executor-cores 2 -executor-memory 3G --conf spark.mesos.role=dev ./examples/src/main/python/pi.py 100
Other Findings
From Four reasons to pay attention to Apache Mesos:
Spark workloads can also be sensitive to the physical characteristics of the
infrastructure, such as memory size of the node, access to fast solid state disk, or
proximity to the data source.
to run Spark workloads well you need a resource manager that not only can handle the
rapid swings in load inherent in analytics processing, but one that can do to smartly.
Matching of the task to the RIGHT resources is crucial and awareness of the physical
environment is a must. Mesos is designed to manage this problem on behalf of
workloads like Spark.
553
Spark on Mesos
554
Spark on Mesos
MesosCoarseGrainedSchedulerBackend
Coarse-Grained Scheduler Backend for Mesos
Caution
FIXME
(executorLimitOption attribute)
executorLimitOption is an internal attribute toFIXME
555
Spark on Mesos
About Mesos
Apache Mesos is an Apache Software Foundation open source cluster management and
scheduling framework. It abstracts CPU, memory, storage, and other compute resources
away from machines (physical or virtual).
Mesos provides API for resource management and scheduling across multiple nodes (in
datacenter and cloud environments).
Tip
Concepts
A Mesos master manages agents. It is responsible for tracking, pooling and distributing
agents' resources, managing active applications, and task delegation.
A Mesos agent is the worker with resources to execute tasks.
A Mesos framework is an application running on a Apache Mesos cluster. It runs on agents
as tasks.
The Mesos master offers resources to frameworks that can accept or reject them based on
specific constraints.
A resource offer is an offer with CPU cores, memory, ports, disk.
Frameworks: Chronos, Marathon, Spark, HDFS, YARN (Myriad), Jenkins, Cassandra.
Mesos API
Mesos is a scheduler of schedulers
Mesos assigns jobs
Mesos typically runs with an agent on every virtual machine or bare metal server under
management (https://www.joyent.com/blog/mesos-by-the-pound)
Mesos uses Zookeeper for master election and discovery. Apache Auroa is a scheduler
that runs on Mesos.
556
Spark on Mesos
557
Execution Model
Execution Model
Caution
FIXME This is the single place for explaining jobs, stages, tasks. Move
relevant parts from the other places.
558
Optimising Spark
Optimising Spark
Caching and Persistence
Broadcast variables
Accumulators
559
Note
Due to the very small and purely syntactic difference between caching and
persistence of RDDs the two terms are often used interchangeably and I will
follow the "pattern" here.
RDDs can also be unpersisted to remove RDD from a permanent storage like memory
and/or disk.
You can only change the storage level once or a UnsupportedOperationException is thrown:
Cannot change storage level of an RDD after it was already assigned a level
Note
You can pretend to change the storage level of an RDD with already-assigned
storage level only if the storage level is the same as it is currently assigned.
560
If the RDD is marked as persistent the first time, the RDD is registered to ContextCleaner (if
available) and SparkContext .
The internal storageLevel attribute is set to the input newLevel storage level.
Storage Levels
StorageLevel describes how an RDD is persisted (and addresses the following concerns):
You can check out the storage level using getStorageLevel() operation.
561
When called, unpersist prints the following INFO message to the logs:
INFO [RddName]: Removing RDD [id] from persistence list
562
Broadcast variables
Broadcast Variables
From the official documentation about Broadcast Variables:
Broadcast variables allow the programmer to keep a read-only variable cached on each
machine rather than shipping a copy of it with tasks.
And later in the document:
Explicitly creating broadcast variables is only useful when tasks across multiple stages
need the same data or when caching the data in deserialized form is important.
563
Broadcast variables
Introductory Example
Lets start with an introductory example to check out how to use broadcast variables and
build your initial understanding.
Youre going to use a static mapping of interesting projects with their websites, i.e.
Map[String, String] that the tasks, i.e. closures (anonymous functions) in transformations,
use.
scala> val pws = Map("Apache Spark" -> "http://spark.apache.org/", "Scala" -> "http://
www.scala-lang.org/")
pws: scala.collection.immutable.Map[String,String] = Map(Apache Spark -> http://spark.
apache.org/, Scala -> http://www.scala-lang.org/)
scala> val websites = sc.parallelize(Seq("Apache Spark", "Scala")).map(pws).collect
...
websites: Array[String] = Array(http://spark.apache.org/, http://www.scala-lang.org/)
It works, but is very ineffective as the pws map is sent over the wire to executors while it
could have been there already. If there were more tasks that need the pws map, you could
improve their performance by minimizing the number of bytes that are going to be sent over
the network for task execution.
Enter broadcast variables.
564
Broadcast variables
Semantically, the two computations - with and without the broadcast value - are exactly the
same, but the broadcast-based one wins performance-wise when there are more executors
spawned to execute many tasks that use pws map.
Introduction
Broadcast is part of Spark that is responsible for broadcasting information across nodes in
a cluster.
You use broadcast variable to implement map-side join, i.e. a join using a map . For this,
lookup tables are distributed across nodes in a cluster using broadcast and then looked up
inside map (to do the join implicitly).
When you broadcast a value, it is copied to executors only once (while it is copied multiple
times for tasks otherwise). It means that broadcast can help to get your Spark application
faster if you have a large value to use in tasks or there are more tasks than executors.
It appears that a Spark idiom emerges that uses broadcast with collectAsMap to create a
Map for broadcast. When an RDD is map over to a smaller dataset (column-wise not
record-wise), collectAsMap , and broadcast , using the very big RDD to map its elements to
the broadcast RDDs is computationally faster.
val acMap = sc.broadcast(myRDD.map { case (a,b,c,b) => (a, c) }.collectAsMap)
val otherMap = sc.broadcast(myOtherRDD.collectAsMap)
myBigRDD.map { case (a, b, c, d) =>
(acMap.value.get(a).get, otherMap.value.get(c).get)
}.collect
Use large broadcasted HashMaps over RDDs whenever possible and leave RDDs with a
key to lookup necessary data as demonstrated above.
Spark comes with a BitTorrent implementation.
It is not enabled by default.
SparkContext.broadcast
Read about SparkContext.broadcast method in Creating broadcast variables.
565
Broadcast variables
Further Reading
Map-Side Join in Spark
566
Accumulators
Accumulators
Accumulators are variables that are "added" to through an associative and commutative
"add" operation. They act as a container for accumulating partial values across multiple
tasks running on executors. They are designed to be used safely and efficiently in parallel
and distributed Spark computations and are meant for distributed counters and sums.
You can create built-in accumulators for longs, doubles, or collections or register custom
accumulators using the SparkContext.register methods. You can create accumulators with
or without a name, but only named accumulators are displayed in web UI (under Stages tab
for a given stage).
Accumulators are not thread-safe. They do not really have to since the
DAGScheduler.updateAccumulators method that the driver uses to update the values of
accumulators after a task completes (successfully or with a failure) is only executed on a
single thread that runs scheduling loop. Beside that, they are write-only data structures for
workers that have their own local accumulator reference whereas accessing the value of an
accumulator is only allowed by the driver.
567
Accumulators
Accumulators are serializable so they can safely be referenced in the code executed in
executors and then safely send over the wire for execution.
val counter = sc.longAccumulator("counter")
sc.parallelize(1 to 9).foreach(x => counter.add(x))
AccumulatorV2
abstract class AccumulatorV2[IN, OUT]
It creates a AccumulatorMetadata metadata object for the accumulator (with a new unique
identifier) and registers the accumulator with AccumulatorContext. The accumulator is then
registered with ContextCleaner for cleanup.
AccumulatorContext
AccumulatorContext is a private[spark] internal object used to track accumulators by
Spark itself using an internal originals lookup table. Spark uses the AccumulatorContext
object to register and unregister accumulators.
The originals lookup table maps accumulator identifier to the accumulator itself.
Every accumulator has its own unique accumulator id that is assigned using the internal
nextId counter.
568
Accumulators
AccumulatorContext.SQL_ACCUM_IDENTIFIER
AccumulatorContext.SQL_ACCUM_IDENTIFIER is an internal identifier for Spark SQLs internal
accumulators. The value is sql and Spark uses it to distinguish Spark SQL metrics from
others.
Named Accumulators
An accumulator can have an optional name that you can specify when creating an
accumulator.
val counter = sc.longAccumulator("counter")
AccumulableInfo
AccumulableInfo contains information about a tasks local updates to an Accumulable.
id of the accumulator
569
Accumulators
Imagine you are requested to write a distributed counter. What do you think about the
following solutions? What are the pros and cons of using it?
val ints = sc.parallelize(0 to 9, 3)
var counter = 0
ints.foreach { n =>
println(s"int: $n")
counter = counter + 1
}
println(s"The number of elements is $counter")
570
Accumulators
571
Spark Security
Spark Security
Enable security via spark.authenticate property (defaults to false ).
See org.apache.spark.SecurityManager
Enable INFO for org.apache.spark.SecurityManager to see messages regarding
security in Spark.
Enable DEBUG for org.apache.spark.SecurityManager to see messages regarding SSL
in Spark, namely file server and Akka.
SecurityManager
Caution
572
Securing Web UI
Securing Web UI
Tip
To secure Web UI you implement a security filter and use spark.ui.filters setting to refer
to the class.
Examples of filters implementing basic authentication:
Servlet filter for HTTP basic auth
neolitec/BasicAuthenticationFilter.java
573
574
FIXME What are the differences between textFile and the rest methods in
SparkContext like newAPIHadoopRDD , newAPIHadoopFile , hadoopFile ,
hadoopRDD ?
Returns HadoopRDD
When using textFile to read an HDFS folder with multiple files inside, the number
of partitions are equal to the number of HDFS blocks.
What does sc.binaryFiles ?
URLs supported:
s3:// or s3n://
hdfs://
file://;
575
The general rule seems to be to use HDFS to read files multiple times with S3 as a storage
for a one-time access.
576
1.
parallelize uses 4 to denote the number of partitions so there are going to be 4 files
saved.
2.
S3
s3:// or s3n:// URL are supported.
configuration).
577
if the directory contains multiple SequenceFiles all of them will be added to RDD
SequenceFile RDD
Edit conf/log4j.properties so the line log4j.rootCategory uses appropriate log level, e.g.
log4j.rootCategory=ERROR, console
FIXME
Describe the other computing models using Spark SQL, MLlib, Spark Streaming, and
GraphX.
578
$ ./bin/spark-shell
...
Spark context available as sc.
...
SQL context available as spark.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.5.0-SNAPSHOT
/_/
Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
Type in expressions to have them evaluated.
Type :help for more information.
scala> sc.addFile("/Users/jacek/dev/sandbox/hello.json")
scala> import org.apache.spark.SparkFiles
import org.apache.spark.SparkFiles
scala> SparkFiles.get("/Users/jacek/dev/sandbox/hello.json")
See org.apache.spark.SparkFiles.
Caution
579
scala> sc.textFile("http://japila.pl").foreach(println)
java.io.IOException: No FileSystem for scheme: http
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2644)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat
.java:258)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
...
580
Parquet
Apache Parquet is a columnar storage format available to any project in the Hadoop
ecosystem, regardless of the choice of data processing framework, data model or
programming language.
Spark 1.5 uses Parquet 1.7.
excellent for local file storage on HDFS (instead of external databases).
writing very large datasets to disk
supports schema and schema evolution.
faster than json/gzip
Used in Spark SQL.
581
Serialization
Serialization
Serialization systems:
Java serialization
Kryo
Avro
Thrift
Protobuf
582
583
584
Caution
FIXME Describe the features listed in the document and how Spark features
contributed
Caution
585
586
Spark SQL
Spark SQL
From Spark SQL home page:
Spark SQL is Sparks module for working with structured data (rows and columns) in
Spark.
From Sparks Role in the Big Data Ecosystem - Matei Zaharia video:
Spark SQL enables loading & querying structured data in Spark.
Spark SQL is a distributed SQL framework that leverages the power of Sparks distributed
computation model (based on RDD). It becomes the new Spark core with the Catalyst query
optimizer and the Tungsten execution engine with the main abstractions being Dataset,
DataFrame and the good ol' SQL (see the comment from Reynold Xin).
The following snippet shows a batch ETL pipeline to process JSON files and saving their
subset as CSVs.
spark.read
.format("json")
.load("input-json")
.select("field1", "field2")
.where(field2 > 15)
.write
.format("csv")
.save("output-csv")
With Structured Streaming however, the above static query becomes dynamic and
continuous.
spark.readStream
.format("json")
.load("input-json")
.select("field1", "field2")
.where(field2 > 15)
.writeStream
.format("console")
.start
As of Spark 2.0, the main data abstraction of Spark SQL is Dataset. It represents a
structured data which is records with a known schema. This structured data representation
Dataset enables compact binary representation using compressed columnar format that is
587
Spark SQL
DataFrame
Spark SQL introduces a tabular data abstraction called DataFrame. It is designed to ease
processing large amount of structured tabular data on Spark infrastructure.
588
Spark SQL
Found the following note about Apache Drill, but appears to apply to Spark SQL
perfectly:
Note
A SQL query engine for relational and NoSQL databases with direct
queries on self-describing and semi-structured data in files, e.g. JSON or
Parquet, and HBase tables without needing to specify metadata definitions
in a centralized store.
From user@spark:
If you already loaded csv data into a dataframe, why not register it as a table, and use
Spark SQL to find max/min or any other aggregates? SELECT MAX(column_name)
FROM dftable_name seems natural.
youre more comfortable with SQL, it might worth registering this DataFrame as a table
and generating SQL query to it (generate a string with a series of min-max calls)
Caution
You can parse data from external data sources and let the schema inferencer to deduct the
schema.
Creating DataFrames
From http://stackoverflow.com/a/32514683/1305344:
val df = sc.parallelize(Seq(
Tuple1("08/11/2015"), Tuple1("09/11/2015"), Tuple1("09/12/2015")
)).toDF("date_string")
df.registerTempTable("df")
spark.sql(
"""SELECT date_string,
from_unixtime(unix_timestamp(date_string,'MM/dd/yyyy'), 'EEEEE') AS dow
FROM df"""
).show
The result:
589
Spark SQL
+-----------+--------+
|date_string| dow|
+-----------+--------+
| 08/11/2015| Tuesday|
| 09/11/2015| Friday|
| 09/12/2015|Saturday|
+-----------+--------+
And then
val fileRdd = sc.textFile("README.md")
val df = fileRdd.toDF
import org.apache.spark.sql.SaveMode
val outputF = "test.avro"
df.write.mode(SaveMode.Append).format("com.databricks.spark.avro").save(outputF)
Spark SQL
val df = sc.parallelize(Seq(
(1441637160, 10.0),
(1441637170, 20.0),
(1441637180, 30.0),
(1441637210, 40.0),
(1441637220, 10.0),
(1441637230, 0.0))).toDF("timestamp", "value")
import org.apache.spark.sql.types._
val tsGroup = (floor($"timestamp" / lit(60)) * lit(60)).cast(IntegerType).alias("times
tamp")
df.groupBy(tsGroup).agg(mean($"value").alias("value")).show
More examples
Another example:
val df = Seq(1 -> 2).toDF("i", "j")
val query = df.groupBy('i)
.agg(max('j).as("aggOrdering"))
.orderBy(sum('j))
query == Row(1, 2) // should return true
591
The private more direct API to create a SparkSession requires a SparkContext and an
optional SharedState (that represents the shared state across SparkSession instances).
Note
ImplicitsSparkSession.implicits
The implicits object is a helper class with methods to convert objects to Datasets and
DataFrames, and also comes with many Encoders for "primitive" types as well as the
collections thereof.
Import the implicits by import spark.implicits._ as follows:
Note
It holds Encoders for Scala "primitive" types like Int , Double , String , and their products
and collections.
It offers support for creating Dataset from RDD of any type (for which an encoder exists in
scope), or case classes or tuples, and Seq .
It also offers conversions from Scalas Symbol or $ to Column .
It also offers conversions from RDD or Seq of Product types (e.g. case classes or tuples)
to DataFrame . It has direct conversions from RDD of Int , Long and String to
DataFrame with a single column name _1 .
592
Note
readStream
readStream: DataStreamReader
emptyDataset
emptyDataset[T: Encoder]: Dataset[T]
emptyDataset creates an empty Dataset (assuming that future records being of type T ).
createDataset methods
createDataset[T : Encoder](data: Seq[T]): Dataset[T]
createDataset[T : Encoder](data: RDD[T]): Dataset[T]
createDataset creates a Dataset from the local collection or the distributed RDD .
593
The LogicalPlan is LocalRelation (for the input data collection) or LogicalRDD (for the
input RDD[T] ).
range methods
range(end: Long): Dataset[java.lang.Long]
range(start: Long, end: Long): Dataset[java.lang.Long]
range(start: Long, end: Long, step: Long): Dataset[java.lang.Long]
range(start: Long, end: Long, step: Long, numPartitions: Int): Dataset[java.lang.Long]
emptyDataFrame
emptyDataFrame: DataFrame
594
createDataFrame method
createDataFrame(rowRDD: RDD[Row], schema: StructType): DataFrame
streams Attribute
streams: StreamingQueryManager
udf Attribute
udf: UDFRegistration
catalog Attribute
595
catalog attribute is an interface to the current catalog (of databases, tables, functions, table
table method
table(tableName: String): DataFrame
table creates a DataFrame from records in the tableName table (if exists).
val df = spark.table("mytable")
streamingQueryManager Attribute
streamingQueryManager is
listenerManager Attribute
listenerManager is
ExecutionListenerManager
ExecutionListenerManager is
functionRegistry Attribute
functionRegistry is
experimentalMethods Attribute
experimentalMethods is
newSession method
596
newSession(): SparkSession
newSession creates (starts) a new SparkSession (with the current SparkContext and
SharedState).
scala> println(sc.version)
2.0.0-SNAPSHOT
scala> val newSession = spark.newSession
newSession: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@122f
58a
sharedState Attribute
sharedState points at the current SharedState.
SharedState
SharedState represents the shared state across all active SQL sessions (i.e. SparkSession
happen when a new session is created or when the shared services are accessed. It is
created with a SparkContext.
FIXME
597
createDataset is an experimental API to create a Dataset from a local Scala collection, i.e.
Seq[T] or Javas List[T] , or an RDD[T] .
Note
Youd rather not be using createDataset since you have the Scala implicits and
toDS method.
read method returns a DataFrameReader that is used to read data from external storage
conf returns the current runtime configuration (as RuntimeConfig ) that wraps SQLConf.
Caution
FIXME
sessionState
sessionState is a transient lazy value that represents the current SessionState.
sessionState is a lazily-created value based on the internal
598
Internally, it creates a Dataset using the current SparkSession and the plan (based on the
input sqlText and parsed using ParserInterface.parsePlan available using
sessionState.sqlParser).
Caution
import org.apache.spark.sql.SparkSession
val builder = SparkSession.builder
Settings
599
spark.sql.catalogImplementation
spark.sql.catalogImplementation (default: in-memory ) is an internal setting with two
600
SQLConf
SQLConf
SQLConf is a key-value configuration store for parameters and hints used in Spark SQL. It
601
SQLConf
You can use clear to remove all the parameters and hints in SQLConf .
FIXME
spark.sql.streaming.fileSink.log.deletion
spark.sql.streaming.fileSink.log.deletion (default: true ) is an internal flag to control
spark.sql.streaming.fileSink.log.compactInterval
spark.sql.streaming.fileSink.log.compactInterval
spark.sql.streaming.fileSink.log.cleanupDelay
spark.sql.streaming.fileSink.log.cleanupDelay
spark.sql.streaming.schemaInference
spark.sql.streaming.schemaInference
602
Catalog
Catalog
Catalog is the interface to work with database(s), local and external tables, functions, table
Catalog Contract
603
Catalog
package org.apache.spark.sql.catalog
abstract class Catalog {
def currentDatabase: String
def setCurrentDatabase(dbName: String): Unit
def listDatabases(): Dataset[Database]
def listTables(): Dataset[Table]
def listTables(dbName: String): Dataset[Table]
def listFunctions(): Dataset[Function]
def listFunctions(dbName: String): Dataset[Function]
def listColumns(tableName: String): Dataset[Column]
def listColumns(dbName: String, tableName: String): Dataset[Column]
def createExternalTable(tableName: String, path: String): DataFrame
def createExternalTable(tableName: String, path: String, source: String): DataFrame
def createExternalTable(
tableName: String,
source: String,
options: Map[String, String]): DataFrame
def createExternalTable(
tableName: String,
source: String,
schema: StructType,
options: Map[String, String]): DataFrame
def dropTempView(viewName: String): Unit
def isCached(tableName: String): Boolean
def cacheTable(tableName: String): Unit
def uncacheTable(tableName: String): Unit
def clearCache(): Unit
def refreshTable(tableName: String): Unit
def refreshByPath(path: String): Unit
}
CatalogImpl
CatalogImpl is the one and only Catalog that relies on a per-session SessionCatalog
604
Catalog
605
Dataset
Dataset
Dataset is the API for working with structured data, i.e. records with a known schema. It is
the result of a SQL query against files or databases. Dataset API comes with declarative and
type-safe operators (that improves on the experience in data processing using DataFrames).
Dataset was first introduced in Apache Spark 1.6.0 as an experimental
feature, but has since turned itself into a fully supported API.
Note
and the strong static type-safety of Scala. The last feature of bringing the strong type-safety
to DataFrame makes Dataset so appealing. All the features together give you a more
functional programming interface to work with structured data.
It is only with Datasets to have syntax and analysis checks at compile time (that is not
possible using DataFrame, regular SQL queries or even RDDs).
Using Dataset objects turns DataFrames of Row instances into a DataFrames of case
classes with proper names and types (following their equivalents in the case classes).
Instead of using indices to access respective fields in a DataFrame and cast it to a type, all
this is automatically handled by Datasets and checked by the Scala compiler.
Datasets use Catalyst Query Optimizer and Tungsten to optimize their performance.
A Dataset object requires a SQLContext, a QueryExecution, and an Encoder. In same
cases, a Dataset can also be seen as a pair of LogicalPlan in a given SQLContext.
Note
606
Dataset
You can convert a type-safe Dataset to a "untyped" DataFrame (see Type Conversions to
Dataset[T]) or access the RDD that sits underneath (see Converting Datasets into RDDs
(using rdd method)). It is supposed to give you a more pleasant experience while
transitioning from legacy RDD-based or DataFrame-based APIs.
The default storage level for Datasets is MEMORY_AND_DISK because recomputing the inmemory columnar representation of the underlying table is expensive. See Persisting
Dataset (persist method) in this document.
Spark 2.0 has introduced a new query model called Structured Streaming for continuous
incremental execution of structured queries. That made possible to consider Datasets a
static and bounded data as well as streaming and unbounded with one single API.
join
Caution
FIXME
where
Caution
FIXME
groupBy
Caution
FIXME
foreachPartition method
foreachPartition(f: Iterator[T] => Unit): Unit
Note
mapPartitions method
607
Dataset
mapPartitions returns a new Dataset (of type U ) with the function func applied to each
partition.
Caution
FIXME Example
flatMap returns a new Dataset (of type U ) with all records (of type T ) mapped over
608
Dataset
persist caches the Dataset using the default storage level MEMORY_AND_DISK or
newLevel .
Note
object of SQLContext.
609
Dataset
Note
scala> :imports
1) import spark.implicits._ (59 terms, 38 are implicit)
2) import spark.sql (1 terms)
import spark.implicits._
case class Token(name: String, productId: Int, score: Double)
val data = Seq(
Token("aaa", 100, 0.12),
Token("aaa", 200, 0.29),
Token("bbb", 200, 0.53),
Token("bbb", 300, 0.42))
// Transform data to a Dataset[Token]
// It doesn't work with type annotation yet
// https://issues.apache.org/jira/browse/SPARK-13456
val ds: Dataset[Token] = data.toDS
// Transform data into a DataFrame with no explicit schema
val df = data.toDF
// Transform DataFrame into a Dataset
val ds = df.as[Token]
scala> ds.show
+----+---------+-----+
|name|productId|score|
+----+---------+-----+
| aaa| 100| 0.12|
| aaa| 200| 0.29|
| bbb| 200| 0.53|
| bbb| 300| 0.42|
+----+---------+-----+
610
Dataset
scala> ds.printSchema
root
|-- name: string (nullable = true)
|-- productId: integer (nullable = false)
|-- score: double (nullable = false)
// In DataFrames we work with Row instances
scala> df.map(_.getClass.getName).show(false)
+--------------------------------------------------------------+
|value |
+--------------------------------------------------------------+
|org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema|
|org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema|
|org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema|
|org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema|
+--------------------------------------------------------------+
// In Datasets we work with case class instances
scala> ds.map(_.getClass.getName).show(false)
+---------------------------+
|value |
+---------------------------+
|$line40.$read$$iw$$iw$Token|
|$line40.$read$$iw$$iw$Token|
|$line40.$read$$iw$$iw$Token|
|$line40.$read$$iw$$iw$Token|
+---------------------------+
scala> ds.map(_.name).show
+-----+
|value|
+-----+
| aaa|
| aaa|
| bbb|
| bbb|
+-----+
Schema
611
Dataset
You may also use the following methods to learn about the schema:
printSchema(): Unit
Tip
explain
Supported Types
Caution
toJSON
toJSON maps the content of Dataset to a Dataset of JSON strings.
Note
explain
explain(): Unit
explain(extended: Boolean): Unit
explain prints the logical and physical plans to the console. You can use it for debugging.
Tip
If you are serious about query debugging you could also use the Debugging
Query Execution facility.
612
Dataset
val ds = spark.range(10)
scala> ds.explain(extended = true)
== Parsed Logical Plan ==
Range 0, 10, 1, 8, [id#9L]
== Analyzed Logical Plan ==
id: bigint
Range 0, 10, 1, 8, [id#9L]
== Optimized Logical Plan ==
Range 0, 10, 1, 8, [id#9L]
== Physical Plan ==
WholeStageCodegen
: +- Range 0, 1, 8, 10, [id#9L]
select
select[U1: Encoder](c1: TypedColumn[T, U1]): Dataset[U1]
select[U1, U2](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2]): Dataset[(U1, U2)]
select[U1, U2, U3](
c1: TypedColumn[T, U1],
c2: TypedColumn[T, U2],
c3: TypedColumn[T, U3]): Dataset[(U1, U2, U3)]
select[U1, U2, U3, U4](
c1: TypedColumn[T, U1],
c2: TypedColumn[T, U2],
c3: TypedColumn[T, U3],
c4: TypedColumn[T, U4]): Dataset[(U1, U2, U3, U4)]
select[U1, U2, U3, U4, U5](
c1: TypedColumn[T, U1],
c2: TypedColumn[T, U2],
c3: TypedColumn[T, U3],
c4: TypedColumn[T, U4],
c5: TypedColumn[T, U5]): Dataset[(U1, U2, U3, U4, U5)]
Caution
FIXME
selectExpr
selectExpr(exprs: String*): DataFrame
613
Dataset
val ds = spark.range(5)
scala> ds.selectExpr("rand() as random").show
16/04/14 23:16:06 INFO HiveSqlParser: Parsing command: rand() as random
+-------------------+
| random|
+-------------------+
| 0.887675894185651|
|0.36766085091074086|
| 0.2700020856675186|
| 0.1489033635529543|
| 0.5862990791950973|
+-------------------+
Internally, it executes select with every expression in exprs mapped to Column (using
SparkSqlParser.parseExpression).
scala> ds.select(expr("rand() as random")).show
+------------------+
| random|
+------------------+
|0.5514319279894851|
|0.2876221510433741|
|0.4599999092045741|
|0.5708558868374893|
|0.6223314406247136|
+------------------+
Note
isStreaming
isStreaming returns true when Dataset contains StreamingRelation or
Note
614
Dataset
Note
randomSplit
randomSplit(weights: Array[Double]): Array[Dataset[T]]
randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]]
You can define seed and if you dont, a random seed will be used.
Note
val ds = spark.range(10)
scala> ds.randomSplit(Array[Double](2, 3)).foreach(_.show)
+---+
| id|
+---+
| 0|
| 1|
| 2|
+---+
+---+
| id|
+---+
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+---+
Note
Queryable
Caution
FIXME
615
Dataset
withNewExecutionId is a private[sql] method that executes the input body action using
616
Dataset
Encoders allows for significantly faster serialization and deserialization (comparing to the
default Java or Kryo serializers).
Note
Encoder works with the type of the accompanying Dataset. You can create custom
encoders using Encoders object. Encoders for many Scala types are however available
through SparkSession.implicits object so in most cases you dont need to worry about them
whatsoever and simply import the implicits object.
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
Encoders map columns (of your dataset) to fields (of your JVM object) by name. It is by
Encoders that you can bridge JVM objects to data sources (CSV, JDBC, Parquet, Avro,
JSON, Cassandra, Elasticsearch, memsql) and vice versa.
import org.apache.spark.sql.Encoders
case class Person(id: Int, name: String, speaksPolish: Boolean)
scala> val personEncoder = Encoders.product[Person]
personEncoder: org.apache.spark.sql.Encoder[Person] = class[id[0]: int, name[0]: strin
g, speaksPolish[0]: boolean]
scala> personEncoder.schema
res11: org.apache.spark.sql.types.StructType = StructType(StructField(id,IntegerType,f
alse), StructField(name,StringType,true), StructField(speaksPolish,BooleanType,false))
scala> personEncoder.clsTag
res12: scala.reflect.ClassTag[Person] = Person
617
Dataset
You can find methods to create encoders for Javas object types, e.g. Boolean , Integer ,
Long , Double , String , java.sql.Timestamp or Byte array, that could be composed to
create more advanced encoders for Java bean classes (using bean method).
import org.apache.spark.sql.Encoders
scala> Encoders.STRING
res2: org.apache.spark.sql.Encoder[String] = class[value[0]: string]
You can create encoders for Scalas tuples and case classes, Int , Long , Double , etc.
import org.apache.spark.sql.Encoders
scala> Encoders.tuple(Encoders.scalaLong, Encoders.STRING, Encoders.scalaBoolean)
res9: org.apache.spark.sql.Encoder[(Long, String, Boolean)] = class[_1[0]: bigint, _2[0
]: string, _3[0]: boolean]
618
Dataset
Columns
Caution
FIXME
over function
over(window: expressions.WindowSpec): Column
over function defines a windowing column that allows for window computations to be
cast
cast method casts a column to a data type. It makes for type-safe maps with Row objects
It uses CatalystSqlParser to parse the data type from its canonical string representation.
619
Dataset
cast Example
scala> val df = Seq((0f, "hello")).toDF("label", "text")
df: org.apache.spark.sql.DataFrame = [label: float, text: string]
scala> df.printSchema
root
|-- label: float (nullable = false)
|-- text: string (nullable = true)
// without cast
import org.apache.spark.sql.Row
scala> df.select("label").map { case Row(label) => label.getClass.getName }.show(false
)
+---------------+
|value |
+---------------+
|java.lang.Float|
+---------------+
// with cast
import org.apache.spark.sql.types.DoubleType
scala> df.select(col("label").cast(DoubleType)).map { case Row(label) => label.getClas
s.getName }.show(false)
+----------------+
|value |
+----------------+
|java.lang.Double|
+----------------+
620
Dataset
Schema
Caution
621
Dataset
See org.apache.spark.package.scala.
A DataFrame is a distributed collection of tabular data organized into rows and named
columns. It is conceptually equivalent to a table in a relational database and provides
operations to project ( select ), filter , intersect , join , group , sort , join ,
aggregate , or convert to a RDD (consult DataFrame API)
data.groupBy('Product_ID).sum('Score)
Spark SQL borrowed the concept of DataFrame from pandas' DataFrame and made it
immutable, parallel (one machine, perhaps with many processors and cores) and
distributed (many machines, perhaps with many processors and cores).
Note
Hey, big data consultants, time to help teams migrate the code from pandas'
DataFrame into Sparks DataFrames (at least to PySparks DataFrame) and
offer services to set up large clusters!
DataFrames in Spark SQL strongly rely on the features of RDD - its basically a RDD
exposed as structured DataFrame by appropriate operations to handle very big data from
the day one. So, petabytes of data should not scare you (unless youre an administrator to
create such clustered Spark environment - contact me when you feel alone with the task).
622
Dataset
You can create DataFrames by loading data from structured files (JSON, Parquet, CSV),
RDDs, tables in Hive, or external databases (JDBC). You can also create DataFrames from
scratch and build upon them (as in the above example). See DataFrame API. You can read
any format given you have appropriate Spark SQL extension of DataFrameReader to format
the dataset appropriately.
Caution
Filtering
DataFrames use the Catalyst query optimizer to produce efficient queries (and so they are
supposed to be faster than corresponding RDD-based queries).
Note
Your DataFrames can also be type-safe and moreover further improve their
performance through specialized encoders that can significantly cut serialization
and deserialization times.
623
Dataset
You can enforce types on generic rows and hence bring type safety (at compile time) by
encoding rows into type-safe Dataset object. As of Spark 2.0 it is a preferred way of
developing Spark applications.
Features of DataFrame
A DataFrame is a collection of "generic" Row instances (as RDD[Row] ) and a schema (as
StructType ).
Note
A schema describes the columns and for each column it defines the name, the type and
whether or not it accepts empty values.
StructType
Caution
FIXME
method.
as gives you a conversion from Dataset[Row] to Dataset[T] .
624
Dataset
withColumn method returns a new DataFrame with the new column col with colName
name added.
Note
FIXME
625
Dataset
The Apache Hive data warehouse software facilitates querying and managing large
datasets residing in distributed storage.
Using toDF
After you import spark.implicits._ (which is done for you by Spark shell) you may apply
toDF method to convert objects to DataFrames.
626
Dataset
:24
scala> val headers = lines.first
headers: String = auctionid,bid,bidtime,bidder,bidderrate,openbid,price
scala> import org.apache.spark.sql.types.{StructField, StringType}
import org.apache.spark.sql.types.{StructField, StringType}
scala> val fs = headers.split(",").map(f => StructField(f, StringType))
fs: Array[org.apache.spark.sql.types.StructField] = Array(StructField(auctionid,String
Type,true), StructField(bid,StringType,true), StructField(bidtime,StringType,true), St
ructField(bidder,StringType,true), StructField(bidderrate,StringType,true), StructFiel
d(openbid,StringType,true), StructField(price,StringType,true))
scala> import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructType
scala> val schema = StructType(fs)
schema: org.apache.spark.sql.types.StructType = StructType(StructField(auctionid,Strin
gType,true), StructField(bid,StringType,true), StructField(bidtime,StringType,true), S
tructField(bidder,StringType,true), StructField(bidderrate,StringType,true), StructFie
ld(openbid,StringType,true), StructField(price,StringType,true))
scala> val noheaders = lines.filter(_ != header)
noheaders: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[10] at filter at <conso
le>:33
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
scala> val rows = noheaders.map(_.split(",")).map(a => Row.fromSeq(a))
rows: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[12] at map
at <console>:35
scala> val auctions = spark.createDataFrame(rows, schema)
auctions: org.apache.spark.sql.DataFrame = [auctionid: string, bid: string, bidtime: s
tring, bidder: string, bidderrate: string, openbid: string, price: string]
scala> auctions.printSchema
root
|-- auctionid: string (nullable = true)
|-- bid: string (nullable = true)
|-- bidtime: string (nullable = true)
|-- bidder: string (nullable = true)
|-- bidderrate: string (nullable = true)
|-- openbid: string (nullable = true)
|-- price: string (nullable = true)
scala> auctions.dtypes
res28: Array[(String, String)] = Array((auctionid,StringType), (bid,StringType), (bidt
ime,StringType), (bidder,StringType), (bidderrate,StringType), (openbid,StringType), (
price,StringType))
627
Dataset
scala> auctions.show(5)
+----------+----+-----------+-----------+----------+-------+-----+
| auctionid| bid| bidtime| bidder|bidderrate|openbid|price|
+----------+----+-----------+-----------+----------+-------+-----+
|1638843936| 500|0.478368056| kona-java| 181| 500| 1625|
|1638843936| 800|0.826388889| doc213| 60| 500| 1625|
|1638843936| 600|3.761122685| zmxu| 7| 500| 1625|
|1638843936|1500|5.226377315|carloss8055| 5| 500| 1625|
|1638843936|1600| 6.570625| jdrinaz| 6| 500| 1625|
+----------+----+-----------+-----------+----------+-------+-----+
only showing top 5 rows
628
Dataset
Support for CSV data sources is available by default in Spark 2.0.0. No need for
an external module.
629
Dataset
630
Dataset
Among the supported structured data (file) formats are (consult Specifying Data Format
(format method) for DataFrameReader ):
JSON
parquet
JDBC
ORC
Tables in Hive and any JDBC-compliant database
libsvm
val reader = spark.read
r: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader@59e67a1
8
reader.parquet("file.parquet")
reader.json("file.json")
reader.format("libsvm").load("sample_libsvm_data.txt")
Querying DataFrame
Note
This variant (in which you use stringified column names) can only select existing
columns, i.e. you cannot create new ones using select expressions.
631
Dataset
scala> predictions.printSchema
root
|-- id: long (nullable = false)
|-- topic: string (nullable = true)
|-- text: string (nullable = true)
|-- label: double (nullable = true)
|-- words: array (nullable = true)
| |-- element: string (containsNull = true)
|-- features: vector (nullable = true)
|-- rawPrediction: vector (nullable = true)
|-- probability: vector (nullable = true)
|-- prediction: double (nullable = true)
scala> predictions.select("label", "words").show
+-----+-------------------+
|label| words|
+-----+-------------------+
| 1.0| [hello, math!]|
| 0.0| [hello, religion!]|
| 1.0|[hello, phy, ic, !]|
+-----+-------------------+
scala> auctions.groupBy("bidder").count().show(5)
+--------------------+-----+
| bidder|count|
+--------------------+-----+
| dennisthemenace1| 1|
| amskymom| 5|
| [email protected]| 4|
| millyjohn| 1|
|ykelectro@hotmail...| 2|
+--------------------+-----+
only showing top 5 rows
In the following example you query for the top 5 of the most active bidders.
Note the tiny $ and desc together with the column name to sort the rows by.
632
Dataset
scala> auctions.groupBy("bidder").count().sort($"count".desc).show(5)
+------------+-----+
| bidder|count|
+------------+-----+
| lass1004| 22|
| pascal1666| 19|
| freembd| 17|
|restdynamics| 17|
| happyrova| 17|
+------------+-----+
only showing top 5 rows
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> auctions.groupBy("bidder").count().sort(desc("count")).show(5)
+------------+-----+
| bidder|count|
+------------+-----+
| lass1004| 22|
| pascal1666| 19|
| freembd| 17|
|restdynamics| 17|
| happyrova| 17|
+------------+-----+
only showing top 5 rows
633
Dataset
scala> df.select("auctionid").distinct.count
res88: Long = 97
scala> df.groupBy("bidder").count.show
+--------------------+-----+
| bidder|count|
+--------------------+-----+
| dennisthemenace1| 1|
| amskymom| 5|
| [email protected]| 4|
| millyjohn| 1|
|ykelectro@hotmail...| 2|
| [email protected]| 1|
| rrolex| 1|
| bupper99| 2|
| cheddaboy| 2|
| adcc007| 1|
| varvara_b| 1|
| yokarine| 4|
| steven1328| 1|
| anjara| 2|
| roysco| 1|
|lennonjasonmia@ne...| 2|
|northwestportland...| 4|
| bosspad| 10|
| 31strawberry| 6|
| nana-tyler| 11|
+--------------------+-----+
only showing top 20 rows
Using SQL
Register a DataFrame as a named temporary table to run SQL.
scala> df.registerTempTable("auctions") (1)
scala> val sql = spark.sql("SELECT count(*) AS count FROM auctions")
sql: org.apache.spark.sql.DataFrame = [count: bigint]
634
Dataset
scala> sql.explain
== Physical Plan ==
TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[
count#148L])
TungstenExchange SinglePartition
TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], outp
ut=[currentCount#156L])
TungstenProject
Scan PhysicalRDD[auctionid#49,bid#50,bidtime#51,bidder#52,bidderrate#53,openbid#54
,price#55]
scala> sql.show
+-----+
|count|
+-----+
| 1348|
+-----+
scala> val count = sql.collect()(0).getLong(0)
count: Long = 1348
Filtering
scala> df.show
+----+---------+-----+
|name|productId|score|
+----+---------+-----+
| aaa| 100| 0.12|
| aaa| 200| 0.29|
| bbb| 200| 0.53|
| bbb| 300| 0.42|
+----+---------+-----+
scala> df.filter($"name".like("a%")).show
+----+---------+-----+
|name|productId|score|
+----+---------+-----+
| aaa| 100| 0.12|
| aaa| 200| 0.29|
+----+---------+-----+
DataFrame.explain
When performance is the issue you should use DataFrame.explain(true) .
Caution
635
Dataset
Example Datasets
eBay online auctions
SFPD Crime Incident Reporting system
636
Dataset
Row
Row is a data abstraction of an ordered collection of fields that can be accessed by an
ordinal / an index (aka generic access by ordinal), a name (aka native primitive access) or
using Scalas pattern matching. A Row instance may or may not have a schema.
The traits of Row :
length or size - Row knows the number of elements (columns).
schema - Row knows the schema
Row belongs to org.apache.spark.sql.Row package.
import org.apache.spark.sql.Row
Field Access
Fields of a Row instance can be accessed by index (starting from 0 ) using apply or
get .
Note
Generic access by ordinal (using apply or get ) returns a value of type Any .
You can query for fields with their proper types using getAs with an index
val row = Row(1, "hello")
scala> row.getAs[Int](0)
res1: Int = 1
scala> row.getAs[String](1)
res2: String = hello
637
Dataset
FIXME
Note
row.getAs[String](null)
Schema
A Row instance can have a schema defined.
Note
Unless you are instantiating Row yourself (using Row Object), a Row has
always a schema.
Note
Row Object
Row companion object offers factory methods to create Row instances from a collection of
638
Dataset
639
640
DataFrameReader
DataFrameReader is an interface to return DataFrame from many storage formats in external
It has a direct support for many file formats and interface for new ones. It assumes parquet
as the default data source format that you can change using spark.sql.sources.default
setting.
Note
641
Refer to Schema.
load methods
load(): DataFrame
load(path: String): DataFrame
stream methods
stream(): DataFrame
stream(path: String): DataFrame
Caution
642
JSON
CSV
parquet
ORC
text
json method
json(path: String): DataFrame
json(paths: String*): DataFrame
json(jsonRDD: RDD[String]): DataFrame
csv method
csv(paths: String*): DataFrame
parquet method
parquet(paths: String*): DataFrame
643
none or uncompressed
snappy - the default codec in Spark 2.0.0.
gzip - the default codec in Spark before 2.0.0
lzo
644
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.
scala:137)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:134)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:117)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.sca
la:65)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:65)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:390)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:247)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:230)
... 48 elided
orc method
orc(path: String): DataFrame
Optimized Row Columnar (ORC) file format is a highly efficient columnar format to store
Hive data with more than 1,000 columns and improve performance. ORC format was
introduced in Hive version 0.11 to use and retain the type information from the table
definition.
Tip
Read ORC Files document to learn about the ORC file format.
text method
text method loads a text file.
Example
645
646
spark.sessionState.sqlParser.parseTableIdentifier(tableName) and
spark.sessionState.catalog.lookupRelation . Would be nice to learn a bit
jdbc method
Note
jdbc allows you to create DataFrame that represents table in the database available as
url .
647
DataFrameWriter
DataFrameWriter is used to write a DataFrame to external storage systems in batch or
streaming fashions.
Use write method on a DataFrame to access it.
import org.apache.spark.sql.{DataFrame, DataFrameWriter}
val df: DataFrame = ...
val writer: DataFrameWriter = df.write
It has a direct support for many file formats, JDBC databases and interface for new ones. It
assumes parquet as the default data source (but you can change the format using
spark.sql.sources.default setting or format method.
As of Spark 2.0.0 DataFrameWriter offers methods for Structured Streaming:
trigger to set the Trigger for a stream query.
queryName
startStream to start a continuous write.
jdbc
jdbc(url: String, table: String, connectionProperties: Properties): Unit
jdbc method saves the content of the DataFrame to an external database table via JDBC.
You can use mode to control save mode, i.e. what happens when an external table exists
when save is executed.
It is assumed that the jdbc save pipeline is not partitioned and bucketed.
All options are overriden by the input connectionProperties .
The required options are:
driver which is the class name of the JDBC driver (that is passed to Sparks own
DriverRegistry.register and later used to connect(url, properties) ).
When table exists and the override save mode is in use, DROP TABLE table is executed.
648
It creates the input table (using CREATE TABLE table (schema) where schema is the
schema of the DataFrame ).
bucketBy method
Caution
FIXME
partitionBy method
partitionBy(colNames: String*): DataFrameWriter[T]
Caution
FIXME
You can control the behaviour of write using mode method, i.e. what happens when an
external file or table exist when save is executed.
SaveMode.Ignore or
SaveMode.ErrorIfExists or
SaveMode.Overwrite or
trigger
trigger(trigger: Trigger): DataFrameWriter
trigger method sets the time interval known as a trigger (as Trigger object) for stream
query.
Note
Tip
Note
Whether or not you have to specify path option depends on the DataSource in
use.
Recognized options:
queryName is the name of active streaming query.
checkpointLocation is the directory for checkpointing.
Note
Note
FIXME
FIXME
Parquet
Caution
Note
FIXME
Parquet is the default data source format.
650
651
DataSource
DataSource case class belongs to the Data Source API (along with DataFrameReader and
DataFrameWriter).
Caution
createSource
createSource(metadataPath: String): Source
Caution
FIXME
sourceSchema
Caution
FIXME
inferFileFormatSchema
inferFileFormatSchema(format: FileFormat): StructType
652
DataSourceRegister
DataSourceRegister is an interface to register DataSources under their (shorter) aliases. It
allows users using the data source alias as the format type over the fully qualified class
name.
package org.apache.spark.sql.sources
trait DataSourceRegister {
def shortName(): String
}
653
in DataFrames.
Note
You can access the functions using the following import statement:
import org.apache.spark.sql.functions._
There are nearly 50 or more functions in the functions object. Some functions are
transformations of Column objects (or column names) into other Column objects or
transform DataFrame into DataFrame .
The functions are grouped by functional areas:
Defining UDFs
String functions
split
upper (chained with reverse )
Aggregate functions
Non-aggregate functions (aka normal functions)
struct
broadcast (for DataFrame )
expr
Date time functions
and others
Tip
window
Caution
FIXME
654
The udf family of functions allows you to create user-defined functions (UDFs) based on a
user-defined function in Scala. It accepts f function of 0 to 10 arguments and the input and
output types are automatically inferred (given the types of the respective input and output
types of the function f ).
import org.apache.spark.sql.functions._
val _length: String => Int = _.length
val _lengthUDF = udf(_length)
// define a dataframe
val df = sc.parallelize(0 to 3).toDF("num")
// apply the user-defined function to "num" column
scala> df.withColumn("len", _lengthUDF($"num")).show
+---+---+
|num|len|
+---+---+
| 0| 1|
| 1| 1|
| 2| 1|
| 3| 1|
+---+---+
udf(f: AnyRef, dataType: DataType) allows you to use a Scala closure for the function
argument (as f ) and explicitly declaring the output data type (as dataType ).
655
String functions
split function
split(str: Column, pattern: String): Column
split function splits str column using pattern . It returns a new Column .
Note
Note
upper function
upper(e: Column): Column
656
upper function converts a string column into one with all letter upper. It returns a new
Column .
Note
The following example uses two functions that accept a Column and return
another to showcase how to chain them.
Non-aggregate functions
They are also called normal functions.
struct functions
struct(cols: Column*): Column
struct(colName: String, colNames: String*): Column
struct family of functions allows you to create a new struct column based on a collection of
Column or their names.
Note
The difference between struct and another similar array function is that the
types of the columns can be different (in struct ).
broadcast function
657
broadcast function creates a new DataFrame (out of the input DataFrame ) and marks the
expr function
expr(expr: String): Column
expr function parses the input expr SQL string to a Column it represents.
658
Internally, expr uses the active sessions sqlParser or creates a new SparkSqlParser to call
parseExpression method.
659
Aggregation (GroupedData)
Note
You can use DataFrame to compute aggregates over a collection of (grouped) rows.
DataFrame offers the following aggregate operators:
groupBy
rollup
cube
groupBy Operator
Note
The following session uses the data setup as described in Test Setup section
below.
660
scala> df.show
+----+---------+-----+
|name|productId|score|
+----+---------+-----+
| aaa| 100| 0.12|
| aaa| 200| 0.29|
| bbb| 200| 0.53|
| bbb| 300| 0.42|
+----+---------+-----+
scala> df.groupBy("name").count.show
+----+-----+
|name|count|
+----+-----+
| aaa| 2|
| bbb| 2|
+----+-----+
scala> df.groupBy("name").max("score").show
+----+----------+
|name|max(score)|
+----+----------+
| aaa| 0.29|
| bbb| 0.53|
+----+----------+
scala> df.groupBy("name").sum("score").show
+----+----------+
|name|sum(score)|
+----+----------+
| aaa| 0.41|
| bbb| 0.95|
+----+----------+
scala> df.groupBy("productId").sum("score").show
+---------+------------------+
|productId| sum(score)|
+---------+------------------+
| 300| 0.42|
| 100| 0.12|
| 200|0.8200000000000001|
+---------+------------------+
GroupedData
GroupedData is a result of executing
661
count
mean
max
avg
min
sum
pivot
Test Setup
This is a setup for learning GroupedData . Paste it into Spark Shell using :paste .
import spark.implicits._
case class Token(name: String, productId: Int, score: Double)
val data = Token("aaa", 100, 0.12) ::
Token("aaa", 200, 0.29) ::
Token("bbb", 200, 0.53) ::
Token("bbb", 300, 0.42) :: Nil
val df = data.toDF.cache (1)
1. Cache the DataFrame so following queries wont load data over and over again.
662
UDFsUser-Defined Functions
User-Defined Functions (aka UDF) is a feature of Spark SQL to define new Column-based
functions that extend the vocabulary of Spark SQLs DSL to transform Datasets.
Note
Tip
You define a new UDF by defining a Scala function as an input parameter of udf function.
You can use Scala functions of up to 10 input parameters. See the section udf Functions (in
functions object).
val df = Seq((0, "hello"), (1, "world")).toDF("id", "text")
// Define a "regular" Scala function
val upper: String => String = _.toUpperCase
import org.apache.spark.sql.functions.udf
val upperUDF = udf(upper)
scala> df.withColumn("upper", upperUDF('text)).show
+---+-----+-----+
| id| text|upper|
+---+-----+-----+
| 0|hello|HELLO|
| 1|world|WORLD|
+---+-----+-----+
org.apache.spark.sql.functions object comes with udf function to let you define a UDF for
a Scala function f .
663
val df = Seq(
(0, "hello"),
(1, "world")).toDF("id", "text")
// Define a "regular" Scala function
// It's a clone of upper UDF
val toUpper: String => String = _.toUpperCase
import org.apache.spark.sql.functions.udf
val upper = udf(toUpper)
scala> df.withColumn("upper", upper('text)).show
+---+-----+-----+
| id| text|upper|
+---+-----+-----+
| 0|hello|HELLO|
| 1|world|WORLD|
+---+-----+-----+
// You could have also defined the UDF this way
val upperUDF = udf { s: String => s.toUpperCase }
// or even this way
val upperUDF = udf[String, String](_.toUpperCase)
scala> df.withColumn("upper", upperUDF('text)).show
+---+-----+-----+
| id| text|upper|
+---+-----+-----+
| 0|hello|HELLO|
| 1|world|WORLD|
+---+-----+-----+
Tip
664
Spark SQL supports three kinds of window aggregate function: ranking functions, analytic
functions, and aggregate functions.
A window specification defines the partitioning, ordering, and frame boundaries.
Window functions are also called over functions due to how they are applied
using Columns over function.
Although similar to aggregate functions, a window function does not group rows into a single
output row and retains their separate identities. A window function can access rows that are
linked to the current row.
Tip
665
Ranking functions
Analytic functions
DataFrame API
RANK
rank
DENSE_RANK
dense_rank
PERCENT_RANK
percent_rank
NTILE
ntile
ROW_NUMBER
row_number
CUME_DIST
cume_dist
LAG
lag
LEAD
lead
For aggregate functions, you can use the existing aggregate functions as window functions,
e.g. sum , avg , min , max and count .
You can mark a function window by OVER clause after a function in SQL, e.g. avg(revenue)
OVER () or over method on a function in the Dataset API, e.g. rank().over() .
When executed, a window function computes a value for each row in a window.
Note
A window specification defines which rows are included in a window (aka a frame), i.e. set
of rows, that is associated with a given input row. It does so by partitioning an entire data
set and specifying frame boundary with ordering.
Note
666
import org.apache.spark.sql.expressions.Window
scala> val byHTokens = Window.partitionBy('token startsWith "h")
byHTokens: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressi
ons.WindowSpec@574985d8
Besides the two above, you can also use the following methods (that correspond to the
methods in Window object):
partitionBy
orderBy
Window object
Window object provides functions to define windows (as WindowSpec instances).
Window object lives in org.apache.spark.sql.expressions package. Import it to use Window
functions.
import org.apache.spark.sql.expressions.Window
667
There are two families of the functions available in Window object that create WindowSpec
instance for one or many Column instances:
partitionBy
orderBy
partitionBy
partitionBy(colName: String, colNames: String*): WindowSpec
partitionBy(cols: Column*): WindowSpec
partitionBy creates an instance of WindowSpec with partition expression(s) defined for one
or more columns.
// partition records into two groups
// * tokens starting with "h"
// * others
val byHTokens = Window.partitionBy('token startsWith "h")
// count the sum of ids in each group
val result = tokens.select('*, sum('id) over byHTokens as "sum over h tokens").orderBy(
'id)
scala> .show
+---+-----+-----------------+
| id|token|sum over h tokens|
+---+-----+-----------------+
| 0|hello| 4|
| 1|henry| 4|
| 2| and| 2|
| 3|harry| 4|
+---+-----+-----------------+
orderBy
orderBy(colName: String, colNames: String*): WindowSpec
orderBy(cols: Column*): WindowSpec
Window Examples
Two samples from org.apache.spark.sql.expressions.Window scaladoc:
668
// PARTITION BY country ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
Window.partitionBy('country).orderBy('date).rowsBetween(Long.MinValue, 0)
Frame
At its core, a window function calculates a return value for every input row of a table based
on a group of rows, called the frame. Every input row can have a unique frame associated
with it.
When you define a frame you have to specify three components of a frame specification the start and end boundaries, and the type.
Types of boundaries (two positions and three offsets):
UNBOUNDED PRECEDING - the first row of the partition
UNBOUNDED FOLLOWING - the last row of the partition
CURRENT ROW
<value> PRECEDING
<value> FOLLOWING
In the current implementation of WindowSpec you can use two methods to define a frame:
rowsBetween
rangeBetween
Examples
Top N per Group
669
Top N per Group is useful when you need to compute the first and second best-sellers in
category.
This example is borrowed from an excellent article Introducing Window
Functions in Spark SQL.
Note
category
revenue
Thin
cell phone
6000
Normal
tablet
1500
Mini
tablet
5500
Ultra thin
cell phone
5000
Very thin
cell phone
6000
Big
tablet
2500
Bendable
cell phone
3000
Foldable
cell phone
3000
Pro
tablet
4500
Pro2
tablet
6500
Question: What are the best-selling and the second best-selling products in every category?
670
The question boils down to ranking products in a category based on their revenue, and to
pick the best selling and the second best-selling products based the ranking.
671
import org.apache.spark.sql.expressions.Window
val overCategory = Window.partitionBy('category).orderBy('revenue.desc)
val rank = dense_rank.over(overCategory)
val ranked = data.withColumn("rank", dense_rank.over(overCategory))
scala> ranked.show
+----------+----------+-------+----+
| product| category|revenue|rank|
+----------+----------+-------+----+
| Pro2| tablet| 6500| 1|
| Mini| tablet| 5500| 2|
| Pro| tablet| 4500| 3|
| Big| tablet| 2500| 4|
| Normal| tablet| 1500| 5|
| Thin|cell phone| 6000| 1|
| Very thin|cell phone| 6000| 1|
|Ultra thin|cell phone| 5000| 2|
| Bendable|cell phone| 3000| 3|
| Foldable|cell phone| 3000| 3|
+----------+----------+-------+----+
scala> ranked.where('rank <= 2).show
+----------+----------+-------+----+
| product| category|revenue|rank|
+----------+----------+-------+----+
| Pro2| tablet| 6500| 1|
| Mini| tablet| 5500| 2|
| Thin|cell phone| 6000| 1|
| Very thin|cell phone| 6000| 1|
|Ultra thin|cell phone| 5000| 2|
+----------+----------+-------+----+
This example is the 2nd example from an excellent article Introducing Window
Functions in Spark SQL.
672
import org.apache.spark.sql.expressions.Window
val reveDesc = Window.partitionBy('category).orderBy('revenue.desc)
val reveDiff = max('revenue).over(reveDesc) - 'revenue
scala> data.select('*, reveDiff as 'revenue_diff).show
+----------+----------+-------+------------+
| product| category|revenue|revenue_diff|
+----------+----------+-------+------------+
| Pro2| tablet| 6500| 0|
| Mini| tablet| 5500| 1000|
| Pro| tablet| 4500| 2000|
| Big| tablet| 2500| 4000|
| Normal| tablet| 1500| 5000|
| Thin|cell phone| 6000| 0|
| Very thin|cell phone| 6000| 0|
|Ultra thin|cell phone| 5000| 1000|
| Bendable|cell phone| 3000| 3000|
| Foldable|cell phone| 3000| 3000|
+----------+----------+-------+------------+
Difference on Column
Compute a difference between values in rows in a column.
673
Please note that Why do Window functions fail with "Window function X does not take a
frame specification"?
The key here is to remember that DataFrames are RDDs under the covers and hence
aggregation like grouping by a key in DataFrames is RDDs groupBy (or worse,
reduceByKey or aggregateByKey transformations).
674
Running Total
The running total is the sum of all previous lines including the current one.
val sales = Seq(
(0, 0, 0, 5),
(1, 0, 1, 3),
(2, 0, 2, 1),
(3, 1, 0, 2),
(4, 2, 0, 8),
(5, 2, 2, 8))
.toDF("id", "orderID", "prodID", "orderQty")
scala> sales.show
+---+-------+------+--------+
| id|orderID|prodID|orderQty|
+---+-------+------+--------+
| 0| 0| 0| 5|
| 1| 0| 1| 3|
| 2| 0| 2| 1|
| 3| 1| 0| 2|
| 4| 2| 0| 8|
| 5| 2| 2| 8|
+---+-------+------+--------+
val orderedByID = Window.orderBy('id)
val totalQty = sum('orderQty).over(orderedByID).as('running_total)
val salesTotalQty = sales.select('*, totalQty).orderBy('id)
scala> salesTotalQty.show
16/04/10 23:01:52 WARN Window: No Partition Defined for Window operation! Moving all d
ata to a single partition, this can cause serious performance degradation.
+---+-------+------+--------+-------------+
| id|orderID|prodID|orderQty|running_total|
+---+-------+------+--------+-------------+
| 0| 0| 0| 5| 5|
| 1| 0| 1| 3| 8|
| 2| 0| 2| 1| 9|
| 3| 1| 0| 2| 11|
| 4| 2| 0| 8| 19|
| 5| 2| 2| 8| 27|
+---+-------+------+--------+-------------+
val byOrderId = orderedByID.partitionBy('orderID)
val totalQtyPerOrder = sum('orderQty).over(byOrderId).as('running_total_per_order)
val salesTotalQtyPerOrder = sales.select('*, totalQtyPerOrder).orderBy('id)
scala> salesTotalQtyPerOrder.show
+---+-------+------+--------+-----------------------+
| id|orderID|prodID|orderQty|running_total_per_order|
+---+-------+------+--------+-----------------------+
675
| 0| 0| 0| 5| 5|
| 1| 0| 1| 3| 8|
| 2| 0| 2| 1| 9|
| 3| 1| 0| 2| 2|
| 4| 2| 0| 8| 8|
| 5| 2| 2| 8| 16|
+---+-------+------+--------+-----------------------+
676
677
Structured Streaming
Note
Tip
The feature has also been called Streaming Spark SQL Query, Streaming
DataFrames, Continuous DataFrames or Continuous Queries. There have
been lots of names before Structured Streaming was chosen.
Watch SPARK-8360 Streaming DataFrames to track progress of the feature.
Example
Below is a complete example of a streaming query in a form of DataFrame of data from
hello cvs files of a given schema into a ConsoleSink every 5 seconds.
678
Structured Streaming
file:///Users/jacek/dev/oss/spark/csv-logs/people-1.csv
file:///Users/jacek/dev/oss/spark/csv-logs/people-2.csv
file:///Users/jacek/dev/oss/spark/csv-logs/people-3.csv
679
Structured Streaming
file:///Users/jacek/dev/oss/spark/csv-logs/people-1.csv
------------------------------------------Batch: 0
------------------------------------------+-----+--------+-------+---+-----+
| name| city|country|age|alive|
+-----+--------+-------+---+-----+
|Jacek|Warszawa| Polska| 42| true|
+-----+--------+-------+---+-----+
scala> spark.streams.active.foreach(println)
Streaming Query - consoleStream [state = ACTIVE]
scala> spark.streams.active(0).explain
== Physical Plan ==
*Scan csv [name#130,city#131,country#132,age#133,alive#134] Format: CSV, InputPaths: f
ile:/Users/jacek/dev/oss/spark/csv-logs/people-3.csv, PushedFilters: [], ReadSchema: s
truct<name:string,city:string,country:string,age:int,alive:boolean>
680
Structured Streaming
DataStreamReader
DataStreamReader is an interface for reading streaming data in DataFrame from data
format
format(source: String): DataStreamReader
schema
schema(schema: StructType): DataStreamReader
option Methods
681
Structured Streaming
There is support for values of String , Boolean , Long , and Double types for user
convenience, and internally are converted to String type.
Note
You can also set options in bulk using options method. You have to do the type
conversion yourself, though.
options
options(options: scala.collection.Map[String, String]): DataStreamReader
options method allows specifying one or many options of the streaming input data source.
Note
You can also set options one by one using option method.
load Methods
load(): DataFrame
load(path: String): DataFrame (1)
Built-in Formats
json(path: String): DataFrame
csv(path: String): DataFrame
parquet(path: String): DataFrame
text(path: String): DataFrame
DataStreamReader can load streaming data from data sources of the following formats:
682
Structured Streaming
json
csv
parquet
text
683
Structured Streaming
DataStreamWriter
Caution
FIXME
outputMode specifies output mode of a streaming Dataset which is what gets written to a
sink.
OutputMode.Complete entire streaming dataset (with all the rows) will be written to a
sink every time there are updates. It is supported only for streaming queries with
aggregations.
queryName
queryName(queryName: String): DataStreamWriter[T]
trigger
684
Structured Streaming
trigger sets the interval of trigger (batch) for the streaming query.
Note
start methods
start(path: String): StreamingQuery
start(): StreamingQuery
foreach
685
Structured Streaming
Streaming Source
A Streaming Source represents a continuous stream of data for a streaming query. It
generates batches of DataFrame for given start and end offsets. For fault tolerance, a
source must be able to replay data given a start offset.
A streaming source should be able to replay an arbitrary sequence of past data in the stream
using a range of offsets. This means that only streaming sources like Kafka and Kinesis
(which have the concept of per-record offset) fit into this model. This is the assumption so
structured streaming can achieve end-to-end exactly-once guarantees.
Source trait has the following features:
MemoryStream
686
Structured Streaming
FileStreamSource
FileStreamSource is a Source that reads text files from path directory as they appear. It
You can provide the schema of the data and dataFrameBuilder - the function to build a
DataFrame in getBatch at instantiation time.
inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.sql.execution.streaming.FileStreamSource=TRACE
Refer to Logging.
Options
maxFilesPerTrigger
687
Structured Streaming
maxFilesPerTrigger option specifies the maximum number of files per trigger (batch). It
limits the file stream source to read the maxFilesPerTrigger number of files specified at a
time and hence enables rate limiting.
It allows for a static set of files be used like a stream for testing as the file set is processed
maxFilesPerTrigger number of files at a time.
schema
If the schema is specified at instantiation time (using optional dataSchema constructor
parameter) it is returned.
Otherwise, fetchAllFiles internal method is called to list all the files in a directory.
When there is at least one file the schema is calculated using dataFrameBuilder constructor
parameter function. Else, an IllegalArgumentException("No schema specified") is thrown
unless it is for text provider (as providerName constructor parameter) where the default
schema with a single value column of type StringType is assumed.
Note
getOffset
The maximum offset ( getOffset ) is calculated by fetching all the files in path excluding
files that start with _ (underscore).
When computing the maximum offset using getOffset , you should see the following
DEBUG message in the logs:
DEBUG Listed ${files.size} in ${(endTime.toDouble - startTime) / 1000000}ms
When computing the maximum offset using getOffset , it also filters out the files that were
already seen (tracked in seenFiles internal registry).
You should see the following DEBUG message in the logs (depending on the status of a
file):
new file: $file
// or
old file: $file
getBatch
688
Structured Streaming
You should see the following INFO and DEBUG messages in the logs:
INFO Processing ${files.length} files from ${startId + 1}:$endId
DEBUG Streaming ${files.mkString(", ")}
The method to create a result batch is given at instantiation time (as dataFrameBuilder
constructor parameter).
metadataLog
metadataLog is a metadata storage using metadataPath path (which is a constructor
parameter).
Note
Caution
It extends HDFSMetadataLog[Seq[String]] .
FIXME Review HDFSMetadataLog
689
Structured Streaming
Streaming Sinks
A Streaming Sink represents an external storage to write streaming datasets to. It is
modeled as Sink trait that can process batches of data given as DataFrames.
The following sinks are currently available in Spark:
ConsoleSink for console format.
FileStreamSink for parquet format.
ForeachSink used in foreach operator.
MemorySink for memory format.
You can create your own streaming format implementing StreamSinkProvider.
Sink Contract
Sink Contract is described by Sink trait. It defines the one and only addBatch method to
add data as batchId .
package org.apache.spark.sql.execution.streaming
trait Sink {
def addBatch(batchId: Long, data: DataFrame): Unit
}
FileStreamSink
Caution
FIXME
MemorySink
MemorySink is an memory-based Sink particularly useful for testing. It stores the results in
memory.
It is available as memory format that requires a query name (by queryName method or
queryName option).
...FIXME
690
Structured Streaming
Note
It was introduced in the pull request for [SPARK-14288][SQL] Memory Sink for
streaming.
Note
It creates MemorySink instance based on the schema of the DataFrame it operates on.
It creates a new DataFrame using MemoryPlan with MemorySink instance created earlier
and registers it as a temporary table (using DataFrame.registerTempTable method).
Note
At this point you can query the table as if it were a regular non-streaming table
using sql method.
691
Structured Streaming
ConsoleSink
ConsoleSink is a streaming sink that is registered as the console format.
ConsoleSinkProvider
ConsoleSinkProvider is a StreamSinkProvider for ConsoleSink. As a DataSourceRegister, it
692
Structured Streaming
ForeachSink
ForeachSink is a typed Sink that passes records (of the type T ) to ForeachWriter (one
Internally, addBatch (the only method from the Sink Contract) takes records from the input
DataFrame (as data ), transforms them to expected type T (of this ForeachSink ) and
(now as a Dataset) processes each partition.
addBatch(batchId: Long, data: DataFrame): Unit
It then opens the constructors ForeachWriter (for the current partition and the input batch)
and passes the records to process (one at a time per partition).
Caution
FIXME Why does Spark track whether the writer failed or not? Why couldnt
it finally and do close ?
Caution
ForeachWriter
Caution
FIXME
693
Structured Streaming
694
Structured Streaming
StreamSinkProvider
StreamSinkProvider is an interface for objects that can create streaming sinks for a specific
695
Structured Streaming
StreamingQueryManagerStreaming Query
Management
Note
Initialization
StreamingQueryManager manages the following instances:
StateStoreCoordinatorRef (as stateStoreCoordinator )
objects.
StreamingQueryListenerBus
Caution
FIXME
startQuery
startQuery(name: String,
checkpointLocation: String,
df: DataFrame,
sink: Sink,
trigger: Trigger = ProcessingTime(0)): StreamingQuery
Note
696
Structured Streaming
Note
startQuery makes sure that activeQueries internal registry does not contain the query
697
Structured Streaming
internal listenerBus .
postListenerEvent
postListenerEvent(event: StreamingQueryListener.Event): Unit
StreamingQueryListener
Caution
FIXME
StreamingQueryListener is an interface for listening to query life cycle events, i.e. a query
Used in:
awaitAnyTermination
awaitAnyTermination(timeoutMs: Long)
They all wait 10 millis before doing the check of lastTerminatedQuery being non-null.
It is set in:
resetTerminated() resets lastTerminatedQuery , i.e. sets it to null .
notifyQueryTermination(terminatedQuery: StreamingQuery) sets lastTerminatedQuery to
698
Structured Streaming
StreamingQuery
StreamingQuery provides an interface for interacting with a query that executes continually
in background.
Note
StreamExecution
It can be in two states: active (started) or inactive (stopped). If inactive, it may have
transitioned into the state due to an StreamingQueryException (that is available under
exception ).
699
Structured Streaming
Trigger
Trigger is used to define how often a streaming query should be executed to produce
results.
Note
Note
file Trigger.scala.
A trigger can also be considered a batch (as in Spark Streaming).
Note
ProcessingTime
ProcessingTime is the only available implementation of Trigger sealed trait. It assumes
ProcessingTime(10)
ProcessingTime(10.seconds)
700
Structured Streaming
ProcessingTime.create(10, TimeUnit.SECONDS)
701
Structured Streaming
StreamExecution
StreamExecution manages execution of a streaming query for a SQLContext and a Sink. It
requires a LogicalPlan to know the Source objects from which records are periodically
pulled down.
StreamExecution is a StreamingQuery with additional attributes:
checkpointRoot
LogicalPlan
Sink
Trigger
inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.sql.execution.streaming.StreamExecution=DEBUG
Refer to Logging.
runBatches
702
Structured Streaming
toDebugString
You can call toDebugString on StreamExecution to learn about the internals.
scala> out.asInstanceOf[StreamExecution].toDebugString
res3: String =
"
=== Continuous Query ===
Name: memStream
Current Offsets: {FileSource[hello]: #0}
Current State: ACTIVE
Thread State: RUNNABLE
Logical Plan:
FileSource[hello]
"
703
Structured Streaming
StreamingRelation
StreamingRelation is the LogicalPlan of the DataFrame being the result of executing
DataFrameReader.stream method.
val reader = spark.read
val helloDF = reader.stream("hello")
scala> helloDF.explain(true)
== Parsed Logical Plan ==
FileSource[hello]
== Analyzed Logical Plan ==
id: bigint
FileSource[hello]
== Optimized Logical Plan ==
FileSource[hello]
== Physical Plan ==
java.lang.AssertionError: assertion failed: No plan for FileSource[hello]
StreamingExecutionRelation
704
Structured Streaming
StreamingQueryListenerBus
Caution
FIXME
705
Joins
Joins
Caution
FIXME
You can use broadcast function to mark a DataFrame to be broadcast when used in a join
operator.
val left = Seq((0, "aa"), (0, "bb")).toDF("id", "token")
val right = Seq(("aa", 0.99), ("bb", 0.57)).toDF("token", "prob")
scala> left.join(broadcast(right), "token").show
+-----+---+----+
|token| id|prob|
+-----+---+----+
| aa| 0|0.99|
| bb| 0|0.57|
+-----+---+----+
According to the article Map-Side Join in Spark, broadcast join is also called a replicated
join (in the distributed system community) or a map-side join (in the Hadoop community).
Note
At long last! I have always been wondering what a map-side join is and it
appears I am close to uncover the truth!
And later in the article Map-Side Join in Spark, you can find that with the broadcast join, you
can very effectively join a large table (fact) with relatively small tables (dimensions), i.e. to
perform a star-schema join you can avoid sending all data of the large table over the
network.
CanBroadcast object matches a LogicalPlan with output small enough for broadcast join.
Note
Currently statistics are only supported for Hive Metastore tables where the
command ANALYZE TABLE [tableName] COMPUTE STATISTICS noscan has been run.
706
Joins
707
Hive Integration
Hive Integration
Spark SQL supports Apache Hive using HiveContext . It uses the Spark SQL execution
engine to work with data stored in Hive.
From Wikipedia, the free encyclopedia:
Apache Hive supports analysis of large datasets stored in Hadoops HDFS
and compatible file systems such as Amazon S3 filesystem.
Note
Tip
log4j.logger.org.apache.spark.sql.hive.HiveContext=DEBUG
Refer to Logging.
Hive Functions
SQLContext.sql (or simply sql ) allows you to interact with Hive.
You can use show functions to learn about the Hive functions supported through the Hive
integration.
708
Hive Integration
current_database function
current_database function returns the current database of Hive metadata.
709
Hive Integration
Analyzing Tables
analyze(tableName: String)
analyze analyzes tableName table for query optimizations. It currently supports only Hive
tables.
scala> sql("show tables").show(false)
16/04/09 14:04:10 INFO HiveSqlParser: Parsing command: show tables
+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
|dafa |false |
+---------+-----------+
scala> spark.asInstanceOf[HiveContext].analyze("dafa")
16/04/09 14:02:56 INFO HiveSqlParser: Parsing command: dafa
java.lang.UnsupportedOperationException: Analyze only works for Hive tables, but dafa
is a LogicalRelation
at org.apache.spark.sql.hive.HiveContext.analyze(HiveContext.scala:304)
... 50 elided
FIXME
710
Hive Integration
Settings
spark.sql.hive.metastore.version (default: 1.2.1 ) - the version of the Hive metastore.
Caution
711
Hive Integration
FIXME
Read about Spark SQL CLI in Sparks official documentation in Running the
Spark SQL CLI.
Tip
712
SQL Parsers
SQL Parsers
ParserInterfaceSQL Parser Contract
ParserInterface is the parser contract for extracting LogicalPlan, Expressions , and
TableIdentifiers from a given SQL string.
package org.apache.spark.sql.catalyst.parser
trait ParserInterface {
def parsePlan(sqlText: String): LogicalPlan
def parseExpression(sqlText: String): Expression
def parseTableIdentifier(sqlText: String): TableIdentifier
}
AbstractSqlParser
AbstractSqlParser abstract class is a ParserInterface that provides the foundation for the
SQL parsing infrastructure in Spark SQL with two concrete implementations: SparkSqlParser
and CatalystSqlParser.
AbstractSqlParser expects that subclasses provide custom AstBuilder (as astBuilder )
SparkSqlParser
SparkSqlParser is the default parser of the SQL statements supported in Spark SQL. It is
713
SQL Parsers
Refer to Logging.
Caution
CatalystSqlParser
CatalystSqlParser is a AbstractSqlParser that comes with its own specialized astBuilder
(i.e. AstBuilder ).
CatalystSqlParser is used to parse data types (using their canonical string representation),
happens inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.sql.catalyst.parser.CatalystSqlParser=INFO
Refer to Logging.
714
SQL Parsers
715
Caching
Caching
Caution
FIXME
You can use CACHE TABLE [tableName] to cache tableName table in memory. It is an eager
operation which is executed as soon as the statement is executed.
sql("CACHE TABLE [tableName]")
716
Datasets vs RDDs
Datasets vs RDDs
Many may have been asking yourself why they should be using Datasets rather than the
foundation of all Spark - RDDs using case classes.
This document collects advantages of Dataset vs RDD[CaseClass] to answer the question
Dan has asked on twitter:
"In #Spark, what is the advantage of a DataSet over an RDD[CaseClass]?"
717
SessionState
SessionState
SessionState is the default separation layer for isolating state across sessions, including
SQL configuration, tables, functions, UDFs, the SQL parser, and everything else that
depends on a SQLConf.
Caution
Note
optimizer
analyzer
catalog
streamingQueryManager
udf
newHadoopConf to create a new Hadoops Configuration .
sessionState
sqlParser
catalog Attribute
catalog: SessionCatalog
catalog attribute points at shared internal SessionCatalog for managing tables and
databases.
It is used to create the shared analyzer, optimizer
SessionCatalog
718
SessionState
analyzer Attribute
analyzer: Analyzer
analyzer is
optimizer Attribute
optimizer: Optimizer
optimizer is
optimizer Attribute
optimizer is
sqlParser Attribute
sqlParser is
planner method
planner is
executePlan method
executePlan is
refreshTable method
refreshTable is
addJar method
addJar is
analyze method
719
SessionState
analyze is
streamingQueryManager Attribute
streamingQueryManager: StreamingQueryManager
udf Attribute
udf: UDFRegistration
Note
Caution
720
SQLExecution.withNewExecutionId allow executing the input body query action with the
execution id local property set (as executionId or auto-generated). The execution identifier
is set as spark.sql.execution.id local property (using SparkContext.setLocalProperty).
The use case is to track Spark jobs (e.g. when running in separate threads) that belong to a
single SQL query execution.
Note
Caution
It is used in Dataset.withNewExecutionId.
FIXME Where is the proxy-like method used? How important is it?
If there is another execution local property set (as spark.sql.execution.id ), it is replaced for
the course of the current action.
721
722
SQLContext
SQLContext
Caution
In the older Spark 1.x, SQLContext was the entry point for Spark SQL. Whatever you do in
Spark SQL it has to start from creating an instance of SQLContext.
A SQLContext object requires a SparkContext , a CacheManager , and a SQLListener. They
are all transient and do not participate in serializing a SQLContext.
You should use SQLContext for the following:
Creating Datasets
Creating Dataset[Long] (range method)
Creating DataFrames
Creating DataFrames for Table
Accessing DataFrameReader
Accessing StreamingQueryManager
Registering User-Defined Functions (UDF)
Caching DataFrames in In-Memory Cache
Setting Configuration Properties
Bringing Converter Objects into Scope
Creating External Tables
Dropping Temporary Tables
Listing Existing Tables
Managing Active SQLContext for JVM
Executing SQL Queries
723
SQLContext
SQLContext(sc: SparkContext)
SQLContext.getOrCreate(sc: SparkContext)
SQLContext.newSession() allows for creating a new instance of SQLContext with a
You can get the current value of a configuration property by key using:
getConf(key: String): String
getConf(key: String, defaultValue: String): String
getAllConfs: immutable.Map[String, String]
Note
Properties that start with spark.sql are reserved for Spark SQL.
Creating DataFrames
emptyDataFrame
emptyDataFrame: DataFrame
SQLContext
This variant of createDataFrame creates a DataFrame from RDD of Row and explicit
schema.
Functions registered using udf are available for Hive queries only.
Tip
725
SQLContext
// Create a DataFrame
val df = Seq("hello", "world!").zip(0 to 1).toDF("text", "id")
// Register the DataFrame as a temporary table in Hive
df.registerTempTable("texts")
scala> sql("SHOW TABLES").show
+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
| texts| true|
+---------+-----------+
scala> sql("SELECT * FROM texts").show
+------+---+
| text| id|
+------+---+
| hello| 0|
|world!| 1|
+------+---+
// Just a Scala function
val my_upper: String => String = _.toUpperCase
// Register the function as UDF
spark.udf.register("my_upper", my_upper)
scala> sql("SELECT *, my_upper(text) AS MY_UPPER FROM texts").show
+------+---+--------+
| text| id|MY_UPPER|
+------+---+--------+
| hello| 0| HELLO|
|world!| 1| WORLD!|
+------+---+--------+
not. It simply requests CacheManager for CachedData and when exists, it assumes the table
is cached.
cacheTable(tableName: String): Unit
726
SQLContext
Caution
uncacheTable(tableName: String)
clearCache(): Unit
ImplicitsSQLContext.implicits
The implicits object is a helper class with methods to convert objects into Datasets and
DataFrames, and also comes with many Encoders for "primitive" types as well as the
collections thereof.
Import the implicits by import spark.implicits._ as follows:
Note
It holds Encoders for Scala "primitive" types like Int , Double , String , and their
collections.
It offers support for creating Dataset from RDD of any types (for which an encoder exists in
scope), or case classes or tuples, and Seq .
It also offers conversions from Scalas Symbol or $ to Column .
It also offers conversions from RDD or Seq of Product types (e.g. case classes or tuples)
to DataFrame . It has direct conversions from RDD of Int , Long and String to
DataFrame with a single column name _1 .
Note
Creating Datasets
createDataset[T: Encoder](data: Seq[T]): Dataset[T]
createDataset[T: Encoder](data: RDD[T]): Dataset[T]
727
SQLContext
Note
The experimental read method returns a DataFrameReader that is used to read data from
external storage systems and load it into a DataFrame .
Caution
It assumes parquet as the default data source format that you can change using
spark.sql.sources.default setting.
Caution
728
SQLContext
The range family of methods creates a Dataset[Long] with the sole id column of
LongType for given start , end , and step .
Note
scala> spark.range(5)
res0: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> .show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
table methods return a DataFrame that holds names of existing tables in a database.
scala> spark.tables.show
+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
| t| true|
| t2| true|
+---------+-----------+
729
SQLContext
Note
tableNames(): Array[String]
tableNames(databaseName: String): Array[String]
tableNames are similar to tables with the only difference that they return Array[String]
Accessing StreamingQueryManager
streams: StreamingQueryManager
FIXME
SQLContext.getOrCreate method returns an active SQLContext object for the JVM or creates
Interestingly, there are two helper methods to set and clear the active SQLContext object setActive and clearActive respectively.
Note
730
SQLContext
sql parses sqlText using a dialect that can be set up using spark.sql.dialect setting.
sql is imported in spark-shell so you can execute Hive statements without
spark prefix.
Note
Tip
You may also use spark-sql shell script to interact with Hive.
731
SQLContext
FIXME Review
Enable INFO logging level for the loggers that correspond to the
implementations of AbstractSqlParser to see what happens inside sql .
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.sql.hive.execution.HiveSqlParser=INFO
Refer to Logging.
You can use newSession method to create a new session without a cost of instantiating a
new SqlContext from scratch.
newSession returns a new SqlContext that shares SparkContext , CacheManager ,
732
RowEncoder
RowEncoder is a factory object that maps StructType to ExpressionEncoder[Row]
733
Predicate Pushdown
Caution
FIXME
When you execute where operator right after loading a data (into a Dataset ), Spark SQL
will push the "where" predicate down to the source using a corresponding SQL query with
WHERE clause (or whatever is the proper language for the source).
This optimization is called predicate pushdown that pushes down the filtering to a data
source engine (rather than dealing with it after the entire dataset has been loaded to Sparks
memory and filtering out records afterwards).
Given the following code:
val df = spark.read
.format("jdbc")
.option("url", "jdbc:...")
.option("dbtable", "people")
.load()
.as[Person]
.where(_.name === "Jacek")
Caution
734
Query Plan
Caution
FIXME
QueryPlan abstract class has a output (that is a sequence of Attribute instances). You can
schema
You can find out about the schema of a QueryPlan using schema that builds StructType
from the output attributes.
Output Attributes
Attribute
Caution
FIXME
735
Spark Plan
SparkPlan is an abstract QueryPlan for physical operators, e.g. InMemoryTableScanExec .
Note
Physical operators have their names end with the Exec prefix.
metrics
outputPartitioning
outputOrdering
SparkPlan can be executed (using the final execute method) to compute
RDD[InternalRow] .
SparkPlan has the following final methods that prepare environment and pass calls on to
SQLMetric
SQLMetric is an accumulator that accumulate and produce long values.
736
SparkPlan Contract
The contract of SparkPlan requires that concrete implementations define the following
method:
doExecute(): RDD[InternalRow]
Caution
737
Logical Plan
Caution
FIXME
738
QueryPlanner
QueryPlanner transforms a LogicalPlan through a chain of GenericStrategy objects to
produce a PhysicalPlan , e.g. SparkPlan for SparkPlanner or the custom SparkPlanner for
HiveSessionState.
QueryPlanner contract defines three operations:
strategies that returns a collection of GenericStrategy objects.
planLater(plan: LogicalPlan): PhysicalPlan that skips the current plan.
plan(plan: LogicalPlan) that returns an Iterator[PhysicalPlan] with elements being
SparkStrategies
SparkStrategies is an abstract QueryPlanner for SparkPlan.
SparkPlanner
SparkPlanner is a concrete QueryPlanner (extending SparkStrategies).
739
740
QueryExecution
QueryExecution requires SQLContext and LogicalPlan.
Caution
across serializations.
val ds = spark.range(5)
scala> ds.queryExecution
res17: org.apache.spark.sql.execution.QueryExecution =
== Parsed Logical Plan ==
Range 0, 5, 1, 8, [id#39L]
== Analyzed Logical Plan ==
id: bigint
Range 0, 5, 1, 8, [id#39L]
== Optimized Logical Plan ==
Range 0, 5, 1, 8, [id#39L]
== Physical Plan ==
WholeStageCodegen
: +- Range 0, 1, 8, 5, [id#39L]
IncrementalExecution
IncrementalExecution is a custom QueryExecution with OutputMode , checkpointLocation ,
and currentBatchId .
It lives in org.apache.spark.sql.execution.streaming package.
Caution
Stateful operators in the query plan are numbered using operatorId that starts with 0 .
IncrementalExecution adds one Rule[SparkPlan] called state to preparations sequence
741
executedPlan SparkPlan
executedPlan lazy value is a SparkPlan ready for execution after applying the rules in
preparations.
742
import org.apache.spark.sql.execution.debug._
scala> spark.range(10).where('id === 4).debugCodegen
Found 1 WholeStageCodegen subtrees.
== Subtree 1 / 1 ==
*Filter (id#8L = 4)
+- *Range (0, 10, splits=8)
Generated code:
/* 001 */ public Object generate(Object[] references) {
/* 002 */ return new GeneratedIterator(references);
/* 003 */ }
/* 004 */
/* 005 */ /**
* Codegend pipeline for
* Filter (id#8L = 4)
* +- Range (0, 10, splits=8)
*/
...
743
Review SPARK-12795 Whole stage codegen to learn about the work to support
it.
Use Dataset.explain method to know the physical plan of a query and find out
whether or not WholeStageCodegen is in use.
Tip
Consider using Debugging Query Execution facility to deep dive into whole stage
codegen.
Note
Before a query is executed, CollapseCodegenStages case class is used to find the plans
that support codegen and collapse them together as WholeStageCodegen . It is part of the
sequence of rules QueryExecution.preparations that will be applied in order to the physical
plan before execution.
CodegenSupport Contract
744
Codegen Operators
SparkPlan plans that support codegen extend CodegenSupport.
ProjectExec for as
FilterExec for where or filter
Range
Caution
BatchedDataSourceScanExec
ExpandExec
BaseLimitExec
SortExec
WholeStageCodegenExec and InputAdapter
TungstenAggregate
BroadcastHashJoinExec
SortMergeJoinExec
BroadcastHashJoinExec
BroadcastHashJoinExec variables are prefixed with bhj (see
CodegenSupport.variablePrefix ).
745
scala> spark.conf.get("spark.sql.autoBroadcastJoinThreshold")
res18: String = 10485760
scala> ds.join(ds).explain(extended=true)
== Parsed Logical Plan ==
'Join Inner
:- LocalRelation [_1#21, _2#22]
+- LocalRelation [_1#21, _2#22]
== Analyzed Logical Plan ==
_1: int, _2: string, _1: int, _2: string
Join Inner
:- LocalRelation [_1#21, _2#22]
+- LocalRelation [_1#32, _2#33]
== Optimized Logical Plan ==
Join Inner
:- LocalRelation [_1#21, _2#22]
+- LocalRelation [_1#32, _2#33]
== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, Inner, true
:- LocalTableScan [_1#21, _2#22]
+- BroadcastExchange IdentityBroadcastMode
+- LocalTableScan [_1#32, _2#33]
// Use broadcast function to mark the right-side Dataset
// eligible for broadcasting explicitly
scala> ds.join(broadcast(ds)).explain(extended=true)
== Parsed Logical Plan ==
'Join Inner
:- LocalRelation [_1#21, _2#22]
+- BroadcastHint
+- LocalRelation [_1#21, _2#22]
== Analyzed Logical Plan ==
_1: int, _2: string, _1: int, _2: string
Join Inner
:- LocalRelation [_1#21, _2#22]
+- BroadcastHint
+- LocalRelation [_1#43, _2#44]
== Optimized Logical Plan ==
Join Inner
:- LocalRelation [_1#21, _2#22]
+- BroadcastHint
+- LocalRelation [_1#43, _2#44]
== Physical Plan ==
BroadcastNestedLoopJoin BuildRight, Inner, true
:- LocalTableScan [_1#21, _2#22]
746
+- BroadcastExchange IdentityBroadcastMode
+- LocalTableScan [_1#43, _2#44]
SampleExec
scala> spark.range(10).sample(false, 0.4).explain
== Physical Plan ==
WholeStageCodegen
: +- Sample 0.0, 0.4, false, -7634498724724501829
: +- Range 0, 1, 8, 10, [id#15L]
RangeExec
scala> spark.range(10).explain
== Physical Plan ==
WholeStageCodegen
: +- Range 0, 1, 8, 10, [id#20L]
CollapseCodegenStages
CollapseCodegenStages is a Rule[SparkPlan] , i.e. a transformation of SparkPlan into
another SparkPlan .
Note
It searches for sub-plans (aka stages) that support codegen and collapse them together as a
WholeStageCodegen .
Note
stage codegen. It counts the fields included in complex types, i.e. StructType , MapType ,
ArrayType , UserDefinedType , and their combinations, recursively. See SPARK-14554.
It inserts InputAdapter leaf nodes in a SparkPlan recursively that is then used to generate
code that consumes an RDD iterator of InternalRow .
performance.
You can execute it using the command:
build/sbt 'sql/testOnly *BenchmarkWholeStageCodegen'
Note
748
Project Tungsten
One of the main motivations of Project Tungsten is to greatly reduce the usage of Java
objects to minimum by introducing its own memory management. It uses a compact storage
format for data representation that also reduces memory footprint. With a known schema for
datasets, the proper data layout is possible immediately with the data being already
serialized (that further reduces or completely avoids serialization between JVM object
representation and Sparks internal one).
Project Tungsten uses sun.misc.unsafe API for direct memory access to bypass the JVM in
order to avoid garbage collection.
The optimizations provided by the project Tungsten:
1. Memory Management using Binary In-Memory Data Representation aka Tungsten row
format.
2. Cache-Aware Computations with Cache-Aware Layout for high cache hit rates
3. Code Generation
Tungsten does code generation, i.e. generates JVM bytecode on the fly, to access
Tungsten-managed memory structures that gives a very fast access.
Tungsten also introduces cache-aware data structures that are aware of the physical
machine caches at different levels - L1, L2, L3.
749
Settings
Settings
The following list are the settings used to configure Spark SQL applications.
You can apply them to SQLContext using setConf method:
spark.setConf("spark.sql.codegen.wholeStage", "false")
spark.sql.catalogImplementation
spark.sql.catalogImplementation (default: in-memory ) is an internal setting to select the
Caution
spark.sql.shuffle.partitions
spark.sql.shuffle.partitions (default: 200 )the default number of partitions to use
spark.sql.allowMultipleContexts
spark.sql.allowMultipleContexts (default: true ) controls whether creating multiple
SQLContexts/HiveContexts is allowed.
spark.sql.autoBroadcastJoinThreshold
spark.sql.autoBroadcastJoinThreshold (default: 10 * 1024 * 1024 ) configures the maximum
size in bytes for a table that will be broadcast to all worker nodes when performing a join. If
the size of the statistics of the logical plan of a DataFrame is at most the setting, the
DataFrame is broadcast for join.
Negative values or 0 disable broadcasting.
Consult Broadcast Join for more information about the topic.
750
Settings
spark.sql.columnNameOfCorruptRecord
spark.sql.columnNameOfCorruptRecord FIXME
spark.sql.dialect
spark.sql.dialect - FIXME
spark.sql.sources.default
spark.sql.sources.default (default: parquet ) sets the default data source to use in
input/output.
It is used when reading or writing data in DataFrameWriter, DataFrameReader,
createExternalTable as well as the streaming DataStreamReader and DataStreamWriter.
spark.sql.streaming.checkpointLocation
spark.sql.streaming.checkpointLocation is the default location for storing checkpoint data
spark.sql.codegen.wholeStage
spark.sql.codegen.wholeStage (default: true ) controls whether the whole stage (of multiple
operators) will be compiled into single java method ( true ) or not ( false ).
751
Spark Streaming
Spark Streaming
Spark Streaming is the incremental stream processing framework for Spark.
Spark Streaming offers the data abstraction called DStream that hides the complexity of
dealing with a continuous data stream and makes it as easy for programmers as using one
single RDD at a time.
That is why Spark Streaming is also called a micro-batching streaming framework as a
batch is one RDD at a time.
Note
I think Spark Streaming shines on performing the T stage well, i.e. the
transformation stage, while leaving the E and L stages for more specialized
tools like Apache Kafka or frameworks like Akka.
For a software developer, a DStream is similar to work with as a RDD with the DStream API
to match RDD API. Interestingly, you can reuse your RDD-based code and apply it to
DStream - a stream of RDDs - with no changes at all (through foreachRDD).
It runs streaming jobs every batch duration to pull and process data (often called records)
from one or many input streams.
Each batch computes (generates) a RDD for data in input streams for a given batch and
submits a Spark job to compute the result. It does this over and over again until the
streaming context is stopped (and the owning streaming application terminated).
To avoid losing records in case of failure, Spark Streaming supports checkpointing that
writes received records to a highly-available HDFS-compatible storage and allows to recover
from temporary downtimes.
Spark Streaming allows for integration with real-time data sources ranging from such basic
ones like a HDFS-compatible file system or socket connection to more advanced ones like
Apache Kafka or Apache Flume.
Checkpointing is also the foundation of stateful and windowed operations.
About Spark Streaming from the official documentation (that pretty much nails what it offers):
752
Spark Streaming
Spark Streaming is an extension of the core Spark API that enables scalable, highthroughput, fault-tolerant stream processing of live data streams. Data can be ingested
from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP sockets, and
can be processed using complex algorithms expressed with high-level functions like
map, reduce, join and window. Finally, processed data can be pushed out to
filesystems, databases, and live dashboards. In fact, you can apply Sparks machine
learning and graph processing algorithms on data streams.
Essential concepts in Spark Streaming:
StreamingContext
Stream Operators
Batch, Batch time, and JobSet
Streaming Job
Discretized Streams (DStreams)
Receivers
Other concepts often used in Spark Streaming:
ingestion = the act of processing streaming data.
Micro Batch
Micro Batch is a collection of input records as collected by Spark Streaming that is later
represented as an RDD.
A batch is internally represented as a JobSet.
Streaming Job
A streaming Job represents a Spark computation with one or many Spark jobs.
It is identified (in the logs) as streaming job [time].[outputOpId] with outputOpId being the
position in the sequence of jobs in a JobSet.
753
Spark Streaming
Internal Registries
nextInputStreamId - the current InputStream id
StreamingSource
Caution
FIXME
754
StreamingContext
StreamingContext
StreamingContext is the main entry point for all Spark Streaming functionality. Whatever you
Creating Instance
You can create a new instance of StreamingContext using the following constructors. You
can group them by whether a StreamingContext constructor creates it from scratch or it is
recreated from checkpoint directory (follow the links for their extensive coverage).
Creating StreamingContext from scratch:
StreamingContext(conf: SparkConf, batchDuration: Duration)
StreamingContext(master: String, appName: String, batchDuration: Duration,
sparkHome: String, jars: Seq[String], environment: Map[String,String])
StreamingContext(sparkContext: SparkContext, batchDuration: Duration)
Note
Note
755
StreamingContext
URLs:
Tip
Creating ReceiverInputDStreams
StreamingContext offers the following methods to create ReceiverInputDStreams:
receiverStream(receiver: Receiver[T])
actorStream[T](props: Props, name: String, storageLevel: StorageLevel =
StorageLevel.MEMORY_AND_DISK_SER_2, supervisorStrategy: SupervisorStrategy =
ActorSupervisorStrategy.defaultStrategy): ReceiverInputDStream[T]
756
StreamingContext
You can also use two additional methods in StreamingContext to build (or better called
compose) a custom DStream:
union[T](streams: Seq[DStream[T]]): DStream[T]
receiverStream method
receiverStream[T: ClassTag](receiver: Receiver[T]): ReceiverInputDStream[T]
You can register a custom input dstream using receiverStream method. It accepts a
Receiver.
Note
transform method
transform[T](dstreams: Seq[DStream[_]], transformFunc: (Seq[RDD[_]], Time) => RDD[T]):
DStream[T]
transform Example
import org.apache.spark.rdd.RDD
def union(rdds: Seq[RDD[_]], time: Time) = {
rdds.head.context.union(rdds.map(_.asInstanceOf[RDD[Int]]))
}
ssc.transform(Seq(cis), union)
757
StreamingContext
remember method
remember(duration: Duration): Unit
remember method sets the remember interval (for the graph of output dstreams). It simply
FIXME figure
Checkpoint Interval
The checkpoint interval is an internal property of StreamingContext and corresponds to
batch interval or checkpoint interval of the checkpoint (when checkpoint was present).
Note
checkpoint interval is mandatory when checkpoint directory is defined (i.e. not null ).
Checkpoint Directory
A checkpoint directory is a HDFS-compatible directory where checkpoints are written to.
Note
Its initial value depends on whether the StreamingContext was (re)created from a checkpoint
or not, and is the checkpoint directory if so. Otherwise, it is not set (i.e. null ).
You can set the checkpoint directory when a StreamingContext is created or later using
checkpoint method.
Internally, a checkpoint directory is tracked as checkpointDir .
Tip
Initial Checkpoint
Initial checkpoint is the checkpoint (file) this StreamingContext has been recreated from.
The initial checkpoint is specified when a StreamingContext is created.
val ssc = new StreamingContext("_checkpoint")
758
StreamingContext
of a streaming application can make decisions how to initialize themselves (or just be
initialized).
isCheckpointPresent checks the existence of the initial checkpoint that gave birth to the
StreamingContext.
You use checkpoint method to set directory as the current checkpoint directory.
Note
method.
Note
Note
You start stream processing by calling start() method. It acts differently per state of
StreamingContext and only INITIALIZED state makes for a proper startup.
Note
759
StreamingContext
Right after StreamingContext has been instantiated, it enters INITIALIZED state in which
start first checks whether another StreamingContext instance has already been started in
If no other StreamingContext exists, it performs setup validation and starts JobScheduler (in
a separate dedicated daemon thread called streaming-start).
760
StreamingContext
You stop StreamingContext using one of the three variants of stop method:
stop(stopSparkContext: Boolean = true)
stop(stopSparkContext: Boolean, stopGracefully: Boolean)
Note
stop methods stop the execution of the streams immediately ( stopGracefully is false )
or wait for the processing of all received data to be completed ( stopGracefully is true ).
stop reacts appropriately per the state of StreamingContext , but the end state is always
ContextWaiter is notifyStop()
761
StreamingContext
5.
shutdownHookRef is cleared.
At that point, you should see the following INFO message in the logs:
INFO StreamingContext: StreamingContext stopped successfully
shuts down, e.g. all non-daemon thread exited, System.exit was called or ^C was typed.
Note
Note
work.
762
StreamingContext
Setup Validation
validate(): Unit
Note
It first asserts that DStreamGraph has been assigned (i.e. graph field is not null ) and
triggers validation of DStreamGraph.
Caution
If checkpointing is enabled, it ensures that checkpoint interval is set and checks whether the
current streaming runtime environment can be safely serialized by serializing a checkpoint
for fictitious batch time 0 (not zero time).
If dynamic allocation is enabled, it prints the following WARN message to the logs:
WARN StreamingContext: Dynamic Allocation is enabled for this
application. Enabling Dynamic allocation for Spark Streaming
applications can cause data loss if Write Ahead Log is not
enabled for non-replayable sources like Flume. See the
programming guide for details on how to enable the Write Ahead
Log
FIXME
FIXME
States
StreamingContext can be in three states:
763
StreamingContext
764
Stream Operators
Stream Operators
You use stream operators to apply transformations to the elements received (often called
records) from input streams and ultimately trigger computations using output operators.
Transformations are stateless, but Spark Streaming comes with an experimental support for
stateful operators (e.g. mapWithState or updateStateByKey). It also offers windowed
operators that can work across batches.
Note
You may use RDDs from other (non-streaming) data sources to build more
advanced pipelines.
slice
window
reduceByWindow
reduce
map
(output operator) foreachRDD
glom
(output operator) saveAsObjectFiles
(output operator) saveAsTextFiles
transform
transformWith
flatMap
765
Stream Operators
filter
repartition
mapPartitions
count
countByValue
countByWindow
countByValueAndWindow
union
Note
Most streaming operators come with their own custom DStream to offer the service. It
however very often boils down to overriding the compute method and applying
corresponding RDD operator on a generated RDD.
print Operator
print(num: Int) operator prints num first elements of each RDD in the input stream.
print uses print(num: Int) with num being 10 .
Internally, it calls RDD.take(num + 1) (see take action) on each RDD in the stream to print
num elements. It then prints if there are more elements in the RDD (that would
foreachRDD Operators
766
Stream Operators
foreachRDD Example
val clicks: InputDStream[(String, String)] = messages
// println every single data received in clicks input stream
clicks.foreachRDD(rdd => rdd.foreach(println))
glom Operator
glom(): DStream[Array[T]]
glom operator creates a new stream in which RDDs in the source stream are RDD.glom
over, i.e. it coalesces all elements in RDDs within each partition into an array.
reduce Operator
reduce(reduceFunc: (T, T) => T): DStream[T]
reduce operator creates a new stream of RDDs of a single element that is a result of
reduce Example
val clicks: InputDStream[(String, String)] = messages
type T = (String, String)
val reduceFunc: (T, T) => T = {
case in @ ((k1, v1), (k2, v2)) =>
println(s">>> input: $in")
(k2, s"$v1 + $v2")
}
val reduceClicks: DStream[(String, String)] = clicks.reduce(reduceFunc)
reduceClicks.print
767
Stream Operators
map Operator
map[U](mapFunc: T => U): DStream[U]
map operator creates a new stream with the source elements being mapped over using
mapFunc function.
It creates MappedDStream stream that, when requested to compute a RDD, uses RDD.map
operator.
map Example
val clicks: DStream[...] = ...
val mappedClicks: ... = clicks.map(...)
reduceByKey Operator
reduceByKey(reduceFunc: (V, V) => V): DStream[(K, V)]
reduceByKey(reduceFunc: (V, V) => V, numPartitions: Int): DStream[(K, V)]
reduceByKey(reduceFunc: (V, V) => V, partitioner: Partitioner): DStream[(K, V)]
transform Operators
transform(transformFunc: RDD[T] => RDD[U]): DStream[U]
transform(transformFunc: (RDD[T], Time) => RDD[U]): DStream[U]
transform operator applies transformFunc function to the generated RDD for a batch.
It asserts that one and exactly one RDD has been generated for a batch before
calling the transformFunc .
Note
transform Example
768
Stream Operators
transformWith Operators
transformWith(other: DStream[U], transformFunc: (RDD[T], RDD[U]) => RDD[V]): DStream[V
]
transformWith(other: DStream[U], transformFunc: (RDD[T], RDD[U], Time) => RDD[V]): DSt
ream[V]
transformWith operators apply the transformFunc function to two generated RDD for a
batch.
It creates a TransformedDStream stream.
Note
It asserts that two and exactly two RDDs have been generated for a batch
before calling the transformFunc .
Note
transformWith Example
769
Stream Operators
770
Stream Operators
Windowed Operators
Go to Window Operations to read the official documentation.
Note
In short, windowed operators allow you to apply transformations over a sliding window of
data, i.e. build a stateful computation across multiple batches.
Note
By default, you apply transformations using different stream operators to a single RDD that
represents a dataset that has been built out of data received from one or many input
streams. The transformations know nothing about the past (datasets received and already
processed). The computations are hence stateless.
You can however build datasets based upon the past ones, and that is when windowed
operators enter the stage. Using them allows you to cross the boundary of a single dataset
(per batch) and have a series of datasets in your hands (as if the data they hold arrived in a
single batch interval).
slice Operators
slice(interval: Interval): Seq[RDD[T]]
slice(fromTime: Time, toTime: Time): Seq[RDD[T]]
slice operators return a collection of RDDs that were generated during time interval
771
Stream Operators
window Operators
window(windowDuration: Duration): DStream[T]
window(windowDuration: Duration, slideDuration: Duration): DStream[T]
window operator creates a new stream that generates RDDs containing all the elements
messages.window(Seconds(10))
reduceByWindow Operator
reduceByWindow(reduceFunc: (T, T) => T, windowDuration: Duration, slideDuration: Durat
ion): DStream[T]
reduceByWindow(reduceFunc: (T, T) => T, invReduceFunc: (T, T) => T, windowDuration: Du
ration, slideDuration: Duration): DStream[T]
reduceByWindow operator create a new stream of RDDs of one element only that was
computed using reduceFunc function over the data received during batch duration that later
was again applied to a collection of the reduced elements from the past being window
duration windowDuration sliding slideDuration forward.
Note
reduceByWindow Example
772
Stream Operators
// batchDuration = Seconds(5)
val clicks: InputDStream[(String, String)] = messages
type T = (String, String)
val reduceFn: (T, T) => T = {
case in @ ((k1, v1), (k2, v2)) =>
println(s">>> input: $in")
(k2, s"$v1 + $v2")
}
val windowedClicks: DStream[(String, String)] =
clicks.reduceByWindow(reduceFn, windowDuration = Seconds(10), slideDuration = Second
s(5))
windowedClicks.print
773
Stream Operators
SaveAs Operators
There are two saveAs operators in DStream:
saveAsObjectFiles
saveAsTextFiles
They are output operators that return nothing as they save each RDD in a batch to a
storage.
Their full signature is as follows:
saveAsObjectFiles(prefix: String, suffix: String = ""): Unit
saveAsTextFiles(prefix: String, suffix: String = ""): Unit
Note
RDD.saveAsTextFile.
The file name is based on mandatory prefix and batch time with optional suffix . It is in
the format of [prefix]-[time in milliseconds].[suffix] .
Example
val clicks: InputDStream[(String, String)] = messages
clicks.saveAsTextFiles("clicks", "txt")
774
Stream Operators
cumulative calculations.
The motivation for the stateful operators is that by design streaming operators are stateless
and know nothing about the previous records and hence a state. If youd like to react to new
records appropriately given the previous records you would have to resort to using persistent
storages outside Spark Streaming.
Note
mapWithState Operator
mapWithState(spec: StateSpec[K, V, ST, MT]): MapWithStateDStream[K, V, ST, MT]
You create StateSpec instances for mapWithState operator using the factory methods
StateSpec.function.
mapWithState creates a MapWithStateDStream dstream.
mapWithState Example
775
Stream Operators
A key and its state is considered idle if it has not received any data for at least the given
idle duration.
776
Stream Operators
updateStateByKey Operator
updateStateByKey(updateFn: (Seq[V], Option[S]) => Option[S]): DStream[(K, S)] (1)
updateStateByKey(updateFn: (Seq[V], Option[S]) => Option[S],
numPartitions: Int): DStream[(K, S)] (2)
updateStateByKey(updateFn: (Seq[V], Option[S]) => Option[S],
partitioner: Partitioner): DStream[(K, S)] (3)
updateStateByKey(updateFn: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],
partitioner: Partitioner,
rememberPartitioner: Boolean): DStream[(K, S)] (4)
updateStateByKey(updateFn: (Seq[V], Option[S]) => Option[S],
partitioner: Partitioner,
initialRDD: RDD[(K, S)]): DStream[(K, S)]
updateStateByKey(updateFn: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],
partitioner: Partitioner,
rememberPartitioner: Boolean,
initialRDD: RDD[(K, S)]): DStream[(K, S)]
1. When not specified explicitly, the partitioner used is HashPartitioner with the number of
partitions being the default level of parallelism of a Task Scheduler.
2. You may however specify the number of partitions explicitly for HashPartitioner to use.
3. This is the "canonical" updateStateByKey the other two variants (without a partitioner or
the number of partitions) use that allows specifying a partitioner explicitly. It then
executes the "last" updateStateByKey with rememberPartitioner enabled.
4. The "last" updateStateByKey
777
Stream Operators
updateStateByKey stateful operator allows for maintaining per-key state and updating it
using updateFn . The updateFn is called for each key, and uses new data and existing state
of the key, to generate an updated state.
Tip
Note
The state update function updateFn scans every key and generates a new state for every
key given a collection of values per key in a batch and the current state for the key (if exists).
updateStateByKey Example
778
Stream Operators
779
The page is made up of three sections (aka tables) - the unnamed, top-level one with basic
information about the streaming application (right below the title Streaming Statistics),
Active Batches and Completed Batches.
Note
780
Basic Information
Basic Information section is the top-level section in the Streaming page that offers basic
information about the streaming application.
781
Scheduling Delay
Scheduling Delay is the time spent from when the collection of streaming jobs for a batch
was submitted to when the first streaming job (out of possibly many streaming jobs in the
collection) was started.
Note
The values in the timeline (the first column) depict the time between the events
StreamingListenerBatchSubmitted and StreamingListenerBatchStarted (with
minor yet additional delays to deliver the events).
You may see increase in scheduling delay in the timeline when streaming jobs are queued
up as in the following example:
// batch duration = 5 seconds
val messages: InputDStream[(String, String)] = ...
messages.foreachRDD { rdd =>
println(">>> Taking a 15-second sleep")
rdd.foreach(println)
java.util.concurrent.TimeUnit.SECONDS.sleep(15)
}
Processing Time
Processing Time is the time spent to complete all the streaming jobs of a batch.
782
Total Delay
Total Delay is the time spent from submitting to complete all jobs of a batch.
Active Batches
Active Batches section presents waitingBatches and runningBatches together.
Completed Batches
Completed Batches section presents retained completed batches (using
completedBatchUIData ).
Note
Figure 7. Two Batches with Incoming Data inside for Kafka Direct Stream in web UI
(Streaming tab)
783
Figure 8. Two Jobs for Kafka Direct Stream in web UI (Jobs tab)
784
Streaming Listeners
Streaming Listeners
Streaming listeners are listeners interested in streaming events like batch submitted,
started or completed.
Streaming listeners implement org.apache.spark.streaming.scheduler.StreamingListener
listener interface and process StreamingListenerEvent events.
The following streaming listeners are available in Spark Streaming:
StreamingJobProgressListener
RateController
StreamingListenerEvent Events
StreamingListenerBatchSubmitted is posted when streaming jobs are submitted for
has completed, i.e. all the streaming jobs in JobSet have stopped their execution.
StreamingJobProgressListener
StreamingJobProgressListener is a streaming listener that collects information for
onBatchSubmitted
For StreamingListenerBatchSubmitted(batchInfo: BatchInfo) events, it stores batchInfo
batch information in the internal waitingBatchUIData registry per batch time.
The number of entries in waitingBatchUIData registry contributes to numUnprocessedBatches
(together with runningBatchUIData ), waitingBatches , and retainedBatches . It is also used
to look up the batch data for a batch time (in getBatchUIData ).
numUnprocessedBatches , waitingBatches are used in StreamingSource.
785
Streaming Listeners
Note
onBatchStarted
Caution
FIXME
onBatchCompleted
Caution
FIXME
Retained Batches
retainedBatches are waiting, running, and completed batches that web UI uses to display
streaming statistics.
The number of retained batches is controlled by spark.streaming.ui.retainedBatches.
786
Checkpointing
Checkpointing
Checkpointing is a process of writing received records (by means of input dstreams) at
checkpoint intervals to a highly-available HDFS-compatible storage. It allows creating faulttolerant stream processing pipelines so when a failure occurs input dstreams can restore
the before-failure streaming state and continue stream processing (as if nothing had
happened).
DStreams can checkpoint input data at specified time intervals.
You can also create a brand new StreamingContext (and putting checkpoints
aside).
787
Checkpointing
You must not create input dstreams using a StreamingContext that has been
recreated from checkpoint. Otherwise, you will not start the
StreamingContext at all.
Warning
When you use StreamingContext(path: String) constructor (or the variants thereof), it uses
Hadoop configuration to access path directory on a Hadoop-supported file system.
Effectively, the two variants use StreamingContext(path: String, hadoopConf: Configuration)
constructor that reads the latest valid checkpoint file (and hence enables )
Note
SparkContext and batch interval are set to their corresponding values using the
checkpoint file.
DStreamCheckpointData
788
Checkpointing
Note
happens inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.streaming.dstream.DStreamCheckpointData=DEBUG
Refer to Logging.
update collects batches and the directory names where the corresponding RDDs were
The collection of the batches and their checkpointed RDDs is recorded in an internal field for
serialization (i.e. it becomes the current value of the internal field currentCheckpointFiles
that is serialized when requested).
The collection is also added to an internal transient (non-serializable) mapping
timeToCheckpointFile and the oldest checkpoint (given batch times) is recorded in an
789
Checkpointing
cleanup deletes checkpoint files older than the oldest batch for the input time .
It first gets the oldest batch time for the input time (see Updating Collection of Batches and
Checkpoint Directories (update method)).
If the (batch) time has been found, all the checkpoint files older are deleted (as tracked in
the internal timeToCheckpointFile mapping).
You should see the following DEBUG message in the logs:
DEBUG Files to delete:
[comma-separated files to delete]
For each checkpoint file successfully deleted, you should see the following INFO message in
the logs:
INFO Deleted checkpoint file '[file]' for time [time]
Otherwise, when no (batch) time has been found for the given input time , you should see
the following DEBUG message in the logs:
DEBUG Nothing to delete
Note
restore restores the dstreams generatedRDDs given persistent internal data mapping
790
Checkpointing
restore takes the current checkpoint files and restores checkpointed RDDs from each
Note
It is called by DStream.restoreCheckpointData().
Checkpoint
Checkpoint class requires a StreamingContext and checkpointTime time to be instantiated.
Note
Note
It is merely a collection of the settings of the current streaming runtime environment that is
supposed to recreate the environment after it goes down due to a failure or when the
streaming context is stopped immediately.
It collects the settings from the input StreamingContext (and indirectly from the
corresponding JobScheduler and SparkContext):
The master URL from SparkContext as master .
The mandatory application name from SparkContext as framework .
The jars to distribute to workers from SparkContext as jars .
The DStreamGraph as graph
The checkpoint directory as checkpointDir
The checkpoint interval as checkpointDuration
The collection of pending batches to process as pendingTimes
The Spark configuration (aka SparkConf) as sparkConfPairs
791
Checkpointing
Refer to Logging.
write the input checkpoint object with and returns the result as a collection of bytes.
Caution
compression codec and once read the just-built Checkpoint object is validated and returned
back.
Note
validate validates the Checkpoint. It ensures that master , framework , graph , and
checkpointTime are defined, i.e. not null .
Note
You should see the following INFO message in the logs when the object passes the
validation:
792
Checkpointing
CheckpointWriter
An instance of CheckpointWriter is created (lazily) when JobGenerator is, but only when
JobGenerator is configured for checkpointing.
It uses the internal single-thread thread pool executor to execute checkpoint writes
asynchronously and does so until it is stopped.
write method serializes the checkpoint object and passes the serialized form to
CheckpointWriteHandler to write asynchronously (i.e. on a separate thread) using singlethread thread pool executor.
Note
It is called when JobGenerator receives DoCheckpoint event and the batch time
is eligible for checkpointing.
If the asynchronous checkpoint write fails, you should see the following ERROR in the logs:
793
Checkpointing
ERROR Could not submit checkpoint task to the thread pool executor
CheckpointWriter uses the internal stopped flag to mark whether it is stopped or not.
Note
stop method checks the internal stopped flag and returns if it says it is stopped already.
If not, it orderly shuts down the internal single-thread thread pool executor and awaits
termination for 10 seconds. During that time, any asynchronous checkpoint writes can be
safely finished, but no new tasks will be accepted.
Note
The wait time before executor stops is fixed, i.e. not configurable, and is set to
10 seconds.
After 10 seconds, when the thread pool did not terminate, stop stops it forcefully.
You should see the following INFO message in the logs:
INFO CheckpointWriter: CheckpointWriter executor terminated? [terminated], waited for
[time] ms.
CheckpointWriteHandlerAsynchronous Checkpoint
Writes
CheckpointWriteHandler is an (internal) thread of execution that does checkpoint writes. It is
instantiated with checkpointTime , the serialized form of the checkpoint, and whether or not
to clean checkpoint data later flag (as clearCheckpointDataLater ).
794
Checkpointing
Note
It records the current checkpoint time (in latestCheckpointTime ) and calculates the name of
the checkpoint file.
Note
It uses a backup file to do atomic write, i.e. it writes to the checkpoint backup file first and
renames the result file to the final checkpoint file name.
Note
Note
not configurable.
When attempting to write, you should see the following INFO message in the logs:
INFO CheckpointWriter: Saving checkpoint for time [checkpointTime] ms to file '[checkp
ointFile]'
Note
It deletes any checkpoint backup files that may exist from the previous
attempts.
It then deletes checkpoint files when there are more than 10.
Note
The number of checkpoint files when the deletion happens, i.e. 10, is fixed and
not configurable.
If all went fine, you should see the following INFO message in the logs:
INFO CheckpointWriter: Checkpoint for time [checkpointTime] ms saved to file '[checkpo
intFile]', took [bytes] bytes and [time] ms
JobGenerator is informed that the checkpoint write completed (with checkpointTime and
clearCheckpointDataLater flag).
In case of write failures, you can see the following WARN message in the logs:
795
Checkpointing
If the number of write attempts exceeded (the fixed) 10 or CheckpointWriter was stopped
before any successful checkpoint write, you should see the following WARN message in the
logs:
WARN CheckpointWriter: Could not write checkpoint for time [checkpointTime] to file [c
heckpointFile]'
CheckpointReader
CheckpointReader is a private[streaming] helper class to read the latest valid checkpoint
read methods read the latest valid checkpoint file from the checkpoint directory
checkpointDir . They differ in whether Spark configuration conf and Hadoop configuration
hadoopConf are given or created in place.
Note
The first read throws no SparkException when no checkpoint file could be read.
Note
It appears that no part of Spark Streaming uses the simplified version of read .
read uses Apache Hadoops Path and Configuration to get the checkpoint files (using
796
Checkpointing
The method reads all the checkpoints (from the youngest to the oldest) until one is
successfully loaded, i.e. deserialized.
You should see the following INFO message in the logs just before deserializing a
checkpoint file :
INFO CheckpointReader: Attempting to load checkpoint from file [file]
If the checkpoint file was loaded, you should see the following INFO messages in the logs:
INFO CheckpointReader: Checkpoint successfully loaded from file [file]
INFO CheckpointReader: Checkpoint was generated at time [checkpointTime]
In case of any issues while loading a checkpoint file, you should see the following WARN in
the logs and the corresponding exception:
WARN CheckpointReader: Error reading checkpoint from file [file]
797
JobScheduler
JobScheduler
Streaming scheduler ( JobScheduler ) schedules streaming jobs to be run as Spark jobs. It
is created as part of creating a StreamingContext and starts with it.
in JobScheduler.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.streaming.scheduler.JobScheduler=DEBUG
Refer to Logging.
798
JobScheduler
When JobScheduler starts (i.e. when start is called), you should see the following
DEBUG message in the logs:
DEBUG JobScheduler: Starting JobScheduler
It then goes over all the dependent services and starts them one by one as depicted in the
figure.
FIXME
799
JobScheduler
Note
ReceiverTracker is stopped.
Note
1 hour (it is assumed that it is enough and is not configurable). Otherwise, it waits for 2
seconds.
jobExecutor Thread Pool is forcefully shut down (using jobExecutor.shutdownNow() ) unless it
has terminated already.
You should see the following DEBUG message in the logs:
DEBUG JobScheduler: Stopped job executor
JobScheduler
When no streaming jobs are inside the jobSet , you should see the following INFO in the
logs:
INFO JobScheduler: No jobs added for time [jobSet.time]
Otherwise, when there is at least one streaming job inside the jobSet ,
StreamingListenerBatchSubmitted (with data statistics of every registered input stream for
which the streaming jobs were generated) is posted to StreamingListenerBus.
The JobSet is added to the internal jobSets registry.
It then goes over every streaming job in the jobSet and executes a JobHandler (on
jobExecutor Thread Pool).
At the end, you should see the following INFO message in the logs:
INFO JobScheduler: Added jobs for time [jobSet.time] ms
JobHandler
JobHandler is a thread of execution for a streaming job (that simply calls Job.run ).
Note
When started, it prepares the environment (so the streaming job can be nicely displayed in
the web UI under /streaming/batch/?id=[milliseconds] ) and posts JobStarted event to
JobSchedulerEvent event loop.
It runs the streaming job that executes the job function as defined while generating a
streaming job for an output stream.
Note
You may see similar-looking INFO messages in the logs (it depends on the operators you
use):
801
JobScheduler
handleJobStart(job: Job, startTime: Long) takes a JobSet (from jobSets ) and checks
802
JobScheduler
Note
handleJobCompletion looks the JobSet up (from the jobSets internal registry) and calls
Internal Registries
JobScheduler maintains the following information in internal registries:
jobSets - a mapping between time and JobSets. See JobSet.
JobSet
803
JobScheduler
A JobSet represents a collection of streaming jobs that were created at (batch) time for
output streams (that have ultimately produced a streaming job as they may opt out).
registry).
Note
At the beginning (when JobSet is created) all streaming jobs are incomplete.
Caution
started.
processingEndTime being the time when the last streaming job in the collection finished
processing.
A JobSet changes state over time. It can be in the following states:
Created after a JobSet was created. submissionTime is set.
Started after JobSet.handleJobStart was called. processingStartTime is set.
Completed after JobSet.handleJobCompletion and no more jobs are incomplete (in
incompleteJobs internal registry). processingEndTime is set.
804
JobScheduler
Note
JobGenerator.generateJobs
JobScheduler.submitJobSet(jobSet: JobSet)
JobGenerator.restart
JobScheduler.handleJobStart(job: Job, startTime: Long)
JobScheduler.handleJobCompletion(job: Job, completedTime: Long)
InputInfoTracker
InputInfoTracker tracks batch times and batch statistics for input streams (per input stream
id with StreamInputInfo ). It is later used when JobGenerator submits streaming jobs for a
batch time (and propagated to interested listeners as StreamingListenerBatchSubmitted
event).
Note
batch times and input streams (i.e. another mapping between input stream ids and
StreamInputInfo ).
805
JobScheduler
It accumulates batch statistics at every batch time when input streams are computing RDDs
(and explicitly call InputInfoTracker.reportInfo method).
It is up to input streams to have these batch statistics collected (and requires
calling InputInfoTracker.reportInfo method explicitly).
The following input streams report information:
Note
DirectKafkaInputDStream
ReceiverInputDStreams - Input Streams with Receivers
FileInputDStream
Cleaning up
cleanup(batchThreshTime: Time): Unit
You should see the following INFO message when cleanup of old batch times is requested
(akin to garbage collection):
INFO InputInfoTracker: remove old batch metadata: [timesToCleanup]
Caution
806
JobGenerator
JobGenerator
JobGenerator asynchronously generates streaming jobs every batch interval (using
recurring timer) that may or may not be checkpointed afterwards. It also periodically
requests clearing up metadata and checkpoint data for each input dstream.
JobGenerator is completely owned and managed by JobScheduler, i.e.
Note
started itself).
Enable INFO or DEBUG logging level for
org.apache.spark.streaming.scheduler.JobGenerator logger to see what happens
inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.streaming.scheduler.JobGenerator=DEBUG
Refer to Logging.
Note
Figure 1. JobGenerator Start (First Time) procedure (tip: follow the numbers)
It first checks whether or not the internal event loop has already been created which is the
way to know that the JobScheduler was started. If so, it does nothing and exits.
807
JobGenerator
Note
It first requests timer for the start time and passes the start time along to
DStreamGraph.start and RecurringTimer.start.
Note
The start time has the property of being a multiple of batch interval and after the
current system time. It is in the hands of recurring timer to calculate a time with
the property given a batch interval.
Note
Note
INFO RecurringTimer: Started timer for JobGenerator at time [nextTime]
Right before the method finishes, you should see the following INFO message in the logs:
INFO JobGenerator: Started JobGenerator at [startTime] ms
808
JobGenerator
Note
It first checks whether eventLoop internal event loop was ever started (through checking
null ).
Warning
When JobGenerator should stop immediately, i.e. ignoring unprocessed data and pending
streaming jobs ( processReceivedData flag is disabled), you should see the following INFO
message in the logs:
INFO JobGenerator: Stopping JobGenerator immediately
It requests the timer to stop forcefully ( interruptTimer is enabled) and stops the graph.
Otherwise, when JobGenerator should stop gracefully, i.e. processReceivedData flag is
enabled, you should see the following INFO message in the logs:
INFO JobGenerator: Stopping JobGenerator gracefully
You should immediately see the following INFO message in the logs:
INFO JobGenerator: Waiting for all received blocks to be consumed for job generation
ReceiverTracker has any blocks left to be processed (whatever is shorter) before continuing.
Note
When a timeout occurs, you should see the WARN message in the logs:
WARN JobGenerator: Timed out while stopping the job generator (timeout = [stopTimeoutM
s])
After the waiting is over, you should see the following INFO message in the logs:
INFO JobGenerator: Waited for all received blocks to be consumed for job generation
809
JobGenerator
It requests timer to stop generating streaming jobs ( interruptTimer flag is disabled) and
stops the graph.
You should see the following INFO message in the logs:
INFO JobGenerator: Stopped generation timer
You should immediately see the following INFO message in the logs:
INFO JobGenerator: Waiting for jobs to be processed and checkpoints to be written
batches have been processed (whatever is shorter) before continuing. It waits for batches to
complete using last processed batch internal property that should eventually be exactly the
time when the timer was stopped (it returns the last time for which the streaming job was
generated).
Note
After the waiting is over, you should see the following INFO message in the logs:
INFO JobGenerator: Waited for jobs to be processed and checkpoints to be written
environment of the past execution that may have stopped immediately, i.e. without waiting
for all the streaming jobs to complete when checkpoint was enabled, or due to a abrupt
shutdown (a unrecoverable failure or similar).
810
JobGenerator
Note
restart first calculates the batches that may have been missed while JobGenerator was
down, i.e. batch times between the current restart time and the time of initial checkpoint.
Warning
restart doesnt check whether the initial checkpoint exists or not that may
lead to NPE.
It then ask the initial checkpoint for pending batches, i.e. the times of streaming job sets.
Caution
FIXME What are the pending batches? Why would they ever exist?
It then computes the batches to reschedule, i.e. pending and down time batches that are
before restart time.
You should see the following INFO message in the logs:
INFO JobGenerator: Batches to reschedule ([size] batches): [timesToReschedule]
811
JobGenerator
The only purpose of the lastProcessedBatch property is to allow for stopping the streaming
context gracefully, i.e. to wait until all generated streaming jobs are completed.
Note
For every JobGeneratorEvent event, you should see the following DEBUG message in the
logs:
DEBUG JobGenerator: Got event [event]
812
JobGenerator
Note
If checkpointing is disabled or the current batch time is not eligible for checkpointing, the
method does nothing and exits.
Note
A current batch is eligible for checkpointing when the time interval between
current batch time and zero time is a multiple of checkpoint interval.
Caution
FIXME Who checks and when whether checkpoint interval is greater than
batch interval or not? What about checking whether a checkpoint interval is a
multiple of batch time?
Caution
Otherwise, when checkpointing should be performed, you should see the following INFO
message in the logs:
INFO JobGenerator: Checkpointing graph for time [time] ms
It requests DStreamGraph for updating checkpoint data and CheckpointWriter for writing a
new checkpoint. Both are given the current batch time .
813
JobGenerator
ClearMetadata are posted after a micro-batch for a batch time has completed.
It removes old RDDs that have been generated and collected so far by output streams
(managed by DStreamGraph). It is a sort of garbage collector.
When ClearMetadata(time) arrives, it first asks DStreamGraph to clear metadata for the
given time.
If checkpointing is enabled, it posts a DoCheckpoint event (with clearCheckpointDataLater
being enabled, i.e. true ) and exits.
Otherwise, when checkpointing is disabled, it asks DStreamGraph for the maximum
remember duration across all the input streams and requests ReceiverTracker and
InputInfoTracker to do their cleanups.
Caution
Eventually, it marks the batch as fully processed, i.e. that the batch completed as well as
checkpointing or metadata cleanups, using the internal lastProcessedBatch marker.
clearCheckpointData(time: Time)
814
JobGenerator
Caution
When and what for are they set? Can one of ssc.checkpointDuration and
ssc.checkpointDir be null ? Do they all have to be set and is this checked
somewhere?
Answer: See Setup Validation.
Caution
815
JobGenerator
onCheckpointCompletion
Caution
FIXME
timer RecurringTimer
timer RecurringTimer (with the name being JobGenerator ) is used to posts GenerateJobs
timer is created when JobGenerator is. It starts when JobGenerator starts (for
the first time only).
816
DStreamGraph
DStreamGraph
DStreamGraph (is a final helper class that) manages input and output dstreams. It also
holds zero time for the other components that marks the time when it was started.
DStreamGraph maintains the collections of InputDStream instances (as inputStreams ) and
output DStream instances (as outputStreams ), but, more importantly, it generates streaming
jobs for output streams for a batch (time).
DStreamGraph holds the batch interval for the other parts of a Streaming application.
Refer to Logging.
Streaming application.
setBatchDuration(duration: Duration) is the method to set the batch interval.
817
DStreamGraph
It appears that it is the place for the value since it must be set before JobGenerator can be
instantiated.
It is set while StreamingContext is being instantiated and is validated (using validate()
method of StreamingContext and DStreamGraph ) before StreamingContext is started.
Maximum Remember Interval is the maximum remember interval across all the input
dstreams. It is calculated using getMaxInputStreamRememberDuration method.
Note
FIXME
you need to register a dstream (using DStream.register method) which happens forFIXME
Starting DStreamGraph
start(time: Time): Unit
When DStreamGraph is started (using start method), it sets zero time and start time.
Note
Note
start method is called when JobGenerator starts for the first time (not from a
checkpoint).
You can start DStreamGraph as many times until time is not null and zero
time has been set.
(output dstreams) start then walks over the collection of output dstreams and for each
output dstream, one at a time, calls their initialize(zeroTime), remember (with the current
remember interval), and validateAtStart methods.
818
DStreamGraph
(input dstreams) When all the output streams are processed, it starts the input dstreams (in
parallel) using start method.
Stopping DStreamGraph
stop(): Unit
Caution
FIXME
Restarting DStreamGraph
restart(time: Time): Unit
Note
Caution
This is the only moment when zero time can be different than start time.
restart doesnt seem to be called ever.
generateJobs method generates a collection of streaming jobs for output streams for a
given batch time . It walks over each registered output stream (in outputStreams internal
registry) and requests each stream for a streaming job
Note
When generateJobs method executes, you should see the following DEBUG message in
the logs:
DEBUG DStreamGraph: Generating jobs for time [time] ms
generateJobs then walks over each registered output stream (in outputStreams internal
819
DStreamGraph
Right before the method finishes, you should see the following DEBUG message with the
number of streaming jobs generated (as jobs.length ):
DEBUG DStreamGraph: Generated [jobs.length] jobs for time [time] ms
Validation Check
validate() method checks whether batch duration and at least one output stream have
Metadata Cleanup
Note
When clearMetadata(time: Time) is called, you should see the following DEBUG message
in the logs:
DEBUG DStreamGraph: Clearing metadata for time [time] ms
It merely walks over the collection of output streams and (synchronously, one by one) asks
to do its own metadata cleaning.
When finishes, you should see the following DEBUG message in the logs:
DEBUG DStreamGraph: Cleared old metadata for time [time] ms
When restoreCheckpointData() is executed, you should see the following INFO message in
the logs:
INFO DStreamGraph: Restoring checkpoint data
820
DStreamGraph
At the end, you should see the following INFO message in the logs:
INFO DStreamGraph: Restored checkpoint data
Note
checkpoint.
Note
When updateCheckpointData is called, you should see the following INFO message in the
logs:
INFO DStreamGraph: Updating checkpoint data for time [time] ms
It then walks over every output dstream and calls its updateCheckpointData(time).
When updateCheckpointData finishes it prints out the following INFO message to the logs:
INFO DStreamGraph: Updated checkpoint data for time [time] ms
Checkpoint Cleanup
clearCheckpointData(time: Time)
Note
When clearCheckpointData is called, you should see the following INFO message in the
logs:
INFO DStreamGraph: Clearing checkpoint data for time [time] ms
It merely walks through the collection of output streams and (synchronously, one by one)
asks to do their own checkpoint data cleaning.
When finished, you should see the following INFO message in the logs:
821
DStreamGraph
Remember Interval
Remember interval is the time to remember (aka cache) the RDDs that have been
generated by (output) dstreams in the context (before they are released and garbage
collected).
It can be set using remember method.
remember method
remember(duration: Duration): Unit
Note
It first checks whether or not it has been set already and if so, throws
java.lang.IllegalArgumentException as follows:
Note
822
Refer to Logging.
DStream Contract
A DStream is defined by the following properties (with the names of the corresponding
methods that subclasses have to implement):
dstream dependencies, i.e. a collection of DStreams that this DStream depends on.
They are often referred to as parent dstreams.
def dependencies: List[DStream[_]]
slide duration (aka slide interval), i.e. a time interval after which the stream is
requested to generate a RDD out of input data it consumes.
def slideDuration: Duration
How to compute (generate) an optional RDD for the given batch if any. validTime is a
point in time that marks the end boundary of slide duration.
823
Creating DStreams
You can create dstreams through the built-in input stream constructors using streaming
context or more specialized add-ons for external input data sources, e.g. Apache Kafka.
Note
Initially, when a dstream is created, the remember interval is not set (i.e. null ), but is set
when the dstream is initialized.
It can be set to a custom value using remember method.
Note
You may see the current value of remember interval when a dstream is
validated at startup and the log level is INFO.
that were generated for the batch. It acts as a cache when a dstream is requested to
compute a RDD for batch (i.e. generatedRDDs may already have the RDD or gets a new
RDD added).
824
As new RDDs are added, dstreams offer a way to clear the old metadata during which the
old RDDs are removed from generatedRDDs collection.
If checkpointing is used, generatedRDDs collection can be recreated from a storage.
initialize method sets zero time and optionally checkpoint interval (if the dstream must
checkpoint and the interval was not set already) and remember duration.
Note
started.
The zero time of a dstream can only be set once or be set again to the same zero time.
Otherwise, it throws SparkException as follows:
ZeroTime is already initialized to [zeroTime], cannot initialize it again to [time]
If mustCheckpoint is enabled and the checkpoint interval was not set, it is automatically set
to the slide interval or 10 seconds, whichever is longer. You should see the following INFO
message in the logs when the checkpoint interval was set automatically:
INFO [DStreamType]: Checkpoint interval automatically set to [checkpointDuration]
It then ensures that remember interval is at least twice the checkpoint interval (only if
defined) or the slide duration.
825
At the very end, it initializes the parent dstreams (available as dependencies) that
recursively initializes the entire graph of dstreams.
remember Method
remember(duration: Duration): Unit
remember sets remember interval for the current dstream and the dstreams it depends on
(see dependencies).
If the input duration is specified (i.e. not null ), remember allows setting the remember
interval (only when the current value was not set already) or extend it (when the current
value is shorter).
You should see the following INFO message in the logs when the remember interval
changes:
INFO Duration for remembering RDDs set to [rememberDuration] for [dstream]
At the end, remember always sets the current remember interval (whether it was set,
extended or did not change).
Internally, checkpoint method calls persist (that sets the default MEMORY_ONLY_SER storage
level).
If checkpoint interval is set, the checkpoint directory is mandatory. Spark validates it when
StreamingContext starts and throws a IllegalArgumentException exception if not set.
826
You can see the value of the checkpoint interval for a dstream in the logs when it is
validated:
INFO Checkpoint interval = [checkpointDuration]
Checkpointing
DStreams can checkpoint input data at specified time intervals.
The following settings are internal to a dstream and define how it checkpoints the input data
if any.
mustCheckpoint (default: false ) is an internal private flag that marks a dstream as
checkpoints data. It is often called checkpoint interval. If not set explicitly, but the
dstream is checkpointed, it will be while initializing dstreams.
checkpointData is an instance of DStreamCheckpointData.
restoredFromCheckpointData (default: false ) is an internal flag to describe the initial
state of a dstream, i.e.. whether ( true ) or not ( false ) it was started by restoring state
from checkpoint.
827
DStream comes with internal register method that registers a DStream as an output
stream.
The internal private foreachRDD method uses register to register output streams to
DStreamGraph. Whenever called, it creates ForEachDStream and calls register upon it.
That is how streams become output streams.
The internal generateJob method generates a streaming job for a batch time for a (output)
dstream. It may or may not generate a streaming job for the requested batch time .
Note
It computes an RDD for the batch and, if there is one, returns a streaming job for the batch
time and a job function that will run a Spark job (with the generated RDD and the job
828
The generated RDD is checkpointed if checkpointDuration is defined and the time interval
between current and zero times is a multiple of checkpointDuration.
You should see the following DEBUG message in the logs:
DEBUG Marking RDD [id] for time [time] for checkpointing
FIXME
Checkpoint Cleanup
Caution
FIXME
restoreCheckpointData
restoreCheckpointData(): Unit
Note
829
Metadata Cleanup
Note
clearMetadata(time: Time) is called to remove old RDDs that have been generated so far
Regardless of spark.streaming.unpersist flag, all the collected RDDs are removed from
generatedRDDs.
When spark.streaming.unpersist flag is set (it is by default), you should see the following
DEBUG message in the logs:
DEBUG Unpersisting old RDDs: [id1, id2, ...]
For every RDD in the list, it unpersists them (without blocking) one by one and explicitly
removes blocks for BlockRDDs. You should see the following INFO message in the logs:
INFO Removing blocks of RDD [blockRDD] of time [time]
After RDDs have been removed from generatedRDDs (and perhaps unpersisted), you
should see the following DEBUG message in the logs:
DEBUG Cleared [size] RDDs that were older than [time]: [time1, time2, ...]
updateCheckpointData
updateCheckpointData(currentTime: Time): Unit
830
Note
When updateCheckpointData is called, you should see the following DEBUG message in the
logs:
DEBUG Updating checkpoint data for time [currentTime] ms
When updateCheckpointData finishes, you should see the following DEBUG message in the
logs:
DEBUG Updated checkpoint data for time [currentTime]: [checkpointData]
Internal Registries
DStream implementations maintain the following internal properties:
storageLevel (default: NONE ) is the StorageLevel of the RDDs in the DStream .
restoredFromCheckpointData is a flag to inform whether it was restored from checkpoint.
graph is the reference to DStreamGraph.
831
Input DStreams
Input DStreams in Spark Streaming are the way to ingest data from external data sources.
They are represented as InputDStream abstract class.
InputDStream is the abstract base class for all input DStreams. It provides two abstract
methods start() and stop() to start and stop ingesting data, respectively.
When instantiated, an InputDStream registers itself as an input stream (using
DStreamGraph.addInputStream) and, while doing so, is told about its owning
DStreamGraph.
It asks for its own unique identifier using StreamingContext.getNewInputStreamId() .
Note
Name your custom InputDStream using the CamelCase notation with the suffix
InputDStream, e.g. MyCustomInputDStream.
Note
Custom implementations of InputDStream can override (and actually provide!) the optional
RateController. It is undefined by default.
832
package pl.japila.spark.streaming
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{ Time, StreamingContext }
import org.apache.spark.streaming.dstream.InputDStream
import scala.reflect.ClassTag
class CustomInputDStream[T: ClassTag](ssc: StreamingContext, seq: Seq[T])
extends InputDStream[T](ssc) {
override def compute(validTime: Time): Option[RDD[T]] = {
Some(ssc.sparkContext.parallelize(seq))
}
override def start(): Unit = {}
override def stop(): Unit = {}
}
Tip
833
Receiver input streams run receivers as long-running tasks that occupy a core
per stream.
ReceiverInputDStream abstract class defines the following abstract method that custom
The receiver is then sent to and run on workers (when ReceiverTracker is started).
Note
spark.streaming.backpressure.enabled is enabled.
Note
834
If the time to generate RDDs ( validTime ) is earlier than the start time of StreamingContext,
an empty BlockRDD is generated.
Otherwise, ReceiverTracker is requested for all the blocks that have been allocated to this
stream for this batch (using ReceiverTracker.getBlocksOfBatch ).
The number of records received for the batch for the input stream (as StreamInputInfo aka
input blocks information) is registered to InputInfoTracker (using
InputInfoTracker.reportInfo ).
Back Pressure
Caution
FIXME
Back pressure for input dstreams with receivers can be configured using
spark.streaming.backpressure.enabled setting.
Note
835
ConstantInputDStreams
ConstantInputDStream is an input stream that always returns the same mandatory input
Example
val sc = new SparkContext("local[*]", "Constant Input DStream Demo", new SparkConf())
import org.apache.spark.streaming.{ StreamingContext, Seconds }
val ssc = new StreamingContext(sc, batchDuration = Seconds(5))
// Create the RDD
val rdd = sc.parallelize(0 to 9)
// Create constant input dstream with the RDD
import org.apache.spark.streaming.dstream.ConstantInputDStream
val cis = new ConstantInputDStream(ssc, rdd)
// Sample stream computation
cis.print
836
ForEachDStreams
ForEachDStream is an internal DStream with dependency on the parent stream with the
Note
Although it may seem that ForEachDStreams are by design output streams they
are not. You have to use DStreamGraph.addOutputStream to register a stream
as output.
You use stream operators that do the registration as part of their operation, like
print .
837
WindowedDStreams
WindowedDStream (aka windowed stream) is an internal DStream with dependency on the
parent stream.
Note
Note
window duration.
compute method always returns a RDD, either PartitionerAwareUnionRDD or UnionRDD ,
depending on the number of the partitioners defined by the RDDs in the window. It uses slice
operator on the parent stream (using the slice window of [now - windowDuration +
parent.slideDuration, now] ).
Otherwise, when there are multiple different partitioners in use, UnionRDD is created and
you should see the following DEBUG message in the logs:
DEBUG WindowedDStream: Using normal union for windowing at [time]
838
inside WindowedDStream .
Tip
839
MapWithStateDStream
MapWithStateDStream is the result of mapWithState stateful operator.
Note
Note
MapWithStateDStreamImpl
MapWithStateDStreamImpl is an internal DStream with dependency on the parent
dataStream key-value dstream. It uses a custom internal dstream called internalStream (of
type InternalMapWithStateDStream).
slideDuration is exactly the slide duration of the internal stream internalStream .
dependencies returns a single-element collection with the internal stream internalStream .
The compute method may or may not return a RDD[MappedType] by getOrCompute on the
internal stream andTK
Caution
FIXME
InternalMapWithStateDStream
InternalMapWithStateDStream is an internal dstream to support MapWithStateDStreamImpl
and uses dataStream (as parent of type DStream[(K, V)] ) as well as StateSpecImpl[K, V,
S, E] (as spec ).
It is a DStream[MapWithStateRDDRecord[K, S, E]] .
It uses StorageLevel.MEMORY_ONLY storage level by default.
It uses the StateSpecs partitioner or HashPartitioner (with SparkContexts
defaultParallelism).
slideDuration is the slide duration of parent .
840
Caution
FIXME MapWithStateRDD.createFromRDD
841
StateDStream
StateDStream is the specialized DStream that is the result of updateStateByKey stateful
operator. It is a wrapper around a parent key-value pair dstream to build stateful pipeline
(by means of updateStateByKey operator) and as a stateful dstream enables checkpointing
(and hence requires some additional setup).
It uses a parent key-value pair dstream, updateFunc update state function, a partitioner ,
a flag whether or not to preservePartitioning and an optional key-value pair initialRDD .
It works with MEMORY_ONLY_SER storage level enabled.
The only dependency of StateDStream is the input parent key-value pair dstream.
The slide duration is exactly the same as that in parent .
It forces checkpointing regardless of the current dstream configuration, i.e. the internal
mustCheckpoint is enabled.
When requested to compute a RDD it first attempts to get the state RDD for the previous
batch (using DStream.getOrCompute). If there is one, parent stream is requested for a
RDD for the current batch (using DStream.getOrCompute). If parent has computed one,
computeUsingPreviousRDD(parentRDD, prevStateRDD) is called.
Caution
FIXME When could getOrCompute not return an RDD? How does this apply
to the StateDStream? What about the parents getOrCompute ?
If however parent has not generated a RDD for the current batch but the state RDD
existed, updateFn is called for every key of the state RDD to generate a new state per
partition (using RDD.mapPartitions)
Note
842
When the stream processing starts, i.e. no state RDD exists, and there is no
input data received, no computation is triggered.
Given no state RDD and with parent RDD computed, when initialRDD is NONE , the input
data batch (as parent RDD) is grouped by key (using groupByKey with partitioner ) and
then the update state function updateFunc is applied to the partitioned input data (using
RDD.mapPartitions) with None state. Otherwise, computeUsingPreviousRDD(parentRDD,
initialStateRDD) is called.
It should be read as given a collection of triples of a key, new records for the key, and the
current state for the key, generate a collection of keys and their state.
843
computeUsingPreviousRDD
computeUsingPreviousRDD(parentRDD: RDD[(K, V)], prevStateRDD: RDD[(K, S)]): Option[RDD
[(K, S)]]
The computeUsingPreviousRDD method uses cogroup and mapPartitions to build the final
state RDD.
Note
Regardless of the return type Option[RDD[(K, S)]] that really allows no state, it
will always return some state.
Note
It is acceptable to end up with keys that have no new records per batch, but
these keys do have a state (since they were received previously when no state
might have been built yet).
The signature of cogroup is as follows and applies to key-value pair RDDs, i.e. RDD[(K, V)]
Note
It defines an internal update function finalFunc that maps over the collection of all the
keys, new records per key, and at-most-one-element state per key to build new iterator that
ensures that:
1. a state per key exists (it is None or the state built so far)
2. the lazy iterable of new records is transformed into an eager sequence.
Caution
With every triple per every key, the internal update function calls the constructors
updateFunc.
The state RDD is a cogrouped RDD (on parentRDD and prevStateRDD using the
constructors partitioner ) with every element per partition mapped over using the internal
update function finalFunc and the constructors preservePartitioning (through
mapPartitions ).
Caution
844
845
TransformedDStream
TransformedDStream is the specialized DStream that is the result of transform operator.
Note
When created, it asserts that the input collection of dstreams use the same
StreamingContext and slide interval.
It is acceptable to have more than one dependent dstream.
It may throw a SparkException when a dstream does not compute a RDD for a
batch.
Caution
at
org.apache.spark.streaming.dstream.TransformedDStream.compute(Tr
ansformedDStream.scala:48)
846
Receivers
Receivers
Receivers run on workers to receive external data. They are created and belong to
ReceiverInputDStreams.
Note
The abstract Receiver class requires the following methods to be implemented (see
Custom Receiver):
onStart() that starts the receiver when the application starts.
onStop() that stops the receiver.
A receiver uses store methods to store received data as data blocks into Sparks memory.
Note
A receiver can be in one of the three states: Initialized , Started , and Stopped .
Custom Receiver
847
Receivers
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.receiver.Receiver
final class MyStringReceiver extends Receiver[String](StorageLevel.NONE) {
def onStart() = {
println("onStart called")
}
def onStop() = {
println("onStop called")
}
}
val ssc = new StreamingContext(sc, Seconds(5))
val strings = ssc.receiverStream(new MyStringReceiver)
strings.print
ssc.start
// MyStringReceiver will print "onStart called"
ssc.stop()
// MyStringReceiver will print "onStop called"
848
Receivers
ReceiverTracker
Introduction
ReceiverTracker manages execution of all Receivers.
It can only be started once and only when at least one input receiver has been registered.
ReceiverTracker can be in one of the following states:
Initialized - it is in the state after having been instantiated.
Started Stopping
Stopped
You can only start ReceiverTracker once and multiple attempts lead to throwing
SparkException exception.
Note
849
Receivers
A successful startup of ReceiverTracker finishes with the following INFO message in the
logs:
INFO ReceiverTracker: ReceiverTracker started
FIXME
hasUnallocatedBlocks
Caution
FIXME
FIXME
StartAllReceivers
StartAllReceivers(receivers) is a local message sent by ReceiverTracker when it starts
(using ReceiverTracker.launchReceivers() ).
It schedules receivers (using ReceiverSchedulingPolicy.scheduleReceivers(receivers,
getExecutors) ).
Caution
850
Receivers
Caution
FIXME When the scaladoc says "along with the scheduled executors", does
it mean that the executors are already started and waiting for the receiver?!
It then starts a ReceiverSupervisor for receiver and keeps awaiting termination, i.e. once
the task is run it does so until a termination message comes from some other external
source). The task is a long-running task for receiver .
Caution
Otherwise, it distributes the one-element collection across the nodes (and potentially even
executors) for receiver . The RDD has the name Receiver [receiverId] .
The Spark jobs description is set to Streaming job running receiver [receiverId] .
Caution
851
Receivers
Note
The method demonstrates how you could use Spark Core as the distributed
computation platform to launch any process on clusters and let Spark handle
the distribution.
Very clever indeed!
Ultimately, right before the method exits, the following INFO message appears in the logs:
INFO Receiver [receiver.streamId] started
StopAllReceivers
Caution
FIXME
AllReceiverIds
Caution
FIXME
852
Receivers
It then sends the stop signal to all the receivers (i.e. posts StopAllReceivers to
ReceiverTracker RPC endpoint) and waits 10 seconds for all the receivers to quit gracefully
(unless graceful flag is set).
Note
You should see the following INFO messages if the graceful flag is enabled which means
that the receivers quit in a graceful manner:
INFO ReceiverTracker: Waiting for receiver job to terminate gracefully
INFO ReceiverTracker: Waited for receiver job to terminate gracefully
It then checks whether all the receivers have been deregistered or not by posting
AllReceiverIds to ReceiverTracker RPC endpoint.
You should see the following INFO message in the logs if they have:
INFO ReceiverTracker: All of the receivers have deregistered successfully
Otherwise, when there were receivers not having been deregistered properly, the following
WARN message appears in the logs:
WARN ReceiverTracker: Not all of the receivers have deregistered, [receivers]
853
Receivers
Note
When there are no receiver input streams in use, the method does nothing.
ReceivedBlockTracker
Caution
FIXME
You should see the following INFO message in the logs when cleanupOldBatches is called:
INFO ReceivedBlockTracker: Deleting batches [timesToCleanup]
allocateBlocksToBatch Method
allocateBlocksToBatch(batchTime: Time): Unit
INFO Possibly processed batch [batchTime] needs to be processed again in WAL recovery
854
Receivers
ReceiverSupervisors
ReceiverSupervisor is an (abstract) handler object that is responsible for supervising a
receiver (that runs on the worker). It assumes that implementations offer concrete methods
to push received data to Spark.
Note
Note
ReceiverSupervisor Contract
ReceiverSupervisor is a private[streaming] abstract class that assumes that concrete
Starting Receivers
startReceiver() calls (abstract) onReceiverStart() . When true (it is unknown at this
855
Receivers
The receivers onStart() is called and another INFO message appears in the logs:
INFO Called receiver onStart
Stopping Receivers
stop method is called with a message and an optional cause of the stop (called error ). It
calls stopReceiver method that prints the INFO message and checks the state of the
receiver to react appropriately.
When the receiver is in Started state, stopReceiver calls Receiver.onStop() , prints the
following INFO message, and onReceiverStop(message, error) .
INFO Called receiver onStop
Restarting Receivers
A ReceiverSupervisor uses spark.streaming.receiverRestartDelay to restart the receiver
with delay.
Note
It then stops the receiver, sleeps for delay milliseconds and starts the receiver (using
startReceiver() ).
Caution
856
Receivers
Awaiting Termination
awaitTermination method blocks the current thread to wait for the receiver to be stopped.
Note
When called, you should see the following INFO message in the logs:
INFO Waiting for receiver to be stopped
If a receiver has terminated successfully, you should see the following INFO message in the
logs:
INFO Stopped receiver without error
stoppingError is the exception associated with the stopping of the receiver and is rethrown.
Note
following cases:
When a receiver itself calls stop(message: String) or stop(message: String, error:
Throwable)
857
Receivers
ReceiverSupervisorImpl
ReceiverSupervisorImpl is the implementation of ReceiverSupervisor contract.
Note
It communicates with ReceiverTracker that runs on the driver (by posting messages using
the ReceiverTracker RPC endpoint).
Enable DEBUG logging level for
org.apache.spark.streaming.receiver.ReceiverSupervisorImpl logger to see what
happens in ReceiverSupervisorImpl .
Tip
push Methods
push methods, i.e. pushArrayBuffer , pushIterator , and pushBytes solely pass calls on to
ReceiverSupervisorImpl.pushAndReportBlock.
ReceiverSupervisorImpl.onReceiverStart
ReceiverSupervisorImpl.onReceiverStart sends a blocking RegisterReceiver message to
(using getCurrentLimit ).
ReceivedBlockHandler
858
Receivers
ReceivedBlockHandler to use.
It defaults to BlockManagerBasedBlockHandler, but could use
WriteAheadLogBasedBlockHandler instead when
spark.streaming.receiver.writeAheadLog.enable is true .
It uses ReceivedBlockHandler to storeBlock (see ReceivedBlockHandler Contract for more
coverage and ReceiverSupervisorImpl.pushAndReportBlock in this document).
ReceiverSupervisorImpl.pushAndReportBlock
ReceiverSupervisorImpl.pushAndReportBlock(receivedBlock: ReceivedBlock, metadataOption:
Option[Any], blockIdOption: Option[StreamBlockId]) stores receivedBlock using
ReceivedBlockHandler.storeBlock and reports it to the driver.
ReceiverSupervisorImpl.pushAndReportBlock is only used by the push methods,
Note
driver).
When a response comes, you should see the following DEBUG message in the logs:
DEBUG Reported block [blockId]
859
Receivers
ReceivedBlockHandlers
ReceivedBlockHandler represents how to handle the storage of blocks received by receivers.
Note
ReceivedBlockHandler Contract
ReceivedBlockHandler is a private[streaming] trait . It comes with two methods:
storeBlock(blockId: StreamBlockId, receivedBlock: ReceivedBlock):
ReceivedBlockStoreResult to store a received block as blockId .
cleanupOldBlocks(threshTime: Long) to clean up blocks older than threshTime .
Note
cleanupOldBlocks implies that there is a relation between blocks and the time
they arrived.
BlockManagerBasedBlockHandler
BlockManagerBasedBlockHandler is the default ReceivedBlockHandler in Spark Streaming.
store ReceivedBlock .
860
Receivers
WriteAheadLogBasedBlockHandler
WriteAheadLogBasedBlockHandler is used when
spark.streaming.receiver.writeAheadLog.enable is true .
It uses BlockManager, a receivers streamId and StorageLevel, SparkConf for additional
configuration settings, Hadoop Configuration, the checkpoint directory.
861
Streaming mode
You create DirectKafkaInputDStream using KafkaUtils.createDirectStream .
Define the types of keys and values in KafkaUtils.createDirectStream , e.g.
Note
Kafka brokers have to be up and running before you can create a direct stream.
862
If zookeeper.connect or group.id parameters are not set, they are added with their values
being empty strings.
In this mode, you will only see jobs submitted (in the Jobs tab in web UI) when a message
comes in.
863
Note
DirectKafkaInputDStream
864
As an input stream, it implements the five mandatory abstract methods - three from
DStream and two from InputDStream :
dependencies: List[DStream[_]] returns an empty collection, i.e. it has no
dependencies on other streams (other than Kafka brokers to read data from).
slideDuration: Duration passes all calls on to DStreamGraph.batchDuration.
compute(validTime: Time): Option[RDD[T]] - consult Computing RDDs (using compute
Method) section.
start() does nothing.
stop() does nothing.
The name of the input stream is Kafka direct stream [id]. You can find the name in the
Streaming tab in web UI (in the details of a batch in Input Metadata section).
It uses spark.streaming.kafka.maxRetries setting while computing latestLeaderOffsets (i.e.
a mapping of kafka.common.TopicAndPartition and LeaderOffset).
Every time the method is called, latestLeaderOffsets calculates the latest offsets (as
Map[TopicAndPartition, LeaderOffset] ).
Note
Every call to compute does call Kafka brokers for the offsets.
The moving parts of generated KafkaRDD instances are offsets. Others are taken directly
from DirectKafkaInputDStream (given at the time of instantiation).
It then filters out empty offset ranges to build StreamInputInfo for
InputInfoTracker.reportInfo.
It sets the just-calculated offsets as current (using currentOffsets ) and returns a new
KafkaRDD instance.
Back Pressure
865
Caution
FIXME
Back pressure for Direct Kafka input dstream can be configured using
spark.streaming.backpressure.enabled setting.
Note
Kafka Concepts
broker
leader
topic
partition
offset
exactly-once semantics
Kafka high-level consumer
LeaderOffset
LeaderOffset is an internal class to represent an offset on the topic partition on the broker
Recommended Reading
Exactly-once Spark Streaming from Apache Kafka
866
KafkaRDD
KafkaRDD class represents a RDD dataset from Apache Kafka. It uses KafkaRDDPartition
for partitions that know their preferred locations as the host of the topic (not port however!). It
then nicely maps a RDD partition to a Kafka partition.
Tip
KafkaRDD overrides methods of RDD class to base them on offsetRanges , i.e. partitions.
Computing Partitions
To compute a partition, KafkaRDD , checks for validity of beginning and ending offsets (so
they range over at least one element) and returns an (internal) KafkaRDDIterator .
You should see the following INFO message in the logs:
INFO KafkaRDD: Computing topic [topic], partition [partition] offsets [fromOffset] ->
[toOffset]
FIXME Review
867
868
RecurringTimer
RecurringTimer
class RecurringTimer(clock: Clock, period: Long, callback: (Long) => Unit, name: String
)
thread prefixed RecurringTimer - [name] that, once started, executes callback in a loop
every period time (until it is stopped).
The wait time is achieved by Clock.waitTillTime (that makes testing easier).
Enable INFO or DEBUG logging level for
org.apache.spark.streaming.util.RecurringTimer logger to see what happens
inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.streaming.util.RecurringTimer=DEBUG
Refer to Logging.
When RecurringTimer triggers an action for a period , you should see the following
DEBUG message in the logs:
DEBUG RecurringTimer: Callback for [name] called at time [prevTime]
869
RecurringTimer
parameter, i.e. it calculates a time as getStartTime but shifts the result to accommodate the
time gap since originalStartTime .
Note
Starting Timer
start(startTime: Long): Long
start(): Long (1)
When start is called, it sets the internal nextTime to the given input parameter
startTime and starts the internal daemon thread. This is the moment when the clock starts
ticking
You should see the following INFO message in the logs:
INFO RecurringTimer: Started timer for [name] at time [nextTime]
Stopping Timer
stop(interruptTimer: Boolean): Long
When called, you should see the following INFO message in the logs:
INFO RecurringTimer: Stopped timer for [name] after time [prevTime]
stop method uses the internal stopped flag to mark the stopped state and returns the last
period for which it was successfully executed (tracked as prevTime internally).
870
RecurringTimer
Note
Before it fully terminates, it triggers callback one more/last time, i.e. callback
is executed for a period after RecurringTimer has been (marked) stopped.
Fun Fact
You can execute org.apache.spark.streaming.util.RecurringTimer as a command-line
standalone application.
$ ./bin/spark-class org.apache.spark.streaming.util.RecurringTimer
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
INFO RecurringTimer: Started timer for Test at time 1453787444000
INFO RecurringTimer: 1453787444000: 1453787444000
DEBUG RecurringTimer: Callback for Test called at time 1453787444000
INFO RecurringTimer: 1453787445005: 1005
DEBUG RecurringTimer: Callback for Test called at time 1453787445000
INFO RecurringTimer: 1453787446004: 999
DEBUG RecurringTimer: Callback for Test called at time 1453787446000
INFO RecurringTimer: 1453787447005: 1001
DEBUG RecurringTimer: Callback for Test called at time 1453787447000
INFO RecurringTimer: 1453787448000: 995
DEBUG RecurringTimer: Callback for Test called at time 1453787448000
^C
INFO ShutdownHookManager: Shutdown hook called
INFO ShutdownHookManager: Deleting directory /private/var/folders/0w/kb0d3rqn4zb9fcc91
pxhgn8w0000gn/T/spark-71dbd43d-2db3-4527-adb8-f1174d799b0d/repl-a6b9bf12-fec2-4004-923
6-3b0ab772cc94
INFO ShutdownHookManager: Deleting directory /private/var/folders/0w/kb0d3rqn4zb9fcc91
pxhgn8w0000gn/T/spark-71dbd43d-2db3-4527-adb8-f1174d799b0d
871
Backpressure
Note
RateController
Tip
completed updates for a dstream and maintains a rate limit, i.e. an estimate of the speed at
which this stream should ingest messages. With every batch completed update event it
calculates the current processing rate and estimates the correct receving rate.
Note
872
Backpressure
When created, it creates a daemon single-thread executor service called stream-rateupdate and initializes the internal rateLimit counter which is the current messageingestion speed.
When a batch completed update happens, a RateController grabs processingEndTime ,
processingDelay , schedulingDelay , and numRecords processed for the batch, computes a
rate limit and publishes the current value. The computed value is set as the present rate
limit, and published (using the sole abstract publish method).
Computing a rate limit happens using the RateEstimators compute method.
Caution
RateEstimator
RateEstimator computes the rate given the input time , elements , processingDelay , and
schedulingDelay .
The PID rate estimator is the only possible estimator. All other rate
estimators lead to IllegalArgumentException being thrown.
873
Backpressure
Note
happens inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.streaming.scheduler.rate.PIDRateEstimator=TRACE
Refer to Logging.
When the PID rate estimator is created you should see the following INFO message in the
logs:
INFO PIDRateEstimator: Created PIDRateEstimator with proportional = [proportional], in
tegral = [integral], derivative = [derivative], min rate = [minRate]
When the pid rate estimator computes the rate limit for the current time, you should see the
following TRACE message in the logs:
TRACE PIDRateEstimator:
time = [time], # records = [numElements], processing time = [processingDelay], schedul
ing delay = [schedulingDelay]
If the time to compute the current rate limit for is before the latest time or the number of
records is 0 or less, or processing delay is 0 or less, the rate estimation is skipped. You
should see the following TRACE message in the logs:
TRACE PIDRateEstimator: Rate estimation skipped
874
Backpressure
Once the new rate has already been computed, you should see the following TRACE
message in the logs:
TRACE PIDRateEstimator:
latestRate = [latestRate], error = [error]
latestError = [latestError], historicalError = [historicalError]
delaySinceUpdate = [delaySinceUpdate], dError = [dError]
If it was the first computation of the limit rate, you should see the following TRACE message
in the logs:
TRACE PIDRateEstimator: First run, rate estimation skipped
875
The motivation is to control the number of executors required to process input records when
their number increases to the point when the processing time could become longer than the
batch interval.
Configuration
spark.streaming.dynamicAllocation.enabled controls whether to enabled dynamic
876
ExecutorAllocationManager
Caution
FIXME
requestExecutors
killExecutor
877
Settings
Settings
The following list are the settings used to configure Spark Streaming applications.
Caution
i.e. adds its value to checkpoint time, when used with the clock being a subclass of
org.apache.spark.util.ManualClock . It is used when JobGenerator is restarted from
checkpoint.
spark.streaming.unpersist (default: true ) is a flag to control whether output streams
878
Settings
Checkpointing
spark.streaming.checkpoint.directory - when set and StreamingContext is created, the
Back Pressure
spark.streaming.backpressure.enabled (default: false ) - enables ( true ) or disables
use.
879
Spark MLlib
Im new to Machine Learning as a discipline and Spark MLlib in particular so
mistakes in this document are considered a norm (not an exception).
Caution
Note
Machine Learning uses large datasets to identify (infer) patterns and make decisions (aka
predictions). Automated decision making is what makes Machine Learning so appealing.
You can teach a system from a dataset and let the system act by itself to predict future.
The amount of data (measured in TB or PB) is what makes Spark MLlib especially important
since a human could not possibly extract much value from the dataset in a short time.
Spark handles data distribution and makes the huge data available by means of RDDs,
DataFrames, and recently Datasets.
880
Use cases for Machine Learning (and hence Spark MLlib that comes with appropriate
algorithms):
Security monitoring and fraud detection
Operational optimizations
Product recommendations or (more broadly) Marketing optimization
Ad serving and optimization
Concepts
This section introduces the concepts of Machine Learning and how they are modeled in
Spark MLlib.
Observation
An observation is used to learn about or evaluate (i.e. draw conclusions about) the
observed items target value.
Spark models observations as rows in a DataFrame .
Feature
A feature (aka dimension or variable) is an attribute of an observation. It is an independent
variable.
Spark models features as columns in a DataFrame (one per feature or a set of features).
Note
Label
A label is a variable that a machine learning system learns to predict that are assigned to
observations.
881
FP-growth Algorithm
Spark 1.5 have significantly improved on frequent pattern mining capabilities with new
algorithms for association rule generation and sequential pattern mining.
Frequent Itemset Mining using the Parallel FP-growth algorithm (since Spark 1.3)
Frequent Pattern Mining in MLlib User Guide
frequent pattern mining
reveals the most frequently visited site in a particular period
finds popular routing paths that generate most traffic in a particular region
models its input as a set of transactions, e.g. a path of nodes.
A transaction is a set of items, e.g. network nodes.
the algorithm looks for common subsets of items that appear across transactions,
e.g. sub-paths of the network that are frequently traversed.
A naive solution: generate all possible itemsets and count their occurrence
A subset is considered a pattern when it appears in some minimum proportion of
all transactions - the support.
the items in a transaction are unordered
analyzing traffic patterns from network logs
the algorithm finds all frequent itemsets without generating and testing all
candidates
suffix trees (FP-trees) constructed and grown from filtered transactions
Also available in Mahout, but slower.
Distributed generation of association rules (since Spark 1.5).
in a retailers transaction database, a rule {toothbrush, floss} {toothpaste} with
a confidence value 0.8 would indicate that 80% of customers who buy a
toothbrush and floss also purchase a toothpaste in the same transaction. The
882
retailer could then use this information, put both toothbrush and floss on sale, but
raise the price of toothpaste to increase overall profit.
FPGrowth model
parallel sequential pattern mining (since Spark 1.5)
PrefixSpan algorithm with modifications to parallelize the algorithm for Spark.
extract frequent sequential patterns like routing updates, activation failures, and
broadcasting timeouts that could potentially lead to customer complaints and
proactively reach out to customers when it happens.
a graph algorithm
Among the first MLlib algorithms built upon GraphX.
takes an undirected graph with similarities defined on edges and outputs clustering
assignment on nodes
uses truncated power iteration to find a very low-dimensional embedding of the
nodes, and this embedding leads to effective graph clustering.
stores the normalized similarity matrix as a graph with normalized similarities
defined as edge properties
The edge properties are cached and remain static during the power iterations.
The embedding of nodes is defined as node properties on the same graph
topology.
update the embedding through power iterations, where aggregateMessages is
used to compute matrix-vector multiplications, the essential operation in a power
iteration method
883
884
ML Pipelines (spark.ml)
Both scikit-learn and GraphLab have the concept of pipelines built into their
system.
Note
Note
Note
The old RDD-based API has been developed in parallel under the spark.mllib
package. It has been proposed to switch RDD-based MLlib APIs to
maintenance mode in Spark 2.0.
The Pipeline API lives under org.apache.spark.ml package.
885
ML Pipelines (spark.ml)
886
ML Pipelines (spark.ml)
A machine learning component is any object that belongs to Pipeline API, e.g.
Pipeline, LinearRegressionModel, etc.
Pipelines
A ML pipeline (or a ML workflow) is a sequence of Transformers and Estimators to fit a
PipelineModel to an input dataset.
pipeline: DataFrame =[fit]=> DataFrame (using transformers and estimators)
887
ML Pipelines (spark.ml)
The Pipeline object can read or load pipelines (refer to Persisting Machine Learning
Components page).
read: MLReader[Pipeline]
load(path: String): Pipeline
You can create a Pipeline with an optional uid identifier. It is of the format
pipeline_[randomUid] when unspecified.
The fit method returns a PipelineModel that holds a collection of Transformer objects
that are results of Estimator.fit method for every Estimator in the Pipeline (with possiblymodified dataset ) or simply input Transformer objects. The input dataset DataFrame is
888
ML Pipelines (spark.ml)
Note
transform method is called for every Transformer calculated but the last one (that is the
PipelineStage
The PipelineStage abstract class represents a single stage in a Pipeline.
PipelineStage has the following direct implementations (of which few are abstract classes,
too):
Estimators
Models
Pipeline
Predictor
Transformer
Each PipelineStage transforms schema using transformSchema family of methods:
transformSchema(schema: StructType): StructType
transformSchema(schema: StructType, logging: Boolean): StructType
Note
889
ML Pipelines (spark.ml)
Tip
890
ML Pipelines (spark.ml)
Transformers
A transformer is a function object that maps (aka transforms) a DataFrame into another
DataFrame (both called datasets).
Transformers prepare a dataset for an machine learning algorithm to work with. They are
also very helpful to transform DataFrames in general (even outside the machine learning
space).
Transformers are instances of org.apache.spark.ml.Transformer abstract class that offers
transform family of methods:
891
ML Pipelines (spark.ml)
StopWordsRemover
StopWordsRemover is a machine learning feature transformer that takes a string array column
and outputs a string array column with all defined stop words removed. The transformer
comes with a standard set of English stop words as default (that are the same as scikit-learn
uses, i.e. from the Glasgow Information Retrieval Group).
Note
import org.apache.spark.ml.feature.StopWordsRemover
val stopWords = new StopWordsRemover
Note
null values from the input array are preserved unless adding null to
stopWords explicitly.
892
ML Pipelines (spark.ml)
import org.apache.spark.ml.feature.RegexTokenizer
val regexTok = new RegexTokenizer("regexTok")
.setInputCol("text")
.setPattern("\\W+")
import org.apache.spark.ml.feature.StopWordsRemover
val stopWords = new StopWordsRemover("stopWords")
.setInputCol(regexTok.getOutputCol)
val df = Seq("please find it done (and empty)", "About to be rich!", "empty")
.zipWithIndex
.toDF("text", "id")
scala> stopWords.transform(regexTok.transform(df)).show(false)
+-------------------------------+---+------------------------------------+----------------+
|text |id |regexTok__output |stopWords__o
utput|
+-------------------------------+---+------------------------------------+----------------+
|please find it done (and empty)|0 |[please, find, it, done, and, empty]|[]
|
|About to be rich! |1 |[about, to, be, rich] |[rich]
|
|empty |2 |[empty] |[]
|
+-------------------------------+---+------------------------------------+----------------+
Binarizer
Binarizer is a Transformer that splits the values in the input column into two groups -
"ones" for values larger than the threshold and "zeros" for the others.
It works with DataFrames with the input column of DoubleType or VectorUDT. The type of
the result output column matches the type of the input column, i.e. DoubleType or
VectorUDT .
893
ML Pipelines (spark.ml)
import org.apache.spark.ml.feature.Binarizer
val bin = new Binarizer()
.setInputCol("rating")
.setOutputCol("label")
.setThreshold(3.5)
scala> println(bin.explainParams)
inputCol: input column name (current: rating)
outputCol: output column name (default: binarizer_dd9710e2a831__output, current: label
)
threshold: threshold used to binarize continuous features (default: 0.0, current: 3.5)
val doubles = Seq((0, 1d), (1, 1d), (2, 5d)).toDF("id", "rating")
scala> bin.transform(doubles).show
+---+------+-----+
| id|rating|label|
+---+------+-----+
| 0| 1.0| 0.0|
| 1| 1.0| 0.0|
| 2| 5.0| 1.0|
+---+------+-----+
import org.apache.spark.mllib.linalg.Vectors
val denseVec = Vectors.dense(Array(4.0, 0.4, 3.7, 1.5))
val vectors = Seq((0, denseVec)).toDF("id", "rating")
scala> bin.transform(vectors).show
+---+-----------------+-----------------+
| id| rating| label|
+---+-----------------+-----------------+
| 0|[4.0,0.4,3.7,1.5]|[1.0,0.0,1.0,0.0]|
+---+-----------------+-----------------+
SQLTransformer
SQLTransformer is a Transformer that does transformations by executing SELECT FROM
THIS with THIS being the underlying temporary table registered for the input dataset.
Internally, THIS is replaced with a random name for a temporary table (using
registerTempTable).
Note
It requires that the SELECT query uses THIS that corresponds to a temporary table and
simply executes the mandatory statement using sql method.
You have to specify the mandatory statement parameter using setStatement method.
894
ML Pipelines (spark.ml)
import org.apache.spark.ml.feature.SQLTransformer
val sql = new SQLTransformer()
// dataset to work with
val df = Seq((0, s"""hello\tworld"""), (1, "two spaces inside")).toDF("label", "sente
nce")
scala> sql.setStatement("SELECT sentence FROM __THIS__ WHERE label = 0").transform(df)
.show
+-----------+
| sentence|
+-----------+
|hello world|
+-----------+
scala> println(sql.explainParams)
statement: SQL statement (current: SELECT sentence FROM __THIS__ WHERE label = 0)
VectorAssembler
VectorAssembler is a feature transformer that assembles (merges) multiple columns into a
895
ML Pipelines (spark.ml)
import org.apache.spark.ml.feature.VectorAssembler
val vecAssembler = new VectorAssembler()
scala> print(vecAssembler.explainParams)
inputCols: input column names (undefined)
outputCol: output column name (default: vecAssembler_5ac31099dbee__output)
final case class Record(id: Int, n1: Int, n2: Double, flag: Boolean)
val ds = Seq(Record(0, 4, 2.0, true)).toDS
scala> ds.printSchema
root
|-- id: integer (nullable = false)
|-- n1: integer (nullable = false)
|-- n2: double (nullable = false)
|-- flag: boolean (nullable = false)
val features = vecAssembler
.setInputCols(Array("n1", "n2", "flag"))
.setOutputCol("features")
.transform(ds)
scala> features.printSchema
root
|-- id: integer (nullable = false)
|-- n1: integer (nullable = false)
|-- n2: double (nullable = false)
|-- flag: boolean (nullable = false)
|-- features: vector (nullable = true)
scala> features.show
+---+---+---+----+-------------+
| id| n1| n2|flag| features|
+---+---+---+----+-------------+
| 0| 4|2.0|true|[4.0,2.0,1.0]|
+---+---+---+----+-------------+
UnaryTransformers
The UnaryTransformer abstract class is a specialized Transformer that applies
transformation to one input column and writes results to another (by appending a new
column).
Each UnaryTransformer defines the input and output columns using the following "chain"
methods (they return the transformer on which they were executed and so are chainable):
setInputCol(value: String)
896
ML Pipelines (spark.ml)
setOutputCol(value: String)
Note
A UnaryTransformer is a PipelineStage .
When transform is called, it first calls transformSchema (with DEBUG logging enabled) and
then adds the column as a result of calling a protected abstract createTransformFunc .
Note
Internally, transform method uses Spark SQLs udf to define a function (based on
createTransformFunc function described above) that will create the new output column (with
appropriate outputDataType ). The UDF is later applied to the input column of the input
DataFrame and the result becomes the output column (using DataFrame.withColumn
method).
Note
Tokenizer
Tokenizer is a UnaryTransformer that converts the input string to lowercase and then splits
it by white spaces.
897
ML Pipelines (spark.ml)
import org.apache.spark.ml.feature.Tokenizer
val tok = new Tokenizer()
// dataset to transform
val df = Seq((1, "Hello world!"), (2, "Here is yet another sentence.")).toDF("label",
"sentence")
val tokenized = tok.setInputCol("sentence").transform(df)
scala> tokenized.show(false)
+-----+-----------------------------+-----------------------------------+
|label|sentence |tok_b66af4001c8d__output |
+-----+-----------------------------+-----------------------------------+
|1 |Hello world! |[hello, world!] |
|2 |Here is yet another sentence.|[here, is, yet, another, sentence.]|
+-----+-----------------------------+-----------------------------------+
RegexTokenizer
RegexTokenizer is a UnaryTransformer that tokenizes a String into a collection of String .
import org.apache.spark.ml.feature.RegexTokenizer
val regexTok = new RegexTokenizer()
// dataset to transform with tabs and spaces
val df = Seq((0, s"""hello\tworld"""), (1, "two spaces inside")).toDF("label", "sente
nce")
val tokenized = regexTok.setInputCol("sentence").transform(df)
scala> tokenized.show(false)
+-----+------------------+-----------------------------+
|label|sentence |regexTok_810b87af9510__output|
+-----+------------------+-----------------------------+
|0 |hello
Note
It supports minTokenLength parameter that is the minimum token length that you can change
using setMinTokenLength method. It simply filters out smaller tokens and defaults to 1 .
898
ML Pipelines (spark.ml)
It has gaps parameter that indicates whether regex splits on gaps ( true ) or matches
tokens ( false ). You can set it using setGaps . It defaults to true .
When set to true (i.e. splits on gaps) it uses Regex.split while Regex.findAllIn for false .
scala> rt.setInputCol("line").setGaps(false).transform(df).show
+-----+--------------------+-----------------------------+
|label| line|regexTok_8c74c5e8b83a__output|
+-----+--------------------+-----------------------------+
| 1| hello world| []|
| 2|yet another sentence| [another, sentence]|
+-----+--------------------+-----------------------------+
scala> rt.setInputCol("line").setGaps(false).setPattern("\\W").transform(df).show(false
)
+-----+--------------------+-----------------------------+
|label|line |regexTok_8c74c5e8b83a__output|
+-----+--------------------+-----------------------------+
|1 |hello world |[] |
|2 |yet another sentence|[another, sentence] |
+-----+--------------------+-----------------------------+
It has pattern parameter that is the regex for tokenizing. It uses Scalas .r method to
convert the string to regex. Use setPattern to set it. It defaults to \\s+ .
It has toLowercase parameter that indicates whether to convert all characters to lowercase
before tokenizing. Use setToLowercase to change it. It defaults to true .
NGram
In this example you use org.apache.spark.ml.feature.NGram that converts the input
collection of strings into a collection of n-grams (of n words).
899
ML Pipelines (spark.ml)
import org.apache.spark.ml.feature.NGram
val bigram = new NGram("bigrams")
val df = Seq((0, Seq("hello", "world"))).toDF("id", "tokens")
bigram.setInputCol("tokens").transform(df).show
+---+--------------+---------------+
| id| tokens|bigrams__output|
+---+--------------+---------------+
| 0|[hello, world]| [hello world]|
+---+--------------+---------------+
HashingTF
Another example of a transformer is org.apache.spark.ml.feature.HashingTF that works on a
Column of ArrayType .
It transforms the rows for the input column into a sparse term frequency vector.
import org.apache.spark.ml.feature.HashingTF
val hashingTF = new HashingTF()
.setInputCol("words")
.setOutputCol("features")
.setNumFeatures(5000)
// see above for regexTok transformer
val regexedDF = regexTok.transform(df)
// Use HashingTF
val hashedDF = hashingTF.transform(regexedDF)
scala> hashedDF.show(false)
+---+------------------+---------------------+-----------------------------------+
|id |text |words |features |
+---+------------------+---------------------+-----------------------------------+
|0 |hello
|
|1 |two spaces inside|[two, spaces, inside]|(5000,[276,940,2533],[1.0,1.0,1.0])|
+---+------------------+---------------------+-----------------------------------+
The name of the output column is optional, and if not specified, it becomes the identifier of a
HashingTF object with the __output suffix.
900
ML Pipelines (spark.ml)
scala> hashingTF.uid
res7: String = hashingTF_fe3554836819
scala> hashingTF.transform(regexDF).show(false)
+---+------------------+---------------------+------------------------------------------+
|id |text |words |hashingTF_fe3554836819__output
|
+---+------------------+---------------------+------------------------------------------+
|0 |hello
|
|1 |two spaces inside|[two, spaces, inside]|(262144,[53244,77869,115276],[1.0,1.0,1.0
])|
+---+------------------+---------------------+------------------------------------------+
OneHotEncoder
OneHotEncoder is a Tokenizer that maps a numeric input column of label indices onto a
901
ML Pipelines (spark.ml)
// dataset to transform
val df = Seq(
(0, "a"), (1, "b"),
(2, "c"), (3, "a"),
(4, "a"), (5, "c"))
.toDF("label", "category")
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer().setInputCol("category").setOutputCol("cat_index").fi
t(df)
val indexed = indexer.transform(df)
import org.apache.spark.sql.types.NumericType
scala> indexed.schema("cat_index").dataType.isInstanceOf[NumericType]
res0: Boolean = true
import org.apache.spark.ml.feature.OneHotEncoder
val oneHot = new OneHotEncoder()
.setInputCol("cat_index")
.setOutputCol("cat_vec")
val oneHotted = oneHot.transform(indexed)
scala> oneHotted.show(false)
+-----+--------+---------+-------------+
|label|category|cat_index|cat_vec |
+-----+--------+---------+-------------+
|0 |a |0.0 |(2,[0],[1.0])|
|1 |b |2.0 |(2,[],[]) |
|2 |c |1.0 |(2,[1],[1.0])|
|3 |a |0.0 |(2,[0],[1.0])|
|4 |a |0.0 |(2,[0],[1.0])|
|5 |c |1.0 |(2,[1],[1.0])|
+-----+--------+---------+-------------+
scala> oneHotted.printSchema
root
|-- label: integer (nullable = false)
|-- category: string (nullable = true)
|-- cat_index: double (nullable = true)
|-- cat_vec: vector (nullable = true)
scala> oneHotted.schema("cat_vec").dataType.isInstanceOf[VectorUDT]
res1: Boolean = true
Custom UnaryTransformer
The following class is a custom UnaryTransformer that transforms words using upper letters.
902
ML Pipelines (spark.ml)
package pl.japila.spark
import org.apache.spark.ml._
import org.apache.spark.ml.util.Identifiable
import org.apache.spark.sql.types._
class UpperTransformer(override val uid: String)
extends UnaryTransformer[String, String, UpperTransformer] {
def this() = this(Identifiable.randomUID("upper"))
override protected def validateInputType(inputType: DataType): Unit = {
require(inputType == StringType)
}
protected def createTransformFunc: String => String = {
_.toUpperCase
}
protected def outputDataType: DataType = StringType
}
903
ML Pipelines (spark.ml)
Estimators
An estimator is an abstraction of a learning algorithm that fits a model on a dataset.
Note
That was so machine learning to explain an estimator this way, wasnt it? It is
that the more I spend time with Pipeline API the often I use the terms and
phrases from this space. Sorry.
Technically, an Estimator produces a Model (i.e. a Transformer) for a given DataFrame and
parameters (as ParamMap ). It fits a model to the input DataFrame and ParamMap to produce
a Transformer (a Model ) that can calculate predictions for any DataFrame -based input
datasets.
It is basically a function that maps a DataFrame onto a Model through fit method, i.e. it
takes a DataFrame and produces a Transformer as a Model .
estimator: DataFrame =[fit]=> Model
fit(dataset: DataFrame): M
Note
document.
As an example you could use LinearRegression learning algorithm estimator to train a
LinearRegressionModel.
Some of the direct specialized implementations of the Estimator abstract class are as
follows:
StringIndexer
KMeans
TrainValidationSplit
Predictors
904
ML Pipelines (spark.ml)
StringIndexer
org.apache.spark.ml.feature.StringIndexer is an Estimator that produces
StringIndexerModel .
KMeans
KMeans class is an implementation of the K-means clustering algorithm in machine learning
905
ML Pipelines (spark.ml)
and centroids) as the nearest mean. The algorithm steps are repeated till the convergence
of a specified number of steps.
Note
type IntegerType .
Internally, fit method "unwraps" the feature vector in featuresCol column in the input
DataFrame and creates an RDD[Vector] . It then hands the call over to the MLlib variant of
906
ML Pipelines (spark.ml)
Refer to Logging.
KMeans Example
You can represent a text corpus (document collection) using the vector space model. In this
representation, the vectors have dimension that is the number of different words in the
corpus. It is quite natural to have vectors with a lot of zero values as not all words will be in a
document. We will use an optimized memory representation to avoid zero values using
sparse vectors.
This example shows how to use k-means to classify emails as a spam or not.
// NOTE Don't copy and paste the final case class with the other lines
// It won't work with paste mode in spark-shell
final case class Email(id: Int, text: String)
val emails = Seq(
"This is an email from your lovely wife. Your mom says...",
"SPAM SPAM spam",
"Hello, We'd like to offer you").zipWithIndex.map(_.swap).toDF("id", "text").as[Email
]
// Prepare data for k-means
// Pass emails through a "pipeline" of transformers
import org.apache.spark.ml.feature._
val tok = new RegexTokenizer()
.setInputCol("text")
.setOutputCol("tokens")
.setPattern("\\W+")
val hashTF = new HashingTF()
.setInputCol("tokens")
.setOutputCol("features")
.setNumFeatures(20)
val preprocess = (tok.transform _).andThen(hashTF.transform)
val features = preprocess(emails.toDF)
scala> features.select('text, 'features).show(false)
907
ML Pipelines (spark.ml)
+--------------------------------------------------------+-----------------------------------------------------------+
|text |features
|
+--------------------------------------------------------+-----------------------------------------------------------+
|This is an email from your lovely wife. Your mom says...|(20,[0,3,6,8,10,11,17,19],[1
.0,2.0,1.0,1.0,2.0,1.0,2.0,1.0])|
|SPAM SPAM spam |(20,[13],[3.0])
|
|Hello, We'd like to offer you |(20,[0,2,7,10,11,19],[2.0,1.0
,1.0,1.0,1.0,1.0]) |
+--------------------------------------------------------+-----------------------------------------------------------+
import org.apache.spark.ml.clustering.KMeans
val kmeans = new KMeans
scala> val kmModel = kmeans.fit(features.toDF)
16/04/08 15:57:37 WARN KMeans: The input data is not directly cached, which may hurt p
erformance if its parent RDDs are also uncached.
16/04/08 15:57:37 INFO KMeans: Initialization with k-means|| took 0.219 seconds.
16/04/08 15:57:37 INFO KMeans: Run 0 finished in 1 iterations
16/04/08 15:57:37 INFO KMeans: Iterations took 0.030 seconds.
16/04/08 15:57:37 INFO KMeans: KMeans converged in 1 iterations.
16/04/08 15:57:37 INFO KMeans: The cost for the best run is 5.000000000000002.
16/04/08 15:57:37 WARN KMeans: The input data was not directly cached, which may hurt
performance if its parent RDDs are also uncached.
kmModel: org.apache.spark.ml.clustering.KMeansModel = kmeans_7a13a617ce0b
scala> kmModel.clusterCenters.map(_.toSparse)
res36: Array[org.apache.spark.mllib.linalg.SparseVector] = Array((20,[13],[3.0]), (20,[
0,2,3,6,7,8,10,11,17,19],[1.5,0.5,1.0,0.5,0.5,0.5,1.5,1.0,1.0,1.0]))
val email = Seq("hello mom").toDF("text")
val result = kmModel.transform(preprocess(email))
scala> .show(false)
+---------+------------+---------------------+----------+
|text |tokens |features |prediction|
+---------+------------+---------------------+----------+
|hello mom|[hello, mom]|(20,[2,19],[1.0,1.0])|1 |
+---------+------------+---------------------+----------+
TrainValidationSplit
Caution
FIXME
Predictors
908
ML Pipelines (spark.ml)
train(dataset: DataFrame): M
The train method is supposed to ease dealing with schema validation and copying
parameters to a trained PredictionModel model. It also sets the parent of the model to itself.
A Predictor is basically a function that maps a DataFrame onto a PredictionModel .
predictor: DataFrame =[train]=> PredictionModel
It implements the abstract fit(dataset: DataFrame) of the Estimator abstract class that
validates and transforms the schema of a dataset (using a custom transformSchema of
PipelineStage), and then calls the abstract train method.
Validation and transformation of a schema (using transformSchema ) makes sure that:
1.
2.
DecisionTreeClassifier
DecisionTreeClassifier is a ProbabilisticClassifier that
Caution
FIXME
LinearRegression
LinearRegression is an example of Predictor (indirectly through the specialized Regressor
private abstract class), and hence a Estimator , that represents the linear regression
algorithm in Machine Learning.
LinearRegression belongs to org.apache.spark.ml.regression package.
909
ML Pipelines (spark.ml)
Tip
LinearRegression.train
train(dataset: DataFrame): LinearRegressionModel
columns:
1.
2.
It returns LinearRegressionModel .
It first counts the number of elements in features column (usually features ). The column
has to be of mllib.linalg.Vector type (and can easily be prepared using HashingTF
transformer).
val spam = Seq(
(0, "Hi Jacek. Wanna more SPAM? Best!"),
910
ML Pipelines (spark.ml)
911
ML Pipelines (spark.ml)
RandomForestRegressor
RandomForestRegressor is a concrete Predictor for Random Forest learning algorithm. It
Caution
FIXME
912
ML Pipelines (spark.ml)
import org.apache.spark.mllib.linalg.Vectors
val features = Vectors.sparse(10, Seq((2, 0.2), (4, 0.4)))
val data = (0.0 to 4.0 by 1).map(d => (d, features)).toDF("label", "features")
// data.as[LabeledPoint]
scala> data.show(false)
+-----+--------------------------+
|label|features |
+-----+--------------------------+
|0.0 |(10,[2,4,6],[0.2,0.4,0.6])|
|1.0 |(10,[2,4,6],[0.2,0.4,0.6])|
|2.0 |(10,[2,4,6],[0.2,0.4,0.6])|
|3.0 |(10,[2,4,6],[0.2,0.4,0.6])|
|4.0 |(10,[2,4,6],[0.2,0.4,0.6])|
+-----+--------------------------+
import org.apache.spark.ml.regression.{ RandomForestRegressor, RandomForestRegressionM
odel }
val rfr = new RandomForestRegressor
val model: RandomForestRegressionModel = rfr.fit(data)
scala> model.trees.foreach(println)
DecisionTreeRegressionModel (uid=dtr_247e77e2f8e0) of depth 1 with 3 nodes
DecisionTreeRegressionModel (uid=dtr_61f8eacb2b61) of depth 2 with 7 nodes
DecisionTreeRegressionModel (uid=dtr_63fc5bde051c) of depth 2 with 5 nodes
DecisionTreeRegressionModel (uid=dtr_64d4e42de85f) of depth 2 with 5 nodes
DecisionTreeRegressionModel (uid=dtr_693626422894) of depth 3 with 9 nodes
DecisionTreeRegressionModel (uid=dtr_927f8a0bc35e) of depth 2 with 5 nodes
DecisionTreeRegressionModel (uid=dtr_82da39f6e4e1) of depth 3 with 7 nodes
DecisionTreeRegressionModel (uid=dtr_cb94c2e75bd1) of depth 0 with 1 nodes
DecisionTreeRegressionModel (uid=dtr_29e3362adfb2) of depth 1 with 3 nodes
DecisionTreeRegressionModel (uid=dtr_d6d896abcc75) of depth 3 with 7 nodes
DecisionTreeRegressionModel (uid=dtr_aacb22a9143d) of depth 2 with 5 nodes
DecisionTreeRegressionModel (uid=dtr_18d07dadb5b9) of depth 2 with 7 nodes
DecisionTreeRegressionModel (uid=dtr_f0615c28637c) of depth 2 with 5 nodes
DecisionTreeRegressionModel (uid=dtr_4619362d02fc) of depth 2 with 5 nodes
DecisionTreeRegressionModel (uid=dtr_d39502f828f4) of depth 2 with 5 nodes
DecisionTreeRegressionModel (uid=dtr_896f3a4272ad) of depth 3 with 9 nodes
DecisionTreeRegressionModel (uid=dtr_891323c29838) of depth 3 with 7 nodes
DecisionTreeRegressionModel (uid=dtr_d658fe871e99) of depth 2 with 5 nodes
DecisionTreeRegressionModel (uid=dtr_d91227b13d41) of depth 2 with 5 nodes
DecisionTreeRegressionModel (uid=dtr_4a7976921f4b) of depth 2 with 5 nodes
scala> model.treeWeights
res12: Array[Double] = Array(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0
, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0)
scala> model.featureImportances
res13: org.apache.spark.mllib.linalg.Vector = (1,[0],[1.0])
913
ML Pipelines (spark.ml)
Example
The following example uses LinearRegression estimator.
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
val data = (0.0 to 9.0 by 1) // create a collection of Doubles
.map(n => (n, n)) // make it pairs
.map { case (label, features) =>
LabeledPoint(label, Vectors.dense(features)) } // create labeled points of dense v
ectors
.toDF // make it a DataFrame
scala> data.show
+-----+--------+
|label|features|
+-----+--------+
| 0.0| [0.0]|
| 1.0| [1.0]|
| 2.0| [2.0]|
| 3.0| [3.0]|
| 4.0| [4.0]|
| 5.0| [5.0]|
| 6.0| [6.0]|
| 7.0| [7.0]|
| 8.0| [8.0]|
| 9.0| [9.0]|
+-----+--------+
import org.apache.spark.ml.regression.LinearRegression
val lr = new LinearRegression
val model = lr.fit(data)
scala> model.intercept
res1: Double = 0.0
scala> model.coefficients
res2: org.apache.spark.mllib.linalg.Vector = [1.0]
// make predictions
scala> val predictions = model.transform(data)
predictions: org.apache.spark.sql.DataFrame = [label: double, features: vector ... 1 m
ore field]
scala> predictions.show
+-----+--------+----------+
|label|features|prediction|
+-----+--------+----------+
| 0.0| [0.0]| 0.0|
| 1.0| [1.0]| 1.0|
| 2.0| [2.0]| 2.0|
914
ML Pipelines (spark.ml)
915
ML Pipelines (spark.ml)
916
ML Pipelines (spark.ml)
Models
Model abstract class is a Transformer with the optional Estimator that has produced it (as a
Note
An Estimator is optional and is available only after fit (of an Estimator) has
been executed whose result a model is.
There are two direct implementations of the Model class that are not directly related to a
concrete ML algorithm:
PipelineModel
PredictionModel
PipelineModel
Caution
Once fit, you can use the result model as any other models to transform datasets (as
DataFrame ).
917
ML Pipelines (spark.ml)
// Transformer #1
import org.apache.spark.ml.feature.Tokenizer
val tok = new Tokenizer().setInputCol("text")
// Transformer #2
import org.apache.spark.ml.feature.HashingTF
val hashingTF = new HashingTF().setInputCol(tok.getOutputCol).setOutputCol("features")
// Fuse the Transformers in a Pipeline
import org.apache.spark.ml.Pipeline
val pipeline = new Pipeline().setStages(Array(tok, hashingTF))
val dataset = Seq((0, "hello world")).toDF("id", "text")
// Since there's no fitting, any dataset works fine
val featurize = pipeline.fit(dataset)
// Use the pipelineModel as a series of Transformers
scala> featurize.transform(dataset).show(false)
+---+-----------+------------------------+--------------------------------+
|id |text |tok_8aec9bfad04a__output|features |
+---+-----------+------------------------+--------------------------------+
|0 |hello world|[hello, world] |(262144,[71890,72594],[1.0,1.0])|
+---+-----------+------------------------+--------------------------------+
PredictionModel
PredictionModel is an abstract class to represent a model for prediction algorithms like
regression and classification (that have their own specialized models - details coming up
below).
PredictionModel is basically a Transformer with predict method to calculate predictions
import org.apache.spark.ml.PredictionModel
The contract of PredictionModel class requires that every custom implementation defines
predict method (with FeaturesType type being the type of features ).
918
ML Pipelines (spark.ml)
ClassificationModel
RandomForestRegressionModel
As a custom Transformer it comes with its own custom transform method.
Internally, transform first ensures that the type of the features column matches the type
of the model and adds the prediction column of type Double to the schema of the result
DataFrame .
It then creates the result DataFrame and adds the prediction column with a predictUDF
function applied to the values of the features column.
FIXME A diagram to show the transformation from a dataframe (on the left)
and another (on the right) with an arrow to represent the transformation
method.
Caution
Refer to Logging.
ClassificationModel
ClassificationModel is a PredictionModel that transforms a DataFrame with mandatory
features , label , and rawPrediction (of type Vector) columns to a DataFrame with
prediction column added.
Note
ClassificationModel comes with its own transform (as Transformer) and predict (as
PredictionModel).
The following is a list of the known ClassificationModel custom implementations (as of
March, 24th):
ProbabilisticClassificationModel (the abstract parent of the following classification
models)
DecisionTreeClassificationModel ( final )
919
ML Pipelines (spark.ml)
LogisticRegressionModel
NaiveBayesModel
RandomForestClassificationModel ( final )
RegressionModel
RegressionModel is a PredictionModel that transforms a DataFrame with mandatory label ,
features , and prediction columns.
It comes with no own methods or values and so is more a marker abstract class (to combine
different features of regression models under one type).
LinearRegressionModel
LinearRegressionModel represents a model produced by a LinearRegression estimator. It
Note
920
ML Pipelines (spark.ml)
The coefficients Vector and intercept Double are the integral part of
LinearRegressionModel as the required input parameters of the constructor.
LinearRegressionModel Example
921
ML Pipelines (spark.ml)
RandomForestRegressionModel
RandomForestRegressionModel is a PredictionModel with features column of type Vector.
KMeansModel
KMeansModel is a Model of KMeans algorithm.
922
ML Pipelines (spark.ml)
// See spark-mllib-estimators.adoc#KMeans
val kmeans: KMeans = ???
val trainingDF: DataFrame = ???
val kmModel = kmeans.fit(trainingDF)
// Know the cluster centers
scala> kmModel.clusterCenters
res0: Array[org.apache.spark.mllib.linalg.Vector] = Array([0.1,0.3], [0.1,0.1])
val inputDF = Seq((0.0, Vectors.dense(0.2, 0.4))).toDF("label", "features")
scala> kmModel.transform(inputDF).show(false)
+-----+---------+----------+
|label|features |prediction|
+-----+---------+----------+
|0.0 |[0.2,0.4]|0 |
+-----+---------+----------+
923
ML Pipelines (spark.ml)
Evaluators
A evaluator is a transformation that maps a DataFrame into a metric indicating how good a
model is.
evaluator: DataFrame =[evaluate]=> Double
BinaryClassificationEvaluator
BinaryClassificationEvaluator is a concrete Evaluator for binary classification that
RegressionEvaluator
RegressionEvaluator is a concrete Evaluator for regression that expects datasets (of
DataFrame type) with the following two columns:
prediction of float or double values
label of float or double values
924
ML Pipelines (spark.ml)
925
ML Pipelines (spark.ml)
926
ML Pipelines (spark.ml)
CrossValidator
Caution
Note
What makes CrossValidator a very useful tool for model selection is its ability
to work with any Estimator instance, Pipelines including, that can preprocess
datasets before passing them on. This gives you a way to work with any dataset
and preprocess it before a new (possibly better) model could be fit to it.
import org.apache.spark.ml.tuning.CrossValidator
val cv = new CrossValidator
scala> println(cv.explainParams)
estimator: estimator for selection (undefined)
estimatorParamMaps: param maps for the estimator (undefined)
evaluator: evaluator used to select hyper-parameters that maximize the validated metri
c (undefined)
numFolds: number of folds for cross validation (>= 2) (default: 3)
seed: random seed (default: -1191137437)
927
ML Pipelines (spark.ml)
928
ML Pipelines (spark.ml)
929
ML Pipelines (spark.ml)
930
ML Pipelines (spark.ml)
MLWriter
MLWriter abstract class comes with save(path: String) method to save a machine
It comes with another (chainable) method overwrite to overwrite the output path if it
already exists.
overwrite(): this.type
The component is saved into a JSON file (see MLWriter Example section below).
Enable INFO logging level for the MLWriter implementation logger to see what
happens inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.ml.Pipeline$.PipelineWriter=INFO
Refer to Logging.
Caution
FIXME The logging doesnt work and overwriting does not print out INFO
message to the logs :(
MLWriter Example
import org.apache.spark.ml._
val pipeline = new Pipeline().setStages(Array())
pipeline.write.overwrite().save("hello-pipeline")
931
ML Pipelines (spark.ml)
$ cat hello-pipeline/metadata/part-00000 | jq
{
"class": "org.apache.spark.ml.Pipeline",
"timestamp": 1457685293319,
"sparkVersion": "2.0.0-SNAPSHOT",
"uid": "pipeline_12424a3716b2",
"paramMap": {
"stageUids": []
}
}
MLReader
MLReader abstract class comes with load(path: String) method to load a machine
932
ML Pipelines (spark.ml)
ExampleText Classification
Note
The example was inspired by the video Building, Debugging, and Tuning Spark
Machine Learning Pipelines - Joseph Bradley (Databricks).
The example uses a case class LabeledText to have the schema described
nicely.
import spark.implicits._
sealed trait Category
case object Scientific extends Category
case object NonScientific extends Category
// FIXME: Define schema for Category
case class LabeledText(id: Long, category: Category, text: String)
val data = Seq(LabeledText(0, Scientific, "hello world"), LabeledText(1, NonScientific
, "witaj swiecie")).toDF
scala> data.show
+-----+-------------+
|label| text|
+-----+-------------+
| 0| hello world|
| 1|witaj swiecie|
+-----+-------------+
It is then tokenized and transformed into another DataFrame with an additional column
called features that is a Vector of numerical values.
Note
Paste the code below into Spark Shell using :paste mode.
import spark.implicits._
case class Article(id: Long, topic: String, text: String)
val articles = Seq(Article(0, "sci.math", "Hello, Math!"),
Article(1, "alt.religion", "Hello, Religion!"),
Article(2, "sci.physics", "Hello, Physics!")).toDF
val papers = articles.as[Article]
933
ML Pipelines (spark.ml)
Now, the tokenization part comes that maps the input text of each text document into tokens
(a Seq[String] ) and then into a Vector of numerical values that can only then be
understood by a machine learning algorithm (that operates on Vector instances).
scala> papers.show
+---+------------+----------------+
| id| topic| text|
+---+------------+----------------+
| 0| sci.math| Hello, Math!|
| 1|alt.religion|Hello, Religion!|
| 2| sci.physics| Hello, Physics!|
+---+------------+----------------+
// FIXME Use Dataset API (not DataFrame API)
val labelled = papers.toDF.withColumn("label", $"topic".like("sci%")).cache
val topic2Label: Boolean => Double = isSci => if (isSci) 1 else 0
val toLabel = udf(topic2Label)
val training = papers.toDF.withColumn("label", toLabel($"topic".like("sci%"))).cache
scala> training.show
+---+------------+----------------+-----+
| id| topic| text|label|
+---+------------+----------------+-----+
| 0| sci.math| Hello, Math!| 1.0|
| 1|alt.religion|Hello, Religion!| 0.0|
| 2| sci.physics| Hello, Physics!| 1.0|
+---+------------+----------------+-----+
scala> training.groupBy("label").count.show
+-----+-----+
|label|count|
+-----+-----+
| 0.0| 1|
| 1.0| 2|
+-----+-----+
The train a model phase uses the logistic regression machine learning algorithm to build a
model and predict label for future input text documents (and hence classify them as
scientific or non-scientific).
934
ML Pipelines (spark.ml)
import org.apache.spark.ml.feature.RegexTokenizer
val tokenizer = new RegexTokenizer().setInputCol("text").setOutputCol("words")
import org.apache.spark.ml.feature.HashingTF
val hashingTF = new HashingTF().setInputCol(tokenizer.getOutputCol).setOutputCol("feat
ures").setNumFeatures(5000)
import org.apache.spark.ml.classification.LogisticRegression
val lr = new LogisticRegression().setMaxIter(20).setRegParam(0.01)
import org.apache.spark.ml.Pipeline
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, lr))
It uses two columns, namely label and features vector to build a logistic regression
model to make predictions.
935
ML Pipelines (spark.ml)
936
ML Pipelines (spark.ml)
import org.apache.spark.ml.tuning.ParamGridBuilder
val paramGrid = new ParamGridBuilder()
.addGrid(hashingTF.numFeatures, Array(1000, 10000))
.addGrid(lr.regParam, Array(0.05, 0.2))
.build
import org.apache.spark.ml.tuning.CrossValidator
import org.apache.spark.ml.param._
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(2)
val cvModel = cv.fit(training)
Caution
FIXME Review
https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/tuning
You can eventually save the model for later use (using DataFrame.write ).
cvModel.transform(test).select("id", "prediction")
.write
.json("/demo/predictions")
937
ML Pipelines (spark.ml)
ExampleLinear Regression
The DataFrame used for Linear Regression has to have features column of
org.apache.spark.mllib.linalg.VectorUDT type.
Note
You can change the name of the column using featuresCol parameter.
Caution
938
ML Pipelines (spark.ml)
import org.apache.spark.ml.Pipeline
val pipeline = new Pipeline("my_pipeline")
import org.apache.spark.ml.regression._
val lr = new LinearRegression
val df = sc.parallelize(0 to 9).toDF("num")
val stages = Array(lr)
val model = pipeline.setStages(stages).fit(df)
// the above lines gives:
java.lang.IllegalArgumentException: requirement failed: Column features must be of type
org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually IntegerType.
at scala.Predef$.require(Predef.scala:219)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.sc
ala:51)
at org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:72)
at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:117)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:182)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:182)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:182)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:66)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:133)
... 51 elided
939
Information here are based almost exclusively from the blog post Topic
modeling with LDA: MLlib meets GraphX.
Topic modeling is a type of model that can be very useful in identifying hidden thematic
structure in documents. Broadly speaking, it aims to find structure within an unstructured
collection of documents. Once the structure is "discovered", you may answer questions like:
What is document X about?
How similar are documents X and Y?
If I am interested in topic Z, which documents should I read first?
Spark MLlib offers out-of-the-box support for Latent Dirichlet Allocation (LDA) which is the
first MLlib algorithm built upon GraphX.
Topic models automatically infer the topics discussed in a collection of documents.
940
Vector
Vector
Vector sealed trait represents a numeric vector of values (of Double type) and their
Note
It is not the Vector type in Scala or Java. Train your eyes to see two types of
the same name. Youve been warned.
A Vector object knows its size .
A Vector object can be converted to:
Array[Double] using toArray .
Tip
941
Vector
import org.apache.spark.mllib.linalg.Vectors
// You can create dense vectors explicitly by giving values per index
val denseVec = Vectors.dense(Array(0.0, 0.4, 0.3, 1.5))
val almostAllZeros = Vectors.dense(Array(0.0, 0.4, 0.3, 1.5, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0))
// You can however create a sparse vector by the size and non-zero elements
val sparse = Vectors.sparse(10, Seq((1, 0.4), (2, 0.3), (3, 1.5)))
// Convert a dense vector to a sparse one
val fromSparse = sparse.toDense
scala> almostAllZeros == fromSparse
res0: Boolean = true
Note
import org.apache.spark.mllib.linalg._
// prepare elements for a sparse vector
// NOTE: It is more Scala rather than Spark
val indices = 0 to 4
val elements = indices.zip(Stream.continually(1.0))
val sv = Vectors.sparse(elements.size, elements)
// Notice how Vector is printed out
scala> sv
res4: org.apache.spark.mllib.linalg.Vector = (5,[0,1,2,3,4],[1.0,1.0,1.0,1.0,1.0])
scala> sv.size
res0: Int = 5
scala> sv.toArray
res1: Array[Double] = Array(1.0, 1.0, 1.0, 1.0, 1.0)
scala> sv == sv.copy
res2: Boolean = true
scala> sv.toJson
res3: String = {"type":0,"size":5,"indices":[0,1,2,3,4],"values":[1.0,1.0,1.0,1.0,1.0]}
942
LabeledPoint
LabeledPoint
Caution
FIXME
LabeledPoint is a convenient class for declaring a schema for DataFrames that are used as
943
Streaming MLlib
Streaming MLlib
The following Machine Learning algorithms have their streaming variants in MLlib:
k-means
Linear Regression
Logistic Regression
They can train models and predict on streaming data.
Note
Streaming k-means
org.apache.spark.mllib.clustering.StreamingKMeans
Sources
Streaming Machine Learning in Spark- Jeremy Freeman (HHMI Janelia Research
Center)
944
import org.apache.spark.graphx._
Graph
Graph abstract class represents a collection of vertices and edges .
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
val vertices: RDD[(VertexId, String)] =
sc.parallelize(Seq(
(0L, "Jacek"),
(1L, "Agata"),
(2L, "Julian")))
val edges: RDD[Edge[String]] =
sc.parallelize(Seq(
Edge(0L, 1L, "wife"),
Edge(1L, 2L, "owner")
))
scala> val graph = Graph(vertices, edges)
graph: org.apache.spark.graphx.Graph[String,String] = org.apache.spark.graphx.impl.Gra
phImpl@5973e4ec
945
Transformations
mapVertices
mapEdges
mapTriplets
reverse
subgraph
mask
groupEdges
Joins
outerJoinVertices
Computation
aggregateMessages
Note
GraphImpl
GraphImpl is the default implementation of Graph abstract class.
947
948
Graph Algorithms
Graph Algorithms
GraphX comes with a set of built-in graph algorithms.
PageRank
Triangle Count
Connected Components
Identifies independent disconnected subgraphs.
Collaborative Filtering
What kinds of people like what kinds of products.
949
FIXME
950
951
HistoryServer
HistoryServer
HistoryServer is a web interface for completed and running (aka incomplete) Spark
applications.
You can start a HistoryServer instance by executing $SPARK_HOME/sbin/start-historyserver.sh script. See Starting HistoryServer.
Tip
Tip
log4j.logger.org.apache.spark.deploy.history.HistoryServer=INFO
Refer to Logging.
Starting HistoryServer
You can start a HistoryServer instance by executing $SPARK_HOME/sbin/start-historyserver.sh script.
$ ./sbin/start-history-server.sh
starting org.apache.spark.deploy.history.HistoryServer, logging to .../spark/logs/spar
k-jacek-org.apache.spark.deploy.history.HistoryServer-1-japila.out
Note
When started, it prints out the following INFO message to the logs:
INFO HistoryServer: Started daemon with process name: [processName]
It registers signal handlers (using SignalUtils ) for TERM , HUP , INT to log their execution:
ERROR HistoryServer: RECEIVED SIGNAL [signal]
952
HistoryServer
It creates a SecurityManager .
It creates a ApplicationHistoryProvider (by reading spark.history.provider).
It reads spark.history.ui.port.
It creates a HistoryServer and requests to bind.
It registers a shutdown hook to call stop on the HistoryServer instance.
FIXME
Settings
spark.history.provider (default: FsHistoryProvider) is a fully-qualified class name for a
953
SQLHistoryListener
SQLHistoryListener
SQLHistoryListener is a custom SQLListener for History Server. It attaches SQL tab to
History Servers web UI only when the first SparkListenerSQLExecutionStart arrives and
shuts onExecutorMetricsUpdate off. It also handles ends of tasks in a slightly different way.
Note
Support for SQL UI in History Server was added in SPARK-11206 Support SQL
UI on the history server.
Caution
onOtherEvent
onOtherEvent(event: SparkListenerEvent): Unit
onTaskEnd
Caution
FIXME
(which is SparkHistoryListenerFactory ).
The SQLHistoryListenerFactory class is registered when SparkUI.createHistoryUI as a
Java service in META-INF/services/org.apache.spark.scheduler.SparkHistoryListenerFactory :
org.apache.spark.sql.execution.ui.SQLHistoryListenerFactory
Note
onExecutorMetricsUpdate
onExecutorMetricsUpdate does nothing.
954
SQLHistoryListener
955
FsHistoryProvider
FsHistoryProvider
FsHistoryProvider is the default application history provider for HistoryServer. It uses
inside.
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.deploy.history.FsHistoryProvider=DEBUG
Refer to Logging.
ApplicationHistoryProvider
ApplicationHistoryProvider tracks the history of Spark applications with their Spark UIs. It
ApplicationHistoryProvider Contract
Every ApplicationHistoryProvider offers the following:
getListing to return a list of all known applications.
getListing(): Iterable[ApplicationHistoryInfo]
stop(): Unit
956
FsHistoryProvider
957
Logging
Logging
Spark uses log4j for logging.
Logging Levels
The valid logging levels are log4js Levels (from most specific to least):
OFF (most specific, no logging)
FATAL (most specific, little data)
ERROR
WARN
INFO
DEBUG
TRACE (least specific, a lot of data)
ALL (least specific, all data)
conf/log4j.properties
You can set up the default logging for Spark shell in conf/log4j.properties . Use
conf/log4j.properties.template as a starting point.
958
Logging
sbt
When running a Spark application from within sbt using run task, you can use the following
build.sbt to configure logging levels:
With the above configuration log4j.properties file should be on CLASSPATH which can be
in src/main/resources directory (that is included in CLASSPATH by default).
When run starts, you should see the following output in sbt:
[spark-activator]> run
[info] Running StreamingApp
log4j: Trying to find [log4j.properties] using context classloader sun.misc.Launcher$A
ppClassLoader@1b6d3586.
log4j: Using URL [file:/Users/jacek/dev/oss/spark-activator/target/scala-2.11/classes/
log4j.properties] for automatic log4j configuration.
log4j: Reading configuration from URL file:/Users/jacek/dev/oss/spark-activator/target
/scala-2.11/classes/log4j.properties
Disabling Logging
Use the following conf/log4j.properties to disable logging completely:
log4j.logger.org=OFF
959
Performance Tuning
Performance Tuning
Goal: Improve Sparks performance where feasible.
From Investigating Sparks performance:
measure performance bottlenecks using new metrics, including block-time analysis
a live demo of a new performance analysis tool
CPU not I/O (network) is often a critical bottleneck
community dogma = network and disk I/O are major bottlenecks
a TPC-DS workload, of two sizes: a 20 machine cluster with 850GB of data, and a 60
machine cluster with 2.5TB of data.
network is almost irrelevant for performance of these workloads
network optimization could only reduce job completion time by, at most, 2%
10Gbps networking hardware is likely not necessary
serialized compressed data
From Making Sense of Spark Performance - Kay Ousterhout (UC Berkeley) at Spark
Summit 2015:
reduceByKey is better
960
Metrics System
Spark uses Metrics - a Java library to measure the behaviour of the components.
org.apache.spark.metrics.source.Source is the top-level class for the metric registries in
ExecutorSource
JvmSource
MesosClusterSchedulerSource
StreamingSource
Review MetricsServlet
Review org.apache.spark.metrics package, esp. MetricsSystem class.
Default properties
"*.sink.servlet.class", "org.apache.spark.metrics.sink.MetricsServlet"
"*.sink.servlet.path", "/metrics/json"
"master.sink.servlet.path", "/metrics/master/json"
"applications.sink.servlet.path", "/metrics/applications/json"
spark.metrics.conf (default: metrics.properties on CLASSPATH )
spark.metrics.conf. prefix in SparkConf
Executors
A non-local executor registers executor source.
FIXME See Executor class.
961
Master
$ http http://192.168.1.4:8080/metrics/master/json/path
HTTP/1.1 200 OK
Cache-Control: no-cache, no-store, must-revalidate
Content-Length: 207
Content-Type: text/json;charset=UTF-8
Server: Jetty(8.y.z-SNAPSHOT)
X-Frame-Options: SAMEORIGIN
{
"counters": {},
"gauges": {
"master.aliveWorkers": {
"value": 0
},
"master.apps": {
"value": 0
},
"master.waitingApps": {
"value": 0
},
"master.workers": {
"value": 0
}
},
"histograms": {},
"meters": {},
"timers": {},
"version": "3.0.0"
}
962
Spark Listeners
Spark Listeners
SparkListener is a developer API for custom Spark listeners. It is an abstract class that is a
Tip
SparkListenerEvents
Caution
FIXME Give a less code-centric description of the times for the events.
SparkListenerApplicationStart
SparkListenerApplicationStart(
appName: String,
appId: Option[String],
time: Long,
sparkUser: String,
appAttemptId: Option[String],
driverLogs: Option[Map[String, String]] = None)
SparkListenerJobStart
SparkListenerJobStart(
jobId: Int,
time: Long,
stageInfos: Seq[StageInfo],
properties: Properties = null)
SparkListenerStageSubmitted
963
Spark Listeners
SparkListenerTaskStart
SparkListenerTaskStart(stageId: Int, stageAttemptId: Int, taskInfo: TaskInfo)
SparkListenerTaskGettingResult
SparkListenerTaskGettingResult(taskInfo: TaskInfo)
SparkListenerTaskEnd
SparkListenerTaskEnd(
stageId: Int,
stageAttemptId: Int,
taskType: String,
reason: TaskEndReason,
taskInfo: TaskInfo,
// may be null if the task has failed
@Nullable taskMetrics: TaskMetrics)
SparkListenerStageCompleted
SparkListenerStageCompleted(stageInfo: StageInfo)
SparkListenerJobEnd
964
Spark Listeners
SparkListenerJobEnd(
jobId: Int,
time: Long,
jobResult: JobResult)
SparkListenerApplicationEnd
SparkListenerApplicationEnd(time: Long)
SparkListenerEnvironmentUpdate
SparkListenerEnvironmentUpdate(environmentDetails: Map[String, Seq[(String, String)]])
SparkListenerBlockManagerAdded
SparkListenerBlockManagerAdded(
time: Long,
blockManagerId: BlockManagerId,
maxMem: Long)
SparkListenerBlockManagerRemoved
SparkListenerBlockManagerRemoved(
time: Long,
blockManagerId: BlockManagerId)
965
Spark Listeners
SparkListenerBlockUpdated
SparkListenerBlockUpdated(blockUpdatedInfo: BlockUpdatedInfo)
SparkListenerUnpersistRDD
SparkListenerUnpersistRDD(rddId: Int)
SparkListenerExecutorAdded
SparkListenerExecutorAdded(
time: Long,
executorId: String,
executorInfo: ExecutorInfo)
SparkListenerExecutorRemoved
SparkListenerExecutorRemoved(
time: Long,
executorId: String,
reason: String)
Known Implementations
The following is the complete list of all known Spark listeners:
966
Spark Listeners
EventLoggingListener
ExecutorsListener that prepares information to be displayed on the Executors tab in
web UI.
SparkFirehoseListener that allows users to receive all SparkListenerEvent events by
HeartbeatReceiver
web UI and EventLoggingListener listeners
Caution
SparkListenerInterface
SparkListenerInterface is an internal interface for listeners of events from the Spark
scheduler.
967
LiveListenerBus
LiveListenerBus
LiveListenerBus asynchronously passes listener events to registered Spark listeners.
Note
FIXME
Internally, it saves the input SparkContext for later use and starts listenerThread. It makes
sure that it only happens when LiveListenerBus has not been started before (i.e. started
is disabled).
If however LiveListenerBus has already been started, a IllegalStateException is thrown:
[name] already started!
968
LiveListenerBus
post puts the input event onto the internal eventQueue queue and releases the internal
eventLock semaphore. If the event placement was not successful (and it could happen
If LiveListenerBus has been stopped, the following ERROR appears in the logs:
ERROR [name] has already stopped! Dropping event [event]
onDropEvent is called when no further events can be added to the internal eventQueue
Note
Stopping LiveListenerBus
stop(): Unit
stop releases the internal eventLock semaphore and waits until listenerThread dies. It can
only happen after all events were posted (and polling eventQueue gives nothing).
969
LiveListenerBus
It checks that started flag is enabled (i.e. true ) and throws a IllegalStateException
otherwise.
Attempted to stop [name] that has not yet started!
events from the event queue is only after the listener was started and only one event at a
time.
Caution
Settings
spark.extraListeners
spark.extraListeners (default: empty) is a comma-separated list of listener class names
SparkListenerBus
SparkListenerBus is a ListenerBus that manages SparkListenerInterface listeners that
970
LiveListenerBus
SparkListenerInterfaces Method
SparkListenerStageSubmitted
onStageSubmitted
SparkListenerStageCompleted
onStageCompleted
SparkListenerJobStart
onJobStart
SparkListenerJobEnd
onJobEnd
SparkListenerJobEnd
onJobEnd
SparkListenerTaskStart
onTaskStart
SparkListenerTaskGettingResult
onTaskGettingResult
SparkListenerTaskEnd
onTaskEnd
SparkListenerEnvironmentUpdate
onEnvironmentUpdate
SparkListenerBlockManagerAdded
onBlockManagerAdded
SparkListenerBlockManagerRemoved
onBlockManagerRemoved
SparkListenerUnpersistRDD
onUnpersistRDD
SparkListenerApplicationStart
onApplicationStart
SparkListenerApplicationEnd
onApplicationEnd
SparkListenerExecutorMetricsUpdate
onExecutorMetricsUpdate
SparkListenerExecutorAdded
onExecutorAdded
SparkListenerExecutorRemoved
onExecutorRemoved
SparkListenerBlockUpdated
onBlockUpdated
SparkListenerLogStart
event ignored
onOtherEvent
Note
ListenerBus
971
LiveListenerBus
ListenerBus is an event bus that post events (of type E ) to all registered listeners (of type
L ).
It manages listeners of type L , i.e. it can add to and remove listeners from an internal
listeners collection.
It can post events of type E to all registered listeners (using postToAll method). It simply
iterates over the internal listeners collection and executes the abstract doPostEvent
method.
doPostEvent(listener: L, event: E): Unit
Note
In case of exception while posting an event to a listener you should see the following
ERROR message in the logs and the exception.
ERROR Listener [listener] threw an exception
Note
Tip
log4j.logger.org.apache.spark.util.ListenerBus=ERROR
Refer to Logging.
972
ReplayListenerBus
ReplayListenerBus
ReplayListenerBus is a custom SparkListenerBus that can replay JSON-encoded
SparkListenerEvent events from a stream and post them to listeners.
Note
Note
package.
replay reads JSON-encoded SparkListenerEvent events from logData (one event per
replay uses jackson from json4s library to parse the AST for JSON.
When there is an exception parsing a JSON event, you may see the following WARN
message in the logs (for the last line) or a JsonParseException .
WARN Got JsonParseException from log file $sourceName at line [lineNumber], the file m
ight not have finished writing cleanly.
Any other non-IO exceptions end up with the following ERROR messages in the logs:
ERROR Exception parsing Spark event log: [sourceName]
ERROR Malformed line #[lineNumber]: [currentLine]
Note
973
ReplayListenerBus
974
EventLoggingListenerEvent Logging
When enabled it writes events to a log file under spark.eventLog.dir directory. All Spark
events are logged.
Note
You can use History Server to view the logs using a web interface.
It is a private[spark] class in org.apache.spark.scheduler package.
Enable INFO logging level for org.apache.spark.scheduler.EventLoggingListener
logger to see what happens inside EventLoggingListener .
Add the following line to conf/log4j.properties :
Tip
log4j.logger.org.apache.spark.scheduler.EventLoggingListener=INFO
Refer to Logging.
975
EventLoggingListenerEvent Logging
The log files working name is created based on appId with or without the compression
codec used and appAttemptId , i.e. local-1461696754069 . It also uses .inprogress
extension.
If overwrite is enabled, you should see the WARN message:
WARN EventLoggingListener: Event log [path] already exists. Overwriting...
The working log .inprogress is attempted to be deleted. In case it could not be deleted, the
following WARN message is printed out to the logs:
WARN EventLoggingListener: Error deleting [path]
The buffered output stream is created with metadata with Sparks version and
SparkListenerLogStart class' name as the first line.
{"Event":"SparkListenerLogStart","Spark Version":"2.0.0-SNAPSHOT"}
At this point, EventLoggingListener is ready for event logging and you should see the
following INFO message in the logs:
INFO EventLoggingListener: Logging events to [logPath]
extension.
If the target log file exists (one without .inprogress extension), it overwrites the file if
spark.eventLog.overwrite is enabled. You should see the following WARN message in the
logs:
WARN EventLoggingListener: Event log [target] already exists. Overwriting...
976
EventLoggingListenerEvent Logging
If the target log file exists and overwrite is disabled, an java.io.IOException is thrown with
the following message:
Target log file already exists ([logPath])
Settings
spark.eventLog.enabled
spark.eventLog.enabled (default: false ) - whether to log Spark events that encode the
information displayed in the UI to persisted storage. It is useful for reconstructing the Web UI
after a Spark application has finished.
spark.eventLog.dir
spark.eventLog.dir (default: /tmp/spark-events ) - path to the directory in which events are
logged, e.g. hdfs://namenode:8021/directory . The directory must exist before Spark starts
up. See Creating a SparkContext. * spark.eventLog.buffer.kb (default: 100 ) - buffer size to
use when writing to output streams.
spark.eventLog.overwrite
spark.eventLog.overwrite (default: false ) - whether to delete or at least overwrite an
spark.eventLog.compress
spark.eventLog.compress (default: false ) controls whether to compress events ( true ) or
not ( false ).
See Compressing Events.
spark.eventLog.testing
977
EventLoggingListenerEvent Logging
spark.eventLog.testing (default: false ) - internal flag for testing purposes to add JSON
978
StatsReportListenerLogging Summary
Statistics
org.apache.spark.scheduler.StatsReportListener (see the class' scaladoc) is a
SparkListener that logs a few summary statistics when each stage completes.
It listens to SparkListenerTaskEnd and SparkListenerStageCompleted messages.
$ ./bin/spark-shell --conf \
spark.extraListeners=org.apache.spark.scheduler.StatsReportListener
...
INFO SparkContext: Registered listener org.apache.spark.scheduler.StatsReportListener
...
scala> sc.parallelize(0 to 10).count
...
15/11/04 15:39:45 INFO StatsReportListener: Finished stage: org.apache.spark.scheduler
.StageInfo@4d3956a4
15/11/04 15:39:45 INFO StatsReportListener: task runtime:(count: 8, mean: 36.625000, s
tdev: 5.893588, max: 52.000000, min: 33.000000)
15/11/04 15:39:45 INFO StatsReportListener:
75%
90%
95%
0%
5%
10%
25%
50%
100%
33.0 ms 33.0 ms 33.0 ms 34.0 ms 35.0 m
75%
90%
95%
5%
10%
25%
50%
100%
0%
75%
90%
95%
20 %
20 %
20 %
0%
5%
10%
25%
50%
13 %
13 %
13 %
17 %
18 %
100%
20 %
15/11/04 15:39:45 INFO StatsReportListener: other time pct: (count: 8, mean: 82.339780
, stdev: 1.948627, max: 86.538462, min: 80.000000)
15/11/04 15:39:45 INFO StatsReportListener:
75%
90%
95%
83 %
87 %
87 %
5%
10%
25%
50%
80 %
80 %
80 %
82 %
82 %
100%
0%
87 %
979
Tip
980
Building Spark
Building Spark
You can download pre-packaged versions of Apache Spark from the projects web site. The
packages are built for a different Hadoop versions, but only for Scala 2.10.
Note
Since [SPARK-6363][BUILD] Make Scala 2.11 the default Scala version the
default version of Scala is 2.11.
If you want a Scala 2.11 version of Apache Spark "users should download the Spark source
package and build with Scala 2.11 support" (quoted from the Note at Download Spark).
The build process for Scala 2.11 takes around 15 mins (on a decent machine) and is so
simple that its unlikely to refuse the urge to do it yourself.
You can use sbt or Maven as the build command.
Build Profiles
Caution
981
Building Spark
982
Building Spark
...
[INFO] -----------------------------------------------------------------------[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM ........................... SUCCESS [ 4.186 s]
[INFO] Spark Project Tags ................................. SUCCESS [ 4.893 s]
[INFO] Spark Project Sketch ............................... SUCCESS [ 5.066 s]
[INFO] Spark Project Networking ........................... SUCCESS [ 11.108 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 7.051 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [ 7.650 s]
[INFO] Spark Project Launcher ............................. SUCCESS [ 9.905 s]
[INFO] Spark Project Core ................................. SUCCESS [02:09 min]
[INFO] Spark Project GraphX ............................... SUCCESS [ 19.317 s]
[INFO] Spark Project Streaming ............................ SUCCESS [ 42.077 s]
[INFO] Spark Project Catalyst ............................. SUCCESS [01:32 min]
[INFO] Spark Project SQL .................................. SUCCESS [01:47 min]
[INFO] Spark Project ML Local Library ..................... SUCCESS [ 10.049 s]
[INFO] Spark Project ML Library ........................... SUCCESS [01:36 min]
[INFO] Spark Project Tools ................................ SUCCESS [ 3.520 s]
[INFO] Spark Project Hive ................................. SUCCESS [ 52.528 s]
[INFO] Spark Project REPL ................................. SUCCESS [ 7.243 s]
[INFO] Spark Project YARN Shuffle Service ................. SUCCESS [ 7.898 s]
[INFO] Spark Project YARN ................................. SUCCESS [ 15.380 s]
[INFO] Spark Project Hive Thrift Server ................... SUCCESS [ 24.876 s]
[INFO] Spark Project Assembly ............................. SUCCESS [ 2.971 s]
[INFO] Spark Project External Flume Sink .................. SUCCESS [ 7.377 s]
[INFO] Spark Project External Flume ....................... SUCCESS [ 10.752 s]
[INFO] Spark Project External Flume Assembly .............. SUCCESS [ 1.695 s]
[INFO] Spark Integration for Kafka 0.8 .................... SUCCESS [ 13.013 s]
[INFO] Spark Project Examples ............................. SUCCESS [ 31.728 s]
[INFO] Spark Project External Kafka Assembly .............. SUCCESS [ 3.472 s]
[INFO] Spark Integration for Kafka 0.10 ................... SUCCESS [ 12.297 s]
[INFO] Spark Integration for Kafka 0.10 Assembly .......... SUCCESS [ 3.789 s]
[INFO] Spark Project Java 8 Tests ......................... SUCCESS [ 4.267 s]
[INFO] -----------------------------------------------------------------------[INFO] BUILD SUCCESS
[INFO] -----------------------------------------------------------------------[INFO] Total time: 12:29 min
[INFO] Finished at: 2016-07-07T22:29:56+02:00
[INFO] Final Memory: 110M/913M
[INFO] ------------------------------------------------------------------------
Please note the messages that say the version of Spark (Building Spark Project Parent POM
2.0.0-SNAPSHOT), Scala version (maven-clean-plugin:2.6.1:clean (default-clean) @ sparkparent_2.11) and the Spark modules built.
The above command gives you the latest version of Apache Spark 2.0.0-SNAPSHOT built
for Scala 2.11.8 (see the configuration of scala-2.11 profile).
Tip
You can also know the version of Spark using ./bin/spark-shell --version .
983
Building Spark
Making Distribution
./make-distribution.sh is the shell script to make a distribution. It uses the same profiles as
Once finished, you will have the distribution in the current directory, i.e. spark-2.0.0SNAPSHOT-bin-2.7.2.tgz .
984
FIXME
FIXME
FIXME What are the differences between the formats and how are they used
in Spark.
Introduction to Hadoop
Note
This page is the place to keep information more general about Hadoop and not
related to Spark on YARN or files Using Input and Output (I/O) (HDFS). I dont
really know what it could be, though. Perhaps nothing at all. Just saying.
985
HDFS (Hadoop Distributed File System) is a distributed file system designed to run
on commodity hardware. It is a data storage with files split across a cluster.
MapReduce - the compute engine for batch processing
YARN (Yet Another Resource Negotiator) - the resource manager
Currently, its more about the ecosystem of solutions that all use Hadoop infrastructure
for their work.
People reported to do wonders with the software with Yahoo! saying:
Yahoo has progressively invested in building and scaling Apache Hadoop clusters with
a current footprint of more than 40,000 servers and 600 petabytes of storage spread
across 19 clusters.
Beside numbers Yahoo! reported that:
Deep learning can be defined as first-class steps in Apache Oozie workflows with
Hadoop for data processing and Spark pipelines for machine learning.
You can find some preliminary information about Spark pipelines for machine learning in
the chapter ML Pipelines.
HDFS provides fast analytics scanning over large amounts of data very quickly, but it was
not built to handle updates. If data changed, it would need to be appended in bulk after a
certain volume or time interval, preventing real-time visibility into this data.
HBase complements HDFS capabilities by providing fast and random reads and writes
and supporting updating data, i.e. serving small queries extremely quickly, and allowing
data to be updated in place.
From How does partitioning work for data from files on HDFS?:
When Spark reads a file from HDFS, it creates a single partition for a single input split.
Input split is set by the Hadoop InputFormat used to read this file. For instance, if you
use textFile() it would be TextInputFormat in Hadoop, which would return you a
single partition for a single block of HDFS (but the split between partitions would be
done on line split, not the exact block split), unless you have a compressed text file. In
case of compressed file you would get a single partition for a single file (as compressed
text files are not splittable).
986
If you have a 30GB uncompressed text file stored on HDFS, then with the default HDFS
block size setting (128MB) it would be stored in 235 blocks, which means that the RDD
you read from this file would have 235 partitions. When you call repartition(1000) your
RDD would be marked as to be repartitioned, but in fact it would be shuffled to 1000
partitions only when you will execute an action on top of this RDD (lazy execution
concept)
With HDFS you can store any data (regardless of format and size). It can easily handle
unstructured data like video or other binary files as well as semi- or fully-structured data
like CSV files or databases.
There is the concept of data lake that is a huge data repository to support analytics.
HDFS partition files into so called splits and distributes them across multiple nodes in a
cluster to achieve fail-over and resiliency.
MapReduce happens in three phases: Map, Shuffle, and Reduce.
Further reading
Introducing Kudu: The New Hadoop Storage Engine for Fast Analytics on Fast Data
987
988
989
990
Spark Packages
Spark Packages
Spark Packages is a community index of packages for Apache Spark.
Spark Packages is a community site hosting modules that are not part of Apache Spark. It
offers packages for reading different files formats (than those natively supported by Spark)
or from NoSQL databases like Cassandra, code testing, etc.
When you want to include a Spark package in your application, you should be using -packages command line option.
991
TransportConfTransport Configuration
TransportConfTransport Configuration
TransportConf is a class for the transport-related network configuration for modules, e.g.
ExternalShuffleService or YarnShuffleService.
It exposes methods to access settings for a single module as spark.module.prefix or general
network-related settings.
spark.module.prefix Settings
The settings can be in the form of spark.[module].[prefix] with the following prefixes:
io.mode (default: NIO )the IO mode: nio or epoll .
io.preferDirectBufs (default: true )a flag to control whether Spark prefers
timeout in milliseconds.
io.backLog (default: -1 for no backlog)the requested maximum length of the
thread pool.
io.receiveBuffer (default: -1 )the receive buffer size (SO_RCVBUF).
io.sendBuffer (default: -1 )the send buffer size (SO_SNDBUF).
sasl.timeout (default: 30s )the timeout (in milliseconds) for a single round trip of
exceptions (such as connection timeouts) per request. If set to 0 , Spark will not do any
retries.
io.retryWait (default: 5s )the time (in milliseconds) that Spark will wait in order to
992
TransportConfTransport Configuration
( true ) or not ( false ). If true , file descriptors are created only when data is going to
be transferred. This can reduce the number of open files.
should start using memory map rather than reading in through normal IO operations.
This prevents Spark from memory mapping very small blocks. In general, memory mapping
has high overhead for blocks close to or below the page size of the OS.
spark.network.sasl.maxEncryptedBlockSize
spark.network.sasl.maxEncryptedBlockSize (default: 64k ) is the maximum number of bytes
spark.network.sasl.serverAlwaysEncrypt
spark.network.sasl.serverAlwaysEncrypt (default: false ) controls whether the server
993
command is printed out to the standard error output, i.e. System.err , or not.
Spark Command: [here comes the command]
========================================
All the Spark shell scripts use org.apache.spark.launcher.Main class internally that checks
SPARK_PRINT_LAUNCH_COMMAND and when set (to any value) will print out the entire command
994
995
996
f556)
997
at org.apache.spark.serializer.SerializationDebugger$.improveException(Serialization
Debugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.sc
ala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala
:101)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301
)
... 57 more
Further reading
Job aborted due to stage failure: Task not serializable
Add utility to help with NotSerializableException debugging
Task not serializable: java.io.NotSerializableException when calling function outside
closure only on classes not objects
998
Note
15/01/29 17:21:27 ERROR Shell: Failed to locate the winutils binary in the hadoop bina
ry path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop b
inaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:326)
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:76)
Note
You need to have Administrator rights on your laptop. All the following
commands must be executed in a command-line window ( cmd ) ran as
Administrator, i.e. using Run As Administrator option while executing cmd .
Tip
999
1000
Exercises
Exercises
Here Im collecting exercises that aim at strengthening your understanding of Apache Spark.
1001
Exercise
How would you go about solving a requirement to pair elements of the same key and
creating a new RDD out of the matched values?
val users = Seq((1, "user1"), (1, "user2"), (2, "user1"), (2, "user3"), (3,"user2"), (3
,"user4"), (3,"user1"))
// Input RDD
val us = sc.parallelize(users)
// ...your code here
// Desired output
Seq("user1","user2"),("user1","user3"),("user1","user4"),("user2","user4"))
1002
1003
Caution
However, when you execute r1.take(2) two jobs get run as the implementation assumes
one job with one partition, and if the elements didnt total to the number of elements
requested in take , quadruple the partitions to work on in the following jobs.
Caution
Can you guess how many jobs are run for r1.take(15) ? How many tasks per job?
Caution
Answer: 3.
1004
Note
$ cp ./sbin/start-master{,-2}.sh
$ grep "CLASS 1" ./sbin/start-master-2.sh
"${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 1 \
$ sed -i -e 's/CLASS 1/CLASS 2/' sbin/start-master-2.sh
$ grep "CLASS 1" ./sbin/start-master-2.sh
$ grep "CLASS 2" ./sbin/start-master-2.sh
"${SPARK_HOME}/sbin"/spark-daemon.sh start $CLASS 2 \
$ ./sbin/start-master-2.sh -h localhost -p 17077 --webui-port 18080 --properties-file
ha.conf
1005
You can check how many instances youre currently running using jps command as
follows:
$ jps -lm
5024 sun.tools.jps.Jps -lm
4994 org.apache.spark.deploy.master.Master --ip japila.local --port 7077 --webui-port
8080 -h localhost -p 17077 --webui-port 18080 --properties-file ha.conf
4808 org.apache.spark.deploy.master.Master --ip japila.local --port 7077 --webui-port
8080 -h localhost -p 7077 --webui-port 8080 --properties-file ha.conf
4778 org.apache.zookeeper.server.quorum.QuorumPeerMain config/zookeeper.properties
1006
1007
1. Read the text file - refer to Using Input and Output (I/O).
2. Split each line into words and flatten the result.
3. Map each word into a pair and count them by word (key).
4. Save the result into text files - one per partition.
After you have executed the example, see the contents of the README.count directory:
$ ls -lt README.count
total 16
-rw-r--r-- 1 jacek staff 0 9 pa 13:36 _SUCCESS
-rw-r--r-- 1 jacek staff 1963 9 pa 13:36 part-00000
-rw-r--r-- 1 jacek staff 1663 9 pa 13:36 part-00001
The files part-0000x contain the pairs of word and the count.
1008
$ cat README.count/part-00000
(package,1)
(this,1)
(Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hado
op-version),1)
(Because,1)
(Python,2)
(cluster.,1)
(its,1)
([run,1)
...
Further (self-)development
Please read the questions and give answers first before looking at the link given.
1. Why are there two files under the directory?
2. How could you have only one?
3. How to filter out words by name?
4. How to count words?
Please refer to the chapter Partitions to find some of the answers.
1009
Overview
Youre going to use sbt as the project build tool. It uses build.sbt for the projects
description as well as the dependencies, i.e. the version of Apache Spark and others.
The applications main code is under src/main/scala directory, in SparkMeApp.scala file.
With the files in a directory, executing sbt package results in a package that can be
deployed onto a Spark cluster using spark-submit .
In this example, youre going to use Sparks local mode.
SparkMe Application
1010
The application uses a single command-line parameter (as args(0) ) that is the file to
process. The file is read and the number of lines printed out.
package pl.japila.spark
import org.apache.spark.{SparkContext, SparkConf}
object SparkMeApp {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("SparkMe Application")
val sc = new SparkContext(conf)
val fileName = args(0)
val lines = sc.textFile(fileName).cache
val c = lines.count
println(s"There are $c lines in $fileName")
}
}
Tip
With the file the build is more predictable as the version of sbt doesnt depend on
the sbt launcher.
Packaging Application
Execute sbt package to package the application.
sparkme-app sbt package
[info] Loading global plugins from /Users/jacek/.sbt/0.13/plugins
[info] Loading project definition from /Users/jacek/dev/sandbox/sparkme-app/project
[info] Set current project to SparkMe Project (in build file:/Users/jacek/dev/sandbox/
sparkme-app/)
[info] Compiling 1 Scala source to /Users/jacek/dev/sandbox/sparkme-app/target/scala-2
.11/classes...
[info] Packaging /Users/jacek/dev/sandbox/sparkme-app/target/scala-2.11/sparkme-projec
t_2.11-1.0.jar ...
[info] Done packaging.
[success] Total time: 3 s, completed Sep 23, 2015 12:47:52 AM
The application uses only classes that comes with Spark so package is enough.
1011
spark-submit the SparkMe application and specify the file to process (as it is the only and
build.sbt is sbts build definition and is only used as an input file for
Note
1012
1013
1014
1015
Requirements
1. Typesafe Activator
2. Access to Internet to download the Spark dependency - spark-core
Add the following line to build.sbt (the main configuration file for the sbt project) that adds
the dependency on Spark 1.5.1. Note the double % that are to select the proper version of
the dependency for Scala 2.11.7.
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.1"
$ mkdir -p src/main/scala/pl/japila/spark
1016
package pl.japila.spark
import org.apache.spark.scheduler.{SparkListenerStageCompleted, SparkListener, SparkLi
stenerJobStart}
class CustomSparkListener extends SparkListener {
override def onJobStart(jobStart: SparkListenerJobStart) {
println(s"Job started with ${jobStart.stageInfos.size} stages: $jobStart")
}
override def onStageCompleted(stageCompleted: SparkListenerStageCompleted): Unit = {
println(s"Stage ${stageCompleted.stageInfo.stageId} completed with ${stageComplete
d.stageInfo.numTasks} tasks.")
}
}
[custom-spark-listener]> package
[info] Compiling 1 Scala source to /Users/jacek/dev/sandbox/custom-spark-listener/targ
et/scala-2.11/classes...
[info] Packaging /Users/jacek/dev/sandbox/custom-spark-listener/target/scala-2.11/cust
om-spark-listener_2.11-1.0.jar ...
[info] Done packaging.
[success] Total time: 0 s, completed Nov 4, 2015 8:59:30 AM
You should have the result jar file with the custom scheduler listener ready (mine is
/Users/jacek/dev/sandbox/custom-spark-listener/target/scala-2.11/custom-sparklistener_2.11-1.0.jar )
1017
The last line that starts with Job started: is from the custom Spark listener youve just
created. Congratulations! The exercises over.
Use sc.addSparkListener(myListener)
Questions
1. What are the pros and cons of using the command line version vs inside a Spark
application?
1018
Caution
at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileC
lassLoader.scala:62)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:173)
at org.apache.spark.rpc.RpcEnv$.getRpcEnvFactory(RpcEnv.scala:38)
at org.apache.spark.rpc.RpcEnv$.create(RpcEnv.scala:49)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:257)
at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:198)
at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:272)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:441)
at org.apache.spark.repl.Main$.createSparkContext(Main.scala:79)
at $line3.$read$$iw$$iw.<init>(<console>:12)
at $line3.$read$$iw.<init>(<console>:21)
at $line3.$read.<init>(<console>:23)
at $line3.$read$.<init>(<console>:27)
at $line3.$read$.<clinit>(<console>)
at $line3.$eval$.$print$lzycompute(<console>:7)
at $line3.$eval$.$print(<console>:6)
at $line3.$eval.$print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:6
2)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImp
l.java:43)
1019
at java.lang.reflect.Method.invoke(Method.java:497)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:784)
at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1039)
at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.a
pply(IMain.scala:636)
at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.a
pply(IMain.scala:635)
at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoad
er.scala:31)
at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileC
lassLoader.scala:19)
at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:
635)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:567)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:563)
at scala.tools.nsc.interpreter.ILoop.reallyInterpret$1(ILoop.scala:802)
at scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:836)
at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:694)
at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:404)
at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply$mcZ$sp(Sp
arkILoop.scala:39)
at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoo
p.scala:38)
at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoo
p.scala:38)
at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:213)
at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:38)
at org.apache.spark.repl.SparkILoop.loadFiles(SparkILoop.scala:94)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.sca
la:922)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:911)
at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:911)
at scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClas
sLoader.scala:97)
at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:911)
at org.apache.spark.repl.Main$.main(Main.scala:49)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:6
2)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImp
l.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$r
unMain(SparkSubmit.scala:680)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
1020
1021
FIXME
1022
Note
Execute the command to have the jar downloaded into ~/.ivy2/jars directory by
spark-shell :
./bin/spark-shell --packages org.postgresql:postgresql:9.4.1208
Tip
Start ./bin/spark-shell with --driver-class-path command line option and the driver jar.
SPARK_PRINT_LAUNCH_COMMAND=1 ./bin/spark-shell --driver-class-path /Users/jacek/.m2/re
pository/org/postgresql/postgresql/9.4.1207.jre7/postgresql-9.4.1207.jre7.jar
It will give you the proper setup for accessing PostgreSQL using the JDBC driver.
Execute the following to access projects table in sparkdb .
val opts = Map(
"url" -> "jdbc:postgresql:sparkdb",
"dbtable" -> "projects")
val df = spark
.read
.format("jdbc")
.options(opts)
.load
scala> df.show(false)
+---+------------+-----------------------+
|id |name |website |
+---+------------+-----------------------+
|1 |Apache Spark|http://spark.apache.org|
|2 |Apache Hive |http://hive.apache.org |
|3 |Apache Kafka|http://kafka.apache.org|
|4 |Apache Flink|http://flink.apache.org|
+---+------------+-----------------------+
1023
Troubleshooting
If things can go wrong, they sooner or later go wrong. Here is a list of possible issues and
their solutions.
PostgreSQL Setup
Note
1024
Dropping Database
Stopping Database Server
Installation
Install PostgreSQL as described inTK
Caution
This page serves as a cheatsheet for the author so he does not have to
search Internet to find the installation steps.
1025
Create Database
$ createdb sparkdb
Tip
Accessing Database
Use psql sparkdb to access the database.
$ psql sparkdb
psql (9.5.2)
Type "help" for help.
sparkdb=#
Execute SELECT version() to know the version of the database server you have connected
to.
sparkdb=# SELECT version();
version
------------------------------------------------------------------------------------------------------------ PostgreSQL 9.5.2 on x86_64-apple-darwin14.5.0, compiled by Apple LLVM version 7.0.2 (
clang-700.1.81), 64-bit
(1 row)
1026
Creating Table
Create a table using CREATE TABLE command.
CREATE TABLE projects (
id SERIAL PRIMARY KEY,
name text,
website text
);
Execute select * from projects; to ensure that you have the following records in
projects table:
Dropping Database
$ dropdb sparkdb
Tip
1027
1028
Recipe
Start a Spark cluster, e.g. 1-node Hadoop YARN.
start-yarn.sh
// 2-stage job -- it _appears_ that a stage can be failed only when there is a shuffle
sc.parallelize(0 to 3e3.toInt, 2).map(n => (n % 2, n)).groupByKey.count
Use 2 executors at least so you can kill one and keep the application up and running (on one
executor).
YARN_CONF_DIR=hadoop-conf ./bin/spark-shell --master yarn \
-c spark.shuffle.service.enabled=true \
--num-executors 2
1029
Courses
Spark courses
Spark Fundamentals I from Big Data University.
Introduction to Big Data with Apache Spark from Databricks.
Scalable Machine Learning from Databricks.
1030
Books
Books
OReilly
Learning Spark (my review at Amazon.com)
Advanced Analytics with Spark
Data Algorithms: Recipes for Scaling Up with Hadoop and Spark
Spark Operations: Operationalizing Apache Spark at Scale (in the works)
Manning
Spark in Action (MEAP)
Streaming Data (MEAP)
Spark GraphX in Action (MEAP)
Packt
Mastering Apache Spark
Spark Cookbook
Learning Real-time Processing with Spark Streaming
Machine Learning with Spark
Fast Data Processing with Spark, 2nd Edition
Fast Data Processing with Spark
Apache Spark Graph Processing
Apress
Big Data Analytics with Spark
Guide to High Performance Distributed Computing (Case Studies with Hadoop,
Scalding and Spark)
1031
DataStax Enterprise
DataStax Enterprise
DataStax Enterprise
1032
1033
Commercial Products
Spark has reached the point where companies around the world adopt it to build their own
solutions on top of it.
1. IBM Analytics for Apache Spark
2. Google Cloud Dataproc
1034
1035
1036
1037
Requirements
1038
Day 1
1039
Day 2
1040
Spark Core
Dont fear the logs - Learn Spark by Logs
Everything you always wanted to know about accumulators (and task metrics)
Optimizing Spark using SchedulableBuilders
Learning Spark internals using groupBy (to cause shuffle)
Spark on Cluster
10 Lesser-Known Tidbits about Spark Standalone
Spark Streaming
Fault-tolerant stream processing using Spark Streaming
Stateful stream processing using Spark Streaming
1041
Duration: FIXME
REST Server
Read REST Server.
spark-shell is spark-submit
Read Spark shell.
Note
1042
1043
You may also make it a little bit heavier with explaining data distribution over cluster and go
over the concepts of drivers, masters, workers, and executors.
1044