Advanced Programming Using The Spark Core API: in This Chapter
Advanced Programming Using The Spark Core API: in This Chapter
In This Chapter:
Introduction to shared variables (broadcast variables and accumulators) in
Spark
Partitioning and repartitioning of Spark RDDs
Storage options for RDDs
Caching, distributed persistence, and checkpointing of RDDs
This chapter focuses on the additional programming tools at your disposal with
the Spark API, including broadcast variables and accumulators as shared
variables across different Workers in a Spark cluster. This chapter also dives into
the important topics of Spark partitioning and RDD storage. You will learn about
the various storage functions available for program optimization, durability, and
process restart and recovery. You will also learn how to use external programs
and scripts to process data in Spark RDDs in a Spark-managed lineage. The
information in this chapter builds on the Spark API transformations you learned
about in Chapter 4, “Learning Spark Programming Basics,” and gives you the
additional tools required to build efficient end-to-end Spark processing pipelines.
Broadcast Variables
Broadcast variables are read-only variables set by the Spark Driver program that
are made available to the Worker nodes in a Spark cluster, which means they are
available to any tasks running on Executors on the Workers. Broadcast variables
are read only after being set by the Driver. Broadcast variables are shared across
Workers using an efficient peer-to-peer sharing protocol based on BitTorrent;
this enables greater scalability than simply pushing variables directly to Executor
processes from the Spark Driver. Figure 5.1 demonstrates how broadcast
variables are initialized, disseminated among Workers, and accessed by nodes
within tasks.
Figure 5.1 Spark broadcast variables.
broadcast()
Syntax:
sc.broadcast(value)
You can also create broadcast variables from the contents of a file, either on a
local, network, or distributed filesystem. Consider a file named
stations.csv, which contains comma-delimited data, as follows:
Click here to view code image
83,Mezes Park,37.491269,-122.236234,15,Redwood City,2/20/2014
84,Ryland Park,37.342725,-121.895617,15,San Jose,4/9/2014
stationsfile = '/opt/spark/data/stations.csv'
stationsdata = dict(map(lambda x: (x[0],x[1]), \
map(lambda x: x.split(','), \
open(stationsfile))))
stations = sc.broadcast(stationsdata)
stations.value["83"]
# returns 'Mezes Park'
Listing 5.2 shows how to create a broadcast variable from a csv file
(stations.csv) consisting of a dictionary of key/value pairs, including the
station ID and the station name. You can now access this dictionary from within
any map() or filter() RDD operations.
For initialized broadcast variable objects, a number of methods can be called
within the SparkContext, as described in the following sections.
value()
Syntax:
Broadcast.value()
Listing 5.2 demonstrates the use of the value() function to return the value
from the broadcast variable; in that example, the value is a dict (or map) that
can access values from the map by their keys. The value() function can be
used within a lambda function in a map() or filter() operation in a Spark
program.
unpersist()
Syntax:
Broadcast.unpersist(blocking=False)
What are the advantages of broadcast variables? Why are they useful or even
required in some cases? As discussed in Chapter 4, it is often necessary to
combine two datasets to produce a resultant dataset. This can be achieved in
multiple ways.
Consider two associated datasets: stations (a relatively small lookup data
set) and status (a large eventful data source). These two datasets can join on a
natural key, station_id. You could join the two datasets as RDDs directly in
your Spark application, as shown in Listing 5.4.
status = sc.textFile('file:///opt/spark/data/bike-share/status') \
.map(lambda x: x.split(',')) \
.keyBy(lambda x: x[0])
stations = sc.textFile('file:///opt/spark/data/bike-share/stations') \
.map(lambda x: x.split(',')) \
.keyBy(lambda x: x[0])
status.join(stations) \
.map(lambda x: (x[1][0][3],x[1][1][1],x[1][0][1],x[1][0][2])) \
.count()
# returns 907200
stationsfile = '/opt/spark/data/bike-share/stations/stations.csv'
sdata = dict(map(lambda x: (x[0],x[1]), \
map(lambda x: x.split(','), \
open(stationsfile))))
status = sc.textFile('file:///opt/spark/data/bike-share/status') \
.map(lambda x: x.split(',')) \
.keyBy(lambda x: x[0])
status.map(lambda x: (x[1][3], sdata[x[0]], x[1][1], x[1][2])) \
.count()
# returns 907200
This works and is better in most cases than the first option; however, it lacks
scalability. In this case, the variable is part of a closure within the referencing
function. This may result in unnecessary and less efficient transfer and
duplication of data on the Worker nodes.
The best option would be to initialize a broadcast variable for the smaller
stations table. This involves using peer-to-peer replication to make the
variable available to all Workers, and the single copy is usable by all tasks on all
Executors belonging to an application running on the Worker. Then you can use
the variable in your map() operations, much as in the second option. An
example of this is provided in Listing 5.6.
stationsfile = '/opt/spark/data/bike-share/stations/stations.csv'
sdata = dict(map(lambda x: (x[0],x[1]), \
map(lambda x: x.split(','), \
open(stationsfile))))
stations = sc.broadcast(sdata) status =
sc.textFile('file:///opt/spark/data/bike-share/status') \
.map(lambda x: x.split(',')) \
.keyBy(lambda x: x[0])
status.map(lambda x: (x[1][3], stations.value[x[0]], x[1][1], x[1][2]))
\
.count()
# returns 907200
As you can see in the scenario just described, using broadcast variables is an
efficient method for sharing data at runtime between processes running on
different nodes of a Spark cluster. Consider the following points about broadcast
variables:
Using them eliminates the need for a shuffle operation.
They use an efficient and scalable peer-to-peer distribution mechanism.
They replicate data once per Worker, as opposed to replicating once per task
—which is important as there may be thousands of tasks in a Spark
application.
Many tasks can reuse them multiple times.
They are serialized objects, so they are efficiently read.
Accumulators
Another type of shared variable in Spark is an accumulator. Unlike with
broadcast variables, you can update accumulators; more specifically, they are
numeric values that be incremented.
Think of accumulators as counters that you can use in a number of ways in
Spark programming. Accumulators allow you to aggregate multiple values while
your program is running.
Accumulators are set by the Driver and updated by Executors running tasks in
the respective SparkContext. The Driver can then read back the final value from
the accumulator, typically at the end of the program.
Accumulators update only once per successfully completed task in a Spark
application. Worker nodes send the updates to the accumulator back to the
Driver, which is the only process that can read the accumulator value.
Accumulators can use integer or float values. Listing 5.7 and Figure 5.2
demonstrate how accumulators are created, updated, and read.
acc = sc.accumulator(0)
def addone(x):
global acc
acc += 1
return x + 1
myrdd=sc.parallelize([1,2,3,4,5])
myrdd.map(lambda x: addone(x)).collect()
# returns [2, 3, 4, 5, 6]
print("records processed: " + str(acc.value))
# returns "records processed: 5"
Figure 5.2 Accumulators.
accumulator()
Syntax:
sc.accumulator(value, accum_param=None)
value()
Syntax:
Accumulator.value()
The value() method retrieves the accumulator’s value. This method can be
used only in the Driver program.
Custom Accumulators
Standard accumulators created in a SparkContext support primitive numeric
datatypes, including int and float. Custom accumulators can perform
aggregate operations on variables of types other than scalar numeric values.
Custom accumulators are created using the AccumulatorParam helper
object. The only requirement is that the operations performed must be
associative and commutative, meaning the order and sequence of operation are
irrelevant.
A common use of custom accumulators is to accumulate vectors as either lists or
dictionaries. Conceptually, the same principle applies in a non-mathematical
context to non-numeric operations—for instance, when you create a custom
accumulator to concatenate string values.
To use custom accumulators, you need to define a custom class that extends the
AccumulatorParam class. The class needs to include two specific member
functions: addInPlace(), used to operate against two objects of the custom
accumulators datatype and to return a new value, and zero(), which provides a
“zero value” for the type—for instance, an empty map for a map type.
Listing 5.8 shows an example of a custom accumulator used to sum vectors as a
Python dictionary.
4. Initialize accumulators for the cumulative word count and cumulative total
length of all words:
Click here to view code image
word_count = sc.accumulator(0)
total_len = sc.accumulator(0.0)
Note that you have created total_len as a float because you will use it as
the numerator in a division operation later, when you want to keep the
precision in the result.
5. Create a function to accumulate word count and the total word length:
Click here to view code image
def add_values(word,word_count,total_len):
word_count += 1
total_len += len(word)
7. Use the foreach action to iterate through the resultant RDD and call your
add_values function:
Click here to view code image
words.foreach(lambda x: add_values(x, word_count, total_len))
This should return 966958 for the total number of words and
3.608722405730135 for the average word length.
7. Now put all the code for this exercise in a file named
average_word_length.py and execute the program using spark-
submit. Recall that you need to add the following to the beginning of your
script:
Click here to view code image
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName('Broadcast Variables and
Accumulators')
sc = SparkContext(conf=conf)
The complete source code for this exercise can be found in the average-
word-length folder at https://github.com/sparktraining/spark_using_python.
Partitioning Data in Spark
Partitioning is integral to Spark processing in most cases. Effective partitioning
can improve application performance by orders of magnitude. Conversely,
inefficient partitioning can result in programs failing to complete, producing
problems such as Executor-out-of-memory errors for excessively large
partitions.
The following sections recap what you already know about RDD partitions and
then discuss API methods that can affect partitioning behavior or that can access
data within partitions more effectively.
Partitioning Overview
The number of partitions to create from an RDD transformation is usually
configurable. There are some default behaviors you should be aware of,
however.
Spark creates an RDD partition per block when using HDFS (typically the size
of a block in HDFS is 128MB), as in this example:
Click here to view code image
myrdd = sc.textFile("hdfs:///dir/filescontaining10blocks")
myrdd.getNumPartitions()
# returns 10
Controlling Partitions
How many partitions should an RDD have? There are issues at both ends of the
spectrum when it comes to answering this question. Having too few, very large
partitions can result in out-of-memory issues on Executors. Having too many
small partitions isn’t optimal because too many tasks spawn for trivial input sets.
A mix of large and small partitions can result in speculative execution occurring
needlessly, if this is enabled. Speculative execution is a mechanism that a cluster
scheduler uses to preempt slow-running processes; if the root cause of the
slowness of one or more processes in a Spark application is inefficient
partitioning, then speculative execution won’t help.
Consider the scenario in Figure 5.3.
Figure 5.3 Skewed partitions.
The filter() operation creates a new partition for every input partition on a
one-to-one basis, with only records that meet the filter condition. This can result
in some partitions having significantly less data than others, which can lead to
bad outcomes, such as data skewing, the potential for speculative execution, and
suboptimal performance in subsequent stages.
In such cases, you can use one of the repartitioning methods in the Spark API;
these include partitionBy(), coalesce(), repartition(), and
repartitionAndSortWithinPartitions(), all of which are explained
shortly.
These functions take a partitioned input RDD and create a new RDD with n
partitions, where n could be more or fewer than the original number of
partitions. Take the example from Figure 5.3. In Figure 5.4, a
repartition() function is applied to consolidate the four unevenly
distributed partitions to two “evenly” distributed partitions, using the default
HashPartitioner.
Figure 5.4 The repartition() function.
Repartitioning Functions
The main functions used to repartition RDDs are documented in the following
sections.
partitionBy()
Syntax:
Click here to view code image
RDD.partitionBy(numPartitions, partitionFunc=portable_hash)
The partitionBy() method returns a new RDD containing the same data as
the input RDD but with the number of partitions specified by the
numPartitions argument, using the portable_hash function
(HashPartitioner) by default. An example of partitionBy() is shown
in Listing 5.9.
kvrdd = sc.parallelize([(1,'A'),(2,'B'),(3,'C'),(4,'D')],4)
kvrdd.getNumPartitions()
# returns 4
kvrdd.partitionBy(2).getNumPartitions()
# returns 2
repartition()
Syntax:
RDD.repartition(numPartitions)
The repartition() method returns a new RDD with the same data as the
input RDD, consisting of exactly the number of partitions specified by the
numPartitions argument. The repartition() method may require a
shuffle, and, unlike partitionBy(), it has no option to change the partitioner
or partitioning function. The repartition() method also lets you create
more partitions in the target RDD than existed in the input RDD. Listing 5.10
shows an example of the repartition() function.
kvrdd = sc.parallelize([(1,'A'),(2,'B'),(3,'C'),(4,'D')],4)
kvrdd.repartition(2).getNumPartitions()
# returns 2
coalesce()
Syntax:
RDD.coalesce(numPartitions, shuffle=False)
kvrdd = sc.parallelize([(1,'A'),(2,'B'),(3,'C'),(4,'D')],4)
kvrdd.coalesce(2, shuffle=False).getNumPartitions()
# returns 2
repartitionAndSortWithinPartitions()
Syntax:
Click here to view code image
RDD.repartitionAndSortWithinPartitions(numPartitions=None,
partitionFunc=portable_hash,
ascending=True,
keyfunc=<lambda function>)
kvrdd = sc.parallelize([((1,99),'A'),((1,101),'B'),((2,99),'C'),
((2,101),'D')],2)
kvrdd.glom().collect()
# returns:
# [[((1, 99), 'A'), ((1, 101), 'B')], [((2, 99), 'C'), ((2, 101), 'D')]]
kvrdd2 = kvrdd.repartitionAndSortWithinPartitions( \
numPartitions=2,
ascending=False,
keyfunc=lambda x: x[1])
kvrdd2.glom().collect()
# returns:
# [[((1, 101), 'B'), ((1, 99), 'A')], [((2, 101), 'D'), ((2, 99), 'C')]]
foreachPartition()
Syntax:
RDD.foreachPartition(func)
def f(x):
for rec in x:
print(rec)
kvrdd = sc.parallelize([((1,99),'A'),((1,101),'B'),((2,99),'C'),
((2,101),'D')],2)
kvrdd.foreachPartition(f)
# returns:
# ((1, 99), 'A')
# ((1, 101), 'B')
# ((2, 99), 'C')
# ((2, 101), 'D')
glom()
Syntax:
RDD.glom()
The glom() method returns an RDD created by coalescing all the elements
within each partition into a list. This is useful for inspecting RDD partitions as
collated lists; you saw an example of this function in Listing 5.12.
lookup()
Syntax:
RDD.lookup(key)
The lookup() method returns the list of values in an RDD for the key
referenced by the key argument. If used against an RDD partitioned with a
known partitioner, lookup() uses the partitioner to narrow its search to only
the partitions where the key would be present.
Listing 5.14 shows an example of the lookup() method.
kvrdd = sc.parallelize([(1,'A'),(1,'B'),(2,'C'),(2,'D')],2)
kvrdd.lookup(1)
# returns ['A', 'B']
mapPartitions()
Syntax:
Click here to view code image
RDD.mapPartitions(func, preservesPartitioning=False)
kvrdd = sc.parallelize([(1,'A'),(1,'B'),(2,'C'),(2,'D')],2)
def f(iterator): yield [(b, a) for (a, b) in iterator]
kvrdd.mapPartitions(f).collect()
# returns [[('A', 1), ('B', 1)], [('C', 2), ('D', 2)]]
>>> print(longwords.toDebugString())
(1) PythonRDD[6] at collect at <stdin>:1 []
| MapPartitionsRDD[1] at textFile at ..[]
| file://lorem.txt HadoopRDD[0] at textFile at ..[]
In addition, there are replicated storage options available with each of the basic
storage levels listed in Table 5.2. These replicate each partition to more than one
cluster node. Replication of RDDs consumes more space across the cluster but
enables tasks to continue to run in the event of a failure without having to wait
for lost partitions to reprocess. Although fault tolerance is provided for all Spark
RDDs, regardless of their storage level, replicated storage levels provide much
faster fault recovery.
Storage-Level Flags
A storage level is implemented as a set of flags that control the RDD storage.
There are flags that determine whether to use memory, whether to spill data to
disk if it does not fit in memory, whether to store objects in serialized format,
and whether to replicate the RDD partitions to multiple nodes. Flags are
implemented in the StorageClass constructor, as shown in Listing 5.17.
StorageLevel(useDisk,
useMemory,
useOffHeap,
deserialized,
replication=1)
getStorageLevel()
Syntax:
RDD.getStorageLevel()
The Spark API includes a function called getStorageLevel() that you can
use to inspect the storage level for an RDD. The getStorageLevel()
function returns the different storage option flags set for an RDD. The return
value in the case of PySpark is an instance of the class
pyspark.StorageLevel. Listing 5.18 shows how to use the
getStorageLevel() function.
RDD Caching
A Spark RDD, including all of its parent RDDs, is normally recomputed for each
action called in the same session or application. Caching an RDD persists the
data in memory; the same routine can then reuse it multiple times when
subsequent actions are called, without requiring reevaluation.
Caching does not trigger execution or computation; rather, it is a suggestion. If
there is not enough memory available to cache the RDD, it is reevaluated for
each lineage triggered by an action. Caching never spills to disk because it only
uses memory. The cached RDD persists using the MEMORY_ONLY storage level.
Under the appropriate circumstances, caching is a useful tool to increase
application performance. Listing 5.19 shows an example of caching with RDDs.
Persisting RDDs
Cached partitions, partitions of an RDD where the cache() method ran, are
stored in memory on Executor JVMs on Spark Worker nodes. If one of the
Worker nodes were to fail or become unavailable, Spark would need to re-create
the cached partition from its lineage.
The persist() method, introduced in Chapter 4, offers additional storage
options, including MEMORY_AND_DISK, DISK_ONLY, MEMORY_ONLY_SER,
MEMORY_AND_DISK_SER, and MEMORY_ONLY, which is the same as the
cache() method. When using persistence with one of the disk storage options,
the persisted partitions are stored as local files on the Worker nodes running
Spark Executors for the application. You can use the persisted data on disk to
reconstitute partitions lost due to Executor or memory failure.
In addition, persist() can use replication to persist the same partition on
more than one node. Replication makes reevaluation less likely because more
than one node would need to fail or be unavailable to trigger recomputation.
Persistence offers additional durability over caching, while still offering
increased performance. It is worth reiterating that Spark RDDs are fault tolerant
regardless of persistence and can always be re-created in the event of a failure.
Persistence simply expedites this process.
Persistence, like caching, is only a suggestion, and it takes place only after an
action is called to trigger evaluation of an RDD. If sufficient resources are not
available—for instance, if there is not enough memory available—persistence is
not implemented.
You can inspect the persistence state and current storage levels from any RDD at
any stage by using the getStorageLevel() method, discussed earlier in this
chapter.
The methods available for persisting and unpersisting RDDs are documented in
the following sections.
persist()
Syntax:
Click here to view code image
RDD.persist(storageLevel=StorageLevel.MEMORY_ONLY_SER)
The persist() method specifies the desired storage level and storage
attributes for an RDD. The desired storage options are implemented the first
time the RDD is evaluated. If this is not possible—for example, if there is
insufficient memory to persist the RDD in memory—Spark reverts to its normal
behavior of retaining only required partitions in memory.
The storageLevel argument is expressed as either a static constant or a set
of storage flags (see the section “RDD Storage Options,” earlier in this chapter).
For example, to set a storage level of MEMORY_AND_DISK_SER_2, you could
use either of the following:
Click here to view code image
myrdd.persist(StorageLevel.MEMORY_AND_DISK_SER_2)
myrdd.persist(StorageLevel(True, True, False, False, 2))
unpersist()
Syntax:
RDD.unpersist()
The unpersist() method “unpersists” the RDD. Use it if you no longer need
the RDD to persist. Also, if you want to change the storage options for a
persisted RDD, you must unpersist the RDD first. If you attempt to change the
storage level of an RDD marked for persistence, you get the exception “Cannot
change storage level of an RDD after it was already assigned a level.”
Listing 5.20 shows several examples of persistence.
Listing 5.20 Persisting an RDD
Click here to view code image
doc = sc.textFile("file:///opt/spark/data/shakespeare.txt")
words = doc.flatMap(lambda x: x.split()) \
.map(lambda x: (x,1)) \
.reduceByKey(lambda x, y: x + y)
words.persist()
words.count()
# returns: 33505
words.take(3)
# returns: [('Quince', 8), ('Begin', 9), ('Just', 12)]
print(words.toDebugString().decode("utf-8"))
# returns:
# (1) PythonRDD[46] at RDD at PythonRDD.scala:48 [Memory Serialized 1x
Replicated]
# | CachedPartitions: 1; MemorySize: 644.8 KB;
ExternalBlockStoreSize: ...
# | MapPartitionsRDD[45] at mapPartitions at PythonRDD.scala:427 [...]
# | ShuffledRDD[44] at partitionBy at NativeMethodAccessorImpl.java:0
[...]
# +-(1) PairwiseRDD[43] at reduceByKey at <stdin>:3 [Memory Serialized
1x ...]
# | PythonRDD[42] at reduceByKey at <stdin>:3 [Memory Serialized 1x
Replicated]
# | file:///opt/spark/data/shakespeare.txt MapPartitionsRDD[41] at
textFile ...
# | file:///opt/spark/data/shakespeare.txt HadoopRDD[40] at
textFile at ...
Note that the unpersist() method can also be used to remove an RDD that
was cached using the cache() method.
Persisted RDDs are also viewable in the Spark application UI via the Storage
tab, as shown in Figures 5.6 and 5.7.
Figure 5.6 Viewing persisted RDDs in the Spark application UI.
Figure 5.7 Viewing details of a persisted RDD in the Spark application UI.
Checkpointing RDDs
Checkpointing involves saving data to a file. Unlike the disk-based persistence
option just discussed, which deletes the persisted RDD data when the Spark
Driver program finishes, checkpointed data persists beyond the application.
Checkpointing eliminates the need for Spark to maintain RDD lineage, which
can be problematic when the lineage gets long, such as with streaming or
iterative processing applications. Long lineage typically leads to long recovery
times and the possibility of a stack overflow.
Checkpointing data to a distributed filesystem such as HDFS provides additional
storage fault tolerance as well. Checkpointing is expensive, so implement it with
some consideration about when you should checkpoint an RDD.
As with the caching and persistence options, checkpointing happens only after
an action is called against an RDD to force computation, such as count().
Note that checkpointing must be requested before any action is requested against
an RDD.
The methods associated with checkpointing are documented in the following
sections.
setCheckpointDir()
Syntax:
sc.setCheckpointDir(dirName)
checkpoint()
Syntax:
RDD.checkpoint()
The checkpoint directory is valid only for the current SparkContext, so you need
to execute setCheckpointDir() for each separate Spark application. In
addition, the checkpoint directory cannot be shared across different Spark
applications.
isCheckpointed()
Syntax:
RDD.isCheckpointed()
getCheckpointFile()
Syntax:
RDD. getCheckpointFile()
sc.setCheckpointDir('file:///opt/spark/data/checkpoint')
doc = sc.textFile("file:///opt/spark/data/shakespeare.txt")
words = doc.flatMap(lambda x: x.split()) \
.map(lambda x: (x,1)) \
.reduceByKey(lambda x, y: x + y)
words.checkpoint()
words.count()
# returns: 33505
words.isCheckpointed()
# returns: True
words.getCheckpointFile()
# returns:
# 'file:/opt/spark/data/checkpoint/df6370eb-7b5f-4611-99a8-
bacb576c2ea1/rdd-15'
After a certain number of iterations, you should see an exception like this:
Click here to view code image
PicklingError: Could not pickle object as excessively deep recursion
required.
4. Open the looping_test.py file again with a text editor and uncomment
the following line:
Click here to view code image
#rddofints.checkpoint()
So the file should now read:
...
print("Looped " + str(i) + " times")
rddofints.checkpoint()
rddofints.count()
...
pipe()
Syntax:
Click here to view code image
RDD.pipe(command, env=None, checkCode=False)
#!/usr/bin/env perl
my $format = 'A6 A8 A20 A2 A5';
while (<>) {
chomp;
my( $custid, $orderid, $date,
$city, $state, $zip) =
unpack( $format, $_ );
print "$custid\t$orderid\t$date\t$city\t$state\t$zip";
}
Listing 5.23 demonstrates the use of the pipe() command to run the
parsefixedwidth.pl script from Listing 5.22.
Listing 5.23 The pipe() Function
Click here to view code image
sc.addFile("/home/ubuntu/parsefixedwidth.pl")
fixed_width = sc.parallelize(['3840961028752220160317Hayward
CA94541'])
piped = fixed_width.pipe("parsefixedwidth.pl") \
.map(lambda x: x.split('\t'))
piped.collect()
# returns [['384096', '10287522', '20160317', 'Hayward', 'CA', '94541']]
sample()
Syntax:
Click here to view code image
RDD.sample(withReplacement, fraction, seed=None)
doc = sc.textFile("file:///opt/spark/data/shakespeare.txt")
doc.count()
# returns: 129107
sampled_doc = doc.sample(False, 0.1, seed=None)
sampled_doc.count()
# returns: 12879 (approximately 10% of the original RDD)
takeSample()
Syntax:
Click here to view code image
RDD.takeSample(withReplacement, num, seed=None)
dataset = sc.parallelize([1,2,3,4,5,6,7,8,9,10])
dataset.takeSample(False, 3)
# returns [6, 7, 5] (your results may vary!)
Listing 5.26 provides some examples of settings for some common environment
variables; these could be set in your spark-env.sh file or as environment
variables in your shell prior to running an interactive Spark process such as
pyspark or spark-shell.
export SPARK_HOME=${SPARK_HOME:-/usr/lib/spark}
export SPARK_LOG_DIR=${SPARK_LOG_DIR:-/var/log/spark}
export HADOOP_HOME=${HADOOP_HOME:-/usr/lib/hadoop}
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/etc/hadoop/conf}
export HIVE_CONF_DIR=${HIVE_CONF_DIR:-/etc/hive/conf}
export STANDALONE_SPARK_MASTER_HOST=sparkmaster.local
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_IP=$STANDALONE_SPARK_MASTER_HOST
export SPARK_MASTER_WEBUI_PORT=8080
export SPARK_WORKER_DIR=${SPARK_WORKER_DIR:-/var/run/spark/work}
export SPARK_WORKER_PORT=7078
export SPARK_WORKER_WEBUI_PORT=8081
export SPARK_DAEMON_JAVA_OPTS="-XX:OnOutOfMemoryError='kill -9 %p'"
The following sections take a look at some of the most common Spark
environment variables and their use.
spark.master yarn
spark.eventLog.enabled true
spark.eventLog.dir hdfs:///var/log/spark/apps
spark.history.fs.logDirectory hdfs:///var/log/spark/apps
spark.executor.memory 2176M
spark.executor.cores 4
There are also several SparkConf methods for setting specific common
properties. These methods appear in Listing 5.29.
As you can see, there are several ways to pass the same configuration parameter,
including as an environment variable, as a Spark default configuration property,
or as a command line argument. Table 5.10 shows just a few of the various ways
to set the same property in Spark. Many other properties have analogous
settings.
-- spark.executor.memory SPARK_EXECUTOR_MEMORY
executor-
memory
-- spark.executor.cores SPARK_EXECUTOR_CORES
executor-
cores
Configuration Management
Managing configuration is one of the biggest challenges involved in
administering a Spark cluster—or any other cluster, for that matter. Often,
configuration settings need to be consistent across different hosts, such as
different Worker nodes in a Spark cluster. Configuration management and
deployment tools such as Puppet and Chef can be useful for managing Spark
deployments and their configurations. If you are rolling out and managing Spark
as part of a Hadoop deployment using a commercial Hadoop distribution, you
can manage Spark configuration by using the Hadoop vendor’s management
interface, such as Cloudera Manager for Cloudera installations or Ambari for
Hortonworks installations.
In addition, there are other options for configuration management, such as
Apache Amaterasu (http://amaterasu.incubator.apache.org/), which uses
pipelines to build, run, and manage environments as code.
Optimizing Spark
The Spark runtime framework generally does its best to optimize stages and
tasks in a Spark application. However, as a developer, you can make many
optimizations for notable performance improvements. We discuss some of them
in the following sections.
rdd.map(lambda x: (x[0],1)) \
.groupByKey() \
.mapValues(lambda x: sum(x)) \
.collect()
# preferred method
rdd.map(lambda x: (x[0],1)) \
.reduceByKey(lambda x, y: x + y) \
.collect()
Contrast what you have just seen with Figure 5.10, which shows the functionally
equivalent reduceByKey() implementation.
As you can see from the preceding figures, reduceByKey() combines records
locally by key before shuffling the data; this is often referred to as a combiner in
MapReduce terminology. Combining can result in a dramatic decrease in the
amount of data shuffled and thus a corresponding increase in application
performance.
Some other alternatives to groupByKey() are combineByKey(), which
you can use if the inputs and outputs to your reduce function are different, and
foldByKey(), which performs an associative operation providing a zero
value. Additional functions to consider include treeReduce(),
treeAggregate(), and aggregateByKey().
...
massive_list = [...]
def big_fn(x):
# function enclosing massive_list
...
...
rdd.map(lambda x: big_fn(x)).saveAsTextFile...
# parallelize data which would have otherwise been enclosed
massive_list_rdd = sc.parallelize(massive_list)
rdd.join(massive_list_rdd).saveAsTextFile...
Optimizing Parallelism
A specific configuration parameter that could be beneficial to set at an
application level or using spark-defaults.conf is the
spark.default.parallelism setting. This setting specifies the default
number of RDD partitions returned by transformations such as
reduceByKey(), join(), and parallelize() where the
numPartitions argument is not supplied. You saw the effect of this
configuration parameter earlier in this chapter.
It is often recommended to make the value for this setting equal to or double the
number of cores on each Worker. As with many other settings, you may need to
experiment with different values to find the optimal setting for your
environment.
Dynamic Allocation
Spark’s default runtime behavior is that the Executors requested or provisioned
for an application are retained for the life of the application. If an application is
long lived, such as a pyspark session or Spark Streaming application, this may
not be optimal, particularly if the Executors are idle for long periods of time and
other applications are unable to get the resources they require.
With dynamic allocation, Executors can be released back to the cluster resource
pool if they are idle for a specified period of time. Dynamic allocation is
typically implemented as a system setting to help maximize use of system
resources.
Listing 5.34 shows the configuration parameters used to enable dynamic
allocation.
Moreover, large partitions can also result from a shuffle operation using a
custom partitioner, such as a month partitioner for a corpus of log data where
one month is disproportionately larger than the others. In this case, the solution
is to use repartition() or coalesce() after the reduce operation, using
a hash partitioner.
Another good practice is to repartition before a large shuffle operation as this can
provide a significant performance benefit.
Collection Performance
If your program has a collection stage, you can get summary and detailed
performance information from the Spark application UI. From the Details page,
you can see metrics related to the collection process, including the data size
collected, as well as the duration of collection tasks; this is shown in Figure 5.14.
Figure 5.14 Spark application UI stage detail: collection information.
Summary
This chapter completes our coverage of the Spark core (or RDD) API using
Python. This chapter introduces the different shared variables available in the
Spark API, including broadcast variables and accumulators, along with their
purpose and usage. Broadcast variables are useful for distributing reference
information, such as lookup tables, to Workers to avoid expensive “reduce side”
joins. Accumulators are useful as general-purpose counters in Spark applications
and also can be used to optimize processing. This chapter also discusses RDD
partitioning in much more detail, as well as the methods available for
repartitioning RDDs, including repartition() and coalesce(), as well
as functions designed to work on partitions atomically, such as
mapPartitions(). This chapter also looks at the behavior of partitioning
and its influence on performance as well as RDD storage options. You have
learned about the effects of checkpointing RDDs, which is especially useful for
periodic saving of state for iterative algorithms, where elongated lineage can
make recovery very expensive. In addition, you have learned about the pipe()
function, which you can use with external programs with Spark. Finally, you got
a look at how to sample data in Spark and explored some considerations for
optimizing Spark programs.