Spark RDD
Spark RDD
2 / 62
2.1 Introduction
3 / 62
Philosophy
Spark computing framework deals with many complex issues: fault
tolerance, slow machines, big datasets, etc.
4 / 62
API
An API allows a user to interact with the software
Scala (JVM)
Java (JVM)
Python
R
This course uses the Python API. Easier to learn than Scala and Java
5 / 62
Architecture
When you interact with Spark through its API, you send instructions to
the Driver
Example
6 / 62
SparkContext object
You interact with the driver through the SparkContext object.
In a jupyter notebook
Create a SparkContext object using:
7 / 62
RDDs and running model
Spark programs are written in terms of operations on RDDs
8 / 62
Creating a RDD
From an iterable object iterator (e.g. a Python list, etc.):
lines = sc.parallelize(iterator)
lines = sc.textFile("/path/to/file.txt")
Remarks
9 / 62
Operations on RDD
Two families of operations can be performed on RDDs
Transformations
Operations on RDDs which return a new RDD
Lazy evaluation
Actions
Operations on RDDs that return some other data type
Triggers computations
10 / 62
2.2 Transformations
11 / 62
Transformations
The most important transformation is map
transformation description
Here is an example:
12 / 62
Transformations
About passing functions to map:
[Let's go to notebook05_sparkrdd.ipynb]
13 / 62
Transformations
Then we have flatMap
transformation description
flatMap(f) apply f to each element of the RDD, then flattens the results
Example
14 / 62
Transformations
filter allows to filter an RDD
transformation description
Example
15 / 62
Transformations
About distinct and sample
transformation description
Example
16 / 62
Transformations
We have also "pseudo set" operations
transformation description
If if there are duplicates in the input RDD, the result of union() will
contain duplicates (fixed with distinct())
intersection() removes all duplicates (including duplicates from a
single RDD)
Performance of intersection() is much worse than union() since it
requires a shuffle to identify common elements
subtract also requires a shuffle
17 / 62
Transformations
We have also "pseudo set" operations
transformation description
>>> rdd3.distinct().collect()
[0, 1, 2, 3, 4, 5, 6, 7, 8]
18 / 62
About shuffles
Certain operations trigger a shuffle
It is Spark’s mechanism for re-distributing data so that it’s grouped
differently across partitions
It involves copying data across executors and machines, making the
shuffle a complex and costly operation
We will discuss shuffles in detail later in the course
Performance Impact
A shuffle involves disk I/O, data serialization and network I/O.
To organize data for the shuffle, Spark generates sets of tasks
map tasks to organize the data
and a set of reduce tasks to aggregate it. This nomenclature comes from
MapReduce and does not directly relate to Spark’s map and reduce
operations.
19 / 62
Transformations
Another "pseudo set" operation
transformation description
cartesian(otherRdd) Return the Cartesian product of this RDD and another one
Example
[Let's go to notebook05_sparkrdd.ipynb]
20 / 62
2.3 Actions
21 / 62
Actions
collect brings the RDD back to the driver
transformation description
Example
Remarks
22 / 62
Actions
It's important to count !
transformation description
Example
>>> rdd.countByValue()
defaultdict(int, {1: 2, 3: 1, 2: 3})
23 / 62
Actions
How to get some values in an RDD ?
action description
Remarks
24 / 62
Actions
How to get some values in an RDD ?
action description
Example
25 / 62
Actions
The reduce action
action description
fold(zeroValue,
Same as reduce() but with the provided zero value.
op)
26 / 62
Actions
The reduce action
action description
fold(zeroValue,
Same as reduce() but with the provided zero value.
op)
Example
27 / 62
Actions
The reduce action
action description
fold(zeroValue,
Same as reduce() but with the provided zero value.
op)
action description
fold(zeroValue,
Same as reduce() but with the provided zero value.
op)
[Let's go to notebook05_sparkrdd.ipynb]
29 / 62
Actions
The reduce action
action description
fold(zeroValue,
Same as reduce() but with the provided zero value.
op)
30 / 62
Actions
The aggregate action
action description
Aggregates the elements of each partition, and then the results for all the
partitions, given aggregation functions and zero value.
31 / 62
Actions
The aggregate action
action description
Example
[Let's go to notebook05_sparkrdd.ipynb]
32 / 62
Actions
The foreach action
action description
33 / 62
2.4 Persistence
34 / 62
Lazy evaluation and persistence
Spark RDDs are lazily evaluated
Each time an action is called on a RDD, this RDD and all its
dependencies are recomputed
Remarks
Lazy evaluation helps spark to reduce the number of passes over the
data it has to make by grouping operations together
No substantial benefit to writing a single complex map instead of
chaining together many simple operations
Users are free to organize their program into smaller, more
manageable operations
35 / 62
Persistence
How to use persistence ?
method description
These methods must be called before the action, and do not trigger
the computation
Usage of storageLevel
pyspark.StorageLevel(
useDisk, useMemory, useOffHeap, deserialized, replication=1
)
36 / 62
Persistence
Options for persistence
argument description
Store data outside of JVM heap if True. Useful if using some in-memory
useOffHeap
storage system (such a Tachyon)
37 / 62
Persistence
Options for persistence
argument description
Store data outside of JVM heap if True. Useful if using some in-memory
useOffHeap
storage system (such a Tachyon)
replication
If you cache data that is quite slow to be recomputed, you can use
replications. If a machine fails, data will not have to be recomputed.
38 / 62
Persistence
Options for persistence
argument description
Store data outside of JVM heap if True. Useful if using some in-memory
useOffHeap
storage system (such a Tachyon)
deserialized
argument description
Store data outside of JVM heap if True. Useful if using some in-memory
useOffHeap
storage system (such a Tachyon)
useOffHeap
rdd.persist(MEMORY_AND_DISK)
41 / 62
Persistence
What if you attempt to cache too much data to fit in memory ?
Spark will automatically evict old partitions using a Least Recently Used
(LRU) cache policy:
42 / 62
Reminder: about passing functions (again)
Warning
then Spark sends the entire object to worker nodes, which can be much
larger than the bit of information you need
This can cause your program to fail, if your class contains objects
that Python can't pickle
43 / 62
About passing functions
Passing a function with field references (don’t do this !)
class SearchFunctions(object):
def __init__(self, query):
self.query = query
def isMatch(self, s):
return self.query in s
def getMatchesFunctionReference(self, rdd):
# Problem: references all of "self" in "self.isMatch"
return rdd.filter(self.isMatch)
def getMatchesMemberReference(self, rdd):
# Problem: references all of "self" in "self.query"
return rdd.filter(lambda x: self.query in x)
Instead, just extract the fields you need from your object into a local
variable and pass that in
44 / 62
About passing functions
Python function passing without field references
class WordFunctions(object):
...
45 / 62
2.5. Pair RDD: key-value pairs
46 / 62
Pair RDD: key-value pairs
It's roughly an RDD where each element is a tuple with two elements: a
key and a value
>>> rdd = sc.parallelize([[1, "a", 7], [2, "b", 13], [2, "c", 17]])
>>> rdd = rdd.map(lambda x: (x[0], x[1:]))
>>> rdd.collect()
[(1, ['a', 7]), (2, ['b', 13]), (2, ['c', 17])]
47 / 62
Warning
An element mut be tuples with two elements (the key and the value)
>>> rdd = sc.parallelize([[1, "a", 7], [2, "b", 13], [2, "c", 17]])
>>> rdd.keys().collect()
[1, 2, 2]
>>> rdd.values().collect()
['a', 'b', 'c']
>>> rdd = sc.parallelize([[1, "a", 7], [2, "b", 13], [2, "c", 17]])\
.map(lambda x: (x[0], x[1:]))
>>> rdd.keys().collect()
[1, 2, 2]
>>> rdd.values().collect()
[['a', 7], ['b', 13], ['c', 17]]
48 / 62
Transformations for a single PairRDD
transformation description
49 / 62
Transformations for a single PairRDD
transformation description
50 / 62
Transformations for a single PairRDD
transformation description
51 / 62
52 / 62
Transformations for a single PairRDD
transformation description
reduceByKey(f) Merge the values for each key using an associative reduce function f.
foldByKey(f) Merge the values for each key using an associative reduce function f.
53 / 62
54 / 62
Transformations for a single PairRDD
transformation description
Transforms an RDD[(K, V)] into another RDD of type RDD[(K, C)] for
a "combined" type C that can be different from V
55 / 62
Transformations for a single PairRDD
transformation description
In this example
56 / 62
Transformations for two PairRDD
transformation description
57 / 62
Transformations for two PairRDD
Join operations are mainly used through the high-level API:
DataFrame objects and the spark.sql API
We will use them a lot with the high-level API (DataFrame from
spark.sql)
[Let's go to notebook05_sparkrdd.ipynb]
58 / 62
Actions for two PairRDD
action description
lookup(key) Return all the values associated with the provided key.
[Let's go to notebook05_sparkrdd.ipynb]
59 / 62
Data partitionning
Some operations on PairRDDs, such as join, require to scan
the data more than once
In Spark: you can choose which keys will appear on the same
node, but no explicit control of which worker node each key
goes to.
60 / 62
Data partitionning
In practice, you can specify the number of partitions with
rdd.partitionBy(100)
You can also use a custom partition function hash such that hash(key)
returns a hash
import urlparse
To have finer control on partitionning, you must use the Scala API.
61 / 62