Big Data Computing Spark Basics and RDD: Ke Yi
Big Data Computing Spark Basics and RDD: Ke Yi
1
A Brief History
2
Why is Map/Reduce bad?
• Programming model too restricted
3
Many specialized systems on top
of Hadoop
4
What is Spark?
Fast and Expressive Cluster Computing
Engine Compatible with Apache Hadoop
10 × ffaasstteerr oonn d
Upp ttoo
U
100× iinn mmem diisskk,, 2- 5× le s
emoorryy s c ode
Efficient Usable
• General execution • Rich APIs in Java, Scala,
graphs Python
• Interactive shell
• In-memory storage
Spark’s (Short) History
6
Spark Popularity
7
Use Memory Instead of Disk
8
Tech Trend: Cost of Memory
9
In-Memory Data Sharing
10
Spark and Map Reduce
Differences
11
Spark Programming
12
Resilient Distributed Datasets (RDDs)
lines = spark.textFile(“hdfs://…”)
errors = lines.filter(_.startsWith(“Error”))
errors.filter(_.contains(“HDFS”))
.map(_split(‘\t’)(3))
.collect()
Example:
If a partition of errors is lost, Spark rebuilds it by applying a filter on
only the corresponding partition of lines.
lines = sc.textFile("...",4)
comments = lines.filter(isComment)
print lines.count(),comments.count()
Caching RDDs
lines = sc.textFile("...",4)
lines.cache()
comments = lines.filter(isComment)
print lines.count(),comments.count()
RDD Persistence
>>> textFile.take(5)
RDD actions and transformations can be used for more complex computations. Let’s say we
want to find the line with the most words:
>>> textFile.map(lambda line: len(line.split())).reduce(lambda a, b: a if (a > b) else b)
22
For multiple commands, you can write them in a .py file, and execute it using execfile(). But
you will need add “print” to get the output.
Lazy transformations
You can realize that map function returns immediately, while the count() function
really triggers the work.
By default, each transformed RDD may be recomputed each time you run an
action on it. However, you may also persist an RDD in memory using
the persist (or cache) method, in which case Spark will keep the elements around
on the cluster for much faster access the next time you query it. There is also
support for persisting RDDs on disk, or replicated across multiple nodes
RDD Basics
lines = sc.textFile("README.md")
lineLengths = lines.map(lambda s: len(s))
totalLength = lineLengths.reduce(lambda a, b: a + b)
The first line defines a base RDD from an external file. This dataset is not
loaded in memory or otherwise acted on: lines is merely a pointer to the
file. The second line defines lineLengths as the result of
a map transformation. Again, lineLengths is not immediately computed,
due to laziness. Finally, we run reduce, which is an action. At this point
Spark breaks the computation into tasks to run on separate machines, and
each machine runs both its part of the map and a local reduction,
returning only its answer to the driver program.
Self-Contained Applications
Example: http://www.cse.ust.hk/msbd5003/SimpleApp.py
This program just counts the number of lines containing ‘a’ and
the number containing ‘b’ in a text file. We can run this
application using the bin/spark-submit script:
$ ./bin/spark-submit SimpleApp.py
...
Lines with a: 62, Lines with b: 30
• To install Spark Standalone mode, you simply place a compiled version of Spark
on each node on the cluster.
• You can start a standalone master server by executing:
./sbin/start-master.sh
• Start one or more slaves by executing
./sbin/start-slave.sh <master-spark-url>
• Finally, remember to shut down the master and workers using the following
scripts when they are not needed:
sbin/stop-master.sh
sbin/stop-slave.sh
• Note: Only one master/worker can run on the same machine, but a machine can
be both a master and a worker.
• Now you can submit your application to the cluster. Example:
./bin/spark-submit --master spark://hostname:7077 pi.py 20
• You can monitor the cluster’s status on the master at port 8080.
• Job status can be monitored at driver at port 4040, 4041, …
Where code runs
Most Python code runs in driver, except for code passed to transformations.
Transformations run at executors, actions run at executors and driver.
Example: Let’s say you want to combine two RDDs: a, b.
You remember that rdd.collect() returns a list, and in Python you can combine
two lists with +
A naïve implementation would be:
>>> a = RDDa.collect()
>>> b = RDDb.collect()
>>> RDDc = sc.parallelize(a+b)
Where does this code run?
In the first line, all distributed data for a and b is sent to driver. What if a and/or b
is very large? Driver could run out of memory. Also, it takes a long time to send
the data to the driver.
In the third line, all data is sent from driver to executors.
The correct way:
>>> RDDc = RDDa.union(RDDb)
This runs completely at executors.
The Jupyter Notebook
• Problem:
– Input: an array A of n numbers (unordered), and k
– Output: the k-th smallest number (counting from 0)
• Algorithm
1. x = A[0]
2. partition A into
A[0..mid-1] < A[mid] = x < A[mid+1..n-1]
3. if mid = k then return x
4. if k < mid then A = A[0..mid-1]
if k > mid then A = A[mid+1,n-1], k = k – mid – 1
5. go to step 1
Why didn’t it work?
Examples
Lab: PMI Computation
•• PMI
(pointwise mutual information) is a measure of association
used in information theory and statistics.
• Given a list of pairs (x, y)
x y p(x, y)
where
0 0 0.1
probability of ,
probability of , 0 1 0.7
joint probability of 1 0 0.15
• Example: 1 1 0.05
The algorithm:
1. Choose k points from the input points randomly. These
points represent initial group centroids.
2. Assign each point to the closest centroid.
3. When all points have been assigned, recalculate the
positions of the k centroids.
4. Repeat Steps 2 and 3 until the centroids no longer move.
This produces a separation of the objects into groups
from which the metric to be minimized can be
calculated.
See example at http://shabal.in/visuals/kmeans/6.html
Lab: PageRank
• Algorithm:
– Initialize all PR’s to 1
– Iteratively
compute 𝑃𝑅 (𝑣 )
𝑃𝑅 ( 𝑢 ) ← 0.15 ×1+0.85 × ∑
𝑣 →𝑢 outdegree (𝑣)