Lecture 3 MapReduce Spark
Lecture 3 MapReduce Spark
MapReduce
● inverted indices
● graph structure of web documents
● summaries of the number of pages
Output crawled per host
● the set of most frequent queries in a
day
What is MapReduce?
It is a programming model
I am a class president
borrowed slide
An English teacher asks you:
borrowed slide
Um... Ok...
borrowed slide
Let’s divide the workload among classmates.
map
borrowed slide
And let few combine the intermediate results.
reduce
I will collect
from A ~ G
H~Q R~Z
borrowed slide
Is it because Google uses it?
borrowed slide
Distributed Computation
Before MapReduce
Things to consider:
borrowed slide
MapReduce has made distributed computation
an easy thing to do!
Multiple Workers
● Follow whatever the Master asks to do.
Execution Overview
1. The MapReduce library in the user program first splits
the input file into M pieces.
gfs://path/input_file
master
worker 1
master worker 2
worker 3
Time
4. Map Phase (each mapper node)
1) Read in a corresponding input partition.
2) Apply the user-defined map function to each key/value pair
in the partition.
3) Partition the result produced by the map function into R
regions using the partitioning function.
4) Write the result into its local disk (not GFS).
5) Notify the master with the locations of each partitioned
intermediate result.
Map Phase
partition_k Inside
2. where is
my partition kth
map task
map function
master mapper
temp_k1 temp_k2 temp_kR
worker 1
master worker 2
worker 3
Time
6. Reduce Phase (each reducer node)
1) Read in all the corresponding intermediate result
partitions from mapper nodes.
2) Sort the intermediate results by the intermediate keys.
3) Apply the user-defined reduce function on each
intermediate key and the corresponding set of
intermediate values.
4) Create one output file.
Reduce Phase
3. here are your intermediate result partitions
mappers
2. send
intermediate
result to this temp_Mk
reducer
temp_1k temp_2k
sort
1. assign
reduce task reduce function
● Published the
Resilient Distributed Datasets (RDD)
paper in NSDI 2012.
Input
Iter. 1 Iter. 2
Input
HDFS read
Query 3 Result 3
Instead of reading
Input HDFS read input from HDFS
every time you run
query,
Query 1 Result 1
bring the input into
RAM first then run
HDFS Query 2 Result 2 multiple queries.
read
Query 3 Result 3
Input
Challenge
But RAM is a volatile storage…
Iter. 1 Iter. 2
Input
Although the probability of a machine failure is low,
the probability of a machine failing among thousands of
machines is common.
Which is the title of the Spark paper and the core idea
behind Spark!
Resilient Distributed Datasets
What are the properties of RDD?
● you can only create RDD from input files in a storage or RDD
● That means
if we just record how the RDD got created
from its parent RDD (lineage),
it becomes fault-tolerant!
RDD (cont.)
But how do you code in Spark using RDD?
● Coding in Spark is
creating a lineage of RDDs
in a directed acyclic graph (DAG) form.
Data
Source Map
Union
Data
Source Map Match
Data
Reduce Group
Sink
Data Data
Source Source
RDD Operators
Transformations & Actions
Lazy Execution
Transformation
● Fast.
Wide Dependency
● The task needs shuffle.
● Slow.
● In later iterations,
Spark is much faster
(black bar).
● HadoopBM writes
intermediate data in
memory not HDFS.
What if the number of nodes
increases?
Apache Spark Ecosystem
References
● Resilient Distributed Datasets: A Fault-Tolerant Abstraction
for In-Memory Cluster Computing - Matei Zaharia, et al. -
2012
● https://databricks.com/spark/about
● http://www.slideshare.net/yongho/rdd-paper-review
● https://www.youtube.com/watch?v=dmL0N3qfSc8
● http://www.tothenew.com/blog/spark-1o3-spark-internals/
● https://trongkhoanguyenblog.wordpress.com/2014/11/27/un
derstand-rdd-operations-transformations-and-actions/