Parallel Programming, Mapreduce Model: Unit Ii
Parallel Programming, Mapreduce Model: Unit Ii
UNIT II
serial program consist of a sequence of instructions, where each instruction executed one after the other
In
a parallel program, the processing is broken up into parts, each of which can be executed concurrently.
sets of tasks that can run concurrently and/or paritions of data that can be processed concurrently
Sometimes
common situation is having a large amount of consistent data which must be processed.
the array and splits it up according to the number of available WORKERS sends each WORKER its subarray receives the results from each WORKER
The WORKER:
receives
the subarray from the MASTER performs processing on the subarray returns results to MASTER
Approximating pi
Approximating pi..
The area of the square, denoted As = (2r)2 or 4r2. The area of the circle, denoted Ac, is pi * r2. So: pi = Ac / r2 As = 4r2 r2 = As / 4 pi = 4 * Ac / As
the number of generated points that are both in the circle and in the square
r
= the number of points in the circle divided by the number of points in the square
PI
=4*r
NUMPOINTS = 100000; // some large number - the bigger, the closer the approximation
p = number of WORKERS; numPerWorker = NUMPOINTS / p; countCircle = 0; // one of these for each WORKER
// each WORKER does the following: for (i = 0; i < numPerWorker; i++) { generate 2 random numbers that lie inside the square; xcoord = first random number; ycoord = second random number; if (xcoord, ycoord) lies inside the circle countCircle++; }
MASTER: receives from WORKERS their countCircle values computes PI from these values: PI = 4.0 * countCircle / NUMPOINTS;
MapReduce
A Brief History
reduce() function
Combines all elements of a sequence using a binary operator
What is MapReduce?
This model derives from the map and reduce combinators from a functional language like Lisp. Restricted parallel programming model meant for large clusters
Map()
Process a key/value pair to generate intermediate key/value pairs
Reduce()
Merge all intermediate values associated with the same key
Map()
Input <filename, file text> Parses file and emits <word, count> pairs
eg. <hello, 1>
Reduce()
Sums all values for the same key and emits <word, TotalCount>
eg. <hello, (3 5 2 7)> => <hello, 17>
M
How now Brown cow
M M M Map
R R
MapReduce Framework
Reduce
Input
Output
map(string key, string value) //key: document name //value: document contents for each word w in value EmitIntermediate(w,;)1 reduce(string key, iterator values) //key: word //values: list of counts int results = 0; for each v in values result += ParseInt(v); Emit(AsString(result));
MapReduce Examples
Distributed grep
Map function emits <word, line_number> if word matches search criteria Reduce function is the identity function
More formally,
Map(k1,v1) --> list(k2,v2) Reduce(k2, list(v2)) --> list(v2)
MapReduce Benefits
Practical
User to do list:
indicate:
Input/output files M: number of map tasks R: number of reduce tasks W: number of machines
The user program, via the MapReduce library, shards the input data
Input Data
User Program
Data Distribution
Intermediate files created from map tasks are written to local disk Output files are written to distributed file system
The user program creates process copies distributed on a machine cluster. One copy will be the Master and the others will be worker threads.
Master
MapReduce Resources
3.
Message(Do_map_task)
Master
Idle Worker
Assigning Tasks
Many copies of user program are started Tries to utilize data localization by running map tasks on machines with data One instance becomes the Master Master finds idle machines and assigns them tasks
MapReduce Resources
4.
Each map-task worker reads assigned input shard and outputs intermediate key/value pairs.
Output buffered in RAM.
Shard 0
Map worker
Key/value pairs
Each worker flushes intermediate values, partitioned into R regions, to disk and notifies the Master process.
Disk locations
Master
Map worker
Local Storage
Master process gives disk locations to an available reduce-task worker who reads all associated intermediate data.
Disk locations
Master
Reduce worker
remote Storage
Each reduce-task worker sorts its intermediate data. Calls the reduce function, passing in unique keys and associated key values. Reduce function output appended to reduce-tasks partition output file.
Sorts data
Reduce worker
Master process wakes up user process when all tasks have completed. Output contained in R output files.
Master
wakeup
User Program
Output files
Observations
No reduce can begin until map is complete Tasks scheduled based on location of data If map worker fails any time before reduce finishes, task must be completely rerun Master must communicate locations of intermediate files MapReduce library does most of the hard work for us!
...
map
Data store 1 Data store n
map
(key 1, values...)
(key 2, values...)
(key 3, values...)
(key 1, values...)
(key 2, values...)
(key 3, values...)
== Barrier == : Aggregates intermediate values by output key key 1, intermediate values reduce key 2, intermediate values reduce key 3, intermediate values reduce
Fault Tolerance
Fault Tolerance
Input file blocks stored on multiple machines When computation almost done, reschedule in-progress tasks
Avoids stragglers
Conclusions
Simplifies large-scale computations that fit this model Allows user to focus on the problem without worrying about details Computer architecture not very important
Portable model
MapReduce Applications
..
Relational join could be executed in parallel using mapreduce E.g. given sales table and city table compute the gross sales by city
..
Enterprise context : interest in leveraging the MapReduce model for highthroughput batch processing, analysis of data
References
Jeffery Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters Josh Carter, http://multipartmixed.com/software/mapreduce_presentation.pdf Ralf Lammel, Google's MapReduce Programming Model Revisited http://code.google.com/edu/parallel/mapreduce-tutorial.html