0% found this document useful (0 votes)
5 views28 pages

M4 06 MapReduce

MapReduce is a programming model designed for efficient distributed computing, simplifying the process of handling large datasets through a straightforward interface of Map and Reduce functions. It addresses challenges such as machine failures and scheduling, providing fault tolerance and data locality optimizations. The model is particularly effective for applications like log processing and web index building, allowing for parallel processing and improved load balancing.

Uploaded by

aradhyamanil9797
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views28 pages

M4 06 MapReduce

MapReduce is a programming model designed for efficient distributed computing, simplifying the process of handling large datasets through a straightforward interface of Map and Reduce functions. It addresses challenges such as machine failures and scheduling, providing fault tolerance and data locality optimizations. The model is particularly effective for applications like log processing and web index building, allowing for parallel processing and improved load balancing.

Uploaded by

aradhyamanil9797
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

MapReduce:

Simplified Data Processing on Large Cluster


Motivation

• Challenge at google
• Input data too large -> distributed computing
• Most computations are straightforward(log processing, inverted
index) -> boring work

• Complexity of distributed computing


• Machine failure
• Scheduling
Solution: MapReduce

• MapReduce as the distributed programing infrastructure

• Simple Programming interface: Map + Reduce

• Distributed implementation that hides all the messy details


• Fault tolerance
• I/O scheduling
• parallelization
MapReduce - What?
• MapReduce is a programming model for efficient distributed computing
• It works like a Unix pipeline
• cat input | grep | sort | uniq -c | cat > output
• Input | Map | Shuffle & Sort | Reduce | Output
• Efficiency from
• Streaming through data, reducing seeks
• Pipelining
• A good fit for a lot of applications
• Log processing
• Web index building
Programming Model
• Inspired by map and reduce functions in Lisp and other functional
programing languages
• Lisp:

Map #‘length’ (() (a) (ab) (abc)) 0123

Reduce #‘+’ (0 1 2 3) 6
Programing Model
• Programmer only need to specify two functions:

• Map Function
map (in_key, in_value) -> list(out_key, intermediate_value)
• Process input key/value pair
• Produce set of output key/intermediate value pairs

• Reduce Function
reduce (out_key, intermediate_value) -> list(out_value)
• Process intermediate key/value pairs
• Combines intermediate values per unique key
• Produce a set of merged output values(usually just one)
[input (key, value)]

Map Function
[Intermediate (key, value)]
Programming Model

Shuffle (merge sort by key)


Reduce function

[Unique key, output value list]


MapReduce - Features
• Fine grained Map and Reduce tasks
• Improved load balancing
• Faster recovery from failed tasks
• Automatic re-execution on failure
• In a large cluster, some nodes are always slow or flaky
• Framework re-executes failed tasks
• Locality optimizations
• With large data, bandwidth to data is a problem
• Map-Reduce + HDFS is a very effective solution
• Map-Reduce queries HDFS for locations of input data
• Map tasks are scheduled close to the inputs when possible
MapReduce - Advantages
• The two biggest advantages of MapReduce are:

Parallel Processing

Data Locality
Word Count Example
• Mapper
• Input: value: lines of text of input
• Output: key: word, value: 1
• Reducer
• Input: key: word, value: set of counts
• Output: key: word, value: sum
• Launching program
• Defines this job
• Submits job to cluster
Example: WordCount
Input Split Map Shuffle Reduce Output
<the, 1>
<small, 1> a, 1
the small the small Map <brown, 1> another 1
brown fox brown fox <fox, 1> <brown, 1> brown, 2
cross, 1
Reduce cow, 1
a fox a fox <a, 1>
<fox, 1> <brown, 1> fox, 3
speaks to speaks to
Map <fox, 1>
another another <speaks, 1>
fox fox <to, 1>
<another, 1> road, 1
small, 1
brown brown Reduce speaks, 1
cow cross cow cross Map <cow, 1> the, 2
<cross, 1>
the road the road <the, 1>
to, 1
<road, 1>
Word Count Mapper
public static class Map extends MapReduceBase implements
Mapper<LongWritable,Text,Text,IntWritable> {
private static final IntWritable one = new IntWritable(1);
private Text word = new Text();

public static void map(LongWritable key, Text value,


OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer = new StringTokenizer(line);
while(tokenizer.hasNext()) {
word.set(tokenizer.nextToken());
output.collect(word,one);
}
}
}
Word Count Reducer
public static class Reduce extends MapReduceBase implements
Reducer<Text,IntWritable,Text,IntWritable> {
public static void map(Text key, Iterator<IntWritable> values,
OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while(values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Word Count Example

• Jobs are controlled by configuring JobConfs


• JobConfs are maps from attribute names to string values
• The framework defines attributes to control how the job is executed
• conf.set(“mapred.job.name”, “MyApp”);
• Applications can add arbitrary values to the JobConf
• conf.set(“my.string”, “foo”);
• conf.set(“my.integer”, 12);
• JobConf is available to all tasks
Putting it all together

• Create a launching program for your application


• The launching program configures:
• The Mapper and Reducer to use
• The output key and value types (input types are inferred from the InputFormat)
• The locations for your input and output
• The launching program then submits the job and typically waits for it to
complete
Putting it all together
JobConf conf = new JobConf(WordCount.class);
conf.setJobName(“wordcount”);

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducer(Reduce.class);

conf.setInputFormat(TextInputFormat.class);
Conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));


FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);
Input and Output Formats
• A Map/Reduce may specify how it’s input is to be read by specifying an
InputFormat to be used
• A Map/Reduce may specify how it’s output is to be written by specifying an
OutputFormat to be used
• These default to TextInputFormat and TextOutputFormat, which process line-
based text data
• Another common choice is SequenceFileInputFormat and
SequenceFileOutputFormat for binary data
• These are file-based
How many Maps and Reduces
• Maps
• Usually as many as the number of HDFS blocks being processed, this is the default
• Else the number of maps can be specified as a hint
• The number of maps can also be controlled by specifying the minimum split size
• The actual sizes of the map inputs are computed by:
• max(min(block_size,data/#maps), min_split_size)
• Reduces
• Unless the amount of data being processed is small
• 0.95*num_nodes*mapred.tasktracker.tasks.maximum
System Implementation: Overview

• Cluster Characteristic
• 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory
• Storage is on local IDE disks
• Infrastructure
• GFS: distributed file system manages data (SOSP'03)
• Job scheduling system: jobs made up of tasks, scheduler assigns tasks to
machines
Control Flow and data flow
User submit
Program Scheduler

fork fork fork allocate

assign Master
GFS assign
map GFS
Notify location reduce
of local write
Input Data Worker
write Output
local Worker File 0
Split 0 read
write
Split 1 Worker
Split 2 Output
Worker File 1
Worker remote
read,
sort
MapReduce Architecture
Coordinate

• Master data structures


• Task status: (idle, in-progress, completed)
• Idle tasks get scheduled as workers become available
• When a map task completes, it sends the master the location and sizes of its
intermediate files, one for each reducer
• Master incrementally pushes this info to reducers
• Master pings workers periodically to detect failures
Fault Tolerance
• Map worker failure
• Completed or in-progress tasks are
reset to idle
• Reduce worker failure
• Only in-progress tasks are reset to idle
• Why?
• Master failure
• MapReduce Task is aborted and client
is notified

• Reset tasks are rescheduled on another machine


Backup Tasks

• Slow worker delay completion time


• Start back-up tasks, for those in-progress as the job nears completion
• First task to complete is considered
Combiner Function

• Local reducer at the map worker


• Can save network time by pre-aggregating at mapper

Map
combine(k1, list(v1))

• Works only if reduce function is commutative and


associative
WordCount: No combine
Input Split Map Shuffle Reduce Output
<the, 1>
<small, 1> a, 1
the small the small Map <brown, 1> another 1
brown fox brown fox <fox, 1> <brown, 1> brown, 2
cross, 1
Reduce cow, 1
a fox a fox <a, 1>
<fox, 1> <brown, 1> fox, 3
speaks to speaks to
Map <fox, 1>
another another <speaks, 1>
fox fox <to, 1>
<another, 1> road, 1
small, 1
brown brown Reduce speaks, 1
cow cross cow cross Map <cow, 1> the, 2
<cross, 1>
the road the road <the, 1>
to, 1
<road, 1>
WordCount: With Combine
Input Split Map Shuffle Reduce Output
<the, 1>
<small, 1> a, 1
the small the small Map <brown, 1> another 1
brown fox brown fox <fox, 1> <brown, 1> brown, 2
cross, 1
Reduce cow, 1
a fox a fox <a, 1>
<fox, 2> <brown, 1> fox, 3
speaks to speaks to
Map <speaks, 1>
another another <to, 1>
fox fox <another, 1>
road, 1
small, 1
brown brown Reduce speaks, 1
cow cross cow cross Map <cow, 1> the, 2
<cross, 1>
the road the road <the, 1>
to, 1
<road, 1>
Conclusion

• Inexpensive commodity machines can be the basis of a large scale


reliable system
• MapReduce hides all the messy details of distributed computing
• MapReduce provides a simple parallel programming interface
References
• MapReduce Architecture: http://cecs.wright.edu/~tkprasad/courses/cs707/L06MapReduce.ppt/
• MapReduce Presentation: http://research.google.com/archive/mapreduce-osdi04-slides/
• MapReduce Presentation: http://web.eecs.umich.edu/~
mozafari/fall2015/eecs584/presentations/lecture15-a.pdf/
• Operating system support for warehouse-scale computing: https://www.cl.cam.ac.uk/~
ms705/pub/thesis-submitted.pdf/
• Apache Ecosystem Pic: http://blog.agroknow.com/?cat=1
• MapReduce: http://static.googleusercontent.com/media/research.google.com/en//
archive/mapreduce-osdi04.pdf/
• FlumeJava: http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf/
• MillWheel: http://www.vldb.org/pvldb/vol6/p1033-akidau.pdf/
• Pregel: http://web.stanford.edu/class/cs347/reading/pregel.pdf/

You might also like