M4 06 MapReduce
M4 06 MapReduce
• Challenge at google
• Input data too large -> distributed computing
• Most computations are straightforward(log processing, inverted
index) -> boring work
Reduce #‘+’ (0 1 2 3) 6
Programing Model
• Programmer only need to specify two functions:
• Map Function
map (in_key, in_value) -> list(out_key, intermediate_value)
• Process input key/value pair
• Produce set of output key/intermediate value pairs
• Reduce Function
reduce (out_key, intermediate_value) -> list(out_value)
• Process intermediate key/value pairs
• Combines intermediate values per unique key
• Produce a set of merged output values(usually just one)
[input (key, value)]
Map Function
[Intermediate (key, value)]
Programming Model
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducer(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
Conf.setOutputFormat(TextOutputFormat.class);
JobClient.runJob(conf);
Input and Output Formats
• A Map/Reduce may specify how it’s input is to be read by specifying an
InputFormat to be used
• A Map/Reduce may specify how it’s output is to be written by specifying an
OutputFormat to be used
• These default to TextInputFormat and TextOutputFormat, which process line-
based text data
• Another common choice is SequenceFileInputFormat and
SequenceFileOutputFormat for binary data
• These are file-based
How many Maps and Reduces
• Maps
• Usually as many as the number of HDFS blocks being processed, this is the default
• Else the number of maps can be specified as a hint
• The number of maps can also be controlled by specifying the minimum split size
• The actual sizes of the map inputs are computed by:
• max(min(block_size,data/#maps), min_split_size)
• Reduces
• Unless the amount of data being processed is small
• 0.95*num_nodes*mapred.tasktracker.tasks.maximum
System Implementation: Overview
• Cluster Characteristic
• 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory
• Storage is on local IDE disks
• Infrastructure
• GFS: distributed file system manages data (SOSP'03)
• Job scheduling system: jobs made up of tasks, scheduler assigns tasks to
machines
Control Flow and data flow
User submit
Program Scheduler
assign Master
GFS assign
map GFS
Notify location reduce
of local write
Input Data Worker
write Output
local Worker File 0
Split 0 read
write
Split 1 Worker
Split 2 Output
Worker File 1
Worker remote
read,
sort
MapReduce Architecture
Coordinate
Map
combine(k1, list(v1))