Map Red
Map Red
MapReduce
MapReduce is the key algorithm that the Hadoop MapReduce engine uses to distribute work around a cluster.
The Map
A map transform is provided to transform an input data row of key and value to an output key/value:
map(key1,value) -> list<key2,value2>
That is, for an input it returns a list containing zero or more (key,value) pairs:
The Reduce
A reduce transform is provided to take all values for a specific key, and generate a new list of the reduced output.
reduce(key2, list<value2>) -> list<value3>
A distributed filesystem spreads multiple copies of the data across different machines. This not only offers reliability without the need for RAID-
controlled disks, it offers multiple locations to run the mapping. If a machine with one copy of the data is busy or offline, another machine can be
used.
A job scheduler (in Hadoop, the JobTracker), keeps track of which MR jobs are executing, schedules individual Maps, Reduces or intermediate
merging operations to specific machines, monitors the success and failures of these individual Tasks, and works to complete the entire batch job.
The filesystem and Job scheduler can somehow be accessed by the people and programs that wish to read and write data, and to submit and
monitor MR jobs.
Apache Hadoop is such a MapReduce engine. It provides its own distributed filesystem and runs [HadoopMapReduce] jobs on servers near the data
stored on the filesystem -or any other supported filesystem, of which there is more than one.
Limitations
For maximum parallelism, you need the Maps and Reduces to be stateless, to not depend on any data generated in the same MapReduce job.
You cannot control the order in which the maps run, or the reductions.
It is very inefficient if you are repeating similar searches again and again. A database with an index will always be faster than running an MR job
over unindexed data. However, if that index needs to be regenerated whenever data is added, and data is being added continually, MR jobs may
have an edge. That inefficiency can be measured in both CPU time and power consumed.
In the Hadoop implementation Reduce operations do not take place until all the Maps are complete (or have failed and been skipped). As a result,
you do not get any data back until the entire mapping has finished.
There is a general assumption that the output of the reduce is smaller than the input to the Map. That is, you are taking a large datasource and
generating smaller final values.
It is not a silver bullet to all the problems of scale, just a good technique to work on large sets of data when you can work on small pieces of that dataset in
parallel.
HadoopMapReduce
How Map and Reduce operations are actually carried out
Introduction
This document describes how MapReduce operations are carried out in Hadoop. If you are not familiar with the Google MapReduce programming model
you should get acquainted with it first.
Map
As the Map operation is parallelized the input file set is first split to several pieces called FileSplits. If an individual file is so large that it will affect seek time
it will be split to several Splits. The splitting does not know anything about the input file's internal logical structure, for example line-oriented text files are
split on arbitrary byte boundaries. Then a new map task is created per FileSplit.
When an individual map task starts it will open a new output writer per configured reduce task. It will then proceed to read its FileSplit using the RecordRea
der it gets from the specified InputFormat. InputFormat parses the input and generates key-value pairs. InputFormat must also handle records that may be
split on the FileSplit boundary. For example TextInputFormat will read the last line of the FileSplit past the split boundary and, when reading other than the
first FileSplit, TextInputFormat ignores the content up to the first newline.
It is not necessary for the InputFormat to generate both meaningful keys and values. For example the default output from TextInputFormat consists of input
lines as values and somewhat meaninglessly line start file offsets as keys - most applications only use the lines and ignore the offsets.
As key-value pairs are read from the RecordReader they are passed to the configured Mapper. The user supplied Mapper does whatever it wants with the
input pair and calls OutputCollector.collect with key-value pairs of its own choosing. The output it generates must use one key class and one value class.
This is because the Map output will be written into a SequenceFile which has per-file type information and all the records must have the same type (use
subclassing if you want to output different data structures). The Map input and output key-value pairs are not necessarily related typewise or in cardinality.
When Mapper output is collected it is partitioned, which means that it will be written to the output specified by the Partitioner. The default HashPartitioner
uses the hashcode function on the key's class (which means that this hashcode function must be good in order to achieve an even workload across the
reduce tasks). See MapTask for details.
N input files will generate M map tasks to be run and each map task will generate as many output files as there are reduce tasks configured in the system.
Each output file will be targeted at a specific reduce task and the map output pairs from all the map tasks will be routed so that all pairs for a given key end
up in files targeted at a specific reduce task.
Combine
When the map operation outputs its pairs they are already available in memory. For efficiency reasons, sometimes it makes sense to take advantage of
this fact by supplying a combiner class to perform a reduce-type function. If a combiner is used then the map key-value pairs are not immediately written to
the output. Instead they will be collected in lists, one list per each key value. When a certain number of key-value pairs have been written, this buffer is
flushed by passing all the values of each key to the combiner's reduce method and outputting the key-value pairs of the combine operation as if they were
created by the original map operation.
For example, a word count MapReduce application whose map operation outputs (word, 1) pairs as words are encountered in the input can use a
combiner to speed up processing. A combine operation will start gathering the output in in-memory lists (instead of on disk), one list per word. Once a
certain number of pairs is output, the combine operation will be called once per unique word with the list available as an iterator. The combiner then emits (
word, count-in-this-part-of-the-input) pairs. From the viewpoint of the Reduce operation this contains the same information as the original Map output, but
there should be far fewer pairs output to disk and read from disk.
Reduce
When a reduce task starts, its input is scattered in many files across all the nodes where map tasks ran. If run in distributed mode these need to be first
copied to the local filesystem in a copy phase (see ReduceTaskRunner).
Once all the data is available locally it is appended to one file in an append phase. The file is then merge sorted so that the key-value pairs for a given key
are contiguous (sort phase). This makes the actual reduce operation simple: the file is read sequentially and the values are passed to the reduce method
with an iterator reading the input file until the next key value is encountered. See ReduceTask for details.
At the end, the output will consist of one output file per executed reduce task. The format of the files can be specified with JobConf.setOutputFormat. If
SequentialOutputFormat is used then the output key and value classes must also be specified.
HowManyMapsAndReduces
Partitioning your job into maps and reduces
Picking the appropriate size for the tasks for your job can radically change the performance of Hadoop. Increasing the number of tasks increases the
framework overhead, but increases load balancing and lowers the cost of failures. At one extreme is the 1 map/1 reduce case where nothing is distributed.
The other extreme is to have 1,000,000 maps/ 1,000,000 reduces where the framework runs out of resources for the overhead.
Number of Maps
The number of maps is usually driven by the number of DFS blocks in the input files. Although that causes people to adjust their DFS block size to adjust
the number of maps. The right level of parallelism for maps seems to be around 10-100 maps/node, although we have taken it up to 300 or so for very cpu-
light map tasks. Task setup takes awhile, so it is best if the maps take at least a minute to execute.
Actually controlling the number of maps is subtle. The mapred.map.tasks parameter is just a hint to the InputFormat for the number of maps. The default
InputFormat behavior is to split the total number of bytes into the right number of fragments. However, in the default case the DFS block size of the input
files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapred.min.split.size. Thus, if you expect 10TB of input
data and have 128MB DFS blocks, you'll end up with 82k maps, unless your mapred.map.tasks is even larger. Ultimately the InputFormat determines the
number of maps.
The number of map tasks can also be increased manually using the JobConf's conf.setNumMapTasks(int num). This can be used to increase the number
of map tasks, but will not set the number below that which Hadoop determines via splitting the input data.
Number of Reduces
The ideal reducers should be the optimal value that gets them closest to:
Anything other than that means there is a good chance your reducers are less than great. There is a tremendous tendency for users to use a REALLY high
value ("More parallelism means faster!") or a REALLY low value ("I don't want to blow my namespace quota!"). Both are equally dangerous, resulting in
one or more of:
Now, there are always exceptions and special cases. One particular special case is that if following that advice makes the next step in the workflow do
ridiculous things, then we need to likely 'be an exception' in the above general rules of thumb.
Currently the number of reduces is limited to roughly 1000 by the buffer size for the output files (io.buffer.size * 2 * numReduces << heapSize). This will be
fixed at some point, but until it is it provides a pretty firm upper bound.
The number of reduce tasks can also be increased in the same way as the map tasks, via JobConf's conf.setNumReduceTasks(int num).
WordCount
WordCount Example
WordCount example reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a
word and the count of how often it occured, separated by a tab.
Each mapper takes a line as input and breaks it into words. It then emits a key/value pair of the word and 1. Each reducer sums the counts for each word
and emits a single key/value with the word and sum.
As an optimization, the reducer is also used as a combiner on the map outputs. This reduces the amount of data sent across the network by combining
each word into a single record.
bin/hadoop jar hadoop-*-examples.jar wordcount [-m <#maps>] [-r <#reducers>] <in-dir> <out-dir>
All of the files in the input directory (called in-dir in the command line above) are read and the counts of words in the input are written to the output
directory (called out-dir above). It is assumed that both inputs and outputs are stored in HDFS (see ImportantConcepts). If your input is not already in
HDFS, but is rather in a local file system somewhere, you need to copy the data into HDFS using a command like this:
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.waitForCompletion(true);
}
}
Sort
Sort Example
The Sort example simply uses the map/reduce framework to sort the input directory into the output directory. The inputs and outputs must be Sequence
files where the keys and values are BytesWritable.
The mapper is the predefined IdentityMapper and the reducer is the predefined IdentityReducer, both of which just pass their inputs directly to the output.
bin/hadoop jar hadoop-*-examples.jar sort [-m <#maps>] [-r <#reduces>] <in-dir> <out-dir>
% bin/hadoop jar hadoop-*-examples.jar sort rand rand-sort The first command will generate the unsorted data in the rand directory. The second command
will read that data, sort it, and write into the rand-sort directory.