MapReduce - Documentation
MapReduce - Documentation
1. The input data from the users is passed to the Mapper which is
specified by an InputFormat. InputFormat is specified in the driver
code. It defines the location of the input data like a file or directory on
HDFS. It also determines how to split the input data into input splits.
2. Each Mapper deals with a single input split. RecordReader are
objects which is a part of InputFormat, used to extract (key, value)
records from the input source (split data)
3. The Mapper processes the input, which are, the (key, value) pairs
and provides an output, which are also (key, value) pairs. The output
from the Mapper is called the intermediate output.
4. The Mapper may use or completely ignore the input key. For
example, a standard pattern is to read a file one line at a time. The key
is the byte offset into the file at which the line starts. The value is the
contents of the line itself. Typically the key is considered irrelevant. If
the Mapper writes anything out, the output must be in the form of
key/value pairs.
5. The output from the Mapper (intermediate keys and their value
lists) are passed to the Reducer in sorted key order.
6. The Reducer outputs zero or more final key/value pairs. These are
written to HDFS. The Reducer usually emits a single key/value pair for
each input key
7. If a Mapper appears to be running more slowly or lagging than the
others, a new instance of the Mapper will be started on another
machine, operating on the same data. The results of the first Mapper to
finish will be used. Hadoop will eliminate the Mapper which is still
running
The number of map tasks in a MapReduce program depends on the
number of data blocks of the input file. For example, if the block size is
128MB per block of split data and the input data is of size 1GB, then the
number of map tasks will be 8 map tasks. The number of map tasks
increases with the increase in the input data and hence parallelism
increases which results in faster processing of data.