Unit-2 (MapReduce-II)
Unit-2 (MapReduce-II)
Unit-II
MapReduce
Anatomy of a MapReduce Job in Apache Hadoop
Hadoop Framework comprises of two main components,
• Hadoop Distributed File System (HDFS) for Data Storage
• MapReduce for Data Processing.
A typical Hadoop MapReduce job is divided into a set of Map
and Reduce tasks that execute on a Hadoop cluster. The
execution flow occurs as follows:
• Input data is split into small subsets of data.
• Map tasks work on these data splits.
• The intermediate input data from Map tasks is then
submitted to Reduce task after an intermediate process
called ‘shuffle’.
• The Reduce task(s) works on this intermediate data to
generate the result of a MapReduce Job.
• Hadoop MapReduce jobs are divided into a set of map
tasks and reduce tasks
• The input to a MapReduce job is a set of files in the data store
that are spread out over the HDFS. In Hadoop, these files are
split with an input format, which defines how to separate a files
into input split. You can assume that input split is a byte-
oriented view of a chunk of the files to be loaded by a map task.
• Each map task in Hadoop is broken into following phases: record
reader, mapper, combiner, partitioner. The output of map
phase, called intermediate key and values are sent to the
reducers.
• The reduce tasks are broken into following phases: shuffle, sort,
reducer and output format.
The map tasks are assigned
by Hadoop framework to
those DataNodes where the
actual data to be processed
resides. This ensures that
the data typically doesn’t
have to move over the
network to save the
network bandwidth and
data is computed on the
local machine itself so called
map task is data local.
Mapper
Record Reader:
The record reader translates an input split generated by input format into
records. The purpose of record reader is to parse the data into record but
doesn’t parse the record itself. It passes the data to the mapper in form of
key/value pair. Usually the key in this context is positional information and
the value is a chunk of data that composes a record.
Map:
Map function is the heart of mapper task, which is executed on each
key/value pair from the record reader to produce zero or more key/value
pair, called intermediate pairs. The decision of what is key/value pair
depends on what the MapReduce job is accomplishing. The data is grouped
on key and the value is the information pertinent to the analysis in the
reducer.
Combiner:
• Combiner is not applicable to all the MapReduce algorithms but where
ever it can be applied it is always recommended to use. It takes the
intermediate keys from the mapper and applies a user-provided
method to aggregate values in a small scope of that one mapper. e.g
sending (hadoop, 3) requires fewer bytes than sending (hadoop, 1)
three times over the network.
Partitioner:
• The partitioner takes the intermediate key/value pairs from mapper
and split them into shards, one shard per reducer. This randomly
distributes the keyspace evenly over the reducer, but still ensures that
keys with the same value in different mappers end up at the same
reducer. The partitioned data is written to the local filesystem for each
map task and waits to be pulled by its respective reducer.
Reducer
Shuffle and Sort:
• The reduce task start with the shuffle and sort step. This
step takes the output files written by all of the hadoop
partitioners and downloads them to the local machine in
which the reducer is running. These individual data pipes
are then sorted by keys into one larger data list. The
purpose of this sort is to group equivalent keys together
so that their values can be iterated over easily in the
reduce task.
Reduce:
• The reducer takes the grouped data as input and runs a reduce
function once per key grouping. The function is passed the key and
an iterator over all the values associated with that key. A wide range
of processing can happen in this function, the data can be
aggregated, filtered, and combined in a number of ways. Once it is
done, it sends zero or more key/value pair to the final step, the
output format.
Output Format:
• The output format translate the final key/value pair from the reduce
function and writes it out to a file by a record writer. By default, it
will separate the key and value with a tab and separate record with
a new line character.
Anatomy of Hadoop MapReduce Execution: