0% found this document useful (0 votes)
76 views11 pages

Unit-2 (MapReduce-II)

A MapReduce job in Apache Hadoop splits input data and processes it in parallel across multiple machines. The input is split into map tasks that process the data in parallel. The outputs from the map tasks are shuffled and sorted before being processed by reduce tasks to generate the final results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views11 pages

Unit-2 (MapReduce-II)

A MapReduce job in Apache Hadoop splits input data and processes it in parallel across multiple machines. The input is split into map tasks that process the data in parallel. The outputs from the map tasks are shuffled and sorted before being processed by reduce tasks to generate the final results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Big Data and Analytics

Unit-II
MapReduce
Anatomy of a MapReduce Job in Apache Hadoop
Hadoop Framework comprises of two main components,
• Hadoop Distributed File System (HDFS) for Data Storage
• MapReduce for Data Processing.
A typical Hadoop MapReduce job is divided into a set of Map
and Reduce tasks that execute on a Hadoop cluster. The
execution flow occurs as follows:
• Input data is split into small subsets of data.
• Map tasks work on these data splits.
• The intermediate input data from Map tasks is then
submitted to Reduce task after an intermediate process
called ‘shuffle’.
• The Reduce task(s) works on this intermediate data to
generate the result of a MapReduce Job.
• Hadoop MapReduce jobs are divided into a set of map
tasks and reduce tasks
• The input to a MapReduce job is a set of files in the data store
that are spread out over the HDFS. In Hadoop, these files are
split with an input format, which defines how to separate a files
into input split. You can assume that input split is a byte-
oriented view of a chunk of the files to be loaded by a map task.
• Each map task in Hadoop is broken into following phases: record
reader, mapper, combiner, partitioner. The output of map
phase, called intermediate key and values are sent to the
reducers.
• The reduce tasks are broken into following phases: shuffle, sort,
reducer and output format.
The map tasks are assigned
by Hadoop framework to
those DataNodes where the
actual data to be processed
resides. This ensures that
the data typically doesn’t
have to move over the
network to save the
network bandwidth and
data is computed on the
local machine itself so called
map task is data local.
Mapper
Record Reader:
The record reader translates an input split generated by input format into
records. The purpose of record reader is to parse the data into record but
doesn’t parse the record itself. It passes the data to the mapper in form of
key/value pair. Usually the key in this context is positional information and
the value is a chunk of data that composes a record.
Map:
Map function is the heart of mapper task, which is executed on each
key/value pair from the record reader to produce zero or more key/value
pair, called intermediate pairs. The decision of what is key/value pair
depends on what the MapReduce job is accomplishing. The data is grouped
on key and the value is the information pertinent to the analysis in the
reducer.
Combiner:
• Combiner is not applicable to all the MapReduce algorithms but where
ever it can be applied it is always recommended to use. It takes the
intermediate keys from the mapper and applies a user-provided
method to aggregate values in a small scope of that one mapper. e.g
sending (hadoop, 3) requires fewer bytes than sending (hadoop, 1)
three times over the network.
Partitioner:
• The partitioner takes the intermediate key/value pairs from mapper
and split them into shards, one shard per reducer. This randomly
distributes the keyspace evenly over the reducer, but still ensures that
keys with the same value in different mappers end up at the same
reducer. The partitioned data is written to the local filesystem for each
map task and waits to be pulled by its respective reducer.
Reducer
Shuffle and Sort:
• The reduce task start with the shuffle and sort step. This
step takes the output files written by all of the hadoop
partitioners and downloads them to the local machine in
which the reducer is running. These individual data pipes
are then sorted by keys into one larger data list. The
purpose of this sort is to group equivalent keys together
so that their values can be iterated over easily in the
reduce task.
Reduce:
• The reducer takes the grouped data as input and runs a reduce
function once per key grouping. The function is passed the key and
an iterator over all the values associated with that key. A wide range
of processing can happen in this function, the data can be
aggregated, filtered, and combined in a number of ways. Once it is
done, it sends zero or more key/value pair to the final step, the
output format.
Output Format:
• The output format translate the final key/value pair from the reduce
function and writes it out to a file by a record writer. By default, it
will separate the key and value with a tab and separate record with
a new line character.
Anatomy of Hadoop MapReduce Execution:

Once we give a MapReduce job the system will enter into a


series of life cycle phases:
1. Job Submission Phase
2. Job Initialization Phase
3. Task Assignment Phase
4. Task Execution Phase
5. Progress update Phase
6. Failure Recovery
In order to run the MR program the hadoop uses the command-‘yarn jar
client.jar job-class HDFS input HDFS-output directory’, where yarn is an utility
and jar is the command.Client.jar and job class name written by the
developer. When we execute on terminal the Yarn will initiate a set of actions
1. Loading configurations
2. Identifying command
3. Setting class path
4. Identifying the java class
corresponding to the jar command
i.e..org.apache.hadoop.util.RunJar.Then it
will set the user provided command to
“java. Org .apache. hadoop.util. RunJar
job-class HDFS-input HDFS-output
directory”.

You might also like