Data Science Presentation
Data Science Presentation
-Bhanu
HADOOP MAPREDUCE
• Objective
• MapReduce is the core component of Hadoop that process huge amount of data in parallel
by dividing the work into a set of independent tasks. In MapReduce data flow in step by step
from Mapper to Reducer. In this tutorial, we are going to cover how Hadoop MapReduce
works internally?
• This blog on Hadoop MapReduce data flow will provide you the complete MapReduce data
flow chart in Hadoop. The tutorial covers various phases of MapReduce job execution such
as Input Files, Input Format in Hadoop, Input Splits, Record
Reader, Mapper, Combiner, Partitioner, Shuffling and Sorting, Reducer, Record Writer and
Output Format in detail. We will also learn How Hadoop MapReduce works with the help of
all these phases.
WHAT IS MAPREDUCE?
Mapper is the initial line of code that initially interacts with the input dataset. suppose, If we have 100
Data-Blocks of the dataset we are analyzing then in that case there will be 100 Mapper program or
process that runs in parallel on machines(nodes) and produce there own output known as intermediate
output which is then stored on Local Disk, not on HDFS. The output of the mapper act as input for
Reducer which performs some sorting and aggregation operation on data and produces the final output.
The Mapper mainly consists of 5 components: Input, Input Splits, Record Reader, Map, and Intermediate output disk.
Input: Input is records or the datasets that are used for analysis purposes. This Input data is set out with the help
of InputFormat. It helps in identifying the location of the Input data which is stored in HDFS(Hadoop Distributed File
System).
Working of Mapper in MapReduce:The input data from the users is passed to the Mapper which is specified by an
InputFormat. InputFormat is specified in the driver code. It defines the location of the input data like a file or directory
on HDFS. It also determines how to split the input data into input splits.Each Mapper deals with a single input split.
RecordReader are objects which is a part of InputFormat, used to extract (key, value) records from the input source
(split data)The Mapper processes the input, which are, the (key, value) pairs and provides an output, which are also
(key, value) pairs. The output from the Mapper is called the intermediate output.The Mapper may use or completely
ignore the input key. For example, a standard pattern is to read a file one line at a time. The key is the byte offset into
the file at which the line starts. The value is the contents of the line itself. Typically the key is considered irrelevant. If
the Mapper writes anything out, the output must be in the form of key/value pairs.The output from the Mapper
(intermediate keys and their value lists) are passed to the Reducer in sorted key order.The Reducer outputs zero or
more final key/value pairs. These are written to HDFS. The Reducer usually emits a single key/value pair for each input
keyIf a Mapper appears to be running more slowly or lagging than the others, a new instance of the Mapper will be
started on another machine, operating on the same data. The results of the first Mapper to finish will be used.
Hadoop will eliminate the Mapper which is still runningThe number of map tasks in a MapReduce program depends
on the number of data blocks of the input file. For example, if the block size is 128MB per block of split data and the
input data is of size 1GB, then the number of map tasks will be 8 map tasks. The number of map tasks increases with
the increase in the input data and hence parallelism increases which results in faster processing of data.
Hadoop – Reducer in Map-Reduce
Reducer takes the output of the Mapper (intermediate key-value pair) process each of them to
generate the output. The output of the reducer is the final output, which is stored in HDFS. Usually,
in the Hadoop Reducer, we do aggregation or summation sort of computation.
different phases of Hadoop MapReduce Reducer, shuffling and sorting in Hadoop, Hadoop
reduce phase, functioning of Hadoop reducer class. We will also discuss how many reducers are
required in Hadoop and how to change the number of reducers in Hadoop MapReduce.
example to understand the working of Reducer. Suppose we have the data of a college faculty of all departments
stored in a CSV file. In case we want to find the sum of salaries of faculty according to their department then we can
make their dept. title as key and salaries as value. The Reducer will perform the summation operation on this dataset
and produce the desired output.
The number of Reducers in Map-Reduce task also affects below features:
1.Framework overhead increases.
2.Cost of failure Reduces
3.Increase load balancing.
The Reducer Of Map-Reduce is consist of mainly 3 processes/phases:
1.Shuffle: Shuffling helps to carry data from the Mapper to the required Reducer. With the help of
HTTP, the framework calls for applicable partition of the output in all Mappers.
2.Sort: In this phase, the output of the mapper that is actually the key-value pairs will be sorted on the
basis of its key value.
3.Reduce: Once shuffling and sorting will be done the Reducer combines the obtained result and
perform the computation operation as per the requirement. OutputCollector.collect() property is used
for writing the output to the HDFS. Keep remembering that the output of the Reducer will not be sorted.
Note: Shuffling and Sorting both execute in parallel.
MapReduce Combiner
-MANSi
MAPREDUCE COMBINER
ON A LARGE DATASET WHEN WE RUN MAPREDUCE
JOB, LARGE CHUNKS OF INTERMEDIATE DATA IS
GENERATED BY THE MAPPER AND THIS
INTERMEDIATE DATA IS PASSED ON THE REDUCER
FOR FURTHER PROCESSING, WHICH LEADS TO
ENORMOUS NETWORK CONGESTION. MAPREDUCE
FRAMEWORK PROVIDES A FUNCTION KNOWN AS
HADOOP COMBINER THAT PLAYS A KEY ROLE IN
REDUCING NETWORK CONGESTION.
HOW DOES MAPREDUCE COMBINER WORK?
• MapReduce program with Combiner in between
MapReduce program without Combiner.
Mapper and Reducer
ADVANTAGES OF MAPREDUCE COMBINER
• Hadoop Combiner reduces the time taken for data transfer between mapper and reducer.
• It decreases the amount of data that needed to be processed by the reducer.
• The Combiner improves the overall performance of the reducer .