0% found this document useful (0 votes)
80 views20 pages

Data Science Presentation

The document provides an overview of MapReduce and how it works in Hadoop. It discusses that MapReduce is the processing layer of Hadoop that is designed to process large volumes of data in parallel. It divides the work into independent tasks that are performed by mappers and reducers. Mappers process the data and generate intermediate outputs which are shuffled and sorted before being input to reducers, which perform aggregation and final output generation. Combiners can further reduce network traffic by performing partial aggregation locally.

Uploaded by

yadvendra dhakad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views20 pages

Data Science Presentation

The document provides an overview of MapReduce and how it works in Hadoop. It discusses that MapReduce is the processing layer of Hadoop that is designed to process large volumes of data in parallel. It divides the work into independent tasks that are performed by mappers and reducers. Mappers process the data and generate intermediate outputs which are shuffled and sorted before being input to reducers, which perform aggregation and final output generation. Combiners can further reduce network traffic by performing partial aggregation locally.

Uploaded by

yadvendra dhakad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Map Reduce Introduction

-Bhanu
HADOOP MAPREDUCE

• This Hadoop MapReduce tutorial describes all the concepts of 


Hadoop MapReduce in great details. In this tutorial, we will
understand what is MapReduce and how it works, what is Mapper,
Reducer, shuffling, and sorting, etc. This Hadoop MapReduce
Tutorial also covers internals of MapReduce, DataFlow ,
architecture, and Data locality as well. So lets get started with the
Hadoop MapReduce Tutorial.
WHAT IS MAPREDUCE?
• MapReduce is the processing layer of Hadoop. MapReduce programming model is designed
for processing large volumes of data in parallel by dividing the work into a set of independent
tasks. You need to put business logic in the way MapReduce works and rest things will be
taken care by the framework. Work (complete job) which is submitted by the user to master
is divided into small works (tasks) and assigned to slaves.
MapReduce programs are written in a particular style influenced by functional programming
constructs, specifical idioms for processing lists of data. Here in MapReduce, we get inputs
from a list and it converts it into output which is again a list. It is the heart of Hadoop.
Hadoop is so much powerful and efficient due to MapRreduce as here parallel processing is
done.
This is what MapReduce is in Big Data. In the next step of Mapreduce Tutorial we have
MapReduce Process, MapReduce dataflow how MapReduce divides the work into sub-work,
why MapReduce is one of the best paradigms to process data:
CONCLUSION: HADOOP
MAPREDUCE
• Hence, MapReduce empowers the functionality of Hadoop. Since it
works on the concept of data locality, thus improves the
performance. In the next tutorial of mapreduce, we will learn the 
shuffling and sorting phase in detail.
• This was all about the Hadoop Mapreduce tutorial. I Hope you are
clear with what is MapReduce like the Hadoop MapReduce 
Map Reduce Data Flow
yadvendra singh dhalkad
HOW HADOOP MAPREDUCE WORKS

• Objective
• MapReduce is the core component of Hadoop that process huge amount of data in parallel
by dividing the work into a set of independent tasks. In MapReduce data flow in step by step
from Mapper to Reducer. In this tutorial, we are going to cover how Hadoop MapReduce
works internally?
• This blog on Hadoop MapReduce data flow will provide you the complete MapReduce data
flow chart in Hadoop. The tutorial covers various phases of MapReduce job execution such
as Input Files, Input Format in Hadoop, Input Splits, Record
Reader, Mapper, Combiner, Partitioner, Shuffling and Sorting, Reducer, Record Writer and
Output Format in detail. We will also learn How Hadoop MapReduce works with the help of
all these phases.
WHAT IS MAPREDUCE?

• MapReduce is the data processing layer of Hadoop. It is a software framework


for easily writing applications that process the vast amount of structured and
unstructured data stored in the Hadoop Distributed Filesystem (HDFS). It
processes the huge amount of data in parallel by dividing the job (submitted
job) into a set of independent tasks (sub-job). By this parallel processing
speed and reliability of cluster is improved. We just need to put the custom
code (business logic) in the way map reduce works and rest things will be
taken care by the engine.
HOW HADOOP MAPREDUCE WORKS?

• In Hadoop, MapReduce works by breaking the data


processing into two phases: Map phase and Reduce
phase. The map is the first phase of processing, where we
specify all the complex logic/business rules/costly code.
Reduce is the second phase of processing, where we
specify light-weight processing like
aggregation/summation.
Mapper and reducerVaishnavi jaiswal
Mapper is a function or task which is used to process all input records
from a file and generate the output which works as input for Reducer. It
produces the output by returning new key-value pairs. The input data has
to be converted to key-value pairs as Mapper can not process the raw input
records or tuples(key-value pairs). The mapper also generates some small
blocks of data while processing the input records as a key-value pair.
Mapper is a simple user-defined program that performs some operations on input-splits as per it is
designed. Mapper is a base class that needs to be extended by the developer or programmer in his lines
of code according to the organization’s requirements. input and output type need to be mentioned under
the Mapper class argument which needs to be modified by the developer.
For Example:
Class MyMappper extends Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

Mapper is the initial line of code that initially interacts with the input dataset. suppose, If we have 100
Data-Blocks of the dataset we are analyzing then in that case there will be 100 Mapper program or
process that runs in parallel on machines(nodes) and produce there own output known as intermediate
output which is then stored on Local Disk, not on HDFS. The output of the mapper act as input for
Reducer which performs some sorting and aggregation operation on data and produces the final output.
The Mapper mainly consists of 5 components: Input, Input Splits, Record Reader, Map, and Intermediate output disk.
Input: Input is records or the datasets that are used for analysis purposes. This Input data is set out with the help
of InputFormat. It helps in identifying the location of the Input data which is stored in HDFS(Hadoop Distributed File
System).

Working of Mapper in MapReduce:The input data from the users is passed to the Mapper which is specified by an
InputFormat. InputFormat is specified in the driver code. It defines the location of the input data like a file or directory
on HDFS. It also determines how to split the input data into input splits.Each Mapper deals with a single input split.
RecordReader are objects which is a part of InputFormat, used to extract (key, value) records from the input source
(split data)The Mapper processes the input, which are, the (key, value) pairs and provides an output, which are also
(key, value) pairs. The output from the Mapper is called the intermediate output.The Mapper may use or completely
ignore the input key. For example, a standard pattern is to read a file one line at a time. The key is the byte offset into
the file at which the line starts. The value is the contents of the line itself. Typically the key is considered irrelevant. If
the Mapper writes anything out, the output must be in the form of key/value pairs.The output from the Mapper
(intermediate keys and their value lists) are passed to the Reducer in sorted key order.The Reducer outputs zero or
more final key/value pairs. These are written to HDFS. The Reducer usually emits a single key/value pair for each input
keyIf a Mapper appears to be running more slowly or lagging than the others, a new instance of the Mapper will be
started on another machine, operating on the same data. The results of the first Mapper to finish will be used.
Hadoop will eliminate the Mapper which is still runningThe number of map tasks in a MapReduce program depends
on the number of data blocks of the input file. For example, if the block size is 128MB per block of split data and the
input data is of size 1GB, then the number of map tasks will be 8 map tasks. The number of map tasks increases with
the increase in the input data and hence parallelism increases which results in faster processing of data.
Hadoop – Reducer in Map-Reduce
Reducer takes the output of the Mapper (intermediate key-value pair) process each of them to
generate the output. The output of the reducer is the final output, which is stored in HDFS. Usually,
in the Hadoop Reducer, we do aggregation or summation sort of computation.
different phases of Hadoop MapReduce Reducer, shuffling and sorting in Hadoop, Hadoop
reduce phase, functioning of Hadoop reducer class. We will also discuss how many reducers are
required in Hadoop and how to change the number of reducers in Hadoop MapReduce.
example to understand the working of Reducer. Suppose we have the data of a college faculty of all departments
stored in a CSV file. In case we want to find the sum of salaries of faculty according to their department then we can
make their dept. title as key and salaries as value. The Reducer will perform the summation operation on this dataset
and produce the desired output. 
The number of Reducers in Map-Reduce task also affects below features:
1.Framework overhead increases.
2.Cost of failure Reduces
3.Increase load balancing.
The Reducer Of Map-Reduce  is consist of mainly 3 processes/phases:
1.Shuffle: Shuffling helps to carry data from the Mapper to the required Reducer. With the help of
HTTP, the framework calls for applicable partition of the output in all Mappers.
2.Sort: In this phase, the output of the mapper that is actually the key-value pairs will be sorted on the
basis of its key value.
3.Reduce: Once shuffling and sorting will be done the Reducer combines the obtained result and
perform the computation operation as per the requirement. OutputCollector.collect() property is used
for writing the output to the HDFS. Keep remembering that the output of the Reducer will not be sorted.
Note: Shuffling and Sorting both execute in parallel.
MapReduce Combiner
-MANSi
MAPREDUCE COMBINER
ON A LARGE DATASET WHEN WE RUN MAPREDUCE
JOB, LARGE CHUNKS OF INTERMEDIATE DATA IS
GENERATED BY THE MAPPER AND THIS
INTERMEDIATE DATA IS PASSED ON THE REDUCER
FOR FURTHER PROCESSING, WHICH LEADS TO
ENORMOUS NETWORK CONGESTION. MAPREDUCE
FRAMEWORK PROVIDES A FUNCTION KNOWN AS
HADOOP COMBINER THAT PLAYS A KEY ROLE IN
REDUCING NETWORK CONGESTION.
HOW DOES MAPREDUCE COMBINER WORK?
• MapReduce program with Combiner in between
MapReduce program without Combiner.
Mapper and Reducer
ADVANTAGES OF MAPREDUCE COMBINER
• Hadoop Combiner reduces the time taken for data transfer between mapper and reducer.
• It decreases the amount of data that needed to be processed by the reducer.
• The Combiner improves the overall performance of the reducer .

Disadvantages MapReduce Combiner

 MapReduce jobs cannot depend on the Hadoop combiner execution because


there is no guarantee in its execution.
 In the local filesystem, the key-value pairs are stored in the Hadoop and run
the combiner later which will cause expensive disk IO.

You might also like