Data Science Presentation

The document provides an overview of MapReduce and how it works in Hadoop. It discusses that MapReduce is the processing layer of Hadoop that is designed to process large volumes of data in parallel. It divides the work into independent tasks that are performed by mappers and reducers. Mappers process the data and generate intermediate outputs which are shuffled and sorted before being input to reducers, which perform aggregation and final output generation. Combiners can further reduce network traffic by performing partial aggregation locally.

Uploaded by

yadvendra dhakad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

80 views20 pages

Data Science Presentation

Uploaded by

yadvendra dhakad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 20

Map Reduce Introduction

-Bhanu
HADOOP MAPREDUCE

• This Hadoop MapReduce tutorial describes all the concepts of

Hadoop MapReduce in great details. In this tutorial, we will
understand what is MapReduce and how it works, what is Mapper,
Reducer, shuffling, and sorting, etc. This Hadoop MapReduce
Tutorial also covers internals of MapReduce, DataFlow ,
architecture, and Data locality as well. So lets get started with the
Hadoop MapReduce Tutorial.
WHAT IS MAPREDUCE?
• MapReduce is the processing layer of Hadoop. MapReduce programming model is designed
for processing large volumes of data in parallel by dividing the work into a set of independent
tasks. You need to put business logic in the way MapReduce works and rest things will be
taken care by the framework. Work (complete job) which is submitted by the user to master
is divided into small works (tasks) and assigned to slaves.
MapReduce programs are written in a particular style influenced by functional programming
constructs, specifical idioms for processing lists of data. Here in MapReduce, we get inputs
from a list and it converts it into output which is again a list. It is the heart of Hadoop.
Hadoop is so much powerful and efficient due to MapRreduce as here parallel processing is
done.
This is what MapReduce is in Big Data. In the next step of Mapreduce Tutorial we have
MapReduce Process, MapReduce dataflow how MapReduce divides the work into sub-work,
why MapReduce is one of the best paradigms to process data:
CONCLUSION: HADOOP
MAPREDUCE
• Hence, MapReduce empowers the functionality of Hadoop. Since it
works on the concept of data locality, thus improves the
performance. In the next tutorial of mapreduce, we will learn the
shuffling and sorting phase in detail.
• This was all about the Hadoop Mapreduce tutorial. I Hope you are
clear with what is MapReduce like the Hadoop MapReduce
Map Reduce Data Flow
yadvendra singh dhalkad
HOW HADOOP MAPREDUCE WORKS

• Objective
• MapReduce is the core component of Hadoop that process huge amount of data in parallel
by dividing the work into a set of independent tasks. In MapReduce data flow in step by step
from Mapper to Reducer. In this tutorial, we are going to cover how Hadoop MapReduce
works internally?
• This blog on Hadoop MapReduce data flow will provide you the complete MapReduce data
flow chart in Hadoop. The tutorial covers various phases of MapReduce job execution such
as Input Files, Input Format in Hadoop, Input Splits, Record
Reader, Mapper, Combiner, Partitioner, Shuffling and Sorting, Reducer, Record Writer and
Output Format in detail. We will also learn How Hadoop MapReduce works with the help of
all these phases.
WHAT IS MAPREDUCE?

• MapReduce is the data processing layer of Hadoop. It is a software framework

for easily writing applications that process the vast amount of structured and
unstructured data stored in the Hadoop Distributed Filesystem (HDFS). It
processes the huge amount of data in parallel by dividing the job (submitted
job) into a set of independent tasks (sub-job). By this parallel processing
speed and reliability of cluster is improved. We just need to put the custom
code (business logic) in the way map reduce works and rest things will be
taken care by the engine.
HOW HADOOP MAPREDUCE WORKS?

• In Hadoop, MapReduce works by breaking the data

processing into two phases: Map phase and Reduce
phase. The map is the first phase of processing, where we
specify all the complex logic/business rules/costly code.
Reduce is the second phase of processing, where we
specify light-weight processing like
aggregation/summation.
Mapper and reducerVaishnavi jaiswal
Mapper is a function or task which is used to process all input records
from a file and generate the output which works as input for Reducer. It
produces the output by returning new key-value pairs. The input data has
to be converted to key-value pairs as Mapper can not process the raw input
records or tuples(key-value pairs). The mapper also generates some small
blocks of data while processing the input records as a key-value pair.
Mapper is a simple user-defined program that performs some operations on input-splits as per it is
designed. Mapper is a base class that needs to be extended by the developer or programmer in his lines
of code according to the organization’s requirements. input and output type need to be mentioned under
the Mapper class argument which needs to be modified by the developer.
For Example:
Class MyMappper extends Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

Mapper is the initial line of code that initially interacts with the input dataset. suppose, If we have 100
Data-Blocks of the dataset we are analyzing then in that case there will be 100 Mapper program or
process that runs in parallel on machines(nodes) and produce there own output known as intermediate
output which is then stored on Local Disk, not on HDFS. The output of the mapper act as input for
Reducer which performs some sorting and aggregation operation on data and produces the final output.
The Mapper mainly consists of 5 components: Input, Input Splits, Record Reader, Map, and Intermediate output disk.
Input: Input is records or the datasets that are used for analysis purposes. This Input data is set out with the help
of InputFormat. It helps in identifying the location of the Input data which is stored in HDFS(Hadoop Distributed File
System).

Working of Mapper in MapReduce:The input data from the users is passed to the Mapper which is specified by an
InputFormat. InputFormat is specified in the driver code. It defines the location of the input data like a file or directory
on HDFS. It also determines how to split the input data into input splits.Each Mapper deals with a single input split.
RecordReader are objects which is a part of InputFormat, used to extract (key, value) records from the input source
(split data)The Mapper processes the input, which are, the (key, value) pairs and provides an output, which are also
(key, value) pairs. The output from the Mapper is called the intermediate output.The Mapper may use or completely
ignore the input key. For example, a standard pattern is to read a file one line at a time. The key is the byte offset into
the file at which the line starts. The value is the contents of the line itself. Typically the key is considered irrelevant. If
the Mapper writes anything out, the output must be in the form of key/value pairs.The output from the Mapper
(intermediate keys and their value lists) are passed to the Reducer in sorted key order.The Reducer outputs zero or
more final key/value pairs. These are written to HDFS. The Reducer usually emits a single key/value pair for each input
keyIf a Mapper appears to be running more slowly or lagging than the others, a new instance of the Mapper will be
started on another machine, operating on the same data. The results of the first Mapper to finish will be used.
Hadoop will eliminate the Mapper which is still runningThe number of map tasks in a MapReduce program depends
on the number of data blocks of the input file. For example, if the block size is 128MB per block of split data and the
input data is of size 1GB, then the number of map tasks will be 8 map tasks. The number of map tasks increases with
the increase in the input data and hence parallelism increases which results in faster processing of data.
Hadoop – Reducer in Map-Reduce
Reducer takes the output of the Mapper (intermediate key-value pair) process each of them to
generate the output. The output of the reducer is the final output, which is stored in HDFS. Usually,
in the Hadoop Reducer, we do aggregation or summation sort of computation.
different phases of Hadoop MapReduce Reducer, shuffling and sorting in Hadoop, Hadoop
reduce phase, functioning of Hadoop reducer class. We will also discuss how many reducers are
required in Hadoop and how to change the number of reducers in Hadoop MapReduce.
example to understand the working of Reducer. Suppose we have the data of a college faculty of all departments
stored in a CSV file. In case we want to find the sum of salaries of faculty according to their department then we can
make their dept. title as key and salaries as value. The Reducer will perform the summation operation on this dataset
and produce the desired output.
The number of Reducers in Map-Reduce task also affects below features:
1.Framework overhead increases.
2.Cost of failure Reduces
3.Increase load balancing.
The Reducer Of Map-Reduce is consist of mainly 3 processes/phases:
1.Shuffle: Shuffling helps to carry data from the Mapper to the required Reducer. With the help of
HTTP, the framework calls for applicable partition of the output in all Mappers.
2.Sort: In this phase, the output of the mapper that is actually the key-value pairs will be sorted on the
basis of its key value.
3.Reduce: Once shuffling and sorting will be done the Reducer combines the obtained result and
perform the computation operation as per the requirement. OutputCollector.collect() property is used
for writing the output to the HDFS. Keep remembering that the output of the Reducer will not be sorted.
Note: Shuffling and Sorting both execute in parallel.
MapReduce Combiner
-MANSi
MAPREDUCE COMBINER
ON A LARGE DATASET WHEN WE RUN MAPREDUCE
JOB, LARGE CHUNKS OF INTERMEDIATE DATA IS
GENERATED BY THE MAPPER AND THIS
INTERMEDIATE DATA IS PASSED ON THE REDUCER
FOR FURTHER PROCESSING, WHICH LEADS TO
ENORMOUS NETWORK CONGESTION. MAPREDUCE
FRAMEWORK PROVIDES A FUNCTION KNOWN AS
HADOOP COMBINER THAT PLAYS A KEY ROLE IN
REDUCING NETWORK CONGESTION.
HOW DOES MAPREDUCE COMBINER WORK?
• MapReduce program with Combiner in between
MapReduce program without Combiner.
Mapper and Reducer
ADVANTAGES OF MAPREDUCE COMBINER
• Hadoop Combiner reduces the time taken for data transfer between mapper and reducer.
• It decreases the amount of data that needed to be processed by the reducer.
• The Combiner improves the overall performance of the reducer .

Disadvantages MapReduce Combiner

 MapReduce jobs cannot depend on the Hadoop combiner execution because

there is no guarantee in its execution.
 In the local filesystem, the key-value pairs are stored in the Hadoop and run
the combiner later which will cause expensive disk IO.

Matrix - Xla - Excel Addin
No ratings yet
Matrix - Xla - Excel Addin
112 pages
Teddy CADiNP Introduction
No ratings yet
Teddy CADiNP Introduction
59 pages
Univan Ship Management LTD.: Chennai Office
No ratings yet
Univan Ship Management LTD.: Chennai Office
17 pages
02 5G Xhaul Transport - BRKSPM-2012 BRKSPG-2680
100% (1)
02 5G Xhaul Transport - BRKSPM-2012 BRKSPG-2680
98 pages
Optimization of Shovel-Dumper Combination in An Open Cast Mine Using Simulation Software
No ratings yet
Optimization of Shovel-Dumper Combination in An Open Cast Mine Using Simulation Software
12 pages
Over 251 Google Products & Services You Probably Don't Know
No ratings yet
Over 251 Google Products & Services You Probably Don't Know
13 pages
Solving XOR Problem Using DNN AIDS
100% (1)
Solving XOR Problem Using DNN AIDS
4 pages
Chapter 5: Database Design 1: Normalization True / False: Cengage Learning Testing, Powered by Cognero
100% (1)
Chapter 5: Database Design 1: Normalization True / False: Cengage Learning Testing, Powered by Cognero
6 pages
Ichiban Air-Cooled Chiller
No ratings yet
Ichiban Air-Cooled Chiller
8 pages
Top Answers To Map Reduce Interview Questions
No ratings yet
Top Answers To Map Reduce Interview Questions
6 pages
Class 8 Networking Concepts Part-1 PDF
No ratings yet
Class 8 Networking Concepts Part-1 PDF
7 pages
High Availability and DR Test Report: T24 Architecture With JMS Connectivity Oracle Stack
No ratings yet
High Availability and DR Test Report: T24 Architecture With JMS Connectivity Oracle Stack
59 pages
Unit 2 - From Hadoop Streaming PDF
No ratings yet
Unit 2 - From Hadoop Streaming PDF
20 pages
Unit1 Ai&ml
No ratings yet
Unit1 Ai&ml
51 pages
Hadoop Karunesh
No ratings yet
Hadoop Karunesh
14 pages
Map Reduce
No ratings yet
Map Reduce
10 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
32 pages
Chap 6 - MapReduce Programming
No ratings yet
Chap 6 - MapReduce Programming
37 pages
Web Storyboard: XXX XXX
No ratings yet
Web Storyboard: XXX XXX
12 pages
SG49K5J: Multi-MPPT String Inverter For Japan System
No ratings yet
SG49K5J: Multi-MPPT String Inverter For Japan System
1 page
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
No ratings yet
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
36 pages
BDA Unit-2
No ratings yet
BDA Unit-2
11 pages
1 - 2 Evolution - 2 - 5G - GPRS
No ratings yet
1 - 2 Evolution - 2 - 5G - GPRS
14 pages
Chapter 4 TCP IP Reference Model
No ratings yet
Chapter 4 TCP IP Reference Model
43 pages
MapReduce - Documentation
No ratings yet
MapReduce - Documentation
2 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
What Is MapReduce in Hadoop - Architecture - Example
No ratings yet
What Is MapReduce in Hadoop - Architecture - Example
7 pages
National Concrete Products Co. LTD 200-20-0049 CIFA K48
No ratings yet
National Concrete Products Co. LTD 200-20-0049 CIFA K48
4 pages
Unit-2 (MapReduce-II)
No ratings yet
Unit-2 (MapReduce-II)
11 pages
Map Reduce
No ratings yet
Map Reduce
8 pages
Led 08 02 2020
No ratings yet
Led 08 02 2020
41 pages
What Is MapReduce in Hadoop
No ratings yet
What Is MapReduce in Hadoop
5 pages
Hadoop Wordcount Program
No ratings yet
Hadoop Wordcount Program
20 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
4 pages
BDA-MapReduce (1) 5rfgy656yhgvcft6
No ratings yet
BDA-MapReduce (1) 5rfgy656yhgvcft6
60 pages
Inverter - EP-3K-48-AU - User Manual - 091119
No ratings yet
Inverter - EP-3K-48-AU - User Manual - 091119
43 pages
Unit-2 Map Reduce Notes
No ratings yet
Unit-2 Map Reduce Notes
28 pages
Cloud Computing Prof
No ratings yet
Cloud Computing Prof
11 pages
Data Science
No ratings yet
Data Science
7 pages
BDA Unit 3 1
No ratings yet
BDA Unit 3 1
37 pages
Unit 3 - Big Data Technologies
No ratings yet
Unit 3 - Big Data Technologies
42 pages
HDFS Unit 4
No ratings yet
HDFS Unit 4
12 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
32 pages
Lecture 04
No ratings yet
Lecture 04
25 pages
Unit - III
No ratings yet
Unit - III
37 pages
Hadoop Map Reduce
No ratings yet
Hadoop Map Reduce
53 pages
C01001848H - 00 Man - Inst Rack Cooler CRCC - CRCX en
No ratings yet
C01001848H - 00 Man - Inst Rack Cooler CRCC - CRCX en
34 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
Big Data BCA Unit4
No ratings yet
Big Data BCA Unit4
9 pages
Understanding MapReduce
No ratings yet
Understanding MapReduce
15 pages
Map Reduce
No ratings yet
Map Reduce
45 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
Cold Storage Design Thesis
100% (2)
Cold Storage Design Thesis
6 pages
EXCEL-Convert Number of Month To Name of Month
No ratings yet
EXCEL-Convert Number of Month To Name of Month
7 pages
Zatca Updated Color 03
No ratings yet
Zatca Updated Color 03
1 page
Map Reduce
No ratings yet
Map Reduce
25 pages
MapReduce Arch
No ratings yet
MapReduce Arch
29 pages
Technology NEW Vocab Parts 1-2-3
No ratings yet
Technology NEW Vocab Parts 1-2-3
21 pages
3.1.how Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.how Map Reduce Works & 3.2 Anatomy
11 pages
Bda Unit III r20csm
No ratings yet
Bda Unit III r20csm
54 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
67 pages
Bda U2
No ratings yet
Bda U2
79 pages
Unit - Iii
No ratings yet
Unit - Iii
38 pages
Map Reduce
No ratings yet
Map Reduce
7 pages
Big Data Unit - 3
No ratings yet
Big Data Unit - 3
7 pages
Bda FW-4
No ratings yet
Bda FW-4
7 pages
Applied Logistic Regression - 3rd Edition Scribd Download
100% (8)
Applied Logistic Regression - 3rd Edition Scribd Download
17 pages
Big Data
No ratings yet
Big Data
120 pages
Understanding MapReduce in Hadoop
No ratings yet
Understanding MapReduce in Hadoop
25 pages
28-11-2024 Daily Progress Report Night Shift
No ratings yet
28-11-2024 Daily Progress Report Night Shift
1 page
Unit 2
No ratings yet
Unit 2
12 pages
Bda Unit 2
No ratings yet
Bda Unit 2
48 pages
RV 10
No ratings yet
RV 10
8 pages
Big Data Unit-2 PPT Part2
No ratings yet
Big Data Unit-2 PPT Part2
78 pages
BDA Unit 2 Notes
No ratings yet
BDA Unit 2 Notes
32 pages
Big Data Analytics UNIT 3 Notets
No ratings yet
Big Data Analytics UNIT 3 Notets
12 pages
Map Reduce 2
No ratings yet
Map Reduce 2
14 pages
B. Hadoop Ecosystem - III - B (MapReduce Framework)
No ratings yet
B. Hadoop Ecosystem - III - B (MapReduce Framework)
33 pages
Introduction To Python
No ratings yet
Introduction To Python
5 pages
Unit 5 - Mapreduce
No ratings yet
Unit 5 - Mapreduce
8 pages
Sugar Rush Project Fudge Wreck-It Ralph Fanon Wiki Fandom
No ratings yet
Sugar Rush Project Fudge Wreck-It Ralph Fanon Wiki Fandom
1 page
Unit 4 1
No ratings yet
Unit 4 1
12 pages
Big Data Management Continued
No ratings yet
Big Data Management Continued
48 pages
Bda Unit-3
No ratings yet
Bda Unit-3
44 pages
Unit 3
No ratings yet
Unit 3
27 pages
Bda Unit 3
No ratings yet
Bda Unit 3
29 pages
Bda Unit 3
No ratings yet
Bda Unit 3
14 pages
TCM Past Paper
No ratings yet
TCM Past Paper
4 pages
Hadoop Beginner's Guide
From Everand
Hadoop Beginner's Guide
Garry Turkington
4/5 (7)

Data Science Presentation

Uploaded by

Data Science Presentation

Uploaded by

Map Reduce Introduction

• This Hadoop MapReduce tutorial describes all the concepts of

• MapReduce is the data processing layer of Hadoop. It is a software framework

• In Hadoop, MapReduce works by breaking the data

Disadvantages MapReduce Combiner

 MapReduce jobs cannot depend on the Hadoop combiner execution because

You might also like