Unit 2 Topic 4 Map Reduce

The document provides an overview of the MapReduce framework, detailing its processing technique and program model for distributed computing, which consists of two main tasks: Map and Reduce. It explains how data is processed in parallel across multiple nodes, with examples such as word counting, and highlights the benefits of MapReduce including fault tolerance, resilience, and scalability. Additionally, it outlines the phases of the MapReduce model, including Mapper, Shuffle and Sort, Reducer, and the optional Combiner phase.

Uploaded by

sharmayashikagzb

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Unit 2 Topic 4 Map Reduce

Uploaded by

sharmayashikagzb

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 27

Map Reduce framework and

basics
Dr. Anil Kumar Dubey
Associate Professor,
Computer Science & Engineering Department,
ABES EC, Ghaziabad
Affiliated to Dr. A.P.J. Abdul Kalam Technical University, Uttar
Pradesh, Lucknow
Basic of MapReduce
• Is a processing technique and a program model for distributed
computing based on java.

• Algorithm contains two important tasks

• Map
• Reduce

• Reduce task is always performed after the map job.

Conti…
• Map takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key/value pairs).
• Reduce task, which takes the output from a map as an input and combines
those data tuples into a smaller set of tuples.
Example: A Word Count
• Let us have a text file called example.txt whose contents are as
follows:
Dear, Bear, River, Car, Car, River, Deer, Car and Bear

• Now, suppose, we have to perform a word count on the

sample.txt using MapReduce.

• So, we will be finding unique words and the number of

occurrences of those unique words.
Conti…
Conti…
• First, we divide the input into three splits as shown in the figure.
This will distribute the work among all the map nodes.

• Then, we tokenize the words in each of the mappers and give a

hardcoded value (1) to each of the tokens or words.

• The rationale behind giving a hardcoded value equal to 1 is that

every word, in itself, will occur once.
Conti…
• Now, a list of key-value pair will be created where the key is
nothing but the individual words and value is one. So, for the first
line (Dear Bear River) we have 3 key-value pairs — Dear, 1; Bear, 1;
River, 1. The mapping process remains the same on all the nodes.

• After the mapper phase, a partition process takes place where

sorting and shuffling happen so that all the tuples with the same
key are sent to the corresponding reducer.
Conti…
• So, after the sorting and shuffling phase, each reducer will have a
unique key and a list of values corresponding to that very key. For
example, Bear, [1,1]; Car, [1,1,1].., etc.
• Now, each Reducer counts the values which are present in that list
of values. As shown in the figure, reducer gets a list of values
which is [1,1] for the key Bear. Then, it counts the number of ones
in the very list and gives the final output as — Bear, 2.
• Finally, all the output key/value pairs are then collected and
written in the output file.
Benefits of MapReduce
Faulat-tolerance
• During the middle of a map-reduce job, if a machine carrying a few data blocks
fails architecture handles the failure.
• It considers replicated copies of the blocks in alternate machines for further
processing.

Resilience
• Each node periodically updates its status to the master node.
• If a slave node doesn’t send its notification, the master node reassigns the
currently running task of that slave node to other available nodes in the cluster.
Conti…
Quick
• Data processing is quick as MapReduce uses HDFS as the storage system.
• MapReduce takes minutes to process terabytes of unstructured large
volumes of data.

Parallel Processing
• In MapReduce, we are dividing the job among multiple nodes and each
node works with a part of the job simultaneously.
• So, MapReduce is based on Divide and Conquer paradigm which helps us
to process the data using different machines.
• As the data is processed by multiple machines instead of a single machine
in parallel, the time taken to process the data gets reduced by a
tremendous amount
Conti…
Conti…
Availability
• Multiple replicas of the same data are sent to numerous nodes in
the network.
• Thus, in case of any failure, other copies are readily available for
processing without any loss.

Scalability
• Hadoop is a highly scalable platform.
• Traditional RDBMS systems are not scalable according to the
increase in data volume.
• MapReduce lets you run applications from a huge number of nodes,
using terabytes and petabytes of data.
Map Reduce Framework
• A MapReduce job usually splits the input data-set into independent
chunks which are processed by the map tasks in a completely parallel
manner.
• The framework sorts the outputs of the maps, which are then input
to the reduce tasks.
• Typically both the input and the output of the job are stored in a file-
system.
Conti…
Conti…
How Map Reduce works
• MapReduce can perform distributed and parallel computations
using large datasets across a large number of nodes.

• A MapReduce job usually splits the input datasets and then

process each of them independently by the Map tasks in a
completely parallel manner.

• The output is then sorted and input to reduce tasks.

Conti…
• Both job input and output are stored in file systems.

• Tasks are scheduled and monitored by the framework.

• Map Reduce architecture contains two core components as

Daemon services responsible for running mapper and reducer
tasks, monitoring, and re-executing the tasks on failure. In Hadoop
2 onwards Resource Manager and Node Manager are the daemon
services.
Conti…
• When the job client submits a MapReduce job, these daemons
come into action. They are also responsible for parallel processing
and fault-tolerance features of MapReduce jobs.

• In Hadoop 2 onwards resource management and job scheduling or

monitoring functionalities are segregated by YARN (Yet Another
Resource Negotiator) as different daemons.
Conti…
• Compared to Hadoop 1 with Job Tracker and Task Tracker, Hadoop
2 contains a global Resource Manager (RM) and Application
Masters (AM) for each application.

• Job Client submits the job to the Resource Manager.

• YARN Resource Manager’s scheduler is responsible for the

coordination of resource allocation of the cluster among the
running applications.
Conti…
• YARN Node Manager runs on each node and does node-level
resource management, coordinating with the Resource manager.
It launches and monitors the compute containers on the machine
on the cluster.

• Application Master helps the resources from Resource Manager

and use Node Manager to run and coordinate MapReduce tasks.

• HDFS is usually used to share the job files between other entities.
Conti…
Phases of the MapReduce model
• MapReduce model has three major and one optional phase
• Mapper
• Shuffle and Sort
• Reducer
• Combiner
Conti…
Mapper
• It is the first phase of MapReduce programming and contains the
coding logic of the mapper function.
• The conditional logic is applied to the ‘n’ number of data blocks
spread across various data nodes.
• Mapper function accepts key-value pairs as input as (k, v), where
the key represents the offset address of each record and the value
represents the entire record content.
• The output of the Mapper phase will also be in the key-value
format as (k’, v’).
Conti…
Shuffle and Sort
• The output of various mappers (k’, v’), then goes into Shuffle and
Sort phase.
• All the duplicate values are removed, and different values are
grouped together based on similar keys.
• The output of the Shuffle and Sort phase will be key-value pairs
again as key and array of values (k, v[]).
Conti…
Reducer
• The output of the Shuffle and Sort phase (k, v[]) will be the input
of the Reducer phase.
• In this phase reducer function’s logic is executed and all the values
are aggregated against their corresponding keys.
• Reducer consolidates outputs of various mappers and computes
the final job output.
• The final output is then written into a single file in an output
directory of HDFS.
Conti…
Combiner
• It is an optional phase in the MapReduce model.
• The combiner phase is used to optimize the performance of
MapReduce jobs.
• In this phase, various outputs of the mappers are locally reduced
at the node level.
• For example, if different mapper outputs (k, v) coming from a
single node contains duplicates, then they get combined i.e.
locally reduced as a single (k, v[]) output.
• This phase makes the Shuffle and Sort phase work even quicker
thereby enabling additional performance in MapReduce jobs.
THANK
YOU

Aiyai Mamai
100% (3)
Aiyai Mamai
5 pages
IPexpert CCUE Routing &amp Switching Volume 3 Lab 1 Proctor Guide
No ratings yet
IPexpert CCUE Routing &amp Switching Volume 3 Lab 1 Proctor Guide
35 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
Unit IV Notes
No ratings yet
Unit IV Notes
25 pages
Data Science
No ratings yet
Data Science
7 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
Map Reduce and Hadoop
No ratings yet
Map Reduce and Hadoop
39 pages
Chapter 4
No ratings yet
Chapter 4
53 pages
3.1.How Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.How Map Reduce Works & 3.2 Anatomy
11 pages
Cloud Computing Prof
No ratings yet
Cloud Computing Prof
11 pages
Da Unit 5 Data Analytics
No ratings yet
Da Unit 5 Data Analytics
43 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Unit 2 - From Hadoop Streaming PDF
No ratings yet
Unit 2 - From Hadoop Streaming PDF
20 pages
Map reduce
No ratings yet
Map reduce
35 pages
BIG DATA
No ratings yet
BIG DATA
120 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
32 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
32 pages
MapReduce Arch
No ratings yet
MapReduce Arch
29 pages
BDA Unit 3 1
No ratings yet
BDA Unit 3 1
37 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
Chapter 4 - Understanding Map Reduce Fundamentals
No ratings yet
Chapter 4 - Understanding Map Reduce Fundamentals
45 pages
Unit - III
No ratings yet
Unit - III
37 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
4 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
Lecture - 3
No ratings yet
Lecture - 3
25 pages
BDA UNIT-3 (1) - Merged
No ratings yet
BDA UNIT-3 (1) - Merged
98 pages
Unit 5 Lecture 5
No ratings yet
Unit 5 Lecture 5
21 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
MapReduce Architecture
No ratings yet
MapReduce Architecture
5 pages
Map Reduce
No ratings yet
Map Reduce
18 pages
UNIT – III
No ratings yet
UNIT – III
38 pages
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
No ratings yet
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
15 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
26 pages
Lecture 10 MapReduce Hadoop
No ratings yet
Lecture 10 MapReduce Hadoop
37 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
BDA 2 (1)
No ratings yet
BDA 2 (1)
35 pages
Hadoop Karunesh
No ratings yet
Hadoop Karunesh
14 pages
BDA Unit 3 Notes
No ratings yet
BDA Unit 3 Notes
11 pages
BIG DATA UNIT -3
No ratings yet
BIG DATA UNIT -3
7 pages
Map Reduce
No ratings yet
Map Reduce
44 pages
Unit 5
No ratings yet
Unit 5
35 pages
2 MapReduce continue
No ratings yet
2 MapReduce continue
12 pages
Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
Unit 5
No ratings yet
Unit 5
7 pages
BDA_UNIT_2
No ratings yet
BDA_UNIT_2
48 pages
Map Reduce
No ratings yet
Map Reduce
3 pages
Big Data notes (1)
No ratings yet
Big Data notes (1)
13 pages
BSC in Information Technology (Data Science) : Massive or Big Data Processing J.Alosius
No ratings yet
BSC in Information Technology (Data Science) : Massive or Big Data Processing J.Alosius
30 pages
04_MapReduce
No ratings yet
04_MapReduce
45 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
55 pages
777 1651400043 BD Module 4
No ratings yet
777 1651400043 BD Module 4
21 pages
Unit 3 - Big Data Technologies
No ratings yet
Unit 3 - Big Data Technologies
42 pages
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
From Everand
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
Fouad Sabry
No ratings yet
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
Contract For Services: Document Ref: UGNMK-8GM8T-DFPBM-JJWZN Page 1 of 5
No ratings yet
Contract For Services: Document Ref: UGNMK-8GM8T-DFPBM-JJWZN Page 1 of 5
6 pages
Automatic Radiology Report Generation Based On Multi-View Image Fusion and Medical Concept Enrichment
No ratings yet
Automatic Radiology Report Generation Based On Multi-View Image Fusion and Medical Concept Enrichment
9 pages
Cheat Effect: Buddha
No ratings yet
Cheat Effect: Buddha
15 pages
Analysis and Design of DC-link Voltage Controller in Shunt Active Power Filter
No ratings yet
Analysis and Design of DC-link Voltage Controller in Shunt Active Power Filter
13 pages
Onkyo Sks-ht728 Datasheet
No ratings yet
Onkyo Sks-ht728 Datasheet
1 page
CONSUMER INFORMATION SHEET (Wait List ID: 3171315)
No ratings yet
CONSUMER INFORMATION SHEET (Wait List ID: 3171315)
3 pages
HP Data Center Networking Solutions Brochure
No ratings yet
HP Data Center Networking Solutions Brochure
12 pages
ICT 34 Data Structures and Analysis of Algorithm
100% (1)
ICT 34 Data Structures and Analysis of Algorithm
9 pages
Collections in Java
No ratings yet
Collections in Java
19 pages
Raw Waveform Processing - BayesMap Solutions, LLC
No ratings yet
Raw Waveform Processing - BayesMap Solutions, LLC
3 pages
[FREE PDF sample] Mastering Flutter: A Beginner's Guide 1st Edition Sufyan Bin Uzayr ebooks
100% (2)
[FREE PDF sample] Mastering Flutter: A Beginner's Guide 1st Edition Sufyan Bin Uzayr ebooks
50 pages
Bench Vortex
No ratings yet
Bench Vortex
3 pages
Diode Clippers: 1. Positive Clipper and Negative Clipper
No ratings yet
Diode Clippers: 1. Positive Clipper and Negative Clipper
6 pages
ICF - 8-Lesson 3
No ratings yet
ICF - 8-Lesson 3
14 pages
Syllabus: Module 1 - Understanding Linux Concepts
No ratings yet
Syllabus: Module 1 - Understanding Linux Concepts
3 pages
Frequency Response of Lsi Systems
No ratings yet
Frequency Response of Lsi Systems
30 pages
Questions Scope PDF
No ratings yet
Questions Scope PDF
10 pages
Triangulation: Statement
No ratings yet
Triangulation: Statement
3 pages
Top Java MCQ
No ratings yet
Top Java MCQ
39 pages
Print (Short)
No ratings yet
Print (Short)
3 pages
Module 10 PPT Session 3
No ratings yet
Module 10 PPT Session 3
25 pages
Small Business Digital Marketing 2
No ratings yet
Small Business Digital Marketing 2
9 pages
Basic of Computer Engineering (Final - English Medium)
0% (1)
Basic of Computer Engineering (Final - English Medium)
208 pages
Weinberg 1109.6462v3
100% (1)
Weinberg 1109.6462v3
13 pages
Stuff - November 2024 UK
No ratings yet
Stuff - November 2024 UK
100 pages
New Doc 26-Feb-2021 12.20
No ratings yet
New Doc 26-Feb-2021 12.20
20 pages
gp-435g CHG 1
No ratings yet
gp-435g CHG 1
41 pages
Object oriented design knowledge principles heuristics and best practices Javier Garzã¡S 2024 scribd download
100% (2)
Object oriented design knowledge principles heuristics and best practices Javier Garzã¡S 2024 scribd download
76 pages