0% found this document useful (0 votes)

81 views43 pages

MapReduce Workflows

This document covers the MapReduce workflow, including the anatomy of a MapReduce job run, classic MapReduce, and YARN's role in resource management and job scheduling. It details the phases of job execution, potential failure cases, and the architecture of YARN, which separates resource management from processing. Additionally, it outlines the components of YARN, such as the Resource Manager, Node Manager, and Application Master, and their functions in managing applications within the Hadoop framework.

Uploaded by

SUJITHA M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

81 views43 pages

MapReduce Workflows

Uploaded by

SUJITHA M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 43

UNIT - 4

UNIT IV MAP R EDUC E APPLICATIONS

Ma pRe duce workfl ows – unit te s ts with MRUnit – test
da ta a nd local tests – a na tomy of Ma pReduce job run
– cla s s ic Map-reduce – YAR N – failure s in classic Map-
re duce and YARN – job s ch eduling – s huffl e and sort –
ta s k execution – Ma pRedu ce ty pe s – input formats –
output formats.
Map Reduce Workflow -
https://www.simplilearn.com/tutorials/hadoop-tutorial/mapre
duce-example
Unit tests with MRUnit -
https:/learnhadoopwithme.wordpress.com/2013/09/03/unit-t
est-mapreduce-using-mrunit/
ANATOMY OF MAP
REDUCE JOB RUN -
CLASSIC MAP
REDUCE
Once we give a MapReduce job the system
will enter into a series of life cycle phases

01 04
Job Submission Phase Task Execution Phase

02 05
Job Initialization Phase Progress Update Phase

03 06
Task Assignment Phase Failure Recovery
• In order to run the MR program the
hadoop uses the command- ‘yarn jar
client.jar job-class H DFS input HDFS-
output directory’ , where yarn is an
utility and jar is the command .
• Client.jar and job class name written
by the developer
• When we
execute on
terminal the
Yarn will
initiate a set
of actions
Steps Hadoop takes to run a
MRjob
1. CLIENT - Submits the Map Reduce Jobs
2. JOB TRACKER - is a Java application, which
coordinates the job run, whose main class is
Job Tracker
3. TASK TRACKER - is a Java application,
which runs the tasks that the job has been
split into, whose main class is Task Tracker
4. DISTRIBUTED FILE SYSTEM - used for
sharing job fi les between the other entities.
JOB RUN - CLASSIC
MAP REDUCE
How job reduce is carried out in

classic map reduce?

• Job client connects to Job tracker and asked for a
new job ID.
• Client connects to job tracker using the address
from mapred-site.xml confi guration fi le.
• After a new job ID is assigned client performs few
checks on HDFS.
• It fi rst checks the output exists or not.
• If the output already exists, the job stops there
itself.
• This is the error proofi ng technique applied in
Hadoop so as to avoid any loss of eff orts by
overwriting the results.
• Then it calculates the input splits.
• Also it checks whether the input fi les exists or
not
• It throws an error it in case if it doesn’t fi nd any
input fi le saying it cannot compute the splits.
• If it fi nds the input fi le it proceeds and copies the
jar and distributes to HDFS with a very high
replication factor the default being 10.
FAILURE CASES
IN CLASSIC MAP
REDUCE
Task Failure
• There can be a scenario where user code can run into an
infinite loop.
• In those cases, task tracker would observe there hasn’t been
any progress on the task for a period of time.
• The observation time is set to 0 for mapred.task.timeout,
which means task tracker would never fail a long running job.
• Runtime errors - report errors and put it in the user logs.
• JVM may exposed to bug while running the code
• Failed tasks are notified to the job tracker and job tracker will
schedules the execution of failed tasks.
TaskTracker Failure
• When Job tracker stops receiving the heartbeats from
the task tracker, job tracker concludes the task
tracker is dead.
• It reschedules the task on another task tracker.
• Task tracker failed number of times and it crosses
the threshold, it gets blocklisted.
• Threshold is set by the property
mapred.max.tracker.failures
JobTracker Failure
• Failure of the jobtracker is the most serious failure mode.
• Hadoop has no mechanism for dealing with failure of the
jobtracker—it is a single point of failure—so in this case the
job fails.
• However, this failure mode has a low chance of occurring,
since the chance of a particular machine failing is low.
• This situation is improved in YARN, since one of its design
goals is to eliminate single points of failure in MapReduce.
JobTracker Failure
• After restarting a jobtracker, any jobs that were
running at the time it was stopped will need to be
re-submitted.
• There is a configuration option that attempts to
recover any running jobs
(mapred.jobtracker.restart.recover, turned off by
default).
YARN
3. YARN
• YARN stands for Yet Another Resource Negotiator
• YARN is the Cluster management component of Hadoop 2.0
• The basic idea behind YARN is to relieve MapReduce by taking
over the responsibility of Resource Management and Job
Scheduling.
• YARN started to give Hadoop the ability to run non-MapReduce
jobs within the Hadoop framework.
• YARN Infrastructure is responsible for providing computational
resources such as CPUs or memory needed for application
executions.
• YARN architecture basically separates the

resource management layer from the processing

layer.

• In Hadoop 1.0 version, the responsibility of Job

tracker is split between the resource manager

and application manager .

Components of YARN

HADOOP YARN

Resource Node Application

Containers
Manager Manager Master
YARN ARCHITECTURE
1. RESOURCE MANAGER
• Resource Manager is the master of YARN
and is responsible for resource assignment
and management among all the
applications.
• Whenever it receives a processing request,
it forwards it to the corresponding node
manager and allocates resources for the
completion of the request accordingly.
• It has two major components:
1. Scheduler
SCHEDULER
• It performs scheduling based on the
allocated application and available
resources.
• It is a pure scheduler, means it does not
perform other tasks such as monitoring or
tracking and does not guarantee a restart if
a task fails.
• The YARN scheduler supports plugins such
as Capacity Scheduler and Fair Scheduler to
partition the cluster resources.
APPLICATION MANAGER

• It is responsible for accepting the

application and negotiating the fi rst

container from the resource manager.

• It also restarts the Application Master

container if a task fails.

2. NODE MANAGER
• Node Manager take care of individual node on Hadoop
cluster & manages application, workfl ow & that
particular node.
• Its primary job is to keep up with the Resource
Manager.
• It registers with the Resource Manager and sends
heartbeats with the health status of the node.
• It monitors resource usage, performs log management
and also kills a container based on directions from
the resource manager.
• It is also responsible for creating the container
process and start it on the request of Application
master.
APPLICATION MASTER
• An application is a single job submitted to a
framework.
• The application master is responsible for
negotiating resources with the resource manager,
tracking the status and monitoring progress of a
single application.
• The application master requests the container from
the node manager by sending a Container Launch
Context(CLC) which includes everything an
application needs to run.
• Once the application is started, it sends the health
report to the resource manager from time-to -time.
CONTAINER
• It is a collection of physical resources such
as RAM, CPU cores and disk on a single
node.
• The containers are invoked by Container
Launch Context(CLC) which is a record that
contains information such as environment
variables, security tokens, dependencies
etc.
Application workflow in Hadoop
YARN
Application workflow in Hadoop
YARN
• Client submits an application
• The Resource Manager allocates a container to

start the Application Manager.

• The Application Manager registers itself with the

Resource Manager.
• The Application Manager negotiates containers
from the Resource Manager.
Application workflow in Hadoop
YARN
5. The Application Manager notifi es the Node Manager
to launch containers.
6. Application code is executed in the container.
7. Client contacts Resource Manager/Application
Manager to monitor application’s status.
8. Once the processing is complete, the Application
Manager un-registers with the Resource Manager
FAILURE CASES
IN YARN

Unit 3
100% (1)
Unit 3
46 pages
Computer Architecture by Kai Hwang Kai Hwang & F. A. Briggs, "Computer Architecture and Parallel Processing", McGraw Hill
75% (8)
Computer Architecture by Kai Hwang Kai Hwang & F. A. Briggs, "Computer Architecture and Parallel Processing", McGraw Hill
864 pages
3.7.YARN - Failures in Classic MapReduce
No ratings yet
3.7.YARN - Failures in Classic MapReduce
5 pages
Yarn and Its Failures
No ratings yet
Yarn and Its Failures
22 pages
BigData Unit4
No ratings yet
BigData Unit4
70 pages
Unit - 4
No ratings yet
Unit - 4
50 pages
Anatomy of Map Reduce Job Run
100% (2)
Anatomy of Map Reduce Job Run
20 pages
Unit Iv Mapreduce Applications
No ratings yet
Unit Iv Mapreduce Applications
70 pages
Bda Unit 3 - Mam
No ratings yet
Bda Unit 3 - Mam
89 pages
Anatomy of Mapreduce Job Run: Some Slides Are Taken From Cmu PPT Presentation
No ratings yet
Anatomy of Mapreduce Job Run: Some Slides Are Taken From Cmu PPT Presentation
73 pages
17 18 19 20 21 22 23 Yarn
No ratings yet
17 18 19 20 21 22 23 Yarn
44 pages
Unit III
No ratings yet
Unit III
161 pages
How Map Reduce Work
No ratings yet
How Map Reduce Work
99 pages
BDA-Unit-4 Notes
No ratings yet
BDA-Unit-4 Notes
55 pages
Unit 3-1
No ratings yet
Unit 3-1
65 pages
2 - Yarn
No ratings yet
2 - Yarn
59 pages
Bda Unit 3
No ratings yet
Bda Unit 3
50 pages
MapReduce V1
No ratings yet
MapReduce V1
26 pages
Chapter - 6 - Hadoop
No ratings yet
Chapter - 6 - Hadoop
51 pages
Wa0002.
No ratings yet
Wa0002.
66 pages
Module 4 - Yarn
No ratings yet
Module 4 - Yarn
34 pages
Lecture 06 - Data Analytics For IoT A Primer
No ratings yet
Lecture 06 - Data Analytics For IoT A Primer
31 pages
Chapter 10
No ratings yet
Chapter 10
45 pages
Lecture 06
No ratings yet
Lecture 06
26 pages
Hadoop 1
No ratings yet
Hadoop 1
26 pages
04 MapRed 6 JobExecutionOnYarn
No ratings yet
04 MapRed 6 JobExecutionOnYarn
20 pages
10 - Big Data Architecture and Tools
No ratings yet
10 - Big Data Architecture and Tools
31 pages
Bigdata and Hadoop - Unit III
No ratings yet
Bigdata and Hadoop - Unit III
24 pages
UNIT-4 BIG DATA (NoSql)
No ratings yet
UNIT-4 BIG DATA (NoSql)
38 pages
Hadoop Class 2 PDF
No ratings yet
Hadoop Class 2 PDF
18 pages
UNIT-4 Bda
No ratings yet
UNIT-4 Bda
26 pages
BDA Unit 4
No ratings yet
BDA Unit 4
22 pages
Map Reduce and Hadoop
No ratings yet
Map Reduce and Hadoop
39 pages
3-MapReduce Different Phases-13-01-2025
No ratings yet
3-MapReduce Different Phases-13-01-2025
23 pages
Big Data-Week 3 - 1
No ratings yet
Big Data-Week 3 - 1
22 pages
BDA Unit 3 Notes
No ratings yet
BDA Unit 3 Notes
23 pages
Big Data Unit 2 AKTU Notes
No ratings yet
Big Data Unit 2 AKTU Notes
63 pages
Bda U4
No ratings yet
Bda U4
25 pages
Hadoop 2full Mod2
No ratings yet
Hadoop 2full Mod2
10 pages
BDA UNIT - 4 Notes
No ratings yet
BDA UNIT - 4 Notes
28 pages
Bda Unit 3
No ratings yet
Bda Unit 3
28 pages
Module 4 BDA Solutions
No ratings yet
Module 4 BDA Solutions
22 pages
Framework For Processing Data in Hadoop - : Yarn and Mapreduce
No ratings yet
Framework For Processing Data in Hadoop - : Yarn and Mapreduce
31 pages
Hadoop 2.0
No ratings yet
Hadoop 2.0
20 pages
Big Data Unit 3 Own
No ratings yet
Big Data Unit 3 Own
20 pages
Unit 3 Handouts
No ratings yet
Unit 3 Handouts
11 pages
Big Data Notes Unit-3
No ratings yet
Big Data Notes Unit-3
7 pages
Module 4
No ratings yet
Module 4
37 pages
6 Yarn
No ratings yet
6 Yarn
10 pages
BDA Unit II
No ratings yet
BDA Unit II
12 pages
2inceptez Hadoop Processing
No ratings yet
2inceptez Hadoop Processing
16 pages
Unit-4: Illustrate Mapreduce Architecture With Diagram
No ratings yet
Unit-4: Illustrate Mapreduce Architecture With Diagram
7 pages
Apache Hadoop YARN
No ratings yet
Apache Hadoop YARN
24 pages
A Weather Dataset. Understanding Hadoop API For MapReduce Framework
No ratings yet
A Weather Dataset. Understanding Hadoop API For MapReduce Framework
9 pages
Hca 1
No ratings yet
Hca 1
71 pages
Hadoop and Big Data Unit 31
No ratings yet
Hadoop and Big Data Unit 31
9 pages
Apache Yarn Interviews and Answers
No ratings yet
Apache Yarn Interviews and Answers
4 pages
Unit3 MapReduce
No ratings yet
Unit3 MapReduce
7 pages
Unit 2 Notes BDA
No ratings yet
Unit 2 Notes BDA
10 pages
Rtos
No ratings yet
Rtos
37 pages
Dekker's Algorithm
No ratings yet
Dekker's Algorithm
9 pages
Unit 2 B)
No ratings yet
Unit 2 B)
16 pages
FCFS Lab
No ratings yet
FCFS Lab
5 pages
Apache Hadoop Yarn Architecture PDF
No ratings yet
Apache Hadoop Yarn Architecture PDF
3 pages
Concurrency Control Techniques
No ratings yet
Concurrency Control Techniques
46 pages
Synchronization in Distributed Systems - Unit 2
100% (1)
Synchronization in Distributed Systems - Unit 2
3 pages
Unit 5 (Coa) Notes
No ratings yet
Unit 5 (Coa) Notes
35 pages
Unit Ii
No ratings yet
Unit Ii
59 pages
Big Data Unit 4 Own
No ratings yet
Big Data Unit 4 Own
18 pages
Chapter 6 Parallel Processor
No ratings yet
Chapter 6 Parallel Processor
21 pages
CH 2
No ratings yet
CH 2
51 pages
BA Unit2 Own
No ratings yet
BA Unit2 Own
10 pages
CST206 - Ktu Qbank
No ratings yet
CST206 - Ktu Qbank
10 pages
Log Cat 1749115988216
No ratings yet
Log Cat 1749115988216
150 pages
Obt Unit 3 IAT2
No ratings yet
Obt Unit 3 IAT2
3 pages
Ba Unit 3 Own
No ratings yet
Ba Unit 3 Own
7 pages
CS609 Lesson 92-169
No ratings yet
CS609 Lesson 92-169
98 pages
CCS369 Two Marks
No ratings yet
CCS369 Two Marks
9 pages
Operating System Unit 2
No ratings yet
Operating System Unit 2
17 pages
Instruction Level Parallelism and Its Exploitation: Unit Ii by Raju K, Cse Dept
No ratings yet
Instruction Level Parallelism and Its Exploitation: Unit Ii by Raju K, Cse Dept
201 pages
Unit 5 UA
No ratings yet
Unit 5 UA
19 pages
Unit 2 - Process & Threads
No ratings yet
Unit 2 - Process & Threads
146 pages
Unit 5
No ratings yet
Unit 5
50 pages
Ba Unit 4 UA
No ratings yet
Ba Unit 4 UA
19 pages
OS-ch4-Process Scheduling
No ratings yet
OS-ch4-Process Scheduling
32 pages
Internn
No ratings yet
Internn
9 pages
CLDC/1.0 Devices: The Following Devices Support The CLDC/1.0 Configuration There Are 273 Devices in This List
No ratings yet
CLDC/1.0 Devices: The Following Devices Support The CLDC/1.0 Configuration There Are 273 Devices in This List
21 pages
Ba Unit 4
No ratings yet
Ba Unit 4
13 pages
Trace CN - Wps.moffice Eng
No ratings yet
Trace CN - Wps.moffice Eng
84 pages
Ba Unit 1 UA
No ratings yet
Ba Unit 1 UA
13 pages
Embedd Iat
No ratings yet
Embedd Iat
6 pages
CloudComputing Unit 3
No ratings yet
CloudComputing Unit 3
31 pages
Unit 3 - Desktop, Network, Storage Virtualization
No ratings yet
Unit 3 - Desktop, Network, Storage Virtualization
8 pages
C++ Multithreading Tutorial
No ratings yet
C++ Multithreading Tutorial
7 pages
3.2 Process Scheduling
No ratings yet
3.2 Process Scheduling
16 pages
Exercise 1 Changes
No ratings yet
Exercise 1 Changes
3 pages
CC Unit 5 Own Notes
No ratings yet
CC Unit 5 Own Notes
13 pages
Exe 10
No ratings yet
Exe 10
10 pages
VM Security Attacks and Real Case Studies
No ratings yet
VM Security Attacks and Real Case Studies
4 pages
Exercise 4
No ratings yet
Exercise 4
2 pages
Exercise 2
No ratings yet
Exercise 2
2 pages
L12 Slides6
No ratings yet
L12 Slides6
13 pages
Medical Imaging Techniques - Hca
No ratings yet
Medical Imaging Techniques - Hca
3 pages
Migration No SQL
No ratings yet
Migration No SQL
4 pages
CC, IAM Design Challengs
No ratings yet
CC, IAM Design Challengs
3 pages
Exercise 1
No ratings yet
Exercise 1
3 pages
BD Unit 1
No ratings yet
BD Unit 1
5 pages
Operating System With Python
No ratings yet
Operating System With Python
15 pages
ParalleSystem Report
No ratings yet
ParalleSystem Report
11 pages
CC Unit 3 (Virtual Clusters and Resource Management)
No ratings yet
CC Unit 3 (Virtual Clusters and Resource Management)
3 pages
GNC Assign Os Assignment
No ratings yet
GNC Assign Os Assignment
11 pages
1 Scheduling Context Switching
No ratings yet
1 Scheduling Context Switching
4 pages
Power Off Reset Reason Backup
No ratings yet
Power Off Reset Reason Backup
5 pages
BIM 8th Sem Syllabus 2017 - Removed - Removed
No ratings yet
BIM 8th Sem Syllabus 2017 - Removed - Removed
4 pages
Lab 7
No ratings yet
Lab 7
2 pages
Cloud Computing Assignment Questions
No ratings yet
Cloud Computing Assignment Questions
1 page

MapReduce Workflows

Uploaded by

MapReduce Workflows

Uploaded by

UNIT - 4

UNIT IV MAP R EDUC E APPLICATIONS

classic map reduce?

resource management layer from the processing

• In Hadoop 1.0 version, the responsibility of Job

tracker is split between the resource manager

and application manager .

Resource Node Application

• It is responsible for accepting the

application and negotiating the fi rst

container from the resource manager.

• It also restarts the Application Master

container if a task fails.

start the Application Manager.

You might also like