0% found this document useful (0 votes)
81 views43 pages

MapReduce Workflows

This document covers the MapReduce workflow, including the anatomy of a MapReduce job run, classic MapReduce, and YARN's role in resource management and job scheduling. It details the phases of job execution, potential failure cases, and the architecture of YARN, which separates resource management from processing. Additionally, it outlines the components of YARN, such as the Resource Manager, Node Manager, and Application Master, and their functions in managing applications within the Hadoop framework.

Uploaded by

SUJITHA M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views43 pages

MapReduce Workflows

This document covers the MapReduce workflow, including the anatomy of a MapReduce job run, classic MapReduce, and YARN's role in resource management and job scheduling. It details the phases of job execution, potential failure cases, and the architecture of YARN, which separates resource management from processing. Additionally, it outlines the components of YARN, such as the Resource Manager, Node Manager, and Application Master, and their functions in managing applications within the Hadoop framework.

Uploaded by

SUJITHA M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 43

UNIT - 4

UNIT IV MAP R EDUC E APPLICATIONS


Ma pRe duce workfl ows – unit te s ts with MRUnit – test
da ta a nd local tests – a na tomy of Ma pReduce job run
– cla s s ic Map-reduce – YAR N – failure s in classic Map-
re duce and YARN – job s ch eduling – s huffl e and sort –
ta s k execution – Ma pRedu ce ty pe s – input formats –
output formats.
Map Reduce Workflow -
https://www.simplilearn.com/tutorials/hadoop-tutorial/mapre
duce-example
Unit tests with MRUnit -
https:/learnhadoopwithme.wordpress.com/2013/09/03/unit-t
est-mapreduce-using-mrunit/
ANATOMY OF MAP
REDUCE JOB RUN -
CLASSIC MAP
REDUCE
Once we give a MapReduce job the system
will enter into a series of life cycle phases

01 04
Job Submission Phase Task Execution Phase

02 05
Job Initialization Phase Progress Update Phase

03 06
Task Assignment Phase Failure Recovery
• In order to run the MR program the
hadoop uses the command- ‘yarn jar
client.jar job-class H DFS input HDFS-
output directory’ , where yarn is an
utility and jar is the command .
• Client.jar and job class name written
by the developer
• When we
execute on
terminal the
Yarn will
initiate a set
of actions
Steps Hadoop takes to run a
MRjob
1. CLIENT - Submits the Map Reduce Jobs
2. JOB TRACKER - is a Java application, which
coordinates the job run, whose main class is
Job Tracker
3. TASK TRACKER - is a Java application,
which runs the tasks that the job has been
split into, whose main class is Task Tracker
4. DISTRIBUTED FILE SYSTEM - used for
sharing job fi les between the other entities.
JOB RUN - CLASSIC
MAP REDUCE
How job reduce is carried out in

classic map reduce?


• Job client connects to Job tracker and asked for a
new job ID.
• Client connects to job tracker using the address
from mapred-site.xml confi guration fi le.
• After a new job ID is assigned client performs few
checks on HDFS.
• It fi rst checks the output exists or not.
• If the output already exists, the job stops there
itself.
• This is the error proofi ng technique applied in
Hadoop so as to avoid any loss of eff orts by
overwriting the results.
• Then it calculates the input splits.
• Also it checks whether the input fi les exists or
not
• It throws an error it in case if it doesn’t fi nd any
input fi le saying it cannot compute the splits.
• If it fi nds the input fi le it proceeds and copies the
jar and distributes to HDFS with a very high
replication factor the default being 10.
FAILURE CASES
IN CLASSIC MAP
REDUCE
Task Failure
• There can be a scenario where user code can run into an
infinite loop.
• In those cases, task tracker would observe there hasn’t been
any progress on the task for a period of time.
• The observation time is set to 0 for mapred.task.timeout,
which means task tracker would never fail a long running job.
• Runtime errors - report errors and put it in the user logs.
• JVM may exposed to bug while running the code
• Failed tasks are notified to the job tracker and job tracker will
schedules the execution of failed tasks.
TaskTracker Failure
• When Job tracker stops receiving the heartbeats from
the task tracker, job tracker concludes the task
tracker is dead.
• It reschedules the task on another task tracker.
• Task tracker failed number of times and it crosses
the threshold, it gets blocklisted.
• Threshold is set by the property
mapred.max.tracker.failures
JobTracker Failure
• Failure of the jobtracker is the most serious failure mode.
• Hadoop has no mechanism for dealing with failure of the
jobtracker—it is a single point of failure—so in this case the
job fails.
• However, this failure mode has a low chance of occurring,
since the chance of a particular machine failing is low.
• This situation is improved in YARN, since one of its design
goals is to eliminate single points of failure in MapReduce.
JobTracker Failure
• After restarting a jobtracker, any jobs that were
running at the time it was stopped will need to be
re-submitted.
• There is a configuration option that attempts to
recover any running jobs
(mapred.jobtracker.restart.recover, turned off by
default).
YARN
3. YARN
• YARN stands for Yet Another Resource Negotiator
• YARN is the Cluster management component of Hadoop 2.0
• The basic idea behind YARN is to relieve MapReduce by taking
over the responsibility of Resource Management and Job
Scheduling.
• YARN started to give Hadoop the ability to run non-MapReduce
jobs within the Hadoop framework.
• YARN Infrastructure is responsible for providing computational
resources such as CPUs or memory needed for application
executions.
• YARN architecture basically separates the

resource management layer from the processing

layer.

• In Hadoop 1.0 version, the responsibility of Job

tracker is split between the resource manager

and application manager .


Components of YARN

HADOOP YARN

Resource Node Application


Containers
Manager Manager Master
YARN ARCHITECTURE
1. RESOURCE MANAGER
• Resource Manager is the master of YARN
and is responsible for resource assignment
and management among all the
applications.
• Whenever it receives a processing request,
it forwards it to the corresponding node
manager and allocates resources for the
completion of the request accordingly.
• It has two major components:
1. Scheduler
SCHEDULER
• It performs scheduling based on the
allocated application and available
resources.
• It is a pure scheduler, means it does not
perform other tasks such as monitoring or
tracking and does not guarantee a restart if
a task fails.
• The YARN scheduler supports plugins such
as Capacity Scheduler and Fair Scheduler to
partition the cluster resources.
APPLICATION MANAGER

• It is responsible for accepting the

application and negotiating the fi rst

container from the resource manager.

• It also restarts the Application Master

container if a task fails.


2. NODE MANAGER
• Node Manager take care of individual node on Hadoop
cluster & manages application, workfl ow & that
particular node.
• Its primary job is to keep up with the Resource
Manager.
• It registers with the Resource Manager and sends
heartbeats with the health status of the node.
• It monitors resource usage, performs log management
and also kills a container based on directions from
the resource manager.
• It is also responsible for creating the container
process and start it on the request of Application
master.
APPLICATION MASTER
• An application is a single job submitted to a
framework.
• The application master is responsible for
negotiating resources with the resource manager,
tracking the status and monitoring progress of a
single application.
• The application master requests the container from
the node manager by sending a Container Launch
Context(CLC) which includes everything an
application needs to run.
• Once the application is started, it sends the health
report to the resource manager from time-to -time.
CONTAINER
• It is a collection of physical resources such
as RAM, CPU cores and disk on a single
node.
• The containers are invoked by Container
Launch Context(CLC) which is a record that
contains information such as environment
variables, security tokens, dependencies
etc.
Application workflow in Hadoop
YARN
Application workflow in Hadoop
YARN
• Client submits an application
• The Resource Manager allocates a container to

start the Application Manager.


• The Application Manager registers itself with the

Resource Manager.
• The Application Manager negotiates containers
from the Resource Manager.
Application workflow in Hadoop
YARN
5. The Application Manager notifi es the Node Manager
to launch containers.
6. Application code is executed in the container.
7. Client contacts Resource Manager/Application
Manager to monitor application’s status.
8. Once the processing is complete, the Application
Manager un-registers with the Resource Manager
FAILURE CASES
IN YARN

You might also like