0% found this document useful (0 votes)
86 views

Unit 2 B)

YARN is a resource management framework for Hadoop that improves cluster utilization and supports a variety of applications. It introduces the concepts of a ResourceManager and per-application ApplicationMasters to separate resource management from job scheduling and monitoring. The ResourceManager allocates resources across applications while ApplicationMasters work with NodeManagers to execute and monitor tasks. This allows YARN to scale beyond MapReduce and enables multi-tenancy through queue-based scheduling policies like the Capacity Scheduler.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views

Unit 2 B)

YARN is a resource management framework for Hadoop that improves cluster utilization and supports a variety of applications. It introduces the concepts of a ResourceManager and per-application ApplicationMasters to separate resource management from job scheduling and monitoring. The ResourceManager allocates resources across applications while ApplicationMasters work with NodeManagers to execute and monitor tasks. This allows YARN to scale beyond MapReduce and enables multi-tenancy through queue-based scheduling policies like the Capacity Scheduler.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

UNIT –II YARN

Anatomy of YARN (Yet Another Resource Negotiator):

 Introduced in Hadoop2
 provides APIs for requesting and working with cluster resources
 typically not directly used

Higher level APIs

and higher level applications e.g. Pig, Hive, ...

Anatomy of a YARN Application Run:

Two types of long-running daemons

 Resource manager (one per cluster)


 Node managers (on all nodes)
1
Resource Manager (RM) Characteristics:

 One Per Cluster


 Process Client Requests
 Allocate and Manage Resources across Clusters
 Creates Application Master
 Scheduling Jobs
 Allocating resources to applications
 Monitor Progress of Jobs

Node Manager (NM) Characteristics:

 Running on all Worker Nodes.


 Launch and Monitor Container
 Updates RM through Heartbeat.
 Responsible for Container Life Cycle Management.
 Tracks the health of Node.
 Kills the Container once RM directs it after job is done.

Application Master (AM) Characteristics:

 Each application will be having one Application Master(Unique)


 Coordinates Application execution in the cluster.
 Reports it to client.
 Negotiate resources from the Resource Manager.
 AM is a JVM Process.
 Run the computation in container and return result to client.
 Sometimes it requests more Containers from Resource Manager
to run distributed application.

Container:

 Physical Resources (RAM,CPU


 Container may be Unix Process or Linux cgroup.
 Containers have multiple containers in a single node.

2
Fig: How YARN runs an Application

1. Client -> Resource manager: run application master


2. Resource manager finds node manager to launch master in a container
3. Application master runs computation either
 in its own container
 or requests further resources for distributed computing

3
Resource Requests:
 Flexible model:
e.g. amount of computer resources; locality constraints
 when processing HDFS blocks; request resources on nodes where the HDFS
block is stored
 can be made at any time
 all up front: e.p. Spark
 dynamically: e.g. MapReduce

Application Lifespan:

 The lifespan of a YARN application can range from a few seconds to a


few months
 It can be like one application per job (MapReduce)
 It can be One application per workflow for this:

 Containers can be reused


 Intermediate data is cached between jobs
 Tez and Spark are the examples

 Long Running applications which can be shared among many people

 It may act as a Coordinator


 A long-running master to launch other applications
 Apache Impala runs proxy applications and can reduce the
overhead of Application Master

4
Building YARN Applications

The role of the YARN client is to negotiate with the Resource Manager for a YARN application
instance to be created and launched.

As part of this work, you’ll need to inform the Resource Manager about the system
resource requirements of your Application Master.

Once the Application Master is up and running, the client can choose to monitor the status of the
application.

YARN Compared to MapReduce 1:


The distributed implementation of MapReduce in the original version of Hadoop (version 1 and
earlier) is sometimes referred to as “MapReduce 1” to distinguish it from MapReduce 2, the
implementation that uses YARN (in Hadoop 2 and later).
In MapReduce 1, there are two types of daemon that control the job execution process:
a jobtracker and one or more tasktrackers.
The jobtracker coordinates all the jobs run on the system by scheduling tasks to run on
tasktrackers.
Tasktrackers run tasks and send progress reports to the jobtracker, which keeps a record of the
overall progress of each job.
If a task fails, the jobtracker can reschedule it on a different tasktracker.

5
In MapReduce 1, the jobtracker takes care of both job scheduling (matching tasks with
tasktrackers) and task progress monitoring (keeping track of tasks, restarting failed or slow tasks,
and doing task bookkeeping, such as maintaining counter totals).
By contrast, in YARN these responsibilities are handled by separate entities: the resource
manager and an application master (one for each MapReduce job).

The jobtracker is also responsible for storing job history for completed jobs.

In YARN, the equivalent role is the timeline server, which stores application history.

The YARN equivalent of a tasktracker is a node manager.

Comparison of MapReduce 1 and YARN.

The benefits to using YARN include the following:

Scalability

YARN can run on larger clusters than MapReduce 1.

MapReduce 1 hits scalability bottlenecks in the region of 4,000 nodes and 40,000 tasks.

YARN overcomes these limitations by virtue of its split resource manager/application master –

architecture: it is designed to scale up to 10,000 nodes and 100,000 tasks.

Availability

6
However, the large amount of rapidly changing complex state in the jobtracker’s memory (each
task status is updated every few seconds, for example) makes it very difficult to retrofit HA into
the jobtracker service.

Hadoop 2 supports HA both or the resource manager and for the application master for
MapReduce jobs.

Utilization
In MapReduce 1, each tasktracker is configured with a static allocation of fixed-

size“slots,” which are divided into map slots and reduce slots at configuration time.

In YARN, a node manager manages a pool of resources, rather than a fixed number of
designated slots.

Multitenancy

In some ways, the biggest benefit of YARN is that it opens up Hadoop to other typesof
distributed application beyond MapReduce. MapReduce is just one YARN application among
many.
It is even possible for users to run different versions of MapReduce on the same YARN cluster,
which makes the process of upgrading MapReduce more manageable.

Scheduling in YARN
In an ideal world, the requests that a YARN application makes would be granted
immediately.
In the real world, however, resources are limited, and on a busy cluster, an application will often
need to wait to have some of its requests fulfilled.
It is the job of the YARN scheduler to allocate resources to applications according to some
defined policy. Scheduling in general is a difficult problem and there is no one “best” policy,
which is why YARN provides a choice of schedulers and configurable policies.

Scheduler Options
7
Three schedulers are available in YARN: the FIFO, Capacity, and Fair Schedulers.

I ) FIFO(FIRST IN FIRST OUT):

FIFO Scheduler places applications in a queue and runs them in the order of submission (first
in, first out). Requests for the first application in the queue are allocated first; once its
requests have been satisfied, the next application in the queue is served, and so on.
The FIFO Scheduler has the merit of being simple to understand and not needing any
configuration, but it’s not suitable for shared clusters. Large applications will use all the
resources in a cluster, so each application has to wait its turn.

II ) Capacity Scheduler

 separate queues for small and large jobs


 small jobs don't have to wait
 overall cluster utilization may be lower

8
III) Fair scheduler

 dynamically balance resources


 time lag between job start and when it receives the requested resources

 needs to wait for resources to free up

9
Capacity Scheduler Configuration

Capacity scheduler in YARN allows multi-tenancy of the Hadoop cluster where multiple users

can share the large cluster.

Every organization having their own private cluster leads to a poor resource utilization. An
organization may provide enough resources in the cluster to meet their peak demand but that
peak demand may not occur that frequently, resulting in poor resource utilization at rest of the
time.
Thus sharing cluster among organizations is a more cost effective idea.
To configure the Resource Manager to use the Capacity Scheduler, set the following property in
the conf/yarn-site.xml

yarn.resourcemanager.scheduler.class-
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler

For setting up queues in Capacity Scheduler you need to make changes in etc/hadoop/capacity-
scheduler.xml configuration file.
Example:
If there are two child queues starting from root XYZ and ABC. XYZ further divides the queue
into technology and development. XYZ is given 60% of the cluster capacity and ABC is given
40%.

<property>
<name>yarn.scheduler.capacity.root.queues</name>
<value>XYZ, ABC</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.XYZ.queues</name>
<value>technology,marketing</value>
</property>
10
<property>
<name>yarn.scheduler.capacity.root.XYZ.capacity</name>
<value>60</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.ABC.capacity</name>
<value>40</value>
</property>

Fair Scheduler Configuration


Fair scheduler in YARN allocates resources to applications in such a way that all apps get, on
average, an equal share of resources over time.
By default, the Fair Scheduler bases scheduling fairness decisions only on memory.
It can be configured to schedule with both memory and CPU, in the form (X mb, Y vcores).

To use the Fair Scheduler in YARN first assign the appropriate scheduler class in yarn-site.xml:

<property>
<name>yarn.resourcemanager.scheduler.class</name>

<value>org.apache.hadoop.yarn.server.resourcemanage
r.scheduler.fair.FairScheduler</value>
</property>

11
Delay Scheduling
Delay scheduling is a simple technique to achieve data locality (placing tasks on
nodes that contain their input data) and fairness in cluster scheduling.
All YARN schedulers try to honor locality requests.
On a busy cluster, if an application requests a particular node, there is a good
chance that other containers are running on it at the time of the request.

The obvious course of action is to immediately loosen the locality requirement and
allocate a container on the same rack.
However, it has been observed in practice that waiting a short time (no more than a
few seconds) can dramatically increase the chances of being allocated a container
on the requested node, and therefore increase the efficiency of the cluster.
This feature is called delay scheduling, and it is supported by both the Capacity
Scheduler and the Fair Scheduler.

Every node manager in a YARN cluster periodically sends a heartbeat request to


the resource manager—by default, one per second. Heartbeats carry information
about the node manager’s running containers and the resources available for new
containers, so each heartbeat is a potential scheduling opportunity for an
application to run a container.

When using delay scheduling, the scheduler doesn’t simply use the first scheduling
opportunity it receives, but waits for up to a given maximum number of scheduling
opportunities to occur before loosening the locality constraint and taking the next
scheduling opportunity.

12
Dominant Resource Fairness
 Proposed by researchers from U.California Berkeley.
 Proposes a notion of fairness across jobs with multi-
resource requirements.
 They showed that DRF is :
Fair for multi-tenant systems.
Strategy-proof: tenant can’t be benefit by lying,
Envy-free: tenant can’t envy another tenant’s
allocations.
DRF is
- Usable in scheduling VMs in a cluster.
- Usable in scheduling Hadoop in a cluster.
DRF used in Mesos, an OS intended for cloud environments.
DRF-like strategies also used some cloud computing company’s distributed OS’s

Example:
In Our example
- Job 1’s tasks: 2 CPUs, 8GB
 Job 1’s resource vector = <2 CPUs, 8 GB>
- Job 2’s tasks: 6 CPUs , 2 GB
=> Job 2’s resource vector =<6 CPUs, 2 GB>
Consider a cloud with <18 CPUs, 36 GB RAM>
Each Job 1’s task consumes % of total CPUs = 2/18 = 1/9
Each Job 1’s task consumes % of total RAM = 8/36 = 2/9

13
1/9 < 2/9
=>Job 1’s dominant resource is RAM i.e., Job 1 is more memory intensive than
it’s CPU-intensive.

Each Job 2’s task consumes % of total CPUs = 6/18 = 6/18


Each Job 2’s task consumes % of total RAM = 2/36 = 1/18
6/18 > 1/18
=>Job 2’s dominant resource is CPU i.e., Job 1 is more CPU intensive than it’s
memory-intensive.
DRF Ensures:
For a given job, the % of its dominant resource type that it gets cluster-wide, is the
same for all jobs
- Job 1’s % of RAM = Job 2’s % of CPU

Solution for our example:


- Job 1 gets 3 tasks with <2 CPUs, 8 GB>
- Job 2 gets 2 tasks with <6 CPUs, 2 GB>
Job 1’s % of RAM
= Number of tasks * RAM per task / Total cluster RAM
= 3*8/36 = 2/3
Job 2’s % of CPU
= Number of tasks * CPU per task / Total cluster CPUs
= 2*6/18 = 2/3

DRF generalizes to multiple jobs


DRF also generalizes to more than 2 resource types
14
- CPU,RAM,Netwwork,Disk,etc.
DRF ensures that each job gets a fair share of that type of resource which the job
desires the most.

15
16

You might also like