0% found this document useful (0 votes)
22 views14 pages

Yarn Tutorial

Uploaded by

chise6969
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views14 pages

Yarn Tutorial

Uploaded by

chise6969
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Big Data

Video Tutorials Articles Ebooks Live Webinars On-demand Webinars Free Practice Tests

Home Resources Big Data Hadoop Tutorial: Getting Started with Hadoop Yarn Tutorial

Yarn Tutorial

Lesson 10 of 16 By Simplilearn

Last updated on Sep 18, 2021 27347

Previous Next

Tutorial Playlist

Table of Contents

What is Yarn?

YARN - Use Case

YARN Infrastructure

YARN and its Architecture

YARN Architecture Element - Resource Manager

View More

YARN is the acronym for Yet Another Resource Negotiator YARN is a resource manager created
YARN is the acronym for Yet Another Resource Negotiator. YARN is a resource manager created
by separating the processing engine and the management function of MapReduce. It monitors
and manages workloads, maintains a multi-tenant environment, manages the high availability
features of Hadoop, and implements security controls.

Get trained in Yarn, MapReduce, Pig, Hive, HBase, and Apache Spark with the Big Data
Hadoop Certification Training Course. Enroll now!

Before beginning the details of the YARN tutorial, let us understand what is YARN.

What is Yarn?

Before 2012, users could write MapReduce programs using scripting languages such as Java,
Python, and Ruby. They could also use Pig, a language used to transform data. No matter what
language was used, its implementation depended on the MapReduce processing model.

In May 2012, during the release of Hadoop version 2.0, YARN was introduced. You are no longer
limited to working with the MapReduce framework anymore as YARN supports multiple
processing models in addition to MapReduce, such as Spark. Other features of YARN include
significant performance improvement and a flexible execution engine.

Big Data Hadoop and Spark Developer Course (FREE)

Learn Big Data Basics from Top Experts

ENROLL NOW

Now that we have learned about YARN, let us next take a look at the Yarn use case as a part of
this Yarn tutorial.

YARN - Use Case

Yahoo was the first company to embrace Hadoop and this became a trendsetter within the
Hadoop ecosystem. In late 2012, Yahoo struggled to handle iterative and stream processing of
data on the Hadoop infrastructure due to MapReduce limitations.

Both iterative and stream processing was important for Yahoo in facilitating its move from batch
computing to continuous computing.

After implementing YARN in the first quarter of 2013, Yahoo installed more than 30,000
production nodes on

Spark for iterative processing

Storm for stream processing

Hadoop for batch processing, allowing it to handle more than 100 billion events such as
clicks, impressions, email content, metadata, and so on per day.

This was possible only after YARN was introduced and multiple processing frameworks were
implemented. The single-cluster approach provides a number of advantages, including:

Higher cluster utilization, where resources unutilized by a framework can be consumed by


another

Lower operational costs because only one "do-it-all" cluster needs to be managed

Reduced data motion as there's no need to move data between Hadoop YARN and systems
running on different clusters of computers

Let us next look at the yarn architecture as a part of this Yarn tutorial.

YARN Infrastructure

The YARN Infrastructure is responsible for providing computational resources such as CPUs or
memory needed for application executions.

YARN infrastructure and HDFS are completely independent. The former provides resources for
running an application while the latter provides storage.

The MapReduce framework is only one of the many possible frameworks that run on YARN. The
fundamental idea of MapReduce version-2 is to split the two major functionalities of resource
management and job scheduling and monitoring into separate daemons.

In the next section, we will discuss YARN and its architecture as a part of this Yarn tutorial.

YARN and its Architecture

Let us first understand the important three Elements of YARN Architecture.

The three important elements of the YARN architecture are:

Resource Manager

Application Master

Node Managers

These three Elements of YARN Architecture are shown in the given below diagram.

Resource Manager

The ResourceManager, or RM, which is usually one per cluster, is the master server. Resource
Manager knows the location of the DataNode and how many resources they have. This
information is referred to as Rack Awareness. The RM runs several services, the most important
of which is the Resource Scheduler that decides how to assign the resources.

Application Master

The Application Master is a framework-specific process that negotiates resources for a single
application, that is, a single job or a directed acyclic graph of jobs, which runs in the first
container allocated for the purpose. Each Application Master requests resources from the
Resource Manager and then works with containers provided by Node Managers.

Node Managers

The Node Managers can be many in one cluster. They are the slaves of the infrastructure. When
it starts, it announces itself to the RM and periodically sends a heartbeat to the RM.

Each Node Manager offers resources to the cluster The resource capacity is the amount of
Each Node Manager offers resources to the cluster. The resource capacity is the amount of
memory and the number of v-cores, short for the virtual core. At run-time, the Resource
Scheduler decides how to use this capacity. A container is a fraction of the NodeManager
capacity, and it is used by the client to run a program. Each Node Manager takes instructions
from the ResourceManager and reports and handles containers on a single node.

YARN Architecture Element - Resource Manager

The first element of YARN architecture is ResourceManager. The RM mediates the available
resources in the cluster among competing applications with the goal of maximum cluster
utilization.

It includes a pluggable scheduler called the YarnScheduler, which allows different policies for
managing constraints such as capacity, fairness, and Service Level Agreements. The Resource
Manager has two main components - Scheduler and Applications Manager. Let us understand
each of them in detail.

Resource Manager Component - Scheduler

The Scheduler is responsible for allocating resources to various running applications depending
on the common constraints of capacities, queues, and so on. The Scheduler does not monitor or
track the status of the application. Also, it does not restart the tasks in case of any application or
hardware failures. The Scheduler performs its function based on the resource requirements of
the applications. It does so based on the abstract notion of a resource container that
incorporates elements such as memory, CPU, disk, and network. The Scheduler has a policy
plugin which is responsible for partitioning the cluster resources among various queues and
applications. The current MapReduce schedulers such as the Capacity Scheduler and the Fair
Scheduler are some examples of the plug-in.

The Capacity Scheduler supports hierarchical queues to enable a more predictable sharing of
cluster resources.

Resource Manager Component - Application Manager

The Application Manager is an interface which maintains a list of applications that have been
submitted, currently running, or completed. The Application Manager is responsible for accepting
job-submissions, negotiating the first container for executing the application-specific Application
Master and restarting the Application Master container on failure.
Let’s discuss how each component of YARN Architecture works together. First, we will
understand how the Resource Manager operates.

Big Data Engineer Master's Program

Master All the Big Data Skill You Need Today

ENROLL NOW

How Does the Resource Manager Operate?

The Resource Manager communicates with the clients through an interface called the Client
Service. A client can submit or terminate an application and gain information about the
scheduling queue or cluster statistics through the Client Service.

Administrative requests are served by a separate interface called the Admin Service through
which operators can get updated information about the cluster operation.

In parallel, the Resource Tracker Service receives node heartbeats from the Node Manager to
track new or decommissioned nodes.

The NM Liveliness Monitor and Nodes List Manager keep an updated status of which nodes are
healthy so that the Scheduler and the Resource Tracker Service can allocate work appropriately.

The Application Master Service manages Application Masters on all nodes, keeping the
Scheduler informed. The AM Liveliness Monitor keeps a list of Application Masters and their last
heartbeat times to let the Resource Manager know what applications are healthy on the cluster.

Any Application Master that does not send a heartbeat within a certain interval is marked as
dead and re-scheduled to run on a new container.

Resource Manager in High Availability Mode

Before Hadoop 2.4, the Resource Manager was the single point of failure in a YARN cluster. The
High Availability, or HA, the feature adds redundancy in the form of an Active/Standby Resource
Manager pair to remove this single point of failure.

Resource Manager HA is realized through the Active/Standby architecture. At any point in time,
one of the RMs is active and one or more RMs are in Standby mode waiting to take over, should
anything happen to the Active. The trigger to transition-to-active comes from either the admin
through the Command-Line Interface or through the integrated failover-controller.

The RMs have an option to invade the zookeeper base active standby Elector to decide which
RMs should be active. Only active go down or become unresponsive, another RMs is
automatically Elector to be active. Note there is no need to run a separate ZKFC Demon like in
HDFS. Because the active standby Elector embedded in RMs acts as a failure to a detector and
leads to an Elector.

In the next section, let us look at the second most important YARN Architecture element,
Application Master.

YARN Architecture Element - Application Master

The second element of YARN architecture is the Application Master. The Application Master in
YARN is a framework-specific library, which negotiates resources from the RM and works with
the NodeManager or Managers to execute and monitor containers and their resource
consumption.

While an application is running, the Application Master manages the application lifecycle,
dynamic adjustments to resource consumption, execution flow, faults, and it provides status and
metrics.

The Application Master is architected to support a specific framework and can be written in any
language. It uses extensible communication protocols with the Resource Manager and the Node
Manager.

The Application Master can be customized to extend the framework or run any other code.
Because of this, the Application Master is not considered trustworthy and is not run as a trusted
service.

In reality, every application has its own instance of an Application Master. However, it is feasible
In reality, every application has its own instance of an Application Master. However, it is feasible
to implement an Application Master to manage a set of applications, for example, an Application
Master for Pig or Hive to manage a set of MapReduce jobs.

YARN Architecture Element - Node Manager

The third element of YARN architecture is the Node Manager. When a container is leased to an
application, the NodeManager sets up the container environment. The environment includes the
resource constraints specified in the lease and any kind of dependencies, such as data or
executable files.

The Node Manager monitors the health of the node, reporting to the ResourceManager when a
hardware or software issue occurs so that the Scheduler can divert resource allocations to
healthy nodes until the issue is resolved. The Node Manager also offers a number of services to
containers running on the node such as a log aggregation service.

The Node Manager runs on each node and manages the activities such as container lifecycle
management, container dependencies, container leases, node and container resource usage,
node health, and log management and reports node and container status to the Resource
Manager.

Let us now look at the node manager component YARN container.

Node Manager Component: YARN Container

A YARN container is a collection of a specific set of resources to use in certain amounts on a


specific node. It is allocated by the ResourceManager on the basis of the application. The
Application Master presents the container to the Node Manager on the node where the container
has been allocated, thereby gaining access to the resources.

Now, let us discuss how to launch the container.

The Application Master must provide a Container Launch Context or CLC. This includes
information such as Environment variables, dependencies on the requirement of data files or
shared objects prior to the launch, security tokens, and the command to create the process to
launch the application.

The CLC supports the Application Master to use containers. This helps to run a variety of
different kinds of work, from simple shell scripts to applications to a virtual operating system.
Applications on YARN

Owing to YARN is the generic approach, a Hadoop YARN cluster runs various work-loads. This
means a single Hadoop cluster in your data center can run MapReduce, Storm, Spark, Impala,
and more.

Let us first understand how to run an application through YARN.

Running an Application through YARN

Broadly, there are five steps involved in YARN to run an application:

1. The client submits an application to the Resource Manager

2. The ResourceManager allocates a container

3. The Application Master contacts the related Node Manager

4. The Node Manager launches the container

5. The container executes the Application Master

Step 1 - Application submitted to the Resource Manager

Users submit applications to the Resource Manager by typing the Hadoop jar command.

The Resource Manager maintains the list of applications on the cluster and available resources
on the Node Manager. The Resource Manager determines the next application that receives a
portion of the cluster resource. The decision is subject to many constraints such as queue
capacity, Access Control Lists, and fairness.

Step 2 - Resource Manager allocates Container

When the Resource Manager accepts a new application submission, one of the first decisions
the Scheduler makes is selecting a container. Then, the Application Master is started and is
responsible for the entire life-cycle of that particular application.
First, it sends resource requests to the ResourceManager to ask for containers to run the
application's tasks.

A resource request is simply a request for a number of containers that satisfy resource
requirements such as the following:

Amount of resources expressed as megabytes of memory and CPU shares Preferred location,
specified by hostname or rackname, Priority within this application and not across multiple
applications.

The Resource Manager allocates a container by providing a container ID and a hostname,


which satisfies the requirements of the Application Master.

Step 3 - Application Master contacts Node Manager

After a container is allocated, the Application Master asks the Node Manager managing the host
on which the container was allocated to use these resources to launch an application-specific
task. This task can be any process written in any framework, such as a MapReduce task.

Step 4 -Resource Manager Launches Container

The NodeManager does not monitor tasks; it only monitors the resource usage in the containers.

For example, it kills a container if it consumes more memory than initially allocated.

Throughout its life, the Application Master negotiates containers to launch all of the tasks
needed to complete its application. It also monitors the progress of an application and its tasks,
restarts failed tasks in newly requested containers, and reports progress back to the client that
submitted the application.

Step 5 - Container Executes the Application Master

After the application is complete, the Application Master shuts itself and releases its own
container. Though the ResourceManager does not monitor the tasks within an application, it
g g pp ,
checks the health of the ApplicationMaster. If the ApplicationMaster fails, it can be restarted by
the ResourceManager in a new container. Thus, the resource manager looks after the
ApplicationMaster, while the ApplicationMaster looks after the tasks.

Tools for YARN Development

Hadoop includes three tools for YARN developers:

YARN Web UI

Hue Job Browser

YARN Command Line

These tools enable developers to submit, monitor, and manage jobs on the YARN cluster.

YARN Web UI

YARN web UI runs on 8088 port by default. It also provides a better view than Hue; however, you
cannot control or configure from YARN web UI.

Hue Job Browser

The Hue Job Browser allows you to monitor the status of a job, kill a running job, and view logs.

YARN Command Line

Most of the YARN commands are for the administrator rather than the developer.

A few useful commands for the developer are as follows:

To list all commands of YARN:

-yarn -help
It lists all the commands of yarn.

To print the version:

- yarn -version

It prints the version.

To view logs of a specified application ID:

- yarn logs -applicationId <app-id>

It views logs of specified application ID.

Preparing for the CCA175 exam? Take up this Big Data and Hadoop Developer Practice Test
and assess your preparedness.

Next Step to Success

To learn more and get an in-depth understanding of Hadoop and you can enroll in the Big Data
Engineer Master’s Program. This program in collaboration with IBM provides online training on
the popular skills required for a successful career in data engineering. Master the Hadoop Big
Data framework, leverage the functionality of AWS services, and use the database management
tool MongoDB to store data.

Find our Big Data Hadoop and Spark Developer Online Classroom training classes in
top cities:

Name Date Place

Big Data Hadoop and Spark 6 Jun -28 Jun 2022,


Your City
Developer Weekdays batch

Name and Spark


Big Data Hadoop Date
25 Jun -6 Aug 2022, Place
Bangalore
Developer Weekend batch
Developer Weekend batch

Big Data Hadoop and Spark 4 Jul -26 Jul 2022,


Hyderabad
Developer Weekdays batch

About the Author

Simplilearn

Simplilearn is one of the world’s leading providers of online training for Digital Marketing, Cloud
Computing, Project Management, Data Science, IT, Software Development, and many other em…

View More

Recommended Programs

Big Data Hadoop and Spark Developer Lifetime


Access*
32492 Learners

Big Data Engineer Lifetime


Access*
13391 Learners

Post Graduate Program in Data Engineering Lifetime


Access*
1716 Learners

*Lifetime access to high-quality, self-paced e-learning content.


Explore Category

Find Big Data Hadoop and Spark Developer in these cities

Big Data Hadoop Certification Training Course in Ahmedabad Big Data Hadoop Training

Course in Ameerpet Big Data Hadoop Certification Training Course in Bangalore Big Data

Hadoop Certification Training Course in Chennai Big Data Hadoop Certification Training

Course in Delhi Big Data Hadoop Certification Training Course in Kolkata Big Data

Hadoop Certification Training Course in Mumbai Big Data Hadoop Certification Training

Course in Marathahalli Big Data Hadoop Certification Training Course in Pune

Recommended Resources

MapReduce Hadoop Interview


Example in Guide
Apache Hadoop

© 2009 -2022- Simplilearn Solutions

Disclaimer
PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

You might also like