Yarn Tutorial
Yarn Tutorial
Video Tutorials Articles Ebooks Live Webinars On-demand Webinars Free Practice Tests
Home Resources Big Data Hadoop Tutorial: Getting Started with Hadoop Yarn Tutorial
Yarn Tutorial
Lesson 10 of 16 By Simplilearn
Previous Next
Tutorial Playlist
Table of Contents
What is Yarn?
YARN Infrastructure
View More
YARN is the acronym for Yet Another Resource Negotiator YARN is a resource manager created
YARN is the acronym for Yet Another Resource Negotiator. YARN is a resource manager created
by separating the processing engine and the management function of MapReduce. It monitors
and manages workloads, maintains a multi-tenant environment, manages the high availability
features of Hadoop, and implements security controls.
Get trained in Yarn, MapReduce, Pig, Hive, HBase, and Apache Spark with the Big Data
Hadoop Certification Training Course. Enroll now!
Before beginning the details of the YARN tutorial, let us understand what is YARN.
What is Yarn?
Before 2012, users could write MapReduce programs using scripting languages such as Java,
Python, and Ruby. They could also use Pig, a language used to transform data. No matter what
language was used, its implementation depended on the MapReduce processing model.
In May 2012, during the release of Hadoop version 2.0, YARN was introduced. You are no longer
limited to working with the MapReduce framework anymore as YARN supports multiple
processing models in addition to MapReduce, such as Spark. Other features of YARN include
significant performance improvement and a flexible execution engine.
ENROLL NOW
Now that we have learned about YARN, let us next take a look at the Yarn use case as a part of
this Yarn tutorial.
Yahoo was the first company to embrace Hadoop and this became a trendsetter within the
Hadoop ecosystem. In late 2012, Yahoo struggled to handle iterative and stream processing of
data on the Hadoop infrastructure due to MapReduce limitations.
Both iterative and stream processing was important for Yahoo in facilitating its move from batch
computing to continuous computing.
After implementing YARN in the first quarter of 2013, Yahoo installed more than 30,000
production nodes on
Hadoop for batch processing, allowing it to handle more than 100 billion events such as
clicks, impressions, email content, metadata, and so on per day.
This was possible only after YARN was introduced and multiple processing frameworks were
implemented. The single-cluster approach provides a number of advantages, including:
Lower operational costs because only one "do-it-all" cluster needs to be managed
Reduced data motion as there's no need to move data between Hadoop YARN and systems
running on different clusters of computers
Let us next look at the yarn architecture as a part of this Yarn tutorial.
YARN Infrastructure
The YARN Infrastructure is responsible for providing computational resources such as CPUs or
memory needed for application executions.
YARN infrastructure and HDFS are completely independent. The former provides resources for
running an application while the latter provides storage.
The MapReduce framework is only one of the many possible frameworks that run on YARN. The
fundamental idea of MapReduce version-2 is to split the two major functionalities of resource
management and job scheduling and monitoring into separate daemons.
In the next section, we will discuss YARN and its architecture as a part of this Yarn tutorial.
Resource Manager
Application Master
Node Managers
These three Elements of YARN Architecture are shown in the given below diagram.
Resource Manager
The ResourceManager, or RM, which is usually one per cluster, is the master server. Resource
Manager knows the location of the DataNode and how many resources they have. This
information is referred to as Rack Awareness. The RM runs several services, the most important
of which is the Resource Scheduler that decides how to assign the resources.
Application Master
The Application Master is a framework-specific process that negotiates resources for a single
application, that is, a single job or a directed acyclic graph of jobs, which runs in the first
container allocated for the purpose. Each Application Master requests resources from the
Resource Manager and then works with containers provided by Node Managers.
Node Managers
The Node Managers can be many in one cluster. They are the slaves of the infrastructure. When
it starts, it announces itself to the RM and periodically sends a heartbeat to the RM.
Each Node Manager offers resources to the cluster The resource capacity is the amount of
Each Node Manager offers resources to the cluster. The resource capacity is the amount of
memory and the number of v-cores, short for the virtual core. At run-time, the Resource
Scheduler decides how to use this capacity. A container is a fraction of the NodeManager
capacity, and it is used by the client to run a program. Each Node Manager takes instructions
from the ResourceManager and reports and handles containers on a single node.
The first element of YARN architecture is ResourceManager. The RM mediates the available
resources in the cluster among competing applications with the goal of maximum cluster
utilization.
It includes a pluggable scheduler called the YarnScheduler, which allows different policies for
managing constraints such as capacity, fairness, and Service Level Agreements. The Resource
Manager has two main components - Scheduler and Applications Manager. Let us understand
each of them in detail.
The Scheduler is responsible for allocating resources to various running applications depending
on the common constraints of capacities, queues, and so on. The Scheduler does not monitor or
track the status of the application. Also, it does not restart the tasks in case of any application or
hardware failures. The Scheduler performs its function based on the resource requirements of
the applications. It does so based on the abstract notion of a resource container that
incorporates elements such as memory, CPU, disk, and network. The Scheduler has a policy
plugin which is responsible for partitioning the cluster resources among various queues and
applications. The current MapReduce schedulers such as the Capacity Scheduler and the Fair
Scheduler are some examples of the plug-in.
The Capacity Scheduler supports hierarchical queues to enable a more predictable sharing of
cluster resources.
The Application Manager is an interface which maintains a list of applications that have been
submitted, currently running, or completed. The Application Manager is responsible for accepting
job-submissions, negotiating the first container for executing the application-specific Application
Master and restarting the Application Master container on failure.
Let’s discuss how each component of YARN Architecture works together. First, we will
understand how the Resource Manager operates.
ENROLL NOW
The Resource Manager communicates with the clients through an interface called the Client
Service. A client can submit or terminate an application and gain information about the
scheduling queue or cluster statistics through the Client Service.
Administrative requests are served by a separate interface called the Admin Service through
which operators can get updated information about the cluster operation.
In parallel, the Resource Tracker Service receives node heartbeats from the Node Manager to
track new or decommissioned nodes.
The NM Liveliness Monitor and Nodes List Manager keep an updated status of which nodes are
healthy so that the Scheduler and the Resource Tracker Service can allocate work appropriately.
The Application Master Service manages Application Masters on all nodes, keeping the
Scheduler informed. The AM Liveliness Monitor keeps a list of Application Masters and their last
heartbeat times to let the Resource Manager know what applications are healthy on the cluster.
Any Application Master that does not send a heartbeat within a certain interval is marked as
dead and re-scheduled to run on a new container.
Before Hadoop 2.4, the Resource Manager was the single point of failure in a YARN cluster. The
High Availability, or HA, the feature adds redundancy in the form of an Active/Standby Resource
Manager pair to remove this single point of failure.
Resource Manager HA is realized through the Active/Standby architecture. At any point in time,
one of the RMs is active and one or more RMs are in Standby mode waiting to take over, should
anything happen to the Active. The trigger to transition-to-active comes from either the admin
through the Command-Line Interface or through the integrated failover-controller.
The RMs have an option to invade the zookeeper base active standby Elector to decide which
RMs should be active. Only active go down or become unresponsive, another RMs is
automatically Elector to be active. Note there is no need to run a separate ZKFC Demon like in
HDFS. Because the active standby Elector embedded in RMs acts as a failure to a detector and
leads to an Elector.
In the next section, let us look at the second most important YARN Architecture element,
Application Master.
The second element of YARN architecture is the Application Master. The Application Master in
YARN is a framework-specific library, which negotiates resources from the RM and works with
the NodeManager or Managers to execute and monitor containers and their resource
consumption.
While an application is running, the Application Master manages the application lifecycle,
dynamic adjustments to resource consumption, execution flow, faults, and it provides status and
metrics.
The Application Master is architected to support a specific framework and can be written in any
language. It uses extensible communication protocols with the Resource Manager and the Node
Manager.
The Application Master can be customized to extend the framework or run any other code.
Because of this, the Application Master is not considered trustworthy and is not run as a trusted
service.
In reality, every application has its own instance of an Application Master. However, it is feasible
In reality, every application has its own instance of an Application Master. However, it is feasible
to implement an Application Master to manage a set of applications, for example, an Application
Master for Pig or Hive to manage a set of MapReduce jobs.
The third element of YARN architecture is the Node Manager. When a container is leased to an
application, the NodeManager sets up the container environment. The environment includes the
resource constraints specified in the lease and any kind of dependencies, such as data or
executable files.
The Node Manager monitors the health of the node, reporting to the ResourceManager when a
hardware or software issue occurs so that the Scheduler can divert resource allocations to
healthy nodes until the issue is resolved. The Node Manager also offers a number of services to
containers running on the node such as a log aggregation service.
The Node Manager runs on each node and manages the activities such as container lifecycle
management, container dependencies, container leases, node and container resource usage,
node health, and log management and reports node and container status to the Resource
Manager.
The Application Master must provide a Container Launch Context or CLC. This includes
information such as Environment variables, dependencies on the requirement of data files or
shared objects prior to the launch, security tokens, and the command to create the process to
launch the application.
The CLC supports the Application Master to use containers. This helps to run a variety of
different kinds of work, from simple shell scripts to applications to a virtual operating system.
Applications on YARN
Owing to YARN is the generic approach, a Hadoop YARN cluster runs various work-loads. This
means a single Hadoop cluster in your data center can run MapReduce, Storm, Spark, Impala,
and more.
Users submit applications to the Resource Manager by typing the Hadoop jar command.
The Resource Manager maintains the list of applications on the cluster and available resources
on the Node Manager. The Resource Manager determines the next application that receives a
portion of the cluster resource. The decision is subject to many constraints such as queue
capacity, Access Control Lists, and fairness.
When the Resource Manager accepts a new application submission, one of the first decisions
the Scheduler makes is selecting a container. Then, the Application Master is started and is
responsible for the entire life-cycle of that particular application.
First, it sends resource requests to the ResourceManager to ask for containers to run the
application's tasks.
A resource request is simply a request for a number of containers that satisfy resource
requirements such as the following:
Amount of resources expressed as megabytes of memory and CPU shares Preferred location,
specified by hostname or rackname, Priority within this application and not across multiple
applications.
After a container is allocated, the Application Master asks the Node Manager managing the host
on which the container was allocated to use these resources to launch an application-specific
task. This task can be any process written in any framework, such as a MapReduce task.
The NodeManager does not monitor tasks; it only monitors the resource usage in the containers.
For example, it kills a container if it consumes more memory than initially allocated.
Throughout its life, the Application Master negotiates containers to launch all of the tasks
needed to complete its application. It also monitors the progress of an application and its tasks,
restarts failed tasks in newly requested containers, and reports progress back to the client that
submitted the application.
After the application is complete, the Application Master shuts itself and releases its own
container. Though the ResourceManager does not monitor the tasks within an application, it
g g pp ,
checks the health of the ApplicationMaster. If the ApplicationMaster fails, it can be restarted by
the ResourceManager in a new container. Thus, the resource manager looks after the
ApplicationMaster, while the ApplicationMaster looks after the tasks.
YARN Web UI
These tools enable developers to submit, monitor, and manage jobs on the YARN cluster.
YARN Web UI
YARN web UI runs on 8088 port by default. It also provides a better view than Hue; however, you
cannot control or configure from YARN web UI.
The Hue Job Browser allows you to monitor the status of a job, kill a running job, and view logs.
Most of the YARN commands are for the administrator rather than the developer.
-yarn -help
It lists all the commands of yarn.
- yarn -version
Preparing for the CCA175 exam? Take up this Big Data and Hadoop Developer Practice Test
and assess your preparedness.
To learn more and get an in-depth understanding of Hadoop and you can enroll in the Big Data
Engineer Master’s Program. This program in collaboration with IBM provides online training on
the popular skills required for a successful career in data engineering. Master the Hadoop Big
Data framework, leverage the functionality of AWS services, and use the database management
tool MongoDB to store data.
Find our Big Data Hadoop and Spark Developer Online Classroom training classes in
top cities:
Simplilearn
Simplilearn is one of the world’s leading providers of online training for Digital Marketing, Cloud
Computing, Project Management, Data Science, IT, Software Development, and many other em…
View More
Recommended Programs
Big Data Hadoop Certification Training Course in Ahmedabad Big Data Hadoop Training
Course in Ameerpet Big Data Hadoop Certification Training Course in Bangalore Big Data
Hadoop Certification Training Course in Chennai Big Data Hadoop Certification Training
Course in Delhi Big Data Hadoop Certification Training Course in Kolkata Big Data
Hadoop Certification Training Course in Mumbai Big Data Hadoop Certification Training
Recommended Resources
Disclaimer
PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.