0% found this document useful (0 votes)
24 views5 pages

YARN

YARN (Yet Another Resource Negotiator) splits the responsibilities of resource management and job scheduling/monitoring into separate daemons. The Global Resource Manager allocates resources among applications, and per-application Application Masters negotiate resources and monitor tasks. NodeManagers launch and monitor containers that execute the application code. Applications are submitted jobs, and containers are the basic units of resource allocation.

Uploaded by

mydhili
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views5 pages

YARN

YARN (Yet Another Resource Negotiator) splits the responsibilities of resource management and job scheduling/monitoring into separate daemons. The Global Resource Manager allocates resources among applications, and per-application Application Masters negotiate resources and monitor tasks. NodeManagers launch and monitor containers that execute the application code. Applications are submitted jobs, and containers are the basic units of resource allocation.

Uploaded by

mydhili
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Explain in detail about YARN?

The fundamental idea behind the YARN(Yet Another Resource Negotiator)


architecture is to splitting the JobTracker reponsibility of resource management
and job scheduling/monitoring into separate daemons.

Daemons that are part of YARN architecture are:

1. Global Resource Manager: The main responsibility of Global Resource


Manager is to distribute resources among various applications.

It has two main components:

Scheduler: The pluggable scheduler of ResourceManager decides allocation of


resources to various running applications. The scheduler is just that, a pure
scheduler, meaning it does NOT monior or track the status of the application.

Application Manager: It does:

Accepting job submissions.

Negotiating resources(container) for executing the application specific


ApplicationMaster

Restarting the ApplicationMaster in case of failure

NodeManager:

This is a per-machine slave daemon. NodeManager responsibility is launching the


application containers for application execution.

NodeManager monitors the resource usage such as memory, CPU, disk, network,
etc.

It then reports the usage of resources to the global ResourceManager.

Per-Application Application Master: Per-application Application master is an


application specific entity. It’s responsibility is to negotiate required resources for
execution from the ResourceManager.
It works along with the NodeManager for executing and monitoring component
tasks.

Basic concepts of YARN are: Application and Container.

Application is a job submitted to system.

Ex: MapReduce job.

Container: Basic unit of allocation. Replaces fixed map/reduce slots. Fine-grained


resource allocation across multiple resource type

Eg. Container_0: 2GB,1CPU

Container_1: 1GB,6CPU

Fig. YARN Architecture

The steps involved in YARN architecture are:

The client program submits an application.


The Resource Manager launches the Application Master by assigning some
container.

The Application Master registers with the Resource manager.

On successful container allocations, the application master launches the container


by providing the container launch specification to the NodeManager.

The NodeManager executes the application code.

During the application execution, the client that submitted the job directly
communicates with the Application Master to get status, progress updates.

Once the application has been processed completely, the application master
deregisters with the ResourceManager and shutsdown allowing its own container
to be repurposed.

Explain Hadoop Ecosystem in detail.

The following are the components of Hadoop ecosystem:

HDFS: Hadoop Distributed File System. It simply stores data files as close to the
original form as possible.
HBase: It is Hadoop’s distributed column based database. It supports structured
data storage for large tables.

Hive: It is a Hadoop’s data warehouse, enables analysis of large data sets using a
language very similar to SQL. So, one can access data stored in hadoop cluster by
using Hive.

Pig: Pig is an easy to understand data flow language. It helps with the analysis of
large data sets which is quite the order with Hadoop without writing codes in
MapReduce paradigm.

ZooKeeper: It is an open source application that configures synchronizes the


distribured systems.

Oozie: It is a workflowscheduler system to manage apache hadoop jobs.

Mahout: It is a scalable Machine Learning and data mining library.

Chukwa: It is a data collection system for managing large distributed systems.

Sqoop: it is used to transfer bulk data between Hadoop and structured data stores
such as relational databases.

Ambari: it is a web based tool for provisioning, Managing and Monitoring Apache
Hadoop clusters.

Explain the following

Modules of Apache Hadoop framework

There are four basic or core components:

Hadoop Common: It is a set of common utilities and libraries which handle other
Hadoop modules. It makes sure that the hardware failures are managed by Hadoop
cluster automatically.

Hadoop YARN: It allocates resources which in turn allow different users to


execute various applications without worrying about the increased workloads.
HDFS: It is a Hadoop Distributed File System that stores data in the form of small
memory blocks and distributes them across the cluster. Each data is replicated
multiple times to ensure data availability.

Hadoop MapReduce: It executes tasks in a parallel fashion by distributing the


data as small blocks.

Hadoop Modes of Installations

Standalone, or local mode: which is one of the least commonly used


environments, which only for running and debugging of MapReduce programs.
This mode does not use HDFS nor it launches any of the hadoop daemon.

Pseudo-distributed mode(Cluster of One), which runs all daemons on single


machine. It is most commonly used in development environments.

Fully distributed mode, which is most commonly used in production


environments. This mode runs all daemons on a cluster of machines rather than
single one.

XML File configrations in Hadoop.

core-site.xml – This configuration file contains Hadoop core configuration


settings, for example, I/O settings, very common for MapReduce and HDFS.
mapred-site.xml – This configuration file specifies a framework name for
MapReduce by setting mapreduce.framework.name

hdfs-site.xml – This configuration file contains HDFS daemons configuration


settings. It also specifies default block permission and replication checking

You might also like