0% found this document useful (0 votes)
15 views46 pages

Mod 5

The document provides an overview of Hadoop YARN architecture, detailing its advantages, components, and working mechanism. It explains the transition from the original MapReduce version to YARN, which separates resource management and job scheduling, allowing for more efficient resource utilization and support for various applications. Additionally, it covers integration with traditional data warehouses, polyglot persistence, and introduces Apache Hive as a data warehousing solution for structured data in Hadoop.

Uploaded by

abiat2246
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views46 pages

Mod 5

The document provides an overview of Hadoop YARN architecture, detailing its advantages, components, and working mechanism. It explains the transition from the original MapReduce version to YARN, which separates resource management and job scheduling, allowing for more efficient resource utilization and support for various applications. Additionally, it covers integration with traditional data warehouses, polyglot persistence, and introduces Apache Hive as a data warehousing solution for structured data in Hadoop.

Uploaded by

abiat2246
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Module 5

Understanding Hadoop YARN Architecture


Advantage, Architecture, Working , YARN Schedulers
• RDMS (Relational Database Management System): RDBMS is an information
management system, which is based on a data model. In RDBMS tables are used for
information storage. Each row of the table represents a record and column represents
an attribute of data. Organization of data and their manipulation processes are
different in RDBMS from other databases. RDBMS ensures ACID (atomicity,
consistency, integrity, durability) properties required for designing a database. The
purpose of RDBMS is to store, manage, and retrieve data as quickly and reliably as
possible.
• Hadoop: It is an open-source software framework used for storing data and running
applications on a group of commodity hardware. It has large storage capacity and high
processing power. It can manage multiple concurrent processes at the same time. It is
used in predictive analysis, data mining and machine learning. It can handle both
structured and unstructured form of data. It is more flexible in storing, processing,
and managing data than traditional RDBMS. Unlike traditional systems, Hadoop
enables multiple analytical processes on the same data at the same time. It supports
scalability very flexibly.
Issues with Non-Relational Database
• Not suitable for data involving transactions.
• Not suitable when a data needs to be structured.
• Not suitable for multiple read and write, update operations.
• Not suitable when data integrity has to be maintained.
Polyglot Persistence
• Polyglot Persistence is a term to mean that when storing
data, it is best to use multiple data storage technologies,
chosen based upon the way data is being used by
individual applications or components of a single
application.
• Different kinds of data are best dealt with different data
stores.
• An e-commerce platform will deal with many types of
data (i.e. shopping cart, inventory, completed orders,
etc).
• Instead of trying to store all this data in one database,
which would require a lot of data conversion to make
the format of the data all the same, store the data in the
database best suited for that type of data.
Integrating Big Data with Traditional Data Warehouses
- challenges
• Data availability.
• Pattern study.
• Data incorporation and integration.
• Data volumes and exploration.
• Compliance and localised legal requirements.
• Storage performance.
YARN : “Yet Another Resource Negotiator”
• Hadoop version 1.0 which is also referred to as MRV1(MapReduce Version 1).
• In MRV1, MapReduce performed both Processing and Resource management and Job
Scheduling functions.
• It consisted of a Job Tracker which runs on Name node was the single master. The Job Tracker
allocated the resources, performed scheduling and monitored the processing jobs.
• It assigned map and reduce tasks on a number of subordinate processes called the Task
Trackers on Data nodes. The Task Trackers periodically reported their progress to the Job
Tracker.
• This design resulted in scalability bottleneck due to a single Job Tracker.
• Apart from this limitation, the utilization of computational resources was also inefficient in
MRV1.
• To overcome all these issues, YARN was introduced in Hadoop version 2.0 in the year 2012 by
Yahoo and Hortonworks. The basic idea behind YARN is to relieve MapReduce by taking over
the responsibility of Resource Management and Job Scheduling.
With the introduction of YARN, the Hadoop ecosystem was completely
revolutionized.
YARN enabled the users to perform operations as per requirement by using a
variety of tools like Spark for real-time processing, Hive for SQL, HBase for
NoSQL and others.
YARN performs all your processing activities by allocating resources and scheduling
tasks.
Advantages of YARN
1. Efficient Resource Utilization among clusters: Since YARN supports
Dynamic utilization of resources , it enables efficient resource
Utilization. It provides a central resource manager which allows you to
share multiple applications through a common resource.
2. Running non-MapReduce applications – In YARN, the scheduling and
resource management capabilities are separated from the data
processing component. This allows Hadoop to run varied types of
applications which do not conform to the programming of the Hadoop
framework. Hadoop clusters are now capable of running independent
interactive queries and performing better real-time analysis.
3. Backward compatibility – YARN comes as a backward-
compatible framework, which means any existing job of MapReduce can be
executed in Hadoop 2.0.
4. JobTracker no longer exists – The two major roles of the JobTracker were
resource management and job scheduling. With the introduction of the
YARN framework these are now segregated into two separate components,
namely:
• NodeManager
• ResourceManager
YARN Architecture
The primary components of YARN architecture are:
1. Resource Manager – Overall responsibility of controlling and managing
resources.
2. Application Manager (per application)– Allows a cluster to handle
multiple applications at a time. Each application in the cluster will have its
own Application Manager instance. Application can be a single job or
multiple jobs.
3. Node Manager - NodeManager runs on the slave nodes. It is responsible
for monitoring the machine resource usage that is CPU, memory, disk,
network usage, and reporting the same to the Resource Manager or
Scheduler.
Resource Manager
The Resource Manager, or RM, which is usually one per cluster, is the master
server.
Resource Manager knows the location of the DataNode and how many resources
they have. This information is referred to as Rack Awareness.
It has two major components
A. Scheduler:
It performs scheduling based on the application and available resources.
It is a pure scheduler, means it does not perform other tasks such as monitoring
or tracking and does not guarantee a restart if a task fails.
B. Application manager:
It is responsible for accepting/rejecting the application and negotiating the first
container from the resource manager. It manages running Application Masters in
the cluster, i.e., it is responsible for starting application masters and for monitoring
and restarting them on different nodes in case of failures.
Node Managers
• The Node Managers can be many in one cluster. They are the slaves of the
infrastructure.
• Responsible for the execution of a task on every single Data Node.
• When it starts, it announces itself to the RM and periodically sends a
heartbeat to the RM.
• Each Node Manager offers resources to the cluster
Application Master:
One application master runs per application.
Manages the user job lifecycle and resource needs of individual
applications. It works along with the Node Manager and monitors the
execution of tasks.
Requesting appropriate resources is called ‘ResourceRequest’. When
resource manager approves it, a container is created.
Container: It is a collection of physical resources such as RAM, CPU
cores and disk on a single node. The containers are invoked by
Container Launch Context(CLC) which is a record that contains
information such as environment variables, security tokens,
dependencies etc.
Working of YARN
1. Client submits an application to Resource manager RM.
2. Resource manager finds necessary resources for launching an Application Master(Instance of
ApplicationManager) specific for this application (container0 in any slave node ) and launches
it.
3. Application Master AM registers itself with the RM
4. AM requests resources to RM for executing its application using a ResourceRequest.
5. RM approves it by allocating an appropriate container for this application. This container can
be in any node in the cluster irrespective of the node which contains AM container.
6. AM requests Node manager for the allotted container(s) using ContainerLaunchSpecific CLC so
that AM can directly communicate with its containers.
7. Application code starts running within the conatiners and reports information(Progress,
status,resource availability, …) to its AM. Client directly communicate with AM for getting
status and details.
8. On completion of application, AM deregisters the containers with the RM.
YARN Schedulers
Scheduler is only responsible for allocating resources
to applications submitted to the cluster, applying
constraint of capacities and queues.

Scheduler does not provide any guarantee for job


completion or monitoring, it only allocates the cluster
resources governed by the nature of job and resource
requirement.

• FIFO Scheduler – default with plain vanilla Hadoop


and typically used for exploratory purposes.
• Fair Scheduler – Resources will be allocated to all
the subsequent jobs in Fair Manner, default with
Cloudera distribution.
• Capacity Scheduler – Nothing but FIFO Scheduler
within each queue, default with Hortonworks
distribution.
1. YARN Capacity scheduler
Hadoop default scheduler.
Capacity scheduler maintains a separate queue for small jobs in order
to start them as soon a request initiates.
However, this comes at a cost as we are dividing cluster capacity hence
large jobs will take more time to complete.
It maximize the throughput and the utilization of the cluster.
Capacity Scheduler - features
1. Hierarchical Queues
Capacity scheduler in Hadoop works on the concept of queues. Each
organization gets its own dedicated queue with a percentage of the total
cluster capacity for its own use.
2. Capacity guarantees
Sharing cluster among organizations is a more cost effective idea to
maximize resource utilization.
3. Security
Every organization has a unique queue. Each queue has strict ACLs(Access Control
Lists) which controls which users can submit applications to individual queues.
4. Resource based scheduling
Capacity schedulers supports resource intensive applications which require higher
requirement specifications than default.
2. YARN Fair Scheduler
Fair scheduling is a method of assigning resources to applications such
that all apps get, on average, an equal share of resources over time.
When there is a single app running, that app uses the entire cluster.
When other apps are submitted, resources that free up are assigned to
the new apps, so that each app eventually on gets roughly the same
amount of resources.
This lets short apps to finish in reasonable time while not starving long-
lived apps in the queue.
3. FIFO Scheduler
• FIFO means First In First Out. As the name indicates, the job
submitted first will get priority to execute. FIFO is a queue-based
scheduler.
• Allocates resources based on arrival time. If there is a long-running
job which takes up all the capacity, resources will not be allocated to
other jobs until the job reach a point where required resources for
the job is less than the capacity of the cluster.
• Due to the above reason, if there is a critical small job submitted
when the long-running job is running it has to wait until the earlier
jobs do not require all the capacity.
Hive
HIVE
• Hadoop provides an open source data ware house system called Apache Hive
through which data stored in HDFS can be accessed.
• Works on structured/semi structured data in Hadoop.
• Hive is like a vehicle which uses map reduce engine.
• Hive is a lightweight, NoSQL database. It doesn’t have data storage capacity.
• Commonly used in ware housing applications to perform batch processing.
• Initially Hive was developed by Facebook, later the Apache Software Foundation
took it up and developed it further as an open source under the name Apache
Hive.
• Apache Hive uses a Hive Query language HQL, which is a declarative language
similar to SQL.
• Hive translates the hive queries into MapReduce programs.
• It is used by different companies. For example, Amazon uses it in Amazon Elastic
MapReduce.
Hive is not….
• A relational database
• A design for OnLine Transaction Processing (OLTP) such as online
ticketing or bank transactions.
• A language for real-time queries and row-level updates.
Hive Architecture
1. Hive Clients:
Hive supports application written in many languages like Java, C++,
Python etc. using JDBC, Thrift and ODBC drivers. Hence one can always
write hive client application written in a language of their choice.
• Thrift Clients: As Hive server is based on Apache Thrift, it can serve the
request from all those programming language that supports Thrift.
• JDBC Clients: Hive allows Java applications to connect to it using the
JDBC driver.
• ODBC Clients: The Hive ODBC Driver allows applications that support
the ODBC protocol to connect to Hive server. (Like the JDBC driver, the
ODBC driver uses Thrift to communicate with the Hive server.)
• Thrift is an RPC framework for building cross-platform services. Using
this ,you can implement a function in java, host it on a server and then
remotely call it from python.
2. Hive Services:
Apache Hive provides various services like CLI, Web Interface etc.
to perform queries.

• Hive CLI (Command Line Interface): This is the terminal


window provided by the Hive where you can execute your Hive
queries and commands directly.
• Apache Hive Web Interfaces: Apart from the command line
interface, Hive also provides a web based GUI for executing Hive
queries and commands.
• Hive Server: Hive server is built on Apache Thrift and therefore,
is also referred as Thrift Server that allows different clients to
submit requests to Hive and retrieve the final result.
• Apache Hive Driver: It is responsible for receiving the queries
• Hive Driver works in 3 steps
1. Compiler : The driver passes the query to the compiler where parsing,
type checking and semantic analysis takes place with the help of
schema present in the metastore.
2. Optimizer: In the next step, an optimized logical plan is generated in the
form of a DAG (Directed Acyclic Graph) of map-reduce tasks and HDFS tasks.
3. Executor: Finally, the execution engine executes these tasks in the order
of their dependencies, using Hadoop as per the plan given by compiler.
• Metastore: A central repository for storing all the Hive metadata
information. Hive metadata includes various types of information like
structure of tables and the partitions along with the column, column type
etc.
• Meta data is stored in Apache Derby DB which is a Relational DB used for
reducing latency.
• JAR - You can add files to Hive. There are some Jars that are
shipped with Hive
• JAR stands for Java ARchive. It's a file format based on the popular
ZIP file format and is used for aggregating many files into one.
Example: Java applets and their requisite components (.class files,
images and sounds) can be downloaded to a browser in a single
HTTP transaction.
• Hive clients.
3. Processing And Resource Management
Hive uses Mapreduce V1 & V2 (batch processing), Tez (Interactive
batch processing) for parallel processing and YARN for resource
management.

4. Distributed Storage : Hive uses HDFS for storing data.


Working of Hive
• Step 1: executeQuery: The user interface calls the execute interface to the driver.
• Step 2: getPlan: The driver accepts the query and passes the query to the compiler for
generating the execution plan.
• Step 3: getMetaData: The compiler sends the metadata request to the metastore.
• Step 4: sendMetaData: The metastore sends the metadata to the compiler.
• The compiler uses this metadata for performing type-checking and semantic analysis.
The compiler then generates the execution plan (Directed acyclic Graph). For Map
Reduce jobs, the plan contains map operator trees (operator trees which are executed
on mapper) and reduce operator tree (operator trees which are executed on reducer).
• Step 5: sendPlan: The compiler then sends the generated execution plan to the driver.
• Step 6: executePlan: After receiving the execution plan from compiler, driver sends the
execution plan to the execution engine for executing the plan.
• Step 7: submit job to MapReduce: The execution engine then sends these stages of
DAG to appropriate components.
Hive Built in Functions
There are some built-in functions available for Hive. Such as Hive
Collection Functions, Hive Date Functions, Hive Mathematical
Functions, Hive Conditional Functions and Hive String Functions
to perform mathematical, arithmetic, logical and relational
operations on the operands of table column names.

You might also like