The document provides an overview of Hadoop YARN architecture, detailing its advantages, components, and working mechanism. It explains the transition from the original MapReduce version to YARN, which separates resource management and job scheduling, allowing for more efficient resource utilization and support for various applications. Additionally, it covers integration with traditional data warehouses, polyglot persistence, and introduces Apache Hive as a data warehousing solution for structured data in Hadoop.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
15 views46 pages
Mod 5
The document provides an overview of Hadoop YARN architecture, detailing its advantages, components, and working mechanism. It explains the transition from the original MapReduce version to YARN, which separates resource management and job scheduling, allowing for more efficient resource utilization and support for various applications. Additionally, it covers integration with traditional data warehouses, polyglot persistence, and introduces Apache Hive as a data warehousing solution for structured data in Hadoop.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46
Module 5
Understanding Hadoop YARN Architecture
Advantage, Architecture, Working , YARN Schedulers • RDMS (Relational Database Management System): RDBMS is an information management system, which is based on a data model. In RDBMS tables are used for information storage. Each row of the table represents a record and column represents an attribute of data. Organization of data and their manipulation processes are different in RDBMS from other databases. RDBMS ensures ACID (atomicity, consistency, integrity, durability) properties required for designing a database. The purpose of RDBMS is to store, manage, and retrieve data as quickly and reliably as possible. • Hadoop: It is an open-source software framework used for storing data and running applications on a group of commodity hardware. It has large storage capacity and high processing power. It can manage multiple concurrent processes at the same time. It is used in predictive analysis, data mining and machine learning. It can handle both structured and unstructured form of data. It is more flexible in storing, processing, and managing data than traditional RDBMS. Unlike traditional systems, Hadoop enables multiple analytical processes on the same data at the same time. It supports scalability very flexibly. Issues with Non-Relational Database • Not suitable for data involving transactions. • Not suitable when a data needs to be structured. • Not suitable for multiple read and write, update operations. • Not suitable when data integrity has to be maintained. Polyglot Persistence • Polyglot Persistence is a term to mean that when storing data, it is best to use multiple data storage technologies, chosen based upon the way data is being used by individual applications or components of a single application. • Different kinds of data are best dealt with different data stores. • An e-commerce platform will deal with many types of data (i.e. shopping cart, inventory, completed orders, etc). • Instead of trying to store all this data in one database, which would require a lot of data conversion to make the format of the data all the same, store the data in the database best suited for that type of data. Integrating Big Data with Traditional Data Warehouses - challenges • Data availability. • Pattern study. • Data incorporation and integration. • Data volumes and exploration. • Compliance and localised legal requirements. • Storage performance. YARN : “Yet Another Resource Negotiator” • Hadoop version 1.0 which is also referred to as MRV1(MapReduce Version 1). • In MRV1, MapReduce performed both Processing and Resource management and Job Scheduling functions. • It consisted of a Job Tracker which runs on Name node was the single master. The Job Tracker allocated the resources, performed scheduling and monitored the processing jobs. • It assigned map and reduce tasks on a number of subordinate processes called the Task Trackers on Data nodes. The Task Trackers periodically reported their progress to the Job Tracker. • This design resulted in scalability bottleneck due to a single Job Tracker. • Apart from this limitation, the utilization of computational resources was also inefficient in MRV1. • To overcome all these issues, YARN was introduced in Hadoop version 2.0 in the year 2012 by Yahoo and Hortonworks. The basic idea behind YARN is to relieve MapReduce by taking over the responsibility of Resource Management and Job Scheduling. With the introduction of YARN, the Hadoop ecosystem was completely revolutionized. YARN enabled the users to perform operations as per requirement by using a variety of tools like Spark for real-time processing, Hive for SQL, HBase for NoSQL and others. YARN performs all your processing activities by allocating resources and scheduling tasks. Advantages of YARN 1. Efficient Resource Utilization among clusters: Since YARN supports Dynamic utilization of resources , it enables efficient resource Utilization. It provides a central resource manager which allows you to share multiple applications through a common resource. 2. Running non-MapReduce applications – In YARN, the scheduling and resource management capabilities are separated from the data processing component. This allows Hadoop to run varied types of applications which do not conform to the programming of the Hadoop framework. Hadoop clusters are now capable of running independent interactive queries and performing better real-time analysis. 3. Backward compatibility – YARN comes as a backward- compatible framework, which means any existing job of MapReduce can be executed in Hadoop 2.0. 4. JobTracker no longer exists – The two major roles of the JobTracker were resource management and job scheduling. With the introduction of the YARN framework these are now segregated into two separate components, namely: • NodeManager • ResourceManager YARN Architecture The primary components of YARN architecture are: 1. Resource Manager – Overall responsibility of controlling and managing resources. 2. Application Manager (per application)– Allows a cluster to handle multiple applications at a time. Each application in the cluster will have its own Application Manager instance. Application can be a single job or multiple jobs. 3. Node Manager - NodeManager runs on the slave nodes. It is responsible for monitoring the machine resource usage that is CPU, memory, disk, network usage, and reporting the same to the Resource Manager or Scheduler. Resource Manager The Resource Manager, or RM, which is usually one per cluster, is the master server. Resource Manager knows the location of the DataNode and how many resources they have. This information is referred to as Rack Awareness. It has two major components A. Scheduler: It performs scheduling based on the application and available resources. It is a pure scheduler, means it does not perform other tasks such as monitoring or tracking and does not guarantee a restart if a task fails. B. Application manager: It is responsible for accepting/rejecting the application and negotiating the first container from the resource manager. It manages running Application Masters in the cluster, i.e., it is responsible for starting application masters and for monitoring and restarting them on different nodes in case of failures. Node Managers • The Node Managers can be many in one cluster. They are the slaves of the infrastructure. • Responsible for the execution of a task on every single Data Node. • When it starts, it announces itself to the RM and periodically sends a heartbeat to the RM. • Each Node Manager offers resources to the cluster Application Master: One application master runs per application. Manages the user job lifecycle and resource needs of individual applications. It works along with the Node Manager and monitors the execution of tasks. Requesting appropriate resources is called ‘ResourceRequest’. When resource manager approves it, a container is created. Container: It is a collection of physical resources such as RAM, CPU cores and disk on a single node. The containers are invoked by Container Launch Context(CLC) which is a record that contains information such as environment variables, security tokens, dependencies etc. Working of YARN 1. Client submits an application to Resource manager RM. 2. Resource manager finds necessary resources for launching an Application Master(Instance of ApplicationManager) specific for this application (container0 in any slave node ) and launches it. 3. Application Master AM registers itself with the RM 4. AM requests resources to RM for executing its application using a ResourceRequest. 5. RM approves it by allocating an appropriate container for this application. This container can be in any node in the cluster irrespective of the node which contains AM container. 6. AM requests Node manager for the allotted container(s) using ContainerLaunchSpecific CLC so that AM can directly communicate with its containers. 7. Application code starts running within the conatiners and reports information(Progress, status,resource availability, …) to its AM. Client directly communicate with AM for getting status and details. 8. On completion of application, AM deregisters the containers with the RM. YARN Schedulers Scheduler is only responsible for allocating resources to applications submitted to the cluster, applying constraint of capacities and queues.
Scheduler does not provide any guarantee for job
completion or monitoring, it only allocates the cluster resources governed by the nature of job and resource requirement.
• FIFO Scheduler – default with plain vanilla Hadoop
and typically used for exploratory purposes. • Fair Scheduler – Resources will be allocated to all the subsequent jobs in Fair Manner, default with Cloudera distribution. • Capacity Scheduler – Nothing but FIFO Scheduler within each queue, default with Hortonworks distribution. 1. YARN Capacity scheduler Hadoop default scheduler. Capacity scheduler maintains a separate queue for small jobs in order to start them as soon a request initiates. However, this comes at a cost as we are dividing cluster capacity hence large jobs will take more time to complete. It maximize the throughput and the utilization of the cluster. Capacity Scheduler - features 1. Hierarchical Queues Capacity scheduler in Hadoop works on the concept of queues. Each organization gets its own dedicated queue with a percentage of the total cluster capacity for its own use. 2. Capacity guarantees Sharing cluster among organizations is a more cost effective idea to maximize resource utilization. 3. Security Every organization has a unique queue. Each queue has strict ACLs(Access Control Lists) which controls which users can submit applications to individual queues. 4. Resource based scheduling Capacity schedulers supports resource intensive applications which require higher requirement specifications than default. 2. YARN Fair Scheduler Fair scheduling is a method of assigning resources to applications such that all apps get, on average, an equal share of resources over time. When there is a single app running, that app uses the entire cluster. When other apps are submitted, resources that free up are assigned to the new apps, so that each app eventually on gets roughly the same amount of resources. This lets short apps to finish in reasonable time while not starving long- lived apps in the queue. 3. FIFO Scheduler • FIFO means First In First Out. As the name indicates, the job submitted first will get priority to execute. FIFO is a queue-based scheduler. • Allocates resources based on arrival time. If there is a long-running job which takes up all the capacity, resources will not be allocated to other jobs until the job reach a point where required resources for the job is less than the capacity of the cluster. • Due to the above reason, if there is a critical small job submitted when the long-running job is running it has to wait until the earlier jobs do not require all the capacity. Hive HIVE • Hadoop provides an open source data ware house system called Apache Hive through which data stored in HDFS can be accessed. • Works on structured/semi structured data in Hadoop. • Hive is like a vehicle which uses map reduce engine. • Hive is a lightweight, NoSQL database. It doesn’t have data storage capacity. • Commonly used in ware housing applications to perform batch processing. • Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and developed it further as an open source under the name Apache Hive. • Apache Hive uses a Hive Query language HQL, which is a declarative language similar to SQL. • Hive translates the hive queries into MapReduce programs. • It is used by different companies. For example, Amazon uses it in Amazon Elastic MapReduce. Hive is not…. • A relational database • A design for OnLine Transaction Processing (OLTP) such as online ticketing or bank transactions. • A language for real-time queries and row-level updates. Hive Architecture 1. Hive Clients: Hive supports application written in many languages like Java, C++, Python etc. using JDBC, Thrift and ODBC drivers. Hence one can always write hive client application written in a language of their choice. • Thrift Clients: As Hive server is based on Apache Thrift, it can serve the request from all those programming language that supports Thrift. • JDBC Clients: Hive allows Java applications to connect to it using the JDBC driver. • ODBC Clients: The Hive ODBC Driver allows applications that support the ODBC protocol to connect to Hive server. (Like the JDBC driver, the ODBC driver uses Thrift to communicate with the Hive server.) • Thrift is an RPC framework for building cross-platform services. Using this ,you can implement a function in java, host it on a server and then remotely call it from python. 2. Hive Services: Apache Hive provides various services like CLI, Web Interface etc. to perform queries.
• Hive CLI (Command Line Interface): This is the terminal
window provided by the Hive where you can execute your Hive queries and commands directly. • Apache Hive Web Interfaces: Apart from the command line interface, Hive also provides a web based GUI for executing Hive queries and commands. • Hive Server: Hive server is built on Apache Thrift and therefore, is also referred as Thrift Server that allows different clients to submit requests to Hive and retrieve the final result. • Apache Hive Driver: It is responsible for receiving the queries • Hive Driver works in 3 steps 1. Compiler : The driver passes the query to the compiler where parsing, type checking and semantic analysis takes place with the help of schema present in the metastore. 2. Optimizer: In the next step, an optimized logical plan is generated in the form of a DAG (Directed Acyclic Graph) of map-reduce tasks and HDFS tasks. 3. Executor: Finally, the execution engine executes these tasks in the order of their dependencies, using Hadoop as per the plan given by compiler. • Metastore: A central repository for storing all the Hive metadata information. Hive metadata includes various types of information like structure of tables and the partitions along with the column, column type etc. • Meta data is stored in Apache Derby DB which is a Relational DB used for reducing latency. • JAR - You can add files to Hive. There are some Jars that are shipped with Hive • JAR stands for Java ARchive. It's a file format based on the popular ZIP file format and is used for aggregating many files into one. Example: Java applets and their requisite components (.class files, images and sounds) can be downloaded to a browser in a single HTTP transaction. • Hive clients. 3. Processing And Resource Management Hive uses Mapreduce V1 & V2 (batch processing), Tez (Interactive batch processing) for parallel processing and YARN for resource management.
4. Distributed Storage : Hive uses HDFS for storing data.
Working of Hive • Step 1: executeQuery: The user interface calls the execute interface to the driver. • Step 2: getPlan: The driver accepts the query and passes the query to the compiler for generating the execution plan. • Step 3: getMetaData: The compiler sends the metadata request to the metastore. • Step 4: sendMetaData: The metastore sends the metadata to the compiler. • The compiler uses this metadata for performing type-checking and semantic analysis. The compiler then generates the execution plan (Directed acyclic Graph). For Map Reduce jobs, the plan contains map operator trees (operator trees which are executed on mapper) and reduce operator tree (operator trees which are executed on reducer). • Step 5: sendPlan: The compiler then sends the generated execution plan to the driver. • Step 6: executePlan: After receiving the execution plan from compiler, driver sends the execution plan to the execution engine for executing the plan. • Step 7: submit job to MapReduce: The execution engine then sends these stages of DAG to appropriate components. Hive Built in Functions There are some built-in functions available for Hive. Such as Hive Collection Functions, Hive Date Functions, Hive Mathematical Functions, Hive Conditional Functions and Hive String Functions to perform mathematical, arithmetic, logical and relational operations on the operands of table column names.