0% found this document useful (0 votes)

25 views7 pages

CH 4 BDA

The document discusses Hadoop Yarn architecture and components. It defines the Resource Manager as the master daemon that manages resource assignments and includes a scheduler and application manager. The Node Manager is the slave daemon that manages containers and reports resource usage. The Application Master negotiates resources and manages application lifecycle. Key components of the Resource Manager include the applications manager, ACL manager and scheduler.

Uploaded by

Binit Karmakar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views7 pages

CH 4 BDA

Uploaded by

Binit Karmakar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

BIG DATA ANALYTICS

PECAIML601A

CHAPTER-4
1 MARK QUESTIONS

1. A ________ node acts as the Slave and is responsible for executing a Task assigned to it by the
JobTracker.

• TaskTracker receives the information necessary for the execution of a Task from JobTracker, Executes the Task, and
Sends the Results back to JobTracker.

2. ___________ part of the MapReduce is responsible for processing one or more chunks of data and
producing the output results.

• Maptask part of the MapReduce is responsible for processing one or more chunks of data and producing the output
results.

3. _________ function is responsible for consolidating the results produced by each of the Map ()
functions/tasks.

• Reduce function is responsible for consolidating the results produced by each of the Map() functions/tasks.

4. What is YARN stands for? What is the role of YARN in Hadoop?

• Yet Another Resource Negotiator

• YARN is responsible for managing resources (CPU, memory, etc.) and scheduling tasks in a Hadoop cluster.

5. The CapacityScheduler supports _____________ queues to allow for more predictable sharing of
cluster resources.

• Hierarchial

6. Users can bundle their Yarn code in a _________ file and execute it using jar command.

• Users can bundle their Yarn code in a Jar file and execute it using jar command.
5 MARKS QUESTIONS

1. Draw Hadoop Yarn Architecture and also explain the components of Hadoop Yarn Architecture

1. RESOURCE MANAGER (RM)

• It is the master daemon of Yarn. RM manages the global assignments of resources (CPU and memory) among all the
applications. It arbitrates system resources between competing applications. follow Resource Manager guide to
learn Yarn Resource manager in great detail.
• Resource Manager has two Main components:

A) SCHEDULER
• The scheduler is responsible for allocating the resources to the running application. The scheduler is pure scheduler
it means that it performs no monitoring no tracking for the application and even doesn’t guarantees about restarting
failed tasks either due to application failure or hardware failures.

B) APPLICATION MANAGER
• It manages running Application Masters in the cluster, i.e., it is responsible for starting application masters and for
monitoring and restarting them on different nodes in case of failures.

2. NODE MANAGER (NM)

• It is the slave daemon of Yarn. NM is responsible for containers monitoring their resource usage and reporting the
same to the ResourceManager. Manage the user process on that machine. Yarn NodeManager also tracks the health
of the node on which it is running. The design also allows plugging long-running auxiliary services to the NM; these
are application-specific services, specified as part of the configurations and loaded by the NM during startup. A
shuffle is a typical auxiliary service by the NMs for MapReduce applications on YARN

3. APPLICATION MASTER (AM)

• One application master runs per application. It negotiates resources from the resource manager and works with the
node manager. It Manages the application life cycle.
• The AM acquires containers from the RM’s Scheduler before contacting the corresponding NMs to start the
application’s individual tasks.
2. Write a short note on classic MapReduce model.

• The classic MapReduce model is a programming model and framework introduced by Google, which forms the
foundation for processing large-scale data in a distributed and parallel manner. It consists of two primary
components: the Map function and the Reduce function.
• The Map function takes an input dataset and applies a user-defined transformation to each element independently.
It generates a set of intermediate key-value pairs as output, where the key represents a category or identifier, and
the value is the result of the transformation. The Map function is designed to be parallelizable, allowing multiple
instances of the function to be executed in parallel on different parts of the dataset.
• After the Map function is applied to the entire dataset, the intermediate key-value pairs are shuffled and sorted
based on their keys. This process groups together all the values associated with the same key, preparing them for
the Reduce function.
• The Reduce function takes the intermediate key-value pairs as input and performs an aggregation or summarization
operation on each group of values associated with a specific key. The Reduce function produces a set of final output
key-value pairs, where the key typically represents a unique category or result, and the value represents the
aggregated or summarized result.
• The classic MapReduce model provides fault tolerance by automatically handling failures and rerunning failed tasks
on other nodes in the distributed system. It also optimizes data movement by minimizing network communication,
as the intermediate key-value pairs are shuffled and sorted locally before being passed to the Reduce function.
• The classic MapReduce model has been widely used in various big data processing frameworks, including Apache
Hadoop. It provides a scalable and efficient approach for processing large volumes of data by leveraging the parallel
processing capabilities of distributed systems. However, it is worth noting that newer frameworks and models have
emerged that build upon or enhance the classic MapReduce model, offering additional functionalities and
optimizations.
15 MARKS QUESTIONS

1. Question

a) What is Hadoop Yarn Resource Manager? Draw Its Architecture.

b) Explain Components interfacing RM to the client and RM to the nodes
c) Explain the core component of the ResourceMana ger.

Answer a) Hadoop Yarn Resource Manager

• The Resource Manager is the core component of YARN – Yet Another Resource Negotiator. In analogy, it occupies
the place of JobTracker of MRV1. Hadoop YARN is designed to provide a generic and flexible framework to
administer the computing resources in the Hadoop cluster.
• In this direction, the YARN Resource Manager Service (RM) is the central controlling authority for resource
management and makes allocation decisions ResourceManager has two main components: Scheduler and
ApplicationsManager.

Answer b)

Components Interfacing RM to the Client

A) ClientService

o The client interface to the Resource Manager. This component handles all the RPC interfaces to the RM
from the clients including operations like application submission, application termination, obtaining queue
information, cluster statistics etc.

B) AdminService

o To make sure that admin requests don’t get starved due to the normal users’ requests and to give the
operators’ commands the higher priority, all the admin operations like refreshing node-list, the queues’
configuration etc. are served via this separate interface.
Components connecting RM to the nodes

A) ResourceTrackerService

o This is the component that obtains heartbeats from nodes in the cluster and forwards them to
YarnScheduler. Responds to RPCs from all the nodes, registers new nodes, rejecting requests from any
invalid/decommissioned nodes, It works closely with NMLivelinessMonitor and NodesListManager.

b) NMLivelinessMonitor
o To keep track of live nodes and dead nodes. This component keeps track of each node’s its last heartbeat
time. Any node that doesn’t send a heartbeat within a configured interval of time, by default 10 minutes,
is deemed dead and is expired by the RM. All the containers currently running on an expired node are
marked as dead and no new containers are scheduling on such node.

c) NodesListManager

o Manages valid and excluded nodes. Responsible for reading the host configuration files and seeding the
initial list of nodes based on those files. Keeps track of nodes that are decommissioned as time progresses.

Answer c) Core component of the ResourceManager

a) ApplicationsManager

o Responsible for maintaining a collection of submitted applications. Also, keeps a cache of completed
applications so as to serve users’ requests via web UI or command line long after the applications in
question finished.

b) ApplicationACLsManager

o RM needs to gate the user facing APIs like the client and admin requests to be accessible only to authorized
users. This component maintains the ACLs lists per application and enforces them whenever a request like
killing an application, viewing an application status is received.

c) ApplicationMasterLauncher

o Maintains a thread-pool to launch AMs of newly submitted applications as well as applications whose
previous AM attempts exited due to some reason. Also responsible for cleaning up the AM when an
application has finished normally or forcefully terminated.

d) YarnScheduler

o Yarn Scheduler is responsible for allocating resources to the various running applications subject to
constraints of capacities, queues etc. It also performs its scheduling function based on the resource
requirements of the applications. For example, memory, CPU, disk, network etc. Currently, only memory is
supported and support for CPU is close to completion.

E) ContainerAllocationExpirer

o This component is in charge of ensuring that all allocated containers are used by AMs and subsequently
launched on the correspond NMs.
• AMs run as untrusted user code and can potentially hold on to allocations without using them, and as such can cause
cluster under-utilization. To address this, ContainerAllocationExpirer maintains the list of allocated containers that
are still not used on the corresponding NMs.
• For any container, if the corresponding NM doesn’t report to the RM that the container has started running within
a configured interval of time, by default 10 minutes, then the container is deemed as dead and is expired by the RM.

2. Question

a) What is MapReduce
b) How Map and Reduce work Together?
c) What is a key -value pair in Hadoop? How to generate Key -value pair in MapReduce

Answer a) MapReduce
• MapReduce is a programming model and framework for processing and analyzing large volumes of data in a
distributed and parallel manner. It was introduced by Google in 2004 and has since become a widely adopted
approach for big data processing.
• The MapReduce framework simplifies the task of writing distributed data processing applications by abstracting
away the complexity of parallelization, fault tolerance, and data distribution. It provides a high-level programming
model that allows developers to focus on the logic of their data transformations rather than the low-level details of
distributed computing.
• In the MapReduce paradigm, data processing is divided into two main stages: the Map stage and the Reduce stage.
o Map Stage: In this stage, a function called the "mapper" is applied to each input element in parallel. The
mapper takes an input key-value pair and produces intermediate key-value pairs as output. The
intermediate key-value pairs are not stored permanently but are passed on to the next stage.
o Shuffle and Sort: After the Map stage, the intermediate key-value pairs are sorted and grouped based on
their keys. This process is called shuffle and sort. It ensures that all intermediate values with the same key
are grouped together, allowing for efficient processing in the next stage.
o Reduce Stage: In this stage, a function called the "reducer" is applied to each group of intermediate key-
value pairs. The reducer takes a key and the corresponding set of values and produces a set of final output
key-value pairs. The reducer performs aggregation, summarization, or any other operation that requires
combining the values associated with a particular key.
• The MapReduce framework handles the parallel execution, fault tolerance, and data distribution automatically. It
divides the input data into smaller chunks and assigns them to different machines or processors in a cluster. The
mappers and reducers can run in parallel on different portions of the data, enabling efficient processing of large
datasets.
• MapReduce is designed to handle large-scale data processing tasks by leveraging the parallel processing capabilities
of a distributed system. It has been widely used in various big data processing frameworks, such as Apache Hadoop,
to perform tasks like data transformation, filtering, sorting, indexing, and more.

Answer b) Working of Mapper and Reducer

• Map and Reduce are two fundamental operations in distributed computing and parallel processing frameworks,
such as MapReduce. They work together to enable efficient processing of large volumes of data across multiple
machines or processors.
• The Map operation applies a transformation function to each element in a dataset independently, producing a set
of key-value pairs as output. This transformation function can be any operation or computation that can be applied
to individual elements of the dataset. The key-value pairs generated by the Map operation are often referred to as
intermediate key-value pairs.
• Once the Map operation has been performed on the entire dataset, the intermediate key-value pairs are grouped
based on their keys, and these groups are sent to the Reduce operation. The Reduce operation applies a specific
aggregation or summarization function to each group of intermediate key-value pairs, producing a final output for
each key. The aggregation function can be any operation that takes a set of values associated with a key and
produces a single value.
• The key idea behind the MapReduce paradigm is that the Map operation can be performed in parallel on different
portions of the dataset, with each machine or processor handling a subset of the data. This parallelization allows for
efficient processing of large datasets by distributing the workload across multiple computing resources. Once the
Map operation is completed, the intermediate key-value pairs can be shuffled and distributed to the Reduce
operations based on their keys, again allowing for parallel processing of different groups of intermediate data.
• The combination of Map and Reduce operations enables scalable and fault-tolerant processing of large-scale data.
By dividing the computation into independent map tasks and aggregating the results through the reduce tasks,
MapReduce frameworks can efficiently process data in parallel across a cluster of machines, minimizing data
movement and maximizing resource utilization. This approach has been widely adopted in big data processing
systems and has greatly contributed to the ability to handle massive datasets efficiently.

Answer c) Key-value pair in Hadoop

• Apache Hadoop is used mainly for Data Analysis. We look at statistical and logical techniques in data Analysis to
describe, illustrate and evaluate data. Hadoop deals with structured, unstructured and semi-structured data. In
Hadoop, when the schema is static, we can directly work on the column instead of keys and values, but, when the
schema is not static, then we will work on keys and values. Keys and values are not the intrinsic properties of the
data, but they are chosen by user analyzing the data.
• MapReduce is the core component of Hadoop which provides data processing. Hadoop MapReduce is a software
framework for easily writing an application that processes the vast amount of structured and unstructured data
stored in the HDFS. MapReduce works by breaking the processing into two phases: Map phase and Reduce phase.
Each phase has key-value as input and output.
• In MapReduce process, before passing the data to the mapper, data should be first converted into key-value pairs
as mapper only understands key-value pairs of data.
• key-value pairs in Hadoop MapReduce is generated as follows:
o InputSplit – It is the logical representation of data. The data to be processed by an individual Mapper is
presented by the InputSplit.
o RecordReader – It communicates with the InputSplit and it converts the Split into records which are in form
of key-value pairs that are suitable for reading by the mapper. By default, RecordReader uses
TextInputFormat for converting data into a key-value pair. RecordReader communicates with the InputSplit
until the file reading is not completed.

• In MapReduce, map function processes a certain key-value pair and emits a certain number of key-value pairs and
the Reduce function processes values grouped by the same key and emits another set of key-value pairs as
output. The output types of the Map should match the input types of the Reduce as shown below:
• Map: (K1, V1) -> list (K2, V2)
• Reduce: {(K2, list (V2 }) -> list (K3, V3)

Slides For Chapter 16: Transactions and Concurrency Control: Distributed Systems: Concepts and Design
No ratings yet
Slides For Chapter 16: Transactions and Concurrency Control: Distributed Systems: Concepts and Design
39 pages
This Set of Operating System Multiple Choice Questions
No ratings yet
This Set of Operating System Multiple Choice Questions
13 pages
CS8461 Os Lab Manual
No ratings yet
CS8461 Os Lab Manual
59 pages
Parallel Algorithm Models
No ratings yet
Parallel Algorithm Models
21 pages
Introduction To Distributed Systems
No ratings yet
Introduction To Distributed Systems
45 pages
MCQ's For Operating Systems
No ratings yet
MCQ's For Operating Systems
11 pages
Concurrency Mechanism
No ratings yet
Concurrency Mechanism
8 pages
Ch1 Annotated
No ratings yet
Ch1 Annotated
27 pages
Unit 2 B)
No ratings yet
Unit 2 B)
16 pages
Flynn'S Classification: Cs6303 Computer Architecture
No ratings yet
Flynn'S Classification: Cs6303 Computer Architecture
11 pages
Chapter 03 OS
No ratings yet
Chapter 03 OS
35 pages
Multi-Processor-Parallel Processing PDF
No ratings yet
Multi-Processor-Parallel Processing PDF
12 pages
Framework For Processing Data in Hadoop - : Yarn and Mapreduce
No ratings yet
Framework For Processing Data in Hadoop - : Yarn and Mapreduce
31 pages
Multithreading and Multiprocessing
No ratings yet
Multithreading and Multiprocessing
3 pages
PDC - Lecture - No. 2
No ratings yet
PDC - Lecture - No. 2
31 pages
Javacore 20110516 132651 2248
No ratings yet
Javacore 20110516 132651 2248
48 pages
Semaphores Tutorial OReilly Linuxdevcenter
No ratings yet
Semaphores Tutorial OReilly Linuxdevcenter
17 pages
Apache Hadoop Yarn Architecture PDF
No ratings yet
Apache Hadoop Yarn Architecture PDF
3 pages
Map Reduce and Hadoop
No ratings yet
Map Reduce and Hadoop
39 pages
Apache Hadoop YARN: Unit 3 Chapter 2
No ratings yet
Apache Hadoop YARN: Unit 3 Chapter 2
9 pages
Unit 2 Notes BDA
No ratings yet
Unit 2 Notes BDA
10 pages
Chapter 10
No ratings yet
Chapter 10
45 pages
Apache Yarn Interviews and Answers
No ratings yet
Apache Yarn Interviews and Answers
4 pages
5 B) Monitors
No ratings yet
5 B) Monitors
4 pages
CS8461 - 17.1
No ratings yet
CS8461 - 17.1
4 pages
Concurrecny Java 8
No ratings yet
Concurrecny Java 8
21 pages
05-MapReduce and Yarn
No ratings yet
05-MapReduce and Yarn
82 pages
Big Data QB
No ratings yet
Big Data QB
24 pages
YARN
No ratings yet
YARN
5 pages
HADOOP
No ratings yet
HADOOP
19 pages
What Is Semaphore
No ratings yet
What Is Semaphore
5 pages
Chapter3 HDFS MapReduce YARN
No ratings yet
Chapter3 HDFS MapReduce YARN
35 pages
Unit IV Notes
No ratings yet
Unit IV Notes
34 pages
Unit - 4 Yarn
No ratings yet
Unit - 4 Yarn
20 pages
Parallel Algorithms For VLSI Layout Verication
No ratings yet
Parallel Algorithms For VLSI Layout Verication
34 pages
Bigdata and Hadoop - Unit III
No ratings yet
Bigdata and Hadoop - Unit III
24 pages
Unit 5 - Big Data Ecosystem - 06.05.18
No ratings yet
Unit 5 - Big Data Ecosystem - 06.05.18
21 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
31 pages
Workflow:: Applicationmaster (Am) : Ii.) Negotiates Resources With The Resourcemanager
No ratings yet
Workflow:: Applicationmaster (Am) : Ii.) Negotiates Resources With The Resourcemanager
12 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
11 pages
UNIT-1-part-2-BIG DATA ANALYTICS AND TOOLS
No ratings yet
UNIT-1-part-2-BIG DATA ANALYTICS AND TOOLS
19 pages
(Osy) Final QB With Assigment All
No ratings yet
(Osy) Final QB With Assigment All
8 pages
MapReduce V1
No ratings yet
MapReduce V1
26 pages
Unit-2 Bda Kalyan - Pagenumber
No ratings yet
Unit-2 Bda Kalyan - Pagenumber
15 pages
Lecture 06
No ratings yet
Lecture 06
26 pages
L12 Slides6
No ratings yet
L12 Slides6
13 pages
BDA Assignment 3
No ratings yet
BDA Assignment 3
24 pages
Hadoop Class 2 PDF
No ratings yet
Hadoop Class 2 PDF
18 pages
Assignment 4 (Big Data)
No ratings yet
Assignment 4 (Big Data)
3 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
Big Data-Week 3 - 1
No ratings yet
Big Data-Week 3 - 1
22 pages
Bda Final Sem 7
No ratings yet
Bda Final Sem 7
120 pages
BDMA Part 3
No ratings yet
BDMA Part 3
22 pages
1 Bda Chapter1 Answer
No ratings yet
1 Bda Chapter1 Answer
7 pages
PDC Assignment #1 - (20014119-035)
No ratings yet
PDC Assignment #1 - (20014119-035)
3 pages
Lec 6
No ratings yet
Lec 6
14 pages
Hadoop 2full Mod2
No ratings yet
Hadoop 2full Mod2
10 pages
Yarn and Its Failures
No ratings yet
Yarn and Its Failures
22 pages
Unit V Data Analytics Notes
No ratings yet
Unit V Data Analytics Notes
22 pages
Anr 5.37 (53700005) 0
No ratings yet
Anr 5.37 (53700005) 0
13 pages
Mod 5
No ratings yet
Mod 5
46 pages
Big Data Notes Unit-3
No ratings yet
Big Data Notes Unit-3
7 pages
Lecture 6
No ratings yet
Lecture 6
51 pages
Yarn Tutorial
No ratings yet
Yarn Tutorial
14 pages
Bigdata Final
No ratings yet
Bigdata Final
25 pages
Big Data Unit 3 Own
No ratings yet
Big Data Unit 3 Own
20 pages
ECS765P - W3 - Hadoop Principles and Components
No ratings yet
ECS765P - W3 - Hadoop Principles and Components
47 pages
Cloud Computing Material Unit - 1
No ratings yet
Cloud Computing Material Unit - 1
24 pages
Unit5 B
No ratings yet
Unit5 B
4 pages
6 Yarn
No ratings yet
6 Yarn
10 pages
OS Processes and Threads Notes
No ratings yet
OS Processes and Threads Notes
33 pages
Download
No ratings yet
Download
7 pages
MapReduce Workflows
No ratings yet
MapReduce Workflows
43 pages
Unit 2
No ratings yet
Unit 2
73 pages
Lab 7
No ratings yet
Lab 7
2 pages
Chapter 5 Mapreduce and Yarn Questions Answers
No ratings yet
Chapter 5 Mapreduce and Yarn Questions Answers
5 pages
10 - Big Data Architecture and Tools
No ratings yet
10 - Big Data Architecture and Tools
31 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
14 pages
UNIT-4 BIG DATA (NoSql)
No ratings yet
UNIT-4 BIG DATA (NoSql)
38 pages
BD U-4 (Anupam Sir)
No ratings yet
BD U-4 (Anupam Sir)
23 pages
Bda Unit 2
No ratings yet
Bda Unit 2
16 pages
RTOS Based Embedded Design (Unit 4 Btech)
No ratings yet
RTOS Based Embedded Design (Unit 4 Btech)
126 pages
Lecture 11 Chapter 6 Part 2 Big Data Processing Concepts
No ratings yet
Lecture 11 Chapter 6 Part 2 Big Data Processing Concepts
14 pages
CS-687 - Lab 07
No ratings yet
CS-687 - Lab 07
5 pages
MapReduce Daemons
No ratings yet
MapReduce Daemons
21 pages
Hadoop YARN Technology
No ratings yet
Hadoop YARN Technology
3 pages
Bda Unit 3 - Mam
No ratings yet
Bda Unit 3 - Mam
89 pages
Bda Unit 3
No ratings yet
Bda Unit 3
22 pages
Unit-4: Illustrate Mapreduce Architecture With Diagram
No ratings yet
Unit-4: Illustrate Mapreduce Architecture With Diagram
7 pages
Java Streams Explained: A Practical Guide with Examples
From Everand
Java Streams Explained: A Practical Guide with Examples
William E. Clark
No ratings yet

CH 4 BDA

Uploaded by

CH 4 BDA

Uploaded by

BIG DATA ANALYTICS

4. What is YARN stands for? What is the role of YARN in Hadoop?

• Yet Another Resource Negotiator

1. RESOURCE MANAGER (RM)

2. NODE MANAGER (NM)

3. APPLICATION MASTER (AM)

a) What is Hadoop Yarn Resource Manager? Draw Its Architecture.

Answer a) Hadoop Yarn Resource Manager

Components Interfacing RM to the Client

Answer c) Core component of the ResourceManager

Answer b) Working of Mapper and Reducer

Answer c) Key-value pair in Hadoop

You might also like