Module 03 MapReduce - Distributed Off-line Batch Processing and Yarn - Resource Negotiator
Module 03 MapReduce - Distributed Off-line Batch Processing and Yarn - Resource Negotiator
www.huawei.com
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 2
Contents
1. Introduction to MapReduce and YARN
4. Enhanced Features
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 3
MapReduce Overview
MapReduce is developed based on the paper issued by Google about
MapReduce and is used for parallel computing of a massive data set
(larger than 1 TB). It delivers the following highlights:
Easy to program: Programmers only need to describe what to do, and the
execution framework will do the job accordingly.
Outstanding scalability: Cluster capabilities can be improved by adding
nodes.
High fault tolerance: Cluster availability and fault tolerance are improved
by policies such as computing or data migration.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 4
YARN Overview
Apache Hadoop YARN (Yet Another Resource Negotiator) is a
new Hadoop resource manager. It provides unified resource
management and scheduling for upper-layer applications,
remarkably improving cluster resource utilization, unified
resource management, and data sharing.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 5
Position of YARN in FusionInsight
Application service layer
OpenAPI/SDK REST/SNMP/Syslog
YARN is the resource management system of Hadoop 2.0. It is a general resource management module
that manages and schedules resources for applications.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 6
Contents
1. Introduction to MapReduce and YARN
4. Enhanced Features
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 7
Working Process of MapReduce (1)
Before starting MapReduce, make sure that the files to be processed are stored in
HDFS.
Commit MapReduce submits requests to ResourceManager. Then ResourceManager creates
jobs. One application maps to one job (example job ID: job_201431281420_0001).
Job.jar
Job.split Before jobs are submitted, the files to be processed are split. By default, the
Job.xml MapReduce framework regards a block as a split. Client applications can redefine
the mapping relation between blocks and splits.
Split
After the jobs are submitted to ResourceManager, ResourceManager selects an
appropriate NodeManager in the cluster to schedule ApplicationMasters based on
the workloads of NodeManagers. The ApplicationMaster initializes jobs and applies
for resources from ResourceManager. ResourceManager selects an appropriate
NodeManager to start the container for task execution.
Map
The outputs of Map are placed to the buffer in memory. When the buffer
Buffer in overflows, data in the buffer needs to be written to local disks. Before that, the
memory following process must be completed:
Sort 2. Sort — The outputs of Map are sorted, for example, ('Hi','1'),('Hello','1') are
reordered as ('Hello','1'),('Hi','1').
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 8
Working Process of MapReduce (2)
3. Combine — By default, this operation is optional. For example, ('Hi','1'),
('Hi','1'),('Hello','1'), ('Hello','1') are combined into ('Hi','2'),('Hello','2').
Combine
4. Spill — After a Map task is processed, many spill files are generated. These spill
files must be combined into spill file (MOF: MapOutFile) that is partitioned and
Spill/Merge sorted. To reduce the amount of data to be written to disks, MapReduce allows
MOFs to be written after being compressed.
When the MOF output progress of Map tasks reaches 3%, the Reduce tasks are
Copy started and obtains MOF files from each Map task. The number of Reduce tasks is
determined by clients, and the number of MOF partitions is determined by that of
Reduce tasks. For this reason, the MOF files outputted by Map tasks map to Reduce
In memory or tasks.
on disk
MOF files need to be sorted. If the amount of data received by Reduce tasks is small,
Sort/Merge
the data is directly stored in the buffer. As the number of files in the buffer increases,
the MapReduce background thread merges the files into a large one. Many
intermediate files are generated during the merge operation. The last merge result is
directly outputted to the Reduce function defined by the user.
Reduce
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 9
Shuffle Mechanism
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 10
Example: Typical Program WordCount
WordCount
App
2
Resource Name
Manager Node
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 11
Functions of WordCount
Input Output
Bye 3
Hello World Bye World MapReduce Hadoop 4
Hello Hadoop Bye Hadoop
Hello 3
Bye Hadoop Hello Hadoop
World 2
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 12
Map Process of WordCount
<Hello,1>
2.“Hello Hadoop Bye Hadoop” Map <Hadoop,1>
<Bye,1>
<Hadoop,1>
<Bye,1>
3.“Bye Hadoop Hello Hadoop” Map <Hadoop,1>
<Hello,1>
<Hadoop,1>
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 13
Reduce Process of WordCount
Map Map Output Reduce Reduce
Output Input Output
<Hello,1> Reduce Bye 3
<Hello,1> <World,2> <Hello,1 1 1>
<World,1> <Bye,1>
<Bye,1> Reduce
<Hello,1> <Bye,1 1 1> Hadoop 4
<World,1> <Hadoop,2> Shuffle
Combine
<Hello,1> <Bye,1>
<Hadoop,1> <World,2> Reduce Hello 3
<Bye,1>
<Hadoop,1> <Bye,1>
<Hadoop,2> <Hadoop,2 2>
<Hello,1> Reduce World 2
<Bye,1>
<Hadoop,1>
<Hello,1>
<Hadoop,1>
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 14
Architecture of YARN
Node
Manager
client Node
Resource Manager
Manager
client
App Mstr Container
Node
MapReduce Status Manager
Job Submission
Node Status
Container Container
Resource Request
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 15
Task Scheduling of MapReduce on
YARN
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 17
YARN HA Solution
ResourceManager in YARN manages resources and schedules tasks in the cluster. The
YARN HA solution uses redundant ResourceManager nodes to solve single point of failure
problem of ResourceManager.
Zookeeper Cluster
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 18
YARN APPMaster Fault Tolerant
Mechanism
Container
AM-1
Restart/
Container
Failure
AM-1
Container
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 20
Contents
1. Introduction to MapReduce and YARN
4. Enhanced Features
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 21
Resource Management
Yarn manages and allocates memory and CPU resources.
Memory and CPU resources from each NodeManager can be
configured (on the Yarn service configuration page).
yarn.nodemanager.resource.memory-mb
yarn.nodemanager.vmem-pmem-ratio
yarn.nodemanager.resource.cpu-vcore
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 22
Resource Allocation Model
Root
1. Selects a queue.
2. Selects an application
App1 App 2 … App N
from the queue.
Rack B
Any resources
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 23
Capacity Scheduler Overview
Capacity Scheduler enables Hadoop applications to run in a shared, multi-
tenant cluster while maximizing the throughput and utilization of the cluster.
Capacity Scheduler allocates resources by queue. Users can set upper and
lower limits for the resource usage of each queue. Administrators can restrict
the resource used by a queue, user, or job. Job priorities can be set but
resource preemption is not supported.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 24
Highlights of Capacity Scheduler
Capacity assurance: Administrators can set upper and lower limits for the resource
usage of each queue. All applications submitted to the queue share the resources.
Flexibility: The remaining resources of a queue can be used by other queues that
require resources. If a new application is submitted to the queue, other queues release
and return the resources to the queue.
Priority: Priority queuing is supported (FIFO by default).
Multi-leasing: Multiple users can share a cluster, and multiple applications can run
concurrently. Administrators can add multiple restrictions to prevent cluster resources
from being exclusively occupied by an application, user, or queue.
Dynamic update of configuration files: Administrators can dynamically modify
configuration parameters to manage clusters online.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 25
Task Selection by Capacity Scheduler
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 26
Queue Resource Limitation (1)
Queues are created on the Tenant page. After a tenant is created and
associated with YARN, a queue with the same name as the tenant is created.
For example, if tenants QueueA and QueueB are created, two YARN queues
QueueA and QueueB are created.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 27
Queue Resource Limitation (2)
Queue resource capacity (percentage), there are three queues, default, QueueA, and
QueueB, and each has a [queue name].capacity configuration:
The capacity of the default queue is 20% of the total cluster resources.
The capacity of the QueueA queue is 10% of the total cluster resources.
The capacity of the QueueB queue is 10% of the total cluster resources. The capacity of the
root-default shadow queue in the background is 60% of the total cluster resources.
Resource Allocation
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 28
Queue Resource Limitation (3)
Sharing Idle Resources
Due to resource sharing, the resources used by a queue may
exceed its capacity (for example, QueueA.capacity). The maximum
resource usage can be limited by parameter.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 29
User and Task Limitation
Log into FusionInsight Manager and choose Tenant > Dynamic Resource
Plan > Queue Config to configure user and task limitation parameters.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 30
User Limitation (1)
Minimum resource assurance (percentage) of a user:
The resources for each user in a queue are limited at any time. If tasks of multiple users are running
at the same time in a queue, the resource usage of each user fluctuates between the minimum
value and the maximum value. The maximum value is determined by the number of running tasks,
while the minimum value is determined by minimum-user-limit-percent.
For example, if yarn.scheduler.capacity.root.QueueA.minimum-user-limit-percent=25, the
queue resources are adjusted as follows when the number of users who submit tasks increases:
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 31
User Limitation (2)
Maximum resource usage of a user (multiples of queue capacity) :
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 32
Task Limitation
Maximum number of active tasks:
Indicates the maximum number of active tasks in a cluster, including the running or suspended
tasks. When the number of submitted task requests reaches the limit, new tasks will be rejected.
The default value is 10000.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 33
Queue Information
Choose Services > YARN > ResouceManager (active) > Scheduler to view queue
information.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 34
Contents
1. Introduction to MapReduce and YARN
4. Enhanced Features
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 35
Enhanced Features - YARN Dynamic
Memory Management
Calculate the
memory
usage of each
Container.
Containers can run.
No
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 36
Enhanced Features - YARN Label-based
Scheduling
Applications that Applications that
Applications that
have common have demanding
have demanding
resource memory
I/O requirements
requirements requirements
NodeManager NodeManager
NodeManager
Queue
Tasks
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 37
Summary
This module describes the following information:Application
scenarios and Architectures of MapReduce and YARN, Resource
management and task scheduling of YARN, and enhanced
features of YARN in FusionInsight HD.
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 38
Quiz
1. What is the working principle of MapReduce?
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 39
Quiz
1. What are highlights of MapReduce? ( )
A. Easy to program
B. Outstanding scalability
C. Real-time computing
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 40
Quiz
2. What is the abstraction of Yarn resources? ( )
A. Memory
B. CPU
C. Container
D. Disk space
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 41
Quiz
3. What does MapReduce apply to? ( )
A. Iterative computing
B. Offline computing
D. Stream computing
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 42
Quiz
4. What are highlights of capacity scheduler? ( )
A. Capacity assurance
B. Flexibility
C. Multi-leasing
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 43
More Information
Training materials:
http://support.huawei.com/learning/Certificate!showCertificate?lang=en&pbiPath=term100002
5450&id=Node1000011796
Exam outline:
http://support.huawei.com/learning/Certificate!toExamOutlineDetail?lang=en&nodeId=Node10
00011797
Mock exam:
http://support.huawei.com/learning/Certificate!toSimExamDetail?lang=en&nodeId=Node10000
11798
Authentication process:
http://support.huawei.com/learning/NavigationAction!createNavi#navi[id]=_40
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 44
Thank You
www.huawei.com
Copyright © 2018 Huawei Technologies Co., Ltd. All rights reserved. Page 45