0% found this document useful (0 votes)
36 views6 pages

Improved Job Scheduling For Achieving Fairness On Apache Hadoop YARN

This document discusses improving job scheduling for Apache Hadoop YARN by tuning configuration parameters. It introduces Hadoop YARN and describes how the default parameter settings are often not optimal for different applications. The paper proposes tuning YARN parameters to improve execution time and efficient job scheduling. Experiments are conducted to evaluate different parameter configuration settings and their impact on system performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views6 pages

Improved Job Scheduling For Achieving Fairness On Apache Hadoop YARN

This document discusses improving job scheduling for Apache Hadoop YARN by tuning configuration parameters. It introduces Hadoop YARN and describes how the default parameter settings are often not optimal for different applications. The paper proposes tuning YARN parameters to improve execution time and efficient job scheduling. Experiments are conducted to evaluate different parameter configuration settings and their impact on system performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Improved Job Scheduling for Achieving Fairness on Apache Hadoop YARN

Thet Hsu Aung, Wint Thida Zaw


University of Information Technology, Yangon, Myanmar
[email protected], [email protected]

Abstract for collecting, processing, analyzing, and managing data


than a conventional database framework [11]. It intended
Enormous amounts of data are gathered from social media to scale up from single to help thousands of equipment
sites, mobile and other business environment. Analyzing hubs and every commitment nearby calculation and
the enormous amounts of big data becomes large capacity. The primary segments of Hadoop are
workloads with distributed applications and the resources MapReduce, Hadoop Distributed File System (HDFS),
of a single machine are insufficient for this application. Hadoop YARN, and Hadoop common. HDFS is intended
Hadoop YARN (Yet Another Resource Negotiator) enables to distributed, adaptable, and portable file system
running multiple applications over hadoop cluster to framework over the hubs in a cluster, and give adaptation
utilize the resources efficiently and provide the data to internal failure productivity. MapReduce is a
parallel programming model. Hadoop YARN breaks up the programming model and it works as both a processing
performance of open source framework for distributed engine and cluster resource manager, which fixed HDFS
applications and performs job scheduling and monitoring straightforwardly to it and running like batch processing.
together with storage, processing and analysis of big data Hadoop YARN encourages resource management and job
on commodity hardware. Apache Hadoop provides for scheduling/monitoring in the Hadoop clusters. Hadoop
over 200 default parameter configuration settings for all Common indicates the collection of libraries and utilities
type of clusters and applications. Of If the available required by other Hadoop modules. In Hadoop clusters,
parameters misconfigure, the one or more machines in the Hadoop YARN lies among HDFS and the processing
cluster may decrease the system performance. Appropriate engines set up by clients. The resource supervisor is the
tuning parameters configuration can increase the system master that mediates all the accessible bunch resources and
performance. Tuning parameter configuration becomes helps the distributed applications. It works together with
the challenge of Apache Hadoop Framework for utilization the per-hub NodeManager and the per-application
of system resources efficiently. In this paper, YARN ApplicationsMasters. It underpins different job scheduling
parameters tuning is done for improving the execution time draws near and a great deal of strategies that schedules job
and efficient job scheduling. dependent on cluster resources. YARN framework gives
default parameter setup settings for all applications and
Keywords- Apache Hadoop, Hadoop YARN, clusters [3]. Be that as it may, it isn't relevant for all
Parameters Configuration applications since they have various sorts of attributes like
CPU intensive, I/O intensive, Memory intensive, and
1. Introduction mixed of both. The parameter tuning tips and deceives
dependent on the measure of data that is being moved and
Big data can dissect the structured, semi-structured, and additionally on the sort of Hadoop jobs being run
unstructured data accumulated by associations. Big data is underway. This paper features the reasonable parameter
regularly characterized by the enormous volume of data, design tuning for Hadoop jobs to accomplish better
the wide variety of data types, and the velocity, wherein the execution. It gives various types of parameter setup setting
data is created, gathered and processed, and the veracity of XML documents with altered offices to change its worth
quality of data, and the value to the capacity to change data dependent on different sorts of the dataset. Hence,
into business. Big data analytics advantage organizations parameter configuration settings are reliant on application
and associations to give the best choices by uncovering attributes as indicated by the worth is set [3].
data that would have been secured. This paper is organized as follows: Section 2 proposes
Hadoop is an open-source framework for the distributed the related work. Section 3 introduces an overview of
processing of a tremendous measure of datasets that Hadoop YARN Architecture, MapReduce and HDFS.
oversees processing and storage for huge data applications Section 4 describes Apache Hadoop YARN Parameter
in extensible clusters of PC frameworks [11]. Hadoop can Configuration and the proposed system is declared in
handle various types of data structure, for example, section 5. The experiment and evaluation are explained in
structured or unstructured data and give more extensible section 6 and finally concludes in section 7.

188

Authorized licensed use limited to: Auckland University of Technology. Downloaded on December 19,2020 at 18:23:14 UTC from IEEE Xplore. Restrictions apply.
2. Related work 3.1. Apache Hadoop
Different scientists have proposed the Hadoop Apache Hadoop is an Apache high-level venture and
parameter design. In this area, a few papers that depicted took into account the distributed processing of enormous
works identifying with this framework are examined. measure of datasets utilizing straightforward programming
L. Changlong et al. [7] established that the Hadoop models [11]. Hadoop utilizes master slave architecture. It
parameter configuration has an impediment for the Hadoop is intended to carry the computation to the data rather than
MapReduce job execution. Misconfiguration of parameters the data to the computation. It oversees data processing and
may bomb the presentation of the framework. This storage for Big Data application running in the cluster
framework proposed a versatile programmed setup device frameworks. Hadoop can handle various types of data
dependent on a mathematical model that will precisely structure and giving clients greater adaptability for
become familiar with the connection between system collecting, processing, and analyzing data than
execution and parameter configuration. conventional database framework.
In [10], this framework intended to the Hadoop Hadoop modules are structured with a central suspicion
parameter setup challenge of the Hadoop framework and that hardware failures are normal and consequently ought
proposed a framework checking and execution to be naturally handled in programming by the framework
investigation framework which gave arrangement design [11]. It is a distributed file system and batch processing
difficulties. B. Garvit, et al [2] proposed to examine the framework for running MapReduce jobs. MapReduce has
impact of various parameter arrangements on Hadoop two projects like Map and Reduce. The Map task attempts
MapReduce under isolated conditions, to achieve the to take things and runs an operation over each line in the
greatest throughput and contemplated examinations to file and split into key value sets. Reduce task bunches these
dissect the effect of various parameter setups and exhorted key value pairs and runs an operation to link them like
an ideal worth. This paper zeroed in on two key execution (key->sum).
markers throughput and execution time and contrasted with HDFS is a distributed, versatile, and compact file
FIFO, Fair sharing, and Capacity scheduling. It is seen that framework written in java for the Hadoop framework.
different parameters alongside divergent scheduling Every hub in a Hadoop occurrence normally has a single
components affect the performance measurements. name node, and a bunch of data nodes structures the HDFS
B. Ailton [1] found an ontology based semantic way to cluster. HDFS stores enormous records over numerous
deal with tuning parameters to improve Hadoop execution machines. The benefit of HDFS is data mindfulness
and parameter design impacts on framework execution and between the job tracker and task tracker. The job tracker
these parameters are influenced by outstanding burden plans map or reduce jobs to task trackers with a cognizance
attributes. of the data area.
B. J. Mathiya [4] presented the mapper reducer work Hadoop YARN is a sub-task of Hadoop that isolates the
utilizing the mean shift clustering based algorithm permits resource management and processing segments [12]. The
the client to examine the dataset and achieved better YARN based engineering of Hadoop gives a more broad
execution in executing the job by utilizing the ideal setup processing stage. The principle thought of YARN is to
of mappers and reducers dependent on the size of the separate the two significant functionalities, for example,
datasets and brought about significantly brought down resource management and job scheduling and monitoring,
framework cost, energy usage, and the board into discrete daemons. It has a worldwide
unpredictability and expanded the exhibition of the ResourceManager (RM) and per-application
framework. ApplicationMaster(AM). YARN upgrades the intensity of
Q. Wang [8] proposed the framework can drastically a Hadoop register cluster like scalability, compatibility
tradeoff the performance and the fairness and decreased the with MapReduce, improved bunch usage, nimbleness, and
makespan of MapReduce jobs by using a multi-level backing for outstanding burdens other than MapReduce
queue, the time factor, job urgency factor, and domain [12]. Notwithstanding, Apache Hadoop YARN made the
resource proportion. This test indicated that the makespan Hadoop Framework more reasonable for real time
of MapReduce jobs decrease and improved CPU usage and processing utilizes and different applications that can
memory use. hardly wait for cluster occupations to wrap up.

3. Background theory 3.2. Job execution flow of Hadoop YARN


This section describes Apache Hadoop and MapReduce In the job execution flow, the user submits a job to
and overview of Hadoop YARN architecture to execute ResourceManager (RM). RM issues a request to
their applications. ApplicationManager (AM) to communicate with
NodeManager (NM) and start ApplicationMaster. In

189

Authorized licensed use limited to: Auckland University of Technology. Downloaded on December 19,2020 at 18:23:14 UTC from IEEE Xplore. Restrictions apply.
YARN architecture number of NM is one per machine or 4. Apache Hadoop YARN parameter
system. Number of ApplicationMaster running is configuration
equivalent to number of Application’s submitted by user.
Each NM is associated with one Application per Big data processing systems contain a large number of
Application. An ApplicationMaster is by default in sleep configuration parameters controlling parallelism, I/O
mode. Whenever RM needs to Process a job then it behavior, memory settings and compression. Inappropriate
instructs NM to launch ApplicationMaster which is in sleep parameter settings can cause significant performance
mode. Once NM launches ApplicationMaster, which degradation and stability issues. Hadoop configuration
requests for containers to the scheduler of parameters impact several aspects of job execution at the
ResourceManager. Here scheduler allocates Resources for different phases, such as concurrency, memory allocation,
containers. A container is collection of resources in which I/O performance and network bandwidth usage. Hadoop
application tasks are executed. parameters control different aspects of job behavior during
Once resources are allocated by scheduler, execution. Currently, Hadoop has over 200 parameters,
ApplicationMaster requests NM to launch containers from which about 30 can have a substantial effect on job
where the tasks are executed. NM tracks and monitors the performance.
life cycle of containers and updates the status to RM’s AM The parameter with appropriate values will bring high
through ApplicationMaster. ApplicationMaster acts a performance for Hadoop. The range of each parameter is
communication channel between NM and RM. If any node set based on large jobs or short jobs, which will affect the
is down where NM and ApplicationMaster is running, then performance of the system. To improve the performance,
if the ApplicationMaster is destroyed then the other NMs the user still should select the useful parameter for different
reporting to ApplicationMaster can directly communicate applications.
with RM. Hadoop provides parameter configurations setting with
An ApplicationMaster requests for containers where default value in xml configuration file [3]. It can be
data is available. An ApplicationMaster running in one specific, such as node and application. Furthermore, it
node communicates with other NM’s running in other provides methods like hadoop –conf and hadoop –D to
machine. And then NM reports the status of node to change configuration value.
ApplicationMaster update the status to RM. Here “Hadoop jar example.jar nameofexample –D
ApplicationMaster can allocate containers in different node propertyname(key)=value”
and track the status of application through these NM. AM Hadoop parameter configurations can be classified like
of RM takes care of monitoring and tracking the status of CPU, I/O and Memory and Network [3]. Some parameters
node through ApplicationMaster. NM sends heart beats directly affect the performance of the system, are shown in
consistently to ApplicationMaster and it updates the same Table 1.
to AM of RM. If AM is not receiving heart beats from a
NM through ApplicationMaster, then the RM assumes the Table 1. Hadoop YARN parameter configurations
NM is down and assigns the job to some other NM running
in the cluster. CPU I/O Memory Network
mapred.tasktra dfs.block.size io.sort. mapred.comp
cker.map.tasks mb ress.map.
.maximum output
mapred.tasktra dfs. io.sort. Mapred.map.
cker.reduce.tas replication factor output.compr
ks.maximum ession.code
mapred.map. mapred.comp io.sort. mapred.
tasks ress.map. spill. output.
output percent compress
mapred.reduce. mapred.comp io.sort. mapred.
tasks ress.map. factor output.
output compression.
type
mapred.map. mapred.map.o mapred. mapred.
tasks.speculati utput.compres child.java. output.compr
ve.execution sion.codec opts ession.codec
Figure 1. Execution flow of Hadoop YARN mapred.reduce. dfs.heartbeat.
tasks.speculati interval
ve.execution

190

Authorized licensed use limited to: Auckland University of Technology. Downloaded on December 19,2020 at 18:23:14 UTC from IEEE Xplore. Restrictions apply.
5. The proposed system 1
𝑎𝑣𝑒𝑟𝑎𝑔𝑒(𝑇) = ∑𝑛𝑖=1 𝑇𝑖
𝑛
(3)
The job sets J could be separated into two types of job
Data input sets like SJ means short job and LJ means long job based
on the average time of average(T).
𝑆𝐽, 𝑇𝑖 ≤ 𝑎𝑣𝑒𝑟𝑎𝑔𝑒(𝑇)
𝐽𝑖 ∈ { (4)
𝐿𝐽, 𝑇𝑖 > 𝑎𝑣𝑣𝑒𝑟𝑎𝑔𝑒(𝑇)
HDFS Client Based on the demand resource of single map or reduce
Resource Request
tasks, 𝐷𝑖𝑟 and 𝐷𝑖𝑚 represents the demand resource, sorting
and Allocation the job sets.
Finally, we built the target job sets, namely LJ and SJ
Resource Manager
Parameter tuning and then to select the first target job sets TJ1 from LJ.
Negotiate for Resource
based on Job sets ∑𝐿𝐽𝑖 ∈𝐿𝐽 𝐷(𝐿𝐽𝑖 ) < λ × availCap, 0 < λ ≤ 1 (5)
Start Application
Utilization
Manager Here, 𝐷(𝐿𝐽𝑖 ) represents the resource demand of long
job and the default value of λ is 1. And then another target
Application Master Container
job set TJ2 from SJ to select as follows:
∑𝑆𝐽𝑖 ∈𝑆𝐽 𝐷(𝑆𝐽𝑖 ) < availCap − ∑𝑇𝐽𝑖∈𝑇𝐽1 𝐷(𝑇𝐽𝑖 ) (6)
NodeManager
Here, 𝐷(𝑆𝐽𝑖 ) represents the resource demand of short
Launch Tasks in job SJi. Finally, the target job sets TJ could be easily
the Container
calculated by
Task Container Task Container 𝑇𝐽 = 𝑇𝐽1 ∪ 𝑇𝐽2 (7)
The simple algorithm for improving the execution time
NodeManager NodeManager and efficient job scheduling on Hadoop YARN is proposed
in Figure 3.
Figure 2. Proposed system design
begin
The fundamental unit of scheduling in YARN is a 1. get targetQueue TQ according to (1)
queue. The capacity of each queue is specified the 2. get LJ and SJ according to (2) (3) (4)
percentage of cluster resources that are available for 3. sort jobs in LJ by demand resource of map/reduce
applications submitted to the queue. To improve the tasks
performance of job scheduling, the proposed system design 4. sort job in SJ by demand resource of map/reduce
for improving the execution time and efficient job tasks
scheduling is illustrated in Figure 2. 5. choose target jobs TJ1 from LJ according to (5)
6. choose target jobs TJ2 from SJ according to (6)
7. get target job TJ according to (7)
5.1. Proposed algorithm
8. for each job in TJ
9. allocate resource to this job
In order to distinguish different types of jobs, multi-
10. end for
level queue model is used and configure the parameters of
end
multi-level queue, time factor and type of job sets in the
configuration file. In the first stage, the queue is selected as
the target queue for resource allocation. Assuming the Figure 3. The algorithm for improving the
queue sets is Q, the formula is as follows: execution time and efficient job scheduling on
used Hadoop YARN
TQ = {q ∈ Q| min ( )} (1)
actCap
Here, TQ represents the target queue, used represent the
used resource, and actCap represents the actual capacity of
6. Experiments
this queue.
In the next stage, to choose the target job sets, we need The proposed work is performed with pseudo
to know the average time of job; it can be calculated of distributed Hadoop cluster configurations of one master
job’s running time Ji by: and one slave. The experiments were performed under
𝑇𝑖 = 𝑇𝑁𝑖 − 𝑇𝑆𝑖 , 𝑖 = 1,2, … , 𝑛 (2) Hadoop 3.2.1. The main configuration information is
where, Ti is the running time of job Ji, TNi is the current shown in Table 2. The default value is set for other
time, and TSi is the submission time of job Ji. According to parameters.
equation number 2, and then calculate the average time of
all the jobs in job sets J is as follows:

191

Authorized licensed use limited to: Auckland University of Technology. Downloaded on December 19,2020 at 18:23:14 UTC from IEEE Xplore. Restrictions apply.
Table 2. Main configuration of the Hadoop scheduling through Hadoop job parameter configuration
cluster settings. Furthermore, the default Hadoop parameter
configuration setting isn't appropriate for a wide range of
Config Key Value applications and proper parameter configuration settings
File can tune performance. The proposed algorithm can viably
yarn- yarn.scheduler.minimum- 3072 adjust the performance of the framework. Hadoop client
site.xml allocation-mb can get the best performance from framework resource
yarn- yarn.scheduler.maximum- 51200 through proper parameter setting and can build up a
site.xml allocation-mb framework which contains great parameter configuration
yarn- yarn.nodemanger.resource. 51200 setting. In the future, we will perform experimental
site.xml memory-mb evaluation by running on distributed Hadoop Cluster with
more suitable Hadoop parameter configurations and test on
mapred- mapreduce.map.memory.m 3072
site.xml b processor and memory intensive programs and compare
the workload completion rate, turnaround time and
mapred- mapreduce.map.java.opts -Xmx2048m
throughput.
site.xml
mapred- mapreduce.reduce.memory 3072
site.xml .mb
8. References
mapred- mapreduce.reduce.java.opt -Xmx2048m [1] B. Ailton, M. Andre, and S. Fabiano, “Towards and Ontology-
site.xml s based Semantic Approach to Tuning Parameters to Improve
mapred- yarn.app.mapreduce.am.res 3072 Hadoop Application Performance”, Information Technology in
site.xml ource.mb Industry 2.2 (2014): pp. 56-61.
mapred- yarn.app.mapreduce.am.co -Xmx2048m
site.xml mmand-opts [2] B. Garvit, S. Manish and B. Subhasis, “A Framework for
Performance Analysis and Tuning in Hadoop Based Cluster”,
mapred- mapreduce.task.io.sort.mb 1024 International Conference on Distributed Computing and
site.xml Networking, Coimbatore, India, 2014.

The experiments are conducted on Hadoop 3.2.1 [3] B.J. Mathiya, and V.L. Desai, “Apache Hadoop YARN
configured with tune parameters values described in Table Parameter Configuration Challenges and Optimization”,
2 and also done with the configuration of default parameter International Conference on Soft-Computing and Network
values. Then the system performance is analyzed based on Security, Coimbatore, India, February 25-27, 2015.
Wordcount program using different sizes of input data. The
[4] G. Sasiniveda and N. Revathi, “Performance Tuning and
experimental results running Wordcount programs are Scheduling of Large Dataset Analysis in MapReduce Paradigm
shown in figure 4. It shows that the parameters tuning on by Optimal Configuration using Hadoop”.
Hadoop configurations could achieve the better system
performance. [5] H. Wei, D. Luo and L. Liang, “Optimization of YARN
Hierarchical Resource Scheduling Algorithm”, International
1500 Conference on Computer Science and Application Engineering,
2017.
Total time (seconds)

1000
[6] K. Hadjar, A. Jedidi, “A New Approach for Scheduling Tasks
and/or Jobs in Big Data Cluster”, in 2019 IEEE.
500
[7] L. Changlong, Z. Hang, L. Kun, S. Mingming, Z. Jinhone, D.
0
Dong and Z. Xuehai, “An Adaptive Auto-Configuraion Tool for
128 MB 256MB 1GB 2GB 3GB
Hadoop”, Engineering of Complex Computer System , 19 th
Dataset Size International Conference on IEEE, 2014.

Default Parameters Tune Parameters [8] Q. Wang, X. Huang, “PFT: A Performance-Fairness


Scheduler on Hadoop YARN”, IEEE, 2016.

[9] R. Andrew, G. Apaar, S. Vinayak, S. Dinkar and K.


Figure 4. Comparison of execution time
Subramaniam, “Predicting Hadoop Misconfigurations using
Machine Learning”, Center for Cloud Computing and Big Data,
7. Conclusion Department of Computer Science and Engineering, PES
University, Bangalore, India.
This paper presents Hadoop YARN job parameter
configuration issues and assesses the performance of job

192

Authorized licensed use limited to: Auckland University of Technology. Downloaded on December 19,2020 at 18:23:14 UTC from IEEE Xplore. Restrictions apply.
[10] W. Dili and G. Aniruddha, “A Self-Tuning System based on
Application Profiling and Performance Analysis for Optimizing
Hadoop MapReduce Cluster Configuration”, High Performance
Computing, 20th International Conference on, Dec. 2013, vol.,
no., pp. 89,98, 18-21.

[11] https://hadoop.apache.org/

[12] https://hadoop.apache.org/docs/current/hadoop-
yarn/hadoop-yarn-site/YARN.html

[13] https://data-flair.training/blogs/hadoop-yarn-tutorial/

193

Authorized licensed use limited to: Auckland University of Technology. Downloaded on December 19,2020 at 18:23:14 UTC from IEEE Xplore. Restrictions apply.

You might also like