Improved Job Scheduling For Achieving Fairness On Apache Hadoop YARN
Improved Job Scheduling For Achieving Fairness On Apache Hadoop YARN
188
Authorized licensed use limited to: Auckland University of Technology. Downloaded on December 19,2020 at 18:23:14 UTC from IEEE Xplore. Restrictions apply.
2. Related work 3.1. Apache Hadoop
Different scientists have proposed the Hadoop Apache Hadoop is an Apache high-level venture and
parameter design. In this area, a few papers that depicted took into account the distributed processing of enormous
works identifying with this framework are examined. measure of datasets utilizing straightforward programming
L. Changlong et al. [7] established that the Hadoop models [11]. Hadoop utilizes master slave architecture. It
parameter configuration has an impediment for the Hadoop is intended to carry the computation to the data rather than
MapReduce job execution. Misconfiguration of parameters the data to the computation. It oversees data processing and
may bomb the presentation of the framework. This storage for Big Data application running in the cluster
framework proposed a versatile programmed setup device frameworks. Hadoop can handle various types of data
dependent on a mathematical model that will precisely structure and giving clients greater adaptability for
become familiar with the connection between system collecting, processing, and analyzing data than
execution and parameter configuration. conventional database framework.
In [10], this framework intended to the Hadoop Hadoop modules are structured with a central suspicion
parameter setup challenge of the Hadoop framework and that hardware failures are normal and consequently ought
proposed a framework checking and execution to be naturally handled in programming by the framework
investigation framework which gave arrangement design [11]. It is a distributed file system and batch processing
difficulties. B. Garvit, et al [2] proposed to examine the framework for running MapReduce jobs. MapReduce has
impact of various parameter arrangements on Hadoop two projects like Map and Reduce. The Map task attempts
MapReduce under isolated conditions, to achieve the to take things and runs an operation over each line in the
greatest throughput and contemplated examinations to file and split into key value sets. Reduce task bunches these
dissect the effect of various parameter setups and exhorted key value pairs and runs an operation to link them like
an ideal worth. This paper zeroed in on two key execution (key->sum).
markers throughput and execution time and contrasted with HDFS is a distributed, versatile, and compact file
FIFO, Fair sharing, and Capacity scheduling. It is seen that framework written in java for the Hadoop framework.
different parameters alongside divergent scheduling Every hub in a Hadoop occurrence normally has a single
components affect the performance measurements. name node, and a bunch of data nodes structures the HDFS
B. Ailton [1] found an ontology based semantic way to cluster. HDFS stores enormous records over numerous
deal with tuning parameters to improve Hadoop execution machines. The benefit of HDFS is data mindfulness
and parameter design impacts on framework execution and between the job tracker and task tracker. The job tracker
these parameters are influenced by outstanding burden plans map or reduce jobs to task trackers with a cognizance
attributes. of the data area.
B. J. Mathiya [4] presented the mapper reducer work Hadoop YARN is a sub-task of Hadoop that isolates the
utilizing the mean shift clustering based algorithm permits resource management and processing segments [12]. The
the client to examine the dataset and achieved better YARN based engineering of Hadoop gives a more broad
execution in executing the job by utilizing the ideal setup processing stage. The principle thought of YARN is to
of mappers and reducers dependent on the size of the separate the two significant functionalities, for example,
datasets and brought about significantly brought down resource management and job scheduling and monitoring,
framework cost, energy usage, and the board into discrete daemons. It has a worldwide
unpredictability and expanded the exhibition of the ResourceManager (RM) and per-application
framework. ApplicationMaster(AM). YARN upgrades the intensity of
Q. Wang [8] proposed the framework can drastically a Hadoop register cluster like scalability, compatibility
tradeoff the performance and the fairness and decreased the with MapReduce, improved bunch usage, nimbleness, and
makespan of MapReduce jobs by using a multi-level backing for outstanding burdens other than MapReduce
queue, the time factor, job urgency factor, and domain [12]. Notwithstanding, Apache Hadoop YARN made the
resource proportion. This test indicated that the makespan Hadoop Framework more reasonable for real time
of MapReduce jobs decrease and improved CPU usage and processing utilizes and different applications that can
memory use. hardly wait for cluster occupations to wrap up.
189
Authorized licensed use limited to: Auckland University of Technology. Downloaded on December 19,2020 at 18:23:14 UTC from IEEE Xplore. Restrictions apply.
YARN architecture number of NM is one per machine or 4. Apache Hadoop YARN parameter
system. Number of ApplicationMaster running is configuration
equivalent to number of Application’s submitted by user.
Each NM is associated with one Application per Big data processing systems contain a large number of
Application. An ApplicationMaster is by default in sleep configuration parameters controlling parallelism, I/O
mode. Whenever RM needs to Process a job then it behavior, memory settings and compression. Inappropriate
instructs NM to launch ApplicationMaster which is in sleep parameter settings can cause significant performance
mode. Once NM launches ApplicationMaster, which degradation and stability issues. Hadoop configuration
requests for containers to the scheduler of parameters impact several aspects of job execution at the
ResourceManager. Here scheduler allocates Resources for different phases, such as concurrency, memory allocation,
containers. A container is collection of resources in which I/O performance and network bandwidth usage. Hadoop
application tasks are executed. parameters control different aspects of job behavior during
Once resources are allocated by scheduler, execution. Currently, Hadoop has over 200 parameters,
ApplicationMaster requests NM to launch containers from which about 30 can have a substantial effect on job
where the tasks are executed. NM tracks and monitors the performance.
life cycle of containers and updates the status to RM’s AM The parameter with appropriate values will bring high
through ApplicationMaster. ApplicationMaster acts a performance for Hadoop. The range of each parameter is
communication channel between NM and RM. If any node set based on large jobs or short jobs, which will affect the
is down where NM and ApplicationMaster is running, then performance of the system. To improve the performance,
if the ApplicationMaster is destroyed then the other NMs the user still should select the useful parameter for different
reporting to ApplicationMaster can directly communicate applications.
with RM. Hadoop provides parameter configurations setting with
An ApplicationMaster requests for containers where default value in xml configuration file [3]. It can be
data is available. An ApplicationMaster running in one specific, such as node and application. Furthermore, it
node communicates with other NM’s running in other provides methods like hadoop –conf and hadoop –D to
machine. And then NM reports the status of node to change configuration value.
ApplicationMaster update the status to RM. Here “Hadoop jar example.jar nameofexample –D
ApplicationMaster can allocate containers in different node propertyname(key)=value”
and track the status of application through these NM. AM Hadoop parameter configurations can be classified like
of RM takes care of monitoring and tracking the status of CPU, I/O and Memory and Network [3]. Some parameters
node through ApplicationMaster. NM sends heart beats directly affect the performance of the system, are shown in
consistently to ApplicationMaster and it updates the same Table 1.
to AM of RM. If AM is not receiving heart beats from a
NM through ApplicationMaster, then the RM assumes the Table 1. Hadoop YARN parameter configurations
NM is down and assigns the job to some other NM running
in the cluster. CPU I/O Memory Network
mapred.tasktra dfs.block.size io.sort. mapred.comp
cker.map.tasks mb ress.map.
.maximum output
mapred.tasktra dfs. io.sort. Mapred.map.
cker.reduce.tas replication factor output.compr
ks.maximum ession.code
mapred.map. mapred.comp io.sort. mapred.
tasks ress.map. spill. output.
output percent compress
mapred.reduce. mapred.comp io.sort. mapred.
tasks ress.map. factor output.
output compression.
type
mapred.map. mapred.map.o mapred. mapred.
tasks.speculati utput.compres child.java. output.compr
ve.execution sion.codec opts ession.codec
Figure 1. Execution flow of Hadoop YARN mapred.reduce. dfs.heartbeat.
tasks.speculati interval
ve.execution
190
Authorized licensed use limited to: Auckland University of Technology. Downloaded on December 19,2020 at 18:23:14 UTC from IEEE Xplore. Restrictions apply.
5. The proposed system 1
𝑎𝑣𝑒𝑟𝑎𝑔𝑒(𝑇) = ∑𝑛𝑖=1 𝑇𝑖
𝑛
(3)
The job sets J could be separated into two types of job
Data input sets like SJ means short job and LJ means long job based
on the average time of average(T).
𝑆𝐽, 𝑇𝑖 ≤ 𝑎𝑣𝑒𝑟𝑎𝑔𝑒(𝑇)
𝐽𝑖 ∈ { (4)
𝐿𝐽, 𝑇𝑖 > 𝑎𝑣𝑣𝑒𝑟𝑎𝑔𝑒(𝑇)
HDFS Client Based on the demand resource of single map or reduce
Resource Request
tasks, 𝐷𝑖𝑟 and 𝐷𝑖𝑚 represents the demand resource, sorting
and Allocation the job sets.
Finally, we built the target job sets, namely LJ and SJ
Resource Manager
Parameter tuning and then to select the first target job sets TJ1 from LJ.
Negotiate for Resource
based on Job sets ∑𝐿𝐽𝑖 ∈𝐿𝐽 𝐷(𝐿𝐽𝑖 ) < λ × availCap, 0 < λ ≤ 1 (5)
Start Application
Utilization
Manager Here, 𝐷(𝐿𝐽𝑖 ) represents the resource demand of long
job and the default value of λ is 1. And then another target
Application Master Container
job set TJ2 from SJ to select as follows:
∑𝑆𝐽𝑖 ∈𝑆𝐽 𝐷(𝑆𝐽𝑖 ) < availCap − ∑𝑇𝐽𝑖∈𝑇𝐽1 𝐷(𝑇𝐽𝑖 ) (6)
NodeManager
Here, 𝐷(𝑆𝐽𝑖 ) represents the resource demand of short
Launch Tasks in job SJi. Finally, the target job sets TJ could be easily
the Container
calculated by
Task Container Task Container 𝑇𝐽 = 𝑇𝐽1 ∪ 𝑇𝐽2 (7)
The simple algorithm for improving the execution time
NodeManager NodeManager and efficient job scheduling on Hadoop YARN is proposed
in Figure 3.
Figure 2. Proposed system design
begin
The fundamental unit of scheduling in YARN is a 1. get targetQueue TQ according to (1)
queue. The capacity of each queue is specified the 2. get LJ and SJ according to (2) (3) (4)
percentage of cluster resources that are available for 3. sort jobs in LJ by demand resource of map/reduce
applications submitted to the queue. To improve the tasks
performance of job scheduling, the proposed system design 4. sort job in SJ by demand resource of map/reduce
for improving the execution time and efficient job tasks
scheduling is illustrated in Figure 2. 5. choose target jobs TJ1 from LJ according to (5)
6. choose target jobs TJ2 from SJ according to (6)
7. get target job TJ according to (7)
5.1. Proposed algorithm
8. for each job in TJ
9. allocate resource to this job
In order to distinguish different types of jobs, multi-
10. end for
level queue model is used and configure the parameters of
end
multi-level queue, time factor and type of job sets in the
configuration file. In the first stage, the queue is selected as
the target queue for resource allocation. Assuming the Figure 3. The algorithm for improving the
queue sets is Q, the formula is as follows: execution time and efficient job scheduling on
used Hadoop YARN
TQ = {q ∈ Q| min ( )} (1)
actCap
Here, TQ represents the target queue, used represent the
used resource, and actCap represents the actual capacity of
6. Experiments
this queue.
In the next stage, to choose the target job sets, we need The proposed work is performed with pseudo
to know the average time of job; it can be calculated of distributed Hadoop cluster configurations of one master
job’s running time Ji by: and one slave. The experiments were performed under
𝑇𝑖 = 𝑇𝑁𝑖 − 𝑇𝑆𝑖 , 𝑖 = 1,2, … , 𝑛 (2) Hadoop 3.2.1. The main configuration information is
where, Ti is the running time of job Ji, TNi is the current shown in Table 2. The default value is set for other
time, and TSi is the submission time of job Ji. According to parameters.
equation number 2, and then calculate the average time of
all the jobs in job sets J is as follows:
191
Authorized licensed use limited to: Auckland University of Technology. Downloaded on December 19,2020 at 18:23:14 UTC from IEEE Xplore. Restrictions apply.
Table 2. Main configuration of the Hadoop scheduling through Hadoop job parameter configuration
cluster settings. Furthermore, the default Hadoop parameter
configuration setting isn't appropriate for a wide range of
Config Key Value applications and proper parameter configuration settings
File can tune performance. The proposed algorithm can viably
yarn- yarn.scheduler.minimum- 3072 adjust the performance of the framework. Hadoop client
site.xml allocation-mb can get the best performance from framework resource
yarn- yarn.scheduler.maximum- 51200 through proper parameter setting and can build up a
site.xml allocation-mb framework which contains great parameter configuration
yarn- yarn.nodemanger.resource. 51200 setting. In the future, we will perform experimental
site.xml memory-mb evaluation by running on distributed Hadoop Cluster with
more suitable Hadoop parameter configurations and test on
mapred- mapreduce.map.memory.m 3072
site.xml b processor and memory intensive programs and compare
the workload completion rate, turnaround time and
mapred- mapreduce.map.java.opts -Xmx2048m
throughput.
site.xml
mapred- mapreduce.reduce.memory 3072
site.xml .mb
8. References
mapred- mapreduce.reduce.java.opt -Xmx2048m [1] B. Ailton, M. Andre, and S. Fabiano, “Towards and Ontology-
site.xml s based Semantic Approach to Tuning Parameters to Improve
mapred- yarn.app.mapreduce.am.res 3072 Hadoop Application Performance”, Information Technology in
site.xml ource.mb Industry 2.2 (2014): pp. 56-61.
mapred- yarn.app.mapreduce.am.co -Xmx2048m
site.xml mmand-opts [2] B. Garvit, S. Manish and B. Subhasis, “A Framework for
Performance Analysis and Tuning in Hadoop Based Cluster”,
mapred- mapreduce.task.io.sort.mb 1024 International Conference on Distributed Computing and
site.xml Networking, Coimbatore, India, 2014.
The experiments are conducted on Hadoop 3.2.1 [3] B.J. Mathiya, and V.L. Desai, “Apache Hadoop YARN
configured with tune parameters values described in Table Parameter Configuration Challenges and Optimization”,
2 and also done with the configuration of default parameter International Conference on Soft-Computing and Network
values. Then the system performance is analyzed based on Security, Coimbatore, India, February 25-27, 2015.
Wordcount program using different sizes of input data. The
[4] G. Sasiniveda and N. Revathi, “Performance Tuning and
experimental results running Wordcount programs are Scheduling of Large Dataset Analysis in MapReduce Paradigm
shown in figure 4. It shows that the parameters tuning on by Optimal Configuration using Hadoop”.
Hadoop configurations could achieve the better system
performance. [5] H. Wei, D. Luo and L. Liang, “Optimization of YARN
Hierarchical Resource Scheduling Algorithm”, International
1500 Conference on Computer Science and Application Engineering,
2017.
Total time (seconds)
1000
[6] K. Hadjar, A. Jedidi, “A New Approach for Scheduling Tasks
and/or Jobs in Big Data Cluster”, in 2019 IEEE.
500
[7] L. Changlong, Z. Hang, L. Kun, S. Mingming, Z. Jinhone, D.
0
Dong and Z. Xuehai, “An Adaptive Auto-Configuraion Tool for
128 MB 256MB 1GB 2GB 3GB
Hadoop”, Engineering of Complex Computer System , 19 th
Dataset Size International Conference on IEEE, 2014.
192
Authorized licensed use limited to: Auckland University of Technology. Downloaded on December 19,2020 at 18:23:14 UTC from IEEE Xplore. Restrictions apply.
[10] W. Dili and G. Aniruddha, “A Self-Tuning System based on
Application Profiling and Performance Analysis for Optimizing
Hadoop MapReduce Cluster Configuration”, High Performance
Computing, 20th International Conference on, Dec. 2013, vol.,
no., pp. 89,98, 18-21.
[11] https://hadoop.apache.org/
[12] https://hadoop.apache.org/docs/current/hadoop-
yarn/hadoop-yarn-site/YARN.html
[13] https://data-flair.training/blogs/hadoop-yarn-tutorial/
193
Authorized licensed use limited to: Auckland University of Technology. Downloaded on December 19,2020 at 18:23:14 UTC from IEEE Xplore. Restrictions apply.