Improving The Performance of Heterogeneous Hadoop Cluster-2016
Improving The Performance of Heterogeneous Hadoop Cluster-2016
Abstract— With the development of information Technology, Apache Hadoop is an open source implementation of
we can see an exponential growth in the amount of data Google MapReduce programming model. The Hadoop
that is being generated and used over the last decade. The software framework has two main parts Hadoop Distributed
need for storing the data and processing them in order to
extract value from skew data has paved way for parallel, File System (i.e.HDFS) and the MapReduce programming
distributed processing applications like Hadoop. The current model. HDFS works as a storage for MapReduce application
Hadoop implementation assumes that the compute capacity of data, while the MapReduce programming includes the logic
all the nodes in a cluster are homogeneous in nature, but looking to process the MapReduce application data stored on
at the cloud infrastructure, we can see that different hardware HDFS.Hadoop has two predominant versions available
configurations systems are being used which is logical. Hence,
it is necessary to study the data placement policy where we can today namely Hadoop 1.x and 2.x. The 2.x has features
distribute the data based on the processing power of a node. pertaining to the improvement of the batch processing
In this paper, we propose a dynamic block placement strategy feature by introducing YARN (Yet Another Resource
in Hadoop to distribute the input data blocks to the nodes Negotiator). YARN comprises of a Resource Manager and
based on the computing capacity of each node. The proposed Node Manager. The Resource Manager is responsible for
algorithm which can adapt and balance data dynamically and
reorganize the input data in HDFS according to the compute managing and deploying resources and the Node Manager
capability of each node in hadoop heterogeneous enviornment. is responsible for managing the datanode and reporting the
The proposed method reduces the data transfer time to achieve status of the datanode to the resource manager.
improved performance. We ran benchmark applications against
the proposed block placement strategy and original HDFS Map reduce helps in analyzing large amount of data, it
blocks placement strategy. The experimental results show that
the dynamic data placement strategy can decrease the execution also helps in gaining valuable insights that benefits many
time and improve performance of hadoop heterogeneous cluster. organizations. Some of the Applications of MapReduce
Hadoop is designed to handle Big files and with less frequent include machine learning, graph mining, indexing and search
updates. But nowadays many applications need to handle small etc.., These applications are difficult to implement using the
files, which has become a pertinent issue for degradation of traditional SQL. MapReduce helps the programmers to focus
performance of an application on hadoop platform. Our pro-
posed methods have shown good improvement of performance only on applying the transformations to the data, leaving
in these cases. we are able to see a speedup of 25.7x and 20x the details of parallelization, network communication and
compared to the original Hadoop when ran on wordcount and fault tolerance to be handled by the system[1].
grep benchmark applications.
Keywords : Hadoop, HDFS, MapReduce, Block Placement HDFS is a distributed file system which stores files
Strategy, Small Files
redundantly across cluster nodes for security and availability.
To store a file HDFS splits it into blocks and replicates
I. INTRODUCTION
those according to a replication factor. The HDFS default
Hadoop is gaining popularity in the high performance blocks placement policy distributes the blocks reasonably
computing arena, as it is being used for many data and well across the cluster nodes. However, there are cases,
compute intensive applications. People had come up with in which this algorithm creates server hotspots that put
multiple techniques for achieving better performance in unnecessary load onto cluster nodes and lower the overall
hadoop clusters. The ability of such applications to process performance of the cluster[2].
petabytes of data generated from websites like Facebook,
Amazon, Yahoo and search engines like Google and Bing In this study, we developed a block placement mechanism
have eventually lead to a data revolution where by processing in Hadoop to distribute the large number of input data
each and every minute information relating to customers blocks to all the nodes based on the computing capacity
and users can add value, there by improving core compe- of each node. We propose an algorithm which reorganize
tency. Facebook process roughly around 15 terabytes of data the input data in HDFS. We also solved the small files
each day using the Hadoop Framework. Even the Scientific problem by combining the small files into large file. Even,
applications like seismic simulations and Natural Language this method will not overcome the NameNode Memory
Processing are done on Hadoop. Management issue. So, we tried using the Sequence file
,(((
)RXUWK,QWHUQDWLRQDO&RQIHUHQFHRQ3DUDOOHO'LVWULEXWHGDQG*ULG&RPSXWLQJ 3'*&
data of the file system and the multiple data nodes which
store the application data.
where the filename is stored as the key in the sequence file B. MapReduce Overview
and the file contents are stored as the value. This works MapReduce is a model for processing large amount of
very well when you need to maintain the input filename. data by using parallel algorithms on a cluster. The main
MapReduce job will divide the input data into independent
The rest of the paper is organized as follows. Section chunks and they can be computed parallely, the results of
2 briefly describes the Hadoop Architecture, Hadoop Dis- the map task will be fed into the reduce task as an input
tributed File System(HDFS), MapReduce and Motivation. which give the final result. Hadoop gains the recognition
Section 3 describes the Dynamic Block Placement Strategy of being largely scalable framework due to its mapreduce
and provides the experimental results by comparing our new model.
computeratio balancer with the default Hadoop. In Sections
4 we explain our methods Combining the input small files Any job that is given to the map reduce model will undergo
and the Sequence file to handle the small files problem. two stages. The first is the map stage and the second is
This section also provides the experimental results of our reduce stage. The map reduce computations take input as a
proposed methods comparing with the existing Hadoop. key-value pairs and generate a set of output key-value pairs.
Section 5 concludes by giving the summary and provides The user provided map function takes a set of key value
the necessary information for further improvement in this pairs as the input and generates a intermediate key-value
study. pairs. These intermediate key-value pairs will be collected
together and sent to another Reduce function. The reducer
II. BACKGROUND AND MOTIVATION
function also written by user, merges the key-value pairs of
A. Hadoop and Hadoop Distributed File System the intermediate stage and perform the user specified reduce
Hadoop[1] is a open source framework which is used to operation. Reducer may generate single output key value
process large amount of data(Terabytes of data). Hadoop is pairs or a set of output pairs.
written in java language by the apache foundation group.
It can run on large clusters parallely(which can have C. Motivation
thousands of systems). Hadoop framework is attractive for The current Hadoop implementation assumes that
many reasons. It process the data in a very reliable manner computing nodes in a cluster are homogeneous in nature.
by replicating the input data, can be scaled to hundreds This strategy has its potential benefits on homogeneous
of nodes, can handle terabytes or few petabytes of input environment but it might not be suitable on an heterogeneous
data, designed in such a way that it can run on commodity environment. Further looking at the cluster infrastructures,
clusters. Commodity computing is that, it is preferable to we can see that different hardware configurations systems
have more of low-cost, low-performance hardware working are being used in the same cluster which is logical.
in parallel than to have fewer high-cost, high-performance However, the default HDFS block placement strategy does
hardware. Hadoop is a recently developed project which is not consider the different hardware configurations, which
used for non-trival applications. results in under utilization of resources. So, we want to
come up with a block placement mechanism in Hadoop to
In[1],Hadoop Distributed File System, which is a Apache distribute the large number of input data blocks to all the
Hadoop Project, HDFS is a distributed, low cost commodity nodes based on the computing capacity of each node. To be
hardware and a highly fault tolerant file system. It provides specific, we want to design algorithm which reorganize the
high throughput, and is suitable for large data sets. Hadoop input data in HDFS.
Distributed File System is fault-tolerant and is designed in
such a way it can be deployed on low cost hardware. Each Hadoop works well with small number of large files, where
cluster consist of single name node that maintains the meta as it does not work well with large number of small files. For
)RXUWK,QWHUQDWLRQDO&RQIHUHQFHRQ3DUDOOHO'LVWULEXWHGDQG*ULG&RPSXWLQJ 3'*&
small files, one file generates one Inputsplit. One Inputsplit tem(HDFS). In the first part, input file blocks are distributed
is used by one Map container, which means the number of to all the heterogeneous nodes that are part of the cluster.
Inputsplits is equal to the number of Map containers. So Once the input file fragments are available they are dis-
the first issue is that too many Map containers are created tributed to the corresponding compute nodes. The next step
by the MapReduce job for small files. Too many containers is to reorganize the input file blocks because data might get
mean too many processes. Now, as there are large number of disrupted and this is done by the next algorithm.
processes that are getting created and closed, the overhead is 1) Initial Data Placement: The algorithm starts by
big issue. The similar overhead can be seen even in reduce evenly dividing the large input file into number of blocks.
containers. We want to overcome the small files problem by Then based on the performance capability of the nodes the
combining the small files into large file. input file fragments are assigned to nodes in the cluster.
III. DYNAMIC B LOCK P LACEMENT S TRATEGY Nodes that are having high compute capability are expected
to process and store more file fragments compared to nodes
In homogeneous cluster, the data gets distributed among with low compute capability. The initial block placement
the nodes based on the availability of space in the cluster. mechanism distribute the data blocks to all the nodes in
Hadoop has a balancing functionality called Balancer which the heterogeneous cluster based on the performance of each
balances the data before running the applications in case node.
of large data accumulated on only few nodes. The balancer
even takes care of the replication factor which plays an
2) Data Redistribution: After distributing the initial in-
important role in data movement.
put file fragments to different nodes there might be some
disruption compared to the initial data distribution because
In heterogeneous environment, it is ideal to transfer the
of following reasons (1) Appending the new data to file that
data from one node to another node, as the faster node ac-
is already existing; (2) Deleting data blocks from files that
counts for the data transfer overhead and the task processing.
are existing in the cluster; (3) To the existing cluster new
As a result, the data placement policy tries to explore the
compute nodes are being added. To handle this dynamic
consequence of migrating the data block to a faster node in
block placement problem, we came up with a algorithm
the heterogeneous cluster. The implementation details of the
that distributes and organizes the data based on the compute
data placement policy algorithm provides a better overview
ratios.
of the goal.
A. Design B. Implementation Details
Initially we will do profiling to compute the computation The CRBalancer is responsible for migrating the data from
ratios of the nodes in the cluster. To accomplish this we need one node to another. The Figure describes the pattern and
to run a set of benchmarking applications namely WordCount the methodology in which the CRBalancer migrates the data.
and Grep. The computation ratio is calculated based on the It first takes into consideration the network topology, then
processing power of the nodes. calculates the under utilized and over utilized datanodes.
Mapping between the under utilized and over utilized
datanodes takes place by moving the blocks concurrently
among the nodes. It also takes into consideration the
replication factor by limiting the migration if the data is
already present on the node for the corresponding file. After
balancing takes place the benchmarks are run and tested for
the results.
)RXUWK,QWHUQDWLRQDO&RQIHUHQFHRQ3DUDOOHO'LVWULEXWHGDQG*ULG&RPSXWLQJ 3'*&
C. Performance Evaluation
We ran our experiment on 6 systems, out of which 4 of
them are with a 3.4 GHz Quadcore Intel Core i5 CPU, 16
GB memory and 1TB hard disk. The 5th and 6th nodes are
3.1 GHz, Quad-core Intel Core i5, 8GB Memory and 1 TB
Hard Disk. The software used in this experiment comprises
of Hadoop application 2.3.0. A node is assigned the master
status and other nodes are assigned the slave status. Compute Fig. 4. TestDFSIO Benchmark
Ratio Balancer reads block information from the namenode
and runs the distribution algorithm in order to distribute the From the results it is clear that, even the Standard Bal-
data based on the computing capacity of the nodes. ancer of HDFS does not perform well for the TestDFSIO
1) Experimental Results: Terasort benchmark. The execution time is almost similar i.e around
The input data that is generated from TeraGen is taken as 7 min. for both the Default Block Placement Startegy and the
input to the TeraSort program and the output which is a Standard Balancer of HDFS. However our New CRBalancer
sorted data written into terasort-output folder. Below figure provides a better performance by reducing the execution time
shows that the execution time of New Balancer for the of the TestDFSIO benchmark. The new Balancer took 3.57
Terasort benchmark is less compared to the execution time min. with the weight of 1.5 . On modifying the weight to
of Hadoop standard Balancer and the default Hadoop block 1.2 we are able to improve the time of execution, which took
placement strategy. The input size of the data that we tried 2.46 minutes. With our proposed CRBalancer we are able to
to sort is 1GB. improve the time taken by the program by factor of 3.
However, the small file problem does not just affect small
files. If a large number of files in your Hadoop cluster
Fig. 3. Terasort Benchmark are marginally larger than an increment of your block size
you will encounter the same challenges as small files. For
The Terasort Benchmark does not give a very good example if your block size is 128MB but all of the files you
performance even with the new Balancer. To understand load into Hadoop are 136MB you will have a significant
this first we will see how TeraGen is working. number of small 8MB blocks. Solving the small file problem
is significantly more complex.
The data is written into the Cluster using the MapReduce
program. As we already discussed in the Architecture of A. The Issues in Traditional MapReduce for Processing
Hadoop that, when a new block is created, the first replica Small Files
of the block is placed in the first location allotted for In [5] the process of Hadoop MapReduce, Inputsplits are
the block. [4] The second and third replica are placed on generated by Application Master. For small files, one file
different nodes. The other replicas will be placed randomly generates one Inputsplit. One Inputsplit is used by one Map
on different nodes with a condition that maximum of two container, which means the number of Inputsplits is equal
replicas can be placed in the same rack. TeraGen will to the number of Map containers. So the first issue is that
)RXUWK,QWHUQDWLRQDO&RQIHUHQFHRQ3DUDOOHO'LVWULEXWHGDQG*ULG&RPSXWLQJ 3'*&
Original Proposed
too many Map containers are created by the MapReuce No. of Files
System (sec) System(sec)
job for small files. Too many containers mean too many 1000 1349 45
processes. The creating and closing of a process cost very 1500 2090 52
2000 2688 57
little. However, when there are a large number of processes
2500 3358 60
needed to create and close, the overhead cannot be ignored. 3000 4029 71
The similar overhead is generated for Reduce containers.
TABLE I
Another issue is the imbalance between computation ability
P ROCESSING T IME FOR W ORD C OUNT A PPLICATION ON ORIGINAL
and parallelism. Excluding the overhead of using processes,
H ADOOP AND C OMBINE I NPUT F ILE
the traditional MapReduce processing costs a lot in the disk
I/O and network communication.
)RXUWK,QWHUQDWLRQDO&RQIHUHQFHRQ3DUDOOHO'LVWULEXWHGDQG*ULG&RPSXWLQJ 3'*&
Original Proposed
No. of files
System(sec) System(sec)
1000 1349 52
1500 2090 64
2000 2688 72
2500 3358 90
3000 4029 111
TABLE II Fig. 8. Grep Benchmark Application
T IME TAKEN TO PROCESS THE W ORD C OUNT BENCHMARK ON ORIGINAL
H ADOOP AND SEQUENCE FILE
does not store the metadata of files. It does not show better
performance, compared to files directly stored in HDFS.
We implemented three classes in this Sequence file format. V. CONCLUSION AND FUTURE WORK
First, is FullFileInputFormat class which is an extension of In conclusion, our proposed dynamic block placement
InputFormat. Second, is FullFileRecordReader which is an strategy is an improvement over the default HDFS blocks
extension of RecordReader class. Finally, SmallFilesToSe- placement policy and that our CRbalancer yield a greater
quenceFile class which is the main class. Each file will be performance over the HDFS standard balancer. Future work,
processed as one record and we need first and second classes has to come up with a efficient data placement strategy
to process the entire file as a record. where we can store the data blocks using different metric
2) Experimental Results: wordcount which works better and also deal with the network overhead
As, we have discussed earlier that the WordCount is a and bandwidth.
cpu intensive benchmark which reads the dataset and counts
the number of words within the dataset based on a space We came up with Sequence File format, which handles
delimiter per line. Table II explains the processing time both the issues of the small files problem. The Sequence
for executing WordCount benchmark on original Hadoop File method is ran on wordcount and grep benchmark ap-
System and on the proposed Sequence file. After running plications and we are able to see a speedup of 25.7x and
our proposed algorithm on wordcount we are able to see 20x respectively compared to the original Hadoop. Future
a speedup of 25.7x compared to the original Hadoop. The work, has to come up with a optimal schedular that manages
results were shown in above table, where the experiments the multiple jobs with different workloads, that are being
were ran on 1000 files, 1500, 2000, 2500 and 3000 files, executed simultaneously.
and the time taken by original Hadoop system and the
proposed sequence file, is shown in seconds. ACKNOWLEDGMENT
This work is dedicated to the founder chancellor of our
GREP university Sri Sathya Sai Baba.
R EFERENCES
The grep is a proved cpu intensive benchmark which
reads the dataset and counts the number of words within the [1] Jeffrey Dean and Sanjay Ghemawat, Mapreduce: simplified data
processing on large clusters, Communications of the ACM, vol. 51,
dataset based on a search pattern mentioned as an input. The no. 1, pp. 107113, 2008.
Grep benchmark program runs in sequence with the Map [2] Andrew Wang. Better sorting in NetworkTopology
job followed by reduce job. The Map job counts number of pseudoSortByDistance when no local node is found.
https://issues.apache.org/jira/browse/HDFS-6268, 2014. [Accessed
times a matching string had occured and the Reduce job 28-April-2014].
will sort the matching strings based on number of times [3] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert
they occurred (frequency) and the output is stored into a Chansler, The hadoop distributed file system, in Mass Storage Systems
and Technologies (MSST), 2010 IEEE 26th Symposium on. IEEE,
single file. 2010, pp. 110.
[4] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert
Fig. 8 explains the processing time for executing Grep Chansler, The hadoop distributed file system,” in Mass Storage Sys-
tems and Technologies (MSST), 2010 IEEE 26th Symposium on.
benchmark application on original Hadoop System and on IEEE, 2010.
the proposed Sequence file. After running our proposed [5] Fang Zhou, Assessment of Multiple MapReduce Strategies for Fast
algorithm on Grep application we are able to see a speedup Analytics of Small Files, Ph.D. thesis, Auburn University, 2015.
[6] Fang Zhou, Hai Pham, Jianhui Yue, Hao Zou, and Weikuan Yu,
of 20x compared to the original Hadoop. Sfmapreduce: An optimized mapreduce framework for small files,” in
Even though the Sequence File Format handles both the Networking, Architecture and Storage (NAS), 2015 IEEE International
issues of small files problem in Hadoop, it cannot provide Conference on. IEEE, 2015.
quick random access to specific files, because SequenceFile