Performance Analysis of Hadoop Map Reduce On

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

International Journal of Computer Applications (0975 – 8887)

Volume 79 – No 17, October 2013

Performance Analysis of Hadoop Map Reduce on


Eucalyptus Private Cloud

Jobby P Jacob Anirban Basu


Dept of Computer Science & Engg.- R&D Centre Dept of Computer Science & Engg.- R&D Centre
East Point College of Engineering & Technology, East Point College of Engineering & Technology,
Bangalore, India Bangalore, India

ABSTRACT performance and MapReduce a valuable tool for clustering


The cost effectiveness and the ease of maintenance are the larger datasets that are distributed and cannot be stored on a
reasons behind the increasing popularity of Cloud Computing. single node.
The need to reduce the execution time of programs on Cloud Clustering in the cloud by Xuan Wang[3] chose to run
platforms have led to development of Hadoop [12]. This paper clustering techniques on Hadoop MapReduce Framework on
analyzes the performance of K-Means Clustering Algorithm the cloud. For the cloud infrastructure, he used the public
when running on Hadoop MapReduce on Eucalyptus [5] cloud for this purpose. And always public clouds work on the
platform. Running Eucalyptus for Hadoop requires lot of internet and when it comes to transferring large data, it's a
customization for the software to run as discussed here. problem. However he has determined that clustering
Several tools like Ganglia, TestDSFIO.java, Linux technologies can be run on public clouds.
performance measuring tools have been used to measure the
performance. The paper discusses how performance of K- 3. APACHE HADOOP
Means Clustering Algorithm scales up with number of nodes Apache Hadoop is a software written in Java which brings to
on Eucalyptus cloud. Results of measurement of the disk, the table an open source platform that enables data centric
network, memory bandwidth, data throughput and average I/O application to run parallel in a distributed environment. This
are presented here. framework has proved to be an effective way to run an
application in parallel especially dealing with terabytes of
Keywords data. Hadoop enables applications to work with thousands of
BigData, Hadoop, MapReduce, Eucalyptus Cloud, Ganglia. nodes and terabytes of data, without concerning the user with
too much detail on the allocation and distribution of data and
1. INTRODUCTION calculation. Hadoop is open source and distributed under
Hadoop MapReduce had grown in popularity over the recent Apache license. The main components of Hadoop are:
years due its robustness, ease and effectiveness when MapReduce and HDFS.
processing large amounts of data. This paper evaluates the
performance of K-Means Clustering algorithm implemented 3.1 MAPREDUCE
using Hadoop MapReduce framework on a Eucalyptus private MapReduce[5] is a programming model that works on large
cloud. For the evaluation of performance, work involved in datasets. Many organizations use MapReduce model for
running K-Means clustering algorithm on Hadoop cluster on computing when they have huge datasets and need to process
cloud and extended the Map Reduce to run on multiple nodes. them within short time. MapReduce works by breaking the
Measurements were carried out and results indicate that processing into two phases: the map phase and the reduce
Hadoop MapReduce can be run efficiently on Eucalyptus phase[10]. Each phase has key-value pairs as input and
private cloud for this algorithm and is easily scalable. output, the types of which may be chosen by the programmer.
In MapReduce, the Map function processes the input in the
2. EARLIER WORK form of key/value pairs and generate intermediate key/value
In [2], Chen He, Derek Weitzel, David Swanson, Ying Lu pairs, and the Reduce function process all intermediate values
discussed experiments by running Hadoop Map Reduce on a associated with the same intermediate key generated by the
grid which provides a free, elastic, and dynamic MapReduce Map function[10]. Reduce program have to written in such a
environment on the opportunistic resources of the grid. For way as it can accept the key value pairs of map and then
Hadoop evaluation, they successfully extended Hadoop to process it.
1100 nodes on Grid. From the evaluation, Chen and his team
found that the unreliability of the grid makes Hadoop on the MapReduce performs series of operation on the input key and
grid very challenging. K-means Clustering Using Hadoop value. As many companies now adopt Hadoop, the
Map Reduce by Grace Nila Ramamoorthy[1] evaluated the requirements of what input varies from flat files to databases.
performance of K-Means clustering using Hadoop Map Input is split of chunks. In case of database, one can take an
Reduce on standalone servers infrastructure. This model example of rows from the table . Split actually doesn't contain
works well, but when it comes to scalability and efficient use the data. As always it is just a reference After splitting, the
of resources and need for rebuilding environment for different split component is sent to jobtracker and map tasks are
projects it becomes complex and very time consuming. It is scheduled to process them.
also not a good model when scalability is a requirement as
Map functions are evenly distributed between all nodes in
individual nodes have to be setup to scale up. Grace also
cluster and locality of the data can be considered unknown.
noted that the number of nodes available for map tasks affect.
There are also better ways to balance the data and jobs that are
performance and more the number of nodes, the better is the
local to the data. The map phase receives a key value pair that

10
International Journal of Computer Applications (0975 – 8887)
Volume 79 – No 17, October 2013

can be a generic type and outputs zero or more key value installation with network information, storage arrays, how IP
pairs that may be of some other types also. The input type and should be assigned and what kind of data store to use.
output type can be same or it can be different. The reduce Eucalyptus then contacts different daemons running for
function process the results that are the output from map different components to determine the setup and layout of the
phase for every key that is unique. Reducer function receives systems. With this information Eucalyptus manages the
values and outputs zero or more key value pairs. MapReduce dynamic creation, distribution and sharing of resources to
programs interact with framework for processing large tasks. multiple virtual machines spread across the physical hardware
The framework takes care of assigning task to map and which are part of the Eucalyptus system. The main
reduce phase, rerunning of tasks that are failed ,splitting input components of Eucalyptus are:
to feed map phase, giving map output to reducer and receiving
output from reducer. 4.1 Walrus
Walrus[11] is storage sub system where data is stored similar
3.2HDFS to Amazon S3[13]. Data is logically grouped into buckets and
Hadoop distributed file system is designed for containing very have the same API to read and write data in the bucket.
large files that are terabytes in size. It is built around the
concept of write once and read many times and generally in 4.2 Cloud Controller (CLC)
case of high compute application this theory is valid and this The Cloud Controller [11] is the component that controls the
is the reason why HDFS adds to the overall performance of virtual machines operation like starting, stopping, assigning of
the file system and MapReduce as a whole. HDFS file system systems resources like RAM and CPU, and polling of the
and the data files are distributed an stored in chunks. And components frequently to determine the status of each system
HDFS file system property can be attributed to data being resources assigned. It manages the user access database and
local to the node and rather avoid the costly data move keys required for administration. It's also an entry for the end
between nodes in cluster . users and administrators to configure and use the system.
It can keep multiple copies of data which increases reliability. 4.3 Storage Controller
One can have an option to run file system integrity check This is the equivalent to the EBS found in Amazon[13]. SAN
which can correct file system inconsistencies, missing blocks storage, Local disk arrays, local disks, NFS can be added to
etc. This feature is plus in tandem with the commercial file the storage controller nodes. Depending on the storage
systems available now. assigned, we can group the storage needs depending on what
HDFS in a normal enterprise deployment consists of a single kind of performance is expected from file system. Storage can
NameNode and multiple datanodes. Namenode is considered be dynamically assigned.
as HDFS master and all datanodes works as slave. In order to
ensure availability of NameNode, redundant files are stored in
5. PERFORMANCE ANALYSIS OF
DataNodes. Filesystem's metadata is stored in NameNode. HADOOP
Metadata contains information about file system permissions, Experiments were conducted by running a K-Means
last access time etc. When a user wants to read a file it uses a clustering job with three iterations to cluster junk of emails.
Hadoop HDFS client that contacts the NameNode. The Hence measured the performance using monitoring tool
NameNode then fetches the block locations and return the Ganglia to capture CPU, memory, disk I/O, network transfer
locations to the client, forcing the client to do the reading and rate during the time when job was run.
merging of blocks from the DataNodes.
K-Means clustering performance:
NameNode provides interface to the users who can access
some useful information from the user interface provided The K-Means clustering program was run with chunk of
through http. Data like file system capacity, name node logs, emails and used to analyze how large iterative jobs perform
its configured capacity, how many nodes HDFS cluster is on Hadoop when run on Eucalyptus. The overall results
made of, etc. showed excellent performance.
4. EUCALYPTUS CLOUD
Cloud computing[6] can be termed as a system that contains
different modules to integrate fundamental components of an
enterprise computing to function or control as a single point
of management. It provides access to large pools of data
storage, network bandwidth, system resources that can be
distributed to number of machines that act as a standalone
machine.
Eucalyptus is an framework developed using the open-source
standards and is referred as Infrastructure as a Service
(IaaS)[10], systems which give IT architects to control a large
IT environment using a single software irrespective of end
users demand. Apart from this, end user gets an ability to
easily build and deploy systems, controlled and scaled up with
much ease. Different operating systems can be deployed on
Eucalyptus cloud. This provides the flexibility to ensure end
users need when they deploy their applications. Eucalyptus Figure 1
software is installed on top of CentOS termed as Host
operating system and the clients can be termed as virtual Figure 1 shows the variation of performance of K-Means
machines. Eucalyptus software is configured during the initial clustering performance by increasing number of nodes. From

11
International Journal of Computer Applications (0975 – 8887)
Volume 79 – No 17, October 2013

the graph it can be identified that the execution time decreases much memory compared to total memory was being utilized
as number of nodes increases. by the running process.
Using three VMs clearly improves execution speed and four
VMs also improves the map/shuffle performance, but with a
decrease in CPU utilization.
Due to the limitations of the testing environment, I/O's were
written to the local disk using iSCSI with 100Mbps network.
Although there are limitations in using SAN disks or other
high end storage devices, local disk performed well to the
extensive reads and writes to the disk. From the monitoring
tool TestDFSIO, the process waiting for I/O to complete
clearly showed the bottleneck.

Figure 2

Figure 2 shows the Speed up obtained by varying number of


nodes.
The CPU utilization is considerably higher but not high
enough to affect the node controllers of Eucalyptus. Since
some CPU and memory is allocated to run the cloud
environment, high CPU or Memory usage by MapReduce
programs are not throttling the cloud environment.
Figure 5

Figure 5 shows Hadoop datanode : CPU wait time for I/O


Process to complete. This graph shows hadoop2 wait time for
I/O process after job started running.

6. CONCLUSION
While Hadoop MapReduce is designed to be used on top of
physical commodity hardware servers, testing has shown that
running Hadoop MapReduce in private cloud supplied by
Eucalyptus is viable and very efficient. The results showed the
performance of MapReduce job running in four nodes is
good. It was noted that the increase in number of nodes
boosted performance significantly. And also the performance
Figure 3 gain increases as number of nodes scale up.
The need for building individual standalone machines for
Figure 3 shows the entire Cluster-Load average when K-
different user requirements have become things of the past.
Means job was run with 3 iterations. This is measured right
Creating or running an OS is much faster as pre-build images
after the job started running. The graph measures the load
can be launched very easily and customized according to user
average of the program.
requirements. Cloud also gives the user the ability to supply
more machines when needed, as long as it is not reaching the
physical upper limits of the underlying host machines.
Eucalyptus private cloud should be improved to accommodate
different user needs when running an image. This needs to be
improved so that effort to bring up instance initially will not
take much time. Hadoop design of working on HDFS file
systems still needs much improvement. Hadoop can look into
integrating GFS[5]file system.

7. REFERENCES
[1] Grace Nila Ramamoorthy:K-Means Clustering Using
Hadoop MapReduce. Published by UCD School of
Figure 4 Computer Science and Informatics.
[2] Chen He, Derek Weitzel, David Swanson, Ying Lu.
Figure 4 shows Entire cluster Memory Statistics when K- HOG: Distributed Hadoop MapReduce on the Grid
Means job was run with 3 iterations. The graph shows how Published by 2012 SC Companion: High Performance
Computing, Networking Storage and Analysis.

12
International Journal of Computer Applications (0975 – 8887)
Volume 79 – No 17, October 2013

[3] Xuan Wang,Clustering in the cloud:Clustering Algorithms [10] Tom White Hadoop- The Definitive Guide,Published by
to Hadoop Map/Reduce Framework" (2010),Published O'Reilly Media/Yahoo Press, 2nd edition, 2010.
by Technical Reports-Computer Science by Texas State
University. [11] Huan Liu and Dan Orban, Cloud MapReduce: a
MapReduce Implementation on top of a Cloud Operating
[4] Apache Software Foundation, Hdfs user guide System. Published in Cluster, Cloud and Grid
http://hadoop.apache.org/hdfs/docs/current/hdfsuserguide Computing(CCGrid) 2011,11th IEEE/ACM International
Symposium.
[5] MapReduce tutorial Apache Hadoop 1.2.1 documentation
by Hadoop wiki. [12] Weizhong Zhao,Huifang Ma,Qing He,Parallel K-Means
clustering based on MapReduce Published by The Key
[6] Eucalyptus Systems, Inc.Eucalyptus 3.3.1, Eucalyptus Laboratory of Intelligent Information Processing,
Administration Guide (2.0), 2010. Institute of Computing Technology,Chinese Accademy
[7] J.Dean and S.Ghema wat. MapReduce: Simplified Data of Sciences.
Processing on Large Clusters.Proceedingsof 6th [13] The Apache Hadoop Ecosystem,University of Cloudera,
Symposium on Operating SystemDesign and OnlineResources.
Implementation,Published by Communications of
ACM,Volume 51 Issue 1,January 2008. [14] Amazon. Amazon elastic block storage (ebs),aws
documentation by Amazon Elastic Compute Cloud User
[8] Cloudera. Cloudera's distribution including Apache Guide.
hadoop.
[15] Blaise Barney, Introduction to Parallel Computing,
[9] A.T. Velte,T.J. Velte,and R.Elsenpeter. Cloud Computing- Published By Lawrence Livermore National Laboratory.
A Practical Approach,Published by The McGraw-Hill
Companies, 2010.

IJCATM : www.ijcaonline.org 13

You might also like