Performance Analysis of Hadoop Map Reduce On
Performance Analysis of Hadoop Map Reduce On
Performance Analysis of Hadoop Map Reduce On
10
International Journal of Computer Applications (0975 – 8887)
Volume 79 – No 17, October 2013
can be a generic type and outputs zero or more key value installation with network information, storage arrays, how IP
pairs that may be of some other types also. The input type and should be assigned and what kind of data store to use.
output type can be same or it can be different. The reduce Eucalyptus then contacts different daemons running for
function process the results that are the output from map different components to determine the setup and layout of the
phase for every key that is unique. Reducer function receives systems. With this information Eucalyptus manages the
values and outputs zero or more key value pairs. MapReduce dynamic creation, distribution and sharing of resources to
programs interact with framework for processing large tasks. multiple virtual machines spread across the physical hardware
The framework takes care of assigning task to map and which are part of the Eucalyptus system. The main
reduce phase, rerunning of tasks that are failed ,splitting input components of Eucalyptus are:
to feed map phase, giving map output to reducer and receiving
output from reducer. 4.1 Walrus
Walrus[11] is storage sub system where data is stored similar
3.2HDFS to Amazon S3[13]. Data is logically grouped into buckets and
Hadoop distributed file system is designed for containing very have the same API to read and write data in the bucket.
large files that are terabytes in size. It is built around the
concept of write once and read many times and generally in 4.2 Cloud Controller (CLC)
case of high compute application this theory is valid and this The Cloud Controller [11] is the component that controls the
is the reason why HDFS adds to the overall performance of virtual machines operation like starting, stopping, assigning of
the file system and MapReduce as a whole. HDFS file system systems resources like RAM and CPU, and polling of the
and the data files are distributed an stored in chunks. And components frequently to determine the status of each system
HDFS file system property can be attributed to data being resources assigned. It manages the user access database and
local to the node and rather avoid the costly data move keys required for administration. It's also an entry for the end
between nodes in cluster . users and administrators to configure and use the system.
It can keep multiple copies of data which increases reliability. 4.3 Storage Controller
One can have an option to run file system integrity check This is the equivalent to the EBS found in Amazon[13]. SAN
which can correct file system inconsistencies, missing blocks storage, Local disk arrays, local disks, NFS can be added to
etc. This feature is plus in tandem with the commercial file the storage controller nodes. Depending on the storage
systems available now. assigned, we can group the storage needs depending on what
HDFS in a normal enterprise deployment consists of a single kind of performance is expected from file system. Storage can
NameNode and multiple datanodes. Namenode is considered be dynamically assigned.
as HDFS master and all datanodes works as slave. In order to
ensure availability of NameNode, redundant files are stored in
5. PERFORMANCE ANALYSIS OF
DataNodes. Filesystem's metadata is stored in NameNode. HADOOP
Metadata contains information about file system permissions, Experiments were conducted by running a K-Means
last access time etc. When a user wants to read a file it uses a clustering job with three iterations to cluster junk of emails.
Hadoop HDFS client that contacts the NameNode. The Hence measured the performance using monitoring tool
NameNode then fetches the block locations and return the Ganglia to capture CPU, memory, disk I/O, network transfer
locations to the client, forcing the client to do the reading and rate during the time when job was run.
merging of blocks from the DataNodes.
K-Means clustering performance:
NameNode provides interface to the users who can access
some useful information from the user interface provided The K-Means clustering program was run with chunk of
through http. Data like file system capacity, name node logs, emails and used to analyze how large iterative jobs perform
its configured capacity, how many nodes HDFS cluster is on Hadoop when run on Eucalyptus. The overall results
made of, etc. showed excellent performance.
4. EUCALYPTUS CLOUD
Cloud computing[6] can be termed as a system that contains
different modules to integrate fundamental components of an
enterprise computing to function or control as a single point
of management. It provides access to large pools of data
storage, network bandwidth, system resources that can be
distributed to number of machines that act as a standalone
machine.
Eucalyptus is an framework developed using the open-source
standards and is referred as Infrastructure as a Service
(IaaS)[10], systems which give IT architects to control a large
IT environment using a single software irrespective of end
users demand. Apart from this, end user gets an ability to
easily build and deploy systems, controlled and scaled up with
much ease. Different operating systems can be deployed on
Eucalyptus cloud. This provides the flexibility to ensure end
users need when they deploy their applications. Eucalyptus Figure 1
software is installed on top of CentOS termed as Host
operating system and the clients can be termed as virtual Figure 1 shows the variation of performance of K-Means
machines. Eucalyptus software is configured during the initial clustering performance by increasing number of nodes. From
11
International Journal of Computer Applications (0975 – 8887)
Volume 79 – No 17, October 2013
the graph it can be identified that the execution time decreases much memory compared to total memory was being utilized
as number of nodes increases. by the running process.
Using three VMs clearly improves execution speed and four
VMs also improves the map/shuffle performance, but with a
decrease in CPU utilization.
Due to the limitations of the testing environment, I/O's were
written to the local disk using iSCSI with 100Mbps network.
Although there are limitations in using SAN disks or other
high end storage devices, local disk performed well to the
extensive reads and writes to the disk. From the monitoring
tool TestDFSIO, the process waiting for I/O to complete
clearly showed the bottleneck.
Figure 2
6. CONCLUSION
While Hadoop MapReduce is designed to be used on top of
physical commodity hardware servers, testing has shown that
running Hadoop MapReduce in private cloud supplied by
Eucalyptus is viable and very efficient. The results showed the
performance of MapReduce job running in four nodes is
good. It was noted that the increase in number of nodes
boosted performance significantly. And also the performance
Figure 3 gain increases as number of nodes scale up.
The need for building individual standalone machines for
Figure 3 shows the entire Cluster-Load average when K-
different user requirements have become things of the past.
Means job was run with 3 iterations. This is measured right
Creating or running an OS is much faster as pre-build images
after the job started running. The graph measures the load
can be launched very easily and customized according to user
average of the program.
requirements. Cloud also gives the user the ability to supply
more machines when needed, as long as it is not reaching the
physical upper limits of the underlying host machines.
Eucalyptus private cloud should be improved to accommodate
different user needs when running an image. This needs to be
improved so that effort to bring up instance initially will not
take much time. Hadoop design of working on HDFS file
systems still needs much improvement. Hadoop can look into
integrating GFS[5]file system.
7. REFERENCES
[1] Grace Nila Ramamoorthy:K-Means Clustering Using
Hadoop MapReduce. Published by UCD School of
Figure 4 Computer Science and Informatics.
[2] Chen He, Derek Weitzel, David Swanson, Ying Lu.
Figure 4 shows Entire cluster Memory Statistics when K- HOG: Distributed Hadoop MapReduce on the Grid
Means job was run with 3 iterations. The graph shows how Published by 2012 SC Companion: High Performance
Computing, Networking Storage and Analysis.
12
International Journal of Computer Applications (0975 – 8887)
Volume 79 – No 17, October 2013
[3] Xuan Wang,Clustering in the cloud:Clustering Algorithms [10] Tom White Hadoop- The Definitive Guide,Published by
to Hadoop Map/Reduce Framework" (2010),Published O'Reilly Media/Yahoo Press, 2nd edition, 2010.
by Technical Reports-Computer Science by Texas State
University. [11] Huan Liu and Dan Orban, Cloud MapReduce: a
MapReduce Implementation on top of a Cloud Operating
[4] Apache Software Foundation, Hdfs user guide System. Published in Cluster, Cloud and Grid
http://hadoop.apache.org/hdfs/docs/current/hdfsuserguide Computing(CCGrid) 2011,11th IEEE/ACM International
Symposium.
[5] MapReduce tutorial Apache Hadoop 1.2.1 documentation
by Hadoop wiki. [12] Weizhong Zhao,Huifang Ma,Qing He,Parallel K-Means
clustering based on MapReduce Published by The Key
[6] Eucalyptus Systems, Inc.Eucalyptus 3.3.1, Eucalyptus Laboratory of Intelligent Information Processing,
Administration Guide (2.0), 2010. Institute of Computing Technology,Chinese Accademy
[7] J.Dean and S.Ghema wat. MapReduce: Simplified Data of Sciences.
Processing on Large Clusters.Proceedingsof 6th [13] The Apache Hadoop Ecosystem,University of Cloudera,
Symposium on Operating SystemDesign and OnlineResources.
Implementation,Published by Communications of
ACM,Volume 51 Issue 1,January 2008. [14] Amazon. Amazon elastic block storage (ebs),aws
documentation by Amazon Elastic Compute Cloud User
[8] Cloudera. Cloudera's distribution including Apache Guide.
hadoop.
[15] Blaise Barney, Introduction to Parallel Computing,
[9] A.T. Velte,T.J. Velte,and R.Elsenpeter. Cloud Computing- Published By Lawrence Livermore National Laboratory.
A Practical Approach,Published by The McGraw-Hill
Companies, 2010.
IJCATM : www.ijcaonline.org 13