3412ijwsc01 PDF
3412ijwsc01 PDF
3412ijwsc01 PDF
4, December 2012
ABSTRACT
The computer industry is being challenged to develop methods and techniques for affordable data
processing on large datasets at optimum response times. The technical challenges in dealing with the
increasing demand to handle vast quantities of data is daunting and on the rise. One of the recent
processing models with a more efficient and intuitive solution to rapidly process large amount of data in
parallel is called MapReduce. It is a framework defining a template approach of programming to perform
large-scale data computation on clusters of machines in a cloud computing environment. MapReduce
provides automatic parallelization and distribution of computation based on several processors. It hides
the complexity of writing parallel and distributed programming code. This paper provides a comprehensive
systematic review and analysis of large-scale dataset processing and dataset handling challenges and
requirements in a cloud computing environment by using the MapReduce framework and its open-source
implementation Hadoop. We defined requirements for MapReduce systems to perform large-scale data
processing. We also proposed the MapReduce framework and one implementation of this framework on
Amazon Web Services. At the end of the paper, we presented an experimentation of running MapReduce
system in a cloud environment. This paper outlines one of the best techniques to process large datasets is
MapReduce; it also can help developers to do parallel and distributed computation in a cloud environment.
KEYWORDS
MapReduce, Hadoop, cloud computing, parallel and distributed processing
1. INTRODUCTION
Today, the need to process large amount of data has been enhanced in the area of Engineering,
Science, Commerce and the Economics of the world. The ability to process huge data from
multiple resources of data remains a critical challenge.
Many organizations face difficulties when dealing with a large amount of data. They are unable to
manage, manipulate, process, share and retrieve large amounts of data by traditional software
tools due to them being costly and time-consuming for data processing. The term large-scale
processing is focused on how to handle the applications with massive datasets. Such applications
devote the largest fraction of execution time to movement of data from data storage to the
computing node in a computing environment [1]. The main challenges behind such applications
DOI : 10.5121/ijwsc.2012.3401
International Journal on Web Service Computing (IJWSC), Vol.3, No.4, December 2012
are data storage capacity and processor computing power constrains. Developers need hundreds
or thousands of processing nodes and large volume of storage devices to process complex
applications with large datasets, such applications process multi-terabyte to petabyte-sized
datasets [2] and using traditional data processing methods like sequential processing and
centralized data processing are not effective to solve these kinds of applications problems.
The question is how to process large amounts of distributed data quickly with good response
times and replication at minimum cost? One of the best ways for huge data processing is to
perform parallel and distributed computing in a cloud computing environment. Cloud computing
as a distributed computing paradigm aims at large datasets to be processed on available computer
nodes by using a MapReduce framework. MapReduce is a software framework introduced to the
world by Google in 2004; it runs on a large cluster of machines and is highly scalable [3]. It is a
high-performance processing technique to solve large-scale dataset problems. MapReduce
computation processes petabyte to terabyte of unit data on thousands of processors. Google uses
MapReduce for indexing web pages. Its main aim is to process large amount of data in parallel
stored on a distributed cluster of computers. This study presents a way to solve large-scale dataset
processing problems in parallel and distributed mode operating on a large cluster of machines by
using MapReduce framework. It is a basis to take advantage of cloud computing paradigm as a
new realistic computation industry standard.
The first contribution of this work is to propose a framework for running MapReduce system in a
cloud environment based on the captured requirements and to present its implementation on
Amazon Web Services. The second contribution is to present an experimentation of running the
MapReduce system in a cloud environment to validate the proposed framework and to present the
evaluation of the experiment based on the criteria such as speed of processing, data-storage usage,
response time and cost efficiency.
The rest of the paper is organized as follows. Section II provides background information and
definition of MapReduce and Hadoop. Section III describes workflow of MapReduce, the general
introduction of Map and Reduce functions and it also describes Hadoop, an implementation of the
MapReduce framework. Section IV we present MapReduce in cloud computing. Section V
presents the related MapReduce systems. Section VI captures a set of requirements to develop the
framework. Section VII shows the proposed framework and the implementation of the framework
on Amazon Web Services for running a MapReduce system in a cloud environment it also
presents an experimentation of running a MapReduce system in a cloud environment to validate
the proposed framework and resulting outcomes with evaluation criteria. Section VIII concludes
this paper.
International Journal on Web Service Computing (IJWSC), Vol.3, No.4, December 2012
The unique feature of MapReduce is that it can both interpret and analyse both structured and
unstructured data across many nodes through using of a distributed share nothing architecture.
Share nothing architecture is a distributed computing architecture consisting of multiple nodes.
Each node is independent, has its own disks, memory, and I/O devices in the network. In this type
of architecture, each node is self-sufficient and shares nothing over the network: therefore, there
are no points of contention across the system.
A MapReduce programming model drives from three fundamental phases:
1. Map phase: partition into M Map function (Mapper); each Mapper runs in parallel. The
outputs of Map phase are intermediate key and value pairs.
2. Shuffle and Sort phase: the output of each Mapper is partitioned by hashing the output key.
In this phase, the number of partitions is equal to the number of reducers; all key and value
pairs in shuffle phase share the same key that belongs to the same partition. After
partitioning the Map output, each partition is stored by a key to merge all values for that key.
3. Reduce phase: partition into R Reduce function (Reducer); each Reducer also runs in parallel
and processes different intermediate keys.
MapReduce libraries have been written in several programming languages,include, LISP, Java,
C++, Python, Ruby and C. A presentation of the MapReduce workflow includes;a dataset is
divided into several units of data and then each unit of data is processed in a Map phase.Finally,
they are combined in Reduce phase to produce the final output.
Map function takes input pairs and produces a set of intermediate key and value pairs and passes
them to the Reduce function in order to combine all the values associated with the same key.
Reduce function accepts an intermediate key as a set of values for that key; it merges together
these values to prepare a proper smaller set of values to produce the output file [5].
3. MAPREDUCE FRAMEWORK
3.1.MapReduce Workflow Overview
When the user runs a MapReduce program, the following operations occur as shown in Figure1.
International Journal on Web Service Computing (IJWSC), Vol.3, No.4, December 2012
The numbered labels in the above figure relate to the numbers in the list below:
1. Execution begins by running the MapReduce program. User submits a job to the master
node. Master node divides the input dataset into many pieces of data, typically 16MB to
64MB data per pieces into HDFS and then creates several copies of the user MapReduce
program over working machines in the cluster. Each machine in the cluster has a separate
instruction of the MapReduce program. HDFS makes multiple copies of data block for
reliability, put them over working machines within the cluster. After that MapReduce
automatically starts to process the blocks of data where they are located.
2. One copy of the program placed on a machine is specific and that machine is called the
master node. The rest of the program is assigned to worker (slave) nodes by master node.
The master node partitions a job into the Map and Reduce tasks. The Map tasks and Reduce
tasks are performed in order by Mapper and Reducer functions. The master node then
chooses the idle Mapper or Reducer and allocates each of them Map or Reduce task.
3. The Mapper function receives a Map task, read the content of a chunk of data and invoke the
user defined Map function. Then the Map function produces the intermediate key and value
pairs for each chunk of the input data. The outputs of a Map function (intermediate key and
value pairs) are buffered in the memory.
4. The output data in the temporary buffers are stored on the local disk and the physical
memory addresses of these buffered data are sent to the master node. The master finds the
idle workers and forwards the location of these buffered data to them to perform Reduce
tasks.
5. A Reducer function informs by master node about these physical memory addresses;it uses a
remote procedure to access the buffered data from the Mappers on the local disks. When a
reducer read all the intermediate key and value pairs, it sorts them by the intermediate keys
so that all the data of the same intermediate key are classified together.
6. The Reducer sends each unique key and its consequent set of intermediate values to the
Reduce function. The final output is available in the Reducer; then it is stored to the
distributed file system (HDFS).
4
International Journal on Web Service Computing (IJWSC), Vol.3, No.4, December 2012
Namenode is a single master server in a cluster environment that manages the file system data
within the whole Hadoops distributed file system and adjusts the read/write access to the data file
5
International Journal on Web Service Computing (IJWSC), Vol.3, No.4, December 2012
by the user. There are generous numbers of datanodes (one datanode per computer node in the
cluster) which are used to store and retrieve the data, and they will be performing the read and
write requests from the file system clients. Datanodes are also responsible for performing
replication tasks and more importantly sorting the file system data [4]. Namenode and datanodes
components play fundamental roles in a Hadoop file system. Jobtracker accepts and controls
submitted jobs in a cluster environment, and is responsible for scheduling and monitoring all
running tasks on the cluster. It divides the input data into many pieces of data to be processed by
the Map and Reduce tasks. Tasktrackers execute Map tasks and Reduce tasks via received
instructions from the jobtracker. They are responsible for partitioning Map outputs and
sorting/grouping of reducer inputs. The jobtracker and tasktrackers make the backbone of
running a MapReduce program.
International Journal on Web Service Computing (IJWSC), Vol.3, No.4, December 2012
5. MAPREDUCE REQUIREMENTS
5.1. Fundamental and Specific Requirements
MapReduce platform is a distributed runtime engine for managing, scheduling and running
MapReduce systems in a cluster of servers across an open distributed file system. To develop a
MapReduce system based on the proposed framework, fundamental and specific requirements
have been captured. Table 1 lists the summary of fundamental requirements for MapReduce
system. Table 2 lists the summary of specific requirements. Both requirements essentially must be
met to make a MapReduce system for large-scale processing more efficient.
Table 1.Fundamental Requirements.
No
Fundamental
Requirements
Description
Scalability
Parallelism
Distributed Data
Cost Efficiency
Specific
Requirements
Description
Availability
Reliability
Flexibility
Security
Usability
Locality
Data
Consistency
Trust
International Journal on Web Service Computing (IJWSC), Vol.3, No.4, December 2012
6. PROPOSED FRAMEWORK
A framework is proposed based on the MapReduce requirements; it is shown in Figure 3. This
framework is designed to operate in a cloud computing environment.
Figure3. MapReduce framework and its components in cloud computing at the platform level
Legends:
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
International Journal on Web Service Computing (IJWSC), Vol.3, No.4, December 2012
Slaves: Process those smaller tasks as directed by the master on a determined part of dataset
and pass the result back to the master node.
HDFS: As a distributed, scalable file system stores virtually input and output files. User of
the MapReduce system enables to access the final result from this virtual storage.
Legends:
(1) Connect to AWS console, upload MapReduce program and input dataset on Amazon S3
service.
(2) Input file is uploaded into HDFS
(3) MapReduce program is located at the master node.
(4) Master node gets the input file from HDFS.
(5) Master node distributes data blocks to slaves for processing.
(6) Master node checks and updates the state of the running jobs.
(7) Results from slaves are backed to master node and stored on HDFS.
(8) Final output file is transmitted from HDFS to Amazon S3.
(9) Final output file is transmitted from Amazon S3 to the local machine.
Figure5 shows a working model of implementing the proposed framework in Amazon Web
Services, for massive data processing. It shows how the Amazon Web Services uses cloud
infrastructure to work with MapReduce for running a MapReduce program for large-scale data
processing. It includes a number of Amazon Web Services listed below:
9
International Journal on Web Service Computing (IJWSC), Vol.3, No.4, December 2012
1. AWS management console: it manages and accesses a collection of remote services offered
over the internet by amazon.com.
2. Amazon Elastic MapReduce: it introduces the Hadoop framework as a service.
3. Amazon S3: the Simple Storage Service (S3) stores the input and output dataset.
4. Amazon EC2 Cluster: the Elastic Compute Cloud (EC2) is used for running a large
distributed processing in parallel.
5. Amazon Simple DB: it determines the state of a given component or process.
To run the system from AWS management console, user who is developed a MapReduce
program, sign up for AWS Console to get AWS services, then uploads MapReduce program and
input dataset on Amazon S3 service. Input dataset is transmitted into HDFS to be used by EC2
instances.A job flow is started on Amazon EC2 cluster.In this service, one machine works as
master node and the others work as slave nodes to distribute the MapReduce process. During the
running processes Simple DB shows the status of the running job and all information about EC2
instances. All machines are terminated once the MapReduce tasks complete running. The final
result is stored in an output file and the output can be retrieved from Amazon S3.
7. EXPERIMENTATION
For our experiment, a MapReduce program was written in Java to process a large dataset. The
program determines the number of occurrences of each word in the given dataset. We executed
the program based on our proposed framework in a cloud computing environment hosted
byAmazon Web Services1. The proposed framework, which is one contribution of this work, was
tested by running the MapReduce program on Amazon Elastic Compute Cloud (EC2) cluster. In
the experiment, a dataset which is a text file with the size of 1 GB and 200,000,000 words, was
used and stored on Amazon Simple Storage Service (S3) for being processed.
The dataset was transmitted from Amazon S3 into Hadoop HDFS to be processed in a cluster of
machines from Amazon EC2. In Hadoop HDFS, the dataset was initially divided into 16MB to
64MB chunks of data, each chunk of data residing on one machine in a cluster, one copy of the
MapReduce program was created on the master node of the Amazon EC2 cluster and then master
node distributed this copy of the program with one chunk of data on all working machines in the
cluster as a MapReduce job. Each working machine in the Amazon EC2 cluster executes a
MapReduce job based on the instructions of the MapReduce program to process one chunk of
data. All auctions were performed in the cloud environment offered by Amazon Web Services at
the price of $0.16 per hour for a medium working machine. The output was stored in Amazon S3
to be accessible by the user. It represents the word and the number of times it has appeared in the
input dataset with a short response time and minimum cost. In the discussion part, the experiment
is evaluated base on the criteria such as data-storage usage, speed of processing, response time
and cost efficiency.
Processing details of 1GB dataset which contains 200,000,000 words, are presented in Figure 6, it
is based on the completion percentage of Map tasks and Reduce tasks in cloud computing mode
with eight numbers of instances.
International Journal on Web Service Computing (IJWSC), Vol.3, No.4, December 2012
100
Completion %
80
60
40
20
0
0
0.5
1.5
2.5
3.5
4.5
Time (sec)
5.5
Map
Reduce
As shown in Figure 6 by completion of Map tasks on 3 seconds, the Reduce tasks are started and
finished on 5.2 seconds. Map tasks take 3 seconds for 100% completion and Reduce tasks take
2.2 seconds for 100% completion.
No. of instances
65.3
44.8
26.6
5.2
International Journal on Web Service Computing (IJWSC), Vol.3, No.4, December 2012
No. of Instances
2
4
6
8
No. of Instances
2
4
6
8
As discussed above, MapReduce performs data processing with the available working machine at
Amazon EC2 clouds with an extremely low budget. The combination of MapReduce frameworks
and cloud computing is an attractive propositioning for large-scale data processing so that
MapReduce systems can take advantage of massive parallelization in a computer cluster. From
our analysis using cloud computing and MapReduce together improve the speed of processing
and decrease consuming cost.
8. CONCLUSION
In conclusion, this paper conducted a comprehensive review and analysis of a MapReduce
framework as a new programming model for massive data analytics and its open source
implementation, Hadoop. MapReduce is an easy, effective and flexible tool for large-scale fault
tolerant data analysis. It has proven to be a useful abstraction that allows programmers to develop
easily high performance system for running on cloud platforms and to distribute the processing
over as many processors as possible. This paper described Hadoop, an open source
12
International Journal on Web Service Computing (IJWSC), Vol.3, No.4, December 2012
REFERENCES
[1]
13