Mapreduce Article Review
Mapreduce Article Review
College of Informatics
Department of Information Science
Data Science and analytics Post Graduate program
Data science workshop
Literature review on MapReduce
Submitted to Demeke.A(PhD)
Contents
Introduction ................................................................................................................................................... 1
Statement of problem .................................................................................................................................... 2
Objectives ..................................................................................................................................................... 3
General Objectives .................................................................................................................................... 3
Specific objectives .................................................................................................................................... 3
Significance................................................................................................................................................... 3
Methodology ................................................................................................................................................. 4
Critique ......................................................................................................................................................... 4
Strength ..................................................................................................................................................... 4
Weakness .................................................................................................................................................. 5
Proposed solution .......................................................................................................................................... 5
Conclusion .................................................................................................................................................... 5
Reference ...................................................................................................................................................... 6
Introduction
Now a day, data and web data are increasing in terms of 10 to 100 tera bytes which can’t be mine
or process on single server. Yahoo and Apache developed Hadoop which replicates the same data
3 times and distributes the pieces to several systems connected in the network. So, if one system
goes down other 2 replicas are available. So, it’s cheap, robust and fault tolerant. Cloud BLAST,
a distributed implementation of NCBIBLAST using Hadoop is investigated an efficient approach
to the execution of bio informatics applications[1].
Map Reduce has emerge as a popular way to harness the power of large clusters of computers.
Map Reduce allows programmers to think in a data-centric fashion they focus on applying
transformations to sets of data records, and allow the details of distributed execution, network
communication and fault tolerance to be handled by the Map Reduce frame work. Map Reduce is
typically applied to large batch oriented computations that are concerned primarily with time to
job completion[2].
Over the past five years, the authors and many others at Google have implemented hundreds of
special-purpose computations that process large amounts of raw data, such as crawled documents,
web request logs, etc., to compute various kinds of derived data, such as inverted indices, various
representations of the graph structure of web documents, summaries of the number of pages
crawled per host, the set of most frequent queries in a given day, etc. Most such computations are
conceptually straightforward. However, the input data is usually large and the computations have
to be distributed across hundreds or thousands of machines in order to finish in a reasonable
amount of time[3].
1
Statement of problem
A big problem has been encountered in various fields for making the full use of these large-scale
data which support decision making. Data mining is the technique that can discovers new patterns
from large data sets. For many years it has been studied in all kinds of application area and thus
many data mining methods have been developed and applied to practice. But there was a
tremendous increase in the amount of data, their computation and analyses in recent years. In such
situation most classical data mining methods became out of reach in practice to handle such big
data. Efficient parallel/concurrent algorithms and implementation techniques are the key to
meeting the scalability and performance requirements entailed in such large-scale data mining
analyses. And
Traditional Enterprise Systems normally have a centralized server to store and process data. The
following illustration depicts a schematic view of a traditional enterprise system. Traditional model
is certainly not suitable to process huge volumes of scalable data and cannot be accommodated by
standard database servers. Moreover, the centralized system creates too much of a bottleneck while
processing multiple files simultaneously. Google solved this bottleneck issue using an algorithm
called MapReduce. MapReduce divides a task into small parts and assigns them to many
computers. Later, the results are collected at one place and integrated to form the result dataset.
Traditional system
and centralized
system
2
Objectives
General Objectives
The main objectives of the researches to understand MapReduce performance by handling its
limitations
Specific objectives
In order to achieve the general objective, the researchers prepared the following specific
objectives:
• To develop examples using the Hadoop implementation for showing the use of the map-
reduce model on data processing problems
• To learn how to define experiments
• To define a catalogue of problem families that can be implemented using the map reduce
model
• Study variants of the map-reduce model for example with shared memory in order to
program challenging data processing problems like join implementations.
• To develop a solution for a challenging problem using a map-reduce model.
Significance
According to the research paper I have reviewed a Map reduce have more significances like to
handle data is faster because of its parallel processing And both structured and unstructured data
can be quickly processed irrespective of the data source and type of data, thus giving the
necessary flexibility that was not there with traditional relational database systems.
❑ It supports data locality by collocating the data with the compute node, so it reduces
network communication cost.
❑ It supports scalability. The runtime scheduling strategy enables Map Reduce to offer elastic
scalability which means the ability of dynamically adjusting resources during job
execution.
❑ It has a fault tolerance strategy that transparently handles failure. It detects map and reduce
tasks of failed nodes and reassigns it to other nodes in the cluster.
3
❑ It has the ability to handle data for heterogeneous system, since Map Reduce is storage
independent, and it can analyze data stored in different storage system.
Methodology
From the reviewed paper I have seen different methods used. The methods are different depend
on the paper.
Critique
Strength
➢ It is fault tolerance and fair allocation of multiple resources is the strength part of the study.
➢ The security is more important element so they process big data in Secured way.
➢ All papers explained map reduce in detail
➢ They solved their problems they mentioned
➢ They can be used as stepping stone for map reduce researchers
➢ They used consistent citation
4
Weakness
➢ Some of the papers doesn’t mentioned the methods they used
➢ Results are not comparatively mentioned in some of the papers
➢ Proposed solution is not clear
➢ Some of the paper doesn’t mentioned the procedures they followed
Proposed solution
As I try to review these above different papers, I observe different problem and papers are try to describe
MapReduce Algorithm. Traditional Map Reduce implementations provide a poor interface for interactive
data analysis, because they do not emit any output until the job has been executed to completion. In many
cases, an interactive user would prefer a “quick and dirty” approximation over a correct answer that takes
much longer to compute. In the data base literature, online aggregation has been proposed to address this
problem.
Conclusion
A Map Reduce job usually splits the input data-set into independent chunks which are processed
by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps,
which are then input to the reduce tasks. Typically, both the input and the output of the job are
stored in a file-system. The major advantage of Map Reduce is that it is easy to scale data
processing over multiple computing nodes. Under the Map Reduce model, the data processing
primitives are called mappers and reducers. Decomposing a data processing application into
mappers and reducers is sometimes nontrivial.
Map Reduce is a programming framework that allows us to perform distributed and parallel
processing on large data sets in a distributed environment. ... Then, the reducer aggregates those
intermediate data tuples (intermediate key-value pair) into a smaller set of tuples or key-value pairs
which is the final output.
5
Reference
[1] Woo-Hyun Lee, Hee-Gook Jun, and Hyoung-Joo Kim, "HADOOP MAPREDUCE
PERFORMANCE ENHANCEMENT USING IN-NODE COMBINERS," International Journal of
Computer Science & Information Technology (IJCSIT) , vol. Vol 7, October 2015.
[4] J. D. a. S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," Google, Inc,
2004.
[5] C. J. Seema Maitreya, "MapReduce: Simplified Data Analysis of Big data," Elsevier B.V., p. 563 –
571, (2015 .