0% found this document useful (0 votes)
27 views4 pages

Map Reduce Summary

This document summarizes a research paper on MapReduce and how it provides a simplified model for processing large datasets in parallel across clusters of computers. It describes MapReduce's map and reduce functions and provides results on sorting and searching large datasets. The summary concludes that MapReduce hides parallelization complexity and has been widely adopted at Google for various applications.

Uploaded by

karim ben hassen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views4 pages

Map Reduce Summary

This document summarizes a research paper on MapReduce and how it provides a simplified model for processing large datasets in parallel across clusters of computers. It describes MapReduce's map and reduce functions and provides results on sorting and searching large datasets. The summary concludes that MapReduce hides parallelization complexity and has been widely adopted at Google for various applications.

Uploaded by

karim ben hassen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Research paper summary Page 1

Ahmad Afiq Bin Che Johari"

Rachid Chelouah"

Initiation to Research"

1 Apr 2014"

MapReduce:: Simplified Data Processing on Large Clusters Summary"

Introduction!
"
The authors, Jeffrey Dean and Sanjay Ghemawat are both employees at Google."
At Google, it’s very common to analyse a large input data in order to perform computa-
tions. "
"
For example, to output the set of most frequent queries against a database on a given
day, analysing the web request logs, graph structure of web documents, etc."
Since the input data is large, several computers or machines need to be used to dis-
tribute the computations so that we can get our answer faster."
"
Most of the time the computations are straightforward but the main challenge is with the
parallelisation of computation, distributing the data and how to handle failures on the dis-
tributed machines. "
"
Therefore, we seek for a programming model that allows the programmers to hide away
the complexity of parallelisation on distributed machines while performing the computa-
tions."
"
Method!
"
Below is the schematic representation of the authors MapReduce model."
Research paper summary Page 2

"
The data is split into a number of partitions and each partition of data will be passed
through a map function independently of each other. The map function outputs an inter-
mediary key-value results. And then, the reduce function will receive the intermediary
key-value to output a smaller set of key-value, possibly one key-value. Finally, the results
from the reduce function of each independent machines are aggregated to produce the
final result."
"
"
Results!
"
In order to provide a sense of benchmark to MapReduce, the authors measure the per-
formance of its implementation on two computations. The first involves a searching for a
pattern through a roughly one terabyte of data and the second, a sort on a roughly one
terabyte of data. These two examples should be representative of using MapReduce
against large datasets, that is the sort example to show a typical program that needs to
shuffle a large data and the search example to represent a program that needs to extract
information from a large data. The following is the result for the search implementation of
MapReduce."
"
" The Grep program(search)"

The figure above is result of the distributed grep program implementing MapReduce
model in the authors paper. It shows the progress of computing the search over time."
"
The input file is 10^10 100-byte records and the grep searches for a three-character pat-
tern which is relatively rare(92,337 records out of 10^10). The input file is then split into
64MB pieces, that is 15,000 * 64MB, and the resulting output is then stored in one file."
"
The rate gradually increases at the beginning as many machines are assigned during the
Mapping phase. It peaked at 30GB/s when around 1764 workers are assigned. As the
Mapping phase terminates, the rate decreases quite dramatically"
"
"
"
"
Research paper summary Page 3

"
Perspectives!
"
" In fact, the first version of the MapReduce library was implemented in February of 2003 "
" and enhancements were being made later in August of 2003 dealing mainly with the"
" more technical parallelisation problems like locality optimisation, dynamic load balancing
" of task execution across worker machiens, etc."
"
The authors were also surprised by how broadly the MapReduce paradigm could be im-
plemented at Google. Notable examples are:"
• large-scale machine learning problem"
• clustering problems for the Google News and Froogle products."
• large-scale graph computations."

"
" There has been a significant growth of the number of different instances of MapReduce "
" programs since its introduction at Google. "
" "
" The main factor that contributes to its success is that a program that could be written in "
" MapReduce model is simple and highly scalable, meaning it can be executed on different
" machines while hiding away the complexity of parallelisation. Therefore, a programmer "
" who has no experience with distributed or parallel systems could easily exploit the re"
" sources of the distributed machines."
"
" Below is some interesting statistics related to MapReduce instances at provided by the "
" authors from Google’s code management system."
Research paper summary Page 4

"
" "
Conclusion!
"
The MapReduce programming model has been widely accepted at Google for many dif-
ferent purposes."
"
The proposed model hides away the parallelisation details allowing programmers to have
a higher abstraction on a given problem. So, we focus more on the problem and not on
the parallelisation details."
"
Though not all, but many problems are expressible in MapReduce computations."
"
Bibliography!
"
The oldest reference provided by the authors dates back to 1989 but most of the refer-
ences were from the late nineties i.e 1997, 1996, and early twenties i.e 2001, 2003. "
Most of the references are related to parallel computation.

You might also like