Map Reduce Summary
Map Reduce Summary
Rachid Chelouah"
Initiation to Research"
1 Apr 2014"
Introduction!
"
The authors, Jeffrey Dean and Sanjay Ghemawat are both employees at Google."
At Google, it’s very common to analyse a large input data in order to perform computa-
tions. "
"
For example, to output the set of most frequent queries against a database on a given
day, analysing the web request logs, graph structure of web documents, etc."
Since the input data is large, several computers or machines need to be used to dis-
tribute the computations so that we can get our answer faster."
"
Most of the time the computations are straightforward but the main challenge is with the
parallelisation of computation, distributing the data and how to handle failures on the dis-
tributed machines. "
"
Therefore, we seek for a programming model that allows the programmers to hide away
the complexity of parallelisation on distributed machines while performing the computa-
tions."
"
Method!
"
Below is the schematic representation of the authors MapReduce model."
Research paper summary Page 2
"
The data is split into a number of partitions and each partition of data will be passed
through a map function independently of each other. The map function outputs an inter-
mediary key-value results. And then, the reduce function will receive the intermediary
key-value to output a smaller set of key-value, possibly one key-value. Finally, the results
from the reduce function of each independent machines are aggregated to produce the
final result."
"
"
Results!
"
In order to provide a sense of benchmark to MapReduce, the authors measure the per-
formance of its implementation on two computations. The first involves a searching for a
pattern through a roughly one terabyte of data and the second, a sort on a roughly one
terabyte of data. These two examples should be representative of using MapReduce
against large datasets, that is the sort example to show a typical program that needs to
shuffle a large data and the search example to represent a program that needs to extract
information from a large data. The following is the result for the search implementation of
MapReduce."
"
" The Grep program(search)"
The figure above is result of the distributed grep program implementing MapReduce
model in the authors paper. It shows the progress of computing the search over time."
"
The input file is 10^10 100-byte records and the grep searches for a three-character pat-
tern which is relatively rare(92,337 records out of 10^10). The input file is then split into
64MB pieces, that is 15,000 * 64MB, and the resulting output is then stored in one file."
"
The rate gradually increases at the beginning as many machines are assigned during the
Mapping phase. It peaked at 30GB/s when around 1764 workers are assigned. As the
Mapping phase terminates, the rate decreases quite dramatically"
"
"
"
"
Research paper summary Page 3
"
Perspectives!
"
" In fact, the first version of the MapReduce library was implemented in February of 2003 "
" and enhancements were being made later in August of 2003 dealing mainly with the"
" more technical parallelisation problems like locality optimisation, dynamic load balancing
" of task execution across worker machiens, etc."
"
The authors were also surprised by how broadly the MapReduce paradigm could be im-
plemented at Google. Notable examples are:"
• large-scale machine learning problem"
• clustering problems for the Google News and Froogle products."
• large-scale graph computations."
"
" There has been a significant growth of the number of different instances of MapReduce "
" programs since its introduction at Google. "
" "
" The main factor that contributes to its success is that a program that could be written in "
" MapReduce model is simple and highly scalable, meaning it can be executed on different
" machines while hiding away the complexity of parallelisation. Therefore, a programmer "
" who has no experience with distributed or parallel systems could easily exploit the re"
" sources of the distributed machines."
"
" Below is some interesting statistics related to MapReduce instances at provided by the "
" authors from Google’s code management system."
Research paper summary Page 4
"
" "
Conclusion!
"
The MapReduce programming model has been widely accepted at Google for many dif-
ferent purposes."
"
The proposed model hides away the parallelisation details allowing programmers to have
a higher abstraction on a given problem. So, we focus more on the problem and not on
the parallelisation details."
"
Though not all, but many problems are expressible in MapReduce computations."
"
Bibliography!
"
The oldest reference provided by the authors dates back to 1989 but most of the refer-
ences were from the late nineties i.e 1997, 1996, and early twenties i.e 2001, 2003. "
Most of the references are related to parallel computation.