0% found this document useful (0 votes)
18 views2 pages

Ditp - ch2 3

This document provides a simplified view of MapReduce. Mappers process input key-value pairs in parallel and generate intermediate key-value pairs. These pairs are shuffled and sorted by key, then reducers process all values associated with the same key to generate the final output. The word count algorithm is provided as an example - mappers emit each word as a key paired with a count of 1, while reducers sum the counts for each word.

Uploaded by

jefferyleclerc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views2 pages

Ditp - ch2 3

This document provides a simplified view of MapReduce. Mappers process input key-value pairs in parallel and generate intermediate key-value pairs. These pairs are shuffled and sorted by key, then reducers process all values associated with the same key to generate the final output. The word count algorithm is provided as an example - mappers emit each word as a key paired with a count of 1, while reducers sum the counts for each word.

Uploaded by

jefferyleclerc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

A α B β C γ D δ E ε F ζ

mapper mapper mapper mapper

a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8

Shuffle and Sort: aggregate values by keys

a 1 5 b 2 7 c 2 9 8

reducer reducer reducer

X 5 Y 7 Z 9

Figure 2.2: Simplified view of MapReduce. Mappers are applied to all input
Figure  2.2:  Skey-value
implified  pairs,
view  owhich
f  MapReduce.  
generate M anappers  
arbitraryare  number
applied  tofo  intermediate
all  input  key-­‐value  
key-value pairs,  
which  generate  
pairs.an   arbitrary  are
Reducers number  
applied of  to
intermediate   key-­‐value  
all values associated with pairs.  
theRsameeducers  
key.are   applied  to  
Between
all  values  associated  
the map and with  reduce
the  same  
phases key.  lies
Between  
a barrierthe  that
map   and  reduce  
involves a large phases   lies  a  barrier  
distributed sort
that  involves   a  large  
and groupdistributed  
by. sort  and  group  by.  

Algorithm 2.1 Word count


The mapper emits an intermediate key-value pair for each word in a document.
The reducer sums up all counts for each word.
1: class Mapper
2: method Map(docid a, doc d)
3: for all term t ∈ doc d do
4: Emit(term t, count 1)
1: class Reducer
2: method Reduce(term t, counts [c1 , c2 , . . .])
3: sum ← 0
4: for all count c ∈ counts [c1 , c2 , . . .] do
5: sum ← sum + c
6: Emit(term t, count sum)
number of map tasks to run, but the execution framework (see next section)
makes the final determination based on the physical layout of the data (more
details in Section 2.5 and Section 2.6). The situation is similar for the reduce
phase: a reducer object is initialized for each reduce task, and the Reduce
method is called once per intermediate key. In contrast with the number of
map tasks, the programmer can precisely specify the number of reduce tasks.
We will return to discuss the details of Hadoop job execution in Section 2.6,
which is dependent on an understanding of the distributed file system (covered
in Section 2.5). To reiterate: although the presentation of algorithms in this
book closely mirrors the way they would be implemented in Hadoop, our fo-
cus is on algorithm design and conceptual understanding—not actual Hadoop
programming. For that, we would recommend Tom White’s book [154].
What are the restrictions on mappers and reducers? Mappers and reduc-
ers can express arbitrary computations over their inputs. However, one must
generally be careful about use of external resources since multiple mappers or
reducers may be contending for those resources. For example, it may be unwise
for a mapper to query an external SQL database, since that would introduce a
scalability bottleneck on the number of map tasks that could be run in parallel
(since they might all be simultaneously querying the database).10 In general,
mappers can emit an arbitrary number of intermediate key-value pairs, and
they need not be of the same type as the input key-value pairs. Similarly,
reducers can emit an arbitrary number of final key-value pairs, and they can
differ in type from the intermediate key-value pairs. Although not permitted
in functional programming, mappers and reducers can have side effects. This
is a powerful and useful feature: for example, preserving state across multiple
inputs is central to the design of many MapReduce algorithms (see Chapter 3).
Such algorithms can be understood as having side effects that only change
state that is internal to the mapper or reducer. While the correctness of such
algorithms may be more difficult to guarantee (since the function’s behavior
depends not only on the current input but on previous inputs), most potential
synchronization problems are avoided since internal state is private only to in-
dividual mappers and reducers. In other cases (see Section 4.4 and Section 7.5),
it may be useful for mappers or reducers to have external side effects, such as
writing files to the distributed file system. Since many mappers and reducers
are run in parallel, and the distributed file system is a shared global resource,
special care must be taken to ensure that such operations avoid synchroniza-
tion conflicts. One strategy is to write a temporary file that is renamed upon
successful completion of the mapper or reducer [45].
In addition to the “canonical” MapReduce processing flow, other variations
are also possible. MapReduce programs can contain no reducers, in which case
mapper output is directly written to disk (one file per mapper). For embar-
rassingly parallel problems, e.g., parse a large text collection or independently
analyze a large number of images, this would be a common pattern. The
converse—a MapReduce program with no mappers—is not possible, although
10 Unless, of course, the database itself is highly scalable.

You might also like