What Is A Mapreduce?: Michael Kleber
What Is A Mapreduce?: Michael Kleber
Google, Inc.
Jan. 15, 2008
Michael Kleber
with many slides shamelessly stolen from Jeff Dean and Yonatan Zunger
Map output
Reduce output
(the, 1)
(the, 1048576)
(the, www.bar.org/index.html)
(purple, cow)
Sorting
In addition to map and reduce functions, you may specify
Partition: (k', number of reducers) choice of reducer for k'
Default partition is (hash(k') mod #reducers), for load balancing, but:
Output file for k' reduction determined by its partition
Guarantee: each reducer sees the keys in its partition in sorted order
(implemented by the invisible shuffle-and-sort stage)
Map: produces (sort key, record)
Partition: send consecutive blocks of sort keys to same reducer,
e.g. by using most significant bits of sort keys
Reduce: identity function
MR_Sort sorted 1 terabyte of 100-byte records in 14 minutes
Except as otherwise noted, this presentation is released
under the Creative Commons Attribution 2.5 License.
10
Indexing
Map: (url, merged contents)
<(term, occurrence info)>
Partition: hash(url) % number of output shards
Reduce: (term, occurrences in partition of docs)
(term, compressed posting list)
11
12
13
PageRank
Indexing pipeline can answer the question
14
PageRank
Suppose you browsed the web randomly, i.e. by clicking on random links.
What fraction of your time would you spend on each page?
T(X) = time spent on page X
L(X) = # links from page X to other pages
Then the Ts and Ls would satisfy equations like
A B C
Except as otherwise noted, this presentation is released
under the Creative Commons Attribution 2.5 License.
15
A B C
PageRank
T(X) = T(A)/L(A) + T(B)/L(B) + T(C)/L(C)
Problems:
Pages with no outbound links = black holes
How to calculate? Is there even a unique solution?
Solution to both: damping factor (d = .85)
16
PageRank
Public
domainnoted,
image
Wikipedia
user 345Kai,
Except
as otherwise
thisby
presentation
is released
under
the Creative
Commons Attribution 2.5 License.
appears
on http://en.wikipedia.org/wiki/PageRank
17
A B C
PageRank
PR(X) = (1-d) + d*(PR(A)/L(A) + PR(B)/L(B) + PR(C)/L(C) )
PR (Y)/L(Y)
YX
18
19