0% found this document useful (0 votes)
56 views

What Is A Mapreduce?: Michael Kleber

This document provides an overview of MapReduce and describes how it can be used to solve various problems. It explains that MapReduce allows distributed processing of large datasets across clusters of computers by breaking the work into independent chunks processed in parallel. Examples shown include word counting, sorting data, analyzing the structure of the web through links, and calculating PageRank to find the most important pages. The PageRank algorithm is described and it is explained how it can be iteratively computed through MapReduce jobs.

Uploaded by

Ioana Borcea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views

What Is A Mapreduce?: Michael Kleber

This document provides an overview of MapReduce and describes how it can be used to solve various problems. It explains that MapReduce allows distributed processing of large datasets across clusters of computers by breaking the work into independent chunks processed in parallel. Examples shown include word counting, sorting data, analyzing the structure of the web through links, and calculating PageRank to find the most important pages. The PageRank algorithm is described and it is explained how it can be iteratively computed through MapReduce jobs.

Uploaded by

Ioana Borcea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

What is a MapReduce?

Google, Inc.
Jan. 15, 2008

Michael Kleber
with many slides shamelessly stolen from Jeff Dean and Yonatan Zunger

Except as otherwise noted, this presentation is released


under the Creative Commons Attribution 2.5 License.

Word Count and friends


Input: (URL, contents)

Map output

Reduce output

(the, 1)

(the, 1048576)

(the, www.bar.org/index.html)

(the, list of URLs)

(hostname, doc term frequencies)

(hostname, site term frequencies)

(purple, cow)

(purple, probability distrib of


next word in English)

Mappers can take command-line arguments, eg a regular expression.


(line number, matching line)

(line number, matching line)

With 1800 machines, MR_Grep scanned 1 terabyte in 100 seconds.


Except as otherwise noted, this presentation is released
under the Creative Commons Attribution 2.5 License.

Query Frequency Over Time


Queries containing eclipse

Queries containing world series

Queries containing full moon

Queries containing summer olympics

Queries containing Opteron

Queries containing watermelon

Except as otherwise noted, this presentation is released


under the Creative Commons Attribution 2.5 License.

Sorting
In addition to map and reduce functions, you may specify
Partition: (k', number of reducers) choice of reducer for k'
Default partition is (hash(k') mod #reducers), for load balancing, but:
Output file for k' reduction determined by its partition
Guarantee: each reducer sees the keys in its partition in sorted order
(implemented by the invisible shuffle-and-sort stage)
Map: produces (sort key, record)
Partition: send consecutive blocks of sort keys to same reducer,
e.g. by using most significant bits of sort keys
Reduce: identity function
MR_Sort sorted 1 terabyte of 100-byte records in 14 minutes
Except as otherwise noted, this presentation is released
under the Creative Commons Attribution 2.5 License.

Structure of the Web


Input is (URL, contents) again
Scan through the document's contents looking for links to other URLs
Map outputs (URL, linked-to URL)
you get a simple representation of the WWW link graph
Map outputs (linked-to URL, URL)
you get the reverse link graph, what web pages link to me?
Map outputs (linked-to URL, anchor text)
you get how do other web pages characterize me?

Google uses this anchor text propagation in indexing.


Except as otherwise noted, this presentation is released
under the Creative Commons Attribution 2.5 License.

Example: A Simple Indexing Pipeline


Input: (url, document) pairs

Output: Inverted index including anchor text


Except as otherwise noted, this presentation is released
under the Creative Commons Attribution 2.5 License.

Example: A Simple Indexing Pipeline


Duplicate Elimination
Want: a sorted table of
(original URL, canonical URL)
for each URL its canonical URL

Except as otherwise noted, this presentation is released


under the Creative Commons Attribution 2.5 License.

Example: A Simple Indexing Pipeline


Duplicate Elimination
Map: (url, contents) <(contentfp, url + aux info)>
Reduce: (contentfp, <url + info>)
<(contentfp, url + best url)>

Except as otherwise noted, this presentation is released


under the Creative Commons Attribution 2.5 License.

Example: A Simple Indexing Pipeline


Duplicate Elimination
Map: (url, contents) <(contentfp, url + aux info)>
Reduce: (contentfp, <url + info>)
<(contentfp, url + best url)>
Reorganize that
Map: (contentfp, url + canonical url)
(url, canonical url) (or nothing if equal)
Reduce: Identity
Partition: hash(url) % number of output shards

Except as otherwise noted, this presentation is released


under the Creative Commons Attribution 2.5 License.

Example: A Simple Indexing Pipeline


Link Extraction
Map: (url, contents)
(a) Parse contents, forward each link using map;
output (target url, anchor info)
(b) Try to forward doc URL using map; if it
doesnt forward, output (url, contents)
Reduce: (forwarded url, <content or anchor>)
(url, contents + anchors)

Except as otherwise noted, this presentation is released


under the Creative Commons Attribution 2.5 License.

10

Example: A Simple Indexing Pipeline

Indexing
Map: (url, merged contents)
<(term, occurrence info)>
Partition: hash(url) % number of output shards
Reduce: (term, occurrences in partition of docs)
(term, compressed posting list)

Except as otherwise noted, this presentation is released


under the Creative Commons Attribution 2.5 License.

11

Example: A Simple Indexing Pipeline

Final product: a map from each term to its


compressed posting list
Each partition file contains docs with
fixed (hash(url) % num files)
Except as otherwise noted, this presentation is released
under the Creative Commons Attribution 2.5 License.

12

Real indexing pipelines are a bit messier

Except as otherwise noted, this presentation is released


under the Creative Commons Attribution 2.5 License.

13

PageRank
Indexing pipeline can answer the question

What are all the pages that match this query?


But most people don't want to look at all million hits. We need to answer

What are the best pages that match this query?


PageRank: use the link structure of the web to find best pages

Except as otherwise noted, this presentation is released


under the Creative Commons Attribution 2.5 License.

14

PageRank
Suppose you browsed the web randomly, i.e. by clicking on random links.
What fraction of your time would you spend on each page?
T(X) = time spent on page X
L(X) = # links from page X to other pages
Then the Ts and Ls would satisfy equations like

T(X) = T(A)/L(A) + T(B)/L(B) + T(C)/L(C)


if A, B, C are all the pages linking to X

A B C
Except as otherwise noted, this presentation is released
under the Creative Commons Attribution 2.5 License.

15

A B C

PageRank
T(X) = T(A)/L(A) + T(B)/L(B) + T(C)/L(C)

Problems:
Pages with no outbound links = black holes
How to calculate? Is there even a unique solution?
Solution to both: damping factor (d = .85)

PR(X) = (1-d) + d*(PR(A)/L(A) + PR(B)/L(B) + PR(C)/L(C) )


Random browsing, except with probability 1-d you jump to a random page.
Brin & Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998)

Except as otherwise noted, this presentation is released


under the Creative Commons Attribution 2.5 License.

16

PageRank

Public
domainnoted,
image
Wikipedia
user 345Kai,
Except
as otherwise
thisby
presentation
is released
under
the Creative
Commons Attribution 2.5 License.
appears
on http://en.wikipedia.org/wiki/PageRank

17

A B C

PageRank
PR(X) = (1-d) + d*(PR(A)/L(A) + PR(B)/L(B) + PR(C)/L(C) )

We can calculate PageRank iteratively.


PRi(X) = probability of being on page X after i random steps
Initial state:
PR0(X) = 1 for all pages X.
Update:
PRi+1(X) = (1-d) + d

PR (Y)/L(Y)
YX

PRi(X) converges to PR(X) as i. Keep iterating.


Except as otherwise noted, this presentation is released
under the Creative Commons Attribution 2.5 License.

18

PageRank: MapReduce Implementation


PR(X) = (1-d) + d*(PR(A)/L(A) + PR(B)/L(B) + PR(C)/L(C) )
Perform one MR to produce the link graph and set up initial state
(URL, contents) (URL, [1.0, list of linked-to URLs])
Each MapReduce performs one step of the iteration
The mapper and reducer keys are both the set of all URLs
Map: (Y, [PR(Y), Z1...Zn]) (Zi, PR(Y)/n) for i=1,...,n
(Y, [Z1...Zn]) remember the link graph
Reduce: (Y, <S0, S1,...,Sm, [Z1...Zn]>) (Y, [PR(Y), Z1...Zn])
An outside controller runs each MR iteration and decides when to stop.
Except as otherwise noted, this presentation is released
under the Creative Commons Attribution 2.5 License.

19

You might also like