0% found this document useful (0 votes)

56 views

What Is A Mapreduce?: Michael Kleber

This document provides an overview of MapReduce and describes how it can be used to solve various problems. It explains that MapReduce allows distributed processing of large datasets across clusters of computers by breaking the work into independent chunks processed in parallel. Examples shown include word counting, sorting data, analyzing the structure of the web through links, and calculating PageRank to find the most important pages. The PageRank algorithm is described and it is explained how it can be iteratively computed through MapReduce jobs.

Uploaded by

Ioana Borcea

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views

What Is A Mapreduce?: Michael Kleber

Uploaded by

Ioana Borcea

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

What is a MapReduce?

Google, Inc.
Jan. 15, 2008

Michael Kleber
with many slides shamelessly stolen from Jeff Dean and Yonatan Zunger

Except as otherwise noted, this presentation is released

under the Creative Commons Attribution 2.5 License.

Word Count and friends

Input: (URL, contents)

Map output

Reduce output

(the, 1)

(the, 1048576)

(the, www.bar.org/index.html)

(the, list of URLs)

(hostname, doc term frequencies)

(hostname, site term frequencies)

(purple, cow)

(purple, probability distrib of

next word in English)

Mappers can take command-line arguments, eg a regular expression.

(line number, matching line)

With 1800 machines, MR_Grep scanned 1 terabyte in 100 seconds.

Except as otherwise noted, this presentation is released
under the Creative Commons Attribution 2.5 License.

Query Frequency Over Time

Queries containing eclipse

Queries containing world series

Queries containing full moon

Queries containing summer olympics

Queries containing Opteron

Queries containing watermelon

Except as otherwise noted, this presentation is released

under the Creative Commons Attribution 2.5 License.

Sorting
In addition to map and reduce functions, you may specify
Partition: (k', number of reducers) choice of reducer for k'
Default partition is (hash(k') mod #reducers), for load balancing, but:
Output file for k' reduction determined by its partition
Guarantee: each reducer sees the keys in its partition in sorted order
(implemented by the invisible shuffle-and-sort stage)
Map: produces (sort key, record)
Partition: send consecutive blocks of sort keys to same reducer,
e.g. by using most significant bits of sort keys
Reduce: identity function
MR_Sort sorted 1 terabyte of 100-byte records in 14 minutes
Except as otherwise noted, this presentation is released
under the Creative Commons Attribution 2.5 License.

Structure of the Web

Input is (URL, contents) again
Scan through the document's contents looking for links to other URLs
Map outputs (URL, linked-to URL)
you get a simple representation of the WWW link graph
Map outputs (linked-to URL, URL)
you get the reverse link graph, what web pages link to me?
Map outputs (linked-to URL, anchor text)
you get how do other web pages characterize me?

Google uses this anchor text propagation in indexing.

Except as otherwise noted, this presentation is released
under the Creative Commons Attribution 2.5 License.

Example: A Simple Indexing Pipeline

Input: (url, document) pairs

Output: Inverted index including anchor text

Except as otherwise noted, this presentation is released
under the Creative Commons Attribution 2.5 License.

Example: A Simple Indexing Pipeline

Duplicate Elimination
Want: a sorted table of
(original URL, canonical URL)
for each URL its canonical URL

Except as otherwise noted, this presentation is released

under the Creative Commons Attribution 2.5 License.

Example: A Simple Indexing Pipeline

Duplicate Elimination
Map: (url, contents) <(contentfp, url + aux info)>
Reduce: (contentfp, <url + info>)
<(contentfp, url + best url)>

Except as otherwise noted, this presentation is released

under the Creative Commons Attribution 2.5 License.

Example: A Simple Indexing Pipeline

Duplicate Elimination
Map: (url, contents) <(contentfp, url + aux info)>
Reduce: (contentfp, <url + info>)
<(contentfp, url + best url)>
Reorganize that
Map: (contentfp, url + canonical url)
(url, canonical url) (or nothing if equal)
Reduce: Identity
Partition: hash(url) % number of output shards

Except as otherwise noted, this presentation is released

under the Creative Commons Attribution 2.5 License.

Example: A Simple Indexing Pipeline

Link Extraction
Map: (url, contents)
(a) Parse contents, forward each link using map;
output (target url, anchor info)
(b) Try to forward doc URL using map; if it
doesnt forward, output (url, contents)
Reduce: (forwarded url, <content or anchor>)
(url, contents + anchors)

Except as otherwise noted, this presentation is released

under the Creative Commons Attribution 2.5 License.

Example: A Simple Indexing Pipeline

Indexing
Map: (url, merged contents)
<(term, occurrence info)>
Partition: hash(url) % number of output shards
Reduce: (term, occurrences in partition of docs)
(term, compressed posting list)

Except as otherwise noted, this presentation is released

under the Creative Commons Attribution 2.5 License.

Example: A Simple Indexing Pipeline

Final product: a map from each term to its

compressed posting list
Each partition file contains docs with
fixed (hash(url) % num files)
Except as otherwise noted, this presentation is released
under the Creative Commons Attribution 2.5 License.

Real indexing pipelines are a bit messier

Except as otherwise noted, this presentation is released

under the Creative Commons Attribution 2.5 License.

PageRank
Indexing pipeline can answer the question

What are all the pages that match this query?

But most people don't want to look at all million hits. We need to answer

What are the best pages that match this query?

PageRank: use the link structure of the web to find best pages

Except as otherwise noted, this presentation is released

under the Creative Commons Attribution 2.5 License.

PageRank
Suppose you browsed the web randomly, i.e. by clicking on random links.
What fraction of your time would you spend on each page?
T(X) = time spent on page X
L(X) = # links from page X to other pages
Then the Ts and Ls would satisfy equations like

T(X) = T(A)/L(A) + T(B)/L(B) + T(C)/L(C)

if A, B, C are all the pages linking to X

A B C
Except as otherwise noted, this presentation is released
under the Creative Commons Attribution 2.5 License.

A B C

PageRank
T(X) = T(A)/L(A) + T(B)/L(B) + T(C)/L(C)

Problems:
Pages with no outbound links = black holes
How to calculate? Is there even a unique solution?
Solution to both: damping factor (d = .85)

PR(X) = (1-d) + d*(PR(A)/L(A) + PR(B)/L(B) + PR(C)/L(C) )

Random browsing, except with probability 1-d you jump to a random page.
Brin & Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998)

Except as otherwise noted, this presentation is released

under the Creative Commons Attribution 2.5 License.

PageRank

Public
domainnoted,
image
Wikipedia
user 345Kai,
Except
as otherwise
thisby
presentation
is released
under
the Creative
Commons Attribution 2.5 License.
appears
on http://en.wikipedia.org/wiki/PageRank

A B C

PageRank
PR(X) = (1-d) + d*(PR(A)/L(A) + PR(B)/L(B) + PR(C)/L(C) )

We can calculate PageRank iteratively.

PRi(X) = probability of being on page X after i random steps
Initial state:
PR0(X) = 1 for all pages X.
Update:
PRi+1(X) = (1-d) + d

PR (Y)/L(Y)
YX

PRi(X) converges to PR(X) as i. Keep iterating.

Except as otherwise noted, this presentation is released
under the Creative Commons Attribution 2.5 License.

PageRank: MapReduce Implementation

PR(X) = (1-d) + d*(PR(A)/L(A) + PR(B)/L(B) + PR(C)/L(C) )
Perform one MR to produce the link graph and set up initial state
(URL, contents) (URL, [1.0, list of linked-to URLs])
Each MapReduce performs one step of the iteration
The mapper and reducer keys are both the set of all URLs
Map: (Y, [PR(Y), Z1...Zn]) (Zi, PR(Y)/n) for i=1,...,n
(Y, [Z1...Zn]) remember the link graph
Reduce: (Y, <S0, S1,...,Sm, [Z1...Zn]>) (Y, [PR(Y), Z1...Zn])
An outside controller runs each MR iteration and decides when to stop.
Except as otherwise noted, this presentation is released
under the Creative Commons Attribution 2.5 License.

Notes To Good Calories Bad Calories
No ratings yet
Notes To Good Calories Bad Calories
30 pages
Etl Tools Ab Initio PDF
No ratings yet
Etl Tools Ab Initio PDF
2 pages
DevOps. How To Build Pipelines With Bitbucket Pipelines + Docker Container + AWS ECS + JDK 11 + Maven 3?
From Everand
DevOps. How To Build Pipelines With Bitbucket Pipelines + Docker Container + AWS ECS + JDK 11 + Maven 3?
John Edward Cooper Berg
No ratings yet
7-Day Kickstart Your Metabolism
75% (4)
7-Day Kickstart Your Metabolism
26 pages
DNS Zones
No ratings yet
DNS Zones
2 pages
What Is Mapreduce
No ratings yet
What Is Mapreduce
19 pages
Backlinks - Pagerank
No ratings yet
Backlinks - Pagerank
12 pages
Anatomy of A Large-Scale Hypertextual Web Search Engine
No ratings yet
Anatomy of A Large-Scale Hypertextual Web Search Engine
33 pages
The Anatomy of A Large-Scale Hypertextual
No ratings yet
The Anatomy of A Large-Scale Hypertextual
41 pages
Lec 7
No ratings yet
Lec 7
10 pages
Lec 7
No ratings yet
Lec 7
10 pages
Page Rank On Map-Reduce Paradigm: Group 24
No ratings yet
Page Rank On Map-Reduce Paradigm: Group 24
18 pages
Mapreduce: Theory and Implementation: Cse 490H - Intro To Distributed Computing, Modified by George Lee
No ratings yet
Mapreduce: Theory and Implementation: Cse 490H - Intro To Distributed Computing, Modified by George Lee
33 pages
IR Unit II
No ratings yet
IR Unit II
78 pages
Map-Reduce For Parallel Computing: Amit Jain
No ratings yet
Map-Reduce For Parallel Computing: Amit Jain
72 pages
Web Search
No ratings yet
Web Search
49 pages
IR On Web Search Engines: Reference of Slides Taken From DR Haddawy's Material
No ratings yet
IR On Web Search Engines: Reference of Slides Taken From DR Haddawy's Material
21 pages
Problem-Solving Using Mapreduce/Hadoop
No ratings yet
Problem-Solving Using Mapreduce/Hadoop
22 pages
A Review of 'The Anatomy of A Large-Scale Hypertextual Web Search Engine'
No ratings yet
A Review of 'The Anatomy of A Large-Scale Hypertextual Web Search Engine'
27 pages
Web Search Engingine Indexing Crawling and Ranking
No ratings yet
Web Search Engingine Indexing Crawling and Ranking
63 pages
Completed Final UNIT-V 9.10.17
100% (1)
Completed Final UNIT-V 9.10.17
74 pages
Google SearchEngine
No ratings yet
Google SearchEngine
13 pages
IR unit 3
No ratings yet
IR unit 3
64 pages
Link Analysis
No ratings yet
Link Analysis
43 pages
Retrieving and Visualizing Data: Charles Severance
No ratings yet
Retrieving and Visualizing Data: Charles Severance
19 pages
Algorithms
No ratings yet
Algorithms
49 pages
14 MapReduce PDF
100% (1)
14 MapReduce PDF
82 pages
14 MapReduce
100% (1)
14 MapReduce
82 pages
Distributed Computing Seminar: Lecture 5: Graph Algorithms & Pagerank
No ratings yet
Distributed Computing Seminar: Lecture 5: Graph Algorithms & Pagerank
33 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
Lecture 1 - Map Reduce
No ratings yet
Lecture 1 - Map Reduce
31 pages
Retrieving and Visualizing Data: Charles Severance
No ratings yet
Retrieving and Visualizing Data: Charles Severance
19 pages
MapReduce: Simplified Data Processing On Large Clusters
100% (1)
MapReduce: Simplified Data Processing On Large Clusters
13 pages
Blue Modern Pitch Deck Presentation
No ratings yet
Blue Modern Pitch Deck Presentation
13 pages
Retrieving and Visualizing Data: Charles Severance
No ratings yet
Retrieving and Visualizing Data: Charles Severance
19 pages
Pythonlearn 16 Data Viz
No ratings yet
Pythonlearn 16 Data Viz
19 pages
Working of Webb Search Engines
No ratings yet
Working of Webb Search Engines
29 pages
SearchLand: Search Quality For Beginners
No ratings yet
SearchLand: Search Quality For Beginners
29 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
Mapreduce
No ratings yet
Mapreduce
13 pages
Mapreduce
No ratings yet
Mapreduce
13 pages
Lecture 3 - MapReduce
No ratings yet
Lecture 3 - MapReduce
9 pages
GRP 11_page rank algorithms
No ratings yet
GRP 11_page rank algorithms
15 pages
CS571-Note
No ratings yet
CS571-Note
2 pages
Web Info PDF
No ratings yet
Web Info PDF
4 pages
Chapter - 2 Literature Survey: S. No Page No
No ratings yet
Chapter - 2 Literature Survey: S. No Page No
22 pages
Lecture - 3
No ratings yet
Lecture - 3
25 pages
Unit-2
No ratings yet
Unit-2
14 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
No ratings yet
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
37 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
Page Rank Algorithm
No ratings yet
Page Rank Algorithm
26 pages
IRWM: Assignment 1: How Does Google Search Engine Works?
No ratings yet
IRWM: Assignment 1: How Does Google Search Engine Works?
7 pages
Standard Web Search Engine Architecture: User Query
No ratings yet
Standard Web Search Engine Architecture: User Query
101 pages
Mini Google
No ratings yet
Mini Google
34 pages
IRS Unit4
No ratings yet
IRS Unit4
10 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
Page Rank Algorithm
No ratings yet
Page Rank Algorithm
9 pages
Webmininglec
No ratings yet
Webmininglec
75 pages
System Design - 100 Job Interview Questions
From Everand
System Design - 100 Job Interview Questions
Cristian Scutaru
No ratings yet
Couchbase Certified Java Developer - Exam Practice Tests
From Everand
Couchbase Certified Java Developer - Exam Practice Tests
Cristian Scutaru
No ratings yet
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
From Everand
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
Miguel Miranda de Mattos
No ratings yet
IBM Cognos 8 Planning
From Everand
IBM Cognos 8 Planning
Jason Edwards
No ratings yet
Learning DHTMLX Suite UI
From Everand
Learning DHTMLX Suite UI
Eli Geske
No ratings yet
Top Foods To Avoid On The Keto Diet: Proketotips
100% (1)
Top Foods To Avoid On The Keto Diet: Proketotips
4 pages
20 Super-Alimente Care Nu Trebuie Sa-Ti Lipseasca
No ratings yet
20 Super-Alimente Care Nu Trebuie Sa-Ti Lipseasca
1 page
Unit - I Introduction To Database Management Systems (DBMS) : Overview
No ratings yet
Unit - I Introduction To Database Management Systems (DBMS) : Overview
15 pages
ITC556 Sample Exam + Solutions
60% (5)
ITC556 Sample Exam + Solutions
11 pages
11CS DBMS Lab Tasks - 8664723.lab Task 1-8 For 11cs Dbms
No ratings yet
11CS DBMS Lab Tasks - 8664723.lab Task 1-8 For 11cs Dbms
3 pages
Synkka 3p0 Profile
No ratings yet
Synkka 3p0 Profile
1,709 pages
1Z0 1041 23 Questions
No ratings yet
1Z0 1041 23 Questions
4 pages
Concurrency Control Protocol & Recovery
No ratings yet
Concurrency Control Protocol & Recovery
23 pages
Dir List
No ratings yet
Dir List
35 pages
Experiment #1: DDL Statements (Create Tables)
No ratings yet
Experiment #1: DDL Statements (Create Tables)
5 pages
Chapter 1 Databases and Database Users Data
No ratings yet
Chapter 1 Databases and Database Users Data
8 pages
Dynamic Management Views
No ratings yet
Dynamic Management Views
11 pages
BDC Recording SHDB Steps
No ratings yet
BDC Recording SHDB Steps
16 pages
IP Database Tables
No ratings yet
IP Database Tables
5 pages
Database Optimization in Practice With SELECT Statements
No ratings yet
Database Optimization in Practice With SELECT Statements
9 pages
OAS: Cheat Sheet: File / Directories
No ratings yet
OAS: Cheat Sheet: File / Directories
8 pages
Case Study Wedding Cards
No ratings yet
Case Study Wedding Cards
2 pages
Normalization:: 1NF 2NF 3NF BCNF 4NF 5NF
No ratings yet
Normalization:: 1NF 2NF 3NF BCNF 4NF 5NF
6 pages
How To Use The Excel VLOOKUP Function - Exceljet
No ratings yet
How To Use The Excel VLOOKUP Function - Exceljet
25 pages
Mongodb Mock Test
No ratings yet
Mongodb Mock Test
7 pages
Examining The Fat File System
No ratings yet
Examining The Fat File System
22 pages
RDBMS PR 16
No ratings yet
RDBMS PR 16
3 pages
Dissertation On Dbms
100% (2)
Dissertation On Dbms
8 pages
02 Transactions
No ratings yet
02 Transactions
5 pages
Dbms Batch 2
No ratings yet
Dbms Batch 2
26 pages
06 Practice 6 Getting Familiar With Oracle EM Database Express
No ratings yet
06 Practice 6 Getting Familiar With Oracle EM Database Express
12 pages
DMS Practice Question
No ratings yet
DMS Practice Question
2 pages
DBMS Unit 4
No ratings yet
DBMS Unit 4
21 pages
An Implementation of The FP-growth Algorithm
No ratings yet
An Implementation of The FP-growth Algorithm
6 pages
Comitetul Celor 300
No ratings yet
Comitetul Celor 300
179 pages

What Is A Mapreduce?: Michael Kleber

Uploaded by

What Is A Mapreduce?: Michael Kleber

Uploaded by

What is a MapReduce?

Except as otherwise noted, this presentation is released

Word Count and friends

(the, list of URLs)

(hostname, doc term frequencies)

(hostname, site term frequencies)

(purple, probability distrib of

Mappers can take command-line arguments, eg a regular expression.

(line number, matching line)

With 1800 machines, MR_Grep scanned 1 terabyte in 100 seconds.

Query Frequency Over Time

Queries containing world series

Queries containing full moon

Queries containing summer olympics

Queries containing Opteron

Queries containing watermelon

Except as otherwise noted, this presentation is released

Structure of the Web

Google uses this anchor text propagation in indexing.

Example: A Simple Indexing Pipeline

Output: Inverted index including anchor text

Example: A Simple Indexing Pipeline

Except as otherwise noted, this presentation is released

Example: A Simple Indexing Pipeline

Except as otherwise noted, this presentation is released

Example: A Simple Indexing Pipeline

Except as otherwise noted, this presentation is released

Example: A Simple Indexing Pipeline

Except as otherwise noted, this presentation is released

Example: A Simple Indexing Pipeline

Except as otherwise noted, this presentation is released

Example: A Simple Indexing Pipeline

Final product: a map from each term to its

Real indexing pipelines are a bit messier

Except as otherwise noted, this presentation is released

What are all the pages that match this query?

What are the best pages that match this query?

Except as otherwise noted, this presentation is released

T(X) = T(A)/L(A) + T(B)/L(B) + T(C)/L(C)

PR(X) = (1-d) + d*(PR(A)/L(A) + PR(B)/L(B) + PR(C)/L(C) )

Except as otherwise noted, this presentation is released

We can calculate PageRank iteratively.

PRi(X) converges to PR(X) as i. Keep iterating.

PageRank: MapReduce Implementation

You might also like