0% found this document useful (0 votes)

75 views13 pages

The Mapreduce Paradigm: Michael Kleber

The document discusses the MapReduce paradigm for distributed computing. It explains that MapReduce hides the complexities of parallelization, load balancing, and fault tolerance by implementing these functions in a runtime library. Programmers simply specify map and reduce functions, while the library automatically parallelizes tasks across many machines and re-executes any failed tasks. A common example problem is calculating word frequencies by mapping each word to a key-value pair and reducing by summing the counts for each unique word.

Uploaded by

goodnessoffit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views13 pages

The Mapreduce Paradigm: Michael Kleber

Uploaded by

goodnessoffit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

The MapReduce Paradigm

Michael Kleber
with most slides shamelessly stolen from Jeff Dean and Yonatan Zunger

Google, Inc.
Jan. 14, 2008

Except as otherwise noted, this presentation is released 1

under the Creative Commons Attribution 2.5 License.
Do We Need It?

If distributed computing is so hard, do we really need to do it?

Yes: Otherwise some problems are too big.

Example: 20+ billion web pages x 20KB = 400+ terabytes

• One computer can read 30-35 MB/sec from disk
~four months to read the web
• ~1,000 hard drives just to store the web
• Even more to do something with the data

Except as otherwise noted, this presentation is released 2

under the Creative Commons Attribution 2.5 License.
Yes, We Do

Good news: same problem with 1000 machines, < 3 hours

Bad news: programming work
• communication and coordination
• recovering from machine failure (all the time!)
• status reporting
• debugging
• optimization
• locality
Bad news II: repeat for every problem you want to solve

How can we make this easier?

Except as otherwise noted, this presentation is released 3

under the Creative Commons Attribution 2.5 License.
MapReduce

A simple programming model that applies to many large-scale

computing problems

Hide messy details in MapReduce runtime library:

• automatic parallelization
• load balancing
• network and disk transfer optimization
• handling of machine failures
• robustness
• improvements to core library benefit all users of library!

Except as otherwise noted, this presentation is released

under the Creative Commons Attribution 2.5 License.
Typical problem solved by MapReduce

Read a lot of data

Map: extract something you care about from each record
Shuffle and Sort
Reduce: aggregate, summarize, filter, or transform
Write the results

Outline stays the same,

Map and Reduce change to fit the problem

Except as otherwise noted, this presentation is released

under the Creative Commons Attribution 2.5 License.
MapReduce Paradigm

Basic data type: the key-value pair (k,v).

For example, key = URL, value = HTML of the web page.

Programmer specifies two primary methods:

• Map: (k, v) ↦ <(k1,v1), (k2,v2), (k3,v3),…,(kn,vn)>

• Reduce: (k', <v’1, v’2,…,v’n’>) ↦ <(k', v'’1), (k', v'’2),…,(k', v'’n’’)>

All v' with same k' are reduced together.

(Remember the invisible “Shuffle and Sort” step.)

Except as otherwise noted, this presentation is released

under the Creative Commons Attribution 2.5 License.
Example: Word Frequencies in Web Pages

A typical exercise for a new Google engineer in his or her first week
Input: files with one document per record
Specify a map function that takes a key/value pair
key = document URL
value = document contents
Output of map function is (potentially many) key/value pairs.
In our case, output (word, “1”) once per word in the document

“document1”, “to be or not to be”

“to”, “1”
“be”, “1”
“or”, “1”
…
Except as otherwise noted, this presentation is released
under the Creative Commons Attribution 2.5 License.
Example: Word Frequencies in Web Pages

MapReduce library gathers together all pairs with the same key
(shuffle/sort)
Specify a reduce function that combines the values for a key
In our case, compute the sum

key = “be” key = “not” key = “or” key = “to”

values = “1”, “1” values = “1” values = “1” values = “1”, “1”

“2” “1” “1” “2”

Output of reduce (usually 0 or 1 value) paired with key and saved

“be”, “2”
“not”, “1”
“or”, “1”
“to”, “2”
Except as otherwise noted, this presentation is released
under the Creative Commons Attribution 2.5 License.
Under the hood: Scheduling

One master, many workers

• Input data split into M map tasks (typically 64 MB in size)
• Reduce phase partitioned into R reduce tasks (= # of output files)
• Tasks are assigned to workers dynamically
• Reasonable numbers inside Google: M=200,000; R=4,000; workers=2,000
Master assigns each map task to a free worker
• Considers locality of data to worker when assigning task
• Worker reads task input (often from local disk!)
• Worker produces R local files containing intermediate (k,v) pairs
Master assigns each reduce task to a free worker
• Worker reads intermediate (k,v) pairs from map workers
• Worker sorts & applies user’s Reduce op to produce the output
• User may specify Partition: which intermediate keys to which Reducers

Except as otherwise noted, this presentation is released

under the Creative Commons Attribution 2.5 License.
MapReduce Input
data

Map Map Map Map

Master

Shuffle Shuffle Shuffle

Reduce Reduce Reduce Partitioned

output

Except as otherwise noted, this presentation is released

under the Creative Commons Attribution 2.5 License.
MapReduce: Granularity

Fine granularity tasks: many more map tasks than machines

• Minimizes time for fault recovery
• Can pipeline shuffling with map execution
• Better dynamic load balancing

Except as otherwise noted, this presentation is released

under the Creative Commons Attribution 2.5 License.
MapReduce: Fault Tolerance via Re-Execution

Worker failure:
• Detect failure via periodic heartbeats
• Re-execute completed and in-progress map tasks
• Re-execute in-progress reduce tasks
• Task completion committed through master

Master failure:
• State is checkpointed to replicated file system
• New master recovers & continues

Very Robust: lost 1600 of 1800 machines once, but finished fine

Except as otherwise noted, this presentation is released

under the Creative Commons Attribution 2.5 License.
MapReduce: A Leaky Abstraction

MR insulates you from many concerns, but not all of them.

• Don't overload one reducer
• Don't leak memory, even a little!
• Static and global variables probably don't do what you expect
(They can sometimes be useful, though!)
• Mappers might get rerun -- maybe on different data!
Careful with side-effects: must be atomic, idempotent.
Different reducers might see different versions!

Except as otherwise noted, this presentation is released

under the Creative Commons Attribution 2.5 License.

BIG DATA
No ratings yet
BIG DATA
120 pages
4a-MapReduce
No ratings yet
4a-MapReduce
47 pages
Paper Map Reduce
No ratings yet
Paper Map Reduce
16 pages
ECS765P_W2_The MapReduce Programming Model
No ratings yet
ECS765P_W2_The MapReduce Programming Model
53 pages
M4_06_MapReduce
No ratings yet
M4_06_MapReduce
28 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
Chapter 4
No ratings yet
Chapter 4
53 pages
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts (1)
No ratings yet
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts (1)
26 pages
bda_unit_3[1]
No ratings yet
bda_unit_3[1]
20 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Lecture 3 - MapReduce
No ratings yet
Lecture 3 - MapReduce
9 pages
Lecture 03
No ratings yet
Lecture 03
26 pages
Introduction to batch processing
No ratings yet
Introduction to batch processing
23 pages
BDA Module 3
No ratings yet
BDA Module 3
66 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
Lecture 2.1
No ratings yet
Lecture 2.1
13 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
Map Reduce
No ratings yet
Map Reduce
7 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
Map reduce
No ratings yet
Map reduce
35 pages
Map Reduce
No ratings yet
Map Reduce
39 pages
Da Unit 5 Data Analytics
No ratings yet
Da Unit 5 Data Analytics
43 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Unit 5 Lecture 5
No ratings yet
Unit 5 Lecture 5
21 pages
BDP 2024 08
No ratings yet
BDP 2024 08
14 pages
2 MapReduce continue
No ratings yet
2 MapReduce continue
12 pages
Map Reduce
No ratings yet
Map Reduce
7 pages
Mapreduce
No ratings yet
Mapreduce
13 pages
Mapreduce
No ratings yet
Mapreduce
13 pages
Map Reduce
No ratings yet
Map Reduce
3 pages
ESSIR MapReduce For Indexing
No ratings yet
ESSIR MapReduce For Indexing
86 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
Lecture - 3
No ratings yet
Lecture - 3
25 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
BDP 2024 09
No ratings yet
BDP 2024 09
24 pages
Map Reduce
No ratings yet
Map Reduce
44 pages
The Mapreduce Programming Model
No ratings yet
The Mapreduce Programming Model
64 pages
Lecture 10 MapReduce Hadoop
No ratings yet
Lecture 10 MapReduce Hadoop
37 pages
Ecs765p W2
No ratings yet
Ecs765p W2
55 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
17 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
MapReduce: Simplified Data Processing On Large Clusters
100% (1)
MapReduce: Simplified Data Processing On Large Clusters
13 pages
Map-Reduce For Parallel Computing: Amit Jain
No ratings yet
Map-Reduce For Parallel Computing: Amit Jain
72 pages
Lecture 2 - Mapreduce: Cpe 458 - Parallel Programming, Spring 2009
No ratings yet
Lecture 2 - Mapreduce: Cpe 458 - Parallel Programming, Spring 2009
26 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Data Science
No ratings yet
Data Science
7 pages
Map Reduce Intro CS4961-L22
No ratings yet
Map Reduce Intro CS4961-L22
20 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
Dean 08 Map Reduce
No ratings yet
Dean 08 Map Reduce
7 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
55 pages
Map Reduce
No ratings yet
Map Reduce
18 pages
Map Reduce: Simplified Processing On Large Clusters
No ratings yet
Map Reduce: Simplified Processing On Large Clusters
29 pages
Parallel Programming, Mapreduce Model: Unit Ii
No ratings yet
Parallel Programming, Mapreduce Model: Unit Ii
47 pages
Mapreduce: Simpli - Ed Data Processing On Large Clusters
No ratings yet
Mapreduce: Simpli - Ed Data Processing On Large Clusters
4 pages
The Art of Debugging with GDB, DDD, and Eclipse
From Everand
The Art of Debugging with GDB, DDD, and Eclipse
Norman Matloff
3.5/5 (6)
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
From Everand
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
Manoj R Chakravarthi
No ratings yet
Solving the Riddle of Microsoft and Your Computer: 2Nd Edition
From Everand
Solving the Riddle of Microsoft and Your Computer: 2Nd Edition
Mark Riddle
No ratings yet
Google BigQuery Analytics
From Everand
Google BigQuery Analytics
Jordan Tigani
3/5 (1)