0% found this document useful (0 votes)

64 views30 pages

Distributed Computing Seminar: Mapreduce Theory and Implementation

The document discusses MapReduce, a programming model and software framework for processing large datasets in a distributed computing environment. It reviews functional programming concepts like map and fold that inspired MapReduce and provides an overview of how MapReduce works, including having users implement map and reduce functions and how this allows for automatic parallelization and fault tolerance across large clusters.

Uploaded by

wyhwyhwyh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views30 pages

Distributed Computing Seminar: Mapreduce Theory and Implementation

Uploaded by

wyhwyhwyh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 30

Distributed Computing Seminar

MapReduce Theory and

Implementation

Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet

Summer 2007
Except as otherwise noted, the content of this presentation is (c) 2007
Google Inc. and licensed under the Creative Commons Attribution 3.0
License.
Outline
 Lisp/ML review
 functional programming
 map, fold
 MapReduce overview
Functional Programming Review

 Functional operations do not modify data

structures: They always create new ones
 Original data still exists in unmodified form
 Data flows are implicit in program design
 Order of operations does not matter
Functional Programming Review

fun foo(l: int list) =

sum(l) + mul(l) + length(l)

Order of sum() and mul(), etc does not

matter – they do not modify l
Functional Updates Do Not Modify
Structures
fun append(x, lst) =
let lst' = reverse lst in
reverse ( x :: lst' )
The append() function above reverses a list, adds a new
element to the front, and returns all of that, reversed,
which appends an item.

But it never modifies lst!

Functions Can Be Used As
Arguments
fun DoDouble(f, x) = f (f x)
It does not matter what f does to its
argument; DoDouble() will do it twice.

What is the type of this function?

Map
map f lst: (’a->’b) -> (’a list) -> (’b list)
Creates a new list by applying f to each element
of the input list; returns output in order.
f

f
Fold
fold f x0 lst: ('a*'b->'b)->'b->('a list)->'b
Moves across a list, applying f to each element
plus an accumulator. f returns the next
accumulator value, which is combined with the
next element of the list

f f f f f returned
initial
fold left vs. fold right
 Order of list elements can be significant
 Fold left moves left-to-right across the list
 Fold right moves from right-to-left
SML Implementation:

fun foldl f a [] = a
| foldl f a (x::xs) = foldl f (f(x, a)) xs

fun foldr f a [] = a
| foldr f a (x::xs) = f(x, (foldr f a xs))
Example
fun foo(l: int list) =
sum(l) + mul(l) + length(l)

How can we implement this?

Example (Solved)
fun foo(l: int list) =
sum(l) + mul(l) + length(l)

fun sum(lst) = foldl (fn (x,a)=>x+a) 0 lst

fun mul(lst) = foldl (fn (x,a)=>x*a) 1 lst
fun length(lst) = foldl (fn (x,a)=>1+a) 0 lst
A More Complicated Fold Problem

 Given a list of numbers, how can we

generate a list of partial sums?

e.g.: [1, 4, 8, 3, 7, 9] 
[0, 1, 5, 13, 16, 23, 32]
A More Complicated Map Problem

 Given a list of words, can we: reverse the

letters in each word, and reverse the
whole list, so it all comes out backwards?

[“my”, “happy”, “cat”] -> [“tac”, “yppah”, “ym”]

map Implementation
fun map f [] = []
| map f (x::xs) = (f x) :: (map f xs)

 This implementation moves left-to-right

across the list, mapping elements one at a
time

 … But does it need to?

Implicit Parallelism In map
 In a purely functional setting, elements of a list
being computed by map cannot see the effects
of the computations on other elements
 If order of application of f to elements in list is
commutative, we can reorder or parallelize
execution
 This is the “secret” that MapReduce exploits
MapReduce
Motivation: Large Scale Data
Processing
 Want to process lots of data ( > 1 TB)
 Want to parallelize across
hundreds/thousands of CPUs
 … Want to make this easy
MapReduce
 Automatic parallelization & distribution
 Fault-tolerant
 Provides status and monitoring tools
 Clean abstraction for programmers
Programming Model
 Borrows from functional programming
 Users implement interface of two
functions:
 map (in_key, in_value) ->
(out_key, intermediate_value) list

 reduce (out_key, intermediate_value list) ->

out_value list
map
 Records from the data source (lines out of
files, rows of a database, etc) are fed into
the map function as key*value pairs: e.g.,
(filename, line).
 map() produces one or more intermediate
values along with an output key from the
input.
reduce
 After the map phase is over, all the
intermediate values for a given output key
are combined together into a list
 reduce() combines those intermediate
values into one or more final values for
that same output key
 (in practice, usually only one final value
per key)
Input key*value Input key*value
pairs pairs

...

map map
Data store 1 Data store n

(key 1, (key 2, (key 3, (key 1, (key 2, (key 3,

values...) values...) values...) values...) values...) values...)

== Barrier == : Aggregates intermediate values by output key

key 1, key 2, key 3,

intermediate intermediate intermediate
values values values

reduce reduce reduce

final key 1 final key 2 final key 3

values values values
Parallelism
 map() functions run in parallel, creating
different intermediate values from different
input data sets
 reduce() functions also run in parallel,
each working on a different output key
 All values are processed independently
 Bottleneck: reduce phase can’t start until
map phase is completely finished.
Example: Count word occurrences
map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, "1");

reduce(String output_key, Iterator

intermediate_values):
// output_key: a word
// output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));
Example vs. Actual Source Code

 Example is written in pseudo-code

 Actual implementation is in C++, using a
MapReduce library
 Bindings for Python and Java exist via
interfaces
 True code is somewhat more involved
(defines how the input key/values are
divided up and accessed, etc.)
Locality
 Master program divvies up tasks based on
location of data: tries to have map() tasks
on same machine as physical file data, or
at least same rack
 map() task inputs are divided into 64 MB
blocks: same size as Google File System
chunks
Fault Tolerance
 Master detects worker failures
 Re-executes completed & in-progress map()
tasks
 Re-executes in-progress reduce() tasks
 Master notices particular input key/values
cause crashes in map(), and skips those
values on re-execution.
 Effect: Can work around bugs in third-party
libraries!
Optimizations
 No reduce can start until map is complete:
A single slow disk controller can rate-limit the
whole process
 Master redundantly executes “slow-
moving” map tasks; uses results of first
copy to finish

Why is it safe to redundantly execute map tasks? Wouldn’t this mess up

the total computation?
Optimizations
 “Combiner” functions can run on same
machine as a mapper
 Causes a mini-reduce phase to occur
before the real reduce phase, to save
bandwidth

Under what conditions is it sound to use a combiner?

MapReduce Conclusions
 MapReduce has proven to be a useful
abstraction
 Greatly simplifies large-scale computations at
Google
 Functional programming paradigm can be
applied to large-scale applications
 Fun to use: focus on problem, let library deal w/
messy details

Zoo Management System
85% (27)
Zoo Management System
23 pages
Marketing Plan
81% (27)
Marketing Plan
61 pages
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
Caster Manual
No ratings yet
Caster Manual
291 pages
Mapreduce Class Notes
No ratings yet
Mapreduce Class Notes
43 pages
Paper Map Reduce
No ratings yet
Paper Map Reduce
16 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
37 pages
Mapreduce: Theory and Implementation: Cse 490H - Intro To Distributed Computing, Modified by George Lee
No ratings yet
Mapreduce: Theory and Implementation: Cse 490H - Intro To Distributed Computing, Modified by George Lee
33 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Mapreduce
No ratings yet
Mapreduce
13 pages
Mapreduce
No ratings yet
Mapreduce
13 pages
BDA Module 3
No ratings yet
BDA Module 3
66 pages
MapReduce: Simplified Data Processing On Large Clusters
100% (1)
MapReduce: Simplified Data Processing On Large Clusters
13 pages
Lecture 3 - MapReduce
No ratings yet
Lecture 3 - MapReduce
9 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
Ecs765p W2
No ratings yet
Ecs765p W2
55 pages
Ditp - ch2 1
No ratings yet
Ditp - ch2 1
2 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
43 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
Map Reduce: Simplified Processing On Large Clusters
No ratings yet
Map Reduce: Simplified Processing On Large Clusters
29 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
Parallel Programming, Mapreduce Model: Unit Ii
No ratings yet
Parallel Programming, Mapreduce Model: Unit Ii
47 pages
Map Reduce Intro CS4961-L22
No ratings yet
Map Reduce Intro CS4961-L22
20 pages
Chapter 4
No ratings yet
Chapter 4
53 pages
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
No ratings yet
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
23 pages
Lecture 2 - Map Reduce
No ratings yet
Lecture 2 - Map Reduce
20 pages
Map Reduce Examples
No ratings yet
Map Reduce Examples
7 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
Google'S Mapreduce Programming Model - Revisited: Ralf L Ammel
No ratings yet
Google'S Mapreduce Programming Model - Revisited: Ralf L Ammel
42 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Map-Reduce For Parallel Computing: Amit Jain
No ratings yet
Map-Reduce For Parallel Computing: Amit Jain
72 pages
Module 1 Algorithm For Massive Datasets
No ratings yet
Module 1 Algorithm For Massive Datasets
59 pages
Mapreduce: Simpli - Ed Data Processing On Large Clusters
No ratings yet
Mapreduce: Simpli - Ed Data Processing On Large Clusters
4 pages
Unit 4 - MapReduce
No ratings yet
Unit 4 - MapReduce
16 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Lecture 2 - Mapreduce: Cpe 458 - Parallel Programming, Spring 2009
No ratings yet
Lecture 2 - Mapreduce: Cpe 458 - Parallel Programming, Spring 2009
26 pages
BDP 2024 09
No ratings yet
BDP 2024 09
24 pages
Chapter4 - MapReduce
No ratings yet
Chapter4 - MapReduce
29 pages
Module 3 (Part-1) - Big Data
No ratings yet
Module 3 (Part-1) - Big Data
46 pages
6.unit 3 Bda
No ratings yet
6.unit 3 Bda
18 pages
Dean 08 Map Reduce
No ratings yet
Dean 08 Map Reduce
7 pages
Da Unit 5 Data Analytics
No ratings yet
Da Unit 5 Data Analytics
43 pages
Lec 8
No ratings yet
Lec 8
24 pages
Lecture 1 - Map Reduce
No ratings yet
Lecture 1 - Map Reduce
31 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
777 1651400043 BD Module 4
No ratings yet
777 1651400043 BD Module 4
21 pages
ECS765P - W2 - The MapReduce Programming Model
No ratings yet
ECS765P - W2 - The MapReduce Programming Model
53 pages
Map Reduce
No ratings yet
Map Reduce
35 pages
ESSIR MapReduce For Indexing
No ratings yet
ESSIR MapReduce For Indexing
86 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
74 pages
Lec 8
No ratings yet
Lec 8
19 pages
Map Reduce Design and Execution Framework Part 1
No ratings yet
Map Reduce Design and Execution Framework Part 1
19 pages
The Mapreduce Programming Model
No ratings yet
The Mapreduce Programming Model
64 pages
5 RK - MapReduce - v3
No ratings yet
5 RK - MapReduce - v3
30 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
Research Paper - Map Reduce - CSC3323
No ratings yet
Research Paper - Map Reduce - CSC3323
16 pages
The Map Reduce Programming
No ratings yet
The Map Reduce Programming
15 pages
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
Kap4 DWG 119 MBXX Me 0165
No ratings yet
Kap4 DWG 119 MBXX Me 0165
1 page
LD Config
No ratings yet
LD Config
9 pages
Tutorial Chapter 3 Memo
No ratings yet
Tutorial Chapter 3 Memo
45 pages
DLBT0802533EN00
No ratings yet
DLBT0802533EN00
5 pages
CSE 315 Lecture-No.5 (Fall 2021) - Line Coding
No ratings yet
CSE 315 Lecture-No.5 (Fall 2021) - Line Coding
30 pages
CSC Form 48 Daily Time Record DTR
No ratings yet
CSC Form 48 Daily Time Record DTR
1 page
E4160 - Microprocessor & Microcontroller System
100% (4)
E4160 - Microprocessor & Microcontroller System
48 pages
SQL Lab Sheet Updated
No ratings yet
SQL Lab Sheet Updated
9 pages
Emc Cheatsheet
No ratings yet
Emc Cheatsheet
1 page
Task 3 Assign 1 - New Unit 6 Digital Graphics
No ratings yet
Task 3 Assign 1 - New Unit 6 Digital Graphics
11 pages
Pengembangan Media Permainan Ular Tangga Untuk Layanan Bimbingan Dan Konseling Bagi Siswa Kelas Vii
No ratings yet
Pengembangan Media Permainan Ular Tangga Untuk Layanan Bimbingan Dan Konseling Bagi Siswa Kelas Vii
18 pages
Spot Billing System
No ratings yet
Spot Billing System
2 pages
MAXSURF Motions Quickstart
No ratings yet
MAXSURF Motions Quickstart
9 pages
Juniper SRX Day One
No ratings yet
Juniper SRX Day One
1 page
Cs Telecom
No ratings yet
Cs Telecom
3 pages
Java 1
No ratings yet
Java 1
36 pages
HR - FS-I077 Payroll US Check Printing
No ratings yet
HR - FS-I077 Payroll US Check Printing
9 pages
McKinsey The Winning Formula For Omnichannel Banking in North America
No ratings yet
McKinsey The Winning Formula For Omnichannel Banking in North America
9 pages
Question On FEM
100% (1)
Question On FEM
4 pages
Phototransistor Based Image Scanner1
No ratings yet
Phototransistor Based Image Scanner1
17 pages
Progression Briefing 20-2-2023
No ratings yet
Progression Briefing 20-2-2023
13 pages
10th 11th Master Scheduling
No ratings yet
10th 11th Master Scheduling
6 pages
Solution Manual For Data Structures and Abstractions With Java, 5th Edition Frank M. Carrano, Timothy M. Henry Download
100% (2)
Solution Manual For Data Structures and Abstractions With Java, 5th Edition Frank M. Carrano, Timothy M. Henry Download
44 pages
Problem Set 1stats 2
No ratings yet
Problem Set 1stats 2
2 pages
Jquery
100% (2)
Jquery
19 pages
Disk Drive Response Overview
No ratings yet
Disk Drive Response Overview
6 pages
IPerf - IPerf3 and IPerf2 User Documentation
100% (1)
IPerf - IPerf3 and IPerf2 User Documentation
15 pages

Distributed Computing Seminar: Mapreduce Theory and Implementation

Uploaded by

Distributed Computing Seminar: Mapreduce Theory and Implementation

Uploaded by

Distributed Computing Seminar

MapReduce Theory and

Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet

 Functional operations do not modify data

fun foo(l: int list) =

Order of sum() and mul(), etc does not

But it never modifies lst!

What is the type of this function?

How can we implement this?

fun sum(lst) = foldl (fn (x,a)=>x+a) 0 lst

 Given a list of numbers, how can we

 Given a list of words, can we: reverse the

[“my”, “happy”, “cat”] -> [“tac”, “yppah”, “ym”]

 This implementation moves left-to-right

 … But does it need to?

 reduce (out_key, intermediate_value list) ->

(key 1, (key 2, (key 3, (key 1, (key 2, (key 3,

== Barrier == : Aggregates intermediate values by output key

key 1, key 2, key 3,

reduce reduce reduce

final key 1 final key 2 final key 3

reduce(String output_key, Iterator

 Example is written in pseudo-code

Why is it safe to redundantly execute map tasks? Wouldn’t this mess up

Under what conditions is it sound to use a combiner?

You might also like