0% found this document useful (0 votes)
57 views33 pages

Mapreduce Design Patterns: Barry Brumitt

The document describes various MapReduce design patterns for joining datasets. It discusses inner join, bucketing/grace hash join, and recursive key join patterns. It also provides code examples of using these patterns to join road, intersection and town features to build a graph.

Uploaded by

debojyotis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views33 pages

Mapreduce Design Patterns: Barry Brumitt

The document describes various MapReduce design patterns for joining datasets. It discusses inner join, bucketing/grace hash join, and recursive key join patterns. It also provides code examples of using these patterns to join road, intersection and town features to build a graph.

Uploaded by

debojyotis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

MapReduce Design Patterns

Barry Brumitt
[email protected]
Software Engineer
About your speaker
Ph.D. Robotics, Carnegie Mellon, 91 - 97
Path Planning for Multiple Mobile Robots
Researcher, Microsoft Research, 98 - 02
Ubiquitous Computing
Software Eng., Microsoft Games, 03 - 05
AI for Forza Motorsport
Software Engineer, Google, 05 - now
Maps: Pathfinder
Systems: Infrastructure
Indexing Large Datasets
All web pages Data Center Index Files
Indexing Large Datasets
Geographic Data Data Center Index Files
...not so useful for user-facing applications
Pointer Following (or) Joining
Feature List
1: <type=Road>, <intersections=(3)>, <geom>,
2: <type=Road>, <intersections=(3)>, <geom>,
3: <type=Intersection>, stop_type, POI?
4: <type=Road>, <intersections=(6)>, <geom>,
5: <type=Road>, <intersections=(3,6)>, <geom>,
6: <type=Intersection>, stop_type, POI?,
7: <type=Road>, <intersections=(6)>, <geom>,
8: <type=Town>, <name>, <geom>,
.
.
.
Intersection List
3: <type=Intersection>, stop_type, <roads=(
1: <type=Road>, <geom>, <name>,
2: <type=Road>, <geom>, <name>,
5: <type=Road>, <geom>, <name>, )>,
6: <type=Intersection>, stop_type, <roads=(
4: <type=Road>, <geom>, <name>, ,
5: <type=Road>, <geom>, <name>, ,
7: <type=Road>, <geom>, <name>, )>,
.
.
.
Input Output
1
2
7
5
4
6
3
Inner Join Pattern
Input Map Shuffle Reduce Output
Apply map() to each;
Key = intersection id
Value = feature
Sort by key
Apply reduce() to list
of pairs with same key,
gather into a feature
Feature list,
aggregated
Feature list
1: Road
2: Road
3: Intersection
4: Road
5: Road
6: Intersection
7: Road
(3, 1: Road)
(3, 2: Road)
(3, 3: Intxn)
(6, 4: Road)
(3, 5: Road)
(6, 5: Road)
(6, 6: Intxn)
(6, 7: Road)
(3, 1: Road)
(3, 2: Road)
(3, 3: Intxn.)
(6, 4: Road)
(3, 5: Road)
(6, 5: Road)
(6, 6: Intxn.)
(6, 7: Road)
3
6
3: Intersection
1: Road,
2: Road,
5: Road
6: Intersection
4: Road,
5: Road,
7: Road
1
2
7
5
4
6
3
Inner Join Pattern in SQL
e
d
c
b
a
D
6 7
6 5
6 4
3 2
3 1
I R
y 6
x 3
D I
e
d
c
b
a
r.D
x 7
y 5
y 4
x 2
x 1
i.D r.R
roads ints
SELECT roads.R, roads.D, ints.D
FROM roads INNER JOIN ints
ON roads.I = ints.I
x 3 6 e 7
y 6 3 a 1
y 6 3 b 2
y 6 6 c 4
y 6 6 d 5
y 6 6 e 7
6
6
3
3
r.I
3
3
3
3
i.I
d
c
b
a
r.D
x 5
x 4
x 2
x 1
i.D r.R
1
2
7
5
4
6
3
Cross Join
Inner Join Pattern in SQL
e
d
c
b
a
D
5 7
6 5
6 4
3 2
3 1
I R
y 6
x 3
D I
SELECT roads.R, roads.D, ints.D
FROM roads, ints
WHERE roads.I = ints.I
(aka an Equi Join)
e
d
c
b
a
r.D
x 7
y 5
y 4
x 2
x 1
i.D r.R
roads ints
SELECT roads.R, roads.D, ints.D
FROM roads INNER JOIN ints
ON roads.I = ints.I
1
2
7
5
4
6
3
Tables vs. Flat File?
Roads Features
Intersections
Towns
Tables Flat File
Road Intersection Town
Road
Intersection
Town
Road
Intersection Town
Message GeoFeature {
enum Type {
ROAD = 1;
INTERSECTION = 2;
TOWN = 3;
}
required Type type = 0;
optional Road road = 1;
optional Intersection intersection = 2;
optional Town town = 3 ;
}
Protocol Buffer
References vs. Duplication?
References: Common primary key; easy restructuring
Duplication: Avoids additional MR passes;
denormalizes data
an engineering space / time / complexity tradeoff
References
1: <type=Road>, <intersections=(3)>, <geom>,
2: <type=Road>, <intersections=(3)>, <geom>,
3: <type=Intersection>, <roads=(1,2,5)>,
4: <type=Road>, <intersections=(6)>, <geom>,
5: <type=Road>, <intersections=(3,6)>, <geom>,
6: <type=Intersection>, <roads=(5,6,7)>,
7: <type=Road>, <intersections=(6)>, <geom>,
8: <type=Town>, <name>, <geom>,
.
.
.
Duplication
3: <type=Intersection>, <roads=(
1: <type=Road>, <geom>, <name>,
2: <type=Road>, <geom>, <name>,
5: <type=Road>, <geom>, <name>, )>,
6: <type=Intersection>, <roads=(
4: <type=Road>, <geom>, <name>, >
5: <type=Road>, <geom>, <name>, >
7: <type=Road>, <geom>, <name>, )>,
.
.
.
1
2
7
5
4
6
3
Code Example
class IntersectionAssemblerMapper : public
Mapper {
virtual void Map(MapInput* input) {
GeoFeature feature;
feature.FromMapInput(input);
if (feature.type()==INTERSECTION) {
Emit(feature.id(), input);
} else if (feature.type() == ROAD) {
Emit(feature.intersection_id(0), input);
Emit(feature.intersection_id(1), input);
}
}
};
REGISTER_MAPPER(IntersectionAssemblerMapper);
class IntersectionAssemblerReducer : public
Reducer {
virtual void Reduce(ReduceInput* input) {
GeoFeature feature;
GraphIntersection intersection;
intersection.id = input->key();
while(!input->done()) {
feature.FromMapInput(input->value());
if (feature.type()==INTERSECTION)
intersection.SetIntersection(feature);
else
intersection.AddRoadFeature(feature);
input->next();
}
Emit(intersection);
}
};
REGISTER_REDUCER(IntersectionAssemblerReducer);
(3, 1: Road)
(3, 2: Road)
(3, 3: Intxn)
(6, 4: Road)
(3, 5: Road)
(6, 5: Road)
(6, 6: Intxn)
(6, 7: Road)
(3, 1: Road)
(3, 2: Road)
(3, 3: Intxn.)
(6, 4: Road)
(3, 5: Road)
(6, 5: Road)
(6, 6: Intxn.)
(6, 7: Road)
3
6
Join, but no pointers or keys?
1: Road
2: Road
3: Town
4: Road
5: Road
6: Town
7: Road
1
2
7
5
4
6
3
Input Map Shuffle Reduce Output
Apply map() to each;
emit (key,val) pairs
Sort by key
Apply reduce() to list
of pairs with same key
New list of items List of items
3: 1,2,5
6: 4,5,7
?
Bucketing (or) Grace Hash Join
1: Road
2: Road
3: Town
4: Road
5: Road
6: Town
7: Road
A B
C D
Input Map Shuffle Reduce Output
1
2
7
5
4
6
3
(A-Road, 1)
(C-Road, 2)
(A-Town, 3)
(D-Road, 4)
(C-Road, 5)
(B-Town, 6)
(B-Road, 7)
(C-Road, 1)
(B-Town, 3)
(C-Town, 3)
(D-Road, 5)
(D-Town, 6)
(D-Road, 7)
Emit (key, item) pair
Key = geometric hash
Secondary key = Type
Sort by keys
Intersect all towns
with all roads; emit
intersecting pairs
(town, road) pair Feature List
Reduce on Key A
1: Road
2: Road
3: Town
4: Road
5: Road
6: Town
7: Road
A B
C D
Input Map Shuffle Reduce Output
Emit (key, item) pair
Key = geometric hash
Secondary key = Type
Sort by keys
Intersect all towns
with all roads; emit
intersecting pairs
(town, road) pair Feature List
1
2
7
5
4
6
3
(A-Road, 1)
(C-Road, 2)
(A-Town, 3)
(D-Road, 4)
(C-Road, 5)
(B-Town, 6)
(B-Road, 7)
(C-Road, 1)
(B-Town, 3)
(C-Town, 3)
(D-Road, 5)
(D-Town, 6)
(D-Road, 7)
(A-Road, 1)
(A-Town, 3) (3, 1)
Reduce on Key B
1: Road
2: Road
3: Town
4: Road
5: Road
6: Town
7: Road
A B
C D
Input Map Shuffle Reduce Output
1
2
7
5
4
6
3
(A-Road, 1)
(C-Road, 2)
(A-Town, 3)
(D-Road, 4)
(C-Road, 5)
(B-Town, 6)
(B-Road, 7)
(C-Road, 1)
(B-Town, 3)
(C-Town, 3)
(D-Road, 5)
(D-Town, 6)
(D-Road, 7)
(B-Town, 6)
(B-Road, 7)
(B-Town, 3) (6, 7)
Emit (key, item) pair
Key = geometric hash
Secondary key = Type
Sort by keys
Intersect all towns
with all roads; emit
intersecting pairs
(town, road) pair Feature List
Reduce on Key C
1: Road
2: Road
3: Town
4: Road
5: Road
6: Town
7: Road
A B
C D
Input Map Shuffle Reduce Output
1
2
7
5
4
6
3
(A-Road, 1)
(C-Road, 2)
(A-Town, 3)
(D-Road, 4)
(C-Road, 5)
(B-Town, 6)
(B-Road, 7)
(C-Road, 1)
(B-Town, 3)
(C-Town, 3)
(D-Road, 5)
(D-Town, 6)
(D-Road, 7)
(C-Road, 2)
(C-Road, 5)
(C-Road, 1)
(C-Town, 3)
(3, 2)
(3, 5)
Emit (key, item) pair
Key = geometric hash
Secondary key = Type
Sort by keys
Intersect all towns
with all roads; emit
intersecting pairs
(town, road) pair Feature List
(3, 1)
Reduce on Key D
1: Road
2: Road
3: Town
4: Road
5: Road
6: Town
7: Road
A B
C D
Input Map Shuffle Reduce Output
1
2
7
5
4
6
3
(A-Road, 1)
(C-Road, 2)
(A-Town, 3)
(D-Road, 4)
(C-Road, 5)
(B-Town, 6)
(B-Road, 7)
(C-Road, 1)
(B-Town, 3)
(C-Town, 3)
(D-Road, 5)
(D-Town, 6)
(D-Road, 7)
(D-Road, 4)
(D-Road, 5)
(D-Town, 6)
(D-Road, 7)
(6, 4)
(6, 5)
Emit (key, item) pair
Key = geometric hash
Secondary key = Type
Sort by keys
Intersect all towns
with all roads; emit
intersecting pairs
(town, road) pair Feature List
(6, 7)
Output not quite...
1: Road
2: Road
3: Town
4: Road
5: Road
6: Town
7: Road
A B
C D
Input Map Shuffle Reduce Output
Apply map() to each;
emit (key,val) pairs
Sort by key
Apply reduce() to list
of pairs with same key
List of items Feature List
1
2
7
5
4
6
3
(6, 7)
(3, 1)
(3, 2)
(3, 5)
(6, 4)
(6, 5)
(3, 1)
(6, 7)
recall earlierJoin Pattern
Input Map Shuffle Reduce Output
Apply map() to each;
emit (key,val) pairs
Sort by key
Apply reduce() to list
of pairs with same key
New list of items List of items
1: Road
2: Road
3: Intersection
4: Road
5: Road
6: Intersection
7: Road
(3, 1: Road)
(3, 2: Road)
(3, 3: Intxn)
(6, 4: Road)
(3, 5: Road)
(6, 5: Road)
(6, 6: Intxn)
(6, 7: Road)
(3, 1: Road)
(3, 2: Road)
(3, 3: Intxn.)
(6, 4: Road)
(3, 5: Road)
(6, 5: Road)
(6, 6: Intxn.)
(6, 7: Road)
3
6
3: Intersection
1: Road,
2: Road,
5: Road
6: Intersection
4: Road,
5: Road,
7: Road
1
2
7
5
4
6
3
(6, 5)
Recursive Key Join Pattern
Input Map Shuffle Reduce Output
Identity Mapper,
key = town
Sort by key
Reducer sorts, gathers,
remove duplicates;
similar to join
Index of roads
in each town
Output from
previous phase
3
6
(3:
1,
2,
5)
(6:
4,
5,
7)
(3, 1)
(3, 2)
(3, 1)
(6, 7)
(6, 7)
(6, 5)
1
2
7
5
4
6
3
(6, 7)
(3, 1)
(3, 2)
(3, 5)
(6, 4)
(6, 5)
(3, 1)
(6, 7)
(3, 5)
(6, 4)
(6-7, 7)
Could use 2ndry keys
to avoid reduce sort(),
eg:
(6, 7)
(3, 1)
(3, 2)
(3, 5)
(6, 4)
(3, 1)
(6, 7)
Chained MapReduces Pattern
Input Map Shuffle Reduce Output
Identity Mapper,
key = town
Sort by key
Reducer sorts, gathers,
remove duplicates;
similar to join
Index of roads
in each town
(town, road)
pair
Emit (key, item) pair
Key = geometric hash
Secondary key = Type
Sort by keys
Intersect all towns
with all roads; emit
intersecting pairs
(town, road) pair Feature List
Distributing Costly Computation:
e.g. Rendering Map Tiles
Input Map Shuffle Reduce Output
Emit each to all
overlapping latitude-
longitude rectangles
Sort by key
(key= Rect. Id)
Render tile using
data for all enclosed
features
Rendered tiles
Geographic
feature list
I-5
Lake Washington
WA-520
I-90
(N, I-5)
(N, Lake Wash.)
(N, WA-520)
(S, I-90)
(S, I-5)
(S, Lake Wash.)
(N, I-5)
(N, Lake Wash.)
(N, WA-520)
(S, I-90)
N
S
(S, I-5)
(S, Lake Wash.)

(Bucket pattern) (Parallel rendering)


Finding Nearest Points Of Interest (POIs)
Feature List
1, Type, Road, Intersection,
2, Type, Road, Intersection,
3, Type, Road, Intersection,
. . .
Nearest POI within
5mi of Intersection
(1, 1)
(2, 1)
(3, 1)
(4, 1)
(5, 7)
(6, 7)
(7, 7)
(8, 7)
(9, 7)
Input Output
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
9
Finding Nearest POI on a Graph
Input Map Shuffle Reduce Output
Nodes with
nearest POIs
Graph
Perform Dijkstra from
each POI node.
Emit POI & dist. At
each node in search.
For each node, emit
closest POI
Finding Nearest POI on a Graph
Input Map Shuffle Reduce Output
Nodes with
nearest POIs
Graph
Perform Dijkstra from
each POI node.
Emit POI & dist. At
each node in search.
For each node, emit
closest POI
Finding Nearest POI on a Graph
Input Map Shuffle Reduce Output
Nodes with
nearest POIs
Graph
Perform Dijkstra from
each POI node.
Emit POI & dist. At
each node in search.
For each node, emit
closest POI
Finding Nearest POI on a Graph
Input Map Shuffle Reduce Output
Nodes with
nearest POIs
Graph
Perform Dijkstra from
each POI node.
Emit POI & dist. At
each node in search.
For each node, emit
closest POI
Putting it all together: Nearest POI
Input Map Shuffle Reduce Output
Nodes with edges Feature List
Subgraphs Nodes with edges
Nodes with
nearest POI
& dist
Subgraphs
Perform Dijkstra from
each POI node.
Emit POI & dist. At
each node in search.
For each node, emit
closest POI
Sorted nodes
with
nearest POI
Nodes with
nearest POI
& dist
Use key-join pattern to create
nodes,edges out of intersections,roads
Use bucketing pattern to create
appropriate(overlapping, large-enough) subgraphs
Use identity mapper & gather pattern
to sort and clean-up node, POI pairs
Hard Problems for MapReduce
Following multiple pointer hops
Iterative algorithms
Algorithms with global state
Operations on graphs without good embeddings
[insert your favorite challenge here]
Summary
MapReduce eases:
Machine coordination
Network communication
Fault tolerance
Scaling
Productivity
MapReduce patterns:
Flat data structures
Foreign / Recursive Key Joins
(aka pointer following)
Hash Joins (aka bucketing)
Distribute $$ computation
Chain MapReduce phases
Simplify Reduce() by using
secondary keys
[ insert your pattern here ]
Questions?
MapReduce: Simplified Data Processing on Large Clusters,
Jeffrey Dean and Sanjay Ghemawat
OSDI'04: Sixth Symposium on Operating System Design and
Implementation
Contact: [email protected]

You might also like