Bda Unit I Lecture8 1
Bda Unit I Lecture8 1
2
Definitions: Replication Rate & Reducer Size
◻ Replication rate: Avg # of key-value pairs generated by Map tasks per input
The communication cost between Map and Reduce is determined by this
Donated as r
◻ Reducer size: Upper bound for the size of the value list corresponding to a single key
Donated as q
Choose q small enough such that:
1. there are many reducers for high levels of parallelism
2. the data for a reducer fits into the main memory of a node
◻ Typically q and r inversely proportional
Tradeoff between communication cost and parallelism/memory requirements.
3
Example: Join with MapReduce
◻ Map:
For each input tuple R(a, b):
Generate <key = b, value = (‘R’, a)> Replication rate:
For each input tuple S(b, c): r=1
Generate <key = b, value = (‘S’, c)>
Communication cost:
◻ Reduce: 2(|R|+|S|)
Input: <b, value_list>
In the value_list:
■ Pair each entry of the form (‘R’, a) with each entry (‘S’, c),
Reducer size (worst case):
and output: q = |R| + |S|
<a, b, c>
4
Example: Single-Step Matrix-Matrix Multiplication
◻ Reduce(key, value_list)
Reducer size:
sum ← 0
for each pair (M, j, mij) and (N, j, njk) in value_list q = 2n
sum += mij . njk
output (key, sum)
5
A Graph Model for MapReduce Algorithms
6
Example: Single-Step Matrix-Matrix Multiplication
7
Application: Naïve Similarity Join
Naïve Similarity Join
◻ Objective: Given a large set of elements X and a similarity measure s(x1, x2), output
the pairs that have similarity above a given threshold.
Locality sensitive hashing is not used for the sake of this example.
◻ Example:
Each element is an image of 1M bytes
There are 1M images in the set
About 5x1011 (500B) image comparisons to make
9
Similarity Join with MapReduce (First Try)
◻ Map:
for each picture Pi do: Replication rate r = n-1
Reducer size q = 2
for each j=1 to n (except i)
Communication cost = n + n(n-1)
generate <key = (i,j), value = Pi> # of reducers = n(n-1)/2
◻ Communication cost:
n(n-1) pictures communicated from Map to Reduce
total # bytes transferred = 1018
11
Graph Model
12
Graph Model: Multiple Outputs per Reducer
13
Grouping Outputs
14
Similarity Join with Grouping
16
17
18
Example
1 ..…………………....…… n
There will be a reducer for each key (u, v),
1
.. where u ≤ v
..
..
..
..
..
..
..
..
..
..
..
..
..
..
n
19
Example
1 ..…………………....…… n
Reducers: (1, 2), (2, 2), (2, 3), (2, 4)
1
..
..
..
..
..
Pi ..
..
..
..
..
..
..
..
..
..
n
20
Complexity Analysis
◻ Replication rate:
r=g
◻ Reducer size:
q = 2n/g
◻ Communication cost:
n+ng
◻ # of reducers:
g(g+1)/2
21
Example: 1M pictures with 1MByte size each
◻ Let g = 1000
22
Tradeoff Between Replication Rate and Reducer Size
Replication rate r = g q = 2n / r qr = 2n
Reducer size q = 2n/g
◻ Reducing replication rate will reduce communication, but will increase reducer size.
Extreme case: r = 1 and q = 2n. There is a single reducer doing all the comparisons.
Extreme case: r = n and q = 2. There is a reducer for each pair of inputs.
◻ Need to choose r small enough such that the data fits into local DRAM and there’s
enough parallelism.
23
Application: Matrix-Matrix Multiplication
with 1D Decomposition
Reminder: Matrix-Matrix Multiplication without Grouping
j k k
i mij i pik
X =
j njk
M N P
25
Reminder: Matrix-Matrix Multiplication without Grouping
j k k
i mij i pik
X =
j njk
M N P
Replication rate r = n
26
Multiple Outputs per Reducer
g stripes
j k K
i mij I
X = g stripes
j njk
M N P
27
Multiple Outputs per Reducer
g stripes
j
i mij I
X = g stripes
M N P
28
Multiple Outputs per Reducer
g stripes
k K
X = g stripes
j njk
M N P
29
1D Matrix Decomposition
g stripes
K K
I I
X = g stripes
M N P
30
MapReduce Formulation
◻ Map:
for each element mij from matrix M
for K=1 to g
generate <key=(I, K), value = (‘M’, i, j, mij)>
for each element njk from matrix N Replication rate:
for I=1 to g r=g
generate <key=(I, K), value = (‘N’, j, k, njk)>
Communication cost:
◻ Reduce(key=(I,K), value_list) 2n2 + 2gn2
for each i ∈ I and for each k ∈ K
pik = 0 Reducer size:
for j = 1 to n q= 2n2/g
pik += mij . njk
# of reducers:
output <key=(i, k), value = pik> g2
31
Communication Cost vs. Reducer Size
Replication rate:
Communication cost vs. reducer size r=g
cost = 2n2 + 2gn2
= 2n2 + 4n4/q Communication cost:
2n2 + 2gn2
Inverse relation between communication cost and reducer size. Reducer size:
Reminder: q value chosen should be small enough such that: q = 2n2/g
Local memory is sufficient
There’s enough parallelism # of reducers:
g2
32
Application: Matrix-Matrix Multiplication
with 2D Decomposition
Two Stage MapReduce Algorithm
◻ Use a similar idea, but for sub-blocks of matrices instead of individual elements
34
2D Matrix Decomposition
g stripes
K K
I I
X = g stripes
M N P
35
Computing the Product at Stripe (I, K)
g stripes
K K
I I
X = g stripes
M N P
g stripes
J K K
I I
X = g stripes
J
M N P
g stripes
J K K
I I
X = g stripes
J
M N P
38
First MapReduce Step
J K K
I J = I
from M from N Jth partial sum
39
MapReduce Step 1: Map
g stripes
J
I I
X = g stripes
M N P
Reminder: Reducer (I, J, K) is responsible for computing the Jth partial sum for block PIK
40
MapReduce Step 1: Map
g stripes
K K
X = g stripes
J
M N P
Reminder: Reducer (I, J, K) is responsible for computing the Jth partial sum for block PIK
41
MapReduce Step 1: Reduce
g stripes
J K K
I I
X = g stripes
J
M N P
Reducer (I, J, K) will receive MIJ and NJK blocks and will compute
the Jth partial sum for block PIK
42
MapReduce Step 1: Reducer Output
g stripes
K K
I I
X = g stripes
M N P
For each pik ∈ PIK, there are g reducers that compute a partial sum (each with key=(I, J, K))
The reduce outputs corresponding to pik: <key = (i, k), value = xJik>
43
MapReduce Step 2
◻ Map:
for each input <key = (i, k), value = xJik>
generate <key = (i, k), value = xJik>
44
Complexity Analysis: Step 1
Replication rate:
r1 = g
Communication cost:
2n2 + 2gn2
Reducer size:
q1= 2n2/g2
# of reducers:
g3
45
Complexity Analysis: MapReduce Step 2
◻ Map:
for each input <key = (i, k), value = xJik> Replication rate:
generate <key = (i, k), value = xJik> r2 = 1
Communication cost:
◻ Reduce(key = (i, k), value_list) gn2
pik = 0
for each xJik in value_list Reducer size:
pik += xJik q2= g
output <key = (i, k), value = pik>
# of reducers:
n2
46
Complexity Analysis
◻ Step 1 Step 2
Replication rate: Replication rate:
r1 = g r2 = 1
# of reducers: # of reducers:
g3 n2
47
Tradeoff Between Communication Cost and Reducer Size
1D Decomposition 2D Decomposition
50
Comparison: Reducer Size
1D Decomposition 2D Decomposition
51
Comparison: Communication Costs
1D Decomposition 2D Decomposition
52
Comparison: Communication Costs (when reducer sizes are equal)
1D Decomposition 2D Decomposition
53
Conclusions
54
References
55