Bda - Unit I - Lecture 6, 7
Bda - Unit I - Lecture 6, 7
Most of the slides are from the Mining of Massive Datasets book.
These slides have been modified for Big Data Analytics Course. The original slides can be accessed at: www.mmds.org
Example Application: Join
Join Operation
◼ Compute the natural join R(A,B) ⋈ S(B,C)
◼ R and S are each stored in files
◼ Tuples are pairs (a,b) or (b,c)
A B B C A B C
a1 b1 b2 c1 a3 b2 c1
a2
a3
b1
b2
⋈ b2 c2 = a3 b2 c2
b3 c3 a4 b3 c3
a4 b3
S
R
Join with MapReduce: Map
◻ Map:
For each input tuple R(a, b):
Generate <key = b, value = (‘R’, a)>
For each input tuple S(b, c):
Generate <key = b, value = (‘S’, c)>
Think of ‘R’ and ‘S’ as bool variables that indicate where the pair originated from
A B Key-value pairs
a1 b1 B C
<b1, (R, a1)> <b2, (S, c1)>
a2 b1 b2 c1 <b1, (R, a2)> <b2, (S, c2)>
a3 b2 b2 c2 <b2, (R, a3)> <b3, (S, c3)>
a4 b3 b3 c3 <b3, (R, a4)>
R S
4
Join with MapReduce: Shuffle & Sort
Output of Map
Input of Reduce
5
Join with MapReduce: Reduce
◻ Reduce:
Input: <b, value_list>
In the value list:
■ Pair each entry of the form (‘R’, a) with each entry (‘S’, c), and output:
<a, b, c>
6
Example Application:
Matrix-Vector Multiplication
Matrix-Vector Multiplication
M V X
mij . = xi
vj
8
Simple Case: Vector fits in memory
◻ For simplicity, assume that n is not too large and V fits into
main memory of each node.
◻ First, read V into an array accessible from Map tasks
◻ Map:
For each mij, generate <key = i, value = mij * vj>
◻ Reduce:
Input: <key = i, values = [mi1 * v1 ; mi2 * v2 ; … ; min * vn]
Sum up all values, and output <key = i, value = sum>
M V
10
Example Application:
Matrix-Matrix Multiplication
2 Map-Reduce Steps
Matrix – Matrix Multiplication
j k k
i mij i pik
X =
j njk
M N P
12
Two-Step MapReduce
Step 2: Accumulate all results and compute the output matrix entries
13
First MapReduce Step
◻ Map:
For each mij value of matrix M
Generate <key = j, value = (“M”, i, mij) >
14
Example: Map Output
j k
0 0 m13 0 0 n12 0 n14
m21 0 0 m24 0 0 0 0
i j
0 0 0 0 0 n32 0 n34
m41 0 m43 0 0 n42 0 0
Map output:
<1, (M, 2, m21)> <1, (N, 2, n12)>
<1, (M, 4, m41)> <1, (N, 4, n14)>
<3, (M, 1, m13)> <3, (N, 2, n32)>
<3, (M, 4, m43)> <3, (N, 4, n34)>
<4, (M, 2, m24)> <4, (N, 2, n42)>
15
Intuition 1: Joining entries with same j values
j k
0 0 m13 0 0 n12 0 n14
m21 0 0 m24 0 0 0 0
i j
0 0 0 0 0 n32 0 n34
m41 0 m43 0 0 n42 0 0
Map output:
<1, (M, 2, m21)> <1, (N, 2, n12)>
<1, (M, 4, m41)> <1, (N, 4, n14)>
<3, (M, 1, m13)> <3, (N, 2, n32)>
<3, (M, 4, m43)> <3, (N, 4, n34)>
<4, (M, 2, m24)> <4, (N, 2, n42)>
16
Intuition 1: Joining entries with same j values
j k
0 0 m13 0 0 n12 0 n14
m21 0 0 m24 0 0 0 0
i j
0 0 0 0 0 n32 0 n34
m41 0 m43 0 0 n42 0 0
Map output:
<1, (M, 2, m21)> <1, (N, 2, n12)>
<1, (M, 4, m41)> <1, (N, 4, n14)>
<3, (M, 1, m13)> <3, (N, 2, n32)>
<3, (M, 4, m43)> <3, (N, 4, n34)>
<4, (M, 2, m24)> <4, (N, 2, n42)>
17
Intuition 2: Partial sums
j k
0 0 m13 0 0 n12 0 n14
m21 0 0 m24 0 0 0 0
i j
0 0 0 0 0 n32 0 n34
m41 0 m43 0 0 n42 0 0
18
Intuition 2: Partial sums
j k
0 0 m13 0 0 n12 0 n14
m21 0 0 m24 0 0 0 0
i j
0 0 0 0 0 n32 0 n34
m41 0 m43 0 0 n42 0 0
19
Intuition 2: Partial sums
j k
0 0 m13 0 0 n12 0 n14
m21 0 0 m24 0 0 0 0
i j
0 0 0 0 0 n32 0 n34
m41 0 m43 0 0 n42 0 0
20
First MapReduce Step: Reduce
Reduce input: <1, [ (M, 2, m21); (M, 4, m41); (N, 2, n12); (N, 4, n14) ] >
<3, [ (M, 1, m13); (M, 4, m43); (N, 2, n32); (N, 4, n34) ] >
<4, [ (M, 2, m24); (N, 2, n42) ] >
Reduce(key, value_list):
Put all entries in value_list of form (M, i, mi,key) into LM
Put all entries in value_list of form (N, k, nkey,k) into LN
for each entry (M, i, mi,key) in LM
for each entry (N, k, nkey,k) in LN
output <key = (i, k); value = mi,key . nkey,k >
21
Example: Reduce Output
Reduce input:
<1, [ (M, 2, m21); (M, 4, m41); (N, 2, n12); (N, 4, n14) ] >
<3, [ (M, 1, m13); (M, 4, m43); (N, 2, n32); (N, 4, n34) ] >
<4, [ (M, 2, m24); (N, 2, n42) ] >
Reduce output:
< (2, 2), (m21.n12) > < (1, 2), (m13.n32) >
< (2, 4), (m21.n14) > < (1, 4), (m13.n34) >
< (4, 2), (m41.n12) > < (4, 2), (m43.n32) >
< (4, 4), (m41.n14) > < (4, 4), (m43.n34) >
22
Second MapReduce Step: Map
Map input:
< (2, 2), (m21.n12) > < (1, 2), (m13.n32) >
< (2, 4), (m21.n14) > < (1, 4), (m13.n34) >
< (4, 2), (m41.n12) > < (4, 2), (m43.n32) >
< (4, 4), (m41.n14) > < (4, 4), (m43.n34) >
◻ Map:
for each (key, value) pair in the input
generate (key, value)
◻ Identity function
◻ The system will most likely assign the map tasks on the same node as
the reduce that produced these outputs. Hence, no communication cost.
23
Second MapReduce Step: Reduce
Reduce input:
< (2, 2), (m21.n12) > < (1, 2), (m13.n32) >
< (2, 4), (m21.n14) > < (1, 4), (m13.n34) >
< (4, 2), (m41.n12) > < (4, 2), (m43.n32) >
< (4, 4), (m41.n14) > < (4, 4), (m43.n34) >
◻ Reduce(key, value_list):
sum = 0
foreach v in value_list
sum += v
output <key, sum>
24
Example: MapReduce Step 2 - Reduce
j k
0 0 m13 0 0 n12 0 n14
m21 0 0 m24 0 0 0 0
i j
0 0 0 0 0 n32 0 n34
m41 0 m43 0 0 n42 0 0
Reduce output:
< (2, 2), (m21.n12) > < (1, 2), (m13.n32) >
< (2, 4), (m21.n14) > < (1, 4), (m13.n34) >
< (4, 2), (m41.n12) > < (4, 2), (m43.n32) >
< (4, 4), (m41.n14) > < (4, 4), (m43.n34) >
j k
0 0 m13 0 0 n12 0 n14
m21 0 0 m24 0 0 0 0
i j
0 0 0 0 0 n32 0 n34
m41 0 m43 0 0 n42 0 0
Reduce output:
< (2, 2), (m21.n12) > < (1, 2), (m13.n32) >
< (2, 4), (m21.n14) > < (1, 4), (m13.n34) >
< (4, 2), (m41.n12) > < (4, 2), (m43.n32) >
< (4, 4), (m41.n14) > < (4, 4), (m43.n34) >
j k
0 0 m13 0 0 n12 0 n14
m21 0 0 m24 0 0 0 0
i j
0 0 0 0 0 n32 0 n34
m41 0 m43 0 0 n42 0 0
Reduce output:
< (2, 2), (m21.n12) > < (1, 2), (m13.n32) >
< (2, 4), (m21.n14) > < (1, 4), (m13.n34) >
< (4, 2), (m41.n12) > < (4, 2), (m43.n32) >
< (4, 4), (m41.n14) > < (4, 4), (m43.n34) >
j k
0 0 m13 0 0 n12 0 n14
m21 0 0 m24 0 0 0 0
i j
0 0 0 0 0 n32 0 n34
m41 0 m43 0 0 n42 0 0
Reduce output:
< (2, 2), (m21.n12) > < (1, 2), (m13.n32) >
< (2, 4), (m21.n14) > < (1, 4), (m13.n34) >
< (4, 2), (m41.n12) > < (4, 2), (m43.n32) >
< (4, 4), (m41.n14) > < (4, 4), (m43.n34) >
◻ In other words:
mij entry is needed to compute pik values for all k.
njk entry is needed to compute pik values for all i.
◻ Intuition: Send each input matrix entry to all reducers that need it.
31
An Entry of Matrix M
j k
i mij
X
j njk
M N
32
An Entry of Matrix N
j k
i mij X
j njk
M N
33
Map Operation
◻ Reminder:
mij entry is needed to compute pik values for all k.
njk entry is needed to compute pik values for all i.
◻ Map:
for each mij entry from matrix M:
for k=1 to n
generate <key = (i, k), value = (‘M’, j, mij) >
for each njk entry from matrix N:
for i=1 to n
generate <key = (i, k), value = (‘N’, j, njk) >
34
Example: Map Output for Matrix M Entries
Map output:
<(1,1), (M, 3, m13) >
35
Example: Map Output for Matrix M Entries
Map output:
<(1,1), (M, 3, m13) >
<(1,2), (M, 3, m13) >
36
Example: Map Output for Matrix M Entries
Map output:
<(1,1), (M, 3, m13) >
<(1,2), (M, 3, m13) >
<(1,3), (M, 3, m13) >
<(1,4), (M, 3, m13) >
37
Example: Map Output for Matrix M Entries
Map output:
<(1,1), (M, 3, m13) > <(2,1), (M, 1, m21) >
<(1,2), (M, 3, m13) > <(2,2), (M, 1, m21) >
<(1,3), (M, 3, m13) > <(2,3), (M, 1, m21) >
<(1,4), (M, 3, m13) > <(2,4), (M, 1, m21) >
38
Example: Map Output for Matrix M Entries
Map output:
<(1,1), (M, 3, m13) > <(2,1), (M, 1, m21) > <(2,1), (M, 4, m24) >
<(1,2), (M, 3, m13) > <(2,2), (M, 1, m21) > <(2,2), (M, 4, m24) > …
<(1,3), (M, 3, m13) > <(2,3), (M, 1, m21) > <(2,3), (M, 4, m24) >
<(1,4), (M, 3, m13) > <(2,4), (M, 1, m21) > <(2,4), (M, 4, m24) >
39
Example: Map Output for Matrix N Entries
Map output:
<(1,2), (N, 1, n12) >
40
Example: Map Output for Matrix N Entries
Map output:
<(1,2), (N, 1, n12) >
<(2,2), (N, 1, n12) >
<(3,2), (N, 1, n12) >
<(4,2), (N, 1, n12) >
41
Example: Map Output for Matrix N Entries
Map output:
<(1,2), (N, 1, n12) > <(1,4), (N, 1, n14) >
<(2,2), (N, 1, n12) > <(2,4), (N, 1, n14) >
<(3,2), (N, 1, n12) > <(3,4), (N, 1, n14) >
<(4,2), (N, 1, n12) > <(4,4), (N, 1, n14) >
42
Example: Map Output for Matrix N Entries
Map output:
<(1,2), (N, 1, n12) > <(1,4), (N, 1, n14) > <(1,2), (N, 3, n32) >
<(2,2), (N, 1, n12) > <(2,4), (N, 1, n14) > <(2,2), (N, 3, n32) > …
<(3,2), (N, 1, n12) > <(3,4), (N, 1, n14) > <(3,2), (N, 3, n32) >
<(4,2), (N, 1, n12) > <(4,4), (N, 1, n14) > <(4,2), (N, 3, n32) >
43
Summary: Map Operation
…
mij
njk
44
Reduce Operation
◻ Input:
key = (i,k), value_list = [ … (M, j, mij); … (N, j, njk) … ]
an entry exists for any non-zero mij or njk
◻ Objective: Multiply mij and njk values with matching j values, and sum
up all products to compute pik
◻ Reduce(key, value_list)
put all entries of form (M, j, mij) into LM
sort entries in LM based on j values
put all entries of form (N, j, njk) into LN
sort entries in LN based on j values
sum ← 0
for each pair (M, j, mij) in LM and (N, j, njk) in LN
sum += mij . njk
output (key, sum)
45
Example: Reduce
Reduce input: key = (4, 2), value_list = { (M, m41, 1); (M, m43, 3);
(N, n12, 1); (N, n32, 3); (N, n42, 4) }
Reduce output: key = (4, 2), value = m41. n12 + m43. n32
46
Summary: Single-Step MapReduce Algorithm
◻ Map(input):
for each mij entry from matrix M:
for k=1 to n
generate <key = (i, k), value = (‘M’, j, mij) >
for each njk entry from matrix N:
for i=1 to n
generate <key = (i, k), value = (‘N’, j, njk) >
◻ Reduce(key, value_list)
sum ← 0
for each pair (M, j, mij) and (N, j, njk) in value_list
sum += mij . njk
output (key, sum)
47
References
48