0% found this document useful (0 votes)

27 views55 pages

Bda Unit I Lecture8 1

This document discusses models for analyzing the communication costs of MapReduce algorithms. It defines key terms like replication rate and reducer size. Examples are provided to illustrate how to model join and matrix multiplication algorithms. The document also explores using grouping to reduce communication costs for similarity joins by assigning multiple output pairs to each reducer.

Uploaded by

Anju2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views55 pages

Bda Unit I Lecture8 1

Uploaded by

Anju2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

BIG DATA ANALYTICS

Complexity Analysis of MapReduce Algorithms

Communication Cost Model

◻ The model we will use:

Communication cost = sum of input sizes to each stage
◻ Output sizes are ignored
If the output is large, it’s likely that it will be input to another stage
The real outputs are typically small, e.g. some summary statistics, etc.
◻ Reading from disk is part of the communication cost
e.g. The input to the map stage can be from the disk of a reduce task at a different node
◻ Analysis is independent of scheduling decisions
e.g. Map and reduce tasks may or may not be assigned to the same node.

Input Map Reduce Map Reduce Output

2
Definitions: Replication Rate & Reducer Size

◻ Replication rate: Avg # of key-value pairs generated by Map tasks per input
The communication cost between Map and Reduce is determined by this
Donated as r
◻ Reducer size: Upper bound for the size of the value list corresponding to a single key
Donated as q
Choose q small enough such that:
1. there are many reducers for high levels of parallelism
2. the data for a reducer fits into the main memory of a node
◻ Typically q and r inversely proportional
Tradeoff between communication cost and parallelism/memory requirements.

3
Example: Join with MapReduce

◻ Map:
For each input tuple R(a, b):
Generate <key = b, value = (‘R’, a)> Replication rate:
For each input tuple S(b, c): r=1
Generate <key = b, value = (‘S’, c)>

Communication cost:
◻ Reduce: 2(|R|+|S|)
Input: <b, value_list>
In the value_list:
■ Pair each entry of the form (‘R’, a) with each entry (‘S’, c),
Reducer size (worst case):
and output: q = |R| + |S|
<a, b, c>

4
Example: Single-Step Matrix-Matrix Multiplication

◻ Map(input): Assume both M and N have size nxn

for each mij entry from matrix M:
for k=1 to n Replication rate:
generate <key = (i, k), value = (‘M’, j, mij) > r=n
for each njk entry from matrix N:
for i=1 to n
Communication cost:
generate <key = (i, k), value = (‘N’, j, njk) >
2n2 + 2n3

◻ Reduce(key, value_list)
Reducer size:
sum ← 0
for each pair (M, j, mij) and (N, j, njk) in value_list q = 2n
sum += mij . njk
output (key, sum)
5
A Graph Model for MapReduce Algorithms

Inputs Outputs ◻ Define a vertex for each input and output

◻ Define edges reflecting which inputs each output needs
◻ Every MapReduce algorithm has a schema that assigns
outputs to reducers.

◻ Assume that max reducer size is q.

◻ Assignment Requirements:
1. No reducer can be assigned more than q inputs.

2. Each output is assigned to at least one reducer that

receives all inputs needed for that output.

6
Example: Single-Step Matrix-Matrix Multiplication

We have assigned each output to a single reducer.

The replication rate r = n
The reducer size q = 2n

7
Application: Naïve Similarity Join
Naïve Similarity Join

◻ Objective: Given a large set of elements X and a similarity measure s(x1, x2), output
the pairs that have similarity above a given threshold.
Locality sensitive hashing is not used for the sake of this example.

◻ Example:
Each element is an image of 1M bytes
There are 1M images in the set
About 5x1011 (500B) image comparisons to make

9
Similarity Join with MapReduce (First Try)

◻ Let n be the # of pictures in the set.

◻ Map:
for each picture Pi do: Replication rate r = n-1
Reducer size q = 2
for each j=1 to n (except i)
Communication cost = n + n(n-1)
generate <key = (i,j), value = Pi> # of reducers = n(n-1)/2

◻ Reduce (key, value_list)

compute sim(Pi, Pj)
output (i,j) if similarity is above threshold
10
Example: 1M pictures with 1MByte size each

◻ Communication cost:
n(n-1) pictures communicated from Map to Reduce
total # bytes transferred = 1018

◻ Assume gigabit ethernet:

time to transfer 1018 bytes = 1010 seconds (~300 years)

◻ Replication rate r = n-1

◻ Reducer size q = 2
◻ Communication cost = n + n(n-1)
◻ # of reducers = n(n-1)/2

11
Graph Model

Our MapReduce algorithm:

One reducer per output.
Pi must be sent to each output.
Replication rate r = n-1
Reducer size q = 2

What if a reducer covers multiple outputs?

12
Graph Model: Multiple Outputs per Reducer

Replication rate & communication

cost reduced.

How to do the grouping?

13
Grouping Outputs

◻ Define g intervals between 1 and n.

◻ Reducer (u,v) will be responsible for comparing all inputs in range u with all inputs in range v.
interval 3
Example: 1 ..…………………....…… n
1
..
..
..
.. Reducer (2, 3) will compare all entries in
..
interval 2 .. interval 2 with all entries in interval 3.
..
..
..
..
..
..
..
..
..
n

14
Similarity Join with Grouping

◻ Let n be the number of inputs, and g be the number of groups.

◻ Map:
for each Pi in the input
let u be the group to which i belongs
for v = 1 to g
generate < key=(u, v), value=(i, Pi) > Problem:
Pi will be sent to (gi, gj)
Pj will be sent to (gj, gi)
◻ Reduce(key=(u,v), value_list)
for each i that belongs to group u in value_list
for each j that belongs to group v in value_list
compute sim(Pi, Pj), and output (i, j) if it is above threshold.
15
Similarity Join with Grouping

◻ Let n be the number of inputs, and g be the number of groups.

◻ Map:
for each Pi in the input
let u be the group to which i belongs
for v = 1 to g
generate < key=[min(u, v), max(u,v)], value=(i, Pi) >

Single key generated for (u,v) and (v,u)

◻ Reduce(key=(u,v), value_list)
for each i that belongs to group u in value_list
for each j that belongs to group v in value_list
compute sim(Pi, Pj), and output (i, j) if it is above threshold.

16
17
18
Example

Example: If g = 4, the highlighted comparisons will be performed.

1 ..…………………....…… n
There will be a reducer for each key (u, v),
1
.. where u ≤ v
..
..
..
..
..
..
..
..
..
..
..
..
..
..
n

19
Example

Which reducers will receive and use Pi in group 2?

1 ..…………………....…… n
Reducers: (1, 2), (2, 2), (2, 3), (2, 4)
1
..
..
..
..
..
Pi ..
..
..
..
..
..
..
..
..
..
n

20
Complexity Analysis

◻ Replication rate:
r=g
◻ Reducer size:
q = 2n/g
◻ Communication cost:
n+ng
◻ # of reducers:
g(g+1)/2

21
Example: 1M pictures with 1MByte size each

◻ Let g = 1000

◻ Reducer size q = 2n/g

memory needed for one node: ~2GB (reasonable)
◻ Communication cost = n + ng
total # bytes transferred = ~1015 (still a lot, but 1000x less than before)
◻ # of reducers = g(g+1)/2
there are ~500K reducers (enough parallelism for 1000s of nodes)
◻ What if g = 100?

22
Tradeoff Between Replication Rate and Reducer Size

Replication rate r = g q = 2n / r qr = 2n
Reducer size q = 2n/g

◻ Replication rate and reducer size are inversely proportional.

◻ Reducing replication rate will reduce communication, but will increase reducer size.
Extreme case: r = 1 and q = 2n. There is a single reducer doing all the comparisons.
Extreme case: r = n and q = 2. There is a reducer for each pair of inputs.

◻ Need to choose r small enough such that the data fits into local DRAM and there’s
enough parallelism.
23
Application: Matrix-Matrix Multiplication
with 1D Decomposition
Reminder: Matrix-Matrix Multiplication without Grouping

j k k

i mij i pik
X =
j njk

M N P

Each mij needs to be sent to each reducer (i, k) for all k

25
Reminder: Matrix-Matrix Multiplication without Grouping

j k k

i mij i pik
X =
j njk

M N P

Each njk needs to be sent to each reducer (i, k) for all i

Replication rate r = n

26
Multiple Outputs per Reducer

g stripes
j k K

i mij I
X = g stripes
j njk

M N P

Notation: Let reducer (I,K) be responsible

j: row/column index of an individual matrix entry for computing all pik where:
J: set of indices that belong to the Jth interval. i ∈ I and k ∈ K

27
Multiple Outputs per Reducer

g stripes
j

i mij I
X = g stripes

M N P

Which reducers need mij?

Reducers (I, K) for all 1 ≤ K ≤ g Replication rate r = g

28
Multiple Outputs per Reducer

g stripes
k K

X = g stripes
j njk

M N P

Which reducers need njk?

Reducers (I, K) for all 1 ≤ I ≤ g Replication rate r = g

29
1D Matrix Decomposition

g stripes
K K

I I
X = g stripes

M N P

Which matrix elements will reducer (I, K) receive?

Ith row stripe of M and Kth column stripe of N

30
MapReduce Formulation

◻ Map:
for each element mij from matrix M
for K=1 to g
generate <key=(I, K), value = (‘M’, i, j, mij)>
for each element njk from matrix N Replication rate:
for I=1 to g r=g
generate <key=(I, K), value = (‘N’, j, k, njk)>
Communication cost:
◻ Reduce(key=(I,K), value_list) 2n2 + 2gn2
for each i ∈ I and for each k ∈ K
pik = 0 Reducer size:
for j = 1 to n q= 2n2/g
pik += mij . njk
# of reducers:
output <key=(i, k), value = pik> g2

31
Communication Cost vs. Reducer Size

Replication rate vs. reducer size

q = 2n2/g 🡺 q = 2n2/r 🡺 qr = 2n2

Replication rate:
Communication cost vs. reducer size r=g
cost = 2n2 + 2gn2
= 2n2 + 4n4/q Communication cost:
2n2 + 2gn2

Inverse relation between communication cost and reducer size. Reducer size:
Reminder: q value chosen should be small enough such that: q = 2n2/g
Local memory is sufficient
There’s enough parallelism # of reducers:
g2

32
Application: Matrix-Matrix Multiplication
with 2D Decomposition
Two Stage MapReduce Algorithm

◻ What are we trying to achieve?

A better tradeoff between replication rate r and reducer size q
The previous algorithm: qr = 2n2
We will show that we can achieve qr2 = 2n2
For the same reducer size, the replication rate will be smaller

◻ Reminder: Two-stage MapReduce without grouping:

Stage 1: “Join” matrix entries that need to be multiplied together
Stage 2: Sum up products to compute final results

◻ Use a similar idea, but for sub-blocks of matrices instead of individual elements

34
2D Matrix Decomposition

g stripes
K K

I I
X = g stripes

M N P

Assume that M and N are partitioned to g horizontal and g vertical stripes.

35
Computing the Product at Stripe (I, K)

g stripes
K K

I I
X = g stripes

M N P

Note: MIJ x NJK is multiplication of two sub-matrices

36
How to Define Reducers?

g stripes
J K K

I I
X = g stripes
J

M N P

What if we define a reducer for each (I, K)?

It would be identical to the 1D decomposition
What if we define a reducer for each J?
Exercise: Derive the communication cost as a function of n and q
37
How to Define Reducers?

g stripes
J K K

I I
X = g stripes
J

M N P

What if we define a reducer for each (I, J, K)?

Smaller reducer size
Reducer (I, J, K) will be responsible for computing the Jth partial sum for block PIK

38
First MapReduce Step

J K K

I J = I
from M from N Jth partial sum

39
MapReduce Step 1: Map

g stripes
J

I I
X = g stripes

M N P

Block MIJ will be sent to the reducers (I, J, K) for all K

Reminder: Reducer (I, J, K) is responsible for computing the Jth partial sum for block PIK

40
MapReduce Step 1: Map

g stripes
K K

X = g stripes
J

M N P

Block NJK will be sent to the reducers (I, J, K) for all I

Reminder: Reducer (I, J, K) is responsible for computing the Jth partial sum for block PIK

41
MapReduce Step 1: Reduce

g stripes
J K K

I I
X = g stripes
J

M N P

Reducer (I, J, K) will receive MIJ and NJK blocks and will compute
the Jth partial sum for block PIK

42
MapReduce Step 1: Reducer Output

g stripes
K K

I I
X = g stripes

M N P

For each pik ∈ PIK, there are g reducers that compute a partial sum (each with key=(I, J, K))

The reduce outputs corresponding to pik: <key = (i, k), value = xJik>

43
MapReduce Step 2

◻ Map:
for each input <key = (i, k), value = xJik>
generate <key = (i, k), value = xJik>

◻ Reduce(key = (i, k), value_list)

pik = 0
for each xJik in value_list
pik += xJik
output <key = (i, k), value = pik>

44
Complexity Analysis: Step 1

Replication rate:
r1 = g

Communication cost:
2n2 + 2gn2

Reducer size:
q1= 2n2/g2

# of reducers:
g3

45
Complexity Analysis: MapReduce Step 2

◻ Map:
for each input <key = (i, k), value = xJik> Replication rate:
generate <key = (i, k), value = xJik> r2 = 1

Communication cost:
◻ Reduce(key = (i, k), value_list) gn2
pik = 0
for each xJik in value_list Reducer size:
pik += xJik q2= g
output <key = (i, k), value = pik>
# of reducers:
n2

46
Complexity Analysis

◻ Step 1 Step 2
Replication rate: Replication rate:
r1 = g r2 = 1

Communication cost: Communication cost:

2n2 + 2gn2 gn2

Reducer size: Reducer size:

q1= 2n2/g2 q 2= g

# of reducers: # of reducers:
g3 n2

47
Tradeoff Between Communication Cost and Reducer Size

◻ To decrease communication cost:

Choose g small enough
◻ To decrease reducer size:
Choose g large enough to reduce q1
Size of q2 is less of a concern. Why?
The reduce operation in step 2:
Simply accumulate the values
The same value is used only once
The value_list doesn’t have to fit into local memory

◻ Conclusion: Use the communication cost formula as a function of q1

to determine the right tradeoff.
48
Matrix-Matrix Multiplication
1D Decomposition vs. 2D Decomposition
Comparison: Parallelism

1D Decomposition 2D Decomposition

For the same # of groups, 2D decomposition has better parallelism

50
Comparison: Reducer Size

1D Decomposition 2D Decomposition

51
Comparison: Communication Costs

1D Decomposition 2D Decomposition

Note: We have control over how to choose the g

values for 1D and 2D decompositions. However,
the max q value is limited by the available local
memory size. So, it makes more sense to use the
same q value for 1D and 2D decompositions.

52
Comparison: Communication Costs (when reducer sizes are equal)

1D Decomposition 2D Decomposition

53
Conclusions

54
References

◻ Jure Leskovec, Anand Rajaraman, Jeff Ullman, Mining of Massive

Datasets, Cambridge University Press, Second Edition, 2014.
◻ http://mmds.org/
◻ http://www.cs.bilkent.edu.tr/~mustafa.ozdal/cs425/

F Series Breaker Instruction Manual 1pdf PDF Free
80% (5)
F Series Breaker Instruction Manual 1pdf PDF Free
424 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Bda - Unit I - Lecture 6, 7
No ratings yet
Bda - Unit I - Lecture 6, 7
48 pages
MapReduce Algo Design Final
No ratings yet
MapReduce Algo Design Final
46 pages
MR Databases
No ratings yet
MR Databases
52 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
Jeffrey D. Ullman: Stanford University
No ratings yet
Jeffrey D. Ullman: Stanford University
52 pages
03 MapReduce
No ratings yet
03 MapReduce
184 pages
Chapter 2 - Introduction To MapReduce - New
No ratings yet
Chapter 2 - Introduction To MapReduce - New
107 pages
Map Reduce
No ratings yet
Map Reduce
26 pages
BDS Session 6
No ratings yet
BDS Session 6
53 pages
Matrix Case Study
No ratings yet
Matrix Case Study
51 pages
DRKP Module 3
No ratings yet
DRKP Module 3
44 pages
Map Reduce
No ratings yet
Map Reduce
44 pages
Module2 D MapReduceParadigm
No ratings yet
Module2 D MapReduceParadigm
84 pages
Lez.d-01-Hadoop (A) Intro
No ratings yet
Lez.d-01-Hadoop (A) Intro
58 pages
Map-Reduce For Parallel Computing: Amit Jain
No ratings yet
Map-Reduce For Parallel Computing: Amit Jain
72 pages
Matrix-Vector Multiplication by MapReduce-V2
No ratings yet
Matrix-Vector Multiplication by MapReduce-V2
26 pages
Computational Tools DTU Presentation Week3
No ratings yet
Computational Tools DTU Presentation Week3
33 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
8300 17977 1 PB
No ratings yet
8300 17977 1 PB
19 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
Cement Plant
100% (1)
Cement Plant
232 pages
Comp9313: Big Data Management: Mapreduce
No ratings yet
Comp9313: Big Data Management: Mapreduce
65 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Map Reduce On Red Green Blue Architecture
No ratings yet
Map Reduce On Red Green Blue Architecture
11 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
MapReduce: Simplified Data Processing On Large Clusters
100% (1)
MapReduce: Simplified Data Processing On Large Clusters
13 pages
BDH Answer Bank
No ratings yet
BDH Answer Bank
21 pages
Problem-Solving Using Mapreduce/Hadoop
No ratings yet
Problem-Solving Using Mapreduce/Hadoop
22 pages
Map Reduce
No ratings yet
Map Reduce
35 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
Efficient Parallel Set-Similarity Joins Using Mapreduce: Tilani Gunawardena
No ratings yet
Efficient Parallel Set-Similarity Joins Using Mapreduce: Tilani Gunawardena
47 pages
Nosql Qbsol Ia-02
No ratings yet
Nosql Qbsol Ia-02
18 pages
5-Yarn Architecture Components Workflow Scheduling-22-01-2025
No ratings yet
5-Yarn Architecture Components Workflow Scheduling-22-01-2025
26 pages
3 Bda Unit 3 Notes
No ratings yet
3 Bda Unit 3 Notes
12 pages
Week10-Map Reducible Algo-2025
No ratings yet
Week10-Map Reducible Algo-2025
39 pages
No SQL
No ratings yet
No SQL
12 pages
Big Data Lab
No ratings yet
Big Data Lab
12 pages
Veena - Strings & Tuning
100% (4)
Veena - Strings & Tuning
39 pages
3 Bda Unit 3 Notes
No ratings yet
3 Bda Unit 3 Notes
12 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
43 pages
Chapter4 - MapReduce
No ratings yet
Chapter4 - MapReduce
29 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Common Friends Problem
No ratings yet
Common Friends Problem
42 pages
3 Bda Unit 3 Notes
No ratings yet
3 Bda Unit 3 Notes
12 pages
Wa0004
No ratings yet
Wa0004
14 pages
Big Data Notes (All Lectures)
No ratings yet
Big Data Notes (All Lectures)
44 pages
Ecs765p W2
No ratings yet
Ecs765p W2
55 pages
Explain The Update Consistency - Update (Write-Write Conflict), Read (Read-Write Conflict) With An Example and A Neat Diagram
No ratings yet
Explain The Update Consistency - Update (Write-Write Conflict), Read (Read-Write Conflict) With An Example and A Neat Diagram
6 pages
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
No ratings yet
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
37 pages
Section8 Mapreduce Solution PDF
No ratings yet
Section8 Mapreduce Solution PDF
5 pages
3 Bda Unit 3 Notes
No ratings yet
3 Bda Unit 3 Notes
12 pages
Flipkart Recommendation
0% (1)
Flipkart Recommendation
35 pages
Map Reduce PArt 2
No ratings yet
Map Reduce PArt 2
40 pages
Paper Map Reduce
No ratings yet
Paper Map Reduce
16 pages
Instrument Datasheet
No ratings yet
Instrument Datasheet
67 pages
A Presentation On Sms-2
No ratings yet
A Presentation On Sms-2
22 pages
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
No ratings yet
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
23 pages
BOP Test Rig 51 #2
No ratings yet
BOP Test Rig 51 #2
9 pages
CP4252 Machine Learning Lab Manual
No ratings yet
CP4252 Machine Learning Lab Manual
26 pages
108S Manual 10
100% (1)
108S Manual 10
10 pages
Abha Municipality Roundabout Boq
No ratings yet
Abha Municipality Roundabout Boq
14 pages
Composite Materials
100% (3)
Composite Materials
78 pages
Modbus List
No ratings yet
Modbus List
14 pages
1833 Designing High Rise Housing The Singapore Experience
No ratings yet
1833 Designing High Rise Housing The Singapore Experience
7 pages
300+ TOP CABLES Objective Type Questions and Answers Electrical Engineering Multiple Choice Questions
No ratings yet
300+ TOP CABLES Objective Type Questions and Answers Electrical Engineering Multiple Choice Questions
16 pages
IEEE Biography Samples
No ratings yet
IEEE Biography Samples
5 pages
ACCOMPLISHMENT REPORT For July 2023
No ratings yet
ACCOMPLISHMENT REPORT For July 2023
5 pages
Anti-Ragging Form: Students Personal Details
No ratings yet
Anti-Ragging Form: Students Personal Details
5 pages
Pickawood Montageanleitung Kleiderschrank 07 2018 ENG
No ratings yet
Pickawood Montageanleitung Kleiderschrank 07 2018 ENG
12 pages
Piping Engineering Questions SEMESTER All
No ratings yet
Piping Engineering Questions SEMESTER All
6 pages
Coal Specification From Indonesia
No ratings yet
Coal Specification From Indonesia
5 pages
The 316 - 316L Dual Certified Stainless Steel Pipe
No ratings yet
The 316 - 316L Dual Certified Stainless Steel Pipe
9 pages
Enae 641
No ratings yet
Enae 641
17 pages
Technics Su s3000
No ratings yet
Technics Su s3000
34 pages
Error Form
No ratings yet
Error Form
3 pages
S35 4E Manual
No ratings yet
S35 4E Manual
44 pages
PPE Assignment
No ratings yet
PPE Assignment
3 pages
Narula Institute of Technology: Title
No ratings yet
Narula Institute of Technology: Title
55 pages
Single Component Polyurethane
No ratings yet
Single Component Polyurethane
3 pages
Agile Documentation
No ratings yet
Agile Documentation
2 pages
At Syllabus
No ratings yet
At Syllabus
2 pages
Jeetender Joshi (BD)
No ratings yet
Jeetender Joshi (BD)
3 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Mathematical Formulas for Economics and Business: A Simple Introduction
From Everand
Mathematical Formulas for Economics and Business: A Simple Introduction
K.H. Erickson
4/5 (4)
Basic Exercises for Competitive Programming: Python
From Everand
Basic Exercises for Competitive Programming: Python
Jan Pol
No ratings yet
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
From Everand
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
Fouad Sabry
No ratings yet

Bda Unit I Lecture8 1

Uploaded by

Bda Unit I Lecture8 1

Uploaded by

BIG DATA ANALYTICS

Complexity Analysis of MapReduce Algorithms

◻ The model we will use:

Input Map Reduce Map Reduce Output

◻ Map(input): Assume both M and N have size nxn

Inputs Outputs ◻ Define a vertex for each input and output

◻ Assume that max reducer size is q.

2. Each output is assigned to at least one reducer that

We have assigned each output to a single reducer.

◻ Let n be the # of pictures in the set.

◻ Reduce (key, value_list)

◻ Assume gigabit ethernet:

◻ Replication rate r = n-1

Our MapReduce algorithm:

What if a reducer covers multiple outputs?

Replication rate & communication

How to do the grouping?

◻ Define g intervals between 1 and n.

◻ Let n be the number of inputs, and g be the number of groups.

◻ Let n be the number of inputs, and g be the number of groups.

Single key generated for (u,v) and (v,u)

Example: If g = 4, the highlighted comparisons will be performed.

Which reducers will receive and use Pi in group 2?

◻ Reducer size q = 2n/g

◻ Replication rate and reducer size are inversely proportional.

Each mij needs to be sent to each reducer (i, k) for all k

Each njk needs to be sent to each reducer (i, k) for all i

Notation: Let reducer (I,K) be responsible

Which reducers need mij?

Which reducers need njk?

Which matrix elements will reducer (I, K) receive?

Replication rate vs. reducer size

◻ What are we trying to achieve?

◻ Reminder: Two-stage MapReduce without grouping:

Assume that M and N are partitioned to g horizontal and g vertical stripes.

Note: MIJ x NJK is multiplication of two sub-matrices

What if we define a reducer for each (I, K)?

What if we define a reducer for each (I, J, K)?

Block MIJ will be sent to the reducers (I, J, K) for all K

Block NJK will be sent to the reducers (I, J, K) for all I

◻ Reduce(key = (i, k), value_list)

Communication cost: Communication cost:

Reducer size: Reducer size:

◻ To decrease communication cost:

◻ Conclusion: Use the communication cost formula as a function of q1

For the same # of groups, 2D decomposition has better parallelism

Note: We have control over how to choose the g

◻ Jure Leskovec, Anand Rajaraman, Jeff Ullman, Mining of Massive

You might also like