0% found this document useful (0 votes)
39 views

06 cmsc416 Algorithms

Uploaded by

qiqi85078802
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

06 cmsc416 Algorithms

Uploaded by

qiqi85078802
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Introduction to Parallel Computing (CMSC416 / CMSC616)

Parallel Algorithms
Abhinav Bhatele, Alan Sussman
Matrix multiplication

for (i=0; i<M; i++)


for (j=0; j<N; j++)
for (k=0; k<L; k++)
C[i][j] += A[i][k]*B[k][j];

https://en.wikipedia.org/wiki/Matrix_multiplication

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 2


Matrix multiplication

for (i=0; i<M; i++)


for (j=0; j<N; j++)
for (k=0; k<L; k++)
C[i][j] += A[i][k]*B[k][j];

Any performance issues for large arrays?

https://en.wikipedia.org/wiki/Matrix_multiplication

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 2


Blocking to improve cache performance
• Create smaller blocks that t in cache: leads to cache reuse

• C12 = A10 * B02 + A11 * B12 + A12 * B22 + A13 * B32

k j j

A00 A01 A02 A03 B00 B01 B02 B03 C00 C01 C02 C03

i k i
A10 A11 A12 A13 B10 B11 B12 B13 C10 C11 C12 C13

A20 A21 A22 A23 B20 B21 B22 B23 C20 C21 C22 C23

A30 A31 A32 A33 B30 B31 B32 B33 C30 C31 C32 C33

https://link.springer.com/chapter/10.1007/978-3-319-67630-2_36

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 3


fi
Blocking to improve cache performance
• Create smaller blocks that t in cache: leads to cache reuse

• C12 = A10 * B02 + A11 * B12 + A12 * B22 + A13 * B32

https://link.springer.com/chapter/10.1007/978-3-319-67630-2_36

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 3


fi
Blocking to improve cache performance
• Create smaller blocks that t in cache: leads to cache reuse

• C12 = A10 * B02 + A11 * B12 + A12 * B22 + A13 * B32

https://link.springer.com/chapter/10.1007/978-3-319-67630-2_36

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 3


fi
Blocking to improve cache performance
• Create smaller blocks that t in cache: leads to cache reuse

• C12 = A10 * B02 + A11 * B12 + A12 * B22 + A13 * B32

https://link.springer.com/chapter/10.1007/978-3-319-67630-2_36

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 3


fi
Blocked (tiled) matrix multiply
for (ii = 0; ii < n; ii+=B) {
for (jj = 0; jj < n; jj+=B) {
for (kk = 0; kk < n; kk+=B) {
for (i = ii; i < ii+B; i++) {
for (j = jj; j < jj+B; j++) {
for (k = kk; k < kk+B; k++) {
C[i][j] += A[i][k]*B[k][j];
}
} Original code

} for (i=0; i<M; i++)


} for (j=0; j<N; j++)
} for (k=0; k<L; k++)
} C[i][j] += A[i][k]*B[k][j];
Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 4
Parallel matrix multiply

• Store A and B in a distributed manner

• Communication between processes to get the right sub-matrices to each process

• Each process computes a portion of C

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 5


Cannon’s 2D matrix multiply

• Arrange processes in a 2D virtual grid

• Assign sub-blocks of A and B to each process

• Each process responsible for computing a sub-block of C

• Requires other processes in its row and column to send A and B blocks so can it can
compute the nal values of its sub-block

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 6


fi
Cannon’s 2D matrix multiply
• C12 = A10 * B02 + A11 * B12 + A12 * B22 + A13 * B32

0 1 2 3 A00 A01 A02 A03 B00 B01 B02 B03

4 5 6 7 A10 A11 A12 A13 B10 B11 B12 B13

8 9 10 11 A20 A21 A22 A23 B20 B21 B22 B23

12 13 14 15 A30 A31 A32 A33 B30 B31 B32 B33

2D process grid

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 7


Cannon’s 2D matrix multiply
• C12 = A10 * B02 + A11 * B12 + A12 * B22 + A13 * B32 A: Displace blocks in row i by i
B: Displace blocks in column j by j

0 1 2 3 A00 A01 A02 A03 B00 B01 B02 B03

4 5 6 7 A10 A11 A12 A13 B10 B11 B12 B13

8 9 10 11 A20 A21 A22 A23 B20 B21 B22 B23

12 13 14 15 A30 A31 A32 A33 B30 B31 B32 B33

2D process grid Initial skew in rows Initial skew in columns

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 7


Cannon’s 2D matrix multiply
• C12 = A10 * B02 + A11 * B12 + A12 * B22 + A13 * B32 A: Displace blocks in row i by i
B: Displace blocks in column j by j

0 1 2 3 A00 A01 A02 A03 B00 B01 B02


02 B03

4 5 6 7 A10
10 A11
11 A
A12
12 A13
13 B10 B11 B12
12 B13

8 9 10 11 A
A20
20 A
A21
21 A
A22
22 A
A23
23 B20 B21 B
B22
22 B23

12 13 14 15 A30 A31 A32 A33 B30 B31 B


B32
32 B33

2D process grid

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 8


Cannon’s 2D matrix multiply
• C12 = A10 * B02 + A11 * B12 + A12 * B22 + A13 * B32 A: Displace blocks in row i by i
B: Displace blocks in column j by j

0 1 2 3 A00 A01 A02 A03 B00 B01 B02


02 B03

4 5 6 7 A11
10 A12
11 A
A13
12 A10
13 B10 B11 B12
12 B13

8 9 10 11 A
A20
20 A
A21
21 A
A22
22 A
A23
23 B20 B21 B
B22
22 B23

12 13 14 15 A30 A31 A32 A33 B30 B31 B


B32
32 B33

2D process grid Initial skew in rows

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 8


Cannon’s 2D matrix multiply
• C12 = A10 * B02 + A11 * B12 + A12 * B22 + A13 * B32 A: Displace blocks in row i by i
B: Displace blocks in column j by j

0 1 2 3 A00 A01 A02 A03 B00 B01 B02


02 B03

4 5 6 7 A11
10 A12
11 A
A13
12 A10
13 B10 B11 B12
12 B13

8 9 10 11 A
A22
20 A
A23
21 A
A22
20 A
A21
23 B20 B21 B
B22
22 B23

12 13 14 15 A30 A31 A32 A33 B30 B31 B


B32
32 B33

2D process grid Initial skew in rows

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 8


Cannon’s 2D matrix multiply
• C12 = A10 * B02 + A11 * B12 + A12 * B22 + A13 * B32 A: Displace blocks in row i by i
B: Displace blocks in column j by j

0 1 2 3 A00 A01 A02 A03 B00 B01 B02


22 B03

4 5 6 7 A11
10 A12
11 A
A13
12 A10
13 B10 B11 B12
32 B13

8 9 10 11 A
A22
20 A
A23
21 A
A22
20 A
A21
23 B20 B21 B
B22
02 B23

12 13 14 15 A30 A31 A32 A33 B30 B31 B


B32
12 B33

2D process grid Initial skew in rows Initial skew in columns

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 8


Cannon’s 2D matrix multiply
• C12 = A10 * B02 + A11 * B12 + A12 * B22 + A13 * B32

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

2D process grid

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 9


Cannon’s 2D matrix multiply
• C12 = A10 * B02 + A11 * B12 + A12 * B22 + A13 * B32

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

2D process grid Shift-by-1 in rows Shift-by-1 in columns

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 9


Cannon’s 2D matrix multiply
• C12 = A10 * B02 + A11 * B12 + A12 * B22 + A13 * B32

0 1 2 3 B22

4 5 6 7 A11 A12 A13 A10 B32

8 9 10 11 B02

12 13 14 15 B12

2D process grid

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 10


Cannon’s 2D matrix multiply
• C12 = A10 * B02 + A11 * B12 + A12 * B22 + A13 * B32

0 1 2 3 B22

4 5 6 7 A12 A13 A10 A11 B32

8 9 10 11 B02

12 13 14 15 B12

2D process grid Shift-by-1 in rows

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 10


Cannon’s 2D matrix multiply
• C12 = A10 * B02 + A11 * B12 + A12 * B22 + A13 * B32

0 1 2 3 B32

4 5 6 7 A12 A13 A10 A11 B02

8 9 10 11 B12

12 13 14 15 B22

2D process grid Shift-by-1 in rows Shift-by-1 in columns

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 10


Cannon’s 2D matrix multiply
• C12 = A10 * B02 + A11 * B12 + A12 * B22 + A13 * B32

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

2D process grid

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 11


Cannon’s 2D matrix multiply
• C12 = A10 * B02 + A11 * B12 + A12 * B22 + A13 * B32

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

2D process grid Shift-by-1 in rows Shift-by-1 in columns

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 11


Announcements

• Assignment 2 is due on October 10

• Assignment 3 will be posted on October 10


• Due on October 18 11:59 pm ET

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 12


Agarwal’s 3D matrix multiply

• Arrange processes in a 3D virtual grid

• Assign sub-blocks of A and B to each process


• In this algorithm, there are multiple copies of A and B (one in each plane)

• Each process computes a partial sub-block of C

• Data movement is done only once before computation and once after computation

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 13


Agarwal’s 3D matrix multiply
• Copy A to all i-k planes and B to all j-k planes
k

9 10 11 A00 A01 A02 B00 B10 B20

j
12 13 14 A10 A11 A12 B01 B11 B21

3D process grid 15 16 17 A20 A21 A22


B00 B10 B20
k
B0101 B1111 B2121
0 1 2 A00 A01 A02 B B B
i
3 4 5 A
A10
10 A
A11
11 A
A12
12 B00 B10 B20

6 7 8 A20 A21 A22 B01 B11 B21

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 14


Agarwal’s 3D matrix multiply
• Copy A to all i-k planes and B to all j-k planes
k

9 10 11 A00 A01 A02 B00 B10 B20

j
12 13 14 A
A10
10 A
A11
11 A
A12
12 B01 B11 B21

3D process grid 15 16 17 A20 A21 A22


B00 B10 B20
k
B0101 B1111 B2121
0 1 2 A00 A01 A02 B B B
i
3 4 5 A
A10
10 A
A11
11 A
A12
12 B00 B10 B20

6 7 8 A20 A21 A22 B01 B11 B21

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 14


Agarwal’s 3D matrix multiply
• Copy A to all i-k planes and B to all j-k planes
k

9 10 11 A00 A01 A02 B00 B10 B20

j
12 13 14 A
A10
10 A
A11
11 A
A12
12 B0101
B
B1111
B
B2121
B

3D process grid 15 16 17 A20 A21 A22


B00 B10 B20
k
B0101 B1111 B2121
0 1 2 A00 A01 A02 B B B
i
3 4 5 A
A10
10 A
A11
11 A
A12
12 B00 B10 B20

6 7 8 A20 A21 A22 B0101 B1111 B2121


B B B

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 14


Agarwal’s 3D matrix multiply
• Perform a single matrix multiply to calculate partial C

• Allreduce along i-j planes to calculate nal result


k
C01
9 10 11 A00 A01 A02 B00 B10 B20

j
12 13 14 A10 A11 A12 B01 B11 B21

C00
15 16 17 A20 A21 A22 A10 * B01 C01
B00 B10 B20
A11 * B11 C10
k C11
A12 * B21
B01 B11 B21 C20
0 1 2 A00 A01 A02 C21

i
3 4 5 A10 A11 A12 B00 B10 B20

6 7 8 A20 A21 A22 B01 B11 B21

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 15


fi
Communication algorithms

• Reduction

• All-to-all

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 16


Types of reduction

• Scalar reduction: every process contributes one number


• Perform some commutative associate operation

• Vector reduction: every process contributes an array of numbers

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 17


Parallelizing reduction

MPI Reduction Algorithms: https://hcl.ucd.ie/system/ les/TJS-Hasanov-2016.pdf

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 18


fi
Parallelizing reduction

• Naive algorithm: every process sends to the root

MPI Reduction Algorithms: https://hcl.ucd.ie/system/ les/TJS-Hasanov-2016.pdf

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 18


fi
Parallelizing reduction

• Naive algorithm: every process sends to the root

• Spanning tree: organize processes in a k-ary tree

MPI Reduction Algorithms: https://hcl.ucd.ie/system/ les/TJS-Hasanov-2016.pdf

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 18


fi
Parallelizing reduction

• Naive algorithm: every process sends to the root

• Spanning tree: organize processes in a k-ary tree

• Start at leaves and send to parents

• Intermediate nodes wait to receive data from all their children

MPI Reduction Algorithms: https://hcl.ucd.ie/system/ les/TJS-Hasanov-2016.pdf

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 18


fi
Parallelizing reduction

• Naive algorithm: every process sends to the root

• Spanning tree: organize processes in a k-ary tree

• Start at leaves and send to parents

• Intermediate nodes wait to receive data from all their children

• Number of phases: logkp

MPI Reduction Algorithms: https://hcl.ucd.ie/system/ les/TJS-Hasanov-2016.pdf

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 18


fi
All-to-all collective call
• Each process sends a distinct message to every other process

• Naive algorithm: every process sends the data pair-wise to all other processes

https://www.codeproject.com/Articles/896437/A-Gentle-Introduction-to-the-Message-Passing-Inter

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 19


Virtual topology: 2D mesh

• Alternative algorithm: send messages along the


rows and columns of a 2D mesh

• Phase 1: every process sends to its row neighbors

• Barrier: wait for phase 1 to complete

• Phase 2: every process sends to column neighbors

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 20


Virtual topology: hypercube

• Hypercube is an n-dimensional analog of a square (n=2) and cube (n=3)

• Special case of k-ary d-dimensional mesh

https://en.wikipedia.org/wiki/Hypercube

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616) 21

You might also like