0% found this document useful (0 votes)
57 views

Canon's Algorithm

This document discusses parallel algorithms for matrix multiplication. It begins by describing a simple parallel algorithm that partitions the matrices into blocks and distributes the blocks across processes. Each process computes a block of the result matrix using blocks of the input matrices. It analyzes the computation and communication costs. The document then presents Cannon's algorithm, which improves memory efficiency by shifting blocks during the computation to pair matching blocks. It provides pseudocode and analyzes Cannon's performance.

Uploaded by

Venkatesan N
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

Canon's Algorithm

This document discusses parallel algorithms for matrix multiplication. It begins by describing a simple parallel algorithm that partitions the matrices into blocks and distributes the blocks across processes. Each process computes a block of the result matrix using blocks of the input matrices. It analyzes the computation and communication costs. The document then presents Cannon's algorithm, which improves memory efficiency by shifting blocks during the computation to pair matching blocks. It provides pseudocode and analyzes Cannon's performance.

Uploaded by

Venkatesan N
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Lecture 6: Parallel Matrix

Algorithms (part 3)

1
A Simple Parallel Matrix-Matrix Multiplication
Let 𝐴 = [𝑎𝑖𝑗 ]𝑛×𝑛 and 𝐵 = [𝑏𝑖𝑗 ]𝑛×𝑛 be n × n matrices. Compute 𝐶 =
𝐴𝐵
• Computational complexity of sequential algorithm: 𝑂(𝑛3 )
• Partition 𝐴 and 𝐵 into 𝑝 square blocks 𝐴𝑖,𝑗 and 𝐵𝑖,𝑗 (0 ≤ 𝑖, 𝑗 < 𝑝)
of size (𝑛/ 𝑝) × (𝑛/ 𝑝) each.
• Use Cartesian topology to set up process grid. Process 𝑃𝑖,𝑗 initially
stores 𝐴𝑖,𝑗 and 𝐵𝑖,𝑗 and computes block 𝐶𝑖,𝑗 of the result matrix.
• Remark: Computing submatrix 𝐶𝑖,𝑗 requires all submatrices 𝐴𝑖,𝑘 and
𝐵𝑘,𝑗 for 0 ≤ 𝑘 < 𝑝.

2
• Algorithm:
– Perform all-to-all broadcast of blocks of A in each row of
processes
– Perform all-to-all broadcast of blocks of B in each column
of processes
𝑝−1
– Each process 𝑃𝑖,𝑗 perform 𝐶𝑖,𝑗 = 𝑘=0 𝐴𝑖,𝑘 𝐵𝑘,𝑗

3
Performance Analysis
• 𝑝 rows of all-to-all broadcasts, each is among a group of 𝑝
𝑛2
processes. A message size is , communication time: 𝑡𝑠 𝑙𝑜𝑔 𝑝 +
𝑝
𝑛2
𝑡𝑤 𝑝−1
𝑝
• 𝑝 columns of all-to-all broadcasts, communication time:
𝑛2
𝑡𝑠 𝑙𝑜𝑔 𝑝 + 𝑡𝑤 𝑝−1
𝑝
• Computation time: 𝑝 × (𝑛/ 𝑝)3 = 𝑛3 /𝑝
𝑛3 𝑛2
• Parallel time: 𝑇𝑝 = + 2 𝑡𝑠 𝑙𝑜𝑔 𝑝 + 𝑡𝑤 𝑝−1
𝑝 𝑝

4
Memory Efficiency of the Simple Parallel Algorithm

• Not memory efficient


– Each process 𝑃𝑖,𝑗 has 2 𝑝 blocks of 𝐴𝑖,𝑘 and 𝐵𝑘,𝑗
– Each process needs Θ(𝑛2 / 𝑝) memory
– Total memory over all the processes is Θ(𝑛2 × 𝑝),
i.e., 𝑝 times the memory of the sequential
algorithm.

5
Cannon’s Algorithm of Matrix-Matrix Multiplication

Goal: to improve the memory efficiency.


Let 𝐴 = [𝑎𝑖𝑗 ]𝑛×𝑛 and 𝐵 = [𝑏𝑖𝑗 ]𝑛×𝑛 be n × n matrices. Compute 𝐶 =
𝐴𝐵
• Partition 𝐴 and 𝐵 into 𝑝 square blocks 𝐴𝑖,𝑗 and 𝐵𝑖,𝑗 (0 ≤ 𝑖, 𝑗 < 𝑝)
of size (𝑛/ 𝑝) × (𝑛/ 𝑝) each.
• Use Cartesian topology to set up process grid. Process 𝑃𝑖,𝑗 initially
stores 𝐴𝑖,𝑗 and 𝐵𝑖,𝑗 and computes block 𝐶𝑖,𝑗 of the result matrix.
• Remark: Computing submatrix 𝐶𝑖,𝑗 requires all submatrices 𝐴𝑖,𝑘 and
𝐵𝑘,𝑗 for 0 ≤ 𝑘 < 𝑝.
• The contention-free formula:
𝑝−1
𝐶𝑖,𝑗 = 𝑘=0 𝐴𝑖, 𝑖+𝑗+𝑘 % 𝑝 𝐵 𝑖+𝑗+𝑘 % 𝑝,𝑗

6
Cannon’s Algorithm
// make initial alignment
for 𝑖, 𝑗 :=0 to 𝑝 − 1 do
Send block 𝐴𝑖,𝑗 to process 𝑖, 𝑗 − 𝑖 + 𝑝 𝑚𝑜𝑑 𝑝 and block 𝐵𝑖,𝑗 to process
𝑖 − 𝑗 + 𝑝 𝑚𝑜𝑑 𝑝, 𝑗 ;
endfor;
Process 𝑃𝑖,𝑗 multiply received submatrices together and add the result to 𝐶𝑖,𝑗 ;

// compute-and-shift. A sequence of one-step shifts pairs up 𝐴𝑖,𝑘 and 𝐵𝑘,𝑗


// on process 𝑃𝑖,𝑗 . 𝐶𝑖,𝑗 = 𝐶𝑖,𝑗 +𝐴𝑖,𝑘 𝐵𝑘,𝑗
for step :=1 to 𝑝 − 1 do
Shift 𝐴𝑖,𝑗 one step left (with wraparound) and 𝐵𝑖,𝑗 one step up (with
wraparound);
Process 𝑃𝑖,𝑗 multiply received submatrices together and add the result to 𝐶𝑖,𝑗 ;
Endfor;

Remark: In the initial alignment, the send operation is to: shift 𝐴𝑖,𝑗 to the left (with
wraparound) by 𝑖 steps, and shift 𝐵𝑖,𝑗 to the up (with wraparound) by 𝑗 steps. 7
Cannon’s Algorithm for 3 × 3 Matrices

Initial A, B A, B initial A, B after A, B after


alignment shift step 1 shift step 2

8
Performance Analysis
• In the initial alignment step, the maximum distance
over which block shifts is 𝑝 − 1
– The circular shift operations in row and column
𝑡𝑤 𝑛 2
directions take time: 𝑡𝑐𝑜𝑚𝑚 = 2(𝑡𝑠 + )
𝑝
• Each of the 𝑝 single-step shifts in the compute-
𝑡𝑤 𝑛 2
and-shift phase takes time: 𝑡𝑠 + .
𝑝
n n
• Multiplying 𝑝 submatrices of size ( ) ×( )
𝑝 𝑝
takes time: 𝑛3 /𝑝.
𝑛3 𝑡𝑤 𝑛 2 𝑡𝑤 𝑛 2
• Parallel time: 𝑇𝑝 = + 2 𝑝 𝑡𝑠 + + 2(𝑡𝑠 + )
𝑝 𝑝 𝑝

9
int MPI_Sendrecv_replace( void *buf, int count,
MPI_Datatype datatype, int dest, int sendtag, int source,
int recvtag, MPI_Comm comm, MPI_Status *status );
• Execute a blocking send and receive. The same buffer is
used both for the send and for the receive, so that the
message sent is replaced by the message received.
• buf[in/out]: initial address of send and receive buffer

10
#include "mpi.h"
#include <stdio.h>

int main(int argc, char *argv[])


{
int myid, numprocs, left, right;
int buffer[10];
MPI_Request request;
MPI_Status status;

MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);

right = (myid + 1) % numprocs;


left = myid - 1;
if (left < 0)
left = numprocs - 1;

MPI_Sendrecv_replace(buffer, 10, MPI_INT, left, 123, right, 123, MPI_COMM_WORLD,


&status);

MPI_Finalize();
return 0; 11
}

You might also like