Canon's Algorithm
Canon's Algorithm
Algorithms (part 3)
1
A Simple Parallel Matrix-Matrix Multiplication
Let 𝐴 = [𝑎𝑖𝑗 ]𝑛×𝑛 and 𝐵 = [𝑏𝑖𝑗 ]𝑛×𝑛 be n × n matrices. Compute 𝐶 =
𝐴𝐵
• Computational complexity of sequential algorithm: 𝑂(𝑛3 )
• Partition 𝐴 and 𝐵 into 𝑝 square blocks 𝐴𝑖,𝑗 and 𝐵𝑖,𝑗 (0 ≤ 𝑖, 𝑗 < 𝑝)
of size (𝑛/ 𝑝) × (𝑛/ 𝑝) each.
• Use Cartesian topology to set up process grid. Process 𝑃𝑖,𝑗 initially
stores 𝐴𝑖,𝑗 and 𝐵𝑖,𝑗 and computes block 𝐶𝑖,𝑗 of the result matrix.
• Remark: Computing submatrix 𝐶𝑖,𝑗 requires all submatrices 𝐴𝑖,𝑘 and
𝐵𝑘,𝑗 for 0 ≤ 𝑘 < 𝑝.
2
• Algorithm:
– Perform all-to-all broadcast of blocks of A in each row of
processes
– Perform all-to-all broadcast of blocks of B in each column
of processes
𝑝−1
– Each process 𝑃𝑖,𝑗 perform 𝐶𝑖,𝑗 = 𝑘=0 𝐴𝑖,𝑘 𝐵𝑘,𝑗
3
Performance Analysis
• 𝑝 rows of all-to-all broadcasts, each is among a group of 𝑝
𝑛2
processes. A message size is , communication time: 𝑡𝑠 𝑙𝑜𝑔 𝑝 +
𝑝
𝑛2
𝑡𝑤 𝑝−1
𝑝
• 𝑝 columns of all-to-all broadcasts, communication time:
𝑛2
𝑡𝑠 𝑙𝑜𝑔 𝑝 + 𝑡𝑤 𝑝−1
𝑝
• Computation time: 𝑝 × (𝑛/ 𝑝)3 = 𝑛3 /𝑝
𝑛3 𝑛2
• Parallel time: 𝑇𝑝 = + 2 𝑡𝑠 𝑙𝑜𝑔 𝑝 + 𝑡𝑤 𝑝−1
𝑝 𝑝
4
Memory Efficiency of the Simple Parallel Algorithm
5
Cannon’s Algorithm of Matrix-Matrix Multiplication
6
Cannon’s Algorithm
// make initial alignment
for 𝑖, 𝑗 :=0 to 𝑝 − 1 do
Send block 𝐴𝑖,𝑗 to process 𝑖, 𝑗 − 𝑖 + 𝑝 𝑚𝑜𝑑 𝑝 and block 𝐵𝑖,𝑗 to process
𝑖 − 𝑗 + 𝑝 𝑚𝑜𝑑 𝑝, 𝑗 ;
endfor;
Process 𝑃𝑖,𝑗 multiply received submatrices together and add the result to 𝐶𝑖,𝑗 ;
Remark: In the initial alignment, the send operation is to: shift 𝐴𝑖,𝑗 to the left (with
wraparound) by 𝑖 steps, and shift 𝐵𝑖,𝑗 to the up (with wraparound) by 𝑗 steps. 7
Cannon’s Algorithm for 3 × 3 Matrices
8
Performance Analysis
• In the initial alignment step, the maximum distance
over which block shifts is 𝑝 − 1
– The circular shift operations in row and column
𝑡𝑤 𝑛 2
directions take time: 𝑡𝑐𝑜𝑚𝑚 = 2(𝑡𝑠 + )
𝑝
• Each of the 𝑝 single-step shifts in the compute-
𝑡𝑤 𝑛 2
and-shift phase takes time: 𝑡𝑠 + .
𝑝
n n
• Multiplying 𝑝 submatrices of size ( ) ×( )
𝑝 𝑝
takes time: 𝑛3 /𝑝.
𝑛3 𝑡𝑤 𝑛 2 𝑡𝑤 𝑛 2
• Parallel time: 𝑇𝑝 = + 2 𝑝 𝑡𝑠 + + 2(𝑡𝑠 + )
𝑝 𝑝 𝑝
9
int MPI_Sendrecv_replace( void *buf, int count,
MPI_Datatype datatype, int dest, int sendtag, int source,
int recvtag, MPI_Comm comm, MPI_Status *status );
• Execute a blocking send and receive. The same buffer is
used both for the send and for the receive, so that the
message sent is replaced by the message received.
• buf[in/out]: initial address of send and receive buffer
10
#include "mpi.h"
#include <stdio.h>
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myid);
MPI_Finalize();
return 0; 11
}