Unit 4 HPC Part8
Unit 4 HPC Part8
Prof. B. J. Dange
Assistant Professor
E-mail : [email protected]
Contact No: 91301 91301 Ext :145, 9604146122
Matrix-Matrix Multiplication
• Consider the problem of multiplying two n x n dense, square matrices A and B to yield
the product matrix C =A x B.
• The serial complexity is O(n3).
• We do not consider better serial algorithms (Strassen's method), although, these can be
used as serial kernels in the parallel algorithms.
• A useful concept in this case is called block operations. In this view, an n x n matrix A
can be regarded as a q x q array of blocks Ai,j (0 ≤ i, j < q) such that each block is an
(n/q) x (n/q) submatrix.
• In this view, we perform q3 matrix multiplications, each involving (n/q) x (n/q) matrices.
• Consider two n x n matrices A and B partitioned into p blocks Ai,j and Bi,j (0 ≤ i, j < )
of size each.
• Process Pi,j initially stores Ai,j and Bi,j and computes block Ci,j of the result matrix.
• Computing submatrix Ci,j requires all submatrices Ai,k and Bk,j for 0 ≤ k < .
• The algorithm is cost optimal and the isoefficiency is O(p1.5) due to bandwidth term tw
and concurrency.
• Major drawback of the algorithm is that it is not memory optimal.
• In this algorithm, we schedule the computations of the processes of the ith row
such that, at any given time, each process is using a different block Ai,k.
• These blocks can be systematically rotated among the processes after every submatrix
multiplication so that every process gets a fresh Ai,k after each rotation.
• Align the blocks of A and B in such a way that each process multiplies its local
submatrices. This is done by shifting all submatrices Ai,j to the left (with wraparound) by i
• Each block of A moves one step left and each block of B moves one step up (again with
wraparound).
• Perform next block multiplication, add to partial result, repeat until all blocks have
been multiplied.
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 7
Matrix-Matrix Multiplication: Cannon's Algorithm
• In the alignment step, since the maximum distance over which a block shifts is ,
the two shift operations require a total of time.
• Each of the single-step shifts in the compute-and-shift phase of the algorithm takes
time.
• The computation time for multiplying matrices of size
is .
• The parallel time is approximately:
• The cost-efficiency and isoefficiency of the algorithm are identical to the first algorithm,
except, this is memory optimal.
• Visualize the matrix multiplication algorithm as a cube . matrices A and B come in two
orthogonal faces and result C comes out the other orthogonal face.
• Each internal node in the cube represents a single add-multiply operation (and thus the
complexity).
• Since each add-multiply takes constant time and accumulation and broadcast takes log n
time, the total runtime is log n.
• This is not cost optimal. It can be made cost optimal by using n / log n processors along the
direction of accumulation.
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 10
The communication steps
in the DNS algorithm
while multiplying 4 x 4
matrices A and B on 64
processes.
• The two matrices are partitioned into blocks of size (n/q) x(n/q).
• The algorithm follows from the previous one, except, in this case, we operate on blocks
rather than on individual elements.
• Matrix-Matrix Multiplication
• Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar, "Introduction to
Parallel Computing", 2nd edition, Addison-Wesley, 2003, ISBN: 0-201-64865-2.