0% found this document useful (0 votes)
1 views16 pages

Unit 4 HPC Part8

Uploaded by

Pratik Oza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views16 pages

Unit 4 HPC Part8

Uploaded by

Pratik Oza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 16

Sanjivani Rural Education Society’s

Sanjivani College of Engineering, Kopargaon-423 603


(An Autonomous Institute, Affiliated to Savitribai Phule Pune University, Pune)
NAAC ‘A’ Grade Accredited, ISO 9001:2015 Certified

Department of Computer Engineering


(NBA Accredited)

Course - High Performance Computing (410241)


Unit 4- Analytical Models of Parallel Programs

Prof. B. J. Dange
Assistant Professor
E-mail : [email protected]
Contact No: 91301 91301 Ext :145, 9604146122
Matrix-Matrix Multiplication
• Consider the problem of multiplying two n x n dense, square matrices A and B to yield
the product matrix C =A x B.
• The serial complexity is O(n3).
• We do not consider better serial algorithms (Strassen's method), although, these can be
used as serial kernels in the parallel algorithms.
• A useful concept in this case is called block operations. In this view, an n x n matrix A
can be regarded as a q x q array of blocks Ai,j (0 ≤ i, j < q) such that each block is an
(n/q) x (n/q) submatrix.
• In this view, we perform q3 matrix multiplications, each involving (n/q) x (n/q) matrices.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 2


Matrix-Matrix Multiplication

• Consider two n x n matrices A and B partitioned into p blocks Ai,j and Bi,j (0 ≤ i, j < )
of size each.

• Process Pi,j initially stores Ai,j and Bi,j and computes block Ci,j of the result matrix.

• Computing submatrix Ci,j requires all submatrices Ai,k and Bk,j for 0 ≤ k < .

• All-to-all broadcast blocks of A along rows and B along columns.

• Perform local submatrix multiplication.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 3


Matrix-Matrix Multiplication
• The two broadcasts take time

• The computation requires multiplications of sized submatrices.


• The parallel run time is approximately

• The algorithm is cost optimal and the isoefficiency is O(p1.5) due to bandwidth term tw
and concurrency.
• Major drawback of the algorithm is that it is not memory optimal.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 4


Matrix-Matrix Multiplication: Cannon's Algorithm

• In this algorithm, we schedule the computations of the processes of the ith row
such that, at any given time, each process is using a different block Ai,k.

• These blocks can be systematically rotated among the processes after every submatrix
multiplication so that every process gets a fresh Ai,k after each rotation.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 5


Communication steps in Cannon's
algorithm on 16 processes.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 6


Matrix-Matrix Multiplication: Cannon's Algorithm

• Align the blocks of A and B in such a way that each process multiplies its local
submatrices. This is done by shifting all submatrices Ai,j to the left (with wraparound) by i

steps and all submatrices Bi,j up (with wraparound) by j steps.

• Perform local block multiplication.

• Each block of A moves one step left and each block of B moves one step up (again with
wraparound).

• Perform next block multiplication, add to partial result, repeat until all blocks have
been multiplied.
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 7
Matrix-Matrix Multiplication: Cannon's Algorithm
• In the alignment step, since the maximum distance over which a block shifts is ,
the two shift operations require a total of time.
• Each of the single-step shifts in the compute-and-shift phase of the algorithm takes

time.
• The computation time for multiplying matrices of size
is .
• The parallel time is approximately:

• The cost-efficiency and isoefficiency of the algorithm are identical to the first algorithm,
except, this is memory optimal.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 8


Matrix-Matrix Multiplication: DNS Algorithm

• Uses a 3-D partitioning.

• Visualize the matrix multiplication algorithm as a cube . matrices A and B come in two
orthogonal faces and result C comes out the other orthogonal face.

• Each internal node in the cube represents a single add-multiply operation (and thus the
complexity).

• DNS algorithm partitions this cube using a 3-D block scheme.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 9


Matrix-Matrix Multiplication: DNS Algorithm

• Assume an n x n x n mesh of processors.

• Move the columns of A and rows of B and perform broadcast.

• Each processor computes a single add-multiply.

• This is followed by an accumulation along the C dimension.

• Since each add-multiply takes constant time and accumulation and broadcast takes log n
time, the total runtime is log n.

• This is not cost optimal. It can be made cost optimal by using n / log n processors along the
direction of accumulation.
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 10
The communication steps
in the DNS algorithm
while multiplying 4 x 4
matrices A and B on 64
processes.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 11


Matrix-Matrix Multiplication: DNS Algorithm
Using fewer than n3 processors.

• Assume that the number of processes p is equal to q3 for some q < n.

• The two matrices are partitioned into blocks of size (n/q) x(n/q).

• Each matrix can thus be regarded as a q x q two-dimensional square array of blocks.

• The algorithm follows from the previous one, except, in this case, we operate on blocks
rather than on individual elements.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 12


Matrix-Matrix Multiplication: DNS Algorithm
Using fewer than n3 processors.
• The first one-to-one communication step is performed for both A and B, and takes
time for each matrix.
• The two one-to-all broadcasts take time for each matrix.
• The reduction takes time .
• Multiplication of submatrices takes time.
• The parallel time is approximated by:

• The isoefficiency function is .

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 13


Summary

• Matrix-Matrix Multiplication

• Matrix-Matrix Multiplication: Cannon's Algorithm

• Matrix-Matrix Multiplication: DNS Algorithm

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 14


References

• Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar, "Introduction to
Parallel Computing", 2nd edition, Addison-Wesley, 2003, ISBN: 0-201-64865-2.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 15


Thank You.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 16

You might also like