0% found this document useful (0 votes)

1 views16 pages

Unit 4 HPC Part8

Uploaded by

Pratik Oza

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views16 pages

Unit 4 HPC Part8

Uploaded by

Pratik Oza

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 16

Sanjivani Rural Education Society’s

Sanjivani College of Engineering, Kopargaon-423 603

(An Autonomous Institute, Affiliated to Savitribai Phule Pune University, Pune)
NAAC ‘A’ Grade Accredited, ISO 9001:2015 Certified

Department of Computer Engineering

(NBA Accredited)

Course - High Performance Computing (410241)

Unit 4- Analytical Models of Parallel Programs

Prof. B. J. Dange
Assistant Professor
E-mail : [email protected]
Contact No: 91301 91301 Ext :145, 9604146122
Matrix-Matrix Multiplication
• Consider the problem of multiplying two n x n dense, square matrices A and B to yield
the product matrix C =A x B.
• The serial complexity is O(n3).
• We do not consider better serial algorithms (Strassen's method), although, these can be
used as serial kernels in the parallel algorithms.
• A useful concept in this case is called block operations. In this view, an n x n matrix A
can be regarded as a q x q array of blocks Ai,j (0 ≤ i, j < q) such that each block is an
(n/q) x (n/q) submatrix.
• In this view, we perform q3 matrix multiplications, each involving (n/q) x (n/q) matrices.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 2

Matrix-Matrix Multiplication

• Consider two n x n matrices A and B partitioned into p blocks Ai,j and Bi,j (0 ≤ i, j < )
of size each.

• Process Pi,j initially stores Ai,j and Bi,j and computes block Ci,j of the result matrix.

• Computing submatrix Ci,j requires all submatrices Ai,k and Bk,j for 0 ≤ k < .

• All-to-all broadcast blocks of A along rows and B along columns.

• Perform local submatrix multiplication.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 3

Matrix-Matrix Multiplication
• The two broadcasts take time

• The computation requires multiplications of sized submatrices.

• The parallel run time is approximately

• The algorithm is cost optimal and the isoefficiency is O(p1.5) due to bandwidth term tw
and concurrency.
• Major drawback of the algorithm is that it is not memory optimal.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 4

Matrix-Matrix Multiplication: Cannon's Algorithm

• In this algorithm, we schedule the computations of the processes of the ith row
such that, at any given time, each process is using a different block Ai,k.

• These blocks can be systematically rotated among the processes after every submatrix
multiplication so that every process gets a fresh Ai,k after each rotation.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 5

Communication steps in Cannon's
algorithm on 16 processes.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 6

Matrix-Matrix Multiplication: Cannon's Algorithm

• Align the blocks of A and B in such a way that each process multiplies its local
submatrices. This is done by shifting all submatrices Ai,j to the left (with wraparound) by i

steps and all submatrices Bi,j up (with wraparound) by j steps.

• Perform local block multiplication.

• Each block of A moves one step left and each block of B moves one step up (again with
wraparound).

• Perform next block multiplication, add to partial result, repeat until all blocks have
been multiplied.
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 7
Matrix-Matrix Multiplication: Cannon's Algorithm
• In the alignment step, since the maximum distance over which a block shifts is ,
the two shift operations require a total of time.
• Each of the single-step shifts in the compute-and-shift phase of the algorithm takes

time.
• The computation time for multiplying matrices of size
is .
• The parallel time is approximately:

• The cost-efficiency and isoefficiency of the algorithm are identical to the first algorithm,
except, this is memory optimal.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 8

Matrix-Matrix Multiplication: DNS Algorithm

• Uses a 3-D partitioning.

• Visualize the matrix multiplication algorithm as a cube . matrices A and B come in two
orthogonal faces and result C comes out the other orthogonal face.

• Each internal node in the cube represents a single add-multiply operation (and thus the
complexity).

• DNS algorithm partitions this cube using a 3-D block scheme.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 9

Matrix-Matrix Multiplication: DNS Algorithm

• Assume an n x n x n mesh of processors.

• Move the columns of A and rows of B and perform broadcast.

• Each processor computes a single add-multiply.

• This is followed by an accumulation along the C dimension.

• Since each add-multiply takes constant time and accumulation and broadcast takes log n
time, the total runtime is log n.

• This is not cost optimal. It can be made cost optimal by using n / log n processors along the
direction of accumulation.
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 10
The communication steps
in the DNS algorithm
while multiplying 4 x 4
matrices A and B on 64
processes.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 11

Matrix-Matrix Multiplication: DNS Algorithm
Using fewer than n3 processors.

• Assume that the number of processes p is equal to q3 for some q < n.

• The two matrices are partitioned into blocks of size (n/q) x(n/q).

• Each matrix can thus be regarded as a q x q two-dimensional square array of blocks.

• The algorithm follows from the previous one, except, in this case, we operate on blocks
rather than on individual elements.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 12

Matrix-Matrix Multiplication: DNS Algorithm
Using fewer than n3 processors.
• The first one-to-one communication step is performed for both A and B, and takes
time for each matrix.
• The two one-to-all broadcasts take time for each matrix.
• The reduction takes time .
• Multiplication of submatrices takes time.
• The parallel time is approximated by:

• The isoefficiency function is .

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 13

Summary

• Matrix-Matrix Multiplication

• Matrix-Matrix Multiplication: Cannon's Algorithm

• Matrix-Matrix Multiplication: DNS Algorithm

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 14

References

• Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar, "Introduction to
Parallel Computing", 2nd edition, Addison-Wesley, 2003, ISBN: 0-201-64865-2.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 15

Thank You.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 16

IDS Notes Unit 1
No ratings yet
IDS Notes Unit 1
22 pages
Poshan Tracker 23.6 New Updates
No ratings yet
Poshan Tracker 23.6 New Updates
36 pages
Matrix Multiplication Algorithm
No ratings yet
Matrix Multiplication Algorithm
9 pages
AdvancedAlgorithms NN
No ratings yet
AdvancedAlgorithms NN
152 pages
Algorithm Design
No ratings yet
Algorithm Design
579 pages
FD Controller Instruction Manual Command Reference: 4th Edition
No ratings yet
FD Controller Instruction Manual Command Reference: 4th Edition
124 pages
اردو گرامر برائے نہم دہم
No ratings yet
اردو گرامر برائے نہم دہم
116 pages
DP
No ratings yet
DP
30 pages
GFZ-63994EN01, PROFIBUS-DP Board For 30ia - Operator's
No ratings yet
GFZ-63994EN01, PROFIBUS-DP Board For 30ia - Operator's
98 pages
Unit II Matrix Multiplication
No ratings yet
Unit II Matrix Multiplication
23 pages
Adsa Unit - 4
No ratings yet
Adsa Unit - 4
33 pages
Matrix Chain Multiplication (MCM)
No ratings yet
Matrix Chain Multiplication (MCM)
32 pages
Dynamic II
No ratings yet
Dynamic II
31 pages
Matrix Mul
No ratings yet
Matrix Mul
33 pages
Matrix Chain Multiplication
No ratings yet
Matrix Chain Multiplication
17 pages
Cabais Finals Lab Act#2
No ratings yet
Cabais Finals Lab Act#2
9 pages
PowerHour ParallelingSolutions 2019-08-22 PDF
No ratings yet
PowerHour ParallelingSolutions 2019-08-22 PDF
49 pages
Daa Exp02
No ratings yet
Daa Exp02
16 pages
03 01 24 - 19 02 59 - DebugLog
No ratings yet
03 01 24 - 19 02 59 - DebugLog
81 pages
Digital Modulations using Matlab
From Everand
Digital Modulations using Matlab
Mathuranathan Viswanathan
4/5 (6)
Lecture 20ppt
No ratings yet
Lecture 20ppt
25 pages
Mid Sem 1 Portions
No ratings yet
Mid Sem 1 Portions
3 pages
5.3 MCM
No ratings yet
5.3 MCM
12 pages
Sigma Personal Voice Assistance Mid - Defence - Report
No ratings yet
Sigma Personal Voice Assistance Mid - Defence - Report
27 pages
Matrix Multiplication-Javan.
No ratings yet
Matrix Multiplication-Javan.
6 pages
Lecture 19ppt
No ratings yet
Lecture 19ppt
18 pages
Bhagaban - Dynamic - Programming Intro - Matrix - Elemnts - Unit - II - 4
No ratings yet
Bhagaban - Dynamic - Programming Intro - Matrix - Elemnts - Unit - II - 4
37 pages
Daa 02 R1 2
No ratings yet
Daa 02 R1 2
63 pages
Lect11 12 Parallel
No ratings yet
Lect11 12 Parallel
57 pages
Unit 4 HPC Part9
No ratings yet
Unit 4 HPC Part9
5 pages
Project Database
No ratings yet
Project Database
7 pages
Parallel Algorithms Underlying MPI Implementations
No ratings yet
Parallel Algorithms Underlying MPI Implementations
55 pages
Dear The Weight
From Everand
Dear The Weight
Masud Rana
No ratings yet
To Read Dynprog2
No ratings yet
To Read Dynprog2
50 pages
2016 Quiz Paper
No ratings yet
2016 Quiz Paper
1 page
Matrix Chain Problem - DP
No ratings yet
Matrix Chain Problem - DP
35 pages
Chapter 4
No ratings yet
Chapter 4
58 pages
To Print - Dynprog2
No ratings yet
To Print - Dynprog2
46 pages
Week 09 2021b
No ratings yet
Week 09 2021b
52 pages
WPA3
No ratings yet
WPA3
3 pages
High Performance Computing Matrix Mul.
No ratings yet
High Performance Computing Matrix Mul.
15 pages
Red PPT Template-71-75
No ratings yet
Red PPT Template-71-75
5 pages
Consumer Intentions To Adopt Electronic Commerce - Incorporating Trust and Risk in The Technology Acceptance Model
No ratings yet
Consumer Intentions To Adopt Electronic Commerce - Incorporating Trust and Risk in The Technology Acceptance Model
30 pages
Algorithms-I CSC 302 (July-Dec2022 IIITK) L41 - L42
No ratings yet
Algorithms-I CSC 302 (July-Dec2022 IIITK) L41 - L42
69 pages
Algo VC Lecture24
No ratings yet
Algo VC Lecture24
32 pages
Rohit Data Analysis
No ratings yet
Rohit Data Analysis
1 page
Daa Lecture CSE
No ratings yet
Daa Lecture CSE
6 pages
Workshop 3-1: Antenna Post-Processing: ANSYS HFSS For Antenna Design
No ratings yet
Workshop 3-1: Antenna Post-Processing: ANSYS HFSS For Antenna Design
51 pages
Multiplym 2
No ratings yet
Multiplym 2
23 pages
Thiet Bi PassiveComponents
No ratings yet
Thiet Bi PassiveComponents
56 pages
Paper - 1 - Group Data Sharing Agreement Using Block Design Based Key in Cloud Computing
No ratings yet
Paper - 1 - Group Data Sharing Agreement Using Block Design Based Key in Cloud Computing
5 pages
Chapter 3: The Electronic Wallet 3.1. Introduction To E-Wallet
100% (2)
Chapter 3: The Electronic Wallet 3.1. Introduction To E-Wallet
11 pages
Dynamic Programming
No ratings yet
Dynamic Programming
15 pages
Be142 Genset Controller Manual
No ratings yet
Be142 Genset Controller Manual
28 pages
Canon's Algorithm
No ratings yet
Canon's Algorithm
11 pages
Aaa 9
No ratings yet
Aaa 9
54 pages
Fall SlidesMania
No ratings yet
Fall SlidesMania
11 pages
Worked Examples in Mechanics of Machines using MATLAB
From Everand
Worked Examples in Mechanics of Machines using MATLAB
Eric Ogur
No ratings yet
Lecture14 - Dynamic II
No ratings yet
Lecture14 - Dynamic II
31 pages
User Behavior Analytics
No ratings yet
User Behavior Analytics
2 pages
Introduction To Parallel Programming: Parallel Methods For Matrix Multiplication
No ratings yet
Introduction To Parallel Programming: Parallel Methods For Matrix Multiplication
50 pages
Dense Matrix Algorithms: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
No ratings yet
Dense Matrix Algorithms: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
55 pages
Dynamic Programming: Department of CSE JNTUA College of Engg., Kalikiri
No ratings yet
Dynamic Programming: Department of CSE JNTUA College of Engg., Kalikiri
66 pages
Greedy DP
No ratings yet
Greedy DP
57 pages
Design & Analysis of Algorithms: Bits, Pilani - K. K. Birla Goa Campus
No ratings yet
Design & Analysis of Algorithms: Bits, Pilani - K. K. Birla Goa Campus
27 pages
Matrix Chain Multiplication
100% (1)
Matrix Chain Multiplication
20 pages
User Manual: AN5506-04-B GPON Optical Network Unit
No ratings yet
User Manual: AN5506-04-B GPON Optical Network Unit
44 pages
Webinar - Online Conference
No ratings yet
Webinar - Online Conference
4 pages
AutoCAD Cheat Sheet
No ratings yet
AutoCAD Cheat Sheet
2 pages
Matrix Multiplications and Collective Communication: Michael Hanke
No ratings yet
Matrix Multiplications and Collective Communication: Michael Hanke
38 pages
MATRIX CHAIN Multiplication
No ratings yet
MATRIX CHAIN Multiplication
41 pages
Chapter 7-Matrix Multiplication From The Book Parallel Computing by Michael J. Quinn
No ratings yet
Chapter 7-Matrix Multiplication From The Book Parallel Computing by Michael J. Quinn
39 pages
Class18 - Linalg II Handout PDF
No ratings yet
Class18 - Linalg II Handout PDF
48 pages
Parallel Processing
No ratings yet
Parallel Processing
35 pages
Algorithms and Data Structure
No ratings yet
Algorithms and Data Structure
29 pages
Adafruit Ultimate Gps PDF
No ratings yet
Adafruit Ultimate Gps PDF
52 pages
Defcon 18 Schearer Shodan
No ratings yet
Defcon 18 Schearer Shodan
27 pages
Learn HTML - Semantic HTML Cheatsheet - Codecademy
No ratings yet
Learn HTML - Semantic HTML Cheatsheet - Codecademy
2 pages
MayankJain Resume
No ratings yet
MayankJain Resume
1 page
Parallel Algorithms Underlying MPI Implementations
No ratings yet
Parallel Algorithms Underlying MPI Implementations
55 pages
Pila
No ratings yet
Pila
4 pages
3GPP TS 22.090
No ratings yet
3GPP TS 22.090
9 pages
Efficient Parallel Implementation of The Fox Algorithm
No ratings yet
Efficient Parallel Implementation of The Fox Algorithm
8 pages
Chained Matrix Multiplication
No ratings yet
Chained Matrix Multiplication
32 pages
A Practical Performance Comparison of Parallel Matrix Multiplication Algorithms On Networks of Workstations
No ratings yet
A Practical Performance Comparison of Parallel Matrix Multiplication Algorithms On Networks of Workstations
2 pages
How To Multiply: 5.5 Integer Multiplication
No ratings yet
How To Multiply: 5.5 Integer Multiplication
16 pages
Advanced Computer Architecture 1
No ratings yet
Advanced Computer Architecture 1
14 pages
Cannon Strassen DNS Algorithm
No ratings yet
Cannon Strassen DNS Algorithm
10 pages
Matrix Chain Mult
No ratings yet
Matrix Chain Mult
11 pages
Scanline Rendering: Exploring Visual Realism Through Scanline Rendering Techniques
From Everand
Scanline Rendering: Exploring Visual Realism Through Scanline Rendering Techniques
Fouad Sabry
No ratings yet

Unit 4 HPC Part8

Uploaded by

Unit 4 HPC Part8

Uploaded by

Sanjivani Rural Education Society’s

Sanjivani College of Engineering, Kopargaon-423 603

Department of Computer Engineering

Course - High Performance Computing (410241)

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 2

• All-to-all broadcast blocks of A along rows and B along columns.

• Perform local submatrix multiplication.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 3

• The computation requires multiplications of sized submatrices.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 4

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 5

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 6

steps and all submatrices Bi,j up (with wraparound) by j steps.

• Perform local block multiplication.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 8

• Uses a 3-D partitioning.

• DNS algorithm partitions this cube using a 3-D block scheme.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 9

• Assume an n x n x n mesh of processors.

• Move the columns of A and rows of B and perform broadcast.

• Each processor computes a single add-multiply.

• This is followed by an accumulation along the C dimension.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 11

• Assume that the number of processes p is equal to q3 for some q < n.

• Each matrix can thus be regarded as a q x q two-dimensional square array of blocks.

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 12

• The isoefficiency function is .

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 13

• Matrix-Matrix Multiplication: Cannon's Algorithm

• Matrix-Matrix Multiplication: DNS Algorithm

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 14

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 15

DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 16

You might also like