0% found this document useful (0 votes)

47 views15 pages

High Performance Computing Matrix Mul.

This document provides an overview of parallel algorithms for matrix-vector and matrix-matrix multiplication. It discusses row-wise and 2D partitioning approaches for matrix-vector multiplication and their performance analyses. For matrix-matrix multiplication, it describes performing block operations and Cannon's algorithm which systematically rotates matrix blocks among processes.

Uploaded by

kmarali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views15 pages

High Performance Computing Matrix Mul.

Uploaded by

kmarali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

CZ4102 – High Performance Computing

Lecture 11:

Matrix Algorithms

- Dr Tay Seng Chuan

Reference:
“Introduction to Parallel Computing” – Chapter 8.

Topic Overview

• Parallel Algorithms for

- Matrix-Vector Multiplication
- Matrix-Matrix Multiplication

• Performance Analysis

1
Matrix Algorithms: Introduction
• Due to their regular structure, parallel computations
involving matrices and vectors readily lend themselves to
data-decomposition.
• Typical algorithms rely on input, output, or intermediate
data decomposition.
• Most algorithms use one- and two-dimensional block,
cyclic, and block-cyclic partitions for parallel processing.
• The run-time performance of such algorithms depends on
the amount of overheads incurred as compared to the
computation workload.
• As a rule of thumb, good speedup can be achieved if the
computation granularity is able to outweigh the overheads
such as the communication cost, consolidation cost –
3
algorithm penalty, data packaging, etc.

Matrix-Vector Multiplication

• We aim to multiply a dense n x n matrix A with an n x 1 vector x

to yield the n x 1 result vector y.
• The serial algorithm requires n2 multiplications and additions.
A x y

n x =

• The total workload is 4

2
Matrix-Vector Multiplication:
Rowwise 1-D Partitioning
• The n x n matrix A is partitioned among n processors,
with each processor storing a complete row of the
matrix.
• The n x 1 vector x is distributed such that each process
owns one of its elements.
A x y

:
:
:
:

Matrix-Vector Multiplication:
Rowwise 1-D Partitioning (p = n)

• Since each process starts

with only one element of
Vector x , an all-to-all
broadcast is required to
distribute all the elements
x[ j ] to all the processes.
• Process Pi now computes

• The all-to-all broadcast

and the computation of
y[i] both take time Θ(n) .
Therefore, the parallel time
is Θ(n) .
6
p=n

3
Matrix-Vector Multiplication:
Rowwise 1-D Partitioning (p < n)
• Consider now the case when
p < n and we use blocks in
partitions.
n/p
• Each process initially stores
n/p complete rows of the p
matrix, and a portion of the
vector of size n/p.

• The all-to-all broadcast takes

place among p processes
and involves messages of
size n/p.

• This is followed by n/p rows of local dot products.

(computation time = n/p x n = n2/p)

Recap: All-to-All Broadcast

based on Doubling-up
Approach:

n
Now m = , we have
p
• Thus, the parallel run time of matrix-vector
n multiplication based on rowwise 1-D
T = ts log p + tw x x (p-1) partitioning (p < n) is
p

~ ts log p + tw x n
All-to-all broadcast of vector elements.
8
• This is cost-optimal.

4
Matrix-Vector Multiplication:
2-D Partitioning (p = n2 )
• The n x n matrix is partitioned among n2 processors such
that each processor owns a single element.
• The n x 1 vector x is distributed only in the last column of
n processors.
(p = n2 processors)

n x =

n
9

Matrix-Vector Multiplication:
2-D Partitioning (p = n2 )
• We must first align the vector
with the matrix appropriately.
• The first communication step
for the 2-D partitioning aligns
the vector x along the
principal diagonal of the
matrix.
• The second step copies the
vector elements from each
diagonal process to all the
processes in the
corresponding column using
n simultaneous broadcasts
among all processors in the
column.
• Finally, the result vector is
computed by performing an
all-to-one reduction along the
columns.
10

5
Matrix-Vector Multiplication:
2-D Partitioning (p = n2)
• Three basic communication operations are used in this algorithm: one-to-one
communication to align the vector along the main diagonal, one-to-all broadcast
of each vector element among the n processes of each column, and all-to-one
reduction in each row.

• These communications take Θ(log n) time. Computation time is O(1). The

parallel time of this algroithm is Θ(log n) + Θ(1) = Θ(log n) .
• The cost (process-time product) is n2 x log n = Θ(n2 log n) > n2; hence, the
algorithm is not cost-optimal. 11

Matrix-Vector Multiplication:
2-D Partitioning (p < n2)
• When using fewer than n2 processors, each process owns an
block of the matrix. p (ie, p x p ) processors are used.
• The vector is distributed in portions of elements in the last process-
column only.
• In this case, the message sizes for the alignment, broadcast, and reduction
are all .
• The computation is a product of an submatrix with a
vector of length .
p

6
Matrix-Vector Multiplication:
2-D Partitioning (p < n2)
• The first alignment step takes time
• The broadcast and reductions each take time

• Local matrix-vector products take time

• Total time is 13

Matrix-Matrix Multiplication

• Consider the problem of multiplying two n x n dense,

square matrices A and B to yield the product matrix
C = A x B.
• The serial complexity is O(n3).
n n2

n x =

A B C
14

7
Matrix-Matrix Multiplication
• A useful concept in this case is called block operations.
In this view, an n x n matrix A can be regarded as a q x q
array of blocks Ai,j (0 ≤ i, j < q) such that each block is an
(n/q) x (n/q) submatrix.
• We perform q2 matrix multiplications, each involving
(n/q) x (n/q) matrices.

q
n/q

n/q

Matrix-Matrix Multiplication
• Consider two n x n matrices A and B partitioned into p
blocks of Ai,j and Bi,j (0 ≤ i, j < ) of size
each.
• Process Pi,j initially stores Ai,j and Bi,j and computes block
Ci,j of the result matrix.
• Computing submatrix Ci,j requires all submatrices Ai,k
and Bk,j for 0 ≤ k < .
• All-to-all broadcast blocks of A along rows, rows and B along
columns are needed.
• Perform local submatrix multiplication.

8
Matrix-Matrix Multiplication
• The two broadcasts take time

• The computation requires multiplications of

sized submatrices.
• The parallel run time is approximately

• Major drawback of the algorithm is that it is not memory

17
optimal.

Matrix-Matrix
Multiplication:
Cannon's Algorithm
• In this algorithm, we
schedule the computations
of the processes of the
ith row such that, at any
given time, each process is
using a different block Ai,k.
• These blocks can be
systematically rotated
among the processes after
every submatrix
multiplication so that every
process gets a fresh Ai,k
after each rotation. 18

9
Matrix-Matrix
Multiplication:
Cannon's Algorithm
• Align the blocks of A and B in such a
way that each process multiplies its
local submatrices. This is done by
shifting all submatrices Ai,j to the left
(with wraparound) by i steps and all
submatrices Bi,j up (with
wraparound) by j steps.
• Perform local block multiplication.
• Each block of A moves one step left
and each block of B moves one step
up (again with wraparound).
• Perform next block multiplication,
add to partial result, repeat until all
blocks have been multiplied.
19

Matrix-Matrix Multiplication:
Cannon's Algorithm
• In the alignment step, since the
maximum distance over which a
block shifts is , the two shift
operations require a total of
time.
• Each of the single-step shifts in
the compute-and-shift phase of the
algorithm takes time.
• The computation time for multiplying
matrices of size
is . (i.e., X 3 )

• The parallel time is approximately:

10
Matrix-Matrix Multiplication:
DNS (Dekel, Nassimi, and Sahni) Algorithm

• Uses a 3-D partitioning.

• Visualize the matrix multiplication
algorithm as a cube. Matrices A
and B come in two orthogonal
faces and result C comes out the
other orthogonal face.
• Each internal node in the cube
represents a single add-multiply k
operation (and thus the
complexity). j

• DNS algorithm partitions this i

cube using a 3-D block scheme.
21

Matrix-Matrix Multiplication:
DNS Algorithm
• Assume an n x n x n mesh of
processors.
• Move the columns of A and rows of B k=3
and perform broadcast.
• Each processor Pi, j, k computes a single
multiply: C[i,k] = A[i,k] x B[k,j]. k=2
• This is followed by an accumulation
along the k dimension.
• Since each add-multiply takes constant k=1
time and accumulation and broadcast
takes log n time, the total runtime
k=0
is log n.
• This is not cost optimal. It can be made
cost optimal by using n / log n
processors along the direction of
22
accumulation.

11
Matrix-Matrix Multiplication:
DNS Algorithm 0,3 1,3 2,3 3,3

The vertical column of processes Pi,j,*

computes the dot product of row A[i, *]
and column B[*, j]. Therefore, rows of A
and columns of B need to be moved 0,2 1,2 2,2 3,2

appropriately so that each vertical

column of processes Pi,j,* has row A[i, *]
and column B[*, j]. More precisely,
process Pi,j,k should have A[i, k] and
B[k, j]. 0,1 1,1 2,1 3,1

First, each column of A moves to a

different plane such that the j th column
occupies the same position in the plane
corresponding to k = j as it initially did in
the plane corresponding to k = 0.

Matrix-Matrix Multiplication:
DNS Algorithm

Now all the columns of A are

replicated n times in their
respective planes by a parallel
one-to-all broadcast along the
j axis.

Pi,0,j, Pi,1,j, ..., Pi,n-1,j receive a

copy of A[i, j] from Pi,j,j. At this
point, each vertical column of
processes Pi,j,* has row A[i, *].
More precisely, process Pi,j,k
has A[i, k].

12
Matrix-Matrix Multiplication:
DNS Algorithm
3,3
3,2
3,1
3,0

For matrix B, the 2,3

communication steps 2,2

2,1
are similar, but the roles 2,0
of i and j in process 1,3
subscripts are switched. 1,2

In the first one-to-one 1,1

communication step, 1,0

B[i, j] is moved from

Pi,j,0 to Pi,j,i.

Matrix-Matrix Multiplication:
DNS Algorithm
3,3
3,2
3,1
3,0

Then it is broadcast
2,3
2,2
from Pi,j,i among P0,j,i, 2,1

P1,j,i, ..., Pn-1,j,i. 2,0

1,3

At this point, each 1,2

1,1
vertical column of 1,0
processes Pi,j,* has
column B[*, j]. Now
process Pi,j,k has B[k, j],
in addition to A[i, k].
26

13
Matrix-Matrix Multiplication: DNS Algorithm

After these communication steps, A[i, k] and B[k, j] are multiplied at Pi,j,k. Now each element C[i, j]
of the product matrix is obtained by an all-to-one reduction along the k axis. During this step,
process Pi,j,0 accumulates the results of the multiplication from processes Pi,j,1, ..., Pi,j,n-1.
The DNS algorithm has three main communication steps: (1) moving the columns of A and the rows
of B to their respective planes, (2) performing one-to-all broadcast along the j axis for A, and along
the i axis for B, and (3) all-to-one reduction along the k axis. All these operations are performed
within groups of n processes and take time O(log n). Thus, the parallel run time for 27
multiplying two n x n matrices using the DNS algorithm on n3 processes is O(log n).

Matrix-Matrix Multiplication:
DNS Algorithm (Using fewer than n3 processors.)

• Assume that the number of processes p is

equal to q3 for some q < n.
• The two matrices are partitioned into blocks
of size (n/q) x(n/q).
• Each matrix can thus be regarded as a q x q
two-dimensional square array of blocks.
• The algorithm follows from the previous one,
except, in this case, we operate on blocks
rather than on individual elements.

14
Matrix-Matrix Multiplication:
DNS Algorithm (Using fewer than n3 processors.)

• The first one-to-one communication step is performed for

both A and B, and takes time for each matrix.
• The two one-to-all broadcasts take
time for each matrix.
• The reduction takes time .
• Multiplication of submatrices takes time.
• The parallel time is approximated by:

Dense Matrix Algorithms: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
No ratings yet
Dense Matrix Algorithms: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
55 pages
Unit 4 HPC Part8
No ratings yet
Unit 4 HPC Part8
16 pages
Unit II Matrix Multiplication
No ratings yet
Unit II Matrix Multiplication
23 pages
A Practical Performance Comparison of Parallel Matrix Multiplication Algorithms On Networks of Workstations
No ratings yet
A Practical Performance Comparison of Parallel Matrix Multiplication Algorithms On Networks of Workstations
2 pages
Cannon Strassen DNS Algorithm
No ratings yet
Cannon Strassen DNS Algorithm
10 pages
Chapter 7-Matrix Multiplication From The Book Parallel Computing by Michael J. Quinn
No ratings yet
Chapter 7-Matrix Multiplication From The Book Parallel Computing by Michael J. Quinn
39 pages
Introduction To Parallel Programming: Parallel Methods For Matrix Multiplication
No ratings yet
Introduction To Parallel Programming: Parallel Methods For Matrix Multiplication
50 pages
Matrix Mul
No ratings yet
Matrix Mul
33 pages
Class18 - Linalg II Handout PDF
No ratings yet
Class18 - Linalg II Handout PDF
48 pages
Parallel Processing
No ratings yet
Parallel Processing
35 pages
Bhagaban - Dynamic - Programming Intro - Matrix - Elemnts - Unit - II - 4
No ratings yet
Bhagaban - Dynamic - Programming Intro - Matrix - Elemnts - Unit - II - 4
37 pages
Canon's Algorithm
No ratings yet
Canon's Algorithm
11 pages
Chained Matrix Multiplication
No ratings yet
Chained Matrix Multiplication
32 pages
Chapter 9 - Parallel Computation Problems
No ratings yet
Chapter 9 - Parallel Computation Problems
43 pages
Matrix Multiplication Algorithm
No ratings yet
Matrix Multiplication Algorithm
9 pages
Matrix Matrix Algorithm, Cannons Alg
No ratings yet
Matrix Matrix Algorithm, Cannons Alg
13 pages
Efficient Parallel Implementation of The Fox Algorithm
No ratings yet
Efficient Parallel Implementation of The Fox Algorithm
8 pages
Fox Example
No ratings yet
Fox Example
2 pages
Algorithms and Data Structure
No ratings yet
Algorithms and Data Structure
29 pages
AdvancedAlgorithms NN
No ratings yet
AdvancedAlgorithms NN
152 pages
Lecture 4: Principles of Parallel Algorithm Design (Part 4)
No ratings yet
Lecture 4: Principles of Parallel Algorithm Design (Part 4)
27 pages
Cabais Finals Lab Act#2
No ratings yet
Cabais Finals Lab Act#2
9 pages
To Print - Dynprog2
No ratings yet
To Print - Dynprog2
46 pages
Matrix Multiplications and Collective Communication: Michael Hanke
No ratings yet
Matrix Multiplications and Collective Communication: Michael Hanke
38 pages
1 Matrix Multiplication: Strassen's Algorithm: Tuan Nguyen, Alex Adamson, Andreas Santucci
No ratings yet
1 Matrix Multiplication: Strassen's Algorithm: Tuan Nguyen, Alex Adamson, Andreas Santucci
8 pages
Parallel Algorithms Underlying MPI Implementations
No ratings yet
Parallel Algorithms Underlying MPI Implementations
55 pages
Lecture 20ppt
No ratings yet
Lecture 20ppt
25 pages
Matrix Chain Mult
No ratings yet
Matrix Chain Mult
11 pages
How To Multiply: 5.5 Integer Multiplication
No ratings yet
How To Multiply: 5.5 Integer Multiplication
16 pages
Parallel Algorithms Underlying MPI Implementations
No ratings yet
Parallel Algorithms Underlying MPI Implementations
55 pages
Daa Exp02
No ratings yet
Daa Exp02
16 pages
Daa 02 R1 2
No ratings yet
Daa 02 R1 2
63 pages
Week 09 2021b
No ratings yet
Week 09 2021b
52 pages
Blocked Matrix Multiply
No ratings yet
Blocked Matrix Multiply
6 pages
MPI Matrix Multiplication 1 PDF
No ratings yet
MPI Matrix Multiplication 1 PDF
23 pages
Ca 3
No ratings yet
Ca 3
34 pages
Lecture 19ppt
No ratings yet
Lecture 19ppt
18 pages
Matrix-Matrix Multiplication
No ratings yet
Matrix-Matrix Multiplication
8 pages
Matrix Chain Multiplication
100% (1)
Matrix Chain Multiplication
20 pages
Task 1 Types of Parallel Processing
No ratings yet
Task 1 Types of Parallel Processing
3 pages
Dynamic Programming
No ratings yet
Dynamic Programming
15 pages
Multiplym 2
No ratings yet
Multiplym 2
23 pages
Algo VC Lecture24
No ratings yet
Algo VC Lecture24
32 pages
Report - Viber String
No ratings yet
Report - Viber String
26 pages
Greedy DP
No ratings yet
Greedy DP
57 pages
Chapter 14: Parallel Algorithms
No ratings yet
Chapter 14: Parallel Algorithms
23 pages
Week3 PDF
No ratings yet
Week3 PDF
36 pages
Generalized Cannon's Algorithm For Parallel Matrix Multiplication
No ratings yet
Generalized Cannon's Algorithm For Parallel Matrix Multiplication
8 pages
06 cmsc416 Algorithms
No ratings yet
06 cmsc416 Algorithms
40 pages
Lect11 12 Parallel
No ratings yet
Lect11 12 Parallel
57 pages
Strassens Matrix Multiflication
No ratings yet
Strassens Matrix Multiflication
14 pages
Dynamic Programming: Department of CSE JNTUA College of Engg., Kalikiri
No ratings yet
Dynamic Programming: Department of CSE JNTUA College of Engg., Kalikiri
66 pages
To Read Dynprog2
No ratings yet
To Read Dynprog2
50 pages
Daa Lecture CSE
No ratings yet
Daa Lecture CSE
6 pages
DAA IA-1 Case Study Material-CSE
No ratings yet
DAA IA-1 Case Study Material-CSE
9 pages
Advanced Computer Architecture 1
No ratings yet
Advanced Computer Architecture 1
14 pages
Calculus: Maths of the Gods
From Everand
Calculus: Maths of the Gods
Bill Todorovich
No ratings yet
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
From Everand
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
Fouad Sabry
No ratings yet
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Seis Space
No ratings yet
Seis Space
2 pages
Operating System Process Scheduling
No ratings yet
Operating System Process Scheduling
6 pages
Gate LevelCDCPaperr8
No ratings yet
Gate LevelCDCPaperr8
8 pages
Ict Computer Generation
No ratings yet
Ict Computer Generation
8 pages
Computing Power
No ratings yet
Computing Power
12 pages
14 Performance
No ratings yet
14 Performance
26 pages
In3200 Chap05
No ratings yet
In3200 Chap05
34 pages
ANSYS Mechanical APDL Parallel Processing Guide
No ratings yet
ANSYS Mechanical APDL Parallel Processing Guide
56 pages
Lecture 4 Parallel and Scalable Machine Learning With HPC Part 1
No ratings yet
Lecture 4 Parallel and Scalable Machine Learning With HPC Part 1
47 pages
Questions Answered in This Lecture:: - Why Are Threads Useful? - How Does One Use POSIX Pthreads?
No ratings yet
Questions Answered in This Lecture:: - Why Are Threads Useful? - How Does One Use POSIX Pthreads?
6 pages
1 - Concurrent Programming
No ratings yet
1 - Concurrent Programming
28 pages
Computer Science Syllabus Second Year - Fourth Semister
No ratings yet
Computer Science Syllabus Second Year - Fourth Semister
9 pages
Chapter3 2
No ratings yet
Chapter3 2
32 pages
Cs 614 Current Papers
No ratings yet
Cs 614 Current Papers
25 pages
Azure Synapse Analytics Overview
No ratings yet
Azure Synapse Analytics Overview
251 pages
Parallel Computing Toolbox™UserGuide
No ratings yet
Parallel Computing Toolbox™UserGuide
729 pages
MSC Computer Science Syllabus 2012 Onwards
No ratings yet
MSC Computer Science Syllabus 2012 Onwards
42 pages
Chapter 01
No ratings yet
Chapter 01
54 pages
PyCUDA AH PDF
No ratings yet
PyCUDA AH PDF
16 pages
Module 1: Introduction To Operating System: Need For An OS
No ratings yet
Module 1: Introduction To Operating System: Need For An OS
15 pages
Research Into The Conditional Task Scheduling Problem
No ratings yet
Research Into The Conditional Task Scheduling Problem
9 pages
The Future Is Fusion.
No ratings yet
The Future Is Fusion.
10 pages
Developing Map Reduce Application
No ratings yet
Developing Map Reduce Application
29 pages
Microprocessor
No ratings yet
Microprocessor
43 pages
Soa VS Mom
No ratings yet
Soa VS Mom
8 pages
CV Nijkamp
No ratings yet
CV Nijkamp
2 pages
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
No ratings yet
Parallel Computing Platforms and Memory System Performance: John Mellor-Crummey
43 pages
Operating System: Semester 5
No ratings yet
Operating System: Semester 5
18 pages
HP Fortify Static Code Analyzer Performance Guide
No ratings yet
HP Fortify Static Code Analyzer Performance Guide
35 pages
Vivado Intro Fpga Design Hls
No ratings yet
Vivado Intro Fpga Design Hls
92 pages

High Performance Computing Matrix Mul.

Uploaded by

High Performance Computing Matrix Mul.

Uploaded by

CZ4102 – High Performance Computing

- Dr Tay Seng Chuan

• Parallel Algorithms for

• We aim to multiply a dense n x n matrix A with an n x 1 vector x

• The total workload is 4

• Since each process starts

• The all-to-all broadcast

• The all-to-all broadcast takes

• This is followed by n/p rows of local dot products.

Recap: All-to-All Broadcast

• These communications take Θ(log n) time. Computation time is O(1). The

• Local matrix-vector products take time

• Consider the problem of multiplying two n x n dense,

• The computation requires multiplications of

• Major drawback of the algorithm is that it is not memory

• The parallel time is approximately:

• Uses a 3-D partitioning.

• DNS algorithm partitions this i

The vertical column of processes Pi,j,*

appropriately so that each vertical

First, each column of A moves to a

Now all the columns of A are

Pi,0,j, Pi,1,j, ..., Pi,n-1,j receive a

For matrix B, the 2,3

communication steps 2,2

In the first one-to-one 1,1

communication step, 1,0

B[i, j] is moved from

P1,j,i, ..., Pn-1,j,i. 2,0

At this point, each 1,2

• Assume that the number of processes p is

• The first one-to-one communication step is performed for

You might also like