0% found this document useful (0 votes)
47 views15 pages

High Performance Computing Matrix Mul.

This document provides an overview of parallel algorithms for matrix-vector and matrix-matrix multiplication. It discusses row-wise and 2D partitioning approaches for matrix-vector multiplication and their performance analyses. For matrix-matrix multiplication, it describes performing block operations and Cannon's algorithm which systematically rotates matrix blocks among processes.

Uploaded by

kmarali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views15 pages

High Performance Computing Matrix Mul.

This document provides an overview of parallel algorithms for matrix-vector and matrix-matrix multiplication. It discusses row-wise and 2D partitioning approaches for matrix-vector multiplication and their performance analyses. For matrix-matrix multiplication, it describes performing block operations and Cannon's algorithm which systematically rotates matrix blocks among processes.

Uploaded by

kmarali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

CZ4102 – High Performance Computing

Lecture 11:

Matrix Algorithms

- Dr Tay Seng Chuan

Reference:
“Introduction to Parallel Computing” – Chapter 8.

Topic Overview

• Parallel Algorithms for


- Matrix-Vector Multiplication
- Matrix-Matrix Multiplication

• Performance Analysis

1
Matrix Algorithms: Introduction
• Due to their regular structure, parallel computations
involving matrices and vectors readily lend themselves to
data-decomposition.
• Typical algorithms rely on input, output, or intermediate
data decomposition.
• Most algorithms use one- and two-dimensional block,
cyclic, and block-cyclic partitions for parallel processing.
• The run-time performance of such algorithms depends on
the amount of overheads incurred as compared to the
computation workload.
• As a rule of thumb, good speedup can be achieved if the
computation granularity is able to outweigh the overheads
such as the communication cost, consolidation cost –
3
algorithm penalty, data packaging, etc.

Matrix-Vector Multiplication

• We aim to multiply a dense n x n matrix A with an n x 1 vector x


to yield the n x 1 result vector y.
• The serial algorithm requires n2 multiplications and additions.
A x y

n x =

• The total workload is 4

2
Matrix-Vector Multiplication:
Rowwise 1-D Partitioning
• The n x n matrix A is partitioned among n processors,
with each processor storing a complete row of the
matrix.
• The n x 1 vector x is distributed such that each process
owns one of its elements.
A x y

:
:
:
:

Matrix-Vector Multiplication:
Rowwise 1-D Partitioning (p = n)

• Since each process starts


with only one element of
Vector x , an all-to-all
broadcast is required to
distribute all the elements
x[ j ] to all the processes.
• Process Pi now computes

• The all-to-all broadcast


and the computation of
y[i] both take time Θ(n) .
Therefore, the parallel time
is Θ(n) .
6
p=n

3
Matrix-Vector Multiplication:
Rowwise 1-D Partitioning (p < n)
• Consider now the case when
p < n and we use blocks in
partitions.
n/p
• Each process initially stores
n/p complete rows of the p
matrix, and a portion of the
vector of size n/p.

• The all-to-all broadcast takes


place among p processes
and involves messages of
size n/p.

• This is followed by n/p rows of local dot products.


(computation time = n/p x n = n2/p)

Recap: All-to-All Broadcast


based on Doubling-up
Approach:

n
Now m = , we have
p
• Thus, the parallel run time of matrix-vector
n multiplication based on rowwise 1-D
T = ts log p + tw x x (p-1) partitioning (p < n) is
p

~ ts log p + tw x n
All-to-all broadcast of vector elements.
8
• This is cost-optimal.

4
Matrix-Vector Multiplication:
2-D Partitioning (p = n2 )
• The n x n matrix is partitioned among n2 processors such
that each processor owns a single element.
• The n x 1 vector x is distributed only in the last column of
n processors.
(p = n2 processors)

n x =

n
9

Matrix-Vector Multiplication:
2-D Partitioning (p = n2 )
• We must first align the vector
with the matrix appropriately.
• The first communication step
for the 2-D partitioning aligns
the vector x along the
principal diagonal of the
matrix.
• The second step copies the
vector elements from each
diagonal process to all the
processes in the
corresponding column using
n simultaneous broadcasts
among all processors in the
column.
• Finally, the result vector is
computed by performing an
all-to-one reduction along the
columns.
10

5
Matrix-Vector Multiplication:
2-D Partitioning (p = n2)
• Three basic communication operations are used in this algorithm: one-to-one
communication to align the vector along the main diagonal, one-to-all broadcast
of each vector element among the n processes of each column, and all-to-one
reduction in each row.

• These communications take Θ(log n) time. Computation time is O(1). The


parallel time of this algroithm is Θ(log n) + Θ(1) = Θ(log n) .
• The cost (process-time product) is n2 x log n = Θ(n2 log n) > n2; hence, the
algorithm is not cost-optimal. 11

Matrix-Vector Multiplication:
2-D Partitioning (p < n2)
• When using fewer than n2 processors, each process owns an
block of the matrix. p (ie, p x p ) processors are used.
• The vector is distributed in portions of elements in the last process-
column only.
• In this case, the message sizes for the alignment, broadcast, and reduction
are all .
• The computation is a product of an submatrix with a
vector of length .
p

12

6
Matrix-Vector Multiplication:
2-D Partitioning (p < n2)
• The first alignment step takes time
• The broadcast and reductions each take time

• Local matrix-vector products take time

• Total time is 13

Matrix-Matrix Multiplication

• Consider the problem of multiplying two n x n dense,


square matrices A and B to yield the product matrix
C = A x B.
• The serial complexity is O(n3).
n n2

n x =

A B C
14

7
Matrix-Matrix Multiplication
• A useful concept in this case is called block operations.
In this view, an n x n matrix A can be regarded as a q x q
array of blocks Ai,j (0 ≤ i, j < q) such that each block is an
(n/q) x (n/q) submatrix.
• We perform q2 matrix multiplications, each involving
(n/q) x (n/q) matrices.

q
n/q

n/q

15

Matrix-Matrix Multiplication
• Consider two n x n matrices A and B partitioned into p
blocks of Ai,j and Bi,j (0 ≤ i, j < ) of size
each.
• Process Pi,j initially stores Ai,j and Bi,j and computes block
Ci,j of the result matrix.
• Computing submatrix Ci,j requires all submatrices Ai,k
and Bk,j for 0 ≤ k < .
• All-to-all broadcast blocks of A along rows, rows and B along
columns are needed.
• Perform local submatrix multiplication.

16

8
Matrix-Matrix Multiplication
• The two broadcasts take time

• The computation requires multiplications of


sized submatrices.
• The parallel run time is approximately

• Major drawback of the algorithm is that it is not memory


17
optimal.

Matrix-Matrix
Multiplication:
Cannon's Algorithm
• In this algorithm, we
schedule the computations
of the processes of the
ith row such that, at any
given time, each process is
using a different block Ai,k.
• These blocks can be
systematically rotated
among the processes after
every submatrix
multiplication so that every
process gets a fresh Ai,k
after each rotation. 18

9
Matrix-Matrix
Multiplication:
Cannon's Algorithm
• Align the blocks of A and B in such a
way that each process multiplies its
local submatrices. This is done by
shifting all submatrices Ai,j to the left
(with wraparound) by i steps and all
submatrices Bi,j up (with
wraparound) by j steps.
• Perform local block multiplication.
• Each block of A moves one step left
and each block of B moves one step
up (again with wraparound).
• Perform next block multiplication,
add to partial result, repeat until all
blocks have been multiplied.
19

Matrix-Matrix Multiplication:
Cannon's Algorithm
• In the alignment step, since the
maximum distance over which a
block shifts is , the two shift
operations require a total of
time.
• Each of the single-step shifts in
the compute-and-shift phase of the
algorithm takes time.
• The computation time for multiplying
matrices of size
is . (i.e., X 3 )

• The parallel time is approximately:

20

10
Matrix-Matrix Multiplication:
DNS (Dekel, Nassimi, and Sahni) Algorithm

• Uses a 3-D partitioning.


• Visualize the matrix multiplication
algorithm as a cube. Matrices A
and B come in two orthogonal
faces and result C comes out the
other orthogonal face.
• Each internal node in the cube
represents a single add-multiply k
operation (and thus the
complexity). j

• DNS algorithm partitions this i


cube using a 3-D block scheme.
21

Matrix-Matrix Multiplication:
DNS Algorithm
• Assume an n x n x n mesh of
processors.
• Move the columns of A and rows of B k=3
and perform broadcast.
• Each processor Pi, j, k computes a single
multiply: C[i,k] = A[i,k] x B[k,j]. k=2
• This is followed by an accumulation
along the k dimension.
• Since each add-multiply takes constant k=1
time and accumulation and broadcast
takes log n time, the total runtime
k=0
is log n.
• This is not cost optimal. It can be made
cost optimal by using n / log n
processors along the direction of
22
accumulation.

11
Matrix-Matrix Multiplication:
DNS Algorithm 0,3 1,3 2,3 3,3

The vertical column of processes Pi,j,*


computes the dot product of row A[i, *]
and column B[*, j]. Therefore, rows of A
and columns of B need to be moved 0,2 1,2 2,2 3,2

appropriately so that each vertical


column of processes Pi,j,* has row A[i, *]
and column B[*, j]. More precisely,
process Pi,j,k should have A[i, k] and
B[k, j]. 0,1 1,1 2,1 3,1

First, each column of A moves to a


different plane such that the j th column
occupies the same position in the plane
corresponding to k = j as it initially did in
the plane corresponding to k = 0.

23

Matrix-Matrix Multiplication:
DNS Algorithm

Now all the columns of A are


replicated n times in their
respective planes by a parallel
one-to-all broadcast along the
j axis.

Pi,0,j, Pi,1,j, ..., Pi,n-1,j receive a


copy of A[i, j] from Pi,j,j. At this
point, each vertical column of
processes Pi,j,* has row A[i, *].
More precisely, process Pi,j,k
has A[i, k].

24

12
Matrix-Matrix Multiplication:
DNS Algorithm
3,3
3,2
3,1
3,0

For matrix B, the 2,3

communication steps 2,2


2,1
are similar, but the roles 2,0
of i and j in process 1,3
subscripts are switched. 1,2

In the first one-to-one 1,1

communication step, 1,0

B[i, j] is moved from


Pi,j,0 to Pi,j,i.

25

Matrix-Matrix Multiplication:
DNS Algorithm
3,3
3,2
3,1
3,0

Then it is broadcast
2,3
2,2
from Pi,j,i among P0,j,i, 2,1

P1,j,i, ..., Pn-1,j,i. 2,0

1,3

At this point, each 1,2


1,1
vertical column of 1,0
processes Pi,j,* has
column B[*, j]. Now
process Pi,j,k has B[k, j],
in addition to A[i, k].
26

13
Matrix-Matrix Multiplication: DNS Algorithm

After these communication steps, A[i, k] and B[k, j] are multiplied at Pi,j,k. Now each element C[i, j]
of the product matrix is obtained by an all-to-one reduction along the k axis. During this step,
process Pi,j,0 accumulates the results of the multiplication from processes Pi,j,1, ..., Pi,j,n-1.
The DNS algorithm has three main communication steps: (1) moving the columns of A and the rows
of B to their respective planes, (2) performing one-to-all broadcast along the j axis for A, and along
the i axis for B, and (3) all-to-one reduction along the k axis. All these operations are performed
within groups of n processes and take time O(log n). Thus, the parallel run time for 27
multiplying two n x n matrices using the DNS algorithm on n3 processes is O(log n).

Matrix-Matrix Multiplication:
DNS Algorithm (Using fewer than n3 processors.)

• Assume that the number of processes p is


equal to q3 for some q < n.
• The two matrices are partitioned into blocks
of size (n/q) x(n/q).
• Each matrix can thus be regarded as a q x q
two-dimensional square array of blocks.
• The algorithm follows from the previous one,
except, in this case, we operate on blocks
rather than on individual elements.

28

14
Matrix-Matrix Multiplication:
DNS Algorithm (Using fewer than n3 processors.)

• The first one-to-one communication step is performed for


both A and B, and takes time for each matrix.
• The two one-to-all broadcasts take
time for each matrix.
• The reduction takes time .
• Multiplication of submatrices takes time.
• The parallel time is approximated by:

29

15

You might also like