High Performance Computing Matrix Mul.
High Performance Computing Matrix Mul.
Lecture 11:
Matrix Algorithms
Reference:
“Introduction to Parallel Computing” – Chapter 8.
Topic Overview
• Performance Analysis
1
Matrix Algorithms: Introduction
• Due to their regular structure, parallel computations
involving matrices and vectors readily lend themselves to
data-decomposition.
• Typical algorithms rely on input, output, or intermediate
data decomposition.
• Most algorithms use one- and two-dimensional block,
cyclic, and block-cyclic partitions for parallel processing.
• The run-time performance of such algorithms depends on
the amount of overheads incurred as compared to the
computation workload.
• As a rule of thumb, good speedup can be achieved if the
computation granularity is able to outweigh the overheads
such as the communication cost, consolidation cost –
3
algorithm penalty, data packaging, etc.
Matrix-Vector Multiplication
n x =
2
Matrix-Vector Multiplication:
Rowwise 1-D Partitioning
• The n x n matrix A is partitioned among n processors,
with each processor storing a complete row of the
matrix.
• The n x 1 vector x is distributed such that each process
owns one of its elements.
A x y
:
:
:
:
Matrix-Vector Multiplication:
Rowwise 1-D Partitioning (p = n)
3
Matrix-Vector Multiplication:
Rowwise 1-D Partitioning (p < n)
• Consider now the case when
p < n and we use blocks in
partitions.
n/p
• Each process initially stores
n/p complete rows of the p
matrix, and a portion of the
vector of size n/p.
n
Now m = , we have
p
• Thus, the parallel run time of matrix-vector
n multiplication based on rowwise 1-D
T = ts log p + tw x x (p-1) partitioning (p < n) is
p
~ ts log p + tw x n
All-to-all broadcast of vector elements.
8
• This is cost-optimal.
4
Matrix-Vector Multiplication:
2-D Partitioning (p = n2 )
• The n x n matrix is partitioned among n2 processors such
that each processor owns a single element.
• The n x 1 vector x is distributed only in the last column of
n processors.
(p = n2 processors)
n x =
n
9
Matrix-Vector Multiplication:
2-D Partitioning (p = n2 )
• We must first align the vector
with the matrix appropriately.
• The first communication step
for the 2-D partitioning aligns
the vector x along the
principal diagonal of the
matrix.
• The second step copies the
vector elements from each
diagonal process to all the
processes in the
corresponding column using
n simultaneous broadcasts
among all processors in the
column.
• Finally, the result vector is
computed by performing an
all-to-one reduction along the
columns.
10
5
Matrix-Vector Multiplication:
2-D Partitioning (p = n2)
• Three basic communication operations are used in this algorithm: one-to-one
communication to align the vector along the main diagonal, one-to-all broadcast
of each vector element among the n processes of each column, and all-to-one
reduction in each row.
Matrix-Vector Multiplication:
2-D Partitioning (p < n2)
• When using fewer than n2 processors, each process owns an
block of the matrix. p (ie, p x p ) processors are used.
• The vector is distributed in portions of elements in the last process-
column only.
• In this case, the message sizes for the alignment, broadcast, and reduction
are all .
• The computation is a product of an submatrix with a
vector of length .
p
12
6
Matrix-Vector Multiplication:
2-D Partitioning (p < n2)
• The first alignment step takes time
• The broadcast and reductions each take time
• Total time is 13
Matrix-Matrix Multiplication
n x =
A B C
14
7
Matrix-Matrix Multiplication
• A useful concept in this case is called block operations.
In this view, an n x n matrix A can be regarded as a q x q
array of blocks Ai,j (0 ≤ i, j < q) such that each block is an
(n/q) x (n/q) submatrix.
• We perform q2 matrix multiplications, each involving
(n/q) x (n/q) matrices.
q
n/q
n/q
15
Matrix-Matrix Multiplication
• Consider two n x n matrices A and B partitioned into p
blocks of Ai,j and Bi,j (0 ≤ i, j < ) of size
each.
• Process Pi,j initially stores Ai,j and Bi,j and computes block
Ci,j of the result matrix.
• Computing submatrix Ci,j requires all submatrices Ai,k
and Bk,j for 0 ≤ k < .
• All-to-all broadcast blocks of A along rows, rows and B along
columns are needed.
• Perform local submatrix multiplication.
16
8
Matrix-Matrix Multiplication
• The two broadcasts take time
Matrix-Matrix
Multiplication:
Cannon's Algorithm
• In this algorithm, we
schedule the computations
of the processes of the
ith row such that, at any
given time, each process is
using a different block Ai,k.
• These blocks can be
systematically rotated
among the processes after
every submatrix
multiplication so that every
process gets a fresh Ai,k
after each rotation. 18
9
Matrix-Matrix
Multiplication:
Cannon's Algorithm
• Align the blocks of A and B in such a
way that each process multiplies its
local submatrices. This is done by
shifting all submatrices Ai,j to the left
(with wraparound) by i steps and all
submatrices Bi,j up (with
wraparound) by j steps.
• Perform local block multiplication.
• Each block of A moves one step left
and each block of B moves one step
up (again with wraparound).
• Perform next block multiplication,
add to partial result, repeat until all
blocks have been multiplied.
19
Matrix-Matrix Multiplication:
Cannon's Algorithm
• In the alignment step, since the
maximum distance over which a
block shifts is , the two shift
operations require a total of
time.
• Each of the single-step shifts in
the compute-and-shift phase of the
algorithm takes time.
• The computation time for multiplying
matrices of size
is . (i.e., X 3 )
20
10
Matrix-Matrix Multiplication:
DNS (Dekel, Nassimi, and Sahni) Algorithm
Matrix-Matrix Multiplication:
DNS Algorithm
• Assume an n x n x n mesh of
processors.
• Move the columns of A and rows of B k=3
and perform broadcast.
• Each processor Pi, j, k computes a single
multiply: C[i,k] = A[i,k] x B[k,j]. k=2
• This is followed by an accumulation
along the k dimension.
• Since each add-multiply takes constant k=1
time and accumulation and broadcast
takes log n time, the total runtime
k=0
is log n.
• This is not cost optimal. It can be made
cost optimal by using n / log n
processors along the direction of
22
accumulation.
11
Matrix-Matrix Multiplication:
DNS Algorithm 0,3 1,3 2,3 3,3
23
Matrix-Matrix Multiplication:
DNS Algorithm
24
12
Matrix-Matrix Multiplication:
DNS Algorithm
3,3
3,2
3,1
3,0
25
Matrix-Matrix Multiplication:
DNS Algorithm
3,3
3,2
3,1
3,0
Then it is broadcast
2,3
2,2
from Pi,j,i among P0,j,i, 2,1
1,3
13
Matrix-Matrix Multiplication: DNS Algorithm
After these communication steps, A[i, k] and B[k, j] are multiplied at Pi,j,k. Now each element C[i, j]
of the product matrix is obtained by an all-to-one reduction along the k axis. During this step,
process Pi,j,0 accumulates the results of the multiplication from processes Pi,j,1, ..., Pi,j,n-1.
The DNS algorithm has three main communication steps: (1) moving the columns of A and the rows
of B to their respective planes, (2) performing one-to-all broadcast along the j axis for A, and along
the i axis for B, and (3) all-to-one reduction along the k axis. All these operations are performed
within groups of n processes and take time O(log n). Thus, the parallel run time for 27
multiplying two n x n matrices using the DNS algorithm on n3 processes is O(log n).
Matrix-Matrix Multiplication:
DNS Algorithm (Using fewer than n3 processors.)
28
14
Matrix-Matrix Multiplication:
DNS Algorithm (Using fewer than n3 processors.)
29
15