Class18 - Linalg II Handout PDF
Class18 - Linalg II Handout PDF
M. D. Jones, Ph.D.
Center for Computational Research
University at Buffalo
State University of New York
High Performance Computing I, 2012
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 1 / 53
Introduction
Dense & Parallel
In this topic we will still discuss dense (as opposed to sparse)
matrices, but now focus on parallel execution.
Best Performance (and scalability) most-often comes down to
making best use of (local) L3 BLAS
Optimized BLAS are generally crucial for achieving performance
gains
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 3 / 53
Introduction
BLAS Recapitulated
BLAS are critical:
Foundation for all (computational) linear algebra
Parallel versions still call serial versions on each processor
Performance goes up with increasing ratio of oating point
operations to memory references
Well-tuned BLAS takes maximum advantage of memory hierarchy
BLAS Level Calculation Memory Refs. Flop Count Ratio(Flop/Mem)
1 DDOT 3N 2N 2/3
2 DGEMV N
2
2N
2
2
3 DGEMM 4N
2
2N
3
N/2
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 4 / 53
Introduction
Simple Memory Model
Let us assume just two levels of memory - slow and fast
m = # memory elements (words) moved between fast and slow memory,
m
= time per slow memory operation,
f
op
= number of oating-point operations,
f
= time per oating-point operation,
and q = f
op
/m. Minimum time is just f
op
f
, but more realistically:
time f
op
f
+ m
m
= f
op
f
[1 +
m
/(q
f
)] .
key observation - we need optimal reuse of data = larger q (certainly
m
f
).
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 5 / 53
Matrix Multiplication Sequential
Simple Matrix Multiplication
Everyone has probably written this out at least once ...
f or ( i =0; i <N; i ++) {
f or ( j =0; j <N; j ++) {
C[ i ] [ j ] = 0. 0;
f or ( k=0; k<N; k++) {
C[ i ] [ j ] += A[ i ] [ k] B[ k ] [ j ] ;
}
}
}
in order to multiply two matrices, A and B, and store the result in a
third, C.
Can use a temporary scalar to avoid dereferencing in innermost
loop
O(N
3
) multiply-add operations
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 7 / 53
Matrix Multiplication Partitioning
Partitioning
In this case, partitioning is also known as block matrix
multiplication:
Divide into s
2
submatrices, N/s N/s elements in each submatrix
Let m = N/s, and
f or ( p=0; p<s ; p++) {
f or ( q=0; q<s ; q++) {
C_{ p , q} = 0. 0;
f or ( r =0; r <m; r ++) {
C_{ p , q} += A_{ p , r }B{ r , q } ;
}
}
}
where A
p,r
are themselves matrices.
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 8 / 53
Matrix Multiplication Divide & Conquer
Recursive/Divide & Conquer
Let N be a power of 2 (not strictly necessary) and divide A and B into
4 submatrices, delimited by:
A =
_
A
pp
A
pq
A
qp
A
qq
_
.
Then the solution requires 8 pairs of submatrix multiplications, which
can be done recursively ...
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 9 / 53
Matrix Multiplication Divide & Conquer
matmultR( A, B, s ) {
i f ( s==1) {
C = AB; / t er mi nat i on condi t i on /
}
el se {
s = s / 2 ;
P0 = matmultR( A_{ pp } , B_{ pp } , s ) ;
P1 = matmultR( A_{ pq } , B_{ qp } , s ) ;
P2 = matmultR( A_{ pp } , B_{ pq } , s ) ;
P3 = matmultR( A_{ pq } , B_{ qq } , s ) ;
P4 = matmultR( A_{ qp } , B_{ pp } , s ) ;
P5 = matmultR( A_{ qq } , B_{ qp } , s ) ;
P6 = matmultR( A_{ qp } , B_{ pq } , s ) ;
P7 = matmultR( A_{ qq } , B_{ qq } , s ) ;
C_{ pp} = P0+P1;
C_{ pq} = P2+P3;
C_{ qp} = P4+P5;
C_{ qq} = P6+P7;
}
ret urn (C) ;
}
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 10 / 53
Matrix Multiplication Divide & Conquer
Well suited for SMP with cache hierarchy
Size of data continually reduced and localized
Can be highly performing when making maximum reuse of data
within cache memory hierarchy
More generally we want to address cases for which matrix will not
t within memory of a single machine, message
passing/distributed model is required
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 11 / 53
Matrix Multiplication Cannons Algorithm
Cannons Algorithm
Cannons algorithm:
Uses a mesh of processors with a torus topology (periodic
boundary conditions
Elements are shifted to an aligned position: Row i of A is shifted
i places to the left, while row j of B is shifted j spots upward. This
puts A
i ,j +i
and B
i +j ,j
in processor P
i ,j
(namely we have appropriate
submatrices/elements to multiply in P
i ,j
)
Usually submatrices are used, but we can use elements to
simplify the notation a bit
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 12 / 53
Matrix Multiplication Cannons Algorithm
After alignment, Cannons algorithm proceeds as follows:
Each process P
i ,j
multiplies its elements
Row i of A is shifted one place left, and column j of B is shifted
one place up (brings together adjacent elements of A and B
needed for summing results)
Each process P
i ,j
multiplies its elements and adds to
accumulating sum
Preceding two steps repeated until results obtained (N 1 shifts)
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 13 / 53
Matrix Multiplication Cannons Algorithm
Cannon Illustrated
i
j
B
A
P
ij
Cannons algorithm showing ow of data for process P
ij
.
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 14 / 53
Matrix Multiplication Cannons Algorithm
For s submatrices (m = N/s) the communication time for Cannons
algorithm is given by
comm
= 4(s 1)(
lat
+ m
2
dat
),
where
lat
is the latency and
dat
is the time required to send one value
(word). The computation time in Cannons algorithm:
comp
= 2sm
3
= 2m
2
N,
or O(m
2
N).
Cannons algorithm is also known as the ScaLAPACK outer product
algorithm (can you see another numerical library coming?) ...
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 15 / 53
Matrix Multiplication 2D Pipeline
2D Pipeline
Another useful algorithm is the so-called 2D pipeline, in which the
data ows into a rectangular array of processors. Labeling the process
grid using (0, 0) as the top left corner, the data ows from the left and
from the top, with process P
i ,j
:
1
RECV(A from P
i ,j 1
) { A data from left }
2
RECV(B from P
i 1,j
) { B data from above }
3
Accumulate C
i ,j
4
SEND(A to P
i ,j +1
) { A data to right }
5
SEND(B to P
i +1,j
) { B data down }
(illustrate with sketch)
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 16 / 53
Matrix Multiplication 2D Pipeline
2D Pipeline Illustrated
C
00
C
01
C
02
C
03
C
10
C
11
C
12
C
13
C
20
C
21
C
22
C
23
C
30
C
31
C
32
C
33
A
03
A
02
A
01
A
00
A
13
A
12
A
11
A
10
--
A
23
A
22
A
21
A
20
-- --
A
33
A
32
A
31
A
30
-- -- --
B
30
B
20
B
10
B
00
B
31
B
21
B
11
B
01
--
B
32
B
22
B
12
B
02
--
--
B
33
B
23
B
13
B
03
--
--
--
1 cycle delay
P
32
2D pipeline (or systolic array) algorithm showing ow of data.
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 17 / 53
Matrix Multiplication Strassens Method
Strassens Method
The techniques discussed thus far are fundamentally O(N
3
), but there
is a clever method due to Strassen
1
which is O(N
2.81
):
A =
A
11
A
12
A
21
A
22
B =
B
11
B
12
B
21
B
22
C =
C
11
C
12
C
21
C
22
Q
1
= (A
11
+ A
22
)(B
11
+ B
22
), C
11
= Q
1
+ Q
4
Q
5
+ Q
7
Q
2
= (A
21
+ A
22
)B
11
, C
12
= Q
3
+ Q
5
Q
3
= A
11
(B
12
B
22
), C
21
= Q
2
+ Q
4
Q
4
= A
22
(B
11
+ B
21
), C
22
= Q
1
+ Q
3
Q
2
+ Q
6
Q
5
= (A
11
+ A
12
)B
22
,
Q
6
= (A
11
+ A
21
)(B
11
+ B
12
),
Q
7
= (A
12
A
22
)(B
21
+ B
22
)
Conventional matrix multiplication takes N
3
multiplies and N
3
N
2
adds
1
V. Strassen, Gaussian Elimination Is Not Optimal, Numerische
Mathematik, 13, 353-356 (1969).
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 18 / 53
Matrix Multiplication Strassens Method
If N is not a power of 2 can pad with zeros
Reiterate until submatrices reduced to numbers or optimal matrix
multiply size for particular processor (Strassen switches to
standard matrix multiplication for small enough submatrices)
Eliminates 1 matrix multiply, instead of O(N
log
2
8
),
O(N
log
2
7
) = N
2.81
Conventional matrix multiplication takes N
3
multiplies and N
3
N
2
adds, Strassen is 7N
log
2
7
6N
2
Requires additional storage for intermediate matrices, can be less
stable numerically ...
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 19 / 53
LU Decomposition in Parallel
Sequential LU/Gaussian Elimination Code
LU code fragment:
f or ( k=1; k<N; k++) {
f or ( i =k+1; i <=N; i ++) {
L( i , k ) = A( i , k ) / A( k , k )
}
f or ( j =k+1; j <=N; j ++) {
f or ( i =k+1; i <=N; i ++) {
A( i , j ) = A( i , j ) L( i , k)A( k , j )
}
}
}
Note that the inner loops (over i ) have no dependencies, i.e. they can
more easily be executed in parallel.
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 21 / 53
LU Decomposition in Parallel
Parallel LU/Gaussian Elimination
One strategy for parallelizing the Gaussian eleimination part of LU
decomposition is to do so over the middle loop (j ) in the algorithm.
Decomposing the columns, and using a message passing
implementation, we might have something like the following:
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 22 / 53
LU Decomposition in Parallel
do k=1,N1
i f ( column k i s mine ) then ! columnwi se decomposi t i on
do i =k+1,N
L( i , k ) = A( i , k ) / A( k , k )
end do
BCAST( L( k+1:N, k ) , r oot = owner of k )
el se
RECV( L( k+1:N, k ) )
end i f
do j =k+1,N ( modulo I own column j ) ! columnwi se decomposi t i on
do i =k+1,N
A( i , j ) = A( i , j ) L( i , k)A( k , j )
end do
end do
end do
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 23 / 53
LU Decomposition in Parallel
Parallel LU Efciency
If we do a (not overly crude) analysis of the preceding parallel
algorithm, for each column k we have:
Broadcat N k values, lets say taking time c
b
Compute (N k)
2
multiply-adds, time c
fma
k
c
b
(N k) + c
fma
(N k)
2
N
p
,
Summing over k we nd:
(N
p
)
c
b
2
N
2
+
c
fma
3
N
3
N
p
, (1)
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 24 / 53
LU Decomposition in Parallel
Parallel LU Speedup
Now, looking at the parallel speedup factor we have:
S(N
p
) = /(N
p
)
c
fma
N
3
/3
c
b
N
2
/2+c
fma
N
3
/(3N
p
)
=
_
3
2N
c
b
c
fma
+
1
N
p
_
1
.
See the ratio of communication cost to computation? It never goes
away
Minimizing communication costs is again the key to improving the
efciency ...
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 25 / 53
LU Decomposition in Parallel
Parallel LU Efciency
The efciency is given by:
E(N
p
) = S(N
p
)/N
p
,
_
3N
p
2N
c
b
c
fma
+ 1
_
1
,
= 1
3N
p
2N
c
b
c
fma
Note that this algorithm is scalable (in time) by our previous
denition, for a given xed ratio of N
p
/N
Note that it is not scalable in terms of memory, since the memory
requirements grow steadily as O(N
2
)
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 26 / 53
LU Decomposition in Parallel
Improving Parallel LU
The most obvious way to improve the efciency of our parallel
Gaussian elimination is to overlap computation with the necessary
communication. Consider the revised algorithm ...
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 27 / 53
LU Decomposition in Parallel
! 1st pr ocessor computes L ( 2: N, 1) and BCAST ( L ( 2: N, 1) )
do k=1,N1
i f ( column k i s not mine ) then ! post RECV f or mu l t i p l i e r s
RECV( L( k+1:N, k ) from BCAST( next _L )
end i f
i f ( column k+1 i s mine ) then ! compute next set of mu l t i p l i e r s
do i =k+1,N ! and el i mi nat e f or column k+1
A( i , k+1) = A( i , k+1) L( i , k)A( k , k+1)
next _L ( i , k+1) = A( i , k +1) / A( k+1, k+1)
end do
BCAST( next _L ( k+2:N, k +1) , r oot = owner of k+1 )
end i f
do j =k+2,N ( modulo I own column j ) ! perf orm el i mi nat i ons
do i =k+1,N
A( i , j ) = A( i , j ) L( i , k)A( k , j )
end do
end do
end do
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 28 / 53
ScaLAPACK
ScaLAPACK
Scalable LAPACK library (LAPACK is the general purpose Linear
Algebra PACKage), contains LAPACK-style driver routines formulated
for distributed memory parallel processing:
An exceptional example of an MPI library:
1
Portable - based on optimized low-level routines
2
User friendly - parallel versions of the common LAPACK routines
have similar syntax, names just prexed with a P.
3
User of the library need never write parallel processing code - its all
under the covers.
4
Efcient.
Source code and documentation available from NETLIB:
http://www.netlib.org/scalapack
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 30 / 53
ScaLAPACK
Under the ScaLAPACK Hood
Uses a block-cyclic decomposition of matrices, into P
row
P
col
2D
representation. If we now consider a NB NB sub-block (starting at
index ib, ending at index end):
do i b =1,N, NB
end=MIN(N, i b+NB1)
do i =i b , N
( a) Fi nd pi vot row , k , BCAST column
( b) Swap rows k and i i n block column , BCAST row k
( c ) A( i +1:NB, i )=A( i +1:N, i ) / A( i , i )
( d) A( i +1:N, i +1: end)=A( i +1:N, i +1: end)A( i +1:N, i )A( i , i +1: end)
end do
( e) BCAST a l l swap i nf or mat i on to r i g h t and l e f t
( f ) Appl y a l l row swaps to ot her columns
( g) BCAST row k to r i g h t
( h) A( i b : end, end+1:N)=LL / A( i b : end, end+1:N)
( i ) BCAST A( i b : end, end+1:N) down
( j ) BCAST A( end+1:N, i b : end) r i g h t
( k ) El i mi nat e A( end+1:N, end+1:N)
this is our old friend LU decomposition, available as the ScaLAPACK
PGETRF routine.
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 31 / 53
ScaLAPACK ScaLAPACK Schematic - PBLAS & BLACS
ScaLAPACK Schematic
MPI/PVM
Local
Global
LAPACK
BLAS
BLACS
PBLAS
ScaLAPACK
General schematic of the ScaLAPACK library.
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 32 / 53
ScaLAPACK ScaLAPACK Analysis
ScaLAPACK Efciency
Variable Description
C
f
N
3
Total Number FP Operations
C
v
N
2
/
_
N
p
Total Number Data Communicated
C
m
N/NB Total Number of Messages
f
Time per FP Operation
v
Time per Data Item Communicated
m
Time per Message
With these quantities, the Time for the ScaLAPACK drivers is given by:
(N, N
p
) =
C
f
N
3
N
p
f
+
C
v
N
2
_
N
p
v
+
C
m
N
NB
m
,
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 33 / 53
ScaLAPACK ScaLAPACK Analysis
and the scaling:
E(N, N
p
)
_
1 +
1
NB
C
m
m
C
f
f
N
p
N
2
+
C
v
v
C
f
f
_
N
p
N
_
1
.
Note that:
Values of C
f
, C
v
, and C
m
depend on driver routine (and
underlying algorithm)
Scalable for constant N
2
/N
p
...
Small problems, dominated by
m
/
f
, the ratio of latency to time
per FP operation (see why latency matters?)
Medium problems, signicantly impacted by ratio of network
bandwidth to FP operation rate,
f
/
v
Large problems, node Flop/s (1/
f
) dominates
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 34 / 53
ScaLAPACK ScaLAPACK Analysis
Block-Cyclic Decomposition
ScaLAPACK uses a block-cyclic decomposition:
On the left, a 4 processor column-wise decomposition, to the right the
block-cyclic version using P
r
= P
q
= 2, each square is NBxNB (2x2 in
this example). The advantage of block-cyclic is that it allows levels 2
and 3 BLAS operations on subvectors and submatrices within each
processor.
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 35 / 53
ScaLAPACK Simple Example Code
Sample ScaLAPACK Program
"Simplest" program to solve A x = b using PDGESV (LU), available at
www.netlib.org/scalapack/examples/example1.f
1 PROGRAM EXAMPLE1
2
3 Example Program sol vi ng Ax=b vi a ScaLAPACK r out i ne PDGESV
4
5 . . Parameters . .
6 INTEGER DLEN_, IA , JA, IB , JB, M, N, MB, NB, RSRC,
7 $ CSRC, MXLLDA, MXLLDB, NRHS, NBRHS, NOUT,
8 $ MXLOCR, MXLOCC, MXRHSC
9 PARAMETER ( DLEN_ = 9 , I A = 1 , JA = 1 , I B = 1 , JB = 1 ,
10 $ M = 9 , N = 9 , MB = 2 , NB = 2 , RSRC = 0 ,
11 $ CSRC = 0 , MXLLDA = 5 , MXLLDB = 5 , NRHS = 1 ,
12 $ NBRHS = 1 , NOUT = 6 , MXLOCR = 5 , MXLOCC = 4 ,
13 $ MXRHSC = 1 )
14 DOUBLE PRECISION ONE
15 PARAMETER ( ONE = 1. 0D+0 )
16 . .
17 . . Local Scal ar s . .
18 INTEGER ICTXT , INFO, MYCOL, MYROW, NPCOL, NPROW
19 DOUBLE PRECISION ANORM, BNORM, EPS, RESID, XNORM
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 36 / 53
ScaLAPACK Simple Example Code
20 . .
21 . . Local Ar r ays . .
22 INTEGER DESCA( DLEN_ ) , DESCB( DLEN_ ) ,
23 $ I PI V ( MXLOCR+NB )
24 DOUBLE PRECISION A( MXLLDA, MXLOCC ) , A0( MXLLDA, MXLOCC ) ,
25 $ B( MXLLDB, MXRHSC ) , B0( MXLLDB, MXRHSC ) ,
26 $ WORK( MXLOCR )
27 . .
28 . . Ext er nal Funct i ons . .
29 DOUBLE PRECISION PDLAMCH, PDLANGE
30 EXTERNAL PDLAMCH, PDLANGE
31 . .
32 . . Ext er nal Subr out i nes . .
33 EXTERNAL BLACS_EXIT, BLACS_GRIDEXIT, BLACS_GRIDINFO,
34 $ DESCINIT , MATINIT , PDGEMM, PDGESV, PDLACPY,
35 $ SL_INIT
36 . .
37 . . I n t r i n s i c Funct i ons . .
38 INTRINSIC DBLE
39 . .
40 . . Data st at ement s . .
41 DATA NPROW / 2 / , NPCOL / 3 /
42 . .
43 . . Execut abl e St at ement s . .
44
45 I NI TI ALI ZE THE PROCESS GRID
46
47 CALL SL_INIT ( ICTXT , NPROW, NPCOL )
48 CALL BLACS_GRIDINFO( ICTXT , NPROW, NPCOL, MYROW, MYCOL )
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 37 / 53
ScaLAPACK Simple Example Code
49
50 I f I m not i n t he process gr i d , go t o t he end of t he program
51
52 I F ( MYROW.EQ.1 )
53 $ GO TO 10
54
55 DISTRIBUTE THE MATRIX ON THE PROCESS GRID
56 I n i t i a l i z e t he ar r ay descr i pt or s f or t he mat r i ces A and B
57
58 CALL DESCINIT( DESCA, M, N, MB, NB, RSRC, CSRC, ICTXT , MXLLDA,
59 $ INFO )
60 CALL DESCINIT( DESCB, N, NRHS, NB, NBRHS, RSRC, CSRC, ICTXT ,
61 $ MXLLDB, INFO )
62
63 Generate mat r i ces A and B and d i s t r i b u t e t o t he process gr i d
64
65 CALL MATINIT( A, DESCA, B, DESCB )
66
67 Make a copy of A and B f or checki ng purposes
68
69 CALL PDLACPY( Al l , N, N, A, 1 , 1 , DESCA, A0, 1 , 1 , DESCA )
70 CALL PDLACPY( Al l , N, NRHS, B, 1 , 1 , DESCB, B0, 1 , 1 , DESCB )
71
72 CALL THE SCALAPACK ROUTINE
73 Sol ve t he l i near system A X = B
74
75 CALL PDGESV( N, NRHS, A, IA , JA, DESCA, I PI V , B, IB , JB, DESCB,
76 $ INFO )
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 38 / 53
ScaLAPACK Simple Example Code
j =i
A
i ,j
x
(k1)
j
_
_
,
Particularly useful in solution of (O|P)DEs using nite differences
Advantage of small memory requirements (especially for sparse
systems)
Disadvantage of convergence problems (slowly, or not)
Initial guess - often take x
0
= b
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 46 / 53
Iterative Methods Jacobi Iteration
Measuring convergence for Jacobi Iteration can be tricky, particularly
in parallel. At rst glance, you might be tempted by:
x
(k+1)
i
x
k
i
< , i ,
but that says little about solution accuracy.
j
A
i ,j
x
k
j
b
i
<
is a better form, and can reuse values already computed in the
previous iteration.
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 47 / 53
Iterative Methods Jacobi Iteration
Sequential
do i =1,N
x ( i )=b( i )
end do
do i t e r =1,MAXITER
do i =1,N
sum = 0. 0
do j =1,N
i f ( i / = j ) sum = sum + A( i , j )x ( j )
end do
new_x ( i ) = ( b( i )sum) / A( i , i )
end do
do i =1,N
x ( i ) = new_x ( i )
end do
end do
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 48 / 53
Iterative Methods Jacobi Iteration
Block Parallel
Have each process accountable for solving a particular block of
unknowns, and simply use an Allgather or Allgatherv (for an unequal
distribution of work on the processes). The resulting computation time
is
comp
=
N
N
p
(2N + 4)N
iter
with a time for communication
comm
= (N
p
lat
+ N
dat
)N
iter
,
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 49 / 53
Iterative Methods Jacobi Iteration
and the speedup factor:
S(N
p
) =
N(2N + 4)
N(2N + 4)/N
p
+ N
p
lat
+ N
dat
==
N
p
1 +
comm
/
comp
.
Note that this is scalable, but only if the ratio N/N
p
is maintained with
increasing N
p
.
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 50 / 53
Iterative Methods Gauss-Seidel
Gauss-Seidel Relaxation
Technique for convergence acceleration, usually converged faster than
Jacobi,
x
(k)
i
=
1
A
i ,i
_
_
b
i
i 1
j =1
A
i ,j
x
(k)
j
N
j =i +1
A
i ,j
x
(k1)
j
_
_
,
making use of just-updated solution. More difcult to parallelize, but
can be done through particular patterning (say using a checkerboard
allocation) in which regions can be computed simultaneously.
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 51 / 53
Iterative Methods Successive Overrelaxation
Overrelaxation
Another improved convergence technique in which factor (1 )x
i
is
added:
x
(k)
i
=
A
i ,i
_
_
b
i
j =i
A
i ,j
x
(k1)
j
_
_
+ (1 )x
(k1)
i
,
where 0 < < 1. For the Gauss-Seidel method, this is slightly
modied:
x
(k)
i
=
A
i ,i
_
_
b
i
i 1
j =1
A
i ,j
x
(k)
j
N
j =i +1
A
i ,j
x
(k1)
j
_
_
+ (1 )x
k1
i
,
with 0 < 2.
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 52 / 53
Iterative Methods Multigrid
Multigrid Method
Makes use of successively ner grids to improve solution:
Coarse starting grid quickly take distant effects into account
Initial values of ner grids available through interpolation from
existing grid values
Of course, we can even have regions of variable grid densities
using so-called adaptive grid methods
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 53 / 53