0% found this document useful (0 votes)

107 views48 pages

Class18 - Linalg II Handout PDF

This document discusses parallel algorithms for linear algebra operations. It begins with an introduction to parallel dense matrix computations and the importance of optimized Basic Linear Algebra Subprograms (BLAS). Several parallel algorithms for matrix multiplication are then described, including partitioned, divide-and-conquer, Cannon's algorithm, 2D pipelining, and Strassen's method. The document concludes with a discussion of parallelizing the Gaussian elimination step of LU decomposition by decomposing the columns of the matrix.

Uploaded by

Pavan Behara

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

107 views48 pages

Class18 - Linalg II Handout PDF

Uploaded by

Pavan Behara

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

High Performance Linear Algebra II

M. D. Jones, Ph.D.
Center for Computational Research
University at Buffalo
State University of New York
High Performance Computing I, 2012
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 1 / 53
Introduction
Dense & Parallel
In this topic we will still discuss dense (as opposed to sparse)
matrices, but now focus on parallel execution.
Best Performance (and scalability) most-often comes down to
making best use of (local) L3 BLAS
Optimized BLAS are generally crucial for achieving performance
gains
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 3 / 53
Introduction
BLAS Recapitulated
BLAS are critical:
Foundation for all (computational) linear algebra
Parallel versions still call serial versions on each processor
Performance goes up with increasing ratio of oating point
operations to memory references
Well-tuned BLAS takes maximum advantage of memory hierarchy
BLAS Level Calculation Memory Refs. Flop Count Ratio(Flop/Mem)
1 DDOT 3N 2N 2/3
2 DGEMV N
2
2N
2
2
3 DGEMM 4N
2
2N
3
N/2
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 4 / 53
Introduction
Simple Memory Model
Let us assume just two levels of memory - slow and fast
m = # memory elements (words) moved between fast and slow memory,

m
= time per slow memory operation,
f
op
= number of oating-point operations,

f
= time per oating-point operation,
and q = f
op
/m. Minimum time is just f
op

f
, but more realistically:
time f
op

f
+ m
m
= f
op

f
[1 +
m
/(q
f
)] .
key observation - we need optimal reuse of data = larger q (certainly

m

f
).
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 5 / 53
Matrix Multiplication Sequential
Simple Matrix Multiplication
Everyone has probably written this out at least once ...
f or ( i =0; i <N; i ++) {
f or ( j =0; j <N; j ++) {
C[ i ] [ j ] = 0. 0;
f or ( k=0; k<N; k++) {
C[ i ] [ j ] += A[ i ] [ k] B[ k ] [ j ] ;
}
}
}
in order to multiply two matrices, A and B, and store the result in a
third, C.
Can use a temporary scalar to avoid dereferencing in innermost
loop
O(N
3
) multiply-add operations
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 7 / 53
Matrix Multiplication Partitioning
Partitioning
In this case, partitioning is also known as block matrix
multiplication:
Divide into s
2
submatrices, N/s N/s elements in each submatrix
Let m = N/s, and
f or ( p=0; p<s ; p++) {
f or ( q=0; q<s ; q++) {
C_{ p , q} = 0. 0;
f or ( r =0; r <m; r ++) {
C_{ p , q} += A_{ p , r }B{ r , q } ;
}
}
}
where A
p,r
are themselves matrices.
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 8 / 53
Matrix Multiplication Divide & Conquer
Recursive/Divide & Conquer
Let N be a power of 2 (not strictly necessary) and divide A and B into
4 submatrices, delimited by:
A =
_
A
pp
A
pq
A
qp
A
qq
_
.
Then the solution requires 8 pairs of submatrix multiplications, which
can be done recursively ...
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 9 / 53
Matrix Multiplication Divide & Conquer
matmultR( A, B, s ) {
i f ( s==1) {
C = AB; / t er mi nat i on condi t i on /
}
el se {
s = s / 2 ;
P0 = matmultR( A_{ pp } , B_{ pp } , s ) ;
P1 = matmultR( A_{ pq } , B_{ qp } , s ) ;
P2 = matmultR( A_{ pp } , B_{ pq } , s ) ;
P3 = matmultR( A_{ pq } , B_{ qq } , s ) ;
P4 = matmultR( A_{ qp } , B_{ pp } , s ) ;
P5 = matmultR( A_{ qq } , B_{ qp } , s ) ;
P6 = matmultR( A_{ qp } , B_{ pq } , s ) ;
P7 = matmultR( A_{ qq } , B_{ qq } , s ) ;
C_{ pp} = P0+P1;
C_{ pq} = P2+P3;
C_{ qp} = P4+P5;
C_{ qq} = P6+P7;
}
ret urn (C) ;
}
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 10 / 53
Matrix Multiplication Divide & Conquer
Well suited for SMP with cache hierarchy
Size of data continually reduced and localized
Can be highly performing when making maximum reuse of data
within cache memory hierarchy
More generally we want to address cases for which matrix will not
t within memory of a single machine, message
passing/distributed model is required
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 11 / 53
Matrix Multiplication Cannons Algorithm
Cannons Algorithm
Cannons algorithm:
Uses a mesh of processors with a torus topology (periodic
boundary conditions
Elements are shifted to an aligned position: Row i of A is shifted
i places to the left, while row j of B is shifted j spots upward. This
puts A
i ,j +i
and B
i +j ,j
in processor P
i ,j
(namely we have appropriate
submatrices/elements to multiply in P
i ,j
)
Usually submatrices are used, but we can use elements to
simplify the notation a bit
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 12 / 53
Matrix Multiplication Cannons Algorithm
After alignment, Cannons algorithm proceeds as follows:
Each process P
i ,j
multiplies its elements
Row i of A is shifted one place left, and column j of B is shifted
one place up (brings together adjacent elements of A and B
needed for summing results)
Each process P
i ,j
multiplies its elements and adds to
accumulating sum
Preceding two steps repeated until results obtained (N 1 shifts)
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 13 / 53
Matrix Multiplication Cannons Algorithm
Cannon Illustrated
i
j
B
A
P
ij
Cannons algorithm showing ow of data for process P
ij
.
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 14 / 53
Matrix Multiplication Cannons Algorithm
For s submatrices (m = N/s) the communication time for Cannons
algorithm is given by

comm
= 4(s 1)(
lat
+ m
2

dat
),
where
lat
is the latency and
dat
is the time required to send one value
(word). The computation time in Cannons algorithm:

comp
= 2sm
3
= 2m
2
N,
or O(m
2
N).
Cannons algorithm is also known as the ScaLAPACK outer product
algorithm (can you see another numerical library coming?) ...
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 15 / 53
Matrix Multiplication 2D Pipeline
2D Pipeline
Another useful algorithm is the so-called 2D pipeline, in which the
data ows into a rectangular array of processors. Labeling the process
grid using (0, 0) as the top left corner, the data ows from the left and
from the top, with process P
i ,j
:
1
RECV(A from P
i ,j 1
) { A data from left }
2
RECV(B from P
i 1,j
) { B data from above }
3
Accumulate C
i ,j
4
SEND(A to P
i ,j +1
) { A data to right }
5
SEND(B to P
i +1,j
) { B data down }
(illustrate with sketch)
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 16 / 53
Matrix Multiplication 2D Pipeline
2D Pipeline Illustrated
C
00
C
01
C
02
C
03
C
10
C
11
C
12
C
13
C
20
C
21
C
22
C
23
C
30
C
31
C
32
C
33
A
03
A
02
A
01
A
00
A
13
A
12
A
11
A
10
--
A
23
A
22
A
21
A
20
-- --
A
33
A
32
A
31
A
30
-- -- --
B
30
B
20
B
10
B
00
B
31
B
21
B
11
B
01
--
B
32
B
22
B
12
B
02
--
--
B
33
B
23
B
13
B
03
--
--
--
1 cycle delay
P
32
2D pipeline (or systolic array) algorithm showing ow of data.
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 17 / 53
Matrix Multiplication Strassens Method
Strassens Method
The techniques discussed thus far are fundamentally O(N
3
), but there
is a clever method due to Strassen
1
which is O(N
2.81
):
A =

A
11
A
12
A
21
A
22

B =

B
11
B
12
B
21
B
22

C =

C
11
C
12
C
21
C
22

Q
1
= (A
11
+ A
22
)(B
11
+ B
22
), C
11
= Q
1
+ Q
4
Q
5
+ Q
7
Q
2
= (A
21
+ A
22
)B
11
, C
12
= Q
3
+ Q
5
Q
3
= A
11
(B
12
B
22
), C
21
= Q
2
+ Q
4
Q
4
= A
22
(B
11
+ B
21
), C
22
= Q
1
+ Q
3
Q
2
+ Q
6
Q
5
= (A
11
+ A
12
)B
22
,
Q
6
= (A
11
+ A
21
)(B
11
+ B
12
),
Q
7
= (A
12
A
22
)(B
21
+ B
22
)
Conventional matrix multiplication takes N
3
multiplies and N
3
N
2
adds
1
V. Strassen, Gaussian Elimination Is Not Optimal, Numerische
Mathematik, 13, 353-356 (1969).
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 18 / 53
Matrix Multiplication Strassens Method
If N is not a power of 2 can pad with zeros
Reiterate until submatrices reduced to numbers or optimal matrix
multiply size for particular processor (Strassen switches to
standard matrix multiplication for small enough submatrices)
Eliminates 1 matrix multiply, instead of O(N
log
2
8
),
O(N
log
2
7
) = N
2.81
Conventional matrix multiplication takes N
3
multiplies and N
3
N
2
adds, Strassen is 7N
log
2
7
6N
2
Requires additional storage for intermediate matrices, can be less
stable numerically ...
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 19 / 53
LU Decomposition in Parallel
Sequential LU/Gaussian Elimination Code
LU code fragment:
f or ( k=1; k<N; k++) {
f or ( i =k+1; i <=N; i ++) {
L( i , k ) = A( i , k ) / A( k , k )
}
f or ( j =k+1; j <=N; j ++) {
f or ( i =k+1; i <=N; i ++) {
A( i , j ) = A( i , j ) L( i , k)A( k , j )
}
}
}
Note that the inner loops (over i ) have no dependencies, i.e. they can
more easily be executed in parallel.
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 21 / 53
LU Decomposition in Parallel
Parallel LU/Gaussian Elimination
One strategy for parallelizing the Gaussian eleimination part of LU
decomposition is to do so over the middle loop (j ) in the algorithm.
Decomposing the columns, and using a message passing
implementation, we might have something like the following:
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 22 / 53
LU Decomposition in Parallel
do k=1,N1
i f ( column k i s mine ) then ! columnwi se decomposi t i on
do i =k+1,N
L( i , k ) = A( i , k ) / A( k , k )
end do
BCAST( L( k+1:N, k ) , r oot = owner of k )
el se
RECV( L( k+1:N, k ) )
end i f
do j =k+1,N ( modulo I own column j ) ! columnwi se decomposi t i on
do i =k+1,N
A( i , j ) = A( i , j ) L( i , k)A( k , j )
end do
end do
end do
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 23 / 53
LU Decomposition in Parallel
Parallel LU Efciency
If we do a (not overly crude) analysis of the preceding parallel
algorithm, for each column k we have:
Broadcat N k values, lets say taking time c
b
Compute (N k)
2
multiply-adds, time c
fma

k
c
b
(N k) + c
fma
(N k)
2
N
p
,
Summing over k we nd:
(N
p
)
c
b
2
N
2
+
c
fma
3
N
3
N
p
, (1)
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 24 / 53
LU Decomposition in Parallel
Parallel LU Speedup
Now, looking at the parallel speedup factor we have:
S(N
p
) = /(N
p
)

c
fma
N
3
/3
c
b
N
2
/2+c
fma
N
3
/(3N
p
)
=
_
3
2N
c
b
c
fma
+
1
N
p
_
1
.
See the ratio of communication cost to computation? It never goes
away
Minimizing communication costs is again the key to improving the
efciency ...
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 25 / 53
LU Decomposition in Parallel
Parallel LU Efciency
The efciency is given by:
E(N
p
) = S(N
p
)/N
p
,

_
3N
p
2N
c
b
c
fma
+ 1
_
1
,
= 1
3N
p
2N
c
b
c
fma
Note that this algorithm is scalable (in time) by our previous
denition, for a given xed ratio of N
p
/N
Note that it is not scalable in terms of memory, since the memory
requirements grow steadily as O(N
2
)
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 26 / 53
LU Decomposition in Parallel
Improving Parallel LU
The most obvious way to improve the efciency of our parallel
Gaussian elimination is to overlap computation with the necessary
communication. Consider the revised algorithm ...
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 27 / 53
LU Decomposition in Parallel
! 1st pr ocessor computes L ( 2: N, 1) and BCAST ( L ( 2: N, 1) )
do k=1,N1
i f ( column k i s not mine ) then ! post RECV f or mu l t i p l i e r s
RECV( L( k+1:N, k ) from BCAST( next _L )
end i f
i f ( column k+1 i s mine ) then ! compute next set of mu l t i p l i e r s
do i =k+1,N ! and el i mi nat e f or column k+1
A( i , k+1) = A( i , k+1) L( i , k)A( k , k+1)
next _L ( i , k+1) = A( i , k +1) / A( k+1, k+1)
end do
BCAST( next _L ( k+2:N, k +1) , r oot = owner of k+1 )
end i f
do j =k+2,N ( modulo I own column j ) ! perf orm el i mi nat i ons
do i =k+1,N
A( i , j ) = A( i , j ) L( i , k)A( k , j )
end do
end do
end do
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 28 / 53
ScaLAPACK
ScaLAPACK
Scalable LAPACK library (LAPACK is the general purpose Linear
Algebra PACKage), contains LAPACK-style driver routines formulated
for distributed memory parallel processing:
An exceptional example of an MPI library:
1
Portable - based on optimized low-level routines
2
User friendly - parallel versions of the common LAPACK routines
have similar syntax, names just prexed with a P.
3
User of the library need never write parallel processing code - its all
under the covers.
4
Efcient.
Source code and documentation available from NETLIB:
http://www.netlib.org/scalapack
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 30 / 53
ScaLAPACK
Under the ScaLAPACK Hood
Uses a block-cyclic decomposition of matrices, into P
row
P
col
2D
representation. If we now consider a NB NB sub-block (starting at
index ib, ending at index end):
do i b =1,N, NB
end=MIN(N, i b+NB1)
do i =i b , N
( a) Fi nd pi vot row , k , BCAST column
( b) Swap rows k and i i n block column , BCAST row k
( c ) A( i +1:NB, i )=A( i +1:N, i ) / A( i , i )
( d) A( i +1:N, i +1: end)=A( i +1:N, i +1: end)A( i +1:N, i )A( i , i +1: end)
end do
( e) BCAST a l l swap i nf or mat i on to r i g h t and l e f t
( f ) Appl y a l l row swaps to ot her columns
( g) BCAST row k to r i g h t
( h) A( i b : end, end+1:N)=LL / A( i b : end, end+1:N)
( i ) BCAST A( i b : end, end+1:N) down
( j ) BCAST A( end+1:N, i b : end) r i g h t
( k ) El i mi nat e A( end+1:N, end+1:N)
this is our old friend LU decomposition, available as the ScaLAPACK
PGETRF routine.
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 31 / 53
ScaLAPACK ScaLAPACK Schematic - PBLAS & BLACS
ScaLAPACK Schematic
MPI/PVM
Local
Global
LAPACK
BLAS
BLACS
PBLAS
ScaLAPACK
General schematic of the ScaLAPACK library.
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 32 / 53
ScaLAPACK ScaLAPACK Analysis
ScaLAPACK Efciency
Variable Description
C
f
N
3
Total Number FP Operations
C
v
N
2
/
_
N
p
Total Number Data Communicated
C
m
N/NB Total Number of Messages

f
Time per FP Operation

v
Time per Data Item Communicated

m
Time per Message
With these quantities, the Time for the ScaLAPACK drivers is given by:
(N, N
p
) =
C
f
N
3
N
p

f
+
C
v
N
2
_
N
p

v
+
C
m
N
NB

m
,
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 33 / 53
ScaLAPACK ScaLAPACK Analysis
and the scaling:
E(N, N
p
)
_
1 +
1
NB
C
m

m
C
f

f
N
p
N
2
+
C
v

v
C
f

f
_
N
p
N
_
1
.
Note that:
Values of C
f
, C
v
, and C
m
depend on driver routine (and
underlying algorithm)
Scalable for constant N
2
/N
p
...
Small problems, dominated by
m
/
f
, the ratio of latency to time
per FP operation (see why latency matters?)
Medium problems, signicantly impacted by ratio of network
bandwidth to FP operation rate,
f
/
v
Large problems, node Flop/s (1/
f
) dominates
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 34 / 53
ScaLAPACK ScaLAPACK Analysis
Block-Cyclic Decomposition
ScaLAPACK uses a block-cyclic decomposition:
On the left, a 4 processor column-wise decomposition, to the right the
block-cyclic version using P
r
= P
q
= 2, each square is NBxNB (2x2 in
this example). The advantage of block-cyclic is that it allows levels 2
and 3 BLAS operations on subvectors and submatrices within each
processor.
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 35 / 53
ScaLAPACK Simple Example Code
Sample ScaLAPACK Program
"Simplest" program to solve A x = b using PDGESV (LU), available at
www.netlib.org/scalapack/examples/example1.f
1 PROGRAM EXAMPLE1
2
3 Example Program sol vi ng Ax=b vi a ScaLAPACK r out i ne PDGESV
4
5 . . Parameters . .
6 INTEGER DLEN_, IA , JA, IB , JB, M, N, MB, NB, RSRC,
7 $ CSRC, MXLLDA, MXLLDB, NRHS, NBRHS, NOUT,
8 $ MXLOCR, MXLOCC, MXRHSC
9 PARAMETER ( DLEN_ = 9 , I A = 1 , JA = 1 , I B = 1 , JB = 1 ,
10 $ M = 9 , N = 9 , MB = 2 , NB = 2 , RSRC = 0 ,
11 $ CSRC = 0 , MXLLDA = 5 , MXLLDB = 5 , NRHS = 1 ,
12 $ NBRHS = 1 , NOUT = 6 , MXLOCR = 5 , MXLOCC = 4 ,
13 $ MXRHSC = 1 )
14 DOUBLE PRECISION ONE
15 PARAMETER ( ONE = 1. 0D+0 )
16 . .
17 . . Local Scal ar s . .
18 INTEGER ICTXT , INFO, MYCOL, MYROW, NPCOL, NPROW
19 DOUBLE PRECISION ANORM, BNORM, EPS, RESID, XNORM
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 36 / 53
ScaLAPACK Simple Example Code
20 . .
21 . . Local Ar r ays . .
22 INTEGER DESCA( DLEN_ ) , DESCB( DLEN_ ) ,
23 $ I PI V ( MXLOCR+NB )
24 DOUBLE PRECISION A( MXLLDA, MXLOCC ) , A0( MXLLDA, MXLOCC ) ,
25 $ B( MXLLDB, MXRHSC ) , B0( MXLLDB, MXRHSC ) ,
26 $ WORK( MXLOCR )
27 . .
28 . . Ext er nal Funct i ons . .
29 DOUBLE PRECISION PDLAMCH, PDLANGE
30 EXTERNAL PDLAMCH, PDLANGE
31 . .
32 . . Ext er nal Subr out i nes . .
33 EXTERNAL BLACS_EXIT, BLACS_GRIDEXIT, BLACS_GRIDINFO,
34 $ DESCINIT , MATINIT , PDGEMM, PDGESV, PDLACPY,
35 $ SL_INIT
36 . .
37 . . I n t r i n s i c Funct i ons . .
38 INTRINSIC DBLE
39 . .
40 . . Data st at ement s . .
41 DATA NPROW / 2 / , NPCOL / 3 /
42 . .
43 . . Execut abl e St at ement s . .
44
45 I NI TI ALI ZE THE PROCESS GRID
46
47 CALL SL_INIT ( ICTXT , NPROW, NPCOL )
48 CALL BLACS_GRIDINFO( ICTXT , NPROW, NPCOL, MYROW, MYCOL )
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 37 / 53
ScaLAPACK Simple Example Code
49
50 I f I m not i n t he process gr i d , go t o t he end of t he program
51
52 I F ( MYROW.EQ.1 )
53 $ GO TO 10
54
55 DISTRIBUTE THE MATRIX ON THE PROCESS GRID
56 I n i t i a l i z e t he ar r ay descr i pt or s f or t he mat r i ces A and B
57
58 CALL DESCINIT( DESCA, M, N, MB, NB, RSRC, CSRC, ICTXT , MXLLDA,
59 $ INFO )
60 CALL DESCINIT( DESCB, N, NRHS, NB, NBRHS, RSRC, CSRC, ICTXT ,
61 $ MXLLDB, INFO )
62
63 Generate mat r i ces A and B and d i s t r i b u t e t o t he process gr i d
64
65 CALL MATINIT( A, DESCA, B, DESCB )
66
67 Make a copy of A and B f or checki ng purposes
68
69 CALL PDLACPY( Al l , N, N, A, 1 , 1 , DESCA, A0, 1 , 1 , DESCA )
70 CALL PDLACPY( Al l , N, NRHS, B, 1 , 1 , DESCB, B0, 1 , 1 , DESCB )
71
72 CALL THE SCALAPACK ROUTINE
73 Sol ve t he l i near system A X = B
74
75 CALL PDGESV( N, NRHS, A, IA , JA, DESCA, I PI V , B, IB , JB, DESCB,
76 $ INFO )
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 38 / 53
ScaLAPACK Simple Example Code

I F ( MYROW.EQ. 0 .AND. MYCOL.EQ. 0 ) THEN

WRITE( NOUT, FMT = 9999 )
WRITE( NOUT, FMT = 9998 )M, N, NB
WRITE( NOUT, FMT = 9997 )NPROWNPCOL, NPROW, NPCOL
WRITE( NOUT, FMT = 9996 ) INFO
END I F

Compute r esi dual | | A X B| | / ( | | X| | | | A| | eps N )

EPS = PDLAMCH( ICTXT , Epsi l on )

ANORM = PDLANGE( I , N, N, A, 1 , 1 , DESCA, WORK )
BNORM = PDLANGE( I , N, NRHS, B, 1 , 1 , DESCB, WORK )
CALL PDGEMM( N , N , N, NRHS, N, ONE, A0, 1 , 1 , DESCA, B, 1 , 1 ,
$ DESCB, ONE, B0, 1 , 1 , DESCB )
XNORM = PDLANGE( I , N, NRHS, B0, 1 , 1 , DESCB, WORK )
RESID = XNORM / ( ANORMBNORMEPSDBLE( N ) )

I F ( MYROW.EQ. 0 .AND. MYCOL.EQ. 0 ) THEN

I F ( RESID. LT. 10. 0D+0 ) THEN
WRITE( NOUT, FMT = 9995 )
WRITE( NOUT, FMT = 9993 ) RESID
ELSE
WRITE( NOUT, FMT = 9994 )
WRITE( NOUT, FMT = 9993 ) RESID
END I F
END I F
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 39 / 53
ScaLAPACK Simple Example Code
104
105 RELEASE THE PROCESS GRID
106 Free t he BLACS cont ext
107
108 CALL BLACS_GRIDEXIT( ICTXT )
109 10 CONTINUE
110
111 Exi t t he BLACS
112
113 CALL BLACS_EXIT( 0 )
114
115 9999 FORMAT( / ScaLAPACK Example Program #1 - May 1 , 1997 )
116 9998 FORMAT( / Sol vi ng Ax=b where A i s a , I 3 , by , I 3 ,
117 $ mat r i x wi t h a bl ock si ze of , I 3 )
118 9997 FORMAT( Running on , I 3 , processes , where t he process gr i d ,
119 $ i s , I 3 , by , I 3 )
120 9996 FORMAT( / INFO code r et ur ned by PDGESV = , I 3 )
121 9995 FORMAT( /
122 $ Accordi ng t o t he nor mal i zed r esi dual t he s ol ut i on i s cor r ect .
123 $ )
124 9994 FORMAT( /
125 $ Accordi ng t o t he nor mal i zed r esi dual t he s ol ut i on i s i nc or r ec t .
126 $ )
127 9993 FORMAT( / | | Ax b | | / ( | | x | | | | A| | epsN ) = , 1P, E16. 8 )
128 STOP
129 END
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 40 / 53
ScaLAPACK Simple Example Code
Compiling and running on U2 (use the Intel MKL link advisor to get the
linking right):
1 [ k07n14 : ~/ d_scal apack ] $ module l oad i n t e l
2 [ k07n14 : ~/ d_scal apack ] $ module l oad i n t e l mpi
3 [ k07n14 : ~/ d_scal apack ] $ module l oad mkl / 10. 3
4 [ k07n14 : ~/ d_scal apack ] $ mp i i f o r t o example1 example1 . f L$MKL/ l i b / i nt el 64 \
5 l mkl _scal apack_l p64 l mkl _sol ver _l p64_sequent i al Wl , - s t ar t group \
6 l mk l _i nt el _l p64 l mkl _sequent i al l mkl _cor e l mkl _bl acs_i nt el mpi _l p64 \
7 Wl , - endgroup l pt hr ead
8 [ k07n14 : ~/ d_scal apack ] $ l dd example1
9 l i nuxvdso . so . 1 => ( 0 x00007f f f 495f f 000 )
10 l i b d l . so . 2 => / l i b64 / l i b d l . so . 2 ( 0x000000370b400000 )
11 l i bmkl _scal apack_l p64 . so => / u t i l / i n t e l / composerxe2011.1.107/ mkl / l i b / i nt el 64 /
12 l i bmkl _scal apack_l p64 . so ( 0x00002b723da34000 )
13 l i bmk l _i nt el _l p64 . so => / u t i l / i n t e l / composerxe2011.1.107/ mkl / l i b / i nt el 64 /
14 l i bmk l _i nt el _l p64 . so ( 0x00002b723e120000 )
15 l i bmkl _sequent i al . so => / u t i l / i n t e l / composerxe2011.1.107/ mkl / l i b / i nt el 64 /
16 l i bmkl _sequent i al . so ( 0x00002b723e7d2000 )
17 l i bmkl _cor e . so => / u t i l / i n t e l / composerxe2011.1.107/ mkl / l i b / i nt el 64 / l i bmkl _cor e . so
18 ( 0x00002b723edc4000 )
19 l i bmkl _bl acs_i nt el mpi _l p64 . so => / u t i l / i n t e l / composerxe2011.1.107/ mkl / l i b / i nt el 64 /
20 l i bmkl _bl acs_i nt el mpi _l p64 . so ( 0 x00002b723fc68000 )
21 l i bpt hr ead . so . 0 => / l i b64 / l i bpt hr ead . so . 0 ( 0x0000003cc9200000 )
22 l i bmpi . so . 4 => / u t i l / i n t e l / i mpi / 4. 0. 3. 008/ i nt el 64 / l i b / l i bmpi . so . 4 ( 0 x00002b723fdc4000 )
23 l i bmpi gf . so . 4 => / u t i l / i n t e l / i mpi / 4. 0. 3. 008/ i nt el 64 / l i b / l i bmpi gf . so . 4 ( 0x00002b724028e000 )
24 l i b r t . so . 1 => / l i b64 / l i b r t . so . 1 ( 0x0000003cc9a00000 )
25 l i bm . so . 6 => / l i b64 / l i bm . so . 6 ( 0x000000390ac00000 )
26 l i b c . so . 6 => / l i b64 / l i b c . so . 6 ( 0x0000003cc8600000 )
27 l i bgcc_s . so . 1 => / l i b64 / l i bgcc_s . so . 1 ( 0x0000003ccbe00000 )
28 / l i b64 / l dl i nuxx8664.so . 2 ( 0x0000003cc8200000 )
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 41 / 53
ScaLAPACK Simple Example Code
29 [ u2 : ~/ d_scal apack ] $ qsub q debug l nodes =1: ppn=8, wal l t i me =01: 00: 00 I
30 qsub : wai t i ng f or j ob 1112961. d15n41 . ccr . buf f al o . edu t o s t a r t
31 qsub : j ob 1112961. d15n41 . ccr . buf f al o . edu ready
32
33 Job 1112961. d15n41 . ccr . buf f al o . edu has request ed 8 cores / pr ocessor s per node .
34 PBSTMPDIR i s / scr at ch / 1112961. d15n41 . ccr . buf f al o . edu
35 [ d06n40b : ~] $ cd $PBS_O_WORKDIR
36 [ d06n40b : ~/ d_scal apack ] $ NNODES= cat $PBS_NODEFILE | uni q | wc l
37 [ d06n40b : ~/ d_scal apack ] $ mpdboot n $NNODES f $PBS_NODEFILE v
38 r unni ng mpdal l exi t on d06n40b
39 LAUNCHED mpd on d06n40b vi a
40 RUNNING: mpd on d06n40b
41 [ d06n40b : ~/ d_scal apack ] $ module l oad i n t e l mpi
42 [ d06n40b : ~/ d_scal apack ] $ module l oad i n t e l
43 [ d06n40b : ~/ d_scal apack ] $ module l oad mkl / 10. 3
44 [ d06n40b : ~/ d_scal apack ] $ mpiexec np 6 . / example1
45
46 ScaLAPACK Example Program #1 - May 1 , 1997
47
48 Sol vi ng Ax=b where A i s a 9 by 9 mat r i x wi t h a bl ock si ze of 2
49 Running on 6 processes , where t he process gr i d i s 2 by 3
50
51 INFO code r et ur ned by PDGESV = 0
52
53 Accordi ng t o t he nor mal i zed r esi dual t he s ol ut i on i s cor r ect .
54
55 | | Ax b | | / ( | | x | | | | A| | epsN ) = 0.00000000E+00
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 42 / 53
ScaLAPACK ScaLAPACK - HPL Benchmark
ScaLAPACK Driver Illustrated - HPL
HPL (High Performance Linpack) benchmark still used to rank the
Top500 list (www.top500.org)
Super-sized version of previous example code, uses PDGESV to
solve randomly seeded linear system, check results to ensure that
results are accurate
Following plot shows results for U2s Myrinet nodes (1536
processors total) and Qlogic QDR Inniband (4416 cores total) -
note scalability
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 43 / 53
ScaLAPACK ScaLAPACK - HPL Benchmark
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192
Number of Processors
10
1
10
2
10
3
10
4
10
5
10
6
H
P
L

M
e
a
s
u
r
e
d

P
e
r
f
o
r
m
a
n
c
e

[
G
F
l
o
p
/
s
]
Myrinet MX
Qlogic QDR IB
Linpack (HPL) Benchmark
CCR U2 Clusters
N=14,800
N=29,600
N=59,800
N=112,320
N=228,000
N=400,000
N=1,078,272
N=70,656
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 44 / 53
Iterative Methods Jacobi Iteration
Jacobi Iteration
Our method for improving on solutions to systems of linear equations
can be formalized into a solution technique in its own right (Jacobi did
it rst):
x
(k)
i
=
1
A
i ,i
_
_
b
i

j =i
A
i ,j
x
(k1)
j
_
_
,
Particularly useful in solution of (O|P)DEs using nite differences
Advantage of small memory requirements (especially for sparse
systems)
Disadvantage of convergence problems (slowly, or not)
Initial guess - often take x
0
= b
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 46 / 53
Iterative Methods Jacobi Iteration
Measuring convergence for Jacobi Iteration can be tricky, particularly
in parallel. At rst glance, you might be tempted by:

x
(k+1)
i
x
k
i

< , i ,
but that says little about solution accuracy.

j
A
i ,j
x
k
j
b
i

<
is a better form, and can reuse values already computed in the
previous iteration.
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 47 / 53
Iterative Methods Jacobi Iteration
Sequential
do i =1,N
x ( i )=b( i )
end do
do i t e r =1,MAXITER
do i =1,N
sum = 0. 0
do j =1,N
i f ( i / = j ) sum = sum + A( i , j )x ( j )
end do
new_x ( i ) = ( b( i )sum) / A( i , i )
end do
do i =1,N
x ( i ) = new_x ( i )
end do
end do
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 48 / 53
Iterative Methods Jacobi Iteration
Block Parallel
Have each process accountable for solving a particular block of
unknowns, and simply use an Allgather or Allgatherv (for an unequal
distribution of work on the processes). The resulting computation time
is

comp
=
N
N
p
(2N + 4)N
iter
with a time for communication

comm
= (N
p

lat
+ N
dat
)N
iter
,
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 49 / 53
Iterative Methods Jacobi Iteration
and the speedup factor:
S(N
p
) =
N(2N + 4)
N(2N + 4)/N
p
+ N
p

lat
+ N
dat
==
N
p
1 +
comm
/
comp
.
Note that this is scalable, but only if the ratio N/N
p
is maintained with
increasing N
p
.
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 50 / 53
Iterative Methods Gauss-Seidel
Gauss-Seidel Relaxation
Technique for convergence acceleration, usually converged faster than
Jacobi,
x
(k)
i
=
1
A
i ,i
_
_
b
i

i 1

j =1
A
i ,j
x
(k)
j

N

j =i +1
A
i ,j
x
(k1)
j
_
_
,
making use of just-updated solution. More difcult to parallelize, but
can be done through particular patterning (say using a checkerboard
allocation) in which regions can be computed simultaneously.
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 51 / 53
Iterative Methods Successive Overrelaxation
Overrelaxation
Another improved convergence technique in which factor (1 )x
i
is
added:
x
(k)
i
=

A
i ,i
_
_
b
i

j =i
A
i ,j
x
(k1)
j
_
_
+ (1 )x
(k1)
i
,
where 0 < < 1. For the Gauss-Seidel method, this is slightly
modied:
x
(k)
i
=

A
i ,i
_
_
b
i

i 1

j =1
A
i ,j
x
(k)
j

N

j =i +1
A
i ,j
x
(k1)
j
_
_
+ (1 )x
k1
i
,
with 0 < 2.
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 52 / 53
Iterative Methods Multigrid
Multigrid Method
Makes use of successively ner grids to improve solution:
Coarse starting grid quickly take distant effects into account
Initial values of ner grids available through interpolation from
existing grid values
Of course, we can even have regions of variable grid densities
using so-called adaptive grid methods
M. D. Jones, Ph.D. (CCR/UB) High Performance Linear Algebra II HPC-I Fall 2012 53 / 53

Programacion I... by Ing. Mollo
14% (7)
Programacion I... by Ing. Mollo
198 pages
G. W. Stewart - Matrix Algorithms-Society For Industrial and Applied Mathematics (1998)
No ratings yet
G. W. Stewart - Matrix Algorithms-Society For Industrial and Applied Mathematics (1998)
479 pages
DAA - Strassen's Matrix (Anurag Verma) v1.0
No ratings yet
DAA - Strassen's Matrix (Anurag Verma) v1.0
5 pages
Parallel Algorithms Underlying MPI Implementations
No ratings yet
Parallel Algorithms Underlying MPI Implementations
55 pages
HEALTH 7 Q2 Modules 1 To 7
No ratings yet
HEALTH 7 Q2 Modules 1 To 7
215 pages
Unit II Matrix Multiplication
No ratings yet
Unit II Matrix Multiplication
23 pages
Daa Exp02
No ratings yet
Daa Exp02
16 pages
Dense Matrix Algorithms: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
No ratings yet
Dense Matrix Algorithms: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
55 pages
Unit 4 HPC Part8
No ratings yet
Unit 4 HPC Part8
16 pages
HPC Linear
No ratings yet
HPC Linear
52 pages
High Performance Computing Matrix Mul.
No ratings yet
High Performance Computing Matrix Mul.
15 pages
Efficient Parallel Algorithm For
No ratings yet
Efficient Parallel Algorithm For
12 pages
MATH3322 2 Basic Linear Algebra and Matrix Operations
No ratings yet
MATH3322 2 Basic Linear Algebra and Matrix Operations
21 pages
BHH 93
No ratings yet
BHH 93
27 pages
CLRS 4 1 2
No ratings yet
CLRS 4 1 2
7 pages
Pc10 NumLinAl I
No ratings yet
Pc10 NumLinAl I
58 pages
Algorithms-I CSC 302 (July-Dec2022 IIITK) L41 - L42
No ratings yet
Algorithms-I CSC 302 (July-Dec2022 IIITK) L41 - L42
69 pages
DAA IA-1 Case Study Material-CSE
No ratings yet
DAA IA-1 Case Study Material-CSE
9 pages
Iterative Matrix Computation
No ratings yet
Iterative Matrix Computation
55 pages
Daa 02 R1 2
No ratings yet
Daa 02 R1 2
63 pages
Dynamic Programming
No ratings yet
Dynamic Programming
43 pages
Canon's Algorithm
No ratings yet
Canon's Algorithm
11 pages
Parallel Processing
No ratings yet
Parallel Processing
35 pages
Chapter 7-Matrix Multiplication From The Book Parallel Computing by Michael J. Quinn
No ratings yet
Chapter 7-Matrix Multiplication From The Book Parallel Computing by Michael J. Quinn
39 pages
Generalized Cannon's Algorithm For Parallel Matrix Multiplication
No ratings yet
Generalized Cannon's Algorithm For Parallel Matrix Multiplication
8 pages
Parallel Algorithms Underlying MPI Implementations
No ratings yet
Parallel Algorithms Underlying MPI Implementations
55 pages
Introduction To Parallel Programming: Parallel Methods For Matrix Multiplication
No ratings yet
Introduction To Parallel Programming: Parallel Methods For Matrix Multiplication
50 pages
LEC 7 and 8-5
No ratings yet
LEC 7 and 8-5
41 pages
To Print - Dynprog2
No ratings yet
To Print - Dynprog2
46 pages
Strassen
No ratings yet
Strassen
11 pages
Matrix Computations, Marko Huhtanen
No ratings yet
Matrix Computations, Marko Huhtanen
63 pages
Fox Example
No ratings yet
Fox Example
2 pages
PLC Questions Liii
No ratings yet
PLC Questions Liii
209 pages
Strassen's Matrix Multiplication
No ratings yet
Strassen's Matrix Multiplication
13 pages
Ap 2025
No ratings yet
Ap 2025
90 pages
Matrix Multiplications and Collective Communication: Michael Hanke
No ratings yet
Matrix Multiplications and Collective Communication: Michael Hanke
38 pages
Rick Warren, Purpose Driven Life: Section 36
100% (1)
Rick Warren, Purpose Driven Life: Section 36
22 pages
Content PDF
No ratings yet
Content PDF
14 pages
Using Strassen's Algorithm To Accelerate The Solution of Linear Systems
No ratings yet
Using Strassen's Algorithm To Accelerate The Solution of Linear Systems
15 pages
Strassen Matrix DAA
No ratings yet
Strassen Matrix DAA
14 pages
Strassens Matrix Multiflication
No ratings yet
Strassens Matrix Multiflication
14 pages
Efficient Parallel Implementation of The Fox Algorithm
No ratings yet
Efficient Parallel Implementation of The Fox Algorithm
8 pages
Table of Concrete Design Properties Including Strength Properties
No ratings yet
Table of Concrete Design Properties Including Strength Properties
7 pages
Security Analysis
No ratings yet
Security Analysis
60 pages
Ex 5daa
No ratings yet
Ex 5daa
7 pages
Hospital Management System Database Design
100% (1)
Hospital Management System Database Design
8 pages
Numerical Solution of Linear Systems: Chen Greif
No ratings yet
Numerical Solution of Linear Systems: Chen Greif
59 pages
Csce411 Divideconquer2
No ratings yet
Csce411 Divideconquer2
12 pages
Lecture 33 Algebraic Computation and FFTs
No ratings yet
Lecture 33 Algebraic Computation and FFTs
16 pages
Outline of Next 2 Lectures: Matrix Computations: Direct Methods I
No ratings yet
Outline of Next 2 Lectures: Matrix Computations: Direct Methods I
16 pages
Chained Matrix Multiplication
No ratings yet
Chained Matrix Multiplication
32 pages
Algorithms and Data Structures: Dynamic Programming Matrix-Chain Multiplication
No ratings yet
Algorithms and Data Structures: Dynamic Programming Matrix-Chain Multiplication
17 pages
Blocked Matrix Multiply
No ratings yet
Blocked Matrix Multiply
6 pages
Sample Integrated Marketing Communications Plan - WWE LIVE - Global Spectrum (Internship Work)
No ratings yet
Sample Integrated Marketing Communications Plan - WWE LIVE - Global Spectrum (Internship Work)
14 pages
LinearAlgebraPrimer Ver 2010
No ratings yet
LinearAlgebraPrimer Ver 2010
15 pages
Marine Cargo Insurance: Warranties, Representations, Disclosures and Conditions
No ratings yet
Marine Cargo Insurance: Warranties, Representations, Disclosures and Conditions
51 pages
Matrix Chain Mult
No ratings yet
Matrix Chain Mult
11 pages
Strassen Matrix Multiplication
No ratings yet
Strassen Matrix Multiplication
3 pages
Cannon Strassen DNS Algorithm
No ratings yet
Cannon Strassen DNS Algorithm
10 pages
Libya Free High Study Academy / Misrata: - Report of Address
No ratings yet
Libya Free High Study Academy / Misrata: - Report of Address
19 pages
Ecology in Natural Cereal Ferment
No ratings yet
Ecology in Natural Cereal Ferment
8 pages
RPT Science THN 5 2025 Terkini
No ratings yet
RPT Science THN 5 2025 Terkini
23 pages
Linear Algebra Primer: Daniel S. Stutts, PH.D
No ratings yet
Linear Algebra Primer: Daniel S. Stutts, PH.D
14 pages
How To Multiply: 5.5 Integer Multiplication
No ratings yet
How To Multiply: 5.5 Integer Multiplication
16 pages
1 Matrix Multiplication: Strassen's Algorithm: Tuan Nguyen, Alex Adamson, Andreas Santucci
No ratings yet
1 Matrix Multiplication: Strassen's Algorithm: Tuan Nguyen, Alex Adamson, Andreas Santucci
8 pages
Story Comprehension and Retelling Language Arts Pre K
No ratings yet
Story Comprehension and Retelling Language Arts Pre K
11 pages
The $5 Trillion Cold War Hoax - Eustace Mullins
100% (4)
The $5 Trillion Cold War Hoax - Eustace Mullins
16 pages
The Secret Behind Support and Resistance
94% (17)
The Secret Behind Support and Resistance
76 pages
Matrix Multiplication1
No ratings yet
Matrix Multiplication1
10 pages
EEE 531 A MGT and Organization
No ratings yet
EEE 531 A MGT and Organization
8 pages
Communication Art
No ratings yet
Communication Art
19 pages
4 16 14
No ratings yet
4 16 14
24 pages
Breaking Through Memory Limitation in GPU Parallel Processing Using Strassen Algorithm
No ratings yet
Breaking Through Memory Limitation in GPU Parallel Processing Using Strassen Algorithm
5 pages
Strassen Algorithm
No ratings yet
Strassen Algorithm
4 pages
InterCor Hybrid-Roadmap v1.0 Final
No ratings yet
InterCor Hybrid-Roadmap v1.0 Final
40 pages
Language Learning Reflection Website
No ratings yet
Language Learning Reflection Website
4 pages
Annulment, Divorce and Legal Separation in The Philippines
No ratings yet
Annulment, Divorce and Legal Separation in The Philippines
15 pages
Aspergillus Salpingitis A Rare Case Report
No ratings yet
Aspergillus Salpingitis A Rare Case Report
4 pages
A Practical Performance Comparison of Parallel Matrix Multiplication Algorithms On Networks of Workstations
No ratings yet
A Practical Performance Comparison of Parallel Matrix Multiplication Algorithms On Networks of Workstations
2 pages
Newborn Care Checklist Edited
No ratings yet
Newborn Care Checklist Edited
3 pages
1571-Article Text-7843-1-10-20221230
No ratings yet
1571-Article Text-7843-1-10-20221230
13 pages
16.1 & 16.2 Sexual & Asexual Reproduction
No ratings yet
16.1 & 16.2 Sexual & Asexual Reproduction
14 pages
A Film Review: This Study Resource Was
No ratings yet
A Film Review: This Study Resource Was
6 pages
3.2, Machine in The Garden
No ratings yet
3.2, Machine in The Garden
6 pages
DTRA Reading Lesson Plan
No ratings yet
DTRA Reading Lesson Plan
6 pages
Deen Dayal Upadhyaya Gorakhpur University, Gorakhpur
No ratings yet
Deen Dayal Upadhyaya Gorakhpur University, Gorakhpur
1 page
Machiavelli, Hobbes and Locke
No ratings yet
Machiavelli, Hobbes and Locke
2 pages

Class18 - Linalg II Handout PDF

Uploaded by

Class18 - Linalg II Handout PDF

Uploaded by

High Performance Linear Algebra II

I F ( MYROW.EQ. 0 .AND. MYCOL.EQ. 0 ) THEN

Compute r esi dual | | A X B| | / ( | | X| | | | A| | eps N )

EPS = PDLAMCH( ICTXT , Epsi l on )

I F ( MYROW.EQ. 0 .AND. MYCOL.EQ. 0 ) THEN

You might also like