004 2151 002 Scientific Libraries User Guide
004 2151 002 Scientific Libraries User Guide
004–2151–002
© 1996, 1999 Silicon Graphics, Inc. All Rights Reserved. This manual or parts thereof may not be reproduced in any form unless
permitted by contract or by written permission of Silicon Graphics, Inc.
Use, duplication, or disclosure by the Government is subject to restrictions as set forth in the Rights in Data clause at FAR
52.227-14 and/or in similar or successor clauses in the FAR, or in the DOD, DOE or NASA FAR Supplements. Unpublished rights
reserved under the Copyright Laws of the United States. Contractor/manufacturer is Silicon Graphics, Inc., 1600 Amphitheatre
Pkwy., Mountain View, CA 94043-1351.
Autotasking, CF77, Cray, Cray Ada, CraySoft, Cray Y-MP, Cray-1, CRInform, CRI/TurboKiva, HSX, LibSci, MPP Apprentice, SSD,
SUPERCLUSTER, UNICOS, X-MP EA, and UNICOS/mk are federally registered trademarks and Because no workstation is an
island, CCI, CCMT, CF90, CFT, CFT2, CFT77, ConCurrent Maintenance Tools, COS, Cray Animation Theater, Cray APP, Cray C90,
Cray C90D, Cray C++ Compiling System, CrayDoc, Cray EL, Cray J90, Cray J90se, CrayLink, Cray NQS, Cray/REELlibrarian,
Cray S-MP, Cray SSD-T90, Cray SV1, Cray T90, Cray T3D, Cray T3E, CrayTutor, Cray X-MP, Cray XMS, Cray-2, CSIM, CVT,
Delivering the power . . ., DGauss, Docview, EMDS, GigaRing, HEXAR, IOS, ND Series Network Disk Array,
Network Queuing Environment, Network Queuing Tools, OLNET, RQS, SEGLDR, SMARTE, SUPERLINK,
System Maintenance and Remote Testing Environment, Trusted UNICOS, and UNICOS MAX are trademarks of Cray Research,
L.L.C., a wholly owned subsidiary of Silicon Graphics, Inc.
SGI is a trademark of Silicon Graphics, Inc. Silicon Graphics, the Silicon Graphics logo, and IRIS are registered trademarks, and
CASEVision, IRIS 4D, IRIS Power Series, IRIX, Origin2000, and POWER CHALLENGE are trademarks of Silicon Graphics, Inc.
MIPS, R4000, R4400, and R8000 are registered trademarks and MIPSpro and R10000 are trademarks of MIPS Technologies, Inc.
UNIX is a registered trademark in the United States and other countries, licensed exclusively through X/Open Company, Ltd.
VMS and VAX are trademarks of Digital Equipment Corporation.
Portions of this product and document are derived from material copyrighted by Kuck and Associates, Inc.
DynaText and DynaWeb are registered trademarks of Inso Corporation. Silicon Graphics and the Silicon Graphics logo are
registered trademarks of Silicon Graphics, Inc. UNIX is a registered trademark in the United States and other countries, licensed
exclusively through X/Open Company Limited. X/Open is a trademark of X/Open Company Ltd. The X device is a trademark
of the Open Group.
The UNICOS operating system is derived from UNIX® System V. The UNICOS operating system is also based in part on the
Fourth Berkeley Software Distribution (BSD) under license from The Regents of the University of California.
St. Peter’s Basilica image courtesy of ENEL SpA and InfoByte SpA. Disk Thrower image courtesy of Xavier Berenguer, Animatica.
New Features
This guide has been expanded to include appendixes that describe the implementation of version 2 of the
Math library (libm), used on UNICOS systems, as well as describing the algorithms used in that library.
Record of Revision
Version Description
004–2151–002 i
Contents
Page
Introduction [1] 1
004–2151–002 iii
Scientific Libraries User’s Guide
Page
LAPACK [3] 25
LAPACK in the Scientific Library . . . . . . . . . . . . . . . . . . . 25
Types of Problems Solved by LAPACK . . . . . . . . . . . . . . . . . . 26
Solving Linear Systems . . . . . . . . . . . . . . . . . . . . . . 27
Factoring a Matrix . . . . . . . . . . . . . . . . . . . . . . . 29
Example 1: LU factorization . . . . . . . . . . . . . . . . . . . 30
Example 2: Symmetric indefinite matrix factorization . . . . . . . . . . . 31
Error Codes . . . . . . . . . . . . . . . . . . . . . . . . . 33
Example 3: Error conditions . . . . . . . . . . . . . . . . . . . 33
Solving from the Factored Form . . . . . . . . . . . . . . . . . . . . 34
Condition Estimation . . . . . . . . . . . . . . . . . . . . . . 36
Example 4: Roundoff errors . . . . . . . . . . . . . . . . . . . 37
Use in Error Bounds . . . . . . . . . . . . . . . . . . . . . . 37
Equilibration . . . . . . . . . . . . . . . . . . . . . . . . . 39
Iterative Refinement . . . . . . . . . . . . . . . . . . . . . . . 41
Example 5: Hilbert matrix . . . . . . . . . . . . . . . . . . . 41
Error Bounds . . . . . . . . . . . . . . . . . . . . . . . . 43
Inverting a Matrix . . . . . . . . . . . . . . . . . . . . . . . 44
Solving Least Squares Problems . . . . . . . . . . . . . . . . . . . . 45
iv 004–2151–002
Contents
Page
Orthogonal Factorizations . . . . . . . . . . . . . . . . . . . . . 45
Example 6: Orthogonal factorization . . . . . . . . . . . . . . . . 46
Multiplying by the Orthogonal Matrix . . . . . . . . . . . . . . . . . 48
Generating the Orthogonal Matrix . . . . . . . . . . . . . . . . . . 49
Comparing Answers . . . . . . . . . . . . . . . . . . . . . . . 50
004–2151–002 v
Scientific Libraries User’s Guide
Page
Reuse of Values . . . . . . . . . . . . . . . . . . . . . . . 68
Save/restart . . . . . . . . . . . . . . . . . . . . . . . . 69
SITRSOL Tuning Issues . . . . . . . . . . . . . . . . . . . . . 69
Direct Solver Tuning Issues . . . . . . . . . . . . . . . . . . . . 70
SITRSOL Quick Reference . . . . . . . . . . . . . . . . . . . . . 72
Usage Examples . . . . . . . . . . . . . . . . . . . . . . . . . 75
Example 7: General symmetric positive definite . . . . . . . . . . . . . . 75
Example 8: General unsymmetric . . . . . . . . . . . . . . . . . . 79
Example 9: Reuse of structure . . . . . . . . . . . . . . . . . . . 84
Example 10: Multiple right-hand sides . . . . . . . . . . . . . . . . 89
Example 11: Save/restart . . . . . . . . . . . . . . . . . . . . 93
vi 004–2151–002
Contents
Page
004–2151–002 vii
Scientific Libraries User’s Guide
Page
viii 004–2151–002
Contents
Page
Glossary 165
Index 171
Figures
Figure 1. Pipelining in add operation . . . . . . . . . . . . . . . . . . 4
Figure 2. Pipelining and chaining . . . . . . . . . . . . . . . . . . . 4
Figure 3. Cost/robustness: general symmetric sparse solvers . . . . . . . . . . 63
Figure 4. Cost/robustness: general unsymmetric sparse solvers . . . . . . . . . 64
Figure 5. In-memory to virtual matrix copy . . . . . . . . . . . . . . . . 110
Figure 6. Layered software design . . . . . . . . . . . . . . . . . . . 112
Tables
Table 1. Relative cost for SGER . . . . . . . . . . . . . . . . . . . . 23
Table 2. Factorization forms . . . . . . . . . . . . . . . . . . . . 30
Table 3. Solves times: LAPACK and solver routines . . . . . . . . . . . . . 35
Table 4. Verification tests for LAPACK (all should be O(1)) . . . . . . . . . . . 50
Table 5. Summary of tridiagonal solvers . . . . . . . . . . . . . . . . 61
Table 6. SITRSOL argument summary . . . . . . . . . . . . . . . . . 72
Table 7. iparam summary . . . . . . . . . . . . . . . . . . . . . 73
Table 8. rparam summary . . . . . . . . . . . . . . . . . . . . . 74
Table 9. Summary of out-of-core routines for linear algebra . . . . . . . . . . . 107
004–2151–002 ix
About This Guide
Related publications
The following publications provide information related to the Scientific Library:
• UNICOS User Commands Reference Manual
• UNICOS System Libraries Reference Manual
• Scientific Library Reference Manual
• Optimizing Application Code on UNICOS Systems
• CF90 Commands and Directives Reference Manual
• Fortran Language Reference Manual, Volume 1
• Fortran Language Reference Manual, Volume 2
• Fortran Language Reference Manual, Volume 3
• LINPACK User’s Guide
• LAPACK User’s Guide
The following publications provide detailed information about the topics
discussed in this manual. In many cases, these documents are referenced
specifically in this manual.
• Anderson, E., Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A.
Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen.
LAPACK User’s Guide. Philadelphia SIAM, 1992.
004–2151–002 xi
Scientific Libraries User’s Guide
xii 004–2151–002
About This Guide
004–2151–002 xiii
Scientific Libraries User’s Guide
See George and Liu, Duff and Erisman, and Reid for classical references that
give a thorough and in-depth treatment of sparse direct solvers. Another
common reference is Ashcraft.
The original conjugate gradient algorithm was presented in Hestenes and
Stiefel; however, Reid presented the first practical application. A classical text in
iterative methods is that of Hageman and Young. You can find good
discussions of the biconjugate gradient and biconjugate gradient squared
methods in Sonneveld. GMRES is presented by Saad and Schultz.
You can find some references on data structures in SPARSKIT and in the
proposals for sparse BLAS.
Three articles that deal directly with the Cray Research libsci sparse solvers are
Yang on direct solvers, Heroux (1991), and Heroux, Vu, and Yang on SITRSOL.
Conventions
The following conventions are used throughout this documentation:
command This fixed-space font denotes literal items, such as
pathnames, man page names, commands, and
programming language structures.
variable Italic typeface denotes variable entries and words
or concepts being defined.
[] Brackets enclose optional portions of a command
line.
In addition to these formatting conventions, several naming conventions are
used throughout the documentation. “Cray PVP systems” denotes all
configurations of Cray parallel vector processing (PVP) systems which run the
UNICOS operating system. “Cray MPP systems” denotes all configurations of
the Cray T3E series which runs the UNICOS/mk operating system. “ IRIX
systems” denotes SGI platforms which run the IRIX operating system.
The default shell in the UNICOS and UNICOS/mk operating systems, referred
to as the standard shell, is a version of the Korn shell that conforms to the
following standards:
• Institute of Electrical and Electronics Engineers (IEEE) Portable Operating
System Interface (POSIX) Standard 1003.2–1992
• X/Open Portability Guide, Issue 4 (XPG4)
xiv 004–2151–002
About This Guide
The UNICOS and UNICOS/mk operating systems also support the optional use
of the C shell.
Cray UNICOS Version 10.0 is an X/Open Base 95 branded product.
Obtaining Publications
The User Publications Catalog describes the availability and content of all Cray
Research hardware and software documents that are available to customers.
Customers who subscribe to the Cray Inform (CRInform) program can access
this information on the CRInform system.
To order a document, call +1 651 683 5907. SGI employees may send electronic
mail to [email protected] (UNIX system users).
Customers who subscribe to the CRInform program can order software release
packages electronically by using the Order Cray Software option.
Customers outside of the United States and Canada should contact their local
service organization for ordering and documentation information.
Reader Comments
If you have comments about the technical accuracy, content, or organization of
this document, please tell us. Be sure to include the title and part number of
the document with your comments.
You can contact us in any of the following ways:
• Send e-mail to the following address:
[email protected]
• Send a fax to the attention of “Technical Publications” at: +1 650 932 0801.
• Use the Feedback option on the Technical Publications Library World Wide
Web page:
http://techpubs.sgi.com
004–2151–002 xv
Scientific Libraries User’s Guide
xvi 004–2151–002
Introduction [1]
This manual describes the Scientific Libraries which run on UNICOS systems.
The information in this manual supplements the man pages provided with the
Scientific Library and provides details about the implementation and usage of
these library routines on UNICOS systems.
This manual includes the following sections:
• Chapter 2, page 3, discusses parallel processing environments, ways to
measure parallel processing performance, and the implementation strategies
used in the Scientific Library routines.
• Chapter 3, page 25, discusses dense linear algebra problems, including
systems of linear equations, linear least squares problems, eigenvalue
problems, and singular value problems.
• Chapter 4, page 53, discusses sparse matrices and solution techniques for
sparse linear systems.
• Chapter 5, page 101, discusses the out-of-core routines, virtual matrices, and
subroutines used with out-of-core routines.
• Appendix A, page 123, discusses libm Version 2, the default UNICOS math
library.
• Appendix B, page 129, discusses the algorithms used in libm.
004–2151–002 1
Parallel Processing Issues [2]
004–2151–002 3
Scientific Libraries User’s Guide
Operands C
Result
B
1 2 3 4 5 6 7
Clock periods
a10035
Pipelining Chaining
1 CP
For illustration only. Functional units are not physically arranged as shown.
a10036
4 004–2151–002
Parallel Processing Issues [2]
004–2151–002 5
Scientific Libraries User’s Guide
reduce wall-clock time, but at the cost of extra CPU time which increases
because more machine resources are used.
This subsection discusses these benefits and some of the costs of using parallel
processing.
6 004–2151–002
Parallel Processing Issues [2]
004–2151–002 7
Scientific Libraries User’s Guide
T1
Speedup ratio =
Tp
8 004–2151–002
Parallel Processing Issues [2]
(2.1)
With p CPUs, a speedup ratio as close as possible to p is desired. If the program
were completely parallel and no overhead existed, the speedup would be equal
to p (for details about the overhead associated with multitasking, see Section
2.2.2, page 6).
For example, suppose a job takes 8.7 seconds to run on a single processor.
When the job is rerun using four processors, the execution time decreases to 2.5
seconds; the speedup is the following:
s = 82 75 = 3 48
:
:
:
(2.2)
The speedup ratio of multitasked code has two limitations: Amdahl’s Law
(representing the speedup related to the sequential portion of the code), and
multitasking overhead (discussed in Section 2.2.2, page 6).
1
=
0
s
f
(1 f) + p
(2.3)
The following components appear in this equation:
• s: Maximum expected speedup from multitasking
• p: Number of processors available for parallel execution
• f: Fraction of a program that can execute in parallel (the parallel fraction)
For example, suppose that code is 98% parallel, implying that 2% of the code
runs in serial mode. Suppose that 64 processors are available. According to
Amdahl’s Law, the maximum speedup that you can expect is:
004–2151–002 9
Scientific Libraries User’s Guide
s = (1 0 0 98)
1
: + 0 98 :
= 28:32
64
(2.4)
The speedup from multitasking, s, is in terms of wall-clock time, not CPU time.
Speedups equal to the physical number of processors require that the executing
program use all processors effectively 100% of the time with no overhead.
Because this is not possible, performance is dominated by the fraction of the
time spent executing serial code.
Maximum speedup depends on your serial code. If the number of processors
were infinite, the parallel term would be 0, but the serial term would remain,
giving a maximum possible speedup of the following:
1
s = = 50
:02
(2.5)
This is sometimes interpreted to mean that only 50 processors can be used on a
98% parallel problem. You can use any number of processors, but because the
maximum possible speedup is a constant, the efficiency decreases as the
number of processors increases.
The amlaw command displays the maximum theoretical speedup when you
provide the number of CPUs and the percent parallelism. See the amlaw(1)
man page for information about using the command.
s
e =
p
(2.6)
10 004–2151–002
Parallel Processing Issues [2]
If the program were completely parallel, and no overhead existed, the efficiency
would be 1.0 (100%). As an example, suppose that the measured speedup
running on four processors, compared to one processor, is s = 3.48. The
efficiency is then defined as follows:
e = 3 448 = 0 87 = 87%
:
:
(2.7)
= 11 00 1
1
s
f
p
(2.8)
For example, suppose the measured speedup running on four processors
compared to one processor is s = 3.48. The equivalent parallel fraction is the
following:
= 11003 148 = 0 95
1
:
f :
4
(2.9)
This program thus achieves 95% parallelism, assuming the idealized model of
Amdahl’s Law.
004–2151–002 11
Scientific Libraries User’s Guide
information about the different environment variables you can set which can
help you tune your system’s performance.
Note: The settings for the different environment variables can significantly
affect results and timings of code. See your system administrator for
information about your site’s use of these environment variables.
12 004–2151–002
Parallel Processing Issues [2]
004–2151–002 13
Scientific Libraries User’s Guide
In the multiuser environment, you cannot control the number of processors that
are attached to a job except to specify a maximum number by using the NCPUS
environment variable. The actual number of processors that will be used cannot
be known in advance and may change as the job runs. When working in a
lightly loaded, multiuser environment, set NCPUS equal to a small number of
the available physical processors.
The following assumptions are made about the multiuser environment:
• Users do not know how many processors will be available to a job during
run time, except that the number will be less than or equal to NCPUS.
• It is probable that fewer processors than the number specified in NCPUS will
be attached to a job; therefore, you should use a partitioning strategy that
gives the best average performance.
14 004–2151–002
Parallel Processing Issues [2]
004–2151–002 15
Scientific Libraries User’s Guide
#!/bin/sh
maxcpus=’4 8’ # max NCPUS
xgf1=sgemv_1.xgf # results: xgraph input file
xgf2=sgemv_2.xgf # results: xgraph input file
#
cat > mmstr.f << EOF
PROGRAM MMSTR
PARAMETER (N =1024)
REAL*8 A(N,N),B(N,N),C(N,N)
REAL OPS, MFLOPS
REAL CYCLET, GET_CTICKS
INTEGER LOOPCNT
INTEGER T1, T2, IRTC
EXTERNAL SECOND, IRTC, GET_CTICKS
NSTART = N
NEND = 512
NINK = -128
999 CONTINUE
C
DO 1000 NIND=NSTART,NEND,NINK
IF (NINK.LT.-8) THEN
NST1 = MIN(NIND+8,N)
NEN1 = MAX(NIND-8,8)
ELSE
NST1 = NIND
16 004–2151–002
Parallel Processing Issues [2]
NEN1 = NIND
ENDIF
DO 1000 NPART=NST1,NEN1,-8
LDA = NPART
LDB = NPART
LDC = NPART
CALL EX312(A,LDA,NPART)
CALL EX312(B,LDB,NPART)
C
CALL CINIT(C,LDC,NPART)
LOOPCNT = NPART
C
T1 = IRTC()
S1 = SECOND ()
DO I=1,LOOPCNT
CALL SGEMV
(’N’,NPART,NPART,1.,A,LDA,B(1,I),1,0.,C(1,I),1)
END DO
S2 = SECOND ()
T2 = (IRTC() - T1)/NPART
C
OPS = 2*(NPART * NPART)
WRITE(66,’(I5,E12.4)’) NPART,S2-S1
WRITE(77,’(I5,E12.4)’) NPART, OPS/(T2 *CYCLET)
C
1000 CONTINUE
C
IF (NINK.EQ.-128) THEN
NINK = -64
NSTART = NEND + NINK
NEND = 256
GOTO 999
ELSEIF (NINK.EQ.-64) THEN
NINK = -32
NSTART = NEND + NINK
NEND = 128
GOTO 999
ELSEIF (NINK.EQ.-32) THEN
NINK = -8
NSTART = NEND + 2 * NINK
NEND = 8
GOTO 999
004–2151–002 17
Scientific Libraries User’s Guide
ENDIF
C
64 FORMAT(3X,I4,2X,’I’,5(1X,E12.4,3X,’I’))
2000 CONTINUE
END
C
SUBROUTINE EX312(A,LDA,NPART)
INTEGER NPART,I,J,NP1I,LDA
REAL*8 A(LDA,NPART)
C
DO 10 J=1,NPART
DO 10 I=J,NPART
NP1I=NPART+1-I
A(I,J) = DBLE(NP1I)
A(J,I) = DBLE(NP1I)
10 CONTINUE
RETURN
END
C
SUBROUTINE CINIT(C,LDC,NPART)
INTEGER NPART,I,J,LDC
REAL*8 C(LDC,NPART)
C
DO 10 I=1,NPART
DO 10 J=1,NPART
C(I,J) = 0.0
10 CONTINUE
RETURN
C
END
EOF
#
# Get clock ticks to compute wall clock seconds
#
cat >get_cticks.c << EOF
#include <sys/target.h>
float GET_CTICKS()
/* Return the clock period in picoseconds */
{
struct target data;
target(MC_GET_TARGET, &data);
return data.mc_clk;
}
18 004–2151–002
Parallel Processing Issues [2]
EOF
#
cc -c -g get_cticks.c
NCPUS=1 ./a.out
mv fort.66 x.1
#
# Elinimate Residual CPU usage by setting
# HOLDTIME to 0
#
MP_HOLDTIME=0
export MP_HOLDTIME
for n in $maxcpus
do
ja
NCPUS=$n ./a.out
ja -ct
echo ’\n" ’$n CPUs’"’ >> $xgf1
echo ’\n" ’$n CPUs’"’ >> $xgf2
paste x.1 fort.66 |
awk ’{ ov = ($4 - $2)*100/$2; print $1, ov }’ >> $xgf1
cat fort.77 >> $xgf2
done
#
rm -f x.1 fort.66
004–2151–002 19
Scientific Libraries User’s Guide
• small parallel/vector problem (SPVP) – Problems large enough for vector and
parallel processing, but for which parallel processing degrades vector
performance.
• Medium parallel/vector problem (MPVP) – Problems large enough for
optimal vector and parallel processing, but for which load balancing can be
a significant problem if a processor is lost.
• large parallel/vector problem (LPVP) – Problems large enough for optimal
vector and parallel processing for which load balancing is not a significant
problem. In this case, enough work exists to partition the problem into
many subproblems so that the effect of losing a processor is minimized.
The boundary between the VP and SPVP classes is independent of the number
of processors. However, the other boundaries between classes depend on the
number of processors you try to use. A problem size of class LPVP on 2
processors could be of class MPVP on 4 processors and class SPVP on 8 or 16
processors.
20 004–2151–002
Parallel Processing Issues [2]
004–2151–002 21
Scientific Libraries User’s Guide
Despite this unpredictability, some strategies are available. First, the strategies
for VP and LPVP problems are the same as in the dedicated environment. You
should execute VP problems on one processor regardless of the number of
processors requested, because using one processor is the fastest way to solve
the problem.
Secondly, by definition, LPVP problems have optimal vector performance
regardless of the number of processors actually attached so that CPU time
efficiency eu should always be close to optimum. LPVP problems are also
partitioned into kp subproblems, k = 2,3,..., where p is the requested number of
processors. If p′ processors (p′(<)p) are attached, the ratio (kp/p′) is usually large
enough that the remainder kp mod p′ does not create significant load imbalance.
It can be difficult to develop strategies for SPVP and MPVP problems. As in a
dedicated environment, you must partition the problem into exactly p
subproblems and execute it on p processors. Because the number of processors
attached to a job is unknown, however, the correct value of p is not
immediately obvious and, unlike the dedicated environment, you should not
always set p to NCPUS. Choosing p incorrectly could lead to a partitioning
strategy that is too aggressive (increasing user and elapsed times) or too
conservative (reducing parallelism).
To determine the ‘‘best’’ value of p≤ NCPUS (p specifies the number of
subproblems and the number of processors to use) for SPVP and MPVP
problems, first consider SPVP problems. Let m = n = 100 and consider the
performance statistics for SGER(3S) in Table 1. The Scientific Library subroutine
SGER is part of Level 2 BLAS (Basic Linear Algebra Subprograms). SGER
A + axy . A,
performs the following rank-one update of a general matrix: A T
x, and y are real-valued and of dimension m 2 n, m 2 1, and n 2 1, respectively,
and a is a scalar.
To process this equation in parallel, partition A horizontally and/or vertically
into submatrices Aij with dimensions mi 2 nj ; i = 1; . . . nh ; j = 1; . . . nv ; nh ; nu
are the number of horizontal and vertical partitions, respectively. x and y are
partitioned appropriately. This is an SPVP problem because any partition of the
original problems produces subproblems that have poorer vector performance.
22 004–2151–002
Parallel Processing Issues [2]
1 72.72
(1.00)
2 84.03 49.79
(1.33) (0.93)
004–2151–002 23
LAPACK [3]
This section supplements the information on the LAPACK man pages and in
the LAPACK Users’ Guide. It discusses in more detail how the LAPACK
computational routines are used, and uses examples to illustrate the results.
004–2151–002 25
Scientific Libraries User’s Guide
Online man pages are available for individual LAPACK subroutines. For
example, to view a description of the calling sequence for the subroutine to
perform the LU factorization of a real matrix, see the SGETRF(3S) man page.
LAPACK routines that operate on 64-bit real and complex data are included in
the Scientific Library. The subroutine names in the Scientific Library are the
names of the single-precision routines in the standard interfaces; the first letter
of the routine name is S for real data routines and C for complex data routines.
When porting applications from other systems that call LAPACK routines in
double precision, change the names of the calls (for example, change CALL
DGETRF() to CALL SGETRF()). You also can use a compiler option to disable
double-precision constructs in the code and a loader directive to map
double-precision names to single-precision names.
Several enhancements improve the performance of the LAPACK routines on
UNICOS systems. For example, the solver routines are redesigned for better
performance for one or a small number of right-hand sides and to use
parallelism when the number of right-hand sides is large.
Tuning parameters for the block algorithms provided in the Scientific Library
are set within the ILAENV LAPACK routine. ILAENV is an integer function
subprogram that accepts information about the problem type and problem
dimensions and returns a single integer parameter such as the optimal block
size, the minimum block size for which a block algorithm should be used, or the
crossover point (the problem size at which it becomes more efficient to switch
to an unblocked algorithm). Setting tuning parameters occurs without user
intervention, but users can call ILAENV directly to check the values to be used.
26 004–2151–002
LAPACK [3]
in Section 3.5, page 45 also have application in eigenvalue and singular value
computations.
There are two classes of LAPACK routines: LAPACK driver routines solve a
complete problem; LAPACK computational routines perform one step of the
computation. The driver routines generally call the computational routines to
do their work, and offer a more convenient interface; therefore, LAPACK users
should use the LAPACK driver routines for solving systems of linear equations.
The INTRO_LAPACK(3S) man page and the man pages for the individual
subroutines describe the functions performed by each of the driver routines.
2 1
01 1
x
y
= 41
(3.1)
The solution to this system of equations is the set of vectors [x; y] that satisfy
T
both equations. The physical interpretation of this system of equations is that it
represents two lines in the (x,y) plane, which may intersect in one point, no
points (if they are parallel), or an infinite number of points (if the two equations
are multiples of each other).
To solve the system, eliminate variables from each successive equation until the
system is simple enough to solve directly. To form the simpler system, add 12
times the first equation to the second equation, as follows:
2 10
:
0 15
:
x
y
= 43
(3.2)
The second equation can then be solved to get y = 2, and this result is
substituted into the first equation to get x = 1. This process amounts to a
factorization of the coefficient matrix:
004–2151–002 27
Scientific Libraries User’s Guide
" #" #
2 1 1: 0 2 1:
=
01 1 00:5 1 0 1:5
(3.3)
followed by two triangular system solutions:
1 0 z1 4
=
00:5 1 z2 1
(3.4)
whose solution is z1 = 4; z2 = 3 and
2 1:0 x 4
=
0 1:5 y 3
(3.5)
whose solution is x = 1; y = 2. Now consider what happens if the two
equations represent parallel lines, with no points in common, as in this example:
2 1 x 4
=
1 1 y 0
(3.6)
The factorization step takes the following form:
2 1 1 0 2 1
=
1 1 1 1 0 0
(3.7)
The second factor, which should be upper triangular, has a 0 on the diagonal.
This indicates that the triangular system that involves the second factor cannot
be solved by back-substitution, and the system does not have a unique solution.
The factorization routines in LAPACK detect zeros on the diagonals of the
triangular factors and return this information in an error code.
The LAPACK routines for solving linear systems assume the system is already
in a matrix form. The data type (real or complex), characteristics of the
coefficient matrix (general or symmetric, and positive definite or indefinite if
28 004–2151–002
LAPACK [3]
symmetric), and the storage format of the matrix (dense, band, or packed),
determine the routines that should be used.
A = (P1 L1 P L PnLn ) D (P L P L Pn Ln )T
2 2... 1 1 2 2...
(3.8)
where each Pi is a rank-1 permutation, each Li is a rank-1 or rank-2
modification of the identity, and D is a diagonal matrix with 1-by-1 and
2-by–2 diagonal blocks.
Generally, users do not have to know the details of how the factorization is
stored, because other LAPACK routines manipulate the factored form.
Regardless of the form of the factorization, it reduces the solution phase to one
that requires only permutations and the solution of triangular systems. For
example, the LU factorization of a general matrix, A = PLU , is used to solve for
X in the system of equations AX = B by successively applying the inverses of
P, L, and U to the right-hand side:
1. X PB
2. X L0 X
1
3. X U0 X
1
In the last two steps, the inverse of the triangular factors is not computed, but
triangular systems of the form LY = Z and UX = Y are solved instead.
The following table lists the factorization forms for each of the factorization
routines for real matrices. The factorization forms differ for SGETRF and
004–2151–002 29
Scientific Libraries User’s Guide
Example 1: LU factorization
The SGETRF subroutine performs an LU factorization with partial pivoting
(A = P LU ) as the first step in solving a general system of linear equations
AX = B . If SGETRF is called with the following:
2 3
4: 9: 2:
A = 4 3: 5: 7: 5
8: 1: 6:
(3.9)
details of the factorization are returned, as follows:
30 004–2151–002
LAPACK [3]
2 3
8: 1: 6:
66 77
A = 01 0
4 0:5 8:5 :
5
0:375 0:5441 5:294
I P I V = [3; 3; 3]
(3.10)
Matrices L and U are given explicitly in the lower and upper triangles,
respectively, of A:
2 3 2 3
1: 8: 1: 6:
6 77 66 77
L = 6 0:5 1: U = 8:5 01 0
4 5; 4 :
5
0:375 0:5441 1: 5:4294
(3.11)
The IPIV vector specifies the row interchanges that were performed. IPIV(1)=
3 implies that the first and third rows were interchanged when factoring the first
column; IPIV(2)= 3 implies that the second and third rows were interchanged
when factoring the second column. In this case, IPIV(3) must be 3 because
there are only three rows. Thus, the permutation matrix is the following:
2 32 3 2 3
10 0 0 0 1 0 1 0
P = 40 0 1540 1 05 = 40 0 15
0 1 0 1 0 0 1 0 0
(3.12)
Generally, the pivot information is used directly from IPIV without
constructing matrix P.
004–2151–002 31
Scientific Libraries User’s Guide
2 9: 3
6
A = 4
5: 7: 75
016: 12: 3:
4: 02: 10: 8:
(3.13)
Only the lower triangle of A is specified because the matrix is symmetric, but
you could have specified the upper triangle instead. The output from SSYTRF
is the following:
2 3
9:
6
6 77
6 016:
A =6
3: 77
6
4 00:9039 00:8210 21:37 75
00 7511 00 6725
: : 0:4597 13:21
I P IV= [03 03 3 4]
; ; ;
(3.14)
The signs of the indices in the IPIV vector indicate that a 2-by-2 pivot block
was used for the first two columns, and 1-by-1 pivots were used for the third
and fourth columns. Therefore, D must be the following:
2 9: 3
6 016:
D = 4
3: 75
21:37
13:21
(3.15)
Matrix L is supplied in factored form as L = P1L1 P2 L2, where the parts of each
L i that differ from the identity are stored in A below their corresponding blocks
Di :
21 2 3
0 0 03 1:
60 0 1 07
6
6 0: 1: 77
=4 5 ; L1 = 6 75
P1
0 1 0 0 4 00:9039 00:8210 1
0 0 0 1 00:7511 00:6725 0:0 1:0
(3.16)
32 004–2151–002
LAPACK [3]
21 3
P2 = I ; L2 = 64 01
0 0 1:0
75
0 0 0:4597 1
(3.17)
All other errors in the LAPACK routines are described by error codes returned
in info, the last argument. The values returned in info are routine-specific,
except for info = 0, which always means that the requested operation completed
successfully.
For example, an error code of info > 0 from SGETRF means that one of the
diagonal elements of the factor U from the factorization A = P LU is exactly 0.
This indicates that one of the rows of A is a linear combination of the other
rows, and the linear system does not have a unique solution.
2 3
1: 4: 7:
A = 4 2: 5: 8: 5
3: 6: 9:
(3.18)
it returns
2 3
3: 6: 9
66 7
A =
4 0:3333 2: 47
5
0:6667 0:5 0
I P I V = [3; 3; 3]
004–2151–002 33
Scientific Libraries User’s Guide
(3.19)
which corresponds to the factorization
2 3 2 3 2 3
0 1 0 1:0 3: 6: 9:
P = 40 0 1 5 ; L = 4 0:3333 1:0 5;U = 4 2: 4: 5
1 0 0 0:6667 0:5 1: 0:
(3.20)
On exit from SGETRF, info = 3, indicating that U(3,3) is exactly 0. This is not an
error condition for the factorization because the factors that were computed
satisfy A = P LU , but the factorization cannot be used to solve the system.
As shown, you should always check the return code from the factorization to
see whether it completed successfully and did not produce any singular factors.
To obtain further information about proceeding with the solve, estimate the
condition number (see Section 3.4.1, page 36 for details).
Because most of the LAPACK driver routines do their work in the LAPACK
computational routines, a call to a driver routine gives the same performance as
separate calls to the computational routines. The exceptions are the simple
driver routines used for solving tridiagonal systems: SGTSV, SPTSV, CGTSV,
and CPTSV. These routines compute the solution while performing the
factorization for certain numbers of right-hand sides. Because the amount of
34 004–2151–002
LAPACK [3]
work in each loop is small, some reloading of constants and loop overhead is
saved by combining the factorization with part of the solve.
Table 3, page 35 shows the times (in microseconds) on one processor of a
Cray C90 system for solving a tridiagonal system with one right-hand side,
using the LAPACK factor and solve routines separately, compared to the times
for the simple driver routines. For comparison, times also are shown for the
Scientific Library versions of the equivalent LINPACK routines SGTSL, SPTSL,
CGTSL, and CPTSL. The LAPACK driver routines are typically about 20% faster
than the separate computational routines for one right-hand side, and are faster
than LINPACK in all cases except SPTSL, which outperforms SPTSV by not
checking for zeros on the diagonal during the factorization.
This table also shows times for the assembler-coded Scientific Library routine
SDTSOL(3S) for real general tridiagonal matrices, which does not do pivoting
and, like LINPACK, accepts only one right-hand side. This subroutine is much
faster than LAPACK on those problems for which it can be used.
Values of n
Method 25 50 100 200
004–2151–002 35
Scientific Libraries User’s Guide
Values of n
Method 25 50 100 200
36 004–2151–002
LAPACK [3]
machine epsilon, SGESVX returns info = N+1, indicating that the matrix A is
singular to working precision, and it does not compute the solution.
2 3
1: 2: 3:
A = 44 : 5: 6: 5
7: 8: 9:
(3.21)
is singular in exact arithmetic, but on UNICOS systems, SGETRF returns
2 3
7: 8: 9:
6 0:1429 7
A = 4 0:8571 1:714 5
0:5714 0:5 02 478 2 10014
:
(3.22)
where IPIV=[3,3,3] and info = 0. In exact arithmetic, A(3,3) would have been 0,
but roundoff error has made this entry 02:487 2 10014 instead. The reciprocal
condition number computed by SGECON is 3:45 2 10016 , which is less than the
machine epsilon of 1:42 2 10014. Therefore, SGESVX returns info= 4 and does
not try to solve any systems with this A.
x 0 ^ = 01
x A r
(3.23)
and
k 0 ^k 01 k k 01 k k k kk k
A x
x x A r A
kkb
r
004–2151–002 37
Scientific Libraries User’s Guide
(3.24)
because kbk kAk kxk. This gives the bound
A1x = 1b 0 1A (x + 1x)
(3.27)
Assuming A is nonsingular,
k1xk A01
k1bk + k1Ak + k1Ak k1xk
kxk kxk kxk
38 004–2151–002
LAPACK [3]
(3.29)
Using the inequality kbk kAk kxk,
k1xk kAk A01 k1bk + k1Ak + k1Ak k1xk
x kbk kAk kAk kxk
(3.30)
and substituting A ( ) = kAk A01 ,
0 1
k1xk (A) @ kk1bbkk + kk1AAkk A
kxk 1 0 (A) kk1AAkk
(3.31)
provided A( ) k1Ak = kAk < 1. In terms of the relative backward error !,
k1xk 2! (A)
kxk 1 0 ! (A)
(3.32)
In Section 3.4.4.1, page 43, the backward error is defined slightly differently to
obtain a component-wise error bound.
3.4.3 Equilibration
The condition number defined in the last section is sensitive to the scaling of A.
For example, the matrix
" #
A=
1: 1
0 1: 2 1016
(3.33)
" #
A01 =
1: 0 1 2 10016
0 1: 2 10016
(3.34)
004–2151–002 39
Scientific Libraries User’s Guide
0 1
and so RCOND=1= kAk A01 = 1: 2 10016 . Because this value of RCOND is less
than the machine epsilon on UNICOS systems, SGESVX (with FACT = ‘N’) does
not try to solve a system with this matrix. However, A has elements that vary
widely in magnitude, so the bounds on the relative error in the solution may be
pessimistic. For example, if the right-hand side in Ax = b is the following:
2:0
b =
1:0 2 1016
(3.35)
T
SGETRF followed by SGETRS produces the exact answer, x = [1:; 1:] , on
UNICOS systems.
You can improve the condition of this example by a simple row scaling. Scaling
a problem before computing its solution is known as equilibration, and it is an
option to some of the expert driver routines (those for general or positive
definite matrices). Enabling equilibration does not necessarily mean it will be
done; the driver routine will choose to do row scaling, column scaling, both
row and column scaling, or no scaling, depending on the input data. The usage
of this option is as follows:
CALL SGESVX(’E’, ’N’, N, NRHS, A, LDA, AF,
$ LDAF, IPIV, EQUED, R, C, B, LDB, X, LDX,
$ RCOND, FERR, BERR, WORK, IWORK, INFO)
The ’E’ in the first argument enables equilibration. For this example, EQUED =
’R’ on return, indicating that only row scaling was done, and the vector R
contains the scaling constants:
1:0
R =
1:0 2 10016
(3.36)
The form of the equilibrated problem is:
(diag (R) A diag (C )) diag (C )
01 X = diag (R) B
(3.37)
or A E XE = E , where AE is returned in A:
B
40 004–2151–002
LAPACK [3]
1: 0: 1: :1 1: 1:
A
0: 1:0 2 10016 0: 1:0 2 10016 =
0: 1:
(3.38)
and BE is returned in B:
1: 0: 2: 2:
B
0: 1:0 2 10016 1:0 2 1016 =
1:
(3.39)
The factored form, AF, returns the same matrix as A in this example, because A
is upper triangular, and RCOND = 0.3, which is the estimated reciprocal
condition number for the equilibrated matrix. The only output quantity that
pertains to the original problem before equilibration is the solution matrix X. In
this example, X is also the solution to the equilibrated problem because no
column scaling was done, but if EQUED had returned ’C’ or ’B’ and the
solution to the equilibrated system were desired, it could be computed from
XE = diag (C )
01 X .
004–2151–002 41
Scientific Libraries User’s Guide
21 1 1 1 13
2 3 4 5
66 7
17
66 12 1
3
1
4
1
5 67
77
66
17
66 13 1
4
1
5
1
6 7777
66
17
64 14 1
5
1
6
1
7 87 5
1 1 1 1 1
5 6 7 8 9
(3.40)
which has a condition number of 4:77 2 105 . A rule of thumb (Section 3.4.1,
page 36) suggests almost 8 digits of accuracy in the solution are possible on
UNICOS systems, because = 1:4 2 10013 and (A) 0:5 2 1008. If the matrix
is factored using SGETRF and SGETRS is used to solve Ax = b, where
T
b = [1; 0; 0; 0; 0] , the following answers are obtained:
Two additional digits are shown for the Cray T3D results, which are closer to
the exact solution of x = [25; 0300; 1050; 01400; 630] . The component-wise
T
backward error bounds computed on each system are about 9:1 2 10016 on
Cray C90 systems and about 3:2 2 10017 on Cray T3D systems.
Iterative refinement does not change either solution on the system that
computed it.
But if the Cray C90 factorization and solution are input to SGERFS on a
Cray T3D system, iterative refinement produces the new solution:
42 004–2151–002
LAPACK [3]
x(1) 25.0000000000147
x(2) -300.000000000274
x(3) 1050.000000120
x(4) -1400.000000183
x(5) 630.0000000904
kx 0 x^k f
kxk
(3.41)
and the backward error bound ! bounds the relative errors in each component
of A and b in the perturbed equation 2 from Section 3.4.2, page 37.
k1A k ! kA k ; k1b k ! kb k
i;j i;j i i
(3.42)
In Example 5, page 41, the first column of the inverse of the Hilbert matrix of
=
order 5 is computed by solving H5x e1 , and SGERFS computed error bounds
004–2151–002 43
Scientific Libraries User’s Guide
1 4 10 9 1 10
of f : 2 08 and ! : 2 016 on UNICOS systems. This provides direct
information about the solution; its relative error is at most 0(10–8); therefore, the
largest components of the solution should exhibit about 8 digits of accuracy,
and the 0system for which
1 this solution is an exact solution is within a factor of
1 4 10
epsilon : 2 014 from the system whose solution was attempted, so the
solution is as good as the data warrants.
The component-wise relative backward error bound (equation 3) is more
1
restrictive than the classical backward error bound k Ak ! kAk, because it
1
assumes k Ak has the same sparsity structure as kAk, because if Ai;j is 0, so
1
must be Ai;j . The backward error for the solution x̂ is computed from the
equation
(3.43)
where r = b 0 Ax^, and the forward error bound is computed from the equation
44 004–2151–002
LAPACK [3]
minimize kB 0 AXk2
(3.45)
If A is underdetermined, that is, A is m 2 n with m < n, generally many
solutions exist, and you may want to find the solution X with minimum
2-norm. Solving these problems requires that you first obtain a basis for the
range of A, and several orthogonal factorization routines are provided in
LAPACK for this purpose.
An orthogonal factorization decomposes a general m 2 n matrix A into a
product of an orthogonal matrix Q and a triangular or trapezoidal matrix. A
real matrix Q is orthogonal if QT Q = I , and a complex matrix Q is unitary if
QH Q = I . The key property of orthogonal matrices for least squares problems
is that multiplying a vector by an orthogonal matrix does not change its
2-norm, because
p p
kQxk2 = xT QT Qx = xT x = kxk2
(3.46)
004–2151–002 45
Scientific Libraries User’s Guide
A = Q
R
0
(3.47)
where Q is an m-by-m orthogonal matrix, and R is an n-by-n upper triangular
matrix. If m > n, it is convenient to write the factorization as
R
(1) (2)
A = Q Q
0
(3.48)
or simply
A 0 Q
(1)
R
(3.49)
where Q(1) consists of the first n columns of Q, and Q(2) consists of the
remaining m–n columns. The LAPACK routine SGEQRF computes this
factorization. See the LAPACK User’s Guide for details.
2 1: 2: 3:
3
6 03:
A = 4
2: 1: 7
2: 0: 01: 5
3: 01: 2:
(3.50)
is a compact representation of Q and R consisting of
2 3
04:796 1:460 00 8341
:
6 7
6 00:5176
6
A = 6
02 621 :
7
02 754 77
:
6 0:3451
4 00 03805 2 593 75
: :
46 004–2151–002
LAPACK [3]
(3.51)
The matrix R appears in the upper triangle of A explicitly:
2 3
04:796 1:460 00:8341
R = 4 0: 02:621 02:754 75
6
0: 0: 2:593
(3.52)
while the matrix Q is stored in a factored form Q = Q3 Q2Q1 where each Qi is
an elementary Householder transformation of the following form:
Qi = I 0 i i iT . Each vector i has i [0 : i 0 1] = 0; i [i] = 1; and [i + 1 : n] is
stored below the diagonal in the ith column of A. Therefore,
2
1: 3
6 00:5176 7
Q1 = I 0 1:2085 6 7 [1: 0 0:5176 0:3451 0:5176]
6 7
4 0:3451 5
0:5176
(3.53)
2 3
0:
6 1: 7
Q2 = I 0 1:8698 6 7 [0 1: 0 0:03805 0 0:2611]
6 7
4 00:03805 5
00:2611
(3.54)
2
0: 3
6 0: 7
Q3 = I 0 1:8118 6 7 [0 0 1 0 0:3223]
6 7
4 1: 5
00:3223
(3.55)
004–2151–002 47
Scientific Libraries User’s Guide
0 10 1 0 T 1 T
I 0 T I 0 T = I 0 2 T + 2 =I
(3.56)
X
QT B
Y
X R01X
(3.57)
The LAPACK routine SORMQR is used in step 1 to multiply the right-hand side
matrix by the transpose of the orthogonal matrix Q, using Q in its factored
form. The triangular system solution in step 2 can be done using the LAPACK
routine STRTRS.
Continuing the example of Section 3.5.1, page 45, suppose the right-hand side
vector is b = [1: 2: 3: 4:]T . Multiplying b by QT by using the LAPACK routine
SORMQR, you get
2
02:711 3
6 02:273 7
x
y
= QT b = 664 0:5712 775
4:143
(3.58)
and after solving the triangular system with the 3-by-3 matrix R,
48 004–2151–002
LAPACK [3]
2 3
0:7203
6 7
x = 4 0:6356 5
0:2203
(3.59)
The last m–n elements of QT b can be used to compute the 2-norm of the
residual, becausekrk2 kyk2 . Here, krk2 = 4:143.
where the data from the QR factorization of A has been copied into a separate
matrix Q because SORGQR overwrites its input data with the expanded matrix.
The orthogonal matrix is returned in Q:
004–2151–002 49
Scientific Libraries User’s Guide
2 3
00:2085 00:8792 0:1562 00:3989
6 0:6255 00:4147 7
6 0:1465 0:6444 7
Q=6 7
4 00:4170 00:2322 00:7665 0:4296 5
00:6255 0:03318 0:6054 0:4910
(3.60)
(1)
The matrix Q , consisting of only the first n columns of Q, could be generated
by specifying N, rather than M, as the second argument in the call to SORGQR.
In testing LAPACK, the test ratios in the following table were used to verify the
correctness of different operations. All of these ratios give a measure of relative
error. Residual tests are scaled by 1/", the reciprocal of the machine precision,
to make the test ratios O(1), and results that are sensitive to conditioning are
scaled by 1/, where = kAk A01 is the condition number of the matrix A,
as computed from the norms of A and its computed inverse A–1. If a given
result has a test ratio less than 30, it is judged to be as accurate as the machine
precision and the conditioning of the problem will allow. See the Installation
Guide for LAPACK for further details on the testing strategy used for LAPACK.
004–2151–002 51
Using Sparse Linear Solvers [4]
Many techniques exist for solving sparse linear systems. The appropriate
technique depends on many factors, including the mathematical and structural
properties of matrix A, the dimension of A, and the number of right-hand sides
b. This section describes some of the properties that are useful in determining a
good solution technique, with some common sources of matrices with these
properties. Section 4.4, page 62 also describes some techniques you can use to
choose the correct solver for a given problem.
004–2151–002 53
Scientific Libraries User’s Guide
P
• Diagonally dominant matrix: A matrix A is (strictly) diagonally dominant if
jaij j > i6=j jaij j for all i. If A is diagonally dominant, operations that
involve A are often numerically stable. Common sources of
diagonally-dominant matrices are simple reservoir simulation models and
the velocity equations of a segregated fluid dynamics solver.
• Structurally symmetric matrix: If the nonzero pattern of A is symmetric, a
matrix A is structurally symmetric; that is, aij 6= 0 if and only if aji 6= 0. The
integer complexity of a solver for a structurally symmetric matrix is greatly
reduced compared to a more general solver. If A is diagonally dominant,
many of the optimal solution techniques for SPD matrices can be used.
Common sources of these matrices are the same as those for diagonally
dominant matrices.
• Banded matrix: If aij = 0 for ji 0 jj > k, matrix A is banded. If k is small in
relation to the problem dimension n, special techniques exist for solving the
related linear systems. Systems of this form usually occur in special
domains with a particular ordering of the grid or node points.
• Tridiagonal matrix: If aij = 0 for ji 0 jj > 1, matrix A is tridiagonal.
Tridiagonal matrices occur frequently in fluid dynamics and reservoir
simulation.
During a simulation, an application often generates many sparse linear systems
that must be solved. These are usually related to some linear approximation of
a nonlinear function or some time-marching scheme for a time-dependent
problem. In these cases, the linear systems are often related and information
from the previous solution can be used to solve the next linear system.
For example, consider Newton’s method for a nonlinear PDE. In this case, the
linear system An is generated by evaluating the Jacobian at a certain point. A
subsequent matrix, An+1 , is generated using the Jacobian at a nearby point.
Thus, the two matrices An and An+1 are close to each other, and this fact is
used in the solver technique. The structure of An and An+1 is usually identical
and all structural preprocessing can be done once for all related matrices.
54 004–2151–002
Using Sparse Linear Solvers [4]
Solution phase:
5. Given b, compute y = 01
L b; z = 01
D y; x = 0T
L z .
In this case, A is SPD, and L and D can be found such that A = T , where L
LDL
004–2151–002 55
Scientific Libraries User’s Guide
56 004–2151–002
Using Sparse Linear Solvers [4]
4. ro = b 0 Axo
Do k=0, ...
5. z k = Mrk
0 1
6. k = rk ; z k
7. k = k = k 01 ; 0 = 0
8. pk = z k + k pk01
9. qk = Apk
0 1
10. k = qk ; pk
11. k = k =k
12. xk+1 = xk + k pk
13. rk+1 = rk 0 k pk
End Do
Iterative methods are very flexible. Like direct solvers, if two matrices A1 and
A2 have the same structure, the structural preprocessing needs to be done only
once. If there are multiple right-hand sides, Steps 1 through 3 can be skipped
after the first right-hand side.
004–2151–002 57
Scientific Libraries User’s Guide
2 11 0 0 14 03
6 0 22 23 24 25 7
A = 664 0 32 33 0 07
7
4142 0 44 05
0 52 0 0 55
(4.1)
A would be stored in CSC format, as follows:
AMAT = ( 11 41 22 32 42 52 23 33 14 24 44 25 55 )
ROWIND = ( 1 4 2 3 4 5 2 3 1 2 4 2 5 )
COLSTR = ( 1 3 7 9 12 14 )
ROWIND = ( 1 4 2 3 4 5 3 4 5 )
COLSTR = ( 1 3 7 8 9 10 )
58 004–2151–002
Using Sparse Linear Solvers [4]
004–2151–002 59
Scientific Libraries User’s Guide
to apply and usually improves the convergence rate. However, when a large
variation exists in the magnitude of matrix coefficients, you should avoid
explicit scaling. Six types of scaling are available, as follows:
• Symmetric diagonal
• Symmetric row-sum
• Symmetric column-sum
• Left diagonal
• Left row-sum
• Right column-sum
The second type of preconditioning is implicit preconditioning, in which the
preconditioned linear system is never explicitly formed but the preconditioner
is applied at each iteration. Five types of implicit preconditioning are available,
as follows:
• Implicit diagonal scaling
• Incomplete Cholesky factorization
• Incomplete LU factorization
• Neumann polynomial preconditioning
• Least-squares polynomial preconditioning
60 004–2151–002
Using Sparse Linear Solvers [4]
004–2151–002 61
Scientific Libraries User’s Guide
You can use these tridiagonal solvers on periodic tridiagonal matrices by using
the Sherman-Morrison formulation. See Scientific Computing: An Introduction to
Parallel Computing, by Golub and Ortega, for details.
62 004–2151–002
Using Sparse Linear Solvers [4]
High
•SSGETRF/S
•SSPOTRF/S
•(IC1)/PCG
•IC(0)/GMR
Cost •IC(0)/PCG
•Diag/PCG
Low High
Robustness
a10037
004–2151–002 63
Scientific Libraries User’s Guide
High
•SSGETRF/S
•SSTSTRF/S
•ILU(1)/GMR(50)
•ILU(1)/GMR(10)
•(ILU1)/CGS
•ILU(0)/GMR(10)
Cost •ILU(0)/CGS
•Diag/GMR(50)
•Diag/GMR(10)
•Diag/BCG
•Diag/CGS
Low High
Robustness
a10038
64 004–2151–002
Using Sparse Linear Solvers [4]
4.4.4.2 Preconditioning
Preconditioned Conjugate Gradient Methods (PCG) are usually not effective
unless they are coupled with some preconditioning. The goal of
preconditioning is to improve the distribution of eigenvalues of the matrix A
(or, in the case of the normal equations, improve the distribution of singular
004–2151–002 65
Scientific Libraries User’s Guide
values) and thus improve the convergence rate of the iterative method. The
following preconditioners are listed in order of least expensive per iteration
(which also means least robust) to most expensive (and most robust):
• Scaling: Scaling is the simplest and cheapest preconditioner. Explicit scaling
often improves the performance of the iterative routine and can be used in
some form, even if some other implicit preconditioning is done. If nothing is
known about the matrix, symmetric diagonal scaling can be used if the
matrix is symmetric; left diagonal scaling can be used if the matrix is
non-symmetric (unless the matrix has a zero diagonal element). Row or
column sum scaling can be more effective in certain cases when there is a
large variation in the magnitude of values across the rows or columns of the
matrix.
• Implicit diagonal preconditioning: It is equivalent to explicit diagonal
scaling except in the case of CGN in which explicit diagonal scaling uses the
diagonal of A and implicit diagonal scaling uses the diagonal of AAT . This
preconditioning is available if you do not want to scale the matrix explicitly.
• Polynomial preconditioning: SITRSOL has the following types of
polynomial preconditioning:
– A truncated Neumann series expansion applies the truncated Neumann
series approximation of the inverse of A to the linear system.
– The second type uses a polynomial s, in which s is chosen so that s(A)A
is as close in norm to the identity as possible in a least-squares sense.
These preconditioners are well-suited to vector and parallel architectures
and are similar in performance to diagonal preconditioning. However, like
diagonal preconditioning, they are usually not very robust, especially if A is
not SPD. The least-squares approach is usually the best of the two
polynomial preconditioners. See the SIAM Journal of Scientific Statistical
Computing, by Saad, for details.
• Incomplete Factorization: The most robust forms of preconditioners are the
Incomplete Cholesky (IC) and Incomplete LU (ILU) factorizations. These
methods try to approximate the inverse of A by computing only certain
terms of the Cholesky or LU factorizations of A. Although these methods are
usually very robust and are often the only techniques to work on difficult
problems, they are also very expensive per iteration and do not perform
well on vector architectures. They also do not perform very well on parallel
architectures, except for machines that have a small number of processors.
For nonsymmetric problems, ILU is essential for many problems because
other types of preconditioners are not well understood in this case. IC is
66 004–2151–002
Using Sparse Linear Solvers [4]
often useful, especially for ill-conditioned matrices that come from structural
analysis.
004–2151–002 67
Scientific Libraries User’s Guide
The NCPUS environment variable, the number of physical processors, and the
number of users on the system controls the number of processors that are
available to an application. The best number of processors to use depends on
several issues.
68 004–2151–002
Using Sparse Linear Solvers [4]
4.5.2.4 Save/restart
During a long computation, it can be useful to break the computation into
several phases. This allows reuse of early phases, minimizes the loss of
information in case of system failure, and lets you tune system resources for
each phase. Save and restart capabilities are provided to let you take a
‘‘snapshot’’ of the computation between algorithmic phases. All necessary
information is stored in a user-defined binary file, which can be used at a later
time to resume computation.
All of the direct solvers (SSPOTRF, SSPOTRS, SSTSTRF, SSTSTRS, SSGETRF,
and SSGETRS) let you save a binary image after any of the three structural
preprocessing steps or after the numerical factorization step. To control this
process, use iparam(6) and iparam(7). SITRSOL lets you save a binary image
after the construction of the preconditioner or after the iterative phase. The
ipath and iparam(19) and iparam(20) arguments control the process. See Example
11, page 93, which shows the process for both direct and iterative solvers.
004–2151–002 69
Scientific Libraries User’s Guide
• Diagonal shifting for incomplete Cholesky: For ill-conditioned SPD, the only
preconditioner that works is incomplete Cholesky with diagonal shifting,
coupled with PCG. In this case, SITRSOL tries to compute the incomplete
Cholesky factors and, if it detects a negative diagonal, uses a simple iterative
technique to shift the diagonal values. After each shift, it restarts the
factorization. If shifting is required for a class of related problems, you can
usually eliminate restarting by experimenting with rparam(15):gammin and
setting it to a small positive value to allow the factorization to complete the
first time.
70 004–2151–002
Using Sparse Linear Solvers [4]
increase the size of supernodes, allow some zero fill in the supernodes.
iparam(8) indicates the number of zero elements allowed in a supernode.
Although the default value is 0, some zero fill is usually desirable. A value
of iparam(8) = 256 is usually reasonable; however, if A is very sparse, a value
of 512 or 1024 may be appropriate.
The iparam(9) argument indicates the maximum percent of zero-fill in a
supernode and is set to 0 by default. A value of 100 is reasonable, which
allows the supernode to double in size with zero-fill. If A is very sparse, a
value of 200-500 may be appropriate.
• Frontal matrix grouping and parallel execution: When executing on multiple
processors, it can be useful to use two types of parallelism in the direct
factorization: parallel elimination of supernodes and parallel execution with
a frontal matrix. The first type comes from computing with independent
supernodes concurrently. In the beginning of the factorization, a large
number of supernodes usually can be processed in parallel. As the
elimination tree is traversed, however, the number of independent
supernodes decreases. At the same time, the size of the frontal matrices
grows. Thus, at some point, it is appropriate to switch and allow all
processors to work on one frontal matrix.
The iparam(10) and iparam(11) arguments from SSPOTRF and SSTSTRF
(SSGETRF does not have these arguments) let you control when the switch
occurs. Set both parameters to the same value. When the switch should
occur depends on the data. For large problems, however, a value of 10,000 is
reasonable. For smaller problems, a smaller value is better. See the
Proceedings of the Fifth SIAM Conference on Parallel Processing for Scientific
Computing, by Yang, for a full discussion of this topic.
• Threshold pivoting: SSGETRF can perform standard partial pivoting.
However, it is often possible to relax the pivoting requirements to improve
performance and still obtain good accuracy in the solution. SSGETRF uses
threshold pivoting to improve performance by letting you specify a
parameter ; 0:0 1:0, called a threshold value.
Let a3j denote the maximum, max i=j;n jaij j, of the possible pivots at the j th
step of the factorization. The pivot is then taken to be the first value
aij ; j i n, such thatjaij j a3j (that is, the first element in the j th
column with absolute value greater than or equal to the threshold value
times the maximum absolute value in the j th column). If = 0:0 , no
pivoting is done; if = 1:0, standard pivoting is done.
As with other tuning parameters, a good value for depends on the actual
problems. However, experience has shown that you should choose
004–2151–002 71
Scientific Libraries User’s Guide
The following tables, Table 6 through Table 8, page 74, provide a summary of
the arguments and argument types used with SITRSOL. See the SITRSOL(3S)
man page for more details.
72 004–2151–002
Using Sparse Linear Solvers [4]
004–2151–002 73
Scientific Libraries User’s Guide
74 004–2151–002
Using Sparse Linear Solvers [4]
004–2151–002 75
Scientific Libraries User’s Guide
76 004–2151–002
Using Sparse Linear Solvers [4]
004–2151–002 77
Scientific Libraries User’s Guide
C
C.....Arguments
INTEGER NMAX, NEQNS, NZA
INTEGER COLSTR(NMAX+1), ROWIND(NZAMAX)
REAL B(NMAX), AMAT(NZAMAX)
C
C.....Local variables
INTEGER NEQNSL, NZAL
PARAMETER (NEQNSL = 5, NZAL = 9 )
INTEGER COLSTRL(NEQNSL+1), ROWINDL(NZAL)
REAL AMATL(NZAL)
C
C.....Define matrix via data statements
C
DATA (AMATL(I), I=1, NZAL) / 4.0, -1.0, -1.0, 4.0, -1.0, 4.0,
& 4.0, -1.0, 4.0 /
C
DATA (ROWINDL(I), I=1, NZAL ) / 1, 2, 4, 2, 3, 3, 4, 5, 5 /
C
DATA (COLSTRL(I), I=1, NEQNSL + 1) / 1, 4, 6, 7, 9, 10 /
C
C.....Define problem size
NEQNS = NEQNSL
NZA = NZAL
C
C.....Check if enough space
IF (NEQNS .GT. NMAX .OR. NZA .GT. NZAMAX) THEN
WRITE(*,*)’Not enough space.’
STOP
ENDIF
C
C.....Define matrix
DO 10 I = 1, NZA
AMAT(I) = AMATL(I)
ROWIND(I) = ROWINDL(I)
10 CONTINUE
C
DO 20 I = 1, NEQNS + 1
COLSTR(I) = COLSTRL(I)
20 CONTINUE
C
C.....Define b to be all 1’s
DO 30 I = 1, NEQNS
78 004–2151–002
Using Sparse Linear Solvers [4]
B(I) = 1.0
30 CONTINUE
C
C.....ALL DONE
RETURN
END
004–2151–002 79
Scientific Libraries User’s Guide
C ---------------------------
C
C.....Let the initial guess for x be random numbers between 0 and 1
DO 20 I = 1, NEQNS
X(I) = RANF()
20 CONTINUE
C
C.....Set default parameter values
CALL DFAULTS ( IPARAM, RPARAM )
C
C.....Select nonsymmetric, no explicit scaling, left implicit scaling
C preconditioning
IPARAM(1) = 0
IPARAM(7) = 0
IPARAM(9) = 1
IPARAM(10) = 1
C
C.....Call SITRSOL to solve the problem using CGS
IPATH = 2
METHOD = ’CGS’
CALL SITRSOL ( METHOD, IPATH, NEQNS, NEQNS, X, B, COLSTR, ROWIND,
& AMAT, LIWORK, IWORK, LWORK, WORK, IPARAM, RPARAM, IERR )
C
C ----------------------------------------
C Solve same problem using SSGETRF/SSGETRS
C ----------------------------------------
C
C.....use all default values
IPARAM(1) = 0
C
C.....do all 4 phases of factorization
IDO = 14
C
C.....threshold for pivoting
THRESH = 0.01
C
C.....compute factorization using SSGETRF
CALL SSGETRF ( IDO, NEQNS, COLSTR, ROWIND, AMAT, LWORK,
& WORK, IPARAM, THRESH, IERR )
C
C.....compute solution using SSGETRS
C
C.....solve standard way
80 004–2151–002
Using Sparse Linear Solvers [4]
IDO = 1
C.....solve for 1 RHS with leading dim = neqns
NRHS = 1
LDB = NEQNS
C
CALL SSGETRS ( IDO, LWORK, WORK, NRHS, B, LDB,
& IPARAM, IERR )
C
C
C ----------------------------------------
C Solve same problem using SSTSTRF/SSTSTRS
C ----------------------------------------
C
C.....use all default values
IPARAM(1) = 0
C
C.....do all 4 phases of factorization
IDO = 14
C
C.....compute factorization using SSTSTRF
CALL SSTSTRF ( IDO, NEQNS, COLSTR, ROWIND, AMAT, LWORK,
& WORK, IPARAM, IERR )
C
C.....compute solution using SSTSTRS
C
C.....solve standard way
IDO = 1
C.....solve for 1 RHS with leading dim = neqns
NRHS = 1
LDB = NEQNS
C
CALL SSTSTRS ( IDO, LWORK, WORK, NRHS, B1, LDB,
& IPARAM, IERR )
C
C -----------------
C Compare solutions
C -----------------
C
C.....Compute 2-norm of the difference between SITRSOL (array X),
C SSGETRF/S (in array B) and SSTSTRF/S solution (in array B1).
C
C.....compute differences
CALL SAXPY ( NEQNS, -1., X, 1, B, 1 )
004–2151–002 81
Scientific Libraries User’s Guide
C
C.....Arguments
INTEGER NMAX, NEQNS, NZA
INTEGER COLSTR(NMAX+1), ROWIND(NZAMAX)
REAL B(NMAX), AMAT(NZAMAX)
C
C.....Local variables
INTEGER NEQNSL, NZAL
PARAMETER (NEQNSL = 5, NZAL = 13 )
INTEGER COLSTRL(NEQNSL+1), ROWINDL(NZAL)
REAL AMATL(NZAL)
C
C.....Define matrix via data statements
82 004–2151–002
Using Sparse Linear Solvers [4]
C
DATA (AMATL(I), I=1, NZAL)/4.0, -2.0, -1.0, 0.0, 4.0, -2.0,
& 0.0, 4.0, -1.0, 4.0, -1.0, -1.0,
& 4.0 /
C
DATA (ROWINDL(I), I=1, NZAL )/1, 2, 4, 1, 2, 3, 2, 3,
& 1, 4, 5, 4, 5 /
C
DATA (COLSTRL(I), I=1, NEQNSL + 1) / 1, 4, 7, 9, 12, 14/
C
C.....Define problem size
NEQNS = NEQNSL
NZA = NZAL
C
C.....Check if enough space
IF (NEQNS .GT. NMAX .OR. NZA .GT. NZAMAX) THEN
WRITE(*,*)’Not enough space.’
STOP
ENDIF
C
C.....Define matrix
DO 10 I = 1, NZA
AMAT(I) = AMATL(I)
ROWIND(I) = ROWINDL(I)
10 CONTINUE
C
DO 20 I = 1, NEQNS + 1
COLSTR(I) = COLSTRL(I)
20 CONTINUE
C
C.....Define b to be all 1’s
DO 30 I = 1, NEQNS
B(I) = 1.0
30 CONTINUE
C
C.....ALL DONE
RETURN
END
004–2151–002 83
Scientific Libraries User’s Guide
84 004–2151–002
Using Sparse Linear Solvers [4]
100 CONTINUE
C
C ---------------------
C Define matrix and RHS
C ---------------------
CALL MATGEN ( JOB, NMAX, NEQNS, B, COLSTR, NZAMAX, NZA,
& ROWIND, AMAT )
C
C
C ---------------------------
C Solve problem using SITRSOL
C ---------------------------
C
C.....Let the initial guess for x be random numbers between 0 & 1
DO 20 I = 1, NEQNS
X(I) = RANF()
20 CONTINUE
C
C.....Set default parameter values
CALL DFAULTS ( IPARAM, RPARAM )
C
C.....Select no scaling and left least-squares preconditioning
IPARAM(7) = 0
IPARAM(9) = 1
IPARAM(10) = 5
C
C.....Call SITRSOL to solve the problem using PCG
METHOD = ’PCG’
C
C.....Accumulate time spent in solvers
TSTART = SECOND()
CALL SITRSOL(METHOD, IPATH, NEQNS, NEQNS, X, B, COLSTR, ROWIND,
& AMAT, LIWORK, IWORK, LWORK, WORK, IPARAM, RPARAM, IERR)
TTOTAL = SECOND() - TSTART
C
C ----------------------------------------
C Solve same problem using SSPOTRF/SSPOTRS
C ----------------------------------------
C
C.....use all default values
IPARAM(1) = 0
C
C.....compute factorization using SSPOTRF
004–2151–002 85
Scientific Libraries User’s Guide
TSTART = SECOND()
CALL SSPOTRF ( IDO, NEQNS, COLSTR, ROWIND, AMAT, LWORK,
& WORK1, IPARAM, IERR )
TTOTAL = TTOTAL + SECOND() - TSTART
C
C.....compute solution using SSPOTRS
C
C.....solve standard way
IDO = 1
C.....solve for 1 RHS with leading dim = neqns
NRHS = 1
LDB = NEQNS
C
TSTART = SECOND()
CALL SSPOTRS ( IDO, LWORK, WORK1, NRHS, B, LDB,
& IPARAM, IERR )
TTOTAL = TTOTAL + SECOND() - TSTART
C
C -----------------
C Compare solutions
C -----------------
C
C.....Compute two-norm of the difference between SITRSOL in array X
C and SSPOTRF/S solution in array B.
C
C.....compute differences
CALL SAXPY ( NEQNS, -1., B, 1, X, 1 )
C
C.....compute norms
ERR = SNRM2( NEQNS, X, 1 )
C
C.....print results
WRITE(6,12)ERR, TTOTAL
12 FORMAT (’ Difference between SITRSOL and SSPOTRF/S = ’,E15.8,/
& ’ Total time to compute solutions = ’,E15.8,)
C
C.....Check if JOB = 1. If so, reset to 2 and recall solvers with
C new values and same structure.
IF ( JOB .EQ. 1) THEN
C
C.......Define variables for second call to solvers
JOB = 2
86 004–2151–002
Using Sparse Linear Solvers [4]
004–2151–002 87
Scientific Libraries User’s Guide
88 004–2151–002
Using Sparse Linear Solvers [4]
40 CONTINUE
C
C.....ALL DONE
RETURN
END
004–2151–002 89
Scientific Libraries User’s Guide
C
C
C ---------------------------
C Solve problem using SITRSOL
C ---------------------------
C
C.....Let the initial guess for x be random numbers between 0 and 1
DO 10 IRHS = 1, NRHS
DO 10 I = 1, NEQNS
X(I,IRHS) = RANF()
10 CONTINUE
C
C.....Set default parameter values
CALL DFAULTS ( IPARAM, RPARAM )
C
C.....Select no scaling and left Least-squares
preconditioning
IPARAM(7) = 0
IPARAM(9) = 1
IPARAM(10) = 5
C
C.....Call SITRSOL to solve the problem using PCG
IPATH = 2
METHOD = ’PCG’
CALL SITRSOL(METHOD, IPATH, NEQNS, NEQNS, X, B, COLSTR, ROWIND,
& AMAT, LIWORK, IWORK, LWORK, WORK, IPARAM, RPARAM, IERR)
C
C.....Solve for subsequent RHS
IPATH = 4
DO 20 IRHS = 2, NRHS
CALL SITRSOL(METHOD, IPATH, NEQNS, NEQNS, X(1,IRHS), B(1,IRHS),
& COLSTR, ROWIND, AMAT, LIWORK, IWORK, LWORK,
& WORK, IPARAM, RPARAM, IERR )
20 CONTINUE
C
C ----------------------------------------
C Solve same problem using SSPOTRF/SSPOTRS
C ----------------------------------------
C
C.....use all default values
IPARAM(1) = 0
C.....do all 4 phases of factorization
IDO = 14
90 004–2151–002
Using Sparse Linear Solvers [4]
C
C.....compute factorization using SSPOTRF
CALL SSPOTRF ( IDO, NEQNS, COLSTR, ROWIND, AMAT, LWORK,
& WORK, IPARAM, IERR )
C
C.....compute solution using SSPOTRS
C
C.....solve standard way
IDO = 1
C.....solve for all RHS with leading dim = neqns
LDB = NEQNS
C
CALL SSPOTRS ( IDO, LWORK, WORK, NRHS, B, LDB,
& IPARAM, IERR )
C
C -----------------
C Compare solutions
C -----------------
WRITE(6,11)
11 FORMAT (’***** Output from program: EX4 *****’)
C
C.....Compute two-norm of the difference between SITRSOL (array X)
C and SSPOTRF/S solution (in array B) for each RHS.
DO 30 IRHS = 1, NRHS
C
C.......compute differences
CALL SAXPY ( NEQNS, -1., B(1,IRHS), 1, X(1,IRHS), 1 )
C
C.......compute norms
ERR = SNRM2( NEQNS, X(1,IRHS), 1 )
C
C.......print results
WRITE(6,12)IRHS,ERR
12 FORMAT (’Difference between SITRSOL & SSPOTRF/S for sol #’,
& i2, ’ = ’,E15.8, )
30 CONTINUE
C.....all done
END
C
SUBROUTINE MATGEN ( NMAX, NEQNS, NRHS, B, COLSTR, NZAMAX, NZA,
& ROWIND, AMAT )
* The following routine MATGEN defines, in sparse column format,
* the matrix
004–2151–002 91
Scientific Libraries User’s Guide
*
* | 4 -1 0 -1 0 |
* | -1 4 -1 0 0 |
* A = | 0 -1 4 0 0 |
* | -1 0 0 4 -1 |
* | 0 0 0 -1 4 |
92 004–2151–002
Using Sparse Linear Solvers [4]
DO 10 I = 1, NZA
AMAT(I) = AMATL(I)
ROWIND(I) = ROWINDL(I)
10 CONTINUE
C
DO 20 I = 1, NEQNS + 1
COLSTR(I) = COLSTRL(I)
20 CONTINUE
C
C.....Define b to be all 1’s, 2’s, 3’s, ...
DO 30 IRHS = 1, NRHS
DO 30 I = 1, NEQNS
B(I,IRHS) = FLOAT(IRHS)
30 CONTINUE
C
C.....ALL DONE
RETURN
END
004–2151–002 93
Scientific Libraries User’s Guide
PROGRAM EX5A
C
C Purpose:
C Illustrates the use of SITRSOL and SSPOTRF/S to solve a simple
C sparse symmetric linear system with SAVE/RESTART files. This
C program calls SITRSOL to construct the preconditioner and SSPOTRF to
C compute the factorization. Both are saved to binary files for later
C use by the program EX5B.
C
PARAMETER (NMAX = 5, NZAMAX = 9)
PARAMETER (LIWORK = 350, LWORK = LIWORK )
INTEGER NEQNS, NZA, IPATH, IERR,
& ROWIND(NZAMAX), COLSTR(NMAX+1), IWORK(LIWORK), IPARAM(40)
REAL AMAT(NZAMAX), RPARAM(30), X(NMAX), B(NMAX),
& SOLN(NMAX), WORK(LWORK)
CHARACTER*3 METHOD
C
C ---------------------
C Define matrix and RHS
C ---------------------
CALL MATGEN (NMAX, NEQNS, B, COLSTR, NZAMAX, NZA, ROWIND, AMAT)
C
C
C --------------------------------
C Preprocess problem using SITRSOL
C --------------------------------
C
C.....Let the initial guess for x be random numbers between 0 and 1
DO 10 I = 1, NEQNS
X(I) = RANF()
10 CONTINUE
C
C.....Set default parameter values
CALL DFAULTS ( IPARAM, RPARAM )
C
C.....Select no scaling and left IC preconditioning
IPARAM(7) = 0
IPARAM(9) = 1
IPARAM(10) = 2
C
C.....Save preconditioner setup for later use
OPEN(1, FILE=’SITRSOL.SAV’, FORM=’UNFORMATTED’,STATUS=’NEW’)
IPARAM(19) = 1 ! Save after preconditioner setup
94 004–2151–002
Using Sparse Linear Solvers [4]
004–2151–002 95
Scientific Libraries User’s Guide
C
C.....Signal completion
WRITE(6,11)
11 FORMAT (’***** Output from program: EX5A *****’, /
& ’ Preconditioner and Factorization saved’)
C
END
C
SUBROUTINE MATGEN ( NMAX, NEQNS, B, COLSTR, NZAMAX, NZA,
& ROWIND, AMAT )
* The following routine MATGEN defines, in sparse column format,
* the matrix
*
* | 4 -1 0 -1 0 |
* | -1 4 -1 0 0 |
* A = | 0 -1 4 0 0 |
* | -1 0 0 4 -1 |
* | 0 0 0 -1 4 |
* where A is generated by using a five-point difference scheme for
* Poisson’s Equation with Dirichlet boundary conditions. The
* domain is a unit square with the upper right quarter removed and
* a grid spacing of 0.25. MATGEN also defines b, the right-hand-side.
C
C.....Arguments
INTEGER NMAX, NEQNS, NZA
INTEGER COLSTR(NMAX+1), ROWIND(NZAMAX)
REAL B(NMAX), AMAT(NZAMAX)
C
C.....Local variables
INTEGER NEQNSL, NZAL
PARAMETER (NEQNSL = 5, NZAL = 9 )
INTEGER COLSTRL(NEQNSL+1), ROWINDL(NZAL)
REAL AMATL(NZAL)
C
C.....Define matrix via data statements
C
DATA (AMATL(I), I=1, NZAL) / 4.0, -1.0, -1.0, 4.0, -1.0, 4.0,
& 4.0, -1.0, 4.0 /
C
DATA (ROWINDL(I), I=1, NZAL ) / 1, 2, 4, 2, 3, 3, 4, 5, 5 /
C
DATA (COLSTRL(I), I=1, NEQNSL + 1) / 1, 4, 6, 7, 9, 10 /
C
96 004–2151–002
Using Sparse Linear Solvers [4]
PROGRAM EX5B
C
C Purpose:
C Illustrates the use of SITRSOL and SSPOTRF/S to solve asimple
C sparse symmetric linear system with SAVE/RESTART files. This
C program calls SITRSOL to solve a problem using the preconditioner
C computed in EX5A and calls SSPOTRS to solve the problem using
C the factorization computed in EX5A. Note that the original
C sparse matrix is not needed in this program, that is, AMAT,
C ROWIND and COLSTR are not defined.
004–2151–002 97
Scientific Libraries User’s Guide
C
PARAMETER (NMAX = 5, NZAMAX = 9)
PARAMETER (LIWORK = 350, LWORK = LIWORK )
INTEGER NEQNS, NZA, IPATH, IERR, IWORK(LIWORK), IPARAM(40)
REAL RPARAM(30), X(NMAX), B(NMAX), WORK(LWORK)
CHARACTER*3 METHOD
C
C ----------------------------------------
C Solve problem using SITRSOL with Restart
C ----------------------------------------
C
C.....Define problem dimension
NEQNS = NMAX
C
C.....Let the initial guess for x be random numbers between 0 and 1
C Define B to be all 1’s
DO 10 I = 1, NEQNS
X(I) = RANF()
B(I) = 1.0
10 CONTINUE
C
C.....Set default parameter values
CALL DFAULTS ( IPARAM, RPARAM )
C
C.....Select no scaling and left IC preconditioning
IPARAM(7) = 0
IPARAM(9) = 1
IPARAM(10) = 2
C
C.....Restore preconditioner setup from earlier call
OPEN(1, FILE=’SITRSOL.SAV’, FORM=’UNFORMATTED’,STATUS=’OLD’)
IPARAM(19) = 1 ! Start after preconditioner setup
IPARAM(20) = 1 ! Read from unit 1
C
C.....Call SITRSOL to solve the problem using PCG and restart data
IPATH = 5
METHOD = ’PCG’
CALL SITRSOL ( METHOD, IPATH, NEQNS, NEQNS, X, B, COLSTR, ROWIND,
& AMAT, LIWORK, IWORK, LWORK, WORK, IPARAM, RPARAM, IERR )
C
C.....Close file
CLOSE(1)
C
98 004–2151–002
Using Sparse Linear Solvers [4]
C ---------------------------------------------
C Solve same problem using SSPOTRS with Restart
C ---------------------------------------------
C
C.....Restore factorization from earlier computation
OPEN(1, FILE=’SSPOTRF.SAV’, FORM=’UNFORMATTED’,STATUS=’OLD’)
C
C.....Override default values
IPARAM( 1) = 1
IPARAM( 2) = 6 ! Output unit for messages
IPARAM( 3) = 0 ! Report only fatal errors
IPARAM( 4) = 0 ! Do not save adjacency structure
IPARAM( 5) = 1 ! This is a fresh start
IPARAM( 7) = 1 ! Read file from unit 1
C
C.....compute solution using SSPOTRS
C
C.....solve standard way
IDO = 1
C.....solve for 1 RHS with leading dim = neqns
NRHS = 1
LDB = NEQNS
C
CALL SSPOTRS ( IDO, LWORK, WORK, NRHS, B, LDB,
& IPARAM, IERR )
C
C -----------------
C Compare solutions
C -----------------
C
C.....Compute two-norm of the difference between SITRSOL
C (in array X) and SSPOTRF/S solution (in array B).
C
C.....compute differences
CALL SAXPY ( NEQNS, -1., B, 1, X, 1 )
C
C.....compute norms
ERR = SNRM2( NEQNS, X, 1 )
C
C.....print results
WRITE(6,11)ERR
11 FORMAT (’***** Output from program: EX5B *****’,/
& ’Difference between SITRSOL and SSPOTRF/S = ’,E15.8, )
004–2151–002 99
Scientific Libraries User’s Guide
C.....all done
END
100 004–2151–002
Out-of-core Linear Algebra Software [5]
This section explains the basic use of the Scientific Library routines for
out-of-core computations in linear algebra. It gives an overview of the routines,
discusses the concept of virtual matrices, describes the types of subroutines
used with out-of-core routines, and also provides examples.
004–2151–002 101
Scientific Libraries User’s Guide
• They use virtual matrices, which are easy to create and use.
• They contain built-in detailed performance measurement capabilities which
can print automatically to give you complete information about software
and hardware performance.
• They contain tuning parameters that you can easily change to optimize the
software for specific problems and for various computing resources.
102 004–2151–002
Out-of-core Linear Algebra Software [5]
stdout. Also, do not use any unit number that your program is using for
another purpose.
To associate a particular file with a particular unit number, use the
assign-by-unit option of the assign(1) command. For example, to store a
virtual matrix in a file in directory /tmp/xxx, name the file mydata. To use
Fortran unit number 3 for the file, you could issue the following command
prior to executing the following command:
assign -a /tmp/xxx/mydata u:3
Within the out-of-core subroutines, you would use the number 3 as the value of
the argument for the virtual matrix name.
You could use the following command to assign the file on unit 1 to SDS
(secondary data storage in the SSD solid-state storage device), with a subcode
of "scratch" (meaning discard the file at end of processing):
assign -F SDS.SCR u:1
See the assign(1) man page for general information on Fortran unit numbers.
004–2151–002 103
Scientific Libraries User’s Guide
matrix (using the virtual copy routines). If you want to use a virtual matrix as
input to some other program, write a program that uses the virtual copy
routines to get data from the virtual matrix, then write it out using the usual
Fortran I/O facilities. If only the virtual linear algebra routines use the data, it
is most convenient to work just with the virtual matrix files themselves, using
the subroutines provided.
104 004–2151–002
Out-of-core Linear Algebra Software [5]
When you define the value of a virtual matrix element, you are implicitly
creating file space for all elements up to the one you define. For example, if you
declare that a virtual matrix has a leading dimension of 5000, and you define a
value for element (1, 1000), the software creates a virtual matrix file large
enough to contain elements (i,j) for
1 ≤ i ≤ 5000
1 ≤ i ≤ 1000
which is 5 Mwords, or 40 Mbytes of file space.
If you are working with a symmetric or a triangular virtual matrix, you can cut
the file size roughly in half by using packed storage mode.
004–2151–002 105
Scientific Libraries User’s Guide
To define this storage mode, call the VSTORAGE routine, which has the calling
sequence:
CALL VSTORAGE (unit, mode)
The unit argument is an integer that gives the unit number of the virtual
matrix, and mode is a character string that specifies the storage mode.
106 004–2151–002
Out-of-core Linear Algebra Software [5]
004–2151–002 107
Scientific Libraries User’s Guide
In the virtual copy routine names listed in this table, the letter S indicates single
precision (real), just as it does in the names for the BLAS routines. The numeral
2 indicates two-dimensional, because these routines copy matrices, as opposed
to vectors. The codes RV and VR indicate real-to-virtual or virtual-to-real (that
is, in-memory to virtual memory, or vice versa). The word real is used with two
different meanings: not complex and in-memory.
108 004–2151–002
Out-of-core Linear Algebra Software [5]
004–2151–002 109
Scientific Libraries User’s Guide
Virtual matrix V
In-memory matrix A
SCOPY2RV LDV
V(i,j)
a10039
110 004–2151–002
Out-of-core Linear Algebra Software [5]
The calling sequences of the virtual LAPACK and VBLAS routines are similar to
those of the corresponding LAPACK and BLAS routines, but where the
in-memory routines require an array argument, the virtual routines require an
argument that specifies a virtual matrix.
004–2151–002 111
Scientific Libraries User’s Guide
Queuing routines
AQIO BLAS
OS
level UNICOS operating system
a10040
Together, these out-of-core routines implement a paged virtual memory system that
is implemented at the library level. It is highly efficient because the particular
structure of problems in linear algebra permits an intelligent paging strategy.
For more details about the I/O routines, see the Application Programmer’s I/O
Guide.
112 004–2151–002
Out-of-core Linear Algebra Software [5]
004–2151–002 113
Scientific Libraries User’s Guide
114 004–2151–002
Out-of-core Linear Algebra Software [5]
6. For each column of the solution matrix, call the SCOPY2VR(3S) routine to
fetch the solution vector and process it.
7. Call the VEND(3S) routine to terminate the virtual matrix routines.
8. Close the files.
Variable Definition
NCPUS Specifies the number of physical processors that
are either on your system or which are available.
The default value for NCPUS is usually the
number of physical processors on the system. If
you specify NCPUS to be greater than the number
of physical processors available, you can create
unnecessary overhead. If you specify NCPUS to
be less than the number of available processors,
your tasks will not execute as efficiently.
MP_DEDICATED Determines the type of machine environment. If
set to 1, it indicates you are running an
application in a dedicated machine environment.
Slave processors wait in user space rather than
return to the operating system. If MP_DEDICATED
is set to 0 or is not set at all, slave processors
return to the operating system after waiting in
user space. When MP_DEDICATED is set to a
value other than 1 or 0, the behavior is
undefined. If you set MP_DEDICATED to 1 in a
nondedicated machine environment, you can
degrade system throughput.
VBLAS_PAGESIZE The dimension, in real words, of the size of a
page. If np represents this value, the total number
words on a page is np 2 np. For example, if np
equals 256, the default value, a page is 256 2 256
(65,536 words).
004–2151–002 115
Scientific Libraries User’s Guide
5.5.1 Multitasking
Like most routines in the Scientific Library, the virtual linear algebra routines
perform multitasking automatically. You can control the use of multitasking by
setting the value of the NCPUS environment variable to an integer that indicates
the number of processors you want to use. For example, the following C shell
command sets NCPUS equal to 1, indicating single-CPU execution (which
effectively inhibits multitasking):
setenv NCPUS 1
Likewise, the following command means that the software will try to use four
CPUs:
setenv NCPUS 4
116 004–2151–002
Out-of-core Linear Algebra Software [5]
Example:
*** Error in routine: VBEGIN
*** Insufficient memory was given;
minimum required (decimal words) = 198144
If a lower-level system or library routine diagnosed the error, the diagnostic will
include the error code. Usually, you can use the explain(1) command to obtain
more information about the error by typing one of the following commands:
explain sys-xxx
or
explain lib-xxx
The xxx argument represents the error code listed in the diagnostic. Use
explain sys for error status codes numbered less than 100, and explain
lib for higher-numbered codes.
For example, suppose that unit 1 was assigned to file /tmp/xxx/yyy/zzz, by
using the following command:
assign -a /tmp/xxx/yyy/zzz u:1
But suppose that the /tmp/xxx/yyy directory was not created. When the
VBLAS routine tries to create the file, it cannot, and it aborts after printing the
message:
*** Error in routine: page_request
*** Error status on AQOPEN for unit number: 1
*** Error status on AQOPEN = -2
Because AQIO routines are used internally for input and output, AQOPEN(3F),
AQREAD(3F), or AQWRITE(3F) usually detects the error. In this case, it was
AQOPEN. Of more concern, however, is the specific error status. The message
indicates that the error occurred on unit number 1, and that the error status
code was -2. You can type the explain sys-2 command, which prints a
further description that explains that one of the directories in a path name does
not exist. See the explain(1) man page for further information.
004–2151–002 117
Scientific Libraries User’s Guide
these statistics when the VEND(3S) routine is called, provide a nonzero argument
to the VEND routine or set the VBLAS_STATISTICS environment variable.
The statistics reported include the following:
• Total elapsed time
• Total CPU time
• Total I/O wait time
• Total workspace used
• Number of words read and written
• A distribution of wait times
You can use this feature in addition to the usual UNICOS performance tools.
See Section 5.8, page 119, for a sample output listing of the statistics report.
The most important tuning parameter for the VBLAS routines is the value of
nwork, the amount of page-buffer space. This value is set either as an explicit
argument to VBEGIN(3S) or by setting the VBLAS_WORKSPACE environment
variable, prior to calling VBEGIN. (See Section 5.5, page 115, for a summary of
the UNICOS environment variables that are relevant to performance tuning of
the out-of-core routines.)
Note: The nwork argument does not usually affect CPU time; only I/O wait
time and total elapsed time (wall-clock time) are affected.
118 004–2151–002
Out-of-core Linear Algebra Software [5]
• If the virtual matrix is on disk, more buffer space is used and I/O
performance is increased (at least up to a point).
• If running in a dedicated environment, use as much available memory as
possible.
• If running in a production environment, use less memory so that you can
schedule and run the job at the same time that your other jobs are running.
The turnaround time of a smaller job might be much less than for a large
job, even if the I/O wait time for the smaller job is larger.
• A general guideline for optimal performance is to use enough buffer space
for one column of pages (that is, n 2 np words; np is the number of columns
per page, and n is the leading dimension of the matrix (rounded up to a
multiple of np)). If you use as much as twice this memory, performance will
improve.
• If the virtual matrix is SSD resident, you need much less buffer space to
obtain good performance.
• The use of Strassen’s algorithm (see Section 5.3.6.1, page 111) usually speeds
the computation for a small increase in memory. The amount of memory
that Strassen’s algorithm uses is reported in the VBLAS statistics (see Section
5.8, page 119).
• Use packed storage mode (see Section 5.2.6, page 105) when appropriate
because it saves disk space with no penalty in CPU time.
004–2151–002 119
Scientific Libraries User’s Guide
120 004–2151–002
Out-of-core Linear Algebra Software [5]
004–2151–002 121
Appendix A: libm Version 2 [A]
This appendix describes libm version 2, the default UNICOS Math Library.
004–2151–002 123
Scientific Libraries User’s Guide
X0 X1 X2 X3 X4 X5
∞ +∞
1 Ulp
Exact result
0.5 Ulp
Here the exact mathematical result, indicated by the vertical arrow, lies between
two machine numbers, X2 and X3. If a function algorithm for computing X has
an error of 12 ULP, X2 is returned. If the function has the more relaxed
tolerance of < 1 ULP, either X2 or X3 can be returned.
The 1 ULP error tolerance has some useful properties, as shown in the
following figure. Here the exact result, again show by the vertical arrow,
happens to coincide with X2, and the two nearest machine numbers, X1 and X3,
are a distance of 1 ULP away.
124 004–2151–002
Appendix A: libm Version 2 [A]
X0 X1 X2 X3 X4 X5
∞ +∞
1 Ulp 1 Ulp
In this case, given an error of < 1 ULP, only X2 can be returned. That is,
whenever the argument and the result of a function happen to be exactly
machine representable, it is the exact result that is returned. Also, if almost all
results have 12 ULP, the results are unbiased (that is, they do not contribute
to an accumulation of roundoff errors that would eventually skew the results
either high or low).
Here are some examples of the type of exact results that can be expected from a
library that has an error < 1 ULP:
p
100:00 = 10:0 (exactly; not 9:9999999)
0:5
100:00 = 10:0
2:00
10:0 = 100:0
log 10 (100:0) = 2:0
e0 = 1:0
1:0
x = x (exactly ; for any x)
(A.1)
004–2151–002 125
Scientific Libraries User’s Guide
Hart and Cheney. The second involved the use of a table lookup followed by a
correction, as done in the IBM library for their 370 and 3090 series machines.
Although both approaches produced accurate results for most functions, the
table lookup method was faster and simpler. Both methods required the use of
machine-dependent tricks in order to compute intermediate quantities to more
than single precision. Only the table lookup method is discussed here because
it was used most often.
Index Residual
Argument S + Exp
Table
Lookup address
table
P(r) = a + a r = a r 2 = + a r 3 + . . .
0 1 2 3
This figure diagrams the table lookup method in its simplest form. The function
argument is reduced to a manageable range by using very careful argument
reduction techniques. The reduced argument is then split into two pieces, with
126 004–2151–002
Appendix A: libm Version 2 [A]
the upper (most significant) bits becoming an index into a lookup table, and the
lower bits being used to compute a correction factor.
All of this assumes that a relation of the form,
F (x) = F (xo + x) = T (x0) + P (r), can be found, where r = r (x0 ; x).
Final Summation
48 correct bits
= Result S + Exp
004–2151–002 127
Scientific Libraries User’s Guide
The values in the lookup table can be computed in double precision, packed to
contain more than 48 bits of precision, and stored in a common block internal to
the library. The computation is free because the results are available by using a
memory reference at run time.
128 004–2151–002
Appendix B: Math Algorithms [B]
This appendix describes the algorithms used by the functions in the Math
Library (libm). The “procedures” in this section detail the various steps used
in the algorithms.
x m
g
ln ( ) = ln (2 ) + ln ( )
= m ln (2) + ln (g)
= L +L
u l
(B.1)
004–2151–002 129
Scientific Libraries User’s Guide
g g
ln ( ) = ln ( 0 + dg)
dg
= ln (g0 ) + ln 1 +
g0
= ln (g0 ) + ln (1 + z )
(B.2)
where
dg
z
g0
(B.3)
and where the resolution of the lookup table was chosen such that g10 can
be done to sufficient accuracy using the hardware reciprocal iteration. Then,
using the series expansion for ln (1 + z ), if entry ln(x)
z
ln (1 + ) = z + p2 z 2 + p3 z 3 + p4 z 4 + p5 z 5 + p6 z 6
(B.4)
If entry log(x)
(B.5)
The coefficients Pi and Qi are computed from a minimax method especially
for this algorithm. Thus, Equation B.2 becomes
2 3
g
ln ( ) = Tu + Tl + z + z 2 P (z )
= Su + Sl
(B.6)
where Tu + Tl = ln (g0) to more than single precision, and it is taken from
the lookup table. For the entry log(x) the value is Tu + Tl = log10 (g0 ). The
extra bits from Tl are packed into an unused portion of the exponent field
of Tu so that each element of the lookup table will fit into one Cray 64–bit
machine word. The terms are combined carefully to avoid loss of precision
due to truncation and roundoff errors. The result, Su + Sl = ln (g) to more
130 004–2151–002
Appendix B: Math Algorithms [B]
than single precision. The values of Su and Sl are now substituted for ln(g)
into Equation B.1 to obtain the following:
x
ln ( ) = m ln (2) + ln (g) = m ln (2) + ln (g0 + dg)
= m ln (2) + S + S
u l
= L +L
u l
(B.7)
The product m ln (2) (or m log10 (2) if entry is log (x)) is computed to more
than single precision. The result equals ln(x) to within 1 ULP (Unit in Last
Place), that is, almost full single precision.
If the original value of x is close to 1, that is jx 0 1j 208 , then the
preceding method is not used. Instead z x 0 1 is set and Equation B.4 and
Equation B.5 are used to obtain the following:
x ln (1 + z )
ln ( )
= z + p2 z + p3 z + p4 z + p5 z + p6 z
2 3 4 5 6
for ln
= q1 z + q2 z + q3 z + q4 z + q5 z + q6 z for log
2 3 4 5 6
=L +L u l
(B.8)
The coefficients Pi and Qi are the same as in Equation B.4, and Equation B.5,
respectively. The limits on jx 0 1j were chosen such that, for either method,
ln(x) is accurate to within 1 ULP to obtain amost full single-precision
accuracy. In all cases, the final addition is done using software rounding to
obtain a correct single-precision result.
B.1.1 Accuracy
Extensive testing with various sets of 105 random arguments shows that, in the
range 0 < x 25, the function ln(x) is 99.9% exact with a maximum error of .66
ULPs. On the range 25 < x < 102466 the function is (apparently) 100% exact
with a maximum error of 0.50 ULPs. For the function log(x), on the range
0 < x < 102466, the result is 99.9% exact with a maximum error of 0.63 ULPs.
004–2151–002 131
Scientific Libraries User’s Guide
2 2+f f
= 2 f+ f
2
1+ 2
(B.9)
Then the result is:
1+ 2
x ln (1 + f ) = ln
ln ( )
0
10 2 1
1 + 2 + p0 3
+ 1 p 5
p
+ 2
7
Su + Sl
(B.10)
The coefficients pi are chosen by a minimax approximation. The result is
Su + Sl = ln (x) to more than working precision.
01 1
4. If the argument x is outside the range e 32 < x < e 32 then we use a table
lookup method. Let the argument x = 2m 2 g where m is the unbiased
exponent, g is the mantissa, and 1 g < 2. Split g further as g = F + f
where F 1 + i207 ; i = 0; 1; . . . 27, and is taken from the leading 7 bits of
g. Then compute the following:
132 004–2151–002
Appendix B: Math Algorithms [B]
m
ln (x) = ln (2 ) + ln (F + f )
f
= m ln (2) + ln (F ) + ln 1+
F
(B.11)
5. The evaluation of m ln(2) in Equation B.11 is done to more than working
precision by storing the constant ln(2) in two words, ln 2u + ln 2l = ln (2).
The upper constant ln 2u has enough trailing zeros that the product
m 1 ln 2u can be computed exactly in working precision for all values of m.
6. The evaluation of ln(F) in Equation B.11 is done using a table lookup
method:
0 07 1 = Tu + Tl
ln (F ) = ln 1 + i2
(B.12)
where Tu + Tl = ln (F ) to more than working precision. Each upper word of
the table Tu has enough trailing zeroes in its mantissa so that the sum
m 1 ln 2u + Tu can be computed exactly in working precision for all values
of i and m.
7. The final term in Equation B.11, ln 1 + Ff , is computed by the same
power series method as in Procedure 2, step 3, page 132. Define
2f
2F + f
(B.13)
then the term ln 1 +
f can be evaluated as:
F
f 3 5 7
ln 1+ + p0 + p1 + p2
F
= Pu + Pl
(B.14)
The coefficients pi are the same as in Procedure 2, step 3, page 132. Note
that in this case it is not necessary to split the leading term as was done
for in Equation B.9.
004–2151–002 133
Scientific Libraries User’s Guide
L m ln 2 + T
u u u exact
L m ln 2 + T
l l l
(B.15)
Combining Equation B.11 with Equation B.12, Equation B.14, and Equation
B.15 gives the following:
x
ln ( ) = m ln (2) + ln (F ) + ln 1 + Ff
= L + [P + (P + L )]
u u l l
(B.16)
S u + S l
(B.17)
where S
u + S l = ln ( ) x to more than working precision.
9. For the entry ALOG10(x), the result is found by computing
( x)
x lnln(10)
log ( )
= (S + S ) 2 (ln 10 inv
u l u + ln 10 invl )
(B.18)
using the results Su and Sl from either Equation B.10, or Equation B.17. The
multiplication is done by splitting each working precision number into two
pieces, a “head” and a “tail,” and carefully forming the product.
B.2.1 Accuracy
Extensive testing with various sets of 105 random arguments shows that, in the
range 0 < x 1, the function ln(x) is around 99% exact with a maximum error
of 0.7 ULPs. For the function log(x) on the range 0 < x 1, the result is
around 99% exact with a maximum error of 0.8 ULPs. This algorithm is a
variation of the method described in P.T.P. Tang, “Table-driven implementation
134 004–2151–002
Appendix B: Math Algorithms [B]
asin (f ) = f + fP (t)
(B.20)
and the coefficients p1 . . . p12 are derived especially for this algorithm.
21 3
3. On the range f = 2; 1 , use the following identity:
"r #
asin (f ) =
0 2 asin 1 0f
2 2
(B.22)
Perform argument reduction by letting t = 102 f , which is evaluated
carefully to avoid roundoff problems near f=1 as:
004–2151–002 135
Scientific Libraries User’s Guide
01 1
t = 2 0 f + 12
2
(B.23)
Even though the value of t is exact, the square root must be done in
extended precision as:
p
Y =2 t = Yu + Yl
(B.24)
p
where Yu + Yl represents 2 t to more than single precision. Likewise, the
constant 2 is needed to more than single precision as:
= +
2 2 u 2 l
(B.25)
Equation B.22 for the asin(f) can then be evaluated using the same
polynominal as in Procedure 3, step 2, page 135,
asin (f ) =
2
0 2 asin (t)
=
2 u
0 Yu 0 Yu P (t) 0 Yl + 2 l
(B.26)
where the terms are summed carefully to obtain full precision. The YlP(t)
can be neglected.
4. The sign of the result has the sign of argument x. The principal value of
asin(x) is the following on the range 01 x +1:
Procedure 4: ACOS(x)
1. The arccosine is reduced to the computation of arcsine by the identity:
136 004–2151–002
Appendix B: Math Algorithms [B]
acos (x) =
2
0 asin (x)
(B.28)
The argument range x = [01; 1] is divided into four regions in order to
achieve the required 1 ULP accuracy.
2 3
2. On the range x = 0; 1
2 , let f = jxj as in Procedure 3, step 1, page 135,
acos (x) =
2
0 asin (f )
(B.29)
"r #
(B.31)
with the argument reduction and square root done as in Procedure 3, step
3, page 135.
2 3
5. On the range x = 01; 0 21 , use the following identity:
"r #
(B.32)
with the argument reduction and square root done as in Procedure 3, step
3, page 135. The constant is needed to extended precision as = u + . l
004–2151–002 137
Scientific Libraries User’s Guide
6. The sign of the result is always positive. The principal value of acos(x) is
0 acos (x) < on the argument range 01 x +1.
B.3.1 Accuracy
Extensive testing shows that the algorithms for ASIN(x) and ACOS(x) obtain the
correct result approximately 99.2% of the time, with the largest error less than
0.63 ULP. Testing was done using various sets of up to 250,000 random numbers
distributed linearly over the entire range of legal arguments from 01 x 1.
P (t) = p1 t + p2 t2 + . . . p5t5
(B.33)
1 f < 16, compute atan(f) using a table lookup method as follows. Let
3. If 16
f0 be obtained from the exponent and uppermost 5 bits from the mantissa of
f. Note that atan(f0) can be obtained from a lookup table,
T = atan ( f0 ) = Tu + Tl
(B.34)
where Tu and Tl represent atan(f0) to more than single precision. Now use
the standard trigonometric identity
(B.35)
where
138 004–2151–002
Appendix B: Math Algorithms [B]
and where f and atan(f) need not be computed to full precision. The size
of the lookup table was chosen so that the hardware division has sufficient
accuracy for Equation B.36.
atan(f) is computed using the first three terms of the same power series as
in Procedure 5, step 2, page 138, namely atan (f ) = f + f Q (f ) where
2
t = (f ) and Q (t) = p1 t + p2t2 + p3t3 using the same p1, p2, and p3 as in
Equation B.33. Equation B.35 then becomes
(B.37)
B.4.1 Accuracy
This routine was tested on various sets of 250,000 random arguments in all three
ranges. It gives correctly rounded results approximately 99.9% of the time, with
no error greater than 1 ULP. The largest observed error was about 0.55 ULP.
004–2151–002 139
Scientific Libraries User’s Guide
y-axis
y
θ
x-axis
x
tan(θ) = y/x
θ = atan2(y, x)
a10536
140 004–2151–002
Appendix B: Math Algorithms [B]
P (t) = p1 t + p2 t2 + . . . + p5 t5
(B.40)
T = atan (f0) = Tu + Tl (x 0)
(B.41)
(B.42)
Now use the standard trigonometric identity
(B.43)
where
f =
f 0 f0 + f l
1 + f0 f
(B.44)
jf j < 321 and where f and atan(f) need not be computed to full precision.
The size of the lookup table was chosen so that the hardware division has
sufficient accuracy for Equation B.44.
004–2151–002 141
Scientific Libraries User’s Guide
atan(f) is computed using the first three terms of the same power series as
in Procedure 6, step 2, page 140, namely atan (f ) = f + f Q (t) where
2
t = (f ) and Q (t) = p1 t + p2t2 + p3t3 using the same p1, p2, and p3 as in
Equation B.40. Equation B.43, then becomes
(B.45)
atan2 (y; x) = 0 2
0 atan 1
f
1
= + atan
2 f
(B.46)
is used where the inverse ( f1 ) can again be done to sufficient accuracy using
the hardware reciprocal approximation unit.
The atan f1 term is computed by replacing f with f1 and using the power
series in Procedure 6, step 2, page 140.
If x 0, as in y ! 1, the result returned is the truncated value of 2 so that
the result stays in the first quadrant. If x < 0, then as y ! 1, the result is
the rounded value of 2 so that the answer stays in the second quadrant.
Likewise, as y ! 0 where x < 0, the result approaches the truncated value
of , and it stays in the second quadrant.
142 004–2151–002
Appendix B: Math Algorithms [B]
5. A software rounded addition is done for the final addition in all the
preceding steps to produce a correctly rounded single-precision result. The
sign of the result is the same as the sign of the argument y.
B.5.1 Accuracy
This routine was tested on various sets of 250,000 random arguments in all
three ranges. It gives correctly rounded results approximately 99.6% of the time,
with no error greater than 1 ULP. The largest observed error was approximately
0.54 ULP.
0 ) = 0cbrt ( ).
cbrt ( x x
0n1
Let = int 3 and = 0 3
k i n = 02 01 0 1 or 2. The result can be
k; i ; ; ; ;
1
cbrt (f ) = cbrt (m) 2
i=3
1 2k
(B.47)
(B.48)
The values of 2i=3 can be generated as precomputed constants, and the final
multiplication by 2k can be done by adjusting the exponent.
3. To find cbrt(m), in the range 0:5 m < 1, obtain an initial approximation by
the power series cbrt (m) p0 + p1 1 m + p2 1 m2 + . . . + p5 1 m5 where the
coefficients p(0–5) are found by a minimax method. This gives an
approximation good to about six digits.
Now set y0 = cbrt (f0 ) using Equation B.48. Next perform one Newton
iteration in single precision using:
004–2151–002 143
Scientific Libraries User’s Guide
y1 = y0 + 13 1 y f10y 0 y0
0 0
(B.49)
4. Then perform an additional Newton iteration in pseduo-extended precision
by rewriting Equation B.49 slightly as:
y2 = y1 + 13 1 del
(B.50)
where
del = f0 0yy111yy1 1 y1
1 1
(B.51)
1
y2 = y1 + 3
1 del + ROUND
(B.52)
144 004–2151–002
Appendix B: Math Algorithms [B]
B.6.1 Accuracy
This algorithm gives the correctly rounded result 100% of the time for all
numbers tested, throughout the entire range of floating-point numbers.
follows.
2. Let the argument be decomposed as
x
xn =
ln (2)
(B.53)
1
r = x 0 n+
+ 2
128
i
1 ln (2)
(B.54)
where jj
r <
ln(2)
256
. Then the result will be
x n r
e =2 Te
=2
n
1[ T +T 1 ( r 0 1)]
e
(B.55)
where T = 2(i+1=2)=128
004–2151–002 145
Scientific Libraries User’s Guide
(B.56)
The constant c1 is chosen so that the product xnc1 can be computed exactly
(with no hardware rounding) in a single word. The term xnc1 is subtracted
in two steps to avoid loss of a bit when x is slightly less than a power of 2.
The lower term xnc2 makes up for bits lost from cancellation from the first
subtraction.
4. In Equation B.55 the factor of 2(i+1=2)=128 can be stored in a 128–word
lookup table, indexed by i
(i+ 12 )
T =2 128 = Tu + Tl
(B.57)
(B.58)
ex n
=2 W
(B.59)
B.7.1 Accuracy
Extensive testing shows that this algorithm obtains the correct result
approximately 99.8% of the time, with the largest error less than 0.51 ULP.
Testing was done using various sets of 104 random numbers distributed linearly
and logarithmically over the entire range of legal arguments.
004–2151–002 147
Scientific Libraries User’s Guide
xy function aborts, even if the remaining elements of the x and y vectors are
legal.
3. Let x = 2m 2 g, where m is the unbiased exponent, g is the mantissa, and
1
2 g < 1. Then:
x m
ln ( ) = ln (2 ) + ln ( )g
= m ln (2) + ln (g)
= Lu + L l
(B.60)
g g dg)
ln ( ) = ln ( 0 +
= ln (g0 ) + ln 1 +
d g
g0
= ln (g0 ) + ln (1 + z )
(B.61)
where
z dgg
0
= dg 2 g1 u
+ dg 2 g1
0 0 l
= zu + zl
(B.62)
and where the reciprocal g10 is found from a lookup table computed to 15
bits more than single precision. Then, using the series expansion for
ln (1 + z ) :
148 004–2151–002
Appendix B: Math Algorithms [B]
2 2 3 4 5 3
ln (1 + z ) = z u + zl + p2 z + p3 z + p4 z + p5 z
(B.63)
Equation B.61 then becomes:
2 2
3
ln (g ) = Tu + Tl + zu + zl +z P (z )
= Su + Sl
(B.64)
Tu + Tl = ln (g0 ) to more than single precision, and it is taken from the
lookup table. The terms are combined carefully to avoid loss of precision
due to truncation and roundoff errors. The result, Su + Sl = ln (g) to more
than single precision. The values of Su and Sl are now substituted for ln(g)
into Equation B.60 to obtain the following:
ln (x) = m ln (2) + ln (g )
= m ln (2) + ln (g0 + dg )
= m ln (2) + Su + Sl
= Lu + Ll
(B.65)
The product m ln(2) is computed to more than single precision. The result
equals ln(x) to 13 bits more than single precision.
If the original value of x is close to 1, that is:
j 0 1j 0 75 2 2014
x :
(B.66)
ln (x) 2ln (1 + z)
2 3 4 53
= z + p2 z + p3 z + p4 z + p5 z
= Lu + Ll
(B.67)
004–2151–002 149
Scientific Libraries User’s Guide
The limits on jx 0 1j were chosen such that, for either method, ln(x) is
accurate to within 13 bits more than single precision.
4. Now, split y (the power in xy) into two pieces, y = yu + yl and compute
P = y 2 ln (x) in pseudo-double precision as
P =( yu + yl ) 2 (Lu + Ll )
= Pu + Pl
(B.68)
xy ey ln(x)
eP = eP u+ Pl
(B.69)
The function ePu +Pl is done by a method similar to computing EXP, and it
can make sure of the same lookup table as the EXP library function.
The final expression for ePu +Pl becomes the following:
eP u+ Pl = Tu + [Tl + Tu 2 f (r)]
(B.70)
where r is the reduced argument obtained within the EXP function, T is the
EXP table lookup and f (r) er 0 1. The final addition is done using
software rounding to obtain a correct single-precision result.
B.8.1 Accuracy
Extensive testing with combinations of arguments x and y varied such that
z2 = xy returns values throughout
3 the full range of floating-point numbers
1002466 0 10+2466 shows that this function returns the correctly rounded
result about 99% of the time, with the remainder of results being less than 1
ULP in error.
150 004–2151–002
Appendix B: Math Algorithms [B]
x j j
N = INT
=4
(B.71)
which is the number of multiples of 4 in x. Find the octant of the circle (2)
in which N lies. Let m = N (mod8) = d 1 22 + c 1 21 + b where:
b = low-order bit of m (20)
c = middle bit of m (21)
d = high-order bit of m (22)
4. Add 1 to N if N is odd — let N = N + b and let xn = FLOAT (N ), to
convert N to a real. The reduced argument is f = jxj 0 xn 4 , which is
evaluated in extended precision as
f = f1 0 xn c3
(B.73)
004–2151–002 151
Scientific Libraries User’s Guide
+ + =
where c1 c2 c3 4 to more than single precision, and where c1 and c2
are chosen such that xnc1 and xnc2 can be computed exactly within a single
64–bit word (that is, without any rounding by the multiply hardware).
5. Now compute 1 = dble ( ) sngl( )
f 0 1
f . This is the residual error in f. is
the difference between a full double-precision evaluation of jxj 0 xn 4 and
the value of f from Equation B.73. 1is adequately estimated using only
single precision as 1=( )
f1 0 f 0 xnc3 and is the part of xnc3 that did not
contribute to f in Equation B.73.
6. If f < 0,
f = 0f = jfj
1 = 01
(B.74)
Let tflag=a\b\c where a, b, and c are the bit flags defined in Procedure 10,
step 1, page 151 and Procedure 10, step 3, page 151. If tflag=0, compute
sin(f) using either Method a or Method b in Procedure 10, step 7, page 152.
If tflag=1, compute cos(f) using either Method a or Method b in Procedure
10, step 7, page 152. If the entry point was COSS, then compute sin(f) and
cos(f) in parallel.
7. Compute sin(f) and cos(f) by one of two methods, depending on the
magnitude of f.
a. If f 16, compute sin(f) and cos(f) by a minimax polynomial:
sin(f ) = f + fP (t)
(B.75)
cos(f ) = 1 + Q (t)
(B.76)
where t = f 2 and
152 004–2151–002
Appendix B: Math Algorithms [B]
(B.78)
Equation B.75 is evaluated by first replacing f with f ( + 1)
and keeping
1
the first order term, giving the result as f f sin( ) = + [ ( ) + 1]
fP t ,
where the summation is done carefully to preserve full accuracy. The 1
term is not needed for computing f . cos( )
b. 1 < f , compute sin(f) or cos(f) using a table lookup method.
If 16 4
First split f into an upper and lower part by letting f +1= +
f0 f or
f =( )+1
f 0 f0 where f0 is taken from the uppermost 6 bits of f.
Using the leading bits of f0 as an index, look up the values of
sin( ) = +
f0 Su Sl and cos ( ) =
f0 + ( )
Cu Cl where Su ; Sl represent sin(f0)
(
to more than single precision, and likewise for Cu; Cl . Compute a )
correction to the table value using standard trigonometric identities as:
1S = sin(f0 + f ) 0 sin(f0)
= sin(f0)[cos (f ) 0 1] + cos(f0)sin(f )
(B.79)
1 1
In both expressions for S and C, the quantities f and sin( )
[cos ( ) 1]
f 0 are approximated by the same power series as in Equation
B.75 and Equation B.76 but with f replaced by f
h i
S = sin(f ) = f + f P (f )2
(B.81)
h i
C = cos (f ) 0 1 = Q (f )2
(B.82)
004–2151–002 153
Scientific Libraries User’s Guide
and the polynomials are the same as in Equation B.77 and Equation
B.78. Then Equation B.79, and Equation B.80, become
1S = SuC + CuS
(B.83)
1C = CuC 0 SuS
(B.84)
To compute sin(f), evaluate sin ( ) = + [1 + ]
f Su S Sl . To compute cos(f),
evaluate cos( ) =
f Cu + [1 + ]
C Cl . The final summations are done
carefully in order to preserve accuracy, and the final addition is done
using a software rounded add.
8. The sign of the result is SIGN=[#a&(d\e)]![a&(d\c)] where # & ! and
\ are Boolean NOT, AND, OR, and XOR operations respectively. The values a,
c, d, and e are the bit flags as defined in Procedure 10, step 1, page 151 and
Procedure 10, step 3, page 151. When SIGN=1, the result is negative. When
SIGN=0, the result is 0 or positive.
The result returned to the caller is
B.9.1 Accuracy
For various sets of 250,000 random arguments in the range [-, ]
approximately 99.8% were correct. The largest error was around 0.75 ULP.
2
Similar accuracy holds throughout the legal range of arguments jxj < 25 with
no errors greater than 1 ULP found.
154 004–2151–002
Appendix B: Math Algorithms [B]
cosh ( 0 f) = cosh ( f )
(B.86)
sinh (0 f) = 0 sinh ( f)
(B.87)
2 3
2. On the range f = 0; 113
128 : if a COSH entry evaluate cosh (f ) = 1 + Q (t); if a
SINH entry, evaluate sinh (f ) = f + f P (t) where t = f 2 and Q(t) and P(t)
are minimax polynomials of the following form:
2 3 4 5 6 7
Q (t) = q1 t + q2 t + q3 t + q4 t + q5 t + q6 t + q7 t
(B.88)
2 3 4 5 6 7
P (t) = p1 t + p2 t + p3 t + p4 t + p5 t + p6 t + p7 t
(B.89)
where the coefficients qi and pi are derived especially for Cray PVP
floating-point hardware.
2 3
3. On the argument range f = 113 128 ; 5677:5686 , evaluate cosh(f) and sinh(f)
using the ef function. If a COSH entry, let
cosh ( f ) =
1
e
f
+ e1f
1
2
1
= 2 Eu + El + E
u
(B.90)
If a SINH entry, let
004–2151–002 155
Scientific Libraries User’s Guide
sinh (f ) =
1
2
f
e 0 1
ef
=
1
2
Eu + El 0 1
Eu
(B.91)
B.10.1 Accuracy
The SINH(x) and COSH(x) routines were tested on various sets of 250,000
random arguments over the entire range of legal arguments. They give
correctly rounded results approximately 97.0% of the time, with no error greater
than 1 ULP. The largest observed error was about 0.78 ULP.
px =
p
22n 2b g
(B.92)
=2
n pg (b = 0)
(B.93)
156 004–2151–002
Appendix B: Math Algorithms [B]
p
= 2n 2pg (b = 1)
(B.94)
2.
p
Find an initial estimate to g using the King and Phillips approximation
formula (see King and Phillips, The Logarithmic Error and Newton’s Method
for the Square Root):
q
1 +g
p
f0 = g rq2
3 1 +1
28 2
(B.95)
p
If b = 1, multiply the f0 from Equation B.95, by 2.
1
f
f1 = f0 +
2 f0
(B.96)
where only the hardware half-precision reciprocal is needed. Find the third
estimate, f2, by another Newton-Raphson iteration:
1
f
f2 = f1 +
2 f1
(B.97)
f3 = f2 +
f 0 f22
2f2
(B.98)
004–2151–002 157
Scientific Libraries User’s Guide
B.11.1 Accuracy
This routine was tested on various sets of 250,000 random arguments over the
entire range of floating-point numbers. Although testing was not exhaustive,
the algorithm appears to give correctly rounded results 100% of the time, with
no error greater than 0.50 ULP.
N = INT
jxj
=4
(B.99)
158 004–2151–002
Appendix B: Math Algorithms [B]
f = f1 0 xn c3
(B.101)
+ + =
where c1 c2 c3 4 to more than single precision, and where c1 and c2
are chosen such that xnc1 and xnc2 can be computed exactly within a single
64–bit word (that is, without any rounding by the multiply hardware).
5. Now compute 1 = dble ( ) sngl( )
f 0 1
f . This is the residual error in f. is
the difference between a full double-precision evaluation of jxj 0 xn 4 and
the value of f from Equation B.101. 1 is adequately estimated using only
single precision as 1=( )
f1 0 f 0 xnc3 and is the part of xnc3 that did not
contribute to f in Equation B.101.
6. If f < 0,
f = 0f = jfj
1 = 01
(B.102)
Let tflag=a\b\c where a, b, and c are the bit flags defined in Procedure 13,
step 1, page 158 and Procedure 13, step 3, page 158. The \ symbol denotes
the Boolean XOR operation. If tflag=0, compute tan(f) using either Method A
or Method B in Procedure 13, step 7, page 159. If tflag=1, compute cot(f)
using either Method A or Method B in Procedure 13, step 7, page 159.
7. Compute tan(f) and cot(f) by one of two methods, depending on the
magnitude of f.
a. If f 161 , compute tan(f) and cot(f) by a minimax polynomial:
tan(f ) = f + fP (t)
(B.103)
cot(f ) = f1 + fQ (t)
(B.104)
where t = f 2 and
004–2151–002 159
Scientific Libraries User’s Guide
(B.106)
( + 1)
Equation B.103 is evaluated by first replacing f with f and keeping
1
the first order term, giving the result as tan( ) = + [ ( ) + 1]
f f fP t ,
where the summation is done carefully to preserve full accuracy.
When computing cot(f) for Equation B.104 replace f with f ( + 1) and
keep terms to first order in : 1
1! 1 101
f f +1 f 2
f
= f1 + f1 0 f12
u l
(B.107)
cot(f ) = f + f Q (t) + f 0 f12
1 1
u l
(B.108)
160 004–2151–002
Appendix B: Math Algorithms [B]
1T = tan(f0 + f ) 0 tan(f0 )
2
1 + Tu2 tan(f )
3
1 0 T tan(f )
u
(B.109)
where terms in Cl have been neglected. The size of the lookup tables
was chosen such that, in Equation B.109 and Equation B.110, the
1
hardware division has sufficient accuracy. In both expressions for T
1
and C, the quantity tan( )
f is approximated by a shorter version of
the same power series as in Equation B.103, but with f replaced by f
T = tan(f ) = f + p1 (df )3
(B.111)
B.12.1 Accuracy
For various sets of 250,000 random arguments in the range j0; j,
approximately 99.4% were correct. The largest error was around 0.75 ULP.
2
Similar accuracy holds throughout the legal range of arguments jxj < 25 with
no errors greater than 1 ULP found.
004–2151–002 161
Scientific Libraries User’s Guide
0
tanh ( f ) = 0 tanh (f )
(B.112)
(B.113)
where the coefficients pi are derived especially for Cray PVP floating-point
hardware.
2 3
3. On the argument range f = 16 1 ; 17:5 , use a table lookup method based on
the following mathematical identity:
(B.114)
0 tanh2 (a)
2 3
1 tanh (b)
= tanh ( a) +
1 + tanh ( a) tanh (b)
(B.115)
Let f0 be obtained from the exponent and uppermost 9 bits of f, and define
the difference f f 0 f0. Substituting f0 and f for a and b in Equation
B.115 results in the following:
162 004–2151–002
Appendix B: Math Algorithms [B]
2 3
= tanh (f0 ) + 1 0 tanh2 (f0 ) tanh (f )
1 + tanh (f0 ) tanh (f )
(B.117)
0 1
= Tu + T l +
1 0 T2u
tanh (f )
1 + Tu tanh (f )
(B.118)
B.13.1 Accuracy
For various sets of 250,000 random arguments in the range [017:5; 17:5]
approximately 98.5% were correct. Similar accuracy holds throughout the legal
range of arguments jxj < 1, with the largest error around 0.97 ULP.
004–2151–002 163
Glossary
BCG
See Bi-Conjugate Gradient Method.
BLAS
See Basic Linear Algebra Subprogram.
CGN
See Conjugate Gradient Method.
CGS
See Bi-Conjugate Gradient Squared Method.
computational routines
Term used to define LAPACK routines that perform a distinct computational
task.
004–2151–002 165
Scientific Libraries User’s Guide
dedicated environment
A parallel processing environment in which the NCPUS environment variable is
equal to the number of available processors. This ensures that the number of
processors specified by NCPUS is available at all times.
driver routines
Term used to define LAPACK routines used for solving standard types of
problems.
equilibration
The process of scaling a problem before computing its solution.
Fourier analysis
The mathematical process of resolving a given function, f(x), into its frequency
components, which means finding the sequence of constant amplitudes to plug
into a Fourier series to reconstruct the original function.
GCR
See Orthomin/Generalized Conjugate Residual Method.
GMR/GMRES
See Generalized Minimum Residual Method.
Hermitian matrix
A complex matrix which is equal to the conjugate of its transpose, with either
the lower or upper triangle being stored.
166 004–2151–002
Glossary
LAPACK
A public domain library of subroutines for solving dense linear algebra
problems, including systems of linear equations, linear least squares problems,
eigenvalue problems, and singular value problems. It has been designed for
efficiency on high-performance computers.
linear system
A set of simultaneous linear algebraic equations.
load balancing
The process of dividing work done by each available processor into
approximately equal amounts.
LPVP
See Large Parallel/Vector Problem.
004–2151–002 167
Scientific Libraries User’s Guide
multiuser environment
A parallel processing environment in which users do not know how many
processors will be available to a job during run time, except that the number
will be less than or equal to NCPUS.
OMN
See Orthomin/Generalized Conjugate Residual Method.
out-of-core technique
A term that refers to algorithms that combine input and output with
computation to solve problems in which the data resides on disk or some other
secondary random-access storage device.
packed storage
A type of matrix in which half of the matrix (triangular or symmetric) is stored
on disk or SSD.
PCG
See Preconditioned Conjugate Gradient Method.
pipelining
A method of execution which allows each step of an operation to pass its result
to the next step after only one clock period.
168 004–2151–002
Glossary
sparse matrix
A linear system which can be described as Ax = b, where A is an n-by-n matrix,
and x and b are n dimensional vectors. A system of this kind is considered
sparse if the matrix A has a small percentage of nonzero terms (less than 10%,
often less than 1%).
SPD
See Symmetric Positive Definite Matrix.
SPVP
See Small Paralell/Vector Problem.
Strassen’s algorithm
A recursive algorithm that is slightly faster than the ordinary inner product
algorithm. Strassen’s algorithm performs the floating-point operations for
matrix multiplication in an order differently from the vector method; this can
cause round-off problems.
supernodes
A collection of columns that have the same nonzero pattern.
004–2151–002 169
Scientific Libraries User’s Guide
time slicing
A method of execution in which the system works on several jobs or processes
simultaneously.
vector problem
A class of problem size in which problems are large enough for vector
processing, but too small for parallel processing.
vectorization
A form of parallel processing that uses instruction segmenting and vector
registers.
virtual matrices
A virtual matrix is similar to a Fortran array, but it cannot be accessed directly
from a program. It can only be accessed with calls to specific subroutines. Users
do not do any explicit input or output to read from or write to a virtual matrix.
VP
See Vector Problem.
well-conditioned matrix
The condition number of a matrix is defined as (A) = kAk 1 A01 . A
well-conditioned matrix is one for which (A) is small. Although small is
relative, if (A) < 103 , A can be considered well-conditioned.
170 004–2151–002
Index
1 ULP CGN, 59
examples, 125 CGS, 59
chaining, 3
computational routines, 27
A computing a simple bound, 37
condition estimation, 36
acos(x), 135 condition number, 36
accuracy, 138 Conjugate Gradient Method, 59
aggressive optimization cos(x), 151
with CF90 compiler, 5 accuracy, 154
alog(x), 132 cosh(x), 155
accuracy, 134 accuracy, 156
Amdahl’s Law for multitasking, 9 cot(x), 158
AQIO routines, 103 accuracy, 161
asin(x), 135
accuracy, 138
asynchronous queued I/O routines, 103 D
atan(x), 138
accuracy, 139 data structures
atan(y,x), 139 and sparse matrices, 57
accuracy, 143 dedicated environment
Autotasking, 5 parallel processing strategies, 21
dedicated parallel processing environment
characteristics, 13
B dedicated work environment, 13
diagonally dominant matrix, 54
banded matrix, 54 direct general purpose sparse solvers, 67
Basic Linear Algebra Subprograms, 25 direct solver tuning issues, 70
BCG, 59 direct solver tuning parameters
Bi-Conjugate Gradient Method, 59 frontal matrix grouping, 71
Bi-Conjugate Gradient Squared Method, 59 supernode augmentation, 70
binary unblocked file, 103 threshold pivoting, 71
BLAS, 25 direct solvers, 58
driver routines, 27, 34
cbrt(x), 143
accuracy, 145
004–2151–002 171
Scientific Libraries User’s Guide
E factorization forms, 29
e^x, 145
accuracy, 147 G
EISPACK, 25
environment variables, 12 GCR, 59
MP_DEDICATED, 12 general patterned sparse systems, 63
MP_HOLDTIME, 12 Generalized Minimum Residual Method, 59
NCPUS, 12 GMR, 59
suggested settings, 12 GMRES, 59
tuning, 12 guidelines
UNICOS, 115 choosing a solver
environments, 11 based on problem type, 64
equilibration, 39 general patterned sparse linear systems, 63
error bounds, 37 iterative methods, 65
error bounds computations, 43 preconditioning, 65
error codes, 33 tridiagonal systems, 62
error conditions, 33
error reporting, 116
examples H
creating a virtual matrix, 113
error conditions, 33 Hermitian matrix, 105
LU factorization, 30 highly off-diagonally-dominant 3D problems, 62
multiplying a virtual matrix, 113 Hilbert matrix, 41
orthogonal factorization, 46 Householder transformation, 47
out-of-core technique, 101
protocol usage, 114
roundoff errors, 37 I
single/multitasked performance, 15
sparse solvers I/O subsystems, 3
general symmetric positive definiite, 75 ILAENV, 26
general unsymmetric, 79 ill-conditioned problems, 64
multiple right-hand side, 89 implicit diagonal preconditioning, 66
reuse of structure, 84 inverse of dense matrix, 44
save/restart, 93 iterative methods, 59
symmetric indefinite matrix factorization, 31 iterative refinement, 41
explicit form, 29 iterative solvers, 59
F L
172 004–2151–002
Index
004–2151–002 173
Scientific Libraries User’s Guide
matrix characteristics O
Non-Symmetric Definite, 65
Non-Symmetric Indefinite, 65 OMN, 59
Symmetric Indefinite, 65 orthogonal factorizations, 45
Symmetric Positive Definite (SPD), 65 orthogonal matrix
matrix inversion, 44 generating, 49
medium parallel/vector problem (MPVP) multiplying by, 48
problem size, 20 Orthomin/Generalized Conjugate Residual
memory usage guidelines, 118 Method, 59
microtasking, 5 out-of-core routines, 101
MP_DEDICATED environment variable, 12 features, 101
MP_HOLDTIME environment variable, 12 subroutines, 106
MPVP complex routines, 106
definition, 20 initialization and termination, 109
in multiuser environments, 22 lower-level routines, 112
multiprocessing, 4 routine summary, 107
multiprogramming, 4 virtual BLAS routines, 111
multitasking, 5 virtual copy, 109
Amdahl’s Law, 9 virtual LAPACK routines, 110
and vectorization, 7 out-of-core technique, 101
efficiency, 10 overdetermined linear system, 26
overview, 3
speedup ratio, 7
user code, 5 P
when used, 15
multitasking variables, 116 packed storage mode, 105
multiuser environment, 14 full type, 105
parallel processing strategies, 21 lower type, 105
multiuser environments upper type, 105
and SPVP and MPVP problems, 22 page size, 106
multiuser parallel processing environment page-buffer space, 118
characteristics, 14 parallel instruction execution, 3
parallel processing
and vectorization, 7
N benefits, 6
calculating program speedup, 8
NCPUS environment variable, 7, 12 costs/benefits discussion, 5
Non-Symmetric Definite matrix, 65 efficiency, 10
Non-Symmetric Indefinite matrix, 65 overhead, 6
numerical methods, 125 overview, 3
speedup ratio, 7
user code, 5
174 004–2151–002
Index
004–2151–002 175
Scientific Libraries User’s Guide
176 004–2151–002
Index
004–2151–002 177