Exploiting Zeros On The Diagonal in The
Exploiting Zeros On The Diagonal in The
We describe the design of a new code for the solution of sparse indefinite symmetric linear
systems of equations. The principal difference between this new code and earlier work lies in
the exploitation of the additional sparsity available when the matrix has a significant number
of zero diagonal entries. Other new features have been included to enhance the execution
speed, particularly on vector and parallel machines.
Categories and Subject Descriptors: G.1.3 [Numerical Analysis]: Numerical Linear Alge-
bra—linear systems (direct methods); sparse and very large systems
General Terms: Algorithms, Performance
Additional Key Words and Phrases: sparse, 2 3 2 pivots, augmented systems, BLAS, Gaussian
elimination, indefinite symmetric matrices, zero diagonal entries
1. INTRODUCTION
This article describes the design of a collection of Fortran subroutines for
the direct solution of sparse symmetric sets of n linear equations
Ax 5 b, (1.1)
minimize i Bx 2 c i 2 (1.2)
x
subject to
Cx 5 d. (1.3)
ACM Transactions on Mathematical Software, Vol. 22, No. 2, June 1996, Pages 227–257.
228 • I. S. Duff and J. K. Reid
1 21 2 1 2
I B r c
0 C l 5 d . (1.4)
T
B CT 0 x 0
1
minimize x THx 1 c Tx (1.5)
x 2
S H
C
CT
0
DS D S D
x
l
5
2c
d
. (1.6)
Our earlier Harwell Subroutine Library code MA27 [Duff and Reid 1982;
1983] uses a multifrontal solution technique and is unusual in being able to
handle indefinite matrices. It has a preliminary analysis phase that
chooses a tentative pivot sequence from the sparsity pattern alone, assum-
ing that the matrix is definite so that all the diagonal entries are nonzero
and suitable as 1 3 1 pivots. For the indefinite case, this tentative pivot
sequence is modified in the factorization phase to maintain stability by
delaying the use of a pivot if it is too small or by replacing two pivots by a
2 3 2 block pivot [Bunch and Parlett 1971].
The assumption that all the diagonal entries are nonzero is clearly
violated in the preceding examples. For such problems, the fill-in during
the factorization phase of MA27 can be significantly greater than predicted
by the analysis phase. Duff et al. [1991] found that the use of 2 3 2 pivots
with zeros on the diagonal alleviated this problem and assisted the preser-
vation of sparsity during the analysis phase. Our new code, MA47, is based
upon this work and, like MA27, uses a multifrontal method. It will work for
the definite case, but there are many opportunities for simplifications and
efficiency improvements; so we plan to provide a separate code for this
special case.
The factorization used has the form
A5PLDL TP T (1.7)
2. THE ALGORITHM
Our algorithm is based on the work of Duff et al. [1991] which uses 2 3 2
pivots with zeros on the diagonal to assist in the preservation of sparsity.
In addition, it is advantageous to perform elimination with several pivots
simultaneously which we do by accumulating them into a block pivot.
We use block pivots that may be
(i) of the form
S 0
A T1
A1
0
D (2.1)
S A2
A T1
A1
0
D or S 0
A T1
A1
A2
D (2.2)
1 2
0 B2 B3
B T2 B1 B4 , (2.3)
B T3 B T4 0
where the blocks on the diagonal are square. This is the form of the
submatrix altered by a pivotal step with an oxo pivot and is illustrated in
Figure 1. The blocks B1, B2, B3, and B4 are usually full, and we always
store them as full matrices. A tile pivot produces the special case of this
form where the first or last block row and column are null. It is illustrated
in Figure 2. For a full pivot, the generated element is held as a full matrix,
which is the special case where the first and last block row and column are
both null.
We have chosen the multifrontal technique [Duff and Reid 1983] for the
sake of efficiency during the analyze phase and to permit extensive use of
full-matrix code and the BLAS (Basic Linear Algebra Subprograms [Don-
garra et al. 1988; 1990; Lawson et al. 1979]) during factorization. We use
the notation B(l ) for the generated-element matrix from the lth (block)
ACM Transactions on Mathematical Software, Vol. 22, No. 2, June 1996.
Exploiting Zeros on the Diagonal • 231
Fig. 1. An oxo pivot, its pivot rows and columns, and fill-in pattern (generated element).
Fig. 2. A tile pivot, its pivot rows and columns, and fill-in pattern (generated element).
Ak 1 OB ~l !
k , (2.4)
l[Ik
where I k is the set of indices of element matrices that are active then. If
(l )
Bk21 has entries only in the pivotal rows and columns, B(l k
)
will be zero,
(l )
and l is omitted from the index set Ik . Other Bk may have entries that lie
entirely within the pattern of the newly generated element B(k) ; for
efficiency, such a B(l
k
)
is added into B(k) , and l is omitted from Ik . We say
that Bk is absorbed into B(k) .
(l )
Such absorption certainly takes place if the pivot is full and overlaps one
or more of the diagonal entries of B(l )
k because in this case the pivot row has
(l )
an entry for every index of Bk . If all the pivots are full, all generated
ACM Transactions on Mathematical Software, Vol. 22, No. 2, June 1996.
232 • I. S. Duff and J. K. Reid
elements are full, and therefore any generated element that is involved in a
pivotal step is absorbed. This is the situation for a definite matrix.
In the definite case, the whole process may be represented by a tree,
known as the assembly tree, which has a node for each block pivotal step.
The sons of a node correspond to the element matrices that contribute to
the pivotal row(s) and are absorbed in the generated element. Here it is
efficient to add all the generated elements from the sons and the pivot rows
from the original matrix into a temporary matrix known as the frontal
matrix, which can be held in a square array of size the number of rows and
columns with entries involved. The rows and columns are known as the
front. For a fuller description of this case, see Duff et al. [1986, Sections
10.5 to 10.9].
Given an assembly tree, there is considerable freedom in the ordering of
the block pivotal steps during an actual matrix factorization. The opera-
tions are the same for any ordering such that the pivotal operations at a
node follow those at a descendant of the node (apart from roundoff effects
caused by performing additions in a different order). Subject to this
requirement, the order may be chosen for organizational convenience. For a
uniprocessor implementation, it is usual to base it on postordering follow-
ing a depth-first search of the tree, which allows a stack to be used to store
the generated elements awaiting assembly. We follow this practice.
When there are some structured pivots, we employ the same assembly
tree, but a generated element is not necessarily absorbed at its father node.
Instead, it may persist for several generations, making contributions to
several pivotal rows, until it is eventually absorbed. As an illustration of
absorption not occurring, a simple 1 3 1 pivot might overlap the leading
(zero) block of Eq. (2.3). In such a case, B1 and B4 are absorbed, but the
nonpivotal rows of B2 and B3 are inactive during this step (unless made
active by entries from other generated-element matrices). Absorption of
B(l ) (l )
k occurs for a structured pivot if an off-diagonal entry of Bk overlaps the
off-diagonal block A1 of the pivot. This is seen by regarding the structured
pivot as a sequence of 1 3 1 pivots, starting with the off-diagonal entry and
its symmetrically placed partner. To handle the structured case efficiently,
we sum only the contributions to the pivot rows, form the Schur update,
and then add into it any generated elements from the sons that can be
absorbed. The frontal matrix is thus more complicated, but we still refer to
the set of rows and columns involved as the front.
Similarly the stack is more complicated. Access will be needed to gener-
ated elements corresponding to descendants (not just children), but these
will still be nearer the top of the stack than any generated elements that do
not correspond to descendants. When a generated element is absorbed, it
may leave a hole in the stack. These holes are tolerated until the stack
becomes too large for the available memory, at which point we perform a
simple data compression. To aid both access to data in the stack and stack
compression, we merge adjacent holes as soon as they appear.
ACM Transactions on Mathematical Software, Vol. 22, No. 2, June 1996.
Exploiting Zeros on the Diagonal • 233
S B1
B T2
B2
0
D and S 0
B T3
B3
0
D. (2.5)
1 2 1 2 1 2
0 B2 B3 0 B2 0 0 0 B3
B T2 B1 B 4 5 B T2 B1 B4 1 0 0 0 . (2.6)
B T3 B T4 0 0 B T4 0 B T3 0 0
We have decided to use the form (2.3) because, in the case of an oxo-
generated element,
(i) the duplication of the index lists of the first and third blocks is
avoided;
(ii) for a row of the first or third block, one rather than two elements
involve it and need to be included in a list of elements associated with
the row (such lists are needed during the analyze phase); and
(iii) a link would need to be maintained between the two parts of an oxo
generated element in order to recognize that both parts of the
element can be absorbed when an off-diagonal entry overlaps the
off-diagonal block A1 of a structured pivot.
A 11 5 LDL T, (2.7)
ACM Transactions on Mathematical Software, Vol. 22, No. 2, June 1996.
234 • I. S. Duff and J. K. Reid
S A 11
T
A 12
A 12
A 22
5 D S
L
MT I
DS D
A 22 2 S
DS LT M
I
D, (2.8)
A 11 5 S 0
A T1
A1
A2
D S
5
L1
U T2 U T
1
DS 0
D1
D1
D2
DS L T1 U2
U1
D, (2.11)
A 1 5 L 1D 1U 1 (2.12)
1 2
t 11 t 12 z z t 1r
t 21 t 22 z z t 2r
z z z z z , (2.13)
z z z z z
t r1 t r2 z z t rr
1 21 21 2
I d 11 I T
l 21 z z T
l r1
l 21 I d 22 I z z T
l r2
z z z z z z z ,
z z z z z z z
l r1 l r2 z z I d rr I
(2.14)
where each l ij has a zero in position (2,1) and where each d ii is a symmetric
tile. If we now apply the inverse permutation, we get the form (2.11) with
the lower triangular entries of L1, UT T
2 , U1 , respectively, being the (1,1),
(2,1), (2,2) entries of l ij , and the diagonal entries of D1 and D2, respectively,
being the (2,1) and (2,2) entries of d ii . An alternative derivation of Eq.
(2.11) is by application of a sequence of elementary row and column
operations that reduce
S 0
A T1
A1
A2
D
to the form
S 0
D1
D1
D2
D
in column 1, row 1, column r11, row r11, column 2, row 2, column r12, row
r12, . . . . Note that, apart from the effects of reordering additions and
subtractions, the forms (2.11) and (2.14) yield the same numerical values.
We use both forms in our code. We compute the factors with form (2.14). We
apply the factors in solutions and in Schur updates using form (2.11), which
permits use of block operations.
The diagonal entries of
S 0
D1
D1
D2
D
can be changed to make it positive definite as easily as those of D in Eq.
(2.7). Thus we may regard Eq. (2.7) as applicable to the tile case with
L5 S L1
U T2 U T
1
D and D5 S D3
D1
D1
D2
D. (2.15)
M5 S M5
0
M6
M7
0
M8
0
0
,D (2.16)
and only the blocks M5, M6, M7, and M8 are stored. For a tile pivot, the first
block column is absent. These forms are discussed further in Section 2.3.
for a diagonal entry a ii , with row count r i (number of entries in the row), is
extended to
~ r i 2 1 !~ r i 1 r j 2 3 ! (2.18)
~ r i 2 1 !~ r j 2 1 ! (2.19)
end if
if the Markowitz cost is the smallest so far found, store
it as such
end do
end do
if there is a stored 2 3 2 pivot with Markowitz cost # r2 then
accept the 2 3 2 pivot and exit main_loop
end if
end do
for each variable and the variables of each supervariable in a chain with
the principal variable at its head.
This forest is gradually modified until it becomes the elimination tree. At
an intermediate stage, each tree of the forest has a root that represents the
principal variable of either (i) a supervariable that has not yet been pivotal
or (ii) a supervariable that has been pivotal and whose generated-element
matrix is zero or has not yet contributed to a pivotal row. The node of a
principal variable that has been eliminated may be regarded as also
representing the element generated when it was eliminated. When a pivot
block is chosen, the node of its principal variable is given as sons the roots
of all the trees that contain one or more generated elements that contribute
to the pivot rows. If the pivot is a structured pivot, it is also given as a son
the principal variable of the second part. Some amalgamation of supervari-
ables may follow the pivotal step, because the fill-ins may cause some rows
to have identical structures in the reduced matrix.
We have noted that a generated-element matrix need not be absorbed at
its father node. Indeed, it may persist for many generations of the tree,
contributing to pivotal rows at each stage. This may mean that all that is
left is a zero matrix, which is all or part of one of the zero blocks of the
matrix (2.3). It might be thought that such a zero matrix could be
discarded, but to do so risks the loss of a property of an assembly tree that
we wish to exploit: the block pivotal steps may be reordered provided those
at a node follow those at all descendants of the node. We must therefore
retain such a zero element matrix and whenever one of its variables
becomes pivotal, make the root of its tree a son of the pivot node.
In the irreducible case, this eventually yields the elimination tree, whose
root is the only node with a zero generated-element matrix. The node
represents the final block pivot, which obviously leaves a null Schur
complement. If the original matrix is reducible, it must be block diagonal
because it is symmetric. In this case, the problem consists of several
independent subproblems, and there will be a tree for each. It is not
difficult to allow for this, and our code does so. For simplicity of description,
however, we assume that the matrix is irreducible.
Fig. 5. A father-son pair of oxo pivots for which amalgamation is possible if the variables of
the father are interchanged.
2.3 Factorization
The factorization is controlled by the assembly tree created at the end of
the analyze phase. For stability, all the pivots are tested numerically. If the
updated entries of the fully summed rows in the front are f ij , the test for a
1 3 1 pivot is
uf kk u $ u max uf kj u, (2.20)
jÞk
*S D *1 2 1 2
21 max u f kj u
f kk f k,k11 jÞk,k11 u 21
# . (2.21)
f k11,k f k11,k11 max u f k11, j u u 21
jÞk,k11
For a tile pivot, it is possible for this test to fail, and yet taking its two
diagonal entries in turn as 1 3 1 pivots would lead to two 1 3 1 pivots that
satisfy inequality (2.20). Using this pair is mathematically equivalent. We
therefore accept the tile pivot in this case, also. For the second 1 3 1 pivot,
we test the inequality
U U U U
2
f k11,k f k11,k
f k11,k11 2 $ u max f k11, j 2 fkj . (2.22)
f k,k jÞk,k11 fk,k
These numerical tests may mean that some rows that we expected to
eliminate during a block pivotal step remain uneliminated at the end of the
step. These rows are stored alongside the generated element for treatment
at the father node. We call these rows fully summed because we know that
there are no contributions to be added from elsewhere, unlike the rows of
the generated element. At the father node, the possible pivot rows consist
of old fully summed rows coming from this son and perhaps other sons, too,
and the new fully summed rows that were recommended as pivots by the
analyze phase.
If a full block pivot was recommended, we choose a simple 1 3 1 or 2 3 2
pivot and perform the corresponding elimination operations on the pivot
rows before choosing the next simple 1 3 1 or 2 3 2 pivot. Both old and new
fully summed rows are candidates. We know of no other strategy for
ensuring that the block pivot as a whole is satisfactory. Note, however, that
the calculations for the Schur update can be delayed and performed as a
block operation once all pivots are chosen.
If a structured block was recommended, the analyze phase expects that
the new fully summed rows have the form
S A2
A T1
A1
0
A5
0
A6
A7
0
A8
0
0
D or S 0
A T1
A1
0
A5
0
A6
A7
0
A8
0
0
D (2.23)
S 5 M T DM, (2.24)
order 1 and 2 (see Eq. (2.10)). Within the frontal matrix, we hold a compact
representation of M that excludes zero columns. Therefore, our real concern
is the efficient formation of the product (2.24) for a full rectangular matrix
M. Unfortunately, there are no BLAS routines for forming a symmetric
matrix as the product of two rectangular matrices, so we cannot form DM
and use a BLAS routine for calculating MT(DM) without doing about twice
as much work as necessary. We therefore choose a block size b (with default
size 5 set by the initialization routine) and divide the rows of M and DM
into strips that start in rows 1, b11, 2b11, . . . . This allows us to compute
the block upper triangular part of each corresponding strip of S in turn,
using SGEMM for each strip except the last. For the last (undersize) strip,
we use simple Fortran code and take full advantage of the symmetry. Note
that if b . n, simple Fortran code will be used all the time.
For an oxo pivot, the matrix DM has the form
DM 5 S L1
UT1
DS21
A5
0
A6
A7
0
A8
0
0
D S
5
M5
0
M6
M7
0
M8
0
0
, D (2.25)
M5 S D 21
1
D 21
1
DS M5
0
M6
M7
0
M8
0
0
D S
5
0
M 11
M9
M 12
M 10
0
0
0
. D (2.26)
1 21 2
T
0 B2 B3 0 0 M 11
B T2
B T3
B1
B T4
B4
0
0
0
5
M T9
T
M 10 0
T
M 12
S M5
0
M6
M7
0
M8
0
0
D (2.27)
0 0 0 0 0 0
by the calculations
~B2 B 3! 5 M 11
T
~M7 M 8! ,
B 4 5 M 12
T
M 8, (2.28)
B 1 5 ~ M T9 T
M 12 ! S D
M6
M7
.
We may use the BLAS 3 routine SGEMM directly for the first two calcula-
tions. For the symmetric matrix B1, we subdivide the computation into
strips, as for the full-pivot case.
For a tile pivot, the sparsity in the first block column of the preceding
form is lost:
DM 5
L1
U T2
S U T1
DS
21
A5
0
A6
A7
0
A8
0
0
.D (2.29)
DM 5 S M6
M7
0
M8
D 0
0
(2.30)
M5 S 2D 21
D121
21
1 D 2D 1 D 21
0
1
DS M6
M7
0
M8
0
0
D S
5
M9
M 12
M 10
0
0
0
D
. (2.31)
The Schur update calculation is therefore as in the oxo case except that the
first block row and column is not present, and we have only the calculations
for B1 and B4.
2.4 Solution
The solution is conveniently performed in two stages. The first, forward
substitution, consists of solving
~ PLP T! y 5 b, (2.32)
~ PDL TP T! x 5 y. (2.33)
12
b1
P 21b 5 b 2 (2.34)
b3
1 21 2 1 2
L b91 b1
MT I b92 5 b 2 . (2.35)
0 0 I b93 b3
Lb91 5 b 1 (2.36)
In the case of a full pivot, we can employ the Level 2 BLAS routine STPSV
in Eq. (2.36) (we pack the triangular array L to save storage) and the Level
2 BLAS routine SGEMV in Eq. (2.37). For a structured pivot, it is slightly
ACM Transactions on Mathematical Software, Vol. 22, No. 2, June 1996.
244 • I. S. Duff and J. K. Reid
S L1
U T2 U T1
D ,
so solving Eq. (2.36) requires two applications of STRSV (here, we pack the
arrays L1, D1 and U1 together in a square array) and one application of
STPMV. Also, M has the form
S M5
0
M6
M7
0
M8
0
0
D,
Using this data structure, the details of our algorithm are as follows:
for j :5 1 step 1 until n do
for each entry i in column j do
s 5 svar(i) ! s is i’s old supervariable
if flag(s) , j then ! first occurrence of s for column j
flag(s) 5 j
3. NUMERICAL EXPERIENCE
3.1 Introduction
In this section, we examine the performance of the MA47 code on a range of
test problems on a SUN SPARCstation 10 and a CRAY Y-MP. We study the
influence of various parameter settings on the performance of MA47 and
determine the values for the defaults. We also compare the MA47 code with
the code MA27.
As always, the selection of test problems is a compromise between
choosing sufficient problems to obtain meaningful statistics while keeping
run times and this section of the report manageable. In Tables I and II, we
list the test problems used for the numerical experiments in this section. In
choosing this set, many runs were performed on other matrices, so we do
feel that this selection is broadly representative of a far larger set. Our set
of 22 matrices can be divided into three distinct sets. The first (matrices 1
to 10), in Table I, are obtained from the CUTE collection of nonlinear
optimization problems [Bongartz et al. 1995]. The problems in that set are
parameterized, and we have chosen the parameters to obtain a linear
problem of order between about 1000 and 20,000. In each case, we solve a
linear system whose coefficient matrix is the Kuhn-Tucker matrix of the
form
S H
AT
A
0
D
,
for us from CUTE. The second and third sets are obtained by forming
augmented systems of the form
S I
AT
A
0
D and S D
AT
A
0
, D with D5 S I
0
D ,
Table III. Results with Search Limit of 2 Rows Divided by Those with Limit of 4
Storage for
Total Storage Factors Time (SUN)
Table IV. Results with Search Limit of 4 Rows Divided by Those with Limit of 10
Storage for
Total Storage Factors Time (SUN)
Table V. Results with Search Limit of 10 Divided by Those with No Search Limit
Storage for
Total Storage Factors Time (SUN)
Table VI. Results with Threshold u Set to 0.5 Divided by Those with u 5 0.1
Table VII. Results with Threshold u Set to 0.1 Divided by Those with u 5 0.01
Table VIII. Results with Threshold u Set to 0.01 Divided by Those with u 5 0.001
Table IX. Results with Threshold u Set to 0.001 Divided by Those with u 5 0.0001
Table X. Results with Threshold u Set to 0.0001 Divided by Those with u 5 0.00001
that time. However, the more complicated data structures in MA47 and the
greater penalties for not being able to follow the pivotal sequence recom-
mended by the analyze phase penalize higher values of u to a greater
extent than in the earlier code. We investigate this in Tables VI to X.
We also, of course, monitored the numerical performance for all these
runs. Although the results from using a threshold value of 0.5 were better
ACM Transactions on Mathematical Software, Vol. 22, No. 2, June 1996.
250 • I. S. Duff and J. K. Reid
Storage for
Total Storage Factors Time
Table XII. Results with Amalgamation Parameter Set to 5 Divided by Those with
Parameter 10
Storage for
Total Storage Factors Time
Table XIII. Results with Amalgamation Parameter Set to 10 Divided by Those with
Parameter 20
Storage for
Total Storage Factors Time
than 0.1 for a couple of test problems, notably the NNC1374 example, it
was substantially more expensive to use such a high value for the thresh-
old. For lower values of the threshold, the scaled residual was remarkably
flat for all the test problems until values of the threshold less than 1026
when poorer results were obtained on three of the examples.
From the performance figures in Tables VI to X, the execution times and
storage decline almost monotonically, but have begun to level off by about
0.0001 (although there are a couple of outliers). However, we are anxious
not to compromise stability and recognize that our numerical experience is
necessarily limited. We have therefore decided to choose as default a
ACM Transactions on Mathematical Software, Vol. 22, No. 2, June 1996.
Exploiting Zeros on the Diagonal • 251
Table XIV. Times with Block Size Parameter b Set to 1 Divided by Those with b 5 5
threshold value of 0.001 because for many problems, much of the sparsity
benefit has been realized at this value.
3.5 Effect of Change of Block Size for Level 3 BLAS During Factorization
A major benefit of multifrontal methods is that the floating-point arith-
metic is performed on dense submatrices. In particular, if we perform
several pivot steps on a particular frontal matrix, Level 3 BLAS can be
used. However, in the present case, we also wish to maintain symmetry,
and the current Level 3 BLAS suite does not have an appropriate kernel.
We thus, as discussed in Section 2.3, need to split the frontal matrix into
strips, starting at rows 1, b11, 2b11, . . . , so that we can use Level 3 BLAS
without doubling the arithmetic count. In fact, in a block of size b on the
diagonal, the extra work is bp(b 2 1) floating-point operations. Clearly this
means that, while we would like to increase b for Level 3 BLAS efficiency,
by doing so we increase the amount of arithmetic. In this section, we
examine the tradeoff between these competing trends.
ACM Transactions on Mathematical Software, Vol. 22, No. 2, June 1996.
252 • I. S. Duff and J. K. Reid
Table XV. Times with Block Size Parameter b Set to 5 Divided by Those with b 5 10
Table XVI. Times with Block Size Parameter b Set to 10 Divided by Those with b 5 20
Table XVII. Times with Block Size Parameter b Set to 5 Divided by Those with b . n
3.6 Effect of Change of Block Size for Level 2 BLAS During Solution
In a single block pivot stage of the solution phase, one can use indirect
addressing for every operation. Alternatively, one can load the appropriate
entries of the right-hand-side vector into a small full vector corresponding
to the rows in the current front, update this vector with Level 2 BLAS
operations, and finally scatter it back to the full vector.
We have experimented with the parameter that determines whether to
use indirect or direct addressing in the solution phase. Direct addressing is
used (and Level 2 BLAS called) if the number of pivots at a step is more
than this parameter. Thus, for high values of the parameter, there will be
ACM Transactions on Mathematical Software, Vol. 22, No. 2, June 1996.
Exploiting Zeros on the Diagonal • 253
Table XVIII. Ratios of Solve Times with Different Values for the Parameter for Indirect
Addressing
times for the problems of Tables I and II are shown in Tables XIX and XX.
It was our original intention that this new MA47 code would replace MA27
in the Harwell Subroutine Library. However, the added complexity of the
new code will penalize it if it is unable to take advantage of the structure.
We thus might expect that sometimes MA47 would be better, and some-
times MA27. We illustrate this by showing the comparison ratios for the
two codes in Tables XXI and XXII.
We show the full results in these tables as well as the medians and
quartiles because the performance can vary very widely. On the SUN, the
new code factorizes matrix 18 (PILOT) over 60 times faster than MA27, but
is nearly 7 times slower than MA27 on matrix 1 (BRITGAS). With the
exception of matrix 18, the analyze phase times are always greater for the
new code, once by over a factor of 80 (matrix 5, BRATU2D). The variation is
only slightly less dramatic on the CRAY.
We have also run the codes on a set of ten Harwell-Boeing matrices with
nonzero diagonal entries (BCSSTK14/15/16/17/18/26/27/28 and BCSSTM26/
27), and the results are summarized in Table XXIII. On the SUN, MA47
just outperforms MA27 for the factorize and solve phases, but otherwise is
inferior. We therefore recommend that where the matrix has nonzeros on
the diagonal, MA27 should continue to be used. We plan to make a new
version of MA27 that incorporates the BLAS and some other minor im-
ACM Transactions on Mathematical Software, Vol. 22, No. 2, June 1996.
Exploiting Zeros on the Diagonal • 255
Table XXI. Ratio of Performance of MA47 Code to MA27 Code on the SUN SPARC-10
Table XXII. Ratio of Performance of MA47 Code to MA27 Code on the CRAY Y-MP
Table XXIII. MA47 to MA27 Ratios on 10 Matrices with Nonzero Entries on the Diagonal
ACKNOWLEDGMENTS
We would like to thank our colleagues Nick Gould and Jennifer Scott and
particularly the referee John Lewis for reading drafts of the manuscript
and for making many helpful suggestions for the presentation.
ACM Transactions on Mathematical Software, Vol. 22, No. 2, June 1996.
Exploiting Zeros on the Diagonal • 257
Table XXIV. MA47 to MA27 Ratios Using Default Threshold Value for MA27
REFERENCES
BONGARTZ, I., CONN, A. R., GOULD, N. I. M., AND TOINT, P. L. 1995. CUTE: Constrained and
unconstrained testing environment. ACM Trans. Math. Softw. 21, 1 (Mar.), 123–160.
BUNCH, J. R. AND PARLETT, B. N. 1971. Direct methods for solving symmetric indefinite
systems of linear equations. SIAM J. Numer. Anal. 8, 639 – 655.
DONGARRA, J. J., DU CROZ, J., DUFF, I. S., AND HAMMARLING, S. 1990. A set of level 3 basic
linear algebra subroutines. ACM Trans. Math. Softw. 16, 1 (Mar.), 1–17.
DONGARRA, J. J., DU CROZ, J., HAMMARLING, S., AND HANSON, R. J. 1988. An extended set of
Fortran basic linear algebra subprograms. ACM Trans. Math. Softw. 14, 1 (Mar.), 1–17. See
also the companion algorithm, pages 18 –32.
DUFF, I. S. AND REID, J. K. 1982. MA27—A set of Fortran subroutines for solving sparse
symmetric sets of linear equations. Rep. AERE R10533, HMSO, London.
DUFF, I. S. AND REID, J. K. 1983. The multifrontal solution of indefinite sparse symmetric
linear systems. ACM Trans. Math. Softw. 9, 302–325.
DUFF, I. S. AND REID, J. K. 1995. MA47, a Fortran code for direct solution of indefinite
symmetric linear systems of equations. Rep. RAL 95-001, Rutherford Appleton Laboratory,
Oxfordshire, England.
DUFF, I. S., ERISMAN, A. M., AND REID, J. K. 1986. Direct methods for sparse matrices.
Oxford University Press, London.
DUFF, I. S., GOULD, N. I. M., REID, J. K., SCOTT, J. A., AND TURNER, K. 1991. The
factorization of sparse symmetric indefinite matrices. IMA J. Numer. Anal. 11, 181–204.
DUFF, I. S., GRIMES, R. G., AND LEWIS, J. G. 1992. Users’ guide for the Harwell-Boeing
sparse matrix collection. Rep. RAL 92-086, Rutherford Appleton Laboratory, Oxfordshire,
England.
GAY, D. M. 1985. Electronic mail distribution of linear programming test problems. Math.
Program. Soc. COAL Newsl.
HARWELL 1993. Harwell subroutine library catalogue (release 11). Theoretical Studies
Department, AEA Technology, Harwell, England.
LAWSON, C. L., HANSON, R. J., KINCAID, D. R., AND KROGH, F. T. 1979. Basic linear algebra
subprograms for Fortran use. ACM Trans. Math. Softw. 5, 308 –325.
LIU, J. W. H. 1990. The role of elimination trees in sparse factorization. SIAM J. Matrix
Anal. Appl. 11, 134 –172.
MARKOWITZ, H. M. 1957. The elimination form of the inverse and its application to linear
programming. Manage. Sci. 3, 255–269.