Multi Frontal
Multi Frontal
We extend the frontal method for solving linear systems of equations by permitting more than one
front to occur at the same time. This enables us to develop code for general symmetric systems. We
discuss the orgamzation and implementatmn of a multifrontal code which uses the minimum-degree
ordering and indicate how we can solve indefinite systems in a stable manner We illustrate the
performance of our code both on the IBM 3033 and on the CRAY-1.
Categories and Subject Descriptors. G 1 3 [Numerical Analysis]: Numerical Linear Algebra--l~near
systems (d~rect methods), sparse and very large systems; G.4 [Mathematics of Computing]:
Mathematical Software--algorithm analysis, efficiency, reliabihty and robustness
General Terms: Algorithms, Experimentation, Performance
Additmnal Key Words and Phrases: Sparse matrices, indefinite symmetric matrices, linear equatmns,
frontal methods, vector processing, minimum-degree algorithm, generalized elements
1. INTRODUCTION
W e consider t h e direct s o l u t i o n of sparse s y m m e t r i c sets of n s i m u l t a n e o u s linear
equations
Ax-- b. (1.1)
If A is positive definite, t h e n a n y choice o f p i v o t s f r o m t h e d i a g o n a l is n u m e r i c a l l y
stable so t h e y c a n be c h o s e n o n s p a r s i t y g r o u n d s alone. C h o o s i n g p i v o t s f r o m t h e
diagonal p r e s e r v e s s y m m e t r y a n d c o n v e n t i o n a l p r a c t i c e is to c h o o s e t h e d i a g o n a l
pivots b y s y m b o l i c processing o n t h e s p a r s i t y p a t t e r n . T h e n u m e r i c a l f a c t o r i z a t i o n
of A a n d t h e solution o f Eq. (1.1) c a n t h e n be p e r f o r m e d w i t h i n a k n o w n s p a r s i t y
p a t t e r n t h a t a c c o m m o d a t e s t h e fill-ins. P r o m i n e n t e x a m p l e s o f codes t h a t imple-
m e n t this a p p r o a c h are t h e Yale S p a r s e M a t r i x P a c k a g e ( Y S M P ) [9], [10] a n d
S P A R S P A K [18]. A n excellent d e s c r i p t i o n of a v a r i e t y of p i v o t a l strategies a n d
storage s c h e m e s is p r o v i d e d in t h e b o o k o f G e o r g e a n d Liu [17].
T h i s a p p r o a c h m a y fail if A is indefinite; for instance, t h e v e r y first p i v o t m a y
be zero. D u f f et al. [7] o v e r c a m e this difficulty b y following B u n c h a n d P a r l e t t
[1] in using a m i x t u r e of 1 × 1 a n d 2 x 2 p i v o t s c h o s e n d u r i n g t h e n u m e r i c a l
factorization o f A, a n d d e m o n s t r a t e d t h a t c o d e b a s e d o n this a p p r o a c h is corn-
Authors' address" Computer Scmnce and Systems Division, AERE Harwell, Oxford OXll ORA,
England.
Permission to copy without fee all or part of this material is granted provided that the copies are not
made or distributed for direct commercial advantage, the ACM copyright notice and the title of the
publication and its date appear, and notice is given that copying is by permission of the Association
for Computing Machinery. To copy otherwise, or to repubhsh, requires a fee and/or specific
permission.
© 1983ACM 0098-3500/83/0900-0302 $00.75
ACMTransactlonson Mathematmal Software, Vol 9, No 3, September 1983,Pages 302-325.
Multifrontal Solution of Indefinite Sparse Symmetric Linear Systems * 303
petitive with the Harwell code MA17A [22] both in speed and storage. However,
an additional subroutine, MA17E, has since been added to the MA17 package to
choose pivots symbolically and is significantly faster. Even this is not competitive
with the symbolic phases of the YSMP and S P A R S P A K codes, which store the
fill-ins implicitly rather than explicitly.
The aim of this work is to use the ideas of frontal elimination [20, 19] to permit
stable numerical factorization to be performed based on a symbolically chosen
pivotal sequence. An outline of our approach is given in Section 2; we describe
the symbolic phase of our algorithm in Sections 3-5, the numerical factorization
in Section 6, and the numerical solution in Section 7. We refer to these phases as
ANALYZE, FACTOR, and SOLVE, and they are represented in our code by
three separate entries. Finally, in Section 8 we give the results of runs on both a
scalar machine (IBM 3033) and a vector machine (CRAY-1, actually a pipeline
machine able to process vectors efficiently). A side advantage of our approach is
that inner loops of the numerical phases are vectorizable, which gives a speed
advantage, particularly on large problems.
When performing the numerical experiments for this paper, we used sets of
test matrices covering a wide range of sparsity patterns and applications. Among
the test matrices used (the order of the matrices in the tables together with an
acknowledgment of the source follows each in parentheses) were a bandlike
matrix from eigenvalue calculations in atomic physics (147, Johansson, Lund), a
matrix from electrical circuit analysis (1176, Erisman, Boeing), normal equations
matrices from surveying (85, 292, Ashkenazi, Nottingham), a Jacobian from a stiff
system of ordinary differential equations in laser research (130, Curtis, Harwell),
a parameterized set of problems from the triangulation of an L-shaped region
{406, 1009, 1561, George and Liu [15]), matrices from the analysis of ship
structures (503, 1005, Everstine [14]), a power network matrix {274, Lewis,
Boeing), and the matrix from the 9-point discretization of the Laplacian on a 30
x 30 grid (900). We used many more matrices in our experiments, but the above
sample is representative of these and is sufficient to illustrate our points.
2. OVERALL STRATEGY
It is convenient to discuss our overall strategy in terms of finite-element problems,
although it is applicable to general systems. Finite-element problems occur
naturally with matrices of the form
A = ~, B (n, (2.1)
l
where each B (z) is the contribution from a single finite element and is zero except
in a small number of rows and columns. Each B (z) may conveniently be stored as
a small full matrix together with a vector of indices to label where each row and
column of this packed matrix fits into the overall matrix A. In the frontal method,
advantage is taken of the fact that elimination steps
a,j :-- a,j - a,k a ~ a k j (2.2)
do not have to wait for all the assembly steps
a,j := a,j + b t(z)
j
(2.3)
ACM Transact]ons on Mathematical Software, Vol. 9, No. 3, September 1983.
304 • I . S . Duff and J. K. Reid
15
1 2 5 6 13 14
3 4 7 8 9 10 11 12
/\ /\ /\ /\
1 2 3 4 5 6 7 8
from (2.1) to be complete. The operation (2.2) can be performed as soon as the
pivot row and column is fully summed, that is, as soon as all operations (2.3) have
been completed for them, for if any of ask, akj, and akk are not fully summed, then
the wrong quantity will be subtracted in (2.2).
Irons' frontal method [20] treats the symmetric positive-definite case. An
assembly order is chosen a priori, and each variable is eliminated as soon as it is
fully summed, that is, after its last occurrence in a B(t)(l -- 1, 2. . . . ). Hood [19]
extended this idea to unsymmetric matrices with a symmetric sparsity pattern,
and Duff [4], [5] has provided an implementation that treats unsymmetric
patterns. Pivots are chosen on numerical grounds from the submatrix of fully-
summed rows and columns. Some pivots are now off the main diagonal, and all
the fully-summed variables are not necessarily eliminated at once. In both cases
all operations are performed in a full matrix (called the frontal matrix) that
corresponds to the submatrix of rows and columns belonging to variables that
have appeared in one or more of the B (u so far assembled, but have not yet been
eliminated. After each elimination the pivot row is stored ready for use in back
substitution. In the unsymmetric case, if numerically satisfactory pivots cannot
be found for all the fully-summed variables, then the frontal matrix passed on to
the next step is a little larger. In practice the extra storage and work requirements
are slight.
The frontal method can be generalized to correspond to any bracketing of the
summation {2.1). It is convenient to represent such a bracketing by an assembly
tree. A simple example is shown in Figure 1 and corresponds to the summation
[[B (') + B (2)] + [B (a) + B(4)]] + [[B(5) + B (6)] + [B(7) + B (s)]].
The summation is now performed from inner brackets outward, and again we
may eliminate any variable immediately it is fully summed. In general, we now
have several frontal matrices, corresponding to inner brackets that we have
processed and have had to set aside temporarily. Notice that these frontal
matrices are just like the original matrices B (t) and they are often called
"generalized element matrices," though here we call them "generated element
matrices" because they are generated by pivotal steps. The advantage of this
approach is that the extra generality permits choices which are more economical
in arithmetic operations. A prominent example of such a grouping is nested
dissection (also known as nested substructuring). Here the overall problem is
divided into parts and the parts are further subdivided, etc.; each node of the tree
or bracket in the bracketed sum corresponds to a substructure.
ACM Transacttons on Mathematmal Software, Vol 9, No 3, September 1983.
Multlfrontal Solution of Indefinite Sparse Symmetric Linear Systems • 305
Algorithm 2.1
Flag all nonzeros a u as unused;
for each r o w , do
while row i contains a flagged nonzero do
begin add an artificial element containing just variable i;
unflag a,;
for each flagged nonzero a u in row i do
if a~j ~ 0 for all variables k of the current element
t h e n add variable j to the element and unflag both ak~ and ajk for all variables k now
in the element
end;
8
4 6 7 7 12
2 3 3 10 5 5 11 11 11 11 14
1 1 9 9 9 9 13 13 13 13 13 13 13 13 15
3. MINIMUM-DEGREE ORDERING
In the last section we sketched how the pivotal strategy of minimum degree may
be implemented for a general matrix A with the help of an assembly tree. Here
we explore the algorithmic details needed for its efficient execution. We have
found it best to assume that the diagonal is entirely nonzero and to start with
lists of column indices for all the off-diagonal nonzeros in each row. Each column
index may be regarded as pointing to a finite element of order 2 (the other
variable corresponds to the diagonal).
For speed it is important to avoid an expensive search for the pivot at each
stage of the elimination. We have followed the procedure used in Harwell code
MA28 (see [3] and [6]) of holding doubly-linked chains of variables having the
same degree. Any variable that is active in a pivotal step can then be removed
from its chain without any searching, and once its new degree has been computed
it can be placed at the head of the corresponding chain. This makes selecting
each pivot particularly easy. It is always at the front of the current minimum-
degree chain.
Once the pivot has been chosen we run through what remains of the original
matrix row and all the index lists of matrices B (Z) that involve the pivot. By
merging these lists, omitting the pivot itself, we obtain the index list of the new
generated element. These merges are best done with the aid of a full vector of
flags so that we can avoid including any variable twice. The degrees of all these
variables (but no others) may change, so we remove them from their chains,
recalculate their degrees, and insert them in the chains appropriate for their new
degrees. This degree computation is the most expensive part of the whole
calculation and we have found it well worthwhile to recognize identical rows (for
example, compare cases (a) and (c) in Table II). The degree itself provides a
convenient hashing function, particularly since all those of the same degree are
chained together. Variables corresponding to identical rows can be amalgamated
into a "supervariable," a similar concept to the indistinguishable nodes of George
and Liu [17] or the prototype vertices of YSMP. It is convenient to name the
supervariable after one of its components and treat the rest as dummies. The
rows of the dummies can be omitted from the data structure, and by suitably
flagging them we can allow them to be ignored (and removed) whenever they are
later encountered in an index list.
To calculate the degree of a variable we need to run through the index lists of
its original row and all the generated elements it is in. Again a full array of flags
is needed to avoid counting any variable twice. Of course, all the variables whose
degrees we are calculating are in the new generated element, so we begin with the
flags set for the variables of this element and the integer in which the degree is
accumulated set to the number of variables in the new element. The flags permit
us to check that each of the other elements has at least one variable distinct from
those of the new element. Any that do not can be added into the new element
without extra fill-in. They may be treated just like the elements involving the
pivot, that is, removed from the structure after updating the tree. We found that
including such a check adds little overhead and occasionally leads to very
worthwhile savings (for example, compare cases (a) and (b) in Table II). The
most dramatic change is on the test example of order 130. This may be permuted
A C M T r a n s a c t m n s on M a t h e m a t m a l Software, Vol 9, No 3, September 1983.
Multifrontal Solution of Indefinite Sparse Symmetrfc Linear Systems • 309
Table II. The Effect on ANALYZE Times of Algorithmic Changes in Minimum-Degree Ordering
Order: 147 1176 292 85 130 406 503
Number of nonzeros in upper 1298 9864 1250 304 713 1561 3265
triangular part of A:
ANALYZE times (IBM 3033 seconds)
a. Final code 0.068 0.53 0.13 0.034 0.14 0.20 0.36
b. Without element 0.071 0.60 0 14 0.034 0.42 0.21 0.38
amalgamations
c. Without supervarlable 0.189 0 89 0.21 0.039 0.14 0.45 1.12
amalgamations
d. Without elements at 0 073 0.56 0.14 0.036 0.18 0.20 0.38
front of lists
where D is diagonal and F is nearly full but of quite low order. T h e rows of E
have only a few different sparsity patterns, so there are only a few different
patterns in the elements generated by pivoting in D. T h e device we have just
described permits duplicate elements to be dropped quickly f r o m the d a t a
structure, and it is this feature which leads to m u c h b e t t e r execution time t h a n
t h a t of Y S M P or S P A R S P A K for this problem (see T a b l e X, ease of order 130).
T h e s e degree calculations and the construction of the index list for the pivot
row b o t h need pointers from the variables to the elements t h e y are in. A very
convenient way to provide t h e m is to use (as in Y S M P ) the same numerical n a m e
for the generated element as for the variable t h a t was eliminated w h e n it was
created. Where this variable n a m e occurs in a row index list, it is now interpreted
as an element name, which is exactly w h a t is needed.
F u r t h e r savings on the degree calculations (for example, c o m p a r e eases (a) and
(d) in T a b l e II) are possible by permuting the element n u m b e r s ahead of t h e
variable numbers in each row list. This ensures t h a t the element lists are searched
first. Once we reach the list of variables we m a y remove any t h a t h a v e already
been included in the degree calculation, since they represent off-diagonal nonzeros
of the original matrix t h a t have been overlaid b y nonzeros of the generated
elements.
T o illustrate the advantages of the devices we have just discussed we c o m p a r e d
runs of our final code with versions each with one of the devices removed. T h e
results varied from making little difference to changing the computing time b y a
factor of 3 or more. Results illustrating all the cases we ran are shown in T a b l e II.
T h e times are not consistent with those of Section 8 because we r a n with the
Harwell s t a t e m e n t profiler OEO2 in use.
An early version of our code was so dominated by the degree calculation t h a t
we decided to terminate it abruptly at a threshold. Once the m i n i m u m degree
reached this threshold, we increased it and recalculated the degree of all the
outstanding variables, terminating the calculations at the new threshold. How-
ever, the incorporation of supervariable amalgamation, element absorption, and
ACM Transactmns on MathemaUcal Software, Vol. 9, No 3, September 1983.
310 • I.S. Duff and J. K Reid
51
X X X X
X X X X
X X X
XX
X X X X X X
XX
X
X X X X
X X X X
XX
42
T3\ 71 8o
X X X X X
X X X X 21 9o
/
11
deletion of indices that are no longer needed so improved the calculation that
using this threshold was not justifiable.
A single tree (forest in the reducible case) with n nodes suffices to record all
the supervariable amalgamations and element absorptions. As each pivot is
chosen we add son-father pointers for all the old elements absorbed into the new
generated element. When variables are included in supervariables, we record son-
to-father pointers for all the dummies pointing to the master variable. We need
also to accumulate the number of variables in each supervariable in order to
calculate degrees correctly, and the value zero provides a convenient flag to
indicate dummies. The assembly tree (or forest in the reducible case) may be
obtained from all the son-father pointers not flagged as dummies. For example,
the matrix of Figure 3 might give rise to the tree of Figure 4, with numbers of
variables in supervariables given as subscripts to the node numbers. This tree
might be constructed by application of the minimum-degree ordering thus:
(i) Eliminate variable 1 of degree 4. The fills in rows 3 and 4 make them
identical. Generated element 1 has supervariables (4, 3), (5).
(ii) Eliminate variable 2 of degree 4. Element 2 has supervariables (4, 3), (5) and
can absorb element 1.
(iii) Eliminate supervariable (4, 3) of degree 3, after assembling element 2.
Element 4 has supervariable (5) only.
(iv) Eliminate variable 7 of degree 4. The fills in rows 6 and 9 make rows 6, 8, 9
identical. We suppose 9 was found identical to 8 and then (8, 9) to 6.
Generated element 7 has supervariable (6, 8, 9) only.
(v) Eliminate supervariable (6, 8, 9) of degree 4 after assembling element 7.
Generated element contains variable 5 only and absorbs element 4.
(vi) Eliminate variable 5 after assembling element 6.
4 M A N A G E M E N T OF THE TREE S E A R C H
The depth-first tree search cannot be performed until the tree is complete and
local ordering strategies, such as minimum degree, construct the tree from the
ACM Transactions on Mathematical Software, Vol. 9, No 3, September 1983
Mulhfrontal Solution of Indefinite Sparse Symmetric Linear Systems • 311
terminal nodes toward the root. Since each node has at most one father, it is
convenient to record the tree in the form of son-father pointers only. Of course,
we must have a means of finding all the sons of each node if we are to perform a
depth-first search efficiently. However efficiently it is done, it is sensible to do it
just once, rather than repeatedly for each set of new numerical values. We
therefore discuss in this section how the search may be managed and how the
result may be recorded for convenient subsequent numerical processing.
The depth-first search determines a tentative pivotal sequence, which may be
modified for numerical stability. This sequence divides naturally into blocks, each
corresponding to a node of the assembly tree. During numerical factorization we
hold the generated elements temporarily on a stack and each (block) pivotal step
involves merging finite elements with one or two variables corresponding to
nonzeros on and off the diagonal in the pivotal rows of the original matrix with
a certain number (sometimes none) of generated elements from the top of the
stack. It seems natural therefore to hold the pivotal sequence and for each block
pivotal step, the number of variables eliminated and the number of stack elements
-assembled. For instance, depth-first search of the tree of Figure 4 yields the
following steps
(1) Eliminate variable 1 after assembling elements from row 1; stack resulting
generated element 1.
(2) Eliminate variable 2 after assembling elements from row 2 and stacked
element 1; stack the resulting element 2.
(3) Eliminate variables 4, 3 after assembling elements from rows 4, 3 and stacked
element 2; stack the resulting element 4.
(4) Eliminate variable 7 after assembling elements from row 7; stack the resulting
element 7.
(5) Eliminate variables 6, 8, 9 after assembling elements from rows 6, 8, 9 and
stacked elements 4 and 7; stack the resulting element 6.
(6) Eliminate variable 5 after assembling remaining element in row 5 with stacked
element 6.
Table III. Times and Space Forecasts for Two Modes of R u n n i n g Space Evaluation Routine
Order 147 1176 199 292 130 406 503
Nonzeros: 1298 9864 536 1250 713 1561 3265
T n n e (IBM 3033 seconds)
Exact 0.005 0.054 0.007 0.010 0.006 0.012 0.019
Estimate 0.002 0 025 0.006 0.007 0.004 0.008 0.009
Forecast of real
space required (xl000)
Exact 2.6 14 1.3 28 1.0 68 13
Estimate 3.2 20 15 3.0 16 7.0 14
Forecast of integer
space required (XlO00)
Exact 1.5 11 1.3 2.2 1.0 3.4 5.2
Estimate 2.3 15 1.4 26 1.6 38 6.8
Here we have a full matrix of degree 4 at node 6. The recording of degrees also
permits storage and operation counts to be calculated easily.
Greater efficiency during numerical processing is possible if the pivotal block
sizes are large. It may even be preferable to tolerate more operations and fiU-in
for the sake of bigger blocks. We have therefore built into our tree-search code
the option of combining adjacent steps when the first stacks a generated element
that is immediately used by the second, provided the numbers of eliminations
performed in both steps lie below a given threshold. The effect of using this
threshold is discussed in Section 7 (see Table IX).
An important feature of our approach, which feature is also shared by SPAR-
SPAK, is that we can predict the storage and arithmetic requirements of the
subsequent numerical factorization of a definite matrix. This is particularly
important since such a prediction may well determine whether it is computation-
ally feasible or attractive to continue with the direct solution of the system. It is
a relatively simple matter to postprocess the output from the tree search to
obtain this information, although the original input must be preserved if exact
values are required. Of course, the forecast may be optimistic for systems where
numerical pivoting is required, but we show in Section 6 {Table VII) that this
inaccuracy is slight. Since we are keen to allow the ANALYZE phase to require
only a minimal amount of storage, we have designed our space-evaluation routine
so that, when the user has allowed the input data to be overwritten, an upper
bound for the space necessary for the factorization of definite systems will be
calculated. In this case, the space evaluation will be quicker since a scan through
the original nonzeros is not performed. We present the times and space estimates
provided by these two modes of operation in Table III. Since the time spent in
the space evaluation routine is usually less than 10 percent of the ANALYZE
time, and since the estimates can be over 50 percent too high, we choose to
calculate the exact quantities whenever possible.
51
\ 92
\ I
14 24 64
74
I
1/(/\0
the pattern of row i of A with the patterns of those earlier rows of U that have
their first off-diagonal nonzero in column i. Expressed in the terms we have been
using in this paper, this is because with variables reordered according to a given
pivotal sequence, we know that each generated element will be required when
the first of its variables is pivotal.
This makes the construction of the assembly tree very straightforward. We
may now start with the pattern of the rows of the upper triangular part of A,
rather than of both upper and lower triangular parts. T h e generated elements
may be chained together according to their leading variable, in preparation for
use when that variable is pivotal. Since each row of A is searched only once, and
the list of each generated element is searched only once, it is not worthwhile to
look for identical rows or elements that can be absorbed into other elements.
This means that we obtain an assembly tree with n nodes, almost certainly much
greater than is really necessary. However, the tree search techniques described in
Section 4 yield node amalgamations where this is appropriate so there is no loss
of efficiency. For instance, in Figure 5 we show the tree for the matrix of Figure
3 with the pivotal sequence generated in Section 3, and its condensed counterpart
in Figure 6. Note that the pivotal sequence eventually used is not necessarily the
same as that given, though the numerical operations performed are the same as
they would have been (it is just that some independent operations are performed
in a different order). Of course, the extra simplicity of the code for the case with
a known pivotal sequence leads to better execution times, as illustrated in Table
IV, though such is the success of variable amalgamation and element absorption
ACM Transactions on Mathematical Software, Vol 9, No. 3, September 1983
314 I. S. Duff and J. K Retd
Table IV ANALYZE TLmes(IBM 3033 seconds) for Minimum-Degree Ordering and for Using the
Same Sequence As a Known Pivotal Order
Order: 147 1176 292 130 406 1009 1561 1005 900
Nonzeros: 1298 9864 1250 713 1561 3937 6121 4813 4322
Pivotal s e q u e n c e
Unknown 0.039 0.30 0.077 0.073 0.115 0 28 0 43 0,40 0.30
Known 0.034 0.23 0.042 0.019 0 063 0.17 0.29 0.20 0 17
Table V. A Comparison of Storage Requirements (m Thousands of Words) for the Two Different
Frontal Matrix Organizatmns When No Garbage Collections Are Allowed
Order: 147 1176 292 406 1009 1561
Nonzeros' 1298 9864 1250 1561 3937 6121
Real storage required
Fixed frontal matrLx 4.1 23.5 40 8.4 24.0 42.3
Dynamm frontal matrix 3.8 20.4 40 8.3 23.8 42.1
Integer storage required
Fixed frontal matrix 2.1 15.9 3.6 5.5 13.5 21 5
Dynamic frontal matrix 21 13.5 31 4.5 11.4 18.0
Integer storage for factors
FLxed frontal matrix 0.8 5.9 2.3 3.9 9.4 15 1
Dynamm frontal matrix 0.8 3.6 1.8 2.9 7.3 11.7
Real storage for factors (same in 24 10.4 2.6 6.0 18.3 33.0
both cases)
We consider the assembly of stack elements and original rows before discussing
the numerical pivoting. Since the output from the tree search of Section 4 can be
used to calculate the maximum size of the frontal matrix, we could allocate a
fixed amount of storage for it and maintain mapping vectors identifying variables
with their position in the front to facilitate assembly. We originally designed our
factorization code using this scheme (which we hereafter call the fixed-front
approach), but were unhappy with committing a fixed amount of storage to the
front which may for most of the time be much smaller than its maximum size.
With this fixed-front approach the number of permutations within the frontal
matrix required to enable elimination to be performed efficiently was very high.
Because of these misgivings, we tried an approach using a dynamic allocation
for the frontal matrix, generating it afresh at every stage. We only need storage
for the current frontal matrix and can ensure that pivot candidates are in the first
rows and columns, thereby reducing the number of permutations required. T h a t
is, the sort is done during the assembly rather than after it. However, in order to
maintain an efficient assembly, we found that we would stack a generated element
even if it were going to be assembled at the very next stage, which our experience
has shown to be a common occurrence.
It is possible to construct small examples to favor either approach, and since it
was not clear to us which would be superior, both methods were coded and runs
were performed on our sets of test examples. We found the times for both
factorization and solution to be very similar (within 5 percent), with the dynamic
front approach having a slight edge. We are also concerned about the total
storage required by the two approaches both during factorization and for the
integer arrays passed to the solution routines. The dynamic-front approach will
never, require less space for the stack, but does not have the overhead of requiring
more space for the frontal matrix than is actually needed. We compare these
storage requirements in Table V, where it is clear that the dynamic-front approach
is on the whole better, particularly with respect to real storage during factorization
and integer information passed to the solution routines. We have therefore chosen
ACM Transactions on Matheraatmal Software, Vol. 9, No 3, September 1983.
316 I.S. Duff and J. K. Reid
I Free I Remainder of
Factors ) space ~ ) Stack sorted input
Fig. 7. Storage scheme for numermal factorization where the arrows mdicate the dlrectmns m
whmh the boundaries move.
to use a dynamically allocated front approach and now discuss the implementa-
tion of this further.
Since we follow our normal procedure (for example, used also in Harwell code
MA28, [3]) of allowing the user to input the nonzeros in any order, we must first
sort the matrix to avoid gross inefficiencies in the factorization. We have chosen
to sort by rows in the pivot order provided by the tree search since this enables
a simple reuse of the storage occupied by the sorted matrix (see Figure 7). Our
principal choice lay between performing an in-place sort or a slightly simpler
direct sort. The in-place sort requires only n extra real storage locations above
that for the input matrix, whereas the direct sort requires twice the storage of the
input matrix. The real storage needed in the direct sort can be greater than that
required by the factorization itself. For example, in the case of order 1176 only
13,773 words of real storage are required for the factorization, whereas the direct
sort needs 19,728 words. Since the in-place sort is less than 10 percent slower
than the direct sort, we have chosen to implement this strategy in our code. As
in the case of the ANALYZE entry, we give the user the option of allowing the
integer information for the input matrix to be overwritten. This possibility of
saving integer storage would not be possible if we had implemented a direct sort.
After sorting the user's input data, we perform the factorization using the
information passed from the earlier routines. The storage organization is illus-
trated in Figure 7 where the arrows indicate the directions in which the boundaries
move. We have chosen this scheme because we wish to use a single array for the
numerical processing and we also want to keep garbage collections infrequent
and simple. We know in advance the space required for the numerical factorization
when no pivoting is necessary, and additionally we know how much less space is
required if we allow the stack to overwrite the already processed input when the
free space is exhausted. This is the only form of garbage collection which we have
and it is particularly simple. Table VI shows the real and integer storage needed
to avoid garbage collections, the minimum storage needed and the number of
collections when minimum storage is used. The timings are not shown because
the differences were less than the uncertainty of the IBM timer, thanks to the
few collections needed and their simplicity.
Our other refinement on the scheme of Figure 7 is that when a stack eleme,,t
is being assembled we allow the new generated element to overlay the top element
of the stack. This permits a slight reduction in the storage required during
factorization.
We now discuss our strategy for numerical pivoting within the frontal matrices.
These matrices are symmetric and we only store the upper triangular part. Since
there are many cases when it is known a priori that the decomposition is
numerically stable for any sequence of diagonal pivots (for example, when the
ACM Transactions on Mathematmal Software, Vol 9, No 3, September 1983.
Multifrontal Solution of Indefinite Sparse Symmetric Linear Systems • 317
]] [ahk ak,k+l
Lak+l,k ak+l,k+l
]]oomax[max(lak~l,
j#k,k+l
lak+l,A)].cu-1' (6.2)
with u now in the range [0, ½), in the solution of sparse systems (the limit ½ is
needed to be sure t h a t a pivot is available}. This analysis also extends to our
multifrontal scheme, although there are m a n y different ways of implementing
this pivoting strategy.
One possibility is to scan the entire flflly-summed block searching for 1 × 1
pivots before looking for any block pivots of order 2. This could be done very
efficiently, but we have rejected this approach because any failure to use a 1 × 1
pivot requires a second pass to find 2 × 2 pivots, which will r e p e a t m u c h of the
work of the first. Therefore, when we scan a potential pivot row to test its
ACM Transactions on Mathematmal Software, Vol. 9, No 3, September 1983
318 I. S. Duff and J. K ReId
Table VII. Time (IBM 3033seconds) and Storage Requirementswith and withoutPivotingfor
Stabihty
Order: 147 1176 292 406 1009 1561 274
Nonzeros: 1298 9864 1250 1561 3937 6121 943
diagonal entry for stability, we also find a potential 2 × 2 pivot, defined by the
largest off-diagonal in the fully summed part of the row. If the diagonal element
fails the 1 × 1 stability test, we immediately test this 2 × 2 pivot, first against the
largest entry in the current row and then, if it passes that test, against the largest
entry in the other row of the block pivot. If the test (6.2) is not satisfied by this
2 × 2 block, we continue by searching the next fully-summed row. We also tried
the strategy of testing the other diagonal entry of the 2 × 2 block for suitability
as a 1 × 1 pivot, both before and after testing the 2 × 2 pivot. This gave a bias
toward 1 × 1 pivots, but was not any more efficient, particularly because it
increased greatly the number of interchanges within the frontal matrices. We also
tried the strategy of only passing once through the frontal matrix in an attempt
to find pivots, and found a slight increase in storage requirements with no
compensating decrease in factorization time. We therefore, at each stage, search
the fully-summed part of the front exhaustively for pivots.
The results in Table VII indicate that there is only slightly more storage
required when numerical stability considerations delay pivoting and increase the
front size. Thus the estimates from the analysis and tree search are a good guide
to that actually required. The real storage is likely to increase by no more than
about 2 percent, while, because of the possibility of several delayed eliminations
occurring together, the integer storage might actually decrease. The only case we
have seen where the estimate was further adrift was when the diagonal entries
were very small (Ericsson and Ruhe, private communication).
7. N U M E R I C A L SOLUTION
In this section, we consider the efficient solution of sets of linear equations using
the matrix factors produced by the numerical factorization of Section 6. Since we
ACM Transachons on Mathematmal Software, Vol 9, No 3, September 1983
Multifrontal Solution of IndefinIte Sparse Symmetric Linear Systems • 319
have chosen to hold original row and column indices with the Mock pivot row, we
can perform all operations on the input right-hand-side vector to produce the
solution vector without requiring any auxiliary vectors. We call this single vector
the right-hand-side vector, although it will be altered during the computation
and finally holds the solution. We can either work with the right-hand-side vector
using indirect addressing in the innermost loop or can load the appropriate
components into a full vector of length the front size, perform inner-loop opera-
tions using direct addressing, and then unload back into the right-hand-side
vector. The first option is similar to that used in many general sparse matrix
codes, may double the inner-loop time over that required for the basic numerical
operations, and will not vectorize. However, the second option will carry a high
overhead if only a few eliminations are done at each stage. We have found
empirically on the IBM 3033 and on the CRAY-1, by actual runs on versions of
our code as well as times on simple loops, that for a given number of pivots,
direct addressing is better when the order of the frontal matrix is greater than a
threshold value. We hold these values in an array and include code for both forms
of addressing, switching between the two depending on the number of pivots at
each stage. In Table VIII we illustrate the advantage of such a hybrid approach
by comparing it with the use of solely direct or indirect addressing on the IBM
3033.
On other machines the switchover point between indirect and direct addressing
will be different, particularly on a machine which vectorizes the direct-addressing
inner loop. The only change we need make is to alter the values in our switch
array to match the characteristics of the machine.
In YSMP (Eisenstat et al. [9]), integer storage for the factors is greatly reduced
by observing that column indices of successive rows of the factors have consid-
erable overlap and if a second pointer array (of length the order of the system) is
employed, much of this information can be reused. We illustrate this on the
example in Figure 8. However, this is exactly the type of saving which we get
when our block pivot is of order greater than 1. Additionally, during our discussion
of the tree search, we remarked that it is easy to increase the order of the block
pivots at the cost of a little more arithmetic and storage. In Table IX we show
the storage requirements for the solution routine for various levels of node
amalgamation. We see that the integer storage for the factors is reduced dramat-
ically, although there is a more than compensatory increase in real storage.
Notice, however, that the total storage for the factors remains about constant,
although the storage required during factorization rises slowly. We do not show
times in Table IX, since there is little variation over the range considered. They
ACM Transactions on Mathematical Software, Vol. 9, No. 3, September 1983
320 I.S. Duff and J. K. Reid
[ixOi]
x 0
22 44 4 :nc°mpressed
Compressed
0 x
x x C o l u m n indices (upper triangle)
Matrix
Fig 8. Illustration of Y S M P c o m p r e s s e d s t o r a g e s c h e m e .
decrease slightly (about 2-5 percent) from the times in Table X until a node
amalgamation level of about 10, and increase gradually thereafter. In view of
these results we make node amalgamation available to the ilser of our code only
by special request. We also show in Table IX the storage required by YSMP
using the same pivotal sequence and the one it chooses. The storage shown for
YSMP includes 2n for the pointers and row lengths (included also in the figures
for our code). We have, however, not included storage for the permutation vector
required by YSMP but not needed by our code because we hold original indices.
A complete count of the storage required by us and YSMP is given in Table XI.
We see that our block pivot scheme is more economic in storage than the
compressed scheme, particularly if nodes are amalgamated. We were also sur-
prised to find that giving our pivot sequence to YSMP saved it some storage,
presumably because the order produced by tree searching leads to more repeated
nonzero patterns in adjacent rows of the factorized matrix.
ACM Transactions on Mathematmal Software, Vol 9, No 3, September 1983
Multifrontal Solution of Indefinite Sparse Symmetric Linear Systems 321
Table X. Times (IBM 3033 seconds) for Three Phases of Codes on Definite Systems
Order' 147 1176 292 130 406 1 0 0 9 1561 1005 900
Nonzeros: 1298 9864 1250 713 1561 3 9 3 7 6 1 2 1 4 8 1 3 4322
ANALYZE
MA27 0 038 0.295 0.075 0.072 0.112 0.275 0.428 0.382 0.300
YSMP 0.053 0.449 0.082 0.286 0.107 0.280 0.439 0.512 0.279
SPARSPAK 0.083 0.647 0.140 0.386 0 193 0.492 0.812 0.827 0.568
SPARSPAK + input 0.151 1.088 0.203 0.441 0.260 0.599 0.956 0.955 0.683
FACTOR
MA27 0.061 0.341 0.070 0 024 0.170 0.541 1.10 0.897 0.575
YSMP 0.056 0 378 0 046 0.015 0.138 0.548 1.04 1.059 0.578
SPARSPAK 0.056 0.353 0.051 0 012 0 144 0.632 1.27 0.781 0.576
SPARSPAK + input 0.096 0.720 0.087 0.032 0.191 0.748 1A5 0.945 0.701
SOLVE
MA27 0.009 0.043 0.012 0 004 0.024 0.067 0.118 0.078 0.066
YSMP 0 007 0.037 0.009 0.003 0.020 0.063 0.109 0.078 0.063
SPARSPAK 0.009 0.049 0 012 0 004 0.024 0 074 0.128 0 078 0 069
8. GENERAL REMARKS
In the design of our code, called MA27 in the Harwell S u b r o u t i n e Library, we
h a d three main goals, namely,
(i) to develop a general-purpose sparse solver for s y m m e t r i c positive-definite
systems competitive with other routines available;
(ii) to solve indefinite symmetric systems stably with little overhead above the
definite case;
(iii) to develop a general-purpose sparse code which vectorizes well.
In this section, we examine how well these goals have been satisfied.
It is difficult to compare the performance of different codes, partly because
each divides the solution to the problem in a different way. W e h a v e chosen to
measure separately the three phases of nonnumeric preprocessing (ANALYZE),
numerical input and factorization (FACTOR), and solution using the factors
(SOLVE) since these correspond to the operations required once for a given
structure, once for given matrix values, and once for each right-hand side,
respectively. We present the times in double precision on the I B M 3033 for these
three phases in Table X and the storage required in T a b l e XI.
T h e MA27 A N A L Y Z E times in Table X reflect the total time for the nonnu-
meric phase, including matrix reordering. However, because input to S P A R S P A K
requires a subroutine call for each row of the matrix, we give b o t h the time
including input and for the ordering alone. Indeed, the form of input to each
package is different. S P A R S P A K permits input an e n t r y at a time, a row at a
time, or b y finite elements (full submatrices of reals together with identifying
indices). MA27 requires arrays giving the row and column indices of each entry.
These entries can be in a n y order and duplicates are allowed. For e n t r y to Y S M P
the matrix must be ordered b y rows with column indices for each e n t r y and we
do not include times needed for this ordering. T h e effect of these different input
A C M T r a n s a c t i o n s on M a t h e m a t w a l S o f t w a r e , Vol. 9, N o . 3, S e p t e m b e r 1983
.~"~.~ ° ~. z. ~ ~-~-~2t
~'-
322 I . S . Duff a n d J. K. Reid
Order: 147 1176 292 130 406 1009 1561 1005 900
Nonzeros. 1298 9864 1250 713 1561 3937 6121 4813 4322
ANALYZE
/
MA27 ~ 2.0 16 2.7 1.4 3.6 9 14 10 9
IBM 1YSMP 7.1 54 7.4 4.1 9.4 24 37 28 25
tSPARSPAK 2.9 22 3.2 1.7 4.5 11 18 13 11
/
MA27 ~ 3.8 29 4.8 2.5 64 16 25 18 16
CDC ~YSMP
/
7.1 54 7.4 4.1 9.4 24 37 28 25
tSPARSPAK 5.6 43 5.9 32 7.8 20 31 22 20
FACTOR
/MA27 ~ 7.0 40 7.6 3.0 16.1 46 82 60 47
IBM /YSMP 10.6 63 12.9 52 22.6 65 107 77 64
(SPARSPAK 6.3 32 8.5 2.8 16.1 50 85 52 46
[
MA27 a 5.8 38 6.5 30 11 5 32 55 40 32
CDC 1YSMP 6.8 41 8.7 3.6 14.6 41 67 47 40
(SPARSPAK 4.2 21 64 22 11.2 33 55 35 31
SOLVE
I
MA27 5.5 26 6.9 2.2 14.6 43 76 50 42
IBM YSMP 6.5 32 8.8 2.9 17.5 52 88 62 50
(SPARSPAK 6.3 32 8.5 28 16.1 50 85 52 46
MA27 3.4 16 4.9 1.7 95 27 47 32 26
CDC IYSMP 3.9 19 5.6 2.0 10.7 31 52 36 30
(SPARSPAK 4.2 21 6.4 2.2 11.2 33 55 35 31
Assumes overwriting of input data.
forms is significant on the FACTOR entry so, in order to avoid penalizing the
more flexible interfaces, we give in Table X FACTOR times for the numerical
factorization only. To illustrate the overhead we give the times for SPARSPAK,
including the input/sort.
In Table XI we have included all the storage required, including overhead and
permutations and have given values in words for both the IBM, where reals
occupy two words and some integers (for example, row and column indices)
occupy only half a word, and the CDC, where reals and integers both occupy one
word. Both versions are available in MA27 and SPARSPAK, but YSMP does not
offer a half-word integer version, although comments are included to allow rapid
conversion to double precision. Additionally, the data structure used by parts of
the YSMP code involves pointers which require full integers (unless the number
of nonzeros in the factors is severely limited), so full-word integer storage has
been assumed throughout. We have, however, not included storage for the matrix
reals in the YSMP ANALYZE because minor changes to some statements in a
sort routine yield a version that does not need the real arrays. SPARSPAK does
not recover storage between numerical factorization and solution and so the
storage for these is the same.
In Table X we see that our ordering times are generally superior to those of
YSMP and SPARSPAK, largely because of the refinements to our code which
we discussed in Section 3. This is particularly noticeable in the laser problem of
order 130. The overheads of loading into full matrices or vectors before performing
ACM Transactions on Mathematical Software, Vol 9, No 3, September 1983
Multifrontal Solution of Indefinite Sparse Symmetric Linear Systems • 323
Table XII. A Comparison of MA27 with MA28 and an Older 2 × 2 Code of Munksgaard
Order 147 1176 292 199 130 406 900
Nonzeros: 1298 9864 1250 536 713 2716 4322
Time (IBM 3033 seconds)
ANALYZE/FACTOR
MA27 0.12 0.7 0.17 0.09 0.11 0.3 0.9
MA28 0.60 4.6 0.58 0.20 0.16 3.4 7.4
Munksgaard 0.81 6.8 0.61 0.31 0.16 1.9 13.6
FACTOR
MA27 0 08 05 0 09 0.045 0.040 0.2 0.6
MA28 0 17 1.3 0.14 0.052 0.048 0.5 1.0
Munksgaard 0 46 3.4 0.32 -- 0.068 1.0 5.5
SOLVE
MA27 0.008 0 O43 0.012 0.006 0.005 0.022 0.067
MA28 0.011 0.053 0.014 0.007 0.005 0.029 0.055
Munksgaard 0.010 0.055 0.013 - - 0.005 0.026 0.079
the arithmetic operations are reflected in the slower FACTOR times on the
smaller problems, although these times are more competitive on larger problems.
The SOLVE times are comparable with SPARSPAK but are slightly worse than
YSMP because our use of block pivots makes the next to innermost loops more
complicated.
In Table XI we see that our code is extremely competitive with respect to the
amount of storage required. It is comparable in its requirements for the FACTOR
phase and is consistently the best for both IBM and CDC storage modes in the
ANALYZE and SOLVE phases.
We should point out that the storage figures are a little flattering to MA27,
which has a particular provision for overwriting the input matrix. If this must be
preserved, then its storage must be added to the MA27 figures and to the YSMP
figures for ANALYZE.
The only other code which we are aware of that solves sparse symmetric
indefinite codes efficiently and stably is that of Munksgaard. His work is based
on Duff et al. [7], has a similar storage scheme to Harwell Subroutine MA17E,
and performs a numerical factorization at the same time as the analysis. Of
course, a code for unsymmetric systems, for example, MA28 [3], could also be
used to solve indefinite systems, although the symmetry of the original problem
would be destroyed. In Table XII we compare our new code with that of
ACM Transachons on Mathematmal Software, Vo| 9, No 3, September 1983.
324 • I.S. Duff and J. K. Reid
Table XIII. Times (CRAY-1seconds) for MA27 and YSMP on the CRAY-1
Order. 147 1176 292 406 1561 1005 900
Nonzeros: 1298 9864 1250 1561 6121 4813 4322
Factonzation tnnes
MA27 0.019 0.096 0 024 0.050 0.255 0.193 0.144
YSMP 0.026 0 174 0.023 0 060 0.420 0 . 4 1 0 0.237
Solution times
MA27 0.0029 0.016 0.005 0.008 0.035 0.024 0.021
YSMP 0.0033 0.016 0.004 0.008 0.044 0.031 0.025
ACKNOWLEDGMENT
We would like to t h a n k the referee for his careful reading of the paper and
constructive comments.
REFERENCES
1. BuNcH, J.R., AND PARLETT, B.N. Direct methods for solving symmetrm indefimte systems of
hnear equations S I A M J. Numer. Anal. 8 (1971), 639-655.
2. BUNCH,J.R., ANDROSE,D.J. (Eds) Sparse Matrix Computatmns. Academic Press, New York,
1976.
3. DUFF, I.S. MA28--A set of Fortran subroutines for sparse unsymmetnc linear equations
HarweU Rep. AERE R. 8730, HMSO, London, 1977.
4. DUFF,I.S Designfeatures of a code for solving sparse unsymmetric hnear systems out-of-core.
S I A M J. Sc~ Stat Comput. 4 (1983).
ACM Transactions on Mathematical Software, Vol 9, No 3, September 1983
Multifrontal Solutmn of Indefimte Sparse Symmetric Linear Systems • 325
5 DUFF, I.S. MA32--A package for solving sparse unsymmetrm systems using the frontal method.
Harwell Rep AERE R. 10079, HMSO, London, 1981.
6. DUFF, I.S., AND REID, J.K. Some design features of a sparse matrix code. ACM Trans. Math.
Softw. 5, 1 (Mar. 1979), 18-35.
7. DUFF, I.S., REID, J.K., MUNKSGAARD,N., AND NIELSEN, H.B. Direct solution of sets of linear
equations whose matrix is sparse, symmetric and indefinite. J. Inst. Maths. Appl. 23 (1979),
235-250
8. DUFF, I.S., AND STEWART, G.W. (Eds.) Sparse Matrix Proceedings 1978. SIAM Press, Phila-
delphia, Pa., 1979.
9. EISENSTAT,S.C., GURSKY, M C., SCHULTZ, M H., AND SHERMAN, A H. The Yale sparse matrix
package, I The symmetric codes, and, II The non-symmetric codes. Reps. 112 and 114, Dept.
Computer Science, Yale Univ., New Haven, Corm, 1977.
10. EISENSTAT, S.C., GURSKY, M C., SCHULTZ, M.H, AND SHERMAN, A.H. Yale sparse matrix
package I The symmetric codes. Int. J. Numer. Meth. Eng 18 (1982), 1145-1151.
11 EISENSTAT, S.C., SCHULTZ, M.H., AND SHERMAN, A H. Applications of an element model for
Gaussian elimination. In Sparse Matrix Computations, J. R. Bunch and D. J. Rose (Eds.),
Academic Press, New York, 1976, pp. 85-96.
12. EISENSTAT, S.C., SCHULTZ, M.H., AND SHERMAN, A.H. Software for sparse Gaussian elimination
with ILmited core storage. In Sparse Matrix Proceedings 1978, I.S. Duff and G.W. Stewart (Eds.),
SIAM Press, Philadelphia, Pa., 1979.
13. EISENSTAT, S C., SCHULTZ, M.H., AND SHERMAN, A.H. Algorithms and data structures for
sparse symmetric Gaussian elimination. SIAM J. Sc~. Star Comput. 2 (1981), 225-237.
14. EVERSTINE, G.C. A comparmon of three resequencing algorithms for the reduction of matrix
profile and wavefront. Int. J Numer. Meth. Eng. 14 (1979), 837-853.
15. GEORGE, J.A., AND LIU, J.W.H. An automatic nested dissection algorithm for irregular finite
element problems. SIAM J. Numer. Anal. 15 (1978), 1053-1069.
16. GEORGE, J.A., AND LIU, J W.H. A minimal storage implementation of the minimum degree
algorithm. SIAM J Numer. Anal. 17 (1980), 282-299
17. GEORGE, A., AND LIU, J.W. Computer Solution of Large Sparse Posztwe Definite Systems.
Prentme-Hall, Englewood Cliffs, N. J., 1981.
18. GEORGE,A., LIU, J.W., AND NG, E. User grade for SPARSPAK: Waterloo sparse linear equations
package. Res. Rep. CS-78-30 (Rev. Jan. 1980), Dept. Computer Science, Univ. of Waterloo,
Waterloo, Ont., Canada, 1980.
19. HOOD, P. Frontal solution program for unsymmetric matrices. Int. J. Numcr Meth. Eng. 10
(1976), 379-400.
20. IRONS, B.M. A frontal solution program for finite element analysis. Int. J. Numer. Meth Eng.
2 (1970), 5-32
21. PETERS, F.J. Sparse matrices and substructures Mathematical Centre Tracts 119, Mathema-
tisch Centrum, Amsterdam, The Netherlands, 1980.
22 REIn, J.K. Two Fortran subroutmes for direct solutmn of linear equations whose matrix is
sparse, symmetric and posltwe definite. Harwell Rep AERE R. 7119, HMSO, London, 1972,
23. SHERMAN, A.H On the efficient solutmn of sparse systems of linear and nonlinear equatmns.
Res Rep. 46, Dept. Computer Science, Yale Univ., New Haven, Conn, 1975.
24. SPEELPENNING, B. The generalized element method. Private communication, 1973; also, issued
as Rep. UIUCDCS-R-78-946, Dept. Computer Science, Umv. of Ilhnois at Urbana-Champmgn,
1978.