BHH 93
BHH 93
Susanne M. Balle
Per Christian Hansen
UNI•C
Danish Computing Center for Research and Education
Building 305, Technical University of Denmark
DK-2800 Lyngby, Denmark
[email protected]
[email protected]
Nicholas J. Higham
Department of Mathematics
University of Manchester
Manchester M13 9PL
United Kingdom
[email protected]
In addition to the support from the EEC ESPRIT Basic Research Action
Programme, Project 6634 (APPARC), S. M. Balle and P. C. Hansen were
supported by NATO Collaborative Research Grant 5-2-05/RG900098, and
N. J. Higham was supported by Science and Engineering Research Council
grant GR/H52139.
Contents
1 Introduction 1
1.1 The Need for Alternative Algorithms . . . . . . . . . . . . . . 1
1.2 Performance Versus Implementation Time . . . . . . . . . . . 2
1.3 Why Consider Strassen’s Algorithm? . . . . . . . . . . . . . . 3
1.4 Some Related Issues . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Organization of the Report . . . . . . . . . . . . . . . . . . . 5
5 Performance Results 19
5.1 Performance of Strassen’s Algorithm . . . . . . . . . . . . . . 19
5.2 Performance of Block LU Factorization . . . . . . . . . . . . . 22
1
is done by a partitioned block algorithm (in the terminology of [14]) with
a block-cyclic data layout; see [19] for more details. The algorithm on the
MP–1 is a point algorithm with a cyclic layout.
Although such specialized software often provides very efficient imple-
mentations, it is difficult to maintain the software and also difficult to port
it to new hardware and software releases and related architectures. Hence,
although solution of linear systems of equations is one of the classical topics
in numerical analysis, there is still a need to develop new algorithms that
are well-suited to today’s high-performance computers.
2
file=diagram.eps,height=10.0cm
2. Use a simple and reliable technique for detecting when the algorithm
fails.
3
its complexity is O(nlog2 7 ) operations when the recursion is continued down
to the 1 × 1 level, where n is the order of A. In practice one does not
normally recur down to the 1 × 1 level, since a lower operation count and
lower computational overheads are obtained by terminating the recursion
early. An important feature of Strassen’s inversion algorithm is that it is
based on operations on block matrices, and these may provide enough data
parallelism to make the algorithm suited for massively parallel computers.
We remark that iterative improvement of x̃ = Ã−1 b, once Ã−1 has been
computed, is only an O(n2 ) process.
Our use of Strassen’s matrix inversion algorithm is motivated by the
success of Strassen’s matrix multiplication method [21], which has turned
out to be useful on high-performance computers [2, 4, 10]. It is found that
although the rounding error properties of this algorithm are less favorable
than for conventional matrix multiplication, the algorithm is still useful as
a Level 3 BLAS routine in many circumstances [13, 17]
As we shall see in §3.1, Strassen’s matrix inversion algorithm is funda-
mentally unstable. The main reason is that it involves the inversion of sub-
matrices on each recursion level. The key idea in the algorithm we develop
is that if one of these submatrices is ill-conditioned then the computation is
stabilized by perturbing this submatrix to make it better conditioned. As
long as the perturbation is small enough, the computed approximate inverse
Ã−1 can still be used for refining x̃ = Ã−1 b as outlined above. It is this
stabilization technique, in which the computed results are re-used instead
of switching to a different algorithm, that puts this algorithm in the second
class of algorithms falling under the three-step paradigm outlined above.
A crucial point in the above algorithm is the decision when to add sta-
bilization to a submatrix, and how much stabilization to add. There is
obviously a tradeoff between adding too much stabilization, so that the
approximate inverse Ã−1 is too far from the exact A−1 , and too little stabi-
lization, so that the computed Ã−1 is too much contaminated by the effects
of rounding errors (which are proportional to the condition number of the
particular ill-conditioned submatrix or submatrices). One important goal of
our analysis here is to guide this decision.
It this paper, we focus on the computation of the approximate inverse
Ã−1 , since this is the crucial step in solving linear systems Ax = b via
Strassen’s method. We make the important assumption that the matrix A
is well-conditioned; otherwise, other methods (based on, e.g., regularization)
should be used.
4
1.4 Some Related Issues
We want to mention at this point that our interest in this algorithm is also
motivated by one of the so-called “superfast” algorithms for inversion of
a Toeplitz matrix, described by Bitmead & Anderson [9]. This algorithm
is mathematically equivalent to Strassen’s inversion algorithm; however, the
operations are performed by means of FFTs, using the low displacement rank
structure of the involved submatrices. In this way, a O(n log2 n) algorithm
is obtained. Since the Bitmead-Anderson algorithm is based on Strassen’s
inversion algorithm, our stabilization technique can easily be applied to the
Bitmead-Anderson algorithm without changing its complexity. We shall not
pursue this aspect further here, but leave it to a separate investigation.
Two quite different methods for stabilizing Strassen’s matrix inversion
algorithm are proposed by Bailey & Ferguson in [3]. Their first method is
based on the observation that if one submatrix is ill-conditioned, then one
can instead invert another submatrix which cannot be ill-conditioned as long
as A is well conditioned. However, it is difficult to quantify when this form
of “pivoting” should be used, and there is no way to improve the accuracy of
the method for submatrices with medium-size condition numbers. On the
other hand, our stabilization technique can be applied to any submatrix,
even (in principle) well-conditioned ones.
Bailey & Ferguson also suggest the use of Newton iteration to improve
the accuracy of the computed inverses of the sub-blocks at each level of
recursion and, optionally, of the overall computed inverse. At this point, it
is not clear which approach is more efficient: to refine Ã−1 before computing
the solution, or to compute x̃ = Ã−1 b and refine this vector.
5
for stabilization of Strassen’s algorithm, and in Section 5 we present some
results obtained on a CM–200. Finally, in Section 6 we mention some open
problems and we point to future research.
6
2 Strassen’s Matrix Inversion Algorithm and Data
Layout
In this section we summarize Strassen’s asymptotically fast algorithm for
matrix inversion, and we discuss some issues related to data layout.
The evaluation of C via (2.2) and (2.3) can be expressed as follows, using
intermediate matrices R1 , . . . , R6 :
1. R1 = A−1
11
2. R2 = A21 R1
3. R3 = R1 A12
4. R4 = A21 R3
5. S = A22 − R4
Algorithm MatInv 6. R5 = S −1
7. C22 = R5
8. C12 = −R3 R5
9. C21 = −R5 R2
10. R6 = C12 R2
11. C11 = R1 − R6
7
11 12 13 14
Node 1,1 Node 1,2 21 22 23 24 A11 A12
= =
Node 2,1 Node 2,2 31 32 33 34 A21 A22
41 42 43 44
8
11 13 12 14
Node 1,1 Node 1,2 31 33 32 34
=
Node 2,1 Node 2,2 21 23 22 24
41 43 42 44
9
3 Forward Error Analysis
In this section we present a forward error analysis of Algorithm MatInv
from the previous section. Because of the recursive nature of the algorithm,
forward error analysis seems to be the only way to analyze this algorithm,
and yet still—as we shall see below—there are some open problems.
Strassen’s formulae are obtained on multiplying out the product. This gives
the first insight into why Strassen’s method is unstable: the block LU fac-
torization on which it is based is not backward stable in general [14], so we
would expect Strassen’s method also to be unstable, even if only one level
of recursion and conventional multiplication is used.
Next, suppose A is upper triangular. Then Strassen’s formulae reduce
to −1
A11 −A−1 A12 A−1
−1 11 22
A = .
0 A−1
22
This is essentially Method 2B in [16], which is found not to give a small resid-
ual. The cure used in [16] is to convert one of the multiplies into a multiple
right-hand side triangular solve, so that we form, say A−1 11 A12 /A22 (where
/ denotes back substitution). However, this changes Strassen’s method and
there does not seem to be an analogue of this modification for the full matrix
case.
Finally, we note that Strassen’s inversion method is not guaranteed to be
stable even for symmetric positive definite matrices, as numerical examples
in [18] show.
10
3.2 Theoretical Error Bounds
We use the standard notation in which f `(·) denotes the computed version
of the particular quantity. Moreover, we let Ei , Fi and Gi denote the error
matrices corresponding to matrix inversion, matrix multiplication, and ma-
trix addition, respectively, in the various steps of the algorithm. Thus, for
Algorithm MatInv we obtain the following results.
f `(A−1 −1
11 ) = A11 + E1
where ∆S −1 = E2 − S −1 ∆S S −1 , and
f `(C12 ) = −((A−1
11 + E1 )A12 + F1 )(S
−1
+ ∆S −1 ) + F3 = C12 + ∆C12 ,
where ∆C11 = (1−C12 A21 −∆C12 A21 −F4 )E1 −(F4 +∆C12 A21 )A−1 11 −F5 +G2 ≈
−1
(1 − C12 A21 )E1 − (F4 + ∆C12 A21 )A11 − F5 + G2 .
Next, we wish to derive upper bounds for the error matrices. To simplify
the analysis we only consider one level of recursion, assuming that the lower-
level inverses are computed via LU factorization. This suffices to illustrate
the difficulties associated with this algorithm.
Let k · k denote an appropriate matrix norm, for example the Frobenius
norm or the 2-norm (the exact choice of norm is not important). For matrix
addition, we use the error model
kX + Y − f `(X + Y )k ≤ kX + Y k,
11
The models hold for conventional multiplication with approximately n
times the machine precision (for n × n matrices), whereas for Strassen’s
multiplication method has to be approximately (n/n0 )log2 12 n20 times the
machine precision, where n0 is the order of matrix at which the recursion
is terminated [17]. Finally, for matrix inversion we use the following error
model, which is certainly reasonable for LU factorization-based methods [16]:
kX −1 − f `(X −1 )k ≤ κ(X) kX −1 k,
Inserting these bounds into the expressions for the error matrices ∆S , ∆S −1 ,
∆C12 and ∆C11 , and neglecting second-order terms, we obtain the following
bounds:
12
• The matrix A is well conditioned.
• The norms of the submatrices of A satisfy kA11 k ≈ kA12 k ≈ kA21 k ≈
kAk. (The norm kA22 k is not important.)
• The norms of the submatrices of C = A−1 satisfy kC22 k = kS −1 k ≈
kC12 k ≈ kCk.
• The condition number κ(A11 ) is much greater than one.
Recall that S = A22 − A21 A−1 11 A12 . Since A11 is ill conditioned, its inverse
must have large elements, and thus it is likely that the matrix A21 A−1 11 A12
completely dominates S (in theory, cancellation could take place and make
the elements of A21 A−111 A12 of the same order as A22 , but this is highly
unlikely and we ignore this possibility). Hence, we will assume that kSk ≈
kA21 A−1 −1
11 A21 k ≤ kA12 k kA11 k kA12 k. Then, since S
−1 is the (2,2) block of
−1
A ,
−1
κ(S) = kSk kS −1 k < −1
∼ kA21 k kA11 k kA12 k kA k
≈ kA11 k kA−1 −1
11 k kAk kA k = κ(A11 ) κ(A).
By means of these and the above bounds, and by neglecting small factors,
we then obtain the following approximate bounds for the matrix norms:
k∆S k < 2
∼ κ(A11 ) kAk (3.5)
k∆S −1 k < 2
∼ κ(A11 ) κ(A) kCk (3.6)
k∆C12 k < 3
∼ κ(A11 ) κ(A) kCk (3.7)
k∆C11 k < < 4
∼ k∆C12 k κ(A11 ) ∼ κ(A11 ) κ(A) kCk. (3.8)
13
file=bounds.ps,height=10.0cm
14
file=residual.ps,height=10.0cm
Figure 3.2: The size of the submatrices in the residual matrix R = I − Ã−1 A.
The top, middle and bottom figures show the norms kR11 k, kR12 k and kR22 k,
respectively, for the same 10 experiments as before. Within each figure, the
results are computed for κ(A11 ) = 103 (bottom), κ(A11 ) = 105 (middle),
and κ(A11 ) = 107 (top).
from which we conclude that both R11 and R12 are dominated by ∆C11 while
R22 must be of the same size as ∆C12 and ∆C22 .
15
4 Stabilization of the Algorithm
In this section we describe our approach to stabilization of the recursive
inversion algorithm. As we have seen in the previous section, instability is
associated with inversion of ill-conditioned submatrices A11 and S at any
level of the algorithm. In our algorithm, an ill-conditioned submatrix is very
easy to detect, since the inverse is explicitly computed, and we can measure
the ill-conditioning simply by computing an appropriate norm of the inverse
matrix (for example, the element of largest absolute value, which can be
found in logarithmic time on a parallel computer).
16
file=Gig1.ps,height=10.0cm
Figure 4.1: The relative error kÃ−1 − A−1 k/kA−1 k in the computed inverse
as a function of δ, for 10 test matrices. The submatrix A11 is singular, and
it is perturbed by δI.
The condition number of all test matrices is of the order 10.
Next, assume that the perturbation is so large that the perturbation error
from adding δ I dominates. Obviously, a perturbation δ I of A11 corresponds
to a perturbation of A of the form
δI 0
.
0 0
Then it follows from the perturbation bound
kX −1 − (X + ∆X )−1 k < −1 2
∼ k∆X k kX k
that for large δ we have
kC − f `(C)k < 2
∼ δ kCk . (4.2)
17
function of δ, for 10 test matrices. The submatrix A11 is singular, and it is
perturbed by δI. Only one level of recursion was used. There is clearly a
minimum at about 10−5 which is approximately equal to the cube root of
the machine precision.
In a practical application, the norm kAk (which should really be inter-
preted as the appropriate submatrix at any given level) is easy to estimate.
But estimating the condition number of A is a nontrivial process when no
factorization of A is available. The best we can do in connection with this
algorithm is probably either to assume some average condition number, say,
1000, or to require the user to provide an rough condition number estimate
(taking the cube root means that a very rough guess is enough).
18
file=Perf.ps,height=10.0cm
5 Performance Results
In this section we present a few numerical results produced on the 8K Con-
nection Machine CM–200 located at UNI•C. The programs were all coded
in CM Fortran, Version CMF 1.2.
19
For comparison, Figure 5.1 also shows the execution time of the point
algorithm from LAPACK (that was used at the bottom level of Strassen’s
algorithm) when applied to the full matrix.
We make the following remarks about the results:
• Strassen’s method is able to speed up significantly the performance of
the underlying point LU factorization algorithm.
• The performance of Strassen’s method depends greatly on the number
of recursion levels.
• The optimal number of recursions depends on the matrix order n.
The dependence of the optimal recursion level on the matrix size is not
surprising, as the overhead increases for each recursion level. The optimal
number of recursion levels is very difficult to predict a priori, since per-
formance of the algorithm depends on several factors, such as the number
of matrix-matrix multiplications and additions, and the nearest-neighbor
communication.
Regarding the first issue, we note that Strassen’s matrix-matrix multi-
plication algorithm seems to perform better on the MasPar MP–1, where
multiplication is slower than addition by a factor of three. On the Connec-
tion Machines CM–200 and CM–5, however, additions and multiplications
are performed at the same speed.
Regarding nearest-neighbor communication, the analysis in [10] shows
that a successful implementation of Strassen’s algorithm relies on fast near-
est-neighbor communication relative to the floating-point operations. If we
define γX as the time of nearest-neighbor communication divided by the time
of an average flop on computer X, then γM P −1 = 0.2 while γCM −200 = 0.4
and γCM −5 γCM −200 . Again, this shows that Strassen’s algorithms may
not be optimal on the Connection Machines.
It is interesting to note the following observation that we have made.
The amount of mere communication can actually be reduced by using the
CMSSL matrix-matrix multiplication routine throughout—not just at the
bottom level. But then the data must be reorganized before and after each
call to this routine, and the performance loss associated with this reshuffling
of data is greater than the gain in speed from using the high-speed CMSSL
multiplication routine.
The accuracy of the solution computed by Strassen’s method with it-
erative refinement was always of the same magnitude as that of the so-
lution computed by the CMSSL routines, and the Strassen-based solution
20
n κ∞(A) method levels rel. error
128 4E+3 CMSSL E-14
Strassen 1 E-14
256 2E+4 CMSSL E-14
Strassen 1 E-14
Strassen 2 E-14
512 2E+7 CMSSL E-11
Strassen 1 E-11
Strassen 2 E-11
Strassen 3 E-11
1024 1E+5 CMSSL E-13
Strassen 1 E-13
Strassen 2 E-13
Strassen 3 E-13
Strassen 4 E-13
2048 3E+5 CMSSL E-12
Strassen 1 E-12
Strassen 2 E-13
Strassen 3 E-13
Table 5.1: The relative error kxexact − xk∞ /kxexact k∞ in the computed so-
lution x as a function of problem size n for the CMSSL routine and for
Strassen’s method with p levels of recursion and iterative refinement. For
each value of n we also give the ∞-norm condition number of the random
matrix.
21
was never less accurate. The condition number of the random matrices
ranged from 103 to 107 , and the relative error in the Strassen-based solu-
tion, kxexact − xk∞ /kxexact k∞ , was never larger than 10−11 . See Table 5.1
for details.
In spite of the fact that Strassen’s algorithm is always slower than the
CMSSL routine, our results are quite pleasing: with an implementation effort
of about one month, we are able to obtain a performance which is within
af factor of 10 of the “ultimate” performance of the CMSSL routine, which
required three man-years to be implemented. With one more month of
“tuning” of the code for Strassen’s algorithm, we may improve further on
the performance ratio.
Another important issue is that our implementation is “portable”, in
the sense that it relies only on one particular CMSSL routine, namely, the
matrix-matrix multiplication used at the bottom recursion level. Hence, it
is fairly easy to port the implementation to the CM–5, and even to other
MPP computers.
22
6 Final Comments and Future Research
Although we are not able to perform a satisfactory error analysis of the al-
gorithm, our experiments have given us some understanding of the influence
of rounding errors. In particular, the experimental results indicate that the
rounding errors are not as harmful as may first be expected, and that with
suitable stabilization we are indeed able to produce an approximate inverse
which is good enough for iterative refinement or some CG-type method.
The performance of Strassen’s algorithm on the Connection Machines is
not as good as the carefully “hand-tuned” partitioned block LU factorization
algorithm with block cyclic layout from the CMSSL Library. The ratio is
about a factor of 10. On the other hand, our implementation time was about
one month, which should be compared with the three man-years it took to
implement the CMSSL routine.
In a continuation of this project it would be of interest to address some
of the following questions:
Based upon these investigations, we would like to be able to say more about
the trade-off on massively parallel computers between performance and im-
plementation time, as discussed in Section 1.
Acknowledgement
The authors would like to thank Tony F. Chan for valuable discussions
during the work described in this report.
23
References
[1] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz,
A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchow & D.
Sorensen, LAPACK Users’ Guide, SIAM, Philadelphia, 1992.
[6] S. M. Balle & P. C. Hansen, Block algorithms for the Connection Ma-
chines, Report UNIC-93-10, UNI•C, October 1993.
24
[12] J. W. Demmel, M. T. Heath & H. A. Van der Vorst, Parallel numerical
linear algebra, Acta Numerica (1993), 111–197.
25