Updating An LU Factorization With Pivoting
Updating An LU Factorization With Pivoting
ENRIQUE S. QUINTANA-ORTÍ
Universidad Jaime I
and
ROBERT A. VAN DE GEIJN
The University of Texas at Austin
11
We show how to compute an LU factorization of a matrix when the factors of a leading princi-
ple submatrix are already known. The approach incorporates pivoting akin to partial pivoting,
a strategy we call incremental pivoting. An implementation using the Formal Linear Algebra
Methods Environment (FLAME) application programming interface (API) is described. Exper-
imental results demonstrate practical numerical stability and high performance on an Intel
Itanium2 processor-based server.
Categories and Subject Descriptors: G.1.3 [Numerical Analysis]: Numerical Linear Algebra;
G.4 [Mathematics of Computing]: Mathematical Software—Efficiency
General Terms: Algorithms, Performance
Additional Key Words and Phrases: LU factorization, linear systems, updating, pivoting
ACM Reference Format:
Quintana-Ortı́, E. S. and van de Geijn, R. A. 2008. Updating an LU factorization with pivoting.
ACM Trans. Math. Softw. 35, 2, Article 11 (July 2008), 16 pages. DOI = 10.1145/1377612.1377615.
http://doi.acm.org/10.1145/1377612.1377615.
This research was partially sponsored by NSF grants ACI-0305163, CCF-0342369 and CCF-
0540926, and an equipment donation from Hewlett-Packard. Primary support for this work came
from the J. Tinsley Oden Faculty Fellowship Research Program of the Institute for Computational
Engineering and Sciences (ICES) at UT-Austin.
Authors’ addresses: E. S. Quintana-Ortı́, Departamento de Ingenierı́a y Ciencia de Computadores,
Universidad Jaime I, 12.071 – Castellón, Spain; email: [email protected]. R. van de Geijn, De-
partment of Computer Sciences, The University of Texas at Austin, Austin, TX 78712; email:
[email protected].
Permission to make digital or hard copies of part or all of this work for personal or classroom use
is granted without fee provided that copies are not made or distributed for profit or direct com-
mercial advantage and that copies show this notice on the first page or initial screen of a display
along with the full citation. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credits is permitted. To copy otherwise, to republish, to post
on servers, to redistribute to lists, or to use any component of this work in other works requires
prior specific permission and/or a fee. Permissions may be requested from the Publications Dept.,
ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or
[email protected].
c 2008 ACM 0098-3500/2008/07-ART11 $5.00 DOI: 10.1145/1377612.1377615. http://doi.acm.org/
10.1145/1377612.1377615.
ACM Transactions on Mathematical Software, Vol. 35, No. 2, Article 11, Pub. date: July 2008.
11: 2 · E. S. Quintana-Ortı́ and R. A. van de Geijn
1. INTRODUCTION
In this article we consider the LU factorization of a nonsymmetric matrix A,
partitioned as
B C
A→ (1)
D E
when a factorization of B is to be reused as the other parts of the matrix
change. This is known as the updating of an LU factorization.
Applications arising in boundary element methods (BEMs) often lead to very
large, dense linear systems [Cwik et al. 1994; Geng et al. 1996]. For many of
these applications the goal is to optimize a feature of an object. For example,
BEMs may be used to model the radar signature of an airplane. In an effort
to minimize this signature, it may be necessary to optimize the shape of a cer-
tain component of the airplane. If the degrees of freedom associated with this
component are ordered last among all degrees of freedom, the matrix presents
the structure given in Eq. (1). Now, as the shape of the component is modified,
it is only the matrices C, D, and E that change together with the right-hand
side vector of the corresponding linear system. Since the dimension of B is
frequently much larger than those of the remaining three matrices, it is desir-
able to factorize B only once and to update the factorization as C, D, and E
change. A standard LU factorization with partial pivoting does not provide a
convenient solution to this problem, since the rows to be swapped during the
application of the permutations may not lie only within B.
Little literature exists on this important topic. We have been made aware
that an unblocked out-of-core (OOC) algorithm similar to our algorithm was re-
ported in Yip [1979], but we have not been able to locate a copy of that report.
The proposed addition of this functionality to LAPACK is discussed in Demmel
and Dongarra [2005]. We already discussed preliminary results regarding the
algorithm proposed in the current article in a conference paper [Joffrain et al.
2005], in which its application to OOC LU factorization with pivoting is the
main focus.1 In Gunter and van de Geijn [2005] the updating of a QR factor-
ization via techniques, that are closely related to those proposed for the LU
factorization in the current article, is reported.
The article is organized as follows: In Section 2 we review algorithms for
computing the LU factorization with partial pivoting. In Section 3, we discuss
how to update an LU factorization by considering the factorization of a 2 × 2
blocked matrix. The key insight of the work is found in this section: High-
performance blocked algorithms can be synthesized by combining the pivot-
ing strategies of LINPACK and LAPACK. Numerical stability is discussed in
Section 4 and performance is reported in Section 5. Concluding remarks are
given in the final section.
1 More practical
approaches to OOC LU factorization with partial pivoting exist [Toledo 1999; 1997;
Toledo and Gustavson 1996; Klimkowski and van de Geijn 1995]. Therefore, OOC application of
the approach is not further mentioned so as not to distract from the central message of this
article.
ACM Transactions on Mathematical Software, Vol. 35, No. 2, Article 11, Pub. date: July 2008.
Updating an LU Factorization with Pivoting · 11: 3
In−1 0 1 0
··· P(π0 ).
0 P(πn−1 ) 0 P(π1 )
Fig. 2. LINPACK and LAPACK blocked algorithms for the LU factorization built upon an
LAPACK unblocked factorization.
3. UPDATING AN LU FACTORIZATION
In this section we discuss how to compute the LU factorization of the matrix
in (1) in such a way that the LU factorization with partial pivoting of B can
be reused if C, D, and E change. We consider A in Eq. (1) to be of dimension
n× n, with square B and E of orders nB and nE , respectively. For reference, fac-
toring the matrix in (1) using standard LU factorization with partial pivoting
costs 23 n3 flops (floating-point arithmetic operations). In this expression (and
future computational cost estimates) we neglect insignificant terms of lower-
order complexity, including the cost of pivoting the rows.
This step is skipped if B has already been factored. If the factors are to be
used for future updates to C, D, and E, then a copy of U is needed since it is
overwritten by subsequent steps.
Step 2: Update C. This is consistent with the factorization of B. This is
C := FS LAP (B, p, C)
U
Step 3: Factor . Compute the LU factorization with partial pivoting
D
U { L̄\Ū} U
, L̄, r := , r = LU LIN
BLK
.
D Ľ D
Here, Ū overwrites the upper triangular part of B (where U was stored before
this operation). The lower triangular matrix L̄ that results needs to be stored
separately, since both L, computed in step 1 and used at step 2, and L̄ are
needed during the forward-substitution stage when solving a linear system.
C U
Step 4: Update . This is consistent with the factorization of .
E D
C L̄ C
:= FS LIN , r,
E D E
Overall, the five steps of the procedure apply Gauss transforms and permuta-
tions to reduce A to an upper triangular matrix, as
−1
I 0 L̄ 0 L −1 P( p) 0 B C
P(r) =
0 L̃ −1 P(s) Ľ I 0 I D E
| {z }
steps 1 and 2
−1
I 0 L̄ 0 U Ĉ
P(r) =
−1
0 L̃ P(s) Ľ I D E
| {z }
steps 3 and 4
I 0 Ū Č Ū Č
= ,
−1
0 L̃ P(s) 0 Ě 0 Ũ
| {z }
step 5
L̄ 0 Ū
where {L\U}, \ , and { L̃\Ũ} are the triangular factors com-
Ľ I 0
puted, respectively, in the LU factorizations in steps 1, 3, and 5; p, r, and
s are the corresponding permutation vectors;
Ĉ is the matrix that results
Č
from overwriting C with L −1 P( p)C; and are the blocks that result from
Ě
I 0 Ĉ
.
0 L̃ −1 P(s) E
We describe next how to reduce both the computational and storage require-
ments by exploiting the upper triangular structure of U during steps 3 and 4.
T
Fig. 3. SA-LINPACK blocked algorithm for the LU factorization of U T , D T built upon an
LAPACK blocked factorization.
Fig. 4. Illustration of an iteration of the SA LINPACK blocked algorithm used in step 3 and how
it preserves most of the zeroes in U. The zeroes below the diagonal are preserved, except within
the b × b diagonal blocks, where pivoting will fill below the diagonal. The shaded areas are the
ones updated as part of the current iteration. The fact that U22 is not updated demonstrates
how computation can be reduced. If the SA LAPACK blocked algorithm was used, then nonzeroes
would appear during this iteration in the block marked as 0⋆ , due to pivoting; as a result, upon
completion, zeros would be lost in the full strictly lower triangular part of U.
T
Fig. 5. SA-LINPACK blocked algorithm for the update of C T , ET , consistent with the SA-
T
LINPACK blocked LU factorization of U T , D T .
Fig. 6. Overhead cost of the different approaches to compute the LU factorization in Eq. (1) with
respect to the cost of the LU factorization with partial pivoting.
which is basically determined by problem size and pivoting strategy. For ex-
ample, the growth factors of complete, partial, and pairwise [Wilkinson 1965,
p. 236] pivoting have been demonstrated bounded as ρc ≤ n1/2 (2·31/2 · · · n1/n−1 ),
ρ p ≤ 2n−1 , and ρw ≤ 4n−1 , respectively [Sorensen 1985; Stewart 1998]. Statisti-
cal models and extensive experimentations in Trefethen and Schreiber [1990]
showed that, on average, ρc ≈ n1/2 , ρ p ≈ n2/3 , and ρw ≈ n, inferring that in
practice both partial and pairwise pivoting are numerically stable, and pair-
wise pivoting can be expected to numerically behave only slightly worse than
partial pivoting.
The new algorithm applies partial pivoting
during the factorization of B and
U
then again in the factorization of . This can be considered as a blocked
D
variant of pairwise pivoting. Thus, we can expect an element growth for the
algorithm that is between those of partial and pairwise pivoting. Next we
elaborate an experiment that provides evidence in support of this observation.
In Figure 7 we report the element growths observed during computation of
the LU factorization of matrices as in Eq. (1), with nB = 100 and dimensions for
E ranging from nE = 5 to 100 using partial, incremental, and pairwise pivoting.
The entries of the matrices are generated randomly, chosen from a uniform dis-
tribution in the interval (0.0, 1.0). The experiment was carried out on an Intel
Xeon processor using Matlab
R 7.0.4 (IEEE double-precision arithmetic). The
results report the average element growth for 100 different matrices for each
matrix dimension. The figure shows the growth factor of incremental pivoting
to be smaller than that of pairwise pivoting and to approximate that of par-
tial pivoting. A similar behavior was obtained for other matrix types: uniform
distribution in (−1.0, 1.0), normal distribution with mean 0.0 and deviation
1.0 (N[0.0, 1.0]), symmetric matrices with elements in N[0.0, 1.0], and Toeplitz
matrices with elements in N[0.0, 1.0]. Only for orthogonal matrices with
Haar distribution [Trefethen and Schreiber 1990] did we obtain significantly
ACM Transactions on Mathematical Software, Vol. 35, No. 2, Article 11, Pub. date: July 2008.
Updating an LU Factorization with Pivoting · 11: 13
5. PERFORMANCE
In this section we report results for a high-performance implementation of the
SA LINPACK procedure.
5.1 Implementation
The FLAME library (version 0.9) was used to implement a high-performance
LU factorization with partial pivoting and the SA LINPACK procedure. The
benefit of this API is that the code closely resembles the algorithms as they
are presented in Figures 1 through 3 and 5. The performance of the FLAME
LU factorization with partial pivoting is highly competitive with LAPACK and
vendor implementations of this operation.
The implementations can be examined by visiting http://www.cs.utexas.
edu/users/flame/Publications/. For further information on FLAME, visit
www.cs.utexas.edu/users/flame.
5.2 Platform
Performance experiments were performed in double-precision arithmetic on
an Intel Itanium2 (1.5 GHz) processor-based workstation capable of attaining
6 GFLOPS (109 flops per second). For reference, the algorithm for the FLAME
ACM Transactions on Mathematical Software, Vol. 35, No. 2, Article 11, Pub. date: July 2008.
11: 14 · E. S. Quintana-Ortı́ and R. A. van de Geijn
Fig. 8. Top: speedup attained when B is not refactored, over LU factorization with partial pivoting
of the entire matrix; bottom: slowdown for the first factorization (when B must also be factored).
LU factorization with partial pivoting delivered 4.8 GFLOPS for a 2000 × 2000
matrix. A block size b = 128 was employed in this procedure for all experi-
ments reported next. The implementation was linked to the GotoBLAS R1.6
basic linear algebra subprograms (BLAS) library [Goto 2004]. The BLAS rou-
tine DGEMM which is used to compute C := C − A B (C ∈ Rm×n, A ∈ Rm×k ,
and B ∈ Rk×n) attains the best performance when the common dimension of
A and B, namely k, is equal to 128. Notice that most computation in the SA
LINPACK procedure is cast in terms of this operation, with k = b .
The performance benefits reported on this platform are representative of
those that can be expected on other current architectures.
5.3 Results
In Figure 8 (top) we show the speedup attained when an existing factoriza-
tion of B is reused, by reporting the time required to factor Eq. (1) with
high-performance LU factorization with partial pivoting divided by the time
required to update an existing factorization of B via the SA LINPACK proce-
dure (steps 2 through 5). In that figure, nB = 1000 and nE is varied from 0 to
1000. The results are reported when different block size b ’s are chosen. The
ACM Transactions on Mathematical Software, Vol. 35, No. 2, Article 11, Pub. date: July 2008.
Updating an LU Factorization with Pivoting · 11: 15
6. CONCLUSIONS
We have proposed blocked algorithms for updating an LU factorization. They
have been shown to attain high performance and to greatly reduce the cost
of an update to a matrix for which a partial factorization already exists. The
key insight is the synthesis of LINPACK- and LAPACK-style pivoting. While
some additional computation is required, this is more than offset by the im-
provement in performance that comes from casting computation in terms of
matrix-matrix multiplication.
We acknowledge that the question of the numerical stability of the new
algorithm relative to that of LU factorization with partial pivoting remains
open. Strictly speaking, LU factorization with partial pivoting is itself not
numerically stable, but practical experience has shown be effective in practice.
Theoretical results that rigorously bound the additional element growth are in
order, but are beyond the scope of the present article.
REFERENCES
B IENTINESI , P., G UNNELS, J. A., M YERS, M. E., Q UINTANA -O RT Í , E. S., AND VAN DE G EIJN,
R. A. 2005. The science of deriving dense linear algebra algorithms. ACM Trans. Math.
Softw. 31, 1 (Mar.), 1–26.
B IENTINESI , P. AND VAN DE G EIJN, R. 2006. Representing dense linear algebra algorithms:
A farewell to indices. Tech. Rep. FLAME Working Note 17, CS-TR-2006-10, Department of
Computer Sciences, The University of Texas at Austin.
C WIK , T., VAN DE G EIJN, R., AND PATTERSON, J. 1994. The application of parallel computation
to integral equation models of electromagnetic scattering. J. Optic. Soc. Amer. A 11, 4 (Apr.),
1538–1545.
D EMMEL , J. AND D ONGARRA , J. 2005. LAPACK 2005 prospectus: Reliable and scalable software
for linear algebra computations on high end computers. LAPACK Working Note 164 UT-CS-05-
546, University of Tennessee. February.
G ENG, P., O DEN, J. T., AND VAN DE G EIJN, R. 1996. Massively parallel computation for acoustical
scattering problems using boundary element methods. J. Sound Vibra. 191, 1, 145–165.
G OTO, K. 2004. TACC software and tools. http://www.tacc.utexas.edu/resources/software/.
ACM Transactions on Mathematical Software, Vol. 35, No. 2, Article 11, Pub. date: July 2008.
11: 16 · E. S. Quintana-Ortı́ and R. A. van de Geijn
ACM Transactions on Mathematical Software, Vol. 35, No. 2, Article 11, Pub. date: July 2008.