0% found this document useful (0 votes)
36 views16 pages

Updating An LU Factorization With Pivoting

Uploaded by

Cristina Herz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views16 pages

Updating An LU Factorization With Pivoting

Uploaded by

Cristina Herz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Updating an LU Factorization with Pivoting

ENRIQUE S. QUINTANA-ORTÍ
Universidad Jaime I
and
ROBERT A. VAN DE GEIJN
The University of Texas at Austin
11
We show how to compute an LU factorization of a matrix when the factors of a leading princi-
ple submatrix are already known. The approach incorporates pivoting akin to partial pivoting,
a strategy we call incremental pivoting. An implementation using the Formal Linear Algebra
Methods Environment (FLAME) application programming interface (API) is described. Exper-
imental results demonstrate practical numerical stability and high performance on an Intel
Itanium2 processor-based server.
Categories and Subject Descriptors: G.1.3 [Numerical Analysis]: Numerical Linear Algebra;
G.4 [Mathematics of Computing]: Mathematical Software—Efficiency
General Terms: Algorithms, Performance
Additional Key Words and Phrases: LU factorization, linear systems, updating, pivoting
ACM Reference Format:
Quintana-Ortı́, E. S. and van de Geijn, R. A. 2008. Updating an LU factorization with pivoting.
ACM Trans. Math. Softw. 35, 2, Article 11 (July 2008), 16 pages. DOI = 10.1145/1377612.1377615.
http://doi.acm.org/10.1145/1377612.1377615.

This research was partially sponsored by NSF grants ACI-0305163, CCF-0342369 and CCF-
0540926, and an equipment donation from Hewlett-Packard. Primary support for this work came
from the J. Tinsley Oden Faculty Fellowship Research Program of the Institute for Computational
Engineering and Sciences (ICES) at UT-Austin.
Authors’ addresses: E. S. Quintana-Ortı́, Departamento de Ingenierı́a y Ciencia de Computadores,
Universidad Jaime I, 12.071 – Castellón, Spain; email: [email protected]. R. van de Geijn, De-
partment of Computer Sciences, The University of Texas at Austin, Austin, TX 78712; email:
[email protected].
Permission to make digital or hard copies of part or all of this work for personal or classroom use
is granted without fee provided that copies are not made or distributed for profit or direct com-
mercial advantage and that copies show this notice on the first page or initial screen of a display
along with the full citation. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credits is permitted. To copy otherwise, to republish, to post
on servers, to redistribute to lists, or to use any component of this work in other works requires
prior specific permission and/or a fee. Permissions may be requested from the Publications Dept.,
ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or
[email protected].
c 2008 ACM 0098-3500/2008/07-ART11 $5.00 DOI: 10.1145/1377612.1377615. http://doi.acm.org/
10.1145/1377612.1377615.
ACM Transactions on Mathematical Software, Vol. 35, No. 2, Article 11, Pub. date: July 2008.
11: 2 · E. S. Quintana-Ortı́ and R. A. van de Geijn

1. INTRODUCTION
In this article we consider the LU factorization of a nonsymmetric matrix A,
partitioned as
 
B C
A→ (1)
D E
when a factorization of B is to be reused as the other parts of the matrix
change. This is known as the updating of an LU factorization.
Applications arising in boundary element methods (BEMs) often lead to very
large, dense linear systems [Cwik et al. 1994; Geng et al. 1996]. For many of
these applications the goal is to optimize a feature of an object. For example,
BEMs may be used to model the radar signature of an airplane. In an effort
to minimize this signature, it may be necessary to optimize the shape of a cer-
tain component of the airplane. If the degrees of freedom associated with this
component are ordered last among all degrees of freedom, the matrix presents
the structure given in Eq. (1). Now, as the shape of the component is modified,
it is only the matrices C, D, and E that change together with the right-hand
side vector of the corresponding linear system. Since the dimension of B is
frequently much larger than those of the remaining three matrices, it is desir-
able to factorize B only once and to update the factorization as C, D, and E
change. A standard LU factorization with partial pivoting does not provide a
convenient solution to this problem, since the rows to be swapped during the
application of the permutations may not lie only within B.
Little literature exists on this important topic. We have been made aware
that an unblocked out-of-core (OOC) algorithm similar to our algorithm was re-
ported in Yip [1979], but we have not been able to locate a copy of that report.
The proposed addition of this functionality to LAPACK is discussed in Demmel
and Dongarra [2005]. We already discussed preliminary results regarding the
algorithm proposed in the current article in a conference paper [Joffrain et al.
2005], in which its application to OOC LU factorization with pivoting is the
main focus.1 In Gunter and van de Geijn [2005] the updating of a QR factor-
ization via techniques, that are closely related to those proposed for the LU
factorization in the current article, is reported.
The article is organized as follows: In Section 2 we review algorithms for
computing the LU factorization with partial pivoting. In Section 3, we discuss
how to update an LU factorization by considering the factorization of a 2 × 2
blocked matrix. The key insight of the work is found in this section: High-
performance blocked algorithms can be synthesized by combining the pivot-
ing strategies of LINPACK and LAPACK. Numerical stability is discussed in
Section 4 and performance is reported in Section 5. Concluding remarks are
given in the final section.

1 More practical
approaches to OOC LU factorization with partial pivoting exist [Toledo 1999; 1997;
Toledo and Gustavson 1996; Klimkowski and van de Geijn 1995]. Therefore, OOC application of
the approach is not further mentioned so as not to distract from the central message of this
article.
ACM Transactions on Mathematical Software, Vol. 35, No. 2, Article 11, Pub. date: July 2008.
Updating an LU Factorization with Pivoting · 11: 3

We hereafter assume that the reader is familiar with Gauss transforms,


their properties, and how they are used to factor a matrix. We start indexing
elements of vectors and matrices at 0. Capital letters, lower-case letters, and
lower-case Greek letters will be used to denote matrices, vectors, and scalars,
respectively. The identity matrix of order n is denoted by In.

2. THE LU FACTORIZATION WITH PARTIAL PIVOTING


Given an n × n matrix A, its LU factorization with partial pivoting is given by
PA = LU. Here P is a permutation matrix of order n, L is n × n unit lower
triangular, and U is n × n upper triangular. We will denote the computation of
P, L, and U by
[ A , p] := [{L\U}, p] = LU( A), (2)
where {L\U} is the matrix whose strictly lower triangular part equals L and
whose upper triangular part equals U. Matrix L has ones on the diagonal,
which need not be stored, and the factors L and U overwrite the original
contents of A. The permutation matrix is generally stored in a vector p of n
integers.
Solving the linear system A x = b now becomes a matter of solving Ly = Pb
followed by Ux = y. These two stages are referred to as forward substitution
and backward substitution, respectively.

2.1 Unblocked Right-Looking LU Factorization


Two unblocked algorithms for computing the LU factorization with partial piv-
oting are given in Figure 1. There, n(·) stands for the number of columns of a
matrix; the thick lines in the matrices/vectors denote how far computation
has progressed; P IVOT (x) determines the element in x with largest magnitude,
swaps this element with the top element, and returns the index of the ele-
ment that was swapped; and P(π1 ) is the permutation matrix constructed by
interchanging row 0 and row π1 of the identity matrix. The dimension of a
permutation matrix will not be specified since it is obvious from the context in
which it is used. We believe the rest of the notation to be intuitive [Bientinesi
and van de Geijn 2006; Bientinesi et al. 2005]. Both algorithms correspond to
what is usually known as the right-looking variant. Upon completion, matri-
ces L and U overwrite A. These algorithms also yield the LU factorization of
a matrix with more rows than columns.
The LINPACK variant, LU LIN UNB hereafter, computes the LU factorization as
a sequence of Gauss transforms interleaved with permutation matrices.
   
In−1 0 1 0
L n−1 · · · L1 L 0 P(π0 ) A = U
0 P(πn−1 ) 0 P(π1 )

For the LAPACK variant LU LAP


UNB , it is recognized that by swapping those rows
of matrix L that were already computed and stored to the left of the column
that is currently being eliminated, the order of the Gauss transforms and
ACM Transactions on Mathematical Software, Vol. 35, No. 2, Article 11, Pub. date: July 2008.
11: 4 · E. S. Quintana-Ortı́ and R. A. van de Geijn

Fig. 1. LINPACK and LAPACK unblocked algorithms for the LU factorization.

permutation matrices can be rearranged so that P( p) A = LU. Here P( p),


with p = ( π0 · · · πn−1 )T , denotes the n × n permutation

   
In−1 0 1 0
··· P(π0 ).
0 P(πn−1 ) 0 P(π1 )

Both algorithms will execute to completion, even if an exact zero is encountered


on the diagonal of U. This is important since it is possible that matrix B in (1)
is singular even if A is not.
The difference between the two algorithms becomes most obvious when
forward substitution is performed. For the LINPACK variant, forward sub-
stitution requires the application of permutations and Gauss transforms to
be interleaved. For the LAPACK algorithm, the permutations are applied
first on the right-hand side vector, after which a clean lower triangular solve
yields the desired (intermediate) result: Ly = P( p)b . Depending on whether
the LINPACK or LAPACK variant was used for LU factorization, we de-
note the forward-substitution stage, respectively, by y := FS LIN ( A , p, b ) or
y := FS LAP ( A , p, b ), where A and p are assumed to contain the outputs of the
corresponding factorization.
ACM Transactions on Mathematical Software, Vol. 35, No. 2, Article 11, Pub. date: July 2008.
Updating an LU Factorization with Pivoting · 11: 5

Fig. 2. LINPACK and LAPACK blocked algorithms for the LU factorization built upon an
LAPACK unblocked factorization.

2.2 Blocked Right-Looking LU Factorization


It is well known that high performance can be achieved in a portable fash-
ion by casting algorithms in terms of matrix-matrix multiplication [Kågström
et al. 1998; 1995; Gustavson et al. 1998; Gunnels et al. 2001]. In Figure 2
we show LINPACK(-like) and LAPACK blocked algorithms LU LIN LAP
BLK and LU BLK ,
respectively, both built upon an LAPACK unblocked algorithm. The former
algorithm really combines the LAPACK style of pivoting, within the factor-
ization of a panel of width b , with the LINPACK style of pivoting. The two
algorithms attain high performance on modern architectures with (multiple
levels of) cache memory by casting the bulk of the computation in terms of the
matrix-matrix multiplication A 22 := A 22 − L 21 U12 , also called a rank-k update,
which is known to achieve high performance [Goto and van de Geijn 2008].
The algorithms also apply to matrices with more rows than columns.
As both LINPACK and LAPACK blocked algorithms are based on the
LAPACK unblocked algorithm (which completes even if the current panel is
singular), both will complete even for a singular matrix. If matrix A in Eq. (1)
is nonsingular, then the upper triangular factor will also be nonsingular; this
is what we need in order to use the factored matrix to solve a linear system.
ACM Transactions on Mathematical Software, Vol. 35, No. 2, Article 11, Pub. date: July 2008.
11: 6 · E. S. Quintana-Ortı́ and R. A. van de Geijn

3. UPDATING AN LU FACTORIZATION
In this section we discuss how to compute the LU factorization of the matrix
in (1) in such a way that the LU factorization with partial pivoting of B can
be reused if C, D, and E change. We consider A in Eq. (1) to be of dimension
n× n, with square B and E of orders nB and nE , respectively. For reference, fac-
toring the matrix in (1) using standard LU factorization with partial pivoting
costs 23 n3 flops (floating-point arithmetic operations). In this expression (and
future computational cost estimates) we neglect insignificant terms of lower-
order complexity, including the cost of pivoting the rows.

3.1 Basic Procedure


We propose employing the following procedure, consisting of five steps, which
computes an LU factorization with incremental pivoting of the matrix in
Eq. (1).
Step 1: Factor B. Compute the LU factorization with partial pivoting
LAP
[ B, p] := [{L\U}, p] = LU BLK (B).

This step is skipped if B has already been factored. If the factors are to be
used for future updates to C, D, and E, then a copy of U is needed since it is
overwritten by subsequent steps.
Step 2: Update C. This is consistent with the factorization of B. This is

C := FS LAP (B, p, C)

 
U
Step 3: Factor . Compute the LU factorization with partial pivoting
D
       
U { L̄\Ū} U
  , L̄, r :=   , r = LU LIN
BLK
 .
D Ľ D

Here, Ū overwrites the upper triangular part of B (where U was stored before
this operation). The lower triangular matrix L̄ that results needs to be stored
separately, since both L, computed in step 1 and used at step 2, and L̄ are
needed during the forward-substitution stage when solving a linear system.
   
C U
Step 4: Update . This is consistent with the factorization of .
E D
     
C L̄ C
:= FS LIN , r,
E D E

Step 5: Factor E. Finally, compute the LU factorization with partial pivoting


  LAP
[ E, s] := { L̃\Ũ}, s = LU BLK (E).
ACM Transactions on Mathematical Software, Vol. 35, No. 2, Article 11, Pub. date: July 2008.
Updating an LU Factorization with Pivoting · 11: 7

Overall, the five steps of the procedure apply Gauss transforms and permuta-
tions to reduce A to an upper triangular matrix, as

  −1   
I 0 L̄ 0 L −1 P( p) 0 B C
   P(r)    =
0 L̃ −1 P(s) Ľ I 0 I D E
| {z }
steps 1 and 2
   −1  
I 0 L̄ 0 U Ĉ
    P(r)   =
−1
0 L̃ P(s) Ľ I D E
| {z }
steps 3 and 4
    
I 0 Ū Č Ū Č
   = ,
−1
0 L̃ P(s) 0 Ě 0 Ũ
| {z }
step 5

   
 L̄ 0 Ū 
where {L\U},   \   , and { L̃\Ũ} are the triangular factors com-
 Ľ I 0 
puted, respectively, in the LU factorizations in steps 1, 3, and 5; p, r, and
s are the corresponding permutation vectors;
  Ĉ is the matrix that results

from overwriting C with L −1 P( p)C; and   are the blocks that result from
   Ě
I 0 Ĉ
   .
0 L̃ −1 P(s) E

3.2 Analysis of the Basic Procedure


For now, the factorization in step 3 does not take advantage of any zeroes below
the
 diagonal
 of U: After matrix B is factored and C is updated, the matrix
U C
is factored as if it is a matrix without special structure. Its cost is
D E
stated in the column labeled “Basic procedure” in Table I. There we only report
significant terms: We assume that b ≪ nE , nB and report only those costs that
equal at least O(b nE nB ), O(b n2E ), or O(b n2B ). If nE is small (i.e., nB ≈ n), the
procedure clearly does not benefit from the existence of an already factored B.
Also, the procedure requires additional storage for the nB ×nB lower triangular
matrix L̄ computed in step 3.
ACM Transactions on Mathematical Software, Vol. 35, No. 2, Article 11, Pub. date: July 2008.
11: 8 · E. S. Quintana-Ortı́ and R. A. van de Geijn

Table I. Computational Cost (in flops) of Different Approaches to Compute LU


Factorization of the Matrix in Eq. (1).
Approximate cost (in flops)
Operation Basic Structure-Aware Structure-Aware
procedure LAPACK LINPACK
procedure procedure
2 3 2 3 2 3
1: Factor B 3 nB 3 nB 3 nB
2: Update C n2B nE n2B nE n2B nE
 
U
3: Factor n2B nE + 23 n3B n2B nE + 12 b n2B n2B nE + 12 b n2B
D
 
C
4: Update 2nB n2E +n2B nE 2nB n2E +n2B nE 2nB n2E +b nB nE
E
2 3 2 3 2 3
5: Factor E 3 nE nE
3 3 nE
 
Total 2 3
3n
2 3
+ 3 nB + n2B nE 2 3
3n +nB 21 b
2 + nE 2 3
3n +b nB n2B + nE
The highlighted costs are those incurred in excess of the cost of a standard LU factorization.

We describe next how to reduce both the computational and storage require-
ments by exploiting the upper triangular structure of U during steps 3 and 4.

3.3 Exploiting the Structure in Step 3


A blocked algorithm that exploits the upper triangular structure of U is given
−LIN
in Figure 3 and illustrated in Figure 4. We name this algorithm LU SA BLK to
reflect that it computes a “structure-aware” (SA) LU factorization. At
 each
U11
iteration of the algorithm, the panel of b columns consisting of is fac-
D1
LAP
tored using the LAPACK unblocked algorithm LU UNB . (In our implementation
this algorithm is modified to also take advantage of the zeroes below the di-
agonal of U11 .) As part of the factorization, U11 is overwritten by { L̄ 1 \Ū11 }.
However, in order to preserve the strictly lower triangular part of U11 (where
part of the matrix L, that was computed in step 1, is stored), we employ the
b × b submatrix L̄ 1 of the nB × b array L̄ (see Figure 3). As in the LINPACK
blocked algorithm in Figure 2, the LAPACK and LINPACK styles of pivot-
ing are combined: The columns of the current panel are pivoted using the
LAPACK approach,
  but the permutations from this factorization are only
U12
applied to .
D2
The cost of this approach is given in step 3 of the column labeled “Structure-
Aware LINPACK procedure” in Table I. The cost difference comes from the
updates of U12 shown in Figure 3, and provided b ≪ nB , is insignificant com-
pared to 23 n3 .
An SA LAPACK blocked algorithm for step 3 only differs from that in
Figure 3 in that, at a certain iteration after the LU factorization of  the cur-
U10
rent panel is computed, these permutations have to be applied to as
D0
well. As indicated in step 3 of the column labeled “Structure-Aware LAPACK
procedure,” this does not incur extra cost for this step. However, it does require
an nB × nB array for storing L̄ (see Figure 4) and, as we will see next, makes
step 4 more expensive. On the other hand, the SA LINPACK algorithm only
ACM Transactions on Mathematical Software, Vol. 35, No. 2, Article 11, Pub. date: July 2008.
Updating an LU Factorization with Pivoting · 11: 9

T
Fig. 3. SA-LINPACK blocked algorithm for the LU factorization of U T , D T built upon an
LAPACK blocked factorization.

requires an nB × b additional work space for storing the factors, as indicated


in Figure 4.

3.4 Revisiting the Update in Step 4


The same  optimizations
 made in step 3 must now be carried over to the up-
C
date of . The algorithm for this is given in Figure 5. Computation cor-
E
responding to zeroes is avoided so that the cost of performing the update is
2nB n2E + b nB nE flops, as indicated in step 4 of Table I.
Applying the SA LAPACK blocked algorithm in step 3 destroys the structure
of the lower triangular matrix, which cannot be recovered during the forward
substitution stage in step 4. This explains the additional cost reported for this
variant in Table I.
ACM Transactions on Mathematical Software, Vol. 35, No. 2, Article 11, Pub. date: July 2008.
11: 10 · E. S. Quintana-Ortı́ and R. A. van de Geijn

Fig. 4. Illustration of an iteration of the SA LINPACK blocked algorithm used in step 3 and how
it preserves most of the zeroes in U. The zeroes below the diagonal are preserved, except within
the b × b diagonal blocks, where pivoting will fill below the diagonal. The shaded areas are the
ones updated as part of the current iteration. The fact that U22 is not updated demonstrates
how computation can be reduced. If the SA LAPACK blocked algorithm was used, then nonzeroes
would appear during this iteration in the block marked as 0⋆ , due to pivoting; as a result, upon
completion, zeros would be lost in the full strictly lower triangular part of U.

3.5 Key Contribution


The difference in cost of the three different approaches analyzed in Table I is
illustrated in Figure 6. It reports the ratios in cost of the aforesaid different
procedures and that of the LU factorization with partial pivoting for a matrix
with nB = 1000 and different values of nE , using b = 32. The analysis shows
that the overhead of the SA LINPACK procedure is consistently low. On the
other hand, as nE /n → 1 the cost of the basic procedure, which is initially twice
as expensive as that of the LU factorization with partial pivoting, is decreased.
The SA LAPACK procedure only presents a negligible overhead when nE → 0,
that is, when the dimension of the update is very small.
The key insight of the proposed approach is the recognition that combining
LINPACK- and LAPACK-style pivoting allows one to use a blocked algorithm
while avoiding filling most of the zeroes in the lower triangular part of U.
This, in turn, makes the extra cost of step 4 acceptable. In other words, for
the SA LINPACK procedure, the benefit of higher performance of the blocked
algorithm comes at the expense of a lower-order amount of extra computation.
ACM Transactions on Mathematical Software, Vol. 35, No. 2, Article 11, Pub. date: July 2008.
Updating an LU Factorization with Pivoting · 11: 11

T
Fig. 5. SA-LINPACK blocked algorithm for the update of C T , ET , consistent with the SA-
T
LINPACK blocked LU factorization of U T , D T .

The extra memory for the SA LINPACK procedure consists of an nB × nB upper


triangular matrix and an nB × b array.

4. REMARKS ON NUMERICAL STABILITY


The algorithm for the LU factorization with incremental pivoting carries out
a sequence of row permutations (corresponding to the application of permu-
tations) which are different from those that would be performed in an LU
factorization with partial pivoting. Therefore, the numerical stability of this
algorithm is also different. In this section we provide some remarks on the
stability of the new algorithm. We note that all three procedures described in
the previous section (basic, SA LINPACK, and SA LAPACK) perform the same
sequence of row permutations.
The numerical (backward) stability of an algorithm that computes the LU
factorization of a matrix A depends on the growth factor [Stewart 1998]
kLkkUk
ρ= , (3)
kAk
ACM Transactions on Mathematical Software, Vol. 35, No. 2, Article 11, Pub. date: July 2008.
11: 12 · E. S. Quintana-Ortı́ and R. A. van de Geijn

Fig. 6. Overhead cost of the different approaches to compute the LU factorization in Eq. (1) with
respect to the cost of the LU factorization with partial pivoting.

which is basically determined by problem size and pivoting strategy. For ex-
ample, the growth factors of complete, partial, and pairwise [Wilkinson 1965,
p. 236] pivoting have been demonstrated bounded as ρc ≤ n1/2 (2·31/2 · · · n1/n−1 ),
ρ p ≤ 2n−1 , and ρw ≤ 4n−1 , respectively [Sorensen 1985; Stewart 1998]. Statisti-
cal models and extensive experimentations in Trefethen and Schreiber [1990]
showed that, on average, ρc ≈ n1/2 , ρ p ≈ n2/3 , and ρw ≈ n, inferring that in
practice both partial and pairwise pivoting are numerically stable, and pair-
wise pivoting can be expected to numerically behave only slightly worse than
partial pivoting.
The new algorithm applies partial  pivoting
 during the factorization of B and
U
then again in the factorization of . This can be considered as a blocked
D
variant of pairwise pivoting. Thus, we can expect an element growth for the
algorithm that is between those of partial and pairwise pivoting. Next we
elaborate an experiment that provides evidence in support of this observation.
In Figure 7 we report the element growths observed during computation of
the LU factorization of matrices as in Eq. (1), with nB = 100 and dimensions for
E ranging from nE = 5 to 100 using partial, incremental, and pairwise pivoting.
The entries of the matrices are generated randomly, chosen from a uniform dis-
tribution in the interval (0.0, 1.0). The experiment was carried out on an Intel
Xeon processor using Matlab R 7.0.4 (IEEE double-precision arithmetic). The
results report the average element growth for 100 different matrices for each
matrix dimension. The figure shows the growth factor of incremental pivoting
to be smaller than that of pairwise pivoting and to approximate that of par-
tial pivoting. A similar behavior was obtained for other matrix types: uniform
distribution in (−1.0, 1.0), normal distribution with mean 0.0 and deviation
1.0 (N[0.0, 1.0]), symmetric matrices with elements in N[0.0, 1.0], and Toeplitz
matrices with elements in N[0.0, 1.0]. Only for orthogonal matrices with
Haar distribution [Trefethen and Schreiber 1990] did we obtain significantly
ACM Transactions on Mathematical Software, Vol. 35, No. 2, Article 11, Pub. date: July 2008.
Updating an LU Factorization with Pivoting · 11: 13

Fig. 7. Element growth in LU factorization using different pivoting techniques.

different results. In that case, incremental pivoting attained a smaller element


growth than pairwise pivoting, and both outperformed the element growth of
partial pivoting. Explaining the behavior of this case is beyond the scope of
this work.
For those who are not sufficiently satisfied with the element growth of in-
cremental pivoting, we propose to perform a few refinement iterations of the
solution to A x = b at a cost of O(n2 ) flops per step, as this guarantees stability
at a low computational cost [Higham 2002].

5. PERFORMANCE
In this section we report results for a high-performance implementation of the
SA LINPACK procedure.

5.1 Implementation
The FLAME library (version 0.9) was used to implement a high-performance
LU factorization with partial pivoting and the SA LINPACK procedure. The
benefit of this API is that the code closely resembles the algorithms as they
are presented in Figures 1 through 3 and 5. The performance of the FLAME
LU factorization with partial pivoting is highly competitive with LAPACK and
vendor implementations of this operation.
The implementations can be examined by visiting http://www.cs.utexas.
edu/users/flame/Publications/. For further information on FLAME, visit
www.cs.utexas.edu/users/flame.

5.2 Platform
Performance experiments were performed in double-precision arithmetic on
an Intel Itanium2 (1.5 GHz) processor-based workstation capable of attaining
6 GFLOPS (109 flops per second). For reference, the algorithm for the FLAME
ACM Transactions on Mathematical Software, Vol. 35, No. 2, Article 11, Pub. date: July 2008.
11: 14 · E. S. Quintana-Ortı́ and R. A. van de Geijn

Fig. 8. Top: speedup attained when B is not refactored, over LU factorization with partial pivoting
of the entire matrix; bottom: slowdown for the first factorization (when B must also be factored).

LU factorization with partial pivoting delivered 4.8 GFLOPS for a 2000 × 2000
matrix. A block size b = 128 was employed in this procedure for all experi-
ments reported next. The implementation was linked to the GotoBLAS R1.6
basic linear algebra subprograms (BLAS) library [Goto 2004]. The BLAS rou-
tine DGEMM which is used to compute C := C − A B (C ∈ Rm×n, A ∈ Rm×k ,
and B ∈ Rk×n) attains the best performance when the common dimension of
A and B, namely k, is equal to 128. Notice that most computation in the SA
LINPACK procedure is cast in terms of this operation, with k = b .
The performance benefits reported on this platform are representative of
those that can be expected on other current architectures.

5.3 Results
In Figure 8 (top) we show the speedup attained when an existing factoriza-
tion of B is reused, by reporting the time required to factor Eq. (1) with
high-performance LU factorization with partial pivoting divided by the time
required to update an existing factorization of B via the SA LINPACK proce-
dure (steps 2 through 5). In that figure, nB = 1000 and nE is varied from 0 to
1000. The results are reported when different block size b ’s are chosen. The
ACM Transactions on Mathematical Software, Vol. 35, No. 2, Article 11, Pub. date: July 2008.
Updating an LU Factorization with Pivoting · 11: 15

DGEMM operation, in terms of which most computation is cast, attains the


best performance when b = 128 is chosen. However, this generates enough
additional flops that the speedup is higher when b is chosen smaller. When nE
is very small, b = 8 (for steps 2 through 5) yields the best performance. As nE
increases, performance improves by choosing b = 32 (for steps 2 through 5).
The effect of the overhead of the extra computations is demonstrated in
Figure 8 (bottom). There, we report the ratio of the time required by steps 1
through 5 of the SA LINPACK procedure divided by the time required by LU
factorization with partial pivoting of Eq. (1). The results in the figure may be
somewhat disturbing: The algorithm that views the matrix as four quadrants
attains as good or even better performance than the algorithm that views the
matrix as a single unit and performs less computation. The likely explanation
is that the standard LU factorization would also benefit from a variable block
size as the problem size changes, rather than fixing it at b = 128. We did not
further investigate this issue, since we did not want to make raw performance
the primary focus of the article.

6. CONCLUSIONS
We have proposed blocked algorithms for updating an LU factorization. They
have been shown to attain high performance and to greatly reduce the cost
of an update to a matrix for which a partial factorization already exists. The
key insight is the synthesis of LINPACK- and LAPACK-style pivoting. While
some additional computation is required, this is more than offset by the im-
provement in performance that comes from casting computation in terms of
matrix-matrix multiplication.
We acknowledge that the question of the numerical stability of the new
algorithm relative to that of LU factorization with partial pivoting remains
open. Strictly speaking, LU factorization with partial pivoting is itself not
numerically stable, but practical experience has shown be effective in practice.
Theoretical results that rigorously bound the additional element growth are in
order, but are beyond the scope of the present article.

REFERENCES
B IENTINESI , P., G UNNELS, J. A., M YERS, M. E., Q UINTANA -O RT Í , E. S., AND VAN DE G EIJN,
R. A. 2005. The science of deriving dense linear algebra algorithms. ACM Trans. Math.
Softw. 31, 1 (Mar.), 1–26.
B IENTINESI , P. AND VAN DE G EIJN, R. 2006. Representing dense linear algebra algorithms:
A farewell to indices. Tech. Rep. FLAME Working Note 17, CS-TR-2006-10, Department of
Computer Sciences, The University of Texas at Austin.
C WIK , T., VAN DE G EIJN, R., AND PATTERSON, J. 1994. The application of parallel computation
to integral equation models of electromagnetic scattering. J. Optic. Soc. Amer. A 11, 4 (Apr.),
1538–1545.
D EMMEL , J. AND D ONGARRA , J. 2005. LAPACK 2005 prospectus: Reliable and scalable software
for linear algebra computations on high end computers. LAPACK Working Note 164 UT-CS-05-
546, University of Tennessee. February.
G ENG, P., O DEN, J. T., AND VAN DE G EIJN, R. 1996. Massively parallel computation for acoustical
scattering problems using boundary element methods. J. Sound Vibra. 191, 1, 145–165.
G OTO, K. 2004. TACC software and tools. http://www.tacc.utexas.edu/resources/software/.
ACM Transactions on Mathematical Software, Vol. 35, No. 2, Article 11, Pub. date: July 2008.
11: 16 · E. S. Quintana-Ortı́ and R. A. van de Geijn

G OTO, K. AND VAN DE G EIJN, R. A. 2008. Anatomy of a high-performance matrix multiplication.


ACM Trans. Math. Softw. 34, 3 (to appear).
G UNNELS, J. A., H ENRY, G. M., AND VAN DE G EIJN, R. A. 2001. A family of high-performance
matrix multiplication algorithms. In Proceedings of the International Conference on Computa-
tional Science (ICCS), Part I, V. N. Alexandrov et al., eds. Lecture Notes in Computer Science,
vol. 2073. Springer, 51–60.
G UNTER , B. AND VAN DE G EIJN, R. 2005. Parallel out-of-core computation and updating of the
QR factorization. ACM Trans. Math. Softw. 31, 1 (Mar.), 60–78.
G USTAVSON, F., H ENRIKSSON, A., J ONSSON, I., K ÅGSTR ÖM , B., AND L ING, P. 1998. Superscalar
GEMM-based level 3 BLAS – The on-going evolution of a portable and high-performance library.
In Proceedings of the Workshop Applied Parallel Computing (PARA), Large Scale Scientific and
Industrial Problems, B. K. et al., eds. Lecture Notes in Computer Science, vol. 1541. Springer,
207–215.
H IGHAM , N. J. 2002. Accuracy and Stability of Numerical Algorithms, 2nd ed. Society for
Industrial and Applied Mathematics, Philadelphia, PA.
J OFFRAIN, T., Q UINTANA -O RT Í , E. S., AND VAN DE G EIJN, R. A. 2005. Rapid development of
high-performance out-of-core solvers. In Proceedings of the Workshop on Applied Parallel Com-
puting (PARA 2004), J. Dongarra et al., eds. Lecture Notes in Computer Science, vol. 3732.
Springer, 413–422.
K ÅGSTR ÖM , B., L ING, P., AND L OAN, C. V. 1995. Gemm-Based level 3 blas: High-Performance
model, implementations and performance evaluation benchmark. LAPACK Working Note no.
107 CS-95-315, University of Tennessee. November.
K ÅGSTR ÖM , B., L ING, P., AND L OAN, C. V. 1998. GEMM-Based level 3 BLAS: High performance
model implementations and performance evaluation benchmark. ACM Trans. Math. Softw. 24, 3,
268–302.
K LIMKOWSKI , K. AND VAN DE G EIJN, R. 1995. Anatomy of an out-of-core dense linear solver.
In Proceedings of the International Conference on Parallel Processing. vol. III - Algorithms and
Applications, 29–33.
S ORENSEN, D. C. 1985. Analysis of pairwise pivoting in Gaussian elimination. IEEE Trans.
Comput. C-34, 3, 274–278.
S TEWART, G. W. 1998. Matrix Algorithms. Volume I: Basic Decompositions. SIAM, Philadelphia,
PA.
T OLEDO, S. 1997. Locality of reference in LU decomposition with partial pivoting. SIAM J. Matrix
Anal. Appl. 18, 4, 1065–1081.
T OLEDO, S. 1999. A survey of out-of-core algorithms in numerical linear algebra. In External
Memory Algorithms and Visualization, J. Abello and J. S. Vitter, eds. American Mathematical
Society Press, Providence, RI, 161–180.
T OLEDO, S. AND G USTAVSON, F. 1996. The design and implementation of SOLAR, a portable
library for scalable out-of-core linear algebra computations. In Proceedings of the 4th Workshop
on I/O in Parallel and Distributed Systems, 28–40.
T REFETHEN, L. N. AND S CHREIBER , R. S. 1990. Average-Case stability of Gaussian elimination.
SIAM J. Matrix Anal. Appl. 11, 3, 335–360.
W ILKINSON, J. H. 1965. The Algebraic Eigenvalue Problem. Oxford University Press, London.
Y IP, E. L. 1979. Fortran subroutines for out-of-core solutions of large complex linear systems.
Tech. Rep. CR-159142, NASA.

Received August 2006; revised December 2007; accepted December 2007

ACM Transactions on Mathematical Software, Vol. 35, No. 2, Article 11, Pub. date: July 2008.

You might also like