0% found this document useful (0 votes)
18 views

Direct Methods

DirectMethods

Uploaded by

wbdhvz6fxs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Direct Methods

DirectMethods

Uploaded by

wbdhvz6fxs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

DIRECT METHODS

YUWEN LI

The textbook of this course is “Applied Numerical Linear Algebra” by


James Demmel [2]. Interested readers could also read [3, 4], which are
famous books for numerical linear algebra and matrix analysis.
The central topic is the numerical solution of linear systems of equations
by direct methods (Gaussian elimination) and iterative methods (Gauss-
Seidel methods, Jacobi methods, Krylov space methods). Related topics
include least-squares problems, matrix factorization (LU, Cholesky, QR fac-
torization), numerical eigenvalue problems, fast solvers (Fast Fourier Trans-
form and multigrid) etc.

1. Vector and matrix norms


To quantify the error in numerical linear algebra, we need to introduce
norms for vectors and matrices.

1.1. Vector Norms.


Definition 1.1. We say ∥ • ∥ : Rn → R is a norm on Rn if it satisfies the
following three conditions:
(1) ∥u∥ ≥ 0 and ∥u∥ = 0 =⇒ u = 0;
(2) ∥αu∥ = |α|∥u∥, ∀α ∈ R;
(3) ∥u + v∥ ≤ ∥u∥ + ∥v∥.
For example, the Euclidean norm is defined to be
1 1
∥u∥ = (u · u) 2 = (u21 + · · · + u2n ) 2 , u = (u1 , . . . , un ) ∈ Rn .
In analysis, the following ℓp norm is also very useful:
(∑ )1 1
∥u∥p = |ui |p p = (|u1 |p + · · · + |un |p ) p , u ∈ Rn ,
i
(´ )1
which is a discretization of the function norm | • |p dx p
.
Theorem 1.1. For 1 ≤ p ≤ ∞, ∥ • ∥p is a norm on Rn .
Proof. It suffices to prove
∥u + v∥p ≤ ∥u∥p + ∥v∥p ,

Address: [email protected]; School of Mathematical Sciences, Zhejiang University.


1
2 YUWEN LI

the so-called Minkowski inequality.



∥u + v∥pp = |ui + vi ||ui + vi |p−1
i

≤ |ui ||ui + vi |p−1 + |vi ||ui + vi |p−1
i
= |u| · |u + v|p−1 + |v| · |u + v|p−1 .
To proceed, we need to recall the Hölder inequality
1 1
|u · v| ≤ ∥u∥p ∥v∥q , + = 1.
p q
It then follows that
|u| · |u + v|p−1 ≤ ∥u∥p ∥|u + v|p−1 ∥q
(∑ )1 p
= ∥u∥p |u + v|p q = ∥u∥p ∥u + v∥pq ,
i
|v| · |u + v| ≤ ∥v∥p ∥|u + v|p−1 ∥q
p−1

(∑ )1 p
= ∥v∥p |u + v|p q = ∥v∥p ∥u + v∥pq .
i
Combining previous results completes the proof. □
The normed space (Rn , ∥ • ∥p ) is denoted by ℓp . A special example of the
ℓp norm is the ℓ∞ norm
∥u∥∞ = max |ui |.
1≤i≤n

Exercise: Prove the Jensen


∑ inequality: for a convex function f (namely
f ′′ ≥ 0) and 0 ≤ λi ≤ 1 with i λi = 1, it holds that
(∑ ) ∑
f λi x i ≤ λi f (xi ).
i i

Exercise: Use the Jensen inequality to prove the Hölder inequality.


Exercise: Prove that ∥ • ∥p is not a norm when 0 < p < 1.
Exercise: Prove that limp→∞ ∥u∥p = ∥u∥∞ and ∥u∥p is a non-increasing
function in p.
On Rn , we say two different norms ∥ • ∥X and ∥ • ∥Y are equivalent if
there exist positive constants C1 and C2 such that
C1 ∥u∥Y ≤ ∥u∥X ≤ C2 ∥u∥Y , ∀u ∈ Rn ,
and in this case we use the notation
∥u∥X ≂ ∥u∥Y .
For example, we have the following comparisons

∥u∥2 ≤ ∥u∥1 ≤ n∥u∥2 .
DIRECT METHODS 3

In fact, all norms on finite-dimensional linear spaces are equivalent.


Theorem 1.2 (Equivalence of vector norms). For any two norms ∥ • ∥X
and ∥ • ∥Y on Rn are equivalent: there exist positive constants C1 and C2
such that
C1 ∥u∥Y ≤ ∥u∥X ≤ C2 ∥u∥Y , ∀u ∈ Rn .
Proof. It suffices to prove that for any norm ∥ • ∥X , there exists C1 , C2 > 0
such that
C1 ∥u∥2 ≤ ∥u∥X ≤ C2 ∥u∥2 , ∀u ∈ Rn .
This inequality is equivalent to
C1 ≤ ∥u∥X ≤ C2 , ∀u ∈ Rn with ∥u∥2 = 1,
which is guaranteed by the continuity of ∥ • ∥X : Rn → R as a function and
that the unit sphere {u ∈ Rn : ∥u∥2 = 1} is compact.
We show the continuity in 2d (n=2). Let e1 = (1, 0), e2 = (0, 1).
|∥u∥X − ∥v∥X | ≤ ∥u − v∥X
= ∥(u1 − v1 )e1 + (u2 − v2 )e2 ∥X
≤ |u1 − v1 |∥e1 ∥X + |u2 − v2 |∥e2 ∥X
( )1 ( )
≤ |u1 − v1 |2 + |u2 − v2 |2 2 ∥e1 ∥X + ∥e2 ∥X .
Then the (uniform) continuity of ∥ • ∥X follows from ϵ − δ definition. □
1.2. Matrix norms. Similarly, we can define the norm of m × n matrices.
Definition 1.2. We say ∥ • ∥ : Rm×n → R is a norm on Rm×n if it satisfies
the following three conditions:
(1) ∥A∥ ≥ 0 and ∥A∥ = O =⇒ u = O;
(2) ∥αA∥ = |α|∥A∥, ∀α ∈ R;
(3) ∥A + B∥ ≤ ∥A∥ + ∥B∥.
A common matrix norm is the Frobenius norm
( ∑ 2 )1
∥A∥F = aij 2 .
i,j

Exercise: Prove that ∥A∥2F = tr(A⊤ A).


Given Rn and Rm are equipped with some norms ∥ • ∥Rn and ∥ • ∥Rm , this
pair of norms naturally induces a matrix norm on Rm×n by
∥Au∥Rm
∥A∥Rn →Rm := sup = sup ∥Au∥Rm ,
0̸=u∈Rn ∥u∥Rn u∈Rn ,∥u∥Rn =1

which is also called the operator norm or the induced norm. In particular,
∥Au∥p
∥A∥p := sup = sup ∥Au∥p ,
0̸=u∈Rn ∥u∥p u∈Rn ,∥u∥p =1

is the operator norm induced by the vector ℓp norm.


4 YUWEN LI

Question: There is a hidden issue in the definition of operator norms.


We are not sure whether the supremum used for defining ∥A∥Rn →Rm is finite
or not, right? The next theorem shows that
sup ∥Au∥Rm ̸= ∞
u∈Rn ,∥u∥Rn =1

is a bounded number.
Theorem 1.3. The operator norm is well-defined, namely,
sup ∥Au∥Rm ≤ C
u∈Rn ,∥u∥Rn =1

for some constant C.


Proof. This theorem follows from that ∥A • ∥Rm : Rn → R is a continuous
function and that the unit sphere {u ∈ Rn : ∥u∥2 = 1} is compact. □

Examples: For a matrix A, ∥A∥2 is called the ℓ2 operator norm and the
spectral norm of A. When p = 1, ∞, 2,

m
∥A∥1 = max |aij |,
1≤j≤n
i=1
∑n
∥A∥∞ = max |aij |,
1≤i≤m
j=1

∥A∥2 = λmax (A⊤ A).

The last equality can be proved using the following important theorem.
Theorem 1.4. For a symmetric and positive semi-definite matrix A, we
have
x⊤ Ax
λmax (A) = sup ,
x̸=0 x⊤ x
x⊤ Ax
λmin (A) = inf .
x̸=0 x⊤ x

Proposition 1.1. An operator norm ∥ • ∥ of matrices satisfies


(1) ∥I∥ = 1;
(2) ∥Au∥ ≤ ∥A∥∥u∥;
(3) ∥AB∥ ≤ ∥A∥∥B∥ (sub-multiplicative).
As a consequence, ∥ • ∥F is not an operator norm.
Exercise: Prove that ∥A∥2 = 1 if A is an orthogonal matrix.
Exercise: For 1 ≤ p ≤ ∞ and 1
p + 1
q = 1, prove that ∥A∥p = ∥A⊤ ∥q .
DIRECT METHODS 5

2. Condition number
The main topic of this chapter is to solve the linear system of equations
Ax = b by numerical methods on a computer. However, this process can
never be exact due to machine error. The IEEE 754 format of floating point
numbers is given in Figure 1. Each digit in 1 is either 0 or 1. Figure 1a uses
32 bits single-precision to represent the following real number

(−1)s (1.b1 b2 · · · b23 )2 × 2(a1 a2 ···a8 )2 −127 .


Figure 1b uses 64 bits double-precision to represent the following real number

(−1)s (1.b1 b2 · · · b52 )2 × 2(a1 a2 ···a11 )2 −1023 .


In addition, when a1 = a2 = · · · = a8 = 1 or a1 = a2 = · · · = a11 = 1, the
floating point number is set to be ∞ or NaN (Not a number); when a1 = a2 =
· · · = a8 = 0 or a1 = a2 = · · · = a11 = 0, the floating point number is set
to be 0 or NaN (Not a number). Therefore a normalized/standard nonzero
floating point number cannot have exponent 111111112 (111111111112 in
double precision) or 000000002 (000000000002 in double precision).

s a1 a2 ... a8 b1 b2 ... b23


sign exponent fraction
binary point
(a) Single precision.

s a1 a2 ... a11 b1 b2 ... b52


sign exponent fraction
binary point
(b) Double precision.

Figure 1. IEEE 754 floating point numbers.

Exercise: Rewrite the denary point number 10.37510 as binary point


number. Then transform it into IEEE 754 single-precision floating point
number.
For example, 0.4 can not be exactly stored in a computer. Mathematically
speaking, we may actually solve some perturbed equation

Âx̂ = b̂,
(A + δA)(x + δx) = b + δb,
where δA, δb are very small perturbation. We want to bound the difference
δx in terms of δA and δb.
6 YUWEN LI

2.1. Perturbation Analysis I. We want to bound the relative error ∥δx∥/∥x̂∥


or ∥δx∥/∥x∥ in terms of the relative perturbation ∥δA∥/∥A∥ (and ∥δb∥/∥b∥).
Aδx + (δA)x + (δA)(δx) = δb,
δx = A−1 (−(δA)x̂ + δb).
As a result,
( )
∥δx∥ ≤ ∥A−1 ∥ ∥δA∥∥x̂∥ + ∥δb∥ .
∥δx∥ ( ∥δA∥ ∥δb∥ )
≤ ∥A∥∥A−1 ∥ +
∥x̂∥ ∥A∥ ∥A∥∥x̂∥
( ∥δA∥ ∥δb∥ )
≤ ∥A∥∥A−1 ∥ + .
∥A∥ ∥b̂∥
The factor ∥A∥∥A−1 ∥ measures the sensitivity of the solution with respect
to small relative perturbation of the coefficient matrix A. It is named as the
condition number
κ(A) := ∥A∥∥A−1 ∥.

2.2. Perturbation Analysis II. The original and perturbed equations are
manipulated in a different way.
(A + δA)δx = −(δA)x + δb.
δx = (A + δA)−1 (−(δA)x + δb)
= (I + A−1 δA)−1 A−1 (−(δA)x + δb).
Then the relation between the relative error and the relative perturbation
of A, b is as follows:
∥δx∥ ( ∥δb∥ )
≤ ∥(I + A−1 δA)−1 ∥∥A−1 ∥ ∥δA∥ +
∥x∥ ∥x∥
1 ( ∥δA∥ ∥δb∥ )
−1
≤ ∥A∥∥A ∥ +
1 − ∥A−1 ∥∥δA∥ ∥A∥ ∥x∥∥A∥
1 ( ∥δA∥ ∥δb∥ )
≤ κ(A) + .
1 − κ(A)∥δA∥/∥A∥ ∥A∥ ∥b∥
Here we used the inequality (∥ • ∥ should be an operator norm)
1
∥(I − B)−1 ∥ ≤ , ∥B∥ < 1.
1 − ∥B∥
When κ(A) ≪ 1 (in particular ∥δA∥/∥A∥ ≪ 1
κ(A) ) we have

∥δx∥ ( ∥δA∥ ∥δb∥ )


≲ κ(A) + .
∥x∥ ∥A∥ ∥b∥
Here B1 ≲ B2 means B1 ≤ CB2 with a positive constant C = O(1) of mild
size.
DIRECT METHODS 7

Exercise: Prove the above inequality


1
∥(I − B)−1 ∥ ≤ , ∥B∥ < 1.
1 − ∥B∥

2.3. Perturbation analysis III. The following result says that κ(A) is the
smallest relative perturbation of A (under the norm ∥ • ∥2 ) such that the
perturbed matrix is singular.
Theorem 2.1. We have
{ ∥δA∥ } 1 1
2
min : A + δA is singular = = .
δA ∥A∥2 ∥A∥2 ∥A−1 ∥2 κ(A)
Proof. Throughout this proof, ∥ • ∥ = ∥ • ∥2 . We need to show that
{ }
min ∥δA∥ : A + δA is singular = ∥A−1 ∥−1 .
δA

If ∥δA∥ < ∥A−1 ∥−1 , then ∥A−1 δA∥ ≤ ∥A−1 ∥∥δA∥ < 1, which implies that
I + A−1 δA is invertible and A + δA is invertible. Therefore, we have
{ }
min ∥δA∥ : A + δA is singular ≥ ∥A−1 ∥−1 .
δA
On the other hand, there exists a vector u with ∥u∥ = 1 such that
∥A−1 u∥ = ∥A−1 ∥. Let
v = A−1 u/∥A−1 u∥ = A−1 u/∥A−1 ∥
and δA = −uv ⊤ /∥A−1 ∥. Direct calculation shows that
(A + δA)v = u/∥A−1 ∥ − u∥v∥2 /∥A−1 ∥ = 0,
so A + δA is singular, and
√ √
∥δA∥ = λmax (δA⊤ δA) = ∥A−1 ∥−1 λmax (vu⊤ uv ⊤ )
= ∥A−1 ∥−1 ∥u∥∥v∥ = ∥A−1 ∥−1 .
The second part of the proof confirms
{ }
min ∥δA∥ : A + δA is singular = ∥A−1 ∥−1 .
δA

2.4. Examples of ill-conditioned matrices. The condition number un-
der the ℓ2 -norm (called the spectral condition number) of a symmetric and
positive-definite matrix A can be computed by the formula
λmax (A)
κ(A) = .
λmin (A)
A simple ill-conditioned matrix is
( )
1 0
A= , ϵ ≪ 1.
0 ϵ
Its condition number is κ(A) = 1/ϵ ≫ 1.
8 YUWEN LI

Exercise: Compute the ℓ1 , ℓ2 , ℓ∞ condition numbers of the following


matrix ( )
1 ϵ
A= .
0 1

Another famous ill-conditioned matrix is the Hilbert matrix H = (hij ) ∈


Rn×n with
1
hij = .
i+j−1
The Hilbert matrix is of the following form
 
1 1/2 1/3 1/4 · · ·
1/2 1/3 1/4 1/5 · · ·
 
 
H = 1/3 1/4 1/5 1/6 · · · .
1/4 1/5 1/6 1/7 · · ·
 
.. .. .. .. ..
. . . . .
We note that hij is invariant provided i + j is fixed, e.g.,
h12 = h21 , h13 = h22 = h31 , h14 = h23 = h32 = h41 , . . .
Therefore, H is a special Hankel matrix.
Exercise: Prove that H is a positive-definite matrix.
It can be proved that the spectral condition number of H satisfies
√ √
κ(H) = O((1 + 2)4n / n).
Due to the ill-conditioness, the linear system
Hx = b
is extremely hard to solve!
Theorem 2.2. The relative machine error ϵ is 2−24 ≈ 6 × 10−8 under single
precision and 2−53 ≈ 10−16 under double precision.
Proof. Assume a single-precision x = (−1)s (1 + f ) × 2e−127 □
Therefore when inputting the right side b, the relative error ∥δH∥/∥H∥,
∥δb∥/∥b∥ caused by inexact representation of real numbers is around 10−16 .
In this case
∥δx∥ ( ∥δH∥ ∥δb∥ )
≲ κ(H) + ≲ κ(H) × 10−16 .
∥x∥ ∥H∥ ∥b∥
(1) When n = 6, κ(H) ≈ 1.5 × 107 and ∥δx∥ −9
∥x∥ ≲ 1.5 × 10 , within
acceptable accuracy;
(2) When n = 11, κ(H) ≈ 5.2 × 1014 and ∥δx∥ −2
∥x∥ ≲ 5.2 × 10 , not so good;
(3) When n = 12, κ(H) ≈ 1.6 × 1016 and ∥δx∥
∥x∥ ≲ 1.6, which is a pretty
pessimistic and unacceptable upper bound!
DIRECT METHODS 9

In MATLAB, hilb is a built-in function to generate Hilbert matrices.


Try the following code in the command window of MATLAB.

H=hilb (12); % generate a Hilbert matrix


x=rand (12 ,1); % generate the exact solution
b=Hx; % generate the right hand side
xhat = H \ b; % calculate the numerical solution
x-xhat % compare the exact and numerical solutions
Listing 1. An example of a Hilbert matrix

2.5. Condition numbers of general problems. In numerical analysis,


the condition number of a function measures how much the output value of
the function can change for a small change in the input argument. This is
used to measure how sensitive a function is to changes or errors in the input,
and how much error in the output results from an error in the input.
Suppose we need to evaluate the following
y = f (x),
where f = (f1 , . . . , fn ) : Rn → Rn is a differential function. Recall that
δf (x) = f (x̂) − f (x) ≈ J(x)(x̂ − x),
∂fi
where J(x) = ( ∂x j
)1≤i,j≤n is the Jacobian matrix of f at x. As a result,
∥f (x̂) − f (x)∥ ≈ ∥J(x)∥∥x̂ − x∥ and the absolute condition number at x is
∥δf (x)∥
lim sup = ∥J(x)∥.
ϵ→0 ∥δx∥≤ϵ ∥δx∥

The relative condition number at x is


∥δf (x)∥/∥f (x)∥ ∥J(x)∥∥x∥
κ = κ(x) = lim sup = .
ϵ→0 ∥δx∥≤ϵ ∥δx∥/∥x∥ ∥f (x)∥

For example, when y = A−1 x we have J(x) = A−1 and


∥A−1 ∥∥x∥
κ= ≤ ∥A−1 ∥∥A∥ = κ(A).
∥A−1 x∥
Therefore, the matrix condition number is an upper bound of the general
condition number.

3. Direct methods for solving linear systems


In linear algebra, we have already learned the Gaussian elimination for
solving the linear system of equations Ax = b. This process is essentially
equivalent to an LU decomposition of A with reordering rows of A.
10 YUWEN LI

3.1. LU decomposition. We say P is a permutation matrix if it is ob-


tained from reordering rows of the identity matrix. The entry of P is either
0 or 1 and each row of P contains only one 1.
Theorem 3.1 (Gauss Elimination with Partial Pivoting, GEPP). Let A be a
nonsingular matrix. There exist a permutation matrix P , a lower triangular
matrix L with unit diagonal entries, and an upper triangular matrix U such
that P A = LU .
Proof. From basic linear algebra, we know that there exist elementary ma-
trices P1 , . . . , PJ (for row swapping or just identity) and lower triangular
L1 , . . . , LJ for adding some multiple of one row with smaller index to an-
other row with bigger index, such that
LJ PJ · · · L2 P2 L1 P1 A = U
where U is a nonsingular upper triangular matrix. Let
L̃j = PJ · · · Pj+1 Lj Pj+1 · · · PJ .
We can rewrite the equation as
L̃J · · · L̃1 PJ · · · P1 A = U.
Setting P = PJ · · · P1 and
L = (L̃J · · · L̃1 )−1 = L̃−1 −1
1 · · · L̃J
completes the proof. □
The following theorem is a parallel version of Theorem 3.1 and its proof
is omitted.
Theorem 3.2. Let A be a nonsingular matrix. There exist a permutation
matrix Q, a lower triangular matrix L with unit diagonal entries, and an
upper triangular matrix U such that AQ = LU .
Once the LU decomposition P A = LU is available, one can solve Ax = b
in the following way:
(1) Solve Ly = P b for y;
(2) Solve U x = y for x.
Here the cost of solving the n × n triangular linear system is n2 .
Example: Consider the following matrix
 
0 5 22/3
A = 4 2 1 .
2 7 9
Swapping the 1st and 2nd rows and then add − 21 of the new 1st row to the
new 3rd row yields (4 is the pivot, the entry with largest absolute value in
the 1st column)
   
4 2 1 4 2 1
A(0) = 0 5 22/3 , A(1) = 0 5 22/3 .
2 7 9 0 6 8.5
DIRECT METHODS 11

These two operations correspond to


   
0 1 0 1 0 0
P1 = 1 0 0 , L1 =  0 1 0 .
0 0 1 − 21 0 1

Swapping the 2nd and 3rd rows of A(1) and then add − 56 of the new 2nd
row to the new 3rd row yields (6 is the pivot in A(1) , the entry with largest
absolute value in the 1st column of the lower right 2 by 2 block )
   
4 2 1 4 2 1
A(1) = 0 6 8.5  , A(2) = U = 0 6 8.5  .
0 5 22/3 0 0 0.25

These two operations correspond to


   
1 0 0 1 0 0
P2 = 0 0 1 , L2 = 0 1 0 .
0 1 0 0 − 56 1

Therefore, L̃2 = L2 ,
   
0 1 0 1 0 0
P = P2 P1 = 0 0 1 , L̃1 = P2 L1 P2 = − 12 1 0 .
1 0 0 0 0 1

Easy calculation shows that


    
1 0 0 1 0 0 1 0 0
L = L̃−1
1 L̃ −1
2 =  1
1 0   0 1 0  =  1
1 0 .
2 2
5 5
0 0 1 0 6 1 0 6 1

An algorithmic description of the above procedure is as follows.


     
1 0 0 0 1 0 0 1 0
R ↔R2 R2 ↔R3
P = 0 1 0 −−1−−−→ 1 0 0 − −−−−→ 0 0 1 ,
0 0 1 0 0 1 1 0 0
       
1 0 0 1 0 0 1 0 0 1 0 0
L = 0 1 0 −→  0 1 0 −→  21 1 0 −→  12 1 0 .
1 5
0 0 1 2 0 1 0 0 1 0 6 1

The pseudo-code of GEPP is given below. The total number of floating


point operations of GEPP is


n−1
( ∑
n ∑
n ∑
n
) 2
1+ 2 = n3 + O(n2 ).
3
i=1 j=i+1 j=i+1 k=i+1
12 YUWEN LI

Algorithm 1 GEPP
for i = 1 : n − 1 do
permute rows so that |aii | is the largest in |A(i : n, i)|;
permute L(i : n, 1 : i − 1) accordingly;
for j = i + 1 : n do
lji = aji /aii ;
end for
for j = i : n do
uij = aij ;
end for
for j = i + 1 : n do
for k = i + 1 : n do
ajk = ajk − lji uik ;
end for
end for
end for

3.1.1. Effect of pivoting. To show the importance of pivoting, we consider


the linear system Ax = b, where
( ) ( )
−ϵ 1 1
A= , b= , 0 < ϵ ≪ 1.
1 1 1
The exact solution is x = (0, 1)⊤ . Without machine round-off error, a
standard Gaussian elimination implies
( )( )
1 0 −ϵ 1
A = LU = .
− 1ϵ 1 0 1 + 1ϵ
Let fl be the round-off process in a computer under single-precision floating
point arithmetic. Let ϵ = 2−24 ≈ 6 × 10−8 be the relative machine precision.
Then
( ) ( )
1 0 1 0
L̂ = = ,
fl(− 1ϵ ) 1 − 1ϵ 1
( ) ( )
fl(−ϵ) 1 −ϵ 1
Û = = ,
0 fl(1 + 1ϵ ) 0 1ϵ
( ) ( )
−ϵ 1 ϵ 1
1 1
where fl(1 + ϵ ) = ϵ . As a result, L̂Û = Â = ̸= A = .
1 0 1 1
Using this decomposition, we obtain the numerical solution x̂ = (1, 1 + ϵ)⊤ .
The relative error is unacceptable:
∥x − x̂∥2 √
= 1 + ϵ2 > 100%.
∥x∥2
In contrast, GEPP yields
( ) ( )( )
0 1 1 0 1 1
A = LU = ,
1 0 −ϵ 1 0 1+ϵ
DIRECT METHODS 13

and rounding process leads to


( ) ( )
1 0 1 1
L̂ = , Û = ,
ϵ 1 0 fl(1 + ϵ)
( ) ( )
1 1 1 1
 = L̂Û = = .
0 1 ϵ 1+ϵ
The numerical solution is x̂ = (ϵ, 1 − ϵ)⊤ , which is very accurate:
∥x − x̂∥2 √
= 2ϵ ≪ 1.
∥x∥2
3.1.2. Characterization of LU factorization. In the end, we give a theorem
characterizing matrices that can be factorized into LU without permutation.
Theorem 3.3. Let A be a square matrix. Then A = LU for some unit
lower triangular matrix L and nonsingular upper triangular matrix U if and
only if all leading principal minors of A are nonzero.
Proof. We only prove ⇐= by induction, the converse direction is easier.
Assume that this is true for all (n − 1) × (n − 1) matrices. When A is n by
n and all leading principal minors are nonzero, we partition it as
( )
A11 α
A= .
β ⊤ ann
Using the induction assumption, we have A11 = L11 U11 for some unit lower
triangular matrix L11 and nonsingular upper triangular U11 . Assume that
( ) ( )( )
A11 α L11 0 U11 u
A= = .
β ⊤ ann l⊤ 1 0 unn
The equation is true whenever
u = L−1
11 α,
−⊤
l = U11 β, unn = ann − l⊤ u.
Here unn = ann − β ⊤ A−1
11 α ̸= 0 because det(A) ̸= 0. □
3.1.3. Matrix inverse and determinant. Computing A−1 is equivalent to
solving the following n linear systems
Ax1 = e1 , ..., Axn = en .
The cost of GEPP for, e.g., Ax1 = e1 is 2n2 (solving triangular systems
twice). Therefore, the total computational cost of A−1 is 2n3 .
3.1.4. Gaussian Elimination with Complete Pivoting. In a few cases, even
GECP is not numerically stable and it is better to permute both rows and
columns of the active submatrix to select a pivot. This strategy is called
Gaussian Elimination with Complete Pivoting (GECP) with pseudo-code
given in Algorithm 2.
GECP permutes both rows and columns of a matrix A such that the
absolute value of the pivot used in Gaussian elimination is biggest among
14 YUWEN LI

the entries of the active block. The output is P AQ = LU , where P and Q


are permutation matrices.

Algorithm 2 GECP
for i = 1 : n − 1 do
permute rows and cols so that |aii | is the largest in |A(i : n, i : n)|;
permute L(i : n, 1 : i − 1) accordingly;
for j = i + 1 : n do
lji = aji /aii ;
end for
for j = i : n do
uij = aij ;
end for
for j = i + 1 : n do
for k = i + 1 : n do
ajk = ajk − lji uik ;
end for
end for
end for

The drawback of GECP is that each step of the for-loop requires O(n2 )
sorting and the total computational cost of complete pivoting would be
O(n3 ), already of the same order as the number of floating point operations
in Gaussian elimination.

3.2. Positive-definite systems. For symmetric and positive-definite (SPD)


matrices, there is a special decomposition method that is cheaper than the
LU decomposition.
Theorem 3.4 (Cholesky decomposition). Let A be a symmetric and positive-
definite matrix. There exists a lower triangular matrix L with positive diag-
onal entries such that A = LL⊤ .
Proof. We prove it by induction. Assume that this is true for (n−1)×(n−1)
SPD matrices. When A is n by n, we partition it as
( )
A11 α
A= .
α⊤ ann
Using the induction assumption, we have A11 = L11 L⊤11 for some lower
triangular matrix L11 . Assume that
( ) ( )( ⊤ )
A11 α L11 0 L11 β
A= = .
α⊤ ann β ⊤ lnn 0 lnn

The equation is true whenever β = L−1


1

11 α and lnn = (ann − β β) . Here
2
⊤ ⊤ −1
ann − β β = ann − α A11 α > 0 because det(A) > 0. □
DIRECT METHODS 15

Algorithm 3 Cholesky decomposition


for j = 1 : n − 1 do
∑j−1 2 1
ljj = (ajj − k=1 ljk ) 2 ;
for i = j + 1 : n do
∑j−1
lij = (aij − k=1 lik ljk )/ljj ;
end for
end for

The algorithmic procedure is given in Figure 2. The count of floating


point operations of Cholesky decomposition is

n
( ∑
n
) 1
2j + 2j = n3 + O(n2 ).
3
j=1 i=j+1

Step 1

Step 2

Step 3

L
...

Figure 2. Cholesky decomposition algorithm.

Corollary 3.1 (LDL decomposition). Let A be a symmetric and positive-


definite matrix. There exists a unit lower triangular matrix L and a positive
diagonal matrix D such that A = LDL⊤ .
To analyze the numerical error of Cholesky decomposition, we need to
take a close look at the round-off error of floating point operations. It is
true that
fl(a ⊚ b) = (a ⊚ b)(1 + δ), |δ| ≤ ϵ,
where ⊚ is either +, −, ∗, / etc and ϵ is the relative machine precision. It is
also easy to prove that
(∑
d ) ∑d
fl a i bi = ai bi (1 + δi ), |δi | ≤ dϵ + O(ϵ2 ).
i=1 i=1

We say a numerical method for solving Ax = b is forward stable if


∥x − x̂∥
≲ ϵ.
∥x∥
16 YUWEN LI

We say a numerical method is backward stable if the numerical solution


x̂ is the exact solution of the modified problem (A + δA)x̂ = b + δb, where
∥δA∥ ∥δb∥
+ ≲ ϵ.
∥A∥ ∥b∥
It turns out that Cholesky decomposition without pivoting is already nu-
merically stable.
Theorem 3.5. Solving a SPD system Ax = b by Cholesky decomposition
is backward stable. In particular, the Cholesky numerical solution x̂ is the
exact solution of the perturbed problem
(A + δA)x̂ = b,
where the relative perturbation is small: ∥δA∥∞ /∥A∥∞ ≲ ϵ.
Proof. For i > j we have
( ∑
j−1 )
ˆlij = (1 + δ ′ )(1 + δ ′′ ) aij − (1 + δk ) ˆlik ˆljk /ˆljj .
k=1

Reformulating the above equation leads to

1 ∑ j−1
aij = ˆlij ˆljj + (1 + δk ) ˆlik ˆljk
(1 + δ ′ )(1 + δ ′′ )
k=1

j ∑
j
1
= ˆlik ˆljk + ˆlik ˆljk δk , δj = − 1.
(1 + δ ′ )(1 + δ ′′ )
k=1 k=1

Here |δ1 |, . . . , |δj−1 | ≤ (j − 1)ϵ + O(ϵ2 ) and |δj | ≤ 2ϵ. Meanwhile, for each j
recall that
( ∑
j−1 )1
ˆljj = (1 + δ ′ )(1 + δ) 12 ajj − ˆl2 (1 + δk ) 2 , |δk | ≤ jϵ + O(ϵ), |δ|, |δ ′ | ≤ ϵ.
jk
k=1

Rewriting it leads to

1 ∑ j−1
ajj = ˆl2 + ˆl2 (1 + δk ).
(1 + δ ′ )2 (1 + δ) jj jk
k=1

Because δk is very small, we further have



j
ˆl2 ≲ ajj .
jk
k=1

In matrix notation, we have


A = L̂L̂t + E,
DIRECT METHODS 17

where the perturbation matrix is bounded as


∑ ( ∑ 2 )1 ( ∑ 2 )1
|Eij | ≤ nϵ |ˆlik ||ˆljk | ≤ nϵ ˆl 2
ik
ˆl 2
jk
k k k
√ √
≲ nϵ aii ajj .
As a result, ∥E∥∞ ≲ n2 ϵ∥A∥∞ and a numerical solution of Ax = b would
be the true solution of the perturbed system
(A − E)x̂ = b.
The proof is complete. □
Combining the backward stability and the perturbation property of Ax =
b, we deduce that the relative error is about
∥x − x̂∥∞ ∥δA∥∞
≲ κ(A) ≲ n2 ϵκ(A).
∥x∥∞ ∥A∥∞
3.3. A poseriori error estimation. The aforementioned backward sta-
bility analysis and perturbation analysis are a priori error analysis. In
practice, these error bounds are often pessimistic (the multiplicative con-
stant is not sharp at all). On the other hand, the residual r(x̂) = b − Ax̂ is
a very useful quantity measuring the numerical error. For example,
∥x − x̂∥ = ∥A−1 r∥ ≤ ∥A−1 ∥∥r(x̂)∥.
Once an efficient upper bound ∥A−1 ∥ ≤ C(A) is available, we obtain an a
posteriori error estimator
∥x − x̂∥ ≤ C(A)∥r(x̂)∥.
The term “a posteriori” means that this bound depends on the computed
numerical solution. When ∥ • ∥ = ∥ • ∥1 , ∥A−1 ∥1 is obtained by solve the ℓ1
constrained and convex maximization problem
∥A−1 ∥1 = max ∥A−1 y∥1 ,
∥y∥1 =1

which is approximately solved by gradient ascent.


Exercise: Given a matrix B, compute the gradient of f (x) = ∥Bx∥1 .

3.4. Sparse linear systems. When A ∈ Rn×n is a sparse matrix, it is


desirable to factorize it as P AQ = LU (or A = LL⊤ ), where L and U are
also as sparse as possible. We say a matrix is sparse if most entries of A are
zeros and the number of nonzeros in A is O(n). We often pursue algorithms
for sparse linear systems Ax = b with computational cost O(n) or O(n log n).
For sparse matrices, Gaussian elimination (as well as Cholesky decompo-
sition) must be very carefully designed to preserve the sparsity structure. In
fact, an uncareful elimination process would create more and more nonzeros
(named as fill-in) and eventually lead to dense factors L and U . In this case,
solving Ax = b costs at least O(n2 ) operations, see Figures 3–5.
18 YUWEN LI

10

15

20

25

30

35

40

45

50
0 10 20 30 40 50
nz = 217

Figure 3. Sparse pattern of a 49 × 49 SPD matrix A.

10

15

20

25

30

35

40

45

50
0 10 20 30 40 50
nz = 397

Figure 4. Sparse pattern of the standard Cholesky factor


L of A (without permutation, A = LL⊤ ).

It turns out that the number of fill-ins created during the Gaussian elim-
ination (as well as Cholesky decomposition) heavily depends on the order of
unknowns, i.e., appropriate permutation of rows and columns of the sparse
matrix, see Figures 4 and 5 for example. A popular technique for determin-
ing the order (also used in MATLAB) is called the Approximate Minimum
Degree (AMD) algorithm [1]. Once such an order represented by the per-
mutation matrix P is available, one can solve P AP ⊤ y = P b (with x = P ⊤ y)
by Cholesky decomposition. Sophisticated high-performance algorithms for
solving sparse problems are coded in the package UMFPACK (Unsymmetric
Multi-frontal Package).
DIRECT METHODS 19

10

15

20

25

30

35

40

45

50
0 10 20 30 40 50
nz = 255

Figure 5. Sparse pattern of the Cholesky factor L of A


(with permutation P such that P ⊤ AP = LL⊤ ).

As an example, consider the 9 × 9 sparse and symmetric matrix


 
4 0 0 −1 0 −1 0 −1 −1
 0 4 0 −1 0 −1 0 0 0
 
 0 0 4 0 0 0 0 −1 −1
 
 −1 −1 0 4 −1 0 0 0 0
 
A9 = 
 0 0 0 −1 4 0 0 −1 0  .
 −1 −1 0 0 0 4 −1 0 0
 
 0 0 0 0 0 −1 4 0 −1
 
 −1 0 −1 0 −1 0 0 4 0
−1 0 −1 0 0 0 −1 0 4
Its adjacency graph is given in Figure 6. Each vertex of G(A9 ) corresponds
3 8

4
5
1
9

Figure 6. Adjacency graph G(A9 ) of A9 .

to one row of A9 . Vertices vi and vj in G(A9 ) are connected by an edge if


and only if the entry aij of A9 is nonzero.
Eliminating one row and column of A9 (say the i-th row) corresponds to
eliminating one vertex (vi ) of G(A9 ) and deduce a small graph G8 . In G8 ,
20 YUWEN LI

previous neighbors of vi in G(A9 ) will be pairwise connected by edges. The


Minimum Degree algorithm choose to eliminate a vertex of G(A9 ) having
the smallest degree and obtain a smaller graph G8 and recursively do this
minimum-degree procedure in G8 and so on. Such an order of vertices
corresponds to a permutation matrix P . In practice, it is often the case that
the Cholesky factor of P ⊤ AP is quite sparse.
3.4.1. MATLAB backslash. In Matlab, the ‘\’ command for solving x = A\b
invokes an algorithm which depends upon the structure of the matrix A
and includes checks (small overhead) on properties of A. In particular, the
MATLAB backslash \ works as follows.
(1) If A is an upper or lower triangular matrix, employ a backward
substitution algorithm.
(2) If A is symmetric and has real positive diagonal elements, attempt
a Cholesky factorization. If A is sparse, employ reordering first to
minimize fill-in.
(3) If A is banded, employ a banded solver.
(4) If none of criteria above is fulfilled, do a general triangular factor-
ization using Gaussian elimination with partial pivoting.
(5) If A is sparse, then employ the UMFPACK library.
(6) If A is not square, employ algorithms based on QR factorization for
undetermined systems.

References
[1] Patrick R. Amestoy, Timothy A. Davis, and Iain S. Duff. An approximate minimum
degree ordering algorithm. SIAM J. Matrix Anal. Appl., 17(4):886–905, 1996.
[2] James W. Demmel. Applied numerical linear algebra. Society for Industrial and Applied
Mathematics (SIAM), Philadelphia, PA, 1997.
[3] Gene H. Golub and Charles F. Van Loan. Matrix computations. Johns Hopkins Studies
in the Mathematical Sciences. Johns Hopkins University Press, Baltimore, MD, fourth
edition, 2013.
[4] Roger A. Horn and Charles R. Johnson. Matrix analysis. Cambridge University Press,
Cambridge, second edition, 2013.

You might also like