Math 6610 - Analysis of Numerical Methods I
Math 6610 - Analysis of Numerical Methods I
1 Introduction 5
1.1 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Orthogonal Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Norms and Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 Matrix Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.2 Cauchy-Schwarz and Holder Inequalities . . . . . . . . . . . . . . . . . . 12
1.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3
4 Contents
4 Systems of Equations 67
4.1 Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2 Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3 Stability of Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4 Cholesky Factorisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6 Eigenvalue Problems 91
6.1 Eigenvalue-Revealing Factorisation . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.1.1 Geometric and Algebraic Multiplicity . . . . . . . . . . . . . . . . . . . . 91
6.1.2 Eigenvalue Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.1.3 Unitary Diagonalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.1.4 Schur Factorisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.1.5 Localising Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.2 Eigenvalue Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.2.1 Shortcomings of Obvious Algorithms . . . . . . . . . . . . . . . . . . . . 96
6.2.2 Rayleigh Quotient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.2.3 Power iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.2.4 Inverse Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2.5 Rayleigh Quotient Iteration . . . . . . . . . . . . . . . . . . . . . . . . . 101
Abstract: These notes are largely based on Math 6610: Analysis of Numerical Methods
I course, taught by Yekaterina Epshteyn in Fall 2016, at the University of Utah. Additional
examples or remarks or results from other sources are added as I see fit, mainly to facilitate
my understanding. These notes are by no means accurate or applicable, and any mistakes here
are of course my own. Please report any typographical errors or mathematical fallacy to me
by email [email protected]
Chapter 1
Introduction
We review some basic facts about linear algebra, in particular on viewing matrix-vector multi-
plication as linear combination of column matrices; this plays an important role in understand-
ing key ideas behind many algorithms of numerical linear algebra. We review orthogonality,
where many of the best algorithms are based upon. Finally, we discuss about vector norms
and matrix norms, as these provide a way of measuring approximations and convergence of
numerical algorithm.
It is not too difficult to see that matrix-vector product can also be view as linear combination
of columns {a1 , . . . , an j} of A, i.e.
n
X
b = Ax = x j aj . (1.1.1)
j=1
This easily generalises to matrix-matrix product B = AC, in which each column of B is a linear
combination of the columns of A. More precisely, if A ∈ C m×l and C ∈ Cl×n , then B ∈ Cm×n
with
Xl
bij = aik ckj for each i = 1, . . . , m, j = 1, . . . , n,
k=1
or equivalently,
l
X
bk = cjk aj for each k = 1, . . . , n.
j=1
5
6 1.1. Linear Algebra
Example 1.1.1. The outer product is the product of a column vector u ∈ Cm and a row
vector v ∗ ∈ Cn , which gives a rank-one-matrix A = uv ∗ ∈ Cm×n . Symbolically,
u1
u2
A = .. v1 v2 . . . vn = v1 u v2 u . . . vn u .
.
um
(b) The range R(A) of A is the set of vectors y ∈ Cm such that y = Ax for some x ∈ Cn . It
is clear from (1.1.1) that R(A) is the vector space spanned by columns of A:
R(A) = span{a1 , . . . , an }.
(c) The column rank of A is the dimension of its column space. The row rank of A is the
dimension of its row space.
It can be shown that the column rank is always equal to the row rank of a matrix. Thus,
the rank of a matrix is well-defined. A matrix A ∈ Cm×n of full rank is one that has the
maximal possible rank min{m, n}. This means that a matrix of full rank with m ≥ n must
have n linearly independent columns.
Theorem 1.1.3. A matrix A ∈ Cm×n with m ≥ n has full rank if and only if it maps no two
distinct vectors to the same vector.
Proof. Suppose A is of full rank, then its columns {a1 , . . . , an } form a linearly independent set
of vectors in Cn . Suppose Ax = Ay, we need to show that x = y but this is true since
n
X
A(x − y) = 0 =⇒ (xj − yj )aj = 0 =⇒ xj − yj = 0 for each j = 1, . . . , n.
j=1
Conversely, suppose A maps no two distinct vectors to the same vector. To show that A is
of full rank, it suffices to prove that its columns {a1 , . . . , an } are linearly independent in Cn .
Suppose
Xn
xj aj = 0.
j=1
This is equivalent to Ax = 0 with x = (x1 , . . . , xn )∗ ∈ Cn , and we see that x must be the zero
vector. Otherwise there exists a nonzero vector y ∈ Cn such that Ay = 0 = A(0) and this
contradicts the assumption.
Introduction 7
(b) rank(A) = m.
(c) R(A) = Cm .
(g) det(A) 6= 0.
When writing the product x = A−1 b, we should understand x as the unique vector that
satisfies the equation Ax = b. This means that x is the vector of coefficients of the unique
linear expansion of b in the basis of columns of A. Multiplication by A−1 is a change of basis
operation. More precisely, if we view b as coefficients of the expansion of b in {e1 , . . . , em },
then multiplication of A−1 results in coefficients of the expansion of b in {a1 , . . . , am }.
Definition 1.2.1.
m
! 21
√ X
kxk2 = x∗ x = |xj |2 .
j=1
8 1.2. Orthogonal Vectors and Matrices
(c) The cosine of the angle α between x and y can be expressed in terms of the inner product
x∗ y
cos(α) = .
kxk2 kyk2
Remark 1.2.2. Over C, the inner product is sesquilinear, i.e. x 7→ (x, z) is linear and
y 7→ (z, y) is conjugate linear. Over R, the inner product is bilinear.
Definition 1.2.3. A set of nonzero vectors S is said to be orthogonal if its elements are
pairwise orthogonal, that is,
x, y ∈ S, x 6= y =⇒ (x, y) = x∗ y = 0.
Theorem 1.2.4. The vectors in an orthogonal set S are linearly independent. Consequently,
if an orthogonal set S ⊂ Cm contains m vectors, then it is a basis for Cm .
Proof. Suppose, by contradiction, that the set of orthogonal vectors S is not linearly indepen-
dent. This means that at least one of the vectors vk ∈ S can be written as a non-trivial linear
combination of the remaining vectors in S, i.e.
X
vk = αj vj .
j6=k
Taking the inner product of vk against vk and using orthogonality of the set S gives
X
(vk , vk ) = (αj vj , vk ) = 0,
j6=k
We see that r is the part of v orthogonal to {q1 , q2 , . . . , qn } and for every j = 1, 2, . . . , n, (qj , v)qj
is the part of v in the direction of qj .
If {qj } is a basis for Cm , then n must be equal to m and r must be the zero vector, so v is
completely decomposed into m orthogonal components in the direction of the qj . In (1.2.1), we
see that we have two different expressions. In the first case, we view v as a sum of coefficients
qj∗ v times vectors qj . In the second case, we view v as a sum of orthogonal projections of v
onto the various directions qj . The jth projection operation is achieved by the very special
rank-one matrix qj qj∗ .
Introduction 9
A square matrix Q ∈ Cm×m is unitary (or orthogonal in the real case) if Q∗ = Q−1 ,
that is, Q∗ Q = QQ∗ = Im . In terms of the columns of Q, we have the relation qi∗ qj = δij .
This means that columns of a unitary matrix Q form an orthonormal basis for Cm . In the real
case, multiplication by an orthogonal matrix Q corresponds to a rigid rotation if det(Q) = 1
or reflection if det(Q) = −1.
Lemma 1.2.5. The inner product is invariant under unitary transformation, i.e. for any
unitary matrix Q ∈ Cm×m , (Qx, Qy) = (x, y) for any x, y ∈ Cm . Such invariance means that
angles between vectors and their lengths are preserved under unitary transformation.
Remark 1.2.6. Note that the lemma is still true for any matrices with orthornormal columns.
n
! 21
X
kxk2 = |xj |2 (l2 norm)
j=1
n
! p1
X
kxkp = |xj |p , p ≥ 1. (lp norm)
j=1
For any nonsingular matrix W ∈ Cn×n , we can define the weighted p-norms, given by
n n p ! p1
X X
kxkW = kW xkp = wij xj .
i=1 j=1
Definition 1.3.2. Given A ∈ Cm×n , the induced matrix norm kAk is defined as
kAxk
kAk = sup = sup kAxk.
x∈Cn ,x6=0 kxk x∈Cn ,kxk=1
To find the kDk2 geometrically, observe that the image of the 2-norm unit sphere under D
is an m-dimensional ellipse whose semiaxis lengths are given by the numbers |dj |. The unit
vectors amplified most by D are those that are mapped to the longest semiaxis of the ellipse,
of length max{|dj |}. Thus, we have that
kDk2 = max |dj |.
1≤j≤m
Taking the pth root of each side, and then the supremum over all x ∈ Cm with kxkp = 1 yields
the upper bound
kDkp ≤ max |dj |.
1≤j≤m
To obtain kDkp ≥ max |dj |, we choose the standard basis vector x = ek , where k is such that
1≤j≤m
|dk | is the largest diagonal entry. Note that kek kp = 1 and
kDek kp
kDkp ≤ = kDek kp = kdk ek kp = |dk | = max |dj |.
kek kp 1≤j≤m
Lemma 1.3.4. For any A ∈ Cm×n , the induced matrix 1-norm and ∞-norm are equal to the
“maximum column sum” and “maximum row sum” of A respectively, i.e.
To obtain kAk1 ≥ max kaj k1 , we choose the standard basis vector x = ek , where k is such
1≤j≤n
that kak k1 is maximum. Note that kek k1 = 1 and
kAek k1
kAk1 ≥ = kAek k1 = kak k1 = max kaj k1 .
kek k1 1≤j≤n
n
X
To obtain kAk∞ ≥ max |aij |, choose x = (1, . . . , 1)∗ ∈ Cn . Note that kxk∞ = 1 and
1≤i≤m
j=1
n
kAxk∞ X
kAk∞ ≥ = kAxk∞ = max |aij |.
kxk∞ 1≤i≤m
j=1
1 1
Theorem 1.3.5 (Young’s inequality). Let p, q > 1 such that + = 1. For any two nonneg-
p q
ative real numbers a, b, we have
ap b q
ab ≤ + . (Young)
p q
Proof. Observe that the inequality is trivial if either a or b are zero, so suppose both a and b
1 1
are any positive real numbers. Choose any p, q > 1 such that + = 1, the constraint on p
p q
and q suggests a possible convexity argument. Indeed, using the fact that exponential function
is a convex function, we have that
Proof. Observe that the inequality is trivial if either u or v are the zero vector, so suppose
1 1
u, v 6= 0. Choose any p, q > 1 such that + = 1. Young’s inequality (Young) yields
p q
|aj |p |bj |q
|a∗j bj | = |aj ||bj | ≤ + .
p q
we have
n
X 1 1
|a∗j bj | ≤ + = 1. (1.3.2)
j=1
p q
Now, for any nonzero u = (u1 , . . . , un )∗ , v = (v1 , . . . , vn )∗ ∈ Cn , define vectors a = (ã1 , . . . , ãn )∗ , b =
(b̃1 , . . . , b̃n )∗ such that
uj vj
ãj = , b̃j = for all j = 1, . . . , n.
kukp kvkq
By construction, both a and b satisfy (1.3.1) and substituting ãj , b̃j into (1.3.2) yields
n n n
1 X X X
|u∗j vj | ≤ 1 =⇒ u∗j vj ≤ |uj vj | ≤ kukp kvkq .
kukp kvkq j=1 j=1 j=1
Since u, v were arbitrary nonzero vectors in Cn , this proves the Hölder’s inequality.
Taking supremum over all x ∈ Cn of norm 1, we get kAk2 ≤ kak2 . To obtain kAk2 ≥ kak2 ,
choose the particular x = a. Then
kAak2 kak22
kAk2 ≥ = = kak2 .
kak2 kak2
14 1.3. Norms and Inequalities
Taking supremum over all x ∈ Cn of norm 1, we get kAk2 ≤ kuk2 kvk2 . To obtain kAk2 ≥
kuk2 kvk2 , choose the particular x = v. Then
Lemma 1.3.9. Let A ∈ Cm×l , B ∈ Cl×n . The induced matrix norm of AB satisfies the
inequality
kABk ≤ kAkkBk
Consequently, the induced matrix norm of A satisfies
Then kABk = 2 but kAkkBk = 1. An important matrix norm which is not induced by any
vector norm is the Hilbert-Schmidt or Frobenius norm, defined by
m X
n
! 21
X
kAkF = |aij |2 .
i=1 j=1
n
! 12
X
kAkF = kaj k22 .
j=1
Viewing the matrix A ∈ Cm×n as a vector in Cmn , the Frobenius norm can be seen as the usual
l2 norm. Replacing l2 norm with lp norm gives rise to the Schatten p-norm.
Introduction 15
Lemma 1.3.10. For any A ∈ Cm×l , B ∈ Cl×n , the Frobenius norm of AB satisfies
Proof. Let C = AB = (cij ), where the entries of C is given by cij = a∗i bj with a∗i , bj the ith-row
and jth-column of the matrix A and B respectively. Cauchy-Schwarz inequality gives
= kAk2F kBk2F .
Theorem 1.3.11. For any A ∈ Cm×n and unitary matrix Q ∈ Cm×m , V ∈ Cn×n , we have
For any x ∈ Cn , let y = V x ∈ Cn . Then x = V ∗ y and kxk2 = kV ∗ yk2 = kyk2 since unitary
transformation preserves k · k2 . Consequently,
16 1.4. Problems
1.4 Problems
1. Show that if a matrix A is both triangular and unitary, then it is diagonal.
Solution: The statement is trivial if A ∈ Cm×m is both upper and lower triangular, so
suppose A is upper-triangular. This implies that A∗ = A−1 is lower-triangular. The
result then follows if we show that A−1 is also upper-triangular. Since A−1 A = Im×m ,
we have that
b1 . . . bm a1 . . . am = e1 . . . em ,
where aj , bj are columns of A and A−1 respectively and ej are the standard basis
vectors in Cm . Interpreting ej as the linear combination of the columns bj together
with the assumption that A is upper-triangular, we obtain the relation
m j
X X
ej = aij bi = aij bi for any j = 1, . . . , m.
i=1 i=1
e1 = a11 b1
e2 = a12 b1 + a22 b2
.. .. ..
.= . .
em = a1m b1 + a2m b2 + . . . + amm bm .
This implies that bij = 0 for all i > j, j = 1, . . . , m, i.e. A−1 is upper-triangular.
(λx)∗ x = (Ax)∗ x
λ̄(x∗ x) = x∗ A∗ x
h i
= x∗ Ax A is Hermitian.
= λx∗ x.
(b) Prove that if x and y are eigenvectors corresponding to distinct eigenvalues, then x
and y are orthogonal.
Ax = λx and Ay = µy.
Hence, the eigenvalues of a unitary matrix must lie on the unit circle in C.
Ax = (I + uv ∗ )x = 0 =⇒ uv ∗ x = −x. (1.4.1)
For any nonzero scalars β ∈ C, let x = βu. Substituting this into (1.4.1) yields
5. Let k · k denote any norm on Cm and also the induced matrix norm on Cm×m . Show that
ρ(A) ≤ kAk, where ρ(A) is the spectral radius of A, i.e., the largest absolute value |λ| of
an eigenvalue λ of A.
where we use the assumption that kAk is an induced matrix norm for the inequality.
Dividing each side of the inequality by kxk 6= 0 yields |λ| ≤ kAk. The desired
inequality follows from taking the supremum over all eigenvalues of A.
6. (a) Let N (x) := k · k be any vector norm on Cn (or Rn ). Show that N (x) is a continuous
function of the components x1 , x2 , . . . , xn of x.
and
n n
!
X X
kxk − kyk ≤ kx − yk ≤ |xj − yj |kej k ≤ kx − yk∞ kej k .
j=1 j=1
(b) Prove that if W ∈ Cm×m is an arbitrary nonsingular matrix, and k · k is any norm
on Cm , then kxkW = kW xk is a norm on Cm .
kIxk kxk
kIk = sup = sup = 1.
x∈Cm ,x6=0 kxk x∈Cm ,x6=0 kxk
n X
n
! 12 n
! 21
X X √
kIn×n kF = |aij |2 = |1|2 = n.
i=1 j=1 j=1
(c) Show that Frobenius norm is not induced by any vector norm.
Matrix decomposition has been of fundamental importance in modern sciences. In the context
of numerical linear algebra, matrix decomposition serves the purpose of rephrasing through a
series of easier subproblems a task that may be relatively difficult to solve in its original form,
for instance solving linear systems. In the context of applied statistics, matrix decomposition
offers a way of obtaining some form of low-rank approximation to some large “data” matrix
containing numerical observations; this is crucial in understanding the structure of the matrix,
in particular exploring and identifying the relationship within data. In this chapter, we will
study the singular value decomposition (SVD) and QR factorisation, and demonstrate how to
solve linear least squares problems using these decompositions.
σ1
σ2
Σ
b= ∈ Rn×n .
..
.
σn
21
22 2.1. The Singular Value Decomposition
n×n
v1 v2 . . . vn ∈ C .
V =
(a) {u1 , u2 , . . . , un } and {v1 , v2 , . . . , vn } are the left and right singular vectors of A;
columns of U b are orthonormal, V is unitary and U b ∗U
b = V ∗ V = In ;
(b) {σj }nj=1 are singular values of A, with σ1 ≥ σ2 ≥ . . . ≥ σn ≥ 0. These are the lengths
of the n principal semiaxes of the hyperellipse in the case of real matrices A.
(c) These singular vectors and singular values satisfy the relation
Avj = σj uj , j = 1, . . . , n. (2.1.1)
Example 2.1.1. Consider any matrix A ∈ C2×2 . It is clear that H = A∗ A ∈ C2×2 is Hermitian.
Moreover, for any x ∈ C2 we have
Hence, AV = U Σ =⇒ A = U ΣV ∗ .
In terms of (v1 , v2 ) coordinates, the vector (cos θ, sin θ) gets mapped onto (z1 , z2 ) = (σ1 cos θ, σ2 sin θ)
in (u1 , u2 ) coordinates. Moreover,
2 2
z1 z2
+ = cos2 θ + sin2 θ = 1,
σ1 σ2
Matrix Decomposition and Least Squares Problems 23
i.e. S is being transformed to ellipse. We claim that kAk2 = σ1 . On one hand, we obtain using
orthonormality of {u1 , u2 }
On the other hand, choosing x = v1 gives kAv1 k22 = kσ1 u1 k22 = σ12 .
We see that the image of unit circle under A is an ellipse in the 2-dimensional subspace of
Rm defined by span{u1 , u2 }. If A ∈ Rm×n is of full rank with m ≥ n, then the image of the
unit sphere in Rn under A is a hyperellipsoid in Rm .
Σ
b
0∗
Σ= ∈ Rm×n
..
.
0∗
v1 v2 . . . vn
V = ∈ Cn×n .
Note that in full SVD form, Σ has the same size as A, and U, V are unitary matrices.
Proof. The statement is trivial if A is the zero matrix, so assume A 6= 0. Let σ1 = kAk2 > 0,
there exists v1 ∈ Cn such that kv1 k2 = 1 and kAv1 k2 = kAk2 = σ1 . Such v1 exists since the
induced matrix norm is by definition a minimisation problem of a continuous functional (in
this case the norm) over a compact nonempty subset of Cn . Define u1 = Av1 /σ1 ∈ Cm , clearly
u1 6= 0 and ku1 k2 = 1 by construction.
24 2.1. The Singular Value Decomposition
We then have:
" #
u∗1 u∗1 Av1 u∗1 AVb1
h i
A1 := U1∗ AV1 = b1∗ A v1 Vb1 = b1∗ Av1 b1∗ AVb1
U U U
σ1 u∗1 u1 w∗
= b1∗ u1 A
σ1 U b
σ1 w ∗
= b ,
0 A
σ1 0∗ 0∗
∗ σ1
U1 AV1 = b = 0 U2 Σ2 V2∗
0 A
1 0∗ σ1 0∗ 1 0∗
= .
0 U2 0 Σ2 0 V2∗
Matrix Decomposition and Least Squares Problems 25
1 0∗ i 1 0∗ u
h
1
U = U1 = u1 U1
b = b ∈ Cm×m
0 U2 0 U2 U1 U2
1 0∗ i 1 0∗
h h i
V = V1 = v1 Vb1 = v1 Vb1 V2 ∈ Cn×n .
0 V2 0 V2
Since product of unitary matrices are unitary, we only need to show that the vector u1 is orthog-
b1 U2 , but this must be true since {u1 , u2 , . . . , um }
onal to each column u2 , . . . , um of the matrix U
is an orthonormal basis by construction. A similar argument shows that V is also unitary.
Remark 2.1.3. In the case m ≤ n, we simply consider the SVD of its conjugate tranpose A∗ .
If A is singular with rank r < min{m, n}, the full SVD is still appropriate. What changes is
that not n; but only r of the left singular vectors uj are determined by the geometry of the
hyperellipse. To construct the unitary matrix U and V , we introduce an additional (m − r)
and (n − r) arbitrary orthonormal columns respectively.
It is well known that a nondefective square matrix can be expressed as a diagonal matrix
Λ of eigenvalues, if the range and domain are represented in a basis of eigenvectors. SVD gene
ralises this fact to any matrix A ∈ Cm×n , in that SVD allows us to say that A reduces to
diagonal matrix Σ when the range is expressed in the basis of columns of U and the domain
is expressed in the basis of columns of V . More precisely, any b ∈ Cm can be expanded in the
basis of columns {u1 , . . . , um } of U and any x ∈ Cn can be expanded in the basis of columns
{v1 , . . . , vn } of V . The coordinate vectors for these expansions are
b = U b0 ⇐⇒ b0 = U ∗ b and x = V x0 ⇐⇒ x0 = V ∗ x.
Hence,
b = Ax ⇐⇒ U ∗ b = U ∗ Ax = U ∗ U ΣV ∗ x = ΣV ∗ x ⇐⇒ b0 = Σx0 .
There are fundamental differences between the SVD and the eigenvalue decomposition.
(a) SVD uses two different bases (the sets of left and right singular vectors), whereas the
eigenvalue decomposition uses just one (the eigenvectors).
(b) SVD uses orthonormal bases, whereas the eigenvalue decomposition uses a basis that
generally is not orthogonal.
(c) Not all matrices (even square ones) have an eigenvalue decomposition, but all matrices
(even rectangular ones) have a SVD.
(d) In practice, eigenvalues tend to be relevant to problems involving the behaviour of iter-
ated forms of A, such as matrix powers An or matrix exponentials etA , whereas singular
vectors tend to be relevant to problems involving the behaviour of A itself, or its inverse.
26 2.1. The Singular Value Decomposition
Theorem 2.1.4. The rank of A is r, the number of nonzero singular values. Moreover,
R(A) = span{u1 , . . . , ur } and N (A) = span{vr+1 , . . . , vn }.
Proof. Since U, V are unitary, they have full rank. Thus, rank(A) = rank(Σ) = numbers of its
nonzero entries. For any x ∈ Cn , we have Ax = U ΣV ∗ x = U Σy, where y ∈ Cn is arbitrary.
The R(A) is then deduced from the fact that R(Σ) = span{e1 , . . . , er }. To find the nullspace
of A, expanding Az = 0 yields
Az = U ΣV ∗ z = 0 =⇒ ΣV ∗ z = 0 since U is of full rank,
from which we deduce that N (A) is the span of {vr+1 , . . . , vn } since N (Σ) = span{er+1 , . . . , en }.
p
Theorem 2.1.5. kAk2 = σ1 and kAkF = σ12 + σ22 + . . . + σr2 .
Proof. Since k · k2 and k · kF are both invariant under unitary transformation, we have that
kAk2 = kU ΣV ∗ k2 = kΣk2 = max |σj | = σ1 ,
1≤j≤p
and q
∗
kAkF = kU ΣV kF = kΣkF = σ12 + . . . + σr2 .
Theorem 2.1.6. The nonzero singular values of A are the square roots of the nonzero eigen-
values of A∗ A or AA∗ . (These matrices have the same nonzero eigenvalues.)
Proof. Observe that A∗ A ∈ Cn×n is similar to Σ∗ Σ since
A∗ A = (U ΣV ∗ )∗ (U ΣV ∗ ) = V ΣT U ∗ U ΣV ∗ = V (ΣT Σ)V ∗ ,
and hence has the same n eigenvalues. Σ∗ Σ is a diagonal matrix with p eigenvalues σ12 , . . . , σp2
and n−p additional zero eigenvalues if n > p. A similar calculation applies to the m eigenvalues
of AA∗ .
Theorem 2.1.7. If A = A∗ , then the singular values of A are the absolute values of the
eigenvalues of A.
Proof. Since A is Hermitian, it has an eigendecomposition of the form A = QΛQ∗ for some
unitary matrix Q and real diagonal matrix Λ consisting of eigenvalues λj of A. We rewrite it
as
A = QΛQ∗ = Q|Λ|sign(Λ)Q∗ ,
where |Λ| and sign(Λ) denote the diagonal matrices whose entries are |λj | and sign(λj ) respec-
tively. Since sign(Λ)Q∗ is unitary whenever Q is unitary, the expression above is an SVD of A,
with the singular values equal to the diagonal entries of |Λ|. These can be put into nonincreas-
ing order by inserting suitable permutation matrices as factors in Q and sign(Λ)Q∗ if required.
Matrix Decomposition and Least Squares Problems 27
m
Y
Theorem 2.1.8. For A ∈ Cm×m , | det(A)| = σi .
i=1
Proof. Using the fact that unitary matrices have determinant ±1, we obtain
n
Y
∗
| det(A)| = | det(U )|| det(Σ)|| det(V )| = | det(Σ)| = σj .
j=1
Σj = diag(0, . . . , 0, σj , 0, . . . , 0).
There are many ways to decompose A into rank-one matrices, but (2.1.2) has a deeper prop-
erty: its vth partial sum captures as much of the energy of A as possible, in the sense of the
2-norm of the Frobenius norm.
and since the sum of the dimensions of W and Z exceeds n, there must be a nonzero vector in
W ∩ Z and we arrive at a contradiction.
The MATLAB command for computing the reduced and full SVD is [U,S,V] = svd(A,0)
and [U,S,V] = svd(A) respectively.
2.2 Projectors
Projection is an important concept in designing algorithms for certain linear algebra problems.
Geometrically, projection is a generalisation of graphical projection. In functional analysis,
a projection P is a bounded linear operator such that P 2 = P ; in finite-dimensional vector
space, P is a square matrix in Cn×n and it is said to be idempotent. Observe that if y ∈ R(P ),
then y = P x for some x ∈ Cn and
P y = P P x = P 2 x = P x = y.
P (P y − y) = P 2 y − P y = P y − P y = 0.
y
Py
Py − y
x
R(P )
v ∗ (P x − x) = v ∗ (uv ∗ x) − v ∗ x = v ∗ x(v ∗ u − 1) = 0.
(b) P Q = P (I − P ) = 0.
Q2 = (I − P )2 = I 2 − 2P + P 2 = I − 2P + P = I − P = Q.
P x = 0 =⇒ Qx = x − P x = x ∈ R(Q) =⇒ N (P ) ⊂ R(Q).
Suppose y ∈ R(Q),
y = Qy = y − P y =⇒ P y = 0 =⇒ y ∈ N (P ) =⇒ R(Q) ⊂ N (P ).
Combining these two set inequalities show the first equation in (c). The second equation in (c)
now follows from applying the previous result to I − P :
Theorem 2.2.2 actually shows that a projector decomposes Cn into subspaces R(P ) and
N (P ) such that Cn = R(P ) ⊕ N (P ). Such a pair are said to be complementary subspaces.
Indeed, suppose x = P x + z, then
v = v − P v = (I − P )v = 0,
x∗1 x2 = (P x)∗ (I − P )y = x∗ P ∗ (I − P )y 6= 0.
Py − y
x
Py
R(P )
Proof. If P = P ∗ , then
P ∗ (I − P ) = P (I − P ) = P − P 2 = 0,
and it follows from the algebraic definition that P is orthogonal. Conversely, suppose P is an
orthogonal projector, then
P ∗ (I − P ) = 0, or P ∗ = P ∗ P.
Consider the minimal rank SVD of P = Ur Σr Vr∗ , where r ≤ n is the rank of P , Ur∗ Ur = Ir =
Vr∗ Vr and Σr is nonsingular. Substituting the SVD of P into P ∗ = P ∗ P yields
Matrix Decomposition and Least Squares Problems 31
we obtain
P Q = q1 . . . qr 0 . . . 0
q1∗
=⇒ Q∗ P Q = ... q1 . . . qr 0 . . . 0
qn∗
Ir
= = Σ.
0n−r
Consequently, the singular values of orthogonal projectors consists of 1’s and 0’s. Because some
singular values are zero, it is advantageous to drop the columns {qr+1 , . . . , qn } of Q which leads
to
b∗ , where Q b = q1 . . . qr ∈ Cn×r .
P =Q bQ
Remark 2.2.4. Orthogonal projectors doesn’t necessarily have the form QbQb∗ . We will show in
Section 2.2.4 that P = A(A∗ A)−1 A∗ is an orthogonal projection onto R(A) for any A ∈ Cm×n .
represents an orthogonal projection onto R(Q), b i.e. the matrix P = Q bQb∗ is an orthogonal
projector onto R(Q), regardless of how {q1 , . . . , qr } was obtained. Note that its complement
projector I − Q b∗ is an orthogonal projector onto R(Q)
bQ b ⊥.
In the case of r = 1, we have the rank-one orthogonal projector that isolates the component
in a single direction. More precisely, for any given q ∈ Cn , the matrix
qq ∗
Pq = ,
q∗q
32 2.3. QR Factorisation
qq ∗
P⊥q =I− ∗ ,
q q
A := a1 . . . ar ∈ Cn×r ,
or
A∗ (Ax − v) = 0 =⇒ A∗ Ax = A∗ v.
Since A is of full rank, A∗ A is also of full rank and x is uniquely given by
Note that this is a generalisation of the rank-one orthogonal projector. If A has orthonormal
columns, then we recover P = AA∗ as before.
2.3 QR Factorisation
We now study the second matrix factorisation in the course: QR factorisation. Assume for
now that A ∈ Cm×n , m ≥ n is of full rank, but we will see later that this is not necessary.
The idea of QR factorisation is to construct a sequence of orthonormal vectors {q1 , q2 , . . .} that
spans the nested succesive spaces span{a1 , a2 , . . .}, i.e.
In order for span{a1 , . . . , aj } to be successive spaces, the vector aj must be linear combination
of the vectors {q1 , . . . , qj }. Writing this out
a1 = r11 q1 (2.3.1a)
a2 = r12 q1 + r22 q2 (2.3.1b)
.. .. .. ..
. . . . (2.3.1c)
an = r1n q1 + r2n q2 + . . . + rnn qn . (2.3.1d)
Matrix Decomposition and Least Squares Problems 33
In matrix form,
r11 r12 · · ·
rin
. ..
r22 . .
.
a1 a2 . . . an = q1 q2
A= . . . qn = Q̂R̂,
...
r(n−1)n
rnn
where Q b ∈ Cm×n has orthonormal columns and R b ∈ Cn×n is upper-triangular. Such a factori-
sation is called a reduced QR factorisation of A.
One can define a full QR factorisation in a similar fashion as how we define a full SVD,
by adding m − n orthonormal columns (in an arbitrary fashion) to Q b so that it becomes a
unitary matrix Q ∈ Cm×m ; in doing so, m − n rows of zeros needs to be added to R b and
m×n
it becomes an upper-triangular matrix R ∈ C . In the full QR factorisation, the columns
{qn+1 , . . . , qm } are orthogonal to R(A) by construction, and they constitute an orthonormal
basis for R(A)⊥ = N (A∗ ) if A is of full rank n.
The sign of rjj is not determined and if desired we may choose rjj > 0 so that R̂ has positive
diagonal entries. Gram-Schmidt iteration is numerically unstable due to rounding errors on
a computer. To emphasise the instability, we refer to this algorithm as the classical Gram-
Schmidt iteration.
Theorem 2.3.1. Every matrix A ∈ Cm×n , m ≥ n has a full QR factorisation, hence also a
reduced QR factorisation.
34 2.3. QR Factorisation
Proof. The case where A has full rank follows easily from the Gram-Schmidt orthogonalisation,
so suppose A does not have full rank. At one or more steps j, it will happen that vj = 0; at this
point, simply pick qj arbitrarily to be any unit vector orthogonal to {q1 , . . . , qj−1 }, and then
continue the Gram-Schmidt orthogonalisation process. Previous step gives us a reduced QR
factorisation of A. One can construct a full QR factorisation by introducing arbitrary m − n
orthonormal vectors in the same style as in Gram-Schmidt process.
Theorem 2.3.2. Every matrix A ∈ Cm×n , m ≥ n of full rank has a unique reduced QR
factorisation A = Q
bR,
b with rjj > 0 for each j = 1, . . . , n.
Proof. The Gram-Schmidt orthogonalisation determines rij and qj fully, except for the sign of
rjj , but this is now fixed by the condition rjj > 0.
for j = 1 to n
vj = aj
for i = 1 to j − 1
rij = qi∗ aj
vj = vj − rij qi
end
rjj = kvj k2
qj = vj /rjj
end
QRx = b or Rx = Q∗ b.
The linear system Rx = Q∗ b can be solved easily using backward substitution since R is
upper-triangular. This suggests the following method for solving Ax = b:
2. Compute y = Q∗ b.
3. Solve Rx = y for x ∈ Cm .
Matrix Decomposition and Least Squares Problems 35
vj = Pj aj , j = 1, . . . , n.
In contrast, the modified Gram-Schmidt iteration computes the same result by a sequence of
(j − 1) projections of rank (m − 1). Let P⊥q = I − qq ∗ be the rank (m − 1) orthogonal projector
onto the space orthogonal to the nonzero vector q ∈ Cm . It can be shown that
The operations are equivalent, but we decompose the projection to obtain numerical stability.
The modified Gram-Schmidt algorithm computes vj as follows (in order):
(1)
v j = P 1 aj = aj
(2) (1) (1)
vj = P⊥q1 vj = (I − q1 q1∗ )vj
.. .. ..
. . .
(j) (j−1) ∗ (j−1)
vj = vj = P⊥qj−1 vj = (I − qj−1 qj−1 )vj .
for j = 1 to n
vj = aj
for i = 1 to j − 1
rij = qi∗ vj (Step by step projection)
vj = vj − rij qi
end
rjj = kvj k2
aj = vj /rjj
end
for i = 1 to n
vi = ai
end
for i = 1 to n
rii = kvi k2
qi = vi /rii
for j = i + 1 to n
rij = qi∗ vj (Compute Pqi as soon as qi is found
vj = vj − rij qi and then apply to all vi+1 , . . . , vn )
end
end
36 2.3. QR Factorisation
0 0 ε
and make the approximation ε2 ≈ 0 for ε 1 that accounts for rounding error. Applying the
classical Gram-Schmidt gives
v1 = a1
√
r11 = ka1 k2 = 1 + ε2 ≈ 1
q1 = v1 ≈ (1, ε, 0, 0)T
r11
v2 = a2
r12 = q1T a2
=1
v2 = v2 − r12 q1 = (0, −ε, ε, 0)T
√
r22 = kv 2 k 2 = 2ε
v 1
q2 = 2 = √ (0, −1, 1, 0)T
r22 2
v3 = a3
r13 = q1T a3
=1
v3 = v3 − r13 q1 = (0, −ε, 0, ε)T
r23 = q2T a3 =0
v3 = v3 − r23 q2 = (0, −ε, 0, ε)T
√
r33 = kv3 k2 = 2ε
v 1
q3 = 3 = √ (0, −1, 0, 1)T .
r33 2
However, q2T q3 = 1/2 6= 0. We see that small perturbation results in instability, in the sense
that we lose orthogonality due to round off errors. On the other hand, to apply the modified
Gram-Schmidt, it is not difficult to see that q1 , q2 remains unchanged and q3 is obtained as
v 3 = a3
r13 = q1T v3
=1
v3 = v3 − r13 q1 = (0, −ε, 0, ε)T
T ε
r 23 = q 2 v 3 = √
2
ε ε T
v 3 = v 3 − r23 q 2 = 0, − ,− ,ε
√ 2 2
6ε
r33 = kv3 k2 =
2
v 3 1
= √ (0, −1, −1, 2)T .
q3 =
r33 6
We recover q2T q3 = 0 in this case.
Matrix Decomposition and Least Squares Problems 37
Given A ∈ Cm×n , m ≥ n, b ∈ Cm ,
find x ∈ Cn such that kAx − bk2 is minimised. (2.4.1)
This is called the general (linear) least squares problem. The 2-norm is chosen due to
certain geometric and statistical reasons, but the more important reason is it leads to simple
algorithms since the derivative of a quadratic function, which must be set to zero for minimi-
sation, is linear. Geometrically, (2.4.1) means that we want to find a vector x ∈ Cn such that
the vector Ax ∈ Cm is the closest point in R(A) to b ∈ Cm .
Example 2.4.1. For a curve fitting problem, given a set of data (y1 , b1 ), . . . , (ym , bm ), we
want to find a polynomial p(y) such that p(yj ) = bj for every j = 1, . . . , m. If the points
{x1 , . . . , xm } ∈ C are distinct, it can be shown that there exists a unique polynomial inter-
polant to these data, which is a polynomial of degree at most m − 1. However, the fit is often
bad, in the sense that they tend to get worse rather than better if more data are utilised. Even
the fit is good, the interpolation process may be sensitive to perturbations of the data. One
way to avoid such complications is to choose a nonuniform set of interpolation points, but in
applications this will not always be possible.
Surprisingly, one can do better by reducing the degree of the polynomial. For some n < m,
consider a degree n − 1 polynomial of the form
Such a polynomial is a least squares fit to the data if it minimises the residual vector in the
2-norm, that is,
m
!1/2
X
min |p(yi ) − bi |2 .
polynomials of degree n−1
i=1
A∗ Ax = A∗ b, (2.4.2)
or again equivalently,
P b = Ax,
where P ∈ Cm×m is the orthogonal projector onto R(A). The n×n system of equations (2.4.2),
known as the normal equations, is nonsingular if and only if A has full rank. Consequently,
the solution x ∈ Cn is unique if and only if A has full rank.
Proof. The equivalence of A∗ r = 0 and (2.4.2) follows from the definition of r. The equivalence
of A∗ r = 0 and P b = Ax follows from the properties of orthogonal projectors, see Subsection
2.2.4. To prove that y = P b is the unique point in R(A) that minimises kb − yk2 , suppose z is
another point in R(A). Since z − y ⊥ b − y, the Pythagorean theorem gives
Ax = 0 =⇒ (A∗ A)x = A∗ 0 = 0 =⇒ x = 0,
and so A has full rank. Conversely, suppose A ∈ Cm×n is nonsingular and A∗ Ax = 0 for some
x ∈ Cn . Then
x∗ A∗ Ax = x∗ 0 = 0 =⇒ (Ax)∗ Ax = kAxk22 = 0 =⇒ Ax = 0.
If A is of full rank, it follows from Theorem 2.4.2 that the unique solution to the least
squares problem is given by
x = (A∗ A)−1 A∗ b.
where the matrix A+ = (A∗ A)−1 A∗ ∈ Cn×m is called the pseudoinverse of A. The full-rank
linear least squares problem (2.4.1) can then be solved by computing one of both vectors
x = A+ b, y = P b,
The standard method of solving such a system is by Cholesky factorisation, which constructs
a factorisation A∗ A = R∗ R, where R ∈ Cn×n is upper-triangular. Consequently, (A∗ A)x = A∗ b
becomes R∗ Rx = A∗ b.
The steps that dominate the work for this computation are the first two. Exploiting the
symmetry of the problem, the computation of A∗ A and the Cholesky factorisation require only
mn2 flops and n3 /3 flops respectively. Thus the total operation count is ∼ mn2 + n3 /3 flops.
2.4.3 QR Factorisation
Given a reduced QR factorisation A = Q b the orthogonal projector P ∈ Cm×m onto R(A)
bR,
can be written as P = Q
bQb∗ . Since P b ∈ R(A), the system Ax = P b has an exact solution and
Q
bRx
b =Q
bQb∗ b =⇒ Rx b∗ b.
b =Q
b∗ b ∈ Cn .
2. Form the vector Q
Note that the same reduction can also be derived from the normal equations (2.4.2).
A∗ Ax = A∗ b =⇒ (R
b∗ Q
b∗ )(Q
bR)x b∗ Q
b =R b∗ b =⇒ Rx b∗ b.
b =Q
The operation count for this computation is dominated by the cost of the QR factorisation,
which is ∼ 2mn2 − 2n3 /3 flops if Householder reflections are used.
Matrix Decomposition and Least Squares Problems 41
2.4.4 SVD
Given a reduced SVD A = U b ∗ , it follows from Theorem 2.1.4 that the orthogonal projector
b ΣV
P ∈ Cm×m onto R(A) can be written as P = U b ∗ . The system Ax = P b reduces to
bU
U b ∗x = U
b ΣV bUb ∗ b =⇒ ΣV
b ∗x = U
b ∗ b.
b ∗ b ∈ Cn .
2. Form the vector U
4. Set x = V w ∈ Cn .
Note that the same reduction can also be derived from the normal equations (2.4.2).
(V Σ
bUb ∗ )(U b ∗ )x = V Σ
b ΣV bUb ∗ b =⇒ ΣV
b ∗x = U
b ∗ b.
The operation count for this computation is dominated by the cost of the SVD. For m n
this cost is approximately the same as for QR factorisation, but for m ≈ n the SVD is more
expensive. A typical estimate is ∼ 2mn2 + 11n3 flops.
Algorithm 2.4 may be the best if we only care about the computational speed. However,
solving the normal equations is not always numerically stable and so Algorithm 2.5 is the
“modern standard” method for least squares problem. However if A is close to rank-deficient,
it turns out that Algorithm 2.5 has less-than-ideal stability properties and Algorithm 2.6 is
chosen instead.
2.5 Problems
1. Two matrices A, B ∈ Cm×m are unitary equivalent if A = QBQ∗ for some unitary
Q ∈ Cm×m . Is it true or false that A and B are unitarily equivalent if and only if they
have the same singular values?
Solution: Observe that for a square matrix, the reduced SVD and full SVD has the
same structure. The “only if” statement is true. Suppose A = QBQ∗ for some
unitary matrix Q ∈ Cm×m and let B = UB ΣB VB∗ be the SVD of B. Then
with singular value σ1A = σ2A = 1. Since BB ∗ = I2 , B is unitary and it has a SVD of
the form
0 1 1 0 1 0
B = BI2 I2 = ,
1 0 0 1 0 1
with singular values σ1B = σ2B = 1. Suppose A and B are unitary equivalent, then
A = Q∗ AQ = Q∗ (QBQ∗ Q) = B,
2. Using the SVD, prove that any matrix in C m×n is the limit of a sequence of matrices of
full rank. In other words, prove that the set of full-rank matrices is a dense subset of
Cm×n . Use the 2-norm for your proof. (The norm doesn’t matter, since all norms on a
finite-dimensional space are equivalent.)
Solution: We may assume WLOG that m ≥ n. We want to show that for any
matrix A ∈ Cm×n , there exists a sequence of full rank matrices (Ak ) ∈ Cm×n such
that
kAk − Ak2 −→ 0 as k −→ ∞.
The result is trivial if A has full rank, since we may choose Ak = A for each k ≥ 1, so
suppose A is rank-deficient. Let r < min{m, n} = n be the rank of A, which is also
the number of nonzero singular values of A. Consider the reduced SVD A = U b ∗,
b ΣV
where V ∈ Cn×n is unitary, U ∈ Cm×n has orthonormal columns and
σ1
..
.
σr
Σ
b= ∈ Rn×n .
0
. ..
The fact that the 2-norm is invariant under unitary transformation suggests perturb-
ing Σ
b in such a way that it has full rank. More precisely, consider Ak = U b k V ∗,
bΣ
where
Σ
bk = Σb + 1 In .
k
Ak has full rank by construction since it has n nonzero singular values and
(a) Determine, on paper, a real SVD of A in the form A = U ΣV T . The SVD is not
unique, so find the one that has the minimal number of minus signs in U and V .
Solution: Since A is nonsingular, Theorem 2.1.6 says that the singular values
of A are square roots of the eigenvalues of AT A. Computing AT A gives
T −2 −10 −2 11 104 −72
A A= = ,
11 5 −10 5 −72 146
Denote U = [u1 |u2 ] ∈ R2×2 and V = [v1 |v2 ] ∈ R2×2 , where u1 , u2 and v1 , v2 are
column vectors of U and V respectively in the SVD of A = U ΣV T . Observe
that v1 , v2 are normalised eigenvectors of AT A corresponding to eigenvalues
λ1 , λ2 respectively since AT A = V Σ2 V T . It can be shown that
−3/5 4/5
V = [v1 |v2 ] = .
4/5 3/5
(b) List the singular values, left singular vectors, and right singular vectors of A. Draw
a careful, labeled picture of the unit ball in R2 and its image under A, together with
the singular vectors, with the coordinates of their vertices marked.
44 2.5. Problems
√ √
Solution: The singular values of A are σ1 = 10 2, σ2 = 5 2. The left singular
vectors and right singular vectors of A are
√ √
−3/5 4/5 1/√2 1/ √2
v1 = , v2 = , u1 = , u2 = .
4/5 3/5 1/ 2 −1/ 2
(c) What are the 1−, 2−, ∞-, and Frobenius norms of A?
yields p √
3± 9 − 4(100) 3 391i
λ= = ± .
2 2 2
Solution:
√ ! √ !
3 391i 3 391i 9 391 400
λ1 λ2 = + − = + = = 100 = det(A).
2 2 2 2 4 4 4
√ √
σ1 σ2 = (10 2)(5 2) = 50(2) = 100 = | det(A)|.
(g) What is the area of the ellipsoid onto which A maps the unit ball of R2 ?
Solution: The ellipse onto which A maps the unit ball of R2 has major radius
a = σ1 and minor radius b = σ2 . Thus, its area is πab = πσ1 σ2 = 100π.
Matrix Decomposition and Least Squares Problems 45
4. Let P ∈ Cm×m be a nonzero projector. Show that kP k2 ≥ 1, with equality if and only if
P is an orthogonal projector.
kP k2 ≤ kP k22 =⇒ kP k2 ≥ 1 since kP k2 6= 0.
R(P ) ⊥
6 N (P ) = R(I − P ).
1 0 −1 1
1 2 1 1.
5. Let A = and b =
1 1 −3 1
0 1 1 1
(a) Determine the reduced QR factorisation of A.
(b) Use the QR factors from part (a) to determine the least square solution to Ax =
b.
46 2.5. Problems
x3 = 0.
1 1
x2 = − x3 = .
3 3
1 2
x1 = 1 − x2 + x3 = 1 − = .
3 3
Hence, x = (x1 , x2 , x3 )∗ = (2/3, 1/3, 0)∗ .
Conversely, suppose all the diagonal entries of R̂ are nonzero. Suppose the
Matrix Decomposition and Least Squares Problems 47
β1 a1 + . . . + βn an = 0. (2.5.2)
γ1 q1 + . . . + γn qn = 0,
where n
X
γj = βk rjk , j = 1, . . . , n. (2.5.3)
k=j
Next,
(b) Suppose R̂ has k nonzero diagonal entries for some k with 0 ≤ k < n. What does
this imply about the rank of A? Exactly k? At least k? At most k? Give a precise
answer, and prove it.
Solution: Suppose R̂ has k nonzero diagonal entries for some k with 0 ≤ k < n,
i.e. R̂ has at least one zero diagonal entry. Let aj be the jth column of A, and
Aj ∈ Cm×j be the matrix defined by Aj = [a1 |a2 | . . . |aj ].
• First, (
1 if r11 6= 0,
rank(A1 ) =
0 if r11 = 0.
• For j = 2, . . . , n, regardless of the value of rjj , either
aj ∈
/ span{a1 , . . . , aj−1 } =⇒ rank(Aj ) = rank(Aj−1 ) + 1. (2.5.4)
or
aj ∈ span{a1 , . . . , aj−1 } =⇒ rank(Aj ) = rank(Aj−1 ). (2.5.5)
This means that the rank of A cannot be at most k.
• For any j = 2, . . . , n, if rjj 6= 0, then (2.5.1) implies that (2.5.4) must hold.
However, if rjj = 0, then either (2.5.4) or (2.5.5) holds. We illustrate this
48 2.5. Problems
a1 = 0, a2 = r12 q1 , a3 = r13 q1
7. Let A be an m × m matrix, and let aj be its jth column. Give an algebraic proof of
Hadamard’s inequality:
m
Y
| det A| ≤ kaj k2 .
j=1
Also give a geometric interpretation of this result, making use of the fact that the deter-
minant equals the volume of a parallelepiped.
j−1
X
where vj = aj − (qi∗ aj )qi , with the convention that q0 = 0 ∈ Cm . For any j =
i=1
1, . . . , m, since {vj , q1 , . . . , qj−1 } are mutually orthogonal, Pythagorean theorem
gives
j−1 2
X
2
kaj k2 = vj + (qi∗ aj )qi
i=1 2
j−1
X
= kvj k22 + k(qi∗ aj )qi k22
i=1
≥ kvj k22
Matrix Decomposition and Least Squares Problems 49
Since | det(A)| is the volume of the parallelepiped with sides given by the vector
{a1 , a2 , . . . , am }, the Hadamard’s inequality asserts that this is bounded above by the
volume of a rectangular parallelepiped with sides of length ka1 k2 , ka2 k2 , . . . , kam k2 .
8. Consider the inner product space of real-valued continuous functions defined on [−1, 1],
where the inner product is defined by
Z 1
f ·g = f (x)g(x) dx.
−1
Let M be the subspace that is spanned by the three linearly independent polynomial
p0 = 1, p1 = x, p2 = x2 .
(a) Use the Gram-Schmidt process to determine an orthonormal set of polynomials
(Legendre polynomials) q0 , q1 , q2 that spans M .
Solution: It is clear that q0 satisfies the given ODE for n = 0 since q00 = q000 = 0
and n(n + 1)|n=0 = 0. Because differentiation is a linear operation, it suffices to
show that v1 , v2 (from part (a)) satisfies the given ODE for n = 1, 2 respectively.
For n=1,
For n = 2,
1
(1 − x 2
)v200 − 2xv20 2 2
+ 2(2 + 1)v2 = (1 − x )(2) − 2x(2x) + 6 x −
3
= 2 − 2x2 − 4x2 + 6x2 − 2 = 0.
9. Let A ∈ Rm×n with m < n and of full rank. Then min kAx − bk2 is called an Underde-
termined Least-Squares Problem. Show that the solution is an n − m dimensional set.
Show how to compute the unique mininum norm solution using QR decomposition and
SVD approach.
Solution: Let A ∈ Rm×n with m < n and of full rank. Since m < n, Ax = b is an
underdetermined system and kAx − bk2 attains its mininum 0 in this case, where the
solution set, S is given by
S = {xp − z ∈ Rn : z ∈ N (A)},
where xp is the particular solution to Ax = b and N (A) denotes the null space of
A. Note that S is not a vector subspace of Rn (unless b = 0 ∈ Rm ). Invoking the
Rank-Nullity theorem gives
where (AAT )−1 exists since A has full rank implies AT A (and also AAT ) is nonsin-
gular.
(AAT )−1 = (R̂T Q̂T Q̂R̂)−1 = (R̂T R̂)−1 = (R̂)−1 (R̂T )−1 .
Here, the assumption that A is full rank is crucial, in that it ensures the existence
of (R̂T )−1 and (Σ̂)−1 . Indeed, Q1(b)(i) says that R̂ has all nonzero diagonal entries,
which implies that R̂ (and also R̂T ) is nonsingular since R̂ is upper-triangular; Theo-
rem 5.1, page 33, tells us that all singular values of A, which are the diagonal entires
of Σ̂, are nonzero, which implies that Σ̂ is nonsingular since Σ̂ is diagonal.
52 2.5. Problems
Chapter 3
kδf k
κ̂ = κ̂(x) = lim sup .
δ→0 kδxk≤δ kδxk
• It can be interpreted as a supremum over all infinitesimal perturbations δx, thus it can
be written as
kδf k
κ̂ = sup .
δx kδxk
Definition 3.1.2. The relative condition number κ = κ(x) of the problem f at x is defined
as
kδf k kδxk
κ = κ(x) = lim sup ,
δ→0 kδxk≤δ kf (x)k kxk
or assuming δx, δf are infinitesimal,
kδf k kδxk
κ = κ(x) = sup .
δx kf (x)k kxk
53
54 3.1. Conditioning and Condition Numbers
kJ(x)k
κ = sup .
kδxk kf (x)k/kxk
kJ(x)k |α|
κ̂ = kJ(x)k = |α| but κ = = = 1.
kf (x)k/kxk |αx|/|x|
√ √
Example 3.1.4. Consider f (x) = x, x > 0. Then J(x) = f 0 (x) = 1/(2 x) and
√
1 kJ(x)k 1 x 1
κ̂ = kJ(x)k = √ but κ = = √ = .
2 x kf (x)k/kxk 2 x x 2
Example 3.1.5. Consider f (x) = x1 − x2 , x = (x1 , x2 )∗ ∈ (C2 , k · k∞ ). Then J(x) = (1, −1)
and
kJ(x)k∞ 2
κ̂ = kJ(x)k∞ = 2 but κ = = .
kf (x)k1 /kxk∞ |x1 − x2 |/ max{|x1 |, |x2 |}
The absolute condition number blows up if |x1 − x2 | ≈ 0. Thus, this problem is severely ill-
conditioned when x1 ≈ x2 , an issue which κ̂ would not reveal.
• For k · k2 , this bound is actually attained since kAk2 = σ1 and kA−1 k2 = 1/σm , where
σm > 0 since A is non-singular. Indeed, choosing x to be the mth right singular vector
of A yields
kxk2 kvm k2 kvm k2 1
= = = .
kAxk2 kAvm k2 σm kum k2 σm
Conditioning and Stability 55
(a) Consider f (x) = Ax = b. The problem of computing b, given x, has condition number
kAkkxk kAkkxk
κ(x) = = ≤ kAkkA−1 k
kAxk kbk
(b) Consider f (b) = A−1 b = x. The problem of computing x, given b, has condition number
• Consider the problem f (A) = A−1 b = x, where now A has some perturbation δA instead
of b. Then
and
kA−1 kkδAkkxk
kδxk kδAk kAk
κ(A) = sup ≤ sup = kA−1 kkAk.
δA kxk kAk δA kxk kδAk
It can be shown that such perturbations δA exists for any given A ∈ Cm×m , b ∈ Cm and
any chosen norm k · k.
The product kAkkA−1 k appears so often that we decided to call it the condition number
of A (relative to the norm k · k), denoted by κ(A). A is said to be well-conditioned if κ(A) is
small and ill-conditioned if κ(A) is large. In the case where A is singular, we write κ(A) = ∞.
For a rectangular matrix A ∈ Cm×n of full rank, m ≥ n, the condition number is defined in
terms of the pseudoinverse, i.e.
How exactly does one goes from decimal (base 10) to binary (base 2)?
where
β = chosen base,
e = exponent,
x̄ = mantissa of x, and (0.1)β ≤ x̄ < 1.
Observe that (0.1)10 = 0.1 for decimal while (0.1)2 = 0.5 for binary. For example,
The exponent is stored as is if it is within the given range, otherwise the number is overflow
if e is too large or underflow if e is too small.
p
1. An example of an overflow operation is x2 + y 2 when x is large. To avoid this, we
rewrite it as ( 2 )1/2
y
|x| 1 +
if x > y,
p
x
x2 + y 2 = ( 2 )1/2
x
|y| 1 + y if x < y.
√ √
2. An example of an underflow operation is x + 1 − x. Observe that the quantity is
approximately 0 if x is large. To avoid this, we rationalise the function
√ √ 1
x+1− x= √ √ .
x+1+ x
x = σ · (0.a1 a2 . . . an an+1 . . .) · 2e ,
but its floating point representation fl(x) can only include n digits for the mantissa. There are
two ways to truncate x when stored:
(a) Chopping, which amounts to truncating remaining digits after an ,
(b) Rounding, based on the digit an+1 :
(
σ · (0.a1 a2 . . . an−1 an ) · 2e if an+1 = 0,
fl(x) =
σ · (0.a1 a2 . . . an−1 1) · 2e if an+1 = 1.
• One can view the floating point representation fl(x) as a perturbation of x, i.e. there
exists an ε = ε(x) such that
fl(x) − x
fl(x) = x(1 + ε) or = ε.
x
58 3.2. Floating Point Arithmetic
It can be shown that ε has certain range depending on the truncation method:
. . 0} an an+1 . . .)2 · 2e
0 ≤ |x − fl(x)| ≤ (0. 0| .{z
n−1
n
1
≤ 2 · 2e = 2−n+e+1 .
2
Thus,
x − fl(x) 2−n+e+1 2−n+1
0≤ ≤ ≤ = 2−n .
x (0.a1 a2 . . .)2 · 2e 2−1
• The worst possible error for chopping is twice as large as when rounding is used. It
can be seen from (3.2.1), (3.2.2) that x − fl(x) has the same sign as x for chopping but
possibly different sign for rounding. This means that there might be cancellation of error
if rounding is used!
Definition 3.2.2. The machine epsilon, denoted by εmachine is the difference between 1 and
the next larger floating point number. In a relative sense, the machine epsilon is as large as the
gaps between floating point number get. For a double-precision computer, εmachine = 2−52 ≈
O(10−16 ).
Equivalently, for all x ∈ R, there exists an ε with |ε| ≤ εmachine such that fl(x) =
x(1 + ε). That is, the difference between a real number and its (closest) floating point
approximation is always smaller than εmachine in relative terms.
2. Basic floating point operations consists of ⊕, , ⊗, ÷. Denote the floating point operation
by ~. For any floating points x, y, there exists an ε with |ε| ≤ εmachine such that
x ~ y = fl(x ∗ y) = (x ∗ y)(1 + ε).
That is, every operation of floating point arithmetic is exact up to a relative error of size
at most εmachine .
3.3 Stability
Definition 3.3.1. An algorithm f˜ for a problem f is accurate if for each x ∈ X,
kf˜(x) − f (x)k
= O(εmachine ).
kf (x)k
In other words, there exists a constant C > 0 such that for all sufficiently small εmachine we
have that
kf˜(x) − f (x)k
≤ Cεmachine .
kf (x)k
• In practice, C can be large. For ill-conditioned problems, the definition of accuracy can
be too restrictive.
Definition 3.3.2.
1. An algorithm f˜ for a problem f is stable if for each x ∈ X,
kf˜(x) − f (x̃)k
= O(εmachine )
kf (x̃)k
for some x̃ with
kx̃ − xk
= O(εmachine ).
kxk
In words, a stable algorithm gives nearly the right answer to nearly the right question.
2. An algorithm f˜ for a problem f is backward stable if for each x ∈ X,
kx̃ − xk
f˜(x) = f (x̃) for some x̃ with = O(εmachine ).
kxk
In words, a backward stable algorithm gives exactly the right answer to nearly the right
question.
Theorem 3.3.3. For problems f and algorithms f˜ defined on finite-dimensional spaces X and
Y , the properties of accuracy, stability and backward stability all hold or fail to hold indepen-
dently of the choice of norms in X and Y .
60 3.4. More on Stability
Proof. We will only prove this in the case of a subtraction. Consider the subtraction f (x1 , x2 ) =
x1 − x2 , with floating point
f˜(x1 , x2 ) = fl(x1 ) fl(x2 ).
From the first axiom of floating point arithmetic, there exists ε1 , ε2 , ε3 with |ε1 |, |ε2 |, |ε3 | ≤
εmachine such that
fl(x1 ) = x1 (1 + ε1 ), fl(x2 ) = x2 (1 + ε2 ),
fl(x1 ) fl(x2 ) = (fl(x1 ) − fl(x2 ))(1 + ε3 ).
Thus,
for some |ε4 |, |ε5 | ≤ 2εmachine + O(ε2machine ). Backward stability follows directly since
|x̃1 − x1 | |x̃2 − x2 |
= O(εmachine ), = O(εmachine ).
|x1 | |x2 |
kf˜(x) − f (x)k
= O(κ(x)εmachine ).
kf (x)k
kx̃ − xk
= O(εmachine .
kxk
where o(1) denotes a quantity that converges to 0 as εmachine −→ 0. The desired inequality
follows from combining these bounds.
Conditioning and Stability 61
b1
x1 = ,
l11
i−1
!
1 X
xi = b1 − lij xj , i = 2, . . . , m.
lii j=1
bm
xm = ,
umm
m
!
1 X
xi = bi − uij xj , i = 1, . . . , m − 1.
uii j=i+1
3. The operational count for both forward and backward substitution is ∼ m2 flops, since
m(m − 1)
addition and substraction ∼ flops.
2
m(m + 1)
multiplication and division ∼ flops.
2
3.6 Problems
1. Assume that the matrix norm k · k satisfies the submultiplicative property kABk ≤
X∞
kAkkBk. Show that if kXk < 1, then I − X is invertible, (I − X)−1 = X j and
j=0
k(I − X)−1 k ≤ 1/(1 − kXk).
Solution:
P This is a classical result about Neumann series, which is the infinite
series ∞
j=0 X j
. Assuming (I − X) is invertible, with its inverse (I − X)−1 given by
the Neumann series, using the submultiplicative property and triangle inequality for
norms we have that
∞ ∞
−1
X
j
X 1
k(I − X) k = X ≤ kXkj = (3.6.1)
j=0 j=0
1 − kXk
where the second infinite series, which is a geometric series, converges since kXk < 1
by assumption. This proves the desired inequality and moreover it shows that the
X∞
Neumann series X j converges absolutely in the matrix norm, and thus converges
j=0
in the matrix norm too. To conclude the proof, we need to show that I − X is in
fact invertible, with its inverse given by the Neumann series. A direct computation
shows that
n
!
X
(I − X) X j = (I − X)(I + X + . . . + X n ) = I − X n+1 . (3.6.2)
j=0
∞
!
X
A symmetric argument also shows that Xj (I − X) = I. Hence, we have that
j=0
∞
X
(I − X)−1 = X j , i.e. I − X is invertible.
j=0
Combining (3.6.3), (3.6.4) and rearranging yields the left inequality. Next, since
Ax = b, we have
kbk = kAxk ≤ kAkkxk. (3.6.5)
On the other hand, since Ax − b + b − Ax̃ = r, we have
kek = kA−1 Aek = kA−1 (Ae − b + b)k = kA−1 rk ≤ kA−1 kkrk. (3.6.6)
We know that κ(A) = kAkkA−1 k is by definition the condition number of the matrix
A; moreover κ(A) ≥ kAA−1 k = 1. The terms kek/kxk, krk/kbk can be interpreted as
the relative solution error and the relative residual eror respectively. Thus, the right
inequality
kx − x̃k kek krk k(b + r) − bk
= ≤ κ(A) = κ(A)
kxk kxk kbk kbk
tells us that the ratio between the relative solution error and the relative residual
error is controlled by the condition number of A. In other words, suppose x1 ∈ Rn is
such that Ax1 = b1 and suppose we perturb b1 by some ε > 0. Then the correspond-
ing solution can only differ from x1 at most κ(A)ε/kbk in relative terms.
This estimate also shows that if κ(A) is not large then the residual r gives a good
representation of the error e. However, if κ(A) is large then the residual r is not a
good estimate of the error e.
where we assume that kA−1 δAk ≤ kA−1 kkδAk < 1, so that (A + δA) is nonsingular
(Why?). Show that
kx̃ − xk κ(A) kδAk kδbk
≤ + ,
kxk 1 − κ(A) kδAk
kAk
kAk kbk
Since k − A−1 δAk = kA−1 δAk < 1, Problem 1 together with the assumption that A
is also invertible shows that A + δA is invertible, i.e A + δA is non-singular. Since
we have that
kx̃ − xk kA−1 (Ax̃ − Ax)k kA−1 (δb − δAx̃)k
= = (3.6.7a)
kxk kxk kxk
kA kδbk kA−1 kkδAkkx̃k
−1
≤ + . (3.6.7b)
kxk kxk
Using κ(A) = kA−1 kkAk and kbk ≤ kAkkxk yield the bound
The desired inequality follows from substituting (3.6.9) into the above inequality.
4. Show that for Gaussian elimination with partial pivoting (permutation by rows) applied to
maxij |uij |
a matrix A ∈ Rn×n , the growth factor ρ = satisfies the estimate ρ ≤ 2n−1 .
maxij |aij |
Solution:
5. Show that if all the principal minors of a matrix A ∈ Rn×n are nonzero, then there exists
diagonal matrix D, unit lower triangular matrix L and unit upper triangular matrix U ,
Conditioning and Stability 65
Since such LU decomposition and the way we factored out pivots of U are both
unique, we conclude that there exists a unique LDU factorisation of A with all the
desired properties for L, D, U .
If A is symmetric, then its LDU decomposition of the required form mightnot exist,
0 0 0
and might not be unique even if it does exists. Consider A = (aij ) = 0 0 1
0 1 0
which is symmetric and suppose we want to decompose A into the form
1 A 1 d e
A = a 1 B 1 f
b c 1 C 1
A 1 d e
= aA B 1 f
bA cB C 1
A Ad Ae
= aA aAd + B aAe + Bf .
bA bAd + cB bAe + cBf + C
Comparing the first two diagonal entries a11 , a22 gives A = B = 0, but then
aAe + Bf = 0 6= a23 = 1.
66 3.6. Problems
LDU = A = AT = U T DT LT = U T DLT .
A = LDU = U T DU = LDLT .
Chapter 4
Systems of Equations
67
68 4.1. Gaussian Elimination
(i)
3. Under the assumption that aii 6= 0, i = 1, . . . , k − 1, we will have A(k) x = b(k) , k =
2, . . . , n, where
(1) (1) (1)
(1)
a11 a12 . . . ... ... . . . a(1n) b1
(2) (2) (2) (2)
a22 a23 ... ... . . . a2n b2
.. .. .. .. .. .
..
. . . . .
A(k) = (k) , b
(k)
=
b(k) ,
(k) (k)
akk ak(k+1) . . . akn k
.. .. .. ..
.
. . . . ..
(k) (k) (k)
ank ... . . . ann bn
where
(k−1)
ai(k−1)
mi(k−1) = (k−1)
, i = k, . . . , n.
a(k−1)(k−1)
(k) (k−1) (k−1)
aij = aij − mi(k−1) a(k−1)j , i, j = k, . . . , n.
(k) (k−1) (k−1)
bi = bi − mi(k−1) bk−1 , i = k, . . . , n.
General Idea
Define mk = [0, . . . , 0, mk+1k , . . . , mnk ]T ∈ Rn and consider the kth Gaussian transforma-
tion matrix Mk defined by
1
...
1
T
Mk = In − mk ek = .
−m k+1k 1
.. ...
.
−mnk 1
• The inverse of Mk is Mk−1 = In + mk eTk . Indeed, eTk mk = 0 since mk has nonzero entries
starting from k + 1, k = 1, . . . , n − 1. Thus,
Mk−1 Mk+1
−1
= (In + mk eTk )(In + mk+1 eTk+1 ) = In + mk eTk + mk+1 eTk+1 .
1
m21 1
n−1
Y n−1
X
−1 T m31 m32 1
=⇒ Mj = In + mj ek = .
.. .. .. ..
j=1 j=1 . . . .
mn1 mn2 . . . mnn−1 1
2(n − 1)n(n + 1) 2
+ n(n − 1) ∼ n3 flops.
3 3
• The ith order leading principal submatrix Ai is constructed by the first i rows and i
columns of A. Its determinant is called leading dominating minors.
• If Ai is singular, then LU factorisation (with lii = 1) may not exist, or will not be unique.
We demonstrate this with the following examples:
0 1 0 1 1 0 0 1 1 0 0 1
C= = , D= = .
1 0 1 0 0 1 0 2 β 1 0 2−β
Proof. We begin by proving the “if” direction. By using induction, we want to show that if
det(Ai ) 6= 0, i = 1, . . . , n − 1, then the LU factorisation of Ai (as defined above) exists and is
unique. The case i = 1 is trivial since a11 6= 0. Suppose the case (i − 1) is true, there exists a
unique LU decomposition of Ai−1 such that
(i−1)
Ai−1 = L(i−1) U (i−1) , with lkk = 1, k = 1, . . . , i − 1.
(i)
where 0, l, u, c, d ∈ Ri−1 . Note that lii 6= 0 since det(Ai ) 6= 0. Comparing terms in the
factorisation yields
L(i−1) u = c, lT U (i−1) = dT , lT u + uii = aii . (4.1.1)
Since det(Ai−1 ) 6= 0 by induction assumption, we also have det(L(i−1) ), det(U (i−1) ) 6= 0. Thus,
there exists a unique u, l, uii solving (4.1.1).
Conversely, assume there exists a unique LU factorisation A = LU , with lii = 1, i = 1, . . . , n.
There are two separate cases to consider:
1. A is non-singular. Recall that for every i = 1, . . . , n, Ai has an LU factorisation of the
form (i−1) (i−1)
(i) (i) L 0 U u
Ai = L U = .
lT 1 0T uii
Thus, det(Ai ) = det(L(i) ) det(U (i) ) = u11 u22 . . . uii , i = 1, . . . , n. In particular, det(An ) =
u11 . . . unn ; but since A is non-singular, uii 6= 0 for all i = 1, . . . , n. Hence, we must have
det(Ai ) 6= 0 for every i = 1, . . . , n.
2. A is singular. Analysis above shows that U must have at least one zero entry on the main
diagonal. Let ukk be the first zero entry of U on the main diagonal. LU factorisation
process then breaks down at (k + 1)th step, because then lT will not be unique due to
U k being singular (refer to (4.1.1)). In other words, if ukk = 0 for some k ≤ n − 1, then
we loose existence and uniqueness of LU factorisation at (k + 1)th step. Hence, in order
to have a unique LU factorisation of A, we must have ujj 6= 0 for every j = 1, . . . , n − 1
and unn = 0.
We provide a simple algorithm for the Gaussian elimination method without pivoting. This
pseudocode is not optimal, in the sense that both matrices U and M can be stored in the same
array as A.
U = A, L = I.
for k = 1 to n − 1
for j = k + 1 to n
ujk
mjk =
ukk
uj,k:n = uj,k:n − mjk uk,k:n
4.2 Pivoting
We begin by exploring the cruel fact that Gaussian elimination method without pivoting is
neither stable nor backward stable, mainly due to sensitivity of rounding errors. Fortunately,
this instability can be rectified by permutating the order of the rows of the matrix in a certain
way! This operation is called pivoting.
The obvious solution is to interchange rows. Now suppose we perturb a11 by some small number
ε > 0 so that
ε 1 x1 1
= .
1 1 x2 2
Performing GEM yields
ε 1 x1 1 2 − 1/ε 1 − x2
= =⇒ x2 = ≈ 1, x1 = ≈ 0.
0 1 − 1/ε x2 2 − 1/ε 1 − 1/ε ε
1 − 2ε 1
However, the actual solution is given by x2 = ≈ 1, x1 = ≈ 1 6= 0. If we
1−ε 1−ε
interchange rows, we have
1 1 x1 1 1 1 x1 2
= =⇒ = .
ε 1 x2 2 0 1 − ε x2 1 − 2ε
1 − 2ε
The solution is given by x2 = ≈ 1, x1 = 2 − x2 ≈ 1.
1−ε
Main Idea
1 2 3
We demonstrate the main idea with a simple example. Consider A = A(1) = 2 4 5.
7 8 9
1 0 0 1 2 3 1 2 3
Ã(1) = P1 A(1) = 0 1 0 2 4 5 = 2 4 5 .
0 1 0 7 8 9 7 8 9
1 0 0 1 2 3 1 2 3
A(2) = M1 Ã(1) = −2 1 0 2 4 5
= 0
0 −1 .
−7 0 1 7 8 9 0 −6 −12
1 0 0 1 2 3 1 2 3
Ã(2) = P2 A(2) = 0 0 1 0 0 −1 = 0
−6 −12 .
0 1 0 0 −6 −12 0 0 1
1 0 0 1 2 3 1 2 3
A(3) = M2 Ã(2) = 0 1 0 0 −6 −12 = 0
−6 −12 .
0 0 1 0 0 1 0 0 1
Definition 5.1.1. Given some initial guess x(0) ∈ Rn , consider iterative methods of the form
x(k+1) = Bx(k) + f, k ≥ 0, where B = n × n iteration matrix, (5.1.1a)
f = some n-vector obtained from b. (5.1.1b)
An iterative method of the form (5.1.1) is said to be consistent with the linear system Ax = b
if f and B are such that x = Bx + f .
Example 5.1.2. Observe that consistency of (5.1.1) does not imply its convergence. Consider
the linear system 2Ix = b. It is clear that the iterative method defined below is consistent:
x(k+1) = −x(k) + b.
However, this method is not convergent for every choice of initial guess x(0) . Indeed, choosing
x(0) gives
x(2k) = 0, x(2k+1) = b , k ≥ 0.
On the other hand, the proposed iterative method converges to the true solution if x(0) = b/2.
75
76 5.1. Consistent Iterative Methods and Convergence
Let e(k) = x − x(k) . Subtracting the consistency equation x = Bx + f from the iterative
method (5.1.1) yields the recurrence relation for the error equation
Proof. The result is trivial if A = 0, so suppose not. Suppose lim Am = 0. Choose any
m→∞
λ ∈ σ(A) with corresponding eigenvector x 6= 0. Since Am x = λm x,
lim λm x = lim Am x
m→∞ m→∞
lim λ x = lim Am x = 0.
m
m→∞ m→∞
and this proves the only if statement since λ ∈ σ(A) was arbitrary.
Conversely, suppose ρ(A) < 1. By continuity of norm and the fact that any norms are
equivalent in finite-dimensional vector space, it suffices to prove that
lim kAm k = 0.
m→∞
N −1 k
mk
X kU k2
≤ kDkm
2
k=0
k! kDk2
N −1 k !
X kU k2
≤ mN −1 kDkm
2
k=0
kDk2
= CmN −1 ρ(A)m ,
the sequence (am ) converges to 0 as m → ∞ by the Ratio Test for sequences. Consequently,
Remark 5.1.6. A sufficient but not necessary condition for convergence of consistent iterative
method is kBk < 1 for any consistent matrix norm, since ρ(B) ≤ kBk. The rate of conver-
gence depends on how much less that 1 the spectral radius is. The smaller it is, the faster the
convergence.
where B = P −1 N and f = P −1 b.
Splitting A = D + R, where
a11 0 a12 . . . a1n
a22 a21 0 . . . a2n
D = diag(A) = and R = A − D = .. .
... .. . . . ..
. . .
ann an1 an2 ... 0
Dx = −Rx + b
x(k+1) = −D−1 R x(k) + D −1
| {z }b .
| {z }
B f
In the matrix form of Gauss Siedel, we claim that (D + L)−1 exists. This is true because
(D + L) is strictly diagonally dominant by rows.
since A is SDD by rows from assumption. Since λ ∈ σ(B) was arbitrary, this gives ρ(B) < 1
and the Jacobi method is convergent.
A similar argument with the iteration matrix in the Gauss-Siedel method B = −(D+L)−1 U
gives
since A is SDD by rows from assumption. Since λ ∈ σ(B) was arbitrary, this gives ρ(B) < 1
and the Gauss-Siedel method is convergent.
(D + L + U )x = b
Dx = b − Lx − U x
x = D−1 (b − Lx − U x)
ωx = ωD−1 (b − Lx − U x)
x = ωD−1 (b − Lx − U x) + (1 − ω)x.
i−1 n
!
(k+1) ω X (k+1)
X (k) (k)
xi = bi − aij xj − aij xj + (1 − ω)xi , i = 1, . . . , n. (SOR)
aii j=1 j=i+1
Iterative Methods For Linear Systems 81
= Bω x(k) + fω .
For ω = 1, we recover the Gauss-Siedel method. For ω ∈ (0, 1), the method is called under-
relaxation; for ω > 1, the method is called over-relaxation. Clearly there exists an optimal
parameter ω0 that produces the smallest spectral radius.
Theorem 5.2.4.
(a) If A is symmetric positive definite (SPD), then the SOR method is convergent if and only
if 0 < ω < 2.
We can extrapolate the idea of a relaxation parameter to general consistent iterative meth-
ods (5.1.1). This results in a consistent iterative method for any γ 6= 0
Proof. Suppose x is a minimiser of q(x), the first variation of q(x) must equal to 0. More
precisely, for any v ∈ Rn we must have
1
q(x + εv) = hx + εv, A(x + εv)i − hx + εv, bi
2
1 ε ε ε2
= hx, Axi + hx, Avi + hv, Axi + hv, Avi − hx, bi − εhv, bi
2 2 2 2
h i ε2
= q(x) + ε hv, Axi − hv, bi + hv, Avi,
2
q(v) = q(x + w)
1
= hx + w, A(x + w)i − hx + w, bi
2
1 1 1 1
= hx, Axi + hx, Awi + hw, Axi + hw, Awi − hx, bi − hw, bi
2 2 2 2
1
= q(x) + hw, Axi − hw, bi + hw, Awi
2
1
= q(x) + hw, Awi ≥ q(x),
2
where we use the assumption that A is positive-definite. Since v ∈ Rn was arbitrary, it follows
that x is a minimiser of q(x).
If A is SPD, then its minimiser is unique. Suppose there are two distinct minimisers
x, y ∈ Rn of q(x). They must satisfy Ax = b = Ay or A(x − y) = 0, which implies that
x − y = 0 since A is non-singular. In practice, q(x) usually represents a significant quantity
such as the energy of a system. In this case the solution to Ax = b represents a state of minimal
energy.
Iterative Methods For Linear Systems 83
This together with the expression for αk from Lemma 5.3.2 yields
Remark 5.3.4. The residual vector given by r(k+1) = r(k) − αk Ar(k) is chosen to update the
residual vector in the steepest descent method. It is more stable numerically compared to
(5.3.1b) due to rounding error, since b can be very close to Ax(k+1) for large enough k.
84 5.3. Iterative Optimisation Methods
hr(k) , r(k) i
αk =
hr(k) , Ar(k) i
x(k+1) = x(k) + αk r(k)
r(k+1) = r(k) − αk Ar(k) .
Theorem 5.3.5. Let A ∈ Rn×n be symmetric positive definite. The steepest descent method is
convergent for any initial condition x(0) ∈ Rn and we have the following error estimate:
(k+1) κ2 (A) − 1
ke kA ≤ ke(k) kA ,
κ2 (A) + 1
where e(k) = x(k) − xexact and κ2 (A) = kAk2 kA−1 k2 = σ1 /σn is the condition number of A with
respect to k · k2 .
Although the steepest descent method is convergent, it does not imply that the error is
monotonically decreasing. As such, the steepest descent method can be time consuming. It
can happen that r(k) (steepest descent direction) oscillates. Indeed, Lemma 5.3.3 tells us that
r(k+2) can almost be in the same direction as ±r(k) with same magnitude.
where αk , βk , p(0) are again chosen so that E(x(k+1) ) is minimised. We recover the steepest
descent method if βk = 0.
Lemma 5.3.6. Given x(k) , E(x(k+1) ) is minimised if p(0) = r(0) and αk , βk are chosen to be
hr(k) , r(k) i2
(k+1) (k)1
E(x ) = E(x ) − . (5.3.4)
2 hp(k) , Ap(k) i
For k = 0, E(x(1) ) < E(x(0) ) if we choose p(0) = r(0) , since A is positive-definite. To find βk ,
we want to maximise the second term in (5.3.4), i.e. minimise hp(k) , Ap(k) i. We write this
expression in terms of βk using (Search) and A = AT and get
Since the last expression is a quadratic equation in βk−1 , E(x(k+1) ) is minimised if βk−1 satisfies
H 0 (βk−1 ) = 0 and solving this gives
hr(k+1) , Ap(k) i
βk = − , k ≥ 1. (5.3.5)
hp(k) , Ap(k) i
Observe that using (Search) gives an orthogonal relation for successive p(k) with respect to
h·, A(·)i:
hr(k+1) , Ap(k) i
(k+1) (k) (k)
(k)
= hr , Ap i − (k)
(k)
hp , Ap i = 0,
hp , Ap i
We also obtain an orthogonal relation for succesive r(k) with respect to h·, ·i, using (Residual)
and A = AT to get
Finally,
hp(k) , Ap(k) i hr(k+1) , r(k+1) i kr(k+1) k22
βk = = .
hr(k) , r(k) i hp(k) , Ap(k) i kr(k) k22
Lemma 5.3.7. For the conjugate gradient method, the residuals and search directions satisfy
the orthogonality:
hr(j) , r(k) i = hp(j) , Ap(k) i = 0 for all j 6= k.
The following partial result was shown in the proof of Lemma 5.3.6:
Suppose
hr(j) , r(k) i = hp(j) , Ap(k) i = 0 for all 0 ≤ k < j ≤ N .
We need to show that the same relation holds for all 0 ≤ k < j ≤ N + 1. This is true from the
partial result if j = N + 1 and k = N , so suppose j = N + 1 and k < N . Then
h i
hr(N +1) , r(k) i = hr(N ) − αN Ap(N ) , r(k) i From (Residual).
= −αN hAp(N ) , r(k) i
h i
(N ) (k) (k−1)
= −αN hAp ,p − βk p i From (Search).
Iterative Methods For Linear Systems 87
= 0.
h i
hp(N +1) , Ap(k) i = hr(N +1) + βN p(N ) , Ap(k) i From (Search).
= hr(N +1) , Ap(k) i
(k)
− r(k+1)
(N +1) r
h i
= r , From (Residual).
αk
= 0,
kr(k) k22
αk =
hp(k) , Ap(k) i
x(k+1) = x(k) + αk p(k)
r(k+1) = r(k) − αk Ap(k)
kr(k+1) k22
βk =
kr(k) k22
p(k+1) = r(k+1) + βk p(k)
Theorem 5.3.8. If A ∈ Rn×n is symmetric positive definite, then the conjguate gradient
method converges (pointwise) in at most n steps to the solution of Ax = b. Moreover, the error
e(k) = x(k) − xexact ⊥ p(j) for j = 0, 1, . . . , k − 1, k < n, and
p
2C k
(k) (0) κ2 (A) − 1
ke kA ≤ ke kA , where C=p .
1 + C 2k κ2 (A) + 1
n−1
X
δj r(j) = 0. (5.3.6)
j=0
Either r(k) = 0 for some k ≤ n − 1 which means the iteration process stops at the kth step, or
δk = 0 for every k = 0, 1, . . . , n−1, which means the set of residual vectors {r(0) , r(1) , . . . , r(n−1) }
form a basis of Rn and r(n) ≡ 0. In both cases, we see that the conjugate gradient method
converges in at most n steps.
88 5.4. Problems
5.4 Problems
1. Let A be a square matrix and let k · k be a consistent matrix norm (we say that k · k is
compatible or consistent with a vector norm k · k if kAxk ≤ kAkkxk). Show that
|λ|m ≤ kAm k2 .
Taking the mth root of each side, and then the maximum over all λ ∈ σ(A) yields
1/m 1/m
max |λ| = ρ(A) ≤ kAm k2 =⇒ ρ(A) ≤ lim kAm k2 . (5.4.2)
λ∈σ(A) m→∞
= CmN −1 ρ(A)m ,
Iterative Methods For Linear Systems 89
where C > 0 is independent of m. Taking the mth root and then the limit as m → ∞
yields
1/m
lim kAm k2 ≤ lim C 1/m m(N −1)/m ρ(A) = ρ(A),
(5.4.3)
m→∞ m→∞
where we use the fact that lim C 1/m = 1 = lim m1/m for any nonnegative real
m→∞ m→∞
number C. The result follows from combining (5.4.2) and (5.4.3).
2. Consider the 3 × 3 linear system of the form Aj x = bj , where bj is always taken in such a
way that the solution of the system is the vector x = (1, 1, 1)T , and the matrices Aj are
3 0 4 −3 3 −6 4 1 1 7 6 9
A1 = 7 4 2 , A2 = −4 7 −8 , A3 = 2 −9 0 , A4 = 4 5 −4 .
−1 1 2 5 7 −9 0 −8 −6 −7 −3 8
where A = L + D + U with D the diagonal, L the lower off diagonal and U the upper
off diagonal. The number of iterations is N = 200, and we test the algorithm for
three different random initial guesses
0.1214 1.0940 1.8443
x10 = 0.1815 , x20 = 1.7902 , x30 = 1.3112 .
1.6112 0.3737 1.2673
We choose to stop the iteration process if the Euclidean norm of the residual vector
kb − Ax(k) k2 is less than the chosen tolerance = 10−12 . We explain the numerical
result using the spectral radius of the iteration matrix.
90 5.4. Problems
Eigenvalue Problems
Eigenvalues and eigenvectors of square matrices appear in the analysis of linear transformation
and has a wide range of applications, such as facial recognition, image compression, spectral
clustering, dimensionality reduction and ranking algorithm. These matrices may be sparse or
dense and may have greatly varying order and structure. What is to be calculated affects the
choice of method to be used, as well as the structure of the given matrix. We first discuss
three matrix factorisations, where the eigenvalues are explicitly displayed. We then review
three classical eigenvalue algorithms: power iteration, inverse iteration and Rayleigh quotient
iteration.
91
92 6.1. Eigenvalue-Revealing Factorisation
Consequently, eigenvalues of A are roots of the characteristic polynomial pA and vice versa and
we may write pA as
m
Y
pA (z) = (z − λj ) = (z − λ1 )(z − λ2 ) . . . . . . (z − λm ),
j=1
where λj ∈ C are eigenvalues of A and they might be repeated. With this in mind, we define
the algebraic multiplicty of λ ∈ σ(A) as the multiplicity of λ as a root of pA (z); an eigen-
value is simple if its algebraic multiplicity is 1.
Theorem 6.1.2. A matrix A ∈ Cm×m has m eigenvalues, counted with algebraic multiplicity.
In particular, A has m distinct eigenvalues if the roots of pA are simple.
Theorem 6.1.3. Given a matrix A ∈ Cm×m , the following relation holds where eigenvalues
are counted with algebraic multiplicity:
m
Y m
X
det(A) = λj , tr(A) = λj .
j=1 j=1
m
Y
m−1
The second formula follows from equating the coefficient of z in det(zI −A) and (z −λj ).
j=1
Theorem 6.1.4. If X is nonsingular, then A and X −1 AX have the same characteristic poly-
nomial, eigenvalues and algebraic and geometric multiplicities.
Consequently, A and B = X −1 AX have the same characteristic polynomial and also the
eigenvalues and algebraic multiplicities. Finally, suppose Eλ is an eigenspace for A. For any
x ∈ Eλ , we have that
y 0 = X −1 x0 = X −1 Ax = (X −1 AX)X −1 x = Λy.
The system is now decoupled and it can be solved separately. The solutions are
Definition 6.1.8. An eigenvalue whose algebraic multiplicity exceeds its geometric multiplicity
is a defective eigenvalue. A matrix that has one or more defective eigenvalues is a defective
matrix.
Proof. Suppose A has an eigenvalue decomposition. Since A is similar to Λ, it follows from The-
orem 6.1.4 that they share the same eigenvalues and the same multiplicities. Consequently, A is
nondefective since diagonal matrices are nondefective. Conversely, suppose A is nondefective.
Then A must have m linearly independent eigenvectors since one can show that eigenvectors
with different eigenvalues must be linearly independent, and each eigenvalue can contribute as
many linearly independent eigenvectors as its multiplicity. Defining X as the matrix whose
columns are these m linearly independent eigenvectors, we see that AX = XΛ, or A = XΛX −1 .
Theorem 6.1.10.
(a) A hermitian matrix is unitary diagonalisable, and its eigenvalues are real.
Proof. The proof is similar to the existence of SVD. The case m = 1 is trivial, so suppose
m ≥ 2. Let q1 be any eigenvector of A, with corresponding eigenvalue λ. WLOG, we may
assume kq1 k2 = 1. Consider any extension of q1 to an orthonormal basis {q1 , . . . , qm } ⊂ Cm
and construct the unitary matrix
h i
b1 ∈ Cm×m , Q
Q1 = q1 Q b1 = q2 . . . qm .
We have that
" #
∗ ∗ ∗ b ∗
h
∗ q 1
i q 1 Aq 1 q 1 AQ 1 λ b
T1 := Q1 AQ1 = b∗ A q1 Q
b1 =
b∗ Aq1 Q b∗ AQ b1 = 0 A b .
Q1 Q 1 1
Eigenvalue Problems 95
b∗
λ
Q∗1 AQ1 =
0 Q2 T2 Q∗2
1 0∗ λ b∗ Q2 1 0
= ,
0 Q2 0 T2 0 Q∗2
and
1 0∗ λ b ∗ Q 2 1 0∗
A = Q1 Q∗1 = QT Q∗ .
0 Q2 0 T2 0 Q∗2
To finish the proof, we need to show that Q is unitary, but this must be true since
1 0∗
h i
Q = Q1 = q1 Q b 1 Q∗ ,
0 Q2 2
Theorem 6.1.13 (Gershgorin Circle Theorem). The spectrum σ(A) is contained in the union
of the following m disks Di , i = 1, . . . , m in C, where
( m
)
X
Di = z ∈ C : |z − aii | ≤ |aij | , i = 1, . . . , m.
j6=i
and
m
X m
X m
X
|λ − aii | = |(λ − aii )xi | = aij xj ≤ |aij xj | ≤ |aij |.
j6=i j6=i j6=i
96 6.2. Eigenvalue Algorithms
−1 + i 0 1/4
Example 6.1.14. Consider the matrix A = 1/4 1 1/4. Applying the Gershgorin
1 1 3
Circle Theorem gives the following three disks in C
1 1
|λ − (−1 + i)| ≤ 0 + =
4 4
1 1 1
|λ − 1| ≤ + =
4 4 2
|λ − 3| ≤ 1 + 1 = 2.
1
Upon sketching these disks in C, we see that ≤ |λ| ≤ 5. [Draw the solution set in C.]
2
It follows that roots of pm (z) are equal to the eigenvalues of the companion matrix
−a0
0
1 0
−a1
1 0 −a2
A=
... .. .
1 .
...
0 −am−2
1 −am−1
Theorem 6.2.1. For any m ≥ 5, there exists a polynomial p(z) of degree m with rational co-
efficients that has a real root r, with the property that r cannot be written using any expression
involving rational numbers, addition, subtraction, multiplication, division and kth roots.
This theorem says that no computer program can produce the exact roots of an arbitrary
polynomial of degree ≥ 5 in a finite number of steps even in exact arithmetic, and it is because
of this that any eigenvalue solver must be iterative.
xT Ax
AT Aα = AT b, (xT x)α = xT Ax, α = = R(x).
xT x
It is helpful to view R(·) as a function from Rm to R. We investigate the local behavior of
R(x) when x is near the eigenvector. Computing the partial derivatives of r(x) with respect
to the coordinates xj yields
T
∂R(x) 1 ∂ T x Ax ∂
= (x Ax) − (xT x)
∂xj xT x ∂xj (xT x)2 ∂xj
2(Ax)j (xT Ax)2xj
= −
xT x (xT x)2
2
= T (Ax)j − R(x)xj
x x
98 6.2. Eigenvalue Algorithms
2
= (Ax − R(x)x)j .
xT x
Consequently, the gradient of R(x) is
2
∇R(x) = (Ax − R(x)x)T .
xT x
We deduce the following properties of R(x) from the formula of ∇R(x):
We give another proof of the asymptotic relation (6.2.1). We express x as a linear combi-
nation of the eigenvectors {q1 , . . . , qm }
m
X m
X
x= aj q j = hx, qj iqj , since A is symmetric.
j=1 j=1
aj
Assuming x ≈ qJ and ≤ ε for all j 6= J, it suffices to show that R(x) − R(qJ ) = O(ε)2 ,
aJ
since from Pythagorean theorem we have that
m m
!
2 2
X X aj a J − 1
kx − qJ k22 = |aj |2 + |aJ − 1|2 = |aJ |2 + ≈ Cε2 .
j6=J j6=J
a J a J
Theorem 6.2.2. Assume |λ1 | > |λ2 | ≥ . . . ≥ |λm | ≥ 0 and q1T v (0) =
6 0. Then the iterates of
power iteration algorithm satisfy
! !
k 2k
λ2 λ 2
kv (k) − ±q1 k2 = O , |λ(k) − λ1 | = O as k −→ ∞.
λ1 λ1
The ± sign means that at each step k, one or the other choice of sign is to be taken, and then
the indicated bound holds.
v (0) = a1 q1 + a2 q2 + . . . + am qm .
λk1 a1 q1
v (k) −→ = ±q1 as k −→ ∞,
|λ1 |k |a1 |kq1 k2
depending on the sign of λ1 and initial guess v (0) and the first equation follows. The second
equation follows from the asymptotic relation (6.2.1) of the Rayleigh quotient.
If λ1 > 0, then the signs of q1 is controlled by the initial guess v (0) and so are all + or all
−. If λ1 < 0, then the signs of q1 alternate and kv (k) k22 −→ kq1 k22 as k −→ ∞. One can show
that the iterates of power iteration algorithm satisfy
kv (k+1) − ±q1 k2
λ2
=O as k −→ ∞.
kv (k) − ±q1 k2 λ1
Consequently, the rate of convergence for the power iteration is linear. Except for special ma-
trices, the power iteration is very slow!
The upshot is if we choose µ sufficiently close to λJ , then (λJ − µ)−1 may be much larger than
(λj − µ)−1 for all j 6= J. Consequently, applying the power iteration to the matrix (A − µI)−1
gives a rapid convergence to qJ and this is precisely the idea of inverse iteration.
Note that the first step of the algorithm involves solving a linear system at each iteration
step and this raises an immediate question: what if A−µI is so ill-conditioned that an accurate
solution of the linear system is not possible? This however is not a problem at all and we shall
not pursue this issue any further; interested reader may refer to Exercise 27.5 in [TBI97, p.210].
The following theorem is essentially a corollary of Theorem 6.2.2.
Theorem 6.2.3. Suppose λJ is the closest eigenvalue to µ and λm is the second closest, that
is,
|µ − λJ | < |µ − λm | ≤ |µ − λj | for each j =
6 J.
Suppose qJT v (0) 6= 0. Then the iterates of inverse iteration algorithm satisfy
! !
k 2k
µ − λ J µ − λ J
kv (k) − ±qJ k2 = O , |λ(k) − λJ | = O as k −→ ∞.
µ − λm µ − λm
The ± sign means that at each step k, one or the other choice of sign is to be taken, and the
inidicated bound holds.
In practice, the inverse iteration is used when a good approximation for the desired eigen-
value is known. Otherwise, the inverse iteration converges to the eigenvector of the matrix A
corresponding to the closest eigenvalue to µ. As opposed to the power iteration, we can control
the rate of linear convergence since this depends on µ.
[TBI97] L. N. Trefethen and D. Bau III. Numerical Linear Algebra. Vol. 50. Other Titles in
Applied Mathematics. SIAM, 1997.
103