0% found this document useful (0 votes)

51 views

Math 6610 - Analysis of Numerical Methods I

This document covers topics in numerical linear algebra including matrix decomposition techniques like SVD and QR, least squares problems, systems of equations, iterative methods, and eigenvalue problems. It provides detailed explanations of key concepts and algorithms.

Uploaded by

Roshan Saini

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views

Math 6610 - Analysis of Numerical Methods I

Uploaded by

Roshan Saini

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 103

Math 6610 : Analysis of Numerical Methods I

Chee Han Tan

Last modified : August 18, 2017

2
Contents

1 Introduction 5
1.1 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Orthogonal Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Norms and Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 Matrix Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.2 Cauchy-Schwarz and Holder Inequalities . . . . . . . . . . . . . . . . . . 12
1.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 Matrix Decomposition and Least Squares Problems 21

2.1 The Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.1 Geometric Intepretation . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.1.2 Full SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.1.3 Matrix Properties via SVD . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.1.4 Low-Rank Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Projectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.1 Complementary Projectors . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2.2 Orthogonal Projectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.3 Projection with an Orthonormal Basis . . . . . . . . . . . . . . . . . . . 31
2.2.4 Projection with an Arbitrary Basis . . . . . . . . . . . . . . . . . . . . . 32
2.3 QR Factorisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.1 Gram-Schmidt Orthogonalisation . . . . . . . . . . . . . . . . . . . . . . 33
2.3.2 Modified Gram-Schmidt Algorithm . . . . . . . . . . . . . . . . . . . . . 35
2.3.3 Operation Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4 Least Squares Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.1 Existence and Uniqueness . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.2 Normal Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.4.3 QR Factorisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.4.4 SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3 Conditioning and Stability 53

3.1 Conditioning and Condition Numbers . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2 Floating Point Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.3 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4 More on Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.5 Stability of Back Substitution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3
4 Contents

4 Systems of Equations 67
4.1 Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2 Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3 Stability of Gaussian Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4 Cholesky Factorisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5 Iterative Methods For Linear Systems 75

5.1 Consistent Iterative Methods and Convergence . . . . . . . . . . . . . . . . . . . 75
5.2 Linear Iterative Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2.1 Jacobi Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2.2 Gauss-Siedel Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2.3 Successive Over Relaxation (SOR) Method . . . . . . . . . . . . . . . . . 80
5.3 Iterative Optimisation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3.1 Steepest Descent/Gradient Descent Method . . . . . . . . . . . . . . . . 83
5.3.2 Conjugate Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

6 Eigenvalue Problems 91
6.1 Eigenvalue-Revealing Factorisation . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.1.1 Geometric and Algebraic Multiplicity . . . . . . . . . . . . . . . . . . . . 91
6.1.2 Eigenvalue Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.1.3 Unitary Diagonalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.1.4 Schur Factorisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.1.5 Localising Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.2 Eigenvalue Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.2.1 Shortcomings of Obvious Algorithms . . . . . . . . . . . . . . . . . . . . 96
6.2.2 Rayleigh Quotient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.2.3 Power iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.2.4 Inverse Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2.5 Rayleigh Quotient Iteration . . . . . . . . . . . . . . . . . . . . . . . . . 101

Abstract: These notes are largely based on Math 6610: Analysis of Numerical Methods
I course, taught by Yekaterina Epshteyn in Fall 2016, at the University of Utah. Additional
examples or remarks or results from other sources are added as I see fit, mainly to facilitate
my understanding. These notes are by no means accurate or applicable, and any mistakes here
are of course my own. Please report any typographical errors or mathematical fallacy to me
by email [email protected]
Chapter 1

Introduction

We review some basic facts about linear algebra, in particular on viewing matrix-vector multi-
plication as linear combination of column matrices; this plays an important role in understand-
ing key ideas behind many algorithms of numerical linear algebra. We review orthogonality,
where many of the best algorithms are based upon. Finally, we discuss about vector norms
and matrix norms, as these provide a way of measuring approximations and convergence of
numerical algorithm.

1.1 Linear Algebra

Let A ∈ Cm×n be an m × n matrix. The map x 7→ Ax is linear, i.e. the following holds for
any x, y ∈ Cn and scalars α, β ∈ C:

A(αx + βy) = αAx + βAy.

Conversely, any linear map from Cn to Cm can be expressed as multiplication by an m × n

matrix. The matrix-vector product b = Ax ∈ Cm is defined as
n
X
bi = aij xj for every i = 1, . . . , m.
j=1

It is not too difficult to see that matrix-vector product can also be view as linear combination
of columns {a1 , . . . , an j} of A, i.e.
n
X
b = Ax = x j aj . (1.1.1)
j=1

This easily generalises to matrix-matrix product B = AC, in which each column of B is a linear
combination of the columns of A. More precisely, if A ∈ C m×l and C ∈ Cl×n , then B ∈ Cm×n
with
Xl
bij = aik ckj for each i = 1, . . . , m, j = 1, . . . , n,
k=1

or equivalently,
l
X
bk = cjk aj for each k = 1, . . . , n.
j=1

5
6 1.1. Linear Algebra

Example 1.1.1. The outer product is the product of a column vector u ∈ Cm and a row
vector v ∗ ∈ Cn , which gives a rank-one-matrix A = uv ∗ ∈ Cm×n . Symbolically,
 
u1
 u2 
A =  ..  v1 v2 . . . vn = v1 u v2 u . . . vn u .
 
 . 
um

Definition 1.1.2. Given a matrix A ∈ Cm×n ,

(a) The nullspace N (A) of A is the set of vectors x ∈ Cn such that Ax = 0.

(b) The range R(A) of A is the set of vectors y ∈ Cm such that y = Ax for some x ∈ Cn . It
is clear from (1.1.1) that R(A) is the vector space spanned by columns of A:

R(A) = span{a1 , . . . , an }.

Consequently, R(A) is also called the column space of A.

It can be shown that the column rank is always equal to the row rank of a matrix. Thus,
the rank of a matrix is well-defined. A matrix A ∈ Cm×n of full rank is one that has the
maximal possible rank min{m, n}. This means that a matrix of full rank with m ≥ n must
have n linearly independent columns.

Theorem 1.1.3. A matrix A ∈ Cm×n with m ≥ n has full rank if and only if it maps no two
distinct vectors to the same vector.

Proof. Suppose A is of full rank, then its columns {a1 , . . . , an } form a linearly independent set
of vectors in Cn . Suppose Ax = Ay, we need to show that x = y but this is true since
n
X
A(x − y) = 0 =⇒ (xj − yj )aj = 0 =⇒ xj − yj = 0 for each j = 1, . . . , n.
j=1

Conversely, suppose A maps no two distinct vectors to the same vector. To show that A is
of full rank, it suffices to prove that its columns {a1 , . . . , an } are linearly independent in Cn .
Suppose
Xn
xj aj = 0.
j=1

This is equivalent to Ax = 0 with x = (x1 , . . . , xn )∗ ∈ Cn , and we see that x must be the zero
vector. Otherwise there exists a nonzero vector y ∈ Cn such that Ay = 0 = A(0) and this
contradicts the assumption.

Introduction 7

Theorem 1.1.4. For A ∈ Cm×m , the following are equivalent:

(a) A has an inverse A−1 ∈ Cm×m satisfying AA−1 = A−1 A = Im .

(b) rank(A) = m.

(d) N (A) = {0}.

(e) 0 is not an eigenvalue of A.

(f ) 0 is not a singular value of A.

(g) det(A) 6= 0.

When writing the product x = A−1 b, we should understand x as the unique vector that
satisfies the equation Ax = b. This means that x is the vector of coefficients of the unique
linear expansion of b in the basis of columns of A. Multiplication by A−1 is a change of basis
operation. More precisely, if we view b as coefficients of the expansion of b in {e1 , . . . , em },
then multiplication of A−1 results in coefficients of the expansion of b in {a1 , . . . , am }.

1.2 Orthogonal Vectors and Matrices

Given a matrix A ∈ Cm×n , we denote its Hermitian conjugate or adjoint by A∗ . For
example,  
a11 a12
∗ ā11 ā21 ā31
A = a21 a22 =⇒ A =
  .
ā12 ā22 ā32
a31 a32
A is said to be Hermitian if A = A∗ . Note that a Hermitian matrix must be square by
definition. Among the nice properties about finite-dimensional vector space is the notion of
orthogonality. In a plane R2 , two vectors are orthogonal if they make an angle of 90◦ ; this
can be extended into higher-dimensional Euclidean space by introducing the notion of inner
product.

Definition 1.2.1.

(a) The inner product of two column vectors x, y ∈ Cm is defined as

m
X
∗
(x, y) = x y = x̄j yj .
j=1

(b) The Euclidean length of x ∈ Cn is defined as

m
! 21
√ X
kxk2 = x∗ x = |xj |2 .
j=1
8 1.2. Orthogonal Vectors and Matrices

x∗ y
cos(α) = .
kxk2 kyk2

Remark 1.2.2. Over C, the inner product is sesquilinear, i.e. x 7→ (x, z) is linear and
y 7→ (z, y) is conjugate linear. Over R, the inner product is bilinear.

Definition 1.2.3. A set of nonzero vectors S is said to be orthogonal if its elements are
pairwise orthogonal, that is,

x, y ∈ S, x 6= y =⇒ (x, y) = x∗ y = 0.

S is said to be orthonormal if S is orthogonal and kxk2 = 1 for every x ∈ S.

Theorem 1.2.4. The vectors in an orthogonal set S are linearly independent. Consequently,
if an orthogonal set S ⊂ Cm contains m vectors, then it is a basis for Cm .

Proof. Suppose, by contradiction, that the set of orthogonal vectors S is not linearly indepen-
dent. This means that at least one of the vectors vk ∈ S can be written as a non-trivial linear
combination of the remaining vectors in S, i.e.
X
vk = αj vj .
j6=k

Taking the inner product of vk against vk and using orthogonality of the set S gives
X
(vk , vk ) = (αj vj , vk ) = 0,
j6=k

which contradicts the assumption that all vectors in S are nonzero.

An important consequence of inner product is that it can be used to decompose arbitrary
vectors into orthogonal components. More precisely, suppose {q1 , q2 , . . . , qn } is an orthonormal
set, and let v be an arbitrary vector. We decompose v into (n + 1) orthogonal components as
n
X n
X
v=r+ (qj , v)qj = r + (qj qj∗ )v. (1.2.1)
j=1 j=1

We see that r is the part of v orthogonal to {q1 , q2 , . . . , qn } and for every j = 1, 2, . . . , n, (qj , v)qj
is the part of v in the direction of qj .
If {qj } is a basis for Cm , then n must be equal to m and r must be the zero vector, so v is
completely decomposed into m orthogonal components in the direction of the qj . In (1.2.1), we
see that we have two different expressions. In the first case, we view v as a sum of coefficients
qj∗ v times vectors qj . In the second case, we view v as a sum of orthogonal projections of v
onto the various directions qj . The jth projection operation is achieved by the very special
rank-one matrix qj qj∗ .
Introduction 9

A square matrix Q ∈ Cm×m is unitary (or orthogonal in the real case) if Q∗ = Q−1 ,
that is, Q∗ Q = QQ∗ = Im . In terms of the columns of Q, we have the relation qi∗ qj = δij .
This means that columns of a unitary matrix Q form an orthonormal basis for Cm . In the real
case, multiplication by an orthogonal matrix Q corresponds to a rigid rotation if det(Q) = 1
or reflection if det(Q) = −1.

Lemma 1.2.5. The inner product is invariant under unitary transformation, i.e. for any
unitary matrix Q ∈ Cm×m , (Qx, Qy) = (x, y) for any x, y ∈ Cm . Such invariance means that
angles between vectors and their lengths are preserved under unitary transformation.

Proof. We simply expand (Qx, Qy) and obtain

(Qx, Qy) = (Qx)∗ Qy = x∗ Q∗ Qy = x∗ y = (x, y).

In particular, we have that

kQxk2 = kxk2 for any x ∈ Cm .

Remark 1.2.6. Note that the lemma is still true for any matrices with orthornormal columns.

1.3 Norms and Inequalities

We already see from Section 1.2 on how to quantify the length of a vector using the inner prod-
uct, called the Euclidean length, which is a generalisation of distance in a plane. In practice,
it is useful to consider other notions of length in a vector space, which give rise to the following:

Definition 1.3.1. A norm is a function k · k : Cn −→ R satisfying the following properties for

all vectors x, y ∈ Cn and scalars α ∈ C:

(N1) kxk ≥ 0 and kxk = 0 =⇒ x = 0.

(N2) kαxk = |α|kxk.

(N3) kx + yk ≤ kxk + kyk.

Below are a few important examples of vector norms in Cn :

n
X
kxk1 = |xj | (l1 norm)
j=1

n
! 21
X
kxk2 = |xj |2 (l2 norm)
j=1

kxk∞ = max |xj | (l∞ norm)

1≤j≤n
10 1.3. Norms and Inequalities

n
! p1
X
kxkp = |xj |p , p ≥ 1. (lp norm)
j=1

For any nonsingular matrix W ∈ Cn×n , we can define the weighted p-norms, given by
n n p ! p1
X X
kxkW = kW xkp = wij xj .
i=1 j=1

1.3.1 Matrix Norms

We can easily generalise vector norms to matrix norms acting on the vector space of all matri-
ces A ∈ Cm×n . There is a special norm that is sometimes more useful than the general matrix
norms, which is defined by viewing matrix as a linear operator from Cn to Cm :

Definition 1.3.2. Given A ∈ Cm×n , the induced matrix norm kAk is defined as
kAxk
kAk = sup = sup kAxk.
x∈Cn ,x6=0 kxk x∈Cn ,kxk=1

Equivalently, kAk is the smallest number C ≥ 0 such that the inequality

kAxk ≤ Ckxk holds for all x ∈ Cn .
Geometrically, kAk is the maximum factor by which A can “stretch” a vector x.

Example 1.3.3. Let D ∈ Cm×m be a diagonal matrix

 
d1
 d2 
D= .
 
 . . . 
dm

To find the kDk2 geometrically, observe that the image of the 2-norm unit sphere under D
is an m-dimensional ellipse whose semiaxis lengths are given by the numbers |dj |. The unit
vectors amplified most by D are those that are mapped to the longest semiaxis of the ellipse,
of length max{|dj |}. Thus, we have that
kDk2 = max |dj |.
1≤j≤m

This result for the 2-norm generalises to any p ≥ 1: if D is diagonal, then

kDkp = max |dj |.
1≤j≤m

We can prove this algebraically. First,

m
X m
X
kDxkpp = p
|xj dj | ≤ max |dj | p p
|xj | = p
max |dj | kxkpp .
1≤j≤m 1≤j≤m
j=1 j=1
Introduction 11

Taking the pth root of each side, and then the supremum over all x ∈ Cm with kxkp = 1 yields
the upper bound
kDkp ≤ max |dj |.
1≤j≤m

To obtain kDkp ≥ max |dj |, we choose the standard basis vector x = ek , where k is such that
1≤j≤m
|dk | is the largest diagonal entry. Note that kek kp = 1 and
kDek kp
kDkp ≤ = kDek kp = kdk ek kp = |dk | = max |dj |.
kek kp 1≤j≤m

Lemma 1.3.4. For any A ∈ Cm×n , the induced matrix 1-norm and ∞-norm are equal to the
“maximum column sum” and “maximum row sum” of A respectively, i.e.

kAk1 = max kaj k1 .

1≤j≤n

kAk∞ = max ka∗i k1 .

1≤i≤m

Proof. Let {a1 , . . . , an } be columns of A. Viewing Ax as linear combinations of {a1 , . . . , an }

gives
n
X n
X n
X
kAxk1 = aj x j ≤ |xj |kaj k1 ≤ max kaj k1 |xj |
1≤j≤n
j=1 1 j=1 j=1

= max kaj k1 kxk1 .
1≤j≤n

Taking supremum over all x ∈ Cn with kxk1 = 1, we have that

kAk1 ≤ max kaj k1 .

1≤j≤n

To obtain kAk1 ≥ max kaj k1 , we choose the standard basis vector x = ek , where k is such
1≤j≤n
that kak k1 is maximum. Note that kek k1 = 1 and
kAek k1
kAk1 ≥ = kAek k1 = kak k1 = max kaj k1 .
kek k1 1≤j≤n

For the induced ∞-norm, we first write Ax = b = (b1 , b2 , . . . , bm )∗ ∈ Cm . Using the

definition of a matrix-vector product, we have that for any i = 1, . . . , m
n n n
!
X X X
|bi | = aij xj ≤ |aij ||xj | ≤ max |xj | |aij |
1≤j≤n
j=1 j=1 j=1
n
!
X
= kxk∞ |aij |
j=1

Taking supremum over all i = 1, . . . , m, we obtain

n
!
X
max |bi | = kbk∞ = kAxk∞ ≤ max |aij | kxk∞ .
1≤i≤m 1≤i≤m
j=1
12 1.3. Norms and Inequalities

Taking supremum over all x ∈ Cn of norm 1, we have

n
X
kAk∞ ≤ max |aij |.
1≤i≤m
j=1

n
X
To obtain kAk∞ ≥ max |aij |, choose x = (1, . . . , 1)∗ ∈ Cn . Note that kxk∞ = 1 and
1≤i≤m
j=1

n
kAxk∞ X
kAk∞ ≥ = kAxk∞ = max |aij |.
kxk∞ 1≤i≤m
j=1

1.3.2 Cauchy-Schwarz and Holder Inequalities

Inner products can be bounded in terms of p-norms using Hölder’s inequality and Cauchy-
Schwarz inequality, the latter being the special case of Hölder’s inequality. Note that these
two inequalities are tight in the sense that these inequalities become equalities for certain
choices of vectors.

1 1
Theorem 1.3.5 (Young’s inequality). Let p, q > 1 such that + = 1. For any two nonneg-
p q
ative real numbers a, b, we have
ap b q
ab ≤ + . (Young)
p q
Proof. Observe that the inequality is trivial if either a or b are zero, so suppose both a and b
1 1
are any positive real numbers. Choose any p, q > 1 such that + = 1, the constraint on p
p q
and q suggests a possible convexity argument. Indeed, using the fact that exponential function
is a convex function, we have that

ab = exp(ln(ab)) = exp(ln(a) + ln(b))

p q
= exp ln(a) + ln(b)
p q
1 1
≤ exp(p ln(a)) + exp(q ln(b))
p q
p q
a b
= + .
p q
1 1
Since p, q > 1 were arbitrary numbers satisfying + = 1, this proves the Young’s inequality.
p q

1 1
Theorem 1.3.6 (Hölder’s inequality). Let p, q > 1 such that + = 1. For any u, v ∈ Cn ,
p q
we have
n
X
|u∗ v| = u∗j vj ≤ kukp kvkq . (Hölder)
j=1
Introduction 13

Proof. Observe that the inequality is trivial if either u or v are the zero vector, so suppose
1 1
u, v 6= 0. Choose any p, q > 1 such that + = 1. Young’s inequality (Young) yields
p q

|aj |p |bj |q
|a∗j bj | = |aj ||bj | ≤ + .
p q

Summing over j = 1, . . . , n, we get

n n
! n
!
X 1 X 1 X
|a∗j bj | ≤ |aj |p + |bj |q .
j=1
p j=1
q j=1

In particular, for any a = (a1 , . . . , an )∗ , b = (b1 , . . . , bn )∗ ∈ Cn satisfying

n
X n
X
kakpp = p
|aj | = 1 = |bj |q = kbkqq , (1.3.1)
j=1 j=1

we have
n
X 1 1
|a∗j bj | ≤ + = 1. (1.3.2)
j=1
p q

Now, for any nonzero u = (u1 , . . . , un )∗ , v = (v1 , . . . , vn )∗ ∈ Cn , define vectors a = (ã1 , . . . , ãn )∗ , b =
(b̃1 , . . . , b̃n )∗ such that
uj vj
ãj = , b̃j = for all j = 1, . . . , n.
kukp kvkq

By construction, both a and b satisfy (1.3.1) and substituting ãj , b̃j into (1.3.2) yields
n n n
1 X X X
|u∗j vj | ≤ 1 =⇒ u∗j vj ≤ |uj vj | ≤ kukp kvkq .
kukp kvkq j=1 j=1 j=1

Since u, v were arbitrary nonzero vectors in Cn , this proves the Hölder’s inequality.

Example 1.3.7. Consider a matrix A containing a single row, i.e. A = a∗ , where a 6= 0 is a

fixed column vector in Cn . For any x ∈ Cn , Cauchy-Schwarz inequality yields

kAxk2 = |a∗ x| ≤ kak2 kxk2 .

Taking supremum over all x ∈ Cn of norm 1, we get kAk2 ≤ kak2 . To obtain kAk2 ≥ kak2 ,
choose the particular x = a. Then

kAak2 kak22
kAk2 ≥ = = kak2 .
kak2 kak2
14 1.3. Norms and Inequalities

Example 1.3.8. Consider the rank-one outer product A = uv ∗ , where u ∈ Cm and v ∈ Cn .

For any x ∈ Cn , Cauchy-Schwarz inequality yields

kAxk2 = kuv ∗ xk2 = |v ∗ x|kuk2 ≤ kuk2 kvk2 kxk2 .

Taking supremum over all x ∈ Cn of norm 1, we get kAk2 ≤ kuk2 kvk2 . To obtain kAk2 ≥
kuk2 kvk2 , choose the particular x = v. Then

kAvk2 kuv ∗ vk2 kuk2 kvk22

kAk2 ≥ = = = kuk2 kvk2 .
kvk2 kvk2 kvk2

Lemma 1.3.9. Let A ∈ Cm×l , B ∈ Cl×n . The induced matrix norm of AB satisfies the
inequality
kABk ≤ kAkkBk
Consequently, the induced matrix norm of A satisfies

kAn k ≤ kAkn for any n ≥ 1.

Proof. For any x ∈ Cn ,

kABxk ≤ kAkkBxk ≤ kAkkBkkxk.
Taking supremum over all x ∈ Cn of norm 1 gives the desired result.

This does not hold for matrix norms in general. Choose

1 1
kAk = max |aij | and A = B = .
1≤i,j≤m 1 1

Then kABk = 2 but kAkkBk = 1. An important matrix norm which is not induced by any
vector norm is the Hilbert-Schmidt or Frobenius norm, defined by

m X
n
! 21
X
kAkF = |aij |2 .
i=1 j=1

If {a1 , . . . , an } are the columns of A, we have

n
! 12
X
kAkF = kaj k22 .
j=1

An equivalent definition of k · kF is in terms of trace

p p
kAkF = tr(A∗ A) = tr(AA∗ ).

Viewing the matrix A ∈ Cm×n as a vector in Cmn , the Frobenius norm can be seen as the usual
l2 norm. Replacing l2 norm with lp norm gives rise to the Schatten p-norm.
Introduction 15

Lemma 1.3.10. For any A ∈ Cm×l , B ∈ Cl×n , the Frobenius norm of AB satisfies

kABkF ≤ kAkF kBkF .

Proof. Let C = AB = (cij ), where the entries of C is given by cij = a∗i bj with a∗i , bj the ith-row
and jth-column of the matrix A and B respectively. Cauchy-Schwarz inequality gives

|cij | ≤ kai k2 kbj k2 .

Squaring both sides and summing over all i, j, we obtain

m X
X n m X
X n
kABk2F = 2
|cij | ≤ (kai k2 kbj k2 )2
i=1 j=1 i=1 j=1
m
! n
!
X X
= kai k22 kbj k22
i=1 j=1

= kAk2F kBk2F .

Theorem 1.3.11. For any A ∈ Cm×n and unitary matrix Q ∈ Cm×m , V ∈ Cn×n , we have

kQAk2 = kAk2 = kAV k2 , kQAkF = kAkF .

Proof. Using the trace definition of k · kF ,

kQAk2F = tr[(QA)∗ (QA)] = tr[A∗ Q∗ QA] = tr(A∗ A) = kAk2F .

Since the 2-norm is invariant under unitary transformation,

kQAk2 = sup kQAxk2 = sup kAxk2 = kAk2 .

x∈Cn ,kxk2 =1 x∈Cn ,kxk2 =1

For any x ∈ Cn , let y = V x ∈ Cn . Then x = V ∗ y and kxk2 = kV ∗ yk2 = kyk2 since unitary
transformation preserves k · k2 . Consequently,

kAV k2 = sup kAV xk2 = sup kAyk2 = kAk2 .

x∈Cn ,kxk2 =1 y∈Cn ,kyk2 =1

16 1.4. Problems

1.4 Problems
1. Show that if a matrix A is both triangular and unitary, then it is diagonal.

Solution: The statement is trivial if A ∈ Cm×m is both upper and lower triangular, so
suppose A is upper-triangular. This implies that A∗ = A−1 is lower-triangular. The
result then follows if we show that A−1 is also upper-triangular. Since A−1 A = Im×m ,
we have that     
b1 . . . bm  a1 . . . am  = e1 . . . em  ,

where aj , bj are columns of A and A−1 respectively and ej are the standard basis
vectors in Cm . Interpreting ej as the linear combination of the columns bj together
with the assumption that A is upper-triangular, we obtain the relation
m j
X X
ej = aij bi = aij bi for any j = 1, . . . , m.
i=1 i=1

More precisely, we have

e1 = a11 b1
e2 = a12 b1 + a22 b2
.. .. ..
.= . .
em = a1m b1 + a2m b2 + . . . + amm bm .

This implies that bij = 0 for all i > j, j = 1, . . . , m, i.e. A−1 is upper-triangular.

2. Let A ∈ Cm×m be Hermitian. An eigenvector of A is a nonzero vector x ∈ Cm such that

Ax = λx for some λ ∈ C, the corresponding eigenvalue.

(a) Prove that all eigenvalues of A are real.

Solution: Let λ be any eigenvalue of a Hermitian matrix A ∈ Cm×m , with

corresponding eigenvector x ∈ Cm . Since Ax = λx, we have

(λx)∗ x = (Ax)∗ x
λ̄(x∗ x) = x∗ A∗ x
h i
= x∗ Ax A is Hermitian.
= λx∗ x.

Since x 6= 0, x∗ x = kxk22 6= 0 and we must have λ = λ̄, i.e. λ is real. Since λ

was arbitrary eigenvalue of A, the result follows.
Introduction 17

(b) Prove that if x and y are eigenvectors corresponding to distinct eigenvalues, then x
and y are orthogonal.

Solution: Suppose x and y are eigenvectors of a Hermitian matrix A corre-

sponding to distinct eigenvalues λ and µ respectively, i.e.

Ax = λx and Ay = µy.

Using the result that eigenvalues of a Hermitian matrix are real,

λx∗ y = (λx)∗ y = (Ax)∗ y = x∗ A∗ y = x∗ Ay = µx∗ y.

Consequently, x∗ y = 0 since λ 6= µ. Since λ, µ were arbitrary distinct eigenvalues

of A, the result follows.

3. What can be said about the eigenvalues of a unitary matrix?

Solution: Choosea any eigenvalue λ of a unitary matrix Q with corresponding eigen-

vector x 6= 0. Since the 2-norm is invariant under unitary transformations,

|λ|kxk2 = kλxk2 = kQxk2 = kxk2 =⇒ |λ| = 1.

Hence, the eigenvalues of a unitary matrix must lie on the unit circle in C.

4. If u and v are m-vectors, the matrix A = I + uv ∗ is known as a rank-one perturbation of

the identity. Show that if A is nonsingular, then its inverse has the form A−1 = I + αuv ∗
for some scalar α, and give an expression for α. For what u and v is A singular? If it is
singular, what is N (A)?

Solution: The result is trivial if either u or v is the zero vector, so suppose u, v 6= 0.

Suppose A is nonsingular, with its inverse A−1 = I + αuv ∗ , then

I = AA−1 = (I + uv ∗ )(I + αuv ∗ )

= I + αuv ∗ + uv ∗ + αuv ∗ uv ∗
= I + uv ∗ (1 + α + αv ∗ u).

Since uv ∗ 6= 0m , we must have

1
1 + α + αv ∗ u = 0 =⇒ α(1 + v ∗ u) = −1 =⇒ α = − .
1 + v∗u
Note that division by 1 + v ∗ u is allowed here, since 1 + v ∗ u 6= 0 if A is nonsingular,
as we shall prove now. Suppose A is singular, there exists a nonzero x ∈ Cm such
that Ax = 0. In particular, we have

Ax = (I + uv ∗ )x = 0 =⇒ uv ∗ x = −x. (1.4.1)

For any nonzero scalars β ∈ C, let x = βu. Substituting this into (1.4.1) yields

uv ∗ (βu) = −βu =⇒ β(v

∗
u)u = −βu ∗
=⇒ (v u)u = −u.
18 1.4. Problems

Hence, we see that if v ∗ u = −1, then A is singular and N (A) = span(u).

5. Let k · k denote any norm on Cm and also the induced matrix norm on Cm×m . Show that
ρ(A) ≤ kAk, where ρ(A) is the spectral radius of A, i.e., the largest absolute value |λ| of
an eigenvalue λ of A.

Solution: Choose any eigenvalue λ of a matrix A ∈ Cm×m , with corresponding

eigenvector x ∈ Cm . Since Ax = λx, we have

|λ|kxk = kλxk = kAxk ≤ kAkkxk,

where we use the assumption that kAk is an induced matrix norm for the inequality.
Dividing each side of the inequality by kxk 6= 0 yields |λ| ≤ kAk. The desired
inequality follows from taking the supremum over all eigenvalues of A.

6. (a) Let N (x) := k · k be any vector norm on Cn (or Rn ). Show that N (x) is a continuous
function of the components x1 , x2 , . . . , xn of x.

Solution: Consider the canonical basis {e1 , . . . , en } for Cn . Then

n
X
x−y = (xj − yj )ej ,
j=1

and
n n
!
X X
kxk − kyk ≤ kx − yk ≤ |xj − yj |kej k ≤ kx − yk∞ kej k .
j=1 j=1

Continuity follows by taking kx − yk∞ −→ 0.

(b) Prove that if W ∈ Cm×m is an arbitrary nonsingular matrix, and k · k is any norm
on Cm , then kxkW = kW xk is a norm on Cm .

Solution: Let W ∈ Cm×m be an arbitrary nonsingular matrix, and wj the

jth column of W . Let k · k be any norm on Cm , and x = (x1 , . . . , xm )∗ , y =
(y1 , . . . , ym )∗ ∈ Cm . It is clear that kxkW = kW xk ≥ 0. Suppose kxkW = 0.
Then
0 = kxkW = kW xk =⇒ W x = 0.
Viewing W x as a linear combination of columns of W , we obtain
m
X
0 = Wx = xj w j .
j=1

Since W is nonsingular, its columns {w1 , w2 , . . . , wm } form a linearly indepen-

dent set of vectors in Cm and this implies that xj = 0 for all j = 1, . . . , m.
Hence, kxkW = 0 =⇒ x = 0. For any α ∈ C we have that

kαxkW = kW (αx)k = kα(W x)k = |α|kW xk = |α|kxkW .

Introduction 19

Finally, the triangle inequality of k · k gives

kx + ykW = kW x + W yk ≤ kW xk + kW yk = kxkW + kykW .

Hence, we conclude that k · kW is a norm on Cm .

7. (a) Explain why kIk = 1 for every induced matrix norm.

Solution: It follows directly from the definition of an induced matrix norm.

More precisely,

kIxk kxk
kIk = sup = sup = 1.
x∈Cm ,x6=0 kxk x∈Cm ,x6=0 kxk

(b) What is kIn×n kF ?

Solution: From the definition of Frobenius norm,

n X
n
! 12 n
! 21
X X √
kIn×n kF = |aij |2 = |1|2 = n.
i=1 j=1 j=1

Solution: If k · kF were induced by any vector norm, then

√ kIkF must equal to 1
from 7(a). But 7(b) shows that for every n > 1, kIkF = n 6= 1. Consequently,
Frobenius norm is not induced by any vector norm on Cn for n > 1.
20 1.4. Problems
Chapter 2

Matrix Decomposition and Least

Squares Problems

Matrix decomposition has been of fundamental importance in modern sciences. In the context
of numerical linear algebra, matrix decomposition serves the purpose of rephrasing through a
series of easier subproblems a task that may be relatively difficult to solve in its original form,
for instance solving linear systems. In the context of applied statistics, matrix decomposition
offers a way of obtaining some form of low-rank approximation to some large “data” matrix
containing numerical observations; this is crucial in understanding the structure of the matrix,
in particular exploring and identifying the relationship within data. In this chapter, we will
study the singular value decomposition (SVD) and QR factorisation, and demonstrate how to
solve linear least squares problems using these decompositions.

2.1 The Singular Value Decomposition

Throughout this section, we will assume that A ∈ Cm×n , m ≥ n has full rank for simplicity.
The central theme of this section is that SVD is just another formulation of the following
geometric fact in terms of linear algebra:

The image of the unit sphere under linear transformations is a hyperellipse.

It involves three geometrical transformations: rotation, reflection and scaling. Given A ∈

Cm×n , m ≥ n, there exists a decomposition, called reduced singular value decomposition,
or reduced SVD of the form A = U b ∗ , where
b ΣV
 
 
  m×n
∈C
U = u1 u2 . . . un  .
b 
 

 
σ1
 σ2 
Σ
b=  ∈ Rn×n .

 ..
 . 
σn

21
22 2.1. The Singular Value Decomposition
 
 
  n×n
v1 v2 . . . vn  ∈ C .
V = 
 

(a) {u1 , u2 , . . . , un } and {v1 , v2 , . . . , vn } are the left and right singular vectors of A;
columns of U b are orthonormal, V is unitary and U b ∗U
b = V ∗ V = In ;

(b) {σj }nj=1 are singular values of A, with σ1 ≥ σ2 ≥ . . . ≥ σn ≥ 0. These are the lengths
of the n principal semiaxes of the hyperellipse in the case of real matrices A.

Avj = σj uj , j = 1, . . . , n. (2.1.1)

Example 2.1.1. Consider any matrix A ∈ C2×2 . It is clear that H = A∗ A ∈ C2×2 is Hermitian.
Moreover, for any x ∈ C2 we have

x∗ Hx = x∗ (A∗ A)x = (Ax)∗ (Ax) = kAxk22 ≥ 0.

Consequently, H has nonnegative eigenvalues λ1 , λ2 and H = V DV ∗ = V Σ2 V ∗ , where V is

unitary and
λ1 0 σ1 0
D= , Σ= ,
0 λ2 0 σ2
such that σ1 ≥ σ2 ≥ 0 and σ12 = λ1 , σ22 = λ2 . Assume σ1 ≥ σ2 > 0, we claim that U = AV Σ−1 .
Indeed,

U ∗ U = (Σ−1 V ∗ A∗ )(AV Σ−1 ) = Σ−1 V ∗ HV Σ−1

= Σ−1 Σ2 Σ−1 = I2 .

Hence, AV = U Σ =⇒ A = U ΣV ∗ .

2.1.1 Geometric Intepretation

[Include diagrams] Let S be the unit circle in Rn , then any x ∈ S can be written as

cos θ
x = V x̂ = v1 v2 .
sin θ

Observe that kx̂k2 = 1 since cos2 θ + sin2 θ = 1. It follows from AV = U Σ that

Ax = A(v1 cos θ + v2 sin θ) = σ1 u1 cos θ + σ2 u2 sin θ.

In terms of (v1 , v2 ) coordinates, the vector (cos θ, sin θ) gets mapped onto (z1 , z2 ) = (σ1 cos θ, σ2 sin θ)
in (u1 , u2 ) coordinates. Moreover,
2 2
z1 z2
+ = cos2 θ + sin2 θ = 1,
σ1 σ2
Matrix Decomposition and Least Squares Problems 23

i.e. S is being transformed to ellipse. We claim that kAk2 = σ1 . On one hand, we obtain using
orthonormality of {u1 , u2 }

kAxk22 = (σ1 u1 cos θ + σ2 u2 sin θ)∗ (σ1 u1 cos θ + σ2 u2 sin θ)

= σ12 cos2 θu∗1 u1 + σ22 sin2 θu∗2 u2
≤ σ12 cos2 θ + σ12 sin2 θ = σ12 .

On the other hand, choosing x = v1 gives kAv1 k22 = kσ1 u1 k22 = σ12 .

We see that the image of unit circle under A is an ellipse in the 2-dimensional subspace of
Rm defined by span{u1 , u2 }. If A ∈ Rm×n is of full rank with m ≥ n, then the image of the
unit sphere in Rn under A is a hyperellipsoid in Rm .

2.1.2 Full SVD

From the reduced SVD of A, columns of Ub form an orthonormal set in Cm , but U
b is not unitary.
By adjoining an additional m − n orthonormal columns, U can be extended to a unitary matrix
b
U ∈ Cm×m . Consequently, we must concatenated Σ b together with an additional m − n rows
of zero vector so that the product remains unchanged upon replacing U b by U . This process
yields the full SVD of A = U ΣV ∗ , where
 
 
 
U =Ub un+1 . . . um  ∈ Cm×m .
 
 

 
Σ
b
 0∗ 
Σ= ∈ Rm×n
 
.. 
 . 
0∗
 
 
 
v1 v2 . . . vn 
V =  ∈ Cn×n .
 

Note that in full SVD form, Σ has the same size as A, and U, V are unitary matrices.

Theorem 2.1.2. Every matrix A ∈ Cm×n can be factored as A = U ΣV ∗ , where U ∈ Cm×m

and V ∈ Cn×n are unitary and Σ ∈ Rm×n is a rectangular matrix whose only nonzero entries
are nonnegative entries on its diagonal.

Proof. The statement is trivial if A is the zero matrix, so assume A 6= 0. Let σ1 = kAk2 > 0,
there exists v1 ∈ Cn such that kv1 k2 = 1 and kAv1 k2 = kAk2 = σ1 . Such v1 exists since the
induced matrix norm is by definition a minimisation problem of a continuous functional (in
this case the norm) over a compact nonempty subset of Cn . Define u1 = Av1 /σ1 ∈ Cm , clearly
u1 6= 0 and ku1 k2 = 1 by construction.
24 2.1. The Singular Value Decomposition

Consider any extension of u1 , v1 to an orthonormal basis {u1 , u2 , . . . , um }, {v1 , v2 , . . . , vn } of

C , Cn respectively. Construct the following unitary matrices
m

b1 = u2 . . . um ∈ Cm×(m−1) , Vb1 = v2 . . . vn ∈ Cn×(n−1) .

and define two unitary matrices

h i h i
U1 = u1 U b1 ∈ Cm×m , V1 = v1 Vb1 ∈ Cn×n .

We then have:
" #
u∗1 u∗1 Av1 u∗1 AVb1
h i
A1 := U1∗ AV1 = b1∗ A v1 Vb1 = b1∗ Av1 b1∗ AVb1
U U U
σ1 u∗1 u1 w∗

= b1∗ u1 A
σ1 U b
σ1 w ∗

= b ,
0 A

where w ∈ C(n−1) and A

b ∈ C(m−1)×(n−1) . We claim that w = 0. The first thing is to observe
that
2 2
σ1 + w ∗ w
2
σ1
A1 =
w 2 Aw
b
2
= (σ12 + w∗ w)2 + kAwk
b 22
≥ (σ12 + w∗ w)2
2
2 ∗ σ1
= (σ1 + w w)
w 2

Since kA1 k2 = kU1∗ AV1 k2 = kAk2 = σ1 , we have

2 2 2
σ1 σ σ1
0≤ (σ12 ∗
+ w w) ≤ A1 1 ≤ σ12 ,
w 2
w 2
w 2

which implies that

0 ≤ σ12 + w∗ w ≤ σ12 ,
i.e. w∗ w = 0 =⇒ w = 0.
We now proceed by induction. The result is trivial if m = 1 or n = 1. Suppose A
b has an
SVD
U2∗ AV
b 2 = Σ2 ∈ R(m−1)×(n−1) ,

where U2 ∈ C(m−1)×(m−1) , V2 ∈ C(n−1)×(n−1) are unitary. Observe that

σ1 0∗ 0∗

∗ σ1
U1 AV1 = b = 0 U2 Σ2 V2∗
0 A
1 0∗ σ1 0∗ 1 0∗

= .
0 U2 0 Σ2 0 V2∗
Matrix Decomposition and Least Squares Problems 25

Consequently, the unitary matrices U, V are naturally defined as

1 0∗ i 1 0∗ u
h
1
U = U1 = u1 U1
b = b ∈ Cm×m
0 U2 0 U2 U1 U2
1 0∗ i 1 0∗
h h i
V = V1 = v1 Vb1 = v1 Vb1 V2 ∈ Cn×n .
0 V2 0 V2

Since product of unitary matrices are unitary, we only need to show that the vector u1 is orthog-
b1 U2 , but this must be true since {u1 , u2 , . . . , um }
onal to each column u2 , . . . , um of the matrix U
is an orthonormal basis by construction. A similar argument shows that V is also unitary.

Remark 2.1.3. In the case m ≤ n, we simply consider the SVD of its conjugate tranpose A∗ .
If A is singular with rank r < min{m, n}, the full SVD is still appropriate. What changes is
that not n; but only r of the left singular vectors uj are determined by the geometry of the
hyperellipse. To construct the unitary matrix U and V , we introduce an additional (m − r)
and (n − r) arbitrary orthonormal columns respectively.

It is well known that a nondefective square matrix can be expressed as a diagonal matrix
Λ of eigenvalues, if the range and domain are represented in a basis of eigenvectors. SVD gene
ralises this fact to any matrix A ∈ Cm×n , in that SVD allows us to say that A reduces to
diagonal matrix Σ when the range is expressed in the basis of columns of U and the domain
is expressed in the basis of columns of V . More precisely, any b ∈ Cm can be expanded in the
basis of columns {u1 , . . . , um } of U and any x ∈ Cn can be expanded in the basis of columns
{v1 , . . . , vn } of V . The coordinate vectors for these expansions are

b = U b0 ⇐⇒ b0 = U ∗ b and x = V x0 ⇐⇒ x0 = V ∗ x.

Hence,
b = Ax ⇐⇒ U ∗ b = U ∗ Ax = U ∗ U ΣV ∗ x = ΣV ∗ x ⇐⇒ b0 = Σx0 .

There are fundamental differences between the SVD and the eigenvalue decomposition.

(a) SVD uses two different bases (the sets of left and right singular vectors), whereas the
eigenvalue decomposition uses just one (the eigenvectors).

(b) SVD uses orthonormal bases, whereas the eigenvalue decomposition uses a basis that
generally is not orthogonal.

(c) Not all matrices (even square ones) have an eigenvalue decomposition, but all matrices
(even rectangular ones) have a SVD.

(d) In practice, eigenvalues tend to be relevant to problems involving the behaviour of iter-
ated forms of A, such as matrix powers An or matrix exponentials etA , whereas singular
vectors tend to be relevant to problems involving the behaviour of A itself, or its inverse.
26 2.1. The Singular Value Decomposition

2.1.3 Matrix Properties via SVD

For the following discussion, we assume that A ∈ Cm×n and denote p = min{m, n} and r ≤ p
the number of nonzero singular values of A.

Theorem 2.1.4. The rank of A is r, the number of nonzero singular values. Moreover,
R(A) = span{u1 , . . . , ur } and N (A) = span{vr+1 , . . . , vn }.
Proof. Since U, V are unitary, they have full rank. Thus, rank(A) = rank(Σ) = numbers of its
nonzero entries. For any x ∈ Cn , we have Ax = U ΣV ∗ x = U Σy, where y ∈ Cn is arbitrary.
The R(A) is then deduced from the fact that R(Σ) = span{e1 , . . . , er }. To find the nullspace
of A, expanding Az = 0 yields
Az = U ΣV ∗ z = 0 =⇒ ΣV ∗ z = 0 since U is of full rank,
from which we deduce that N (A) is the span of {vr+1 , . . . , vn } since N (Σ) = span{er+1 , . . . , en }.

p
Theorem 2.1.5. kAk2 = σ1 and kAkF = σ12 + σ22 + . . . + σr2 .
Proof. Since k · k2 and k · kF are both invariant under unitary transformation, we have that
kAk2 = kU ΣV ∗ k2 = kΣk2 = max |σj | = σ1 ,
1≤j≤p

and q
∗
kAkF = kU ΣV kF = kΣkF = σ12 + . . . + σr2 .

Theorem 2.1.6. The nonzero singular values of A are the square roots of the nonzero eigen-
values of A∗ A or AA∗ . (These matrices have the same nonzero eigenvalues.)
Proof. Observe that A∗ A ∈ Cn×n is similar to Σ∗ Σ since
A∗ A = (U ΣV ∗ )∗ (U ΣV ∗ ) = V ΣT U ∗ U ΣV ∗ = V (ΣT Σ)V ∗ ,
and hence has the same n eigenvalues. Σ∗ Σ is a diagonal matrix with p eigenvalues σ12 , . . . , σp2
and n−p additional zero eigenvalues if n > p. A similar calculation applies to the m eigenvalues
of AA∗ .

Theorem 2.1.7. If A = A∗ , then the singular values of A are the absolute values of the
eigenvalues of A.
Proof. Since A is Hermitian, it has an eigendecomposition of the form A = QΛQ∗ for some
unitary matrix Q and real diagonal matrix Λ consisting of eigenvalues λj of A. We rewrite it
as
A = QΛQ∗ = Q|Λ|sign(Λ)Q∗ ,
where |Λ| and sign(Λ) denote the diagonal matrices whose entries are |λj | and sign(λj ) respec-
tively. Since sign(Λ)Q∗ is unitary whenever Q is unitary, the expression above is an SVD of A,
with the singular values equal to the diagonal entries of |Λ|. These can be put into nonincreas-
ing order by inserting suitable permutation matrices as factors in Q and sign(Λ)Q∗ if required.

Matrix Decomposition and Least Squares Problems 27

m
Y
Theorem 2.1.8. For A ∈ Cm×m , | det(A)| = σi .
i=1

Proof. Using the fact that unitary matrices have determinant ±1, we obtain
n
Y
∗
| det(A)| = | det(U )|| det(Σ)|| det(V )| = | det(Σ)| = σj .
j=1

2.1.4 Low-Rank Approximations

Low-rank approximation has been applied in a wide variety of areas such as dimension reduc-
tion, signal processing, classification and clustering. The basic problem is as follows: Given a
data matrix A, we want to identify the “best” way of approximating A with matrices having
rank less than ν for some ν . This constrained optimisation problem can be solved analytically
using SVD. Essentially, the idea is to consider the outer-product representation of A, given
by
X r
A= σj uj vj∗ , (2.1.2)
j=1

which can be deduced from the SVD of A by writing Σ as a sum of r matrices

Σj = diag(0, . . . , 0, σj , 0, . . . , 0).

There are many ways to decompose A into rank-one matrices, but (2.1.2) has a deeper prop-
erty: its vth partial sum captures as much of the energy of A as possible, in the sense of the
2-norm of the Frobenius norm.

Theorem 2.1.9. For any ν with 0 ≤ ν ≤ r, define

ν
X
Aν = σj uj vj∗ ;
j=1

if ν = p = min{m, n}, define σν+1 = 0. Then

kA − Aν k2 = inf kA − Bk2 = σν+1 .

B∈Cm×n
rank(B)≤ν
q
2
kA − Aν kF = inf kA − BkF = σν+1 + . . . + σr2 .
B∈Cm×n
rank(B)≤ν

Proof. Suppose there is a matrix B with rank(B) ≤ ν such that

kA − Bk2 < kA − Aν k2 = σν+1 .

There is an (n − ν)-dimensional subspace W ⊂ Cn such that B(W ) = 0. For any w ∈ W we

have
kAwk2 = k(A − B)wk2 ≤ kA − Bk2 kwk2 < σν+1 kwk2 .
28 2.2. Projectors

On the other hand, for any z ∈ span{v1 , . . . , vν+1 } := Z ⊂ Cn we have

ν+1
! 2 ν 2
X X h i
2
kAzk2 = A αj vj = αj σj uj Since Avj = σj uj .
j=1 2 j=1 2
h i
= α12 σ12 + α22 σ22 + ... + 2
αν+1 2
σν+1 From Pythagorean theorem.
2
≥ σν+1 [α12 + α22 + . . . + αν+1
2
]
2 2
= σν+1 kzk2 ,

Since W, Z are subspaces of Cn ,

dim(W + Z) = dim(W ) + dim(Z) − dim(W ∩ Z),

and since the sum of the dimensions of W and Z exceeds n, there must be a nonzero vector in
W ∩ Z and we arrive at a contradiction.

The MATLAB command for computing the reduced and full SVD is [U,S,V] = svd(A,0)
and [U,S,V] = svd(A) respectively.

2.2 Projectors
Projection is an important concept in designing algorithms for certain linear algebra problems.
Geometrically, projection is a generalisation of graphical projection. In functional analysis,
a projection P is a bounded linear operator such that P 2 = P ; in finite-dimensional vector
space, P is a square matrix in Cn×n and it is said to be idempotent. Observe that if y ∈ R(P ),
then y = P x for some x ∈ Cn and

P y = P P x = P 2 x = P x = y.

What if y 6= P y? For any particular y ∈ Cn , consider the vector y to P y, P y − y. Applying

the projector to P y − y gives

P (P y − y) = P 2 y − P y = P y − P y = 0.

i.e. P y − y ∈ N (P ). Geometrically, this means that P projects onto R(P ) along N (P ).

y
Py
Py − y
x

R(P )

Figure 2.1: An oblique (non-orthogonal) projection.

x=6
Matrix Decomposition and Least Squares Problems 29

Example 2.2.1. Consider the matrix P = uv ∗ , where u, v ∈ Cn such that v ∗ u = 1. P is

a projector since P 2 = uv ∗ uv ∗ = uv ∗ = P . Note that N (P ) = {x ∈ Cn : v ∗ x = 0} and
R(P ) = span{u}. We verify that P x − x ∈ N (P ) for any x ∈ Cn .

v ∗ (P x − x) = v ∗ (uv ∗ x) − v ∗ x = v ∗ x(v ∗ u − 1) = 0.

2.2.1 Complementary Projectors

Theorem 2.2.2. Let P ∈ Cn×n be a projector and consider the matrix Q = I − P .

(a) Q is also a projector, and we called Q the complementary projector to P .

(b) P Q = P (I − P ) = 0.

(c) N (P ) = R(Q) and R(P ) = N (Q).

Proof. Expanding Q2 gives

Q2 = (I − P )2 = I 2 − 2P + P 2 = I − 2P + P = I − P = Q.

For the second result, P (I − P ) = P − P 2 = 0. Suppose x ∈ N (P ), then

P x = 0 =⇒ Qx = x − P x = x ∈ R(Q) =⇒ N (P ) ⊂ R(Q).

Suppose y ∈ R(Q),

y = Qy = y − P y =⇒ P y = 0 =⇒ y ∈ N (P ) =⇒ R(Q) ⊂ N (P ).

Combining these two set inequalities show the first equation in (c). The second equation in (c)
now follows from applying the previous result to I − P :

N (Q) = N (I − P ) = R(I − (I − P )) = R(P ).

Theorem 2.2.2 actually shows that a projector decomposes Cn into subspaces R(P ) and
N (P ) such that Cn = R(P ) ⊕ N (P ). Such a pair are said to be complementary subspaces.
Indeed, suppose x = P x + z, then

z = x − P x = Q(x) ∈ R(Q) = N (P ), i.e. Cn = R(P ) + N (P ).

To see that R(P ) ∩ N (P ) = {0}, note that any v ∈ R(P ) ∩ N (P ) satisfies

v = v − P v = (I − P )v = 0,

since R(P ) = N (I − P ). Conversely, for any pair of complementary subspaces S1 , S2 of Cn ,

there exists a projector P ∈ Cn×n such that S1 = R(P ) and S2 = N (P ). We then say that P
is a projector onto S1 along S2 .
30 2.2. Projectors

2.2.2 Orthogonal Projectors

In general, S1 = R(P ) and S2 = N (P ) might not be orthogonal, i.e. there exists x1 = P x ∈
R(P ), x2 = (I − P )y ∈ N (P ) such that

x∗1 x2 = (P x)∗ (I − P )y = x∗ P ∗ (I − P )y 6= 0.

With this in mind, a projector P ∈ Cn×n is an orthogonal projector if P ∗ (I − P ) = 0; oth-

erwise it is an oblique projector. Geometrically, orthogonal projectors P projects any given
vector x orthogonally onto R(P ) along N (P ), i.e. R(P ) ⊥ N (P ). Orthogonal projectors are
not to be confused with orthogonal matrices! Surprisingly, orthogonal projectors have a rather
simple characterisation, which is the result of the next theorem.

Py − y
x

Py
R(P )

Figure 2.2: An orthogonal projection.

Theorem 2.2.3. A projector P ∈ Cn×n is orthogonal ⇐⇒ P is Hermitian, that is, P = P ∗ .

Proof. If P = P ∗ , then
P ∗ (I − P ) = P (I − P ) = P − P 2 = 0,
and it follows from the algebraic definition that P is orthogonal. Conversely, suppose P is an
orthogonal projector, then

P ∗ (I − P ) = 0, or P ∗ = P ∗ P.

Consider the minimal rank SVD of P = Ur Σr Vr∗ , where r ≤ n is the rank of P , Ur∗ Ur = Ir =
Vr∗ Vr and Σr is nonsingular. Substituting the SVD of P into P ∗ = P ∗ P yields

Vr Σr Ur∗ = P ∗ = P ∗ P = (Vr Σr Ur∗ )(Ur Σr Vr∗ ) = Vr Σ2r Vr∗ ,

and left multiplying both sides by Σ−1 ∗ ∗ ∗

r Vr gives Ur = Σr Vr . Hence,

P = Ur (Σr Vr∗ ) = Ur Ur∗ and P ∗ = (Ur Ur∗ )∗ = Ur Ur∗ = P.

Matrix Decomposition and Least Squares Problems 31

We demonstrate how to construct a full SVD of an orthogonal projector P . Let {q1 , . . . , qr }

be a basis for R(P ) and {qr+1 , . . . , qn } a basis for N (P ). Define a unitary matrix Q with
columns {q1 , . . . , qn }. Since
(
qj if j = 1, . . . , r
P qj =
0 if j = r + 1, . . . , n.

we obtain

P Q = q1 . . . qr 0 . . . 0
 
q1∗
=⇒ Q∗ P Q =  ...  q1 . . . qr 0 . . . 0
 

qn∗

Ir
= = Σ.
0n−r

Consequently, the singular values of orthogonal projectors consists of 1’s and 0’s. Because some
singular values are zero, it is advantageous to drop the columns {qr+1 , . . . , qn } of Q which leads
to
b∗ , where Q b = q1 . . . qr ∈ Cn×r .

P =Q bQ

Remark 2.2.4. Orthogonal projectors doesn’t necessarily have the form QbQb∗ . We will show in
Section 2.2.4 that P = A(A∗ A)−1 A∗ is an orthogonal projection onto R(A) for any A ∈ Cm×n .

2.2.3 Projection with an Orthonormal Basis

Any natural extension of the discussion above is that in fact any matrix Q b with orthonormal
columns can generate an orthogonal projector. For r ≤ n, let {q1 , . . . , qr } be any set of r
orthonormal vectors in Cn and Qb the corresponding n × r matrix. We decompose any v ∈ Cn
into (r + 1) orthogonal components
r
X r
X
∗
v=w+ (qj v)qj = w + (qj qj∗ )v.
j=1 j=1

More precisely, v ∈ Cn is decomposed into components in R(Q) b ⊥ . It

b plus component in R(Q)
follows that the map
Xr
v 7→ (qj qj∗ )v,
j=1

represents an orthogonal projection onto R(Q), b i.e. the matrix P = Q bQb∗ is an orthogonal
projector onto R(Q), regardless of how {q1 , . . . , qr } was obtained. Note that its complement
projector I − Q b∗ is an orthogonal projector onto R(Q)
bQ b ⊥.
In the case of r = 1, we have the rank-one orthogonal projector that isolates the component
in a single direction. More precisely, for any given q ∈ Cn , the matrix
qq ∗
Pq = ,
q∗q
32 2.3. QR Factorisation

projects any vector v ∈ Cn onto span{q}. Its complement

qq ∗
P⊥q =I− ∗ ,
q q

is the rank n − 1 orthogonal projector onto Cn \ span{q}.

2.2.4 Projection with an Arbitrary Basis

One can also define an orthogonal projection onto a subspace of Cn with an arbitrary basis, not
necessarily orthogonal. This avoids the need to transform a given set of basis into orthonor-
mal basis. Assume this subspace of Cn is spanned by a set of linearly independent vectors
{a1 , . . . , ar }, with r ≤ n. Define

A := a1 . . . ar ∈ Cn×r ,

Geometrically, projecting any v ∈ Cn orthogonally onto y = Ax ∈ R(A) is equivalent to

requiring y − v ⊥ R(A) = N (A∗ )⊥ . This means that

a∗j (y − v) = 0 for all j = 1, . . . , r,

or
A∗ (Ax − v) = 0 =⇒ A∗ Ax = A∗ v.
Since A is of full rank, A∗ A is also of full rank and x is uniquely given by

x = (A∗ A)−1 A∗ v =⇒ P v = y = Ax = A(A∗ A)−1 A∗ v =⇒ P = A(A∗ A)−1 A∗ .

Note that this is a generalisation of the rank-one orthogonal projector. If A has orthonormal
columns, then we recover P = AA∗ as before.

2.3 QR Factorisation
We now study the second matrix factorisation in the course: QR factorisation. Assume for
now that A ∈ Cm×n , m ≥ n is of full rank, but we will see later that this is not necessary.
The idea of QR factorisation is to construct a sequence of orthonormal vectors {q1 , q2 , . . .} that
spans the nested succesive spaces span{a1 , a2 , . . .}, i.e.

span{q1 , . . . , qj } = span{a1 , . . . , aj } for j = 1, . . . , n.

In order for span{a1 , . . . , aj } to be successive spaces, the vector aj must be linear combination
of the vectors {q1 , . . . , qj }. Writing this out

a1 = r11 q1 (2.3.1a)
a2 = r12 q1 + r22 q2 (2.3.1b)
.. .. .. ..
. . . . (2.3.1c)
an = r1n q1 + r2n q2 + . . . + rnn qn . (2.3.1d)
Matrix Decomposition and Least Squares Problems 33

In matrix form,
   
r11 r12 · · ·

rin
. .. 
r22 . .
   
    . 
a1 a2 . . . an  = q1 q2
A= . . . qn   = Q̂R̂,
 
 ...
    r(n−1)n 
rnn

where Q b ∈ Cm×n has orthonormal columns and R b ∈ Cn×n is upper-triangular. Such a factori-
sation is called a reduced QR factorisation of A.
One can define a full QR factorisation in a similar fashion as how we define a full SVD,
by adding m − n orthonormal columns (in an arbitrary fashion) to Q b so that it becomes a
unitary matrix Q ∈ Cm×m ; in doing so, m − n rows of zeros needs to be added to R b and
m×n
it becomes an upper-triangular matrix R ∈ C . In the full QR factorisation, the columns
{qn+1 , . . . , qm } are orthogonal to R(A) by construction, and they constitute an orthonormal
basis for R(A)⊥ = N (A∗ ) if A is of full rank n.

2.3.1 Gram-Schmidt Orthogonalisation

Equation (2.3.1) suggests the following method for computing the reduced QR factorisation.
Given a1 , a2 , . . ., construct the vectors q1 , q2 , . . . and entries rij by a process of successive or-
thogonalisation. This idea is known as the Gram-Schmidt orthogonalisation.
More precisely, at the jth step, we want to find a unit vector qj ∈ span{a1 , . . . , aj } such that
qj ⊥ {q1 , . . . , qj−1 }. This is done by projecting the vector aj onto each component {q1 , . . . , qj−1 }.
We then obtain
aj = vj + (q1∗ aj )q1 + . . . + (qj−1
∗
aj )qj−1 . (2.3.2)
By construction, vj ⊥ {q1 , . . . , qj−1 } and vj 6= 0, since otherwise aj is a nontrivial linear combi-
nation of {a1 , . . . , aj−1 }, contradicting the assumption that A is of full rank. The orthonormal
vectors are given by
j−1
!
1 X
qj = aj − rij qi , j = 1, . . . , n,
rjj i=1

where the coefficients rij for each i = 1, . . . , n are

 ∗
q a if i 6= j,
 i j


j−1
rij = X
 ± aj − rij qi if i = j.


i=1 2

The sign of rjj is not determined and if desired we may choose rjj > 0 so that R̂ has positive
diagonal entries. Gram-Schmidt iteration is numerically unstable due to rounding errors on
a computer. To emphasise the instability, we refer to this algorithm as the classical Gram-
Schmidt iteration.

Theorem 2.3.1. Every matrix A ∈ Cm×n , m ≥ n has a full QR factorisation, hence also a
reduced QR factorisation.
34 2.3. QR Factorisation

Proof. The case where A has full rank follows easily from the Gram-Schmidt orthogonalisation,
so suppose A does not have full rank. At one or more steps j, it will happen that vj = 0; at this
point, simply pick qj arbitrarily to be any unit vector orthogonal to {q1 , . . . , qj−1 }, and then
continue the Gram-Schmidt orthogonalisation process. Previous step gives us a reduced QR
factorisation of A. One can construct a full QR factorisation by introducing arbitrary m − n
orthonormal vectors in the same style as in Gram-Schmidt process.

Suppose A = Q bRb is a reduced QR factorisation of A, then multiplying the ith column of

Q̂ by z and the ith row of R̂ by z −1 , where z ∈ C such that |z| = 1 gives us another reduced
QR factorisation of A. The next theorem asserts that this is the only way to obtain a unique
reduced QR factorisation if A is of full rank.

Theorem 2.3.2. Every matrix A ∈ Cm×n , m ≥ n of full rank has a unique reduced QR
factorisation A = Q
bR,
b with rjj > 0 for each j = 1, . . . , n.

Proof. The Gram-Schmidt orthogonalisation determines rij and qj fully, except for the sign of
rjj , but this is now fixed by the condition rjj > 0.

Algorithm 2.1: Classical Gram-Schmidt (unstable)

for j = 1 to n
vj = aj
for i = 1 to j − 1
rij = qi∗ aj
vj = vj − rij qi
end
rjj = kvj k2
qj = vj /rjj
end

Suppose we want to solve Ax = b for x, where A ∈ Cm×m is nonsingular. If A = QR is a

(full) QR factorisation, then we can write

QRx = b or Rx = Q∗ b.

The linear system Rx = Q∗ b can be solved easily using backward substitution since R is
upper-triangular. This suggests the following method for solving Ax = b:

1. Compute a QR factorisation A = QR.

2. Compute y = Q∗ b.

3. Solve Rx = y for x ∈ Cm .
Matrix Decomposition and Least Squares Problems 35

2.3.2 Modified Gram-Schmidt Algorithm

At each jth step, the classical Gram-Schmidt iteration computes a single orthogonal projection
of rank m − (j − 1) onto the space orthogonal to {q1 , . . . , qj−1 }, given by

vj = Pj aj , j = 1, . . . , n.

In contrast, the modified Gram-Schmidt iteration computes the same result by a sequence of
(j − 1) projections of rank (m − 1). Let P⊥q = I − qq ∗ be the rank (m − 1) orthogonal projector
onto the space orthogonal to the nonzero vector q ∈ Cm . It can be shown that

Pj = P⊥qj−1 P⊥qj−2 . . . P⊥q1 for each j = 1, . . . , n with P1 = I.

The operations are equivalent, but we decompose the projection to obtain numerical stability.
The modified Gram-Schmidt algorithm computes vj as follows (in order):
(1)
v j = P 1 aj = aj
(2) (1) (1)
vj = P⊥q1 vj = (I − q1 q1∗ )vj
.. .. ..
. . .
(j) (j−1) ∗ (j−1)
vj = vj = P⊥qj−1 vj = (I − qj−1 qj−1 )vj .

Algorithm 2.2: Modified Gram-Schmidt

for j = 1 to n
vj = aj
for i = 1 to j − 1
rij = qi∗ vj (Step by step projection)
vj = vj − rij qi
end
rjj = kvj k2
aj = vj /rjj
end

Algorithm 2.3: (Efficient) Modified Gram-Schmidt

for i = 1 to n
vi = ai
end
for i = 1 to n
rii = kvi k2
qi = vi /rii
for j = i + 1 to n
rij = qi∗ vj (Compute Pqi as soon as qi is found
vj = vj − rij qi and then apply to all vi+1 , . . . , vn )
end
end
36 2.3. QR Factorisation

Consider three vectors

     
1 1 1
ε 0 0
a1 = 
0 , a2 = ε , a3 = 0 ,
    

0 0 ε
and make the approximation ε2 ≈ 0 for ε 1 that accounts for rounding error. Applying the
classical Gram-Schmidt gives

v1 = a1
√



r11 = ka1 k2 = 1 + ε2 ≈ 1
 q1 = v1 ≈ (1, ε, 0, 0)T


r11


 v2 = a2
r12 = q1T a2

=1





v2 = v2 − r12 q1 = (0, −ε, ε, 0)T

√



 r22 = kv 2 k 2 = 2ε
v 1


 q2 = 2 = √ (0, −1, 1, 0)T


r22 2

 v3 = a3

r13 = q1T a3




 =1
v3 = v3 − r13 q1 = (0, −ε, 0, ε)T






r23 = q2T a3 =0


 v3 = v3 − r23 q2 = (0, −ε, 0, ε)T
√



r33 = kv3 k2 = 2ε




v 1


 q3 = 3 = √ (0, −1, 0, 1)T .


r33 2
However, q2T q3 = 1/2 6= 0. We see that small perturbation results in instability, in the sense
that we lose orthogonality due to round off errors. On the other hand, to apply the modified
Gram-Schmidt, it is not difficult to see that q1 , q2 remains unchanged and q3 is obtained as


 v 3 = a3

r13 = q1T v3



 =1

v3 = v3 − r13 q1 = (0, −ε, 0, ε)T






 T ε

 r 23 = q 2 v 3 = √
2


ε ε T

 v 3 = v 3 − r23 q 2 = 0, − ,− ,ε



 √ 2 2

 6ε
r33 = kv3 k2 =


2




 v 3 1
= √ (0, −1, −1, 2)T .

 q3 =


r33 6
We recover q2T q3 = 0 in this case.
Matrix Decomposition and Least Squares Problems 37

2.3.3 Operation Count

To assess the cost of both the Gram-Schmidt algorithms, we count the number of floating
point operations called flops. Each addition, subtraction, multiplication, division or square
root counts as one floop. We make no distinction between real and complex arithmetic, and
no consideration of memory access or other aspects. From Algorithm 2.3, we see that:
n n
!!
X X
# of addition = m−1+ m−1
i=1 j=i+1
n
X
= n(m − 1) + (m − 1)(n − i)
i=1
1
= n(n + 1)(m − 1)
2
n X n n
X X 1
# of subtraction = m= m(n − i) = mn(n − 1)
i=1 j=i+1 i=1
2
n n
!
X X
# of multiplication = m+ 2m
i=1 j=i+1
n
X
= mn + 2m(n − i)
i=1
= mn2
Xn
# of division = m = mn
i=1
n
X
# of square root = 1 = n.
i=1

Hence, the number of flops is

1 1
n(n + 1)(m − 1) + mn(n − 1) + mn2 + mn + n
2 2
2 1 2 1
= 2mn − n + mn + n
2 2
∼ 2mn2 ,
where “ ∼00 means that
number of flops
lim = 1.
m,n→∞ 2mn2
When m and n are large, this can also be obtained by considering only the dominating opera-
tions which occurs in the innermost loop of Algorithm 2.3
h i
rij = qi∗ vj m multiplications and m − 1 additions.
h i
vj = vj − rij qi m multiplications and m − 1 subtractions.

Thus, the number of flops is asymptotic to

n X
X n n
X
(4m − 1) ∼ (i)4m ∼ 2mn2 .
i=1 j=i+1 i=1
38 2.4. Least Squares Problems

2.4 Least Squares Problems

Consider a linear system of m equations having n unknowns, with m > n. In matrix for-
mulation, we want to solve for x ∈ Cn , the matrix equation Ax = b, where A ∈ Cm×n and
b ∈ Cm . In general, such a problem has no solution unless b ∈ R(A), since b ∈ Cm and R(A)
is of dimension at most n < m. We say that a rectangular system of equations with m > n is
overdetermined.
Since the residual vector r = Ax − b ∈ Cm cannot be made to be zero for certain b ∈ Cm ,
minimising it seems like a reasonable thing to do and measuring the “size” of r involves choosing
a norm. For the 2-norm, the problem takes the following form:

Given A ∈ Cm×n , m ≥ n, b ∈ Cm ,
find x ∈ Cn such that kAx − bk2 is minimised. (2.4.1)

This is called the general (linear) least squares problem. The 2-norm is chosen due to
certain geometric and statistical reasons, but the more important reason is it leads to simple
algorithms since the derivative of a quadratic function, which must be set to zero for minimi-
sation, is linear. Geometrically, (2.4.1) means that we want to find a vector x ∈ Cn such that
the vector Ax ∈ Cm is the closest point in R(A) to b ∈ Cm .

Example 2.4.1. For a curve fitting problem, given a set of data (y1 , b1 ), . . . , (ym , bm ), we
want to find a polynomial p(y) such that p(yj ) = bj for every j = 1, . . . , m. If the points
{x1 , . . . , xm } ∈ C are distinct, it can be shown that there exists a unique polynomial inter-
polant to these data, which is a polynomial of degree at most m − 1. However, the fit is often
bad, in the sense that they tend to get worse rather than better if more data are utilised. Even
the fit is good, the interpolation process may be sensitive to perturbations of the data. One
way to avoid such complications is to choose a nonuniform set of interpolation points, but in
applications this will not always be possible.
Surprisingly, one can do better by reducing the degree of the polynomial. For some n < m,
consider a degree n − 1 polynomial of the form

p(y) = x0 + x1 y + . . . + xn−1 y n−1 .

In matrix form, the problem Ax = b has the form

    
1 y1 . . . y1n−1 x0 b1
1 y2 . . . y n−1   x1   b2 
2
Ax =  .. .. ..   ..  =  ..  = b.
    
. . . . . .  .   . 
n−1
1 ym . . . ym xn−1 bm

Such a polynomial is a least squares fit to the data if it minimises the residual vector in the
2-norm, that is,
m
!1/2
X
min |p(yi ) − bi |2 .
polynomials of degree n−1
i=1

2.4.1 Existence and Uniqueness

Geometrically, a vector x ∈ Cn that minimises the residual r = Ax − b in the 2-norm satisfies
Ax = P b, where P ∈ Cm×m is the orthogonal projector onto R(A). In other words, the residual
Matrix Decomposition and Least Squares Problems 39

r = Ax − b must be orthogonal to R(A).

Theorem 2.4.2. Let A ∈ Cm×n , m ≥ n and b ∈ Cm be given. A vector x ∈ Cn minimises the

residual norm krk2 = kb − Axk2 , thereby solving the least squares problem (2.4.1), if and only
if r ⊥ R(A), that is, A∗ r = 0, or equivalently,

A∗ Ax = A∗ b, (2.4.2)

or again equivalently,
P b = Ax,
where P ∈ Cm×m is the orthogonal projector onto R(A). The n×n system of equations (2.4.2),
known as the normal equations, is nonsingular if and only if A has full rank. Consequently,
the solution x ∈ Cn is unique if and only if A has full rank.

Proof. The equivalence of A∗ r = 0 and (2.4.2) follows from the definition of r. The equivalence
of A∗ r = 0 and P b = Ax follows from the properties of orthogonal projectors, see Subsection
2.2.4. To prove that y = P b is the unique point in R(A) that minimises kb − yk2 , suppose z is
another point in R(A). Since z − y ⊥ b − y, the Pythagorean theorem gives

kb − zk22 = kb − yk22 + ky − zk22 > kb − yk22 .

Finally, suppose A∗ A ∈ Cn×n is nonsingular. For any x ∈ Cn satisfying Ax = 0, we have

Ax = 0 =⇒ (A∗ A)x = A∗ 0 = 0 =⇒ x = 0,

and so A has full rank. Conversely, suppose A ∈ Cm×n is nonsingular and A∗ Ax = 0 for some
x ∈ Cn . Then

x∗ A∗ Ax = x∗ 0 = 0 =⇒ (Ax)∗ Ax = kAxk22 = 0 =⇒ Ax = 0.

Since m ≥ n by assumption, the rank of A is min{m, n} = n and the nullity of A is n − n = 0.

Hence, the nullspace of A is trivial and Ax = 0 =⇒ x = 0, which implies that A∗ A is
nonsingular.

If A is of full rank, it follows from Theorem 2.4.2 that the unique solution to the least
squares problem is given by
x = (A∗ A)−1 A∗ b.
where the matrix A+ = (A∗ A)−1 A∗ ∈ Cn×m is called the pseudoinverse of A. The full-rank
linear least squares problem (2.4.1) can then be solved by computing one of both vectors

x = A+ b, y = P b,

where P is the orthogonal projector onto R(A).

40 2.4. Least Squares Problems

2.4.2 Normal Equations

The classical way to solve (2.4.1) is to solve the normal equations (2.4.2). If A ∈ Cm×n has full
rank with m ≥ n, then A∗ A is a square Hermitian positive-definite matrix. Indeed, for any
nonzero x ∈ Cn we have
x∗ (A∗ A)x = (Ax)∗ Ax = kAxk22 > 0.

The standard method of solving such a system is by Cholesky factorisation, which constructs
a factorisation A∗ A = R∗ R, where R ∈ Cn×n is upper-triangular. Consequently, (A∗ A)x = A∗ b
becomes R∗ Rx = A∗ b.

Algorithm 2.4: Least Squares via Normal Equations

1. Form the matrix A∗ A and the vector A∗ b.

2. Compute the Cholesky factorisation A∗ A = R∗ R.

3. Solve the lower-triangular system R∗ w = A∗ b for w ∈ Cn , using forward substitution.

4. Solve the upper-triangular system Rx = w for x ∈ Cn , using backward substitution.

The steps that dominate the work for this computation are the first two. Exploiting the
symmetry of the problem, the computation of A∗ A and the Cholesky factorisation require only
mn2 flops and n3 /3 flops respectively. Thus the total operation count is ∼ mn2 + n3 /3 flops.

2.4.3 QR Factorisation
Given a reduced QR factorisation A = Q b the orthogonal projector P ∈ Cm×m onto R(A)
bR,
can be written as P = Q
bQb∗ . Since P b ∈ R(A), the system Ax = P b has an exact solution and

Q
bRx
b =Q
bQb∗ b =⇒ Rx b∗ b.
b =Q

Algorithm 2.5: Least Squares via QR Factorisation

1. Compute the reduced QR factorisation A = Q

bR.
b

b∗ b ∈ Cn .
2. Form the vector Q

3. Solve the upper-triangular system Rx b∗ b for x ∈ Cn , using backward substitution.

b =Q

Note that the same reduction can also be derived from the normal equations (2.4.2).

A∗ Ax = A∗ b =⇒ (R
b∗ Q
b∗ )(Q
bR)x b∗ Q
b =R b∗ b =⇒ Rx b∗ b.
b =Q

The operation count for this computation is dominated by the cost of the QR factorisation,
which is ∼ 2mn2 − 2n3 /3 flops if Householder reflections are used.
Matrix Decomposition and Least Squares Problems 41

2.4.4 SVD
Given a reduced SVD A = U b ∗ , it follows from Theorem 2.1.4 that the orthogonal projector
b ΣV
P ∈ Cm×m onto R(A) can be written as P = U b ∗ . The system Ax = P b reduces to
bU

U b ∗x = U
b ΣV bUb ∗ b =⇒ ΣV
b ∗x = U
b ∗ b.

Algorithm 2.6: Least Squares via SVD

1. Compute the reduced SVD A = U b ∗.

b ΣV

b ∗ b ∈ Cn .
2. Form the vector U

3. Solve the diagonal system Σw b ∗ b for w ∈ Cn .

b =U

4. Set x = V w ∈ Cn .

Note that the same reduction can also be derived from the normal equations (2.4.2).

(V Σ
bUb ∗ )(U b ∗ )x = V Σ
b ΣV bUb ∗ b =⇒ ΣV
b ∗x = U
b ∗ b.

The operation count for this computation is dominated by the cost of the SVD. For m n
this cost is approximately the same as for QR factorisation, but for m ≈ n the SVD is more
expensive. A typical estimate is ∼ 2mn2 + 11n3 flops.

Algorithm 2.4 may be the best if we only care about the computational speed. However,
solving the normal equations is not always numerically stable and so Algorithm 2.5 is the
“modern standard” method for least squares problem. However if A is close to rank-deficient,
it turns out that Algorithm 2.5 has less-than-ideal stability properties and Algorithm 2.6 is
chosen instead.

2.5 Problems
1. Two matrices A, B ∈ Cm×m are unitary equivalent if A = QBQ∗ for some unitary
Q ∈ Cm×m . Is it true or false that A and B are unitarily equivalent if and only if they
have the same singular values?

Solution: Observe that for a square matrix, the reduced SVD and full SVD has the
same structure. The “only if” statement is true. Suppose A = QBQ∗ for some
unitary matrix Q ∈ Cm×m and let B = UB ΣB VB∗ be the SVD of B. Then

A = QBQ∗ = (QUB )ΣB (VB∗ Q∗ ),

is a SVD of A since product of unitary matrices are unitary. Consequently, the

singular values of A must be the same as B. The “if” statement is false. Consider
the following two matrices

1 0 0 1
A= and B = .
0 1 1 0
42 2.5. Problems

Since A is diagonal with positive entries, it has a SVD of the form

1 0 1 0 1 0
A = I2 I2 I2 = AAA = ,
0 1 0 1 0 1

with singular value σ1A = σ2A = 1. Since BB ∗ = I2 , B is unitary and it has a SVD of
the form
0 1 1 0 1 0
B = BI2 I2 = ,
1 0 0 1 0 1
with singular values σ1B = σ2B = 1. Suppose A and B are unitary equivalent, then

A = Q∗ AQ = Q∗ (QBQ∗ Q) = B,

and we arrive at a contradiction.

2. Using the SVD, prove that any matrix in C m×n is the limit of a sequence of matrices of
full rank. In other words, prove that the set of full-rank matrices is a dense subset of
Cm×n . Use the 2-norm for your proof. (The norm doesn’t matter, since all norms on a
finite-dimensional space are equivalent.)

Solution: We may assume WLOG that m ≥ n. We want to show that for any
matrix A ∈ Cm×n , there exists a sequence of full rank matrices (Ak ) ∈ Cm×n such
that
kAk − Ak2 −→ 0 as k −→ ∞.
The result is trivial if A has full rank, since we may choose Ak = A for each k ≥ 1, so
suppose A is rank-deficient. Let r < min{m, n} = n be the rank of A, which is also
the number of nonzero singular values of A. Consider the reduced SVD A = U b ∗,
b ΣV
where V ∈ Cn×n is unitary, U ∈ Cm×n has orthonormal columns and
 
σ1
..

 . 

σr
 
Σ
b=  ∈ Rn×n .

0

 

 . .. 


The fact that the 2-norm is invariant under unitary transformation suggests perturb-
ing Σ
b in such a way that it has full rank. More precisely, consider Ak = U b k V ∗,
bΣ
where
Σ
bk = Σb + 1 In .
k
Ak has full rank by construction since it has n nonzero singular values and

kAk − Ak2 = kU b ∗ k2 = 1 kIn k2 = 1 −→ 0 as k −→ ∞.

b k − Σ)V
b (Σ
k k
Since A ∈ Cm×n was arbitrary, this shows that the set of full-rank matrices is a dense
subset of Cm×n .
Matrix Decomposition and Least Squares Problems 43

3. Consider the matrix

−2 11
A= .
−10 5

(a) Determine, on paper, a real SVD of A in the form A = U ΣV T . The SVD is not
unique, so find the one that has the minimal number of minus signs in U and V .

Solution: Since A is nonsingular, Theorem 2.1.6 says that the singular values
of A are square roots of the eigenvalues of AT A. Computing AT A gives

T −2 −10 −2 11 104 −72
A A= = ,
11 5 −10 5 −72 146

with characteristic equation

λ2 − Tr(AT A)λ + det(AT A) = λ2 − 250λ + 10000 = 0.

Solving this using quadratic formula gives the eigenvalues

p
250 ± 2502 − 4(10000)
λ= = 125 ± 75.
2
√ √ √ √
Thus, λ1 = 200 =⇒ σ1 = λ1 = 10 2 and λ2 = 50 =⇒ σ2 = λ2 = 5 2.

Denote U = [u1 |u2 ] ∈ R2×2 and V = [v1 |v2 ] ∈ R2×2 , where u1 , u2 and v1 , v2 are
column vectors of U and V respectively in the SVD of A = U ΣV T . Observe
that v1 , v2 are normalised eigenvectors of AT A corresponding to eigenvalues
λ1 , λ2 respectively since AT A = V Σ2 V T . It can be shown that

−3/5 4/5
V = [v1 |v2 ] = .
4/5 3/5

To find u1 , u2 , we use the relation Avj = σj uj

√
√ 1/ 2

10 √
Av1 = = 10 2 = σ 1 u1 .
10 1/ 2
√
√

5 1/ √2
Av2 = =5 2 = σ 2 u2 .
−5 −1/ 2

Hence, a real SVD of A with minimal number of minus signs in U and V is

√ √ √
T 1/√2 1/ √2 10 2 √ 0 −3/5 4/5
A = U ΣV = .
1/ 2 −1/ 2 0 5 2 4/5 3/5

(b) List the singular values, left singular vectors, and right singular vectors of A. Draw
a careful, labeled picture of the unit ball in R2 and its image under A, together with
the singular vectors, with the coordinates of their vertices marked.
44 2.5. Problems

√ √
Solution: The singular values of A are σ1 = 10 2, σ2 = 5 2. The left singular
vectors and right singular vectors of A are
√ √
−3/5 4/5 1/√2 1/ √2
v1 = , v2 = , u1 = , u2 = .
4/5 3/5 1/ 2 −1/ 2

Solution: Let A = (aij )i,j=1,2 . Then

kAk1 = max{|a1j | + |a2j |} = max{12, 16} = 16

j=1,2
√
kAk2 = σ1 = 10 2
kAk∞ = max{|ai1 | + |ai2 |} = max{13, 15} = 15
i=1,2
p √ √
kAkF = tr(AT A) = 250 = 5 10.

(d) Find A−1 not directly, but via the SVD.

Solution: Using SVD,

−1 T −1 −1 1/20 −11/100
T
A = (U ΣV ) =VΣ U = .
1/10 −1/50

(e) Find the eigenvalues λ1 , λ2 of A.

Solution: Solving the characteristic equation

λ2 − Tr(A)λ + det(A) = λ2 − 3λ + 100 = 0,

yields p √
3± 9 − 4(100) 3 391i
λ= = ± .
2 2 2

(f) Verify that det(A) = λ1 λ2 and | det(A)| = σ1 σ2 .

Solution:
√ ! √ !
3 391i 3 391i 9 391 400
λ1 λ2 = + − = + = = 100 = det(A).
2 2 2 2 4 4 4
√ √
σ1 σ2 = (10 2)(5 2) = 50(2) = 100 = | det(A)|.

(g) What is the area of the ellipsoid onto which A maps the unit ball of R2 ?

Solution: The ellipse onto which A maps the unit ball of R2 has major radius
a = σ1 and minor radius b = σ2 . Thus, its area is πab = πσ1 σ2 = 100π.
Matrix Decomposition and Least Squares Problems 45

4. Let P ∈ Cm×m be a nonzero projector. Show that kP k2 ≥ 1, with equality if and only if
P is an orthogonal projector.

Solution: Since any projector P ∈ Cm×m satisfies P 2 = P ,

kP xk2 = kP 2 xk2 ≤ kP k22 kxk2 .

Taking the supremum over all x ∈ Cm with kxk2 = 1 gives

kP k2 ≤ kP k22 =⇒ kP k2 ≥ 1 since kP k2 6= 0.

Suppose P is an orthogonal projector with its SVD P = U ΣV ∗ , where its singular

values are 1’s and 0’s. It follows from Theorem 2.1.4 that kP k2 = σ1 = 1. Conversely,
suppose P is not an orthogonal projector. By definition, this means that

R(P ) ⊥
6 N (P ) = R(I − P ).

   
1 0 −1 1
1 2 1 1.
  
5. Let A =  and b =
1 1 −3 1
0 1 1 1
(a) Determine the reduced QR factorisation of A.

Solution: Denote by aj the jth column of A, j = 1, 2, 3. Following the Gram-

Schmidt iteration notation from Section 2.3,
√ a1 1
r11 = ka1 k2 = 3 =⇒ q1 = = √ (1, 1, 1, 0)∗ .
r11 3
∗
√
r12 = q1 a2 = 3.
√
v2 = a2 − r12 q1 = a2 − 3q1 = (−1, 1, 0, 1)∗ .
√ v2 1
r22 = kv2 k2 = 3 =⇒ q2 = = √ (−1, 1, 0, 1)∗ .
r22 3
∗
√ ∗
√
r13 = q1 a3 = − 3, r23 = q2 a3 = 3.
√ √
v3 = a3 − r13 q1 − r23 q2 = a3 + 3q1 − 3q2 = (1, 1, −2, 0)∗ .
√ v3 1
r33 = kv3 k2 = 6 =⇒ q3 = = √ (1, 1, −2, 0)∗ .
r33 6

Hence, A = Q̂R̂, where

 √ 
1 −1 1/√2  
√ 1 1 −1
1 1 1 1/ 2

√ , R̂ = 3 0 1 1  .

Q̂ = √  √
3 1 0 − 2  0 0 2
0 1 0

(b) Use the QR factors from part (a) to determine the least square solution to Ax =
b.
46 2.5. Problems

Solution: We follow Algorithm 11.2., page 83. Computing Q̂∗ b yields

 
  1  
1 1 1 0   3
∗ 1  1 1  
Q̂ b = √ −1 1 0 1 =√ 1 .
3 1/√2 1/√2 −√2 0 1
 
3 0
1

Thus, R̂x = Q̂∗ b becomes

    
√ 1 1 −1 x1 3
1
3 0 1 √1  x2  = √ 1
0 0 2 x3 3 0
      
1 1 −1 x1 3 1
0 1   
1 √1  x2  = 1 = 1/3 .
3
0 0 2 x3 0 0

Performing back substitution gives

x3 = 0.
1 1
x2 = − x3 = .
3 3
1 2
x1 = 1 − x2 + x3 = 1 − = .
3 3
Hence, x = (x1 , x2 , x3 )∗ = (2/3, 1/3, 0)∗ .

6. Let A be an m × n matrix (m ≥ n), and let A = Q̂R̂ be a reduced QR factorisation.

(a) Show that A has rank n if and only if all the diagonal entries of R̂ are nonzero.

Solution: Let aj be the jth column of the matrix A ∈ Cm×n , m ≥ n. Ob-

serve that to prove the claim above, it suffices to show that the set of vectors
{a1 , . . . , an } is linearly independent in Cm if and only if all the diagonal entries
of R̂ are nonzero. Indeed, rank(A) ≤ min{m, n} = n, so rank(A) = n if and
only if A is of full rank.

Suppose the set of vectors {a1 , . . . , an } is linearly independent in Cm . Recall

that for every j = 1, . . . , n, aj can be expressed as a linear combination of
{q1 , . . . , qj }. More precisely,
j
X
aj = r1j q1 + r2j q2 + . . . + rjj qj = rij qj , j = 1, . . . , n. (2.5.1)
i=1

Suppose by contradiction that there exists an j0 ∈ {1, . . . , n} such that rj0 j0 = 0,

(2.5.1) implies that aj0 ∈ span{a1 , . . . , aj0 −1 }, which contradicts the linear in-
dependency of {a1 , . . . , an }. Hence all the diagonal entries of R̂ must be nonzero.

Conversely, suppose all the diagonal entries of R̂ are nonzero. Suppose the
Matrix Decomposition and Least Squares Problems 47

following equation holds

β1 a1 + . . . + βn an = 0. (2.5.2)

Substituting (2.5.1) into (2.5.2) yields

γ1 q1 + . . . + γn qn = 0,

where n
X
γj = βk rjk , j = 1, . . . , n. (2.5.3)
k=j

Since {q1 , . . . , qn } is an orthonormal set of vectors in Cm , it is linearly indepen-

dent and so we must have γ1 = . . . = γn = 0. We claim that this and (2.5.3)
implies β1 = . . . = βn = 0. First,

γn = βn rnn = 0 =⇒ βn = 0 since rnn 6= 0.

Next,

γn−1 = βn−1 r(n−1)(n−1) + βn r(n−1)n = βn−1 r(n−1)(n−1) = 0

=⇒ βn−1 = 0 since r(n−1)(n−1) 6= 0.

Carrying the exact computation inductively from j = n − 2 to j = 1, together

witih rjj 6= 0 proves the claim. Hence, (2.5.2) has only trivial solution β1 =
. . . = βn = 0, which by definition means that the set of vectors {a1 , . . . , an } is
linearly independent in Cm .

(b) Suppose R̂ has k nonzero diagonal entries for some k with 0 ≤ k < n. What does
this imply about the rank of A? Exactly k? At least k? At most k? Give a precise
answer, and prove it.

Solution: Suppose R̂ has k nonzero diagonal entries for some k with 0 ≤ k < n,
i.e. R̂ has at least one zero diagonal entry. Let aj be the jth column of A, and
Aj ∈ Cm×j be the matrix defined by Aj = [a1 |a2 | . . . |aj ].
• First, (
1 if r11 6= 0,
rank(A1 ) =
0 if r11 = 0.
• For j = 2, . . . , n, regardless of the value of rjj , either

aj ∈
/ span{a1 , . . . , aj−1 } =⇒ rank(Aj ) = rank(Aj−1 ) + 1. (2.5.4)

or
aj ∈ span{a1 , . . . , aj−1 } =⇒ rank(Aj ) = rank(Aj−1 ). (2.5.5)
This means that the rank of A cannot be at most k.
• For any j = 2, . . . , n, if rjj 6= 0, then (2.5.1) implies that (2.5.4) must hold.
However, if rjj = 0, then either (2.5.4) or (2.5.5) holds. We illustrate this
48 2.5. Problems

two cases by looking at 3 × 3 matrix R̂, similar idea applies to “higher

dimensional” R̂ too.
– One example where (2.5.4) holds is the case where r11 = r22 = r33 = 0
but r12 = r13 = r23 6= 0. In this case,

a1 = 0, a2 = r12 q1 , a3 = r13 q1 + r23 q2

and it is clear that a3 ∈

/ span{a1 , a2 }.
– One example where (2.5.5) holds is the case where r11 = r22 = r33 =
r23 = 0 but r12 = r13 6= 0. In this case,

a1 = 0, a2 = r12 q1 , a3 = r13 q1

and it is clear that a3 ∈ span{a1 , a2 }.

Summarising everything, we conclude that the rank of A is at least k.

7. Let A be an m × m matrix, and let aj be its jth column. Give an algebraic proof of
Hadamard’s inequality:
m
Y
| det A| ≤ kaj k2 .
j=1

Also give a geometric interpretation of this result, making use of the fact that the deter-
minant equals the volume of a parallelepiped.

Solution: The inequality is trivial if A is a singular matrix, so suppose not. Consider

the QR factorisation A = Q̂R̂. Since Q̂ ∈ Cm×m is unitary, det(Q̂) = ±1; since
R̂ ∈ Cm×m is upper triangular, det(R̂) equals the product of its diagonal entries.
Using these two facts and product property of determinant,

| det(A)| = | det(Q̂R̂)| = | det(Q̂)|| det(R̂)|

= | det(R̂)|
Ym m
Y
= |rjj | = kvj k2 ,
j=1 j=1

j−1
X
where vj = aj − (qi∗ aj )qi , with the convention that q0 = 0 ∈ Cm . For any j =
i=1
1, . . . , m, since {vj , q1 , . . . , qj−1 } are mutually orthogonal, Pythagorean theorem
gives
j−1 2
X
2
kaj k2 = vj + (qi∗ aj )qi
i=1 2
j−1
X
= kvj k22 + k(qi∗ aj )qi k22
i=1
≥ kvj k22
Matrix Decomposition and Least Squares Problems 49

where we crucially use the fact that k · k2 ≥ 0. Hence,

m
Y m
Y
| det(A)| ≤ kvj k2 ≤ kaj k2 .
j=1 j=1

Since | det(A)| is the volume of the parallelepiped with sides given by the vector
{a1 , a2 , . . . , am }, the Hadamard’s inequality asserts that this is bounded above by the
volume of a rectangular parallelepiped with sides of length ka1 k2 , ka2 k2 , . . . , kam k2 .

8. Consider the inner product space of real-valued continuous functions defined on [−1, 1],
where the inner product is defined by
Z 1
f ·g = f (x)g(x) dx.
−1

Let M be the subspace that is spanned by the three linearly independent polynomial
p0 = 1, p1 = x, p2 = x2 .
(a) Use the Gram-Schmidt process to determine an orthonormal set of polynomials
(Legendre polynomials) q0 , q1 , q2 that spans M .

Solution: Following Gram-Schmidt iteration notation from lectures,

p0 1
q0 = = √ .
(p0 · p0 )1/2 2
Z 1
x
r12 = q0 · p1 = √ dx = 0.
−1 2
=⇒ v1 = p1 − r
12 q0 = x.
Z 1
2 2
(r22 ) = v1 · v1 = x2 dx = .
3
r −1
v1 3
=⇒ q1 = = x.
r22 2
Z 1 2 √
x 2
r13 = q0 · p2 = √ dx = .
−1 2 3
Z 1r
3 3
r23 = q1 · p2 = x dx = 0.
−1 2
2 1
=⇒ v2 = p2 − r13 q0 − 23 q1 = x − .
r
3
Z 1 2
1 8
(r33 )2 = v2 · v2 = x2 − dx = .
3 45
r −1 r
v2 45 1 5
=⇒ q2 = = x2 − = (3x2 − 1).
r33 8 3 8
r r
1 3 5
Hence, q0 (x) = √ , q1 (x) = x, q2 (x) = (3x2 − 1).
2 2 8

(b) Check that qn satisfies (1 − x2 )y 00 − 2xy 0 + n(n + 1)y = 0 for n = 0, 1, 2.

50 2.5. Problems

Solution: It is clear that q0 satisfies the given ODE for n = 0 since q00 = q000 = 0
and n(n + 1)|n=0 = 0. Because differentiation is a linear operation, it suffices to
show that v1 , v2 (from part (a)) satisfies the given ODE for n = 1, 2 respectively.
For n=1,

(1 − x2 )v100 − 2xv10 + 1(1 + 1)v1 = (1 − x2 )(0) − 2x(1) + 2(x) = 0.

For n = 2,

1
(1 − x 2
)v200 − 2xv20 2 2
+ 2(2 + 1)v2 = (1 − x )(2) − 2x(2x) + 6 x −
3
= 2 − 2x2 − 4x2 + 6x2 − 2 = 0.

9. Let A ∈ Rm×n with m < n and of full rank. Then min kAx − bk2 is called an Underde-
termined Least-Squares Problem. Show that the solution is an n − m dimensional set.
Show how to compute the unique mininum norm solution using QR decomposition and
SVD approach.

Solution: Let A ∈ Rm×n with m < n and of full rank. Since m < n, Ax = b is an
underdetermined system and kAx − bk2 attains its mininum 0 in this case, where the
solution set, S is given by

S = {xp − z ∈ Rn : z ∈ N (A)},

where xp is the particular solution to Ax = b and N (A) denotes the null space of
A. Note that S is not a vector subspace of Rn (unless b = 0 ∈ Rm ). Invoking the
Rank-Nullity theorem gives

dim(N (A)) = n − rank(A) = n − m.

i.e. the solution set S is an n − m dimensional set.

Now that we know solutions to an Underdetermined Least-Squares problem must

belong to S , we seek the mininum norm solution. More precisely, we look for
x0 = xp − z0 ∈ S that solves the following minimisation problem:

min kxk2 = min kxp − zk2 . (2.5.6)

x∈S z∈N (A)

Since N (A) is a closed subspace of Rn , (2.5.6) has a unique solution z0 satisfying

x0 = xp − z0 ∈ N (A)⊥ , where N (A)⊥ denotes the orthogonal complement of N (A).
Geometrically, z0 is precisely the orthogonal projection of xp onto N (A). We will
not prove it here, but one can show that N (A)⊥ = R(AT ), the range of AT . Since
x0 ∈ N (A)⊥ = R(AT ), there exists an v0 ∈ Rm such that AT v0 = x0 . Substituting
this into Ax0 = b thus gives

AAT v0 = b =⇒ v0 = (AAT )−1 b =⇒ x0 = AT (AAT )−1 b, (2.5.7)

Matrix Decomposition and Least Squares Problems 51

where (AAT )−1 exists since A has full rank implies AT A (and also AAT ) is nonsin-
gular.

• Suppose AT ∈ Rn×m has a reduced QR factorisation AT = Q̂R̂. Then

(AAT )−1 = (R̂T Q̂T Q̂R̂)−1 = (R̂T R̂)−1 = (R̂)−1 (R̂T )−1 .

Substituting this into (2.5.7) yields

x0 = AT (AAT )−1 b = Q̂R̂(R̂)−1 (R̂T )−1 b = Q̂(R̂T )−1 b.

• Suppose AT ∈ Rn×m has a reduced SVD AT = Û Σ̂V . Then

(AAT )−1 = (V T Σ̂T Û T Û Σ̂V )−1 = (V T Σ̂2 V )−1 = V T (Σ̂2 )−1 V.

where V −1 = V T since V ∈ Rm×m is unitary. Substituting this into (2.5.7)

yields

x0 = AT (AAT )−1 b = Û Σ̂V V T (Σ̂2 )−1 V b = Û Σ̂(Σ̂2 )−1 V b = Û (Σ̂)−1 V b.

Here, the assumption that A is full rank is crucial, in that it ensures the existence
of (R̂T )−1 and (Σ̂)−1 . Indeed, Q1(b)(i) says that R̂ has all nonzero diagonal entries,
which implies that R̂ (and also R̂T ) is nonsingular since R̂ is upper-triangular; Theo-
rem 5.1, page 33, tells us that all singular values of A, which are the diagonal entires
of Σ̂, are nonzero, which implies that Σ̂ is nonsingular since Σ̂ is diagonal.
52 2.5. Problems
Chapter 3

Conditioning and Stability

3.1 Conditioning and Condition Numbers

One can view a problem as a function f : X −→ Y from a normed vector space X of data to
a normed vector space Y of solutions. A well-conditioned problem is one with the prop-
erty that all small perturbations of x lead to only small changes in f (x); an ill-conditioned
problem is one with the property that some small perturbations of x lead to a large change
in f (x).

Definition 3.1.1. Let δx be a small perturbation of x, and δf = f (x + δx) − f (x). The

absolute condition number κ̂ = κ̂(x) of the problem f at x is defined as

kδf k
κ̂ = κ̂(x) = lim sup .
δ→0 kδxk≤δ kδxk

• It can be interpreted as a supremum over all infinitesimal perturbations δx, thus it can
be written as
kδf k
κ̂ = sup .
δx kδxk

• If f : Rn −→ Rm is differentiable, then there exists an element J(x) ∈ Rm×n , called the

Jacobian, such that
f (x + δx) − f (x) = J(x)δx + o(kδxk).
In the limit kδxk −→ 0, the above simplifies to δf = J(x)δx and the absolute condition
number then becomes
κ̂(x) = kJ(x)k,

Definition 3.1.2. The relative condition number κ = κ(x) of the problem f at x is defined
as
kδf k kδxk
κ = κ(x) = lim sup ,
δ→0 kδxk≤δ kf (x)k kxk
or assuming δx, δf are infinitesimal,

kδf k kδxk
κ = κ(x) = sup .
δx kf (x)k kxk

53
54 3.1. Conditioning and Condition Numbers

• If f : Rn −→ Rm is differentiable, then it can be expressed in terms of the Jacobian:

kJ(x)k
κ = sup .
kδxk kf (x)k/kxk

• A problem is well-conditioned if κ is small (e.g. 1, 10, 100) and ill-conditioned if κ is

large (e.g. 106 , 1016 ).

Example 3.1.3. Consider f (x) = αx, x ∈ C. Then J(x) = f 0 (x) = α and

kJ(x)k |α|
κ̂ = kJ(x)k = |α| but κ = = = 1.
kf (x)k/kxk |αx|/|x|

Thus, this problem is well-conditioned.

√ √
Example 3.1.4. Consider f (x) = x, x > 0. Then J(x) = f 0 (x) = 1/(2 x) and
√
1 kJ(x)k 1 x 1
κ̂ = kJ(x)k = √ but κ = = √ = .
2 x kf (x)k/kxk 2 x x 2

Thus, this problem is well-conditioned.

Example 3.1.5. Consider f (x) = x1 − x2 , x = (x1 , x2 )∗ ∈ (C2 , k · k∞ ). Then J(x) = (1, −1)
and
kJ(x)k∞ 2
κ̂ = kJ(x)k∞ = 2 but κ = = .
kf (x)k1 /kxk∞ |x1 − x2 |/ max{|x1 |, |x2 |}
The absolute condition number blows up if |x1 − x2 | ≈ 0. Thus, this problem is severely ill-
conditioned when x1 ≈ x2 , an issue which κ̂ would not reveal.

Condition of Matrix-Vector Multiplication

Given A ∈ Cm×n , consider f (x) = Ax, x ∈ Cn . If x has some perturbation δx, then for
arbitrary vector norm we have

kJk kAkkxk
κ = sup = .
δx kf (x)k/kxk kAxk

• If A is square and non-singular, then

kxk kA−1 Axk

= ≤ kA−1 k =⇒ κ ≤ kAkkA−1 k.
kAxk kAxk

• For k · k2 , this bound is actually attained since kAk2 = σ1 and kA−1 k2 = 1/σm , where
σm > 0 since A is non-singular. Indeed, choosing x to be the mth right singular vector
of A yields
kxk2 kvm k2 kvm k2 1
= = = .
kAxk2 kAvm k2 σm kum k2 σm
Conditioning and Stability 55

Theorem 3.1.6. Let A ∈ Cm×m be non-singular and consider the equation Ax = b.

(a) Consider f (x) = Ax = b. The problem of computing b, given x, has condition number

kAkkxk kAkkxk
κ(x) = = ≤ kAkkA−1 k
kAxk kbk

with respect to perturbations of x. If k · k = k · k2 , then equality holds if x is a multiple of

a mth right singular vector vm of A corresponding to the minimal singular value σm .

(b) Consider f (b) = A−1 b = x. The problem of computing x, given b, has condition number

kA−1 kkbk kA−1 kkbk

κ(b) = = ≤ kA−1 kkAk
kA−1 bk kxk

with respect to perturbations of b. If k · k = k · k2 , then equality holds if b is a multiple of a

1st left singular vector u1 of A corresponding to the maximal singular value σ1 .

Condition of a System of Equations

Theorem 3.1.7. Consider the problem f (A) = A−1 b = x, where A ∈ Cm×m is non-singular.
The problem of computing x, given A, has condition number

κ(A) ≤ kA−1 kAk

with respect to perturbations of A.

• Consider the problem f (A) = A−1 b = x, where now A has some perturbation δA instead
of b. Then

(A + δA)(x + δx) = b =⇒ δAx + Aδx ≈ 0 =⇒ δx ≈ −A−1 δAx

and
kA−1 kkδAkkxk

kδxk kδAk kAk
κ(A) = sup ≤ sup = kA−1 kkAk.
δA kxk kAk δA kxk kδAk

• Equality holds whenever δA is such that

kA−1 δAxk = kA−1 kkδAkkxk.

It can be shown that such perturbations δA exists for any given A ∈ Cm×m , b ∈ Cm and
any chosen norm k · k.

The product kAkkA−1 k appears so often that we decided to call it the condition number
of A (relative to the norm k · k), denoted by κ(A). A is said to be well-conditioned if κ(A) is
small and ill-conditioned if κ(A) is large. In the case where A is singular, we write κ(A) = ∞.
For a rectangular matrix A ∈ Cm×n of full rank, m ≥ n, the condition number is defined in
terms of the pseudoinverse, i.e.

κ(A) = kAkkA+ k = kAkk(A∗ A)−1 A∗ k, A+ ∈ Cn×m .

56 3.2. Floating Point Arithmetic

3.2 Floating Point Arithmetic

Computer uses binary system to represent real numbers. Some examples are
1 1
(1101.11)2 = 23 + 22 + 20 + + = (13.75)10 .
2 4
(11 . . 11})2 = 2n−1 + 2n−2 + . . . + 21 + 20 = (2n − 1)10 .
| .{z
n

How exactly does one goes from decimal (base 10) to binary (base 2)?

• Suppose x ∈ Z+ in decimal, we divide by 2 and denote the remainder by a0 ; this process

continues until we reach 0 and

x = (an an−1 . . . a1 a0 )2 = an · 2n + an−1 · 2n−1 + . . . + a1 · 21 + a0 · 20 .

Let x = 17, then

17
= 8 remainder 1 =⇒ a0 = 1.
2
8
= 4 remainder 0 =⇒ a1 = 0.
2
4
= 2 remainder 0 =⇒ a2 = 0.
2
2
= 1 remainder 0 =⇒ a3 = 0.
2
1
= 0 remainder 1 =⇒ a4 = 1.
2
Thus, x = (10001)2 = 17.

• Suppose x has decimal digits now. We can write x in binary as follows:

x = (0.a1 a2 a3 . . .)2 = a1 · 2−1 + a2 · 2−2 + a3 · 2−3 + . . . .

where

x1 = frac(2x) and a1 = Int(2x).

x2 = frac(2x1 ) and a2 = Int(2x1 ).
x3 = frac(2x2 ) and a3 = Int(2x2 ).
.. .. ..
. . .

Take x = 0.75, then

2x = 1.5 =⇒ x1 = frac(2x) = 0.5 and a1 = Int(2x) = 1.

2x1 = 1.0 =⇒ x1 = frac(2x1 ) = 0.0 and a2 = Int(2x1 ) = 1.

Thus, x = (0.11)2 = 0.75.

For any nonzero decimal numbers x, express it in the following form:

x = σx̄β e , where σ = sign(x) = ±1,

Conditioning and Stability 57

β = chosen base,
e = exponent,
x̄ = mantissa of x, and (0.1)β ≤ x̄ < 1.

Observe that (0.1)10 = 0.1 for decimal while (0.1)2 = 0.5 for binary. For example,

(12.462)1 0 = 1 · (0.12462) · 102 .

(1101.10111)2 = 1 · (0.110110111) · 24 .

There exists two types of floating-point format:

Single-precision (32 bits) : σ exponent mantissa

| {z }
| {z }
8 bits 23 bits

Double-precision (64 bits) : σ exponent mantissa

| {z }
| {z }
11 bits 52 bits

The exponent is stored as is if it is within the given range, otherwise the number is overflow
if e is too large or underflow if e is too small.
p
1. An example of an overflow operation is x2 + y 2 when x is large. To avoid this, we
rewrite it as  ( 2 )1/2
 y
|x| 1 +


 if x > y,
p

 x
x2 + y 2 = ( 2 )1/2
x


 |y| 1 + y if x < y.




√ √
2. An example of an underflow operation is x + 1 − x. Observe that the quantity is
approximately 0 if x is large. To avoid this, we rationalise the function
√ √ 1
x+1− x= √ √ .
x+1+ x

Definition 3.2.1. Suppose x has a representation of the form

x = σ · (0.a1 a2 . . . an an+1 . . .) · 2e ,

but its floating point representation fl(x) can only include n digits for the mantissa. There are
two ways to truncate x when stored:
(a) Chopping, which amounts to truncating remaining digits after an ,
(b) Rounding, based on the digit an+1 :
(
σ · (0.a1 a2 . . . an−1 an ) · 2e if an+1 = 0,
fl(x) =
σ · (0.a1 a2 . . . an−1 1) · 2e if an+1 = 1.

• One can view the floating point representation fl(x) as a perturbation of x, i.e. there
exists an ε = ε(x) such that
fl(x) − x
fl(x) = x(1 + ε) or = ε.
x
58 3.2. Floating Point Arithmetic

It can be shown that ε has certain range depending on the truncation method:

Chopping : − 2−n+1 ≤ ε ≤ 0. (3.2.1)

Rounding : − 2−n ≤ ε ≤ 2−n . (3.2.2)

• Suppose chopping is used. Assuming σ = 1, we have that

. . 0} an+1 an+2 . . .)2 · 2e

0 ≤ x − fl(x) = (0. 0| .{z
n
. . 0} 11 . . .)2 · 2e
≤ (0. 0| .{z
n
( )
n+1 n+2
1 1
= + + . . . · 2e
2 2
n+1
1 1
= 1 + + . . . · 2e
2 2
n+1
1
= 2 · 2e = 2e+1−n−1 = 2−n+e .
2
Thus,
x − fl(x) 2−n+e 2−n 2−n
0≤ ≤ = ≤ = 2−n+1 .
x (0.a1 a2 . . .)2 · 2e (0.a1 a2 . . .)2 2−1
• Suppose rounding is used. A similar calculation as above shows that

. . 0} an an+1 . . .)2 · 2e
0 ≤ |x − fl(x)| ≤ (0. 0| .{z
n−1
n
1
≤ 2 · 2e = 2−n+e+1 .
2
Thus,
x − fl(x) 2−n+e+1 2−n+1
0≤ ≤ ≤ = 2−n .
x (0.a1 a2 . . .)2 · 2e 2−1
• The worst possible error for chopping is twice as large as when rounding is used. It
can be seen from (3.2.1), (3.2.2) that x − fl(x) has the same sign as x for chopping but
possibly different sign for rounding. This means that there might be cancellation of error
if rounding is used!

Definition 3.2.2. The machine epsilon, denoted by εmachine is the difference between 1 and
the next larger floating point number. In a relative sense, the machine epsilon is as large as the
gaps between floating point number get. For a double-precision computer, εmachine = 2−52 ≈
O(10−16 ).

Axioms of Floating Point Arithmetic

1. For all x ∈ R, there exists a floating point fl(x) such that
|fl(x) − x|
≤ εmachine .
|x|
Conditioning and Stability 59

Equivalently, for all x ∈ R, there exists an ε with |ε| ≤ εmachine such that fl(x) =
x(1 + ε). That is, the difference between a real number and its (closest) floating point
approximation is always smaller than εmachine in relative terms.
2. Basic floating point operations consists of ⊕, , ⊗, ÷. Denote the floating point operation
by ~. For any floating points x, y, there exists an ε with |ε| ≤ εmachine such that
x ~ y = fl(x ∗ y) = (x ∗ y)(1 + ε).
That is, every operation of floating point arithmetic is exact up to a relative error of size
at most εmachine .

Common sources of error include mathematical modelling of a physical problem, uncertain-

ity in physical data, machine errors and truncation errors.

3.3 Stability
Definition 3.3.1. An algorithm f˜ for a problem f is accurate if for each x ∈ X,
kf˜(x) − f (x)k
= O(εmachine ).
kf (x)k
In other words, there exists a constant C > 0 such that for all sufficiently small εmachine we
have that
kf˜(x) − f (x)k
≤ Cεmachine .
kf (x)k
• In practice, C can be large. For ill-conditioned problems, the definition of accuracy can
be too restrictive.

Definition 3.3.2.
1. An algorithm f˜ for a problem f is stable if for each x ∈ X,
kf˜(x) − f (x̃)k
= O(εmachine )
kf (x̃)k
for some x̃ with
kx̃ − xk
= O(εmachine ).
kxk
In words, a stable algorithm gives nearly the right answer to nearly the right question.
2. An algorithm f˜ for a problem f is backward stable if for each x ∈ X,
kx̃ − xk
f˜(x) = f (x̃) for some x̃ with = O(εmachine ).
kxk
In words, a backward stable algorithm gives exactly the right answer to nearly the right
question.

Theorem 3.3.3. For problems f and algorithms f˜ defined on finite-dimensional spaces X and
Y , the properties of accuracy, stability and backward stability all hold or fail to hold indepen-
dently of the choice of norms in X and Y .
60 3.4. More on Stability

3.4 More on Stability

Theorem 3.4.1. The four floating point operations ⊕, , ⊗, ÷ are all backward stable.

Proof. We will only prove this in the case of a subtraction. Consider the subtraction f (x1 , x2 ) =
x1 − x2 , with floating point
f˜(x1 , x2 ) = fl(x1 ) fl(x2 ).

From the first axiom of floating point arithmetic, there exists ε1 , ε2 , ε3 with |ε1 |, |ε2 |, |ε3 | ≤
εmachine such that

fl(x1 ) = x1 (1 + ε1 ), fl(x2 ) = x2 (1 + ε2 ),
fl(x1 ) fl(x2 ) = (fl(x1 ) − fl(x2 ))(1 + ε3 ).

Thus,

f˜(x1 , x2 ) = fl(x1 ) fl(x2 ) = [x1 (1 + ε1 ) − x2 (1 + ε2 )](1 + ε3 )

= x1 (1 + ε4 ) − x2 (1 + ε5 )
= x̃1 − x̃2 = f (x̃1 , x̃2 )

for some |ε4 |, |ε5 | ≤ 2εmachine + O(ε2machine ). Backward stability follows directly since

|x̃1 − x1 | |x̃2 − x2 |
= O(εmachine ), = O(εmachine ).
|x1 | |x2 |

Accuracy of a Backward Stable Algorithm

Theorem 3.4.2. If a backward stable algorithm is used to solve a problem f : X −→ Y with
condition number κ, then the relative error satisfies the following estimates:

kf˜(x) − f (x)k
= O(κ(x)εmachine ).
kf (x)k

Proof. Since f is backward stable, f˜(x) = f (x̃) for some x̃ ∈ X satisfying

kx̃ − xk
= O(εmachine .
kxk

Definition of κ(x) yields:

kf˜(x) − f (x)k h i kx̃ − xk

≤ κ(x) + o(1) ,
kf (x)k kxk

where o(1) denotes a quantity that converges to 0 as εmachine −→ 0. The desired inequality
follows from combining these bounds.
Conditioning and Stability 61

3.5 Stability of Back Substitution

Lower and upper triangular systems arise in QR factorisation, Gaussian elimination and
Cholesky factorisation. These systems are easily solved by a process of successive substitu-
ion, called forward substitution if the system is lower-triangular and back substitution if
the system is upper-triangular.

1. Given a non-singular, lower-triangular matrix L ∈ Rm×m , the solution to Lx = b is given

b1
x1 = ,
l11
i−1
!
1 X
xi = b1 − lij xj , i = 2, . . . , m.
lii j=1

2. Given a non-singular, upper-triangular matrix U ∈ Rm×m , the solution to U x = b is given

bm
xm = ,
umm
m
!
1 X
xi = bi − uij xj , i = 1, . . . , m − 1.
uii j=i+1

3. The operational count for both forward and backward substitution is ∼ m2 flops, since

m(m − 1)
addition and substraction ∼ flops.
2
m(m + 1)
multiplication and division ∼ flops.
2

Theorem 3.5.1. The backward substitution algorithm applied to U x = b is backward stable.

The computed solution x̃ ∈ Rm satisfies (U + δU )x̃ = b, where the upper-triangular matrix
δU ∈ Rm×m satisfies
kδU k
= O(εmachine ),
kU k
or for all i, j,
|δuij |
≤ mεmachine + O(ε2machine ).
|uij |

• What about its accuracy? With κ(A) = kA−1 kkAk,

kx̃ − xk kδAk kδU k

≤ κ(A) = κ(U ) = O(κ(U )εmachine ).
kxk kAk kU k
62 3.6. Problems

3.6 Problems
1. Assume that the matrix norm k · k satisfies the submultiplicative property kABk ≤
X∞
kAkkBk. Show that if kXk < 1, then I − X is invertible, (I − X)−1 = X j and
j=0
k(I − X)−1 k ≤ 1/(1 − kXk).

Solution:
P This is a classical result about Neumann series, which is the infinite
series ∞
j=0 X j
. Assuming (I − X) is invertible, with its inverse (I − X)−1 given by
the Neumann series, using the submultiplicative property and triangle inequality for
norms we have that
∞ ∞
−1
X
j
X 1
k(I − X) k = X ≤ kXkj = (3.6.1)
j=0 j=0
1 − kXk

where the second infinite series, which is a geometric series, converges since kXk < 1
by assumption. This proves the desired inequality and moreover it shows that the
X∞
Neumann series X j converges absolutely in the matrix norm, and thus converges
j=0
in the matrix norm too. To conclude the proof, we need to show that I − X is in
fact invertible, with its inverse given by the Neumann series. A direct computation
shows that
n
!
X
(I − X) X j = (I − X)(I + X + . . . + X n ) = I − X n+1 . (3.6.2)
j=0

Since the geometric series in (3.6.1) converges, kXkn −→ 0 as n −→ ∞. It follows

that kX n − 0k = kX n k ≤ kXkn −→ 0 as n −→ ∞. Taking limit of (3.6.2) as
n −→ ∞ yields
n
! ∞
!
X X
lim (I − X) X n = (I − X) X j = I,
n→∞
j=0 j=0

∞
!
X
A symmetric argument also shows that Xj (I − X) = I. Hence, we have that
j=0
∞
X
(I − X)−1 = X j , i.e. I − X is invertible.
j=0

2. Let A ∈ Rn×n be nonsingular matrix. Assume that b 6= 0, x satisfies Ax = b, and x̃

is an approximate solution to this linear system. Denote e := x − x̃ the error vector
and r := b − Ax̃ the residual vector. Show the following inequalities and explain their
importance.
1 krk kek krk
−1
≤ ≤ kAkkA−1 k .
kAkkA k kbk kxk kbk
Conditioning and Stability 63

Solution: This is a relatively straightforward bound, using the assumption that A

is non-singular so that A−1 exists, and the fact that we are working with an induced
matrix norm, i.e. for any x ∈ Rn , kAxk ≤ kAkkxk and kA−1 xk ≤ kA−1 kkxk. Since
x = A−1 b, we have
kxk = kA−1 bk ≤ kA−1 kkbk. (3.6.3)
On the other hand, since b = AA−1 b, we have

krk = kb − Ax̃k = kA(A−1 b − x̃)k = kA(x − x̃)k = kAek ≤ kAkkek. (3.6.4)

Combining (3.6.3), (3.6.4) and rearranging yields the left inequality. Next, since
Ax = b, we have
kbk = kAxk ≤ kAkkxk. (3.6.5)
On the other hand, since Ax − b + b − Ax̃ = r, we have

kek = kA−1 Aek = kA−1 (Ae − b + b)k = kA−1 rk ≤ kA−1 kkrk. (3.6.6)

Combining (3.6.5), (3.6.6) and rearranging yields the right inequality.

We know that κ(A) = kAkkA−1 k is by definition the condition number of the matrix
A; moreover κ(A) ≥ kAA−1 k = 1. The terms kek/kxk, krk/kbk can be interpreted as
the relative solution error and the relative residual eror respectively. Thus, the right
inequality
kx − x̃k kek krk k(b + r) − bk
= ≤ κ(A) = κ(A)
kxk kxk kbk kbk
tells us that the ratio between the relative solution error and the relative residual
error is controlled by the condition number of A. In other words, suppose x1 ∈ Rn is
such that Ax1 = b1 and suppose we perturb b1 by some ε > 0. Then the correspond-
ing solution can only differ from x1 at most κ(A)ε/kbk in relative terms.

This estimate also shows that if κ(A) is not large then the residual r gives a good
representation of the error e. However, if κ(A) is large then the residual r is not a
good estimate of the error e.

3. Suppose that A ∈ Rn×n is non-singular, and consider the two problems:

Ax = b and (A + δA)x̃ = b + δb,

where we assume that kA−1 δAk ≤ kA−1 kkδAk < 1, so that (A + δA) is nonsingular
(Why?). Show that

kx̃ − xk κ(A) kδAk kδbk
≤ + ,
kxk 1 − κ(A) kδAk
kAk
kAk kbk

where κ(A) is the condition number of the matrix.

64 3.6. Problems

Solution: Observe that since A is non-singular, we can rewrite A + δA as

h i
A + δA = A(I + A−1 δA) = A I − (−A−1 δA) .

Since k − A−1 δAk = kA−1 δAk < 1, Problem 1 together with the assumption that A
is also invertible shows that A + δA is invertible, i.e A + δA is non-singular. Since

Ax̃ − Ax = (Ax̃ − b) + (b − Ax) = Ax̃ − b = δb − δAx̃,

we have that
kx̃ − xk kA−1 (Ax̃ − Ax)k kA−1 (δb − δAx̃)k
= = (3.6.7a)
kxk kxk kxk
kA kδbk kA−1 kkδAkkx̃k
−1
≤ + . (3.6.7b)
kxk kxk

Using κ(A) = kA−1 kkAk and kbk ≤ kAkkxk yield the bound

kA−1 kkδbk κ(A)kδbk kδbk

= ≤ κ(A) . (3.6.8a)
kxk kAkkxk kbk
kA−1 kkδAkkx̃k

kδAk kx̃k
= κ(A) . (3.6.8b)
kxk kAk kxk

Denote the following quantity

kx̃ − xk kδbk kδAk

C= , D= , E= . (3.6.9)
kxk kbk kAk

Substituting (3.6.8) into (3.6.7) and using triangle inequality yield

kx̃k kx̃ − xk + kxk
C ≤ κ(A) D + E ≤ κ(A) D + E
kxk kxk
h i
= κ(A) D + E(C + 1)
h i
=⇒ 1 − κ(A)E C ≤ κ(A)[D + E]
κ(A)
=⇒ C ≤ [D + E].
1 − κ(A)E)

The desired inequality follows from substituting (3.6.9) into the above inequality.

4. Show that for Gaussian elimination with partial pivoting (permutation by rows) applied to
maxij |uij |
a matrix A ∈ Rn×n , the growth factor ρ = satisfies the estimate ρ ≤ 2n−1 .
maxij |aij |

Solution:

5. Show that if all the principal minors of a matrix A ∈ Rn×n are nonzero, then there exists
diagonal matrix D, unit lower triangular matrix L and unit upper triangular matrix U ,
Conditioning and Stability 65

such that A = LDU . Is this factorisation unique? What happens if A is symmetric

matrix?
Solution: Since all the principal minors of a matrix A ∈ Rn×n are nonzero, there
exists a unique LU decomposition such that L has unit diagonal entries. Note that
U , an upper-triangular matrix itself, must have non-zero diagonal entries since all
the principal minors of A including det(A) itself is non-zero. However, these diagonal
entries might not be unit; fortunately this can be achieved with left-multiplying U
by a diagonal matrix D with entries dii = uii , i = 1, . . . , n. The consequence of doing
this however is that we need to scale each rows ui of U by uii , i = 1, . . . , n, which of
course is valid since uii 6= 0 for every i = 1, . . . , n. More precisely,
 
u11 u12 u13 . . . u1n

 u22 u23 . . . u2n  
U =
 u33 . . . u3n  
 . . .. 
 . . 
unn
  
u11 1 u12 /u11 u13 /u11 . . . u1n /u11

 u22 
 1 u23 /u22 . . . u2n /u22 

=
 u 33


 1 . . . u3n /u33  = D Ũ

 ...  ... .. 
  . 
unn 1

Since such LU decomposition and the way we factored out pivots of U are both
unique, we conclude that there exists a unique LDU factorisation of A with all the
desired properties for L, D, U .

If A is symmetric, then its LDU decomposition of the required form mightnot exist,

0 0 0
and might not be unique even if it does exists. Consider A = (aij ) = 0 0 1
0 1 0
which is symmetric and suppose we want to decompose A into the form
   
1 A 1 d e
A = a 1   B   1 f
b c 1 C 1
  
A 1 d e
= aA B   1 f
bA cB C 1
 
A Ad Ae
= aA aAd + B aAe + Bf  .
bA bAd + cB bAe + cBf + C

Comparing the first two diagonal entries a11 , a22 gives A = B = 0, but then

aAe + Bf = 0 6= a23 = 1.
66 3.6. Problems

We see that an LDU decomposition of the required

form for this particular matrix
0 0
A does not exist. Next, consider B = which is symmetric. Observe that for
0 1
any α, β ∈ R, B has infinite LDU decomposition as follows

0 0 1 0 0 0 1 β
B= = .
0 1 α 1 0 1 0 1

The issue seems to be that A is singular. If A is symmetric positive-definite (SPD)

however, then all principal minors of A are nonzero and A has a unique LDU decom-
position of the required form. It turns out that the LDU decomposition of a SPD
matrix has a simpler form. Indeed, since A = AT ,

LDU = A = AT = U T DT LT = U T DLT .

Uniqueness of such decomposition implies U T = L, i.e.

A = LDU = U T DU = LDLT .
Chapter 4

Systems of Equations

4.1 Gaussian Elimination

The goal is to solve Ax = b by transforming the system into an upper-triangular one by
applying simple linear transformations on the left. Consider a non-singular square matrix
A = A(1) = (aij ) ∈ Rn×n , b = b(1) ∈ Rn
1. Assume a11 6= 0. Introducing the multipliers
(1)
ai1
mi1 = (1)
, i = 2, . . . , n.
a11
Subtracting multiples of first row from rows 2, . . . n yields
(2) (1) (1)
aij = aij − mi1 a1j , i, j = 2, . . . , n.
(2) (1) (1)
bi = bi − mi1 b1 , i = 2, . . . , n.
and  (1) (1) (1)   (1) 
a11 a12 . . . a1n b1
 0 a(2) (2) 
. . . a2n 
(2) 
b2 

22
A(2) = . , b (2)
= .

 .. .. . . . ..  
 ... 

. .
(2) (2) (2)
0 an2 . . . ann bn
(2)
2. Assuming a22 6= 0. Introducing the multipliers
(2)
ai2
mi2 = (2)
, i = 3, . . . , n.
a22
Subtracting multiples of second row of A(2) from rows 3, . . . , n yields
(3) (2) (2)
aij = aij − mi2 a2j , i, j = 3, . . . , n.
(3) (2) (2)
bi = bi − mi2 b2 , i = 3, . . . , n.
and  (1) (1)   (1) 
(1) (1)
a11 a12 a13 . . . a1n b1
 0 a(2) (2) (2)   (2) 
a23 . . . a2n  b2 

22
 (3) (3)   (3) 
A(3) = 0 0 a33 . . . a3n  , b (3)
= b3  .
 .. .. .. . . . ..   .. 
   
 . . . .   . 
(3) (3) (3)
0 0 an3 . . . ann bn

67
68 4.1. Gaussian Elimination

(i)
3. Under the assumption that aii 6= 0, i = 1, . . . , k − 1, we will have A(k) x = b(k) , k =
2, . . . , n, where
 (1) (1) (1)
  (1) 
a11 a12 . . . ... ... . . . a(1n) b1
 (2) (2) (2)  (2) 
 a22 a23 ... ... . . . a2n    b2 


.. .. .. .. ..   . 
 .. 
 . . . . . 
A(k) =  (k)  , b
(k)
=
b(k)  ,
 
(k) (k)
 akk ak(k+1) . . . akn   k 

.. .. .. .. 
  . 

 . . . .   .. 
(k) (k) (k)
ank ... . . . ann bn

where
(k−1)
ai(k−1)
mi(k−1) = (k−1)
, i = k, . . . , n.
a(k−1)(k−1)
(k) (k−1) (k−1)
aij = aij − mi(k−1) a(k−1)j , i, j = k, . . . , n.
(k) (k−1) (k−1)
bi = bi − mi(k−1) bk−1 , i = k, . . . , n.

Finally, A(n) = U is an upper-triangular matrix. By setting L = M = (mij ), where mi j

is as in above for j < i, we obtain the LU decomposition.
(k)
4. A sufficient condition for the pivots akk 6= 0, k = 1, . . . , n − 1 is that the matrix A be
symmetric positive-definite. In basic form without pivoting, the Gaussian Elimination
Method (GEM) can be excuted in general on

• strictly diagonally dominant (SDD) matrices,

• symmetric positive-definite (SPD) matrices.

Example 4.1.1. Consider the Hilbert matrix

   
1 1/2 1/3 11/6
A(1) = 1/2 1/3 1/4 , b = 13/12 .
1/3 1/4 1/5 47/60

This is a classic example of an ill-conditioned matrix. It motivates the LU decomposition.

Example 4.1.2. If one needs to solve Ax =b for different

 b’s, factor A = LU and store L, U
2 1 1 0
4 3 3 1
for multiple uses. Consider A = A(1) = 
8 7
. Then
9 5
6 7 9 8
   
1 2 1 1 0
−2 1   1 1 1
M1 =   =⇒ A(2) = M1 A(1) =  .
−4 1   3 5 5
−3 1 4 6 8
Systems of Equations 69
   
1 2 1 1 0
 1  (3) (2)
 1 1 1
M2 = 
 −3 1  =⇒ A = M2 A = 
  .
2 2
−4 1 2 4
   
1 2 1 1 0
 1   1 1 1
M3 =   =⇒ A(4) = M3 A(3) =  .
 1   2 2
−1 1 2

Thus, U = A(4) = M3 M2 M1 A(1) =⇒ A = LU = (M3 M2 M1 )−1 U and

    
2 1 1 0 1 2 1 1 0
4 3 3 1
 = 2 1
   1 1 1
A=
8
  = LU.
7 9 5 4 3 1  2 2
6 7 9 8 3 4 1 1 2

General Idea
Define mk = [0, . . . , 0, mk+1k , . . . , mnk ]T ∈ Rn and consider the kth Gaussian transforma-
tion matrix Mk defined by
 
1
 ... 
 
1
 
T
Mk = In − mk ek =  .
 
 −m k+1k 1 
 .. ... 
 . 
−mnk 1

where ek is the kth canonical basis vector in Rn . Component-wise, we have

(Mk )ip = δip − (mk eTk )ip = δip − mik δkp , 1 ≤ i, p ≤ n.

To obtain A(k+1) from A(k) ,

(k+1) (k) (k) (k) (k)
aij = aij − mik akj = aij − mik δkk akj
Xn
(k)
= (δip − mik δkp )apj
p=1
Xn
(k)
= (Mk )ip apj = (Mk A(k) )ij .
p=1

Hence, A(k) −→ A(k+1) is given by A(k+1) = Mk A(k) .

• The inverse of Mk is Mk−1 = In + mk eTk . Indeed, eTk mk = 0 since mk has nonzero entries
starting from k + 1, k = 1, . . . , n − 1. Thus,

Mk Mk−1 = (In − mk eTk )(In + mk eTk ) = In − mk eTk mk eTk = In .

70 4.1. Gaussian Elimination

• Choose L = (Mn−1 . . . M1 )−1 = M1−1 . . . Mn−1

−1
. A similar reason shows that eTk mk+1 = 0
for all k = 1, . . . , n − 2. Thus,

Mk−1 Mk+1
−1
= (In + mk eTk )(In + mk+1 eTk+1 ) = In + mk eTk + mk+1 eTk+1 .
 
1
 m21 1 
n−1
Y n−1
X  
−1 T  m31 m32 1
=⇒ Mj = In + mj ek =  .

 .. .. .. .. 
j=1 j=1  . . . . 
mn1 mn2 . . . mnn−1 1

We now can solve the system Ax = b using GEM.

1. Compute the LU factorisation of A using GEM. This requires

2(n − 1)n(n + 1) 2
+ n(n − 1) ∼ n3 flops.
3 3

2. Solve for y ∈ Rn the lower-triangular system Ly = b, using forward substitution. This

requires ∼ n2 flops.

3. Solve for x ∈ Rn the upper-triangular system U x = y, using backward substitution. This

requires ∼ n2 flops.

Theorem 4.1.3. Let A ∈ Rn×n . The LU factorisation of A with lii = 1, i = 1, . . . , n exists

and is unique if and only if the ith order leading principal submatrix Ai of A, i = 1, . . . , n − 1
are non-singular.

• If a11 = 0, the LU decomposition as how we defined above does not exist.

• The ith order leading principal submatrix Ai is constructed by the first i rows and i
columns of A. Its determinant is called leading dominating minors.

• If Ai is singular, then LU factorisation (with lii = 1) may not exist, or will not be unique.
We demonstrate this with the following examples:

0 1 0 1 1 0 0 1 1 0 0 1
C= = , D= = .
1 0 1 0 0 1 0 2 β 1 0 2−β

Proof. We begin by proving the “if” direction. By using induction, we want to show that if
det(Ai ) 6= 0, i = 1, . . . , n − 1, then the LU factorisation of Ai (as defined above) exists and is
unique. The case i = 1 is trivial since a11 6= 0. Suppose the case (i − 1) is true, there exists a
unique LU decomposition of Ai−1 such that
(i−1)
Ai−1 = L(i−1) U (i−1) , with lkk = 1, k = 1, . . . , i − 1.

We look for a factorisation of the form

(i−1) (i−1)
Ai−1 c (i) (i) L 0 U u
= Ai = L U = .
dT aii lT 1 0T uii
Systems of Equations 71

(i)
where 0, l, u, c, d ∈ Ri−1 . Note that lii 6= 0 since det(Ai ) 6= 0. Comparing terms in the
factorisation yields
L(i−1) u = c, lT U (i−1) = dT , lT u + uii = aii . (4.1.1)
Since det(Ai−1 ) 6= 0 by induction assumption, we also have det(L(i−1) ), det(U (i−1) ) 6= 0. Thus,
there exists a unique u, l, uii solving (4.1.1).
Conversely, assume there exists a unique LU factorisation A = LU , with lii = 1, i = 1, . . . , n.
There are two separate cases to consider:
1. A is non-singular. Recall that for every i = 1, . . . , n, Ai has an LU factorisation of the
form (i−1) (i−1)
(i) (i) L 0 U u
Ai = L U = .
lT 1 0T uii
Thus, det(Ai ) = det(L(i) ) det(U (i) ) = u11 u22 . . . uii , i = 1, . . . , n. In particular, det(An ) =
u11 . . . unn ; but since A is non-singular, uii 6= 0 for all i = 1, . . . , n. Hence, we must have
det(Ai ) 6= 0 for every i = 1, . . . , n.
2. A is singular. Analysis above shows that U must have at least one zero entry on the main
diagonal. Let ukk be the first zero entry of U on the main diagonal. LU factorisation
process then breaks down at (k + 1)th step, because then lT will not be unique due to
U k being singular (refer to (4.1.1)). In other words, if ukk = 0 for some k ≤ n − 1, then
we loose existence and uniqueness of LU factorisation at (k + 1)th step. Hence, in order
to have a unique LU factorisation of A, we must have ujj 6= 0 for every j = 1, . . . , n − 1
and unn = 0.

We provide a simple algorithm for the Gaussian elimination method without pivoting. This
pseudocode is not optimal, in the sense that both matrices U and M can be stored in the same
array as A.
U = A, L = I.
for k = 1 to n − 1
for j = k + 1 to n
ujk
mjk =
ukk
uj,k:n = uj,k:n − mjk uk,k:n

4.2 Pivoting
We begin by exploring the cruel fact that Gaussian elimination method without pivoting is
neither stable nor backward stable, mainly due to sensitivity of rounding errors. Fortunately,
this instability can be rectified by permutating the order of the rows of the matrix in a certain
way! This operation is called pivoting.

Motivation for Pivoting

Example 4.2.1. Consider the following 2 × 2 linear system Ax = b given by

0 1 x1 1
= .
1 1 x2 2
72 4.2. Pivoting

The obvious solution is to interchange rows. Now suppose we perturb a11 by some small number
ε > 0 so that
ε 1 x1 1
= .
1 1 x2 2
Performing GEM yields

ε 1 x1 1 2 − 1/ε 1 − x2
= =⇒ x2 = ≈ 1, x1 = ≈ 0.
0 1 − 1/ε x2 2 − 1/ε 1 − 1/ε ε
1 − 2ε 1
However, the actual solution is given by x2 = ≈ 1, x1 = ≈ 1 6= 0. If we
1−ε 1−ε
interchange rows, we have

1 1 x1 1 1 1 x1 2
= =⇒ = .
ε 1 x2 2 0 1 − ε x2 1 − 2ε
1 − 2ε
The solution is given by x2 = ≈ 1, x1 = 2 − x2 ≈ 1.
1−ε

Example 4.2.2. Consider the following 2 × 2 linear system Ax = b given by

1 1/ε x1 1/ε
= .
1 1 x2 2
Performing GEM yields

1 1/ε x1 1/ε 2 − 1/ε 1 1
= =⇒ x2 = ≈ 1, x1 = − x2 ≈ 0.
0 1 − 1/ε x2 2 − 1/ε 1 − 1/ε ε ε
1
However, the actual solution is given by x1 = ≈ 1 6= 0.
1−ε

Main Idea
 
1 2 3
We demonstrate the main idea with a simple example. Consider A = A(1) = 2 4 5.
7 8 9
    
1 0 0 1 2 3 1 2 3
Ã(1) = P1 A(1) = 0 1 0 2 4 5 = 2 4 5 .
0 1 0 7 8 9 7 8 9
    
1 0 0 1 2 3 1 2 3
A(2) = M1 Ã(1) = −2 1 0 2 4 5
    = 0
 0 −1  .
−7 0 1 7 8 9 0 −6 −12
    
1 0 0 1 2 3 1 2 3
Ã(2) = P2 A(2) = 0 0 1 0 0 −1 = 0
     −6 −12 .
0 1 0 0 −6 −12 0 0 1
    
1 0 0 1 2 3 1 2 3
A(3) = M2 Ã(2) = 0 1 0 0 −6 −12 = 0
     −6 −12 .
0 0 1 0 0 1 0 0 1

Thus, M2 P2 M1 P1 A(1) = U . Define P = P1 P2 and M = P2 P2 M1 P1 . M doesn’t look lower-

triangular at all, but L = M −1 P happens to be lower-triangular and we have P A = M −1 P U .
Systems of Equations 73

4.3 Stability of Gaussian Elimination

4.4 Cholesky Factorisation
74 4.4. Cholesky Factorisation
Chapter 5

Iterative Methods For Linear Systems

5.1 Consistent Iterative Methods and Convergence

The main idea of iterative method is to generate a sequence of vectors (x(k) )∞
k=1 with the
property
x = lim x(k) ,
k→∞
where x is the true solution to Ax = b. In practice, we impose the following stopping criteria:
kx(n) − xk < ε or kx(n+1) − x(n) k < ε,
where ε > 0 is some fixed tolerance. We could also look at the residual norm and demand
kr(k) k = kb − Ax(k) k < ε.
Let e(k) = x(k) − x be the error vector at the kth step of the iteration process. We have the
following relation:
x = lim x(k) ⇐⇒ lim e(k) = 0.
k→∞ k→∞

Definition 5.1.1. Given some initial guess x(0) ∈ Rn , consider iterative methods of the form
x(k+1) = Bx(k) + f, k ≥ 0, where B = n × n iteration matrix, (5.1.1a)
f = some n-vector obtained from b. (5.1.1b)
An iterative method of the form (5.1.1) is said to be consistent with the linear system Ax = b
if f and B are such that x = Bx + f .

Example 5.1.2. Observe that consistency of (5.1.1) does not imply its convergence. Consider
the linear system 2Ix = b. It is clear that the iterative method defined below is consistent:
x(k+1) = −x(k) + b.
However, this method is not convergent for every choice of initial guess x(0) . Indeed, choosing
x(0) gives
x(2k) = 0, x(2k+1) = b , k ≥ 0.
On the other hand, the proposed iterative method converges to the true solution if x(0) = b/2.

75
76 5.1. Consistent Iterative Methods and Convergence

Let e(k) = x − x(k) . Subtracting the consistency equation x = Bx + f from the iterative
method (5.1.1) yields the recurrence relation for the error equation

e(k+1) = Be(k) = B 2 e(k−1) = . . . . . . = B (k+1) e(0) for each k ≥ 0.

In order for e(k) −→ 0 as k −→ ∞ for any choices of e(0) , we require B k −→ 0 as k −→ ∞ and

not surprisingly, this depends on the the magnitude of the largest eigenvalue of B.

Definition 5.1.3. Let A ∈ Cn×n be a square matrix. A nonzero vector x ∈ Cn is an eigen-

vector of A, and λ ∈ C is its corresponding eigenvalue if Ax = λx. The set of all eigenvalues
of A is the spectrum of A, denoted by σ(A).

Theorem 5.1.4. Given any square matrix A ∈ Cn×n ,

lim Am = 0 ⇐⇒ ρ(A) < 1,

m→∞

where ρ(A) is the spectral radius of A defined by

ρ(A) = max |λ|.

λ∈σ(A)

Proof. The result is trivial if A = 0, so suppose not. Suppose lim Am = 0. Choose any
m→∞
λ ∈ σ(A) with corresponding eigenvector x 6= 0. Since Am x = λm x,

lim λm x = lim Am x
m→∞ m→∞

lim λ x = lim Am x = 0.
m
m→∞ m→∞

Since x 6= 0, it follows that

lim λm = 0 =⇒ |λ| < 1,
m→∞

and this proves the only if statement since λ ∈ σ(A) was arbitrary.

Conversely, suppose ρ(A) < 1. By continuity of norm and the fact that any norms are
equivalent in finite-dimensional vector space, it suffices to prove that

lim kAm k = 0.
m→∞

Consider the Schur decomposition of A given by A = QT Q∗ (see Theorem 6.1.11) where Q is

unitary and T = D + U is upper-triangular, with D the diagonal matrix with eigenvalues of A
on its diagonal and U the nilpotent matrix, i.e. there exists an N > 0 such that U m = 0 for
m ≥ N . Since the 2-norm is invariant under unitary transformation, for m much larger than
N we have that
m
m m
X m!
kA k2 = k(D + U ) k2 ≤ kDkm−k
2 kU kk2
k=0
k!(m − k)!
N −1
X m(m − 1) . . . (m − k + 1)
= kDkm−k
2 kU kk2
k=0
k!
Iterative Methods For Linear Systems 77

N −1 k
mk

X kU k2
≤ kDkm
2
k=0
k! kDk2
N −1 k !
X kU k2
≤ mN −1 kDkm
2
k=0
kDk2
= CmN −1 ρ(A)m ,

where C is independent of m. Let am = mN −1 ρ(A)m . Since

N −1
am+1 m+1
lim = lim ρ(A) = ρ(A) < 1,
m→∞ am m→∞ m

the sequence (am ) converges to 0 as m → ∞ by the Ratio Test for sequences. Consequently,

0 ≤ lim kAm k ≤ C lim mN −1 ρ(A)m = 0,

m→∞ m→∞

and the if statement follows.

∞
Theorem 5.1.5. Let (5.1.1) be a consistent iterative method. Then its iterates x(k) k=0
converges to the solution of Ax = b for any choice of initial guess x(0) if and only if ρ(B) < 1.
Proof. The if statement follows from applying Theorem 5.1.4 to the error equation. To prove
the only if statement, suppose ρ(B) ≥ 1. There exists λ ∈ σ(B) such that |λ| ≥ 1 with
corresponding eigenvector x 6= 0. Choosing e(0) = x yields

e(k) = B k e(0) = λk e(0) −→

6 0 as k −→ ∞.

Remark 5.1.6. A sufficient but not necessary condition for convergence of consistent iterative
method is kBk < 1 for any consistent matrix norm, since ρ(B) ≤ kBk. The rate of conver-
gence depends on how much less that 1 the spectral radius is. The smaller it is, the faster the
convergence.

5.2 Linear Iterative Methods

A common approach to devise consistent iterative methods is based on an additive splitting
of the matrix A. More precisely, writing A = P − N where P is nonsingular, we obtain the
consistent iterative method

P x(k+1) = N x(k) + b =⇒ x(k+1) = P −1 N x(k) + P −1 b = Bx(k) + f.

where B = P −1 N and f = P −1 b.

Example 5.2.1. Consider solving the following matrix equation

7 −6 x1 3
Ax = = = b,
−8 9 x2 −4
78 5.2. Linear Iterative Methods

Choosing P = diag(A) and N = P − A, we propose the following method

(k+1) 0 6/7 (k) 3/7
x = x + = Bx(k) + f.
8/9 0 −4/9
Another way to deduce this is by rewriting the system of linear equations in the form
6 3

 x1 = x2 +

7 7
 x2 = x1 − 4 .
 8
9 9

5.2.1 Jacobi Method

Jacobi method is applicable when A ∈ Rn×n is strictly diagonally dominant. By strictly diag-
onally dominant we mean that
n
X
|aii | > |aij |.
j6=i

Splitting A = D + R, where
   
a11 0 a12 . . . a1n
 a22   a21 0 . . . a2n 
D = diag(A) =  and R = A − D =  .. .
   
...  .. . . . .. 
   . . . 
ann an1 an2 ... 0

In matrix form, we have that

Dx = −Rx + b
x(k+1) = −D−1 R x(k) + D −1
| {z }b .
| {z }
B f

Component-wise, the Jacobi iterative method has the form

n
!
(k+1) 1 X (k)
xi = bi − aij xj , i = 1, . . . , n. (Jacobi)
aii j6=i

This says that x(k+1) is found using x(k) only.

5.2.2 Gauss-Siedel Method

Gauss-Siedel method is applicable when A ∈ Rn×n is strictly diagonally dominant, and it is
an improvement of the Jacobi method. More precisely, at the (k + 1)th iteration, the readily
(k+1)
computed values of xi are used to update the solution. This suggests splitting A = D+L+U ,
where
     
a11 0 0 a12 . . . a1n
 a22   a21 0   0 . . . a2n 
D= , L = , U = ..  .
     
... . .. . . ..
 ..
   
  . .   . . 
ann an1 an2 . . . 0 0
Iterative Methods For Linear Systems 79

In matrix form, we have that

(D + L)x(k+1) = −U x(k) + b
x(k+1) = −(D + L)−1 U x(k) + (D + L)−1 b .
| {z } | {z }
B f

Component-wise, the Gauss-Siedel iterative method has the form

i−1 n
!
(k+1) 1 X (k+1)
X (k)
xi = bi − aij xj − aij xj , i = 1, . . . , n. (Gauss-Siedel)
aii j=1 j=i+1

In the matrix form of Gauss Siedel, we claim that (D + L)−1 exists. This is true because
(D + L) is strictly diagonally dominant by rows.

Theorem 5.2.2. Strictly diagonally dominant (SDD) matrices are invertible.

Proof. Suppose by contradiction that A ∈ Rn×n is a strictly diagonally dominant matrix that
is singular. There exists a nonzero x ∈ Rn such that Ax = 0. Let J ∈ {1, . . . , n} be such that
|xJ | = max |xj |.
j=1,...,n

Expanding the Jth component of Ax yields

n n
X X xj
0 = (Ax)J = aJj xj =⇒ aJJ = − aJj
j=1 j6=J
xJ
n n
X xj X
=⇒ |aJJ | ≤ |aJj | ≤ |aJj |,
j6=J
x J
j6=J

contradicting the assumption that A is strictly diagonally dominant.

Theorem 5.2.3. If A is strictly diagonally dominant by rows, then the Jacobi and Gauss-Siedel
methods are convergent.
Proof. Choose any λ ∈ σ(B) with corresponding eigenvector x 6= 0, where B = −D−1 R is the
iteration matrix of the Jacobi method. Rearranging Bx = λx yields
−D−1 Rx = λx
−Rx = λDx
n
X
− aij xj = λaii xi , i = 1, . . . , n.
j6=i

WLOG, assume kxk∞ = 1. Let i be the index such that

|xj | ≤ |xi | = 1 for all j 6= i.
It follows that
n n
X X |aij |
|λ||aii | ≤ |aij | =⇒ |λ| ≤ < 1,
j6=i j6=i
|aii |
80 5.2. Linear Iterative Methods

since A is SDD by rows from assumption. Since λ ∈ σ(B) was arbitrary, this gives ρ(B) < 1
and the Jacobi method is convergent.

A similar argument with the iteration matrix in the Gauss-Siedel method B = −(D+L)−1 U
gives

−(D + L)−1 U x = λx =⇒ −U x = λ(D + L)x

=⇒ λDx = −(λL + U )x
i−1
X n
X
=⇒ λaii xi = −λ aij xj − aij xj , i = 1, . . . , n.
j=1 j=i+1

WLOG, assume kxk∞ = 1. Let i be the index such that

|xj | ≤ |xi | = 1 for all j 6= i.

This together with the triangle inequality yields

i−1
X n
X
|λ||aii | ≤ |λ| |aij | + |aij |
j=1 j=i+1
X n
|aij |
j=i+1
=⇒ |λ| ≤ i−1
< 1,
X
|aii | − |aij |
j=1

since A is SDD by rows from assumption. Since λ ∈ σ(B) was arbitrary, this gives ρ(B) < 1
and the Gauss-Siedel method is convergent.

5.2.3 Successive Over Relaxation (SOR) Method

This is a variant of the Gauss-Siedel method that results in faster convergence by introducing
a relaxation parameter ω 6= 0. Splitting A = D + L + U as in the Gauss-Siedel method, we
can rewrite the linear system Ax = b as follows

(D + L + U )x = b
Dx = b − Lx − U x
x = D−1 (b − Lx − U x)
ωx = ωD−1 (b − Lx − U x)
x = ωD−1 (b − Lx − U x) + (1 − ω)x.

Component-wise, the SOR method has the form

i−1 n
!
(k+1) ω X (k+1)
X (k) (k)
xi = bi − aij xj − aij xj + (1 − ω)xi , i = 1, . . . , n. (SOR)
aii j=1 j=i+1
Iterative Methods For Linear Systems 81

In matrix form, we have that

x(k+1) = ωD−1 (b − Lx(k+1) − U x(k) ) + (1 − ω)x(k)

h i
(D + ωL)x(k+1) = (1 − ω)D − ωU x(k) + ωb
h i
x(k+1) = (D + ωL)−1 (1 − ω)D − ωU x(k) + ω(D + ωL)−1 b
h i
= (D + ωL) D − ω(D + U ) x(k) + ω(D + ωL)−1 b
−1

= Bω x(k) + fω .

For ω = 1, we recover the Gauss-Siedel method. For ω ∈ (0, 1), the method is called under-
relaxation; for ω > 1, the method is called over-relaxation. Clearly there exists an optimal
parameter ω0 that produces the smallest spectral radius.

Theorem 5.2.4.

(a) If A is symmetric positive definite (SPD), then the SOR method is convergent if and only
if 0 < ω < 2.

(b) If A is SDD by rows, then the SOR method is convergent if 0 < ω ≤ 1.

(c) If A, 2D − A are SPD, then the Jacobi method is convergent.

(d) If A is SPD, then the Gauss-Siedel method is convergent.

We can extrapolate the idea of a relaxation parameter to general consistent iterative meth-
ods (5.1.1). This results in a consistent iterative method for any γ 6= 0

x(k+1) = γ(Bx(k) + f ) + (1 − γ)x(k) ,

where upon rearranging yields

h i
x(k+1) = γB + (1 − γ) x(k) + γf = Bγ x(k) + fγ .

From Spectral Mapping Theorem, it follows that if λ ∈ σ(B), then γλ + (1 − γ) ∈ σ(Bγ ).

5.3 Iterative Optimisation Methods

In this section, we reformulate the linear system Ax = b as a quadratic minimisation problem,
in the case where A ∈ Rn×n is symmetric positive-definite (SPD).

Lemma 5.3.1. If A ∈ Rn×n is SPD, solving Ax = b is equivalent to minimising the quadratic

form
1
q(x) = hx, Axi − hx, bi.
2
82 5.3. Iterative Optimisation Methods

Proof. Suppose x is a minimiser of q(x), the first variation of q(x) must equal to 0. More
precisely, for any v ∈ Rn we must have

q(x + εv) − q(x)

lim = 0.
ε→0 ε

Since A is symmetric, expanding q(x + εv) yields

1
q(x + εv) = hx + εv, A(x + εv)i − hx + εv, bi
2
1 ε ε ε2
= hx, Axi + hx, Avi + hv, Axi + hv, Avi − hx, bi − εhv, bi
2 2 2 2
h i ε2
= q(x) + ε hv, Axi − hv, bi + hv, Avi,
2

which upon rearranging gives

q(x + εv) − q(x) ε

0 = lim = lim hv, Axi − hv, bi + hv, Avi
ε→0 ε ε→0 2
= hv, Axi − hv, bi
= hv, Ax − bi for all v ∈ Rn .

Choosing v = Ax − b, we have that kAx − bk22 = 0 =⇒ Ax = b. Observe that its mininum is

given by
1 1
q(x) = hA−1 b, bi − hA−1 b, bi = − hA−1 b, bi < 0.
2 2
Conversely, suppose Ax = b. For any v ∈ Rn , we have

q(v) = q(x + w)
1
= hx + w, A(x + w)i − hx + w, bi
2
1 1 1 1
= hx, Axi + hx, Awi + hw, Axi + hw, Awi − hx, bi − hw, bi
2 2 2 2
1
= q(x) + hw, Axi − hw, bi + hw, Awi
2
1
= q(x) + hw, Awi ≥ q(x),
2

where we use the assumption that A is positive-definite. Since v ∈ Rn was arbitrary, it follows
that x is a minimiser of q(x).

If A is SPD, then its minimiser is unique. Suppose there are two distinct minimisers
x, y ∈ Rn of q(x). They must satisfy Ax = b = Ay or A(x − y) = 0, which implies that
x − y = 0 since A is non-singular. In practice, q(x) usually represents a significant quantity
such as the energy of a system. In this case the solution to Ax = b represents a state of minimal
energy.
Iterative Methods For Linear Systems 83

5.3.1 Steepest Descent/Gradient Descent Method

To compute the mininum of E(x) = q(x), we propose an iterative method, called the steepest
descent method, defined by
x(k+1) = x(k) + αk r(k) (5.3.1a)
r(k) = b − Ax(k) (5.3.1b)
where αk will be chosen in such a way that E(x(k+1) ) is minimised.

Lemma 5.3.2. Given x(k) , E(x(k+1) ) is minimised if αk is chosen to be

hr(k) , r(k) i kr(k) k22

αk = = , k ≥ 0.
hr(k) , Ar(k) i kr(k) k2A

where k · kA is the energy norm.

Proof. Since A is symmetric,

E(x(k+1) ) = E(x(k) + αk r(k) )

h i α2
= E(x(k) ) + αk hr(k) , Ax(k) i − hr(k) , bi + k hr(k) , Ar(k) i
2
αk2 (k)
= hr , Ar(k) i − αk hr(k) , r(k) i + E(x(k) ).
2
The last expression, denoted by G(αk ), is a quadratic equation in αk . where G(·) is a quadratic
equation in αk . Since A is positive-definite, hr(k) , Ar(k) i > 0 and there exists a unique minimum
of G(αk ). This minimum must satisfies G0 (αk ) = 0 and solving this gives the desired expression
for αk .

Lemma 5.3.3. For every k ≥ 0, we have that r(k) ⊥ r(k+1) with respect to h·, ·i.
Proof. First, observe that substituting (5.3.1a) into (5.3.1b) yields
h i
r(k+1) = b − Ax(k+1) = b − A x(k) + αk r(k)
= b − Ax(k) − αk Ar(k)
= r(k) − αk Ar(k) .

This together with the expression for αk from Lemma 5.3.2 yields

hr(k+1) , r(k) i = hr(k) , r(k) i − αk hr(k) , Ar(k) i = 0.

Remark 5.3.4. The residual vector given by r(k+1) = r(k) − αk Ar(k) is chosen to update the
residual vector in the steepest descent method. It is more stable numerically compared to
(5.3.1b) due to rounding error, since b can be very close to Ax(k+1) for large enough k.
84 5.3. Iterative Optimisation Methods

Algorithm 5.1: Steepest Descent Method

Given an initial guess x(0) ∈ Rn , set r(0) = b − Ax(0) . For any k = 0, 1, . . ., compute the
following until desired tolerance.

hr(k) , r(k) i
αk =
hr(k) , Ar(k) i
x(k+1) = x(k) + αk r(k)
r(k+1) = r(k) − αk Ar(k) .

Theorem 5.3.5. Let A ∈ Rn×n be symmetric positive definite. The steepest descent method is
convergent for any initial condition x(0) ∈ Rn and we have the following error estimate:

(k+1) κ2 (A) − 1
ke kA ≤ ke(k) kA ,
κ2 (A) + 1

where e(k) = x(k) − xexact and κ2 (A) = kAk2 kA−1 k2 = σ1 /σn is the condition number of A with
respect to k · k2 .

Although the steepest descent method is convergent, it does not imply that the error is
monotonically decreasing. As such, the steepest descent method can be time consuming. It
can happen that r(k) (steepest descent direction) oscillates. Indeed, Lemma 5.3.3 tells us that
r(k+2) can almost be in the same direction as ±r(k) with same magnitude.

5.3.2 Conjugate Gradient Method

The conjugate gradient (CG) method can be seen as an improvisation of the steepest descent
method. It is defined by
x(k+1) = x(k) + αk p(k) ,
where the conjugate direction p(k) is a linear combination of the steepest descent direction
r(k) = b − Ax(k) and previous change in position x(k) − x(k−1) , i.e.

p(k) = r(k) + γk (x(k) − x(k−1) )

= r(k) + γk αk−1 p(k−1)
= r(k) + βk−1 p(k−1) .

r(k) takes another form

r(k) = b − A x(k−1) + αk−1 p(k−1) = r(k−1) − αk−1 Ap(k−1) .

Thus, the conjugate gradient method takes the form

x(k+1) = x(k) + αk p(k) (5.3.2a)

r(k+1) = r(k) − αk Ap(k) (Residual)
p(k+1) = r(k+1) + βk p(k) . (Search)
Iterative Methods For Linear Systems 85

where αk , βk , p(0) are again chosen so that E(x(k+1) ) is minimised. We recover the steepest
descent method if βk = 0.

Lemma 5.3.6. Given x(k) , E(x(k+1) ) is minimised if p(0) = r(0) and αk , βk are chosen to be

hr(k) , r(k) i kr(k) k22 hr(k+1) , r(k+1) i kr(k+1) k22

αk = (k) = (k) 2 and βk = = , k ≥ 0.
hp , Ap(k) i kp kA hr(k) , r(k) i kr(k) k22
Proof. Since A is symmetric,

E(x(k+1) ) = E(x(k) + αk p(k) )

h i α2
= E(x(k) ) + αk hp(k) , Ax(k) i − hp(k) , bi + k hp(k) , Ap(k) i
2
αk2 (k)
= hp , Ap(k) i − αk hp(k) , r(k) i + E(x(k) ).
2
A similar argument in Lemma 5.3.2 shows that in order to minimise E(x(k+1) ), αk must satisfy
G0 (αk ) and solving this gives
hp(k) , r(k) i
αk = (k) , k ≥ 0. (5.3.3)
hp , Ap(k) i
We now rewrite hp(k) , r(k) i using (Search), (Residual) and (5.3.3) accordingly:

hp(k+1) , r(k+1) i = hr(k+1) , r(k+1) i + βk hp(k) , r(k+1) i

= hr(k+1) , r(k+1) i + βk hp(k) , r(k) − αk Ap(k) i
h i
= hr(k+1) , r(k+1) i + βk hp(k) , r(k) i − αk hp(k) , Ap(k) i
= hr(k+1) , r(k+1) i.

Next, substituting αk into E(x(k+1) ) yields

hr(k) , r(k) i2

(k+1) (k)1
E(x ) = E(x ) − . (5.3.4)
2 hp(k) , Ap(k) i

For k = 0, E(x(1) ) < E(x(0) ) if we choose p(0) = r(0) , since A is positive-definite. To find βk ,
we want to maximise the second term in (5.3.4), i.e. minimise hp(k) , Ap(k) i. We write this
expression in terms of βk using (Search) and A = AT and get

hp(k) , Ap(k) i = hr(k) + βk−1 p(k−1) , A(r(k) + βk−1 p(k−1) )i

= hr(k) , Ar(k) i + 2βk−1 hr(k) , Ap(k−1) i + βk−1
2
hp(k−1) , Ap(k−1) i.

Since the last expression is a quadratic equation in βk−1 , E(x(k+1) ) is minimised if βk−1 satisfies
H 0 (βk−1 ) = 0 and solving this gives

hr(k+1) , Ap(k) i
βk = − , k ≥ 1. (5.3.5)
hp(k) , Ap(k) i

Observe that using (Search) gives an orthogonal relation for successive p(k) with respect to
h·, A(·)i:

hp(k+1) , Ap(k) i = hr(k+1) , Ap(k) i + βk hp(k) , Ap(k) i

86 5.3. Iterative Optimisation Methods

hr(k+1) , Ap(k) i

(k+1) (k) (k)
(k)

= hr , Ap i − (k)
(k)
hp , Ap i = 0,
hp , Ap i

which in turn gives

hp(k) , Ap(k) i = hr(k) , Ap(k) i + βk−1 hp(k−1) , Ap(k) i = hr(k) , Ap(k) i.

We also obtain an orthogonal relation for succesive r(k) with respect to h·, ·i, using (Residual)
and A = AT to get

hr(k+1) , r(k) i = hr(k) , r(k) i − αk hAp(k) , r(k) i

= hr(k) , r(k) i − αk hAp(k) , p(k) i = 0,

which in turn gives

(k)
− r(k+1)

(k+1) (k) (k+1) r 1
hr , Ap i = r , = − hr(k+1) , r(k+1) i
αk αk
(k)
hp , Ap(k) i

=− hr(k+1) , r(k+1) i.
hr(k) , r(k) i

Finally,
hp(k) , Ap(k) i hr(k+1) , r(k+1) i kr(k+1) k22

βk = = .
hr(k) , r(k) i hp(k) , Ap(k) i kr(k) k22

Lemma 5.3.7. For the conjugate gradient method, the residuals and search directions satisfy
the orthogonality:
hr(j) , r(k) i = hp(j) , Ap(k) i = 0 for all j 6= k.

Proof. We need to show the following statement for each N ≥ 1:

hr(j) , r(k) i = hp(j) , Ap(k) i = 0 for all 0 ≤ k < j ≤ N .

The following partial result was shown in the proof of Lemma 5.3.6:

hr(k+1) , r(k) i = hp(k+1) , Ap(k) i = 0 for all k ≥ 0.

The base case N = 1 follows from the partial result above:

hr(1) , r(0) i = hp(1) , Ap(0) i = 0.

Suppose
hr(j) , r(k) i = hp(j) , Ap(k) i = 0 for all 0 ≤ k < j ≤ N .
We need to show that the same relation holds for all 0 ≤ k < j ≤ N + 1. This is true from the
partial result if j = N + 1 and k = N , so suppose j = N + 1 and k < N . Then
h i
hr(N +1) , r(k) i = hr(N ) − αN Ap(N ) , r(k) i From (Residual).
= −αN hAp(N ) , r(k) i
h i
(N ) (k) (k−1)
= −αN hAp ,p − βk p i From (Search).
Iterative Methods For Linear Systems 87

= 0.
h i
hp(N +1) , Ap(k) i = hr(N +1) + βN p(N ) , Ap(k) i From (Search).
= hr(N +1) , Ap(k) i
(k)
− r(k+1)

(N +1) r
h i
= r , From (Residual).
αk
= 0,

provided αk 6= 0, but αk = 0 means r(k) = 0 and the method terminates.

Algorithm 2: Conjugate Gradient Method

Given an initial guess x(0) ∈ Rn , set p(0) = r(0) = b − Ax(0) . For each k = 0, 1, . . .,

kr(k) k22
αk =
hp(k) , Ap(k) i
x(k+1) = x(k) + αk p(k)
r(k+1) = r(k) − αk Ap(k)
kr(k+1) k22
βk =
kr(k) k22
p(k+1) = r(k+1) + βk p(k)

Theorem 5.3.8. If A ∈ Rn×n is symmetric positive definite, then the conjguate gradient
method converges (pointwise) in at most n steps to the solution of Ax = b. Moreover, the error
e(k) = x(k) − xexact ⊥ p(j) for j = 0, 1, . . . , k − 1, k < n, and
p
2C k

(k) (0) κ2 (A) − 1
ke kA ≤ ke kA , where C=p .
1 + C 2k κ2 (A) + 1

Proof. We only prove the first part. Suppose

n−1
X
δj r(j) = 0. (5.3.6)
j=0

From Lemma 5.3.7, it follows that

δk hr(k) , r(k) i = 0 for all k = 0, 1, . . . , n − 1.

Either r(k) = 0 for some k ≤ n − 1 which means the iteration process stops at the kth step, or
δk = 0 for every k = 0, 1, . . . , n−1, which means the set of residual vectors {r(0) , r(1) , . . . , r(n−1) }
form a basis of Rn and r(n) ≡ 0. In both cases, we see that the conjugate gradient method
converges in at most n steps.

88 5.4. Problems

5.4 Problems
1. Let A be a square matrix and let k · k be a consistent matrix norm (we say that k · k is
compatible or consistent with a vector norm k · k if kAxk ≤ kAkkxk). Show that

lim kAm k1/m = ρ(A). (5.4.1)

m→∞

Solution: Since any consistent norms are equivalent in finite-dimensional vector

space, it suffices to prove the statement in the case of k · k2 . Consider the Schur
decomposition of A ∈ Cn×n given by A = QT Q∗ (see Theorem 6.1.11) where Q
is unitary and T = D + U is upper-triangular, with D the diagonal matrix with
eigenvalues of A on its diagonal and U the nilpotent matrix, i.e. there exists an
N > 0 such that U m = 0 for m ≥ N . Suppose ρ(A) = 0, then D = 0 and T = U .
Since k · k2 is invariant under unitary transformation, for m ≥ N we have that

kAm k2 = k(QT Q∗ )m k2 = kQT m Q∗ k2 = kQU m Q∗ k2 = kQ0Q∗ k2 = 0,

and the equality (5.4.2) holds.

Choose any λ ∈ σ(A), with corresponding eigenvector x 6= 0. Since Am x = λm x, we

have that
|λm |kxk2 = kλm xk2 = kAm xk2 ≤ kAm k2 kxk2 .
Since x 6= 0, dividing each side by kxk2 gives

|λ|m ≤ kAm k2 .

Taking the mth root of each side, and then the maximum over all λ ∈ σ(A) yields
1/m 1/m
max |λ| = ρ(A) ≤ kAm k2 =⇒ ρ(A) ≤ lim kAm k2 . (5.4.2)
λ∈σ(A) m→∞

On the other hand, for m much larger than N we have that

m
m m
X m!
kA k2 = k(D + U ) k2 ≤ kDkm−k
2 kU kk2
k=0
k!(m − k)!
N −1
X m(m − 1) . . . (m − k + 1)
= kDkm−k
2 kU kk2
k=0
k!
N −1 k k
X m m kU k2
≤ kDk2
k=0
k! kDk2
N −1 k !
X kU k 2
≤ mN −1 kDkm2
k=0
kDk 2

= CmN −1 ρ(A)m ,
Iterative Methods For Linear Systems 89

where C > 0 is independent of m. Taking the mth root and then the limit as m → ∞
yields
1/m
lim kAm k2 ≤ lim C 1/m m(N −1)/m ρ(A) = ρ(A),

(5.4.3)
m→∞ m→∞

where we use the fact that lim C 1/m = 1 = lim m1/m for any nonnegative real
m→∞ m→∞
number C. The result follows from combining (5.4.2) and (5.4.3).

2. Consider the 3 × 3 linear system of the form Aj x = bj , where bj is always taken in such a
way that the solution of the system is the vector x = (1, 1, 1)T , and the matrices Aj are
       
3 0 4 −3 3 −6 4 1 1 7 6 9
A1 =  7 4 2 , A2 = −4 7 −8 , A3 = 2 −9 0  , A4 =  4 5 −4 .
−1 1 2 5 7 −9 0 −8 −6 −7 −3 8

Suggest strictly diagonally dominant by rows 3 × 3 matrix A5 . Implement Jacobi and

Gauss-Siedel methods on A1 to A5 . Explain theoretically your numerical observations.

Solution: We have that

       
7 −6 6 22
b1 = 13 , b2 = −5 , b3 = −7 , b4 = 5  .
      
2 3 −14 −2

We suggest the following strictly diagonally dominant by rows matrix A5 ∈ R3×3

given by    
15 −7 −7 1
A5 = −1 8
 2 , with b5 = 9 .
 
3 −6 11 8
Recall that for the Jacobi and Gauss-Siedel method, the corresponding iteration
matrix BJ and BGS is given by

BJ = I − D−1 A, BGS = −(D + L)−1 U,

where A = L + D + U with D the diagonal, L the lower off diagonal and U the upper
off diagonal. The number of iterations is N = 200, and we test the algorithm for
three different random initial guesses
     
0.1214 1.0940 1.8443
x10 = 0.1815 , x20 = 1.7902 , x30 = 1.3112 .
1.6112 0.3737 1.2673

We choose to stop the iteration process if the Euclidean norm of the residual vector
kb − Ax(k) k2 is less than the chosen tolerance = 10−12 . We explain the numerical
result using the spectral radius of the iteration matrix.
90 5.4. Problems

Matrix ρ(BJ ) ρ(BGS ) Jacobi Gauss-Siedel

A1 1.1251 1.5833 Does not converges Does not converge
A2 0.8133 1.1111 Converge Does not converge
A3 0.4438 0.0185 Converge Converge
A4 0.6411 0.7746 Converge Converge
A5 0.5134 0.2851 Converge Converge
Chapter 6

Eigenvalue Problems

Eigenvalues and eigenvectors of square matrices appear in the analysis of linear transformation
and has a wide range of applications, such as facial recognition, image compression, spectral
clustering, dimensionality reduction and ranking algorithm. These matrices may be sparse or
dense and may have greatly varying order and structure. What is to be calculated affects the
choice of method to be used, as well as the structure of the given matrix. We first discuss
three matrix factorisations, where the eigenvalues are explicitly displayed. We then review
three classical eigenvalue algorithms: power iteration, inverse iteration and Rayleigh quotient
iteration.

6.1 Eigenvalue-Revealing Factorisation

Definition 6.1.1. Let A ∈ Cm×m be a square matrix. A nonzero vector x ∈ Cm is an
eigenvector of A, and λ ∈ C is its corresponding eigenvalue if Ax = λx. The set of all
eigenvalues of A is the spectrum of A, denoted by σ(A).

6.1.1 Geometric and Algebraic Multiplicity

The set of eigenvectors corresponding to a single eigenvalue λ, together with the zero vector,
forms a subspace of Cm known as an eigenspace, denoted by Eλ . Observe that Eλ is an
invariant subspace of A, that is AEλ ⊂ Eλ . The dimension of Eλ is known as the geometric
multiplicity of λ. Equivalently, the geometric multiplicity is the dimension of (N (A − λI)),
i.e. it is the maximum number of linearly independent eigenvectors with the same eigenvalue
λ.
The characteristic polynomial of a square matrix A ∈ Cm×m is the polynomial pA (z) of
degree m defined by

pA (z) = det(zI − A).

From the definition of an eigenvalue,

λ ∈ σ(A) ⇐⇒ Ax = λx for some x 6= 0.

⇐⇒ (λI − A)x = 0 for some x 6= 0.
⇐⇒ det(λI − A) = pA (λ) = 0.

91
92 6.1. Eigenvalue-Revealing Factorisation

Consequently, eigenvalues of A are roots of the characteristic polynomial pA and vice versa and
we may write pA as
m
Y
pA (z) = (z − λj ) = (z − λ1 )(z − λ2 ) . . . . . . (z − λm ),
j=1

where λj ∈ C are eigenvalues of A and they might be repeated. With this in mind, we define
the algebraic multiplicty of λ ∈ σ(A) as the multiplicity of λ as a root of pA (z); an eigen-
value is simple if its algebraic multiplicity is 1.

Theorem 6.1.2. A matrix A ∈ Cm×m has m eigenvalues, counted with algebraic multiplicity.
In particular, A has m distinct eigenvalues if the roots of pA are simple.

Theorem 6.1.3. Given a matrix A ∈ Cm×m , the following relation holds where eigenvalues
are counted with algebraic multiplicity:
m
Y m
X
det(A) = λj , tr(A) = λj .
j=1 j=1

Proof. From the product property of the determinant,

det(A) = (−1)m det(−A) = (−1)m pA (0)

m
!
Y
= (−1)m (z − λj )
j=1 z=0
m
Y
= λj .
j=1

m
Y
m−1
The second formula follows from equating the coefficient of z in det(zI −A) and (z −λj ).
j=1

If X ∈ Cm×m is nonsingular, then the map A 7→ X −1 AX is called a similarity transfor-

mation of A. We say that two matrices A and B are similar if there exists a nonsingular X
such that B = X −1 AX.

Theorem 6.1.4. If X is nonsingular, then A and X −1 AX have the same characteristic poly-
nomial, eigenvalues and algebraic and geometric multiplicities.

Proof. Checking the characteristic polynomial of X −1 AX yields

pX −1 AX (z) = det(zI − X −1 AX)

= det(X −1 (zI − A)X)
= det(X −1 ) det(zI − A) det(X)
= det(zI − A) = pA (z).
Eigenvalue Problems 93

Consequently, A and B = X −1 AX have the same characteristic polynomial and also the
eigenvalues and algebraic multiplicities. Finally, suppose Eλ is an eigenspace for A. For any
x ∈ Eλ , we have that

Ax = λx, XBX −1 x = λx, BX −1 x = λX −1 x,

i.e. X −1 Eλ is an eigenspace for B = X −1 AX (since X −1 is nonsingular), and conversely.

Theorem 6.1.5. The algebraic multiplicity of an eigenvalue λ ∈ σ(A) is at least as great as
its geometric multiplicity.
Proof. A proof can be found in [TBI97, p.185].

6.1.2 Eigenvalue Decomposition

Definition 6.1.6. An eigenvalue decomposition of a square matrix A is a factorisation
A = XΛX −1 , where X is nonsingular and Λ is diagonal. Equivalently, we have AX = Xλ, or
column-wise
Axj = λj xj , j = 1, . . . , m.
This suggests that the jth column of X is an eigenvector of A, with its eigenvalue λj the jth
diagonal entry of Λ.

Observe that the eigenvalue decomposition expresses a change of basis to “eigenvector

coordinates” and this provides a way of reducing a coupled system to a system of scalar
problems. For instance, suppose we want to solve x0 = Ax, with A ∈ Rm×m given. Suppose A
is not diagonal but there exists a nonsingular X such that A = XΛX −1 . Introducing a change
of variable y = X −1 x, then

y 0 = X −1 x0 = X −1 Ax = (X −1 AX)X −1 x = Λy.

The system is now decoupled and it can be solved separately. The solutions are

yj (t) = eλj t yj (0), j = 1, . . . , m,

which then gives

x(t) = XDX −1 x(0), where djj = eλj t , j = 1, . . . , m.

Example 6.1.7. Consider the matrices

   
2 0 0 2 1 0
A = 0 2 0 , B = 0 2 1 .
0 0 2 0 0 2

Both A and B have characteristic polynomial (z − 2)3 , so there is a single eigenvalue λ = 2 of

algebraic multiplicity 3. In the case of A, the eigenvalue has geometric multiplicty 3, which can
be seen by choosing the standard basis of R3 as eigenvectors. In the case of B, the eigenvalue
has geometric multiplicty 1, since the only eigenvectors are scalar multiples of (1, 0, 0)T .
94 6.1. Eigenvalue-Revealing Factorisation

Definition 6.1.8. An eigenvalue whose algebraic multiplicity exceeds its geometric multiplicity
is a defective eigenvalue. A matrix that has one or more defective eigenvalues is a defective
matrix.

Theorem 6.1.9. A ∈ Cm×m is nondefective if and only if it has an eigenvalue decomposition

A = XλX −1 .

Proof. Suppose A has an eigenvalue decomposition. Since A is similar to Λ, it follows from The-
orem 6.1.4 that they share the same eigenvalues and the same multiplicities. Consequently, A is
nondefective since diagonal matrices are nondefective. Conversely, suppose A is nondefective.
Then A must have m linearly independent eigenvectors since one can show that eigenvectors
with different eigenvalues must be linearly independent, and each eigenvalue can contribute as
many linearly independent eigenvectors as its multiplicity. Defining X as the matrix whose
columns are these m linearly independent eigenvectors, we see that AX = XΛ, or A = XΛX −1 .

6.1.3 Unitary Diagonalisation

In some special cases, the eigenvectors can be chosen to be orthogonal. In this case, we say
that the matrix A is unitary diagonalisable if there exists a unitary matrix Q and a diag-
onal matrix Λ such that A = QΛQ−1 . This factorisation is both an eigenvalue decomposition
and a singular value decomposition, aside from the signs of entries of Λ. Surprisingly, unitary
diagonalisable matrices have an elegant characterisation.

Theorem 6.1.10.

(a) A hermitian matrix is unitary diagonalisable, and its eigenvalues are real.

(b) A matrix A is unitary diagonalisable if and only if it is normal, that is A∗ A = AA∗ .

6.1.4 Schur Factorisation

Theorem 6.1.11. Every square matrix A ∈ Cm×m has a Schur factorisation, i.e. there exists
a unitary matrix Q and an upper-triangular matrix T such that A = QT Q∗ . Moreover, the
eigenvalues of A necessarily appear on the diagonal of T since A and T are similar.

Proof. The proof is similar to the existence of SVD. The case m = 1 is trivial, so suppose
m ≥ 2. Let q1 be any eigenvector of A, with corresponding eigenvalue λ. WLOG, we may
assume kq1 k2 = 1. Consider any extension of q1 to an orthonormal basis {q1 , . . . , qm } ⊂ Cm
and construct the unitary matrix
h i
b1 ∈ Cm×m , Q

Q1 = q1 Q b1 = q2 . . . qm .

We have that
" #
∗ ∗ ∗ b ∗
h
∗ q 1
i q 1 Aq 1 q 1 AQ 1 λ b
T1 := Q1 AQ1 = b∗ A q1 Q
b1 =
b∗ Aq1 Q b∗ AQ b1 = 0 A b .
Q1 Q 1 1
Eigenvalue Problems 95

By induction hypothesis, A b = Q2 T2 Q∗ . Then

b has a Schur factorisation A
2

b∗

λ
Q∗1 AQ1 =
0 Q2 T2 Q∗2
1 0∗ λ b∗ Q2 1 0

= ,
0 Q2 0 T2 0 Q∗2
and
1 0∗ λ b ∗ Q 2 1 0∗

A = Q1 Q∗1 = QT Q∗ .
0 Q2 0 T2 0 Q∗2
To finish the proof, we need to show that Q is unitary, but this must be true since
1 0∗
h i
Q = Q1 = q1 Q b 1 Q∗ ,
0 Q2 2

and {q1 , . . . , qm } is orthogonal by construction.

Remark 6.1.12. Among all three factorisations, the Schur decomposition exists for any ma-
trix and it tends to be numerically stable since unitary transformations are involved. If A is
normal, T will be diagonal and in particular, if A is hermitian, then we can take advantage of
this symmetry to reduce the computational cost.

6.1.5 Localising Eigenvalues

Below we prove a result that locates and bounds the eigenvalues of a given matrix A. A crude
bound is the following inequality regarding the spectral radius:
ρ(A) := max |λ| ≤ kAk,
λ∈σ(A)

for any consistent matrix norm.

Theorem 6.1.13 (Gershgorin Circle Theorem). The spectrum σ(A) is contained in the union
of the following m disks Di , i = 1, . . . , m in C, where
( m
)
X
Di = z ∈ C : |z − aii | ≤ |aij | , i = 1, . . . , m.
j6=i

Proof. Choose any eigenvector x of A, with corresponding eigenvalue λ. WLOG, we may

assume that kxk∞ = 1. Let i be the index in {1, . . . , m} such that |xi | = 1. Then
m
X
λxi = (Ax)i = aij xj ,
j=1

96 6.2. Eigenvalue Algorithms
 
−1 + i 0 1/4
Example 6.1.14. Consider the matrix A =  1/4 1 1/4. Applying the Gershgorin
1 1 3
Circle Theorem gives the following three disks in C
1 1
|λ − (−1 + i)| ≤ 0 + =
4 4
1 1 1
|λ − 1| ≤ + =
4 4 2
|λ − 3| ≤ 1 + 1 = 2.
1
Upon sketching these disks in C, we see that ≤ |λ| ≤ 5. [Draw the solution set in C.]
2

6.2 Eigenvalue Algorithms

For the remaining section, we will assume that A ∈ Rm×m is symmetric unless specified other-
wise. In particular, this means that A has real eigenvalues {λ1 , . . . , λm } and a complete set of
orthogonal eigenvectors {q1 , . . . , qm }.

6.2.1 Shortcomings of Obvious Algorithms

The most obvious method would be to compute the roots of the characteristic polynomial
pA (z). Unfortunately, polynomial-rootfinding is a severely ill-conditioned problem even when
the underlying eigenvalue problem is well-conditioned. This is because roots of polynomial
depend continuously on the coefficient of pA (z) and thus are extremely sensitive to errors, such
as round-off error.

Actually, any polynomial rootfinding problem can be rephrased as an eigenvalue problem.

Given a monic polynomial
pm (z) = z m + am−1 z m−1 + . . . + a1 z + a0 ,
We prove by induction that pm (z) is equal to (−1)m times the determinant of the m×m matrix
−z −a0
 
 1 −z −a1 
 
 1 −z −a 2 
Bm = 

... .. .

 1 . 
 . . . −z

 −am−2 
1 (−z − am−1 )
The base case m = 2 is clear:
−z −a0
(−1)2 = z(z + a1 ) + a0 = p2 (z).
1 (−z − a1 )
Suppose the statement is true for m = k − 1, then
h i
k k k−1 k−1 k−2 k+1
(−1) det(Bk ) = (−1) (−z)(−1) (z + ak−1 z + . . . + a2 z + a1 ) + (−1) (−a0 )
Eigenvalue Problems 97

= z(z k−1 + ak−1 z k−2 + . . . + a2 z + a1 ) + a0

= pk (z).

It follows that roots of pm (z) are equal to the eigenvalues of the companion matrix

−a0
 
0
1 0
 −a1  
 1 0 −a2 
A=

... .. .
 1 . 
 ... 
 0 −am−2 
1 −am−1

Theorem 6.2.1. For any m ≥ 5, there exists a polynomial p(z) of degree m with rational co-
efficients that has a real root r, with the property that r cannot be written using any expression
involving rational numbers, addition, subtraction, multiplication, division and kth roots.

This theorem says that no computer program can produce the exact roots of an arbitrary
polynomial of degree ≥ 5 in a finite number of steps even in exact arithmetic, and it is because
of this that any eigenvalue solver must be iterative.

6.2.2 Rayleigh Quotient

Given a symmetric matrix A ∈ Rm×m , the Rayleigh Quotient of a nonzero vector x ∈ Rm is
the scalar
xT Ax
R(x) = T .
x x
Choosing x to be an eigenvector of A gives R(x) = λ the corresponding eigenvalue. A natural
question arises: given a nonzero x, what scalar α behaves like an eigenvalue for x in the sense
of minimising kAx − αxk2 ? Since x is given, this is an m × 1 least squares problem of the form:
“Find α ∈ R such that kxα − Axk2 is mimimised.”
With A = x and b = Ax, The normal equation is precisely the Rayleigh quotient

xT Ax
AT Aα = AT b, (xT x)α = xT Ax, α = = R(x).
xT x
It is helpful to view R(·) as a function from Rm to R. We investigate the local behavior of
R(x) when x is near the eigenvector. Computing the partial derivatives of r(x) with respect
to the coordinates xj yields
T
∂R(x) 1 ∂ T x Ax ∂
= (x Ax) − (xT x)
∂xj xT x ∂xj (xT x)2 ∂xj
2(Ax)j (xT Ax)2xj
= −
xT x (xT x)2
2
= T (Ax)j − R(x)xj
x x
98 6.2. Eigenvalue Algorithms

2
= (Ax − R(x)x)j .
xT x
Consequently, the gradient of R(x) is

2
∇R(x) = (Ax − R(x)x)T .
xT x
We deduce the following properties of R(x) from the formula of ∇R(x):

1. R(x) is smooth except at x = 0.

2. ∇R(x) = 0 at an eigenvector x of A. Conversely if ∇R(x) = 0 with x 6= 0, then x is an

eigenvector of A with corresponding eigenvalue R(x).

3. Let qJ be an eigenvector of A. For any nonzero x ∈ Rm sufficiently close to qJ , a second

order Taylor expansion yields the asymptotic relation

R(x) − R(qJ ) = O kx − qJ k22

if x is close to qJ . (6.2.1)

Thus the Rayleigh quotient is a quadratically accurate estimate of an eigenvalue!

We give another proof of the asymptotic relation (6.2.1). We express x as a linear combi-
nation of the eigenvectors {q1 , . . . , qm }
m
X m
X
x= aj q j = hx, qj iqj , since A is symmetric.
j=1 j=1

aj
Assuming x ≈ qJ and ≤ ε for all j 6= J, it suffices to show that R(x) − R(qJ ) = O(ε)2 ,
aJ
since from Pythagorean theorem we have that
m m
!
2 2
X X aj a J − 1
kx − qJ k22 = |aj |2 + |aJ − 1|2 = |aJ |2 + ≈ Cε2 .
j6=J j6=J
a J a J

Substituting the expansion of x into the Rayleigh quotient yields

* m m
+ m
X X X
aj q j , λ j aj q j λj a2j
j=1 j=1 j=1
R(x) = m = m ,
X X
a2j a2j
j=1 j=1

which in turn gives

m m m 2
X X X aj
λj a2j a2j (λj − λJ ) (λj − λJ )
j=1 j6=J j6=J
aJ
R(x) − R(qJ ) = m − λJ = = m = m 2 = O(ε).
X X X aj
a2j aJ + a2j 1+
j=1 j6=J j6=J
aJ
Eigenvalue Problems 99

6.2.3 Power iteration

The power iteration is used to find the largest eigenvalue and its corresponding eigenvector of
a given matrix. The algorithm is presented below, and we note that the asymptotic relation
(6.2.1) can be used as a stopping criteria.

Algorithm 6.1: Power Iteration

Assume v (0) is a vector with kv (0) k2 = 1.
for k = 1, 2, 3, . . . . . .
w = Av k−1
w
v (k) =
kwk2
λ(k) = (v (k) )T Av (k) .

Theorem 6.2.2. Assume |λ1 | > |λ2 | ≥ . . . ≥ |λm | ≥ 0 and q1T v (0) =
6 0. Then the iterates of
power iteration algorithm satisfy
! !
k 2k
λ2 λ 2
kv (k) − ±q1 k2 = O , |λ(k) − λ1 | = O as k −→ ∞.
λ1 λ1

The ± sign means that at each step k, one or the other choice of sign is to be taken, and then
the indicated bound holds.

Proof. Write v (0) as a linear combination of the eigenvectors

v (0) = a1 q1 + a2 q2 + . . . + am qm .

By definition of the power iteration,

m
! m
!
X X
v (1) = C1 A aj q j = C1 λj aj qj
j=1 j=1
m
! m
!
X X
v (2) = C2 A λj aj qj = C2 λ2j aj qj
j=1 j=1
.. .. ..
. . .
m
! m
!
X X
v (k) = Ck A λjk−1 aj qj = Ck λkj aj qj ,
j=1 j=1

where Ck are the normalisation constant. Factoring out λk1 gives

m k !
X λj
λk1 a1 q 1 + aj q j
λ
m k !
X λj j=2 1
v (k) = Ck λk1 a1 q 1 + aj q j = m
.
j=2
λ1 X λj
k
|λ1 |k a1 q 1 + aj q j
j=2
λ1
2
100 6.2. Eigenvalue Algorithms

Provided a1 = q1T v (0) 6= 0, we see that

λk1 a1 q1
v (k) −→ = ±q1 as k −→ ∞,
|λ1 |k |a1 |kq1 k2

depending on the sign of λ1 and initial guess v (0) and the first equation follows. The second
equation follows from the asymptotic relation (6.2.1) of the Rayleigh quotient.

If λ1 > 0, then the signs of q1 is controlled by the initial guess v (0) and so are all + or all
−. If λ1 < 0, then the signs of q1 alternate and kv (k) k22 −→ kq1 k22 as k −→ ∞. One can show
that the iterates of power iteration algorithm satisfy

kv (k+1) − ±q1 k2

λ2
=O as k −→ ∞.
kv (k) − ±q1 k2 λ1

Consequently, the rate of convergence for the power iteration is linear. Except for special ma-
trices, the power iteration is very slow!

6.2.4 Inverse Iteration

There is a potential problem with the power iteration: what if λ2 /λ1 ≈ 1 which will result in
very slow convergence? Building from the power iteration, one approach would be to transform
A such that the new matrix, say B, has its largest eigenvalue λB
1 much larger than its remaining
eigenvalues. The Spectral Mapping Theorem tells us just how to find such B and as it
turns out, the same exact idea applies to finding any eigenvalues of A!
Consider any λ ∈ σ(A) with corresponding eigenvector x. For any µ ∈ R, (A − µI)−1 also
has the same eigenvector x, but with a different eigenvalue:

Ax = λx, (A − µI)x = (λ − µ)x, (A − µI)−1 x = (λ − µ)−1 x.

The following relation is also true:

λ ∈ σ(A) ⇐⇒ (λ − µ)−1 ∈ σ((A − µI)−1 ).

The upshot is if we choose µ sufficiently close to λJ , then (λJ − µ)−1 may be much larger than
(λj − µ)−1 for all j 6= J. Consequently, applying the power iteration to the matrix (A − µI)−1
gives a rapid convergence to qJ and this is precisely the idea of inverse iteration.

Algorithm 6.2: Inverse Iteration

Given µ ∈ R and v (0) some initial guess such that kv (0) k2 = 1.
for k = 1, 2, 3, . . . . . .
Solve for w in the equation (A − µI)w = v (k−1)
w
v (k) =
kwk2
λ = (v (k) )T Av (k) .
(k)
Eigenvalue Problems 101

Note that the first step of the algorithm involves solving a linear system at each iteration
step and this raises an immediate question: what if A−µI is so ill-conditioned that an accurate
solution of the linear system is not possible? This however is not a problem at all and we shall
not pursue this issue any further; interested reader may refer to Exercise 27.5 in [TBI97, p.210].
The following theorem is essentially a corollary of Theorem 6.2.2.

Theorem 6.2.3. Suppose λJ is the closest eigenvalue to µ and λm is the second closest, that
is,
|µ − λJ | < |µ − λm | ≤ |µ − λj | for each j =
6 J.
Suppose qJT v (0) 6= 0. Then the iterates of inverse iteration algorithm satisfy
! !
k 2k
µ − λ J µ − λ J
kv (k) − ±qJ k2 = O , |λ(k) − λJ | = O as k −→ ∞.
µ − λm µ − λm
The ± sign means that at each step k, one or the other choice of sign is to be taken, and the
inidicated bound holds.

In practice, the inverse iteration is used when a good approximation for the desired eigen-
value is known. Otherwise, the inverse iteration converges to the eigenvector of the matrix A
corresponding to the closest eigenvalue to µ. As opposed to the power iteration, we can control
the rate of linear convergence since this depends on µ.

6.2.5 Rayleigh Quotient Iteration

Even with a good choice of µ, the inverse iteration converges at best linearly. Extending the
idea of inverse iteration, this rate of convergence can be improved as follows: at each new iter-
ation step, µ is replaced with the Rayleigh quotient of the previous eigenvector approximation.
This leads to the Rayleigh Quotient Iteration:

Algorithm 6.3: Rayleight Quotient Iteration

v (0) = some initial vector with kv (0) k2 = 1
λ(0) = (v (0) )T Av (0) .
for k = 1, 2, 3, . . . . . .
Solve for w in the equation (A − λ(k) I)w = v (k−1)
w
v (k) =
kwk2
λ = (v (k) )T Av (k) .
(k)

Theorem 6.2.4. Rayleigh quotient iteration converges to an eigenvalue/eigenvector pair for

all except a set of measure zero of starting vectors v (0) . When it converges, the convergence is
ultimately cubic in the sense that if λJ is an eigenvalue of A and v (0) is sufficiently close to the
corresponding eigenvector qJ , then
kv (k+1) − ±qJ k2 = O kv (k) − ±qJ k32 , |λ(k+1) − λJ | = O |λ(k) − λJ |3

as k −→ ∞.
The ± signs are not necessarily the same on the two sides of the first equation.
Remark 6.2.5. We have cubic convergence for the Rayleigh quotient iteration if A is sym-
metric, otherwise it only has quadratic convergence.
102 6.2. Eigenvalue Algorithms
Bibliography

[TBI97] L. N. Trefethen and D. Bau III. Numerical Linear Algebra. Vol. 50. Other Titles in
Applied Mathematics. SIAM, 1997.

103