NLA_lecture_notes
NLA_lecture_notes
Welcome to numerical linear algebra (NLA)! NLA is a beautiful subject that combines
mathematical rigor, amazing algorithms, and an extremely rich variety of applications.
What is NLA? In a sentence, it is a subject that deals with the numerical solution (i.e.,
using a computer) of linear systems Ax = b (given A ∈ Rn×n (i.e., a real n × n matrix) and
b ∈ Rn (real n-vector), find x ∈ Rn ) and eigenvalue problems Ax = λx (given A ∈ Rn×n ,
find λ ∈ C and x ∈ Cn ), for problems that are too large to solve by hand (n ≥ 4 is already
large; we aim for n in the thousands or even millions). This can rightfully sound dull, and
some mathematicians (those purely oriented?) tend to get turned off after hearing this —
how could such a course be interesting compared with other courses offered by the Oxford
Mathematical Institute? I hope and firmly believe that at the end of the course you will all
agree that there is more to the subject than you imagined. The rapid rise of data science and
machine learning has only meant that the importance of NLA is still growing, with a vast
number of problems in these fields requiring NLA techniques and algorithms. It is perhaps
worth noting also that these fields have had enormous impact on the direction of NLA, in
particular the recent and very active field of randomized algorithm was born in light of needs
arising from these extremely active fields.
In fact NLA is a truly exciting field that utilises a huge number of ideas from different
branches of mathematics (e.g. matrix analysis, approximation theory, and probability) to
solve problems that actually matter in real-world applications. Having said that, the number
of prerequisites for taking the course is the bare minimum; essentially a basic understanding
of the fundamentals of linear algebra would suffice (and the first lecture will briefly review
the basic facts). If you’ve taken the Part A Numerical Analysis course you will find it helpful,
but again, this is not necessary.
The field NLA has been blessed with many excellent books on the subject. These notes
will try to be self-contained, but these references will definitely help. There is a lot to learn;
literally as much as you want to.
1
– classic, encyclopedic
Horn and Johnson (12) [20]: Matrix Analysis (& Topics in Matrix Analysis (86) [19])
– impressive content
H. C. Elman, D. J. Silvester, A. J. Wathen (14) [11]: Finite Elements and Fast Iterative
Solvers
This course covers the fundamentals of NLA. We first discuss the singular value decom-
position (SVD), which is a fundamental matrix decomposition whose importance is only
growing. We then turn to linear systems and eigenvalue problems. Broadly, we will cover
in this order. Lectures 1–4 cover the fundamentals of matrix theory, in particular the SVD,
its properties and applications.
This document consists of 16 sections. Very roughly speaking, one section corresponds
to one lecture (though this will not be followed strictly at all).
Contents
0 Introduction, why Ax = b and Ax = λx? 6
1 Basic LA review 7
1.1 Warmup exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Structured matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Matrix eigenvalues: basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Computational complexity (operation counts) of matrix algorithms . . . . . 10
1.5 Vector norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 Matrix norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.7 Subspaces and orthonormal matrices . . . . . . . . . . . . . . . . . . . . . . 12
2
2 SVD: the most important matrix decomposition 13
2.1 (Some of the many) applications and consequences of the SVD: rank, col-
umn/row space, etc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 SVD and symmetric eigenvalue decomposition . . . . . . . . . . . . . . . . . 16
2.3 Uniqueness etc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5 Linear systems Ax = b 23
5.1 Solving Ax = b via LU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2 Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.3 Cholesky factorisation for A 0 . . . . . . . . . . . . . . . . . . . . . . . . . 26
7 Numerical stability 35
7.1 Floating-point arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7.2 Conditioning and stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.3 Numerical stability; backward stability . . . . . . . . . . . . . . . . . . . . . 37
7.4 Matrix condition number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.4.1 Backward stable+well conditioned=accurate solution . . . . . . . . . 38
7.5 Stability of triangular systems . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.5.1 Backward stability of triangular systems . . . . . . . . . . . . . . . . 39
7.5.2 (In)stability of Ax = b via LU with pivots . . . . . . . . . . . . . . . 40
7.5.3 Backward stability of Cholesky for A 0 . . . . . . . . . . . . . . . . 40
7.6 Matrix multiplication is not backward stable . . . . . . . . . . . . . . . . . . 41
7.7 Stability of Householder QR . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7.7.1 (In)stability of Gram-Schmidt . . . . . . . . . . . . . . . . . . . . . . 43
3
8 Eigenvalue problems 43
8.1 Schur decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.2 The power method for finding the dominant eigenpair Ax = λx . . . . . . . 45
8.2.1 Digression (optional): Why compute eigenvalues? Google PageRank . 46
8.2.2 Shifted inverse power method . . . . . . . . . . . . . . . . . . . . . . 46
9 The QR algorithm 47
9.1 QR algorithm for Ax = λx . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
9.2 QR algorithm preprocessing: reduction to Hessenberg form . . . . . . . . . . 51
9.2.1 The (shifted) QR algorithm in action . . . . . . . . . . . . . . . . . . 52
9.2.2 (Optional) QR algorithm: other improvement techniques . . . . . . . 53
10 QR algorithm continued 54
10.1 QR algorithm for symmetric A . . . . . . . . . . . . . . . . . . . . . . . . . 54
10.2 Computing the SVD: Golub-Kahan’s bidiagonalisation algorithm . . . . . . . 55
10.3 (Optional but important) QZ algorithm for generalised eigenvalue problems . 55
10.4 (Optional) Tractable eigenvalue problems . . . . . . . . . . . . . . . . . . . . 56
4
14 Randomized algorithms in NLA 72
14.1 Randomized SVD by Halko-Martinsson-Tropp . . . . . . . . . . . . . . . . . 73
14.2 Pseudoinverse and projectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
14.3 HMT approximant: analysis (down from 70 pages!) . . . . . . . . . . . . . . 75
14.4 Tool from RMT: Rectangular random matrices are well conditioned . . . . . 75
14.5 Precise analysis for HMT (nonexaminable) . . . . . . . . . . . . . . . . . . . 77
14.6 Generalised Nyström . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
14.7 MATLAB code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Notation. For convenience below we list the notation that we use throughout the course.
λ(A): the set of eigenvalues of A. If a natural ordering exists (e.g. A is symmetric so
λ is real), λi (A) is the ith (largest) eigenvalue.
σ(A): the set of singular values of A. σi (A) always denotes the ith largest singular
value. We often just write σi .
diag(A): the vector of diagonal entries of A.
We use capital letters for matrices, lower-case for vectors and scalars. Unless otherwise
specified, A is a given matrix, b is a given vector, and x is an unknown vector.
k · k denotes a norm for a vector or matrix. k · k2 denotes the spectral (or 2-) norm,
k · kF the Frobenius norm. For vectors, to simplify notation we sometimes use k · k for
the spectral norm (which for vectors is the familiar Euclidean norm).
Span(A) denotes the span or range of the column space of A. This is the subspace
consisting of vectors of the form Ax.
We reserve Q for an orthonormal (or orthogonal) matrix. L, (U ) are often lower (upper)
triangular.
I always denotes the identity matrix. In is the n × n identity when the size needs to
be specified.
5
AT is the transpose of the matrix; (AT )ij = Aji . A∗ is the (complex) conjugate
transpose (A∗ )ij = A¯ji .
We sometimes use the following shorthand: alg for algorithm, eigval for eigenvalue, eigvec
for eigenvector, singval for singular value, and singvec for singular vector, iff for “if and only
if”.
1. Linear system
Ax = b.
Given a (often square m = n but we will discuss m > n extensively, and m < n briefly
at the end) matrix A ∈ Rm×n and vector b ∈ Rm , find x ∈ Rn such that Ax = b.
2. Eigenvalue problem
Ax = λx.
Given a (always!1 ) square matrix A ∈ Rn×n find λ: eigenvalues (eigval), and x ∈ Rn :
eigenvectors (eigvec).
We’ll see many variants of these problems; one worthy of particular mention is the SVD,
which is related to eigenvalue problems but given its ubiquity has a life of its own. (So if
there’s a third problem we solve in NLA, it would definitely be the SVD.)
It is worth discussing why we care about linear systems and eigenvalue problems.
The primary reason is that many (in fact most) problems in scientific computing (and
even machine learning) boil down to linear problems:
Because that’s often the only way to deal with the scale of problems we face today!
(and in future)
For linear problems, so much is understood and reliable algorithms are available2 .
1
There are exciting recent developments involving eigenvalue problems for rectangular matrices, but these
are outside the scope of this course.
2
A pertinent quote is Richard Feynman’s “Linear systems are important because we can solve them”.
Because we can solve them, we do all sorts of tricks to reduce difficult problems to linear systems!
6
A related important question is where and how these problems arise in real-world prob-
lems.
Let us mention a specific context that is relevant in data science: optimisation. Suppose
one is interested in minimising a high-dimensional real-valued function f (x) : Rn → R where
n 1.
A successful approach is to try and find critical points, that is, points x∗ where ∇f (x∗ ) =
0. Mathematically, this is a non-linear high-dimensional root-finding problem of finding
x ∈ Rn such that ∇f (x) =: F (x) = 0 (the vector 0 ∈ Rn ) where F : Rn → Rn . One of
the most commonly employed methods for this task is Newton’s method (which some of you
have seen in Prelims Constructive Mathematics). This boils down to
Newton’s method for F (x) = 0, F : Rn → Rn nonlinear:
Note that the main computational task is to find the vector y = J −1 F (x(i) ), which is
a linear system Jy = F (x(i) ) (which we solve for the vector y)
What about eigenvalue problems Ax = λx? Google’s pagerank is a famous application
(we will cover this if we have time). Another example the Schrödinger equation of physics
and chemistry. Sometimes a nonconvex optimisation problem can be solved by an eigenvalue
problem.
Equally important is principal component analysis (PCA), which can be used for data
compression. This is more tightly connected to the SVD.
Other sources of linear algebra problems include differential equations, optimisation,
regression, data analysis, ...
1 Basic LA review
We start with a review of key LA facts that will be used in the course. Some will be trivial
to you while others may not. You might also notice that some facts that you have learned in
a core LA course will not be used in this course. For example we will never deal with finite
fields, and determinants only play a passing role.
7
Here is a list: The following are equivalent.
1. A is nonsingular.
6. rank(A) = n.
13. det(A) 6= 0.
14. An n × n matrix A−1 exists such that A−1 A = In . (this, btw, implies (iff) AA−1 = In ,
a nontrivial fact)
8
Orthogonal: AAT = AT A = I (Unitary: AA∗ = A∗ A = I). Note that for square
matrices, AT A = I implies AAT = I.
Normal: AT A = AAT . (Here it’s better to discuss the complex case A∗ A = AA∗ : this
is a necessary and sufficient condition for diagonalisability under a unitary transfor-
mation, i.e., A = U ΛU ∗ where Λ is diagonal and U is unitary.)
(upper) Hessenberg: Aij = 0 if i > j + 1. (we will see this structure often.)
sparse: most elements are zero. nnz(A) denotes the number of nonzero elements in A.
Matrices that are not sparse are called dense.
Other structures: Hankel, Toeplitz, circulant, symplectic,... (we won’t use these in this
course)
9
The eigenvalues
Qn are the roots of the characteristic polynomial det(λI−A) = 0: det(λI−
A) = i=1 (λ − λi ).
According to Galois theory, eigenvalues cannot be computed exactly for matrices with
n ≥ 5. But we still want to compute them! In this course we will (among other
things) explain how this is done in practice by the QR algorithm, one of the greatest
hits of the field.
Norms
We will need a tool (or metric) to measure how big a vector or matrix is. Norms give us a
means to achieve this. Surely you have already seen some norms (e.g. the vector Euclidean
norm). We will discuss a number of norms for vectors and matrices that we will use in the
upcoming lectures.
10
Of particular importance are the three cases p = 1, 2, ∞. In this course, we will see p = 2
the most often.
A norm needs to satisfy the following axioms:
√1 kxk2
n
≤ kxk∞ ≤ kxk2
√1 kxk1
n
≤ kxk2 ≤ kxk1
1
n
kxk1 ≤ kxk∞ ≤ kxk1
Note that with the 2-norm, kU xk2 = kxk2 for any unitary U and any x ∈ Cn . Norms
with this property are called unitarily invariant. √
The 2-norm is also induced by the inner product kxk2 = xT x. An important property
of inner products is the Cauchy-Schwarz inequality |xT y| ≤ kxk2 kyk2 (which can be directly
proved but is perhaps best to prove in general setting)4 . When we just say kxk for a vector
we mean the 2-norm.
(2-norm of vectorisation)
4
Just in case, here’s a proof: for any scalar c, kx − cyk2 = kxk2 − 2cxT y + c2 kyk2 . This is minimised wrt
xT y 2 (xT y)2
c at c = kyk 2 with minimiser kxk − kyk2 . Since this must be ≥ 0, the CS inequality follows.
11
trace norm=nuclear norm kAk∗ = min(m,n) σi (A). (this is the maximum trace of QT A
P
i=1
where Q is orthonormal, hence the name)
Norm axioms hold for each of these. Useful inequalities include: For A ∈ Cm×n , (exercise;
it is instructive to study the cases where each of these equalities holds)
√
√1n kAk∞ ≤ kAk2 ≤ mkAk∞
√
√1 kAk1
m
≤ kAk2 ≤ nkAk1
kAk2 ≤ kAkF ≤
p
min(m, n)kAk2
A useful property of p-norms is that they are subordinate, i.e., kABkp ≤ kAkp kBkp
satisfy this, e.g. with the max norm kAkmax =
(problem sheet). Note that not all norms
1
maxi,j |Aij |, with A = [1, 1] and B = one has kABkmax = 2 but kAkmax = kBkmax = 1.
1
Lemma 1.1 Let V1 ∈ Rn×d1 and V2 ∈ Rn×d2 each have linearly independent column vectors.
If d1 + d2 > n, then there is a nonzero intersection between two subspaces S1 = span(V1 ) and
S2 = span(V2 ), that is, there is a nonzero vector x ∈ Rn such that x = V1 c1 = V2 c2 for some
vectors c1 , c2 .
12
Proof: Consider the matrix M := [V1 , V2 ], which is of size n × (d1 + d2 ). Since d1 + d2 > n
by assumption,
this matrix has a right null vector5 c 6= 0 such that M c = 0. Splitting
c1
c= we have the required result.
−c2
Let us conclude this review with a list of useful results that will be helpful. Proofs (or
counterexample) should be straightforward.
(AB)T = B T AT
(I − X)−1 = I + X + X 2 + X 3 + · · ·
For a square n × n matrix A, the trace is Trace(A) = ni=1 Ai,i (sum of diagonals).
P
For any X, Y such that XY Pis square, Trace(XY ) = Trace(Y X) (quite useful). For
B ∈ Rm×n , we have kBk2F = i j |Bij |2 = Trace(B T B).
P
13
Symmetric eigenvalue decomposition: Any symmetric matrix A ∈ Rn×n has the
decomposition
A = V ΛV T (1)
where V is orthogonal, V T V = In = V V T , and Λ = diag(λ1 , . . . , λn ) is a diagonal
matrix of eigenvalues.
λi are the eigenvalues, and V is the matrix of eigenvectors (its columns are the eigenvec-
tors).
The decomposition (1) makes two remarkable claims: the eigenvectors can be taken to
be orthogonal (which is true more generally of normal matrices s.t. A∗ A = AA∗ ), and the
eigenvalues are real.
It is worth reminding you that eigenvectors are not uniquely determined: (assuming
they are normalised s.t. the 2-norm is 1) their signs can always be flipped. (And actually
more, when there are eigenvalues that are multiple λi = λj ; in this case the eigenvectors
span a subspace whose dimension matches the multiplicity. For example, any vector is an
eigenvector of the identity matrix I).
Now here is the protagonist of this course: Be sure to spend dozens (if not hundreds) of
hours thinking about it!
Theorem 2.1 (Singular Value Decomposition (SVD)) Any matrix A ∈ Rm×n has the
decomposition :
A = U ΣV T . (2)
Here U T U = V T V = In (assuming m ≥ n for definiteness), Σ = diag(σ1 , . . . , σn ), σ1 ≥ σ2 ≥
· · · ≥ σn ≥ 0.
σi (always nonnegative) are called the singular values of A. The rank of A is the number
of positive singular values. The columns of U are called the left singular vectors, and the
columns of V are the right singular vectors.
Proof: For SVD6 (m ≥ n and assume full-rank σn > 0 for simplicity): Take Gram matrix
AT A (symmetric) and its eigendecomposition AT A = V ΛV T with V orthogonal. Λ is non-
negative, and (AV )T (AV ) =: Σ2 is diagonal, so AV Σ−1 =: U is orthonormal. Right-multiply
by ΣV T to get A = U ΣV T .
6
I like to think this is the shortest proof out there.
14
Σ T
It is also worth mentioning the “full” SVD: A = [U, U⊥ ] V where [U, U⊥ ] ∈ Rm×m is
0
square and orthogonal. Essentially this follows from the (thin) SVD (2) by filling in U in (2)
with its orthogonal complement U⊥ (whose construction can be done via the Householder
QR factorsation of Section 6).
Ar = Ur Σr VrT
To see this, note from the SVD that A = ni=1 σi ui viT = ri=1 σi ui viT (since σr+1 = 0),
P P
and so A = Ur Σr VrT where Ur = [u1 , . . . , ur ], Vr = [v1 , . . . , vr ], and Σr is the leading
r × r submatrix of Σ.
Column space (Span(A), linear subspace spanned by vectors Ax): span of U = [u1 , . . . , ur ],
often denoted Span(U ).
Null space Null(A): span of vr+1 , . . . , vn ; as Av = 0 for these vectors. Null(A) is empty
(or just the 0 vector) if m ≥ n and r = n. (When m < n the full SVD is needed to
describe Null(A).)
Aside from these and other applications, the SVD is also a versatile theoretical tool. Very
often, a good place to start in proving a fact about matrices is to first consider its SVD; you
will see this many times in this course. For example, the SVD can give solutions immediately
for linear systems and least-squares problems, though there are more efficient ways to solve
these problems.
15
2.2 SVD and symmetric eigenvalue decomposition
As mentioned above, the SVD and the eigenvalue decomposition for symmetric matrices are
closely connected. Here are some results that highlight the connections between A = U ΣV T
and symmetric eigenvalue decomposition. (We assume m ≥ n for definiteness)
U is an eigvector matrix (for nonzero eigvals) of AAT (be careful with sign flips; see
below)
σi = λi (AT A) for i = 1, . . . , n
p
If A is symmetric, its singular values σi (A) are the absolute values of its eigenvalues
λi (A), i.e., σi (A) = |λi (A)|.
Exercise: What if A is unitary, skew-symmetric, normal matrices, triangular? (problem
sheet)
16
To discuss this topic we need some preparation. We will make heavy use of the spectral
norm of matrices. We start with an important characterisation of kAk2 in terms of the
singular value(s), which we previously stated but did not prove.
Proposition 3.1
kAxk2
kAk2 = max = max kAxk2 = σ1 (A).
x kxk2 kxk2 =1
Proof: Use the SVD: For any x with unit norm kxk2 = 1,
kAxk2 = kU ΣV T xk2
= kΣV T xk2 by unitary invariance
= kΣyk2 with kyk2 = 1
v
u n
uX
=t σi2 yi2
i=1
v
u n
uX
≤t σ12 yi2 = σ1 kyk22 = σ1 .
i=1
A ≈ Ar = Ur Σr VrT
We immediately see that a low rank approximation (when possible) is beneficial in terms of
the storage cost when r m, n. Instead of storing mn entries for A, we can store entries for
Ur , Σr , Vr ((m+n+1)r entries) to keep the low-rank factorisation without losing information.
Low-rank approximation can also bring computational benefits. For example, in order to
compute Ax for a vector x, by noting that Ar x = Ur (Σr (VrT x)), one needs only O((m + n)r)
17
operations7 instead of O(mn). The utility and prevalence of low-rank matrices in data science
is remarkable.
Here is the solution for (3): Truncated SVD, defined via Ar = ri=1 σi ui viT (= Ur Σr VrT ).
P
This Pis the matrix obtained by truncating (removing) the trailing terms in the expression
A = ni=1 σi ui viT . Pictorially,
∗ ∗ ∗
∗ ∗ ∗
..
∗ + ... ∗ ∗ + · · · + ... ∗
A= . ∗ ∗ ··· ∗ ∗ ··· ∗
∗ ··· ∗ ∗ ,
∗ ∗ ∗
∗ ∗ ∗
| {z } | {z } | {z }
σ1 u1 v1 σ2 u2 v2 σn un vn
∗ ∗
∗ ∗
..
∗ + · · · + ... ∗
Ar = . ∗ ∗ ··· ∗
∗ ··· ∗ ∗ .
∗ ∗
∗ ∗
| {z } | {z }
σ1 u1 v1 σr ur vr
In particular, we have
Theorem 3.1 For any A ∈ Rm×n with singular values σ1 (A) ≥ σ2 (A) ≥ · · · ≥ σn (A) ≥ 0,
and any nonnegative integer8 r < min(m, n),
Optimality holds for any unitarily invariant norm: that is, the norms in (3) can be
replaced by e.g. the Frobenius norm. This is surprising, as the low-rank approximation
problem minrank(B)≤r kA − Bk does depend on the choice of the norm (and for many
problems, including the least-squares problem in Section 6.5, the norm choice has a
significant effect on the solution). The proof for this fact is nonexaminable, but if
curious see [20] for a complete proof.
A prominent application of low-rank approximation is PCA (principal component anal-
ysis) in statistics and data science.
Many matrices have explicit or hidden low-rank structure (nonexaminable, but see
e.g. [34]).
7
I presume you’ve seen the big-Oh notation before—f (n) = O(nk ) means there exists a constant C
s.t. f (n)/nk ≤ C for all sufficiently large n. In NLA it is a convenient way to roughly measure the operation
count.
8
If r ≥ min(m, n) then we can simply take Ar = A.
18
Proof: of Theorem 3.1:
2. It follows that B2T (and hence B) has a null space of dimension at least n − r. That
is, there exists an orthonormal matrix W ∈ Cn×(n−r) s.t. BW = 0. Then kA − Bk2 ≥
k(A − B)W k2 = kAW k2 = kU Σ(V T W )k2 . (Why does the first inequality hold?)
19
original rank 1 rank 5
20
We see that as the rank is increased the image becomes finer and finer. At rank 50 it is
fair to say the image looks almost identical to the original. The original matrix is 500 × 500,
so we still achieve a significant amount of data compression in the matrix with r = 50.
Theorem 4.1 The ith largest11 eigenvalue λi of a symmetric matrix A ∈ Rn×n is (below
x 6= 0)
xT Ax xT Ax
λi (A) = max min T = min max (5)
dim S=i x∈S x x dim S=n−i+1 x∈S xT x
It would take some time to get a hang of what the statements mean. One helpful way to look
at it is perhaps to note that inside the maximum in (6) the expression is minx∈S,kxk2 =1 kAxk2 =
minQT Q=Ii ,kyk2 =1 kAQyk2 = σmin (AQ) = σi (AQ), where span(Q) = S. The C-F theorem says
σi (A) is equal to the maximum possible value of this over all subspaces S of dimension i.
Proof: We will prove (6). A proof for (5) is analogous and a recommended exercise.
21
For the singular values of any matrix A,
λ(J) = 1 (n copies), but we have |λ(J + E) − 1| ≈ 1/n . For example when n = 100, an
10−100 perturbation in J would result in a 0.1 perturbation in all the eigenvalues!
(nonexaminable) This is pretty much the worst-case situation. In the generic case where
the matrix is diagonalizable, A = XΛX −1 , with an -perturbation the eigenvalues get per-
turbed by O(c), where the constant c depends on the so-called condition number κ2 (X) of
the eigenvector matrix X (see Section 7.2, and [8]).
22
σi ( A1 A2 ) ≥ max(σi (A1 ), σi (A2 ))
x1
Proof: LHS = maxdim S=i min x1 x A1 A2 , while
x2 ∈S, x2
1
=1 x2 2
2
x1
σi (A1 ) = max min x1
A1 A2 . Since
In x1
x2 ∈S, x2 =1 x2 2
dim S=i,range(S)∈range( ) 2
0
the latter imposes restrictions on S to take the maximum over, the former is at least
as big.
SVD A = U ΣV T
– Normal: X unitary X ∗ X = I
– Symmetric: X unitary and Λ real
λ 1
i
..
Jordan decomposition: A = XJX −1 , J = diag( λi
..
.
)
. 1
λi
5 Linear systems Ax = b
We now (finally) start our discussion on direct algorithms in NLA. The fundamental idea is
to factorisa a matrix into a product of two (or more) simpler matrices. This foundational
idea has been named one of the top 10 algorithms of the 20th centry [9].
23
The sophistication of the state-of-the-art implementation of direct methods is simply
astonishing. For instance, a 100 × 100 dense linear system or eigenvalue problem can be
solved in less than a milisecond on a standard laptop. Imagine solving it by hand!
We start with solving linear systems, unquestionably the most important problem in
NLA (for applications). In some sense we needn’t spend too much time here, as you must
have seen much of the material (e.g. Gaussian elimination) before. However, the description
of the LU factorisation given below will likely be different from the one that you have seen
before. We have chosen a nonstandard description as it reveals its connection to low-rank
approximation.
Let A ∈ Rn×n . Suppose we can decompose (or factorise) A into (here and below, ∗
denotes entries that are possibly nonzero).
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
A = ∗ ∗ ∗ ∗ ∗ = ∗ ∗ ∗ ∗ ∗ ∗ = LU
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
24
Here we’ve assumed a 6= 0; we’ll discuss the case a = 0 later. Repeating the process gives
"∗# "0# "0# "0# "0#
∗ ∗ 0 0 0
A= ∗
∗
[∗ ∗ ∗ ∗ ∗ ] + ∗
∗
[0 ∗ ∗ ∗ ∗] + ∗
∗
[0 0 ∗ ∗ ∗ ] + 0
∗
[0 0 0 ] +
∗ ∗ 0
0
[0 0 0 0 ∗]
∗ ∗ ∗ ∗ ∗
= L1 U1 + L2 U2 + LU + L4 U4 + L5 U5
U ∗ 3 3
∗ ∗ ∗ ∗ ∗
1
∗ ∗ ∗ ∗ ∗ ∗
U2
= [L1 , L2 , . . . , L5 ] .. =
∗ ∗ ∗
∗ ∗ ∗,
. ∗ ∗ ∗ ∗
∗ ∗
U5 ∗ ∗ ∗ ∗ ∗ ∗
an LU factorisation as required.
Note the above expression for A; clearly the Li Ui factors are rank-1 matrices; the LU
factorisation can be thought of as writing A as a sum of (structured) rank-1 matrices.
Triangular solve is always backward stable: e.g. (L + ∆L)ŷ = b (see Higham’s book)
The computational cost is
For LU: 23 n3 flops (floating-point operations).
5.2 Pivoting
Above we’ve assumed the diagonal element (pivot) a 6= 0—when a = 0 we are introuble! In
0 1
fact not every matrix has an LU factorisation. For example, there is no LU for . We
1 0
need a remedy. In practice, a remedy is needed whenever a is small.
The idea is to permute the rows, so that the largest element of the (first active) column is
brought to the pivot. This process is called pivoting (sometimes partial pivoting, to emphasize
the difference from complete pivoting wherein both rows and columns are permuted13 ). This
13
While we won’t discuss complete pivoting further, you might be interested to know that when LU with
complete pivoting is applied to a low-rank matrix (with rapidly decaying singular values), one tends to find
a good low-rank approximation, almost as good as truncated SVD.
25
results in P A = LU , where P is a permutation matrix : orthogonal matrices with only 1 and
0s (every row/column has exactly one 1); applying P would reorder the rows (with P A) or
columns (AP )).
Thus solving Axi = bi for i = 1, . . . , k requires 23 n3 +O(kn2 ) operations instead of O(kn3 ).
When a = 0, remedy: pivot, permute rows such that the largest entry of first (active) column
is at the top. ⇒ P A = LU , P : permutation matrix
for Ax = b, solve P Ax = P b ⇔ LU x = P b
In fact, one can show that any nonsingular matrix A has a pivoted LU factorisation.
(proof: exercise). This means that any linear system that is computationally feasible can be
solved by pivoted LU factorisation.
Even with pivoting, unstable examples exist (Section 7), but almost always stable in
practice and used everywhere.
∗
∗ ∗ ∗ ∗ ∗
A= + (7)
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗
| {z } | {z }
R1 R1T also PD
Notes:
1 3
3
n flops, half as many as LU
14
Positive definite matrices are so important a class of matrices; also full of interesting mathematical
properties, enough for a nice book to be written about it [4].
26
diag(R) no longer 1’s (clearly)
Indefinite case: when A = A∗ but A not PSD (i.e., negative eigenvalues are present),
∃ A = LDL∗ where D diagonal (when A ∈ Rn×n , D can have 2 × 2 diagonal blocks),
L has 1’s on diagonal. This is often called the “LDLT factorisation”.
It’s not easy at this point to see why the second (red) term above is also PD; one way
to see this is via the uniqueness of the Cholesky factorisation; see next section.
Therefore, roughly speaking, symmetric linear systems can be solved with half the effort of
a general non-symmetric system.
A = Q R
27
a1
More precisely, the algorithm performs the following: q1 = ka1 k
, then q̃2 = a2 − q1 q1T a2 ,
q2 = kq̃q̃22 k , (orthogonalise and normalise)
q̃j
repeat for j = 3, . . . , n: q̃j = aj − j−1 T
P
i=1 qi qi aj , qj = kq̃j k .
This gives a QR factorisation! To see this, let rij = qiT aj (i 6= j) and rjj = kaj −
Pj−1
i=1 rij qi k,
a1
q1 =
r11
a2 − r12 q1
q2 =
r22
aj − j−1
P
i=1 rij qi
qj =
rjj
a1 = r11 q1
a2 = r12 q1 + r22 q2
aj = r1j q1 + r2j q2 + · · · + rjj qj .
But this isn’t the recommended way to compute the QR factorisation, as it’s numeri-
cally unstable; see Section 7.7.1 and [17, Ch. 19,20].
28
u
−v(v T u)
H is orthogonal and symmetric: H T H =
H 2 = I. Its eigenvalues are 1 (n − 1 copies)
and −1 (1 copy).
w
For any given u, w ∈ Rn s.t. kuk = kwk and v
w−u
u 6= v, H = I − 2vv T with v = kw−uk gives
Hu = w (⇔ u = Hw, thus ’reflector’)
(u − w)T x = 0
It follows that by choosing the vector v appropriately, one can perform a variety of
operations to a given vector x. A primary example is the particular Householder reflector
x − kxke
H = I − 2vv T , v= , e = [1, 0, . . . , 0]T ,
kx − kxkek
29
∗ ∗ ∗ ∗
∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗ ∗
H1 A = (I − 2v1 v1T )A = ∗ ∗ ∗ , H2 H1 A = (I − 2v2 v2T )H1 A = ∗ ∗ ,
∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗
∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗ ∗
H3 H2 H1 A = ∗ ∗ , Hn · · · H3 H2 H1 A = ∗ ∗ .
∗ ∗
∗
The algorithm gives a constructive proof for the existence of a full QR A = QR. It
also gives, for example, a proof of the existence of the orthogonal complement of the
column space of an orthonormal matrix U .
30
6.4 Givens rotations
Householder QR is an excellent method for computing the QR factorisation of a general
matrix, and it is widely used in practice. However, each Householder reflector acts globally–
it affects all the entries of the (active part of) the matrix. For structured matrices—such as
sparse matrices—sometimes there is a better tool to reduce the matrix to triangular form
(and other forms) by working more locally. Givens rotations give a convenient tool for this.
They are matrices of the form
c s
G= , c2 + s2 = 1.
−s c
Designed to ’zero’ one element at a time. For example to compute the QR factorisation for
an upper Hessenberg matrix, one can perform
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
A= ∗ ∗ ∗ ∗ , G1 A = ∗ ∗ ∗ ∗ , G2 G1 A = ∗ ∗ ∗ ,
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
G3 G2 G1 A = ∗ ∗ ∗ , G4 G3 G2 G1 A = ∗ ∗ ∗ =: R.
∗ ∗ ∗ ∗
∗ ∗ ∗
This means A = GT1 GT2 GT3 GT4 R is the QR factorisation. (note that Givens rotations are
orthogonal but not symmetric—so its inverse is GT , not G).
31
robust solution (for instance, in the presence of noise in measurements). For example, when
there is massive data and we would like to fit the data with a simple model, we will have
many more equations than the degrees of freedom. This leads to the so-called least-squares
problem which we will discuss here.
Given A ∈ Rm×n , m ≥ n and b ∈ Rm , a least-squares problem seeks to find x ∈ Rn such
that the residual is minimised:
min A x − b (8)
x
Thus the goal is to try minimise the residual kAx−bk; usually kAx−bk2 but sometimes
e.g. kAx − bk1 is of interest. Here we focus on kAx − bk2 .
Throughout we assume full rank condition rank(A) = n; this makes the solution unique
and is generically satisfied. If not, the problem will have infinitely many minimisers
(and a standard practice is to look for the minimum-norm solution).
Theorem 6.1 Let A ∈ Rm×n , m > n and b ∈ Rm , with rank(A) = n. The least-squares
problem minx kAx − bk2 has solution given by x = R−1 QT b, where A = QR is the (thin) QR
factorisation.
Proof: Let A = [Q Q⊥ ] R = QF R
be ’full’ QR factorisation. Then
0 0
T
T R Q b
kAx − bk2 = kQF (Ax − b)k2 = x−
0 QT⊥ b 2
so x = R−1 QT b is solution.
This also gives an algorithm (which is essentially the workhorse algorithm used in prac-
tice):
This process is backward stable. That is, the computed x̂ solution for minx k(A +
∆A)x + (b + ∆b)k2 (see Higham’s book Ch.20)
Unlike square system Ax = b, one really needs QR: LU won’t do the job at all.
32
One might wonder why we chose the 2-norm in the least-squares formulation (8). Unlike
for low-rank approximation (where the truncated SVD is a solution for any unitarily invariant
norm16 ) the choice of the norm does matter and affects the properties of the solution x
significantly. For example, an increasingly popular choice of norm is the 1-norm, which
tends to promote sparsity in the quantity to be minimised. In particular, if we simply
replace the tune alarm with the 1 norm, the solution tends to give a residual that is sparse.
(nonexaminable)
(AT A)x = AT b
In fact, more generally, given a linear least-squares approximation problem minp∈P kf −pk
in an inner-product space (of which (8) is a particular example; other examples R1 include
polynomial approximation of a function with the inner product e.g. hf, gi = −1 f (x)g(x)dx)
the solution is characterized by the property that the residual is orthogonal to the subspace
from which a solution is sought, that is, hq, f − pi = 0 for all q ∈ P. To see this, consider the
problem of approximating f with p in a subspace P. Let p∗ ∈ P be such that the residual
f − p+ is orthogonal to any element in P. Then for any q ∈ P, we have kf − (p∗ + q)k2 =
16
Just to be clear, if one uses a norm that is not unitarily invariant (e.g. 1-norm), the truncated SVD may
cease to be a solution for the low-rank approximation problem.
33
kf − p∗ k2 − 2hf − p∗ , qi + kqk2 = kf − p∗ k2 + kqk2 ≥ kf − p∗ k2 , proving p∗ is a minimiser (it
is actually unique).
Since we mentioned Cholesky, let us now revisit (7) and show why the second term there
must be PSD. A PD matrix has an eigenvalue decomposition A = V D2 V T = (V DV T )2 =
(V DV T )T (V DV T ). Now let V DV T = QR be the QR factorisation. Then (V DV T )T (V DV T ) =
RT R (this establishes the existence of Cholesky). But now the 0-structure in (7) means the
first term must be rrT where rT is the first row of R, and hence the second term must be
R2T R2 , which is PSD. Here RT = [r R2T ].
Illustration We illustrate this with an example where we approximate the function f (x) =
1 + sin(10x) exp(x) (which we suppose we don’t know but we can sample it).
m = n = 11 (degree 10 polynomial) m = 100, n = 11
3 3
2 2
1
1
0
0
-1
-1
-2
-2
-3
-3
-1 -0.5 0 0.5 1
-1 -0.5 0 0.5 1
34
We observe that with 11 (equispaced) sample points, the degree-10 polynomial is devi-
ating from the ’true function’ quite a bit. With many more sample points the situation
significantly improves. This is not a cherry-picked example but a phenomenon that can be
mathematically proved; look for “Lebesgue constant” if interested (nonexaminable).
7 Numerical stability
An important aspect that is very often overlooked in numerical computing is numerical sta-
bility. Very roughly, it concerns the quality of a solution obtained by a numerical algorithm,
given that computation on computers is done not exactly but with rounding errors. So far
we have mentioned stability here and there in passing, but in this section it will be our focus.
Let us first look at an example where roundoff errors play a visible role to affect the
computed solution of a linear system.
T 1 1 1
The situation is complicated. For example, let A = U ΣV , where U = √2 ,
1 −1
1 1
Σ= , V = I, and let b = A . That is, we are solving a linear system whose
10−15 1
1
solution is x = .
1
1.0000
If we solve this in MATLAB using x = A\b, the output is . Quite different from
0.94206
the exact solution! Did something go wrong? Did MATLAB or the algorithm fail? The
answer is NO, MATLAB and the algorithm (LU) performed just fine. This is a ramification
of ill-conditioning, not instability. Make sure that after covering this section, you will be
able to explain what happened in the above example.
Numbers not exactly representable with finite bits in base 2, including irrational num-
bers, are stored inexactly (rounded), e.g. 1/3 ≈ 0.333... The unit with which rounding
takes place is the machine precision, often denoted by (or u for unit roundoff). In
the most standard setting of IEEE double-precision arithmetic, u ≈ 10−16 .
35
Thus the accuracy of the final error is nontrivial; in pathological cases, it is rubbish!
To get an idea of how things can go wrong and how serious the final error can be, here
are two examples with MATLAB:
For matrices, there are much more nontrivial and surprising phenomena than these. An
important (but not main) part of numerical analysis/NLA is to study the effect of rounding
errors. This topic can easily span a whole course. By far the best reference on this topic is
Higham’s book [17].
In this section we denote by f l(X) a computed version of X (fl stands for floating point).
For basic operations such as addition and multiplication, one has f l(x + y) = x + y + c
where |c| ≤ max(|x|, |y|) and f l(xy) = xy + c where |c| ≤ max(|xy|).
36
(backward) stable”. If a problem is ill-conditioned κ 1, and the computed solution is
no very accurate, then one should blame the problem, not the algorithm. In such cases, a
backward stable solution (see below) is usually still considered a good solution.
Notation/convention in this section: x̂ denotes a computed approximation to x (e.g. of
x = A−1 b). denotes a small term O(u), on the order of unit roundoff/working precision; so
we write e.g. u, 10u, (m + n)u, mnu all as . (In other words, here we assume m, n u−1 .)
Consequently (in this lecture/discussion) the norm choice does not matter for the
discussion.
Ideally, error kY − Ŷ k/kY k = : but this is seldom true, and often impossible!
(u: unit roundoff, ≈ 10−16 in standard double precision)
k∆Xk
Good alg has Backward stability Ŷ = f (X + ∆X), kXk
= “exact solution of a
slightly wrong input”.
37
7.4 Matrix condition number
The best way to illustrate conditioning is to look at the conditioning of linear systems. In
fact it leads to the following definition, which arises so frequently in NLA that it merits its
own name: the condition number of a matrix.
σmax (A)
κ2 (A) = (≥ 1).
σmin (A)
σ1 (A)
That is, κ2 (A) = σn (A)
for A ∈ Rm×n , m ≥ n.
Theorem 7.1 Consider a backward stable solution for Ax = b, s.t. (A + ∆A)x̂ = b with
k∆Ak ≤ kAk and κ2 (A) −1 (so kA−1 ∆Ak 1). Then we have
kx̂ − xk
≤ κ2 (A) + O(2 ).
kxk
In other words, even with a backward stable solution, one would only have O(κ2 (A))
relative accuracy in the solution. If κ2 (A) > 1, the solution may be rubbish! But the
NLA view is that’s not the fault of the algorithm, the blame is on the problem being so
ill-conditioned.
38
Then (this is the absolute version, relative version possible)
kŶ − Y k . κ.
’proof’:
kŶ − Y k = kf (X + ∆X) − f (X)k . κk∆Xkkf (X)k = κ.
Here is how to interpret the result: If the problem is well-conditioned κ = O(1), this imme-
diately implies good accuracy of the solution! But otherwise the solution might have poor
accuracy—but it is still the exact solution of a nearby problem. This is often as good as one
can possibly hope for.
The reason this is only a rule of thumb and not exactly rigorous is that conditioning only
examines the asymptotic behavior, where the perturbation is infinitesimally small. Nonethe-
less it often gives an indicative estimate for the error and sometimes we can get rigorous
bounds if we know more about the problem. Important examples include the following:
Well-conditioned linear system Ax = b, κ2 (A) ≈ 1.
Eigenvalues of symmetric matrices (via Weyl’s bound λi (A+E) ∈ λi (A)+[−kEk2 , kEk2 ]).
39
The backward error can be bounded componentwise.
Using the previous rule-of-thumb, this means kx̂ − xk/kxk . κ2 (R).
– (unavoidably) poor worst-case (and attainable) bound when ill-conditioned
– often better with triangular systems; the behavior of triangular linear systems
keep surprising experts!
40
7.6 Matrix multiplication is not backward stable
Here is perhaps a shock—matrix matrix multiplication, one of the most basic operations, is
in general not backward stable.
Let’s start with the basics.
But it is not true to say matrix-matrix multiplication is backward stable; which would
require f l(AB) to be equal to (A + ∆A)(B + ∆B). This may not be satisfied!
In general, what we can prove is a bound for the forward error kf l(AB) − ABk ≤
kAkkBk, so kf l(AB) − ABk/kABk ≤ min(κ2 (A), κ2 (B)) (proof: problem sheet).
This is great news when A or B is orthogonal (or more generally square and well-
conditioned): say if A = Q is orthogonal, then we have
Orthogonality matters for stability A happy and important exception is with orthog-
onal matrices Q (or more generally with well-conditioned matrices):
kf l(QA) − QAk kf l(AQ) − AQk
≤ , ≤ .
kQAk kAQk
41
They are also backward stable:
Hence algorithms involving ill-conditioned matrices are unstable (e.g. eigenvalue decomposi-
tion of non-normal matrices, Jordan form, etc), whereas those based on orthogonal matrices
are stable. These include
Section 8 onwards treats our second big topic, eigenvalue problems. This includes discussing
algorithms shown above in boldface.
f l(Hn · · · H1 ) =: Q̂T = Hn · · · H1 + .
Notes:
This doesn’t mean kQ̂−Qk, kR̂−Rk are small at all! Indeed Q, R are as ill-conditioned
as A [17, Ch. 20].
42
7.7.1 (In)stability of Gram-Schmidt
(Nonexaminable) A somewhat surprising fact is that the Gram-Schmidt algorithm, when
used for computing the QR factorisation, is not backward stable. Namely, orthogonality of
the computed Q̂ matrix is not guaranteed.
Gram-Schmidt is subtle:
8 Eigenvalue problems
First of all, recall that Ax = λx has no explicit solution (neither λ nor x); huge difference
from Ax = b for which x = A−1 b.
From a mathematical viewpoint this marks an enormous point of departure: something
that is is explicitly written vs. something that has no closed-from solution.
From a practical viewpoint the gulf is much smaller, because we have an extremely
reliable algorithm for eigenvalue problems; so robust that it is essentially bulletproof provably
backward stable.
Before we start describing the QR algorithm let’s discuss a few interesting properties of
eigenvalues.
For any polynomial p, there exist (infinitely many) matrices whose eigvals are roots of
p.
43
So no finite-step algorithm exists for Ax = λx.
Eigenvalue algorithms are necessarily iterative and approximate.
Same for SVD, as σi (A) = λi (AT A).
p
44
(This marks the end of “the first half” of the course (i.e., up until the basic facts about
eigenvalues, but not its computation). This information is relevant to MMSC students.)
Algorithm 8.1 The power method for computing the dominant eigenvalue and eigenvector
of A ∈ Rn×n .
1: Let x ∈ Rn :=random initial vector (unless specified)
x
2: Repeat: x = Ax, x = kxk .
2
Convergence
Pn analysis: suppose A is diagonalisable (generic assumption). We can write
x0 = i=1 ci vi , Avi = λi vi with |λ1 | > |λ2 | > · · · . Then after k iterations,
n k
X λi
x=C ci vi → Cc1 v1 as k → ∞ for some scalar C
i=1
λ1
|λ2 |
Converges geometrically (λ, x) → (λ1 , x1 ) with linear rate |λ1 |
Notes:
Google pagerank & Markov chain are linked to the power method.
As we’ll see, the power method is basis for refined algorithms (QR algorithm, Krylov
methods (Lanczos, Arnoldi,...))
45
8.2.1 Digression (optional): Why compute eigenvalues? Google PageRank
Let us briefly digress and talk about a famous application that at least used to solve an
eigenvalue problem in order to achieve a familiar task: Google web search.
Once Google receives a user’s inquiry with keywords, it needs to rank the relevant web-
pages, to output the most important pages first.
Here, the ’importance’ of websites is deter-
mined by the dominant eigenvector of column-
stochastic matrix (i.e., column sums are all 1)
1 ··· 1
(1 − α) . . .
A = αP + .. . . ..
n
1 ··· 1
P : scaled adjacency matrix (Pij = 1 if i, j
connected by an edge, 0 otherwise, and then image from wikipedia
scaled s.t. column sums are 1), α ∈ (0, 1)
To solve this approximately (note that high accuracy isn’t crucial here—getting the
ordering wrong isn’t the end of the world), Google does (at least in the old days) a few steps
of power method: with initial guess x0 , k = 0, 1, . . .
1. xk+1 = Axk
46
Algorithm 8.2 Shifted inverse power method for computing the eigenvalue and eigenvector
of A ∈ Rn×n closest to a prescribed value µ ∈ C.
1: Let x ∈ Rn :=random initial vector, unless specified.
2: Repeat: x := (A − µI)−1 x, x = x/kxk.
3: λ̂ = xT Ax gives an estimate for the eigenvalue.
Note that the eigenvalues of the matrix (A − µI)−1 are (λ(A) − µ)−1 .
By the same analysis as above, shifted-inverse power method converges with improved
|λ −µ|
linear rate |λσ(1) σ(2) −µ|
to the eigenpair closest to µ. Here σ denotes a permutation of
{1, . . . , n} such that |λσ(1) − µ| minimises |λi − µ| over i.
µ can change adaptively with the iterations. The choice µ := xT Ax gives the Rayleigh
quotient iteration, with quadratic convergence kAx(k+1) − λ(k+1) x(k+1) k = O(kAx(k) −
λ(k) x(k) k2 ); this is further improved to cubic convergence if A is symmetric.
It is worth emphasising that the improved convergence comes at a cost: one step of
shifted inverse power method requires a solution of a linear system, which is clearly more
expensive than power method.
9 The QR algorithm
We’ll now describe an algorithm called the QR algorithm that is used universally for solving
eigenvalue problems of moderate size, e.g. by MATLAB’s eig. Given A ∈ Rn×n , the
algorithm
Is backward stable.
47
simple: basically take the QR factorisation, swap the order, take the QR and repeat the
process. Namely,
Algorithm 9.1 The QR algorithm for finding the Schur decomposition of a square matrix
A.
1: Set A1 = A.
2: Repeat: A1 = Q1 R1 , A2 = R1 Q1 , A2 = Q2 R2 , A3 = R2 Q2 , . . .
Notes:
Ak are all similar: Ak+1 = QTk Ak Qk
48
QR algorithm and power method We can deduce that the QR algorithm is closely
related to the power method. Let’s try explain the connection. By Theorem 9.1, Q(k) R(k) is
the QR factorisation of Ak : as we saw in the analysis of the power method, the columns of
Ak are ’dominated by the leading eigenvector’ x1 , where Ax1 = λ1 x1 .
In particular, consider Ak [1, 0, . . . , 0]= Ak en :
Ak en = R(k) (1, 1)Q(k) (:, 1), which is parallel to the first column of Q(k) .
This tells us why the QR algorithm would eventually compute the eigenvalues—at least
the dominant ones. One can even go further and argue that once the dominant eigenvalue
has converged, we can expect the next dominant eigenvalue to start converging too—as due
to the nature of the QR algorithm based on orthogonal transformations, we are then working
in the orthogonal complement; and so on and so forth until the matrix A becomes upper
triangular (and in the normal case this becomes diagonal), completing the solution of the
eigenvalue problem. But there is much better news.
QR algorithm and inverse power method We have seen that the QR algorithm is
related to the power method. We have also seen that the power method can be improved by
a shift-and-invert technique. A natural question arises: can we introduce a similar technique
in the QR algorithm? This question has an elegant and remarkable answer: not only is this
possible but it is possible without ever inverting a matrix or solving a linear system (isn’t
that incredible!?).
Let’s try and explain this. We start with the same expression for the QR iterates:
This means the final column of Q(k) converges to minimum left eigenvector xn with
rate |λ|λn−1
n|
|
, hence Ak (n, :) → [0, . . . , 0, λn ].
49
QR algorithm with shifts and shifted inverse power method We are now ready
to reveal the connection between the shift-and-invert power method and the QR algorithm
with shifts.
First, here is the QR algorithm with shifts.
Algorithm 9.2 The QR algorithm with shifts for finding the Schur decomposition of a
square matrix A.
1: Set A1 = A, k = 1.
2: Ak − sk I = Qk Rk (QR factorisation)
3: Ak+1 = Rk Qk + sk I, k ← k + 1, repeat.
1. Ak − sk I = Qk Rk (QR factorisation)
2. Ak+1 = Rk Qk + sk I, k ← k + 1, repeat.
∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗
Roughly, if sk ≈ λn , then Ak+1 ≈ ∗ ∗ ∗ ∗ ∗ by the argument just made.
∗ ∗ ∗ ∗ ∗
λn
Theorem 9.2
k
Y
(A − si I) = Q(k) R(k) (= (Q1 · · · Qk )(Rk · · · R1 )) .
i=1
Proof: Suppose true for k−1. Then the QR alg computes (Q(k−1) )T (A−sk I)Q(k−1) = Qk Rk ,
so (A − sk I)Q(k−1) = Q(k−1) Qk Rk , hence
Y k
(A − si I) = (A − sk I)Q(k−1) R(k−1) = Q(k−1) Qk Rk R(k−1) = Q(k) R(k) .
i=1
Qk
Inverse conjugate transpose: i=1 (A − si I)−∗ = Q(k) (R(k) )−∗ .
This means the algorithm (implicitly) finds the QR factorisation of a matrix with
Qk 1
eigvals r(λj ) = i=1 λ̄j −si .
This reveals the intimate connection between shifted QR and shifted inverse power
method, hence rational approximation .
50
9.2 QR algorithm preprocessing: reduction to Hessenberg form
We’ve seen the QR iterations drives colored entries to 0 (esp. red ones)
∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗
A= ∗
∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗
But each iteration of the QR algorithm is O(n3 ) for QR, overall O(n4 ).
51
Using Givens rotations, each QR iter is O(n2 ) (not O(n3 )).
The remaining task (done by shifted QR): drive subdiagonal ∗ to 0. A is thus reduced
to a triangular form, revealing the Schur decomposition of A (if one traces back all the
orthogonal transformations employed).
Once bottom-right |∗| < ,
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ≈ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗ ∗
∗ ∗ ∗
52
No shift (plain QR) QR with shifts
10 0 100
10-5 10-5
10-10 10-10
10-15 10-15
0 100 200 300 400 500 600 700 800 0 2 4 6 8 10 12
Iterations Iterations
1000
25
0 20
log|p(λ)|
15
log|p(λ)|
-1000 10
0
-2000
-5
-4 -2 0 2 4 -10
-4 -2 0 2 4
In light of the connection to rational functions as discussed above, here we plot the
underlying functions (red dots: eigvals). The idea here is that we want the functions to
take large values at the target eigenvalue (at the current iterate) in order to accelerate
convergence.
53
Finally, let us mention a cautionary tale:
10 QR algorithm continued
10.1 QR algorithm for symmetric A
so far we have not assumed anything about the matrix besides that it is square. This is
for good reason—because the QR algorithm is applicable to solving any eigenvalue problem.
However, in many situations the eigenvalue problem is structured. In particular the case
where the matrix is symmetric arises very frequently in practice and it comes with significant
simplification of the QR algorithm, so it deserves a special mention.
Most importantly, symmetry immediately implies that the initial reduction to Hessen-
berg form reduces
A to tridiagonal:
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗ Q ∗ ∗ ∗ ∗ ∗ Q ∗ ∗ ∗ Q ∗ ∗ ∗
A =
∗ ∗ ∗ ∗ ∗→
1 ∗ ∗ ∗ ∗→
2 ∗ ∗ ∗ ∗
→
3 ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
QR steps for tridiagonal: requires O(n) flops instead of O(n2 ) per step.
Cost: 34 n3 flops for eigvals, ≈ 10n3 for eigvecs (store Givens rotations to compute
eigvecs).
54
10.2 Computing the SVD: Golub-Kahan’s bidiagonalisation algo-
rithm
The key ideas of the QR algorithm turns out to be applicable for the computation of the
SVD. Clearly this is a major step—given the importance of the SVD. In particular the
SVD algorithm has strong connections to the symmetric QR algorithm. This is perhaps not
surprising given the strong connection between the SVD and symmetric eigenvalue problems
as we discussed previously.
A noteworthy difference is that in the SVD A = U ΣV T the two matrices U and V are
allowed to be different. The algorithm respects this and instead of initially reducing the
matrix to tridiagonal form, it reduces it to a so-called bidiagonal form.
Here is how it works. Apply Householder reflectors from left and right (different ones)
to bidiagonalise
A → B = HL,n · · · HL,1 AHR,1 HR,2 · · · HR,n−2
? ? ? ? ? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
A H→L, 1
H H H H H
? ? ? R, 1
→ ? ? ? L, 2
→ ? ? R, 2
→ ? ? L, 3
→ ? ? L, 4
→ ? ?
= B,
? ? ? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ? ?
Since the transformations are all orthogonal multiplications, singular values are pre-
served σi (A) = σi (B).
Cost: ≈ 4mn2 flops for singular values Σ, ≈ 20mn2 flops to also compute the singular
vectors U, V .
Ax = λBx, A, B ∈ Cn×n
The matrices A, B are given. The goal is to find the eigenvalues λ and their corre-
sponding eigenvectors x.
55
There are usually (incl. when B is nonsingular) n eigenvalues, which are the roots of
det(A − λB).
When B is invertible, one can reduce the problem to B −1 Ax = λx.
Important case: A, B symmetric, B positive definite: in this case λ are all real.
QZ algorithm: look for unitary Q, Z s.t. QAZ, QBZ both upper triangular.
Then diag(QAZ)/diag(QBZ) are the eigenvalues.
Cost: ≈ 50n3 .
SVD A = U ΣV T for A ∈ Rm×n : ( 83 mn2 flops for singvals, +20mn2 for singvecs)
Further speedup is often possible when structure present (e.g. sparse, low-rank)
56
11 Iterative methods: introduction
This section marks a point of departure from previous sections. So far we’ve been discussing
direct methods. Namely, we’ve covered the LU-based linear system solution and the QR
algorithm for eigenvalue problems20 .
Direct methods are
A ’big’ matrix problem is one for which direct methods aren’t feasible. Historically, as
computers become increasingly faster, roughly
1950: n ≥ 20
1965: n ≥ 200
1980: n ≥ 2000
1995: n ≥ 20000
2010: n ≥ 100000
was considered ’too large for direct methods’. While it’s clearly good news that our ability
to solve problems with direct methods has been improving, the scale of problems we face in
data science has been growing at the same (or even faster) pace! For such problems, we need
to turn to alternative algorithms: we’ll cover iterative and randomized methods. We first
discuss iterative methods, with a focus on Krylov subspace methods.
Direct vs. iterative methods Broadly speaking, the idea of iterative methods is to:
Each iteration should be (a lot) cheaper than direct methods, usually O(n2 ) or less.
Iterative methods can be (but not always) much faster than direct methods.
Often, after O(n3 ) work it still gets the exact solution (ignoring roundoff errors). But
one would hope to get a (acceptably) good solution long before that!
57
Figure 2: Rough comparison of direct vs. iterative methods, image from [Trefethen-Bau [33]]
x̂ = pk−1 (A)v ,
where pk−1 is a polynomial of degree (at most) k − 1, that is, pk−1 (z) = k−1 i
P
i=0 ci z for some
n
coefficients ci ∈ C, and v ∈ R is the initial vector (usually v =Pb for linear systems, for
eigenproblems v is usually a random vector). That is, pk−1 (A) = k−1 i
i=0 ci A .
Natural questions:
Why would this be a good idea?
58
11.2 Orthonormal basis for Kk (A, b)
Goal: Find approximate solution x̂ = pk−1 (A)b, i.e. in Krylov subspace
You would want to convince yourself that any vector in the Krylov subspace can be written
as a polynomial of A times the vector b.
An important and non-trivial step towards finding a good solution is to form an orthonor-
mal basis for the Krylov subspace.
First step: form an orthonormal basis Q, s.t. solution x ∈ Kk (A, b) can be written as
x = Qy
Naive idea: Form matrix [b, Ab, A2 b, . . . , Ak−1 b], then compute its QR factorisation.
Algorithm 11.1 The Arnoldi iteration for finding an orthonormal basis for Krylov subspace
Kk (A, b).
1: Set q1 = b/kbk2
2: For k = 1, 2, . . . ,
3: set v = Aqk
4: for j = 1, 2, . . . , k
5: hjk = qjT v, v = v − hjk qj % orthogonalise against qj via (modified) Gram-Schmidt
6: end for
7: hk+1,k = kvk2 , qk+1 = v/hk+1,k
8: End for
59
After k steps, AQk = Qk+1 H̃k = Qk Hk +qk+1 [0, . . . , 0, hk+1,k ], with Qk = [q1 , q2 , . . . , qk ], Qk+1 =
[Qk , qk+1 ], span(Qk ) = span([b, Ab, . . . , Ak−1 b])
h1,1 h1,2 ... h1,k
h2,1 h2,2 ... h2,k
..
A Qk = Qk+1 H̃k , H̃k =
..
.
hk,k−1
.
hk,k
,
QTk+1 Qk+1 = Ik+1
hk+1,k
| {z }
R(k+1)×k upper Hessenberg
60
This is a projection method (similar alg is available for SVD).
10 5
10 Ritz values
10 0
10 -5
0 eigenvalues
10 -10
-5
Lanczos
power
10 -15 -10
0 20 40 60 80 100
0 2 4 6 8 10 12
Iterations
Lanczos iterations
Convergence to dominant eigenvalue Convergence of Ritz values (approximate
eigenvalues)
61
The same principles of projecting the matrix onto a Krylov subspace applies to nonsym-
metric eigenvalue problems. Essentially it boils down to finding the eigenvalues of the upper
Hessenberg matrix H arising in the Arnoldi iteration, rather than the tridiagonal matrix as
in Lanczos.
62
Proof: Recall that xk ∈ Kk (A, b) ⇒ xk = pk−1 (A)b, where pk−1 is a polynomial of degree
at most k − 1. Hence GMRES solution is
min kAxk − bk2 = min kApk−1 (A)b − bk2
xk ∈Kk (A,b) pk−1 ∈Pk−1
= min kp(A)bk2
p∈Pk ,p(0)=1
If A is diagonalizable, A = XΛX −1 ,
kp(A)k2 = kXp(Λ)X −1 k2 ≤ kXk2 kX −1 k2 kp(Λ)k2
= κ2 (X) max |p(z)|
z∈λ(A)
0.6
1
0.4
0.5
0.2
0
0
-0.2
-0.5 -0.4
-0.6
-1
-0.8
-1.5 -1
-1 -0.5 0 0.5 1
-0.5 0 0.5 1 1.5 2 2.5 3 3.5
10 2
10 0
1
10
10 -2
10 -4
10 -6
10 -8
10 0
10 -10 0 100 200 300 400 500
0 5 10 15 20 25 30 35
63
Initial vector. Sometimes a good initial guess x0 for x∗ is available. In this case we take
the initial residual r0 = Ax0 − b, and work in the affine space xk = x0 + Kk (A, r0 ). All the
analysis above can be modified readily to allow for this situation with essentially the same
conclusion (clearly if one has a good x0 by all means that should be used).
64
Preconditioners: examples
ILU (Incomplete LU) preconditioner: A ≈ LU, M = (LU )−1 = U −1 L−1 , L, U ’as sparse
as A’ ⇒ M A ≈ I (hopefully; ’cluster away from 0’).
−1
A B A
For à = , set M = . Then if M nonsingular, M à has
C 0 (CA−1 B)−1
√
eigvals∈ {1, 21 (1 ± 5)} ⇒ 3-step convergence. [Murphy-Golub-Wathen 2000 [23]]
1. Stop GMRES after kmax (prescribed) steps to get approx. solution x̂1 .
2. Solve Ax̃ = b − Ax̂1 via GMRES. (This is a linear system with a different right-hand
side).
1. Compute QT AQ.
65
As in Lanczos, Q = Qk = Kk (A, b), so simply QTk AQk = Hk (Hessenberg eigenproblem,
note that this is ideal for the QR algorithm as the preprocessing step can be skipped).
To find other (e.g. interior) eigvals, one can use shift-invert Arnoldi: Q = Kk ((A −
sI)−1 , b).
The CG algorithm minimises A-norm of error in the Krylov subspace xk = argminx∈Qk kx−
x∗ kA (Ax∗ = b): writing xk = Qk y, we have
66
13.1 CG algorithm for Ax = b, A 0
We’ve described the CG algorithm conceptually. To derive the practical algorithm some
clever manipulations are necessary. We won’t go over them in detail but here is the outcome:
Set x0 = 0, r0 = −b, p0 = r0 and do for k = 1, 2, 3, . . .
αk = hrk , rk i/hpk , Apk i
xk+1 = xk + αk pk
rk+1 = rk − αk Apk
βk = hrk+1 , rk+1 i/hrk , rk i
pk+1 = rk+1 + βk pk
where rk = b − Axk (residual) and pk (search direction). xk is the CG solution after k
iterations.
One can show among others (exercise/sheet)
Kk (A, b) = span(r0 , r1 , . . . , rk−1 ) = span(x1 , x2 , . . . , xk ) (also equal to span(p0 , p1 , . . . , pk−1 ))
rjT rk = 0, j = 0, 1, 2, . . . , k − 1
Thus xk is kth CG solution, satisfying Galerkin orthogonality QTk (Axk − b) = 0: residual is
orthogonal to the (Krylov) subspace.
13.2 CG convergence
Let’s examine the convergence of the CG iterates.
Theorem 13.1 Let A 0 be an n × n positive definite matrix and b ∈ Rn . Let ek := x∗ − xk
be the error after the kth CG iteration (x∗ is the exact solution Ax∗ = b). Then
p !k
kek kA κ2 (A) − 1
≤2 p .
ke0 kA κ2 (A) + 1
67
Now (blue)2 = i λi p(λi )2 (V T e0 )2i ≤ maxj p(λj )2 i λi (V T e0 )2i = maxj p(λj )2 ke0 k2A .
P P
We’ve shown
kek kA
≤ min max |p(λj )| ≤ min max |p(x)|
ke0 kA p∈Pk ,p(0)=1 j p∈Pk ,p(0)=1 x∈[λmin (A),λmax (A)]
σmax (A) λmax (A)
Note that κ2 (A) = σmin (A)
= λmin (A)
(=: ab ).
The above bound is obtained using Chebyshev polynomials on [λmin (A), λmax (A)]. This
is a class of polynomials that arise in a varieties of contexts in computational maths.
Let’s next look at their properties.
0 0
-1 -1
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
1 1
0 0
-1 -1
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
1 1
0 0
-1 -1
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
These polynomials grow very fast outside the interval (here the ’standard’ [−1, 1]). For
example, plots on [−2, 1] look like
68
0
-10
-20
100
50
0
-2 -1.5 -1 -0.5 0 0.5 1
-200
-400
-2 -1.5 -1 -0.5 0 0.5 1
60
40
20
10
1 k
4
0
-1
2
1
0 0
x -1
Outside [−1, 1], |Tk (x)| 1 grows rapidly with |x| and k(fastest growth among Pk )
69
√ √
b/a+1 κ2 (A)+1
Tk (z) = 1 k
2
(z +z −k
) with 1
2
(z +z )=−1 b+a
b−a
⇒z= √ =√ , so
b/a−1 κ2 (A)−1
√ k
b+a κ−1
|p(x)| ≤ 1/Tk ( )≤2 √ .
b−a κ+1
min kAx − bk2 = min kApk−1 (A)b − bk2 = min k(p̃(A) − I)bk2
x∈Kk (A,b) pk−1 ∈Pk−1 p̃∈Pk ,p̃(0)=0
= min kp(A)bk2 .
p∈Pk ,p(0)=1
70
Interpretation: (again) find polynomial s.t. p(0) = 1 and |p(λi )| small
kAx − bk2
≤ min max |p(λi )|.
kbk2 p∈Pk ,p(0)=1
minimisation needed on positive and negative sides, hence slower convergence when A
is indefinite
CG and MINRES, optimal polynomials Here are some optimal polynomials for CG
and MINRES. Note that κ2 (A) = 10 in both cases; observe how much faster CG is than
MINRES. -14
10 CG, iteration k=50 MINRES, iteration k=50
0.02
1
0.01
0
0
-1
-0.01
-2 -0.02
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -2 -1.5 -1 -0.5 0 0.5 1
1 1.5
0.8
1
0.6
0.4 0.5
0.2
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -2 -1.5 -1 -0.5 0 0.5 1
Ax = b, A = AT ( 0)
may not converge rapidly with CG or MINRES if the eigenvalues are not distributed in a
favorable manner (i.e., clustered away from 0).
In this case, as with GMRES, a workaround is to find a good preconditioner M such that
“M T M ≈ A−1 ” and solve
M T AM y = M T b, M y = x
As before, desiderata of M :
M T AM is easy to multiply to a vector.
71
M T AM has clustered eigenvalues away from 0.
Randomized algorithms
72
Gaussian matrices In randomized algorithms we typically introduce a random matrix.
For the analysis, Gaussian matrices G ∈ Rm×n are the most convenient (not always for
computation). These are matrices whose entries are drawn independently from the standard
normal (Gaussian) distribution Gij ∼ N (0, 1). We cannot do justice to random matrix
theory (there is a course in Hilary term!), but here is a summary of the properties of Gaussian
matrices that we’ll be using.
A useful fact about Gaussian random matrices G is that its distribution is invariant
under orthogonal transformations. That is, if G is Gaussian, so is QG and GQ, where
Q is any orthogonal matrix independent of G. To see this (nonexaminable): note that
a sum of Gaussian (scalar) random variables is Gaussian, and by independence the
variance is simply the sum of the variances. Now let gi denote the ith column of G.
Then E[(Qgi )T (Qgi )] = E[giT gi ] = I, so each Qgi is multivariate Gaussian with the
same distribution as gi . Independence of Qgi , Qgj is immediate.
Algorithm 14.1 Randomised SVD (HMT): given A ∈ Rm×n and rank r, find a rank-r
approximation  ≈ A.
1: Form a random matrix X ∈ Rn×r , usually r n.
2: Compute AX.
3: Compute the QR factorisation AX = QR.
4:
A ≈ Q QT A (= (QU0 )Σ0 V0T ) is a rank-r approximation.
Here, X is a random matrix taking independent and identically distributed (iid) entries.
A convenient choice (for the theory, not necessarily for computation) is a Gaussian matrix,
with iid entries Xij ∼ N (0, 1).
Here are some properties of the HMT algorithm:
O(mnr) cost for dense A.
73
Near-optimal approximation guarantee: for any r̂ < r,
r
EkA − ÂkF ≤ 1 + kA − Ar̂ kF ,
r − r̂ − 1
where Ar̂ is the rank r̂-truncated SVD (expectation w.r.t. random matrix X).
This is a remarkable result; make sure to pause and think about what it says! The
r
approximant  has error kA − ÂkF that is within a factor 1 + r−r̂−1 of the optimal
truncated SVD, for a slightly lower rank r̂ < r (say, r̂ = 0.9r).
n×m
M † = Vr Σ−1 T
r Ur ∈ R .
M † satisfies M M † M = M , M † M M † = M † , M M † = (M M † )T , M † M = (M † M )T
(these are often taken to be the definition of the pseudoinverse—the above definition
is much simpler IMO).
M † = M −1 if M nonsingular.
P is always diagonalisable and all eigenvalues are 1 or 0. (think why this is?)
74
14.3 HMT approximant: analysis (down from 70 pages!)
We are now in a position to explain why the HMT algorithm works. The original HMT paper
is over 70 pages with long analysis. Here we attempt to condense the arguments to the essence
(and with a different proof). Recall that our low-rank approximation is  = QQT A, where
AX = QR. Goal: kA − Âk = k(Im − QQT )Ak = O(kA − Ar̂ k).
2. Set M T = (V T X)† V T where V = [v1 , . . . , vr̂ ] ∈ Rn×r̂ is the top right singular vectors
of A (r̂ ≤ r).
Recall that r̂ is any integer bounded by r.
4. Taking norms yields kA − Âk2 = k(Im − QQT )A(I − V V T )(In − XM T )k2 = k(Im −
Σ
QQT )U2 Σ2 V2T (In −XM T )k2 where [V, V2 ] is orthogonal (and A = [U, U2 ] [V, V2 ]T
Σ2
is the SVD), so
It remains to prove kXM T k2 = O(1). To see why this should hold with high probability, we
need a result from random matrix theory.
14.4 Tool from RMT: Rectangular random matrices are well con-
ditioned
A final piece of information required to complete the puzzle is the Marchenko-Pastur law, a
classical result in random matrix theory, which we will not be able to prove here (and hence
is clearly nonexaminable). We refer those interested to the part C course on Random Matrix
Theory. However, understanding the statement and the ability to use this fact is indeed
examinable.
The key message is easy to state: a rectangular random matrix is well conditioned
with extremely high probability. This fact is enormously important and useful in a variety
of contexts in computational mathematics.
Here is a more precise statement:
75
Theorem 14.1 (Marchenko-Pastur) The singular values of random matrix X ∈ Rm×n
(m ≥ n) with iid Xij (mean 0, variance p 1)√follow√ Marchenko-Pastur (M-P) distribution
1 √ √
(proof
√ nonexaminable),
√ √ √ with density ∼ x
(( m + n) − x)(x − ( m − n)), and support
[ m − n, m + n].
aspect=1 aspect=2 aspect=5 aspect=10
50 50 50 50
40 40 40 40
30 30 30 30
20 20 20 20
10 10 10 10
0 0 0 0
0 1 2 0 1 2 0 1 2 0 1 2
Histogram of singular values of random Gaussian matrices with varying aspect ratio m/n.
√
√ √ √ √ 1+
σmax (X) ≈ m+ n, σmin (X) ≈ m− n, hence κ2 (X) ≈ √m/n = O(1).
1− m/n
Proof: omitted. (strictly speaking, the theorem concerns the limit m, n → ∞; but the
result holds in the nonasymptotic limit with enormous probability [6]).
As stated above, this is a key fact in many breakthroughs in computational maths!
Examples include
(nonexaminable:) Compressed sensing (RIP) [Donoho 06, Candes-Tao 06], Matrix con-
centration inequalities [Tropp 11], Function approx. by least-squares [Cohen-Davenport-
Leviatan 13]
kXM T k2 = O(1) Let’s get back to the HMT analysis. Recall that we’ve shown for M T =
(V T X)† V T where X ∈ Rn×r is random, that
kA − Âk2 ≤ kΣ2 k2 k(In − XM T )k2 = kΣ2 k2 kXM T k2 .
| {z }
optimal rank-r̂
Now kXM T k2 = kX(V T X)† V T k2 = kX(V T X)† k2 ≤ kXk2 k(V T X)† k2 .
Now let’s analyse the (standard) case where X is random Gaussian Xij ∼ N (0, 1). Then
V T X is another Gaussian matrix (an important fact about Gaussian matrices is that
orthogonal×Gaussian=Gaussian (in distribution);√ this√is nonexaminable but a nice
T † T
exercise), hence k(V X) k = 1/σmin (V X) . 1/( r − r̂) by M-P.
76
√ √
kXk2 . n + r by M-P.21
√ √
n+ r
Together we get kXM T k2 . √ √
r− r̂
= ”O(1)”.
Remark:
Theorem 14.2
p (ReproducespHMT r2011 Thm.10.5) If X is Gaussian, for any r̂ < r,
EkEHMT kF ≤ EkEHMT k2F = 1 + r−r̂−1 kA − Ar̂ kF .
Now if X is Gaussian then V⊥T X ∈ R(n−r̂)×r and V T X ∈ Rr̂×r are independent Gaussian.
Hence by [HMT Prop. 10.1] EkΣ2 (V⊥T X)(V T X)† k2F = r−r̂−1
r
kΣ2 k2F , so
r
EkEHMT k2F = 1+ kΣ2 k2F .
r − r̂ − 1
Note how remarkable the theorem is—the ’lazily’ computed approximant is nearly optimal
r
p
up to a factor 1 + r−r̂−1 for a near rank r̂ (one can take, e.g. r̂ = 0.9r).
77
Then  is another rank-r approximation to A, and A−  = (I −PAX,Y )A = (I −PAX,Y )A(I −
XM T ); choose M s.t. XM T = X(V T X)† V T = PX,V . Then PAX,Y , PX,V are (nonorthogonal)
projections, and
Note that the kA(I − V V T )(I − PX,V )k term is the exact same as in the HMT error.
T
10 -5 HM
time(s)
2
10
10 -10
GN GN
HMT
SVD
10 1
10 -15
10
3
10
4
10 3 10 4
rank rank
We see that randomised algorithms can outperform the standard SVD significantly.
78
n = 1000; % size
A = gallery(’randsvd’,n,1e100); % geometrically decaying singvals
r = 200; % rank
Then to do HMT as follows:
X = randn(n,r);
AX = A*X;
[Q,R] = qr(AX,0); % QR fact.
At = Q*(Q’*A);
which with high probability gives me an excellent approximation
norm(At-A,’fro’)/norm(A,’fro’)
ans = 1.2832e-15
And for Generalized Nyström :
X = randn(n,r); Y = randn(n,1.5*r);
AX = A*X; YA = Y’*A; YAX = YA*X;
[Q,R] = qr(YAX,0); % stable pseudo-inverse via the QR factorisation
At = (AX/R)*(Q’*YA);
norm(At-A,’fro’)/norm(A,’fro’)
ans = 2.8138e-15
Both algorithms give an excellent low-rank approximation to A.
Traditional method: normal eqn x = (AT A)−1 AT b or A = QR, x = R−1 (QT b), both
require O(mn2 ) cost.
(QR factorisation), then solve miny k(AR̂−1 )y − bk2 ’s normal eqn via Krylov.
79
– O(mn log m + n3 ) cost using fast FFT-type transforms22 for G.
– Crucially, AR̂−1 is well-conditioned. Why? Marchenko-Pastur (next)
Thus we can solve (11) via CG (or LSQR [27], a more stable variant in this context;
nonexaminable)
– exponential convergence, O(1) iterations! (or O(log 1 ) iterations for accuracy)
– each iteration requires w ← Bw and w ← B T w, consisting of w ← R̂−1 w (n × n
triangular solve) and w ← Aw (m × n matrix-vector multiplication); O(mn) cost
overall
22
The FFT (fast Fourier transform) is one of the important topics that we can’t treat properly—for now
just think of it as a matrix-vector multiplication that can be performed in O(n log n) flops rather than
O(n2 ).)
80
15.3 Blendenpik experiments
Let’s illustrate our findings. Since Blendenpik finds a preconditioner such that AR−1 is
well-conditioned (regardless of κ2 (A)), we expect the convergence of CG to be independent
of κ2 (A). This is indeed what we see here:
0
10
10 -5
10 -10
10 -15 Blendenpik
10 20 30 40 50 60 70 80 90 100
CG iterations
where A ∈ Cn×r , n r, one can sketch and solve the problem [35]: draw a random matrix
G ∈ Cr̃×n where r̃ = O(r) n, and solve the sketched problem
In some cases, the solution of this problem is already a good enough approximation to the
original problem.
We’ve taken G to be Gaussian above. For a variety of choices of G, with high probability
the solutions for (13) and (12) can be shown to be similar in that they have a comparable
residual kAx − bk2 .
Let’s understand why the solution for (13) should be good, again using Marchenko-Pastur.
Let [A b] = QR ∈ Cm×(n+1) be a thin QR factorization, and suppose that the sketch G ∈ Cr̃×n
81
is Gaussian. Then GQ is rectangular Gaussian (again using the rotational invariance of
Gaussian random matrices), so well-conditioned. Suppose without loss of generality that G
is scaled so that 1 − δ ≤ σi (GQ) ≤ 1 + δ for some δ < 1 (note that here δ isn’t an O(u)
quantity, δ = 0.5, say, is a typical number), and therefore (since Qv = [A b]ṽ for some ṽ)
xfor
n+1
any ṽ ∈ C we have (1 − δ)k[A b]ṽk2 ≤ kG[A b]ṽk2 ≤ (1 + δ)kG[A b]ṽk2 . Taking ṽ = −1
it follows that for any vector x ∈ Cn we have
Consequently, the minimizer xs of kG(Ax − b)k2 for (13) also minimizes kAx − bk2 for (12)
1+δ
up to a factor 1−δ . If the residual of the original problem can be made small, say kAx −
−10
bk2 /kbk2 = 10 , then kAxG − bk2 /kbk2 ≤ 1+δ 1−δ
× 10−10 , which with a modest and typical
1 −10
value δ = 2 gives 3 × 10 , giving an excellent least-squares fit. If A is well-conditioned,
this also impliees the solution x is close to the exact solution.
FFT (values↔coefficients map for polynomials) [e.g. Golub and Van Loan 2012 [14]]
82
multigrid [e.g. Elman-Silvester-Wathen 2014 [11]]
Direct methods (LU) for linear systems and least-squares problems (QR)
Stability of algorithms
2nd half
Krylov subspace methods for linear systems (GMRES, CG) and eigenvalue problems
(Arnoldi, Lanczos)
C6.4 Finite Element Method for PDEs: NLA arising in solutions of PDEs
83
C6.2 Continuous Optimisation: NLA in optimisation problems
and many more: differential equations, data science, optimisation, machine learning,... NLA
is everywhere in computational mathematics.
References
[1] H. Avron, P. Maymounkov, and S. Toledo. Blendenpik: Supercharging LAPACK’s
least-squares solver. SIAM J. Sci. Comp., 32(3):1217–1236, 2010.
[5] K. Braman, R. Byers, and R. Mathias. The multishift QR algorithm. Part II: Aggressive
early deflation. SIAM J. Matrix Anal. Appl., 23:948–973, 2002.
[6] K. R. Davidson and S. J. Szarek. Local operator theory, random matrices and banach
spaces. Handbook of the geometry of Banach spaces, 1(317-366):131, 2001.
[7] C. Davis and W. M. Kahan. The rotation of eigenvectors by a perturbation. III. SIAM
J. Numer. Anal., 7(1):1–46, 1970.
[8] J. Demmel. Applied Numerical Linear Algebra. SIAM, Philadelphia, USA, 1997.
[9] J. Dongarra and F. Sullivan. Guest editors’ introduction: The top 10 algorithms. IEEE
Computer Architecture Letters, 2(01):22–23, 2000.
[10] I. S. Duff, A. M. Erisman, and J. K. Reid. Direct Methods for Sparse Matrices. Oxford
University Press, 2017.
[11] H. C. Elman, D. J. Silvester, and A. J. Wathen. Finite elements and fast iterative
solvers: with applications in incompressible fluid dynamics. Oxford University Press,
USA, 2014.
[13] D. F. Gleich. Pagerank beyond the web. SIAM Rev., 57(3):321–363, 2015.
84
[14] G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University
Press, 4th edition, 2012.
[15] S. Güttel and F. Tisseur. The nonlinear eigenvalue problem. Acta Numer., 26:1–94,
2017.
[16] N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness:
Probabilistic algorithms for constructing approximate matrix decompositions. SIAM
Rev., 53(2):217–288, 2011.
[19] R. A. Horn and C. R. Johnson. Topics in Matrix Analysis. Cambridge University Press,
1991.
[20] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, second
edition, 2012.
[21] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Rev.,
51(3):455–500, 2009.
[22] P.-G. Martinsson and J. A. Tropp. Randomized numerical linear algebra: Foundations
and algorithms. Acta Numer., pages 403––572, 2020.
[25] Y. Nakatsukasa and J. A. Tropp. Fast & accurate randomized algorithms for linear
systems and eigenvalue problems. arXiv 2111.00113.
[26] C. C. Paige. Error analysis of the Lanczos algorithm for tridiagonalizing a symmetric
matrix. IMA J. Appl. Math., 18(3):341–349, 1976.
[27] C. C. Paige and M. A. Saunders. LSQR: An algorithm for sparse linear equations and
sparse least squares. ACM Trans. Math. Soft., 8(1):43–71, 1982.
[29] Y. Saad and M. H. Schultz. GMRES - A generalized minimal residual algorithm for
solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comp., 7(3):856–869, 1986.
85
[30] G. W. Stewart and J.-G. Sun. Matrix Perturbation Theory (Computer Science and
Scientific Computing). Academic Press, 1990.
[31] V. Strassen. Gaussian elimination is not optimal. Numer. Math., 13:354–356, 1969.
[32] D. B. Szyld. The many proofs of an identity on the norm of oblique projections. Nu-
merical Algorithms, 42(3-4):309–323, 2006.
[33] L. N. Trefethen and D. Bau. Numerical Linear Algebra. SIAM, Philadelphia, 1997.
[34] M. Udell and A. Townsend. Why are big data matrices approximately low rank? SIAM
Journal on Mathematics of Data Science, 1(1):144–160, 2019.
®
[35] D. P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and
Trends in Theoretical Computer Science, 10(1–2):1–157, 2014.
86