0% found this document useful (0 votes)
17 views

NLA_lecture_notes

This is for numerical mathematics who are interested for some computer math

Uploaded by

C. Li
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

NLA_lecture_notes

This is for numerical mathematics who are interested for some computer math

Uploaded by

C. Li
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

C6.

1 Numerical Linear Algebra


Yuji Nakatsukasa*

Welcome to numerical linear algebra (NLA)! NLA is a beautiful subject that combines
mathematical rigor, amazing algorithms, and an extremely rich variety of applications.
What is NLA? In a sentence, it is a subject that deals with the numerical solution (i.e.,
using a computer) of linear systems Ax = b (given A ∈ Rn×n (i.e., a real n × n matrix) and
b ∈ Rn (real n-vector), find x ∈ Rn ) and eigenvalue problems Ax = λx (given A ∈ Rn×n ,
find λ ∈ C and x ∈ Cn ), for problems that are too large to solve by hand (n ≥ 4 is already
large; we aim for n in the thousands or even millions). This can rightfully sound dull, and
some mathematicians (those purely oriented?) tend to get turned off after hearing this —
how could such a course be interesting compared with other courses offered by the Oxford
Mathematical Institute? I hope and firmly believe that at the end of the course you will all
agree that there is more to the subject than you imagined. The rapid rise of data science and
machine learning has only meant that the importance of NLA is still growing, with a vast
number of problems in these fields requiring NLA techniques and algorithms. It is perhaps
worth noting also that these fields have had enormous impact on the direction of NLA, in
particular the recent and very active field of randomized algorithm was born in light of needs
arising from these extremely active fields.
In fact NLA is a truly exciting field that utilises a huge number of ideas from different
branches of mathematics (e.g. matrix analysis, approximation theory, and probability) to
solve problems that actually matter in real-world applications. Having said that, the number
of prerequisites for taking the course is the bare minimum; essentially a basic understanding
of the fundamentals of linear algebra would suffice (and the first lecture will briefly review
the basic facts). If you’ve taken the Part A Numerical Analysis course you will find it helpful,
but again, this is not necessary.
The field NLA has been blessed with many excellent books on the subject. These notes
will try to be self-contained, but these references will definitely help. There is a lot to learn;
literally as much as you want to.

ˆ Trefethen-Bau (97) [33]: Numerical Linear Algebra

– covers essentials, beautiful exposition

ˆ Golub-Van Loan (12) [14]: Matrix Computations


* Lastupdate: December 1, 2021. Please report any corrections or comments on these lecture notes to
[email protected]

1
– classic, encyclopedic

ˆ Horn and Johnson (12) [20]: Matrix Analysis (& Topics in Matrix Analysis (86) [19])

– excellent theoretical treatise, little numerical treatment

ˆ J. Demmel (97) [8]: Applied Numerical Linear Algebra

– impressive content

ˆ N. J. Higham (02) [17]: Accuracy and Stability of Algorithms

– bible for stability, conditioning

ˆ H. C. Elman, D. J. Silvester, A. J. Wathen (14) [11]: Finite Elements and Fast Iterative
Solvers

– PDE applications of linear systems, Krylov methods and preconditioning

This course covers the fundamentals of NLA. We first discuss the singular value decom-
position (SVD), which is a fundamental matrix decomposition whose importance is only
growing. We then turn to linear systems and eigenvalue problems. Broadly, we will cover

ˆ Direct methods (n . 10,000): Sections 5–10 (except 8)

ˆ Iterative methods (n . 1,000,000, sometimes larger): Sections 11–13

ˆ Randomized methods (n & 1,000,000): Sections 14–16

in this order. Lectures 1–4 cover the fundamentals of matrix theory, in particular the SVD,
its properties and applications.
This document consists of 16 sections. Very roughly speaking, one section corresponds
to one lecture (though this will not be followed strictly at all).

Contents
0 Introduction, why Ax = b and Ax = λx? 6

1 Basic LA review 7
1.1 Warmup exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Structured matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Matrix eigenvalues: basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Computational complexity (operation counts) of matrix algorithms . . . . . 10
1.5 Vector norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 Matrix norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.7 Subspaces and orthonormal matrices . . . . . . . . . . . . . . . . . . . . . . 12

2
2 SVD: the most important matrix decomposition 13
2.1 (Some of the many) applications and consequences of the SVD: rank, col-
umn/row space, etc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 SVD and symmetric eigenvalue decomposition . . . . . . . . . . . . . . . . . 16
2.3 Uniqueness etc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Low-rank approximation via truncated SVD 16


3.1 Low-rank approximation: image compression . . . . . . . . . . . . . . . . . . 19

4 Courant-Fischer minmax theorem 21


4.1 Weyl’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.1 Eigenvalues of nonsymmetric matrices are sensitive to perturbation . 22
4.2 More applications of C-F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3 (Taking stock) Matrix decompositions you should know . . . . . . . . . . . . 23

5 Linear systems Ax = b 23
5.1 Solving Ax = b via LU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2 Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.3 Cholesky factorisation for A  0 . . . . . . . . . . . . . . . . . . . . . . . . . 26

6 QR factorisation and least-squares problems 27


6.1 QR via Gram-Schmidt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.2 Towards a stable QR factorisation: Householder reflectors . . . . . . . . . . . 28
6.3 Householder QR factorisation . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.4 Givens rotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.5 Least-squares problems via QR . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.6 QR-based algorithm for linear systems . . . . . . . . . . . . . . . . . . . . . 33
6.7 Solution of least-squares via normal equation . . . . . . . . . . . . . . . . . . 33
6.8 Application of least-squares: regression/function approximation . . . . . . . 34

7 Numerical stability 35
7.1 Floating-point arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7.2 Conditioning and stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.3 Numerical stability; backward stability . . . . . . . . . . . . . . . . . . . . . 37
7.4 Matrix condition number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.4.1 Backward stable+well conditioned=accurate solution . . . . . . . . . 38
7.5 Stability of triangular systems . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.5.1 Backward stability of triangular systems . . . . . . . . . . . . . . . . 39
7.5.2 (In)stability of Ax = b via LU with pivots . . . . . . . . . . . . . . . 40
7.5.3 Backward stability of Cholesky for A  0 . . . . . . . . . . . . . . . . 40
7.6 Matrix multiplication is not backward stable . . . . . . . . . . . . . . . . . . 41
7.7 Stability of Householder QR . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7.7.1 (In)stability of Gram-Schmidt . . . . . . . . . . . . . . . . . . . . . . 43

3
8 Eigenvalue problems 43
8.1 Schur decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.2 The power method for finding the dominant eigenpair Ax = λx . . . . . . . 45
8.2.1 Digression (optional): Why compute eigenvalues? Google PageRank . 46
8.2.2 Shifted inverse power method . . . . . . . . . . . . . . . . . . . . . . 46

9 The QR algorithm 47
9.1 QR algorithm for Ax = λx . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
9.2 QR algorithm preprocessing: reduction to Hessenberg form . . . . . . . . . . 51
9.2.1 The (shifted) QR algorithm in action . . . . . . . . . . . . . . . . . . 52
9.2.2 (Optional) QR algorithm: other improvement techniques . . . . . . . 53

10 QR algorithm continued 54
10.1 QR algorithm for symmetric A . . . . . . . . . . . . . . . . . . . . . . . . . 54
10.2 Computing the SVD: Golub-Kahan’s bidiagonalisation algorithm . . . . . . . 55
10.3 (Optional but important) QZ algorithm for generalised eigenvalue problems . 55
10.4 (Optional) Tractable eigenvalue problems . . . . . . . . . . . . . . . . . . . . 56

11 Iterative methods: introduction 57


11.1 Polynomial approximation: basic idea of Krylov . . . . . . . . . . . . . . . . 58
11.2 Orthonormal basis for Kk (A, b) . . . . . . . . . . . . . . . . . . . . . . . . . 59
11.3 Arnoldi iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
11.4 Lanczos iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
11.5 The Lanczos algorithm for symmetric eigenproblem . . . . . . . . . . . . . . 60

12 Arnoldi and GMRES for Ax = b 62


12.1 GMRES convergence: polynomial approximation . . . . . . . . . . . . . . . . 62
12.2 When does GMRES converge fast? . . . . . . . . . . . . . . . . . . . . . . . 64
12.3 Preconditioning for GMRES . . . . . . . . . . . . . . . . . . . . . . . . . . 64
12.4 Restarted GMRES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
12.5 Arnoldi for nonsymmetric eigenvalue problems . . . . . . . . . . . . . . . . . 65

13 Lanczos and Conjugate Gradient method for Ax = b, A  0 66


13.1 CG algorithm for Ax = b, A  0 . . . . . . . . . . . . . . . . . . . . . . . . . 67
13.2 CG convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
13.2.1 Chebyshev polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . 68
13.2.2 Properties of Chebyshev polynomials . . . . . . . . . . . . . . . . . . 69
13.3 MINRES: symmetric (indefinite) version of GMRES . . . . . . . . . . . . . . 70
13.3.1 MINRES convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
13.4 Preconditioned CG/MINRES . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4
14 Randomized algorithms in NLA 72
14.1 Randomized SVD by Halko-Martinsson-Tropp . . . . . . . . . . . . . . . . . 73
14.2 Pseudoinverse and projectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
14.3 HMT approximant: analysis (down from 70 pages!) . . . . . . . . . . . . . . 75
14.4 Tool from RMT: Rectangular random matrices are well conditioned . . . . . 75
14.5 Precise analysis for HMT (nonexaminable) . . . . . . . . . . . . . . . . . . . 77
14.6 Generalised Nyström . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
14.7 MATLAB code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

15 Randomized least-squares: Blendenpik 79


15.1 Explaining Blendenpik via Marchenko-Pastur . . . . . . . . . . . . . . . . . 80
15.2 Blendenpik: solving minx kAx − bk2 using R̂ . . . . . . . . . . . . . . . . . . 80
15.3 Blendenpik experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
15.4 Sketch and solve for minx kAx − bk2 . . . . . . . . . . . . . . . . . . . . . . . 81
15.5 Randomized algorithm for Ax = b, Ax = λx? . . . . . . . . . . . . . . . . . . 82

16 Conclusion and discussion 82


16.1 Important (N)LA topics not treated . . . . . . . . . . . . . . . . . . . . . . . 82
16.2 Course summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
16.3 Related courses you can take . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

Notation. For convenience below we list the notation that we use throughout the course.
ˆ λ(A): the set of eigenvalues of A. If a natural ordering exists (e.g. A is symmetric so
λ is real), λi (A) is the ith (largest) eigenvalue.
ˆ σ(A): the set of singular values of A. σi (A) always denotes the ith largest singular
value. We often just write σi .
ˆ diag(A): the vector of diagonal entries of A.

ˆ We use capital letters for matrices, lower-case for vectors and scalars. Unless otherwise
specified, A is a given matrix, b is a given vector, and x is an unknown vector.
ˆ k · k denotes a norm for a vector or matrix. k · k2 denotes the spectral (or 2-) norm,
k · kF the Frobenius norm. For vectors, to simplify notation we sometimes use k · k for
the spectral norm (which for vectors is the familiar Euclidean norm).
ˆ Span(A) denotes the span or range of the column space of A. This is the subspace
consisting of vectors of the form Ax.
ˆ We reserve Q for an orthonormal (or orthogonal) matrix. L, (U ) are often lower (upper)
triangular.
ˆ I always denotes the identity matrix. In is the n × n identity when the size needs to
be specified.

5
ˆ AT is the transpose of the matrix; (AT )ij = Aji . A∗ is the (complex) conjugate
transpose (A∗ )ij = A¯ji .

ˆ ,  denote the positive (semi)definite ordering. That is, A  ()0 means A is


positive (semi)definite (abbreviated as PD, PSD), i.e., symmetric and with positive
(nonnegative) eigenvalues. A  B means A − B  0.

We sometimes use the following shorthand: alg for algorithm, eigval for eigenvalue, eigvec
for eigenvector, singval for singular value, and singvec for singular vector, iff for “if and only
if”.

0 Introduction, why Ax = b and Ax = λx?


As already stated, NLA is the study of numerical algorithms for problems involving matrices,
and there are only two main problems(!):

1. Linear system
Ax = b.
Given a (often square m = n but we will discuss m > n extensively, and m < n briefly
at the end) matrix A ∈ Rm×n and vector b ∈ Rm , find x ∈ Rn such that Ax = b.

2. Eigenvalue problem
Ax = λx.
Given a (always!1 ) square matrix A ∈ Rn×n find λ: eigenvalues (eigval), and x ∈ Rn :
eigenvectors (eigvec).

We’ll see many variants of these problems; one worthy of particular mention is the SVD,
which is related to eigenvalue problems but given its ubiquity has a life of its own. (So if
there’s a third problem we solve in NLA, it would definitely be the SVD.)
It is worth discussing why we care about linear systems and eigenvalue problems.
The primary reason is that many (in fact most) problems in scientific computing (and
even machine learning) boil down to linear problems:

ˆ Because that’s often the only way to deal with the scale of problems we face today!
(and in future)

ˆ For linear problems, so much is understood and reliable algorithms are available2 .
1
There are exciting recent developments involving eigenvalue problems for rectangular matrices, but these
are outside the scope of this course.
2
A pertinent quote is Richard Feynman’s “Linear systems are important because we can solve them”.
Because we can solve them, we do all sorts of tricks to reduce difficult problems to linear systems!

6
A related important question is where and how these problems arise in real-world prob-
lems.
Let us mention a specific context that is relevant in data science: optimisation. Suppose
one is interested in minimising a high-dimensional real-valued function f (x) : Rn → R where
n  1.
A successful approach is to try and find critical points, that is, points x∗ where ∇f (x∗ ) =
0. Mathematically, this is a non-linear high-dimensional root-finding problem of finding
x ∈ Rn such that ∇f (x) =: F (x) = 0 (the vector 0 ∈ Rn ) where F : Rn → Rn . One of
the most commonly employed methods for this task is Newton’s method (which some of you
have seen in Prelims Constructive Mathematics). This boils down to
ˆ Newton’s method for F (x) = 0, F : Rn → Rn nonlinear:

1. Start with initial guess x(0) ∈ Rn , set i = 0


∂Fi (x)
2. Find Jacobian matrix J ∈ Rn×n , Jij = |
∂xj x=x(0)

3. Update x(i+1) := x(i) − J −1 F (x(i) ), i ← i + 1, go to step 2 and repeat

Note that the main computational task is to find the vector y = J −1 F (x(i) ), which is
a linear system Jy = F (x(i) ) (which we solve for the vector y)
What about eigenvalue problems Ax = λx? Google’s pagerank is a famous application
(we will cover this if we have time). Another example the Schrödinger equation of physics
and chemistry. Sometimes a nonconvex optimisation problem can be solved by an eigenvalue
problem.
Equally important is principal component analysis (PCA), which can be used for data
compression. This is more tightly connected to the SVD.
Other sources of linear algebra problems include differential equations, optimisation,
regression, data analysis, ...

1 Basic LA review
We start with a review of key LA facts that will be used in the course. Some will be trivial
to you while others may not. You might also notice that some facts that you have learned in
a core LA course will not be used in this course. For example we will never deal with finite
fields, and determinants only play a passing role.

1.1 Warmup exercise


Let A ∈ Rn×n (n × n square matrix). (or Cn×n ; the difference hardly matters in most of
this course3 ). Try to think of statements that are equivalent to A being nonsingular. Try to
come up with as many conditions as possible, before turning the page.
3
While there are a small number of cases where the distinction between real and complex matrices matters,
in the majority of cases it does not, and the argument carries over to complex matrices by replacing ·T with

7
Here is a list: The following are equivalent.

1. A is nonsingular.

2. A is invertible: A−1 exists.

3. The map A : Rn → Rn is a bijection.

4. All n eigenvalues of A are nonzero.

5. All n singular values of A are positive.

6. rank(A) = n.

7. The rows of A are linearly independent.

8. The columns of A are linearly independent.

9. Ax = b has a solution for every b ∈ Cn .

10. A has no nonzero null vector.

11. AT has no nonzero null vector.

12. A∗ A is positive definite (not just semidefinite).

13. det(A) 6= 0.

14. An n × n matrix A−1 exists such that A−1 A = In . (this, btw, implies (iff) AA−1 = In ,
a nontrivial fact)

15. . . . (what did I miss?)

1.2 Structured matrices


We will be discussing lots of structured matrices. For square matrices,

ˆ Symmetric: Aij = Aji (Hermitian: Aij = A¯ji )

– The most important property of symmetric matrices is the symmetric eigenvalue


decomposition A = V ΛV T ; V is orthogonal V T V = V V T = In , and Λ is a
diagonal matrix of eigenvalues Λ = diag(λ1 , . . . , λn ).
– symmetric positive (semi)definite A  ()0: symmetric and all positive (nonneg-
ative) eigenvalues.
·∗ . Therefore for the most part, we lose no generality in assuming the matrix is real (which slightly simplifies
our mindset). Whenever necessary, we will highlight the subtleties that arise resulting from the difference
between real and complex. (For the curious, these are the Schur form/decomposition, LDLT factorisation
and eigenvalue decomposition for (real) matrices with complex eigenvalues.)

8
ˆ Orthogonal: AAT = AT A = I (Unitary: AA∗ = A∗ A = I). Note that for square
matrices, AT A = I implies AAT = I.

ˆ Skew-symmetric: Aij = −Aji (skew-Hermitian: Aij = −A¯ji ).

ˆ Normal: AT A = AAT . (Here it’s better to discuss the complex case A∗ A = AA∗ : this
is a necessary and sufficient condition for diagonalisability under a unitary transfor-
mation, i.e., A = U ΛU ∗ where Λ is diagonal and U is unitary.)

ˆ Tridiagonal: Aij = 0 if |i − j| > 1.

ˆ Upper triangular: Aij = 0 if i > j.

ˆ Lower triangular: Aij = 0 if i < j.

For (possibly nonsquare) matrices A ∈ Cm×n , (usually m ≥ n).

ˆ (upper) Hessenberg: Aij = 0 if i > j + 1. (we will see this structure often.)

ˆ “orthonormal”: AT A = In , and A is (tall) rectangular. (This isn’t an established


name—we could call it “matrix with orthonormal columns” every time it appears—
but we use these matrices all the time in this course, so we need a consistent shorthand
name for it.)

ˆ sparse: most elements are zero. nnz(A) denotes the number of nonzero elements in A.
Matrices that are not sparse are called dense.

Other structures: Hankel, Toeplitz, circulant, symplectic,... (we won’t use these in this
course)

1.3 Matrix eigenvalues: basics


Ax = λx, A ∈ Rn×n , (0 6=)x ∈ Rn
    
2 1 1 1 1
ˆ Example: 1 2 1 1 = 4 1
1 1 2 1 1  
1
This matrix has an eigenvalue λ = 4, with corresponding eigenvector x = 1 (to-
1
gether they are an eigenpair ).
An n × n matrix always has n eigenvalues
  (not 
always
  nlinearly
 independent eigenvec-
1 0
tors); In the example above, (λ, x) = 1, −1 , 1,  1  are also eigenpairs.
0 −1

9
ˆ The eigenvalues
Qn are the roots of the characteristic polynomial det(λI−A) = 0: det(λI−
A) = i=1 (λ − λi ).

ˆ According to Galois theory, eigenvalues cannot be computed exactly for matrices with
n ≥ 5. But we still want to compute them! In this course we will (among other
things) explain how this is done in practice by the QR algorithm, one of the greatest
hits of the field.

1.4 Computational complexity (operation counts) of matrix algo-


rithms
Since NLA is a field that aspires to develop practical algorithms for solving matrix problems,
it is important to be aware of the computational cost (often referred to as complexity) of
the algorithms. We will discuss these as the algorithms are developed, but for now let’s
examine the costs for basic matrix-matrix multiplication. The cost is measured in terms
of flops (floating-point operations), which counts the number of additions, subtractions,
multiplications, and divisions (all treated equally) performed.
In NLA the constant in front of the leading term in the cost is (clearly) important. It
is customary (for good reason) to only track the leading term of the cost. For example,
n3 + 10n2 is abbreviated to n3 .

ˆ Multiplying two n × n matrices AB costs 2n3 flops. More generally, if A is m × n and


B is n × k, then computing AB costs 2mnk flops.

ˆ Multiplying a vector to an m × n matrix A costs 2mn flops.

Norms
We will need a tool (or metric) to measure how big a vector or matrix is. Norms give us a
means to achieve this. Surely you have already seen some norms (e.g. the vector Euclidean
norm). We will discuss a number of norms for vectors and matrices that we will use in the
upcoming lectures.

1.5 Vector norms


For vectors x = [x1 , . . . , xn ]T ∈ Cn

ˆ p-norm kxkp = (|x1 |p + |x2 |p + · · · + |xn |p )1/p (1 ≤ p ≤ ∞)


p
– Euclidean norm=2-norm kxk2 = |x1 |2 + |x2 |2 + · · · + |xn |2
– 1-norm kxk1 = |x1 | + |x2 | + · · · + |xn |
– ∞-norm kxk∞ = maxi |xi |

10
Of particular importance are the three cases p = 1, 2, ∞. In this course, we will see p = 2
the most often.
A norm needs to satisfy the following axioms:

ˆ kαxk = |α|kxk for any α ∈ C (homogeneity),

ˆ kxk ≥ 0 and kxk = 0 ⇔ x = 0 (nonnegativity),

ˆ kx + yk ≤ kxk + kyk (triangle inequality).

The vector p-norm satisfies all these, for any p.


Here are some useful inequalities for vector norms. A proof is left for your exercise and
is highly recommended. (Try to think when each equality is satisfied.) For x ∈ Cn ,

ˆ √1 kxk2
n
≤ kxk∞ ≤ kxk2

ˆ √1 kxk1
n
≤ kxk2 ≤ kxk1

ˆ 1
n
kxk1 ≤ kxk∞ ≤ kxk1

Note that with the 2-norm, kU xk2 = kxk2 for any unitary U and any x ∈ Cn . Norms
with this property are called unitarily invariant. √
The 2-norm is also induced by the inner product kxk2 = xT x. An important property
of inner products is the Cauchy-Schwarz inequality |xT y| ≤ kxk2 kyk2 (which can be directly
proved but is perhaps best to prove in general setting)4 . When we just say kxk for a vector
we mean the 2-norm.

1.6 Matrix norms


We now turn to norms of matrices. As you will see, many (but not the Frobenius and trace
norms) are defined via the vector norms (these are called induced norms).
kAxkp
ˆ p-norm kAkp = maxx kxkp

– 2-norm=spectral norm(=Euclidean norm) kAk2 = σmax (A) (largest singular value;


see Section 2)
– 1-norm kAk1 = maxi m
P
j=1 |Aji |
Pn
– ∞-norm kAk∞ = maxi j=1 |Aij |
qP P
ˆ Frobenius norm kAkF = i j |Aij |
2

(2-norm of vectorisation)
4
Just in case, here’s a proof: for any scalar c, kx − cyk2 = kxk2 − 2cxT y + c2 kyk2 . This is minimised wrt
xT y 2 (xT y)2
c at c = kyk 2 with minimiser kxk − kyk2 . Since this must be ≥ 0, the CS inequality follows.

11
ˆ trace norm=nuclear norm kAk∗ = min(m,n) σi (A). (this is the maximum trace of QT A
P
i=1
where Q is orthonormal, hence the name)

Colored in red are unitarily invariant norms kAk∗ = kU AV k∗ , kAkF = kU AV kF , kAk2 =


kU AV k2 for any unitary/orthogonal U, V .

Norm axioms hold for each of these. Useful inequalities include: For A ∈ Cm×n , (exercise;
it is instructive to study the cases where each of these equalities holds)

ˆ √1n kAk∞ ≤ kAk2 ≤ mkAk∞

ˆ √1 kAk1
m
≤ kAk2 ≤ nkAk1

ˆ kAk2 ≤ kAkF ≤
p
min(m, n)kAk2

A useful property of p-norms is that they are subordinate, i.e., kABkp ≤ kAkp kBkp
  satisfy this, e.g. with the max norm kAkmax =
(problem sheet). Note that not all norms
1
maxi,j |Aij |, with A = [1, 1] and B = one has kABkmax = 2 but kAkmax = kBkmax = 1.
1

1.7 Subspaces and orthonormal matrices


A key notion that we will keep using throughout is a subspace S. In this course we
will almost exclusively confine ourselves to subspaces of Rn , even though they generalize
to more abstract vector spaces. A subspace is the set of vectors that can be written as a
linear combination basis vectors v1 , . . . , vd , which are assumed to be Pdlinearly independent
(otherwise there is a basis with fewer vectors). That is, x ∈ S iff i=1 ci vi (where ci are
scalars, R or C). The integer d is called the dimension of the subspace. We also say the
subspace is spanned by the vectors v1 , . . . , vd , or that v1 , . . . , vd spans the subpsace.
How does one represent a subspace? An obvious answer is to use the basis vectors
v1 , . . . , vd . This sometimes becomes cumbersome, and a common and convenient way to
represent the subspace is to use a (tall-skinny) rectangular matrix V ∈ Rn×d = [v1 , v2 , . . . , vd ],
as S = span(V ) (or sometimes just “subspace V ”) which means the subspace of vectors that
can be written as V c, where c is a ’coefficient’ vector c ∈ Rd .
It will be (not necessary but) convenient to represent subspaces using an orthonormal
matrix Q ∈ Rn×d . (once we cover the QR factorisation, you’ll see that there is no loss of
generality in doing so).
An important fact about subspaces of Rn is the following:

Lemma 1.1 Let V1 ∈ Rn×d1 and V2 ∈ Rn×d2 each have linearly independent column vectors.
If d1 + d2 > n, then there is a nonzero intersection between two subspaces S1 = span(V1 ) and
S2 = span(V2 ), that is, there is a nonzero vector x ∈ Rn such that x = V1 c1 = V2 c2 for some
vectors c1 , c2 .

This is straightforward but important enough to warrant a proof.

12
Proof: Consider the matrix M := [V1 , V2 ], which is of size n × (d1 + d2 ). Since d1 + d2 > n
by assumption,
  this matrix has a right null vector5 c 6= 0 such that M c = 0. Splitting
c1
c= we have the required result. 
−c2
Let us conclude this review with a list of useful results that will be helpful. Proofs (or
counterexample) should be straightforward.

ˆ (AB)T = B T AT

ˆ If A, B invertible, (AB)−1 = B −1 A−1

ˆ If A, B square and AB = I, then BA = I


 −1  
Im X Im −X
ˆ =
0 In 0 In

ˆ Neumann series: if kXk < 1 in any norm,

(I − X)−1 = I + X + X 2 + X 3 + · · ·

ˆ For a square n × n matrix A, the trace is Trace(A) = ni=1 Ai,i (sum of diagonals).
P
For any X, Y such that XY Pis square, Trace(XY ) = Trace(Y X) (quite useful). For
B ∈ Rm×n , we have kBk2F = i j |Bij |2 = Trace(B T B).
P

ˆ Triangular structure (upper or lower) is invariant under addition, multiplication, and


inversion. That is, triangular matrices form a ring (in abstract algebra; don’t worry if
this is foreign to you).

ˆ Symmetry is invariant under addition and inversion, but not multiplication; AB is


usually not symmetric even if A, B are.

2 SVD: the most important matrix decomposition


We now start the discussion of the most important topic of the course: the singular value
decomposition (SVD). The SVD exists for any matrix, square or rectangular, real or complex.
We will prove its existence and discuss its properties and applications in particular in
low-rank approximation, which can immediately be used for compressing the matrix, and
therefore data.
The SVD has many intimate connections to symmetric eigenvalue problems. Let’s start
with a review.
5
If this argument isn’t convincing to you now, probably the easiest way to see this is via the SVD; so stay
tuned and we’ll resolve this in footnote 9, once you’ve seen the SVD!

13
ˆ Symmetric eigenvalue decomposition: Any symmetric matrix A ∈ Rn×n has the
decomposition
A = V ΛV T (1)
where V is orthogonal, V T V = In = V V T , and Λ = diag(λ1 , . . . , λn ) is a diagonal
matrix of eigenvalues.
λi are the eigenvalues, and V is the matrix of eigenvectors (its columns are the eigenvec-
tors).
The decomposition (1) makes two remarkable claims: the eigenvectors can be taken to
be orthogonal (which is true more generally of normal matrices s.t. A∗ A = AA∗ ), and the
eigenvalues are real.
It is worth reminding you that eigenvectors are not uniquely determined: (assuming
they are normalised s.t. the 2-norm is 1) their signs can always be flipped. (And actually
more, when there are eigenvalues that are multiple λi = λj ; in this case the eigenvectors
span a subspace whose dimension matches the multiplicity. For example, any vector is an
eigenvector of the identity matrix I).
Now here is the protagonist of this course: Be sure to spend dozens (if not hundreds) of
hours thinking about it!
Theorem 2.1 (Singular Value Decomposition (SVD)) Any matrix A ∈ Rm×n has the
decomposition :
A = U ΣV T . (2)
Here U T U = V T V = In (assuming m ≥ n for definiteness), Σ = diag(σ1 , . . . , σn ), σ1 ≥ σ2 ≥
· · · ≥ σn ≥ 0.
σi (always nonnegative) are called the singular values of A. The rank of A is the number
of positive singular values. The columns of U are called the left singular vectors, and the
columns of V are the right singular vectors.

Writing U = [uP1n, . . . , un ]T and V = [v1 , . . . , vn ], we have an important alternative expres-


sion for A: A = i=1 σi ui vi . We will use this expression repeatedly in what follows. Also
note that we always order the singular values in nonincreasing order, so σ1 is always the
largest singular value.
The SVD tells us that any (tall) matrix can be written as orthonormal-diagonal-orthogonal.
Roughly, ortho-normal/gonal matrices can be thought of as rotations or reflection, so the
SVD says the action of a matrix can be thought of as a rotation/reflection followed by
magnification (or shrinkage), followed by another rotation/reflection.

Proof: For SVD6 (m ≥ n and assume full-rank σn > 0 for simplicity): Take Gram matrix
AT A (symmetric) and its eigendecomposition AT A = V ΛV T with V orthogonal. Λ is non-
negative, and (AV )T (AV ) =: Σ2 is diagonal, so AV Σ−1 =: U is orthonormal. Right-multiply
by ΣV T to get A = U ΣV T . 
6
I like to think this is the shortest proof out there.

14
 
Σ T
It is also worth mentioning the “full” SVD: A = [U, U⊥ ] V where [U, U⊥ ] ∈ Rm×m is
0
square and orthogonal. Essentially this follows from the (thin) SVD (2) by filling in U in (2)
with its orthogonal complement U⊥ (whose construction can be done via the Householder
QR factorsation of Section 6).

2.1 (Some of the many) applications and consequences of the


SVD: rank, column/row space, etc
From the SVD one can immediately read off a number of important properties of the matrix.
For example:
ˆ The rank r of A ∈ Rm×n , often denoted rank(A): this is the number of nonzero (posi-
tive) singular values σi (A). rank(A) is also equal to the number of linearly independent
rows or the number of independent columns, as you probably learnt in your first course
on linear algebra (exercise).
Prank(A)
– We can always write A = i=1 σi ui viT .
Important: An m × n matrix A that is of rank r can be written as an (outer) product
of m × r and r × n matrices:

Ar = Ur Σr VrT

To see this, note from the SVD that A = ni=1 σi ui viT = ri=1 σi ui viT (since σr+1 = 0),
P P
and so A = Ur Σr VrT where Ur = [u1 , . . . , ur ], Vr = [v1 , . . . , vr ], and Σr is the leading
r × r submatrix of Σ.

ˆ Column space (Span(A), linear subspace spanned by vectors Ax): span of U = [u1 , . . . , ur ],
often denoted Span(U ).

ˆ Row space (Span(AT )): row span of v1T , . . . , vrT , Span(V )T .

ˆ Null space Null(A): span of vr+1 , . . . , vn ; as Av = 0 for these vectors. Null(A) is empty
(or just the 0 vector) if m ≥ n and r = n. (When m < n the full SVD is needed to
describe Null(A).)
Aside from these and other applications, the SVD is also a versatile theoretical tool. Very
often, a good place to start in proving a fact about matrices is to first consider its SVD; you
will see this many times in this course. For example, the SVD can give solutions immediately
for linear systems and least-squares problems, though there are more efficient ways to solve
these problems.

15
2.2 SVD and symmetric eigenvalue decomposition
As mentioned above, the SVD and the eigenvalue decomposition for symmetric matrices are
closely connected. Here are some results that highlight the connections between A = U ΣV T
and symmetric eigenvalue decomposition. (We assume m ≥ n for definiteness)

ˆ V is an eigvector matrix of AT A. (To verify, see proof of SVD)

ˆ U is an eigvector matrix (for nonzero eigvals) of AAT (be careful with sign flips; see
below)

ˆ σi = λi (AT A) for i = 1, . . . , n
p

ˆ If A is symmetric, its singular values σi (A) are the absolute values of its eigenvalues
λi (A), i.e., σi (A) = |λi (A)|.
Exercise: What if A is unitary, skew-symmetric, normal matrices, triangular? (problem
sheet)

ˆ Jordan-Wieldant matrix A0T A0 : This matrixhas eigenvalues


 
±σi (A), and m−n copies
U U U0

of eigenvalues at 0. Its eigenvector matrix is V −V 0 , where AT U0 = 0 (U0 is empty
when m = n). This matrix, along with the Gram matrix AT A, is a very useful tool
when one tries to extend a result on symmetric eigenvalues to an analogue in terms of
the SVD (or vice versa).

2.3 Uniqueness etc


We have established the existence of the SVD A = U ΣV T . A natural question is: is it
unique? In other words, are the factors U, Σ, V uniquely determined by A?
It is straightforward to see that the singular vectors are not uniquely determined. Most
obviously, the singular vectors can be flipped in signs, just like eigenvectors. However, note
that the signs of ui , vi are not entirely arbitrary: if we replace ui by −ui , the same needs to
be done for vi in order to satisfy A = ni=1 σi ui viT . Essentially, once the sign (or rotation)
P
of ui (or vi ) is fixed, vi (or ui ) is determined uniquely.
More generally, in the presence of multiple singular values (i.e., σi = σi+1 for some
i), there is a higher degree of freedom in the SVD. Again this is very much analogous to
eigenvalues (recall the discussion on eigenvectors of the identity). Here think of what the
SVD is for an orthogonal matrix: there is an enormous amount of degrees of freedom in the
choice of U and V . The singular values σi , on the other hand, are always unique, as they
are the eigenvalues of the Gram matrix.

3 Low-rank approximation via truncated SVD


While the SVD has a huge number of applications, undoubtedly the biggest reason that makes
it so important in computational mathematics is its optimality for low-rank approximation.

16
To discuss this topic we need some preparation. We will make heavy use of the spectral
norm of matrices. We start with an important characterisation of kAk2 in terms of the
singular value(s), which we previously stated but did not prove.

Proposition 3.1
kAxk2
kAk2 = max = max kAxk2 = σ1 (A).
x kxk2 kxk2 =1

Proof: Use the SVD: For any x with unit norm kxk2 = 1,

kAxk2 = kU ΣV T xk2
= kΣV T xk2 by unitary invariance
= kΣyk2 with kyk2 = 1
v
u n
uX
=t σi2 yi2
i=1
v
u n
uX
≤t σ12 yi2 = σ1 kyk22 = σ1 .
i=1

Finally, note that taking x = v1 (the leading right singular vector),


qP P we have kAv1 k2 = σ1 . 
pPn
Similarly, the Frobenius norm can be expressed as kAkF = |A |2 = 2
i j ij i=1 (σi (A)) ,

and the trace norm is (by definition) kAk∗ = min(m,n)


P
i=1 σi (A). (exercise) In general, norms
that are unitarily invariant can be characterized by the singular values [20].
Now to the main problem: Given A ∈ Rm×n , consider the problem of finding a rank-r
matrix (remember; these are matrices with r nonzero singular values) Ar ∈ Rm×n that best
approximates A. That is, find the minimiser Ar for

argminrank(Ar )≤r kA − Ar k2 . (3)

It is definitely worth visualizing the situation:

A ≈ Ar = Ur Σr VrT

We immediately see that a low rank approximation (when possible) is beneficial in terms of
the storage cost when r  m, n. Instead of storing mn entries for A, we can store entries for
Ur , Σr , Vr ((m+n+1)r entries) to keep the low-rank factorisation without losing information.
Low-rank approximation can also bring computational benefits. For example, in order to
compute Ax for a vector x, by noting that Ar x = Ur (Σr (VrT x)), one needs only O((m + n)r)

17
operations7 instead of O(mn). The utility and prevalence of low-rank matrices in data science
is remarkable.
Here is the solution for (3): Truncated SVD, defined via Ar = ri=1 σi ui viT (= Ur Σr VrT ).
P
This Pis the matrix obtained by truncating (removing) the trailing terms in the expression
A = ni=1 σi ui viT . Pictorially,
     
∗ ∗ ∗
∗ ∗ ∗
      
 .. 
∗ +  ...  ∗ ∗ + · · · +  ...  ∗
   
A= . ∗ ∗ ··· ∗ ∗ ··· ∗
 
∗ ··· ∗ ∗ ,
     
∗ ∗ ∗
∗ ∗ ∗
| {z } | {z } | {z }
σ1 u1 v1 σ2 u2 v2 σn un vn
   
∗ ∗
∗ ∗
   
 .. 
∗ + · · · +  ...  ∗
 
Ar = . ∗ ∗ ··· ∗
 
∗ ··· ∗ ∗ .
   
∗ ∗
∗ ∗
| {z } | {z }
σ1 u1 v1 σr ur vr

In particular, we have
Theorem 3.1 For any A ∈ Rm×n with singular values σ1 (A) ≥ σ2 (A) ≥ · · · ≥ σn (A) ≥ 0,
and any nonnegative integer8 r < min(m, n),

kA − Ar k2 = σr+1 (A) = min kA − Bk2 . (4)


rank(B)≤r
Before proving this result, let us make some observations.
ˆ Good approximation A ≈ Ar is obtained iff σr+1  σ1 .

ˆ Optimality holds for any unitarily invariant norm: that is, the norms in (3) can be
replaced by e.g. the Frobenius norm. This is surprising, as the low-rank approximation
problem minrank(B)≤r kA − Bk does depend on the choice of the norm (and for many
problems, including the least-squares problem in Section 6.5, the norm choice has a
significant effect on the solution). The proof for this fact is nonexaminable, but if
curious see [20] for a complete proof.
ˆ A prominent application of low-rank approximation is PCA (principal component anal-
ysis) in statistics and data science.
ˆ Many matrices have explicit or hidden low-rank structure (nonexaminable, but see
e.g. [34]).
7
I presume you’ve seen the big-Oh notation before—f (n) = O(nk ) means there exists a constant C
s.t. f (n)/nk ≤ C for all sufficiently large n. In NLA it is a convenient way to roughly measure the operation
count.
8
If r ≥ min(m, n) then we can simply take Ar = A.

18
Proof: of Theorem 3.1:

1. Since rank(B) ≤ r, we can write B = B1 B2T where B1 , B2 have r columns.

2. It follows that B2T (and hence B) has a null space of dimension at least n − r. That
is, there exists an orthonormal matrix W ∈ Cn×(n−r) s.t. BW = 0. Then kA − Bk2 ≥
k(A − B)W k2 = kAW k2 = kU Σ(V T W )k2 . (Why does the first inequality hold?)

3. Now since W is (n − r)-dimensional, by Lemma 1.1 there is an intersection between


W and [v1 , . . . , vr+1 ], the (r + 1)-dimensional subspace spanned by the leading r + 1
right singular vectors.
We will use this type of argument again, so to be more precise: the matrix [W,  x1v1 , . . . , vr+1 ]
is “fat” rectangular, so must have a null vector. That is, [W, v1 , . . . , vr+1 ] x2 = 0 has
a nonzero solution x1 , x2 ; then W x1 is such a vector9 . We scale it so that it has unit
norm kW x1 k2 = k[v1 , . . . , vr+1 ]x2 k2 = 1, that is, kx1 k2 = kx2 k2 = 1.
 
T T T Ir+1
4. Note that U ΣV W x1 = U ΣV [v1 , . . . , vr+1 ]x2 and V [v1 , . . . , vr+1 ] = , so
  0
Σr+1
kU ΣV T W x1 k2 = kU x2 k2 , where Σr+1 is the leading (top-left) (r + 1) × (r + 1)
0
part of Σ. As U is orthogonal this is equal to kΣr+1 y1 k2 , and kΣr+1 y1 k2 ≥ σr+1 can be
verified by direct calculation.

Finally, for the reverse direction, take B = Ar . 

3.1 Low-rank approximation: image compression


A classical and visually pleasing illustration of low-rank approximation by the truncated SVD
is image compression. (Admittedly this is slightly outdated—it is not the state-of-the-art
way of compressing images.)
The idea is that greyscale images can be represented by a matrix where each entry
indicates the intensity of a pixel. The matrix can then be compressed by finding a low-rank
approximation10 , resulting in image compression. (Of course there are generalisations for
color images.)
Below in Figure 1 we take the Oxford logo, represent it as a matrix and find its rank-
r approximation, for varying r. (The matrix being a mere 500 × 500, its SVD is easy to
compute in a fraction of a second; see Section 10.2 for how this is done.) We then reconstruct
the image from the low-rank approximations to visualise them.
9
Let us now resolve the cliffhanger in footnote 5. The claim is that any “fat” m × n (m < n) matrix M
has a right null vector y 6= 0 such that M y = 0. To prove this, use the full SVD M = U ΣV T to see that
M vn = 0.
10
It is somewhat surprising that images are approximable by low-rank matrices. See
https://www.youtube.com/watch?v=9BYsNpTCZGg for a nice explanation.

19
original rank 1 rank 5

rank 10 rank 20 rank 50


Figure 1: Image compression by low-rank approximation via the truncated SVD.

20
We see that as the rank is increased the image becomes finer and finer. At rank 50 it is
fair to say the image looks almost identical to the original. The original matrix is 500 × 500,
so we still achieve a significant amount of data compression in the matrix with r = 50.

4 Courant-Fischer minmax theorem


Continuing on SVD-related topics, we now discuss a very important and useful result with
far-reaching ramifications: the Courant-Fischer (C-F) minimax characterisation.

Theorem 4.1 The ith largest11 eigenvalue λi of a symmetric matrix A ∈ Rn×n is (below
x 6= 0)
xT Ax xT Ax
 
λi (A) = max min T = min max (5)
dim S=i x∈S x x dim S=n−i+1 x∈S xT x

Analogously, for any rectangular A ∈ Cm×n (m ≥ n), we have


 
kAxk2 kAxk2
σi (A) = max min = min max . (6)
dim S=i x∈S kxk2 dim S=n−i+1 x∈S kxk2

It would take some time to get a hang of what the statements mean. One helpful way to look
at it is perhaps to note that inside the maximum in (6) the expression is minx∈S,kxk2 =1 kAxk2 =
minQT Q=Ii ,kyk2 =1 kAQyk2 = σmin (AQ) = σi (AQ), where span(Q) = S. The C-F theorem says
σi (A) is equal to the maximum possible value of this over all subspaces S of dimension i.

Proof: We will prove (6). A proof for (5) is analogous and a recommended exercise.

1. Fix S and let Vi = [vi , . . . , vn ]. We have dim(S)+dim(span(Vi )) = i+(n−i+1) = n+1,


so ∃ intersection w ∈ S ∩ Vi , kwk2 = 1.
kAxk2
2. For this w, we have kAwk2 = kdiag(σi , . . . , σn )(ViT w)k2 ≤ σi ; thus σi ≥ minx∈S kxk2
.

3. For the reverse inequality, take S = [v1 , . . . , vi ], for which w = vi .

4.1 Weyl’s inequality


As an example of the many significant ramifications of the C-F theorem, we present Weyl’s
theorem 12 (or Weyl’s inequality), an important perturbation result for singular values and
eigenvalues of symmetric matrices.

Theorem 4.2 Weyl’s inequality


11
exact analogues hold for the ith smallest eigenvalue and singular values.
12
Hermann Weyl was one of the prominent mathematicians of the 20th centry.

21
ˆ For the singular values of any matrix A,

– σi (A + E) ∈ σi (A) + [−kEk2 , kEk2 ] for all i.


– Special case: kAk2 − kEk2 ≤ kA + Ek2 ≤ kAk2 + kEk2
ˆ For eigenvalues of a symmetric matrix A, λi (A + E) ∈ λi (A) + [−kEk2 , kEk2 ] for all
i.

(Proof: exercise; almost a direct consequence of C-F.)


The upshot is that singular values and eigenvalues of symmetric matrices are insensitive
to perturbation; a property known as being well conditioned.
This is important because this means a backward stable algorithm (see Section 7) com-
putes these quantities with essentially full precision.

4.1.1 Eigenvalues of nonsymmetric matrices are sensitive to perturbation


It is worth remarking that eigenvalues of nonsymmetric
  matrices can be far from
 well
 condi-
1 1 1 1
tioned! Consider for example the Jordan block . By perturbing this to one gets
0 1  1
√ √
eigenvalues that are perturbed by , a magnification factor of 1/   1. More generally,
consider the eigenvalues of a Jordan block and its perturbation
   
1 1 1 1
 ..   .. 
 1 .  n×n
∈R , J +E =
 1 . 
J = 
 ..   .. 
 . 1   . 1 
1  1

λ(J) = 1 (n copies), but we have |λ(J + E) − 1| ≈ 1/n . For example when n = 100, an
10−100 perturbation in J would result in a 0.1 perturbation in all the eigenvalues!
(nonexaminable) This is pretty much the worst-case situation. In the generic case where
the matrix is diagonalizable, A = XΛX −1 , with an -perturbation the eigenvalues get per-
turbed by O(c), where the constant c depends on the so-called condition number κ2 (X) of
the eigenvector matrix X (see Section 7.2, and [8]).

4.2 More applications of C-F


(Somewhat optional) Let’s explore more applications of C-F. A lot more can be proved;
see [19] for many more results and examples along these lines.
 
A1
Example 4.1 ˆ σi ≥ max(σi (A1 ), σi (A2 ))
A2
   
A1 A1
Proof (sketch): LHS = maxdim S=i minx∈S,kxk2 =1 x , and for any x, x ≥
A2 2
A2 2
max(kA1 xk2 , kA2 xk2 ).

22
ˆ σi ( A1 A2 ) ≥ max(σi (A1 ), σi (A2 ))
 
 
  x1
Proof: LHS = maxdim S=i min x1  x  A1 A2 , while
x2 ∈S, x2
1
=1 x2 2
2  
  x1
σi (A1 ) = max   min x1
    A1 A2 . Since
In x1
x2 ∈S, x2 =1 x2 2
dim S=i,range(S)∈range( ) 2
0
the latter imposes restrictions on S to take the maximum over, the former is at least
as big.

4.3 (Taking stock) Matrix decompositions you should know


Let us now take stock to review the matrix decompositions that we have covered, along with
those that we will discuss next.

ˆ SVD A = U ΣV T

ˆ Eigenvalue decomposition A = XΛX −1

– Normal: X unitary X ∗ X = I
– Symmetric: X unitary and Λ real
λ 1

i
..
ˆ Jordan decomposition: A = XJX −1 , J = diag( λi
..
.
)
. 1
λi

ˆ Schur decomposition A = QT Q∗ : T upper triangular

ˆ QR: Q orthonormal, U upper triangular

ˆ LU: L lower triangular, U upper triangular

Red: Orthogonal decompositions, stable computation available


The Jordan decomposition is mathematically the ultimate form that any matrix can
be reduced to by a similarity transformation, but numerically it is not very useful—one of
the problems is that Jordan decompositions are very difficult to compute, as an arbitrarily
small perturbation can change the eigenvalues and block sizes by a large amount (recall the
discussion in Section 4.1.1).

5 Linear systems Ax = b
We now (finally) start our discussion on direct algorithms in NLA. The fundamental idea is
to factorisa a matrix into a product of two (or more) simpler matrices. This foundational
idea has been named one of the top 10 algorithms of the 20th centry [9].

23
The sophistication of the state-of-the-art implementation of direct methods is simply
astonishing. For instance, a 100 × 100 dense linear system or eigenvalue problem can be
solved in less than a milisecond on a standard laptop. Imagine solving it by hand!
We start with solving linear systems, unquestionably the most important problem in
NLA (for applications). In some sense we needn’t spend too much time here, as you must
have seen much of the material (e.g. Gaussian elimination) before. However, the description
of the LU factorisation given below will likely be different from the one that you have seen
before. We have chosen a nonstandard description as it reveals its connection to low-rank
approximation.
Let A ∈ Rn×n . Suppose we can decompose (or factorise) A into (here and below, ∗
denotes entries that are possibly nonzero).
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
A = ∗ ∗ ∗ ∗ ∗ = ∗ ∗ ∗  ∗ ∗ ∗ = LU
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗

Here L is lower triangular, and U is upper triangular. How can we find L, U ?


To get started, consider rewriting A as
"∗# " #
∗ ∗ ∗ ∗ ∗
A= ∗

[∗ ∗ ∗ ∗ ∗ ]+ ∗







∗ ∗ ∗ ∗ ∗
" #
∗ 0
   
∗ ∗
=  
∗ ∗
∗
∗ ∗ ∗ ∗

+  
∗ 0
∗
∗ ∗ ∗ ∗

+ ∗





= ···
∗ ∗ ∗ ∗ ∗
| {z } | {z }
L1 U1 L2 U2

Namely, the first step finds L1 , U1 such that:


" #

 
∗ ∗ ∗ ∗ ∗
A=  
∗ ∗
∗
∗ ∗ ∗ ∗

+ ∗







.
∗ ∗ ∗ ∗ ∗
| {z }
L1 U1

Specifying the elements, the algorithm can be described as (taking a = A11 )


A11 A12 A13 A14 A15
 L11  U11 U12 U13 U14 U15
  
A21 L21 ∗ ∗ ∗ ∗
A31  = L31  + ∗ ∗ ∗ ∗
A41 L41 ∗ ∗ ∗ ∗
A51 L51 ∗ ∗ ∗ ∗
1 A11 A12 A13 A14 A15
    
A21 /a   ∗ ∗ ∗ ∗
= A31 /a 
 +
 ∗ ∗ ∗ ∗
A41 /a   ∗ ∗ ∗ ∗
A51 /a ∗ ∗ ∗ ∗
| {z }
=L1 U1

24
Here we’ve assumed a 6= 0; we’ll discuss the case a = 0 later. Repeating the process gives
"∗# "0# "0# "0# "0#
∗ ∗ 0 0 0
A= ∗

[∗ ∗ ∗ ∗ ∗ ] + ∗

[0 ∗ ∗ ∗ ∗] + ∗

[0 0 ∗ ∗ ∗ ] + 0

[0 0 0 ] +
∗ ∗ 0
0
[0 0 0 0 ∗]
∗ ∗ ∗ ∗ ∗

= L1 U1 + L2 U2 + LU + L4 U4 + L5 U5
U  ∗ 3 3
∗ ∗ ∗ ∗ ∗

1
∗ ∗ ∗ ∗ ∗ ∗
U2   
= [L1 , L2 , . . . , L5 ]  ..  = 
∗ ∗ ∗
 ∗ ∗ ∗,
. ∗ ∗ ∗ ∗

∗ ∗

U5 ∗ ∗ ∗ ∗ ∗ ∗

an LU factorisation as required.
Note the above expression for A; clearly the Li Ui factors are rank-1 matrices; the LU
factorisation can be thought of as writing A as a sum of (structured) rank-1 matrices.

5.1 Solving Ax = b via LU


Having found an LU factorisation A = LU , one can efficiently solve an n × n linear system
Ax = b: First solve Ly = b, then U x = y. Then b = Ly = LU x = Ax.
ˆ These are triangular linear systems, which are easy to solve and can be done in O(n2 )
flops.

ˆ Triangular solve is always backward stable: e.g. (L + ∆L)ŷ = b (see Higham’s book)
The computational cost is
ˆ For LU: 23 n3 flops (floating-point operations).

ˆ Triangular solve is O(n2 ).


Note that once we have an LU factorisation we can solve another linear system with
respect to the same matrix with only O(n2 ) additional operations.

5.2 Pivoting
Above we’ve assumed the diagonal element (pivot) a 6= 0—when a = 0 we are introuble!  In
0 1
fact not every matrix has an LU factorisation. For example, there is no LU for . We
1 0
need a remedy. In practice, a remedy is needed whenever a is small.
The idea is to permute the rows, so that the largest element of the (first active) column is
brought to the pivot. This process is called pivoting (sometimes partial pivoting, to emphasize
the difference from complete pivoting wherein both rows and columns are permuted13 ). This
13
While we won’t discuss complete pivoting further, you might be interested to know that when LU with
complete pivoting is applied to a low-rank matrix (with rapidly decaying singular values), one tends to find
a good low-rank approximation, almost as good as truncated SVD.

25
results in P A = LU , where P is a permutation matrix : orthogonal matrices with only 1 and
0s (every row/column has exactly one 1); applying P would reorder the rows (with P A) or
columns (AP )).
Thus solving Axi = bi for i = 1, . . . , k requires 23 n3 +O(kn2 ) operations instead of O(kn3 ).

A11 A12 A13 A14 A15


  1
 A11 A12 A13 A14 A15
  
A21 A21 /a ∗ ∗ ∗ ∗
A31 = A31 /a  + ∗ ∗ ∗ ∗
A41 A41 /a ∗ ∗ ∗ ∗
A51 A51 /a ∗ ∗ ∗ ∗

When a = 0, remedy: pivot, permute rows such that the largest entry of first (active) column
is at the top. ⇒ P A = LU , P : permutation matrix

ˆ for Ax = b, solve P Ax = P b ⇔ LU x = P b

ˆ cost still 32 n3 + O(n2 )

In fact, one can show that any nonsingular matrix A has a pivoted LU factorisation.
(proof: exercise). This means that any linear system that is computationally feasible can be
solved by pivoted LU factorisation.

ˆ Even with pivoting, unstable examples exist (Section 7), but almost always stable in
practice and used everywhere.

ˆ Stability here means L̂Û = P A + ∆A with small k∆Ak; see Section 7.

5.3 Cholesky factorisation for A  0


If A  0 (symmetric positive definite14 (S)PD⇔λi (A) > 0 for all i), two simplifications
happen in LU:

ˆ We can take Ui = LTi =: Ri by symmetry

ˆ No pivot needed as long as A is PD


   
∗  ∗ ∗ ∗ ∗
A= + (7)
 
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
   

∗  ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗
| {z } | {z }
R1 R1T also PD

Notes:

ˆ 1 3
3
n flops, half as many as LU
14
Positive definite matrices are so important a class of matrices; also full of interesting mathematical
properties, enough for a nice book to be written about it [4].

26
ˆ diag(R) no longer 1’s (clearly)

ˆ A can be written as A = RT R for some R ∈ Rn×n iff A  0 (λi (A) ≥ 0)

ˆ Indefinite case: when A = A∗ but A not PSD (i.e., negative eigenvalues are present),
∃ A = LDL∗ where D diagonal (when A ∈ Rn×n , D can have 2 × 2 diagonal blocks),
L has 1’s on diagonal. This is often called the “LDLT factorisation”.

ˆ It’s not easy at this point to see why the second (red) term above is also PD; one way
to see this is via the uniqueness of the Cholesky factorisation; see next section.

Therefore, roughly speaking, symmetric linear systems can be solved with half the effort of
a general non-symmetric system.

6 QR factorisation and least-squares problems


We’ve seen that the LU factorisation is a key step towards solving Ax = b. For an overde-
temined problem (least-squares problems, the subject of Section 6.5), we will need the QR
factorisation. For any A ∈ Rm×n , there exists a factorisation

A = Q R

where Q ∈ Rm×n is orthonormal QT Q = In , and R ∈ Rn×n is upper triangular.

ˆ Many algorithms available: Gram-Schmidt, Householder QR, CholeskyQR, ...

ˆ various applications: least-squares, orthogonalisation, computing SVD, manifold re-


traction...

ˆ With Householder, pivoting A = QRP not needed for numerical stability.

– but pivoting gives rank-revealing QR (nonexaminable).

6.1 QR via Gram-Schmidt


No doubt you have seen the Gram-Schmidt (G-S) process. What you might not know is that
when applied to the columns of a matrix A, it gives you a QR factorisation A = QR.
Gram-Schmidt: Given A = [a1 , a2 , . . . , an ] ∈ Rm×n (assume full-rank rank(A) = n),
find orthonormal [q1 , . . . , qn ] s.t. span(q1 , . . . , qn ) = span(a1 , . . . , an )

27
a1
More precisely, the algorithm performs the following: q1 = ka1 k
, then q̃2 = a2 − q1 q1T a2 ,
q2 = kq̃q̃22 k , (orthogonalise and normalise)
q̃j
repeat for j = 3, . . . , n: q̃j = aj − j−1 T
P
i=1 qi qi aj , qj = kq̃j k .

This gives a QR factorisation! To see this, let rij = qiT aj (i 6= j) and rjj = kaj −
Pj−1
i=1 rij qi k,

a1
q1 =
r11
a2 − r12 q1
q2 =
r22
aj − j−1
P
i=1 rij qi
qj =
rjj

which can be written equivalently as

a1 = r11 q1
a2 = r12 q1 + r22 q2
aj = r1j q1 + r2j q2 + · · · + rjj qj .

This in turn is A = Q R , where QT Q = In , and R upper triangular.

ˆ But this isn’t the recommended way to compute the QR factorisation, as it’s numeri-
cally unstable; see Section 7.7.1 and [17, Ch. 19,20].

6.2 Towards a stable QR factorisation: Householder reflectors


There is a beautiful alternative algorithm for computing the QR factorisation: Householder
QR factorisation. In order to describe it, let us first introduce Householder reflectors. These
are the class of matrices H that are symmetric, orthogonal and can be written as a rank one
update of the identity
H = I − 2vv T , kvk = 1

28
u
−v(v T u)
ˆ H is orthogonal and symmetric: H T H =
H 2 = I. Its eigenvalues are 1 (n − 1 copies)
and −1 (1 copy).
w
ˆ For any given u, w ∈ Rn s.t. kuk = kwk and v
w−u
u 6= v, H = I − 2vv T with v = kw−uk gives
Hu = w (⇔ u = Hw, thus ’reflector’)

(u − w)T x = 0
It follows that by choosing the vector v appropriately, one can perform a variety of
operations to a given vector x. A primary example is the particular Householder reflector
x − kxke
H = I − 2vv T , v= , e = [1, 0, . . . , 0]T ,
kx − kxkek

which satisfies Hx = [kxk, 0, . . . , 0]T . That is, an arbitrary vector x is mapped by H to a


multiple of e = [1, 0, . . . , 0]T .
(I hope the picture above is helpful—the reflector reflects vectors about the hyperplane
described by v T x = 0.)
In summary, we have the useful result
Lemma 6.1 For any x ∈ Rn not in the form [±kxk, 0, . . . , 0]T , there exists a Householder
reflector H = I − 2vv T where kvk = 1 such that Hx = [kxk, 0, . . . , 0]T .

6.3 Householder QR factorisation


Now we describe how to use the Householder reflectors in order to compute a QR factorisation
of a given matrix A ∈ Rm×n .  
ka1 k
 0 
The first step to obtain QR is to find H1 s.t. H1 a1 =  . ,
 
 .. 
0
and repeat to get Hn · · · H2 H1 A = R upper triangular, then A = (H1 · · · Hn−1 Hn )R =
QR
Here is a pictorial illustration: start with
∗ ∗ ∗ ∗
∗ ∗ ∗ ∗
A = ∗ ∗ ∗ ∗ .
∗ ∗ ∗ ∗
∗ ∗ ∗ ∗

We apply a sequence of Householder reflectors

29
∗ ∗ ∗ ∗
 ∗ ∗ ∗ ∗

∗ ∗ ∗ ∗ ∗ ∗
H1 A = (I − 2v1 v1T )A =  ∗ ∗ ∗ , H2 H1 A = (I − 2v2 v2T )H1 A =  ∗ ∗ ,
∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗
 ∗ ∗ ∗ ∗

∗ ∗ ∗ ∗ ∗ ∗
H3 H2 H1 A =  ∗ ∗ , Hn · · · H3 H2 H1 A =  ∗ ∗ .
∗ ∗

Note the zero pattern vk = [0, 0, . . . , 0, ∗, ∗, . . . , ∗]T . We have


| {z }
k − 1 0’s  
∗ ∗ ∗ ∗
 ∗ ∗∗ ∗∗
 
R
Hn · · · H2 H1 A =  
∗ = 0

To obtain a QR factorisation of A we simply invert the Householder reflectors, noting that


they are both orthogonal and symmetric;
 therefore
  the inverse is itself. This yields
R R
T
⇔ A = (H1T · · · Hn−1 HnT ) =: QF ; which is a full QR, wherein QF is square
0 0
orthogonal.
Moreover, writing QF = [Q Q⊥ ] where Q ∈ Rm×n is orthonormal, we also have A = QR
(’thin’ QR or just QR); this is more economical especially when A is tall-skinny. In a
majority of cases in computational mathematics, this is the object that we wish to compute).

We note some properties of Householder QR.

ˆ Cost 43 n3 flops with Householder-QR (twice that of LU).

ˆ Unconditionally backward stable: the computed version satisfies Q̂R̂ = A + ∆A,


kQ̂T Q̂ − Ik2 =  (Section 7).

ˆ The algorithm gives a constructive proof for the existence of a full QR A = QR. It
also gives, for example, a proof of the existence of the orthogonal complement of the
column space of an orthonormal matrix U .

ˆ To solve Ax = b, solve Rx = QT b via triangle solve.


→ Excellent method, but twice slower than LU (so it is rarely used)

ˆ The process is aptly called orthogonal triangularisation. (By contrast, Gram-Schmidt


and CholeskyQR15 are triangular orthogonalisation).
15
This algorithm does the following: AT A = RT R (Cholesky), then Q = AR−1 . As stated it’s a very fast
but unstable algorithm.

30
6.4 Givens rotations
Householder QR is an excellent method for computing the QR factorisation of a general
matrix, and it is widely used in practice. However, each Householder reflector acts globally–
it affects all the entries of the (active part of) the matrix. For structured matrices—such as
sparse matrices—sometimes there is a better tool to reduce the matrix to triangular form
(and other forms) by working more locally. Givens rotations give a convenient tool for this.
They are matrices of the form
 
c s
G= , c2 + s2 = 1.
−s c

Designed to ’zero’ one element at a time. For example to compute the QR factorisation for
an upper Hessenberg matrix, one can perform
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
A= ∗ ∗ ∗ ∗ , G1 A =  ∗ ∗ ∗ ∗ , G2 G1 A =  ∗ ∗ ∗ ,
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗
 ∗ ∗ ∗ ∗ ∗

∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
G3 G2 G1 A =  ∗ ∗ ∗ , G4 G3 G2 G1 A =  ∗ ∗ ∗ =: R.
∗ ∗ ∗ ∗
∗ ∗ ∗

This means A = GT1 GT2 GT3 GT4 R is the QR factorisation. (note that Givens rotations are
orthogonal but not symmetric—so its inverse is GT , not G).

ˆ G acts locally on two rows (when left-multiplied; two columns if right-multiplied)


ˆ Non-neighboring rows/cols allowed. For example, a rotation acting on the i, jth
columns would have c, s values in the (i, i), (i, j), (j, i), (j, j) entries. Visually,
 
1
 .. 

 . 


 1 


 cos(θ) sin(θ) 


 1 

Gi,j =
 .. 
 . 


 1 


 − sin(θ) cos(θ) 


 1 

 .. 
 . 
1

6.5 Least-squares problems via QR


So far we have discussed linear systems wherein the coefficient matrix is square. However, in
many situations in data science and beyond, there is a good reason to over-sample to obtain a

31
robust solution (for instance, in the presence of noise in measurements). For example, when
there is massive data and we would like to fit the data with a simple model, we will have
many more equations than the degrees of freedom. This leads to the so-called least-squares
problem which we will discuss here.
Given A ∈ Rm×n , m ≥ n and b ∈ Rm , a least-squares problem seeks to find x ∈ Rn such
that the residual is minimised:

min A x − b (8)
x

ˆ ’Overdetermined’ linear system; attaining equality Ax = b is usually impossible

ˆ Thus the goal is to try minimise the residual kAx−bk; usually kAx−bk2 but sometimes
e.g. kAx − bk1 is of interest. Here we focus on kAx − bk2 .

ˆ Throughout we assume full rank condition rank(A) = n; this makes the solution unique
and is generically satisfied. If not, the problem will have infinitely many minimisers
(and a standard practice is to look for the minimum-norm solution).

Here is how we solve the least-squares problem (8).

Theorem 6.1 Let A ∈ Rm×n , m > n and b ∈ Rm , with rank(A) = n. The least-squares
problem minx kAx − bk2 has solution given by x = R−1 QT b, where A = QR is the (thin) QR
factorisation.
   
Proof: Let A = [Q Q⊥ ] R = QF R
be ’full’ QR factorisation. Then
0 0
   T 
T R Q b
kAx − bk2 = kQF (Ax − b)k2 = x−
0 QT⊥ b 2

so x = R−1 QT b is solution. 
This also gives an algorithm (which is essentially the workhorse algorithm used in prac-
tice):

1. Compute thin QR factorisation A = QR (using Householder QR)

2. Solve linear system Rx = QT b.

ˆ This process is backward stable. That is, the computed x̂ solution for minx k(A +
∆A)x + (b + ∆b)k2 (see Higham’s book Ch.20)

ˆ Unlike square system Ax = b, one really needs QR: LU won’t do the job at all.

32
One might wonder why we chose the 2-norm in the least-squares formulation (8). Unlike
for low-rank approximation (where the truncated SVD is a solution for any unitarily invariant
norm16 ) the choice of the norm does matter and affects the properties of the solution x
significantly. For example, an increasingly popular choice of norm is the 1-norm, which
tends to promote sparsity in the quantity to be minimised. In particular, if we simply
replace the tune alarm with the 1 norm, the solution tends to give a residual that is sparse.
(nonexaminable)

6.6 QR-based algorithm for linear systems


It is straightforward to see that the exact same algorithm can be applied for solving square
linear systems. Is this algorithm good? Absolutely! It turns out that it is even better than
the LU-based method in that backward stability can be guaranteed (which isn’t the case
with pivoted LU). However, it is unfortunately twice expensive; which is the reason LU is
used in the vast majority of cases for solving linear systems.
Another very stable algorithm is to compute the SVD A = U ΣV T and take x =
V Σ−1 U T b. This is even more expensive than via QR (by ≈ x10).

6.7 Solution of least-squares via normal equation


There is another way to solve the least-squares problem, by the so-called normal equation.
We’ve seen that
minx kAx − bk2 , A ∈ Rm×n , m ≥ n
x = R−1 QT b is the solution ⇔ x solution for n × n normal equation

(AT A)x = AT b

ˆ AT A  0 (always) and AT A  0 if rank(A) = n; then PD linear system; use Cholesky


to solve.
σmax (A)
ˆ This is fast! but NOT backward stable; κ2 (AT A) = (κ2 (A))2 where κ2 (A) = σmin (A)
condition number (next topic)

In fact, more generally, given a linear least-squares approximation problem minp∈P kf −pk
in an inner-product space (of which (8) is a particular example; other examples R1 include
polynomial approximation of a function with the inner product e.g. hf, gi = −1 f (x)g(x)dx)
the solution is characterized by the property that the residual is orthogonal to the subspace
from which a solution is sought, that is, hq, f − pi = 0 for all q ∈ P. To see this, consider the
problem of approximating f with p in a subspace P. Let p∗ ∈ P be such that the residual
f − p+ is orthogonal to any element in P. Then for any q ∈ P, we have kf − (p∗ + q)k2 =
16
Just to be clear, if one uses a norm that is not unitarily invariant (e.g. 1-norm), the truncated SVD may
cease to be a solution for the low-rank approximation problem.

33
kf − p∗ k2 − 2hf − p∗ , qi + kqk2 = kf − p∗ k2 + kqk2 ≥ kf − p∗ k2 , proving p∗ is a minimiser (it
is actually unique).
Since we mentioned Cholesky, let us now revisit (7) and show why the second term there
must be PSD. A PD matrix has an eigenvalue decomposition A = V D2 V T = (V DV T )2 =
(V DV T )T (V DV T ). Now let V DV T = QR be the QR factorisation. Then (V DV T )T (V DV T ) =
RT R (this establishes the existence of Cholesky). But now the 0-structure in (7) means the
first term must be rrT where rT is the first row of R, and hence the second term must be
R2T R2 , which is PSD. Here RT = [r R2T ].

6.8 Application of least-squares: regression/function approxima-


tion
To illustrate the usefulness of least-squares problems as compared with linear systems here
let’s consider a function approximation problem.
Given function f : [−1, 1] → R,
Consider approximating via polynomial f (x) ≈ p(x) = i=0 ci xi .
P
Very common technique: Regression: this is a very widely applicable problem in statis-
tics and data science.
1. Sample f at points {zi }m
i=1 , and

2. Find coefficients c defined by Vandermonde system Ac ≈ f ,


1 z1 · · · z1n  
   
f (z1 )
1 z2 · · · z2n  c0
  ..   f (z2 ) 
 
..   .  ≈  ..  .

 .. ..
. . .   . 
n cn
1 zm · · · zm f (zm )
ˆ Numerous applications, e.g. in statistics, numerical analysis, approximation theory,
data analysis!

Illustration We illustrate this with an example where we approximate the function f (x) =
1 + sin(10x) exp(x) (which we suppose we don’t know but we can sample it).
m = n = 11 (degree 10 polynomial) m = 100, n = 11
3 3

2 2

1
1

0
0

-1
-1
-2

-2
-3

-3
-1 -0.5 0 0.5 1

-1 -0.5 0 0.5 1

34
We observe that with 11 (equispaced) sample points, the degree-10 polynomial is devi-
ating from the ’true function’ quite a bit. With many more sample points the situation
significantly improves. This is not a cherry-picked example but a phenomenon that can be
mathematically proved; look for “Lebesgue constant” if interested (nonexaminable).

7 Numerical stability
An important aspect that is very often overlooked in numerical computing is numerical sta-
bility. Very roughly, it concerns the quality of a solution obtained by a numerical algorithm,
given that computation on computers is done not exactly but with rounding errors. So far
we have mentioned stability here and there in passing, but in this section it will be our focus.
Let us first look at an example where roundoff errors play a visible role to affect the
computed solution of a linear system.  
T 1 1 1
The situation is complicated. For example, let A = U ΣV , where U = √2 ,
    1 −1
1 1
Σ= , V = I, and let b = A . That is, we are solving a linear system whose
10−15   1
1
solution is x = .
1  
1.0000
If we solve this in MATLAB using x = A\b, the output is . Quite different from
0.94206
the exact solution! Did something go wrong? Did MATLAB or the algorithm fail? The
answer is NO, MATLAB and the algorithm (LU) performed just fine. This is a ramification
of ill-conditioning, not instability. Make sure that after covering this section, you will be
able to explain what happened in the above example.

7.1 Floating-point arithmetic


The IEEE (double precision) floating point arithmetic is by far the most commonly used
model of computation adopted in computation, and we will assume its use here. We will not
get into its details as it becomes too computer scientific, rather than mathematical. Just to
give a very sketchy introduction

ˆ Computers store number in base 2 with finite/fixed memory (bits)

ˆ Numbers not exactly representable with finite bits in base 2, including irrational num-
bers, are stored inexactly (rounded), e.g. 1/3 ≈ 0.333... The unit with which rounding
takes place is the machine precision, often denoted by  (or u for unit roundoff). In
the most standard setting of IEEE double-precision arithmetic, u ≈ 10−16 .

ˆ Whenever calculations (addition, subtraction, multiplication, division) are performed,


the result is rounded to the nearest floating-point number (rounding error); this is
where numerical errors really creep in

35
ˆ Thus the accuracy of the final error is nontrivial; in pathological cases, it is rubbish!

To get an idea of how things can go wrong and how serious the final error can be, here
are two examples with MATLAB:

ˆ ((sqrt(2))2 − 2) ∗ 1e15 = 0.4441 (should be 0..)


P∞ 1
ˆ n=1 n ≈ 30 (should be ∞..)

For matrices, there are much more nontrivial and surprising phenomena than these. An
important (but not main) part of numerical analysis/NLA is to study the effect of rounding
errors. This topic can easily span a whole course. By far the best reference on this topic is
Higham’s book [17].
In this section we denote by f l(X) a computed version of X (fl stands for floating point).
For basic operations such as addition and multiplication, one has f l(x + y) = x + y + c
where |c| ≤ max(|x|, |y|) and f l(xy) = xy + c where |c| ≤ max(|xy|).

7.2 Conditioning and stability


It is important to solidify the definition of stability (which is a property of an algorithm)
and conditioning (which concerns the sensitivity of a problem and has nothing to do with
the algorithm used to solve it).

ˆ Conditioning is the sensitivity of a problem (e.g. of finding y = f (x) given x) to


perturbation in inputs, i.e., how large κ := supδx kf (x + δx) − f (x)k/kδxk is in the
limit δx → 0. (Very informally, one can think of conditioning as the largest directional
derivative).
(this is the absolute condition number; equally important is the relative condition
kf (x+δx)−f (x)k  kδxk
number κr := supδx kf (x)k kxk
)

ˆ (Backward) Stability is a property of an algorithm, which describes if the computed


solution ŷ is a ’good’ solution, in that it is an exact solution of a nearby input, that
is, ŷ = f (x + ∆x) for a small ∆x: if k∆xk can be shown to be small, ŷ is a backward
stable solution. If an algorithm is guarantee to output a backward stable solution, that
algorithm is called backward stable. Throughout this section, ∆ denotes a quantity
that is small relative to the quantity to follow: k∆Xk/kXk = O(), where  is the
machine precision.

To repeat, conditioning is intrinsic in the problem. Stability is a property of an algo-


rithm. Thus we will never say “this problem is backward stable” or “this algorithm is ill-
conditioned”. We can say “this problem is ill/well-conditioned”, or “this algorithm is/isn’t

36
(backward) stable”. If a problem is ill-conditioned κ  1, and the computed solution is
no very accurate, then one should blame the problem, not the algorithm. In such cases, a
backward stable solution (see below) is usually still considered a good solution.
Notation/convention in this section: x̂ denotes a computed approximation to x (e.g. of
x = A−1 b).  denotes a small term O(u), on the order of unit roundoff/working precision; so
we write e.g. u, 10u, (m + n)u, mnu all as . (In other words, here we assume m, n  u−1 .)

ˆ Consequently (in this lecture/discussion) the norm choice does not matter for the
discussion.

7.3 Numerical stability; backward stability


Let us dwell more on (backward) stability, because it is really at the heart of the discussion
on numerical stability. The word backward is key here and it probably differs from the
natural notion of stability that you might first think of.
For a computational task Y = f (X) (given input X, compute Y ) and computed approx-
imant Ŷ ,

ˆ Ideally, error kY − Ŷ k/kY k = : but this is seldom true, and often impossible!
(u: unit roundoff, ≈ 10−16 in standard double precision)
k∆Xk
ˆ Good alg has Backward stability Ŷ = f (X + ∆X), kXk
=  “exact solution of a
slightly wrong input”.

ˆ Justification: The input (matrix) is usually inexact anyway, as storing it on a computer


as a floating-point object already incurs rounding errors! Consequently, f (X + ∆X) is
just as good at f (X) at approximating f (X∗ ) where k∆Xk = O(kX − X∗ k).
We shall ’settle with’ such solution, though it may not mean Ŷ − Y is small.

ˆ Forward stability17 kY − Ŷ k/kY k = O(κ(f )u) “error is as small as backward stable


alg”.

ˆ Another important notion: mixed forward-backward stability: “The computed output


is a slightly perturbed solution for a slightly perturbed problem”.
17
The definition here follows Higham’s book [17]. The phrase is sometimes used to mean small error;
However, as hopefully you will be convinced after this section, it is very often impossible to get a solution of
full accuracy if the original problem was ill-conditioned. The notion of backward stability that we employ
here (and in much of numerical analysis) is therefore much more realistic.

37
7.4 Matrix condition number
The best way to illustrate conditioning is to look at the conditioning of linear systems. In
fact it leads to the following definition, which arises so frequently in NLA that it merits its
own name: the condition number of a matrix.

Definition 7.1 The matrix condition number is defined by

σmax (A)
κ2 (A) = (≥ 1).
σmin (A)
σ1 (A)
That is, κ2 (A) = σn (A)
for A ∈ Rm×n , m ≥ n.

Let’s see how this arises:

Theorem 7.1 Consider a backward stable solution for Ax = b, s.t. (A + ∆A)x̂ = b with
k∆Ak ≤ kAk and κ2 (A)  −1 (so kA−1 ∆Ak  1). Then we have

kx̂ − xk
≤ κ2 (A) + O(2 ).
kxk

Proof: By Neumann series

(A + ∆A)−1 = (A(I + A−1 ∆A))−1 = (I − A−1 ∆A + O(kA−1 ∆Ak2 ))A−1

So x̂ = (A+∆A)−1 b = A−1 b−A−1 ∆AA−1 b+O(kA−1 ∆Ak2 ) = x−A−1 ∆Ax+O(kA−1 ∆Ak2 ),


Hence

kx − x̂k . kA−1 ∆Axk ≤ kA−1 kk∆Akkxk ≤ kAkkA−1 kkxk = κ2 (A)kxk.


In other words, even with a backward stable solution, one would only have O(κ2 (A))
relative accuracy in the solution. If κ2 (A) > 1, the solution may be rubbish! But the
NLA view is that’s not the fault of the algorithm, the blame is on the problem being so
ill-conditioned.

7.4.1 Backward stable+well conditioned=accurate solution


We’ve seen that backward stability does not necessarily imply the solution is accurate. There
is a happier side of this argument and useful rule of thumb that can be used to estimate the
accuracy of a computed solution using backward stability and conditioning.
Suppose

ˆ Y = f (X) is computed backward stably i.e., Ŷ = f (X + ∆X), k∆Xk = .

ˆ Conditioning kf (X) − f (X + ∆X)k . κk∆Xk.

38
Then (this is the absolute version, relative version possible)

kŶ − Y k . κ.

’proof’:
kŶ − Y k = kf (X + ∆X) − f (X)k . κk∆Xkkf (X)k = κ.
Here is how to interpret the result: If the problem is well-conditioned κ = O(1), this imme-
diately implies good accuracy of the solution! But otherwise the solution might have poor
accuracy—but it is still the exact solution of a nearby problem. This is often as good as one
can possibly hope for.
The reason this is only a rule of thumb and not exactly rigorous is that conditioning only
examines the asymptotic behavior, where the perturbation is infinitesimally small. Nonethe-
less it often gives an indicative estimate for the error and sometimes we can get rigorous
bounds if we know more about the problem. Important examples include the following:
ˆ Well-conditioned linear system Ax = b, κ2 (A) ≈ 1.

ˆ Eigenvalues of symmetric matrices (via Weyl’s bound λi (A+E) ∈ λi (A)+[−kEk2 , kEk2 ]).

ˆ Singular values of any matrix σi (A + E) ∈ σi (A) + [−kEk2 , kEk2 ].


Indeed, these problems are well-conditioned, so can be solved with extremely high accuracy,
essentially
√ to working precision O(u) (times a small factor, usually bounded by something
like n).
Note: eigvecs/singvecs can be highly ill-conditioned even for the simplest problems.
Again, think of the identity. (Those curious are invited to look up the Davis-Kahan sin θ
theorem [7].)

7.5 Stability of triangular systems


7.5.1 Backward stability of triangular systems
While we will not be able to discuss in detail which NLA algorithm is backward stable etc,
we will pick a few important examples.
One fact is worth knowing (a proof is omitted; see [33] or [17]): triangular linear systems
can be solved in a backward stable manner. This fact is important as these arise naturally
in the solution of linear systems: Ax = b is solved via Ly = b, U x = y (triangular systems),
as we’ve seen in Section 5.
Let R denote a matrix that is (upper or lower) triangular. The computed solution x̂
for a (upper/lower) triangular linear system Rx = b solved via back/forward substitution is
backward stable, i.e., it satisfies

(R + ∆R)x̂ = b, k∆Rk = O(kRk).

Proof: Trefethen-Bau or Higham (nonexaminable but interesting).


Notes:

39
ˆ The backward error can be bounded componentwise.
ˆ Using the previous rule-of-thumb, this means kx̂ − xk/kxk . κ2 (R).
– (unavoidably) poor worst-case (and attainable) bound when ill-conditioned
– often better with triangular systems; the behavior of triangular linear systems
keep surprising experts!

7.5.2 (In)stability of Ax = b via LU with pivots


We have discussed how to solve a linear system using the LU factorisation. An obvious
question given the context is: is it backward stable? This question has a fascinating answer,
and transpires to touch on one of the biggest open problems in the field.
Fact (proof nonexaminable): Computed L̂Û satisfies kkLkkU
L̂Û −Ak
k
= .
kL̂Û −Ak
(note: not kAk
= )
ˆ By stability of triangular systems (L + ∆L)(U + ∆U )x̂ = b. Now if kLkkU k = O(kAk),
then it follows that ⇒ x̂ backward stable solution (exercise).
Question: Does LU = A + ∆A or LU = P A + ∆A with k∆Ak = kAk (i.e., kLkkU k =
O(kAk)) hold?
 
Without pivot (P = I): kLkkU k  kAk unboundedly (e.g. 1 11 ) unstable.
With pivots:
ˆ Worst-case: kLkkU k  kAk grows exponentially with n, unstable.
– growth governed by that of kLkkU k/kAk ⇒ kU k/kAk.
ˆ In practice (average case): perfectly stable.
– Hence this is how Ax = b is solved, despite alternatives with guaranteed stability
exist (but slower; e.g. via SVD, or QR (next)).
Resolution/explanation: among biggest open problems in numerical linear algebra!

7.5.3 Backward stability of Cholesky for A  0


We’ve seen that symmetric positive definite matrices can be solved with half the effort
using the Cholesky factorisation. As a bonus, in this case it turns out that stability can be
guaranteed.
The key fact is that the Cholesky factorisation A = RT R for A  0
ˆ succeeds without pivot (the active matrix is always positive definite).
ˆ R never contains entries > kAk.
p

It follows that computing the Cholesky factorisation (assuming A  0) is backward stable!


Hence positive definite linear system Ax = b can always be solved in a backward stable
manner via Cholesky.

40
7.6 Matrix multiplication is not backward stable
Here is perhaps a shock—matrix matrix multiplication, one of the most basic operations, is
in general not backward stable.
Let’s start with the basics.

ˆ Vector-vector multiplication is backward stable: f l(y T x) = (y + ∆y)(x + ∆x); in fact


f l(y T x) = (y + ∆y)x. (direct proof is possible)

ˆ It immediately follows that matrix-vector multiplication is also backward stable: f l(Ax) =


(A + ∆A)x.

ˆ But it is not true to say matrix-matrix multiplication is backward stable; which would
require f l(AB) to be equal to (A + ∆A)(B + ∆B). This may not be satisfied!

In general, what we can prove is a bound for the forward error kf l(AB) − ABk ≤
kAkkBk, so kf l(AB) − ABk/kABk ≤  min(κ2 (A), κ2 (B)) (proof: problem sheet).
This is great news when A or B is orthogonal (or more generally square and well-
conditioned): say if A = Q is orthogonal, then we have

kf l(QB) − QBk ≤ kBk,

so it follows that f l(QB) = QB + kBk, hence defining ∆B = QT kBk we have f l(QB) =


Q(B + ∆B), that is, orthogonal multiplication is backward stable (this argument
proves this for left multiplication; orthogonal right-multiplication is entirely analogous).
One of the reasons backward stability of matrix-matrix multiplication fails to hold is that
there are not enough parameters in the backward error to account for the computational error
incurred in the matrix multiplication. Each matrix-vector product is backward stable; but
we cannot concentrate the backward errors into a single term without potentially increasing
the backward error’s norm.
On the other hand, we have seen that a linear system can be solved in a backward
stable fashion (by QR if necessary; in practice LU with pivoting suffices). This means the
inverse can be applied to a vector in a backward stable fashion: The computed x̂ satisfies
(A + ∆A)−1 x̂ = b.
One might wonder, can we not solve n linear systems in order to get a backward stable
inverse? However, if one solves n linear systems with b = e1 , e2 , . . . , en , the solutions will
satisfy (A + ∆i A)−1 x̂i = ei , where ∆i A depends on i. There is no uniform ∆A that is O()
such that (A + ∆A)−1 [x̂1 , . . . , x̂n ] = I.

Orthogonality matters for stability A happy and important exception is with orthog-
onal matrices Q (or more generally with well-conditioned matrices):
kf l(QA) − QAk kf l(AQ) − AQk
≤ , ≤ .
kQAk kAQk

41
They are also backward stable:

f l(QA) = QA +  ⇔ f l(QA) = Q(A + ∆A).


f l(AQ) = AQ +  ⇔ f l(AQ) = (A + ∆A)Q.

Hence algorithms involving ill-conditioned matrices are unstable (e.g. eigenvalue decomposi-
tion of non-normal matrices, Jordan form, etc), whereas those based on orthogonal matrices
are stable. These include

ˆ Householder QR factorisation, QR-based linear system (next subsection).

ˆ QR algorithm for Ax = λx.

ˆ Golub-Kahan algorithm for A = U ΣV T .

ˆ QZ algorithm for Ax = λBx.

Section 8 onwards treats our second big topic, eigenvalue problems. This includes discussing
algorithms shown above in boldface.

7.7 Stability of Householder QR


Householder QR has excellent numerical stability, basically because it’s based on orthogonal
transformations. With Householder QR, the computed Q̂, R̂ satisfy

kQ̂T Q̂ − Ik = O(), kA − Q̂R̂k = O(kAk),

and (of course) R is upper triangular.


Rough proof: Essentially the key idea is that multiplying by an orthogonal matrix is very
stable and Householder QR is based on a sequence of multiplications by orthogonal matrices.
To give a bit more detail,

ˆ Each reflector satisfies f l(Hi A) = Hi A + i kAk.

ˆ Hence (R̂ =)f l(Hn · · · H1 A) = Hn · · · H1 A + kAk.

ˆ f l(Hn · · · H1 ) =: Q̂T = Hn · · · H1 + .

ˆ Thus Q̂R̂ = A + kAk.

Notes:

ˆ This doesn’t mean kQ̂−Qk, kR̂−Rk are small at all! Indeed Q, R are as ill-conditioned
as A [17, Ch. 20].

ˆ Solving Ax = b and least-squares problems via the QR factorisation is stable.

42
7.7.1 (In)stability of Gram-Schmidt
(Nonexaminable) A somewhat surprising fact is that the Gram-Schmidt algorithm, when
used for computing the QR factorisation, is not backward stable. Namely, orthogonality of
the computed Q̂ matrix is not guaranteed.

ˆ Gram-Schmidt is subtle:

– plain (classical) version: kQ̂T Q̂ − Ik ≤ (κ2 (A))2 .

– modified Gram-Schmidt (orthogonalise ’one vector at a time’): kQ̂T Q̂ − Ik ≤


κ2 (A).

– Gram-Schmidt twice (G-S again on computed Q̂) is excellent: kQ̂T Q̂ − Ik ≤ .

8 Eigenvalue problems
First of all, recall that Ax = λx has no explicit solution (neither λ nor x); huge difference
from Ax = b for which x = A−1 b.
From a mathematical viewpoint this marks an enormous point of departure: something
that is is explicitly written vs. something that has no closed-from solution.
From a practical viewpoint the gulf is much smaller, because we have an extremely
reliable algorithm for eigenvalue problems; so robust that it is essentially bulletproof provably
backward stable.
Before we start describing the QR algorithm let’s discuss a few interesting properties of
eigenvalues.

ˆ Eigenvalues are the roots of characteristic polynomial (Abel’s theorem).

ˆ For any polynomial p, there exist (infinitely many) matrices whose eigvals are roots of
p.

ˆ Here is a nice and useful fact18 : Let p(x) = xn + an−1 xn−1 + · · · + a1 x + a0 , ai ∈ C.


Then
p(λ) = 0 ⇔ λ eigenvalue of
 
−an−1 −an−2 . . . −a1 −a0
 1 
 
C=
 1  ∈ Cn×n .

 .. 
 . 
1 0
18
A great application of this is univariate optimisation: to minimise a polynomial p(x), one can find the
critical points by solving for p0 (x) = 0, via the companion eigenvalues.

43
ˆ So no finite-step algorithm exists for Ax = λx.
Eigenvalue algorithms are necessarily iterative and approximate.
ˆ Same for SVD, as σi (A) = λi (AT A).
p

ˆ But this doesn’t mean they’re inaccurate!


Usual goal: compute the Schur decomposition A = U T U ∗ : U unitary, T upper triangular.

8.1 Schur decomposition


Theorem 8.1 Let A ∈ Cn×n (arbitrary square matrix). Then there exists a unitary matrix
U ∈ Cn×n s.t.

A = UT U , (9)
with T upper triangular. The decomposition (9) is called the Schur decomposition.
Note that
ˆ eig(A) = eig(T ) = diag(T ).
ˆ T diagonal iff A is normal, i.e., A∗ A = AA∗
∗ ∗ ∗ ∗ ∗

∗ ∗ ∗ ∗
Proof: Let Av = λ1 v and find U1 = [v1 , V⊥ ] unitary. Then AU1 = U1  ∗ ∗ ∗ ∗ ⇔
∗ ∗ ∗ ∗
∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗

∗ ∗ ∗ ∗
U1∗ AU1 =  ∗ ∗ ∗ ∗. Repeat on the lower-right (n − 1) × (n − 1) part to get
∗ ∗ ∗ ∗
∗ ∗ ∗ ∗
∗ ∗
Un−1 Un−2 . . . U1∗ AU1 U2 . . . Un−1 = T . 
The reason we usually take the Schur form to be the goal is that its computation can be
done in a backward stable manner. Essentially it boils down to the fact that the decompo-
sition is based on orthogonal transformations.
ˆ For normal matrices A∗ A = AA∗ , T must be diagonal and Schur form is automatically
diagonalised.
ˆ For nonnormal A, if diagonalisation A = XΛX −1 really necessary, done via starting
with the Schur decomposition and further reducing T by solving Sylvester equations;
but this process involves nonorthogonal transformations and is not backward stable
(nonexaminable).
ˆ The Schur decomposition is among the few examples in NLA where the difference
between C and R matters a bit. When working only in R one cannot always get a
triangular T ; it will have 2 × 2 blocks in the diagonal (these blocks have complex
eigenvalues). This is essentially because real matrices can have complex eigenvalues.
This is still not a major issue as eigenvalues of 2 × 2 matrices can be computed easily.

44
(This marks the end of “the first half” of the course (i.e., up until the basic facts about
eigenvalues, but not its computation). This information is relevant to MMSC students.)

8.2 The power method for finding the dominant eigenpair Ax = λx


We now start describing the algorithms for solving eigenvalue problems. The first algorithm
that we introduce is surprisingly simple and is based on the idea of keep multiplying the
matrix A to an arbitrary vector.
This algorithm by construction is designed to compute only a single eigenvalue and its
corresponding eigenvector, namely the dominant one. It is not able, at least as presented,
to compute all the eigenvalues.
Despite its limitation and simplicity it is an extremely important algorithm and the
underlying idea is in fact a basis for the ultimate QR algorithm that we use in order to
compute all eigenvalues of a matrix. So here is the algorithm, called the power method.

Algorithm 8.1 The power method for computing the dominant eigenvalue and eigenvector
of A ∈ Rn×n .
1: Let x ∈ Rn :=random initial vector (unless specified)
x
2: Repeat: x = Ax, x = kxk .
2

3: λ̂ = xT Ax gives an estimate for the eigenvalue.

ˆ Convergence
Pn analysis: suppose A is diagonalisable (generic assumption). We can write
x0 = i=1 ci vi , Avi = λi vi with |λ1 | > |λ2 | > · · · . Then after k iterations,
n  k
X λi
x=C ci vi → Cc1 v1 as k → ∞ for some scalar C
i=1
λ1

|λ2 |
ˆ Converges geometrically (λ, x) → (λ1 , x1 ) with linear rate |λ1 |

ˆ What does this imply about Ak = QR as k → ∞? First column of Q → v1 under mild


assumptions.

Notes:

ˆ Google pagerank & Markov chain are linked to the power method.

ˆ As we’ll see, the power method is basis for refined algorithms (QR algorithm, Krylov
methods (Lanczos, Arnoldi,...))

45
8.2.1 Digression (optional): Why compute eigenvalues? Google PageRank
Let us briefly digress and talk about a famous application that at least used to solve an
eigenvalue problem in order to achieve a familiar task: Google web search.
Once Google receives a user’s inquiry with keywords, it needs to rank the relevant web-
pages, to output the most important pages first.
Here, the ’importance’ of websites is deter-
mined by the dominant eigenvector of column-
stochastic matrix (i.e., column sums are all 1)
 
1 ··· 1
(1 − α)  . . .
A = αP +  .. . . .. 
n
1 ··· 1
P : scaled adjacency matrix (Pij = 1 if i, j
connected by an edge, 0 otherwise, and then image from wikipedia
scaled s.t. column sums are 1), α ∈ (0, 1)

To solve this approximately (note that high accuracy isn’t crucial here—getting the
ordering wrong isn’t the end of the world), Google does (at least in the old days) a few steps
of power method: with initial guess x0 , k = 0, 1, . . .

1. xk+1 = Axk

2. xk+1 = xk+1 /kxk+1 k2 , k ← k + 1, repeat.

ˆ xk → PageRank vector v1 : Av1 = λ1 v1 . As A is a nonnegative matrix Aij ≥ 0,


the dominant eigenvector v1 can be taken to be nonnegative (by the Perron-Frobenius
theorem; nonexaminable).

For more on pagerank etc, see Gleich [13] if interested (nonexaminable).

8.2.2 Shifted inverse power method


We saw that the convergence of the power method is governed by |λ1 /λ2 |, the ratio between
the absolute value of dominant and the next dominant eigenvalue.
If this ratio is close to 1 the power method would converge very slowly. A natural question
arises: can one speed up the convergence in such cases?
This leads to the inverse power method, or more generally shifted inverse power method,
which can compute eigenpairs with much improved speed provided that the parameters
(shifts) are chosen appropriately.

46
Algorithm 8.2 Shifted inverse power method for computing the eigenvalue and eigenvector
of A ∈ Rn×n closest to a prescribed value µ ∈ C.
1: Let x ∈ Rn :=random initial vector, unless specified.
2: Repeat: x := (A − µI)−1 x, x = x/kxk.
3: λ̂ = xT Ax gives an estimate for the eigenvalue.

Note that the eigenvalues of the matrix (A − µI)−1 are (λ(A) − µ)−1 .

ˆ By the same analysis as above, shifted-inverse power method converges with improved
|λ −µ|
linear rate |λσ(1) σ(2) −µ|
to the eigenpair closest to µ. Here σ denotes a permutation of
{1, . . . , n} such that |λσ(1) − µ| minimises |λi − µ| over i.

ˆ µ can change adaptively with the iterations. The choice µ := xT Ax gives the Rayleigh
quotient iteration, with quadratic convergence kAx(k+1) − λ(k+1) x(k+1) k = O(kAx(k) −
λ(k) x(k) k2 ); this is further improved to cubic convergence if A is symmetric.

It is worth emphasising that the improved convergence comes at a cost: one step of
shifted inverse power method requires a solution of a linear system, which is clearly more
expensive than power method.

9 The QR algorithm
We’ll now describe an algorithm called the QR algorithm that is used universally for solving
eigenvalue problems of moderate size, e.g. by MATLAB’s eig. Given A ∈ Rn×n , the
algorithm

ˆ Finds all eigenvalues (approximately but reliably) in O(n3 ) flops,

ˆ Is backward stable.

Sister problem: Given A ∈ Rm×n compute SVD A = U ΣV T

ˆ ’ok’ algorithm: eig(AT A) to find V , then normalise AV

ˆ there’s a better (but still closely related) algorithm: Golub-Kahan bidiagonalisation,


discussed later in Section 10.2.

9.1 QR algorithm for Ax = λx


As the name suggests, the QR algorithm is based on the QR factorisation of a matrix.
Another key fact is that the eigenvalues of a product of two matrices remain the same when
the order of the product is reversed (eig(AB)=eig(BA), problem sheet). The QR algorithm
essentially is a simple combination of these two ideas, and the vanilla version is deceptively

47
simple: basically take the QR factorisation, swap the order, take the QR and repeat the
process. Namely,

Algorithm 9.1 The QR algorithm for finding the Schur decomposition of a square matrix
A.
1: Set A1 = A.
2: Repeat: A1 = Q1 R1 , A2 = R1 Q1 , A2 = Q2 R2 , A3 = R2 Q2 , . . .

Notes:
ˆ Ak are all similar: Ak+1 = QTk Ak Qk

ˆ We shall ’show’ that A → triangular triangular (diagonal if A normal), under weak


assumptions.

ˆ Basically: QR(factorise)→ RQ(swap)→ QR → RQ → · · ·

ˆ Fundamental work by Francis (61,62) and Kublanovskaya (63)

ˆ Truly Magical algorithm!

– The algorithm is backward stable, as based on orthogonal transforms: essentially,


RQ = f l(QT (QR)Q) = QT (QR + ∆(QR))Q.
– always converges (with shifts) in practice, but a global proof is currently unavail-
able(!)
– uses ’shifted inverse power method’ (rational functions) without inversions

Again, the QR algorithm performs: Ak = Qk Rk , Ak+1 = Rk Qk , repeat.


Let’s look at its properties.
Theorem 9.1 For k ≥ 1,

Ak+1 = (Q(k) )T AQ(k) , Ak = (Q1 · · · Qk )(Rk · · · R1 ) =: Q(k) R(k) .

Proof : recall Ak+1 = QTk Ak Qk , repeat.

Proof by induction: k = 1 trivial.


Suppose Ak−1 = Q(k−1) R(k−1) . We have
Ak = (Q(k−1) )T AQ(k−1) = Qk Rk .

Then AQ(k−1) = Q(k−1) Qk Rk , and so

Ak = AQ(k−1) R(k−1) = Q(k−1) Qk Rk R(k−1) = Q(k) R(k) 

48
QR algorithm and power method We can deduce that the QR algorithm is closely
related to the power method. Let’s try explain the connection. By Theorem 9.1, Q(k) R(k) is
the QR factorisation of Ak : as we saw in the analysis of the power method, the columns of
Ak are ’dominated by the leading eigenvector’ x1 , where Ax1 = λ1 x1 .
In particular, consider Ak [1, 0, . . . , 0]= Ak en :

ˆ Ak en = R(k) (1, 1)Q(k) (:, 1), which is parallel to the first column of Q(k) .

ˆ By the analysis of the power method, this implies Q(k) (:, 1) → x1

ˆ Hence by Ak+1 = (Q(k) )A Q(k) , Ak (:, 1) → [λ1 , 0, . . . , 0]T .

This tells us why the QR algorithm would eventually compute the eigenvalues—at least
the dominant ones. One can even go further and argue that once the dominant eigenvalue
has converged, we can expect the next dominant eigenvalue to start converging too—as due
to the nature of the QR algorithm based on orthogonal transformations, we are then working
in the orthogonal complement; and so on and so forth until the matrix A becomes upper
triangular (and in the normal case this becomes diagonal), completing the solution of the
eigenvalue problem. But there is much better news.

QR algorithm and inverse power method We have seen that the QR algorithm is
related to the power method. We have also seen that the power method can be improved by
a shift-and-invert technique. A natural question arises: can we introduce a similar technique
in the QR algorithm? This question has an elegant and remarkable answer: not only is this
possible but it is possible without ever inverting a matrix or solving a linear system (isn’t
that incredible!?).
Let’s try and explain this. We start with the same expression for the QR iterates:

Ak = (Q1 · · · Qk )(Rk · · · R1 ) =: Q(k) R(k) , Ak+1 = (Q(k) )T AQ(k) .

Now take inverse: A−k = (R(k) )−1 (Q(k) )T ,


Conjugate transpose: (A−k )T = Q(k) (R(k) )−∗
⇒ QR factorisation of matrix (A−k )T with eigvals r(λi ) = λ−k
i
⇒ Connection also with (unshifted) inverse power method. Note that no matrix inverse
is performed in the algorithm.

ˆ This means the final column of Q(k) converges to minimum left eigenvector xn with
rate |λ|λn−1
n|
|
, hence Ak (n, :) → [0, . . . , 0, λn ].

ˆ (Very) fast convergence if |λn |  |λn−1 |.

ˆ Can we achieve this situation? Yes by shifts.

49
QR algorithm with shifts and shifted inverse power method We are now ready
to reveal the connection between the shift-and-invert power method and the QR algorithm
with shifts.
First, here is the QR algorithm with shifts.

Algorithm 9.2 The QR algorithm with shifts for finding the Schur decomposition of a
square matrix A.
1: Set A1 = A, k = 1.
2: Ak − sk I = Qk Rk (QR factorisation)
3: Ak+1 = Rk Qk + sk I, k ← k + 1, repeat.

1. Ak − sk I = Qk Rk (QR factorisation)
2. Ak+1 = Rk Qk + sk I, k ← k + 1, repeat.
∗ ∗ ∗ ∗ ∗
 
∗ ∗ ∗ ∗ ∗
Roughly, if sk ≈ λn , then Ak+1 ≈ ∗ ∗ ∗ ∗ ∗ by the argument just made.
∗ ∗ ∗ ∗ ∗
 
λn

Theorem 9.2
k
Y
(A − si I) = Q(k) R(k) (= (Q1 · · · Qk )(Rk · · · R1 )) .
i=1

Proof: Suppose true for k−1. Then the QR alg computes (Q(k−1) )T (A−sk I)Q(k−1) = Qk Rk ,
so (A − sk I)Q(k−1) = Q(k−1) Qk Rk , hence
Y k
(A − si I) = (A − sk I)Q(k−1) R(k−1) = Q(k−1) Qk Rk R(k−1) = Q(k) R(k) .
i=1
Qk
Inverse conjugate transpose: i=1 (A − si I)−∗ = Q(k) (R(k) )−∗ . 
ˆ This means the algorithm (implicitly) finds the QR factorisation of a matrix with
Qk 1
eigvals r(λj ) = i=1 λ̄j −si .

ˆ This reveals the intimate connection between shifted QR and shifted inverse power
method, hence rational approximation .

ˆ Ideally, we would like to choose sk ≈ λn to accelerate convergence. This is done by


choosing sk to be the bottom-right corner entry19 of Ak ; which is sensible given that
with the QR algorithm, it tends to converge to an eigenvalue rapidly (with a few steps
of the unshifted QR, it tends to converge to the smallest eigenvalue).
19
In production code a so-called Wilkinson shift is often used, which computes the eigenvalue of the 2 × 2
bottom-right submatrix of Ak , and takes sk to be the one closer to the bottom-right entry of Ak .

50
9.2 QR algorithm preprocessing: reduction to Hessenberg form
We’ve seen the QR iterations drives colored entries to 0 (esp. red ones)
 
∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗
 
A= ∗
 ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗

ˆ Hence An,n → λn , so choosing sk = An,n is sensible.

ˆ This reduces #QR iterations to O(n) (empirical but reliable estimate).

ˆ But each iteration of the QR algorithm is O(n3 ) for QR, overall O(n4 ).

ˆ We next discuss a preprocessing technique to reduce the overall cost to O(n3 ).

The idea is to initially apply a series of deterministic orthogonal transformations that


reduces the matrix to a form that is closer to upper triangular. This is done before starting
the QR iterates, hence called a preprocessing step.
More precisely, to improve the cost of QR factorisation, we first reduce the matrix via
orthogonal Householder transformations as follows:
     
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ 0
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
H1 = I − 2v1 v1T , v1 =
     
A= ∗
 ∗ ∗ ∗ ∗, H1 A = 
 ∗ ∗ ∗ ∗,
∗
 
∗ ∗ ∗ ∗ ∗  ∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
 
∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗
Repeat with H2 = I − 2v2 v2T , v2 = [0, 0, ∗, ∗, ∗]T , ...:
 
Then H1 AH1 = 
 ∗ ∗ ∗ ∗.
 ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗
   
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
   
H2 H1 AH1 H2 =  ∗ ∗ ∗ ∗,
  H3 H2 H1 AH1 H2 H3 =  ∗ ∗ ∗ ∗,
 
 ∗ ∗ ∗  ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗
       
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ H ∗ ∗ ∗ ∗ ∗ H
  ∗ ∗ ∗ ∗ ∗
 H  Hn−2  
A =
∗ ∗ ∗ ∗ ∗ 1
→ ∗ ∗ ∗ ∗ →  ∗ ∗ ∗ ∗
 2 
→
3
··· → 
 ∗ ∗ ∗ ∗.

∗ ∗ ∗ ∗ ∗  ∗ ∗ ∗ ∗  ∗ ∗ ∗  ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
We have thus transformed the matrix A into upper Hessenberg form.

ˆ This helps QR iterations preserve structure: if A1 = QR Hessenberg, then so is A2 =


RQ

51
ˆ Using Givens rotations, each QR iter is O(n2 ) (not O(n3 )).

The remaining task (done by shifted QR): drive subdiagonal ∗ to 0. A is thus reduced
to a triangular form, revealing the Schur decomposition of A (if one traces back all the
orthogonal transformations employed).
Once bottom-right |∗| < ,
   
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
   
 ∗ ∗ ∗ ∗ ≈  ∗ ∗ ∗ ∗
   
 ∗ ∗ ∗  ∗ ∗ ∗
∗ ∗ ∗

and continue with shifted QR on (n − 1) × (n − 1) block, repeat.

ˆ Empirically, the shifted QR algorithm needs 2 − −4 iterations per eigenvalue. Overall,


the cost is O(n3 ), ≈ 25n3 flops.

9.2.1 The (shifted) QR algorithm in action


Let’s see how the QR algorithm works with a small example. The plots show the convergence
of the subdiagonal entries |Ai+1,i | (note that their convergence signifies the convergence of
the QR algorithm as we initially reduce the matrix to Hessenberg form).

52
No shift (plain QR) QR with shifts
10 0 100

10-5 10-5

10-10 10-10

10-15 10-15
0 100 200 300 400 500 600 700 800 0 2 4 6 8 10 12
Iterations Iterations
1000

25

0 20
log|p(λ)|

15
log|p(λ)|
-1000 10

0
-2000
-5

-4 -2 0 2 4 -10
-4 -2 0 2 4

In light of the connection to rational functions as discussed above, here we plot the
underlying functions (red dots: eigvals). The idea here is that we want the functions to
take large values at the target eigenvalue (at the current iterate) in order to accelerate
convergence.

9.2.2 (Optional) QR algorithm: other improvement techniques


We have completed the description of the main ingredients of the QR algorithm. Nonetheless,
there are a few more bells and whistles that go into to a production code. We will not get
into too much detail but here is a list of the key players.

ˆ Double-shift strategy for A ∈ Rn×n

– (A − sI)(A − s̄I) = QR using only real arithmetic


ˆ Aggressive early deflation [Braman-Byers-Mathias 2002 [5]]

– Examine lower-right (say 100 × 100) block instead of (n, n − 1) element


– dramatic speedup (≈ ×10)

ˆ Balancing A ← DAD−1 , D: diagonal

– Aims at reducing kDAD−1 k: often yields better-conditioned eigenvalues.

53
Finally, let us mention a cautionary tale:

ˆ For nonsymmetric A, global convergence is NOT established(!)

– Of course it always converges in practice.. proving convergence is another big


open problem in numerical linear algebra.
– For symmetric A, global convergence analysis is available, see [28, Ch. 8].

10 QR algorithm continued
10.1 QR algorithm for symmetric A
so far we have not assumed anything about the matrix besides that it is square. This is
for good reason—because the QR algorithm is applicable to solving any eigenvalue problem.
However, in many situations the eigenvalue problem is structured. In particular the case
where the matrix is symmetric arises very frequently in practice and it comes with significant
simplification of the QR algorithm, so it deserves a special mention.

ˆ Most importantly, symmetry immediately implies that the initial reduction to Hessen-
berg form reduces

A to tridiagonal:
      
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗ Q ∗ ∗ ∗ ∗ ∗ Q ∗ ∗ ∗  Q ∗ ∗ ∗
    
 
A =
∗ ∗ ∗ ∗ ∗→
1 ∗ ∗ ∗ ∗→
2 ∗ ∗ ∗ ∗
→
3 ∗ ∗ ∗ 

∗ ∗ ∗ ∗ ∗  ∗ ∗ ∗ ∗   ∗ ∗ ∗  ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗

ˆ QR steps for tridiagonal: requires O(n) flops instead of O(n2 ) per step.

ˆ Powerful alternatives are available for tridiagonal eigenvalue problem (divide-conquer


[Gu-Eisenstat 95], HODLR [Kressner-Susnjara 19],...)

ˆ Cost: 34 n3 flops for eigvals, ≈ 10n3 for eigvecs (store Givens rotations to compute
eigvecs).

ˆ Another approach (nonexaminable): spectral divide-and-conquer (Nakatsukasa-Freund,


Higham); which is all about using a carefully chosen rational approximation to find
orthogonal matrices Vi such that
       
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗ V ∗ ∗ ∗
   ∗ ∗   ∗ 
 V  V  
A =
∗ ∗ ∗ ∗ ∗ → ∗ ∗ ∗
 1 
→ ∗ ∗
 2   3
→
 ∗ =
 Λ.
∗ ∗ ∗ ∗ ∗  ∗ ∗  ∗   ∗ 
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗

54
10.2 Computing the SVD: Golub-Kahan’s bidiagonalisation algo-
rithm
The key ideas of the QR algorithm turns out to be applicable for the computation of the
SVD. Clearly this is a major step—given the importance of the SVD. In particular the
SVD algorithm has strong connections to the symmetric QR algorithm. This is perhaps not
surprising given the strong connection between the SVD and symmetric eigenvalue problems
as we discussed previously.
A noteworthy difference is that in the SVD A = U ΣV T the two matrices U and V are
allowed to be different. The algorithm respects this and instead of initially reducing the
matrix to tridiagonal form, it reduces it to a so-called bidiagonal form.
Here is how it works. Apply Householder reflectors from left and right (different ones)
to bidiagonalise
A → B = HL,n · · · HL,1 AHR,1 HR,2 · · · HR,n−2
           
? ? ? ? ? ? ? ? ? ? ? ? ? ?
 ? ? ?  ? ? ?  ? ? ?  ? ?   ? ?   ? ? 
A H→L, 1
 H  H  H  H  H  
 ? ? ? R, 1
→ ? ? ? L, 2
→ ? ? R, 2
→ ? ? L, 3
→ ? ? L, 4 
→  ? ?
 = B,
 ? ? ?  ? ? ?  ? ?  ? ?  ?  ?
? ? ? ? ? ? ? ? ? ? ?

ˆ Since the transformations are all orthogonal multiplications, singular values are pre-
served σi (A) = σi (B).

ˆ Once bidiagonalised, one can complete the SVD as follows:

– Mathematically, do QR alg on B T B (symmetric tridiagonal)


– More elegant: divide-and-conquer [Gu-Eisenstat 1995] or dqds algorithm [Fernando-
Parlett 1994] (nonexaminable)

ˆ Cost: ≈ 4mn2 flops for singular values Σ, ≈ 20mn2 flops to also compute the singular
vectors U, V .

10.3 (Optional but important) QZ algorithm for generalised eigen-


value problems
An increasingly important class of eigenvalue problems is the so-called generalised eigenvalue
problems involving two matrices. You have probably not seen them before, but the number of
applications that boil down to a generalised eigenvalue problem has been increasing rapidly.
A generalised eigenvalue problem is of the form

Ax = λBx, A, B ∈ Cn×n

ˆ The matrices A, B are given. The goal is to find the eigenvalues λ and their corre-
sponding eigenvectors x.

55
ˆ There are usually (incl. when B is nonsingular) n eigenvalues, which are the roots of
det(A − λB).
ˆ When B is invertible, one can reduce the problem to B −1 Ax = λx.

ˆ Important case: A, B symmetric, B positive definite: in this case λ are all real.

QZ algorithm: look for unitary Q, Z s.t. QAZ, QBZ both upper triangular.
ˆ Then diag(QAZ)/diag(QBZ) are the eigenvalues.

ˆ Algorithm: first reduce A, B to Hessenberg-triangular form.

ˆ Then implicitly do QR to B −1 A (without inverting B).

ˆ Cost: ≈ 50n3 .

ˆ See [14] for details.

10.4 (Optional) Tractable eigenvalue problems


Beyond generalised eigenvalue problems there are more exotic generalisations of eigenvalue
problems and reliable algorithms have been proposed for solving them.
Thanks to the remarkable developments in NLA, the following problems are ’tractable’
in that reliable algorithms exist for solving them, at least when the matrix size n is modest
(say in the thousands).
ˆ Standard eigenvalue problems Ax = λx

– symmetric (4/3n3 flops for eigvals, +9n3 for eigvecs)


– nonsymmetric (10n3 flops for eigvals, +15n3 for eigvecs)

ˆ SVD A = U ΣV T for A ∈ Rm×n : ( 83 mn2 flops for singvals, +20mn2 for singvecs)

ˆ Generalised eigenvalue problems Ax = λBx , A, B ∈ Cn×n

ˆ Polynomial eigenvalue problems, e.g. (degree k = 2) P (λ)x = (λ2 A + λB + C)x = 0 ,


A, B, C ∈ Cn×n :≈ 20(nk)3

ˆ Nonlinear problems, e.g. N (λ)x = (A exp(λ) + B)x = 0

– often solved via approximating by polynomial N (λ) ≈ P (λ)


– more difficult: A(x)x = λx: eigenvector nonlinearity

Further speedup is often possible when structure present (e.g. sparse, low-rank)

56
11 Iterative methods: introduction
This section marks a point of departure from previous sections. So far we’ve been discussing
direct methods. Namely, we’ve covered the LU-based linear system solution and the QR
algorithm for eigenvalue problems20 .
Direct methods are

ˆ Incredibly reliable, and backward stable

ˆ Works like magic if n . 10000

ˆ But not if n is larger!

A ’big’ matrix problem is one for which direct methods aren’t feasible. Historically, as
computers become increasingly faster, roughly

ˆ 1950: n ≥ 20

ˆ 1965: n ≥ 200

ˆ 1980: n ≥ 2000

ˆ 1995: n ≥ 20000

ˆ 2010: n ≥ 100000

ˆ 2020: n ≥ 500000 (n ≥ 50000 on a standard desktop)

was considered ’too large for direct methods’. While it’s clearly good news that our ability
to solve problems with direct methods has been improving, the scale of problems we face in
data science has been growing at the same (or even faster) pace! For such problems, we need
to turn to alternative algorithms: we’ll cover iterative and randomized methods. We first
discuss iterative methods, with a focus on Krylov subspace methods.

Direct vs. iterative methods Broadly speaking, the idea of iterative methods is to:

ˆ Gradually refine the solution iteratively.

ˆ Each iteration should be (a lot) cheaper than direct methods, usually O(n2 ) or less.

ˆ Iterative methods can be (but not always) much faster than direct methods.

ˆ Tends to be (slightly) less robust, nontrivial/problem-dependent analysis.

ˆ Often, after O(n3 ) work it still gets the exact solution (ignoring roundoff errors). But
one would hope to get a (acceptably) good solution long before that!

57
Figure 2: Rough comparison of direct vs. iterative methods, image from [Trefethen-Bau [33]]

Each iteration of most iterative methods is based on multiplying A to a vector; clearly


cheaper than an O(n3 ) direct algorithm. We’ll focus on Krylov subspace methods. (Other
iterative methods we won’t get into details include the Gauss-Seidel, SOR and Chebyshev
semi-iterative methods.)

11.1 Polynomial approximation: basic idea of Krylov


The big idea behind Krylov subspace methods is to approximate the solution in terms of a
polynomial of the matrix times a vector. Namely, in Krylov subspace methods, we look for
an (approximate) solution x̂ (for Ax = b or Ax = λx) of the form (after kth iteration)

x̂ = pk−1 (A)v ,

where pk−1 is a polynomial of degree (at most) k − 1, that is, pk−1 (z) = k−1 i
P
i=0 ci z for some
n
coefficients ci ∈ C, and v ∈ R is the initial vector (usually v =Pb for linear systems, for
eigenproblems v is usually a random vector). That is, pk−1 (A) = k−1 i
i=0 ci A .

Natural questions:
ˆ Why would this be a good idea?

– Clearly, ’easy’ to compute.


– One example: recall power method x̂ = Ak−1 v = pk−1 (A)v
Krylov finds a “better/optimal” polynomial pk−1 (A). When the eigenvalues of A
are well-behaved, we’ll see that O(1) iterations suffices for convergence.
– We’ll see more cases where Krylov is powerful.

ˆ How to turn this idea into an algorithm?

– Find an orthonormal basis: Arnoldi (next), Lanczos.


20
Note that the QR algorithm is iterative so it is not exactly correct to call it direct; however the nature
of its consistency and robustness together with the fact that the cost is more less predictable has rightfully
earned it the classification as a direct method.

58
11.2 Orthonormal basis for Kk (A, b)
Goal: Find approximate solution x̂ = pk−1 (A)b, i.e. in Krylov subspace

Kk (A, b) := span([b, Ab, A2 b, . . . , Ak−1 b])

You would want to convince yourself that any vector in the Krylov subspace can be written
as a polynomial of A times the vector b.
An important and non-trivial step towards finding a good solution is to form an orthonor-
mal basis for the Krylov subspace.
First step: form an orthonormal basis Q, s.t. solution x ∈ Kk (A, b) can be written as
x = Qy

ˆ Naive idea: Form matrix [b, Ab, A2 b, . . . , Ak−1 b], then compute its QR factorisation.

– [b, Ab, A2 b, . . . , Ak−1 b] is usually terribly conditioned! Dominated by leading eigvec


– Q is therefore extremely ill-conditioned, and hence inaccurately computed

ˆ Much better solution: Arnoldi iteration

– Multiply A once at a time to the latest orthonormal vector qi


– Then orthogonalise Aqi against previous qj ’s (j = 1, . . . , i − 1) (as in Gram-
Schmidt).

11.3 Arnoldi iteration


Here is a pseudocode of the Arnold iteration. Essentially, what it does is multiply the matrix
A, orthogonalise against the previous vectors, and repeat.

Algorithm 11.1 The Arnoldi iteration for finding an orthonormal basis for Krylov subspace
Kk (A, b).
1: Set q1 = b/kbk2
2: For k = 1, 2, . . . ,
3: set v = Aqk
4: for j = 1, 2, . . . , k
5: hjk = qjT v, v = v − hjk qj % orthogonalise against qj via (modified) Gram-Schmidt

6: end for
7: hk+1,k = kvk2 , qk+1 = v/hk+1,k
8: End for

59
ˆ After k steps, AQk = Qk+1 H̃k = Qk Hk +qk+1 [0, . . . , 0, hk+1,k ], with Qk = [q1 , q2 , . . . , qk ], Qk+1 =
[Qk , qk+1 ], span(Qk ) = span([b, Ab, . . . , Ak−1 b])
 
h1,1 h1,2 ... h1,k
h2,1 h2,2 ... h2,k 
 
..
A Qk = Qk+1 H̃k , H̃k = 



..
.
hk,k−1
.
hk,k
,



QTk+1 Qk+1 = Ik+1

hk+1,k
| {z }
R(k+1)×k upper Hessenberg

ˆ Cost is k A-multiplications+O(k 2 ) inner products (O(nk 2 ) flops)

11.4 Lanczos iteration


When A is symmetric, Arnoldi simplifies to
AQk = Qk Tk + qk+1 [0, . . . , 0, hk+1,k ],
where Tk is symmetric tridiagonal (proof: just note Hk = QTk AQk in Arnoldi)
 
t1,1 t1,2
 .. 
t2,1 t2,2 . 
A Qk = Qk+1 T̃k , T̃k = QTk+1 Qk+1 = Ik+1
 
 ..  ,

 . tk−1,k 

 tk,k−1 tk,k 
tk+1,k
| {z }
R(k+1)×k symmetric tridiagonal
ˆ The vectors qk form a 3-term recurrence tk+1,k qk+1 = (A − tk,k )qk − tk−1,k qk−1 . Orthog-
onalisation is necessary only against last two vectors qk , qk−1
ˆ Significant speedup over Arnoldi; cost is k A-multiplication plus O(k) inner products
(O(nk)).
ˆ In floating-point arithmetic, sometimes the computed Qk loses orthogonality and re-
orthogonalisation may be necessary (nonexaminable, see e.g. Demmel [8])

11.5 The Lanczos algorithm for symmetric eigenproblem


We are now ready to describe one of the most successful algorithms for large-scale symmetric
eigenvalue problems: the Lanczos algorithm. In simple words, it finds an eigenvalue and
eigenvector in a Krylov subspace by a projection method, called the Rayleigh-Ritz process.

Algorithm 11.2 Rayleigh-Ritz: given symmetric A and orthonormal Q, find approximate


eigenpairs
1: Compute QT AQ.
2: Eigenvalue decomposition QT AQ = V Λ̂V T .
3: Approximate eigenvalues diag(Λ̂) (Ritz values) and eigenvectors QV (Ritz vectors).

60
This is a projection method (similar alg is available for SVD).

Now we can describe the Lanczos algorithm as follows:


Lanczos algorithm=Lanczos iteration+Rayleigh-Ritz

Algorithm 11.3 Lanczos algorithm: A ∈ Rn×n find (extremal) eigenpairs.


1: Perform Lanczos iterations to obtain AQk = Qk Tk + qk+1 [0, . . . , 0, hk+1,k ].
2: Compute the eigenvalue decomposition of Tk = Vk Λ̂VkT (Rayleigh-Ritz with subspace
Qk ).
3: diag(Λ̂) are the approximate eigenvalues (Ritz values), and the columns of Qk Vk are the
approximate eigenvectors (Ritz vectors).

ˆ In this case Q = Qk , so simply QTk AQk = Tk (tridiagonal eigenproblem)


ˆ Very good convergence is usually observed to extremal eigenpairs. To see this:
xT Ax
– Recall from Courant-Fischer λmax (A) = maxx xT x
xT Ax v T Av
– Hence λmax (A) ≥ max ≥ , v = Ak−1 b
x∈Kk (A,b) xT x v Tv
| {z } | {z }
Lanczos output power method
– Same for λmin , similar for e.g. λ2
This is admittedly a very crude estimate for the convergence of Lanczos. A thorough analysis
turns out to be very complicated; if interested see for example [26].

Experiments with Lanczos Symmetric A ∈ Rn×n , n = 100, Lanczos/power method


with random initial vector b
Convergence of Ritz values with Lanczos
15

10 5

10 Ritz values

10 0

10 -5
0 eigenvalues

10 -10
-5

Lanczos
power
10 -15 -10
0 20 40 60 80 100
0 2 4 6 8 10 12
Iterations
Lanczos iterations
Convergence to dominant eigenvalue Convergence of Ritz values (approximate
eigenvalues)

61
The same principles of projecting the matrix onto a Krylov subspace applies to nonsym-
metric eigenvalue problems. Essentially it boils down to finding the eigenvalues of the upper
Hessenberg matrix H arising in the Arnoldi iteration, rather than the tridiagonal matrix as
in Lanczos.

12 Arnoldi and GMRES for Ax = b


This is an exciting section as we will describe the GMRES algorithm. This algorithm has
been so successful that in the 90s the paper that introduced it was the most cited paper in
all of applied mathematics.
Just as the Lanczos method for eigenvalue problems is a simple projection with the matrix
onto the Krylov subspace, GMRES attempts to find an approximate solution in the Krylov
subspace that is in some sense the best possible.
Idea (very simple!): minimise residual in Krylov subspace: [Saad-Schulz 86 [29]]

x = argminx∈Kk (A,b) kAx − bk2 .


In order to solve this, The algorithm cleverly takes advantage of the structure afforded
by Arnoldi.
Algorithm: Given AQk = Qk+1 H̃k and writing x = Qk y, rewrite as
min kAQk y − bk2 = min kQk+1 H̃k y − bk2
y y
   T 
H̃k Qk+1
= min y− b
y 0 QTk+1,⊥ 2
 
H̃k
= min y − kbk2 e1 , e1 = [1, 0, . . . , 0]T ∈ Rn
y 0 2
( where [Qk+1 , Qk+1,⊥ ] is orthogonal; we’re using the same trick as in least-squares)
ˆ Minimised when kH̃k y − QTk+1 bk2 is minimised; this is a Hessenberg least-squares prob-
lem.
ˆ Solve via the QR-based approach described in Section 6.5. Here it’s even simpler
as the matrix is Hessenberg: k Givens rotations yields the QR factorisation, then a
triangular solve will complete the solution, so O(k 2 ) work in addition to Arnoldi; recall
Section 6.4.

12.1 GMRES convergence: polynomial approximation


We now study the convergence of GMRES. As with any Krylov subspace method, the analysis
is based on the theory of polynomial approximation.
Theorem 12.1 (GMRES convergence) Assume that A is diagonalisable, A = XΛX −1 .
Then the kth GMRES iterate xk satisfies
kAxk − bk2 ≤ κ2 (X) min max |p(z)|kbk2 .
p∈Pk ,p(0)=1 z∈λ(A)

62
Proof: Recall that xk ∈ Kk (A, b) ⇒ xk = pk−1 (A)b, where pk−1 is a polynomial of degree
at most k − 1. Hence GMRES solution is
min kAxk − bk2 = min kApk−1 (A)b − bk2
xk ∈Kk (A,b) pk−1 ∈Pk−1

= min k(p̃(A) − I)bk2


p̃∈Pk ,p̃(0)=0

= min kp(A)bk2
p∈Pk ,p(0)=1

If A is diagonalizable, A = XΛX −1 ,
kp(A)k2 = kXp(Λ)X −1 k2 ≤ kXk2 kX −1 k2 kp(Λ)k2
= κ2 (X) max |p(z)|
z∈λ(A)

(recall that λ(A) is the set of eigenvalues of A) 


Noting that kp(A)bk2 ≤ kp(A)k2 kbk2 , here is the interpretation of this analysis: We
would like to find a polynomial s.t. p(0) = 1 and |p| is small at the eigenvalues of A, that is,
|p(λi )| is small for all i.
The question becomes, what type of eigenvalues are ’nice’ for this to hold?

GMRES example G: Gaussian random matrix (Gij ∼ N (0, 1), i.i.d.) G/ n, n = 1000:
eigvals in unit √disk.
A = 2I + G/ n, √
−k k A = G/ n
p(z) = 2 (z − 2)
1
1.5
0.8

0.6
1

0.4

0.5
0.2

0
0
-0.2

-0.5 -0.4

-0.6
-1
-0.8

-1.5 -1
-1 -0.5 0 0.5 1
-0.5 0 0.5 1 1.5 2 2.5 3 3.5
10 2

10 0
1
10

10 -2

10 -4

10 -6

10 -8

10 0
10 -10 0 100 200 300 400 500
0 5 10 15 20 25 30 35

63
Initial vector. Sometimes a good initial guess x0 for x∗ is available. In this case we take
the initial residual r0 = Ax0 − b, and work in the affine space xk = x0 + Kk (A, r0 ). All the
analysis above can be modified readily to allow for this situation with essentially the same
conclusion (clearly if one has a good x0 by all means that should be used).

12.2 When does GMRES converge fast?


Recall GMRES solution satisfies (assuming A is diagonalisable+nonsingular)
min kAx − bk2 = min kp(A)bk2 ≤ κ2 (X) max |p(z)|kbk2 .
x∈Kk (A,b) p∈Pk ,p(0)=1 z∈λ(A)

maxz∈λ(A) |p(z)| is small when


ˆ λ(A) are clustered away from 0
– a good p can be found quite easily
– e.g. example above

ˆ When λ(A) takes k( n) distinct values


– Then convergence in k GMRES iterations (why?)

12.3 Preconditioning for GMRES


We’ve seen that GMRES is great if the eigenvalues are clustered away from 0. If this is not
true, GMRES can really require close to (or equal to!) the full n iterations for convergence.
This is undesirable—at this point we’ve spent more computation than a direct method! We
need a workaround.
The idea of preconditioning is to instead of solving
Ax = b,
find a “preconditioner” M ∈ Rn×n and solve
M Ax = M b
Of course, M cannot be an arbitrary matrix. It has to be chosen very carefully. Desiderata
of M are
ˆ M is simple enough s.t. applying M to vector is easy (note that each GMRES iteration
requires M A-multiplication), and one of
1. M A has clustered eigenvalues away from 0.
2. M A has a small number of distinct nonzero eigenvalues.
3. M A is well-conditioned κ2 (M A) = O(1); then solve the normal equation (M A)T M Ax =
(M A)T M b.

64
Preconditioners: examples

ˆ ILU (Incomplete LU) preconditioner: A ≈ LU, M = (LU )−1 = U −1 L−1 , L, U ’as sparse
as A’ ⇒ M A ≈ I (hopefully; ’cluster away from 0’).
   −1 
A B A
ˆ For à = , set M = . Then if M nonsingular, M à has
C 0 (CA−1 B)−1

eigvals∈ {1, 21 (1 ± 5)} ⇒ 3-step convergence. [Murphy-Golub-Wathen 2000 [23]]

ˆ Multigrid-based, operator preconditioning, ...

ˆ A “perfect” preconditioner is M = A−1 ; as then preconditioned GMRES will con-


verge in one step. Obviously this M isn’t easy to apply (if it was then we’re done!).
Preconditioning, therefore, can be regarded as an act of efficiently approximating the
inverse.

Finding effective preconditioners is a never-ending research topic.


Prof. Andy Wathen is our Oxford expert!

12.4 Restarted GMRES


Another practical GMRES technique is restarting. For k iterations, GMRES costs k matrix
multiplications+O(nk 2 ) for orthogonalisation → Arnoldi eventually becomes expensive.

Practical solution: restart by solving ’iterative refinement’:

1. Stop GMRES after kmax (prescribed) steps to get approx. solution x̂1 .

2. Solve Ax̃ = b − Ax̂1 via GMRES. (This is a linear system with a different right-hand
side).

3. Obtain solution x̂1 + x̃.

Sometimes multiple restarts are needed.

12.5 Arnoldi for nonsymmetric eigenvalue problems


Arnoldi for eigenvalue problems: Arnoldi iteration+Rayleigh-Ritz (just like Lanczos alg)

1. Compute QT AQ.

2. Eigenvalue decomposition QT AQ = X Λ̂X −1 .

3. Approximate eigenvalues diag(Λ̂). (Ritz values) and eigenvectors QX (Ritz vectors).

65
As in Lanczos, Q = Qk = Kk (A, b), so simply QTk AQk = Hk (Hessenberg eigenproblem,
note that this is ideal for the QR algorithm as the preprocessing step can be skipped).

Which eigenvalues are found by Arnoldi? We give a rather qualitative answer:


ˆ First note that Krylov subspace is invariant under shift: Kk (A, b) = Kk (A − sI, b).

ˆ Thus any eigenvector that power method applied to A − sI converges to should be


contained in Kk (A, b).

ˆ To find other (e.g. interior) eigvals, one can use shift-invert Arnoldi: Q = Kk ((A −
sI)−1 , b).

13 Lanczos and Conjugate Gradient method for Ax =


b, A  0
Here we introduce the conjugate gradient (CG) method. CG is not only important in practice
but has historical significance in that it was the very first Krylov algorithm to be introduced
(and initially made a big hype, but then took a while to be recognized as a competitive
method).
First recall that when A is symmetric, Lanczos gives Qk , Tk such that AQk = Qk Tk +
qk+1 [0, . . . , 0, 1], Tk : tridiagonal.
The idea of CG is as follows: when A  0 PD, solve QTk (AQk y − b) = Tk y − QTk b = 0,
and x = Qk y.
This is known as “Galerkin orthogonality”: it imposes that the residual Ax − b is orthogonal
to Qk .
ˆ Tk y = QTk b is a tridiagonal linear system, so it requires only O(k) operations to solve.

ˆ Three-term recurrence reduces cost to O(k) A-multiplications, making the orthogonal-


isation cost almost negligible.

ˆ The CG algorithm minimises A-norm of error in the Krylov subspace xk = argminx∈Qk kx−
x∗ kA (Ax∗ = b): writing xk = Qk y, we have

(xk − x∗ )T A(xk − x∗ ) = (Qk y − x∗ )T A(Qk y − x∗ )


= y T (QTk AQk )y − 2bT Qk y + bT x∗ ,

minimiser is y = (QTk AQk )−1 QTk b, so QTk (AQk y − b) = 0.



– Note kxkA = xT Ax defines a norm (exercise)
p
– More generally, for inner-product norm kzkM = hz, ziM , minx=Qy kx∗ − xkM
attained when hqi , x∗ − xiM = 0, ∀qi (cf. Part A Numerical Analysis).

66
13.1 CG algorithm for Ax = b, A  0
We’ve described the CG algorithm conceptually. To derive the practical algorithm some
clever manipulations are necessary. We won’t go over them in detail but here is the outcome:
Set x0 = 0, r0 = −b, p0 = r0 and do for k = 1, 2, 3, . . .
αk = hrk , rk i/hpk , Apk i
xk+1 = xk + αk pk
rk+1 = rk − αk Apk
βk = hrk+1 , rk+1 i/hrk , rk i
pk+1 = rk+1 + βk pk
where rk = b − Axk (residual) and pk (search direction). xk is the CG solution after k
iterations.
One can show among others (exercise/sheet)
ˆ Kk (A, b) = span(r0 , r1 , . . . , rk−1 ) = span(x1 , x2 , . . . , xk ) (also equal to span(p0 , p1 , . . . , pk−1 ))

ˆ rjT rk = 0, j = 0, 1, 2, . . . , k − 1
Thus xk is kth CG solution, satisfying Galerkin orthogonality QTk (Axk − b) = 0: residual is
orthogonal to the (Krylov) subspace.

13.2 CG convergence
Let’s examine the convergence of the CG iterates.
Theorem 13.1 Let A  0 be an n × n positive definite matrix and b ∈ Rn . Let ek := x∗ − xk
be the error after the kth CG iteration (x∗ is the exact solution Ax∗ = b). Then
p !k
kek kA κ2 (A) − 1
≤2 p .
ke0 kA κ2 (A) + 1

Proof: We have e0 = x∗ (x0 = 0), and


kek kA
= min kxk − x∗ kA /kx∗ kA
ke0 kA x∈Kk (A,b)
= min kpk−1 (A)b − A−1 bkA /ke0 kA
pk−1 ∈Pk−1

= min k(pk−1 (A)A − I)e0 kA /ke0 kA


pk−1 ∈Pk−1

= min kp(A)e0 kA /ke0 kA


p∈Pk ,p(0)=1
 
p(λ1 )
= min V
 ..  T
 V e0 /ke0 kA .
p∈Pk ,p(0)=1
.
p(λn ) A

67
Now (blue)2 = i λi p(λi )2 (V T e0 )2i ≤ maxj p(λj )2 i λi (V T e0 )2i = maxj p(λj )2 ke0 k2A .
P P
We’ve shown
kek kA
≤ min max |p(λj )| ≤ min max |p(x)|
ke0 kA p∈Pk ,p(0)=1 j p∈Pk ,p(0)=1 x∈[λmin (A),λmax (A)]

To complete the proof, in the next subsection we will show that


p !k
κ2 (A) − 1
min max |p(x)| ≤ 2 p . (10)
p∈Pk ,p(0)=1 x∈[λmin (A),λmax (A)] κ2 (A) + 1


σmax (A) λmax (A)
ˆ Note that κ2 (A) = σmin (A)
= λmin (A)
(=: ab ).

ˆ The above bound is obtained using Chebyshev polynomials on [λmin (A), λmax (A)]. This
is a class of polynomials that arise in a varieties of contexts in computational maths.
Let’s next look at their properties.

13.2.1 Chebyshev polynomials


For z = exp(iθ), x = 21 (z+z −1 ) = cos θ ∈ [−1, 1], θ = acos(x), Tk (x) = 12 (z k +z −k ) = cos(kθ).
Tk (x) is a polynomial in x:
1 1 1
(z + z −1 )(z k + z −k ) = (z k+1 + z −(k+1) ) + (z k−1 + z −(k−1) ) ⇔ 2xTk (x) = Tk+1 (x) + Tk−1 (x)
2 2 2 | {z }
3-term recurrence
1 1

0 0

-1 -1
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

1 1

0 0

-1 -1
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

1 1

0 0

-1 -1
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
These polynomials grow very fast outside the interval (here the ’standard’ [−1, 1]). For
example, plots on [−2, 1] look like

68
0

-10

-20

-2 -1.5 -1 -0.5 0 0.5 1

100

50

0
-2 -1.5 -1 -0.5 0 0.5 1

-200

-400
-2 -1.5 -1 -0.5 0 0.5 1

Here’s a nice plot of several Chebyshev polynomials:

60

40

20

10
1 k
4
0
-1
2
1
0 0
x -1

13.2.2 Properties of Chebyshev polynomials


For z = exp(iθ), x = 21 (z+z −1 ) = cos θ ∈ [−1, 1], θ = acos(x), Tk (x) = 12 (z k +z −k ) = cos(kθ).

ˆ Inside [−1, 1], |Tk (x)| ≤ 1

ˆ Outside [−1, 1], |Tk (x)|  1 grows rapidly with |x| and k(fastest growth among Pk )

Shift+scale s.t. p(x) = ck Tk ( 2x−b−a


b−a
) where ck = 1/Tk ( −(b+a)
b−a
) so p(0) = 1. Then

ˆ |p(x)| ≤ 1/|Tk ( b−a


b+a
)| on x ∈ [a, b]

69
√ √
b/a+1 κ2 (A)+1
ˆ Tk (z) = 1 k
2
(z +z −k
) with 1
2
(z +z )=−1 b+a
b−a
⇒z= √ =√ , so
b/a−1 κ2 (A)−1
√ k
b+a κ−1
|p(x)| ≤ 1/Tk ( )≤2 √ .
b−a κ+1

This establishes (10), completing the proof of Theorem 13.1. 


For much more about Tk , see C6.3 Approximation of Functions

13.3 MINRES: symmetric (indefinite) version of GMRES


When the matrix is symmetric but not positive definite, GMRES can still be simplified
although some care is needed as the Matrix A ceases to define a norm unlike for positive
definite matrices.
Symmetric analogue of GMRES: MINRES (minimum-residual method) for A = AT (but
not necessarily A  0)
x = argminx∈Kk (A,b) kAx − bk2 .
Algorithm: Given AQk = Qk+1 T̃k and writing x = Qk y, rewrite as

min kAQk y − bk2 = min kQk+1 T̃k y − bk2


y y
   T 
T̃k Qk
= min y− b
y 0 QTk,⊥ 2
 
T̃k
= min y − kbk2 e1 , e1 = [1, 0, . . . , 0]T ∈ Rn
y 0 2

( where [Qk , Qk,⊥ ] orthogonal; same trick as in least-squares)

ˆ Minimised when kT̃k y − Q̃Tk bk → min; tridiagonal least-squares problem

ˆ Solve via QR (k Givens rotations)+ tridiagonal solve, O(k) in addition to Lanczos

13.3.1 MINRES convergence


As in GMRES, we can examine the MINRES residual in terms of minimising the values of
a polynomial at the eigenvalues of A.

min kAx − bk2 = min kApk−1 (A)b − bk2 = min k(p̃(A) − I)bk2
x∈Kk (A,b) pk−1 ∈Pk−1 p̃∈Pk ,p̃(0)=0

= min kp(A)bk2 .
p∈Pk ,p(0)=1

Since A = AT , A is diagonalisable A = QΛQT with Q orthogonal, so

kp(A)k2 = kQp(Λ)QT k2 ≤ kQk2 kQT k2 kp(Λ)k2


= max |p(z)|.
z∈λ(A)

70
Interpretation: (again) find polynomial s.t. p(0) = 1 and |p(λi )| small

kAx − bk2
≤ min max |p(λi )|.
kbk2 p∈Pk ,p(0)=1

One can prove (nonexaminable)


 k/2
κ2 (A) − 1
min max |p(λi )| ≤ 2 .
p∈Pk ,p(0)=1 κ2 (A) + 1
ˆ obtained by Chebyshev+Möbius change of variables [Greenbaum’s book 97]

ˆ minimisation needed on positive and negative sides, hence slower convergence when A
is indefinite

CG and MINRES, optimal polynomials Here are some optimal polynomials for CG
and MINRES. Note that κ2 (A) = 10 in both cases; observe how much faster CG is than
MINRES. -14
10 CG, iteration k=50 MINRES, iteration k=50
0.02
1

0.01
0
0

-1
-0.01

-2 -0.02
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -2 -1.5 -1 -0.5 0 0.5 1

1 1.5

0.8
1
0.6

0.4 0.5
0.2

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -2 -1.5 -1 -0.5 0 0.5 1

13.4 Preconditioned CG/MINRES


The preceding analysis suggests that the linear system

Ax = b, A = AT ( 0)

may not converge rapidly with CG or MINRES if the eigenvalues are not distributed in a
favorable manner (i.e., clustered away from 0).
In this case, as with GMRES, a workaround is to find a good preconditioner M such that
“M T M ≈ A−1 ” and solve
M T AM y = M T b, M y = x
As before, desiderata of M :
ˆ M T AM is easy to multiply to a vector.

71
ˆ M T AM has clustered eigenvalues away from 0.

Note that reducing κ2 (M T AM ) directly implies rapid convergence.

ˆ It is possible to implement preconditioned CG with just M T M (no need to find M ).

14 Randomized algorithms in NLA


In this final part of this lecture series we will talk about randomised algorithms. This takes
us to the forefront of research in the field: A major idea in numerical linear algebra since 2005
or so has been the use of randomisation where in a matrix sketch is used in order to extract
information about the matrix, for example in order to construct a low-rank approximation.
So far, all algorithms have been deterministic (always same output).

ˆ Direct methods (LU for Ax = b, QR alg for Ax = λx or A = U ΣV T ) are

– Incredibly reliable, backward stable.


– Works like magic if n . 10000.
– But not beyond; cubic complexity O(n3 ) or O(mn2 ).

ˆ Iterative methods (GMRES, CG, Arnoldi, Lanczos) are

– Very fast when it works (nice spectrum etc).


– Otherwise, not so much; need for preconditioning.

ˆ Randomized algorithms

– Output differs at every run.


– Ideally succeed with enormous probability, e.g. 1 − exp(−cn).
– Often by far the fastest&only feasible approach.
– Not for all problems—active field of research.

Why do we need randomisation? Randomised algorithms are somehow attached to a


negative connotation that the algorithm is not reliable and always produces different results.
Well, this is a valid concern. We hope to remove such concerns in what follows. The reason
randomisation is necessary is the sheer size of datasets that we face today, with the rise of
data science. As you will see, one core idea of randomisation is that by allowing for a very
small error (which can easily be in the order of machine precision) instead of getting an exact
solution, one can often dramatically speed up the algorithm.
We’ll cover two NLA topics where randomisation has been very successful: low-rank
approximation (randomized SVD), and overdetermined least-squares problems

72
Gaussian matrices In randomized algorithms we typically introduce a random matrix.
For the analysis, Gaussian matrices G ∈ Rm×n are the most convenient (not always for
computation). These are matrices whose entries are drawn independently from the standard
normal (Gaussian) distribution Gij ∼ N (0, 1). We cannot do justice to random matrix
theory (there is a course in Hilary term!), but here is a summary of the properties of Gaussian
matrices that we’ll be using.

ˆ A useful fact about Gaussian random matrices G is that its distribution is invariant
under orthogonal transformations. That is, if G is Gaussian, so is QG and GQ, where
Q is any orthogonal matrix independent of G. To see this (nonexaminable): note that
a sum of Gaussian (scalar) random variables is Gaussian, and by independence the
variance is simply the sum of the variances. Now let gi denote the ith column of G.
Then E[(Qgi )T (Qgi )] = E[giT gi ] = I, so each Qgi is multivariate Gaussian with the
same distribution as gi . Independence of Qgi , Qgj is immediate.

ˆ Another very useful fact is that when the matrix is √ rectangular


√ √m > √n (or m < n),
the singular values are known to lie in the interval [ m − n, m + n]. This is a
consequence of the Marchenko-Pastur rule, which we discuss a bit more later.

14.1 Randomized SVD by Halko-Martinsson-Tropp


We start with what has been arguably the most successful usage of randomisation in NLA,
low-rank approximation. Probably the best reference is the paper by Halko, Martinsson
and Tropp [16]. This paper has been enormously successful, with over 3000 Google Scholar
citations as of 2021. See also the recent survey by the same authors [22].
The algorithm itself is astonishingly simple:

Algorithm 14.1 Randomised SVD (HMT): given A ∈ Rm×n and rank r, find a rank-r
approximation  ≈ A.
1: Form a random matrix X ∈ Rn×r , usually r  n.
2: Compute AX.
3: Compute the QR factorisation AX = QR.

4:
A ≈ Q QT A (= (QU0 )Σ0 V0T ) is a rank-r approximation.

Here, X is a random matrix taking independent and identically distributed (iid) entries.
A convenient choice (for the theory, not necessarily for computation) is a Gaussian matrix,
with iid entries Xij ∼ N (0, 1).
Here are some properties of the HMT algorithm:
ˆ O(mnr) cost for dense A.

73
ˆ Near-optimal approximation guarantee: for any r̂ < r,
 
r
EkA − ÂkF ≤ 1 + kA − Ar̂ kF ,
r − r̂ − 1

where Ar̂ is the rank r̂-truncated SVD (expectation w.r.t. random matrix X).
This is a remarkable result; make sure to pause and think about what  it says! The
r
approximant  has error kA − ÂkF that is within a factor 1 + r−r̂−1 of the optimal
truncated SVD, for a slightly lower rank r̂ < r (say, r̂ = 0.9r).

Goal: understand this, or at least why EkA − Âk = O(1)kA − Ar̂ k.

14.2 Pseudoinverse and projectors


To understand why the HMT algorithm works, we need to introduce two notions: the pseudo
inverse and (orthogonal and oblique) projectors.
Given M ∈ Rm×n with economical SVD M = Ur Σr VrT (Ur ∈ Rm×r , Σr ∈ Rr×r , Vr ∈ Rn×r
where r = rank(M ) so that Σr  0), the pseudoinverse M † is

n×m
M † = Vr Σ−1 T
r Ur ∈ R .

ˆ M † satisfies M M † M = M , M † M M † = M † , M M † = (M M † )T , M † M = (M † M )T
(these are often taken to be the definition of the pseudoinverse—the above definition
is much simpler IMO).

ˆ M † = M −1 if M nonsingular.

ˆ M † M = In (M M † = Im ) if m ≥ n(m ≥ n) and M is full rank.

A square matrix P ∈ Rn×n is called a projector if P 2 = P .

ˆ P is always diagonalisable and all eigenvalues are 1 or 0. (think why this is?)

ˆ kP k2 ≥ 1 and kP k2 = 1 iff P = P T ; in this case P is called an orthogonal projector,


and P can be written as P = QQT where Q is orthonormal.

ˆ One can easily show that I − P is another projector as (I − P )2 = I − P , and unless


P = 0 or P = I, we have kI − P k2 = kP k2 :
Schur form QP Q∗ = I0 B0 , Q(I − P )Q∗ = 00 −B
 
I ; See Szyld 2006 [32] for many more
about projections.

74
14.3 HMT approximant: analysis (down from 70 pages!)
We are now in a position to explain why the HMT algorithm works. The original HMT paper
is over 70 pages with long analysis. Here we attempt to condense the arguments to the essence
(and with a different proof). Recall that our low-rank approximation is  = QQT A, where
AX = QR. Goal: kA − Âk = k(Im − QQT )Ak = O(kA − Ar̂ k).

1. QQT AX = AX (QQT is orthogonal projector onto span(AX)). Hence (Im −


QQT )AX = 0, so A − Â = (Im − QQT )A(In − XM T ) for any M ∈ Rn×r .
The idea then is to choose M cleverly such that the expression (Im −QQT )A(In −XM T )
can be shown to have small norm.

2. Set M T = (V T X)† V T where V = [v1 , . . . , vr̂ ] ∈ Rn×r̂ is the top right singular vectors
of A (r̂ ≤ r).
Recall that r̂ is any integer bounded by r.

3. V V T (I − XM T ) = V V T (I − X(V T X)† V T ) = 0 if V T X full row-rank (this is a generic


assumption), so A − Â = (Im − QQT )A(I − V V T )(In − XM T ).

4. Taking norms yields kA − Âk2 = k(Im − QQT )A(I − V V T )(In − XM T )k2 = k(Im −
Σ
QQT )U2 Σ2 V2T (In −XM T )k2 where [V, V2 ] is orthogonal (and A = [U, U2 ] [V, V2 ]T
Σ2
is the SVD), so

kA − Âk2 ≤ kΣ2 k2 k(In − XM T )k2 = kΣ2 k2 kXM T k2


| {z }
optimal rank-r̂

It remains to prove kXM T k2 = O(1). To see why this should hold with high probability, we
need a result from random matrix theory.

14.4 Tool from RMT: Rectangular random matrices are well con-
ditioned
A final piece of information required to complete the puzzle is the Marchenko-Pastur law, a
classical result in random matrix theory, which we will not be able to prove here (and hence
is clearly nonexaminable). We refer those interested to the part C course on Random Matrix
Theory. However, understanding the statement and the ability to use this fact is indeed
examinable.
The key message is easy to state: a rectangular random matrix is well conditioned
with extremely high probability. This fact is enormously important and useful in a variety
of contexts in computational mathematics.
Here is a more precise statement:

75
Theorem 14.1 (Marchenko-Pastur) The singular values of random matrix X ∈ Rm×n
(m ≥ n) with iid Xij (mean 0, variance p 1)√follow√ Marchenko-Pastur (M-P) distribution
1 √ √
(proof
√ nonexaminable),
√ √ √ with density ∼ x
(( m + n) − x)(x − ( m − n)), and support
[ m − n, m + n].
aspect=1 aspect=2 aspect=5 aspect=10
50 50 50 50

40 40 40 40

30 30 30 30

20 20 20 20

10 10 10 10

0 0 0 0
0 1 2 0 1 2 0 1 2 0 1 2

Histogram of singular values of random Gaussian matrices with varying aspect ratio m/n.

√ √ √ √ 1+
σmax (X) ≈ m+ n, σmin (X) ≈ m− n, hence κ2 (X) ≈ √m/n = O(1).
1− m/n
Proof: omitted. (strictly speaking, the theorem concerns the limit m, n → ∞; but the
result holds in the nonasymptotic limit with enormous probability [6]).
As stated above, this is a key fact in many breakthroughs in computational maths!
Examples include

ˆ Randomized SVD, Blendenpik (randomized least-squares)

ˆ (nonexaminable:) Compressed sensing (RIP) [Donoho 06, Candes-Tao 06], Matrix con-
centration inequalities [Tropp 11], Function approx. by least-squares [Cohen-Davenport-
Leviatan 13]

ˆ (nonexaminable:) You might have heard of the Johnson-Lindenstrauss (JL) Lemma.


A host (thousands) of papers have appeared that use JL to prove interesting results.
It turns out that many of these can be equally well be proven (and IMO more eas-
ily/naturally understood) using MP.

kXM T k2 = O(1) Let’s get back to the HMT analysis. Recall that we’ve shown for M T =
(V T X)† V T where X ∈ Rn×r is random, that
kA − Âk2 ≤ kΣ2 k2 k(In − XM T )k2 = kΣ2 k2 kXM T k2 .
| {z }
optimal rank-r̂
Now kXM T k2 = kX(V T X)† V T k2 = kX(V T X)† k2 ≤ kXk2 k(V T X)† k2 .
Now let’s analyse the (standard) case where X is random Gaussian Xij ∼ N (0, 1). Then
ˆ V T X is another Gaussian matrix (an important fact about Gaussian matrices is that
orthogonal×Gaussian=Gaussian (in distribution);√ this√is nonexaminable but a nice
T † T
exercise), hence k(V X) k = 1/σmin (V X) . 1/( r − r̂) by M-P.

76
√ √
ˆ kXk2 . n + r by M-P.21
√ √
n+ r
Together we get kXM T k2 . √ √
r− r̂
= ”O(1)”.

Remark:

ˆ When X is a non-Gaussian random matrix, the performance is similar, but is harder


to analyze. A popular choice is the so-called SRFT matrices, which use the FFT
(fast Fourier transform) and can be applied to A with O(mn log m) cost rather than
O(mnr).

14.5 Precise analysis for HMT (nonexaminable)


A slightly more elaborate analysis again using random matrix theory will give us a very sharp
bound on the expected value of the error EHMT =: A − Â. (this again is non-examinable).

Theorem 14.2
p (ReproducespHMT r2011 Thm.10.5) If X is Gaussian, for any r̂ < r,
EkEHMT kF ≤ EkEHMT k2F = 1 + r−r̂−1 kA − Ar̂ kF .

Proof: First ineq: Cauchy-Schwarz. kEHMT k2F is

kA(I − V V T )(I − PX,V )k2F = kA(I − V V T )k2F + kA(I − V V T )PX,V k2F


= kΣ2 k2F + kΣ2 PX,V k2F = kΣ2 k2F + kΣ2 (V⊥T X)(V T X)† V T k2F .

Now if X is Gaussian then V⊥T X ∈ R(n−r̂)×r and V T X ∈ Rr̂×r are independent Gaussian.
Hence by [HMT Prop. 10.1] EkΣ2 (V⊥T X)(V T X)† k2F = r−r̂−1
r
kΣ2 k2F , so
 
r
EkEHMT k2F = 1+ kΣ2 k2F .
r − r̂ − 1

Note how remarkable the theorem is—the ’lazily’ computed approximant is nearly optimal
r
p
up to a factor 1 + r−r̂−1 for a near rank r̂ (one can take, e.g. r̂ = 0.9r).

14.6 Generalised Nyström


We wish to briefly mention an algorithm for low-rank approximation that is even faster than
HMT, especially when r  1.
Let X ∈ Rn×r be a random matrix as before; and set another random matrix Y ∈
n×(r+`)
R , and [Nakatsukasa arXiv 2020 [24]]

 = (AX(Y TAX)† Y T )A = PAX,Y A.


√ √ √ √
21 m+ r
This and the next line has been corrected from kXk2 . m+ r and kXM T k2 . √ √
r− r̂
= ”O(1)” on
24 March 2022.

77
Then  is another rank-r approximation to A, and A−  = (I −PAX,Y )A = (I −PAX,Y )A(I −
XM T ); choose M s.t. XM T = X(V T X)† V T = PX,V . Then PAX,Y , PX,V are (nonorthogonal)
projections, and

kA − Âk = k(I − PAX,Y )A(I − PX,V )k


≤ k(I − PAX,Y )A(I − V V T )(I − PX,V )k
≤ kA(I − V V T )(I − PX,V )k + kPAX,Y A(I − V V T )(I − PX,V )k.

ˆ Note that the kA(I − V V T )(I − PX,V )k term is the exact same as in the HMT error.

ˆ Extra term kPAX,Y k2 = O(1) as before if c > 1 in Y ∈ Rm×cr (again, by Marchenko-


Pastur).
√ √
ˆ Overall, about (1 + kPAX,Y k2 ) ≈ (1 + √n+ r+`
√ ) times bigger expected error than HMT,
r+`− r
still near-optimal and much faster O(mn log n + r3 ).

Here’s some experiments with HMT and GN (generalized Nyström ).


Let A = U ΣV T be a dense 30, 000 × 30, 000 matrix with geometrically decaying singular
values σi , and choose U, V to be random (orthogonal factors of a random Gaussian matrix).
We use HMT and GN to find low-rank approximations to A varying the rank r. We also
compare with MATLAB’s SVD, which gives the optimal truncated SVD (up to numerical
errors, which are O(10−15 ) so invisible).
Convergence Speed
0
10 SVD
10 3

T
10 -5 HM
time(s)

2
10

10 -10
GN GN
HMT
SVD
10 1
10 -15
10
3
10
4
10 3 10 4
rank rank
We see that randomised algorithms can outperform the standard SVD significantly.

14.7 MATLAB code


Implementing the HMT and GN algorithms (at least with Gaussian matrices, which don’t
always give optimal speed performance) is very easy!
Here’s some sample MATLAB code. We can set up a matrix with exponentially decay-
ing singular values (the matrix is constructed by computing A = U ΣV T , where U, V are
orthonormal and Σ is a geometric series from 10−100 to 1).

78
n = 1000; % size
A = gallery(’randsvd’,n,1e100); % geometrically decaying singvals
r = 200; % rank
Then to do HMT as follows:
X = randn(n,r);
AX = A*X;
[Q,R] = qr(AX,0); % QR fact.
At = Q*(Q’*A);
which with high probability gives me an excellent approximation
norm(At-A,’fro’)/norm(A,’fro’)
ans = 1.2832e-15
And for Generalized Nyström :
X = randn(n,r); Y = randn(n,1.5*r);
AX = A*X; YA = Y’*A; YAX = YA*X;
[Q,R] = qr(YAX,0); % stable pseudo-inverse via the QR factorisation
At = (AX/R)*(Q’*YA);

norm(At-A,’fro’)/norm(A,’fro’)
ans = 2.8138e-15
Both algorithms give an excellent low-rank approximation to A.

15 Randomized least-squares: Blendenpik


Our final topic is a randomised algorithm for the least-squares problems that are highly
overdetermined.
[Avron-Maymounkov-Toledo 2010 [1]]
m×n
min kAx − bk2 , A ∈R , mn
x

ˆ Traditional method: normal eqn x = (AT A)−1 AT b or A = QR, x = R−1 (QT b), both
require O(mn2 ) cost.

ˆ Randomized: generate random G ∈ R4n×m , and G A = Q̂ R̂

(QR factorisation), then solve miny k(AR̂−1 )y − bk2 ’s normal eqn via Krylov.

79
– O(mn log m + n3 ) cost using fast FFT-type transforms22 for G.
– Crucially, AR̂−1 is well-conditioned. Why? Marchenko-Pastur (next)

15.1 Explaining Blendenpik via Marchenko-Pastur


Let us prove that κ2 (AR̂−1 ) = O(1) with high probability. A key result, once again, is M-P.

Claim: AR̂−1 is well-conditioned with G A = Q̂ R̂ (QR factorisation)

Let’s prove this for G ∈ R4n×m Gaussian:

Proof: Let A = QR. Then GA = (GQ)R =: G̃R

ˆ G̃ is 4n × n rectangular Gaussian, hence well-conditioned.

ˆ So by M-P, κ2 (R̃−1 ) = O(1) where G̃ = Q̃R̃ is the QR factorisation.


ˆ Thus G̃R = (Q̃R̃)R = Q̃(R̃R) = Q̃R̂, so R̂−1 = R−1 R̃−1 .
ˆ Hence AR̂−1 = QR̃−1 , so κ2 (AR̂−1 ) = κ2 (R̃−1 ) = O(1).

15.2 Blendenpik: solving minx kAx − bk2 using R̂


We have κ2 (AR̂−1 ) =: κ2 (B) = O(1); defining R̂x = y, minx kAx − bk2 = miny k(AR̂−1 )y −
bk2 = miny kBy − bk2 .
ˆ B is well-conditioned⇒in normal equation
B T By = B T b (11)
B T B is also well-conditioned κ2 (B T B) = O(1); so positive definite and well-conditioned.

ˆ Thus we can solve (11) via CG (or LSQR [27], a more stable variant in this context;
nonexaminable)
– exponential convergence, O(1) iterations! (or O(log 1 ) iterations for  accuracy)
– each iteration requires w ← Bw and w ← B T w, consisting of w ← R̂−1 w (n × n
triangular solve) and w ← Aw (m × n matrix-vector multiplication); O(mn) cost
overall
22
The FFT (fast Fourier transform) is one of the important topics that we can’t treat properly—for now
just think of it as a matrix-vector multiplication that can be performed in O(n log n) flops rather than
O(n2 ).)

80
15.3 Blendenpik experiments
Let’s illustrate our findings. Since Blendenpik finds a preconditioner such that AR−1 is
well-conditioned (regardless of κ2 (A)), we expect the convergence of CG to be independent
of κ2 (A). This is indeed what we see here:

0
10

10 -5

10 -10

10 -15 Blendenpik
10 20 30 40 50 60 70 80 90 100
CG iterations

Figure 3: CG for AT Ax = AT b vs. Blendenpik (AR−1 )T (AR−1 )x = (AR−1 )T b, m =


10000, n = 100

In practice, Blendenpik gets ≈ ×5 speedup over classical (Householder-QR based) method


when m  n.

15.4 Sketch and solve for minx kAx − bk2


In closing, let us descibe a related, simpler algorithm. To solve a least-squares problem

minimize kAx − bk2 , (12)


x

where A ∈ Cn×r , n  r, one can sketch and solve the problem [35]: draw a random matrix
G ∈ Cr̃×n where r̃ = O(r)  n, and solve the sketched problem

minimize kG(Ax − b)k2 . (13)


x

In some cases, the solution of this problem is already a good enough approximation to the
original problem.
We’ve taken G to be Gaussian above. For a variety of choices of G, with high probability
the solutions for (13) and (12) can be shown to be similar in that they have a comparable
residual kAx − bk2 .
Let’s understand why the solution for (13) should be good, again using Marchenko-Pastur.
Let [A b] = QR ∈ Cm×(n+1) be a thin QR factorization, and suppose that the sketch G ∈ Cr̃×n

81
is Gaussian. Then GQ is rectangular Gaussian (again using the rotational invariance of
Gaussian random matrices), so well-conditioned. Suppose without loss of generality that G
is scaled so that 1 − δ ≤ σi (GQ) ≤ 1 + δ for some δ < 1 (note that here δ isn’t an O(u)
quantity, δ = 0.5, say, is a typical number), and therefore (since Qv = [A b]ṽ for some ṽ)
 xfor
n+1
any ṽ ∈ C we have (1 − δ)k[A b]ṽk2 ≤ kG[A b]ṽk2 ≤ (1 + δ)kG[A b]ṽk2 . Taking ṽ = −1
it follows that for any vector x ∈ Cn we have

(1 − δ)kAx − bk2 ≤ kG(Ax − b)k2 ≤ (1 + δ)kAx − bk2 .

Consequently, the minimizer xs of kG(Ax − b)k2 for (13) also minimizes kAx − bk2 for (12)
1+δ
up to a factor 1−δ . If the residual of the original problem can be made small, say kAx −
−10
bk2 /kbk2 = 10 , then kAxG − bk2 /kbk2 ≤ 1+δ 1−δ
× 10−10 , which with a modest and typical
1 −10
value δ = 2 gives 3 × 10 , giving an excellent least-squares fit. If A is well-conditioned,
this also impliees the solution x is close to the exact solution.

15.5 Randomized algorithm for Ax = b, Ax = λx?


We have seen that randomization can be very powerful for low-rank approximation and
least-squares problems. What about the two core problems in NLA, linear systems Ax = b
and eigenvalue problems Ax = λx? A recent preprint [25] proposes the use of sketching (as
described above) for GMRES (for Ax = b) and Rayleigh-Ritz (for eigenproblems) combined
with a generation of a basis for a Krylov subspace that is not orthonormal. This is certainly
not the end of the story.
Randomized algorithms promise to be employed in more and more applications and
problems. We will almost surely see more breakthroughs in randomized NLA. Who be the
inventors? Would the young and fresh brains like to take a shot?

16 Conclusion and discussion


We have tried to give a good overview of the field of numerical linear algebrain this course.
However a number of topics have been omitted completely due to lack of time. Here is an
incomplete list of other topics.

16.1 Important (N)LA topics not treated


These are clearly nonexaminable but definitely worth knowing if you want to get seriously
into the field. These will be discuss in lecture if and only if time permits, and in any case
only superficially.
ˆ tensors [Kolda-Bader 2009 [21]]

ˆ FFT (values↔coefficients map for polynomials) [e.g. Golub and Van Loan 2012 [14]]

ˆ sparse direct solvers [Duff, Erisman, Reid 2017 [10]]

82
ˆ multigrid [e.g. Elman-Silvester-Wathen 2014 [11]]

ˆ fast (Strassen-type) matrix multiplication etc [Strassen 1969 [31]+many follow-ups]

ˆ functions of matrices [Higham 2008 [18]]

ˆ generalised, polynomial/nonlinear eigenvalue problems [Guttel-Tisseur 2017 [15]]

ˆ perturbation theory (Davis-Kahan [7] etc) [Stewart-Sun 1990 [30]]

ˆ compressed sensing (this deals with Ax = b where A is very ’fat’) [Foucart-Rauhut


2013 [12]]

ˆ model order reduction [Benner-Gugercin-Willcox 2015 [3]]

ˆ communication-avoiding algorithms [e.g. Ballard-Demmel-Holtz-Schwartz 2011 [2]]

16.2 Course summary


This information is provided for MSc students who take two separate exams based on the
1st and 2nd halfs.
1st half

ˆ SVD and its properties (Courant-Fischer etc), applications (low-rank)

ˆ Direct methods (LU) for linear systems and least-squares problems (QR)

ˆ Stability of algorithms

2nd half

ˆ Direct method (QR algorithm) for eigenvalue problems, SVD

ˆ Krylov subspace methods for linear systems (GMRES, CG) and eigenvalue problems
(Arnoldi, Lanczos)

ˆ Randomized algorithms for SVD and least-squares

16.3 Related courses you can take


Courses with significant intersection with NLA include

ˆ C6.3 Approximation of Functions: Chebyshev polynomials/approximation theory

ˆ C7.7 Random Matrix Theory: for theoretical underpinnings of Randomized NLA

ˆ C6.4 Finite Element Method for PDEs: NLA arising in solutions of PDEs

83
ˆ C6.2 Continuous Optimisation: NLA in optimisation problems

and many more: differential equations, data science, optimisation, machine learning,... NLA
is everywhere in computational mathematics.

Thank you for your interest in NLA!

References
[1] H. Avron, P. Maymounkov, and S. Toledo. Blendenpik: Supercharging LAPACK’s
least-squares solver. SIAM J. Sci. Comp., 32(3):1217–1236, 2010.

[2] G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Minimizing communication in


numerical linear algebra. SIAM J. Matrix Anal. Appl., 32(3):866–901, 2011.

[3] P. Benner, S. Gugercin, and K. Willcox. A survey of projection-based model reduction


methods for parametric dynamical systems. SIAM Rev., 57(4):483–531, 2015.

[4] R. Bhatia. Positive Definite Matrices. Princeton University Press, 2009.

[5] K. Braman, R. Byers, and R. Mathias. The multishift QR algorithm. Part II: Aggressive
early deflation. SIAM J. Matrix Anal. Appl., 23:948–973, 2002.

[6] K. R. Davidson and S. J. Szarek. Local operator theory, random matrices and banach
spaces. Handbook of the geometry of Banach spaces, 1(317-366):131, 2001.

[7] C. Davis and W. M. Kahan. The rotation of eigenvectors by a perturbation. III. SIAM
J. Numer. Anal., 7(1):1–46, 1970.

[8] J. Demmel. Applied Numerical Linear Algebra. SIAM, Philadelphia, USA, 1997.

[9] J. Dongarra and F. Sullivan. Guest editors’ introduction: The top 10 algorithms. IEEE
Computer Architecture Letters, 2(01):22–23, 2000.

[10] I. S. Duff, A. M. Erisman, and J. K. Reid. Direct Methods for Sparse Matrices. Oxford
University Press, 2017.

[11] H. C. Elman, D. J. Silvester, and A. J. Wathen. Finite elements and fast iterative
solvers: with applications in incompressible fluid dynamics. Oxford University Press,
USA, 2014.

[12] S. Foucart and H. Rauhut. A Mathematical Introduction to Compressive Sensing.


Springer, 2013.

[13] D. F. Gleich. Pagerank beyond the web. SIAM Rev., 57(3):321–363, 2015.

84
[14] G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University
Press, 4th edition, 2012.

[15] S. Güttel and F. Tisseur. The nonlinear eigenvalue problem. Acta Numer., 26:1–94,
2017.

[16] N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness:
Probabilistic algorithms for constructing approximate matrix decompositions. SIAM
Rev., 53(2):217–288, 2011.

[17] N. J. Higham. Accuracy and Stability of Numerical Algorithms. SIAM, Philadelphia,


PA, USA, second edition, 2002.

[18] N. J. Higham. Functions of Matrices: Theory and Computation. SIAM, Philadelphia,


PA, USA, 2008.

[19] R. A. Horn and C. R. Johnson. Topics in Matrix Analysis. Cambridge University Press,
1991.

[20] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, second
edition, 2012.

[21] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Rev.,
51(3):455–500, 2009.

[22] P.-G. Martinsson and J. A. Tropp. Randomized numerical linear algebra: Foundations
and algorithms. Acta Numer., pages 403––572, 2020.

[23] M. F. Murphy, G. H. Golub, and A. J. Wathen. A note on preconditioning for indefinite


linear systems. SIAM J. Sci. Comp., 21(6):1969–1972, 2000.

[24] Y. Nakatsukasa. Fast and stable randomized low-rank matrix approximation.


arXiv:2009.11392.

[25] Y. Nakatsukasa and J. A. Tropp. Fast & accurate randomized algorithms for linear
systems and eigenvalue problems. arXiv 2111.00113.

[26] C. C. Paige. Error analysis of the Lanczos algorithm for tridiagonalizing a symmetric
matrix. IMA J. Appl. Math., 18(3):341–349, 1976.

[27] C. C. Paige and M. A. Saunders. LSQR: An algorithm for sparse linear equations and
sparse least squares. ACM Trans. Math. Soft., 8(1):43–71, 1982.

[28] B. N. Parlett. The Symmetric Eigenvalue Problem. SIAM, Philadelphia, 1998.

[29] Y. Saad and M. H. Schultz. GMRES - A generalized minimal residual algorithm for
solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comp., 7(3):856–869, 1986.

85
[30] G. W. Stewart and J.-G. Sun. Matrix Perturbation Theory (Computer Science and
Scientific Computing). Academic Press, 1990.

[31] V. Strassen. Gaussian elimination is not optimal. Numer. Math., 13:354–356, 1969.

[32] D. B. Szyld. The many proofs of an identity on the norm of oblique projections. Nu-
merical Algorithms, 42(3-4):309–323, 2006.

[33] L. N. Trefethen and D. Bau. Numerical Linear Algebra. SIAM, Philadelphia, 1997.

[34] M. Udell and A. Townsend. Why are big data matrices approximately low rank? SIAM
Journal on Mathematics of Data Science, 1(1):144–160, 2019.

®
[35] D. P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and
Trends in Theoretical Computer Science, 10(1–2):1–157, 2014.

86

You might also like