1 - Types of Matrices
1 - Types of Matrices
Welcome to numerical linear algebra (NLA)! NLA is a beautiful subject that combines
mathematical rigor, amazing algorithms, and an extremely rich variety of applications.
What is NLA? In a sentence, it is a subject that deals with the numerical solution (i.e.,
using a computer) of linear systems Ax = b (given A 2 Rn⇥n (i.e., a real n ⇥ n matrix) and
b 2 Rn (real n-vector), find x 2 Rn ) and eigenvalue problems Ax = x (given A 2 Rn⇥n ,
find 2 C and x 2 Cn ), for problems that are too large to solve by hand (n 4 is already
large; we aim for n in the thousands or even millions). This can rightfully sound dull, and
some mathematicians (those purely oriented?) tend to get turned o↵ after hearing this —
how could such a course be interesting compared with other courses o↵ered by the Oxford
Mathematical Institute? I hope and firmly believe that at the end of the course you will all
agree that there is more to the subject than you imagined. The rapid rise of data science and
machine learning has only meant that the importance of NLA is still growing, with a vast
number of problems in these fields requiring NLA techniques and algorithms. It is perhaps
worth noting also that these fields have had enormous impact on the direction of NLA, in
particular the recent and very active field of randomised algorithm was born in light of needs
arising from these extremely active fields.
In fact NLA is a truly exciting field that utilises a huge number of ideas from di↵erent
branches of mathematics (e.g. matrix analysis, approximation theory, and probability) to
solve problems that actually matter in real-world applications. Having said that, the number
of prerequisites for taking the course is the bare minimum; essentially a basic understanding
of the fundamentals of linear algebra would suffice (and the first lecture will briefly review
the basic facts). If you’ve taken the Part A Numerical Analysis course you will find it helpful,
but again, this is not necessary.
The field NLA has been blessed with many excellent books on the subject. These notes
will try to be self-contained, but these references will definitely help. There is a lot to learn;
literally as much as you want to.
1
– classic, encyclopedic
Horn and Johnson (12) [20]: Matrix Analysis (& Topics in Matrix Analysis (86) [19])
– impressive content
H. C. Elman, D. J. Silvester, A. J. Wathen (14) [11]: Finite Elements and Fast Iterative
Solvers
This course covers the fundamentals of NLA. We first discuss the singular value decom-
position (SVD), which is a fundamental matrix decomposition whose importance is only
growing. We then turn to linear systems and eigenvalue problems. Broadly, we will cover
in this order. Lectures 1–4 cover the fundamentals of matrix theory, in particular the SVD,
its properties and applications.
This document consists of 16 sections. Very roughly speaking, one section corresponds
to one lecture (though this will not be followed strictly at all).
Contents
0 Introduction, why Ax = b and Ax = x? 6
1 Basic LA review 7
1.1 Warmup exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Structured matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Matrix eigenvalues: basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Computational complexity (operation counts) of matrix algorithms . . . . . 11
1.5 Vector norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 Matrix norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.7 Subspaces and orthonormal matrices . . . . . . . . . . . . . . . . . . . . . . 13
2
2 SVD: the most important matrix decomposition 14
2.1 (Some of the many) applications and consequences of the SVD: rank, col-
umn/row space, etc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 SVD and symmetric eigenvalue decomposition . . . . . . . . . . . . . . . . . 17
2.3 Uniqueness etc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5 Linear systems Ax = b 24
5.1 Solving Ax = b via LU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2 Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3 Cholesky factorisation for A 0 . . . . . . . . . . . . . . . . . . . . . . . . . 27
7 Numerical stability 36
7.1 Floating-point arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.2 Conditioning and stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.3 Numerical stability; backward stability . . . . . . . . . . . . . . . . . . . . . 38
7.4 Matrix condition number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.4.1 Backward stable+well conditioned=accurate solution . . . . . . . . . 39
7.5 Stability of triangular systems . . . . . . . . . . . . . . . . . . . . . . . . . . 40
7.5.1 Backward stability of triangular systems . . . . . . . . . . . . . . . . 40
7.5.2 (In)stability of Ax = b via LU with pivots . . . . . . . . . . . . . . . 41
7.5.3 Backward stability of Cholesky for A 0 . . . . . . . . . . . . . . . . 41
7.6 Matrix multiplication is not backward stable . . . . . . . . . . . . . . . . . . 42
7.7 Stability of Householder QR . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.7.1 (In)stability of Gram-Schmidt . . . . . . . . . . . . . . . . . . . . . . 44
3
8 Eigenvalue problems 44
8.1 Schur decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
8.2 The power method for finding the dominant eigenpair Ax = x . . . . . . . 46
8.2.1 Digression (optional): Why compute eigenvalues? Google PageRank . 47
8.2.2 Shifted inverse power method . . . . . . . . . . . . . . . . . . . . . . 47
9 The QR algorithm 48
9.1 QR algorithm for Ax = x . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
9.2 QR algorithm preprocessing: reduction to Hessenberg form . . . . . . . . . . 52
9.2.1 The (shifted) QR algorithm in action . . . . . . . . . . . . . . . . . . 53
9.2.2 (Optional) QR algorithm: other improvement techniques . . . . . . . 54
10 QR algorithm continued 55
10.1 QR algorithm for symmetric A . . . . . . . . . . . . . . . . . . . . . . . . . 55
10.2 Computing the SVD: Golub-Kahan’s bidiagonalisation algorithm . . . . . . . 56
10.3 (Optional but important) QZ algorithm for generalised eigenvalue problems . 56
10.4 (Optional) Tractable eigenvalue problems . . . . . . . . . . . . . . . . . . . . 57
4
14 Randomised algorithms in NLA 75
14.1 Gaussian matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
14.1.1 Orthogonal invariance . . . . . . . . . . . . . . . . . . . . . . . . . . 76
14.1.2 Marchenko-Pastur: Rectangular random matrices are well conditioned 77
14.2 Randomised least-squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
14.3 “Fast” algorithm: row subset selection . . . . . . . . . . . . . . . . . . . . . 78
14.4 Sketch-and-solve for least-squares problems . . . . . . . . . . . . . . . . . . . 79
14.5 Sketch-to-precondition: Blendenpik . . . . . . . . . . . . . . . . . . . . . . . 81
14.5.1 Explaining 2 (AR̂ 1 ) = O(1) via Marchenko-Pastur . . . . . . . . . . 81
14.5.2 Blendenpik: solving minx kAx bk2 using R̂ . . . . . . . . . . . . . . 82
14.5.3 Blendenpik experiments . . . . . . . . . . . . . . . . . . . . . . . . . 82
Notation. For convenience below we list the notation that we use throughout the course.
(A): the set of singular values of A. i (A) always denotes the ith largest singular
value. We often just write i .
We use capital letters for matrices, lower-case for vectors and scalars. Unless otherwise
specified, A is a given matrix, b is a given vector, and x is an unknown vector.
k · k denotes a norm for a vector or matrix. k · k2 denotes the spectral (or 2-) norm,
k · kF the Frobenius norm. For vectors, to simplify notation we sometimes use k · k for
the spectral norm (which for vectors is the familiar Euclidean norm).
Span(A) denotes the span or range of the column space of A. This is the subspace
consisting of vectors of the form Ax.
5
We reserve Q for an orthonormal (or orthogonal) matrix. L, (U ) are often lower (upper)
triangular.
I always denotes the identity matrix. In is the n ⇥ n identity when the size needs to
be specified.
AT is the transpose of the matrix; (AT )ij = Aji . A⇤ is the (complex) conjugate
transpose (A⇤ )ij = A¯ji .
We sometimes use the following shorthand: alg for algorithm, eigval for eigenvalue, eigvec
for eigenvector, singval for singular value, and singvec for singular vector, i↵ for “if and only
if”.
1. Linear system
Ax = b.
Given a (often square m = n but we will discuss m > n extensively, and m < n briefly
at the end) matrix A 2 Rm⇥n and vector b 2 Rm , find x 2 Rn such that Ax = b.
2. Eigenvalue problem
Ax = x.
Given a (always!1 ) square matrix A 2 Rn⇥n find : eigenvalues (eigval), and x 2 Rn :
eigenvectors (eigvec).
We’ll see many variants of these problems; one worthy of particular mention is the SVD,
which is related to eigenvalue problems but given its ubiquity has a life of its own. (So if
there’s a third problem we solve in NLA, it would definitely be the SVD.)
It is worth discussing why we care about linear systems and eigenvalue problems.
The primary reason is that many (in fact most) problems in scientific computing (and
even machine learning) boil down to linear problems:
Because that’s often the only way to deal with the scale of problems we face today!
(and in future)
1
There are exciting recent developments involving eigenvalue problems for rectangular matrices, but these
are outside the scope of this course.
6
For linear problems, so much is understood and reliable algorithms are available2 .
A related important question is where and how these problems arise in real-world prob-
lems.
Let us mention a specific context that is relevant in data science: optimisation. Suppose
one is interested in minimising a high-dimensional real-valued function f (x) : Rn ! R where
n 1.
A successful approach is to try and find critical points, that is, points x⇤ where rf (x⇤ ) =
0. Mathematically, this is a non-linear high-dimensional root-finding problem of finding
x 2 Rn such that rf (x) =: F (x) = 0 (the vector 0 2 Rn ) where F : Rn ! Rn . One of
the most commonly employed methods for this task is Newton’s method (which some of you
have seen in Prelims Constructive Mathematics). This boils down to
1 Basic LA review
We start with a review of key LA facts that will be used in the course. Some will be trivial
to you while others may not. You might also notice that some facts that you have learned in
a core LA course will not be used in this course. For example we will never deal with finite
fields, and determinants only play a passing role.
2
A pertinent quote is Richard Feynman’s “Linear systems are important because we can solve them”.
Because we can solve them, we do all sorts of tricks to reduce difficult problems to linear systems!
7
1.1 Warmup exercise
Let A 2 Rn⇥n (n ⇥ n square matrix). (or Cn⇥n ; the di↵erence hardly matters in most of
this course3 ). Try to think of statements that are equivalent to A being nonsingular. Try to
come up with as many conditions as possible, before turning the page.
3
While there are a small number of cases where the distinction between real and complex matrices matters,
in the majority of cases it does not, and the argument carries over to complex matrices by replacing ·T with
·⇤ . Therefore for the most part, we lose no generality in assuming the matrix is real (which slightly simplifies
our mindset). Whenever necessary, we will highlight the subtleties that arise resulting from the di↵erence
between real and complex. (For the curious, these are the Schur form/decomposition, LDLT factorisation
and eigenvalue decomposition for (real) matrices with complex eigenvalues.)
8
Here is a list: The following are equivalent.
1. A is nonsingular.
1
2. A is invertible: A exists.
6. rank(A) = n.
13. det(A) 6= 0.
1
14. An n ⇥ n matrix A exists such that A 1 A = In . (this, btw, implies (i↵) AA 1
= In ,
a nontrivial fact)
9
Normal: AT A = AAT . (Here it’s better to discuss the complex case A⇤ A = AA⇤ : this
is a necessary and sufficient condition for diagonalisability under a unitary transfor-
mation, i.e., A = U ⇤U ⇤ where ⇤ is diagonal and U is unitary.)
Tridiagonal: Aij = 0 if |i j| > 1.
Upper triangular: Aij = 0 if i > j.
Lower triangular: Aij = 0 if i < j.
For (possibly nonsquare) matrices A 2 Cm⇥n , (usually m n).
(upper) Hessenberg: Aij = 0 if i > j + 1. (we will see this structure often.)
“orthonormal”: AT A = In , and A is (tall) rectangular. (This isn’t an established
name—we could call it “matrix with orthonormal columns” every time it appears—
but we use these matrices all the time in this course, so we need a consistent shorthand
name for it.)
sparse: most elements are zero. nnz(A) denotes the number of nonzero elements in A.
Matrices that are not sparse are called dense.
Other structures: Hankel, Toeplitz, circulant, symplectic,... (we won’t use these in this
course)
According to Galois theory, eigenvalues cannot be computed exactly for matrices with
n 5. But we still want to compute them! In this course we will (among other
things) explain how this is done in practice by the QR algorithm, one of the greatest
hits of the field.
10
1.4 Computational complexity (operation counts) of matrix algo-
rithms
Since NLA is a field that aspires to develop practical algorithms for solving matrix problems,
it is important to be aware of the computational cost (often referred to as complexity) of
the algorithms. We will discuss these as the algorithms are developed, but for now let’s
examine the costs for basic matrix-matrix multiplication. The cost is measured in terms
of flops (floating-point operations), which counts the number of additions, subtractions,
multiplications, and divisions (all treated equally) performed.
In NLA the constant in front of the leading term in the cost is (clearly) important. It
is customary (for good reason) to only track the leading term of the cost. For example,
n3 + 10n2 is abbreviated to n3 .
Norms
We will need a tool (or metric) to measure how big a vector or matrix is. Norms give us a
means to achieve this. Surely you have already seen some norms (e.g. the vector Euclidean
norm). We will discuss a number of norms for vectors and matrices that we will use in the
upcoming lectures.
Of particular importance are the three cases p = 1, 2, 1. In this course, we will see p = 2
the most often.
A norm needs to satisfy the following axioms:
11
The vector p-norm satisfies all these, for any p.
Here are some useful inequalities for vector norms. A proof is left for your exercise and
is highly recommended. (Try to think when each equality is satisfied.) For x 2 Cn ,
p1 kxk2 kxk1 kxk2
n
1
n
kxk1 kxk1 kxk1
Note that with the 2-norm, kU xk2 = kxk2 for any unitary U and any x 2 Cn . Norms
with this property are called unitarily invariant. p
The 2-norm is also induced by the inner product kxk2 = xT x. An important property
of inner products is the Cauchy-Schwarz inequality |xT y| kxk2 kyk2 (which can be directly
proved but is perhaps best to prove in general setting)4 . When we just say kxk for a vector
we mean the 2-norm.
Norm axioms hold for each of these. Useful inequalities include: For A 2 Cm⇥n , (exercise;
it is instructive to study the cases where each of these equalities holds)
p
p1 kAk1 kAk2 mkAk1
n
4
Just in case, here’s a proof: for any scalar c, kx cyk2 = kxk2 2cxT y + c2 kyk2 . This is minimised wrt
xT y 2 (xT y)2
c at c = kyk 2 with minimiser kxk kyk2 . Since this must be 0, the CS inequality follows.
12
p
p1 kAk1 kAk2 nkAk1
m
p
kAk2 kAkF min(m, n)kAk2
A useful property of p-norms is that they are subordinate, i.e., kABkp kAkp kBkp
(problem sheet). Note that not all norms
satisfy this, e.g. with the max norm kAkmax =
1
maxi,j |Aij |, with A = [1, 1] and B = one has kABkmax = 2 but kAkmax = kBkmax = 1.
1
Lemma 1.1 Let V1 2 Rn⇥d1 and V2 2 Rn⇥d2 each have linearly independent column vectors.
If d1 + d2 > n, then there is a nonzero intersection between two subspaces S1 = span(V1 ) and
S2 = span(V2 ), that is, there is a nonzero vector x 2 Rn such that x = V1 c1 = V2 c2 for some
vectors c1 , c2 .
Proof: Consider the matrix M := [V1 , V2 ], which is of size n ⇥ (d1 + d2 ). Since d1 + d2 > n
by assumption,
this matrix has a right null vector5 c 6= 0 such that M c = 0. Splitting
c1
c= we have the required result. ⇤
c2
Let us conclude this review with a list of useful results that will be helpful. Proofs (or
counterexample) should be straightforward.
5
If this argument isn’t convincing to you now, probably the easiest way to see this is via the SVD; so stay
tuned and we’ll resolve this in footnote 9, once you’ve seen the SVD!
13
(AB)T = B T AT
1
If A, B invertible, (AB) = B 1A 1
i are the eigenvalues, and V is the matrix of eigenvectors (its columns are the eigenvec-
tors).
The decomposition (1) makes two remarkable claims: the eigenvectors can be taken to
be orthogonal (which is true more generally of normal matrices s.t. A⇤ A = AA⇤ ), and the
eigenvalues are real.
14