0% found this document useful (0 votes)
30 views14 pages

1 - Types of Matrices

Oxford NLA Notes

Uploaded by

chamuvmg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views14 pages

1 - Types of Matrices

Oxford NLA Notes

Uploaded by

chamuvmg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

C6.

1 Numerical Linear Algebra


Yuji Nakatsukasa

Welcome to numerical linear algebra (NLA)! NLA is a beautiful subject that combines
mathematical rigor, amazing algorithms, and an extremely rich variety of applications.
What is NLA? In a sentence, it is a subject that deals with the numerical solution (i.e.,
using a computer) of linear systems Ax = b (given A 2 Rn⇥n (i.e., a real n ⇥ n matrix) and
b 2 Rn (real n-vector), find x 2 Rn ) and eigenvalue problems Ax = x (given A 2 Rn⇥n ,
find 2 C and x 2 Cn ), for problems that are too large to solve by hand (n 4 is already
large; we aim for n in the thousands or even millions). This can rightfully sound dull, and
some mathematicians (those purely oriented?) tend to get turned o↵ after hearing this —
how could such a course be interesting compared with other courses o↵ered by the Oxford
Mathematical Institute? I hope and firmly believe that at the end of the course you will all
agree that there is more to the subject than you imagined. The rapid rise of data science and
machine learning has only meant that the importance of NLA is still growing, with a vast
number of problems in these fields requiring NLA techniques and algorithms. It is perhaps
worth noting also that these fields have had enormous impact on the direction of NLA, in
particular the recent and very active field of randomised algorithm was born in light of needs
arising from these extremely active fields.
In fact NLA is a truly exciting field that utilises a huge number of ideas from di↵erent
branches of mathematics (e.g. matrix analysis, approximation theory, and probability) to
solve problems that actually matter in real-world applications. Having said that, the number
of prerequisites for taking the course is the bare minimum; essentially a basic understanding
of the fundamentals of linear algebra would suffice (and the first lecture will briefly review
the basic facts). If you’ve taken the Part A Numerical Analysis course you will find it helpful,
but again, this is not necessary.
The field NLA has been blessed with many excellent books on the subject. These notes
will try to be self-contained, but these references will definitely help. There is a lot to learn;
literally as much as you want to.

Trefethen-Bau (97) [33]: Numerical Linear Algebra

– covers essentials, beautiful exposition

Golub-Van Loan (12) [14]: Matrix Computations


Last update: May 3, 2023. (minor typos fixed) Please report any corrections or comments on these
lecture notes to [email protected]

1
– classic, encyclopedic

Horn and Johnson (12) [20]: Matrix Analysis (& Topics in Matrix Analysis (86) [19])

– excellent theoretical treatise, little numerical treatment

J. Demmel (97) [8]: Applied Numerical Linear Algebra

– impressive content

N. J. Higham (02) [17]: Accuracy and Stability of Algorithms

– bible for stability, conditioning

H. C. Elman, D. J. Silvester, A. J. Wathen (14) [11]: Finite Elements and Fast Iterative
Solvers

– PDE applications of linear systems, Krylov methods and preconditioning

This course covers the fundamentals of NLA. We first discuss the singular value decom-
position (SVD), which is a fundamental matrix decomposition whose importance is only
growing. We then turn to linear systems and eigenvalue problems. Broadly, we will cover

Direct methods (n . 10,000): Sections 5–10 (except 8)

Iterative methods (n . 1,000,000, sometimes larger): Sections 11–13

Randomised methods (n & 1,000,000): Sections 14–16

in this order. Lectures 1–4 cover the fundamentals of matrix theory, in particular the SVD,
its properties and applications.
This document consists of 16 sections. Very roughly speaking, one section corresponds
to one lecture (though this will not be followed strictly at all).

Contents
0 Introduction, why Ax = b and Ax = x? 6

1 Basic LA review 7
1.1 Warmup exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Structured matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Matrix eigenvalues: basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Computational complexity (operation counts) of matrix algorithms . . . . . 11
1.5 Vector norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 Matrix norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.7 Subspaces and orthonormal matrices . . . . . . . . . . . . . . . . . . . . . . 13

2
2 SVD: the most important matrix decomposition 14
2.1 (Some of the many) applications and consequences of the SVD: rank, col-
umn/row space, etc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 SVD and symmetric eigenvalue decomposition . . . . . . . . . . . . . . . . . 17
2.3 Uniqueness etc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Low-rank approximation via truncated SVD 18


3.1 Low-rank approximation: image compression . . . . . . . . . . . . . . . . . . 21

4 Courant-Fischer minmax theorem 22


4.1 Weyl’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1.1 Eigenvalues of nonsymmetric matrices are sensitive to perturbation . 23
4.2 More applications of C-F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 (Taking stock) Matrix decompositions you should know . . . . . . . . . . . . 24

5 Linear systems Ax = b 24
5.1 Solving Ax = b via LU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2 Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3 Cholesky factorisation for A 0 . . . . . . . . . . . . . . . . . . . . . . . . . 27

6 QR factorisation and least-squares problems 28


6.1 QR via Gram-Schmidt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.2 Towards a stable QR factorisation: Householder reflectors . . . . . . . . . . . 29
6.3 Householder QR factorisation . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.4 Givens rotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.5 Least-squares problems via QR . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.6 QR-based algorithm for linear systems . . . . . . . . . . . . . . . . . . . . . 34
6.7 Solution of least-squares via normal equation . . . . . . . . . . . . . . . . . . 34
6.8 Application of least-squares: regression/function approximation . . . . . . . 35

7 Numerical stability 36
7.1 Floating-point arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.2 Conditioning and stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.3 Numerical stability; backward stability . . . . . . . . . . . . . . . . . . . . . 38
7.4 Matrix condition number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.4.1 Backward stable+well conditioned=accurate solution . . . . . . . . . 39
7.5 Stability of triangular systems . . . . . . . . . . . . . . . . . . . . . . . . . . 40
7.5.1 Backward stability of triangular systems . . . . . . . . . . . . . . . . 40
7.5.2 (In)stability of Ax = b via LU with pivots . . . . . . . . . . . . . . . 41
7.5.3 Backward stability of Cholesky for A 0 . . . . . . . . . . . . . . . . 41
7.6 Matrix multiplication is not backward stable . . . . . . . . . . . . . . . . . . 42
7.7 Stability of Householder QR . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.7.1 (In)stability of Gram-Schmidt . . . . . . . . . . . . . . . . . . . . . . 44

3
8 Eigenvalue problems 44
8.1 Schur decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
8.2 The power method for finding the dominant eigenpair Ax = x . . . . . . . 46
8.2.1 Digression (optional): Why compute eigenvalues? Google PageRank . 47
8.2.2 Shifted inverse power method . . . . . . . . . . . . . . . . . . . . . . 47

9 The QR algorithm 48
9.1 QR algorithm for Ax = x . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
9.2 QR algorithm preprocessing: reduction to Hessenberg form . . . . . . . . . . 52
9.2.1 The (shifted) QR algorithm in action . . . . . . . . . . . . . . . . . . 53
9.2.2 (Optional) QR algorithm: other improvement techniques . . . . . . . 54

10 QR algorithm continued 55
10.1 QR algorithm for symmetric A . . . . . . . . . . . . . . . . . . . . . . . . . 55
10.2 Computing the SVD: Golub-Kahan’s bidiagonalisation algorithm . . . . . . . 56
10.3 (Optional but important) QZ algorithm for generalised eigenvalue problems . 56
10.4 (Optional) Tractable eigenvalue problems . . . . . . . . . . . . . . . . . . . . 57

11 Iterative methods: introduction 58


11.1 Polynomial approximation: basic idea of Krylov . . . . . . . . . . . . . . . . 59
11.2 Orthonormal basis for Kk (A, b) . . . . . . . . . . . . . . . . . . . . . . . . . 60
11.3 Arnoldi iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

12 Arnoldi and GMRES for Ax = b 61


12.1 GMRES convergence: polynomial approximation . . . . . . . . . . . . . . . . 62
12.2 When does GMRES converge fast? . . . . . . . . . . . . . . . . . . . . . . . 64
12.3 Preconditioning for GMRES . . . . . . . . . . . . . . . . . . . . . . . . . . 65
12.4 Restarted GMRES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

13 Symmetric case: Lanczos and Conjugate Gradient method for Ax = b,


A 0 66
13.1 Lanczos iteration and Lanczos decomposition . . . . . . . . . . . . . . . . . 66
13.2 CG algorithm for Ax = b, A 0 . . . . . . . . . . . . . . . . . . . . . . . . . 67
13.3 CG convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
13.3.1 Chebyshev polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . 69
13.3.2 Properties of Chebyshev polynomials . . . . . . . . . . . . . . . . . . 71
13.4 MINRES: symmetric (indefinite) version of GMRES (nonexaminable) . . . . 71
13.4.1 MINRES convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
13.5 Preconditioned CG/MINRES . . . . . . . . . . . . . . . . . . . . . . . . . . 73
13.6 The Lanczos algorithm for symmetric eigenproblem (nonexaminable) . . . . 73
13.7 Arnoldi for nonsymmetric eigenvalue problems (nonexaminable) . . . . . . . 75

4
14 Randomised algorithms in NLA 75
14.1 Gaussian matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
14.1.1 Orthogonal invariance . . . . . . . . . . . . . . . . . . . . . . . . . . 76
14.1.2 Marchenko-Pastur: Rectangular random matrices are well conditioned 77
14.2 Randomised least-squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
14.3 “Fast” algorithm: row subset selection . . . . . . . . . . . . . . . . . . . . . 78
14.4 Sketch-and-solve for least-squares problems . . . . . . . . . . . . . . . . . . . 79
14.5 Sketch-to-precondition: Blendenpik . . . . . . . . . . . . . . . . . . . . . . . 81
14.5.1 Explaining 2 (AR̂ 1 ) = O(1) via Marchenko-Pastur . . . . . . . . . . 81
14.5.2 Blendenpik: solving minx kAx bk2 using R̂ . . . . . . . . . . . . . . 82
14.5.3 Blendenpik experiments . . . . . . . . . . . . . . . . . . . . . . . . . 82

15 Randomised algorithms for low-rank approximation 83


15.1 Randomised SVD by Halko-Martinsson-Tropp . . . . . . . . . . . . . . . . . 83
15.2 HMT approximant: analysis (down from 70 pages!) . . . . . . . . . . . . . . 85
15.3 Precise analysis for HMT (nonexaminable) . . . . . . . . . . . . . . . . . . . 86
15.4 Generalised Nyström (nonexaminable) . . . . . . . . . . . . . . . . . . . . . 87
15.5 MATLAB code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
15.6 Randomised algorithm for Ax = b, Ax = x? . . . . . . . . . . . . . . . . . . 89

16 Conclusion and discussion 89


16.1 Important (N)LA topics not treated . . . . . . . . . . . . . . . . . . . . . . . 89
16.2 Course summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
16.3 Related courses you can take . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

Notation. For convenience below we list the notation that we use throughout the course.

(A): the set of eigenvalues of A. If a natural ordering exists (e.g. A is symmetric so


is real), i (A) is the ith (largest) eigenvalue.

(A): the set of singular values of A. i (A) always denotes the ith largest singular
value. We often just write i .

diag(A): the vector of diagonal entries of A.

We use capital letters for matrices, lower-case for vectors and scalars. Unless otherwise
specified, A is a given matrix, b is a given vector, and x is an unknown vector.

k · k denotes a norm for a vector or matrix. k · k2 denotes the spectral (or 2-) norm,
k · kF the Frobenius norm. For vectors, to simplify notation we sometimes use k · k for
the spectral norm (which for vectors is the familiar Euclidean norm).

Span(A) denotes the span or range of the column space of A. This is the subspace
consisting of vectors of the form Ax.

5
We reserve Q for an orthonormal (or orthogonal) matrix. L, (U ) are often lower (upper)
triangular.

I always denotes the identity matrix. In is the n ⇥ n identity when the size needs to
be specified.

AT is the transpose of the matrix; (AT )ij = Aji . A⇤ is the (complex) conjugate
transpose (A⇤ )ij = A¯ji .

, ⌫ denote the positive (semi)definite ordering. That is, A (⌫)0 means A is


positive (semi)definite (abbreviated as PD, PSD), i.e., symmetric and with positive
(nonnegative) eigenvalues. A B means A B 0.

We sometimes use the following shorthand: alg for algorithm, eigval for eigenvalue, eigvec
for eigenvector, singval for singular value, and singvec for singular vector, i↵ for “if and only
if”.

0 Introduction, why Ax = b and Ax = x?


As already stated, NLA is the study of numerical algorithms for problems involving matrices,
and there are only two main problems(!):

1. Linear system
Ax = b.
Given a (often square m = n but we will discuss m > n extensively, and m < n briefly
at the end) matrix A 2 Rm⇥n and vector b 2 Rm , find x 2 Rn such that Ax = b.

2. Eigenvalue problem
Ax = x.
Given a (always!1 ) square matrix A 2 Rn⇥n find : eigenvalues (eigval), and x 2 Rn :
eigenvectors (eigvec).

We’ll see many variants of these problems; one worthy of particular mention is the SVD,
which is related to eigenvalue problems but given its ubiquity has a life of its own. (So if
there’s a third problem we solve in NLA, it would definitely be the SVD.)
It is worth discussing why we care about linear systems and eigenvalue problems.
The primary reason is that many (in fact most) problems in scientific computing (and
even machine learning) boil down to linear problems:

Because that’s often the only way to deal with the scale of problems we face today!
(and in future)
1
There are exciting recent developments involving eigenvalue problems for rectangular matrices, but these
are outside the scope of this course.

6
For linear problems, so much is understood and reliable algorithms are available2 .

A related important question is where and how these problems arise in real-world prob-
lems.
Let us mention a specific context that is relevant in data science: optimisation. Suppose
one is interested in minimising a high-dimensional real-valued function f (x) : Rn ! R where
n 1.
A successful approach is to try and find critical points, that is, points x⇤ where rf (x⇤ ) =
0. Mathematically, this is a non-linear high-dimensional root-finding problem of finding
x 2 Rn such that rf (x) =: F (x) = 0 (the vector 0 2 Rn ) where F : Rn ! Rn . One of
the most commonly employed methods for this task is Newton’s method (which some of you
have seen in Prelims Constructive Mathematics). This boils down to

Newton’s method for F (x) = 0, F : Rn ! Rn nonlinear:

1. Start with initial guess x(0) 2 Rn , set i = 0


@Fi (x)
2. Find Jacobian matrix J 2 Rn⇥n , Jij = |
@xj x=x(0)

3. Update x(i+1) := x(i) J 1


F (x(i) ), i i + 1, go to step 2 and repeat
1
Note that the main computational task is to find the vector y = J F (x(i) ), which is
a linear system Jy = F (x(i) ) (which we solve for the vector y)

What about eigenvalue problems Ax = x? Google’s pagerank is a famous application


(we will cover this if we have time). Another example the Schrödinger equation of physics
and chemistry. Sometimes a nonconvex optimisation problem can be solved by an eigenvalue
problem.
Equally important is principal component analysis (PCA), which can be used for data
compression. This is more tightly connected to the SVD.
Other sources of linear algebra problems include di↵erential equations, optimisation,
regression, data analysis, ...

1 Basic LA review
We start with a review of key LA facts that will be used in the course. Some will be trivial
to you while others may not. You might also notice that some facts that you have learned in
a core LA course will not be used in this course. For example we will never deal with finite
fields, and determinants only play a passing role.
2
A pertinent quote is Richard Feynman’s “Linear systems are important because we can solve them”.
Because we can solve them, we do all sorts of tricks to reduce difficult problems to linear systems!

7
1.1 Warmup exercise
Let A 2 Rn⇥n (n ⇥ n square matrix). (or Cn⇥n ; the di↵erence hardly matters in most of
this course3 ). Try to think of statements that are equivalent to A being nonsingular. Try to
come up with as many conditions as possible, before turning the page.

3
While there are a small number of cases where the distinction between real and complex matrices matters,
in the majority of cases it does not, and the argument carries over to complex matrices by replacing ·T with
·⇤ . Therefore for the most part, we lose no generality in assuming the matrix is real (which slightly simplifies
our mindset). Whenever necessary, we will highlight the subtleties that arise resulting from the di↵erence
between real and complex. (For the curious, these are the Schur form/decomposition, LDLT factorisation
and eigenvalue decomposition for (real) matrices with complex eigenvalues.)

8
Here is a list: The following are equivalent.
1. A is nonsingular.
1
2. A is invertible: A exists.

3. The map A : Rn ! Rn is a bijection.

4. All n eigenvalues of A are nonzero.

5. All n singular values of A are positive.

6. rank(A) = n.

7. The rows of A are linearly independent.

8. The columns of A are linearly independent.

9. Ax = b has a solution for every b 2 Cn .

10. A has no nonzero null vector.

11. AT has no nonzero null vector.

12. A⇤ A is positive definite (not just semidefinite).

13. det(A) 6= 0.
1
14. An n ⇥ n matrix A exists such that A 1 A = In . (this, btw, implies (i↵) AA 1
= In ,
a nontrivial fact)

15. . . . (what did I miss?)

1.2 Structured matrices


We will be discussing lots of structured matrices. For square matrices,
Symmetric: Aij = Aji (Hermitian: Aij = A¯ji )

– The most important property of symmetric matrices is the symmetric eigenvalue


decomposition A = V ⇤V T ; V is orthogonal V T V = V V T = In , and ⇤ is a
diagonal matrix of eigenvalues ⇤ = diag( 1 , . . . , n ).
– symmetric positive (semi)definite A (⌫)0: symmetric and all positive (nonneg-
ative) eigenvalues.

Orthogonal: AAT = AT A = I (Unitary: AA⇤ = A⇤ A = I). Note that for square


matrices, AT A = I implies AAT = I.

Skew-symmetric: Aij = Aji (skew-Hermitian: Aij = A¯ji ).

9
Normal: AT A = AAT . (Here it’s better to discuss the complex case A⇤ A = AA⇤ : this
is a necessary and sufficient condition for diagonalisability under a unitary transfor-
mation, i.e., A = U ⇤U ⇤ where ⇤ is diagonal and U is unitary.)
Tridiagonal: Aij = 0 if |i j| > 1.
Upper triangular: Aij = 0 if i > j.
Lower triangular: Aij = 0 if i < j.
For (possibly nonsquare) matrices A 2 Cm⇥n , (usually m n).
(upper) Hessenberg: Aij = 0 if i > j + 1. (we will see this structure often.)
“orthonormal”: AT A = In , and A is (tall) rectangular. (This isn’t an established
name—we could call it “matrix with orthonormal columns” every time it appears—
but we use these matrices all the time in this course, so we need a consistent shorthand
name for it.)
sparse: most elements are zero. nnz(A) denotes the number of nonzero elements in A.
Matrices that are not sparse are called dense.
Other structures: Hankel, Toeplitz, circulant, symplectic,... (we won’t use these in this
course)

1.3 Matrix eigenvalues: basics


Ax = x, A 2 Rn⇥n , (0 6=)x 2 Rn
2 32 3 2 3
2 1 1 1 1
4
Example: 1 2 1 5 4 1 = 4 15
5 4
1 1 2 1 1 2 3
1
This matrix has an eigenvalue = 4, with corresponding eigenvector x = 15 (to-
4
1
gether they are an eigenpair ).
An n ⇥ n matrix always has n eigenvalues
0 2 (not 3
always
1 0 n2
linearly
31 independent eigenvec-
1 0
tors); In the example above, ( , x) = @1, 4 15A , @1, 4 1 5A are also eigenpairs.
0 1
The eigenvalues
Qn are the roots of the characteristic polynomial det( I A) = 0: det( I
A) = i=1 ( i ).

According to Galois theory, eigenvalues cannot be computed exactly for matrices with
n 5. But we still want to compute them! In this course we will (among other
things) explain how this is done in practice by the QR algorithm, one of the greatest
hits of the field.

10
1.4 Computational complexity (operation counts) of matrix algo-
rithms
Since NLA is a field that aspires to develop practical algorithms for solving matrix problems,
it is important to be aware of the computational cost (often referred to as complexity) of
the algorithms. We will discuss these as the algorithms are developed, but for now let’s
examine the costs for basic matrix-matrix multiplication. The cost is measured in terms
of flops (floating-point operations), which counts the number of additions, subtractions,
multiplications, and divisions (all treated equally) performed.
In NLA the constant in front of the leading term in the cost is (clearly) important. It
is customary (for good reason) to only track the leading term of the cost. For example,
n3 + 10n2 is abbreviated to n3 .

Multiplying two n ⇥ n matrices AB costs 2n3 flops. More generally, if A is m ⇥ n and


B is n ⇥ k, then computing AB costs 2mnk flops.

Multiplying a vector to an m ⇥ n matrix A costs 2mn flops.

Norms
We will need a tool (or metric) to measure how big a vector or matrix is. Norms give us a
means to achieve this. Surely you have already seen some norms (e.g. the vector Euclidean
norm). We will discuss a number of norms for vectors and matrices that we will use in the
upcoming lectures.

1.5 Vector norms


For vectors x = [x1 , . . . , xn ]T 2 Cn

p-norm kxkp = (|x1 |p + |x2 |p + · · · + |xn |p )1/p (1  p  1)


p
– Euclidean norm=2-norm kxk2 = |x1 |2 + |x2 |2 + · · · + |xn |2
– 1-norm kxk1 = |x1 | + |x2 | + · · · + |xn |
– 1-norm kxk1 = maxi |xi |

Of particular importance are the three cases p = 1, 2, 1. In this course, we will see p = 2
the most often.
A norm needs to satisfy the following axioms:

k↵xk = |↵|kxk for any ↵ 2 C (homogeneity),

kxk 0 and kxk = 0 , x = 0 (nonnegativity),

kx + yk  kxk + kyk (triangle inequality).

11
The vector p-norm satisfies all these, for any p.
Here are some useful inequalities for vector norms. A proof is left for your exercise and
is highly recommended. (Try to think when each equality is satisfied.) For x 2 Cn ,
p1 kxk2  kxk1  kxk2
n

p1 kxk1  kxk2  kxk1


n

1
n
kxk1  kxk1  kxk1
Note that with the 2-norm, kU xk2 = kxk2 for any unitary U and any x 2 Cn . Norms
with this property are called unitarily invariant. p
The 2-norm is also induced by the inner product kxk2 = xT x. An important property
of inner products is the Cauchy-Schwarz inequality |xT y|  kxk2 kyk2 (which can be directly
proved but is perhaps best to prove in general setting)4 . When we just say kxk for a vector
we mean the 2-norm.

1.6 Matrix norms


We now turn to norms of matrices. As you will see, many (but not the Frobenius and trace
norms) are defined via the vector norms (these are called induced norms).
kAxkp
p-norm kAkp = maxx kxkp

– 2-norm=spectral norm(=Euclidean norm) kAk2 = max (A) (largest singular value;


see Section 2)
P
– 1-norm kAk1 = maxi m j=1 |Aji |
Pn
– 1-norm kAk1 = maxi j=1 |Aij |
qP P
Frobenius norm kAkF = 2
i j |Aij |
(2-norm of vectorisation)
P
trace norm=nuclear norm kAk⇤ = min(m,n)
i=1 i (A). (this is the maximum trace of QT A
where Q is orthonormal, hence the name)
Colored in red are unitarily invariant norms kAk⇤ = kU AV k⇤ , kAkF = kU AV kF , kAk2 =
kU AV k2 for any unitary/orthogonal U, V .

Norm axioms hold for each of these. Useful inequalities include: For A 2 Cm⇥n , (exercise;
it is instructive to study the cases where each of these equalities holds)
p
p1 kAk1  kAk2  mkAk1
n

4
Just in case, here’s a proof: for any scalar c, kx cyk2 = kxk2 2cxT y + c2 kyk2 . This is minimised wrt
xT y 2 (xT y)2
c at c = kyk 2 with minimiser kxk kyk2 . Since this must be 0, the CS inequality follows.

12
p
p1 kAk1  kAk2  nkAk1
m
p
kAk2  kAkF  min(m, n)kAk2

A useful property of p-norms is that they are subordinate, i.e., kABkp  kAkp kBkp
(problem sheet). Note that not all norms
 satisfy this, e.g. with the max norm kAkmax =
1
maxi,j |Aij |, with A = [1, 1] and B = one has kABkmax = 2 but kAkmax = kBkmax = 1.
1

1.7 Subspaces and orthonormal matrices


A key notion that we will keep using throughout is a subspace S. In this course we
will almost exclusively confine ourselves to subspaces of Rn , even though they generalize
to more abstract vector spaces. A subspace is the set of vectors that can be written as a
linear combination basis vectors v1 , . . . , vd , which are assumed to be P linearly independent
(otherwise there is a basis with fewer vectors). That is, x 2 S i↵ di=1 ci vi (where ci are
scalars, R or C). The integer d is called the dimension of the subspace. We also say the
subspace is spanned by the vectors v1 , . . . , vd , or that v1 , . . . , vd spans the subpsace.
How does one represent a subspace? An obvious answer is to use the basis vectors
v1 , . . . , vd . This sometimes becomes cumbersome, and a common and convenient way to
represent the subspace is to use a (tall-skinny) rectangular matrix V 2 Rn⇥d = [v1 , v2 , . . . , vd ],
as S = span(V ) (or sometimes just “subspace V ”) which means the subspace of vectors that
can be written as V c, where c is a ’coefficient’ vector c 2 Rd .
It will be (not necessary but) convenient to represent subspaces using an orthonormal
matrix Q 2 Rn⇥d . (once we cover the QR factorisation, you’ll see that there is no loss of
generality in doing so).
An important fact about subspaces of Rn is the following:

Lemma 1.1 Let V1 2 Rn⇥d1 and V2 2 Rn⇥d2 each have linearly independent column vectors.
If d1 + d2 > n, then there is a nonzero intersection between two subspaces S1 = span(V1 ) and
S2 = span(V2 ), that is, there is a nonzero vector x 2 Rn such that x = V1 c1 = V2 c2 for some
vectors c1 , c2 .

This is straightforward but important enough to warrant a proof.

Proof: Consider the matrix M := [V1 , V2 ], which is of size n ⇥ (d1 + d2 ). Since d1 + d2 > n
by assumption,
 this matrix has a right null vector5 c 6= 0 such that M c = 0. Splitting
c1
c= we have the required result. ⇤
c2
Let us conclude this review with a list of useful results that will be helpful. Proofs (or
counterexample) should be straightforward.
5
If this argument isn’t convincing to you now, probably the easiest way to see this is via the SVD; so stay
tuned and we’ll resolve this in footnote 9, once you’ve seen the SVD!

13
(AB)T = B T AT
1
If A, B invertible, (AB) = B 1A 1

If A, B square and AB = I, then BA = I


 1 
Im X I X
= m
0 In 0 In
Neumann series: if kXk < 1 in any norm,
1
(I X) = I + X + X2 + X3 + · · ·
P
For a square n ⇥ n matrix A, the trace is Trace(A) = ni=1 Ai,i (sum of diagonals).
For any X, Y such that XY Pis square,
P Trace(XY ) = Trace(Y X) (quite useful). For
B 2 Rm⇥n , we have kBk2F = i j |Bij |2 = Trace(B T B).

Triangular structure (upper or lower) is invariant under addition, multiplication, and


inversion. That is, triangular matrices form a ring (in abstract algebra; don’t worry if
this is foreign to you).

Symmetry is invariant under addition and inversion, but not multiplication; AB is


usually not symmetric even if A, B are.

2 SVD: the most important matrix decomposition


We now start the discussion of the most important topic of the course: the singular value
decomposition (SVD). The SVD exists for any matrix, square or rectangular, real or complex.
We will prove its existence and discuss its properties and applications in particular in
low-rank approximation, which can immediately be used for compressing the matrix, and
therefore data.
The SVD has many intimate connections to symmetric eigenvalue problems. Let’s start
with a review.
Symmetric eigenvalue decomposition: Any symmetric matrix A 2 Rn⇥n has the
decomposition
A = V ⇤V T (1)
where V is orthogonal, V T V = In = V V T , and ⇤ = diag( 1 , . . . , n) is a diagonal
matrix of eigenvalues.

i are the eigenvalues, and V is the matrix of eigenvectors (its columns are the eigenvec-
tors).
The decomposition (1) makes two remarkable claims: the eigenvectors can be taken to
be orthogonal (which is true more generally of normal matrices s.t. A⇤ A = AA⇤ ), and the
eigenvalues are real.

14

You might also like