ALAFF
ALAFF
Foundations to Frontiers
Advanced Linear Algebra
Foundations to Frontiers
Margaret Myers
The University of Texas at Austin
We would like to thank the people who created PreTeXt, the authoring system used to
typeset these materials. We applaud you!
vii
Preface
viii
Contents
Acknowledgements vii
Preface viii
0 Getting Started 1
I Orthogonality
1 Norms 12
ix
CONTENTS x
B Notation 622
References 633
Index 637
Week 0
Getting Started
1
WEEK 0. GETTING STARTED 2
In this week (Week 0), we walk you through some of the basic course information and
help you set up for learning. The week itself is structured like future weeks, so that you
become familiar with that structure.
• 0.3 Enrichments
• 0.4 Wrap Up
• As a PDF at http://www.cs.utexas.edu/users/flame/laff/alaff/ALAFF.pdf.
If you download this PDF and place it in just the right folder of the materials you will
clone from GitHub (see next unit), the links in the PDF to the downloaded material
will work.
We will be updating the materals frequently as people report typos and we receive
feedback from learners. Please consider the environment before you print a copy...
• Eventually, if we perceive there is demand, we may offer a printed copy of these notes
from Lulu.com, a self-publishing service. This will not happen until Summer 2020, at
the earliest.
• Visit https://github.com/ULAFF/ALAFF
• Click on
• On the computer where you intend to work, in a terminal session on the command line
in the directory where you would like to place the materials, execute
• Sometimes we will update some of the files from the repository. When this happens
you will want to execute, in the cloned directory,
git pu
which restores local changes you made. This last step may require you to "merge" files
that were changed in the repository that conflict with local changes.
Upon completion of the cloning, you will have a directory structure similar to that given in
Figure 0.2.2.1.
Figure 0.2.2.1 Directory structure for your ALAFF materials. In this example, we cloned
the repository in Robert’s home directory, rvdg.
0.2.3 MATLAB
We will use Matlab to translate algorithms into code and to experiment with linear algebra.
There are a number of ways in which you can use Matlab:
• Via MATLAB that is installed on the same computer as you will execute your perfor-
mance experiments. This is usually called a "desktop installation of Matlab."
• Via MATLAB Online. You will have to transfer files from the computer where you are
performing your experiments to MATLAB Online. You could try to set up MATLAB
Drive, which allows you to share files easily between computers and with MATLAB
Online. Be warned that there may be a delay in when files show up, and as a result
you may be using old data to plot if you aren’t careful!
If you are using these materials as part of an offering of the Massive Open Online Course
(MOOC) titled "Advanced Linear Algebra: Foundations to Frontiers," you will be given a
temporary license to Matlab, courtesy of MathWorks. In this case, there will be additional
WEEK 0. GETTING STARTED 6
instructions on how to set up MATLAB Online, in the Unit on edX that corresponds to this
section.
You need relatively little familiarity with MATLAB in order to learn what we want you
to learn in this course. So, you could just skip these tutorials altogether, and come back to
them if you find you want to know more about MATLAB and its programming language
(M-script).
Below you find a few short videos that introduce you to MATLAB. For a more compre-
hensive tutorial, you may want to visit MATLAB Tutorials at MathWorks and click "Launch
Tutorial".
What is MATLAB?
https://www.youtube.com/watch?v=2sB-NMD9Qhk
https://www.youtube.com/watch?v=4shp284pGc8
MATLAB Variables
https://www.youtube.com/watch?v=gPIsIzHJA9I
MATLAB as a Calculator
https://www.youtube.com/watch?v=K9xy5kQHDBo
https://www.youtube.com/watch?v=mqYwMnM-x5Q
Remark 0.2.3.1 Some of you may choose to use MATLAB on your personal computer while
others may choose to use MATLAB Online. Those who use MATLAB Online will need to
transfer some of the downloaded materials to that platform.
WEEK 0. GETTING STARTED 7
• Click on
(to make sure you get the address right, you will want to paste the address you copied
in the last step.)
cd b is
• Indicate a specific version of BLIS so that we all are using the same release:
The -p ~/blis installs the library in the subdirectory ~/blis of your home directory,
which is where the various exercises in the course expect it to reside.
• If you run into a problem while installing BLIS, you may want to consult https:
//github.com/f ame/b is/b ob/master/docs/Bui dSystem.md.
On Mac OS-X
• You may need to install Homebrew, a program that helps you install various software
on you mac. Warning: you may need "root" privileges to do so.
Keep an eye on the output to see if the “Command Line Tools” get installed. This
may not be installed if you already have Xcode Command line tools installed. If this
happens, post in the "Discussion" for this unit, and see if someone can help you out.
$ which gcc
The output should be /usr/local/bin. If it isn’t, you may want to add /usr/local/bin
to "the path." I did so by inserting
into the file .bash_profile in my home directory. (Notice the "period" before "bash_profile"
• Now you can go back to the beginning of this unit, and follow the instructions to install
BLIS.
by executing
• https://github.com/flame/libflame/blob/master/INSTALL.
0.3 Enrichments
In each week, we include "enrichments" that allow the participant to go beyond.
1. 1946: John von Neumann, Stan Ulam, and Nick Metropolis, all at the Los Alamos
Scientific Laboratory, cook up the Metropolis algorithm, also known as the Monte Carlo
method.
2. 1947: George Dantzig, at the RAND Corporation, creates the simplex method for
linear programming.
3. 1950: Magnus Hestenes, Eduard Stiefel, and Cornelius Lanczos, all from the Institute
for Numerical Analysisat the National Bureau of Standards, initiate the development
of Krylov subspace iteration methods.
4. 1951: Alston Householder of Oak Ridge National Laboratory formalizes the decompo-
sitional approach to matrix computations.
GETTING STARTED 10
5. 1957: John Backus leads a team at IBM in developing the Fortran optimizing compiler.
6. 1959–61: J.G.F. Francis of Ferranti Ltd., London, finds a stable method for computing
eigenvalues, known as the QR algorithm.
8. 1965: James Cooley of the IBM T.J. Watson Research Center and John Tukey of
PrincetonUniversity and AT&T Bell Laboratories unveil the fast Fourier transform.
9. 1977: Helaman Ferguson and Rodney Forcade of Brigham Young University advance
an integer relation detection algorithm.
10. 1987: Leslie Greengard and Vladimir Rokhlin of Yale University invent the fast mul-
tipole algorithm.
Of these, we will explicitly cover three: the decomposition method to matrix computations,
Krylov subspace methods, and the QR algorithm. Although not explicitly covered, your
understanding of numerical linear algebra will also be a first step towards understanding
some of the other numerical algorithms listed.
0.4 Wrap Up
0.4.1 Additional Homework
For a typical week, additional assignments may be given in this unit.
0.4.2 Summary
In a typical week, we provide a quick summary of the highlights in this unit.
Part I
Orthogonality
11
Week 1
Norms
1.1 Opening
1.1.1 Why norms?
YouTube: https://www.youtube.com/watch?v=DKX3TdQWQ90
The following exercises expose some of the issues that we encounter when computing.
We start by computing b = U x, where U is upper triangular.
Homework 1.1.1.1 Compute
Q RQ R
1 ≠2 1 ≠1
c
a 0 ≠1 ≠1 d c d
ba 2 b =
0 0 2 1
Solution. Q RQ R Q R
1 ≠2 1 ≠1 ≠4
c
a 0 ≠1 ≠1 b a 2 b = a ≠3 d
dc d c
b
0 0 2 1 2
Next, let’s examine the slightly more difficult problem of finding a vector x that satisfies
U x = b.
12
WEEK 1. NORMS 13
Solution. We can recognize the relation between this problem and Homework 1.1.1.1 and
hence deduce the answer without computation:
Q R Q R
‰0 ≠1
c d c d
‰
a 1 b = a 2 b
‰2 1
The point of these two homework exercises is that if one creates a (nonsingular) n ◊ n
matrix U and vector x of size n, then computing b = U x followed by solving U x‚ = b should
leave us with a vector x‚ such that x = x‚.
Remark 1.1.1.1 We don’t "teach" Matlab in this course. Instead, we think that Matlab is
intuitive enough that we can figure out what the various commands mean. We can always
investigate them by typing
he p <command>
in the command window. For example, for this unit you may want to execute
he p format
he p rng
he p rand
he p triu
he p *
he p \
he p diag
he p abs
he p min
he p max
If you want to learn more about Matlab, you may want to take some of the tutorials offered
by Mathworks at https://www.mathworks.com/support/ earn-with-mat ab-tutoria s.htm .
Let us see if Matlab can compute the solution of a triangular matrix correctly.
Homework 1.1.1.3 In Matlab’s command window, create a random upper triangular matrix
U:
format ong Report results in long format. Seed the ran-
dom number generator so that we all create
rng( 0 ); the same random matrix U and vector x.
n = 3
U = triu( rand( n,n ) )
x = rand( n,1 )
WEEK 1. NORMS 14
• x‚ ≠ x does not equal zero. This is due to the fact that the computer stores floating
point numbers and computes with floating point arithmetic, and as a result roundoff
error happens.
• The difference is small (notice the 1.0e-15 before the vector, which shows that each
entry in x‚ ≠ x is around 10≠15 .
• Repeating this with a much larger n make things cumbersome since very long vectors
are then printed.
To be able to compare more easily, we will compute the Euclidean length of x‚ ≠ x instead
using the Matlab command norm( xhat - x ). By adding a semicolon at the end of Matlab
commands, we suppress output.
Homework 1.1.1.4 Execute
format ong Report results in long format.
Seed the random number generator so that we
rng( 0 ); all create the same random matrix U and vec-
n = 100; tor x.
U = triu( rand( n,n ) );
x = rand( n,1 );
b = U * x; Compute right-hand side b from known solu-
tion x.
xhat = U \ b; Solve U x‚ = b
norm( xhat - x ) Report the Euclidean length of the difference
between x‚ and x.
What do we notice?
Next, check how close U x‚ is to b = U x, again using the Euclidean length:
norm( b - U * xhat )
WEEK 1. NORMS 15
What do we notice?
Solution. A script with the described commands can be found in Assignments/Week01/
matlab/Test_Upper_triangular_solve_100.m.
Some things we observe:
• norm(x‚ ≠ x), the Euclidean length of x‚ ≠ x, is huge. Matlab computed the wrong
answer!
• However, the computed x‚ solves a problem that corresponds to a slightly different right-
hand side. Thus, x‚ appears to be the solution to an only slightly changed problem.
The next exercise helps us gain insight into what is going on.
Homework 1.1.1.5 Continuing with the U, x, b, and xhat from Homework 1.1.1.4, consider
• When is an upper triangular matrix singular?
• How large is the smallest element on the diagonal of the U from Homework 1.1.1.4?
(min( abs( diag( U ) ) ) returns it!)
• If U were singular, how many solutions to U x‚ = b would there be? How can we
characterize them?
• How large is the smallest element on the diagonal of the U from Homework 1.1.1.4?
(min( abs( diag( U ) ) ) returns it!)
Answer:
It is small in magnitude. This is not surprising, since it is a random number and hence
as the matrix size increases, the chance of placing a small entry (in magnitude) on the
diagonal increases.
• If U were singular, how many solutions to U x‚ = b would there be? How can we
characterize them?
Answer:
An infinite number. Any vector in the null space can be added to a specific solution
to create another solution.
WEEK 1. NORMS 16
What have we learned? The :"wrong" answer that Matlab computed was due to the fact
that matrix U was almost singular.
To mathematically qualify and quantify all this, we need to be able to talk about "small"
and "large" vectors, and "small" and "large" matrices. For that, we need to generalize the
notion of length. By the end of this week, this will give us some of the tools to more fully
understand what we have observed.
YouTube: https://www.youtube.com/watch?v=2ZEtcnaynnM
1.1.2 Overview
• 1.1 Opening
YouTube: https://www.youtube.com/watch?v=V5ZQmR4zTeU
Recall that | · | : C æ R is the function that returns the absolute value of the input.
In other words, if – = –r + –c i, where –r and –c are the real and imaginary parts of –,
respectively, then Ò
|–| = –r2 + –c2 .
The absolute value (magnitude) of a complex number can also be thought of as the (Eu-
clidean) distance from the point in the complex plane to the origin of that plane, as illustrated
below for the number 3 + 2i.
WEEK 1. NORMS 19
– = –r + –c i = –r ≠ –c i.
4. (1 ≠ i)(2 ≠ i) =
5. (2 ≠ i)(1 ≠ i) =
6. (1 ≠ i)(2 ≠ i) =
Solution.
1. (1 + i)(2 ≠ i) = 2 + 2i ≠ i ≠ i2 = 2 + i + 1 = 3 + i
2. (2 ≠ i)(1 + i) = 2 ≠ i + 2i ≠ i2 = 2 + i + 1 = 3 + i
3. (1 ≠ i)(2 ≠ i) = (1 + i)(2 ≠ i) = 2 ≠ i + 2i ≠ i2 = 3 + i
WEEK 1. NORMS 20
4. (1 ≠ i)(2 ≠ i) = (1 + i)(2 ≠ i) = 2 ≠ i + 2i ≠ i2 = 2 + i + 1 = 3 + i = 3 ≠ i
5. (2 ≠ i)(1 ≠ i) = (2 + i)(1 ≠ i) = 2 ≠ 2i + i ≠ i2 = 2 ≠ i + 1 = 3 ≠ i
6. (1 ≠ i)(2 ≠ i) = (1 ≠ i)(2 + i) = 2 + i ≠ 2i ≠ i2 = 2 ≠ i + 1 = 3 ≠ i
Homework 1.2.1.2 Let –, — œ C.
1. ALWAYS/SOMETIMES/NEVER: –— = —–.
2. ALWAYS/SOMETIMES/NEVER: –— = —–.
1. ALWAYS: –— = —–.
2. SOMETIMES: –— = —–.
Solution.
1. ALWAYS: –— = —–.
Proof:
–—
= < substitute >
(–r + –c i)(—r + —c i)
= < multiply out >
–r —r + –r —c i + –c —r i ≠ –c —c
= < commutativity of real multiplication >
—r –r + —r –c i + —c –r i ≠ —c –c
= < factor >
(—r + —c i)(–r + –c i)
= < substitute >
—–.
2. SOMETIMES: –— = —–.
An example where it is true: – = — = 0.
An example where it is false: – = 1 and — = i. Then –— = 1◊i = i and —– = ≠i◊1 =
≠i.
Homework 1.2.1.3 Let –, — œ C.
ALWAYS/SOMETIMES/NEVER: –— = —–.
Hint. Let – = –r + –c i and — = —r + —c i, where –r , –c , —r , —c œ R.
Answer. ALWAYS
Now prove it!
WEEK 1. NORMS 21
Solution 1.
–—
= < – = –r + –c i; — = —r + —c i >
(–r + –c i)(—r + —c i)
= < conjugate – >
(–r ≠ –c i)(—r + —c i)
= < multiply out >
(–r —r ≠ –c —r i + –r —c i + –c —c )
= < conjugate >
–r —r + –c —r i ≠ –r —c i + –c —c
= < rearrange >
—r –r + —r –c i ≠ —c –r i + —c –c
= < factor >
(—r ≠ —c i)(–r + –c i)
= < definition of conjugation >
(—r + —c i)(–r + –c i)
= < – = –r + –c i; — = —r + —c i >
—–
Solution 2. Proofs in mathematical textbooks seem to always be wonderfully smooth
arguments that lead from the left-hand side of an equivalence to the right-hand side. In
practice, you may want to start on the left-hand side, and apply a few rules:
–—
= < – = –r + –c i; — = —r + —c i >
(–r + –c i)(—r + —c i)
= < conjugate – >
(–r ≠ –c i)(—r + —c i)
= < multiply out >
(–r —r ≠ –c —r i + –r —c i + –c —c )
= < conjugate >
–r —r + –c —r i ≠ –r —c i + –c —c
and then move on to the right-hand side, applying a few rules:
—–
= < – = –r + –c i; — = —r + —c i >
(—r + —c i)(–r + –c i)
= < conjugate — >
(—r ≠ —c i)(–r + –c i)
= < multiply out >
—r –r + —r –c i ≠ —c –r i + —c –c .
At that point, you recognize that
–r —r + –c —r i ≠ –r —c i + –c —c = —r –r + —r –c i ≠ —c –r i + —c –c
WEEK 1. NORMS 22
since the second is a rearrangement of the terms of the first. Optionally, you then go back
and presents these insights as a smooth argument that leads from the expression on the
left-hand side to the one on the right-hand side:
–—
= < – = –r + –c i; — = —r + —c i >
(–r + –c i)(—r + —c i)
= < conjugate – >
(–r ≠ –c i)(—r + —c i)
= < multiply out >
(–r —r ≠ –c —r i + –r —c i + –c —c )
= < conjugate >
–r —r + –c —r i ≠ –r —c i + –c —c
= < rearrange >
—r –r + —r –c i ≠ —c –r i + —c –c
= < factor >
(—r ≠ —c i)(–r + –c i)
= < definition of conjugation >
(—r + —c i)(–r + –c i)
= < – = –r + –c i; — = —r + —c i >
—–.
Solution 3. Yet another way of presenting the proof uses an "equivalence style proof." The
idea is to start with the equivalence you wish to prove correct:
–— = —–
and through a sequence of equivalent statements argue that this evaluates to TRUE:
–— = —–
… < – = –r + –c i; — = —r + —c i >
(–r + –c i)(—r + —c i) = (—r + —c i)(–r + –c i)
… < conjugate ◊ 2 >
(–r ≠ –c i)(—r + —c i) = (—r ≠ —c i)(–r + –c i)
… < multiply out ◊ 2 >
–r —r + –r —c i ≠ –c —r i + –c —c = —r –r + —r –c i ≠ —c –r i + —c –c
… < conjugate >
–r —r ≠ –r —c i + –c —r i + –c —c = —r –r + —r –c i ≠ —c –r i + —c –c
… < subtract equivalent terms from left-hand side and right-hand side >
0=0
… < algebra >
TRUE.
––
= < instantiate >
(–r + –c i)(–r + –c i)
= < conjugate >
(–r ≠ –c i)(–r + –c i)
= < multiply out >
–r2 + –c2 ,
Solution. Let – = –r + –c i.
|–|
= < instantiate >
|–r + –c i|
= < conjugate >
|–r ≠ –c i|
Ò
= < definition of | · | >
–r + –c
2 2
YouTube: https://www.youtube.com/watch?v=CTrUVfLGcNM
A vector norm extends the notion of an absolute value to vectors. It allows us to measure
the magnitude (or length) of a vector. In different situations, a different measure may be
more appropriate.
Definition 1.2.2.1 Vector norm. Let ‹ : Cm æ R. Then ‹ is a (vector) norm if for all
x, y œ Cm and all – œ C
• x ”= 0 ∆ ‹(x) > 0 (‹ is positive definite),
⌃
Homework 1.2.2.1 TRUE/FALSE: If ‹ : C æ R is a norm, then ‹(0) = 0.
m
Hint. From context, you should be able to tell which of these 0’s denotes the zero vector
of a given size and which is the scalar 0.
0x = 0 (multiplying any vector x by the scalar 0 results in a vector of zeroes).
Answer. TRUE.
Now prove it.
WEEK 1. NORMS 25
Solution. Let x œ Cm and, just for clarity this first time, ˛0 be the zero vector of size m so
that 0 is the scalar zero. Then
‹(˛0)
= < 0 · x = ˛0 >
‹(0 · x)
= < ‹(· · · ) is homogeneous >
0‹(x)
= < algebra >
0
Remark 1.2.2.2 We typically use Î · Î instead of ‹(·) for a function that is a norm.
YouTube: https://www.youtube.com/watch?v=bxDDpUZEqBs
The length of a vector is most commonly measured by the "square root of the sum of the
squares of the elements," also known as the Euclidean norm. It is called the 2-norm because
it is a member of a class of norms known as p-norms, discussed in the next unit.
Definition 1.2.3.1 Vector 2-norm. The vector 2-norm Î · Î2 : Cm æ R is defined for
x œ Cm by
Ò ı̂m≠1
ıÿ
ÎxÎ2 = |‰0 + · · · + |‰m≠1 =
|2 |2 Ù |‰ |2 .
i
i=0
⌃
Remark 1.2.3.2 The notation xH requires a bit of explanation. If
Q R
‰0
c . d
x = a .. d
c
b
‰m
WEEK 1. NORMS 26
Î(y T x)yÎ2
Ò
= < definition >
(y T x)T y T (y T x)y
Ò
= < z– = –z >
(x y) y T y
T 2
Thus |xT y| Æ ÎxÎ2 (since a component should be shorter than the whole). If y is not of unit
length (but a nonzero vector), then |xT ÎyÎ
y
2
| Æ ÎxÎ2 or, equivalently, |xT y| Æ ÎxÎ2 ÎyÎ2 .
We now state this result as a theorem, generalized to complex valued vectors:
Theorem 1.2.3.3 Cauchy-Schwarz inequality. Let x, y œ Cm . Then |xH y| Æ ÎxÎ2 ÎyÎ2 .
Proof. Assume that x ”= 0 and y ”= 0, since otherwise the inequality is trivially true. We
can then choose x‚ = x/ÎxÎ2 and y‚ = y/ÎyÎ2 . This leaves us to prove that |x‚H y‚| Æ 1 since
Îx‚Î2 = Îy‚Î2 = 1.
Pick I
1 if xH y = 0
–=
y‚ x‚/|x‚ y‚| otherwise.
H H
so that |–| = 1 and –x‚H y‚ is real and nonnegative. Note that since it is real we also know
WEEK 1. NORMS 27
that
–x‚H y‚
= < — = — if — is real >
–x‚H y‚ .
= < property of complex conjugation >
H‚
–y x
‚
Now,
0
Æ < Î · Î2 is nonnegative definite >
Îx‚ ≠ –y‚Î22
= < ÎzÎ22 = z H z >
(x‚ ≠ –y‚) (x‚ ≠ –y‚)
H
ΖxÎ2
Ò
= < scaling a vector scales its components; definition >
|–‰0 | + · · · + |–‰m≠1 |2
2
Ò
= < algebra >
|–| |‰0 |2 + · · · + |–|2 |‰m≠1 |2
2
Ò
= < algebra >
|–| (|‰0 |2 + · · · + |‰m≠1 |2 )
2
=
Ò
< algebra >
|–| |‰0 |2 + · · · + |‰m≠1 |2
= < definition >
|–|ÎxÎ2 .
Îx + yÎ22
= < ÎzÎ22 = z H z >
(x + y) (x + y)
H
YouTube: https://www.youtube.com/watch?v=WGBMnmgJek8
A vector norm is a measure of the magnitude of a vector. The Euclidean norm (length) is
merely the best known such measure. There are others. A simple alternative is the 1-norm.
Definition 1.2.4.1 Vector 1-norm. The vector 1-norm, Î · Î1 : Cm æ R, is defined for
x œ Cm by
m≠1
ÿ
ÎxÎ1 = |‰0 | + |‰1 | + · · · + |‰m≠1 | = |‰i |.
i=0
⌃
WEEK 1. NORMS 30
Îx + yÎ1
= < vector addition; definition of 1-norm >
|‰0 + Â0 | + |‰1 + Â1 | + · · · + |‰m≠1 + Âm≠1 |
Æ < algebra >
|‰0 | + |Â0 | + |‰1 | + |Â1 | + · · · + |‰m≠1 | + |Âm≠1 |
= < commutivity >
|‰0 | + |‰1 | + · · · + |‰m≠1 | + |Â0 | + |Â1 | + · · · + |Âm≠1 |
= < associativity; definition >
ÎxÎ1 + ÎyÎ1 .
The vector 1-norm is sometimes referred to as the "taxi-cab norm". It is the distance
that a taxi travels, from one point on a street to another such point, along the streets of a
city that has square city blocks.
Another alternative is the infinity norm.
Definition 1.2.4.2 Vector Œ-norm. The vector Œ-norm, Î · ÎŒ : Cm æ R, is defined
for x œ Cm by
ÎxÎŒ = max(|‰0 |, . . . , |‰m≠1 ) = max |‰i |.
m≠1
i=0
⌃
The infinity norm simply measures how large the vector is by the magnitude of its largest
entry.
WEEK 1. NORMS 31
ΖxÎŒ = maxm≠1
i=0 |–‰i |
= maxm≠1
i=0 |–||‰i |
= |–| maxm≠1
i=0 |‰i |
= |–|ÎxÎŒ .
Îx + yÎŒ = maxm≠1
i=0 |‰i + Âi |
Æ maxm≠1
i=0 (|‰i | + |Âi |)
Æ maxm≠1
i=0 |‰i | + maxi=0 |Âi |
m≠1
= ÎxÎŒ + ÎyÎŒ .
In this course, we will primarily use the vector 1-norm, 2-norm, and Œ-norms. For
completeness, we briefly discuss their generalization: the vector p-norm.
Definition 1.2.4.3 Vector p-norm. Given p Ø 1, the vector p-norm Î · Îp : Cm æ R is
defined for x œ Cm by
Ò Am≠1 B1/p
ÿ
ÎxÎp = p
|‰0 + · · · + |‰m≠1 =
|p |p p
|‰i | .
i=0
⌃
Theorem 1.2.4.4 The vector p-norm is a norm.
The proof of this result is very similar to the proof of the fact that the 2-norm is a
norm. It depends on Hölder’s inequality, which is a generalization of the Cauchy-Schwarz
inequality:
Theorem 1.2.4.5 Hölder’s inequality. Let 1 Æ p, q Æ Œ with p1 + 1q = 1. If x, y œ Cm
then |xH y| Æ ÎxÎp ÎyÎq .
We skip the proof of Hölder’s inequality and Theorem 1.2.4.4. You can easily find proofs
for these results, should you be interested.
WEEK 1. NORMS 32
Remark 1.2.4.6 The vector 1-norm and 2-norm are obviously special cases of the vector
p-norm. It can be easily shown that the vector Œ-norm is also related:
Ponder This 1.2.4.3 Consider Homework 1.2.3.3. Try to elegantly formulate this question
in the most general way you can think of. How do you prove the result?
Ponder This 1.2.4.4 Consider the vector norm Î · Î : Cm æ R, the matrix A œ Cm◊n and
the function f : Cn æ R defined by f (x) = ÎAxÎ. For what matrices A is the function f a
norm?
YouTube: https://www.youtube.com/watch?v=aJgrpp7uscw
In 3-dimensional space, the notion of the unit ball is intuitive: the set of all points that
are a (Euclidean) distance of one from the origin. Vectors have no position and can have
more than three components. Still the unit ball for the 2-norm is a straight forward extension
to the set of all vectors with length (2-norm) one. More generally, the unit ball for any norm
can be defined:
Definition 1.2.5.1 Unit ball. Given norm Î · Î : Cm æ R, the unit ball with respect to
Î · Î is the set {x | ÎxÎ = 1} (the set of all vectors with norm equal to one). We will use
ÎxÎ = 1 as shorthand for {x | ÎxÎ = 1}. ⌃
Homework 1.2.5.1 Although vectors have no position, it is convenient to visualize a vector
x œ R2 by the point
A in the plane to which it extends when rooted at the origin. For example,
B
2
the vector x = can be so visualized with the point (2, 1). With this in mind, match
1
the pictures on the right corresponding to the sets on the left:
(a) ÎxÎ2 = 1. (1)
WEEK 1. NORMS 33
Solution.
(a) ÎxÎ2 = 1. (3)
YouTube: https://www.youtube.com/watch?v=Ov77sE90P58
YouTube: https://www.youtube.com/watch?v=qjZyKHvL13E
Homework 1.2.6.1 Fill out the following table:
Q
x R ÎxÎ1 ÎxÎŒ ÎxÎ2
1
c d
a 0 b
Q
0 R
1
c d
a 1 b
Q
1 R
1
c d
a ≠2 b
≠1
Solution.
Q
x R ÎxÎ1 ÎxÎŒ ÎxÎ2
1
c d
a 0 b 1 1 1
Q
0 R
1 Ô
c d
a 1 b 3 1 3
Q
1 R
1 Ò Ô
c d
a ≠2 b 4 2 12 + (≠2)2 + (≠1)2 = 6
≠1
WEEK 1. NORMS 35
In this course, norms are going to be used to reason that vectors are "small" or "large".
It would be unfortunate if a vector were small in one norm yet large in another norm.
Fortunately, the following theorem excludes this possibility:
Theorem 1.2.6.1 Equivalence of vector norms. Let ηΠ: Cm æ R and |||·||| : Cm æ R
both be vector norms. Then there exist positive scalars ‡ and · such that for all x œ Cm
|||x||| Æ · ÎxÎ,
|||x|||
= < algebra >
|||x|||
ÎxÎ
ÎxÎ
1 Æ < 2algebra >
supz”=0 ÎzÎ ÎxÎ
|||z|||
1 = < change
2 of variables: y = z/ÎzÎ >
supÎyÎ=1 |||y||| ÎxÎ
1 = < the 2set ÎyÎ = 1 is compact >
maxÎyÎ=1 |||y||| ÎxÎ
YouTube: https://www.youtube.com/watch?v=I1W6ErdEyoc
WEEK 1. NORMS 36
‡ÎxÎ Æ |||x|||.
From the first part of the proof of Theorem 1.2.6.1, we know that there exists a fl > 0 such
that
ÎxÎ Æ fl|||x|||
and hence
1
ÎxÎ Æ |||x|||.
fl
We conclude that
‡ÎxÎ Æ |||x|||
where ‡ = 1/fl.
Example 1.2.6.2
• Let x œ R2 . Use the picture
to determine the constant C such that ÎxÎ1 Æ CÎxÎŒ . Give a vector x for which
ÎxÎ1 = CÎxÎŒ .
• For x œ R2 and the C you determined in the first part of this problem, prove that
ÎxÎ1 Æ CÎxÎŒ .
• Let x œ Cm . Extrapolate from the last part the constant C such that ÎxÎ1 Æ CÎxÎŒ
and then prove the inequality. Give a vector x for which ÎxÎ1 = CÎxÎŒ .
Solution.
¶ The red square represents all vectors such that ÎxÎŒ = 1 and the white square
represents all vectors such that ÎxÎ1 = 2.
¶ All points on or outside the red square represent vectors y such that ÎyÎŒ Ø 1.
Hence if ÎyÎ1 = 2 then ÎyÎŒ Ø 1.
¶ Now, pick any z ”= 0. Then Î2z/ÎzÎ1 Î1 = 2). Hence
Î2z/ÎzÎ1 ÎŒ Ø 1
ÎxÎ1
= < definition >
|‰0 | + |‰1 |
Æ < algebra >
max(|‰0 |, |‰1 |) + max(|‰0 |, |‰1 |)
= < algebra >
2 max(|‰0 |, |‰1 |)
= < definition >
2ÎxÎŒ .
ÎxÎ1
= < definition >
qm≠1
i=0 |‰ i |
Æ < algebra2>
qm≠1 1
i=0 maxm≠1
j=0 |‰j |
= < algebra >
m maxm≠1
j=0 |‰j |
= < definition >
mÎxÎŒ .
Q R
1
c
c 1 d
d
Equality holds (i.e., ÎxÎ1 = mÎxÎŒ ) for x = c
c .. d.
d
a . b
1
Some will be able to go straight for the general result, while others will want to seek
inspiration from the picture and/or the specialized case where x œ R2 . ⇤
Homework 1.2.6.3 Let x œ Cm . The following table organizes the various bounds:
For each, determine the constant Cx,y and prove the inequality, including that it is a tight
inequality.
Hint: look at the hint!
Ô
Hint. ÎxÎ1 Æ mÎxÎ2 :
This is the hardest
Q one to prove.
R Do it last and use the following hint:
‰0 /|‰0 |
Consider y = c
c .. d
d and employ the Cauchy-Schwarz inequality.
a . b
‰m≠1 /|‰m≠1 |
Ô
Solution 1 (ÎxÎ1QÆ C1,2 ÎxÎ2 ). ÎxÎ
R 1
Æ mÎxÎ2 :
‰0 /|‰0 |
Consider y = a
c .. d
d . Then
c
. b
‰m≠1 /|‰m≠1 |
- - - - - -
-m≠1 - -m≠1 - -m≠1 -
-ÿ - -ÿ 2 - -ÿ -
|x y| =
H
- ‰i ‰i /|‰i |- = - |‰i | /|‰i |- = - |‰i |- = ÎxÎ1 .
- - - - - -
i=0 i=0 i=0
Ô
We also notice that ÎyÎ2 = m.
From the Cauchy-Swartz inequality we know that
Ô
ÎxÎ1 = |xH y| Æ ÎxÎ2 ÎyÎ2 = mÎxÎ2 .
WEEK 1. NORMS 39
If we now choose Q R
1
c . d
x = a .. d
c
b
1
Ô Ô
then ÎxÎ1 = m and ÎxÎ2 = m so that ÎxÎ1 = mÎxÎ2 .
Solution 2 (ÎxÎ1 Æ C1,Œ ÎxÎŒ ). ÎxÎ1 Æ mÎxÎŒ :
See Example 1.2.6.2.
Solution 3 (ÎxÎ2 Æ C2,1 ÎxÎ1 :). ÎxÎ2 Æ ÎxÎ1 :
ÎxÎ22
= < definition >
qm≠1 2
i=0 |‰i |
Æ < algebra >
1q 22
m≠1
i=0 |‰i |
= < definition >
ÎxÎ21 .
Taking the square root of both sides yields ÎxÎ2 Æ ÎxÎ1 .
If we now choose Q R
0
c . d
c .. d
c d
c d
c
c
0 d
d
x= c
c 1 d
d
c
c 0 d
d
c
c .. d
d
a . b
0
then ÎxÎ2 = ÎxÎ1 .
Ô
Solution 4 (ÎxÎ2 Æ C2,Œ ÎxÎŒ ). ÎxÎ2 Æ mÎxÎŒ :
ÎxÎ22
= < definition >
qm≠1 2
i=0 |‰i |
Æ < algebra >
qm≠1 1 22
i=0 maxm≠1
j=0 |‰j |
= < definition >
qm≠1 2
i=0 ÎxÎ Œ
= < algebra >
mÎxÎ2Œ .
Ô
Taking the square root of both sides yields ÎxÎ2 Æ mÎxÎŒ .
WEEK 1. NORMS 40
Consider Q R
1
c . d
x = a .. d
c
b
1
Ô Ô
then ÎxÎ2 = m and ÎxÎŒ = 1 so that ÎxÎ2 = mÎxÎŒ .
Solution 5 (ÎxÎŒ Æ CŒ,1 ÎxÎ1 :). ÎxÎŒ Æ ÎxÎ1 :
ÎxÎŒ
= < definition >
maxm≠1
i=0 |‰ i|
Æ < algebra >
qm≠1
i=0 |‰ i |
= < definition >
ÎxÎ1 .
Consider Q R
0
c . d
c .. d
c d
c d
c
c
0 d
d
x= c
c 1 d.
d
c
c 0 d
d
c
c .. d
d
a . b
0
Then ÎxÎŒ = 1 = ÎxÎ1 .
Solution 6 (ÎxÎŒ Æ CŒ,2 ÎxÎ2 ). ÎxÎŒ Æ ÎxÎ2 :
ÎxÎ2Œ
= < definition >
1 22
maxm≠1 i=0 |‰i |
= < algebra >
maxi=0 |‰i |2
m≠1
Consider Q R
0
c . d
c .. d
c d
c d
c
c
0 d
d
x= c
c 1 d.
d
c
c 0 d
d
c
c .. d
d
a . b
0
Then ÎxÎŒ = 1 = ÎxÎ2 .
Solution 7 (Table of constants).
Ô
ÎxÎ1 Æ mÎxÎ2 ÎxÎ1 Æ mÎxÎŒ
Ô
ÎxÎ2 Æ ÎxÎ1 ÎxÎ2 Æ mÎxÎŒ
ÎxÎŒ Æ ÎxÎ1 ÎxÎŒ Æ ÎxÎ2
Remark 1.2.6.3 The bottom line is that, modulo a constant factor, if a vector is "small"
in one norm, it is "small" in all other norms. If it is "large" in one norm, it is "large" in all
other norms.
• L(x + y) = L(x) + L(y). That is, adding first and then transforming yields the same
result as transforming first and then adding.
⌃
WEEK 1. NORMS 42
The importance of linear transformations comes in part from the fact that many problems
in science boil down to, given a function F : Cn æ Cm and vector y œ Cm , find x such that
F (x) = y. This is known as an inverse problem. Under mild conditions, F can be locally
approximated with a linear transformation L and then, as part of a solution method, one
would want to solve Lx = y.
The following theorem provides the link between linear transformations and matrices:
Theorem 1.3.1.2 Let L : Cn æ Cm be a linear transformation, v0 , v1 , · · · , vk≠1 œ Cn , and
x œ Ck . Then
where Q R
‰0
c . d
x = a .. d
c
b.
‰k≠1
Proof. A simple inductive proof yields the result. For details, see Week 2 of Linear Algebra:
Foundations to Frontiers (LAFF) [26]. ⌅
The following set of vectors ends up playing a crucial role throughout this course:
Definition 1.3.1.3 Standard basis vector. In this course, we will use ej œ Cm to denote
the standard basis vector with a "1" in the position indexed with j. So,
Q R
0
c . d
c .. d
c d
c d
c
c
0 d
d
ej = c
c 1 d
d Ω≠ j
c
c 0 d
d
c
c .. d
d
a . b
0
⌃
Key is the fact that any vector x œ Cn can be written as a linear combination of the
standard basis vectors of Cn :
Q R Q R Q R Q R
‰0 1 0 0
c 0 d 1 0
c d c d c d c d
c ‰1 d c d c d
x = c c .. dd = ‰ 0 c .. d
c d + ‰ c
1c .. d + · · · + ‰n≠1 c
d c .. d
d
a . b a . b a . b a . b
‰n≠1 0 0 1
= ‰0 e0 + ‰1 e1 + · · · + ‰n≠1 en≠1 .
Hence, if L is a linear transformation,
L(x) = L(‰0 e0 + ‰1 e1 + · · · + ‰n≠1 en≠1 )
= ‰0 L(e0 ) + ‰1 L(e1 ) + · · · + ‰n≠1 L(en≠1 ) .
¸ ˚˙ ˝ ¸ ˚˙ ˝ ¸ ˚˙ ˝
a0 a1 an≠1
WEEK 1. NORMS 43
If we now let aj = L(ej ) (the vector aj is the transformation of the standard basis vector ej
and collect these vectors into a two-dimensional array of numbers:
1 2
A= a0 a1 · · · an≠1 (1.3.1)
then we notice that information for evaluating L(x) can be found in this array, since L can
then alternatively be computed by
The array A in (1.3.1) we call a matrix and the operation Ax = ‰0 a0 +‰1 a1 +· · ·+‰n≠1 an≠1
we call matrix-vector multiplication. Clearly
Ax = L(x).
Remark 1.3.1.4 Notation. In these notes, as a rule,
• Roman upper case letters are used to denote matrices.
Corresponding letters from these three sets are used to refer to a matrix, the row or columns
of that matrix, and the elements of that matrix. If A œ Cm◊n then
A
= < partition A by columns and rows >
Q R
aÂT0
1 2 c a T d
c Â1 d
a0 a1 · · · an≠1 = c
c .. d
d
a . b
aÂTm≠1
Q = < expose the elements ofRA >
–0,0 –0,1 · · · –0,n≠1
c d
c
c –1,0 –1,1 · · · –1,n≠1 d d
c .. .. .. d
c
c . . . d
d
c .. d
c
a . d
b
–m≠1,0 –m≠1,1 · · · –m≠1,n≠1
We now notice that the standard basis vector ej œ Cm equals the column of the m ◊ m
identity matrix indexed with j:
Q R Q R
1 0 ··· 0 eÂT0
c
c 0 1 ··· 0 d
d 1 2 c
c eÂT1 d
d
I= c .... . . .. d = e0 e1 · · · em≠1 = c .. d.
c
a . . . .
d
b
c
a .
d
b
0 0 ··· 1 eÂTm≠1
WEEK 1. NORMS 44
Remark 1.3.1.5 The important thing to note is that a matrix is a convenient represen-
tation of a linear transformation and matrix-vector multiplication is an alternative way for
evaluating that linear transformation.
YouTube: https://www.youtube.com/watch?v=cCFAnQmwwIw
Let’s investigate matrix-matrix multiplication and its relationship to linear transforma-
tions. Consider two linear transformations
LA : Ck æ Cm represented by matrix A
LB : Cn æ Ck represented by matrix B
and define
LC (x) = LA (LB (x)),
as the composition of LA and LB . Then it can be easily shown that LC is also a linear
transformation. Let m ◊ n matrix C represent LC . How are A, B, and C related? If we let
cj equal the column of C indexed with j, then because of the link between matrices, linear
transformations, and standard basis vectors
where bj equals the column of B indexed with j. Now, we say that C = AB is the product
of A and B defined by
1 2 1 2 1 2
c0 c1 · · · cn≠1 =A b0 b1 · · · bn≠1 = Ab0 Ab1 · · · Abn≠1
C := AB,
which you will want to pronounce "C becomes A times B" to distinguish assignment from
equality. If you think carefully how individual elements of C are computed, you will realize
that they equal the usual "dot product of rows of A with columns of B."
WEEK 1. NORMS 45
YouTube: https://www.youtube.com/watch?v=g_9RbA5EOIc
As already mentioned, throughout this course, it will be important that you can think
about matrices in terms of their columns and rows, and matrix-matrix multiplication (and
other operations with matrices and vectors) in terms of columns and rows. It is also impor-
tant to be able to think about matrix-matrix multiplication in three different ways. If we
partition each matrix by rows and by columns:
Q R Q R
1 2
cÂT0 1 2
aÂT0
C=
c
=c .. d c .. d
b , A = a0 · · · ak≠1 = a
. d .
c0 · · · cn≠1 c d,
a b
cÂTm≠1 aÂTm≠1
and Q R
Â
bT
1 2 0
B= =c
c .. d
b0 · · · bn≠1 a . d,
b
Â
bT k≠1
2. By rows: Q R Q R Q R
cÂT0 aÂT0 aÂT0 B
c .. d c
d := c .. d B = c
d c .. d
c
a . b a . b a . d.
b
cÂTm≠1 T
aÂm≠1 T
aÂm≠1 B
In other words, cÂTi = aÂTi B for all rows of C.
and Q R
B0,0 ··· B0,N ≠1
c .. .. d
c
a . . d,
b
BK≠1,0 · · · BK≠1,N ≠1
where the partitionings are "conformal", then
K≠1
ÿ
Ci,j = Ai,p Bp,j .
p=0
YouTube: https://www.youtube.com/watch?v=6DsBTz1eU7E
A matrix norm extends the notions of an absolute value and vector norm to matrices:
Definition 1.3.2.1 Matrix norm. Let ‹ : Cm◊n æ R. Then ‹ is a (matrix) norm if for
all A, B œ Cm◊n and all – œ C
• A ”= 0 ∆ ‹(A) > 0 (‹ is positive definite),
⌃
Homework 1.3.2.1 Let ‹ : C æ R be a matrix norm.
m◊n
ALWAYS/SOMETIMES/NEVER: ‹(0) = 0.
Hint. Review the proof on Homework 1.2.2.1.
Answer. ALWAYS.
Now prove it.
WEEK 1. NORMS 47
YouTube: https://www.youtube.com/watch?v=0ZHnGgrJXa4
Definition 1.3.3.1 The Frobenius norm. The Frobenius norm Î · ÎF : Cm◊n æ R is
defined for A œ Cm◊n by
ı̂
ı̂m≠1 n≠1 ı
ı |–0,0 |2 + ··· + |–0,n≠1 |2 +
ÎAÎF =
ıÿ ÿ
2 =
ı .. .. .. .. ..
Ù |–i,j | ı
Ù . . . . .
i=0 j=0
|–m≠1,0 |2 + · · · + |–m≠1,n≠1 |2 .
⌃
One can think of the Frobenius norm as taking the columns of the matrix, stacking them
on top of each other to create a vector of size m ◊ n, and then taking the vector 2-norm of
the result.
Homework 1.3.3.1 Partition m ◊ n matrix A by columns:
1 2
A= a0 · · · an≠1 .
Show that
n≠1
ÿ
ÎAÎ2F = Îaj Î22 .
j=0
WEEK 1. NORMS 48
Solution.
ÎAÎF
Ò
= < definition >
qm≠1 qn≠1 2
i=0 j=0 |–i,j |
=
Òq
< commutativity of addition >
n≠1 qm≠1 2
j=0 i=0 |–i,j |
=
Òq
< definition of vector 2-norm >
n≠1 2
j=0 Îaj Î2
Òq
= < commutativity of addition >
n≠1 qm≠1 2
j=0 i=0 |–i,j |
Òq
= < definition of vector 2-norm >
n≠1 2
j=0 Îaj Î2
= < definition of vector 2-norm >
ı̂.Q R.2
ı. a0 .
ı. .
ı.c d.
ı.c
ı.c
a1 d.
ı.c .. d. .
d.
ı.a
Ù. . b.
.
. an≠1 .2
In other words, it equals the vector 2-norm of the vector that is created by stacking the
columns of A on top of each other. One can then exploit the fact that the vector 2-norm
obeys the triangle inequality.
Homework 1.3.3.3 Partition m ◊ n matrix A by rows:
Q R
aÂT0
c
A=c .. d
a . d
b.
T
am≠1
Â
Show that
m≠1
ÿ
ÎAÎ2F = ÎaÂi Î22 ,
i=0
Solution.
ÎAÎF
Ò
= < definition >
qm≠1 qn≠1 2
i=0 j=0 |–i,j |
=
Òq
< definition of vector 2-norm >
m≠1 Â 2
i=0 Îai Î2 .
Let us review the definition of the transpose of a matrix (which we have already used
when defining the dot product of two real-valued vectors and when identifying a row in a
matrix):
Definition 1.3.3.2 Transpose. If A œ Cm◊n and
Q R
–0,0 –0,1 · · · –0,n≠1
c d
c
c –1,0 –1,1 · · · –1,n≠1 d
d
c .. .. .. d
A= c
c . . . d
d
c .. d
c
a . d
b
–m≠1,0 –m≠1,1 · · · –m≠1,n≠1
⌃
For complex-valued matrices, it is important to also define the Hermitian transpose
of a matrix:
Definition 1.3.3.3 Hermitian transpose. If A œ Cm◊n and
Q R
–0,0 –0,1 · · · –0,n≠1
c d
c
c –1,0 –1,1 · · · –1,n≠1 d
d
c .. .. .. d
A= c
c . . . d
d
c .. d
c
a . d
b
–m≠1,0 –m≠1,1 · · · –m≠1,n≠1
WEEK 1. NORMS 50
where A denotes the conjugate of a matrix, in which each element of the matrix is
conjugated. ⌃
We note that
• A = AT .
T
• If A œ Rm◊n , then AH = AT .
• If x œ Cm , then xH is defined consistent with how we have used it before.
• If – œ C, then –H = –.
(If you view the scalar as a matrix and then Hermitian transpose it, you get the matrix
with as only element –.)
Don’t Panic!. While working with complex-valued scalars, vectors, and matrices may appear
a bit scary at first, you will soon notice that it is not really much more complicated than
working with their real-valued counterparts.
Homework 1.3.3.4 Let A œ Cm◊k and B œ Ck◊n . Using what you once learned about
matrix transposition and matrix-matrix multiplication, reason that (AB)H = B H AH .
Solution.
(AB)H
= < XH = XT >
(AB) T
= < XT = X >
T
B H AH
Definition 1.3.3.4 Hermitian. A matrix A œ Cm◊m is Hermitian if and only if A = AH .
⌃
Obviously, if A œ Rm◊m , then A is a Hermitian matrix if and only if A is a symmetric
matrix.
Homework 1.3.3.5 Let A œ Cm◊n .
ALWAYS/SOMETIMES/NEVER: ÎAH ÎF = ÎAÎF .
WEEK 1. NORMS 51
Answer. ALWAYS
Solution.
ÎAÎF
Ò
= < definition >
qm≠1 qn≠1 2
i=0 j=0 |–i,j |
Òq
= < commutativity of addition >
n≠1 qm≠1 2
j=0 i=0 |–i,j |
Òq
= < change of variables >
n≠1 qm≠1 2
i=0 j=0 |–j,i |
Òq
= < algebra >
n≠1 qm≠1 2
i=0 j=0 |–j,i |
= < definition >
H
ÎA ÎF
Similarly, other matrix norms can be created from vector norms by viewing the matrix
as a vector. It turns out that, other than the Frobenius norm, these aren’t particularly
interesting in practice. An example can be found in Homework 1.6.1.6.
Remark 1.3.3.5 The Frobenius norm of a m ◊ n matrix is easy to compute (requiring
O(mn) computations). The functions f (A) = ÎAÎF and f (A) = ÎAÎ2F are also differentiable.
However, you’d be hard-pressed to find a meaningful way of linking the definition of the
Frobenius norm to a measure of an underlying linear transformation (other than by first
transforming that linear transformation into a matrix).
YouTube: https://www.youtube.com/watch?v=M6ZVBRFnYcU
Recall from Subsection 1.3.1 that a matrix, A œ Cm◊n , is a 2-dimensional array of
numbers that represents a linear transformation, L : Cn æ Cm , such that for all x œ Cn the
matrix-vector multiplication Ax yields the same result as does L(x).
The question "What is the norm of matrix A?" or, equivalently, "How ’large’ is A?" is the
same as asking the question "How ’large’ is L?" What does this mean? It suggests that what
we really want is a measure of how much linear transformation L or, equivalently, matrix A
"stretches" (magnifies) the "length" of a vector. This observation motivates a class of matrix
norms known as induced matrix norms.
WEEK 1. NORMS 52
ÎAxε
ÎAε,‹ = sup .
x œ Cn Îx΋
x ”= 0
⌃
Matrix norms that are defined in this way are said to be induced matrix norms.
Remark 1.3.4.2 In context, it is obvious (from the column size of the matrix) what the
size of vector x is. For this reason, we will write
ÎAxε ÎAxε
ÎAε,‹ = sup as ÎAε,‹ = sup .
x œ Cn Îx΋ x”=0 Îx΋
x ”= 0
Let us start by interpreting this. How "large" A is, as measured by ÎAε,‹ , is defined as
the most that A magnifies the length of nonzero vectors, where the length of the vector, x,
is measured with norm Î · ΋ and the length of the transformed vector, Ax, is measured with
norm Î · ε .
Two comments are in order. First,
ÎAxε
sup = sup ÎAxε .
x”=0 Îx΋ Îx΋ =1
supx”=0 ÎAxÎ
Îx΋
µ
Second, the "sup" (which stands for supremum) is used because we can’t claim yet that
there is a nonzero vector x for which
ÎAxε
sup
x”=0 Îx΋
sup ÎAxε
Îx΋ =1
WEEK 1. NORMS 53
is attained. In words, it is not immediately obvious that there is a vector for which the
supremum is attained. The fact is that there is always such a vector x. The proof again
depends on a result from real analysis, also employed in Proof 1.2.6.1, that states that
supxœS f (x) is attained for some vector x œ S as long as f is continuous and S is a compact
set. For any norm, ÎxÎ = 1 is a compact set. Thus, we can replace sup by max from here
on in our discussion.
We conclude that the following two definitions are equivalent definitions to the one we
already gave:
Definition 1.3.4.3 Let Î · ε : Cm æ R and Î · ΋ : Cn æ R be vector norms. Define
Î · ε,‹ : Cm◊n æ R by
ÎAxε
ÎAε,‹ = max .
x”=0 Îx΋
or, equivalently,
ÎAε,‹ = max ÎAxε .
Îx΋ =1
⌃
Remark 1.3.4.4 In this course, we will often encounter proofs involving norms. Such
proofs are much cleaner if one starts by strategically picking the most convenient of these
two definitions. Until you gain the intuition needed to pick which one is better, you may
have to start your proof using one of them and then switch to the other one if the proof
becomes unwieldy.
Theorem 1.3.4.5 Î · ε,‹ : Cm◊n æ R is a norm.
Proof. To prove this, we merely check whether the three conditions are met:
Let A, B œ Cm◊n and – œ C be arbitrarily chosen. Then
ÎAε,‹
= < definition >
maxx”=0 Îx΋
ÎAxε
ΖAε,‹
= < definition >
maxx”=0 ΖAxÎ
Îx΋
µ
ÎA + Bε,‹
= < definition >
maxx”=0 Îx΋
Î(A+B)xε
⌅
When Î · ε and Î · ΋ are the same norm (but possibly for different sizes of vectors), the
induced norm becomes
Definition 1.3.4.6 Define Î · ε : Cm◊n æ R by
ÎAxε
ÎAε = max
x”=0 Îxε
or, equivalently,
ÎAε = max ÎAxε .
Îxε =1
⌃
Homework 1.3.4.1 Consider the vector p-norm Î · Îp : Cn æ R and let us denote the
induced matrix norm by ||| · ||| : Cm◊n æ R for this exercise: |||A||| = maxx”=0 ÎAxÎ
ÎxÎp
p
.
ALWAYS/SOMETIMES/NEVER: |||y||| = ÎyÎp for y œ C . m
WEEK 1. NORMS 55
Answer. ALWAYS
Solution.
|||y|||
= < definition >
maxx”=0 ÎxÎp
ÎyxÎp
Ò
= < x is a scalar since y is a matrix with one column. Then ÎxÎp = Î(‰0 )Îp = p
|‰0 |p = |‰0 | >
max‰0 ”=0 |‰0 | ÎyÎ p
|‰0 |
= < algebra >
max‰0 ”=0 ÎyÎp
= < algebra >
ÎyÎp
This last exercise is important. One can view a vector x œ Cm as an m ◊ 1 matrix. What
this last exercise tells us is that regardless of whether we view x as a matrix or a vector,
ÎxÎp is the same.
We already encountered the vector p-norms as an important class of vector norms. The
matrix p-norm is induced by the corresponding vector norm, as defined by
Definition 1.3.4.7 Matrix p-norm. For any vector p-norm, define the corresponding
matrix p-norm Î · Îp : Cm◊n æ R by
ÎAxÎp
ÎAÎp = max or, equivalently, ÎAÎp = max ÎAxÎp .
x”=0 ÎxÎp ÎxÎp =1
⌃
Remark 1.3.4.8 The matrix p-norms with p œ {1, 2, Œ} will play an important role in our
course, as will the Frobenius norm. As the course unfolds, we will realize that in practice the
matrix 2-norm is of great theoretical importance but difficult to evaluate, except for special
matrices. The 1-norm, Œ-norm, and Frobenius norms are straightforward and relatively
cheap to compute (for an m ◊ n matrix, computing these costs O(mn) computation).
ÎAxÎ2
ÎAÎ2 = max = max ÎAxÎ2 .
x”=0 ÎxÎ2 ÎxÎ2 =1
⌃
Remark 1.3.5.2 The problem with the matrix 2-norm is that it is hard to compute. At
some point later in this course, you will find out that if A is a Hermitian matrix (A = AH ),
then ÎAÎ2 = |⁄0 |, where ⁄0 equals the eigenvalue of A that is largest in magnitude. You
may recall from your prior linear algebra experience that computing eigenvalues involves
computing the roots of polynomials, and for polynomials of degree three or greater, this is
a nontrivial task. We will see that the matrix 2-norm plays an important role in the theory
of linear algebra, but less so in practical computation.
Example 1.3.5.3 Show that
.A B.
.
. ”0 0 .
.
. . = max(|”0 |, |”1 |).
. 0 ”1 .
2
Solution.
YouTube: https://www.youtube.com/watch?v=B2rz0i5BB3A
[slides (PDF)] [LaTeX source] ⇤
Remark 1.3.5.4 The proof of the last example builds on a general principle: Showing that
maxxœD f (x) = – for some function f : D æ R can be broken down into showing that both
max f (x) Æ –
xœD
and
max f (x) Ø –.
xœD
In turn, showing that maxxœD f (x) Ø – can often be accomplished by showing that there
exists a vector y œ D such that f (y) = – since then
Homework 1.3.5.1 Let D œ Cm◊m be a diagonal matrix with diagonal entries ”0 , . . . , ”m≠1 .
Show that
ÎDÎ2 = max |”j |.
m≠1
j=0
ÎDÎ22
= < definition >
maxÎxÎ2 =1 ÎDxÎ22
= < diagonal vector multiplication >
.Q R.2
. ”0 ‰0 .
. .
.c
maxÎxÎ2 =1 ..c .. d.
a . d.
b.
. .
.
”m≠1 ‰m≠1 .2
= < definition >
q
maxÎxÎ2 =1 m≠1
i=0 |”i ‰i |
2
Next, we show that there is a vector y with ÎyÎ2 = 1 such that ÎDyÎ2 = maxm≠1
i=0 |”i |:
Let j be such that |”j | = maxm≠1
i=0 |”i | and choose y = ej . Then
ÎDyÎ2
= < y = ej >
ÎDej Î2
= < D = diag(”0 , . . . , ”m≠1 ) >
Δj ej Î2
= < homogeneity >
|”j |Îej Î2
= < Îej |2 = 1 >
|”j |
= < choice of j >
maxm≠1i=0 |”i|
ÎyxH Î2
= < definition >
maxÎzÎ2 =1 ÎyxH zÎ2
= < Î · Î2 is homogenius >
maxÎzÎ2 =1 |xH z|ÎyÎ2
Æ < Cauchy-Schwarz inequality >
maxÎzÎ2 =1 ÎxÎ2 ÎzÎ2 ÎyÎ2
= < ÎzÎ2 = 1 >
ÎxÎ2 ÎyÎ2 .
But also
ÎyxH Î2
= < definition >
maxz”=0 ÎyxH zÎ2 /ÎzÎ2
Ø < specific z >
H
Îyx xÎ2 /ÎxÎ2
= < xH x = ÎxÎ22 ; homogeneity >
2
ÎxÎ2 ÎyÎ2 /ÎxÎ2
= < algebra >
ÎyÎ2 ÎxÎ2 .
Hence
ÎyxH Î2 = ÎyÎ2 ÎxÎ2 .
Homework 1.3.5.3 Let A œ Cm◊n and aj its column indexed with j. ALWAYS/SOMETIMES/
NEVER: Îaj Î2 Æ ÎAÎ2 .
Hint. What vector has the property that aj = Ax?
Answer. ALWAYS.
Now prove it!
WEEK 1. NORMS 59
Solution.
Îaj Î2
=
ÎAej Î2
Æ
maxÎxÎ2 =1 ÎAxÎ2
=
ÎAÎ2 .
Homework 1.3.5.4 Let A œ Cm◊n . Prove that
• ÎAÎ2 = maxÎxÎ2 =ÎyÎ2 =1 |y H Ax|.
• ÎAH Î2 = ÎAÎ2 .
Hint. Proving ÎAÎ2 = maxÎxÎ2 =ÎyÎ2 =1 |y H Ax| requires you to invoke the Cauchy-Schwarz
inequality from Theorem 1.2.3.3.
Solution.
|y H Ax|
- = - instantiate >
<
- (Ax)H (Ax) -
- ÎAxÎ -
2
-
= - < z H z = ÎzÎ22 >
- ÎAxÎ2 -
- ÎAxÎ2 -
2
= < algebra >
ÎAxÎ2
= < x was chosen so that ÎAxÎ2 = ÎAÎ2 >
ÎAÎ2
Hence the bound is attained. We conclude that ÎAÎ2 = maxÎxÎ2 =ÎyÎ2 =1 |y H Ax|.
WEEK 1. NORMS 60
• ÎAH Î2 = ÎAÎ2 :
ÎAH Î2
= < first part of homework >
maxÎxÎ2 =ÎyÎ2 =1 |y H AH x|
= < |–| = |–| >
maxÎxÎ2 =ÎyÎ2 =1 |xH Ay|
= < first part of homework >
ÎAÎ2 .
ÎAH AÎ2
= < first part of homework >
maxÎxÎ2 =ÎyÎ2 =1 |y H AH Ax|
Ø < restricts choices of y >
maxÎxÎ2 =1 |xH AH Ax|
= < z H z = ÎzÎ22 >
maxÎxÎ2 =1 ÎAxÎ22
= < algebra >
1 22
maxÎxÎ2 =1 ÎAxÎ2
= < definition >
ÎAÎ22 .
So, ÎAH AÎ2 Ø ÎAÎ22 .
Now, let’s show that ÎAH AÎ2 Æ ÎAÎ22 . This would be trivial if we had already discussed
the fact that Î · · · Î2 is a submultiplicative norm (which we will in a future unit). But
let’s do it from scratch. First, we show that ÎAxÎ2 Æ ÎAÎ2 ÎxÎ2 for all (appropriately
sized) matrices A and x:
ÎAxÎ2
= < norms are homogeneus >
x
ÎA ÎxÎ 2
Î 2 ÎxÎ2
Æ < algebra >
maxÎyÎ2 =1 ÎAyÎ2 ÎxÎ2
= < definition of 2-norm
ÎAÎ2 ÎxÎ2 .
WEEK 1. NORMS 61
ÎAH AÎ2
= < definition of 2-norm >
maxÎxÎ2 =1 ÎAH AxÎ2
Æ < ÎAzÎ2 Æ ÎAÎ2 ÎzÎ2 >
maxÎxÎ2 =1 (ÎAH Î2 ÎAxÎ2 )
= < algebra >
ÎAH Î2 maxÎxÎ2 =1 ÎAxÎ2 )
= < definition of 2-norm >
ÎAH Î2 ÎAÎ2
= < ÎAH Î2 = ÎAÎ >
2
ÎAÎ2
Alternatively, as suggested by one of the learners in the course, we can use the Cauchy-
Schwarz inequality:
ÎAH AÎ2
= < part (a) of this homework >
maxÎxÎ2 =ÎyÎ2 =1 |xH AH Ay|
= < simple manipulation >
maxÎxÎ2 =ÎyÎ2 =1 |(Ax)H Ay|
Æ < Cauchy-Schwarz inequality >
maxÎxÎ2 =ÎyÎ2 =1 ÎAxÎ2 ÎAyÎ2
= < algebra >
maxÎxÎ2 =1 ÎAxÎ2 maxÎyÎ2 =1 ÎAyÎ2
= < definition >
ÎAÎ2 ÎAÎ2
= < algebra >
ÎAÎ22
Q R
A0,0 ··· A0,N ≠1
Homework 1.3.5.5 Partition A = c
c .. .. d
a . . d.
b
AM ≠1,0 · · · AM ≠1,N ≠1
ALWAYS/SOMETIMES/NEVER: ÎAi,j Î2 Æ ÎAÎ2 .
Hint. Using Homework 1.3.5.4 choose vj and wi such that ÎAi,j Î2 = |wiH Ai,j vj |.
Solution. Choose vj and wi such that ÎAi,j Î2 = |wiH Ai,j vj |. Next, choose v and w such
WEEK 1. NORMS 62
that Q R Q R
0 0
c . d c . d
c . d c . d
c . d c . d
c d c d
c
c 0 d
d
c
c 0 d
d
c d c d
v= c vj d, w= c wi d.
c d c d
c
c 0 d
d
c
c 0 d
d
c .. d c .. d
c
a . d
b
c
a . d
b
0 0
You can check (using partitioned multiplication and the last homework) that wH Av =
wiH Ai,j vj . Then, by Homework 1.3.5.4
ÎAÎ2
= < last homework >
maxÎxÎ2 =ÎyÎ2 =1 |y H Ax|
Ø < w and v are specific vectors >
|wH Av|
= < partitioned multiplication >
|wiH Ai,j vj |
= < how wi and vj were chosen >
ÎAi,j Î2 .
YouTube: https://www.youtube.com/watch?v=QTKZdGQ2C6w
The matrix 1-norm and matrix Œ-norm are of great importance because, unlike the
matrix 2-norm, they are easy and relatively cheap to compute.. The following exercises
show how to practically compute the matrix 1-norm and Œ-norm.
1 2
Homework 1.3.6.1 Let A œ Cm◊n and partition A = a0 a1 · · · an≠1 . ALWAYS/
SOMETIMES/NEVER: ÎAÎ1 = max0Æj<n Îaj Î1 .
Hint. Prove it for the real valued case first.
Answer. ALWAYS
WEEK 1. NORMS 63
ÎAÎ1
= < definition >
maxÎxÎ1 =1 ÎAxÎ1
= <. expose the columns of A and elements of x >
Q R.
. ‰ 0
.
. .
.1 2c d.
. c ‰1 d.
maxÎxÎ1 =1 .. a0 a1 · · · an≠1 cc .. d .
d.
.
.
a . b. .
. ‰n≠1 .1
= < definition of matrix-vector multiplication >
maxÎxÎ1 =1 Ή0 a0 + ‰1 a1 + · · · + ‰n≠1 an≠1 Î1
Æ < triangle inequality >
maxÎxÎ1 =1 (Ή0 a0 Î1 + Ή1 a1 Î1 + · · · + Ήn≠1 an≠1 Î1 )
= < homogeneity >
maxÎxÎ1 =1 (|‰0 |Îa0 Î1 + |‰1 |Îa1 Î1 + · · · + |‰n≠1 |Îan≠1 Î1 )
Æ < choice of aJ >
maxÎxÎ1 =1 (|‰0 |ÎaJ Î1 + |‰1 |ÎaJ Î1 + · · · + |‰n≠1 |ÎaJ Î1 )
= < factor out ÎaJ Î1 >
maxÎxÎ1 =1 (|‰0 | + |‰1 | + · · · + |‰n≠1 |) ÎaJ Î1
= < algebra >
ÎaJ Î1 .
Also,
ÎaJ Î1
= < eJ picks out column J >
ÎAeJ Î1
Æ < eJ is a specific choice of x >
maxÎxÎ1 =1 ÎAxÎ1 .
Hence
ÎaJ Î1 Æ max ÎAxÎ1 Æ ÎaJ Î1
ÎxÎ1 =1
Notice that in this exercise aÂi is really (aÂTi )T since aÂTi is the label for the ith row of matrix
WEEK 1. NORMS 64
A.
Hint. Prove it for the real valued case first.
Answer. ALWAYS Q R
aÂT0
Solution. Partition A = c
c .. d
a b. Then
. d
T
aÂm≠1
ÎAÎŒ
= < definition >
maxÎxÎŒ =1 ÎAxÎŒ
= < .Q expose rows R .
>
. a T .
. 0 .
.c
maxÎxÎŒ =1 ..c .. d .
a . b .
d x.
. .
. a  Tm≠1 .
Œ
= < .Q matrix-vectorR.
multiplication >
. aÂT0 x .
. .
.c
maxÎxÎŒ =1 ..c .. d.
a . d.
b.
. .
. a  Tm≠1 x .
Œ
= < 1 definition of Î · 2· · ÎŒ >
maxÎxÎŒ =1 max0Æi<m |aÂTi x|
= < expose aÂTi x >
q
maxÎxÎŒ =1 max0Æi<m | n≠1 p=0 –i,p ‰p |
Æ < triangle inequality >
q
maxÎxÎŒ =1 max0Æi<m n≠1 p=0 |–i,p ‰p |
= < algebra >
q
maxÎxÎŒ =1 max0Æi<m n≠1 p=0 (|–i,p ||‰p |)
Æ < algebra >
q
maxÎxÎŒ =1 max0Æi<m n≠1 p=0 (|–i,p |(maxk |‰k |))
= < definition of Î · ÎŒ >
q
maxÎxÎŒ =1 max0Æi<m n≠1 p=0 (|–i,p |ÎxÎŒ )
= < ÎxÎŒ = 1 >
q
max0Æi<m n≠1 p=0 |–i,p |
= < definition of Î · Î1 >
max0Æi<m ÎaÂi Î1
ÎAÎŒ
= < definition >
maxÎxÎ1 =1 ÎAxÎŒ
= <.Qexpose rows R .
>
. a T .
. 0 .
maxÎxÎ1 =1 ..c
.c .. d .
a . b .
d x.
. .
. a  Tm≠1 .
Œ
.Q
Ø <
R .
y is a specific x>
. aÂ0T .
. .
.c .. d .
.c
.a . b . d y.
. .
. a Tm≠1 .
Œ
.Q
= <Rmatrix-vector
.
multiplication >
. a T
y .
. 0 .
.c .. d.
.c
.a . d.
b.
. .
. aÂTm≠1 y .Œ
Ø < algebra >
|aÂTk y|
= < choice of y >
Î ak Î 1 .
Â
= < choice of k >
max0Æi<m ÎaÂi Î1
Remark 1.3.6.1 The last homework provides a hint as to how to remember how to compute
the matrix 1-norm and Œ-norm: Since ÎxÎ1 must result in the same value whether x is
considered as a vector or as a matrix, we can remember that the matrix 1-norm equals the
maximum of the 1-norms of the columns of the matrix: Similarly, considering ÎxÎŒ as a
vector norm or as matrix norm reminds us that the matrix Œ-norm equals the maximum of
the 1-norms of vectors that become the rows of the matrix.
YouTube: https://www.youtube.com/watch?v=Csqd4AnH7ws
WEEK 1. NORMS 66
Q
A R
ÎAÎ1 ÎAÎŒ ÎAÎF ÎAÎ2
1 0 0
c d
a 0 1 0 b
0 0 1
Q R
1 1 1
c
c 1 1 1 d
d
c d
a 1 1 1 b
Q
1 1 1 R
0 1 0
c d
a 0 1 0 b
0 1 0
Hint. For the second and third, you may want to use Homework 1.3.5.2 when computing
the 2-norm.
Solution.
Q
A R
ÎAÎ1 ÎAÎŒ ÎAÎF ÎAÎ2
1 0 0 Ô
c d
a 0 1 0 b 1 1 3 1
0 0 1
Q R
1 1 1
c 1 1 1 d Ô Ô
c d
c d 4 3 2 3 2 3
a 1 1 1 b
Q
1 1 1 R
0 1 0 Ô Ô
c d
a 0 1 0 b 3 1 3 3
0 1 0
To compute the 2-norm of I, notice that
norm, it is "small" in all other norms, and if it is "large" in one norm, it is "large" in all other
norms. The same is true for matrix norms.
Theorem 1.3.7.1 Equivalence of matrix norms. Let Î · Î : Cm◊n æ R and ||| · ||| :
Cm◊n æ R both be matrix norms. Then there exist positive scalars ‡ and · such that for all
A œ Cm◊n
‡ÎAÎ Æ |||A||| Æ · ÎAÎ.
Proof. The proof again builds on the fact that the supremum over a compact set is achieved
and can be replaced by the maximum.
We will prove that there exists a · such that for all A œ Cm◊n
|||A||| Æ · ÎAÎ
For each, prove the inequality, including that it is a tight inequality for some nonzero A.
(Skip ÎAÎF Æ?ÎAÎ2 : we will revisit it in Week 2.)
Solution.
Ô
• ÎAÎ1 Æ mÎAÎ2 :
ÎAÎ1
= < definition >
maxx”=0 ÎAxÎ
ÎxÎ1
1
Ô
Æ Ô< ÎzÎ1 Æ mÎzÎ2 >
maxx”=0 mÎAxÎ
ÎxÎ1
2
• ÎAÎ1 Æ mÎAÎŒ :
ÎAÎ1
= < definition >
maxx”=0 ÎxÎ1
ÎAxÎ1
ÎAÎ1
Ô Æ < last part >
mÎAÎ2
Ô Æ < some other part:ÎAÎ2 Æ ÎAÎF >
mÎAÎF .
Q R
1
c
c 1 d
d
Equality is attained for A = c
c .. d.
d
a . b
1
Ô
• ÎAÎ2 Æ nÎAÎ1 :
ÎAÎ2
= < definition >
maxx”=0 ÎAxÎ
ÎxÎ2
2
ÎAÎ2
= < definition >
maxx”=0 ÎAxÎ
ÎxÎ2
2
Ô
Æ Ô< ÎzÎ2 Æ mÎzÎŒ >
maxx”=0 mÎAxÎ
ÎxÎ2
Œ
• ÎAÎ2 Æ ÎAÎF :
(See Homework 1.3.7.2, which requires the SVD, as mentioned...)
YouTube: https://www.youtube.com/watch?v=TvthvYGt9x8
There are a number of properties that we would like for a matrix norm to have (but not
all norms do have). Recalling that we would like for a matrix norm to measure by how much
a vector is "stretched," it would be good if for a given matrix norm, Î · · · Î : Cm◊n æ R,
there are vector norms Î · ε : Cm æ R and Î · ΋ : Cn æ R such that, for arbitrary nonzero
x œ Cn , the matrix norm bounds by how much the vector is stretched:
ÎAxε
Æ ÎAÎ
Îx΋
or, equivalently,
ÎAxε Æ ÎAÎÎx΋
where this second formulation has the benefit that it also holds if x = 0. When this rela-
tionship between the involved norms holds, the matrix norm is said to be subordinate to the
WEEK 1. NORMS 71
vector norms:
Definition 1.3.8.1 Subordinate matrix norm. A matrix norm Î · Î : Cm◊n æ R is said
to be subordinate to vector norms Î · ε : Cm æ R and Î · ΋ : Cn æ R if, for all x œ Cn ,
ÎAxε Æ ÎAÎÎx΋ .
If Î · ε and Î · ΋ are the same norm (but perhaps for different m and n), then Î · Î is said
to be subordinate to the given vector norm. ⌃
Fortunately, all the norms that we will employ in this course are subordinate matrix
norms.
Homework 1.3.8.1 ALWAYS/SOMETIMES/NEVER: The Frobenius norm is subordinate
to the vector 2-norm.
Answer. TRUE
Now prove it.
Solution. W.l.o.g., assume x ”= 0.
ÎAxÎ2 ÎAyÎ2
ÎAxÎ2 = ÎxÎ2 Æ max ÎxÎ2 = max ÎAyÎ2 ÎxÎ2 = ÎAÎ2 ÎxÎ2 .
ÎxÎ2 y”=0 ÎyÎ2 ÎyÎ2 =1
So, it suffices to show that ÎAÎ2 Æ ÎAÎF . But we showed that in Homework 1.3.7.2.
Theorem 1.3.8.2 Induced matrix norms, Î · ε,‹ : Cm◊n æ R, are subordinate to the norms,
Î · ε and Î · ΋ , that induce them.
Proof. W.l.o.g. assume x ”= 0. Then
ÎAxε ÎAyε
ÎAxε = Îx΋ Æ max Îx΋ = ÎAε,‹ Îx΋ .
Îx΋ y” = 0 Îy΋
⌅
Corollary 1.3.8.3 Any matrix p-norm is subordinate to the corresponding vector p-norm.
Another desirable property that not all norms have is that
ÎABÎ Æ ÎAÎÎBÎ.
This requires the given norm to be defined for all matrix sizes..
Definition 1.3.8.4 Consistent matrix norm. A matrix norm Î · Î : Cm◊n æ R is said
to be a consistent matrix norm if it is defined for all m and n, using the same formula for
all m and n. ⌃
Obviously, this definition is a bit vague. Fortunately, it is pretty clear that all the matrix
norms we will use in this course, the Frobenius norm and the p-norms, are all consistently
defined for all matrix sizes.
Definition 1.3.8.5 Submultiplicative matrix norm. A consistent matrix norm Î · Î :
WEEK 1. NORMS 72
ÎABÎ Æ ÎAÎÎBÎ.
⌃
Theorem 1.3.8.6 Let Î · Î : Cn æ R be a vector norm defined for all n. Define the
corresponding induced matrix norm as
ÎAxÎ
ÎAÎ = max = max ÎAxÎ.
x”=0 ÎxÎ ÎxÎ=1
Then for any A œ Cm◊k and B œ Ck◊n the inequality ÎABÎ Æ ÎAÎÎBÎ holds.
In other words, induced matrix norms are submultiplicative. To prove this theorem, it
helps to first prove a simpler result:
Lemma 1.3.8.7 Let ηΠ: Cn æ R be a vector norm defined for all n and let ηΠ: Cm◊n æ R
be the matrix norm it induces. Then ÎAxÎ Æ ÎAÎÎxÎ..
Proof. If x = 0, the result obviously holds since then ÎAxÎ = 0 and ÎxÎ = 0. Let x ”= 0.
Then
ÎAxÎ ÎAxÎ
ÎAÎ = max Ø .
x”=0 ÎxÎ ÎxÎ
Rearranging this yields ÎAxÎ Æ ÎAÎÎxÎ. ⌅
We can now prove the theorem:
Proof.
ÎABÎ
= < definition of induced matrix norm >
maxÎxÎ=1 ÎABxÎ
= < associativity >
maxÎxÎ=1 ÎA(Bx)Î
Æ < lemma >
maxÎxÎ=1 (ÎAÎÎBxÎ)
Æ < lemma >
maxÎxÎ=1 (ÎAÎÎBÎÎxÎ)
= < ÎxÎ = 1 >
ÎAÎÎBÎ.
⌅
Homework 1.3.8.2 Show that ÎAxε Æ ÎAε,‹ Îx΋ .
Solution. W.l.o.g. assume that x ”= 0.
ÎAyε ÎAxε
ÎAε,‹ = max Ø .
y”=0 Îy΋ Îx΋
Hence ÎABÎ2F Æ ÎAÎ2F ÎBÎ2F . Taking the square root of both sides leaves us with ÎABÎF Æ
ÎAÎF ÎBÎF .
This proof brings to the forefront that the notation aÂTi leads to some possible confusion.
In this particular situation, it is best to think of aÂi as a vector that, when transposed,
becomes the row of A indexed with i. In this case, aÂTi = aÂi and (aÂTi )H = aÂi (where, recall,
H
x equals the vector with all its entries conjugated). Perhaps it is best to just work through
WEEK 1. NORMS 74
this problem for the case where A and B are real-valued, and not worry too much about the
details related to the complex-valued case...
Homework 1.3.8.5 For A œ Cm◊n define
Answer.
1. TRUE
2. TRUE
Solution.
1. This is a norm. You can prove this by checking the three conditions.
Q |–k,0 | R
c –k,0 d
x=c
c .. d
a . d
b
|–k,n≠1 |
–k,n≠1
1.3.9 Summary
YouTube: https://www.youtube.com/watch?v=DyoT2tJhxIs
YouTube: https://www.youtube.com/watch?v=QwFQNAPKIwk
A question we will run into later in the course asks how accurate we can expect the
solution of a linear system to be if the right-hand side of the system has error in it.
Formally, this can be stated as follows: We wish to solve Ax = b, where A œ Cm◊m but
the right-hand side has been perturbed by a small vector so that it becomes b + ”b.
Remark 1.4.1.1 Notice how the ” touches the b. This is meant to convey that this is a
symbol that represents a vector rather than the vector b that is multiplied by a scalar ”.
The question now is how a relative error in b is amplified into a relative error in the
solution x.
WEEK 1. NORMS 76
We would like to determine a formula, Ÿ(A, b, ”b), that gives us a bound on how much a
relative error in b is potentially amplified into a relative error in the solution x:
ΔxΠΔbÎ
Æ Ÿ(A, b, ”b) .
ÎxÎ ÎbÎ
We assume that A has an inverse since otherwise there may be no solution or there may be
an infinite number of solutions. To find an expression for Ÿ(A, b, ”b), we notice that
Ax + A”x = b + ”b
Ax = b ≠
A”x = ”b
ΔxΠΔbÎ
Æ ÎAÎÎA≠1 Î .
ÎxÎ ¸ ˚˙ ˝ ÎbÎ
Ÿ(A)
namely the x for which the maximum is attained. This is the direction of maximal
magnification. Pick ‚b = Ax‚.
• There is an ”b
‚ for which
ÎA≠1 xÎ ‚
ÎA≠1 ”bÎ
ÎA≠1 Î = max = ‚
,
ÎxΔ=0 ÎxΠΔbÎ
A(x + ”x) = ‚b + ”b
‚
Homework 1.4.1.2 Let Î · Î be a vector norm and corresponding induced matrix norm,
and A a nonsingular matrix.
TRUE/FALSE: Ÿ(A) = ÎAÎÎA≠1 Î Ø 1.
Answer. TRUE
Solution.
1
= < last homework >
ÎIÎ
= < A is invertible >
ÎAA≠1 Î
Æ < Î · Î is submultiplicative >
≠1
ÎAÎÎA Î.
Remark 1.4.1.3 This last exercise shows that there will always be choices for b and ”b
for which the relative error is at best directly translated into an equal relative error in the
solution (if Ÿ(A) = 1).
WEEK 1. NORMS 78
• |–| =
• |– ≠ –|
‚ =
• |–≠–‚|
|–|
=
1 2
• log10 |–≠–‚|
|–|
=
• |–| = 14.24123
• |– ≠ –|
‚ = 0.006
• |–≠–‚|
|–|
¥ 0.00042
1 2
• log10 |–≠–‚|
|–|
¥ ≠3.4
YouTube: https://www.youtube.com/watch?v=LGBFyjhjt6U
We now revisit the material from the launch for the semester. We understand that when
solving Lx = b, even a small relative change to the right-hand side b can amplify into a large
relative change in the solution x̂ if the condition number of the matrix is large.
Homework 1.4.3.1 Change the script Assignments/Week01/mat ab/Test_Upper_triangu ar_
so ve_100.m to also compute the condition number of matrix U , Ÿ(U ). Investigate what
happens to the condition number as you change the problem size n.
Since in the example the upper triangular matrix is generated to have random values as
its entries, chances are that at least one element on its diagonal is small. If that element
were zero, then the triangular matrix would be singular. Even if it is not exactly zero, the
condition number of U becomes very large if the element is small.
1.5 Enrichments
1.5.1 Condition number estimation
It has been observed that high-quality numerical software should not only provide routines
for solving a given problem, but, when possible, should also (optionally) provide the user
with feedback on the conditioning (sensitivity to changes in the input) of the problem. In
this enrichment, we relate this to what you have learned this week.
Given a vector norm Î · Î and induced matrix norm Î · Î, the condition number of matrix
A using that norm is given by Ÿ(A) = ÎAÎÎA≠1 Î. When trying to practically compute the
condition number, this leads to two issues:
• Which norm should we use? A case has been made in this week that the 1-norm and
Œ-norm are canditates since they are easy and cheap to compute.
• It appears that A≠1 needs to be computed. We will see in future weeks that this
is costly: O(m3 ) computation when A is m ◊ m. This is generally considered to be
expensive.
This leads to the question "Can a reliable estimate of the condition number be cheaply
computed?" In this unit, we give a glimpse of how this can be achieved and then point the
interested learner to related papers.
WEEK 1. NORMS 80
Partition m ◊ m matrix A: Q R
aÂT0
c
A=c .. d
a . d
b.
T
aÂm≠1
We recall that
• The Œ-norm is defined by
ÎAÎŒ = max ÎAxÎŒ .
ÎxÎŒ =1
• From Homework 1.3.6.2, we know that the Œ-norm can be practically computed as
where aÂi = (aÂTi )T . This means that ÎAÎŒ can be computed in O(m2 ) operations.
• From the solution to Homework 1.3.6.2, we know that there is a vector x with |‰j | = 1
for 0 Æ j < m such that ÎAÎŒ = ÎAxÎŒ . This x satisfies ÎxÎŒ = 1.
More precisely: ÎAÎŒ = ÎaÂTk Î1 for some k. For simplicity, assume A is real valued.
Then
ÎAÎŒ = |–k,0 | + · · · + |–k,m≠1 |
= –k,0 ‰0 + · · · + –k,m≠1 ‰m≠1 ,
where each ‰j = ±1 is chosen so that ‰j –k,j = |–k,j |. That vector x then has the
property that ÎAÎŒ = ÎaÂk Î1 = ÎAxÎŒ .
From this we conclude that
ÎAÎŒ = max ÎAxÎŒ ,
xœS
ÎU ≠1 ÎŒ = max ÎU ≠1 xÎŒ ,
xœS
where S is the set of all vectors x with elements ‰i œ {≠1, 1}. This is equivalent to
ÎU ≠1 ÎŒ = max ÎzÎŒ ,
zœT
where T is the set of all vectors z that satisfy U z = y for some y with elements Âi œ {≠1, 1}.
So, we could solve U z = y for all vectors y œ S, compute the Œ-norm for all those vectors
z, and pick the maximum of those values. But that is not practical.
WEEK 1. NORMS 81
One simple solution is to try to construct a vector y that results in a large amplification
(in the Œ-norm) when solving U z = y, and to then use that amplification as an estimate
for ÎU ≠1 ÎŒ . So how do we do this? Consider
Q R Q R Q R
.. . . .. .. .. ..
c . . . . d c . d c . d
a 0 · · · ‚m≠2,m≠2 ‚m≠2,m≠1 b a ’m≠2 b = a Âm≠2 b .
c d c d c d
0 ··· 0 ‚m≠1,m≠1 ’m≠1 Âm≠1
¸ ˚˙ ˝ ¸ ˚˙ ˝ ¸ ˚˙ ˝
U z y
Here is a heuristic for picking y œ S:
• We want to pick Âm≠1 œ {≠1, 1} in order to construct a vector y œ S. We can
pick Âm≠1 = 1 since picking it equal to ≠1 will simply carry through negation in the
appropriate way in the scheme we are describing.
From this Âm≠1 we can compute ’m≠1 .
• Now,
‚m≠2,m≠2 ’m≠2 + ‚m≠2,m≠1 ’m≠1 = Âm≠2
where ’m≠1 is known and Âm≠2 can be strategically chosen. We want z to have a large
Œ-norm and hence a heuristic is to now pick Âm≠2 œ {≠1, 1} in such a way that ’m≠2
is as large as possible in magnitude.
With this Âm≠2 we can compute ’m≠2 .
• And so forth!
When done, the magnification equals ÎzÎŒ = |’k |, where ’k is the element of z with largest
magnitude. This approach provides an estimate for ÎU ≠1 ÎŒ with O(m2 ) operations.
The described method underlies the condition number estimator for LINPACK, developed
in the 1970s [16], as described in [11]:
• A.K. Cline, C.B. Moler, G.W. Stewart, and J.H. Wilkinson, An estimate for the con-
dition number of a matrix, SIAM J. Numer. Anal., 16 (1979).
The method discussed in that paper yields a lower bound on ÎA≠1 ÎŒ and with that on
ŸŒ (A).
Remark 1.5.1.1 Alan Cline has his office on our floor at UT-Austin. G.W. (Pete) Stewart
was Robert’s Ph.D. advisor. Cleve Moler is the inventor of Matlab. John Wilkinson received
the Turing Award for his contributions to numerical linear algebra.
More sophisticated methods are discussed in [21]:
• N. Higham, A Survey of Condition Number Estimates for Triangular Matrices, SIAM
Review, 1987.
His methods underlie the LAPACK [1] condition number estimator and are remarkably
accurate: most of the time they provides an almost exact estimate of the actual condition
number.
WEEK 1. NORMS 82
1.6 Wrap Up
1.6.1 Additional homework
Homework 1.6.1.1 For ej œ Rn (a standard basis vector), compute
• Îej Î2 =
• Îej Î1 =
• Îej ÎŒ =
• Îej Îp =
Homework 1.6.1.2 For I œ Rn◊n (the identity matrix), compute
• ÎIÎ1 =
• ÎIÎŒ =
• ÎIÎ2 =
• ÎIÎp =
• ÎIÎF =
Q R
”0 0 ··· 0
c
c 0 ”1 ··· 0 d
d
Homework 1.6.1.3 Let D = c .. . . .. d (a diagonal matrix). Compute
c
a . . . 0
d
b
0 0 · · · ”n≠1
• ÎDÎ1 =
• ÎDÎŒ =
• ÎDÎp =
• ÎDÎF =
Q R
x0
c x1 d
c d
Homework 1.6.1.4 Let x = c
c .. d
d and 1 Æ p < Œ or p = Œ.
a . b
xN ≠1
ALWAYS/SOMETIMES/NEVER: Îxi Îp Æ ÎxÎp .
Homework 1.6.1.5 For A B
1 2 ≠1
A= .
≠1 1 0
compute
WEEK 1. NORMS 83
• ÎAÎ1 =
• ÎAÎŒ =
• ÎAÎF =
Homework 1.6.1.6 For A œ Cm◊n define
Q R
|–0,0 |, ··· ,
|–0,n≠1 |,
.. ..
m≠1
ÿ n≠1
ÿ ÿc d
ÎAÎ = |–i,j | = c
a . . d.
b
i=0 j=0
|–m≠1,0 |, · · · , |–m≠1,n≠1 |
Prove that
• ÎAÎF = ÎAT ÎF .
Ò
• ÎAÎF = Îa0 Î22 + Îa1 Î22 + · · · + Îan≠1 Î22 .
Ò
• ÎAÎF = ÎaÂ0 Î22 + ÎaÂ1 Î22 + · · · + ÎaÂm≠1 Î22 .
Homework 1.6.1.9 Prove that if Îx΋ Æ —Îxε is true for all x, then ÎA΋ Æ —ÎAε,‹ .
WEEK 1. NORMS 84
1.6.2 Summary
If –, — œ C with – = –r + –c i and — = —r + i—c , where –r , –c , —r , —c œ R, then
• Conjugate: – = –r ≠ –c i.
• Product: –— = (–r —r ≠ –c —c ) + (–r —c + –c —r )i.
Ò Ô
• Absolute value: |–| = –r2 + –c2 = ––.
Q R Q R
‰0 Â0
Let x, y œ Cm with x = c
c .. d
and y = c
c .. d
Then
a . d
b a . d.
b
‰m≠1 Âm≠1
• Conjugate: Q R
‰0
x=c
c .. d
a . d.
b
‰m≠1
• Transpose of vector: 1 2
xT = ‰0 · · · ‰m≠1
⌃
Ô Ò Ô
• 2-norm (Euclidean length): ÎxÎ2 = xH x = |‰0 |2 + · · · + |‰m≠1 |2 = ‰0 ‰0 + · · · + ‰m≠1 ‰m≠1
Òq
= m≠1
i=0 |‰i |2 .
Ò Òq
• p-norm: ÎxÎp = p
|‰0 |p + · · · + |‰m≠1 |p = p m≠1
i=0 |‰i |p .
qm≠1
• 1-norm: ÎxÎ1 = |‰0 | + · · · + |‰m≠1 | = i=0 |‰i |.
• Œ-norm: ÎxÎŒ = max(|‰0 |, . . . , ‰m≠1 |) = maxm≠1
i=0 |‰i | = limpæŒ ÎxÎp .
• Unit ball: Set of all vectors with norm equal to one. Notation: ÎxÎ = 1.
WEEK 1. NORMS 85
‡ÎxÎ Æ |||x|||
Ô Æ · ÎxÎ.
ÎxÎ1 Æ mÎxÎ2 ÎxÎ1 Æ mÎxÎŒ
Ô
ÎxÎ2 Æ ÎxÎ1 ÎxÎ2 Æ mÎxÎŒ
ÎxÎŒ Æ ÎxÎ1 ÎxÎŒ Æ ÎxÎ2
Definition 1.6.2.3 Linear transformations and matrices. Let L : Cn æ Cm . Then L
is said to be a linear transformation if for all – œ C and x, y œ Cn
• L(–x) = –L(x). That is, scaling first and then transforming yields the same result as
transforming first and then scaling.
• L(x + y) = L(x) + L(y). That is, adding first and then transforming yields the same
result as transforming first and then adding.
⌃
Definition 1.6.2.4 Standard basis vector. In this course, we will use ej œ C to denote m
the standard basis vector with a "1" in the position indexed with j. So,
Q R
0
c . d
c .. d
c d
c d
c
c
0 d
d
ej = c
c 1 d
d Ω≠ j
c
c 0 d
d
c
c .. d
d
a . b
0
⌃
If L is a linear transformation and we let aj = L(ej ) then
1 2
A= a0 a1 · · · an≠1
1. By columns:
1 2 1 2 1 2
c0 · · · cn≠1 := A b0 · · · bn≠1 = Ab0 · · · Abn≠1 .
2. By rows: Q R Q R Q R
cÂT0 aÂT0 aÂT0 B
c .. d c
d := c .. d B = c
d c .. d
c
a . b a . b a . d.
b
cÂTm≠1 aÂTm≠1 aÂTm≠1 B
In other words, cÂTi = aÂTi B for all rows of C.
and Q R
B0,0 ··· B0,N ≠1
c .. .. d
c
a . . d,
b
BK≠1,0 · · · BK≠1,N ≠1
where the partitionings are "conformal." Then
K≠1
ÿ
Ci,j = Ai,p Bp,j .
p=0
⌃
WEEK 1. NORMS 87
Then
• Conjugate of matrix: Q R
–0,0 ··· –0,n≠1
A=c
c .. .. d
a . . d.
b
–m≠1,0 · · · –m≠1,n≠1
• Transpose of matrix: Q R
–0,0 · · · –m≠1,0
c
A =a .. ...
d
.
T c d.
b
–0,n≠1 · · · –m≠1,n≠1
ÎAxε Æ ÎAÎÎx΋ .
If Î · ε and Î · ΋ are the same norm (but perhaps for different m and n), then Î · Î is said
to be subordinate to the given vector norm. ⌃
Definition 1.6.2.8 Consistent matrix norm. A matrix norm Î · Î : Cm◊n æ R is said
to be a consistent matrix norm if it is defined for all m and n, using the same formula for
all m and n. ⌃
Definition 1.6.2.9 Submultiplicative matrix norm. A consistent matrix norm Î · Î :
Cm◊n æ R is said to be submultiplicative if it satisfies
ÎABÎ Æ ÎAÎÎBÎ.
⌃
Let A, A œ Cm◊m , x, ”x, b, ”b œ Cm , A be nonsingular, and Î · Î be a vector norm and
corresponding subordinate matrix norm. Then
ΔxΠΔbÎ
Æ ÎAÎÎA≠1 Î .
ÎxÎ ¸ ˚˙ ˝ ÎbÎ
Ÿ(A)
YouTube: https://www.youtube.com/watch?v=12K5aydB9cQ
Consider this picture of the Gates Dell Complex that houses our Department of Computer
Science:
89
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 90
where we recognize that we can view the picture as a matrix. What if we want to store
this picture with fewer than m ◊ n data? In other words, what if we want to compress the
picture? To do so, we might identify a few of the columns in the picture to be the "chosen
ones" that are representative of the other columns in the following sense: All columns in the
picture are approximately linear combinations of these chosen columns.
Let’s let linear algebra do the heavy lifting: what if we choose k roughly equally spaced
columns in the picture:
a0 = b0
a1 = bn/k≠1
.. ..
. .
ak≠1 = b(k≠1)n/k≠1 ,
where for illustration purposes we assume that n is an integer multiple of k. (We could
instead choose them randomly or via some other method. This detail is not important as
we try to gain initial insight.) We could then approximate each column of the picture, bj , as
a linear combination of a0 , . . . , ak≠1 :
Q R
1 2c
‰0,j
bj ¥ ‰0,j a0 + ‰1,j a1 + · · · + ‰k≠1,j ak≠1 = .. d
a0 · · · ak≠1 c
a . d.
b
‰k≠1,j
We can write this more concisely by viewing these chosen columns as the columns of matrix
A so that
Q R
1 2
‰0,j
where A =
c
and xj = c .. d
bj ¥ Axj , a0 · · · ak≠1 a . d.
b
‰k≠1,j
If A has linearly independent columns, the best such approximation (in the linear least
squares sense) is obtained by choosing
xj = (AT A)≠1 AT bj ,
where you may recognize (AT A)≠1 AT as the (left) pseudo-inverse of A, leaving us with
bj ¥ A(AT A)≠1 AT bj .
This approximates bj with the orthogonal projection of bj onto the column space of A. Doing
this for every column bj leaves us with the following approximation to the picture:
Q R
c d
c d
B ¥ c A (AT A)≠1 AT b0 · · · A (AT A)≠1 AT bn≠1 d ,
a ¸ ˚˙ ˝ ¸ ˚˙ ˝ b
x0 xn≠1
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 91
which is equivalent to
1 2
B ¥ A (AT A)≠1 AT b0 · · · bn≠1 = A (AT A)≠1 AT B = AX.
¸ ˚˙ ˝ ¸ ˚˙ ˝
1 2
x0 · · · xn≠1 X
Importantly, instead of requiring m ◊ n data to store B, we now need only store A and X.
Homework 2.1.1.1 If B is m ◊ n and A is m ◊ k, how many entries are there in A and X ?
Solution.
• A is m ◊ k.
• X is k ◊ n.
k=2 k = 10
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 92
k = 25 k = 50
Now, there is no reason to believe that picking equally spaced columns (or restricting
ourselves to columns in B) will yield the best rank-k approximation for the picture. It yields
a pretty good result here in part because there is quite a bit of repetition in the picture,
from column to column. So, the question can be asked: How do we find the best rank-k
approximation for a picture or, more generally, a matrix? This would allow us to get the
most from the data that needs to be stored. It is the Singular Value Decomposition (SVD),
possibly the most important result in linear algebra, that provides the answer.
Remark 2.1.1.1 Those who need a refresher on this material may want to review Week 11
of Linear Algebra: Foundations to Frontiers [26]. We will discuss solving linear least squares
problems further in Week 4.
2.1.2 Overview
• 2.1 Opening Remarks
• 2.4 Enrichments
• 2.5 Wrap Up
• Prove that multiplying with unitary matrices does not amplify relative error.
• Link the Reduced Singular Value Decomposition to the rank of the matrix and deter-
mine the best rank-k approximation to a matrix.
YouTube: https://www.youtube.com/watch?v=3zpdTfwZSEo
At some point in your education you were told that vectors are orthogonal (perpendicular)
if and only if their dot product (inner product) equals zero. Let’s review where this comes
from. Given two vectors u, v œ Rm , those two vectors, and their sum all exist in the same
two dimensional (2D) subspace. So, they can be visualized as
where the page on which they are drawn is that 2D subspace. Now, if they are, as drawn,
perpendicular and we consider the lengths of the sides of the triangle that they define
then we can employ the first theorem you were probably ever exposed to, the Pythagorean
Theorem, to find that
ÎuÎ22 + ÎvÎ22 = Îu + vÎ22 .
Using what we know about the relation between the two norm and the dot product, we find
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 95
that
uT u + v T v = (u + v)T (u + v)
… < multiply out >
u u + v v = uT u + u T v + v T u + v T v
T T
length of a vector, ÎxÎ2 = x x, we have not formally defined the inner product (dot
H
⌃
The notation x is short for x , where x equals the vector x with all its entries conjugated.
H T
So,
xH y
= < expose the elements of the vectors >
Q RH Q R
‰0 Â0
c .. d c .. d
c
a . d c
b a . db
‰m≠1 Âm≠1
= < xH = x T >
Q RT Q R
‰0 Â0
c .. d c .. d
c
a . d c
b a . d
b
‰m≠1 Âm≠1
= < conjugate the elements of x >
Q RT Q R
‰0 Â0
c .. d c .. d
c
a . d
b
c
a . d
b
‰m≠1 Âm≠1
= < view x Qas a m ◊ R
1 matrix and transpose >
1 2c
Â0
.. d
‰0 · · · ‰m≠1 a c
. d
b
Âmn≠1
= < view xH as a matrix and perform matrix-vector multiplication >
qm≠1
i=0 ‰i Âi .
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 96
ALWAYS/SOMETIMES/NEVER: xH x is real-valued.
Answer. ALWAYS
Now prove it!
Solution. By the last homework,
xH x = xH x,
A complex number is equal to its conjugate only if it is real-valued.
The following defines orthogonality of two vectors with complex-valued elements:
Definition 2.2.1.2 Orthogonal vectors. Let x, y œ Cm . These vectors are said to be
orthogonal (perpendicular) iff xH y = 0. ⌃
YouTube: https://www.youtube.com/watch?v=CqcJ6Nh1QWg
In a previous linear algebra course, you may have learned that if a, b œ Rm then
aT b aaT
‚
b = a = b
aT a aT a
equals the component of b in the direction of a and
aaT
b‹ = b ≠ ‚b = (I ≠ )b
aT a
Interpret what thi s means about a matrix that projects onto a vector.
Answer. ALWAYS.
Now prove it.
Solution. 1 H21 H2
aa aa
aH a aH a
= < multiply numerators and denominators >
aaH aaH
(aH a)(aH a)
= < associativity >
a(aH a)aH
(aH a)(aH a)
= < aH a is a scalar and hence commutes to front >
aH aaaH
(aH a)(aH a)
= < scalar division >
aaH
aH a
.
Interpretation: orthogonally projecting the orthogonal projection of a vector yields the
orthogonal projection of the vector.
Homework 2.2.2.2 Let a œ Cm .
ALWAYS/SOMETIMES/NEVER:
A BA B
aaH aaH
I≠ H =0
aH a a a
Answer. ALWAYS.
Now prove it.
Solution. 1 21 2
aaH H
aH a
I ≠ aa
aH a
1 = 2 1< Hdistribute
21 H2 >
aaH aa aa
aH a
≠ aH a aH a
1 H 2 1 H 2 homework >
= < last
aa
aH a
≠ aaaH a
=
0.
Interpretation: first orthogonally projecting onto the space orthogonal to vector a and
then orthogonally projecting the resulting vector onto that a leaves you with the zero vector.
Homework 2.2.2.3 Let a, b œ Cn , ‚b = aa b, and b‹ = b ≠ ‚b.
H
aH a
ALWAYS/SOMETIMES/NEVER: ‚bH b‹ = 0.
Answer. ALWAYS.
Now prove it.
Solution.
‚
bH b‹
= < substitute ‚b and b‹ >
1 2H
aaH
aH a
b (b ≠ ‚b)
= < (Ax)H = xH AH ; substitute b ≠ ‚b >
1 2H
(I ≠ aa
aaH
)b
H
bH aH a aH a
= < (((xy H )/–)H = yxH /– if – is real >
(I ≠ aa )b
H H
bH aa
aH a aH a
= < last homework >
bH 0b
= < 0x = 0; y H 0 = 0 >
0.
YouTube: https://www.youtube.com/watch?v=GFfvDpj5dzw
A lot of the formulae in the last unit become simpler if the length of the vector equals
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 99
• the matrix that projects a vector onto the space orthogonal to u is given by
uuH
I≠ = I ≠ uuH .
uH u
Homework 2.2.3.1 Let u ”= 0 œ Cm .
ALWAYS/SOMETIMES/NEVER u/ÎuÎ2 has unit length.
Answer. ALWAYS.
Now prove it.
Solution. . .
. u .
. ÎuÎ .
2 2
= < homogenuity of norms >
ÎuÎ2
ÎuÎ2
= < algebra >
1
This last exercise shows that any nonzero vector can be scaled (normalized) to have unit
length.
Definition 2.2.3.1 Orthonormal vectors. Let u0 , u1 , . . . , un≠1 œ Cm . These vectors are
said to be mutually orthonormal if for all 0 Æ i, j < n
I
1 if i = j
uH
i uj = .
0 otherwise
Ò ⌃
The definition implies that Îui Î2 = = 1 and hence each of the vectors is of unit
uH
i ui
length in addition to being orthogonal to each other.
The standard basis vectors (Definition 1.3.1.3)
i=0 µ C ,
{ej }m≠1 m
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 100
where Q R
0
c . d
c .. d
c d
c d
c
c
0 d
d
ej = c
c 1 d
d Ω≠ entry indexed with j
c
c 0 d
d
c
c .. d
d
a . b
0
are mutually orthonormal since, clearly,
I
1 if i = j
eH
i ej =
0 otherwise.
Naturally, any subset of the standard basis vectors is a set of mutually orthonormal vectors.
Remark 2.2.3.2 For n vectors of size m to be mutually orthonormal, n must be less than
or equal to m. This is because n mutually orthonormal vectors are linearly independent and
there can be at most m linearly independent vectors of size m.
A very concise way of indicating that a set of vectors are mutually orthonormal is to view
them as the columns of a matrix, which then has a very special property:
Definition 2.2.3.3 Orthonormal matrix. Let Q œ Cm◊n (with n Æ m). Then Q is said
to be an orthonormal matrix iff QH Q = I. ⌃
The subsequent exercise makes the connection between mutually orthonormal vectors
and an orthonormal matrix.
1 2
Homework 2.2.3.2 Let Q œ Cm◊n (with n Æ m). Partition Q = q0 q1 · · · qn≠1 .
TRUE/FALSE: Q is an orthonormal matrix if and only if q0 , q1 , . . . , qn≠1 are mutually
orthonormal.
Answer. TRUE
Now prove it! 1 2
Solution. Let Q œ Cm◊n (with n Æ m). Partition Q = q0 q1 · · · qn≠1 . Then
1 2H 1 2
QH Q = q0 q1 · · · qn≠1 q0 q1 · · · qn≠1
Q H R
q0
c H d1 2
c q1 d
= c
c .. d q0 q1 · · · qn≠1
d
a . b
H
qn≠1
Q R
q0H q0 q0H q1 ··· q0H qn≠1
c q1H q0 q1H q1 ··· q1H qn≠1 d
c d
= c
c .. .. .. d.
d
a . . . b
H H H
qn≠1 q0 qn≠1 q1 · · · qn≠1 qn≠1
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 101
• If Q is not square, then QH Q = I means m > n. Hence Q has rank equal to n which in
turn means QQH is a matrix with rank at most equal to n. (Actually, its rank equals
n.). Since I has rank equal to m (it is an m ◊ m matrix with linearly independent
columns), QQH cannot equal I.
1 2
More concretely: let m > 1 and n = 1. Choose Q = e0 . Then QH Q = eH
0 e0 = 1 =
I. But Q R
1 0 ···
c d
QQ = e0 e0 = c
H H 0 0 ··· d.
a
.. .. b
. .
YouTube: https://www.youtube.com/watch?v=izONEmO9uqw
Homework 2.2.4.1 Let Q œ Cm◊n be an orthonormal matrix.
ALWAYS/SOMETIMES/NEVER: Q≠1 = QH and QQH = I.
Answer. SOMETIMES
Now explain it!
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 102
(When you see a proof that involed · · ·, it would be more rigorous to use a proof by
induction.)
Remark 2.2.4.3 Many algorithms that we encounter in the future will involve the applica-
tion of a sequence of unitary matrices, which is why the result in this last exercise is of great
importance.
Perhaps the most important property of a unitary matrix is that it preserves length.
Homework 2.2.4.6 Let U œ Cm◊m be a unitary matrix and x œ Cm . Prove that ÎU xÎ2 =
ÎxÎ2 .
Solution.
ÎU xÎ22
= < alternative definition >
(U x)H U x
= < (Az)H = z H AH >
H H
x U Ux
= < U is unitary >
xH x
= < alternative definition >
ÎxÎ22 .
The converse is true as well:
Theorem 2.2.4.4 If A œ Cm◊m preserves length (ÎAxÎ2 = ÎxÎ2 for all x œ Cm ), then A is
unitary.
Proof. We first prove that (Ax)H (Ay) = xH y for all x, y by considering Îx ≠ yÎ22 = ÎA(x ≠
y)Î22 . We then use that to evaluate eH
i A Aej .
H
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 104
Let x, y œ Cm . Then
…
1 1 2 = ÎxÎ2 2and ÎAyÎ2 = ÎyÎ2 ; – + – = 2Re(–) >
2 < ÎAxÎ
Re xH y = Re (Ax)H Ay
1 2 1 2
One can similarly show that Im xH y = Im (Ax)H Ay by considering A(ix ≠ y).
Conclude that (Ax)H (Ay) = xH y.
We now use this to show that AH A = I by using the fact that the standard basis vectors
have the property that I
1 if i = j
ei ej =
H
0 otherwise
and that the i, j entry in AH A equals eH
i A Aej .
H
Note: I think the above can be made much more elegant by choosing – such that –xH y
is real and then looking at Îx + –yÎ2 = ÎA(x + –y)Î2 instead, much like we did in the proof
of the Cauchy-Schwartz inequality. Try and see if you can work out the details. ⌅
Homework 2.2.4.7 Prove that if U is unitary then ÎU Î2 = 1.
Solution.
ÎU Î2
= < definition >
maxÎxÎ2 =1 ÎU xÎ2
= < unitary matrices preserve length >
maxÎxÎ2 =1 ÎxÎ2
= < algebra >
1
(The above can be really easily proven with the SVD. Let’s point that out later.)
Homework 2.2.4.8 Prove that if U is unitary then Ÿ2 (U ) = 1.
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 105
Solution.
Ÿ2 U
= < definition >
≠1
ÎU Î2 ÎU Î2
= < both U and U ≠1 are unitary ; last homework >
1◊1
= < arithmetic >
1
The preservation of length extends to the preservation of norms that have a relation to
the 2-norm:d
Homework 2.2.4.9 Let U œ Cm◊m and V œ Cn◊n be unitary and A œ Cm◊n . Show that
• ÎU H AÎ2 = ÎAÎ2 .
• ÎAV Î2 = ÎAÎ2 .
• ÎU H AV Î2 = ÎAÎ2 .
Solution.
•
ÎU H AÎ2
= < definition of 2-norm >
maxÎxÎ2 =1 ÎU H AxÎ2
= < U is unitary and unitary matrices preserve length >
maxÎxÎ2 =1 ÎAxÎ2
= < definition of 2-norm >
ÎAÎ2 .
•
ÎAV Î2
= < definition of 2-norm >
maxÎxÎ2 =1 ÎAV xÎ2
= < V H is unitary and unitary matrices preserve length >
maxÎV xÎ2 =1 ÎA(V x)Î2
= < substitute y = V x >
maxÎyÎ2 =1 ÎAyÎ2
= < definition of 2-norm >
ÎAÎ2 .
• The last part follows immediately from the previous two:
ÎU H AV Î2 = ÎU H (AV )Î2 = ÎAV Î2 = ÎAÎ2 .
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 106
Homework 2.2.4.10 Let U œ Cm◊m and V œ Cn◊n be unitary and A œ Cm◊n . Show that
• ÎU H AÎF = ÎAÎF .
• ÎAV ÎF = ÎAÎF .
• ÎU H AV ÎF = ÎAÎF .
• Partition 1 2
A= a0 · · · an≠1 .
qn≠1
Then we saw in Subsection 1.3.3 that ÎAÎ2F = j=0 ÎaÎ22 .
Now,
ÎU H AÎ2F
= 1 < partition 2A by columns >
ÎU H a0 · · · an≠1 Î2F
1= < property of matrix-vector
2 multiplication >
H H 2
Î U a0 · · · U an≠1 ÎF
= < exercise in Chapter 1 >
qn≠1
j=0 ÎU H
aj Î22
= < unitary matrices preserve length >
qn≠1 2
j=0 Îa Î
j 2
= < exercise in Chapter 1 >
ÎAÎ2F .
• The last part follows immediately from the first two parts.
In the last two exercises we consider U H AV rather than U AV because it sets us up better
for future discussion.
2.2.5.1 Rotations
• If you scale a vector first and then rotate it, you get the same result as if you rotate it
first and then scale it.
• If you add two vectors first and then rotate, you get the same result as if you rotate
them first and then add them.
Thus, a rotation is a linear transformation. Also, the above picture captures that a rotation
preserves the length of the vector to which it is applied. We conclude that the matrix that
represents a rotation should be a unitary matrix.
Let us compute the matrix that represents the rotation through an angle ◊. Recall that
if L : Cn æ Cm is a linear transformation and A is the matrix that represents it, then the
jth column of A, aj , equals L(ej ). The pictures
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 108
and
illustrate that
A B A B
cos(◊) ≠ sin(◊)
R◊ (e0 ) = and R◊ (e1 ) = .
sin(◊) cos(◊)
Thus, A BA B
cos(◊) ≠ sin(◊) ‰0
R◊ (x) = .
sin(◊) cos(◊) ‰1
Homework 2.2.5.1 Show that
A B
cos(◊) ≠ sin(◊)
sin(◊) cos(◊)
is a unitary matrix. (Since it is real valued, it is usually called an orthogonal matrix instead.)
Hint. Hint: use c for cos(◊) and s for sin(◊) to save yourself a lot of writing!
Solution.
A BH A B
cos(◊) ≠ sin(◊) cos(◊) ≠ sin(◊)
sin(◊) cos(◊) sin(◊) cos(◊)
= < the matrix is real valued >
A BT A B
cos(◊) ≠ sin(◊) cos(◊) ≠ sin(◊)
sin(◊) cos(◊) sin(◊) cos(◊)
A = < transpose BA > B
cos(◊) sin(◊) cos(◊) ≠ sin(◊)
≠ sin(◊) cos(◊) sin(◊) cos(◊)
A = < multiply > B
cos2 (◊) + sin2 (◊) ≠ cos(◊) sin(◊) + sin(◊) cos(◊)
≠ sin(◊) cos(◊) + cos(◊) sin(◊) sin2 (◊) + cos2 (◊)
A = B< geometry; algebra >
1 0
0 1
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 109
Homework 2.2.5.2 Prove, without relying on geometry but using what you just discovered,
that cos(≠◊) = cos(◊) and sin(≠◊) = ≠ sin(◊)
Solution. Undoing a rotation by an angle ◊ means rotating in the opposite direction
through angle ◊ or, equivalently, rotating through angle ≠◊. Thus, the inverse of R◊ is
R≠◊ . The matrix that represents R◊ is given by
A B
cos(◊) ≠ sin(◊)
sin(◊) cos(◊)
Hence A B A B
cos(≠◊) ≠ sin(≠◊) cos(◊) sin(◊)
.=
sin(≠◊) cos(≠◊) ≠ sin(◊) cos(◊)
from which we conclude that cos(≠◊) = cos(◊) and sin(≠◊) = ≠ sin(◊).
2.2.5.2 Reflections
YouTube: https://www.youtube.com/watch?v=r8S04qqcc-o
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 110
The transformation described above preserves the length of the vector to which it is
applied.
Homework 2.2.5.3 (Verbally) describe why reflecting a vector as described above is a linear
transformation.
Solution.
• If you scale a vector first and then reflect it, you get the same result as if you reflect it
first and then scale it.
• If you add two vectors first and then reflect, you get the same result as if you reflect
them first and then add them.
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 112
Homework 2.2.5.4 Show that the matrix that represents M : R3 æ R3 in the above
example is given by I ≠ 2uuT .
Hint. Rearrange x ≠ 2(uT x)u.
Solution. We notice that
x ≠ 2(uT x)u
= < –x = x– >
x ≠ 2u(uT x)
= < associativity >
Ix ≠ 2uuT x
= < distributivity >
(I ≠ 2uuT )x.
Hence M (x) = (I ≠ 2uuT )x and the matrix that represents M is given by I ≠ 2uuT .
Homework 2.2.5.5 (Verbally) describe why (I≠2uuT )≠1 = I≠2uuT if u œ R3 and ÎuÎ2 = 1.
Solution. If you take a vector, x, and reflect it with respect to the mirror defined by u,
and you then reflect the result with respect to the same mirror, you should get the original
vector x back. Hence, the matrix that represents the reflection should be its own inverse.
Homework 2.2.5.6 Let M : R3 æ R3 be defined by M (x) = (I ≠ 2uuT )x, where ÎuÎ2 = 1.
Show that the matrix that represents it is unitary (or, rather, orthogonal since it is in R3◊3 ).
Solution. Pushing through the math we find that
(I ≠ 2uuT )T (I ≠ 2uuT )
= < (A + B)T = AT + B T >
(I ≠ (2uuT )T )(I ≠ 2uuT )
T
= < uT u = 1 >
I ≠ 4uu + 4uuT
T
= <A≠A=0>
I.
Remark 2.2.5.1 Unitary matrices in general, and rotations and reflections in particular,
will play a key role in many of the practical algorithms we will develop in this course.
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 113
YouTube: https://www.youtube.com/watch?v=DwTVkdQKJK4
A B
≠2
Homework 2.2.6.1 Consider the vector x = and the following picture that depicts
1
a rotated basis with basis vectors u0 and u1 .
What
A are
B the coordinates of the vector x in this rotated system? In other words, find
‰‚0
x‚ = such that ‰‚0 u0 + ‰‚1 u1 = x.
‰‚1
Solution. There are a number of approaches to this. One way is to try to remember the
formula you may have learned in a pre-calculus course about change of coordinates. Let’s
instead start by recognizing (from geometry or by applying the Pythagorean Theorem) that
A Ô B Ô A B A Ô B Ô A B
2/2 2 1 ≠Ô 2/2 2 ≠1
u0 = Ô = and u1 = = .
2/2 2 1 2/2 2 1
Here are two ways in which you can employ what you have discovered in this course:
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 114
x
= < u0 and u1 are orthonormal >
(uT0 x)u0 + (uT1 x)u1
¸ ˚˙ ˝ ¸ ˚˙ ˝
component in the component in the
direction of u0 direction of u1
Q = < instantiate and
R 0 Q u1 >
u
A BT A B A BT A BR
Ô
2 1 ≠2 Ô
2 ≠1 ≠2
a b u0 +a b u1
2 1 1 2 1 1
= Ô
<Ô evaluate >
≠ 2 u0 + 3 2 2 u 1 .
2
• An alternative
1 2 way to arrive at the same answer that provides more insight. Let
U = u0 u1 . Then
x
= < U is unitary (or orthogonal since it is real valued) >
UUT x
= < Ainstantiate
B U>
1 2 uT
0
u0 u1 x
uT1
= < Amatrix-vector
B multiplication >
1 2 uT x
0
u0 u1
uT1 x
= < instantiate >
Q A BT A B R
Ô
2 1 ≠2
1 2cc 2 1 1
d
d
u0 u1 c c A BT A B d
Ô d
a 2 ≠1 ≠2 b
2 1 1
= < evaluate >
A Ô B
1 2 ≠ 2
u0 u1 Ô2
3 2
2
= < Asimplify
A >BB
1 2 Ô ≠1
2
u0 u1 2 3
Below we compare side-by-side how to describe a vector x using the standard basis vectors
e0 , . . . , em≠1 (on the left) and vectors u0 , . . . , um≠1 (on the right):
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 115
Q R Q R
‰0 uT0 x
The vector x = a ... d ..
c d c d
b describes the The vector x‚ = c d describes the
c
a . b
‰m≠1 uTm≠1 x
vector x in terms of the standard basis vec- vector x in terms of the orthonormal basis
tors e0 , . . . , em≠1 : u0 , . . . , um≠1 :
x x
= < x = Ix = IIx = II T x > = < x = Ix = U U H x >
T
II x UUHx
= < exposeQcolumns R
of I > = < exposeQcolumnsRof U >
T
1 2c
e 0 1 2c
uH0
.. d .. d
e0 · · · em≠1 c a . d bx u0 · · · um≠1 c a . d bx
T H
em≠1 um≠1
= < evaluate Q
> R
= < evaluate
Q
> R
1 2c
eT0 x 1 2c
uH0 x
.. d .. d
e0 · · · em≠1 c a . d
b u0 · · · um≠1 c a . d
b
eTm≠1 x uH
m≠1 x
= < eTj x = Q ‰j > R
1 2c
‰ 0
.. d
e0 · · · em≠1 c a . d b
‰m≠1
= < evaluate > = < evaluate >
‰0 e0 + ‰1 e1 + · · · + ‰m≠1 em≠1 . uH
0 xu 0 + 1 xu1 + · · · + um≠1 xum≠1 .
uH H
Illustration:
Illustration:
Another way of looking at this is that if u0 , u1 , . . . , um≠1 is an orthonormal basis for Cm ,
then any x œ Cm can be written as a linear combination of these vectors:
x = –0 u0 + –1 u1 + · · · + –m≠1 um≠1 .
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 116
Now,
Thus uH
i x = –i , the coefficient that multiplies ui .
Remark 2.2.6.1 The point is that given vector x and unitary matrix U , U H x computes
the coefficients for the orthonormal basis consisting of the columns of matrix U . Unitary
matrices allow one to elegantly change between orthonormal bases.
YouTube: https://www.youtube.com/watch?v=d8-AeC3Q8Cw
In Subsection 1.4.1, we looked at how sensitive solving
Ax = b
A(x + ”x) = b + ”b
when an induced matrix norm is used. Let’s look instead at how sensitive matrix-vector
multiplication is.
Homework 2.2.7.1 Let A œ Cn◊n be nonsingular and x œ Cn a nonzero vector. Consider
Show that
ΔyΠΔxÎ
Æ ÎAÎÎA≠1 Î ,
ÎyÎ ¸ ˚˙ ˝ ÎxÎ
Ÿ(A)
where Î · Î is an induced matrix norm.
Solution. Since x = A≠1 y we know that
and hence
1 1
Æ ÎA≠1 Î . (2.2.1)
ÎyÎ ÎxÎ
Subtracting y = Ax from y + ”y = A(x + ”x) yields
”y = A”x
and hence
ΔyÎ Æ ÎAÎΔxÎ. (2.2.2)
Combining (2.2.1) and (2.2.2) yields the desired result.
There are choices of x and ”x for which the bound is tight.
What does this mean? It means that if as part of an algorithm we use matrix-vector
or matrix-matrix multiplication, we risk amplifying relative error by the condition number
of the matrix by which we multiply. Now, we saw in Section 1.4 that 1 Æ Ÿ(A). So, if we
there are algorithms that only use matrices for which Ÿ(A) = 1, then those algorithms don’t
amplify relative error.
Remark 2.2.7.1 We conclude that unitary matrices, which do not amplify the 2-norm of a
vector or matrix, should be our tool of choice, whenever practical.
YouTube: https://www.youtube.com/watch?v=uBo3XAGt24Q
The following is probably the most important result in linear algebra:
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 118
and ‡0 Ø ‡1 Ø · · · Ø ‡r≠1 > 0. The values ‡0 , . . . , ‡r≠1 are called the singular values of
matrix A. The columns of U and V are called the left and right singular vectors, respectively.
Recall that in our notation a 0 indicates a matrix of vector "of appropriate size" and that
in this setting the zero matrices in (2.3.1) may be 0 ◊ 0, (m ≠ r) ◊ 0, and/or 0 ◊ (n ≠ r).
Before proving this theorem, we are going to put some intermediate results in place.
Remark 2.3.1.2 As the course progresses, we will notice that there is a conflict between
the notation that explicitly exposes indices, e.g.,
1 2
U= u0 u1 · · · un≠1
and the notation we use to hide such explicit indexing, which we call the FLAME notation,
e.g., 1 2
U = U0 u1 U2 .
The two linked by Q R
u0 uk≠1 uk uk+1 un≠1
a ¸ ˚˙ ˝ ¸˚˙˝ ¸ ˚˙ ˝ b.
U0 u1 U2
In algorithms that use explicit indexing, k often is the loop index that identifies where in the
matrix or vector the algorithm currently has reached. In the FLAME notation, the index 1
identifies that place. This creates a conflict for the two distinct items that are both indexed
with 1, e.g., u1 in our example here. It is our experience that learners quickly adapt to this
and hence have not tried to introduce even more notation that avoids this conflict. In other
words: you will almost always be able to tell from context what is meant. The following
lemma and its proof illustrate this further.
Lemma 2.3.1.3 Given A œ Cm◊n , with 1 Æ n Æ m and A ”= 0 (the zero matrix), there exist
unitary matrices U œ Cm◊m and V œ Cn◊n such that
A B
‡1 0
A= UÂ VÂ H , where ‡1 = ÎAÎ2 .
0 B
Proof. In the below proof, it is really imporant to keep track of when a line is part of the
partitioning of a matrix or vector, and when it denotes scalar division.
Choose ‡1 and vÂ1 œ Cn such that
• ÎvÂ1 Î2 = 1; and
• ‡1 = ÎAvÂ1 Î2 = ÎAÎ2 .
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 119
UÂ H AVÂ
= < instantiate >
1 2H 1 2
uÂ1 UÂ 2 A vÂ1 VÂ2
A
= < multiplyBout >
uÂH A v uÂH Â
1 Â 1 1 AV2
UÂ2H AvÂ1 UÂ2H AVÂ2
A
= < AvÂ1 = ‡1 uB Â1 >
H Â
‡1 uÂH u
Â
1 1 u 1 A V 2
‡1 UÂ2H uÂ1 UÂ2H AVÂ2
A = < BuÂH1 uÂ1 = 1; U 2 Â1 = 0; pick w = V2 A u
 Hu  H H  and B = U
1
 H AV >
2 2
H
‡1 w
,
0 B
‡12
= < assumption >
ÎAÎ22
= < 2-norm is invariant under multiplication by unitary matrix >
 H  2
ÎU AV Î2
= < definition of Î · Î2 >
ÂH AVÂ xÎ2
maxx”=0 ÎxÎ2 2
ÎU
2
= .< see above .>
A B 2
. ‡ H .
. 1 w .
. x.
. 0 B .
maxx”=0 ÎxÎ22
2
.A
Ø < x replaced by specific vector >
BA B.2
. ‡ H
‡1 ..
. 1 w
. .
. 0 B w .2
.A B.2
.
‡1 ..
.
..
.
w .2
.A
= < multiply out numerator >
B.
. ‡ 2 + w H w .2
. 1 .
. .
. Bw .
.A 2 B.2
. ‡1 .
. .
. .
. w .
.A2 B.2 .A B.2
Â1 ... . ‡1 .
. . .
Ø . = ÎÂ1 Î22 + Îy2 Î22 Ø ÎÂ1 Î22 ; .
<. . = ‡12 + wH w >
y2 .2. . w .
2
(‡1 + w w) /(‡1 + w w)
2 H 2 2 H
A B
‡1 0
Thus ‡12 Ø ‡12 + w w which means that w = 0 (the zero vector) and
H
UÂ H AVÂ =
0 B
A B
‡1 0
so that A = UÂ VÂ H . ⌅
0 B
Hopefully you can see where this is going: If one can recursively find that B = UB B VB ,
H
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 121
then A B
‡1 0
A = UÂ VÂ H
A
0 B B
‡ 1 0
= UÂ VÂ H
A
0 UBB BAVBH BA B
1 0 ‡1 0 1 0 .
= UÂ VÂ H
0 UB 0 B 0 VBH
A B A B A A BBH
1 0 ‡1 0 1 0
= UÂ Â
V .
0 UB 0 B 0 VB
¸ ˚˙ ˝ ¸ ˚˙ ˝ ¸ ˚˙ ˝
U VH
The next exercise provides the insight that the values on the diagonal of will be ordered
from largest to smallest.
A B
‡1 0
Homework 2.3.1.1 Let A œ C with A =
m◊n
and assume that ÎAÎ2 = ‡1 .
0 B
ALWAYS/SOMETIMES/NEVER: ÎBÎ2 Æ ‡1 .
Solution. We will employ a proof by contradiction. Assume that ÎBÎ2 > ‡1 . Then there
exists a vector z with ÎzÎ2 = 1 such that ÎBÎ2 = ÎBzÎ2 = maxÎxÎ2 =1 ÎBxÎ2 . But then
ÎAÎ2
= < definition >
maxÎxÎ2 =1 ÎAxÎ2
. Ø A < pick a specific vector with 2 ≠ norm equal to one >
B.
.
. 0 .
.
.A .
. z .2
.A = <Binstantiate
A B. A>
. ‡
. 1 0 0 .
.
. .
. 0 B z .2
.A = B. < partitioned matrix-vector multiplication >
.
. 0 ..
. .
. Bz .
2 .A B.2
. y .
. 0 .
= < .. . = Îy0 Î22 + Îy1 Î22 >
y1 .2
ÎBzÎ2
= < assumption about z >
ÎBÎ2
> < assumption >
‡1 .
which is a contradiction.
Hence ÎBÎ2 Æ ‡1 .
We are now ready to prove the Singular Value Decomposition Theorem.
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 122
Proof of Singular Value Decomposition Theorem for n Æ m. We will prove this for m Ø n,
leaving the case where m Æ n as an exercise.
Proof by induction: Since m Ø n, we select m to be arbritary and induct on n.
• Base case: n = 1.
1 2
In this case A = a1 where a1 œ Cm is its only column.
Case 1: a1 = 0 (the zero vector).
Then A B
1 2
A= 0 = Im◊m I1◊1
¸ ˚˙ ˝ 0 ¸ ˚˙ ˝
U VH
so that U = Im◊m , V = I1◊1 , and TL is an empty matrix.
Case 2: a1 ”= 0.
Then 1 2 1 2
A= a1 = u1 (Îa1 Î2 )
1 2
where u1 = a1 /Îa1 Î2 . Choose U2 œ Cm◊(m≠1) so that U = u1 U2 is unitary. Then
1 2
A = a1
1 2
= (Îa1 Î2 )
u1
A B
1 2 Îa Î 1 2H
1 2
= u1 U2 1
0
= U V H,
where
1 2
¶ U= u0 U1 ,
A B
1 2
¶ = TL
with = ‡1 and ‡1 = Îa1 Î2 = ÎAÎ2
0 TL
1 2
¶ V = 1 .
• Inductive step:
Assume the result is true for matrices with 1 Æ k columns. Show that it is true for
matrices with k + 1 columns.
Let A œ Cm◊(k+1) with 1 Æ k < n.
Case 1: A = 0 (the zero matrix)
Then A B
A = Im◊m I(k+1)◊(k+1)
0m◊(k+1)
so that U = Im◊m , V = I(k+1)◊(k+1) , and TL is an empty matrix.
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 123
Case 2: A ”= 0.
Then ÎAÎ2 ”= 0. By Lemma 2.3.1.3, A we know
B that there exist unitary U œ C
 m◊m
and
‡1 0
V œ C(k+1)◊(k+1) such that A = U V with ‡1 = ÎAÎ2 .
0 B
By the inductive hypothesis, there exist unitary ǓB œ C(m≠1)◊(m≠1)
A , unitary
B V̌B œ
ˇ 0
Ck◊k , and ˇ B œ R(m≠1)◊k such that B = ǓB ˇ B V̌BH where ˇ B = TL
, ˇTL =
0 0
diag(‡2 , · · · , ‡r≠1 ), and ‡2 Ø · · · Ø ‡r≠1 > 0.
Now, let
A B A B A B
1 0 1 0 ‡1 0
U= UÂ ,V = VÂ , and = ˇB .
0 ǓB 0 V̌B 0
(There are some really tough to see "checks" in the definition of U , V , and !!) Then
A = U V H where U , V , and have the desired properties. Key here is that ‡1 =
ÎAÎ2 Ø ÎBÎ2 which means that ‡1 Ø ‡2 .
• By the Principle of Mathematical Induction the result holds for all matrices A œ Cm◊n
with m Ø n.
⌅
Homework 2.3.1.2 Let = diag(‡0 , . . . , ‡n≠1 ). ALWAYS/SOMETIMES/NEVER: Î Î2 =
maxn≠1
i=0 |‡i |.
Answer. ALWAYS
Now prove it.
Solution. Yes, you have seen this before, in Homework 1.3.5.1. We repeat it here because
of its importance to this topic.
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 124
so that maxn≠1
i=0 |‡i | Æ Î Î2 Æ maxi=0 |‡i |, which implies that Î Î2 = maxi=0 |‡i |.
n≠1 n≠1
Homework 2.3.1.3 Assume that U œ Cm◊m and V œ Cn◊n are unitary matrices. Let
A, B œ Cm◊n with B = U AV H . Show that the singular values of A equal the singular values
of B.
Solution. Let A = UA A VAH be the SVD of A. Then B = U UA A VAH V H = (U UA ) A (V VA )H
where both U UA and V VA are unitary. This gives us the SVD for B and it shows that the
singular values of B equal the singular values of A.
Homework 2.3.1.4 Let A œ Cm◊n with n Æ m and A = U V H be its SVD.
ALWAYS/SOMETIMES/NEVER: AH = V T U H .
Answer. ALWAYS
Solution.
AH = (U V H )H = (V H )H T U H = V T U H
since is real valued. Notice that is only "sort of diagonal" (it is possibly rectangular)
which is why T ”= .
Homework 2.3.1.5 Prove the Singular Value Decomposition Theorem for m Æ n.
Hint. Consider the SVD of B = AH
Solution. Let B = AH . Since it is n ◊ m with n Ø m its SVD exists: B = UB B VB .
H
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 125
YouTube: https://www.youtube.com/watch?v=ZYzqTC5LeLs
YouTube: https://www.youtube.com/watch?v=XKhCTtX1z6A
We will now illustrate what the SVD Theorem tells us about matrix-vector multiplication
(linear transformations) by examining the case where A œ R2◊2 . Let A = U V T be its SVD.
(Notice that all matrices are now real valued, and hence V H = V T .) Partition
A B
1 2 ‡0 0 1 2T
A= u0 u1 v0 v1 .
0 ‡1
Since U and V are unitary matrices, {u0 , u1 } and {v0 , v1 } form orthonormal bases for the
range and domain of A, respectively:
R2 : Domain of A: R2 : Range (codomain) of A:
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 126
Figure 2.3.2.1 Illustration of how orthonormal vectors v0 and v1 are transformed by matrix
A=U V.
Next, Alet usBlook at how A transforms any vector with (Euclidean) unit length. Notice
‰0
that x = means that
‰1
x = ‰0 e0 + ‰1 e1 ,
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 128
where e0 and e1 are the unit basis vectors. Thus, ‰0 and ‰1 are the coefficients when x is
expressed using e0 and e1 as basis. However, we can also express x in the basis given by v0
and v1 : A B
1 21 2T 1 2 vT x
0
x = V V ˝ x = v0 v1
T
v0 v1 x = v0 v1
¸ ˚˙ v1T x
I A B
1 2 –
0
= v0 x v0 + v1 x v1 = –0 v0 + –0 v1 = v0 v1
T T
.
¸˚˙˝ ¸˚˙˝ –1
–0 –1
Thus, in the basis formed by v0 and v1 , its coefficients are –0 and –1 . Now,
1 21 2T
Ax = ‡0 u0 ‡1 u1 v0 v1 x
A B
1 21 2T 1 2 –0
= ‡0 u0 ‡1 u1 v0 v1 v0 v1
A B
–1
1 2 –0
= ‡0 u0 ‡1 u1 = –0 ‡0 u0 + –1 ‡1 u1 .
–1
This is illustrated by the following picture, which also captures the fact that the unit ball
is mapped to an oval with major axis equal to ‡0 = ÎAÎ2 and minor axis equal to ‡1 , as
illustrated in Figure 2.3.2.1 (bottom).
Finally, we show the same insights for general vector x (not necessarily of unit length):
R2 : Domain of A: R2 : Range (codomain) of A:
Another observation is that if one picks the right basis for the domain and codomain,
then the computation Ax simplifies to a matrix multiplication with a diagonal matrix. Let
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 129
Now, if we chose to express y using u0 and u1 as the basis and express x using v0 and v1 as
the basis, then
U U T˝ y = U U T y = (uT0 y)u0 + (uT1 y)u1
¸ ˚˙ ¸ ˚˙ ˝
I y‚ A B A B
1 2 uT y ‚0
0
= u0 u1 =U
uT1 y ‚1
¸ ˚˙ ˝
y‚
V V T˝ x = V V
¸ ˚˙ ¸ ˚˙x˝ = (v0 x)v0 + (v1 x)v1
T T T
I x‚ A B A B
1 2 vT x ‰‚0
0
= v0 v1 =V .
v1T x ‰‚1 .
¸ ˚˙ ˝
x‚
If y = Ax then
U UT y = U V T x˝ = U x‚
¸ ˚˙
¸ ˚˙ ˝
y‚ Ax
so that
y‚ = x‚
and A B A B
‚0 ‡0 ‰‚0
= .
‚1 . ‡1 ‰‚1 .
Remark 2.3.2.2 The above discussion shows that if one transforms the input vector x
and output vector y into the right bases, then the computation y := Ax can be computed
with a diagonal matrix instead: y‚ := x‚. Also, solving Ax = y for x can be computed by
multiplying with the inverse of the diagonal matrix: x‚ := ≠1 y‚.
These observations generalize to A œ Cm◊n : If
y = Ax
then
U H y = U H A ¸V ˚˙
V H˝ x
I
so that
UHy = V H
¸ ˚˙ x˝
¸ ˚˙ ˝
y‚ x‚
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 130
YouTube: https://www.youtube.com/watch?v=1LpK0dbFX1g
This means that in the current step we need to update the contents of ABR with
A B
‡11 0
UÂ H AVÂ =
0 BÂ
and update A B
1 2 1 2
Ik◊k 0
UL UR := UL UR
A 0 UÂ B
1 2 1 2 I 0
:=
k◊k
VL VR VL VR ,
0 VÂ
which simplify to
UBR := UBR UÂ and VBR := VBR VÂ .
At that point, AT L is expanded by one row and column, and the left-most columns of UR and
VR are moved to UL and VL , respectively. If ABR ever contains a zero matrix, the process
completes with A overwritten with = U H V‚ . These observations, with all details, are
captured in Figure 2.3.3.1. In that figure, the boxes in yellow are assertions that capture the
current contents of the variables. Those familiar with proving loops correct will recognize
the first and last such box as the precondition and postcondition for the operation and
A B A B
AT L AT R 1 2H A‚T L A‚T R 1 2
= UL UR VL VR
ABL ABR A B A‚BL A‚BR
0
= TL
0 B
as the loop-invariant that can be used to prove the correctness of the loop via a proof by
induction.
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 132
Figure 2.3.3.1 Algorithm for computing the SVD of A, overwriting A with . In the yellow
boxes are assertions regarding the contents of the various matrices.
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 133
The reason this algorithm is not practical is that many of the steps are easy to state
mathematically, but difficult (computationally expensive) to compute in practice. In partic-
ular:
• Given a vector, determining a unitary matrix with that vector as its first column is
computationally expensive.
• Assuming for simplicity that m = n, even if all other computations were free, comput-
ing the product A22 := UÂ2H ABR VÂ2 requires O((m ≠ k)3 ) operations. This means that
the entire algorithm requires O(m4 ) computations, which is prohibitively expensive
when n gets large. (We will see that most practical algorithms discussed in this course
cost O(m3 ) operations or less.)
Later in this course, we will discuss an algorithm that has an effective cost of O(m3 ) (when
m = n).
Ponder This 2.3.3.1 An implementation of the "algorithm" in Figure 2.3.3.1, using our
FLAME API for Matlab (FLAME@lab) [5] that allows the code to closely resemble the
algorithm as we present it, is given in mySVD.m (Assignments/Week02/matlab/mySVD.m).
This implemention depends on routines in subdirectory Assignments/flameatlab being in the
path. Examine this code. What do you notice? Execute it with
m = 5;
n = 4;
A = rand( m, n ); % create m x n random matrix
[ U, Sigma, V ] = mySVD( A )
norm( A - U * Sigma * V )
YouTube: https://www.youtube.com/watch?v=HAAh4IsIdsY
Corollary 2.3.4.1 Reduced Singular Value Decomposition. Let A œ Cm◊n and
r = rank(A). There exist orthonormal matrix UL œ Cm◊r , orthonormal marix VL œ Cn◊r ,
and matrix T L œ Rr◊r with T L = diag(‡0 , . . . , ‡r≠1 ) and ‡0 Ø ‡1 Ø · · · Ø ‡r≠1 > 0 such
that A = UL T L VLH .
Homework 2.3.4.1 Prove the above corollary.
A B
01 2 1 2H
Solution. Let A = U V = UL UR H TL
VL VR be the SVD of A,
0 0
where UL œ Cm◊r , VL œ Cn◊r and T L œ Rr◊r with TL = diag(‡0 , ‡1 , · · · , ‡r≠1 ) and
‡0 Ø ‡1 Ø · · · Ø ‡r≠1 > 0. Then
A
= < SVD of A >
U VT
= < Partitioning
A B>
TL 0
1 2 1 2H
UL UR VL VR
0 0
= < partitioned matrix ≠ matrix multiplication >
UL T L VLH .
1 2
Corollary 2.3.4.2 Let A = UL H
T L VL be the Reduced SVD with UL = u0 · · · ur≠1 ,
Q R
1 2
‡0
VL = v0 · · · vr≠1 , and =c
c .. d
d. Then
TL a . b
‡r≠1
V ≠1
U H with
≠1
= <>
Q R≠1
‡0 0 · · · 0
c
c 0 ‡1 · · · 0 d
d
c
c .. .. . . .. d
d
a . . . . b
0 0 ··· ‡m≠1
= <>
Q R
1/‡0 0 ··· 0
c
c 0 1/‡1 ··· 0 d
d
c
c .. .. .. .. d.
d
a . . . . b
0 0 · · · 1/‡m≠1
However, the SVD requires the diagonal elements to be positive and ordered from largest to
smallest.
So, only if ‡0 = ‡1 = · · · = ‡m≠1 is it the case that V ≠1 U H is the SVD of A≠1 . In other
words, when = ‡0 I.
Homework 2.3.5.4 Let A œ Cm◊m be nonsingular and
A = U VH Q R
1
‡ ···
2c 0
0
d1 2H
= c .. ... ..
a . .
u0 · · · um≠1 d v0 · · · vm≠1
b
0 · · · ‡m≠1
be its SVD.
The SVD of A≠1 is given by (indicate all correct answers):
1. V ≠1
UH.
Q R
1 2c
1/‡0 · · · 0
. . .. d1 2H
2. c . ..
a . .
v0 · · · vm≠1 d u0 · · · um≠1
b
0 · · · 1/‡m≠1
Q R
1 2c
1/‡m≠1 · · · 0 1 2H
3. .. .. .. d
vm≠1 · · · v0 c
a . . . d
b um≠1 · · · u0 .
0 · · · 1/‡0
Q R
0 ··· 0 1
c
c 0 ··· 1 0 d
d
4. (V P )(PH ≠1
P )(U P )
H H H
where P = c
c .. .. .. d
d
a . . . b
1 ··· 0 0
Answer. 3. and 4.
Explain it!
Solution. This question is a bit tricky.
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 137
2. This is just Answer 1. but with the columns of U and V , and the elements of ,
exposed.
3. This answer corrects the problems with the previous two answers: it reorders colums
of U and V so that the diagonal elements of end up ordered from largest to smallest.
This captures what the condition number Ÿ2 (A) = ‡0 /‡n≠1 captures: how elongated the
oval that equals the image of the unit ball is. The more elongated, the greater the ratio
‡0 /‡n≠1 , and the worse the condition number of the matrix. In the limit, when ‡n≠1 = 0,
the unit ball is mapped to a lower dimensional set, meaning that the transformation cannot
be "undone."
Ponder This 2.3.5.6 For the 2D problem discussed in this unit, what would the image of
the unit ball look like as Ÿ2 (A) æ Œ? When is Ÿ2 (A) = Œ?
YouTube: https://www.youtube.com/watch?v=sN0DKG8vPhQ
We are now ready to answer the question "How do we find the best rank-k approximation
for a picture (or, more generally, a matrix)? " posed in Subsection 2.1.1.
Theorem 2.3.6.1 Given A œ Cm◊n , let A = U V H be its SVD. Assume the entries on the
main diagonal of are ‡0 , · · · , ‡min(m,n)≠1 with ‡0 Ø · · · Ø ‡min(m,n)≠1 Ø 0. Given k such
that 0 Æ k Æ min(m, n), partition
A B
1 2 1 2 0
U= UL UR ,V = VL VR , and = TL
,
0 BR
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 139
B = UL H
T L VL
In other words, B is the matrix with rank at most k that is closest to A as measured by the
2-norm. Also, for this B,
I
‡k if k < min(m, n)
ÎA ≠ BÎ2 =
0 otherwise.
The proof of this theorem builds on the following insight:
Homework 2.3.6.1 Given A œ Cm◊n , let A = U V H be its SVD. Show that
where uj and vj equal the columns of U and V indexed by j, and ‡j equals the diagonal
element of indexed with j.
Solution. W.l.o.g. assume n Æ m. Rewrite A = U V H as AV = U . Then
ÎA ≠ BÎ2
= < multiplication with unitary matrices preserves 2-norm >
ÎU H (A ≠ B)V Î2
= < distribute >
ÎU H AV ≠ U H BV Î2
. = < use SVD of A and partition
1 2H .
1
>
2.
.
. ≠ UL UR B VL VR ..
.
2
.A = < howBB A was chosenB.>
.
. TL 0 TL 0
.
.
. ≠ .
. 0 BR 0 0 .2
.A = partitioned subtraction >
< B.
. 0 0 .
. .
. .
. 0 BR .
2
= <>
Î BR Î2
= < TL is k ◊ k >
‡k
(Obviously, this needs to be tidied up for the case where k > rank(A).)
Next, assume that C has rank r Æ k and ÎA ≠ CÎ2 < ÎA ≠ BÎ2 . We will show that this
leads to a contradiction.
• If x œ N (C) then
ÎAyÎ22
= < y = –0 v0 + · · · + –k vk >
ÎA(–0 v0 + · · · + –k vk )Î22
= < distributivity >
Ζ0 Av0 + · · · + –k Avk Î22
= < Avj = ‡j uj >
Ζ0 ‡0 u0 + · · · + –k ‡k uk Î22
= < this works because the uj are orthonormal >
Ζ0 ‡0 u0 Î22 + · · · + Ζk ‡k uk Î22
= < norms are homogeneous and Îuj Î2 = 1 >
|–0 | ‡0 + · · · + |–k |2 ‡k2
2 2
Ø < ‡0 Ø ‡1 Ø · · · Ø ‡k Ø 0 >
(|–0 |2 + · · · + |–k |2 )‡k2
= < ÎyÎ22 = |–0 |2 + · · · + |–k |2 >
2 2
‡k ÎyÎ2 .
so that ÎAyÎ2 Ø ‡k ÎyÎ2 . In other words, vectors in the subspace of all linear combina-
tions of {v0 , . . . , vk } satisfy ÎAxÎ2 Ø ‡k ÎxÎ2 . The dimension of this subspace is k + 1
(since {v0 , · · · , vk } form an orthonormal basis).
• Both these subspaces are subspaces of Cn . One has dimension k + 1 and the other
n ≠ k. This means that if you take a basis for one (which consists of n ≠ k linearly
independent vectors) and add it to a basis for the other (which has k + 1 linearly
independent vectors), you end up with n + 1 vectors. Since these cannot all be linearly
independent in Cn , there must be at least one nonzero vector z that satisfies both
ÎAzÎ2 < ‡k ÎzÎ2 and ÎAzÎ2 Ø ‡k ÎzÎ2 , which is a contradiction.
⌅
Theorem 2.3.6.1 tells us how to pick the best approximation to a given matrix of a given
desired rank. In Section Subsection 2.1.1 we discussed how a low rank matrix can be used
to compress data. The SVD thus gives the best such rank-k approximation. Let us revisit
this.
Let A œ Rm◊n be a matrix that, for example, stores a picture. In this case, the i, j entry
in A is, for example, a number that represents the grayscale value of pixel (i, j).
Homework 2.3.6.2 In Assignments/Week02/matlab execute
IMG = imread( Frida.jpg );
A = doub e( IMG( :,:,1 ) );
imshow( uint8( A ) )
size( A )
Although the picture is black and white, it was read as if it is a color image, which means
a m ◊ n ◊ 3 array of pixel information is stored. Setting A = IMG( :,:,1 ) extracts a single
matrix of pixel information. (If you start with a color picture, you will want to approximate
IMG( :,:,1), IMG( :,:,2), and IMG( :,:,3) separately.)
Next, compute the SVD of matrix A
[ U, Sigma, V ] = svd( A );
and approximate the picture with a rank-k update, starting with k = 1:
k = 1
B = uint8( U( :, 1:k ) * Sigma( 1:k,1:k ) * V( :, 1:k ) );
imshow( B );
figure
r = min( size( A ) );
p ot( [ 1:r ], diag( Sigma ), x );
Since the singular values span a broad range, we may want to plot them with a log-log plot
For this particular matrix (picture), there is no dramatic drop in the singular values that
makes it obvious what k is a natural choice.
Solution.
k=1 k=2
k=5 k = 10
k = 25 Original picture
2.4 Enrichments
2.4.1 Principle Component Analysis (PCA)
Principle Component Analysis (PCA) is a standard technique in data science related to the
SVD. You may enjoy the article
In that article, PCA is cast as an eigenvalue problem rather than a singular value problem.
Later in the course, in Week 11, we will link these.
2.5 Wrap Up
2.5.1 Additional homework
Homework 2.5.1.1 U œ Cm◊m is unitary if and only if (U x)H (U y) = xH y for all x, y œ Cm .
Hint. Revisit the proof of Homework 2.2.4.6.
Homework 2.5.1.2 Let A, B œ Cm◊n . Furthermore, let U œ Cm◊m and V œ Cn◊n be
unitary.
TRUE/FALSE: U AV H = B iff U H BV = A.
Answer. TRUE
Now prove it!
Homework 2.5.1.3 Prove that nonsingular A œ Cn◊n has condition number Ÿ2 (A) = 1 if
and only if A = ‡Q where Q is unitary and ‡ œ R is positive.
Hint. Use the SVD of A.
Homework 2.5.1.4 Let U œ Cm◊m and V œ Cn◊nAbe unitary.
B
U 0
ALWAYS/SOMETIMES/NEVER: The matrix is unitary.
0 V
Answer. ALWAYS
Now prove it!
Homework 2.5.1.5 Matrix A œ Rm◊m is a stochastic matrix if and only if it is nonnegative
q
(all its entries are nonnegative) and the entries in its columns sum to one: 0Æi<m –i,j = 1.
Such matrices are at the core of Markov processes. Show that a matrix A is both unitary
matrix and a stochastic matrix if and only if it is a permutation matrix.
Homework 2.5.1.6 Show that if Î · · · Î is a norm and A is nonsingular, then Î · · · ÎA≠1
defined by ÎxÎA≠1 = ÎA≠1 xÎ is a norm.
Interpret this result in terms of the change of basis of a vector.
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 146
2. Ÿ2 (A) = ‡0 /‡m≠1 .
3. Ÿ2 (A) = uH
0 Av0 /um≠1 Avm≠1 .
H
2.5.2 Summary
Given x, y œ Cm
xH y xxH
x = y.
xH x xH x
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 147
The matrix that projects a vector onto the space spanned by x is given by
xxH
.
xH x
Thus, the matrix that projects a vector onto the space orthogonal to x is given by
xxH
I≠ .
xH x
uH vu = uuH v.
• The matrix that projects a vector onto the space spanned by u is given by
uuH
• The matrix that projects a vector onto the space that is orthogonal to x is given by
I ≠ uuH
Let u0 , u1 , . . . , un≠1 œ Cm . These vectors are said to be mutually orthonormal if for all
0 Æ i, j < n I
1 if i = j
ui u j =
H
.
0 otherwise
Let Q œ Cm◊n (with n Æ m). Then Q is said to be
• an orthonormal matrix iff QH Q = I.
• U U H = I.
• U ≠1 = U H .
• U H is unitary.
• U V is unitary.
If U œ Cm◊m and V œ Cn◊n are unitary, x œ Cm , and A œ Cm◊n , then
• ÎU xÎ2 = ÎxÎ2 .
• ÎU Î2 = 1
• Ÿ2 (U ) = 1
Examples of unitary matrices:
A B
c ≠s
• Rotation in 2D: .
s c
Then
ΔyΠΔxÎ
Æ ÎAÎÎA≠1 Î ,
ÎyÎ ¸ ˚˙ ˝ ÎxÎ
Ÿ(A)
where Î · Î is an induced matrix norm.
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 149
The values ‡0 , . . . , ‡r≠1 are called the singular values of matrix A. The columns of U and V
are called the left and right singular vectors, respectively.
Let A œ Cm◊n and A = U V H its SVD with
1 2 1 2
U= UL UR = u0 · · · um≠1 ,
1 2 1 2
V = VL VR = v0 · · · vn≠1 ,
and
Q R
‡0 0 ··· 0
A B
0
c
c 0 ‡1 ··· 0 d
d
= TL
, where = c .. .. .. .. d and ‡0 Ø ‡1 Ø · · · Ø ‡r≠1 > 0.
0 0 TL c
a . . . .
d
b
0 0 · · · ‡r≠1
Here UL œ Cm◊r , VL œ Cn◊r and TL œ Rr◊r . Then
• ÎAÎ2 = ‡0 . (The 2-norm of a matrix equals the largest singular value.)
• rank(A) = r.
• C(A) = C(UL ).
• N (A) = C(VR ).
• R(A) = C(VL ).
• Left null-space of A = C(UR ).
• AH = V T
UH.
• SVD: AH = V U H .
• Reduced SVD: A = UL T L VL .
H
•
A= ‡0 u0 v0H + ‡1 u1 v1H +· · ·+ H
‡r≠1 ur≠1 vr≠1 .
¸ ˚˙ ˝ ¸ ˚˙ ˝ ¸ ˚˙ ˝
‡0 ‡1 ‡r≠1
WEEK 2. THE SINGULAR VALUE DECOMPOSITION 150
• (Left) pseudo inverse: if A has linearly independent columns, then A† = (AH A)≠1 AH =
T L UL .
V ≠1 H
B = UL H
T L VL
In other words, B is the matrix with rank at most k that is closest to A as measured by the
2-norm. Also, for this B,
I
‡k if k < min(m, n)
ÎA ≠ BÎ2 =
0 otherwise.
Week 3
The QR Decomposition
3.1 Opening
3.1.1 Choosing the right basis
151
WEEK 3. THE QR DECOMPOSITION 152
which can be solved using the techniques for linear least-squares in Week 4. The matrix in
the above equation is known as a Vandermonde matrix.
Homework 3.1.1.1 Choose ‰0 , ‰1 , · · · , ‰m≠1 to be equally spaced in the interval [0, 1]: for
i = 0, . . . , m ≠ 1, ‰i = ih ,where h = 1/(m ≠ 1). Write Matlab code to create the matrix
Q R
1 ‰0 · · · ‰n≠1
0
c
c 1 ‰1 · · · ‰n≠1
1
d
d
X=c
c .. .. .. d
d
a . . . b
1 ‰m≠1 · · · ‰n≠1
m≠1
Notice that the curves for xj and xj+1 quickly start to look very similar, which explains
why the columns of the Vandermonde matrix quickly become approximately linearly
dependent.
WEEK 3. THE QR DECOMPOSITION 154
YouTube: https://www.youtube.com/watch?v=cBFt2dmXbu4
An alternative set of polynomials that can be used are known as Legendre polynomials.
A shifted version (appropriate for the interval [0, 1]) can be inductively defined by
P0 (‰) = 1
P1 (‰) = 2‰ ≠ 1
.. .
. = ..
Pn+1 (‰) = ((2n + 1)(2‰ ≠ 1)Pn (‰) ≠ nPn≠1 (‰)) /(n + 1).
The polynomials have the property that
⁄ 1 I
Cs if s = t for some nonzero constant Cs
Ps (‰)Pt (‰)d‰ =
0 0 otherwise
which is an orthogonality condition on the polynomials.
The function f : R æ R can now instead be approximated by
f (‰) ¥ “0 P0 (‰) + “1 P1 (‰) + · · · + “n≠1 Pn≠1 (‰).
and hence given points
(‰0 , „0 ), · · · , (‰m≠1 , „m≠1 )
we can determine the polynomial from
“0 P0 (‰0 ) + “1 P1 (‰0 )
+ · · · + “n≠1 Pn≠1 (‰0 ) = „0
.. .. .. .. .. .. .. ..
. . . . . . . .
“0 P0 (‰m≠1 ) + “1 P1 (‰m≠1 ) + · · · + “n≠1 Pn≠1 (‰m≠1 ) = „m≠1 .
This can be reformulated as the approximate linear system
Q RQ R Q R
1 P1 (‰0 ) ··· Pn≠1 (‰0 ) “0 „0
c
c 1 P1 (‰1 ) ··· Pn≠1 (‰1 ) dc
dc “1 d
d
c
c „1 d
d
c
c .. .. .. dc
dc .. d
d ¥c
c .. d.
d
a . . . ba . b a . b
1 P1 (‰m≠1 ) · · · Pn≠1 (‰m≠1 ) “n≠1 „m≠1
which can also be solved using the techniques for linear least-squares in Week 4. Notice that
now the columns of the matrix are (approximately) orthogonal: Notice that if we "sample"
x as ‰0 , . . . , ‰n≠1 , then
⁄ 1 n≠1
ÿ
Ps (‰)Pt (‰)d‰ ¥ Ps (‰i )Pt (‰i ),
0 i=0
WEEK 3. THE QR DECOMPOSITION 155
which equals the dot product of the columns indexed with s and t.
Homework 3.1.1.2 Choose ‰0 , ‰1 , · · · , ‰m≠1 to be equally spaced in the interval [0, 1]: for
i = 0, . . . , m ≠ 1, ‰i = ih, where h = 1/(m ≠ 1). Write Matlab code to create the matrix
Q R
1 P1 (‰0 ) ··· Pn≠1 (‰0 )
c
c 1 P1 (‰1 ) ··· Pn≠1 (‰1 ) d
d
X=c
c .. .. .. d
d
a . . . b
1 P (‰m≠1 ) · · · Pn≠1 (‰m≠1 )
We notice that the matrices created from shifted Legendre polynomials have a very
good condition numbers.
ans =
5000 0 1 0 1
0 1667 0 1 0
1 0 1001 0 1
0 1 0 715 0
1 0 1 0 556
YouTube: https://www.youtube.com/watch?v=syq-jOKWqTQ
Remark 3.1.1.1 The point is that one ideally formulates a problem in a way that already
captures orthogonality, so that when the problem is discretized ("sampled"), the matrices
that arise will likely inherit that orthogonality, which we will see again and again is a good
thing. In this chapter, we discuss how orthogonality can be exposed if it is not already part
WEEK 3. THE QR DECOMPOSITION 157
• 3.4 Enrichments
• 3.5 Wrap Up
• Show that Classical Gram-Schmidt and Modified Gram-Schmidt yield the same result
(in exact arithmetic).
• Compare and contrast the Classical Gram-Schmidt and Modified Gram-Schmidt meth-
ods with regard to cost and robustness in the presence of roundoff error.
• Explain why Householder QR factorization yields a matrix Q with high quality or-
thonormal columns, even in the presence of roundoff error.
YouTube: https://www.youtube.com/watch?v=CWhBZB-3kg4
Given a set of linearly independent vectors {a0 , . . . , an≠1 } µ Cm , the Gram-Schmidt
process computes an orthonormal basis {q0 , . . . , qn≠1 } that spans the same subspace as the
original vectors, i.e.
¶ fl0,0 = Îa0 Î2
Computes the length of vector a0 .
¶ q0 = a0 /fl0,0
Sets q0 to a unit vector in the direction of a0 .
Notice that a0 = q0 fl0,0
• Compute vector q1 of unit length so that Span({a0 , a1 }) = Span({q0 , q1 }):
¶ fl0,1 = q0H a1
Computes fl0,1 so that fl0,1 q0 = q0H a1 q0 equals the component of a1 in the direction
of q0 .
1 = a1 ≠ fl0,1 q0
¶ a‹
Computes the component of a1 that is orthogonal to q0 .
¶ fl1,1 = Îa‹
1 Î2
Computes the length of vector a‹
1.
¶ q 1 = a‹
1 /fl1,1
Sets q1 to a unit vector in the direction of a‹
1.
Notice that A B
1 2 1 2 fl0,0 fl0,1
a0 a1 = q0 q1 .
0 fl1,1
Notice that Q R
1 2 1 2 fl0,0 fl0,1 fl0,2
a 0
c
a0 a1 a2 = q0 q1 q2 fl1,1 fl1,2 d
b.
0 0 fl2,2
WEEK 3. THE QR DECOMPOSITION 160
• And so forth.
YouTube: https://www.youtube.com/watch?v=AvXe0MfK _0
Yet another way of looking at this problem is as follows.
and 1 2
Q= q0 · · · qk≠1 qk qk+1 · · · qn≠1
We observe that
• Span({a0 }) = Span({q0 })
Hence a0 = fl0,0 q0 for some scalar fl0,0 .
• Span({a0 , a1 }) = Span({q0 , q1 })
Hence
a1 = fl0,1 q0 + fl1,1 q1
for some scalars fl0,1 , fl1,1 .
Notice that
qkH ak = qkH (fl0,k q0 + · · · + flk≠1,k qk≠1 + flk,k qk )
= fl0,k qkH q0 + · · · + flk≠1,k qkH qk≠1 + flk,k qkH qk
¸ ˚˙ ˝ ¸ ˚˙ ˝ ¸ ˚˙ ˝
0 0 1
so that
fli,k = qiH ak ,
for i = 0, . . . , k ≠ 1. Next, we can compute
flk,k = Îa‹
k Î2
and
q k = a‹
k /flk,k
Remark 3.2.1.1 For a review of Gram-Schmidt orthogonalization and exercises orthogonal-
izing real-valued vectors, you may want to look at Linear Algebra: Foundations to Frontiers
(LAFF) [26] Week 11.
YouTube: https://www.youtube.com/watch?v=tHj20PSBCek
The discussion in the last unit motivates the following theorem:
Theorem 3.2.2.1 QR Decomposition Theorem. Let A œ Cm◊n have linearly indepen-
dent columns. Then there exists an orthonormal matrix Q and upper triangular matrix R
such that A = QR, its QR decomposition. If the diagonal elements of R are taken to be real
and positive, then the decomposition is unique.
In order to prove this theorem elegantly, we will first present the Gram-Schmidt orthog-
onalization algorithm using FLAME notation, in the next unit.
Ponder This 3.2.2.1 What happens in the Gram-Schmidt algorithm if the columns of A are
NOT linearly independent? How might one fix this? How can the Gram-Schmidt algorithm
be used to identify which columns of A are linearly independent?
Solution. If aj is the first column such that {a0 , . . . , aj } are linearly dependent, then a‹
j
will equal the zero vector and the process breaks down.
WEEK 3. THE QR DECOMPOSITION 162
When a vector with a‹ j equal to the zero vector is encountered, the columns can be
rearranged (permuted) so that that column (or those columns) come last.
Again, if a‹
j = 0 for some j, then the columns are linearly dependent since then aj can
be written as a linear combination of the previous columns.
YouTube: https://www.youtube.com/watch?v=YEEEJYp8snQ
Remark 3.2.3.1 If the FLAME notation used in this unit is not intuitively obvious, you may
to review some of the materials in Weeks 3-5 of Linear Algebra: Foundations to Frontiers
(http://www.u aff.net).
An alternative for motivating that algorithm is as follows:
• Consider A = QR.
Also, QH
0 Q0 = I and Q0 q1 = 0, since the columns of Q are mutually orthonormal.
H
• Hence
0 a1 = Q0 Q0 r01 + Q0 q1 fl11 = r01 .
QH H H
• This shows how r01 can be computed from Q0 and a1 , which are already known:
r01 := QH
0 a1 .
WEEK 3. THE QR DECOMPOSITION 163
• Next,
1 := a1 ≠ Q0 r01
a‹
is computed from (3.2.1). This is the component of a1 that is perpendicular (orthogo-
nal) to the columns of Q0 . We know it is nonzero since the columns of A are linearly
independent.
• Since fl11 q1 = a‹
1 and we know that q1 has unit length, we now compute
fl11 := Îa‹
1 Î2
and
q1 := a‹
1 /fl11 ,
Figure 3.2.3.2 (Classical) Gram-Schmidt (CGS) algorithm for computing the QR factor-
ization of a matrix A.
Having presented the algorithm in FLAME notation, we can provide a formal proof of
Theorem 3.2.2.1.
Proof of Theorem 3.2.2.1. Informal proof: The process described earlier in this unit con-
structs the QR decomposition. The computation of flj,j is unique if it is restricted to be a
real and positive number. This then prescribes all other results along the way.
Formal proof:
(By induction). Note that n Æ m since A has linearly independent columns.
1 2
• Base case: n = 1. In this case A = A0 a1 , where A0 has no columns. Since A has
WEEK 3. THE QR DECOMPOSITION 164
• Inductive step: Assume that the result is true for all A0 with k linearly independent
columns. We will show it is true for A with k + 1 linearly independent columns.
1 2
Let A œ Cm◊(k+1) . Partition A æ A 0 a1 .
By the induction hypothesis, there exist Q0 and R00 such that QH 0 Q0 = I, R00 is
upper triangular with nonzero diagonal entries and A0 = Q0 R00 . Also, by induction
hypothesis, if the elements on the diagonal of R00 are chosen to be positive, then the
factorization A0 = Q0 R00 is unique.
We are looking for A B
1 2 Â
R 00 r01
Â
Q
0 q1 and
0 fl11
so that A B
1 2 1 2 Â
R 00 r01
A0 a1 = Â
Q 0 q1 .
0 fl11
This means that
¶ A0 = QÂ RÂ
0 00 ,
We choose Q Â = Q and R
0 0
 = R . If we insist that the elements on the diagonal
00 00
be positive, this choice is unique. Otherwise, it is a choice that allows us to prove
existence.
¶ a1 = Q0 r01 + fl11 q1 which is the unique choice if we insist on positive elements on
the diagonal.
a1 = Q0 r01 + fl11 q1 . Multiplying both sides by QH 0 we find that r01 must equal
Q0 a1 (and is uniquely determined by this if we insist on positive elements on the
H
diagonal).
¶ Letting a‹1 = a1 ≠ Q0 r01 (which equals the component of a1 orthogonal to C(Q0 )),
we find that fl11 q1 = a‹
1 . Since q1 has unit length, we can choose fl11 = Îa1 Î2 . If
‹
• By the Principle of Mathematical Induction the result holds for all matrices A œ Cm◊n
with m Ø n.
⌅
Homework 3.2.3.1 Implement the algorithm given in Figure 3.2.3.2 as
function [ Q, R ] = CGS_QR( A )
WEEK 3. THE QR DECOMPOSITION 165
YouTube: https://www.youtube.com/watch?v=pOBJHhV3TKY
In the video, we reasoned that the following two algorithms compute the same values,
except that the columns of Q overwrite the corresponding columns of A:
for j = 0, . . . , n ≠ 1 for j = 0, . . . , n ≠ 1
j := aj
a‹
for k = 0, . . . , j ≠ 1 for k = 0, . . . , j ≠ 1
flk,j := qkH a‹ j flk,j := aH
k aj
a‹j := a ‹
j ≠ flk,j qk aj := aj ≠ flk,j ak
end end
flj,j := Îa‹ j Î2 flj,j := Îaj Î2
qj := a‹ j /fl j,j aj := aj /flj,j
end end
(a) MGS algorithm that computes Q and (b) MGS algorithm that computes Q and
R from A. R from A, overwriting A with Q.
Homework 3.2.4.1 Assume that q0 , . . . , qk≠1 are mutually orthonormal. Let flj,k = qjH y for
j = 0, . . . , i ≠ 1. Show that
for i = 0, . . . , k ≠ 1.
WEEK 3. THE QR DECOMPOSITION 166
Solution.
qiH (y ≠ fl0,k q0 ≠ · · · ≠ fli≠1,k qi≠1 )
= < distribute >
qiH y ≠ qiH fl0,k q0 ≠ · · · ≠ qiH fli≠1,k qi≠1
= < fl0,k is a scalar >
qiH y ≠ fl0,k qiH q0 ≠ · · · ≠ fli≠1,k qiH qi≠1
¸ ˚˙ ˝ ¸ ˚˙ ˝
0 0
YouTube: https://www.youtube.com/watch?v=0ooNPondq5M
This homework illustrates how, given a vector y œ Cm and a matrix Q œ Cmxk the
component orthogonal to the column space of Q, given by (I ≠ QQH )y, can be computed by
either of the two algorithms given in Figure 3.2.4.1. The one on the left, Proj ‹ QCGS (Q, y)
projects y onto the column space perpendicular to Q as did the Gram-Schmidt algorithm
with which we started. The one on the left successfuly subtracts out the component in the
direction of qi using a vector that has been updated in previous iterations (and hence is
already orthogonal to q0 , . . . , qi≠1 ). The algorithm on the right is one variant of the Modified
Gram-Schmidt (MGS) algorithm.
[y ‹ , r] = Proj‹QCGS (Q, y) [y ‹ , r] = Proj‹QMGS (Q, y)
(used by CGS) (used by MGS)
y‹ = y y‹ = y
for i = 0, . . . , k ≠ 1 for i = 0, . . . , k ≠ 1
fli := qiH y fli := qiH y ‹
y ‹ := y ‹ ≠ fli qi y ‹ := y ‹ ≠ fli qi
endfor endfor
YouTube: https://www.youtube.com/watch?v=3XzHFWzV5iE
The video discusses how MGS can be rearranged so that every time a new vector qk
is computed (overwriting ak ), the remaining vectors, {ak+1 , . . . , an≠1 }, can be updated by
subtracting out the component in the direction of qk . This is also illustrated through the
next sequence of equivalent algorithms.
WEEK 3. THE QR DECOMPOSITION 168
for j = 0, . . . , n ≠ 1 for j = 0, . . . , n ≠ 1
flj,j := Îaj Î2 flj,j := Îaj Î2
aj := aj /flj,j aj := aj /flj,j
for k = j + 1, . . . , n ≠ 1 for k = j + 1, . . . , n ≠ 1
flj,k := aH j ak flj,k := aH j ak
ak := ak ≠ flj,k aj end
end for k = j + 1, . . . , n ≠ 1
end ak := ak ≠ flj,k aj
end
end
(c) MGS algorithm that normalizes the (d) Slight modification of the algorithm in
jth column to have unit length to com- (c) that computes flj,k in a separate loop.
pute qj (overwriting aj with the result)
and then subtracts the component in the
direction of qj off the rest of the columns
(aj+1 , . . . , an≠1 ).
for j = 0, . . . , n ≠ 1 for j = 0, . . . , n ≠ 1
flj,j := Îaj Î2 flj,j := Îaj Î2
aj 1:= aj /flj,j 2 a1j := aj /flj,j 2
flj,j+1 · · · flj,n≠1 := flj,j+1 · · · flj,n≠1 :=
1 2 1 2
aHj aj+1 · · · an≠1 aHj aj+1 · · · an≠1
1 2 1 2
aj+1 · · · an≠1 := aj+1 · · · an≠1 :=
1 2 1 2
aj+1 ≠ flj,j+1 aj · · · an≠1 ≠ flj,n≠1 aj aj+1 · · · an≠1
1 2
end ≠ aj flj,j+1 · · · flj,n≠1
end
(e) Algorithm in (d) rewritten to expose (f) Algorithm in (e) rewritten to expose
only the outer loop. the row-vector-times
1 matrix 2 multi-
plication aH j a j+1 · · · an≠1 and
1 2
rank-1 update aj+1 · · · an≠1 ≠
1 2
aj flj,j+1 · · · flj,n≠1 .
Figure 3.2.4.4 Alternative Modified Gram-Schmidt algorithm for computing the QR fac-
torization of a matrix A.
As you prove the following insights, relate each to the algorithm in Figure 3.2.4.4. In
particular, at the top of the loop of a typical iteration, how have the different parts of A and
R been updated?
1. AL = QL RT L .
(QL RT L equals the QR factorization of AL .)
2. C(AL ) = C(QL ).
(The first k columns of Q form an orthonormal basis for the space spanned by the first
k columns of A.)
3. RT R = QH
L AR .
4. (AR ≠ QL RT R )H QL = 0.
(Each column in AR ≠ QL RT R equals the component of the corresponding column of
AR that is orthogonal to Span(QL ).)
5. C(AR ≠ QL RT R ) = C(QR ).
6. AR ≠ QL RT R = QR RBR .
(The columns of QR form an orthonormal basis for the column space of AR ≠ QL RT R .)
Solution. Consider the fact that A = QR. Then, multiplying the partitioned matrices,
A B
1 2 1 2 RT L RT R
AL AR = QL QR
1
0 RBR 2
= QL RT L QL RT R + QR RBR .
Hence
AL = QL RT L and AR = QL RT R + QR RBR . (3.2.2)
2. C(AL ) = C(QL ) can be shown by noting that R is upper triangular and nonsingular and
hence RT L is upper triangular and nonsingular, and using this to show that C(AL ) µ
C(QL ) and C(QL ) µ C(AL ):
• C(AL ) µ C(QL ): Let y œ C(AL ). Then there exists x such that AL x = y. But
then QL RT L x = y and hence QL (RT L x) = y which means that y œ C(QL ).
• C(QL ) µ C(AL ): Let y œ C(QL ). Then there exists x such that QL x = y. But
then AL RT≠1L x = y and hence AL (RT≠1L x) = y which means that y œ C(AL ).
This answers 2.
WEEK 3. THE QR DECOMPOSITION 171
L (AR ≠ QL RT R ) = QL QR RBR
QH H
is equivalent to
L AR ≠ QL QL RT R = QL QR RBR = 0.
QH H H
¸ ˚˙ ˝ ¸ ˚˙ ˝
I 0
Rearranging yields 3.
(AR ≠ QL RT R )H QL = RBR
H
R QL = 0.
QH
6. Rearranging the right equality in (3.2.2) yields AR ≠QL RT R = QR RBR ., which answers
5.
YouTube: https://www.youtube.com/watch?v=7ArZnHE0PIw
In theory, all Gram-Schmidt algorithms discussed in the previous sections are equivalent
WEEK 3. THE QR DECOMPOSITION 172
in the sense that they compute the exact same QR factorizations when exact arithmetic is
employed. In practice, in the presense of round-off error, the orthonormal columns of Q
computed by MGS are often "more orthonormal" than those computed by CGS. We will
analyze how round-off error affects linear algebra computations in the second part of the
ALAFF. For now you will investigate it with a classic example.
When storing real (or complex) valued numbers in a computer, a limited accuracy can be
maintained, leading to round-off error when a number is stored and/or when computation
with numbers is performed. Informally, the machine epsilon (also called the unit roundoff
error) is defined as the largest positive number, ‘mach , such that the stored value of 1 + ‘mach
is rounded back to 1.
Now, let us consider a computer where the only error that is ever incurred is when
1 + ‘mach
By hand, apply both the CGS and the MGS algorithms with this matrix, rounding 1 + ‘mach
to 1 whenever encountered in the calculation.
Upon completion, check whether the columns of Q that are computed are (approximately)
orthonormal.
Solution. The complete calculation is given by
WEEK 3. THE QR DECOMPOSITION 173
• For CGS:
QH Q
=
Q RH Q R
1 0Ô 0Ô 1 0Ô 0Ô
c 2 2 d c d
c ‘ ≠ 2
c Ô
≠2 d d c
c ‘ ≠Ô 2 ≠ 22
2
d
d
2 2
a 0 0 b a 0 0
c d c d
2 Ô 2 Ô b
2 2
0 0 2
0 0 2
Q
= Ô Ô R
1 + Ô‘mach ≠ 22 ‘ ≠ 22 ‘
c d
c ≠ 2‘ 1 1 d.
a Ô2 2 b
≠ 22 ‘ 1
2
1
Clearly, the computed second and third columns of Q are not mutually orthonormal.
What is going on? The answer lies with how a‹ 2 is computed in the last step a2 :=
‹
a2 ≠ (q0 a2 )q0 ≠ (q1 a2 )q1 . Now, q0 has a relatively small error in it and hence q0 a2 q0
H H H
has a relatively small error in it. It is likely that a part of that error is in the direction
of q1 . Relative to q0H a2 q0 , that error in the direction of q1 is small, but relative to
a2 ≠ q0H a2 q0 it is not. The point is that then a2 ≠ q0H a2 q0 has a relatively large error
in it in the direction of q1 . Subtracting q1H a2 q1 does not fix this and since in the end
2 is small, it has a relatively large error in the direction of q1 . This error is amplified
a‹
when q2 is computed by normalizing a‹ 2.
WEEK 3. THE QR DECOMPOSITION 175
• For MGS:
QH Q
=
Q RH Q R
1 0Ô 0Ô 1 0Ô 0Ô
c 2 6 d c d
c ‘ ≠ 2
c Ô
≠ Ô6 d d c
c ‘ ≠Ô 2 ≠ Ô66
2
d
d
2 6 d c 2
a 0
c
2
≠Ô 6 b a 0 2
≠Ô 66 d
b
6 6
0 0 3
0 0 3
Q
= Ô Ô R
1 + Ô‘mach ≠ 22 ‘ ≠ 66 ‘
c d
c ≠ 2‘ 1 0 d.
a Ô2 b
6
≠6‘ 0 1
YouTube: https://www.youtube.com/watch?v=OT4Yd-eVMSo
We have argued via an example that MGS is more accurate than CGS. A more thorough
analysis is needed to explain why this is generally so.
YouTube: https://www.youtube.com/watch?v=NAdMU_1ZANk
A fundamental problem to avoid in numerical codes is the situation where one starts
with large values and one ends up with small values with large relative errors in them. This
is known as catastrophic cancellation. The Gram-Schmidt algorithms can inherently fall
victim to this: column aj is successively reduced in length as components in the directions of
{q0 , . . . , qj≠1 } are subtracted, leaving a small vector if aj was almost in the span of the first
j columns of A. Application of a unitary transformation to a matrix or vector inherently
preserves length. Thus, it would be beneficial if the QR factorization can be implementated
as the successive application of unitary transformations. The Householder QR factorization
accomplishes this.
The first fundamental insight is that the product of unitary matrices is itself unitary. If,
WEEK 3. THE QR DECOMPOSITION 178
given A œ Cm◊n (with m Ø n), one could find a sequence of unitary matrices, {H0 , . . . , Hn≠1 },
such that A B
R
Hn≠1 · · · H0 A = ,
0
where R œ Cn◊n is upper triangular, then
A B
R
A= H0H H
· · · Hn≠1
¸ ˚˙ ˝ 0
Q
YouTube: https://www.youtube.com/watch?v=6TIVIw4B5VA
What we have discovered in this first video is how to construct a Householder trans-
formation, also referred to as a reflector, since it acts like a mirroring with respect to the
subspace orthogonal to the vector u, as illustrated in Figure 3.3.2.1.
WEEK 3. THE QR DECOMPOSITION 179
Definition 3.3.2.2 Let u œ Cn be a vector of unit length (ÎuÎ2 = 1). Then H = I ≠ 2uuH
is said to be a Householder transformation or (Householder) reflector. ⌃
We observe:
(I ≠ 2uuH )z = z ≠ 2u(uH z) = z.
• H = HH.
• H H H = HH H = I (a reflector is unitary).
• H = HH.
Solution:
(I ≠ 2uuH )H
=
I ≠ 2(uH )H uH
=
I ≠ 2uuH
• H H H = I (a reflector is unitary).
Solution:
HHH
=
HH
=
I
YouTube: https://www.youtube.com/watch?v=wmjUHak9yHU
WEEK 3. THE QR DECOMPOSITION 181
(I ≠ 2uuH )x = y.
From our discussion above, we need to find a vector u that is perpendicular to the space
with respect to which we will reflect. From Figure 3.3.2.3 we notice that the vector from y
to x, v = x ≠ y, is perpendicular to the desired space. Thus, u must equal a unit vector in
the direction v: u = v/ÎvÎ2 .
Remark 3.3.2.4 In subsequent discussion we will prefer to give Householder transformations
as I ≠ uuH /· , where · = uH u/2 so that u needs no longer be a unit vector, just a direction.
The reason for this will become obvious later.
When employing Householder transformations as part of a QR factorization algorithm,
we need to introduce zeroes below the diagonal of our matrix. This requires a very special
case of Householder transformation.
YouTube: https://www.youtube.com/watch?v=iMrgPGCWZ_o
WEEK 3. THE QR DECOMPOSITION 182
Remark 3.3.2.6 This is a good place to clarify how we index in this course. Here we
label the first element of the vector x as ‰1 , despite the fact that we have advocated in
favor of indexing starting with zero. In our algorithms that leverage the FLAME notation
(partitioning/repartitioning), you may have noticed that a vector or scalar indexed with
1 refers to the "current column/row" or "current element". In preparation of using the
computation of the vectors v and u in the setting of such an algorithm, we use ‰1 here for
the first element from which these vectors will be computed, which tends to be an element
that is indexed with 1. So, there is reasoning behind the apparent insanity.
Ponder This 3.3.2.2 Consider x œ R2 as drawn below:
WEEK 3. THE QR DECOMPOSITION 183
YouTube: https://www.youtube.com/watch?v=UX_QBt90jf8
WEEK 3. THE QR DECOMPOSITION 184
where ‹1 = ‰1 û ÎxÎ2 equals the first element of v. (Note that if ‹1 = 0 then u2 can be set
to 0.)
Here ± denotes a complex scalar on the complex unit circle. By the same argument as
before A B
‰1 ≠ ± ÎxÎ2
v= .
x2
WEEK 3. THE QR DECOMPOSITION 185
We now wish to normalize this vector so its first entry equals "1":
A B A B
v 1 ‰1 ≠ ± ÎxÎ2 1
u= = = .
‹1 ‰1 ≠ ± ÎxÎ2 x2 x2 /‹1
represents the computation of the above mentioned vector u2 , and scalars fl and · , from
vector x. We will use the notation H(x) for the transformation I ≠ ·1 uuH where u and · are
computed by Housev(x).
CA B D AA BB
fl ‰1
Algorithm : , · = Housev
u2 x2
‰2 :=.Îx
A 2 Î2 B.
. ‰ .
1
– := .. .
. (= ÎxÎ2 )
. ‰2 .
2
fl = ≠sign(‰1 )ÎxÎ2 fl := ≠sign(‰1 )–
‹1 = ‰1 + sign(‰1 )ÎxÎ2 ‹1 := ‰1 ≠ fl
u2 = x2 /‹1 u2 := x2 /‹1
‰2 = ‰2 /|‹1 |(= Îu2 Î2 )
· = (1 + uH
2 u2 )/2 · = (1 + ‰22 )/2
Figure 3.3.3.1 Computing the Householder transformation. Left: simple formulation.
Right: efficient computation. Note: I have not completely double-checked these formulas
for the complex case. They work for the real case.
Remark 3.3.3.2 The function
function [ rho, ...
u2, tau ] = Housev( chi1, ...
x2 )
YouTube: https://www.youtube.com/watch?v=5MeeuSoFBdY
Let A be an m ◊ n with m Ø n. We will now show how to compute A æ QR, the QR
factorization, as a sequence of Householder transformations applied to A, which eventually
zeroes out all elements of that matrix below the diagonal. The process is illustrated in
Figure 3.3.4.1.
WEEK 3. THE QR DECOMPOSITION 187
CA B D A B
fl11 –11 aT12
, ·1 = :=
u21 a
A 21
A22
Original matrix A B B “Move forwardÕÕ
–11 fl11 aT12 ≠ w12
T
Housev
a21 0 A22 ≠ u21 w12T
◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊
◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ 0 ◊ ◊ ◊ 0 ◊ ◊ ◊
◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ 0 ◊ ◊ ◊ 0 ◊ ◊ ◊
◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ 0 ◊ ◊ ◊ 0 ◊ ◊ ◊
◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ 0 ◊ ◊ ◊ 0 ◊ ◊ ◊
◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊
0 ◊ ◊ ◊ 0 ◊ ◊ ◊ 0 ◊ ◊ ◊
0 ◊ ◊ ◊ 0 0 ◊ ◊ 0 0 ◊ ◊
0 ◊ ◊ ◊ 0 0 ◊ ◊ 0 0 ◊ ◊
0 ◊ ◊ ◊ 0 0 ◊ ◊ 0 0 ◊ ◊
◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊
0 ◊ ◊ ◊ 0 ◊ ◊ ◊ 0 ◊ ◊ ◊
0 0 ◊ ◊ 0 0 ◊ ◊ 0 0 ◊ ◊
0 0 ◊ ◊ 0 0 0 ◊ 0 0 0 ◊
0 0 ◊ ◊ 0 0 0 ◊ 0 0 0 ◊
◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊
0 ◊ ◊ ◊ 0 ◊ ◊ ◊ 0 ◊ ◊ ◊
0 0 ◊ ◊ 0 0 ◊ ◊ 0 0 ◊ ◊
0 0 0 ◊ 0 0 0 ◊ 0 0 0 ◊
0 0 0 ◊ 0 0 0 0 0 0 0 0
Figure 3.3.4.1 Illustration of Householder QR factorization.
In the first iteration, we partition
A B
–11 aT12
Aæ .
a21 A22
Let CA B D AA BB
fl11 –11
, ·1 = Housev
u21 a21
be the Householder transform computed from the first column of A. Then applying this
Householder transform to A yields
A B Q A BA BH R A B
–11 aT12 1 1 1 –11 aT12
:= aI ≠ ·1
b
a21 A22 u21 u21 a21 A22
A B
fl11 aT12 ≠ w12
T
= ,
0 A22 ≠ u21 w12
T
where w12
T
= (aT12 +uH
21 A22 )/·1 . Computation of a full QR factorization of A will now proceed
with the updated matrix A22 .
WEEK 3. THE QR DECOMPOSITION 188
YouTube: https://www.youtube.com/watch?v=WWe8yVccZy0
Homework 3.3.4.1 Show that
Q R Q Q RQ RH R
I 0 0 0
c A BA BH d c
d = cI ≠
1 c dc d d
c
1 1 1 a 1 ba 1 b d b.
a
0 I≠ b a ·1
·1 u21 u21 u21 u21
Solution.
Q R Q R
I 0 0 0
c A BA BH d c A BA BH d
c
1 1 1 d = I ≠c 1 1 1 d
a
0 I≠ ·1
b a
0 ·1
b
u21 u21 Q
u21 u21 R
0 0
1
c A BA BH d
= I≠ c 1 1 d
·1 a
0 b
u21 u21
Q R
0 0 0
= I ≠ ·11 c
a 0 1 uH2
d
b
Q 0 u2 u2 u2H
Q RQ RH R
c
0 0
1 c dc d d
= c
aI ≠ ·1 a 1 b a 1 b b .
d
u21 u21
More generally, let us assume that after k iterations of the algorithm matrix A contains
Q R
A B R00 r01 R02
RT L RT R
Aæ =c
a 0
d
–11 aT12 b ,
0 ABR
0 a21 A22
and update
Q RQ
I Q 0 R00 r01 R02
R
c A BA BH R d c d
A := c
a 0 aI ≠ 1 1 1 d
b ba 0 –11 aT12 b
·1 u21 u21 0 a21 A22
Q Q RQ RH R Q R
c
0 0 dc
R00 r01 R02
1 c dc d d
= cI
a ≠ ·1 a 1 ba 1 b da
b 0 –11 aT12 b
u21 u21 0 a21 A22
Q R
R00 r01 R02
c d
= a 0 fl11 aT12 ≠ w12
T
b,
0 0 A22 ≠ u21 w12 T
where  denotes the original contents of A and RT L is an upper triangular matrix. Rear-
ranging this we find that
 = H0 H1 · · · Hn≠1 R
which shows that if Q = H0 H1 · · · Hn≠1 then  = QR.
Typically, the algorithm overwrites the original matrix A with the upper triangular ma-
trix, and at each step u21 is stored over the elements that become zero, thus overwriting a21 .
(It is for this reason that the first element of u was normalized to equal "1".) In this case Q
is usually not explicitly formed as it can be stored as the separate Householder vectors below
the diagonal of the overwritten matrix. The algorithm that overwrites A in this manner is
given in Figure 3.3.4.2.
WEEK 3. THE QR DECOMPOSITION 190
[A, t] A
= HQR_unb_var1(A) B A B
AT L AT R tT
Aæ and t æ
ABL ABR tB
AT L is 0 ◊ 0 and tT has 0 elements
while n(ABR ) > 0 Q R Q R
A B A00 a01 A02 A B t0
AT L AT R c d tT c d
æ a aT10 –11 aT12 b and æ a ·1 b
ABL ABR tB
CA B D
A a
CA 20 B 21 D 22
A A B
t2
–11 fl11 –11
, · := , ·1 = Housev
a21 A 1 B u
A 21 A B
a21 B A B
aT12 1 1 1 2 aT12
Update := I ≠ ·1 1 u21H
A22 u21 A22
via the steps
A12 := (a B 12 +Aa21 A22 )/·1
T T H
w B
aT12 aT12 ≠ w12
T
:= T
A22 A22 ≠ a21 w12
Q R Q R
A B A00 a01 A02 A B t0
AT L AT R c T T d tT c d
Ω a a10 –11 a12 b and Ω a ·1 b
ABL ABR tB
A20 a21 A22 t2
endwhile
Figure 3.3.4.2 Unblocked Householder transformation based QR factorization.
In that figure,
[A, t] = HQR_unb_var1(A)
denotes the operation that computes the QR factorization of m ◊ n matrix A, with m Ø n,
via Householder transformations. It returns the Householder vectors and matrix R in the
first argument and the vector of scalars "·i " that are computed as part of the Householder
transformations in t.
Homework 3.3.4.2 Given A œ Rm◊n show that the cost of the algorithm in Figure 3.3.4.2
is given by
2
CHQR (m, n) ¥ 2mn2 ≠ n3 flops.
3
Solution. The bulk of the computation is in
T
w12 = (aT12 + uH
21 A22 )/·1
and
T
A22 ≠ u21 w12 .
During the kth iteration (when RT L is k ◊ k), this means a matrix-vector multiplication
(uH
21 A22 ) and rank-1 update with matrix A22 which is of size approximately (m ≠ k) ◊ (n ≠ k)
WEEK 3. THE QR DECOMPOSITION 191
for a cost of 4(m ≠ k)(n ≠ k) flops. Thus the total cost is approximately
qn≠1
k=0 4(m ≠ k)(n ≠ k)
=
qn≠1
4 j=0 (m ≠ n + j)j
=
q qn≠1 2
4(m ≠ n) n≠1j=0 j + 4 j=0 j
=
q
2(m ≠ n)n(n ≠ 1) + 4 n≠1j=0 j
2
¥ s
2(m ≠ n)n2 + 4 0n x2 dx
=
2mn2 ≠ 2n3 + 43 n3
=
2mn2 ≠ 23 n3 .
Homework 3.3.4.3 Implement the algorithm given in Figure 3.3.4.2 as
function [ A_out, t ] = HQR( A )
3.3.5 Forming Q
YouTube: https://www.youtube.com/watch?v=cFWMsVNBzDY
Given A œ Cm◊n , let [A, t] = HQR_unb_var1(A) yield the matrix A with the House-
holder vectors stored below the diagonal, R stored on and above the diagonal, and the
scalars ·i , 0 Æ i < n, stored in vector t. We now discuss how to form the first n columns of
Q = H0 H1 · · · Hn≠1 . The computation is illustrated in Figure 3.3.5.1.
WEEK 3. THE QR DECOMPOSITION 192
A B
–11 aT12
:=
a
A 21
A22
Original matrix B “Move forwardÕÕ
1 ≠ 1/·1 ≠(uH
21 A22 )/·1
≠u21 /·1 A22 + u21 aT12
1 0 0 0 1 0 0 0 1 0 0 0
0 1 0 0 0 1 0 0 0 1 0 0
0 0 1 0 0 0 1 0 0 0 1 0
0 0 0 1 0 0 0 ◊ 0 0 0 ◊
0 0 0 0 0 0 0 ◊ 0 0 0 ◊
1 0 0 0 1 0 0 0
0 1 0 0 0 1 0 0
0 0 ◊ ◊ 0 0 ◊ ◊
0 0 ◊ ◊ 0 0 ◊ ◊
0 0 ◊ ◊ 0 0 ◊ ◊
1 0 0 0 1 0 0 0
0 ◊ ◊ ◊ 0 ◊ ◊ ◊
0 ◊ ◊ ◊ 0 ◊ ◊ ◊
0 ◊ ◊ ◊ 0 ◊ ◊ ◊
0 ◊ ◊ ◊ 0 ◊ ◊ ◊
◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊
◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊
◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊
◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊
◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊
Figure 3.3.5.1 Illustration of the computation of Q.
Notice that to pick out the first n columns we must form
A B A B A B
In◊n In◊n In◊n
Q = H0 · · · Hn≠1 = H0 · · · Hk≠1 Hk · · · Hn≠1
0 0 0
¸ ˚˙ ˝
Bk
A B
In◊n
so that Q = B0 , where Bk = Hk · · · Hn≠1 .
0
Homework 3.3.5.1 ALWAYS/SOMETIMES/NEVER:
A B A B
In◊n Ik◊k 0
Bk = Hk · · · Hn≠1 = .
0 0 BÂ k
for some (m ≠ k) ◊ (n ≠ k) matrix BÂ k .
Answer. ALWAYS
Solution. The proof of this is by induction on k:
A B
In◊n
• Base case: k = n. Then Bn = , which has the desired form.
0
WEEK 3. THE QR DECOMPOSITION 193
• Inductive step: Assume the result is true for Bk . We show it is true for Bk≠1 :
Bk≠1
= A B
In◊n
Hk≠1 Hk · · · Hn≠1
0
=
Hk≠1 Bk
=A B
Ik◊k 0
Hk≠1
0 BÂ k
=
Q RQ R
I(k≠1)◊(k≠1) A 0B I(k≠1)◊(k≠1) 0 0
c
a 1 1 1 2 d
ba
c
0 1 0 d b
0 I≠ 1 uH
·k uk k
0 0 BÂ k
=
Q R
I(k≠1)◊(k≠1) A A B 0 BA B d
c
a 1 1
1
2 1 0 b
0 I≠ 1 uH
·k
uk k
0 BÂ k
= < choose yk = uk Bk /·k >
T H Â
Q R
I(k≠1)◊(k≠1) A B A 0 B
c
a 1 0 1 1 2 d
b
0 ≠ 1/· y T
0 BÂ k
k k
uk
=
Q R
I(k≠1)◊(k≠1) A 0 B d
c
a 1 ≠ 1/·k ≠ykT b
0 Â T
≠uk /·k Bk ≠ uk yk
=
Q R
I(k≠1)◊(k≠1) 0 0
c
a 0 1 ≠ 1/·k ≠ykT d
b
0 ≠uk /·k BÂ k ≠ uk ykT
A = B
I(k≠1)◊(k≠1) 0
.
0 BÂ k≠1
by completing the code in Assignments/Week03/mat ab/FormQ.m. You will want to use Assignments/
Week03/mat ab/test_FormQ.m to check your implementation. Input is the m ◊ n matrix A and
vector t that resulted from [ A, t ] = HQR( A ). Output is the matrix Q for the QR
factorization. You may want to use Assignments/Week03/mat ab/test_FormQ.m to check your
implementation.
Solution. See Assignments/Week03/answers/FormQ.m
Homework 3.3.5.3 Given A œ Cm◊n , show that the cost of the algorithm in Figure 3.3.5.2
WEEK 3. THE QR DECOMPOSITION 195
is given by
2
CFormQ (m, n) ¥ 2mn2 ≠ n3 flops.
3
Hint. Modify the answer for Homework 3.3.4.2.
Solution. When computing the Householder QR factorization, the bulk of the cost is in
the computations
T
w12 := (aT12 + uH
21 A22 )/·1
and
T
A22 ≠ u21 w12 .
When forming Q, the cost is in computing
aT12 := ≠(aH
21 A22 /·1
and
A22 := A22 + u21 w12
T
.
During the when AT L is k◊k), these represent, essentially, identical costs: p the matrix-vector
multiplication (uH
21 A22 ) and rank-1 update with matrix A22 which is of size approximately
(m ≠ k) ◊ (n ≠ k) for a cost of 4(m ≠ k)(n ≠ k) flops. Thus the total cost is approximately
q0
k=n≠1 4(m ≠ k)(n ≠ k)
= < reverse the order of the summation >
qn≠1
k=0 4(m ≠ k)(n ≠ k)
=
q
4 nj=1 (m ≠ n + j)j
=
q q
4(m ≠ n) nj=1 j + 4 nj=1 j 2
=
q
2(m ≠ n)n(n + 1) + 4 nj=1 j 2
¥ s
2(m ≠ n)n2 + 4 0n x2 dx
=
2mn2 ≠ 2n3 + 34 n3
=
2mn2 ≠ 23 n3 .
Ponder This 3.3.5.4 If m = n then Q could be accumulated by the sequence
Give a high-level reason why this would be (much) more expensive than the algorithm in
Figure 3.3.5.2
WEEK 3. THE QR DECOMPOSITION 196
3.3.6 Applying QH
YouTube: https://www.youtube.com/watch?v=BfK3DVgfxIM
In a future chapter, we will see that the QR factorization is used to solve the linear least-
squares problem. To do so, we need to be able to compute ŷ = QH y where QH = Hn≠1 · · · H0 .
Let us start by computing H0 y:
Q A BA BH R A B
aI 1 1 1 b Â1
≠ ·1 u2 u2 y2
=
A B A B A BH A B
Â1 1 1 Â1
≠ /·1
y2 u2 u2 y2
¸ ˚˙ ˝
Ê1
A = B A B
Â1 1
≠ Ê1
y2 u2
A = B
Â1 ≠ Ê1
.
y2 ≠ Ê1 u2
where Ê1 = (Â1 + uH 2 y2 )/·1 . This motivates the algorithm in Figure 3.3.6.1 for computing
y := Hn≠1 · · · H0 y given the output matrix A and vector t from routine HQR_unb_var1.
WEEK 3. THE QR DECOMPOSITION 197
[y] = A
Apply_QH(A,Bt, y) A B A B
AT L AT R tT yT
Aæ ,t æ ,y æ
ABL ABR tB yB
AT L is 0 ◊ 0 and tT , yT have 0 elements
while n(ABR ) < 0 Q R
A B A00 a01 A02
AT L AT R c d
æ a aT10 –11 aT12 b ,
ABL ABR
Q R
A20 a21 Q A22 R
A B t0 A B y0
tT c d yT c d
æ a ·1 b , æ a Â1 b
tB yB
A
t2B A A
y2B BA B
Â1 1 1 1 2 Â1
Update := I ≠ ·1 1 u21
H
y2 u21 y2
via the steps
AÊ1 :=B(Â1 + A a21 y2 )/·1 B
H
Â1 Â1 ≠ Ê1
:=
y2 y2 ≠ Ê1 u21
Q R
A B A00 a01 A02
AT L AT R c d
Ω a aT10 –11 aT12 b ,
ABL ABR
Q R
A20 a21 Q A22 R
A B t0 A B y0
tT c d yT c d
Ω a ·1 b , Ω a Â1 b
tB yB
t2 y2
endwhile
Figure 3.3.6.1 Algorithm for computing y := QH y(= Hn≠1 · · · H0 y) given the output from
the algorithm HQR_unb_var1.
Homework 3.3.6.1 What is the approximate cost of algorithm in Figure 3.3.6.1 if Q (stored
as Householder vectors in A) is m ◊ n.
Solution. The cost of this algorithm can be analyzed as follows: When yT is of length k,
the bulk of the computation is in a dot product with vectors of length m ≠ k ≠ 1 (to compute
Ê1 ) and an axpy operation with vectors of length m ≠ k ≠ 1 to subsequently update Â1 and
y2 . Thus, the cost is approximately given by
n≠1
ÿ n≠1
ÿ n≠1
ÿ
4(m ≠ k ≠ 1) = 4 m≠4 (k ≠ 1) ¥ 4mn ≠ 2n2 .
k=0 k=0 k=0
Notice that this is much cheaper than forming Q and then multiplying QH y.
WEEK 3. THE QR DECOMPOSITION 198
Homework 3.3.7.1 Previous programming assignments have the following routines for
computing the QR factorization of a given matrix A:
• Classical Gram-Schmidt (CGS) Homework 3.2.3.1:
[ A_out, R_out ] = CGS_QR( A ).
• Modified Gram-Schmidt (MGS) Homework 3.2.4.3:
[ A_out, R_out ] = MGS_QR( A ).
• Householder QR factorization (HQR) Homework 3.3.4.3:
[ A_out, t_out ] = HQR( A ).
• Form Q from Householder QR factorization Homework 3.3.5.2:
Q = FormQ( A, t ).
Use these to examine the orthogonality of the computed Q by writing the Matlab script
Assignments/Week03/matlab/test_orthogonality.m for the matrix
Q R
1 1 1
c
c ‘ 0 0 d
d
c d.
a 0 ‘ 0 b
0 0 ‘
Ponder This 3.3.7.2 In the last homework, we examined the orthogonality of the computed
matrix Q for a very specific kind of matrix. The problem with that matrix is that the columns
are nearly linearly dependent (the smaller ‘ is).
How can you quantify how close to being linearly dependent the columns of a matrix are?
How could you create a matrix of arbitrary size in such a way that you can control how
close to being linearly dependent the columns are?
Homework 3.3.7.3 (Optional). Program up your solution to Ponder This 3.3.7.2 and
use it to compare how mutually orthonormal the columns of the computed matrices Q are.
3.4 Enrichments
3.4.1 Blocked Householder QR factorization
3.4.1.1 Casting computation in terms of matrix-matrix multiplication
Modern processors have very fast processors with very fast floating point units (which per-
form the multiply/adds that are the bread and butter of our computations), but very slow
memory. Without getting into details, the reason is that modern memories are large and
hence are physically far from the processor, with limited bandwidth between the two. To
overcome this, smaller "cache" memories are closer to the CPU of the processor. In order to
achieve high performance (efficient use of the fast processor), the strategy is to bring data
into such a cache and perform a lot of computations with this data before writing a result
out to memory.
Operations like a dot product of vectors or an "axpy" (y := –x+y) perform O(m) compu-
tation with O(m) data and hence don’t present much opportunity for reuse of data. Similarly,
matrix-vector multiplication and rank-1 update operations perform O(m2 ) computation with
O(m2 ) data, again limiting the opportunity for reuse. In contrast, matrix-matrix multipli-
cation performs O(m3 ) computation with O(m2 ) data, and hence there is an opportunity to
reuse data.
The goal becomes to rearrange computation so that most computation is cast in terms of
matrix-matrix multiplication-like operations. Algorithms that achieve this are called blocked
algorithms.
It is probably best to return to this enrichment after you have encountered simpler
algorithms and their blocked variants later in the course, since Householder QR factorization
is one of the more difficult operations to cast in terms of matrix-matrix multiplication.
H0 H1 · · · Hk≠1 = I ≠ U T ≠1 U H ,
WEEK 3. THE QR DECOMPOSITION 201
where T is an upper triangular matrix. If U stores the Householder vectors that define
H0 , . . . , Hk≠1 (with "1"s explicitly on its diagonal) and t holds the scalars ·0 , . . . , ·k≠1 , then
T := FormT( U, t )
computes the desired matrix T . Now, applying this UT transformation to a matrix B yields
(I ≠ U T ≠1 U H )B = B ≠ U (T ≠1 (U H B)),
T = triu(U H U )
(the upper triangular part of U H U ) followed by either dividing the diagonal elements by
two or setting them to ·0 , . . . , ·k≠1 (in order). In that paper, we point out similar published
results [8] [35] [45] [32].
• Partition A B
A11 A12
Aæ
A21 A22
where A11 is b ◊ b.
A B
A11
• We can use the unblocked algorithm in Subsection 3.3.4 to factor the panel
A21
A B A B
A11 A11
[ , t1 ] := HouseQR_unb_var1( ),
A21 A21
A B
U11
overwriting the entries below the diagonal with the Householder vectors (with
U21
the ones on the diagonal implicitly stored) and the upper triangular part with R11 .
• Form T11 from the Householder vectors using the procedure described earlier in this
unit: A B
A11
T11 := FormT( )
A21
WEEK 3. THE QR DECOMPOSITION 202
• Now we need to also apply the Householder transformations to the rest of the columns:
A B
A12
A22
=
Q A B A BH RH A B
aI U11 ≠1 U11 b A12
≠ T11
U21 U21 A22
A = B A B
A12 U11
≠ W12
A22 U21
A = B
A12 ≠ U11 W12
,
A22 ≠ U21 W12
where
W12 = T11
≠H
(U11
H
A12 + U21
H
A22 ).
I ≠ —vv H ,
I ≠ U SU H
where upper triangular matrix S relates to the matrix T in the UT transform via S = T ≠1 .
Obviously, T can be computed first and then inverted via the insights in the next exercise.
Alternatively, inversion of matrix T can be incorporated into the algorithm that computes
T (which is what is done in the implementation in LAPACK [1]).
A = A.
‚
A = Q · A‚ = QR · QH Q = I. (3.4.1)
Now, we know that we march through the matrices in a consistent way. At some point
in the algorithm we will have divided them as follows:
A B
1 2 1 2 RT L RT R
Aæ AL AR ,Q æ QL QR ,R æ ,
RBL RBR
WEEK 3. THE QR DECOMPOSITION 204
where these partitionings are "conformal" (they have to fit in context). To come up with
algorithms, we now ask the question "What are the contents of A and R at a typical stage
of the loop?" To answer this, we instead first ask the question "In terms of the parts of the
matrices are that naturally exposed by the loop, what is the final goal?" To answer that
question, we take the partitioned matrices, and enter them in the postcondition (3.4.1):
1 2 1 2
AL AR = QL QR
¸ ˚˙ ˝ ¸ ˚˙ ˝
A Q A B
1 2 1 2 RT L RT R
· A‚L A‚R = QL QR
¸ ˚˙ ˝ ¸ ˚˙ ˝ 0 RBR
¸ ˚˙ ˝
A‚ R B
Q
A
1 2H 1 2 I 0
· QL QR QL QR = .
¸ ˚˙ ˝ ¸ ˚˙ ˝ 0 I
¸ ˚˙ ˝
QH Q I
(Notice that RBL becomes a zero matrix since R is upper triangular.) Applying the rules of
linear algebra (multiplying out the various expressions) yields
1 2 1 2
AL AR = QL QR
1 2 1 2
· A‚ A‚R = QL RT L QL RT R + QR RBR (3.4.2)
A L B A B
QH T
L QL QL QR I 0
· .= .
QH H
R QL QR QR 0 I
We call this the Partitioned Matrix Expression (PME). It is a recursive definition of the
operation to be performed.
The different algorithms differ in what is in the matrices A and R as the loop iterates.
Can we systematically come up with an expression for their contents at a typical point in
the iteration? The observation is that when the loop has not finished, only part of the final
result has been computed. So, we should be able to take the PME in (3.4.2) and remove
terms to come up with partial results towards the final result. There are some dependencies
(some parts of matrices must be computed before others). Taking this into account gives us
two loop invariants:
• Loop invariant 1:
1 2 1 2
AL AR = QL A‚R
· A‚L = QL RT L (3.4.3)
· QHL QL = I
• Loop invariant 2:
1 2 1 2
AL AR =QL A‚R ≠ QL RT R
1 2 1 2
· A‚L A‚R = QL RT L QL RT R + QR RBR
· QHL QL = I
WEEK 3. THE QR DECOMPOSITION 205
We note that our knowledge of linear algebra allows us to manipulate this into
1 2 1 2
AL AR = QL A‚R ≠ QL RT R
(3.4.4)
· A‚L = QL RT L · QH
L AL = RT R · QL QL = I.
‚ H
The idea now is that we derive the loop that computes the QR factorization by systematically
deriving the algorithm that maintains the state of the variables described by a chosen loop
invariant. If you use (3.4.3), then you end up with CGS. If you use (3.4.4), then you end up
with MGS.
Interested in details? We have a MOOC for that: LAFF-On Programming for Correct-
ness.
3.5 Wrap Up
3.5.1 Additional homework
A B
A
Homework 3.5.1.1 Consider the matrix where A has linearly independent columns.
B
Let
• A = QA RA be the QR factorization of A.
A B A B
RA RA
• = QB RB be the QR factorization of .
B B
A B A B
A A
• = QR be the QR factorization of .
B B
Assume that the diagonal entries of RA , RB , and R are all positive. Show that R = RB .
Solution. A B A BA B A B
A QA 0 RA QA 0
= = QB RB
B 0 I B 0 I
A B
A
Also, = QR. By the uniqueness of the QR factorization (when the diagonal elements
B
A B
QA 0
of the triangular matrix are restricted to be positive), Q = QB and R = RB .
0 I
Remark 3.5.1.1 This last exercise gives a key insight that is explored in the paper
• [20] Brian C. Gunter, Robert A. van de Geijn, Parallel out-of-core computation and up-
dating of the QR factorization, ACM Transactions on Mathematical Software (TOMS),
2005.
WEEK 3. THE QR DECOMPOSITION 206
3.5.2 Summary
Classical Gram-Schmidt orthogonalization: Given a set of linearly independent vectors
{a0 , . . . , an≠1 } µ Cm , the Gram-Schmidt process computes an orthonormal basis {q0 , . . . , qn≠1 }
that spans the same subspace as the original vectors, i.e.
¶ fl0,0 = Îa0 Î2
Computes the length of vector a0 .
¶ q0 = a0 /fl0,0
Sets q0 to a unit vector in the direction of a0 .
¶ fl0,1 = q0H a1
Computes fl0,1 so that fl0,1 q0 = q0H a1 q0 equals the component of a1 in the direction
of q0 .
1 = a1 ≠ fl0,1 q0
¶ a‹
Computes the component of a1 that is orthogonal to q0 .
¶ fl1,1 = Îa‹
1 Î2
Computes the length of vector a‹
1.
¶ q 1 = a‹
1 /fl1,1
Sets q1 to a unit vector in the direction of a‹
1.
Notice that A B
1 2 1 2 fl0,0 fl0,1
a0 a1 = q0 q1 .
0 fl1,1
¶ q 2 = a‹
2 /fl2,2
Sets q2 to a unit vector in the direction of a‹
2.
Notice that Q R
1 2 1 2 fl0,0 fl0,1 fl0,2
a 0
c
a0 a1 a2 = q0 q1 q2 fl1,1 fl1,2 d
b.
0 0 fl2,2
• And so forth.
Theorem 3.5.2.1 QR Decomposition Theorem. Let A œ Cm◊n have linearly indepen-
dent columns. Then there exists an orthonormal matrix Q and upper triangular matrix R
such that A = QR, its QR decomposition. If the diagonal elements of R are taken to be real
and positive, then the decomposition is unique.
Projection a vector y onto the orthonormal columns of Q œ Cm◊n :
Classic example that shows that the columns of Q, computed by MGS, are "more orthog-
onal" than those computed by CGS:
Q R
1 1 1
c
c ‘ 0 0 d
d
1 2
A=c d = a0 a1 a2 .
a 0 ‘ 0 b
0 0 ‘
• H = HH.
• H H H = HH = I.
• H ≠1 = H H = H.
Computing a Householder transformation I ≠ 2uuH :
• Real case:
WEEK 3. THE QR DECOMPOSITION 209
¶ v = x û ÎxÎ2 e0 .
v = x + sign(‰1 )ÎxÎ2 e0 avoids catastrophic cancellation.
¶ u = v/ÎvÎ2
• Complex case:
¶ v = x û ± ÎxÎ2 e0 .
(Picking ± carefully avoids catastrophic cancellation.)
¶ u = v/ÎvÎ2
Practical computation of u and · so that I ≠ uuH /tau is a Householder transformation
(reflector): CA B D AA BB
fl ‰1
Algorithm : , · = Housev
u2 x2
‰2 :=.Îx
A 2 Î2 B.
. ‰ .
. 1 .
– := . . (= ÎxÎ2 )
. ‰2 .
2
fl = ≠sign(‰1 )ÎxÎ2 fl := ≠sign(‰1 )–
‹1 = ‰1 + sign(‰1 )ÎxÎ2 ‹1 := ‰1 ≠ fl
u2 = x2 /‹1 u2 := x2 /‹1
‰2 = ‰2 /|‹1 |(= Îu2 Î2 )
· = (1 + uH u
2 2 )/2 · = (1 + ‰22 )/2
Householder QR factorization algorithm:
[A, t] A
= HQR_unb_var1(A)B A B
AT L AT R tT
Aæ and t æ
ABL ABR tB
AT L is 0 ◊ 0 and tT has 0 elements
while n(ABR ) > 0 Q R Q R
A B A00 a01 A02 A B t0
AT L AT R c d t c d
æ a aT10 –11 aT12 b and T
æ a ·1 b
ABL ABR tB
CA B D
A a
CA 20 B 21 D 22
A A B
t2
–11 fl11 –11
, · := , ·1 = Housev
a21 A 1 B u
A 21 A B
a21 B A B
T
a12 1 1 1 2 aT12
Update := I ≠ ·1 1 u21H
A22 u21 A22
via the steps
A12 := (aB 12 +Aa21 A22 )/·1
T T H
w B
T
a12 aT12 ≠ w12
T
:= T
A22 A22 ≠ a21 w12
Q R Q R
A B A00 a01 A02 A B t0
AT L AT R c d tT c d
Ω a aT10 –11 aT12 b and Ω a ·1 b
ABL ABR tB
A20 a21 A22 t2
endwhile
WEEK 3. THE QR DECOMPOSITION 210
[A] =AFormQ(A, t) B A B
AT L AT R tT
Aæ ,t æ
ABL ABR tB
AT L is n(A) ◊ n(A) and tT has n(A) elements
while n(AT L ) > 0 Q R Q R
A B A00 a01 A02 A B t0
AT L AT R c d tT c d
æ a aT10 –11 aT12 b , æ a ·1 b
ABL ABR tB
A
A
B 20
a21 A22 t2
T
–11 a12
Update :=
aA21 A22 A B BA B
1 1 1 2 1 0
I ≠ ·1 1 u21
H
u21 0 A22
via the steps
–11 := 1 ≠ 1/·1
aT12 := ≠(aH 21 A22 )/·1
A22 := A22 + a21 aT12
a21 := ≠a21 /·1
Q R Q R
A B A00 a01 A02 A B t0
AT L AT R tT
Ωc a aT10 –11 aT12 b ,
d
Ωc a ·1
d
b
ABL ABR tB
A20 a21 A22 t2
endwhile
[y] = A
Apply_QH(A,Bt, y) A B A B
AT L AT R tT yT
Aæ ,t æ ,y æ
ABL ABR tB yB
AT L is 0 ◊ 0 and tT , yT have 0 elements
while n(ABR ) < 0 Q R
A B A00 a01 A02
AT L AT R c d
æ a aT10 –11 aT12 b ,
ABL ABR
Q R
A20 a21 Q A22 R
A B t0 A B y0
tT c d yT c d
æ a ·1 b , æ a Â1 b
tB yB
A
t2B A A
y2B BA B
Â1 1 1 1 2 Â1
Update := I ≠ ·1 1 u21
H
y2 u21 y2
via the steps
AÊ1 :=B(Â1 + A a21 y2 )/·1 B
H
Â1 Â1 ≠ Ê1
:=
y2 y2 ≠ Ê1 u2
Q R
A B A00 a01 A02
AT L AT R c d
Ω a aT10 –11 aT12 b ,
ABL ABR
Q R
A20 a21 Q A22 R
A B t0 A B y0
tT c d yT c d
Ω a ·1 b , Ω a Â1 b
tB yB
t2 y2
endwhile
4.1 Opening
4.1.1 Fitting the best line
YouTube: https://www.youtube.com/watch?v=LPfdOYoQQU0
A classic problem is to fit the "best" line through a given set of points: Given
{(‰i , Âi )}m≠1
i=0 ,
we wish to fit the line f (‰) = “0 + “1 ‰ to these points, meaning that the coefficients “0 and
“1 are to be determined. Now, in the end we want to formulate this as approximately solving
Ax = b and for that reason, we change the labels we use: Starting with points
{(–i , —i )}m≠1
i=0 ,
‰0 + ‰1 –0 ¥ —0
‰0 + ‰1 –1 ¥ —1
.. .. ..
. . .
‰0 + ‰1 –m≠1 ¥ —m≠1 ,
212
WEEK 4. LINEAR LEAST SQUARES 213
where Q R Q R
1 –0 —0
A B
c
c 1 –1 d
d ‰0
c
c —1 d
d
A= c
c .. .. d,x
d = , and b = c
c .. d.
d
a . . b ‰1 a . b
1 –m≠1 —m≠1
Homework 4.1.1.1 Use the script in Assignments/Week04/mat ab/LineFittingExercise.m to
fit a line to the given data by guessing the coefficients ‰0 and ‰1 .
Ponder This 4.1.1.2 Rewrite the script for Homework 4.1.1.1 to be a bit more engaging...)
4.1.2 Overview
• 4.1 Opening
• 4.5 Enrichments
• 4.6 Wrap Up
• Relate the solution of the linear least squares problem to the four fundamental spaces.
• Describe the four fundamental spaces of a matrix using its singular value decomposi-
tion.
• Solve the solution of the linear least squares problem via Normal Equations, the Sin-
gular Value Decomposition, and the QR decomposition.
• Compare and contrast the accuracy and cost of the different approaches for solving the
linear least squares problem.
YouTube: https://www.youtube.com/watch?v=9mdDqC1SChg
We assume that the reader remembers theory related to (vector) subspaces. If a review
is in order, we suggest Weeks 9 and 10 of Linear Algebra: Foundations to Frontiers (LAFF)
[26].
WEEK 4. LINEAR LEAST SQUARES 215
At some point in your linear algebra education, you should also have learned about the
four fundamental spaces of a matrix A œ Cm◊n (although perhaps only for the real-valued
case):
• The column space, C(A), which is equal to the set of all vectors that are linear combi-
nations of the columns of A
{y | y = Ax}.
• The null space, N (A), which is equal to the set of all vectors that are mapped to the
zero vector by A
{x | Ax = 0}.
{y | y H = xH A}.
• The left null space, which is equal to the set of all vectors
{x | xH A = 0}.
xH y
= < x œ R(A) iff there exists z s.t. x = AH z >
(A z) y
H H
Homework 4.2.1.2 Let A œ Cm◊n . Show that its column space, C(A), and left null space,
N (AH ), are orthogonal.
WEEK 4. LINEAR LEAST SQUARES 216
xH y
= < x œ C(A) iff there exists z s.t. x = Az >
(Az) y
H
Homework 4.2.1.3 Let {s0 , · · · , sr≠1 } be a basis for subspace S µ Cn and {t0 , · · · , tk≠1 }
be a basis for subspace T µ Cn . Show that the following are equivalent statements:
1. Subspaces S, T are orthogonal.
2. The vectors in {s0 , · · · , sr≠1 } are orthogonal to the vectors in {t0 , · · · , tk≠1 }.
3. sH
i tj = 0 for all 0 Æ i < r and 0 Æ j < k.
1 2H 1 2
4. s0 · · · sr≠1 t0 · · · tk≠1 = 0, the zero matrix of appropriate size.
Solution. We are going to prove the equivalence of all the statements by showing that 1.
implies 2., 2. implies 3., 3. implies 4., and 4. implies 1.
• 1. implies 2.
Subspaces S and T are orthogonal if any vectors x œ S and y œ T are orthogonal.
Obviously, this means that si is orthogonal to tj for 0 Æ i < r and 0 Æ j < k.
• 2. implies 3.
This is true by definition of what it means for two sets of vectors to be orthogonoal.
• 3. implies 4.
Q R
1 2H 1 2
sH H
0 t0 s0 t1 · · ·
c H d
s0 · · · sr≠1 t0 · · · tk≠1 =c s t sH
a 1 0 1 t1 · · · d
.. .. b
. .
• 4. implies 1.
We need to show that if x œ S and y œ T then xH y = 0.
Notice that
Q R Q R
1 2c
‰‚0 1 2c
‚0
x= .. d .. d
sr≠1 c b and y = t0 · · · tk≠1 a
. d . d
s0 · · · c
a b
‰r≠1
‚ ‚
Âk≠1
WEEK 4. LINEAR LEAST SQUARES 217
x = –0 w0 + · · · + –n≠1 wn≠1
= < split the summation >
–0 w0 + · · · + –r≠1 wr≠1 + –r wr + · · · + –n≠1 wn≠1 .
¸ ˚˙ ˝ ¸ ˚˙ ˝
xr xn
Figure 4.2.1.2 Illustration of the four fundamental spaces and the mapping of a vector
x œ Cn by matrix A œ Cm◊n .
That figure also captures that if r is the rank of matrix, then
• dim(R(A)) = dim(C(A)) = r;
• dim(N (A)) = n ≠ r;
• dim(N (AH )) = m ≠ r.
Proving this is a bit cumbersome given the knowledge we have so far, but becomes very easy
once we relate the various spaces to the SVD, in Subsection 4.3.1. So, we just state it for
now.
WEEK 4. LINEAR LEAST SQUARES 219
YouTube: https://www.youtube.com/watch?v=oT4KIOxx-f4
Consider again the LLS problem: Given A œ Cm◊n and b œ Cm find x̂ œ Cn such that
We list a sequence of observations that you should have been exposed to in previous study
of linear algebra:
• b̂ equals the member of the column space of A that is closest to b, making it the
orthogonal projection of b onto the column space of A.
• From Figure 4.2.1.2 we deduce that b ≠ b̂ = b ≠ Ax̂ is in N (AH ), the left null space of
A.
AH Ax̂ = AH b.
With this, we have discovered what is known as the Method of Normal Equations. These
steps are summarized in Figure 4.2.2.1
WEEK 4. LINEAR LEAST SQUARES 220
PowerPoint Source
Figure 4.2.2.1 Solving LLS via the Method of Normal Equations when A has linearly
independent columns (and hence the row space of A equals Cn ).
Definition 4.2.2.2 (Left) pseudo inverse. Let A œ Cm◊n have linearly independent
columns. Then
A† = (AH A)≠1 AH
is its (left) pseudo inverse. ⌃
Homework 4.2.2.1 Let A œ Cm◊m be nonsingular. Then A≠1 = A† .
Solution.
AA† = A(AH A)≠1 AH = AA≠1 A≠H AH = II = I.
Homework 4.2.2.2 Let A œ Cm◊n have linearly independent columns. ALWAYS/SOMETIMES/
NEVER: AA† = I. 1 2
Hint. Consider A = e0 .
Answer. SOMETIMES
Solution. An example where AA† = I is the case where m = n and hence A is nonsingular.
WEEK 4. LINEAR LEAST SQUARES 221
AA†
= < instantiate >
Q R≠1
Q R c
c Q RH Q R d
d Q RH
1 c 1 1 d 1
c d c c d c d d c d
c 0 d c c 0 d c 0 d d
c
d c 0 d
a
.. b c a
. b a .. b d
.
a
.. b
. c
c . . d
d .
a ¸ ˚˙ ˝ b
1
¸ ˚˙ ˝
1
Q
=R < simplify >
1 1 2
c d
c 0 d 1 0 ···
a
.. b
.
Q
= < multiply
R
out >
1 0 ···
c d
c 0 0 ··· d
a
.. .. b
. .
= <m>1>
=
” I.
Ponder This 4.2.2.3 The last exercise suggests there is also a right pseudo inverse. How
would you define it?
In Section 5.4, you will find out that since B is HPD, there exists a lower triangular matrix
L such that B = LLH . This is known as the Cholesky factorization of B. The steps for
solving the normal equations then become
• Compute B = AH A.
Notice that since B is Hermitian symmetric, only the lower or upper triangular part
needs to be computed. This is known as a Hermitian rank-k update (where in this
case k = n). The cost is, approximately, mn2 flops. (See Subsection C.0.1.)
• Compute y = AH b.
The cost of this matrix-vector multiplication is, approximately, 2mn flops. (See Sub-
section C.0.1.)
• Solve
Lz = y
(solve with a lower triangular matrix) followed by
LH x̂ = z
YouTube: https://www.youtube.com/watch?v=etx_1VZ4VFk
Given A œ Cm◊n with linearly independent columns and b œ Cm , consider the linear least
squares (LLS) problem
Îb ≠ Ax̂Î2 = min Îb ≠ AxÎ2 (4.2.1)
x
and the perturbed problem
Î(b + ”b) ≠ A(x̂ + ”x̂)Î2 = min Î(b + ”b) ≠ A(x + ”x)Î2 (4.2.2)
x
The question we want to examine is by how much the relative error in b is amplified into a rel-
ative error in x̂. We will restrict our discussion to the case where A has linearly independent
columns.
Now, we discovered that b̂, the projection of b onto the column space of A, satisfies
b̂ = Ax̂ (4.2.3)
and the projection of b + ”b satisfies
b̂ + ”b̂ = A(x̂ + ”x̂) (4.2.4)
”b̂ = A”x̂
or, equivalently,
A”x̂ = ”b̂
which is solved by
”x̂ = A† ”b̂
= A† A(AH A)≠1 AH ”b
= (AH A)≠1 AH A(AH A)≠1 AH ”b
= A† ”b,
where A† = (AH A)≠1 AH is the pseudo inverse of A and we recall that ”b̂ = A(AH A)≠1 AH ”b.
Hence
Δx̂Î2 Æ ÎA† Î2 ΔbÎ2 . (4.2.6)
WEEK 4. LINEAR LEAST SQUARES 224
Homework 4.2.4.1 Let A œ Cm◊n have linearly independent columns. Show that
Î(AH A)≠1 AH Î2
=
Î((UL T L V H )H UL T L V H )≠1 (UL T L V H )H Î2
=
Î(V T L ULH UL T L V H )≠1 V T L ULH Î2
=
Î(V ≠1 T L T L V )V T L UL Î2
≠1 H H
=
ÎV ≠1 H
T L UL Î2
=
Î ≠1 H
T L UL Î2
=
1/‡n≠1 .
This last step needs some more explanation: Clearly Î T L ULH Î2 Æ Î T L Î2 ÎULH Î2 =
‡0 ÎULH Î2 Æ ‡0 . We need to show that there exists a vector x with ÎxÎ2 = 1 such that
Î T L ULH xÎ2 = Î T L ULH Î2 . If we pick x = u0 (the first column of UL ), then Î T L ULH xÎ2 =
Î T L ULH u0 Î2 = Î T L e0 Î2 = ·0 e0 Î2 = ‡0 .
Combining (4.2.5), (4.2.6), and the result in this last homework yields
Δx̂Î2 1 ‡0 ΔbÎ2
Æ . (4.2.7)
Îx̂Î2 cos(◊) ‡n≠1 ÎbÎ2
Notice the effect of the cos(◊)b. If b is almost perpendicular to C(A), then its projection
b̂ is small and cos ◊ is small. Hence a small relative change in b can be greatly amplified.
This makes sense: if b is almost perpendical to C(A), then x̂ ¥ 0, and any small ”b œ C(A)
can yield a relatively large change ”x.
Definition 4.2.4.1 Condition number of matrix with linearly independent columns.
Let A œ Cm◊n have linearly independent columns (and hence n Æ m). Then its condition
number (with respect to the 2-norm) is defined by
‡0
Ÿ2 (A) = ÎAÎ2 ÎA† Î2 = .
‡n≠1
⌃
WEEK 4. LINEAR LEAST SQUARES 225
ΔbÎ2
.
Îb + br Î2
The factor 1/ cos(◊) ensures that this does not magically reduce the relative error in x̂:
Δx̂Î2 Îb + br Î2 ‡0 ΔbÎ2
Æ .
Îx̂Î2 Îb̂Î2 ‡n≠1 Îb + br Î2
YouTube: https://www.youtube.com/watch?v=W-HnQDsZsOw
Homework 4.2.5.1 Show that Ÿ2 (AH A) = (Ÿ2 (A))2 .
Hint. Use the SVD of A.
Solution. Let A = U V H be the reduced SVD of A. Then
Ÿ2 (AH A) = ÎAH AÎ2 Î(AH A)≠1 Î2
= Î(U V H )H U V H Î2 Î((U V H )H U V H )≠1 Î2
= ÎV 2 V H Î2 ÎV ( ≠1 )2 V H Î2
= Î 2 Î2 Î( ≠1 )2 Î2
1 22
‡02
= 2
‡n≠1
= ‡0
‡n≠1
= Ÿ2 (A)2 .
Let A œ Cm◊n have linearly independent columns. If one uses the Method of Normal
Equations to solve the linear least squares problem minx Îb ≠ AxÎ2 via the steps
• Compute B = AH A.
WEEK 4. LINEAR LEAST SQUARES 226
• Compute y = AH b.
• Solve B x̂ = y.
the condition number of B equals the square of the condition number of A. So, while the
sensitivity of the LLS problem is captured by
Δx̂Î2 1 ΔbÎ2
Æ Ÿ2 (A) .
Îx̂Î2 cos(◊) ÎbÎ2
the sensitivity of computing x̂ from B x̂ = y is captured by
Δx̂Î2 ΔyÎ2
Æ Ÿ2 (A)2 .
Îx̂Î2 ÎyÎ2
If Ÿ2 (A) is relatively small (meaning that A is not close to a matrix with linearly dependent
columns), then this may not be a problem. But if the columns of A are nearly linearly
dependent, or high accuracy is desired, alternatives to the Method of Normal Equations
should be employed.
Remark 4.2.5.1 It is important to realize that this squaring of the condition number is an
artifact of the chosen algorithm rather than an inherent sensitivity to change of the problem.
YouTube: https://www.youtube.com/watch?v=Zj72oRSSsH8
Theorem 4.3.1.1
A Given
B A œ C
m◊n
, let A = UL T L VLH equal its Reduced SVD and A =
TL 0
1 2 1 2H
UL UR VL VR its SVD. Then
0 0
• C(A) = C(UL ),
• N (A) = C(VR ),
• N (AH ) = C(UR ).
WEEK 4. LINEAR LEAST SQUARES 227
Proof. We prove that C(A) = C(UL ), leaving the other parts as exercises.
Let A = UL T L VLH be the Reduced SVD of A. Then
• TL is nonsingular because it is diagonal and the diagonal elements are all nonzero.
We will show that C(A) = C(UL ) by showing that C(A) µ C(UL ) and C(UL ) µ C(A)
• C(A) µ C(UL ):
Let z œ C(A). Then there exists a vector x œ Cn such that z = Ax. But then
z = Ax = UL T L VLH x = UL T L VLH x = UL x‚. Hence z œ C(UL ).
¸ ˚˙ ˝
x‚
• C(UL ) µ C(A):
Let z œ C(UL ). Then there exists a vector x œ Cr such that z = UL x. But then
z = UL x = UL T L VLH VL ≠1
T L x = A VL T L x = Ax
≠1
‚. Hence z œ C(A).
¸ ˚˙ ˝ ¸ ˚˙ ˝
I x‚
We leave the other parts as exercises for the learner. ⌅
Homework 4.3.1.1 For the last theorem, prove that R(A) = C(AH ) = C(VL ).
Solution. R(A) = C(VL ):
The slickest way to do this is to recognize that if A = UL T L VLH is the Reduced SVD
of A then AH = VL T L ULH is the Reduced SVD of AH . One can then invoke the fact that
C(A) = C(UL ) where in this case A is replaced by AH and UL by VL .
Ponder This 4.3.1.2 For the last theorem, prove that N (AH ) = C(UR ).
Homework 4.3.1.3
A GivenBA œ Cm◊n , let A = UL T L VLH equal its Reduced SVD and
TL 0
1 2 1 2H
A = UL UR VL VR its SVD, and r = rank(A).
0 0
• ALWAYS/SOMETIMES/NEVER: r = rank(A) = dim(C(A)) = dim(C(UL )),
Answer.
Solution.
x = Ix = V V H x
1 21 2H
= VL VR V V x
A L BR
1 2 VH
= VL VR L
H x
V
A R B
1 2 V Hx
= VL VR L
VRH x
= VL VLH x + VR VRH x .
¸ ˚˙ ˝ ¸ ˚˙ ˝
xr xn
PowerPoint Source
Figure 4.3.1.2 Illustration of relationship between the SVD of matrix A and the four
fundamental spaces.
WEEK 4. LINEAR LEAST SQUARES 230
YouTube: https://www.youtube.com/watch?v=wLCN0yOLFkM
Let us start by discussing how to use the SVD to find x̂ that satisfies
for the case where A œ Cm◊n has linearly independent columns (in other words, rank(A) =
n).
Let A = UL T L V H be its reduced SVD decomposition. (Notice that VL = V since A has
linearly independent columns and hence VL is n ◊ n and equals V .)
Here is a way to find the solution based on what we encountered before: Since A has
linearly independent columns, the solution is given by x̂ = (AH A)≠1 AH b (the solution to the
normal equations). Now,
x̂
= < solution to the normal equations >
(AH A)≠1 AH b
= < A = UL T L V H >
Ë È≠1
(UL T LV ) (UL T L V H )
H H
(UL T L V H )H b
= < (BCD)H = (DH C H B H ) and H TL = TL >
Ë È≠1
(V T L ULH )(UL T L V H ) (V T L UL )b
H
PowerPoint Source
Figure 4.3.2.1 Solving LLS via the SVD when A had linearly independent columns (and
hence the row space of A equals Cn ).
Alternatively, we can come to the same conclusion without depending on the Method of
Normal Equations, in preparation for the more general case discussed in the next subsection.
The derivation is captured in Figure 4.3.2.1.
minxœCn Îb ≠ AxÎ22
= < substitute the SVDA = U V H >
minxœCn Îb ≠ U V H xÎ22
= < substitute I = U U H and factor out U >
minxœCn ÎU (U H b ≠ V H x)Î22
= < multiplication by a unitary matrix preserves two-norm >
minxœCn ÎU H b ≠ V H xÎ22
= < partition, partitioned matrix-matrix multiplication >
.A B A B .2
. UHb .
.
minxœCn . L
≠ TL
V x..
H
H
. UR b 0 .
2
= < partitioned matrix-matrix multiplication and addition >
.A B.2
. UHb ≠ H .
. T LV x .
minxœCn .. L
.
URH b .
2
.A B.2
. v .
. .
= <. T
. = ÎvT Î2 + ÎvB Î22 >
2
. vB .
. 2 .2 . .2
. . . .
minxœCn .ULH b ≠ T LV
H
x. + .URH b.
2 2
WEEK 4. LINEAR LEAST SQUARES 232
since T L is a diagonal matrix with only nonzeroes on its diagonal and V is unitary.
Here is yet another way of looking at this: we wish to compute x̂ that satisfies
Îb ≠ Ax̂Î2 = min
x
Îb ≠ AxÎ2 ,
for the case where A œ Cm◊n has linearly independent columns. We know that A =
UL T L V H , its Reduced SVD. To find the x that minimizes, we first project b onto the
column space of A. Since the column space of A is identical to the column space of UL , we
can project onto the column space of UL instead:
b̂ = UL ULH b.
(Notice that this is not because UL is unitary, since it isn’t. It is because the matrix UL ULH
projects onto the columns space of UL since UL is orthonormal.) Now, we wish to find x̂ that
exactly solves Ax̂ = b̂. Substituting in the Reduced SVD, this means that
UL T LV
H
x̂ = UL ULH b.
Multiplying both sides by ULH yields
T LV
H
x̂ = ULH b.
and hence
x̂ = V ≠1 H
T L UL b.
We believe this last explanation probably leverages the Reduced SVD in a way that provides
the most insight, and it nicely motivates how to find solutions to the LLS problem when
rank(A) < r.
The steps for solving the linear least squares problem via the SVD, when A œ Cm◊n has
linearly independent columns, and the costs of those steps are given by
• Compute the Reduced SVD A = UL T LV
H
.
We will not discuss practical algorithms for computing the SVD until much later. We
will see that the cost is O(mn2 ) with a large constant.
• Compute x̂ = V ≠1 H
T L UL b.
YouTube: https://www.youtube.com/watch?v=qhsPHQk1id8
Now we show how to use the SVD to find x̂ that satisfies
where rank(A) = r, with no assumptions about the relative size of m and n. In our discussion,
we let A = UL T L VLH equal its Reduced SVD and
A B
1 2 0 1 2H
A= UL UR TL
VL VR
0 0
its SVD.
The first observation is, once more, that an x̂ that minimizes Îb ≠ AxÎ2 satisfies
Ax̂ = b̂,
where b̂ = UL ULH b, the orthogonal projection of b onto the column space of A. Notice our
use of "an x̂" since the solution won’t be unique if r < m and hence the null space of A is
not trivial. Substituting in the SVD this means that
A B
1 2 0 1 2H
UL UR TL
VL VR x̂ = UL ULH b.
0 0
or, equivalently,
H
T L VL x̂ = ULH b. (4.3.1)
Any solution to this can be written as the sum of a vector in the row space of A with a
vector in the null space of A:
A B
1 2 zT
x̂ = V z = VL VR = VL zT + VR zB .
zB ¸ ˚˙ ˝ ¸ ˚˙ ˝
xr xn
WEEK 4. LINEAR LEAST SQUARES 234
T L VL (VL zT + VR zB ) = ULH b,
H
xr = VL zT = VL ≠1 H
T L UL b
x̂ = VL ≠1 H
T L UL b + VR zB ,
PowerPoint Source
Figure 4.3.3.1 Solving LLS via the SVD of A.
Homework 4.3.3.1 Reason that
x̂ = VL ≠1 H
T L UL b
is the solution to the LLS problem with minimal length (2-norm). In other words, if xı
satisfies
Îb ≠ Axı Î2 = min
x
Îb ≠ AxÎ2
then Îx̂Î2 Æ Îxı Î2 .
WEEK 4. LINEAR LEAST SQUARES 235
xı = VL ≠1 H
T L UL b + VR zB
¸ ˚˙ ˝
x̂
and that
VL ≠1 H
T L UL b and VR zB
are orthogonal to each other (since VLH VR= 0). If uH v = 0 then Îu + vÎ22 = ÎuÎ22 + ÎvÎ22 .
Hence
Îxı Î22 = Îx̂ + VR zB Î22 = Îx̂Î22 + ÎVR zB Î22 Ø Îx̂Î22
and hence Îx̂Î2 Æ Îxı Î2 .
YouTube: https://www.youtube.com/watch?v=mKAZjYX656Y
Theorem 4.4.1.1 Assume A œ Cm◊n has linearly independent columns and let A = QR
be its QR factorization with orthonormal matrix Q œ Cm◊n and upper triangular matrix
R œ Cn◊n . Then the LLS problem
Îb ≠ Q ¸˚˙˝
Rx Î2 .
z
Îb ≠ QzÎ2
after which we can solve Rx = z for x. But from the Method of Normal Equations we know
WEEK 4. LINEAR LEAST SQUARES 236
z = QH b.
minxœCn Îb ≠ AxÎ22
= < substitute A = QL RT L >
minxœC Îb ≠ QL RT L xÎ22
n
3 = < QH
. R
b is independent
.2 4
of x >
. H .
minxœCn .QL b ≠ RT L x. + ÎQH 2
R bÎ2
2
= < minimized by x̂ that satisfies RT L x̂ = QH
Lb >
H 2
ÎQR bÎ2 .
Thus, the desired x̂ that minimizes the linear least squares problem solves RT L x̂ = QH
L y. The
solution is unique because RT L is nonsingular (because A has linearly independent columns).
⌅
Homework 4.4.1.1 Yet another alternative proof for Theorem 4.4.1.1 starts with the ob-
servation that the solution is given by x̂ = (AH A)≠1 AH b and then substitutes in A = QR.
Give a proof that builds on this insight.
WEEK 4. LINEAR LEAST SQUARES 237
Solution. Recall that we saw in Subsection 4.2.2 that, if A has linearly independent
columns, the LLS solution is given by x̂ = (AH A)≠1 AH b (the solution to the normal equa-
tions). Also, if A has linearly independent columns and A = QR is its QR factorization,
then the upper triangular matrix R is nonsingular (and hence has no zeroes on its diagonal).
Now,
x̂
= < Solution to the Normal Equations >
(A A) A b
H ≠1 H
= < A = QR >
Ë È≠1
(QR)H (QR) (QR)H b
= < (BC)H = (C H B H ) >
Ë È≠1
RH QH QR RH QH b
= < QH Q = I >
Ë È≠1
RH R RH QH b
= < (BC)≠1 = C ≠1 B ≠1 >
R≠1 R≠H RH QH b
= < R≠H RH = I >
≠1 H
R Q b.
Thus, the x̂ that solves Rx̂ = QH b solves the LLS problem.
Ponder This 4.4.1.2 Create a picture similar to Figure 4.3.2.1 that uses the QR factoriza-
tion rather than the SVD.
YouTube: https://www.youtube.com/watch?v=Mk-Y_15aGGc
Given A œ Cm◊n with linearly independent columns, the Householder QR factorization
yields n Householder transformations, H0 , . . . , Hn≠1 , so that
A B
RT L
Hn≠1 · · · H0 A = .
¸ ˚˙ ˝ 0
QH
[A, t] = HouseQR_unb_var1(A) overwrites A with the Householder vectors that define
H0 , · · · , Hn≠1 below the diagonal and RT L in the upper triangular part.
Rather than explicitly computing Q and then computing y := QH y, we can instead apply
the Householder transformations:
y := Hn≠1 · · · H0 y,
A B
yT
overwriting y with y‚. After this, the vector y is partitioned as y = and the triangular
yB
system RT L x̂ = yT yields the desired solution.
The steps and theirs costs of this approach are
• From Subsection 3.3.4, factoring A = QR via the Householder QR factorization costs,
approximately, 2mn2 ≠ 23 n3 flops.
• From Homework 3.3.6.1, applying Q as a sequence of Householder transformations
costs, approximately, 4mn ≠ 2n2 flops.
• Solve RT L x̂ = yT : n2 flops.
Total: 2mn2 ≠ 23 n3 + 4mn ≠ n2 ¥ 2mn2 ≠ 23 n3 flops.
And P (p) be the matrix that, when applied to a vector x œ Cn permutes the entries of x
according to the vector p:
Q R Q R Q R
eTfi0 eTfi0 x ‰ fi0
c eTfi1 d c eTfi1 x d c d
c d c d c ‰ fi1 d
P (p)x = c
c .. d
d x=c
c .. d
d =c
c .. d.
d
a . b a . b a . b
eTfin≠1 eTfin≠1 x ‰fin≠1
¸ ˚˙ ˝
P (p)
where ej equals the columns of I œ Rn◊n indexed with j (and hence the standard basis vector
indexed with j).
If we apply P (p)T to A œ Cm◊n from the right, we get
AP (p)T
= < definition of P (p) >
Q T
RT
efi0
c .. d
Aca . d b
T
efin≠1
1= < transpose 2 >
A efi0 · · · efin≠1
In other words, applying the transpose of the permutation matrix to A from the right
permutes its columns as indicated by the permutation vector p.
The discussion about permutation matrices gives us the ability to rearrange the columns
of A so that the first r = rank(A) columns are linearly independent.
Theorem 4.4.4.1 Assume A œ Cm◊n and that r = rank(A). Then there exists a permutation
vector p œ Rn , orthonormal matrix QL œ Cm◊r , upper triangular matrix RT L œ Cr◊r , and
RT R œ Cr◊(n≠r) such that 1 2
AP (p)T = QL RT L RT R .
Proof. Let p be the permutation vector such that the first r columns of AP = AP (p)T are
linearly independent. Partition
1 2
AP = AP (p)T = APL APR
where APL œ Cm◊r . Since APL has linearly independent columns, its QR factorization, AP =
QL RT L , exists. Since all the linearly independent columns of matrix A were permuted to the
left, the remaining columns, now part of APR , are in the column space of APL and hence in
the column space of QL . Hence APR = QL RT R for some matrix RT R , which then must satisfy
WEEK 4. LINEAR LEAST SQUARES 240
1 2 1 2
AP = AP (p)T = APL APR = QL RT L RT R .
⌅
Let us examine how this last theorem can help us solve the LLS
Find x̂ œ Cn such that Îb ≠ Ax̂Î2 = minn Îb ≠ AxÎ2
xœC
when rank(A) Æ n:
minxœCn Îb ≠ AxÎ2
= < P (p)T P (p) = I >
minxœCn Îb ≠ AP (p)T P (p)xÎ
1 2 2
= < AP (p)T = QL RT L RT R >
1 2
minxœCn Îb ≠ QL RT L RT R P (p)x Î2
¸ ˚˙ ˝
w1 2
= < substitute w = RT L RT R P (p)x >
minwœCr Îb ≠ QL wÎ2
which is minimized when w = QH
L b. Thus, we are looking for vector x̂ such that
1 2
RT L RT R P (p)x̂ = QH
L b.
Substituting A B
zT
z=
zB
for P (p)x̂ we find that A B
1 2 zT
RT L RT R = QH
L b.
zB
Now, we can pick zB œ Cn≠r to be an arbitrary vector, and determine a corresponding zT
by solving
RT L zT = QHL b ≠ RT R zB .
4.5 Enrichments
4.5.1 Ran -Revealing QR (RRQR) via MGS
The discussion in Subsection 4.4.4 falls short of being a practical algorithm for at least two
reasons:
• One needs to be able to determine in advance what columns of A are linearly indepen-
dent; and
• Due to roundoff error or error in the data from which the matrix was created, a column
may be linearly independent of other columns when for practical purposes it should be
considered dependent.
We now discuss how the MGS algorithm can be modified so that appropriate linearly inde-
pendent columns can be determined "on the fly" as well as the defacto rank of the matrix.
The result is known as the Rank Revealing QR factorization (RRQR). It is also known
as QR factorization with column pivoting. We are going to give a modification of the
MGS algorithm for computing the RRQR.
For our discussion, we introduce an elementary pivot matrix, P (j) œ Cn◊n , that swaps
the first element of the vector to which it is applied with the element indexed with j:
Q R Q R Q R
eTj eTj x ‰j
c d c d c d
c
c
eT1 d
d
c
c
eT1 x d
d
c
c
‰1 d
d
c .. d c .. d c .. d
c
c . d
d
c
c . d
d
c
c . d
d
c ej≠1 d
T c eTj≠1 x d c ‰j≠1 d
c d c d c d
PÂ (j)x = c dx = c d = c d.
c
c eT0 dd
c
c eT0 x d
d
c
c ‰0 dd
c eTj+1 d c eTj+1 x d c ‰j+1 d
c d c d c d
c .. d c .. d c .. d
c
a . d
b
c
a . d
b
c
a . d
b
eTn≠1 eTn≠1 x ‰n≠1
Another way of stating this is that
Q R
0 0 1 0
c
c 0 I(j≠1)◊(j≠1) 0 0 d
d
PÂ (j) = c d,
a 1 0 0 0 b
0 0 0 I(n≠j≠1)◊(n≠j≠1)
where Ik◊k equals the k ◊ k identity matrix. When applying PÂ (j) from the right to a matrix,
it swaps the first column and the column indexed with j. Notice that PÂ (j)T = PÂ (j) and
PÂ (j) = PÂ (j)≠1 .
Remark 4.5.1.1 For a more detailed discussion of permutation matrices, you may want to
consult Week 7 of "Linear Algebra: Foundations to Frontiers" (LAFF) [26]. We also revisit
this in Section 5.3 when discussing LU factorization with partial pivoting.
Here is an outline of the algorithm:
WEEK 4. LINEAR LEAST SQUARES 242
• Determine the index fi1 such that the column of A indexed with fi1 has the largest
2-norm (is the longest).
• Permute A := APÂ (fi1 ), swapping the first column with the column that is longest.
• Partition
A B A B
1 2 1 2 T
fl11 r12 fi1
Aæ a1 A2 ,Q æ q1 Q2 ,R æ ,p æ
0 R22 p2
• q1 := a1 /fl11 .
• Compute r12
T
:= q1T A2 .
• Update A2 := A2 ≠ q1 r12
T
.
This substracts the component of each column that is in the direction of q1 .
The complete algorithm, which overwrites A with Q, is given in Figure 4.5.1.2. Observe
that the elements on the diagonal of R will be positive and in non-increasing order because
updating A2 := A2 ≠ q1 r12
T
inherently does not increase the length of the columns of A2 .
After all, the component in the direction of q1 is being subtracted from each column of A2 ,
leaving the component orthogonal to q1 .
WEEK 4. LINEAR LEAST SQUARES 243
[A, R, p] := RRQR_MGS_simple(A,
A
R,Bp) A B
1 2 RT L RT R pT
A æ AL AR , R æ ,p æ
RBL RBR pB
AL has 0 columns, RT L is 0 ◊ 0, pT has 0 rows
while
1 n(AL ) <2 n(A) 1 2
AL AR æ A0 a1 A2 ,
Q R Q R
A B R00 r01 R02 A B p0
RT L RT R c T T d pT c d
æ a r10 fl11 r12 b, æa fi1 b
RBL RBR pB
R20 r21 1 R22 2
p2
fi1 = DetermineColumnIndex( a1 A2 )
1 2 1 2
a1 A2 := a1 A2 PÂ (fi1 )
fl11 := Îa1 Î2
a1 := a1 /fl11
T
r12 := aT1 A2
A := A2 ≠2 a1 r112
1 2
T
2
AL AR Ω A0 a1 A2 ,
Q R Q R
A B R00 r01 R02 A B p0
RT L RT R c T T d p c
Ω a r10 fl11 r12 b,
T
Ωa fi1 d
b
RBL RBR pB
R20 r21 R22 p2
endwhile
Figure 4.5.1.2 Simple implementation of RRQR via MGS. Incorporating a stopping critera
that checks whether fl11 is small would allow the algorithm to determine the effective rank
of the input matrix.
The problem with the algorithm in Figure 4.5.1.2 is that determining the index fi1 requires
the 2-norm of all columns in AR to be computed, which costs O(m(n ≠ j)) flops when AL
has j columns (and hence AR has n ≠Qj columns).
R Q
The following
R
insight reduces this cost:
‹0 Îa0 Î22
1 2 c d c 2 d
c ‹1 d c Îa1 Î2 d T
Let A = a0 a1 · · · an≠1 , v = c c .. d
d = c
c .. d, q q = 1 (here q is of the
d
a . b a . b
‹n≠1 Îan≠1 Î22
Q R
fl0
c
c fl1 dd
same size as the columns of A), and r = AT q = c
c d. Compute B := A ≠ qr with
.. d
T
a . b
1 2
fln≠1
B= b0 b1 · · · bn≠1 . Then
Q R Q R
Îb0 Î22 ‹0 ≠ fl20
c
c Îb1 Î22 d
d
c
c ‹1 ≠ fl21 d
d
c
c .. d
d = c
c .. d.
d
a . b a . b
Îbn≠1 Î22 ‹n≠1 ≠ fl2n≠1
WEEK 4. LINEAR LEAST SQUARES 244
Îai Î22 = Î(ai ≠ aTi qq) + aTi qqÎ22 = Îai ≠ aTi qqÎ22 + ÎaTi qqÎ22 = Îai ≠ fli qÎ22 + Îfli qÎ22 = Îbi Î22 + fl2i
so that
Îbi Î22 = Îai Î22 ≠ fl2i = ‹i ≠ fl2i .
Building on this insight, we make an important observation that greatly reduces the cost
of determining the column that is longest. Let us start by computing v as the vector such
that the ith entry in v equals the square of the length of the ith column of A. In other
words, the ith entry of v equals the dot product of the i column of A with itself. In the
above outline for the MGS with column pivoting, we can then also partition
A B
‹1
væ .
v2
[A, R, p] := RRQR_MSG(A, R, p)
v := ComputeWeights(A)A B A B A B
1 2 RT L RT R pT vT
A æ AL AR , R æ ,p æ ,v æ
RBL RBR pB vB
AL has 0 columns, RT L is 0 ◊ 0, pT has 0 rows, vT has 0 rows
while
1 n(AL ) <2 n(A) 1 2
AL AR æ A0 a1 A2 ,
Q R
A B R00 r01 R02
RT L RT R c T T d
æ a r10 fl11 r12 b,
RBL RBR
Q R
R 20 r R
21 Q 22 R
A B p0 A B v0
pT c d vT c d
æ a fi1 b , æ a ‹1 b
pB vB
A B
p2 v
A 2 B
‹1 ‹1
[ , fi1 ] = DeterminePivot( )
1
v 2 2 1
v
22
A0 a1 A2 := A0 a1 A2 PÂ (fi1 )T
fl11 := Îa1 Î2
a1 := a1 /fl11
T
r12 := q1T A2
A2 := A2 ≠ q1 r12 T
v := UpdateWeights(v
1 2 2 1 2 , r12 ) 2
AL AR Ω A0 a1 A2 ,
Q R
A B R00 r01 R02
RT L RT R c T T d
Ω a r10 fl11 r12 b,
RBL RBR
Q R
R20 r21 QR22 R
A B p0 A B v0
pT c d vT c d
Ω a fi1 b , Ω a ‹1 b
pB vB
p2 v2
endwhile
Figure 4.5.1.3 RRQR via MGS, with optimization. Incorporating a stopping critera that
checks whether fl11 is small would allow the algorithm to determine the effective rank of the
input matrix.
Let us revisit the fact that the diagonal elements of R are positive and in nonincreasing
order. This upper triangular matrix is singular if a diagonal element equals zero (and hence
all subsequent diagonal elements equal zero). Hence, if fl11 becomes small relative to prior
diagonal elements, the remaining columns of the (updated) AR are essentially zero vectors,
and the original matrix can be approximated with
1 2
A ¥ QL RT L RT R = . .
WEEK 4. LINEAR LEAST SQUARES 246
in computing the parts of R that are needed to update the weights and that stands in the
way of a true blocked algorithm (that casts most computation in terms of matrix-matrix
multiplication). The following papers are related to this:
• [33] Gregorio Quintana-Orti, Xioabai Sun, and Christof H. Bischof, A BLAS-3 version
of the QR factorization with column pivoting, SIAM Journal on Scientific Computing,
19, 1998.
discusses how too cast approximately half the computation in terms of matrix-matrix
multiplication.
• [25] Per-Gunnar Martinsson, Gregorio Quintana-Orti, Nathan Heavner, Robert van
de Geijn, Householder QR Factorization With Randomization for Column Pivoting
(HQRRP), SIAM Journal on Scientific Computing, Vol. 39, Issue 2, 2017.
shows how a randomized algorithm can be used to cast most computation in terms of
matrix-matrix multiplication.
4.6 Wrap Up
4.6.1 Additional homework
We start with some concrete problems from our undergraduate course titled "Linear Algebra:
Foundations to Frontiers" [26]. If you have trouble with these, we suggest you look at Chapter
11 of that course.
Q R Q R
1 0 1
Homework 4.6.1.1 Consider A = c d c d
a 0 1 b and b = a 1 b.
1 1 0
• Compute an orthonormal basis for C(A).
• Use the method of normal equations to compute the vector x‚ that minimizes minx Îb ≠
AxÎ2
• Use the QR factorization of matrix A to compute the vector x‚ that minimizes minx Îb≠
AxÎ2
Homework 4.6.1.2 The vectors
Ô A B A Ô B Ô A B A Ô B
2 1 2 2 ≠1 ≠ Ô22
q0 = = Ô2 , q1 = = .
2 1 2 2 1 2
2 2
4.6.2 Summary
The LLS problem can be states as: Given A œ Cm◊n and b œ Cm find x̂ œ Cn such that
Given A œ Cm◊n ,
• The column space, C(A), which is equal to the set of all vectors that are linear combi-
nations of the columns of A
{y | y = Ax}.
• The null space, N (A), which is equal to the set of all vectors that are mapped to the
zero vector by A
{x | Ax = 0}.
{y | y H = xH A}.
• The left null space, which is equal to the set of all vectors
{x | xH A = 0}.
• If Ax = b then there exist xr œ R(A) and x = xr +xn where xr œ R(A) and xn œ N (A).
These insights are summarized in the following picture, which also captures the orthogonality
of the spaces.
WEEK 4. LINEAR LEAST SQUARES 249
If A has linearly independent columns, then the solution of LLS, x‚, equals the solution
of the normal equations
(AH A)x̂ = AH b.
as summarized in
WEEK 4. LINEAR LEAST SQUARES 250
The (left) pseudo inverse of A is given by A† = (AH A)≠1 AH so that the solution of LLS
is given by x‚ = A† b.
Definition 4.6.2.1 Condition number of matrix with linearly independent columns.
Let A œ Cm◊n have linearly independent columns (and hence n Æ m). Then its condition
number (with respect to the 2-norm) is defined by
‡0
Ÿ2 (A) = ÎAÎ2 ÎA† Î2 = .
‡n≠1
⌃
Assuming A has linearly independent columns, let b̂ = Ax̂ where b̂ is the projection of b
onto the column space of A (in other words, x̂ solves the LLS problem), cos(◊) = Îb̂Î2 /ÎbÎ2 ,
and b̂ + ”b̂ = A(x̂ + ”x̂), where ”b̂ equals the projection of ”b onto the column space of A. Then
Δx̂Î2 1 ‡0 ΔbÎ2
Æ
Îx̂Î2 cos(◊) ‡n≠1 ÎbÎ2
captures the sensitivity of the LLS problem to changes in the right-hand side.
Theorem 4.6.2.2
A Given
B A œ C
m◊n
, let A = UL T L VLH equal its Reduced SVD and A =
TL 0
1 2 1 2 H
UL UR VL VR its SVD. Then
0 0
• C(A) = C(UL ),
• N (A) = C(VR ),
• N (AH ) = C(UR ).
WEEK 4. LINEAR LEAST SQUARES 252
x̂ = VL ≠1 H
T L UL b
solves LLS. A B
1 2 0 1 2H
Given A œ C m◊n
, let A = UL H
T L VL equal its Reduced SVD and A = UL UR TL
VL VR
0 0
its SVD. Then
x̂ = VL H
T L UL b + VR zb ,
is the general solution to LLS, where zb is any vector in Cn≠r .
Theorem 4.6.2.3 Assume A œ Cm◊n has linearly independent columns and let A = QR
be its QR factorization with orthonormal matrix Q œ Cm◊n and upper triangular matrix
R œ Cn◊n . Then the LLS problem
254
Week 5
5.1 Opening
5.1.1 Of Gaussian elimination and LU factorization
YouTube: https://www.youtube.com/watch?v=fszE2KNxTmo
Homework 5.1.1.1 Reduce the appended system
2 ≠1 1 1
≠2 2 1 ≠1
4 ≠4 1 5
to upper triangular form, overwriting the zeroes that are introduced with the multipliers.
Solution.
2 ≠1 1 1
≠1 1 2 0
2 ≠2 3 3
255
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 256
YouTube: https://www.youtube.com/watch?v=Tt0OQikd-nI
A = LU(A)
A B
AT L AT R
Aæ
ABL ABR
AT L is 0 ◊ 0
while n(AT L ) < n(A) Q R
A B A00 a01 A02
AT L AT R c d
æ a aT10 –11 aT12 b
ABL ABR
A20 a21 A22
a21 := a21 /–11
A22 := A22 ≠ a21 aT12
Q R
A B A00 a01 A02
AT L AT R c T d
Ω a a10 –11 aT12 b
ABL ABR
A20 a21 A22
endwhile
Figure 5.1.1.1 Algorithm that overwrites A with its LU factorization.
Homework 5.1.1.2 The execution of the LU factorization algorithm with
Q R
2 ≠1 1
c
A = a ≠2 2 1 d
b
4 ≠4 1
in the video overwrites A with Q R
2 ≠1 1
c
a ≠1 1 2 d
b.
2 ≠2 3
Multiply the L and U stored in that matrix and compare the result with the original matrix,
let’s call it Â.
Solution. Q R Q R
1 0 0 2 ≠1 1
c
L = a ≠1 1 0 d c
b and U = a 0 1 2 d
b.
2 ≠2 1 0 0 3
Q RQ R Q R
1 0 0 2 ≠1 1 2 ≠1 1
c dc d c
LU = a ≠1 1 0 ba 0 1 2 b = a ≠2 2 1 d
b = Â.
2 ≠2 1 0 0 3 4 ≠4 1
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 257
5.1.2 Overview
• 5.1 Opening
• 5.5 Enrichments
• 5.6 Wrap Up
• State and prove necessary conditions for the existence of the LU factorization.
• Extend the ideas behind Gaussian elimination and LU factorization to include pivoting.
• Derive different algorithms for LU factorization and for solving the resulting triangular
systems.
• State and prove conditions related to the existence of the Cholesky factorization.
• Analyze the cost of the different factorization algorithms and related algorithms for
solving triangular systems.
YouTube: https://www.youtube.com/watch?v=UdN0W8Czj8c
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 259
The exercise in Homework 5.2.1.1 motivates the following algorithm, which reduces the
linear system Ax = b stored in n ◊ n matrix A and right-hand side vector b of size n to an
upper triangular system.
for j := 0, . . . , n ≠ 1
for i := j + 1, . . . , n ≠ 1
⁄i,j := –i,j /–j,j
–i,j := 0
Z
for k = j + 1, . . . , n ≠ 1 _
_
_
–i,k := –i,k ≠ ⁄i,j –j,k ^
subtract ⁄i,j times row j from row k
endfor _
_
_
\
—i := —i ≠ ⁄i,j —j
endfor
endfor
Ignoring the updating of the right-hand side (a process known as forward substitution), for
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 261
each iteration we can first compute the multipliers and then update the matrix:
for j := 0, . . . , n ≠ 1
Z
for i := j + 1, . . . , n ≠ 1 __
_
⁄i,j := –i,j /–j,j ^
compute multipliers
–i,j := 0 _
_
_
\
endfor
for i := j + 1, . . . , n ≠ 1 Z
for k = j + 1, . . . , n ≠ 1 _^
–i,k := –i,k ≠ ⁄i,j –j,k subtract ⁄i,j times row j from row k
_
endfor \
endfor
endfor
Since we know that –i,j is set to zero, we can use its location to store the multiplier:
for j := 0, . . . , n ≠ 1 Z
for i := j + 1, . . . , n ≠ 1 _^
–i,j := ⁄i,j = –i,j /–j,j compute all multipliers
_
endfor \
for i := j + 1, . . . , n ≠ 1 Z
for k = j + 1, . . . , n ≠ 1 _^
–i,k := –i,k ≠ –i,j –j,k subtract ⁄i,j times row j from row k
_
endfor \
endfor
endfor
Finally, we can cast the computation in terms of operations with vectors and submatrices:
A = GE(A)
A B
AT L AT R
Aæ
ABL ABR
AT L is 0 ◊ 0
while n(AT L ) < n(A) Q R
A B A00 a01 A02
AT L AT R c d
æ a aT10 –11 aT12 b
ABL ABR
A20 a21 A22
a21 := l21 = a21 /–11
A22 := A22 ≠ a21 aT12
Q R
A B A00 a01 A02
AT L AT R
Ωc T
a a10 –11 aT12 d
b
ABL ABR
A20 a21 A22
endwhile
Figure 5.2.1.1 Gaussian elimination algorithm that reduced a matrix A to upper triangular
form, storing the multipliers below the diagonal.
Homework 5.2.1.2 Apply the algorithm Figure 5.2.1.1 to the matrix
Q R
2 ≠1 1
c d
a ≠4 0 1 b
4 0 ≠2
and report the resulting matrix. Compare the contents of that matrix to the upper triangular
matrix computed in the solution of Homework 5.2.1.1.
Answer. Q R
2 ≠1 1
c
a ≠2 ≠2 3 d
b
2 ≠1 ≠1
Solution. Partition: Q R
2 ≠1 1
c d
a ≠4 0 1 b
4 0 ≠2
• First iteration:
yielded Q R
2 ≠1 1
c
a ≠2 ≠2 3 d
b.
2 ≠1 ≠1
This can be thought of as an array that stores the unit lower triangular matrix L below the
diagonal (with implicit ones on its diagonal) and upper triangular matrix U on and above
its diagonal: Q R Q R
1 0 0 2 ≠1 1
c
L = a ≠2 1 0 d c
b and U = a 0 ≠2 3 d
b
2 ≠1 1 0 0 ≠1
Compute B = LU and compare it to A.
Answer. Magic! B = A!
Solution.
Q RQ R Q R
1 0 0 2 ≠1 1 2 ≠1 1
c dc d c
B = LU = a ≠2 1 0 b a 0 ≠2 3 b = a ≠4 0 1 d
b = A.
2 ≠1 1 0 0 ≠1 4 0 ≠2
YouTube: https://www.youtube.com/watch?v=GfpB_RU8pIo
In the launch of this week, we mentioned an algorithm that computes the LU factorization
of a given matrix A so that
A = LU,
where L is a unit lower triangular matrix and U is an upper triangular matrix. We now
derive that algorithm, which is often called the right-looking algorithm for computing the
LU factorization.
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 265
Homework 5.2.2.1 Give the approximate cost incurred by the algorithm in Figure 5.2.2.1
when applied to an n ◊ n matrix.
Answer. Approximately 23 n3 flops.
Solution. Consider the iteration where AT L is (initially) k ◊ k. Then
• a21 is of size n ≠ k ≠ 1. Thus a21 := a21 /–11 is typically computed by first computing
1/–11 and then a21 := (1/–11 )a21 , which requires (n ≠ k ≠ 1) flops. (The cost of
computing 1/–11 is inconsequential when n is large, so it is usually ignored.)
• A22 is of size (n ≠ k ≠ 1) ◊ (n ≠ k ≠ 1) and hence the rank-1 update A22 := A22 ≠ a21 aT12
requires 2(n ≠ k ≠ 1)(n ≠ k ≠ 1) flops.
Now, the cost of updating a21 is small relative to that of the update of A22 and hence will
be ignored. Thus, the total cost is given by, approximately,
n≠1
ÿ
2(n ≠ k ≠ 1)2 flops.
k=0
• a21 is of size m ≠ k ≠ 1. Thus a21 := a21 /–11 is typically computed by first computing
1/–11 and then a21 := (1/–11 )a21 , which requires (m ≠ k ≠ 1) flops. (The cost of
computing 1/–11 is inconsequential when m is large.)
• A22 is of size (m ≠ k ≠ 1) ◊ (n ≠ k ≠ 1) and hence the rank-1 update A22 := A22 ≠ a21 aT12
requires 2(m ≠ k ≠ 1)(n ≠ k ≠ 1) flops.
Now, the cost of updating a21 is small relative to that of the update of A22 and hence will
be ignored. Thus, the total cost is given by, approximately,
n≠1
ÿ
2(m ≠ k ≠ 1)(n ≠ k ≠ 1) flops.
k=0
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 267
YouTube: https://www.youtube.com/watch?v=Aaa9n97N1qc
Now that we have an algorithm for computing the LU factorization, it is time to talk
about when this LU factorization exists (in other words: when we can guarantee that the
algorithm completes).
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 268
We would like to talk about the existence of the LU factorization for the more general
case where A is an m ◊ n matrix, with m Ø n. What does this mean?
Definition 5.2.3.1 Given a matrix A œ Cm◊n with m Ø n, its LU factorization is given by
A = LU where L œ Cm◊n is unit lower trapezoidal and U œ Cn◊n is upper triangular with
nonzeroes on its diagonal. ⌃
The first question we will ask is when the LU factorization exists. For this, we need
another definition.
Definition 5.2.3.2 Principal leading submatrix. For k Æ n, the k ◊ k principal
leading
A submatrix of
B a matrix A is defined to be the square matrix AT L œ C
k◊k
such that
AT L AT R
A= . ⌃
ABL ABR
This definition allows us to state necessary and sufficient conditions for when a matrix
with n linearly independent columns has an LU factorization:
Lemma 5.2.3.3 Let L œ Cn◊n be a unit lower triangular matrix and U œ Cn◊n be an
upper triangular matrix. Then A = LU is nonsingular if and only if U has no zeroes on its
diagonal.
Homework 5.2.3.1 Prove Lemma 5.2.3.3.
Hint. You may use the fact that a triangular matrix has an inverse if and only if it has no
zeroes on its diagonal.
Solution. The proof hinges on the fact that a triangular matrix is nonsingular if and only
if it doesn’t have any zeroes on its diagonal. Hence we can instead prove that A = LU is
nonsingular if and only if U is nonsingular ( since L is unit lower triangular and hence has
no zeroes on its diagonal).
• (≈): Assume A = LU and U has no zeroes on its diagonal. We then know that both
L≠1 and U ≠1 exist. Again, we can either explicitly verify a known inverse of A:
or we can recall that the product of two nonsingular matrices, namely U ≠1 and L≠1 ,
is nonsingular.
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 269
¶ Inductive Step: Assume the result is true for all matrices with n = k. Show it is
true for matrices with n = k + 1.
Let A of size n = k + 1 have nonsingular principal leading submatrices. Now, if
an LU factorization of A exists, A = LU , then it would have to form
Q R Q R
A00 a01 L00 0 A B
c T d c T d U00 u01
a a10 –11 b = a l10 1 b . (5.2.1)
0 ‚11
A20 a21 L20 l21 ¸ ˚˙ ˝
¸ ˚˙ ˝ ¸ ˚˙ ˝
U
A L
If we can show that the different parts of L and U exist, are unique, and ‚11 ”= 0,
we are done (since then U is nonsingular). (5.2.1) can be rewritten as
Q R Q R Q R Q R
A00 L00 a01 L00 u01
c T d c T d c d c d
a a10 b = a l10 b U00 and a –11 b = a l10
T
u01 + ‚11 b ,
A20 L20 a21 L20 u01 + l21 ‚11
or, equivalently, Y
_
] L00 u01 = a01
‚11 = –11 ≠ l10
T
u01
_
[
l21 = (a21 ≠ L20 u01 )/‚11
Now, by the Inductive Hypothesis L00 , l10 T
, and L20 exist and are unique. So the
question is whether u01 , ‚11 , and l21 exist and are unique:
⌅ u01 exists and is unique. Since L00 is nonsingular (it has ones on its diagonal)
L00 u01 = a01 has a solution that is unique.
⌅ ‚11 exists, is unique, and is nonzero. Since l10
T
and u01 exist and are unique,
‚11 = –11 ≠ l10 u01 exists and is unique. It is also nonzero since the principal
T
⌅
The formulas in the inductive step of the proof of Theorem 5.2.3.4 suggest an alternative
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 271
• Updating –11 := –11 ≠ aT10 a01 requires approximately 2k flops, which we will ignore.
• Updating a21 := a21 ≠ A20 a01 requires approximately 2(m ≠ k ≠ 1)k flops.
• Updating a21 := a21 /–11 requires approximately (m ≠ k ≠ 1) flops, which we will ignore.
ÿ1
n≠1 2
k 2 + 2(m ≠ k ≠ 1)k flops.
k=0
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 272
Had we not ignored the cost of –11 := –11 ≠ aT10 a01 , which approximately 2k, then the result
would have been approximately
1
mn2 ≠ n3
3
1 3
instead of (m ≠ 1)n ≠ 3 n , which is identical to that of the right-looking algorithm in
2
Figure 5.2.2.1. This makes sense, since the two algorithms perform the same operations in a
different order.
Of course, regardless,
1 1
(m ≠ 1)n2 ≠ n3 ¥ mn2 ≠ n3
3 3
if m is large.
Remark 5.2.3.6 A careful analysis would show that the left- and right-looking algorithms
perform the exact same operations with the same elements of A, except in a different order.
Thus, it is no surprise that the costs of these algorithms are the same.
Ponder This 5.2.3.3 If A is m ◊ m (square!), then yet another algorithm can be derived
by partitioning A, L, and U so that
A B A B A B
A00 a01 L00 0 U00 u01
A= ,L = ,U = .
aT10 –11 T
l10 1 0 ‚11
Assume that L00 and U00 have already been computed in previous iterations, and determine
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 273
A = LU-bordered(A)
A B
AT L AT R
Aæ
ABL ABR
AT L is 0 ◊ 0
while n(AT L ) < n(A) Q R
A B A00 a01 A02
AT L AT R c d
æ a aT10 –11 aT12 b
ABL ABR
A20 a21 A22
Q R
A B A00 a01 A02
AT L AT R c d
Ω a aT10 –11 aT12 b
ABL ABR
A20 a21 A22
endwhile
YouTube: https://www.youtube.com/watch?v=YDtynD4iAVM
Definition 5.2.4.1 A matrix Lk of the form
Q R
Ik 0 0
Lk = a 0 1 0 d
c
b,
0 l21 I
where Ik is the k ◊ k identity matrix and I is an identity matrix "of appropriate size" is called
a Gauss transform. ⌃
Gauss transforms, when applied to a matrix, take multiples of the row indexed with k
and add these multiples to other rows. In our use of Gauss transforms to explain the LU
factorization, we subtract instead:
Example 5.2.4.2 Evaluate
Q RQ R
1 0 0 0 aÂT0
c
c 0 1 0 0 dc
dc aÂT1 d
d
c dc d =
a 0 ≠⁄21 1 0 ba aÂT2 b
0 ≠⁄31 0 1 aÂT3
Solution.
Q RQ R Q R Q R
1 0 0 0 aÂT0 aÂT0 aÂT0
c d
c
c 0 1 0 0 dc
dc aÂT1 d
d c  T1
B aA
d c aÂT1 d
c dc d=c A B d =c
c
d
d.
a 0 ≠⁄21 1 0 ba aÂT2 b c
a aÂT2 ⁄21 d
b a aÂT2 ≠ ⁄21 aÂT1 b
≠ aÂT1
0 ≠⁄31 0 1 aÂT3 aÂT3 ⁄31 aÂT3 ≠ ⁄31 aÂT1
⇤
Notice the similarity with what one does in Gaussian elimination: take multiples of one
row and subtracting these from other rows.
Homework 5.2.4.1 Evaluate
Q RQ R
Ik 0 0 A00 a01 A02
c dc d
a 0 1 0 b a 0 –11 aT12 b
0 ≠l21 I 0 a21 A22
Solution. Q RQ R
Ik 0 0 A00 a01 A02
c dc d
a 0 1 0 b a 0 –11 aT12 b
0 Q≠l21 I 0 a21 A22 R
A00 a01 A02
c d
=a 0 –11 aT12 b
Q
0 ≠l21 –11 + a21 ≠l21 aT12 + AR22
A00 a01 A02
c d
=a 0 –11 aT12 b
0 a21 ≠ –11 l21 A22 ≠ l21 aT12
If l21 = a21 /–11 then aÂ21 = a21 ≠ –11 a21 /–11 = 0.
Hopefully you notice the parallels between the computation in the last homework, and
the algorithm in Figure 5.2.1.1.
Now, assume that the right-looking LU factorization has proceeded to where A contains
Q R
A00 a01 A02
c d
a 0 –11 aT12 b ,
0 a21 A22
where A00 is upper triangular (recall: it is being overwritten by U !). What 1we would like
2 to
do is eliminate the elements in a21 by taking multiples of the "current row"’ –11 aT12 and
1 2
subtract these from the rest of the rows: a21 A22 in order to introduce zeroes below
–11 . The vehicle is an appropriately chosen Gauss transform, inspired by Homework 5.2.4.1.
We must determine l21 so that
Q RQ R Q R
I 0 0 A00 a01 A02 A00 a01 A02
c dc d c d
a 0 1 0 b a 0 –11 aT12 b = a 0 –11 aT12 b.
0 ≠l21 I 0 a21 A22 0 0 A22 ≠ l21 a12
T
As we saw in Homework 5.2.4.1, this means we must pick l21 = a21 /–11 . The resulting
algorithm is summarized in Figure 5.2.4.3. Notice that this algorithm is, once again, identical
to the algorithm in Figure 5.2.1.1 (except that it does not overwrite the lower triangular
matrix).
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 276
A = GE-via-Gauss-transforms(A)
A B
AT L AT R
Aæ
ABL ABR
AT L is 0 ◊ 0
while n(AT L ) < n(A) Q R
A B A00 a01 A02
AT L AT R c d
æ a aT10 –11 aT12 b
ABL ABR
A20 a21 A22
l21 := a21 /–11
Q R Z
A00 a01 A02 _
_
c _
_
T d
a 0 –11 a12 b _
_
_
_
_
0 a
Q 21 A 22 RQ
_
_
R _
_
_
I 0 0 A00 a01 A02 _
_
^ a := 0
1 0 b a 0 –11 a12 b _ 21
c dc T d
:= a 0
A22 := A22 ≠ l21 aT12
Q
0 ≠l 21 0 0 a 21
R
A22
_
_
_
_
_
_
A00 a01 A02 _
_
c d _
_
= a 0 –11 T
a12 b
_
_
_
_
_
0 0 A22 ≠ l21 a12
T \
Q R
A B A00 a01 A02
AT L AT R c d
Ω a aT10 –11 aT12 b
ABL ABR
A20 a21 A22
endwhile
Figure 5.2.4.3 Gaussian elimination, formulated as a sequence of applications of Gauss
transforms.
Homework 5.2.4.2 Show that
Q R≠1 Q R
Ik 0 0 Ik 0 0
c d
a 0 1 0 b =c
a 0 1 0 b
d
0 ≠l21 I 0 l21 I
where Ik denotes the k ◊ k identity matrix.
Hint. To show that B = A≠1 , it suffices to show that BA = I (if A and B are square).
Solution. Q RQ R
Ik 0 0 Ik 0 0
c dc d
a 0 1 0 ba 0 1 0 b
0 Q≠l21 I(n≠k≠1)◊(n≠k≠1)
R
0 l21 I
Ik 0 0
=ca 0 1
d
0 b
Q
0 ≠l21 + l
R 21
I
Ik 0 0
=ca 0 1 0 b
d
0 0 I
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 277
Q R
Ik 0 0
L≠1
k =c
a 0
d
1 0 b.
0 l21 I
It is easy to show that the product of unit lower triangular matrices is itself unit lower
triangular. Hence
L = L≠1 ≠1 ≠1 ≠1
0 L1 · · · Ln≠2 Ln≠1
is unit lower triangular. However, it turns out that this L is particularly easy to compute,
as the following homework suggests.
Homework 5.2.4.3 Let
Q R Q R
L00 0 0 Ik 0 0
c T d c d
L̃k≠1 = L≠1 ≠1 ≠1
0 L1 . . . Lk≠1 = a l10 1 0 b and L≠1
k = a 0 1 0 b,
L20 0 I 0 l21 I
Solution.
L̃k = L
Q0
≠1 ≠1
L1 · · · Lk≠1 L≠1 = L̃k≠1 LR
R Qk
≠1
k
L00 0 0 Ik 0 0
c T dc d
= a l10 1 0 ba 0 1 0 b
L
Q 20
0 I R 0 l21 I
L00 0 0
c T d
= a l10 1 0 b.
L20 l21 I
What this exercise shows is that L = L≠1
0 L1 · · · Ln≠2 Ln≠1 is the triangular matrix that
≠1 ≠1 ≠1
is created by simply placing the computed vectors l21 below the diagonal of a unit lower
triangular matrix. This insight explains the "magic" observed in Homework 5.2.1.3. We
conclude that the algorithm in Figure 5.2.1.1 overwrites n ◊ n matrix A with unit lower
triangular matrix L and upper triangular matrix U such that A = LU . This is known as
the LU factorization or LU decomposition of A.
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 278
YouTube: https://www.youtube.com/watch?v=t6cK75IE6d8
Homework 5.3.1.1 Perform Gaussian elimination as explained in Subsection 5.2.1 to solve
A BA B A B
0 1 ‰0 2
=
1 0 ‰1 1
In the first step, the multiplier is computed as ⁄1,0 = 1/0 and the algorithm fails. Yet, it is
clear that the (unique) solution is
A B A B
‰0 1
= .
‰1 2
The point of the exercise: Gaussian elemination and, equivalently, LU factorization as
we have discussed so far can fail if a "divide by zero" is encountered. The element on the
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 279
diagonal used to compute the multipliers in a current iteration of the outer-most loop is
called the pivot (element). Thus, if a zero pivot is encountered, the algorithms fail. Even if
the pivot is merely small (in magnitude), as we will discuss in a future week, roundoff error
encountered when performing floating point operations will likely make the computation
"numerically unstable," which is the topic of next week’s material.
The simple observation is that the rows of the matrix (and corresponding right-hand side
element) correspond to linear equations that must be simultaneously solved. Reordering
these does not change the solution. Reordering in advance so that no zero pivot is encoun-
tered is problematic, since pivots are generally updated by prior computation. However,
when a zero pivot is encountered, the row in which it appears can simply be swapped with
another row so that the pivot is replaced with a nonzero element (which then becomes the
pivot). In exact arithmetic, it suffices to ensure that the pivot is nonzero after swapping.
As mentioned, in the presence of roundoff error, any element that is small in magnitude can
create problems. For this reason, we will swap rows so that the element with the largest
magnitude (among the elements in the "current" column below the diagonal) becomes the
pivot. This is known as partial pivoting or row pivoting.
Homework 5.3.1.2 When performing Gaussian elimination as explained in Subsection 5.2.1
to solve A BA B A B
10≠k 1 ‰0 1
= ,
1 0 ‰1 1
set
1 ≠ 10k
to
≠10k
(since we will assume k to be large and hence 1 is very small to relative to 10k ). With
this modification (which simulates roundoff error that may be encountered when performing
floating point computation), what is the answer?
Next, solve A BA B A B
1 0 ‰0 1
= .
10≠k 1 ‰1 1
What do you observe?
Solution. The appended system is given by
A B
10≠k 1 1
.
1 0 1
In the first step, the multiplier is computed as ⁄1,0 = 10k and the updated appended system
becomes A B
10≠k 1 1
0 ≠10k 1 ≠ 10k
which is rounded to A B
10≠k 1 1
.
0 ≠10 ≠10k
k
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 280
We then compute
‰1 = (≠10k )/(≠10k ) = 1
and
‰0 = (1 ≠ ‰1 )/10≠k = (1 ≠ 1)/10≠k = 0.
If we instead start with the equivalent system
A B
1 0 1
.
10≠k 1 1
⌃
Homework 5.3.2.1 Let
Q R Q R
fi0 ‰0
c . d c .. d
a .. b and x = a
p=c . d
d c
b.
fin≠1 ‰n≠1
Evaluate P (p)x.
Solution.
P (p)x
Q
= < definition >
R
eTfi0
c .. d
c
a . dx
b
eTfin≠1
Q
= <R matrix-vector multiplication by rows >
eTfi0 x
c .. d
c
a . d
b
eTfin≠1 x
Q
= < eT x = xj >
R j
‰ fi0
c .. d
c
a . d b
‰fin≠1
The last homework shows that applying P (p) to a vector x rearranges the elements of
that vector according to the permutation indicated by the vector p.
Homework 5.3.2.2 Let
Q R Q R
fi0 aÂT0
c . d c . d
a .. b and A = a .. b .
p=c d c d
fin≠1 aÂTn≠1
Evaluate P (p)A.
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 282
Solution.
P (p)A
Q
= < definition >
R
eTfi0
c .. d
c
a . dA
b
eTfin≠1
Q
= <R matrix-matrix multiplication by rows >
eTfi0 A
c .. d
c
a . d
b
eTfin≠1 A
Q
= < eTj A = aÂTj >
R
aÂTfi0
c .. d
c
a . db
aÂTfin≠1
The last homework shows that applying P (p) to a matrix A rearranges the rows of that
matrix according to the permutation indicated by the vector p.
Homework 5.3.2.3 Let
Q R
fi0 1 2
c . d
p = a .. d
c
b and A = a 0 · · · a n≠1 .
fin≠1
Evaluate AP (p)T .
Solution.
AP (p)T
= < definition >
Q T
RT
efi0
c .. d
Aca . d
b
T
efin≠1
1= < transpose
2 P (p) >
A efi0 · · · efin≠1
1 = < matrix-matrix
2 multiplication by columns >
Aefi0 · · · Aefin≠1
1 = < Aej = 2aj >
afi0 · · · afin≠1
The last homework shows that applying P (p)T from the right to a matrix A rearranges
the columns of that matrix according to the permutation indicated by the vector p.
Homework 5.3.2.4 Evaluate P (p)P (p)T .
Answer. P (p)P (p)T = I
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 283
Solution.
P P (p)T
= < definition >
Q RQ RT
eTfi0 eTfi0
c .. dc .. d
c
a . dc
ba . db
eTfin≠1 eTfin≠1
Q
= < transpose P (p) >
R
eTfi0
c .. d1 2
c
a . d
b efi0 · · · efin≠1
eTfin≠1
= < evaluate >
Q R
eTfi0 efi0 eTfi0 efi1 · · · eTfi0 efin≠1
c eTfi1 efi0 eTfi1 efi1 · · · eTfi1 efin≠1 d
c d
c
c .. .. .. d
d
a . . . b
eTfin≠1 efi0 eTfin≠1 efi1 · · · eTfin≠1 efin≠1
= < eTi ej = · · · >
Q R
1 0 ··· 0
c
c 0 1 ··· 0 d d
c
c .. .. .. d
d
a . . . b
0 0 ··· 1
(Here PÂ (·) is always an elementary pivot matrix "of appropriate size.") What this exactly
does is best illustrated through an example:
Example 5.3.2.3 Let
Q R
Q R 0.0 0.1 0.2
2 c 1.0 1.1 1.2 d
p=a 1 d
c
b and A = c
c
d
d.
a 2.0 2.1 2.2 b
1
3.0 3.1 3.2
Evaluate PÂ (p)A.
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 285
Solution.
PÂ (p)A
= < instantiate >
Q R
Q R 0.0 0.1 0.2
2
d c 1.0 1.1 1.2 d
c d
PÂ (c
a 1 b) c d
a 2.0 2.1 2.2 b
1
3.0 3.1 3.2
= < definition of PÂ (·) >
Q R
Q R 0.0 0.1 0.2
1 A
0 B d c 1.0 1.1 1.2 d
c c d
a 1 b PÂ (2) c d
0 P(Â ) a 2.0 2.1 2.2 b
1
3.0 3.1 3.2
= < swap first row with row indexed with 2 >
Q R
Q R 2.0 2.1 2.2
1 0 B
d c 1.0 1.1 1.2 d
A c d
c
a 1 b c d
0 PÂ ( ) a 0.0 0.1 0.2 b
1
3.0 3.1 3.2
Q
= 1 < partitioned matrix-matrix
2 R
multiplication >
2.0 2.1 2.2
c Q R d
c A B 1.0 1.1 1.2 d
= < swap current first row with row indexed with 1 relative to that
c
c  1 c d d
d
a P( ) a 0.0 0.1 0.2 b b
1
Q A 3.0 3.1 B 3.2 R
2.0 2.1 2.2
c d
c
c
0.0A 0.1 0.2 B d
d
c 1 2 1.0 1.1 1.2 d
a Â
P( 1 ) b
3.0 3.1 3.2
Q Q
= < swap current
R R
first row with row indexed with 1 relative to that row >
2.0 2.1 2.2
c c d d
c a 0.0 0.1 0.2 b d
c d
a 1 3.0 3.1 3.2 2 b
c d
1.0 1.1 1.2
=
Q R
2.0 2.1 2.2
c 0.0 0.1 0.2 d
c d
c d
a 3.0 3.1 3.2 b
1.0 1.1 1.2
⇤
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 286
YouTube: https://www.youtube.com/watch?v=QSnoqrsQNag
Having introduced our notation for permutation matrices, we can now define the LU
factorization with partial pivoting: Given an m ◊ n matrix A, we wish to compute
• vector p of n integers that indicates how rows are pivoting as the algorithm proceeds,
• a unit lower trapezoidal matrix L, and
• an upper triangular matrix U
so that PÂ (p)A = LU . We represent this operation by
[A, p] := LUpivA,
where upon completion A has been overwritten by {L\U }, which indicates that U overwrites
the upper triangular part of A and L is stored in the strictly lower triangular part of A.
Let us start with revisiting the derivation of the right-looking LU factorization in Sub-
section 5.2.2. The first step is to find a first permutation matrix PÂ (fi1 ) such that the element
on the diagonal in the first column is maximal in value. (Mathematically, any nonzero value
works. We will see that ensuring that the multiplier is less than one in magnitude reduces
the potential for accumulation of error.) For this, we will introduce the function
maxi(x)
which, given a vector x, returns the index of the element in x with maximal magnitude
(absolute value). The algorithm then proceeds as follows:
• Partition A, L as follows:
A B A B
–11 aT12 1 0
Aæ , and L æ
a21 A22 l21 L22
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 287
A B
–11
• Compute fi1 = maxi .
a21
A B A B
–11 aT12 –11 aT12
• Permute the rows: := PÂ (fi 1) .
a21 A22 a21 A22
This completes the introduction of zeroes below the diagonal of the first column.
Now, more generally, assume that the computation has proceeded to the point where
matrix A has been overwritten by
Q R
A00 a01 A02
c d
a 0 –11 aT12 b
0 a21 A22
where A00 is upper triangular. If no pivoting was added one would compute l21 := a21 /–11
followed by the update
Q R Q RQ R Q R
A00 a01 A02 I 0 0 A00 a01 A02 A00 a01 A02
c d c dc d c d
a 0 –11 aT12 b := a 0 1 0 b a 0 –11 aT12 b = a 0 –11 aT12 b.
0 a21 A22 0 ≠l21 I 0 a21 A22 0 0 A22 ≠ l21 a12
T
• Compute A B
–11
fi1 := maxi .
a21
• Update
l21 := a21 /–11 .
• Update Q R Q RQ R
A00 a01 A02 I 0 0 A00 a01 A02
c T d c dc d
a 0 –11 a12 b := a 0 1 0 b a 0 –11 aT12 b
0 a
Q 21
A22 0 ≠l21R I 0 a21 A22
A00 a01 A02
=c
a 0 –11 aT12
d
b.
0 0 A22 ≠ l21 aT12
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 288
This algorithm is summarized in Figure 5.3.3.1. In that algorithm, the lower triangular
matrix L is accumulated below the diagonal.
[A, p] A= LUpiv-right-looking(A)
B A B
AT L AT R pT
Aæ ,p æ
ABL ABR pB
AT L is 0 ◊ 0, pT has 0 elements
while n(AT L ) < n(A) Q R Q R
A B A00 a01 A02 A B p0
AT L AT R c T T d pT
æ a a10 –11 a12 b , æc d
a fi1 b
ABL ABR pB
A B
A20 a21 A22 p2
–11
fi1 := maxi
a21
Q R Q R
A00 a01 A02 A B A00 a01 A02
c T d I 0 c T d
a a10 –11 aT12 b := a a10 –11 aT12 b
0 P (fi1 )
A20 a21 A22 A20 a21 A22
a21 := a21 /–11
A22 := A22 ≠ a21 aT12
Q R Q R
A B A00 a01 A02 A B p0
AT L AT R c T T d pT c d
Ω a a10 –11 a12 b , Ω a fi1 b
ABL ABR pB
A20 a21 A22 p2
endwhile
Figure 5.3.3.1 Right-looking LU factorization algorithm with partial pivoting.
Ln≠1 Pn≠1 · · · L0 P0 A = U
or, equivalently,
A = P0T L≠1 T ≠1
0 · · · Pn≠1 Ln≠1 U.
A B
Ik◊k 0
Actually, since Pk = for some fi , we know that PkT = Pk and hence
0 P (fi)
Â
A = P0 L≠1 ≠1
0 · · · Pn≠1 Ln≠1 U.
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 289
What we will finally show is that there are Gauss transforms Lı0 , . . . Lın≠1 such that
or, equivalently,
PÂ (p)A = Pn≠1 · · · P0 A = Lı0 · · · Lın≠1 U,
¸ ˚˙ ˝
L
which is what we set out to compute.
Here is the insight. If only we know how to order the rows of A and right-hand side b
correctly, then we would not have to pivot. But we only know how to pivot as the compu-
tation unfolds. Recall that the multipliers can overwrite the elements they zero in Gaussian
elimination and do so when we formulate it as an LU factorization. By not only pivoting
the elements of A B
–11 aT12
a21 A22
but also all of A B
aT12 –11 aT12
,
A20 a21 A22
we are moving the computed multipliers with the rows that are being swapped. It is for this
reason that we end up computing the LU factorization of the permuted matrix: PÂ (p)A.
Homework 5.3.3.1 Implement the algorithm given in Figure 5.3.3.1 as
function [ A_out ] = LUpiv_right_ ooking( A )
• Assignments/Week05/mat ab/maxi.m
• Assignments/Week05/mat ab/Swap.m
YouTube: https://www.youtube.com/watch?v=kqj3n1EUCkw
Given nonsingular matrix A œ Cm◊n , the above discussions have yielded an algorithm
for computing permutation matrix P , unit lower triangular matrix L and upper triangular
matrix U such that P A = LU . We now discuss how these can be used to solve the system
of linear equations Ax = y.
Starting with
Ax = b
where nonsingular matrix A is n ◊ n (and hence square),
[A, p] := LUpiv(A).
Lz = y
U x = z.
YouTube: https://www.youtube.com/watch?v=qc_4NsNp3q0
Consider Lz = y where L is unit lower triangular. Partition
A B A B A B
1 0 ’1 Â1
Læ , zæ and y æ .
l21 L22 z2 y2
Then A B A B A B
1 0 ’1 Â1
= .
l21 L22 z2 y2
¸ ˚˙ ˝ ¸ ˚˙ ˝ ¸ ˚˙ ˝
L z y
Multiplying out the left-hand side yields
A B A B
’1 Â1
=
’1 l21 + L22 z2 y2
• y2 := y2 ≠ Â1 l21
SolveALz = y, overwriting
B
yA withBz (Variant 1)
LT L LT R yT
Læ ,y æ
LBL LBR yB
LT L is 0 ◊ 0 and yT has 0 elements
while n(LT L ) < n(L) Q R Q R
A B L00 l01 L02 A B y0
LT L LT R c T T d yT c d
æ a l10 ⁄11 l12 b, æ a Â1 b
LBL LBR yB
L20 l21 L22 y2
y2 := y2 ≠ Â1 l21
Q R Q R
A B L00 l01 L02 A B y0
LT L LT R y
Ωc T T d
a l10 ⁄11 l12 b ,
T
Ωc d
a Â1 b
LBL LBR yB
L20 l21 L22 y2
endwhile
Figure 5.3.5.1 Lower triangular solve (with unit lower triangular matrix), Variant 1
Homework 5.3.5.1 Derive a similar algorithm for solving U x = z. Update the below
skeleton algorithm with the result. (Don’t forget to put in the lines that indicate how you
"partition and repartition" through the matrix.)
Solve AU x = z, overwriting
B
z withBx (Variant 1)
A
UT L UT R zT
Uæ ,z æ
UBL UBR zB
UBR is 0 ◊ 0 and zB has 0 elements
while n(UBR ) < n(U ) Q R Q R
A B U00 u01 U02 A B z0
UT L UT R c d zT c d
æ a uT10 ‚11 uT12 b , æ a ’1 b
UBL UBR zB
U20 u21 U22 z2
Q R Q R
A B U00 u01 U02 A B z0
UT L UT R c d zT c d
Ω a uT10 ‚11 uT12 b , Ω a ’1 b
UBL UBR zB
U20 u21 U22 z2
endwhile
Hint: Partition A BA B A B
U00 u01 x0 z0
= .
0 ‚11 ‰1 ’1
Solution. Multiplying this out yields
A B A B
U00 x0 + u01 ‰1 z0
= .
‚11 ‰1 ’1
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 293
So, ‰1 = ’1 /‚11 after which x0 can be computed by solving U00 x0 = z0 ≠ ‰1 u01 . The resulting
algorithm is then given by
Solve AU x = z, overwriting
B
z withBx (Variant 1)
A
UT L UT R zT
Uæ ,z æ
UBL UBR zB
UBR is 0 ◊ 0 and zB has 0 elements
while n(UBR ) < n(U ) Q R Q R
A B U00 u01 U02 A B z0
UT L UT R zT
æc T T d
a u10 ‚11 u12 b , æc d
a ’1 b
UBL UBR zB
U20 u21 U22 z2
’1 := ’1 /‚11
z0 := z0 ≠ ’1 u01
Q R Q R
A B U00 u01 U02 A B z0
UT L UT R c d zT c d
Ω a uT10 ‚11 uT12 b , Ω a ’1 b
UBL UBR zB
U20 u21 U22 z2
endwhile
YouTube: https://www.youtube.com/watch?v=2tvfYnD9NrQ
An alternative algorithm can be derived as follows: Partition
A B A B A B
L00 0 z0 y0
Læ , zæ and y æ .
T
l10 1 ’1 Â1
Then A B A B A B
L00 0 z0 y0
= .
T
l10 1 ’1 Â1
¸ ˚˙ ˝ ¸ ˚˙ ˝ ¸ ˚˙ ˝
L z y
Multiplying out the left-hand side yields
A B A B
L00 z0 y0
=
l10 z0 + ’1
T
Â1
and the equalities
L00 z0 = y0
T
l10 z0 + ’1 = Â1 .
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 294
The idea now is as follows: Assume that the elements of z0 were computed in previous
iterations in the algorithm in Figure 5.3.5.2, overwriting y0 . Then in the current iteration
we must compute ’1 := Â1 ≠ l10T
z0 , overwriting Â1 .
SolveALz = y, overwriting
B
yA withBz (Variant 2)
LT L LT R yT
Læ ,y æ
LBL LBR yB
LT L is 0 ◊ 0 and yT has 0 elements
while n(LT L ) < n(L) Q R Q R
A B L00 l01 L02 A B y0
LT L LT R c T T d yT c d
æ a l10 ⁄11 l12 b, æ a Â1 b
LBL LBR yB
L20 l21 L22 y2
Â1 := Â1 ≠ l10
T
y0
Q R Q R
A B L00 l01 L02 A B y0
LT L LT R yT
Ωc T T d
a l10 ⁄11 l12 b , Ωc d
a Â1 b
LBL LBR yB
L20 l21 L22 y2
endwhile
Figure 5.3.5.2 Lower triangular solve (with unit lower triangular matrix), Variant 2
Homework 5.3.5.2 Derive a similar algorithm for solving U x = z. Update the below
skeleton algorithm with the result. (Don’t forget to put in the lines that indicate how you
"partition and repartition" through the matrix.)
Solve AU x = z, overwriting
B
z withBx (Variant 2)
A
UT L UT R zT
Uæ ,z æ
UBL UBR zB
UBR is 0 ◊ 0 and zB has 0 elements
while n(UBR ) < n(U ) Q R Q R
A B U00 u01 U02 A B z0
UT L UT R c d zT c d
æ a uT10 ‚11 uT12 b , æ a ’1 b
UBL UBR zB
U20 u21 U22 z2
Q R Q R
A B U00 u01 U02 A B z0
UT L UT R c d zT c d
Ω a uT10 ‚11 uT12 b , Ω a ’1 b
UBL UBR zB
U20 u21 U22 z2
endwhile
Hint: Partition A B
‚11 uT12
Uæ .
0 U22
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 295
Solution. Partition A BA B A B
‚11 uT12 ‰1 ’1
= .
0 U22 x2 z2
Multiplying this out yields
A B A B
‚11 ‰1 + uT12 x2 ’1
= .
U22 x2 z2
So, if we assume that x2 has already been computed and has overwritten z2 , then ‰1 can be
computed as
‰1 = (’1 ≠ uT12 x2 )/‚11
which can then overwrite ’1 . The resulting algorithm is given by
Solve AU x = z, overwriting
B
z withBx (Variant 2)
A
UT L UT R zT
Uæ ,z æ
UBL UBR zB
UBR is 0 ◊ 0 and zB has 0 elements
while n(UBR ) < n(U ) Q R Q R
A B U00 u01 U02 A B z0
UT L UT R c d zT c d
æ a uT10 ‚11 uT12 b , æ a ’1 b
UBL UBR zB
U20 u21 U22 z2
’1 := ’1 ≠ u12 z2
T
’1 := ’1 /‚11
Q R Q R
A B U00 u01 U02 A B z0
UT L UT R c d zT c d
Ω a uT10 ‚11 uT12 b , Ω a ’1 b
UBL UBR zB
U20 u21 U22 z2
endwhile
Homework 5.3.5.3 Let L be an m ◊ m unit lower triangular matrix. If a multiply and add
each require one flop, what is the approximate cost of solving Lx = y?
Solution. Let us analyze Variant 1.
Let L00 be k ◊ k in a typical iteration. Then y2 is of size m ≠ k ≠ 1 and y2 := y2 ≠ Â1 l21
requires 2(m ≠ k ≠ 1) flops. Summing this over all iterations requires
m≠1
ÿ
[2(m ≠ k ≠ 1)] flops.
k=0
5.3.5.3 Discussion
Computation tends to be more efficient when matrices are accessed by column, since in
scientific computing applications tend to store matrices by columns (in column-major order).
This dates back to the days when Fortran ruled supreme. Accessing memory consecutively
improves performance, so computing with columns tends to be more efficient than computing
with rows.
Variant 1 for each of the algorithms casts computation in terms of columns of the matrix
that is involved;
y2 := y2 ≠ Â1 l21
and
z0 := z0 ≠ ’1 u01 .
These are called axpy operations:
y := –x + y.
"alpha times x plus y." In contrast, Variant 2 casts computation in terms of rows of the
matrix that is involved:
Â1 := Â1 ≠ l10
T
y0
and
’1 := ’1 ≠ uT12 z2
perform dot products.
Now, instead of finding the largest element in magnitude in the first column, find the largest
element in magnitude in the entire matrix. Let’s say it is element (fi1 , fl1 ). Then, one
permutes A B A B
–11 aT12 –11 aT12
:= P (fi1 ) P (fl1 )T ,
a21 A22 a21 A22
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 297
making –11 the largest element in magnitude. We will later see that the magnitude of
–11 impacts element growth in the remaining matrix (A22 ) and that in turn impacts the
numerical stability (accuracy) of the algorithm. By choosing –11 to be as large as possible
in magnitude, the magnitude of multipliers is reduced as is element growth.
The problem is that complete pivoting requires O(n2 ) comparisons per iteration. Thus,
the number of comparisons is of the same order as the number of floating point operations.
Worse, it completely destroys the ability to cast most computation in terms of matrix-matrix
multiplication, thus impacting the ability to attain much greater performance.
In practice LU with complete pivoting is not used.
YouTube: https://www.youtube.com/watch?v=nxGR8NgXYxg
Hermitian Positive Definite (HPD) are a special class of matrices that are frequently
encountered in practice.
Definition 5.4.1.1 Hermitian positive definite matrix. A matrix A œ Cn◊n is Her-
mitian positive definite (HPD) if and only if it is Hermitian (AH = A) and for all nonzero
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 298
0
< < A is HPD >
H
x Ax
= < partition >
A BH A BA B
0 –11 aH 21 0
x2 a21 A22 x2
= < multiply out >
xH
2 A 22 x 2 .
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 299
YouTube: https://www.youtube.com/watch?v=w8a9xVHVmAI
We will prove the following theorem in Subsection 5.4.4.
Theorem 5.4.2.1 Cholesky Factorization Theorem. Given an HPD matrix A there
exists a lower triangular matrix L such that A = LLH . If the diagonal elements of L are
restricted to be positive, L is unique.
Obviously, there similarly exists an upper triangular matrix U such that A = U H U since
we can choose U H = L.
The lower triangular matrix L is known as the Cholesky factor and LLH is known as the
Cholesky factorization of A. It is unique if the diagonal elements of L are restricted to be
positive. Typically, only the lower (or upper) triangular part of A is stored, and it is that
part that is then overwritten with the result. In our discussions, we will assume that the
lower triangular part of A is stored and overwritten.
YouTube: https://www.youtube.com/watch?v=x4grvf-MfTk
The most common algorithm for computing the Cholesky factorization of a given HPD
matrix A is derived as follows:
or, equivalently, Ô
⁄11 = ± –11 ı
.
l21 = a21 /⁄̄11 L22 = Chol(A22 ≠ l21 l21
H
)
• These equalities motivate the following algorithm for overwriting the lower triangular
part of A with the Cholesky factor of A:
A B
–11 ı
¶ Partition A æ .
a21 A22
Ô Ô
¶ Overwrite –11 := ⁄11 = –11 . (Picking ⁄11 = –11 makes it positive and real,
and ensures uniqueness.)
¶ Overwrite a21 := l21 = a21 /⁄11 .
¶ Overwrite A22 := A22 ≠ l21 l21
H
(updating only the lower triangular part of A22 ).
This operation is called a symmetric rank-1 update.
¶ Continue by computing the Cholesky factor of A22 .
The resulting algorithm is often called the "right-looking" variant and is summarized in
Figure 5.4.3.1.
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 301
A = Chol-right-looking(A)
A B
AT L AT R
Aæ
ABL ABR
AT L is 0 ◊ 0
while n(AT L ) < n(A) Q R
A B A00 a01 A02
AT L AT R c d
æ a aT10 –11 aT12 b
ABL ABR
A20 a21 A22
Ô
–11 := ⁄11 = –11
a21 := l21 = a21 /–11
A22 := A22 ≠ a21 aH21 (syr: update only lower triangular part)
Q R
A B A00 a01 A02
AT L AT R
Ωc T
a a10 –11 aT12 d
b
ABL ABR
A20 a21 A22
endwhile
YouTube: https://www.youtube.com/watch?v=6twDI6QhqCY
The cost of the Cholesky factorization of A œ Cn◊n can be analyzed as follows: In
Figure 5.4.3.1 during the kth iteration (starting k at zero) A00 is k ◊ k. Thus, the operations
in that iteration cost
Ô
• –11 := –11 : this cost is negligible when k is large.
• a21 := a21 /–11 : approximately (n≠k≠1) flops. This operation is typically implemented
as (1/–11 )a21 .
• A22 := A22 ≠ a21 aH21 (updating only the lower triangular part of A22 ): approximately
(n ≠ k ≠ 1)2 flops.
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 302
CChol (n)
¥ < sum over all iterations >
n≠1
ÿ n≠1
ÿ
(n ≠ k ≠ 1)2 + (n ≠ k ≠ 1)
k=0 k=0
¸ ˚˙ ˝ ¸ ˚˙ ˝
(Due to update of A22 ) (Due to update of a21 )
= < change of variables j = n ≠ k ≠ 1 >
qn≠1 2 qn≠1
j=0 j + qj=0 j qn≠1
2 3 2
¥ < n≠1
j=0 j ¥ n /3; j=0 j ¥ n /2 >
1 3 1 2
3
n + 2n
¥ < remove lower order term >
1 3
3
n .
Remark 5.4.3.2 Comparing the cost of the Cholesky factorization to that of the LU fac-
torization in Homework 5.2.2.1, we see that taking advantage of symmetry cuts the cost
approximately in half.
Homework 5.4.3.2 Implement the algorithm given in Figure 5.4.3.1 as
function [ A_out ] = Cho _right_ ooking( A )
YouTube: https://www.youtube.com/watch?v=unpQfRgIHOg
Partition, once again, A B
–11 aH
21
Aæ .
a21 A22
The following lemmas are key to the proof of the Cholesky Factorization Theorem:
Lemma 5.4.4.1 Let A œ Cn◊n be HPD. Then –11 is real and positive.
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 303
0
< < A is HPD >
xH Ax
= < partition >
A BH A BA B
‰1 –11 aH 21 ‰1
x2 a21 A22 x2
= < partitioned multiplication >
A BH A B
‰1 –11 ‰1 + aH 21 x2
x2 a21 ‰1 + A22 x2
= < partitioned multiplication >
–11 ‰1 ‰1 + ‰̄1 aH 21 x2 + x2 a21 ‰1 + x2 A22 x2
H H
= < since xH 2 a21 , a21 x2 are scalars and hence can move around; –11 /–11 = 1 >
H
Hx Hx aH
2 a21 –11 ≠ x2 a21 –11 ≠ x2 a21 –11 + x2 A22 x2
a 21 2 a 21 2 21 x2
xH H H H
2. Inductive step: Assume the result is true for n = k. We will show that it holds for
n = k + 1.
Let A œ C(k+1)◊(k+1) be HPD. Partition
A B A B
–11 aH ⁄11 0
A= 21
and L = .
a21 A22 l21 L22
Let
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 304
Ô
• ⁄11 = –11 (which is well-defined by Lemma 5.4.4.1,
• l21 = a21 /⁄11 ,
• A22 ≠ l21 l21
H
= L22 LH
22 (which exists as a consequence of the Inductive Hypothesis
and Lemma 5.4.4.2.)
YouTube: https://www.youtube.com/watch?v=C7LEuhS4H94
Recall from Section 4.2 that the solution x̂ œ Cn to the linear least-squares (LLS) problem
Îb ≠ Ax‚Î2 = minn Îb ≠ AxÎ2 (5.4.2)
xœC
¸ ˚˙A˝ x = A
H H
A ‚
¸ ˚˙ b˝ .
B y
Ponder This 5.4.5.1 Consider A œ Cm◊n with linearly independent columns. Recall that
A has a QR factorization, A = QR where Q has orthonormal columns and R is an upper
triangular matrix with positive diagonal elements. How are the Cholesky factorization of
AH A and the QR factorization of A related?
which updates the vector y stored in memory starting at address y and increment between
entries of incy and of size n with –x + y where x is stored at address x with increment incx
and – is stored in alpha. With these, the implementation becomes
do j=1, n
for j := 0, . . . , n ≠ 1 A(j,j) = sqrt(A(j,j))
Ô
–j,j := –j,j
–j+1:n≠1,j := –j+1:n≠1,j /–j,j ca dsca ( n-j, 1.0d00 / a(j,j), a(j+1,j), 1 )
for k := j + 1, . . . , n ≠ 1
–k:n≠1,k := –k:n≠1,k ≠ –k,j –k:n≠1,j do k=j+1,n
endfor ca daxpy( n-k+1, -A(k,j), A(k,j), 1,
endfor A(k,k), 1 )
enddo
enddo
Q R
–j+1,j
Here –j+1:n≠1,j = a
c .. d
d.
c
. b
–n≠1,j
The entire update A22 := A22 ≠ a21 aT21 can be cast in terms of a matrix-vector operation
(level-2 BLAS call) to
dsyr( up o, n, a pha, x, A, dA )
which updates the matrix A stored in memory starting at address A with leading dimension
ldA of size n by n with –xxT + A where x is stored at address x with increment incx and
– is stored in alpha. Since both A and –xxT + A are symmetric, only the triangular part
indicated by uplo is updated. This is captured by the below algorithm and implementation.
do j=1, n
for j := 0, . . . , n ≠ 1 A(j,j) = sqrt(A(j,j))
Ô
–j,j := –j,j
–j+1:n≠1,j := –j+1:n≠1,j /–j,j ca dsca ( n-j, 1.0d00 / a(j,j), a(j+1,j), 1 )
–j+1:n≠1,j+1:n≠1 :=
–j+1:n≠1,j+1:n≠1 ca dsyr( Lower triangu ar , n-j+1, -1.0d00,
≠ tril(–j+1:n≠1,j –j+1:n≠1,j
T
) A(j+1,j), 1, A(j+1,j+1), dA )
endfor enddo
Notice how the code that cast computation in terms of the BLAS uses a higher level of
abstraction, through routines that implement the linear algebra operations that are encoun-
tered in the algorithms.
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 307
A = Chol-blocked-right-looking(A)
A B
AT L AT R
Aæ
ABL ABR
AT L is 0 ◊ 0
while n(AT L ) < n(A) Q R
A B A00 A01 A02
AT L AT R c d
æ a A10 A11 AT12 b
ABL ABR
A20 A21 A22
A11 := L11 = Chol(()A11 )
Solve L21 L≠H
11 = A21 overwriting A21 with L21
A22 := A22 ≠ A21 AH
21 syrk: update only lower triangular part
Q R
A B A00 A01 A02
AT L AT R c d
Ω a A10 A11 AT12 b
ABL ABR
A20 A21 A22
endwhile
Figure 5.4.6.1 Blocked Cholesky factorization Variant 3 (right-looking) algorithm. The op-
eration "syrk" refers to "symmetric rank-k update", which performs a rank-k update (matrix-
matrix multiplication with a small "k" size), updating only the lower triangular part of the
matrix in this algorithm.
Finally, a blocked right-looking Cholesky factorization algorithm, which casts most com-
putation in terms of a matrix-matrix multiplication operation referred to as a "symmetric
rank-k update" is given in Figure 5.4.6.1. There, we use FLAME notation to present the
algorithm. It translates into Fortran code that exploits the BLAS given below.
do j=1, nb ,n
jb = min( nb, n-j )
Cho ( jb, A( j, j );
• The call to dtrsm implements A21 := L21 where L21 LT11 = A21 .
• The call to dsyrk implements A22 := ≠A21 AT21 +A22 , updating only the lower triangular
part of the matrix.
The bulk of the computation is now cast in terms of matrix-matrix operations which can
achieve high performance.
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 308
5.5 Enrichments
5.5.1 Other LU factorization algorithms
There are actually five different (unblocked) algorithms for computing the LU factorization
that were discovered over the course of the centuries. Here we show how to systematically
derive all five. For details, we suggest Week 6 of our Massive Open Online Course titled
"LAFF-On Programming for Correctness" [28].
Remark 5.5.1.1 To put yourself in the right frame of mind, we highly recommend you
spend about an hour reading the paper
• [31] Devangi N. Parikh, Margaret E. Myers, Richard Vuduc, Robert A. van de Geijn,
A Simple Methodology for Computing Families of Algorithms, FLAME Working Note
#87, The University of Texas at Austin, Department of Computer Science, Technical
Report TR-18-06. arXiv:1808.07832.
A = LU-var1(A)
A B
AT L AT R
Aæ
ABL ABR
AT L is 0 ◊ 0
while n(AT L ) < n(A) Q R
A B A00 a01 A02
AT L AT R
æc d
a aT10 –11 aT12 b
ABL ABR
A20 a21 A22
..
.
Q R
A B A00 a01 A02
AT L AT R c d
Ω a aT10 –11 aT12 b
ABL ABR
A20 a21 A22
endwhile
Figure 5.5.1.2 LU factorization algorithm skeleton.
Finding the algorithms starts with the following observations.
• Our algorithms will overwrite the matrix A, and hence we introduce A‚ to denote the
original contents of A. We will say that the precondition for the algorithm is that
A = A‚
• All the algorithms will march through the matrices from top-left to bottom-right,
giving us the code skeleton in Figure 5.5.1.2. Since the computed L and U overwrite
A, throughout they are partitioned comformal to (in the same way) as is A.
• Thus, before and after each iteration of the loop the matrices are viewed as quadrants:
A B A B A B
AT L AT R LT L 0 UT L UT R
Aæ ,L æ , and U æ .
ABL ABR LBL LBR 0 UBR
• In terms of these exposed quadrants, in the end we wish for matrix A to contain
A B A B
AT L AT R L\UT L UT R
=
ABLA ABR L
B A BL
L\UBRB A B
LT L 0 UT L UT R A‚T L A‚T R
· =
LBL LBR 0 UBR A‚BL A‚BR
• Manipulating this yields what we call the Partitioned Matrix Expression (PME), which
can be viewed as a recursive definition of the LU factorization:
A B A B
AT L AT R L\UT L UT R
=
ABL ABR LBL L\UBR
L U = A‚T L LT L UT R = A‚T R
· TL TL
LBL UT L = ABL LBR UBR = A‚BR ≠ LBL UT R .
‚
• Now, consider the code skeleton for the LU factorization in Figure 5.5.1.2. At the top
of the loop (right after the while), we want to maintain certain contents in matrix A.
Since we are in a loop, we haven’t yet overwritten A with the final result. Instead, some
progress toward this final result has been made. The way we can find what the state
of A is that we would like to maintain is to take the PME and delete subexpressions.
For example, consider the following condition on the contents of A:
A B A B
AT L AT R L\UT L UT R
=
ABL ABR LBL A‚BR ≠ LBL UT R
L U = A‚T L LT L UT R = A‚T R
· TL TL
LBL UT L = ABL
‚ .
What we are saying is that AT L , AT R , and ABL have been completely updated with
the corresponding parts of L and U , and ABR has been partially updated. This is
exactly the state that the right-looking algorithm that we discussed in Subsection 5.2.2
maintains! What is left is to factor ABR , since it contains A‚BR ≠ LBL UT R , and A‚BR ≠
LBL UT R = LBR UBR .
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 310
• By carefully analyzing the order in which computation must occur (in compiler lingo:
by performing a dependence analysis), we can identify five states that can be main-
tained at the top of the loop, by deleting subexpressions from the PME. These are
called loop invariants. There are five for LU factorization:
A B A B A B A B
AT L AT R L\UT L A‚T R AT L AT R L\UT L UT R
= =
ABL ABR ‚
ABL ‚
ABR A BL A BR A‚BL A‚BR
A Invariant
B A 1 B A Invariant
B A 2 B
AT L AT R L\UT L A‚T R AT L AT R L\UT L UT R
= =
ABL ABR ‚ BL
L A‚BR ABL ABR LBL A‚BR
Invariant
A 3 B A InvariantB4
AT L AT R L\UT L UT R
=
ABL ABR LBL A‚BR ≠ LBL UT R
Invariant 5
• Key to figuring out what updates must occur in the loop for each of the variants is to
look at how the matrices are repartitioned at the top and bottom of the loop body.
For each of the five algorithms for LU factorization, we will derive the loop invariant, and
then derive the algorithm from the loop invariant.
meaning that the leading principal submatrix AT L has been overwritten with its LU factor-
ization, and the remainder of the matrix has not yet been touched.
At the top of the loop, after repartitioning, A then contains
Q R Q R
A00 a01 A02 L\U00 a‚01 A‚02
c T d c d
a a10 –11 aT12 b = a a‚ T10 – ‚ T12 b · L00 U00 = A
‚ 11 a ‚
00
A20 a21 A22 ‚
A20 ‚
a‚21 A22
for the loop invariant to again hold after the iteration. Here the entries in red are known (in
addition to the ones marked with a "hat") and the entries in blue are to be computed. With
this, we can compute the desired parts of L and U :
• Solve L00 u01 = a01 , overwriting a01 with the result. (Notice that a01 = a‚01 before this
update.)
• Solve l10
T
U00 = aT10 (or, equivalently, U00 (l10 ) = (aT10 )T for l10
T T T T
), overwriting aT10 with
the result. (Notice that a10 = a‚10 before this update.)
T T
Figure 5.5.1.3 Variant 1 (bordered) LU factorization algorithm. Here A00 stores L\U00 .
• Solve l10
T
U00 = aT10 (or, equivalently, U00 (l10 ) = (aT10 )T for l10
T T T T
), overwriting aT10 with
the result. Cost: approximately k flops.
2
ÿ1
n≠1 2 n≠1
ÿ 1 2
k 2 + k 2 + 2k ¥ 2 k 2 ¥ 2 n3 = n3 .
k=0 k=0 3 3
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 312
meaning that the leading principal submatrix AT L has been overwritten with its LU factor-
ization and UT R has overwritten AT R .
At the top of the loop, after repartitioning, A then contains
Q R Q R
A00 a01 A02 L\U00 u01 U02
c T T d c ‚T
a a10 –11 a12 b = a a10 –‚ 11 a‚T12 d
b
A20 a21 A22 A‚120 a‚21 A‚
2 22 1 2
· L00 U00 = A‚00 L00 u01 U02 = a‚01 A‚02
¸ ˚˙ ˝
L00 U00 = A‚
00 L00 u01 = a‚01 L00 U02 = A‚
02
for the loop invariant to again hold after the iteration. Here, again, the entries in red are
known (in addition to the ones marked with a "hat") and the entries in blue are to be
computed. With this, we can compute the desired parts of L and U :
• Solve l10
T
U00 = aT10 , overwriting aT10 with the result.
A = LU-var2(A)
A B
AT L AT R
Aæ
ABL ABR
AT L is 0 ◊ 0
while n(AT L ) < n(A) Q R
A B A00 a01 A02
AT L AT R c d
æ a aT10 –11 aT12 b
ABL ABR
A20 a21 A22
Solve l10
T
U00 = aT10 overwriting aT10 with the result
–11 := ‚11 = –11 ≠ aT10 a01
aT12 := uT12 = aT12 ≠ l10
T
U
Q 02 R
A B A00 a01 A02
AT L AT R
Ωc T
a a10 –11 aT12 d
b
ABL ABR
A20 a21 A22
endwhile
Figure 5.5.1.4 Variant 2 (up-looking) LU factorization algorithm. Here A00 stores L\U00 .
for the loop invariant to again hold after the iteration. With this, we can compute the desired
parts of L and U :
• Update a21 := l21 = (a21 ≠ L20 u01 )/‚11 = (a21 ≠ A20 a10 )/–11 .
A = LU-var3(A)
A B
AT L AT R
Aæ
ABL ABR
AT L is 0 ◊ 0
while n(AT L ) < n(A) Q R
A B A00 a01 A02
AT L AT R c d
æ a aT10 –11 aT12 b
ABL ABR
A20 a21 A22
Solve L00 u01 = a01 overwriting a01 with the result
–11 := ‚11 = –11 ≠ aT10 a01
a21 := l21 = (a21 ≠ A20 a10 )/–11
Q R
A B A00 a01 A02
AT L AT R
Ωc T
a a10 –11 a12 b
T d
ABL ABR
A20 a21 A22
endwhile
Figure 5.5.1.5 Variant 3 (left-looking) LU factorization algorithm. Here A00 stores L\U00 .
• Update a21 := l21 = (a21 ≠ L20 u01 )/‚11 = (a21 ≠ A20 a10 )/–11 . Approximate cost:
2(n ≠ k ≠ 1) flops.
for the loop invariant to again hold after the iteration. With this, we can compute the desired
parts of L and U :
• Update a21 := l21 = (a21 ≠ L20 u01 )/‚11 = (a21 ≠ A20 a10 )/–11 .
A = LU-var4(A)
A B
AT L AT R
Aæ
ABL ABR
AT L is 0 ◊ 0
while n(AT L ) < n(A) Q R
A B A00 a01 A02
AT L AT R c d
æ a aT10 –11 aT12 b
ABL ABR
A20 a21 A22
–11 := ‚11 = –11 ≠ aT10 a01
aT12 := uT12 = aT12 ≠ aT10 A02
a21 := l21 = (a21 ≠ A20 a10 )/–11
Q R
A B A00 a01 A02
AT L AT R
Ωc T
a a10 –11 aT12 d
b
ABL ABR
A20 a21 A22
endwhile
• Update a21 := l21 = (a21 ≠ L20 u01 )/‚11 = (a21 ≠ A20 a10 )/–11 . Approximate cost:
2k(n ≠ k ≠ 1) + (n ≠ k ≠ 1) flops.
Thus, ignoring the 2k flops for the dot product and the n ≠ k ≠ 1 flops for multiplying with
1/–11 in each iteration, the total cost is approximately given by
qn≠1
k=0 4k(n ≠ k ≠ 1)
q
¥ < remove lower order term > n≠1
k=0 4k(n ≠ k)
= < algebra >
q q
4n n≠1 k=0 k ≠q4 n≠1
k=0 k
2
q
¥ < k=0 k ¥ n2 /2; n≠1
n≠1 2 3
k=0 k ¥ k /3 >
n3
2n ≠ 4 3
3
for the loop invariant to again hold after the iteration. With this, we can compute the desired
parts of L and U :
• –11 := ‚11 = – T
‚ 11 ≠ l10 u01 = –11 (no-op).
(–11 already equals – T
‚ 11 ≠ l01 u01 .)
• Update A22 := A‚22 ≠ L20 U02 ≠ l21 uT12 = A22 ≠ a21 aT12
(A22 already equals A‚22 ≠ L20 U02 .)
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 319
• Update A22 := A22 ≠ l21 uT12 = A22 ≠ a21 aT12 . Approximate cost: 2(n ≠ k ≠ 1)(n ≠ k ≠ 1)
flops.
Thus, ignoring n ≠ k ≠ 1 flops for multiplying with 1/–11 in each iteration, the total cost is
approximately given by
qn≠1
k=0 2(n ≠ k ≠ 1)2
= < change of variable j = n ≠ k ≠ 1 >
qn≠1 2
2 j=0 j
q 2 3
¥ < n≠1k=0 k ¥ k /3 >
n3
23
5.5.1.6 Discussion
Remark 5.5.1.8 For a discussion of the different LU factorization algorithms that also gives
a historic perspective, we recommend "Matrix Algorithms Volume 1" by G.W. Stewart [37].
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 320
• Compute the LU factorization of A11 (e.g., using any of the "unblocked’ algorithms
from Subsection 5.5.1).
A11 = L11 U11 ,
overwriting A11 with the factors.
• Solve
L11 U12 = A12
for U12 , overwriting A12 with the result. This is known as a "triangular solve with
multple right-hand sides." This comes from the fact that solving
LX = B,
which exposes that for each pair of columns we must solve the unit lower triangular
system Lxj = bj .
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 321
• Solve
L21 U11 = A21
for L21 , overwriting A21 with the result. This is also a "triangular solve with multple
right-hand sides" since we can instead view it as solving the lower triangular system
with multiple right-hand sides
L21 = AT21 .
T T
U11
(In practice, the matrices are not transposed.)
• Update
A22 := A22 ≠ L21 U12 .
5.6 Wrap Up
5.6.1 Additional homework
In this chapter, we discussed how the LU factorization (with pivoting) can be used to solve
Ax = y. Why don’t we instead discuss how to compute the inverse of the matrix A and
compute x = A≠1 y? Through a sequence of exercises, we illustrate why one should (almost)
never compute the inverse of a matrix.
Homework 5.6.1.1 Let A œ Cm◊m be nonsingular and B its inverse. We know that AB = I
and hence 1 2 1 2
A b0 · · · bm≠1 = e0 · · · em≠1 ,
where ej can be thought of as the standard basis vector indexed with j or the column of I
indexed with j.
1. Justify the following algorithm for computing B:
for j = 0, . . . , m ≠ 1
Compute the LU factorization with pivoting : P (p)A = LU
Solve Lz = P (p)ej
Solve U bj = z
endfor
3. How can we reduce the cost in the most obvious way and what is the cost of this better
algorithm?
4. If we want to solve Ax = y we can now instead compute x = By. What is the cost of
this multiplication and how does this cost compare with the cost of computing it via
the LU factorization, once the LU factorization has already been computed:
Solve Lz = P (p)y
Solve U x = z
1. Show that A B
1 0
L ≠1
= .
≠L≠1
22 l21 L≠1
22
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 323
2. Use the insight from the last part to complete the following algorithm for computing
the inverse of a unit lower triangular matrix:
[L] =Ainv(L) B
LT L LT R
Læ
LBL LBR
LT L is 0 ◊ 0
while n(LT L ) < n(L) Q R
A B L00 l01 L02
L T L LT R
æc T
a l10 T d
⁄11 l12 b
LBL LBR
L20 l21 L22
l21 :=
Q R
A B L00 l01 L02
L T L LT R c T T d
Ω a l10 ⁄11 l12 b
LBL LBR
L20 l21 L22
endwhile
3. The correct algorithm in the last part will avoid inverting matrices and will require,
approximately, 13 m3 flops. Analyze the cost of your algorithm.
Homework 5.6.1.3 LINPACK, the first software package for computing various operations
related to solving (dense) linear systems, includes routines for inverting a matrix. When
a survey was conducted to see what routines were in practice most frequently used, to the
dismay of the developers, it was discovered that routine for inverting matrices was among
them. To solve Ax = y users were inverting A and then computing x = A≠1 y. For this
reason, the successor to LINPACK, LAPACK, does not even include a routine for inverting
a matrix. Instead, if a user wants to compute the inverse, the user must go through the
steps.
Compute the LU factorization with pivoting : P (p)A = LU
Invert L, overwriting L with the result
Solve U X = L for X
Compute A≠1 := XP (p) (permuting the columns of X)
3. Analyze the cost of the algorithm in the last part of this question. If you did it right,
it should require, approximately, m3 operations.
5.6.2 Summary
The process known as Gaussian elimination is equivalent to computing the LU factorization
of the matrix A œ Cm◊m
: A = LU,
where L is a unit lower trianguular matrix and U is an upper triangular matrix.
Definition 5.6.2.1 Given a matrix A œ Cm◊n with m Ø n, its LU factorization is given by
A = LU where L œ Cm◊n is unit lower trapezoidal and U œ Cn◊n is upper triangular with
nonzeroes on its diagonal. ⌃
Definition 5.6.2.2 Principal leading submatrix. For k Æ n, the k ◊ k principal
leading
A submatrix of
B a matrix A is defined to be the square matrix AT L œ C
k◊k
such that
AT L AT R
A= . ⌃
ABL ABR
Lemma 5.6.2.3 Let L œ Cn◊n be a unit lower triangular matrix and U œ Cn◊n be an
upper triangular matrix. Then A = LU is nonsingular if and only if U has no zeroes on its
diagonal.
Theorem 5.6.2.4 Existence of the LU factorization. Let A œ Cm◊n and m Ø n have
linearly independent columns. Then A has a (unique) LU factorization if and only if all its
principal leading submatrices are nonsingular.
A = LU-right-looking(A)
A B
AT L AT R
Aæ
ABL ABR
AT L is 0 ◊ 0
while n(AT L ) < n(A) Q R
A B A00 a01 A02
AT L AT R
æc d
a aT10 –11 aT12 b
ABL ABR
A20 a21 A22
a21 := a21 /–11
A22 := A22 ≠ a21 aT12
Q R
A B A00 a01 A02
AT L AT R c T d
Ω a a10 –11 aT12 b
ABL ABR
A20 a21 A22
endwhile
Figure 5.6.2.5 Right-looking LU factorization algorithm.
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 325
where Ik is the k ◊ k identity matrix and I is an identity matrix "of appropropriate size" is
called a Gauss transform. ⌃
Q R≠1 Q R
Ik 0 0 Ik 0 0
L≠1
k =c
a 0 1 0 b
d
=c
a 0 1
d
0 b.
0 l21 I 0 ≠l21 I
Definition 5.6.2.8 Given Q R
fi0
c . d
a .. b ,
p=c d
fin≠1
where {fi0 , fi1 , . . . , fin≠1 } is a permutation (rearrangement) of the integers {0, 1, . . . , n ≠ 1},
we define the permutation matrix P (p) by
Q R
eTfi0
c
P (p) = c .. d
a . d
b.
T
efin≠1
⌃
If P is a permutation matrix then P ≠1
=P . T
[A, p] A= LUpiv-right-looking(A)
B A B
AT L AT R pT
Aæ ,p æ
ABL ABR pB
AT L is 0 ◊ 0, pT has 0 elements
while n(AT L ) < n(A) Q R Q R
A B A00 a01 A02 A B p0
AT L AT R c T T d pT c d
æ a a10 –11 a12 b , æ a fi1 b
ABL ABR pB
A B
A20 a21 A22 p2
–11
fi1 := maxi
a21
Q R Q R
A00 a01 A02 A B A00 a01 A02
c T d I 0 c T d
a a10 –11 aT12 b := a a10 –11 aT12 b
0 P (fi1 )
A20 a21 A22 A20 a21 A22
a21 := a21 /–11
A22 := A22 ≠ a21 aT12
Q R Q R
A B A00 a01 A02 A B p0
AT L AT R c T T d pT c d
Ω a a10 –11 a12 b , Ω a fi1 b
ABL ABR pB
A20 a21 A22 p2
endwhile
Figure 5.6.2.10 Right-looking LU factorization algorithm with partial pivoting.
Solving Ax = b via LU factorization: with row pivoting:
• Compute the LU factorization with pivoting P A = LU .k
• Apply the row exchanges to the right-hand side: y = P b.
• Solve Lz = y.
• Solve U x = z.
SolveALz = y, overwriting
B
yA withBz (Variant 1)
LT L LT R yT
Læ ,y æ
LBL LBR yB
LT L is 0 ◊ 0 and yT has 0 elements
while n(LT L ) < n(L) Q R Q R
A B L00 l01 L02 A B y0
LT L LT R c T T d yT c d
æ a l10 ⁄11 l12 b, æ a Â1 b
LBL LBR yB
L20 l21 L22 y2
y2 := y2 ≠ Â1 l21
Q R Q R
A B L00 l01 L02 A B y0
LT L LT R y
Ωc T T d
a l10 ⁄11 l12 b ,
T
Ωc d
a Â1 b
LBL LBR yB
L20 l21 L22 y2
endwhile
Figure 5.6.2.11 Lower triangular solve (with unit lower triangular matrix), Variant 1
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 328
SolveALz = y, overwriting
B
yA withBz (Variant 2)
LT L LT R yT
Læ ,y æ
LBL LBR yB
LT L is 0 ◊ 0 and yT has 0 elements
while n(LT L ) < n(L) Q R Q R
A B L00 l01 L02 A B y0
LT L LT R c T T d yT c d
æ a l10 ⁄11 l12 b, æ a Â1 b
LBL LBR yB
L20 l21 L22 y2
Â1 := Â1 ≠ l10
T
y0
Q R Q R
A B L00 l01 L02 A B y0
LT L LT R y
Ωc T T d
a l10 ⁄11 l12 b ,
T
Ωc d
a Â1 b
LBL LBR yB
L20 l21 L22 y2
endwhile
Figure 5.6.2.12 Lower triangular solve (with unit lower triangular matrix), Variant 2
Solve AU x = z, overwriting
B
z withBx (Variant 1)
A
UT L UT R zT
Uæ ,z æ
UBL UBR zB
UBR is 0 ◊ 0 and zB has 0 elements
while n(UBR ) < n(U ) Q R Q R
A B U00 u01 U02 A B z0
UT L UT R c d zT c d
æ a uT10 ‚11 uT12 b , æ a ’1 b
UBL UBR zB
U20 u21 U22 z2
’1 := ’1 /‚11
z0 := z0 ≠ ’1 u01
Q R Q R
A B U00 u01 U02 A B z0
UT L UT R z
Ωc d
a uT10 ‚11 uT12 b ,
T
Ωc d
a ’1 b
UBL UBR zB
U20 u21 U22 z2
endwhile
Figure 5.6.2.13 Upper triangular solve Variant 1
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 329
Solve AU x = z, overwriting
B
z withBx (Variant 2)
A
UT L UT R zT
Uæ ,z æ
UBL UBR zB
UBR is 0 ◊ 0 and zB has 0 elements
while n(UBR ) < n(U ) Q R Q R
A B U00 u01 U02 A B z0
UT L UT R c d zT c d
æ a uT10 ‚11 uT12 b , æ a ’1 b
UBL UBR zB
U20 u21 U22 z2
’1 := ’1 ≠ uT12 z2
’1 := ’1 /‚11
Q R Q R
A B U00 u01 U02 A B z0
UT L UT R z
Ωc d
a uT10 ‚11 uT12 b ,
T
Ωc d
a ’1 b
UBL UBR zB
U20 u21 U22 z2
endwhile
Figure 5.6.2.14 Upper triangular solve Variant 2
Cost of triangular solve Starting with an n ◊ n (upper or lower) triangular matrix T , solving
T x = b requires approximately n2 flops.
Provided the solution of Ax = b yields some accuracy in the solution, that accuracy can
be improved through a process known as iterative refinement.
• Let x̂ is an approximate solution to Ax = b.
• Let ”x
‚ is an approximate solution to A”x = b ≠ Ax
‚,
• Then x‚ + ”x,
‚ is an improved approximation.
• This process can be repeated until the accuracy in the computed solution is as good
as warranted by the conditioning of A and the accuracy in b.
Definition 5.6.2.15 Hermitian positive definite matrix. A matrix A œ Cn◊n is
Hermitian positive definite (HPD) if and only if it is Hermitian (AH = A) and for all
nonzero vectors x œ Cn it is the case that xH Ax > 0. If in addition A œ Rn◊n then A is said
to be symmetric positive definite (SPD). ⌃
Some insights regarding HPD matrices:
• B has linearly independent columns if and only if A = B H B is HPD.
• A diagonal matrix has only positive values on its diagonal if and only if it is HPD.
• If A is HPD, then its diagonal elements are all real-valued and positive.
A B
AT L AT R
• If A = , where AT L is square, is HPD, then AT L and ABR are HPD.
ABL ABR
Theorem 5.6.2.16 Cholesky Factorization Theorem. Given an HPD matrix A there
exists a lower triangular matrix L such that A = LLH . If the diagonal elements of L are
WEEK 5. THE LU AND CHOLESKY FACTORIZATIONS 330
where A has linearly independent columns, equals the solution to the normal equations
¸ ˚˙A˝ x = A
H H
A ‚
¸ ˚˙ b˝ .
B y
Numerical Stability
331
WEEK 6. NUMERICAL STABILITY 332
• If
Îb ≠ Ax̂Î
ÎbÎ
is small, then we can conclude that x̂ solves a nearby problem, provided we trust
whatever routine computes Ax̂. After all, it solves
Ax̂ = b̂
where
Îb ≠ b̂Î
ÎbÎ
is small.
So, Îb ≠ Ax̂Î/ÎbÎ being small is a necessary condition, but not a sufficient condition. If
Îb ≠ Ax̂Î/ÎbÎ is small, then x̂ is as good an answer as the problem warrants, since a small
error in the right-hand side is to be expected either because data inherently has error in it
or because in storing the right-hand side the input was inherently rounded.
In the presence of roundoff error, it is hard to determine whether an implementation is
correct. Let’s examine a few scenerios.
Homework 6.1.1.2 You use some linear system solver and it gives the wrong answer. In
other words, you solve Ax = b on a computer, computing x̂, and somehow you determine
that
Îx ≠ x̂Î
is large. Which of the following is a possible cause (identify all):
• There is a bug in the code. In other words, the algorithm that is used is sound (gives
the right answer in exact arithmetic) but its implementation has an error in it.
• The linear system is ill-conditioned. A small relative error in the right-hand side can
amplify into a large relative error in the solution.
• All is well: Îx̂ ≠ xÎ is large but the relative error Îx̂ ≠ xÎ/ÎxÎ is small.
Solution. All are possible causes. This week, we will delve into this.
6.1.2 Overview
• 6.1 Opening Remarks
• 6.5 Enrichments
• 6.6 Wrap Up
• Employ strategies for avoiding unnecessary overflow and underflow that can occur in
intermediate computations.
• Compute the machine epsilon (also called the unit roundoff) for a given floating point
representation.
• Quantify errors in storing real numbers as floating point numbers and bound the in-
curred relative error in terms of the machine epsilon.
• Analyze error incurred in floating point computation using the Standard Computation
Model (SCM) and the Alternative Computation Model (ACM) to determine their
forward and backward results.
• Argue how backward error can affect the relative error in the solution of a linear system.
YouTube: https://www.youtube.com/watch?v=sWcdwmCdVOU
Only a finite number of (binary) digits can be used to store a real number in the memory
of a computer. For so-called single-precision and double-precision floating point numbers,
32 bits and 64 bits are typically employed, respectively.
Recall that any real number can be written as µ ◊ — e , where — is the base (an integer
greater than one), µ œ [≠1, 1] is the mantissa, and e is the exponent (an integer). For our
discussion, we will define the set of floating point numbers, F , as the set of all numbers
‰ = µ ◊ — e such that
• — = 2,
µ ◊ —e
=
.101 ◊ 21
=
(1 ◊ 2≠1 + 0 ◊ 2≠2 + 1 ◊ 2≠3 ) ◊ 21
1 = 2
1 0 1
2
+ 4
+ 8
◊2
=
1.25
⇤
Observe that
• There is a largest number (in absolute value) that can be stored. Any number with
larger magnitude "overflows". Typically, this causes a value that denotes a NaN (Not-
a-Number) to be stored.
• There is a smallest number (in absolute value) that can be stored. Any number that
is smaller in magnitude "underflows". Typically, this causes a zero to be stored.
In practice, one needs to be careful to consider overflow and underflow. The following
example illustrates the importance of paying attention to this.
Example 6.2.1.2 Computing the (Euclidean) length of a vector is an operation we will
frequently employ. Careful attention must be paid to overflow and underflow when computing
it.
Given x œ Rn , consider computing
ı̂n≠1
ıÿ
ÎxÎ2 = Ù ‰2 . i (6.2.1)
i=0
Notice that Ô
n max |‰i |
n≠1
ÎxÎ2 Æ
i=0
and hence, unless some ‰i is close to overflowing, the result will not overflow. The problem
is that if some element ‰i has the property that ‰2i overflows, intermediate results in the
computation in (6.2.1) will overflow. The solution is to determine k such that
YouTube: https://www.youtube.com/watch?v=G2jawQW5WPc
Remark 6.2.2.1 We consider the case where a real number is truncated to become the
stored floating point number. This makes the discussion a bit simpler.
Let positive ‰ be represented by
‰ = .”0 ”1 · · · ◊ 2e ,
where ”i are binary digits and ”0 = 1 (the mantissa is normalized). If t binary digits are
stored by our floating point system, then
‰̌ = .”0 ”1 · · · ”t≠1 ◊ 2e
is stored (if truncation is employed). If we let ”‰ = ‰ ≠ ‰̌. Then
”‰ = .”0 ”1 · · · ”t≠1 ”t · · · ◊ 2e ≠ .”0 ”1 · · · ”t≠1 ◊ 2e
¸ ˚˙ ˝ ¸ ˚˙ ˝
‰ ‰̌
= ¸ ˚˙ ˝ t · · · ◊ 2
.0 · · · 00 ” e
t
< · · 01˝ ◊ 2e = 2≠t 2e .
¸.0 · ˚˙
t
Since ‰ is positive and ”0 = 1,
1
‰ = .”0 ”1 · · · ◊ 2e Ø ◊ 2e .
2
WEEK 6. NUMERICAL STABILITY 337
Thus,
”‰ 2≠t 2e
Æ 1 e = 2≠(t≠1) ,
‰ 2
2
which can also be written as
”‰ Æ 2≠(t≠1) ‰.
A careful analysis of what happens when ‰ equals zero or is negative yields
1.3333 · · ·
=
1 + 02 + 14 + 08 + 16
1
+ ···
= < convert to binary representation >
1.0101 · · · ◊ 20
= < normalize >
.10101 · · · ◊ 21
.1010 ◊ 21 ,
.101 ◊ 21 =
1
2
+ 04 + 18 + 16
0
◊ 21
=
0.625 ◊ 2 = < convert to decimal >
1.25
We can abstract away from the details of the base that is chosen and whether rounding or
truncation is used by stating that storing ‰ as the floating point number ‰̌ obeys
where ‘mach is known as the machine epsilon or unit roundoff. When single precision floating
point numbers are used ‘mach ¥ 10≠8 , yielding roughly eight decimal digits of accuracy in the
WEEK 6. NUMERICAL STABILITY 338
stored value. When double precision floating point numbers are used ‘mach ¥ 10≠16 , yielding
roughly sixteen decimal digits of accuracy in the stored value.
Example 6.2.2.3 The number 4/3 = 1.3333 · · · can be written as
1.3333 · · ·
=
1 + 02 + 14 + 08 + 16
1
+ ···
= < convert to binary representation >
1.0101 · · · ◊ 20
= < normalize >
.10101 · · · ◊ 21
.1011 ◊ 21 ,
.1011 ◊ 21 =
1
2
+ 04 + 18 + 16
1
◊ 21
=
0.6875 ◊ 2 = < convert to decimal >
1.375
Solution.
Answer:
.10
¸ ˚˙· · · 0˝ ◊ 21 .
t
digits
6.2.3.1 Notation
In our discussions, we will distinguish between exact and computed quantities. The function
fl(expression) returns the result of the evaluation of expression, where every operation is
executed in floating point arithmetic. For example, given ‰, Â, ’, Ê œ F and assuming that
WEEK 6. NUMERICAL STABILITY 340
the expressions are evaluated from left to right and order of operations is obeyed,
fl(‰ + Â + ’/Ê)
is equivalent to
fl(fl(‰ + Â) + fl(’/Ê)).
Equality between the quantities lhs and rhs is denoted by lhs = rhs. Assignment of rhs to
lhs is denoted by lhs := rhs (lhs becomes rhs). In the context of a program, the statements
lhs := rhs and lhs := fl(rhs) are equivalent. Given an assignment
Ÿ := expression,
we use the notation Ÿ̌ (pronounced "check kappa") to denote the quantity resulting from
fl(expression), which is actually stored in the variable Ÿ:
Ÿ̌ = fl(expression).
Remark 6.2.3.1 In future discussion, we will use the notation [·] as shorthand for fl(·).
YouTube: https://www.youtube.com/watch?v=RIsLyjFbonU
The Standard Computational Model (SCM) assumes that, for any two floating point
numbers ‰ and Â, the basic arithmetic operations satisfy the equality
Ÿ = 4/3,
where we notice that both 4 and 3 can be exactly represented in our floating point system
with — = 2 and t = 4. Recall that the real number 4/3 = 1.3333 · · · is stored as .1010 ◊ 21 , if
t = 4 and truncation is employed. This equals 1.25 in decimal representation. The relative
error was 0.0625. Now
Ÿ̌
=
fl(4/3)
=
1.25
=
1.333 · · · + (≠0.0833 · · · )
= 1 2
1.333 · · · ◊ 1 + ≠0.08333···
1.333···
=
4/3 ◊ (1 + (≠0.0625))
=
Ÿ(1 + ‘/ ),
where
|‘/ | = 0.0625 Æ 0.125
¸ ˚˙ ˝ .
‘mach = 2 ≠(t≠1)
YouTube: https://www.youtube.com/watch?v=6jBxznXcivg
For certain problems it is convenient to use the Alternative Computational Model (ACM)
[21], which also assumes for the basic arithmetic operations that
‰ op Â
fl(‰ op Â) = , |‘| Æ ‘mach , and op œ {+, ≠, ú, /}.
1+‘
As for the standard computation model, the quantity ‘ is a function of ‰, Â and op. Note
that the ‘’s produced using the standard and alternative models are generally not equal.
WEEK 6. NUMERICAL STABILITY 342
f (x̌) = fˇ(x),
means that the error in the computation can be attributed to a small change in the input.
If this is true, then fˇ is said to be a (numerically) stable implementation of f for input x.
The following defines a property that captures correctness in the presence of the kinds of
errors that are introduced by computer arithmetic:
Definition 6.2.4.2 Backward stable implementation. Given the mapping f : D æ R,
where D µ Rn is the domain and R µ Rm is the range (codomain), let fˇ : D æ R be
a computer implementation of this function. We will call fˇ a backward stable (also called
"numerically stable") implementation of f on domain D if for all x œ D there exists a x̌
"close" to x such that fˇ(x) = f (x̌). ⌃
In other words, fˇ is a stable implementation if the error that is introduced is similar
to that introduced when f is evaluated with a slightly changed input. This is illustrated
in Figure 6.2.4.1 for a specific input x. If an implemention is not stable, it is numerically
WEEK 6. NUMERICAL STABILITY 344
unstable.
The algorithm is said to be forward stable on domain D if for all x œ D it is that case
that fˇ(x) ¥ f (x). In other words, the computed result equals a slight perturbation of the
exact result.
Example 6.2.4.3 Under the SCM from the last unit, floating point addition, Ÿ := ‰ + Â, is
a backward stable operation.
Solution.
Ÿ̌
= < computed value for Ÿ >
[‰ + Â]
= < SCM >
(‰ + Â)(1 + ‘+ )
= < distribute >
‰(1 + ‘+ ) + Â(1 + ‘+ )
=
(‰ + ”‰) + ( + ”Â)
where
• |‘+ | Æ ‘mach ,
• ”‰ = ‰‘+ ,
• ” = ‘+ .
• ALWAYS/SOMETIMES/NEVER: Under the SCM from the last unit, floating point
multiplication, Ÿ := ‰ ◊ Â, is a backward stable operation.
• ALWAYS/SOMETIMES/NEVER: Under the SCM from the last unit, floating point
division, Ÿ := ‰/Â, is a backward stable operation.
Answer.
• ALWAYS: Under the SCM from the last unit, floating point subtraction, Ÿ := ‰ ≠ Â,
is a backward stable operation.
• ALWAYS: Under the SCM from the last unit, floating point multiplication, Ÿ := ‰ ◊ Â,
is a backward stable operation.
• ALWAYS: Under the SCM from the last unit, floating point division, Ÿ := ‰/Â, is a
backward stable operation.
WEEK 6. NUMERICAL STABILITY 345
• ALWAYS: Under the SCM from the last unit, floating point subtraction, Ÿ := ‰ ≠ Â,
is a backward stable operation.
Ÿ̌
= < computed value for Ÿ >
[‰ ≠ Â]
= < SCM >
(‰ ≠ Â)(1 + ‘≠ )
= < distribute >
‰(1 + ‘≠ ) ≠ Â(1 + ‘≠ )
=
(‰ + ”‰) ≠ ( + ”Â)
where
¶ |‘≠ | Æ ‘mach ,
¶ ”‰ = ‰‘≠ ,
¶ ” = ‘≠ .
• ALWAYS: Under the SCM from the last unit, floating point multiplication, Ÿ := ‰ ◊ Â,
is a backward stable operation.
Ÿ̌
= < computed value for Ÿ >
[‰ ◊ Â]
= < SCM >
(‰ ◊ Â)(1 + ‘◊ )
= < associative property >
‰ ◊ Â(1 + ‘◊ )
=
‰( + ”Â)
where
¶ |‘◊ | Æ ‘mach ,
¶ ” = ‘◊ .
• ALWAYS: Under the SCM from the last unit, floating point division, Ÿ := ‰/Â, is a
backward stable operation.
Ÿ̌
= < computed value for Ÿ >
[‰/Â] <
= SCM >
(‰/Â)(1 + ‘/ )
= < commutative property >
‰(1 + ‘/ )/Â
=
(‰ + ”‰)/Â
where
¶ |‘/ | Æ ‘mach ,
¶ ”‰ = ‰‘/ ,
YouTube: https://www.youtube.com/watch?v=e29Yk4XCyLs
It is important to keep conditioning versus stability straight:
• Conditioning is a property of the problem you are trying to solve. A problem is well-
conditioned if a small change in the input is guaranteed to only result in a small change
in the output. A problem is ill-conditioned if a small change in the input can result in
a large change in the output.
In other words, in the presence of roundoff error, computing a wrong answer may be due
to the problem (if it is ill-conditioned), the implementation (if it is numerically unstable),
or a programming bug (if the implementation is sloppy). Obviously, it can be due to some
combination of these.
Now,
Yet another way to look at this: A numerically stable implementation will yield an answer
that is as accurate as the conditioning of the problem warrants.
⌃
Definition 6.2.6.2 Let Mœ {<, Æ, =, Ø, >} and x, y œ Rn . Then
m≠1
ÿ n≠1
ÿ m≠1
ÿ n≠1
ÿ
ÎAÎ2F = |–i,j |2 Æ |—i,j |2 = ÎBÎ2F .
i=0 j=0 i=0 j=0
Then
ÎAÎ1
= < alternate way of computing 1-norm >
max0Æj<n Îaj Î1
= <1 expose individual
2 entries of aj >
qm≠1
max0Æj<n i=0 |–i,j |
1q = < choose
2 k to be the index that maximizes >
m≠1
i=0 |–i,k |
Hence
ÎAÎŒ = ÎAT Î1 Æ ÎB T Î1 = ÎBÎŒ .
YouTube: https://www.youtube.com/watch?v=OHqdJ3hjHFY
Before giving a general result, let us focus on the case where the vectors x and y have
only a few elements:
WEEK 6. NUMERICAL STABILITY 350
Solution.
Ÿ̌ Ë È
= < Ÿ̌ = xT y >
SA BT A BT
U ‰0 Â0 V
‰1 Â1
= < definition of xT y >
[‰0 Â0 + ‰1 Â1 ]
= < each suboperation is performed in floating point arithmetic >
[[‰0 Â0 ] + [‰1 Â1 ]]
Ë = < apply SCM multiple È times >
(0) (1)
‰0 Â0 (1 + ‘ú ) + ‰1 Â1 (1 + ‘ú )
= < apply SCM >
(0) (1) (1)
(‰0 Â0 (1 + ‘ú ) + ‰1 Â1 (1 + ‘ú ))(1 + ‘+ )
= < distribute >
(0) (1) (1) (1)
‰0 Â0 (1 + ‘ú )(1 + ‘+ ) + ‰1 Â1 (1 + ‘ú )(1 + ‘+ )
= < commute >
(0) (1) (1) (1)
‰0 (1 + ‘ú )(1 + ‘+ )Â0 + ‰1 (1 + ‘ú )(1 + ‘+ )Â1
= < (perhaps too) slick way of expressing the final result >
A BT A (0) (1)
BA B
‰0 (1 + ‘ú )(1 + ‘+ ) 0 Â0
(1) (1)
‰1 0 (1 + ‘ú )(1 + ‘+ ) Â1
,
(0) (1) (1)
where |‘ú |, |‘ú |, |‘+ | Æ ‘mach . ⇤
An important insight from this example is that the result in (6.3.1) can be manipulated
to associate the accummulated error with vector x as in
A (0) (1)
BT A B
‰0 (1 + ‘ú )(1 + ‘+ ) Â0
Ÿ̌ = (1) (1)
‰1 (1 + ‘ú )(1 + ‘+ ) Â1
WEEK 6. NUMERICAL STABILITY 351
or with vector y
A BT A (0) (1)
B
‰0 Â0 (1 + ‘ú )(1 + ‘+ )
Ÿ̌ = (1) (1) .
‰1 Â1 (1 + ‘ú )(1 + ‘+ )
This will play a role when we later analyze algorithms that use the dot product.
Homework 6.3.1.1 Consider
Q R Q R
‰0 Â0
c d c d
x = a ‰1 b and y = a Â1 b
‰2 Â2
Ÿ := (‰0 Â0 + ‰1 Â1 ) + ‰2 Â2 .
Employ the SCM given in Subsubsection 6.2.3.2, to derive a result similar to that given in
(6.3.1).
Answer.
Q R Q (0) (1) (2)
RQ R
‰0
T
(1 + ‘ú )(1 + ‘+ )(1 + ‘+ ) 0 0 Â
c d c (1) (1) (2) dc 0 d
a ‰1 b ac 0 (1 + ‘ú )(1 + ‘+ )(1 + ‘+ ) 0 d
b a Â1 b ,
(2) (2)
‰2 0 0 (1 + ‘ú )(1 + ‘+ ) Â2
Ÿ̌ Ë È
= < Ÿ̌ = xT y >
[(‰0 Â0 + ‰1 Â1 ) + ‰2 Â2 ]
= < each suboperation is performed in floating point arithmetic >
[[‰0 Â0 + ‰1 Â1 ] + [‰2 Â2 ]]
SS = A
< reformulate
BT A T so we Tcan use result from Example 6.3.1.1 >
B
‰0 Â0 V
UU + [‰2 Â2 ]V
‰1 Â1
QA= < use Example 6.3.1.1; twice SCM >
BT A BA B
(0) (1)
a ‰0 (1 + ‘ú )(1 + ‘+ ) 0 Â0
(1) (1)
‰1 0 (1 + ‘ú )(1 + ‘+ ) Â1
2
(2) (2)
+‰2 Â2 (1 + ‘ú ) (1 + ‘+ )
= < distribute, commute >
A BT A (0) (1)
B A B
‰0 (1 + ‘ú )(1 + ‘+ ) 0 (2) Â0
(1) (1) (1 + ‘+ )
‰1 0 (1 + ‘ú )(1 + ‘+ ) Â1
(2) (2)
+ ‰2 (1 + ‘ú )(1 + ‘+ )Â2
= < (perhaps too) slick way of expressing the final result >
Q RT Q (0) (1) (2)
RQ R
‰0 (1 + ‘ú )(1 + ‘+ )(1 + ‘+ ) 0 0 Â0
c d cc (1) (1) (2) dc
a ‰1 b a 0 (1 + ‘ú )(1 + ‘+ )(1 + ‘+ ) 0 da
b Â1 d
b
(2) (2)
‰2 0 0 (1 + ‘ú )(1 + ‘+ ) Â2
,
(0) (1) (1) (2) (2)
where |‘ú |, |‘ú |, |‘+ |, |‘ú |, |‘+ | Æ ‘mach .
YouTube: https://www.youtube.com/watch?v=PmFUqJXogm8
WEEK 6. NUMERICAL STABILITY 353
Consider now
Q RT Q R
‰0 Â0
c d c d
c
c
‰1 d
d
c
c
Â1 d
d 31 2 4
Ÿ := x y = c .. d c .. d
= (‰0 Â0 + ‰1 Â1 ) + · · · + ‰n≠2 Ân≠2 + ‰n≠1 Ân≠1 .
. .
T
c d c d
c d c d
c ‰n≠2 d c Ân≠2 d
a b a b
‰n≠1 Ân≠1
Under the computational model given in Subsection 6.2.3 the computed result, Ÿ̌, satisfies
A3 4
1 2
(0) (1) (1) (n≠2)
Ÿ̌ = ‰0 Â0 (1 + ‘ú ) + ‰1 Â1 (1 + ‘ú ) (1 + ‘+ ) + · · · (1 + ‘+ )
B
(n≠1) (n≠1)
+ ‰n≠1 Ân≠1 (1 + ‘ú ) (1 + ‘+ )
(0) (1) (2) (n≠1)
= ‰0 Â0 (1 + ‘ú )(1 + ‘+ )(1 + ‘+ ) · · · (1 + ‘+ )
(1) (1) (2) (n≠1)
+ ‰1 Â1 (1 + ‘ú )(1 + ‘+ )(1 + ‘+ ) · · · (1 + ‘+ )
so that Q R
n≠1
ÿ n≠1
Ÿ
a‰i Âi (1 + ‘(i) ) (j)
Ÿ̌ = ú (1 + ‘+ )b , (6.3.2)
i=0 j=i
YouTube: https://www.youtube.com/watch?v=6qnYXaw4Bms
Lemma 6.3.2.1 Let ‘i œ R, 0 Æ i Æ n ≠ 1, n‘mach < 1, and |‘i | Æ ‘mach . Then ÷ ◊n œ R
such that
n≠1
Ÿ
(1 + ‘i )±1 = 1 + ◊n ,
i=0
WEEK 6. NUMERICAL STABILITY 354
(1 + ‘0 ) (1 + ‘1 ) 1
(1 + ‘0 )(1 + ‘1 ) or or or
(1 + ‘1 ) (1 + ‘0 ) (1 + ‘1 )(1 + ‘0 )
so that this lemma can accommodate an analysis that involves a mixture of the Standard and
Alternative Computational Models (SCM and ACM).
Proof. By Mathematical Induction.
• Base case. n = 1. Trivial.
• Inductive Step. The Inductive Hypothesis (I.H.) tells us that for all ‘i œ R, 0 Æ i Æ
n ≠ 1, n‘mach < 1, and |‘i | Æ ‘mach , there exists a ◊n œ R such that
n≠1
Ÿ
(1 + ‘i )±1 = 1 + ◊n , with |◊n | Æ n‘mach /(1 ≠ n‘mach ).
i=0
We will show that if ‘i œ R, 0 Æ i Æ n, (n + 1)‘mach < 1, and |‘i | Æ ‘mach , then there
exists a ◊n+1 œ R such that
n
Ÿ
(1 + ‘i )±1 = 1 + ◊n+1 , with |◊n+1 | Æ (n + 1)‘mach /(1 ≠ (n + 1)‘mach ).
i=0
¶ Case 1: The last term comes from the application of the SCM.
rn r
i=0 (1 + ‘i ) = n≠1
i=0 (1 + ‘i ) (1 + ‘n ). See Ponder This 6.3.2.1.
±1 ±1
¶ Case 2: The last term comes from the application of the ACM.
rn rn≠1
i=0 (1 + ‘i )r = ( i=0 (1 + ‘i ) )/(1 + ‘n ). By the I.H. there exists a ◊n such that
±1 ±1
rn≠1
(1 + ‘i )±1 1 + ◊n ◊n ≠ ‘ n
i=0
= =1+ ,
1 + ‘n 1 + ‘n 1 + ‘n
¸ ˚˙ ˝
◊n+1
WEEK 6. NUMERICAL STABILITY 355
|◊n+1 |
= < definition of ◊n+1 >
|(◊n ≠ ‘n )/(1 + ‘n )|
Æ < |◊n ≠ ‘n | Æ |◊n | + |‘n | Æ |◊n | + ‘mach >
(|◊n | + ‘mach )/(|1 + ‘n |)
Æ < |1 + ‘n | Ø 1 ≠ |‘n | Ø 1 ≠ ‘mach >
(|◊n | + ‘mach )/(1 ≠ ‘mach )
Æ < bound |◊n | using I.H. >
( 1≠n‘
n‘mach
mach
+ ‘mach )/(1 ≠ ‘mach )
= < algebra >
(n‘mach + (1 ≠ n‘mach )‘mach )/((1 ≠ n‘mach )(1 ≠ ‘mach ))
⌅
Ponder This 6.3.2.1 Complete the proof of Lemma 6.3.2.1.
Remark 6.3.2.2 The quantity ◊n will be used throughout these notes. It is not intended
to be a specific number. Instead, it is an order of magnitude identified by the subscript n,
which indicates the number of error factors of the form (1 + ‘i ) and/or (1 + ‘i )≠1 that are
grouped together to form (1 + ◊n ).
Since we will often encounter the bound on |◊n | that appears in Lemma 6.3.2.1 we assign
it a symbol as follows:
Definition 6.3.2.3 For all n Ø 1 and n‘mach < 1, define
⌃
WEEK 6. NUMERICAL STABILITY 356
Ÿ̌
=
‰0 Â0 (1 + ◊n ) + ‰1 Â1 (1 + ◊n ) + · · · + ‰n≠1 Ân≠1 (1 + ◊2 )
=
Q RT Q RQ R
‰0 (1 + ◊n ) 0 0 ··· 0 Â0
c d c dc d
c ‰1 d c
c d c
0 (1 + ◊n ) 0 ··· 0 dc
dc
Â1 d
d
c ‰
c 2
d c
d c 0 0 (1 + ◊ n≠1 ) · · · 0 dc
dc Â2 d
d
c .. d c .. .. .. .. .. dc .. d
c
a . b a
d c . . . . . dc
ba . d
b
‰n≠1 0 0 0 · · · (1 + ◊2 ) Ân≠1
= (6.3.3)
Q RT Q Q RR Q R
‰0 ◊n 0 0 ··· 0 Â0
c d c c dd c d
c
c
‰1 d
d
c
c
c
c
0 ◊n 0 ··· 0 dd
dd
c
c
Â1 d
d
c
c ‰2 d
d
c
cI
c
+c 0 0 ◊n≠1 ··· 0 dd
dd
c
c Â2 d
d
c .. d c c .. .. .. ... .. dd c .. d
c
a . d
b
c
a
c
a . . . . dd
bb
c
a . d
b
‰n≠1 0 0 0 · · · ◊2 Ân≠1
¸ ˚˙ ˝
I+ (n)
=
xT (I + (n)
)y,
where |◊j | Æ “j , j = 2, . . . , n.
Remark 6.3.2.4 Two instances of the symbol ◊n , appearing even in the same expression,
typically do not represent the same number. For example, in (6.3.3) a (1 + ◊n ) multiplies
each of the terms ‰0 Â0 and ‰1 Â1 , but these two instances of ◊n , as a rule, do not denote the
same quantity. In particular, one should be careful when factoring out such quantities.
YouTube: https://www.youtube.com/watch?v=Uc6NuDZMakE
As part of the analyses the following bounds will be useful to bound error that accumu-
lates:
Lemma 6.3.2.5 If n, b Ø 1 then “n Æ “n+b and “n + “b + “n “b Æ “n+b .
This lemma will be invoked when, for example, we want to bound |‘| such that 1 + ‘ =
(1 + ‘1 )(1 + ‘2 ) = 1 + (‘1 + ‘2 + ‘1 ‘2 ) knowing that |‘1 | Æ “n and |‘2 | Æ “b .
WEEK 6. NUMERICAL STABILITY 357
and
“n + “b + “n “b
= < definition >
n‘mach
1≠n‘mach
+ b‘mach
1≠b‘mach
+ (1≠n‘
n‘mach b‘mach
mach ) (1≠b‘mach )
= < algebra >
n‘mach (1≠b‘mach )+(1≠n‘mach )b‘mach +bn‘2mach
(1≠n‘mach )(1≠b‘mach )
= < algebra >
n‘mach ≠bn‘2mach +b‘mach ≠bn‘2mach +bn‘2mach
1≠(n+b)‘mach +bn‘2mach
= < algebra >
(n+b)‘mach ≠bn‘2mach
1≠(n+b)‘mach +bn‘2mach
Æ < bn‘2mach >0>
(n+b)‘mach
1≠(n+b)‘mach +bn‘2mach
Æ < bn‘2mach >0>
(n+b)‘mach
1≠(n+b)‘mach
= < definition >
“n+b .
YouTube: https://www.youtube.com/watch?v=QxUCV4k8Gu8
It is of interest to accumulate the roundoff error encountered during computation as a
perturbation of input and/or output parameters:
WEEK 6. NUMERICAL STABILITY 358
• Ÿ̌ = (x + ”x)T y;
(Ÿ̌ is the exact output for a slightly perturbed x)
• Ÿ̌ = xT (y + ”y);
(Ÿ̌ is the exact output for a slightly perturbed y)
• Ÿ̌ = xT y + ”Ÿ.
(Ÿ̌ equals the exact result plus an error)
The first two are backward error results (error is accumulated onto input parameters, show-
ing that the algorithm is numerically stable since it yields the exact output for a slightly
perturbed input) while the last one is a forward error result (error is accumulated onto the
answer). We will see that in different situations, a different error result may be needed by
analyses of operations that require a dot product.
Let us focus on the second result. Ideally one would show that each of the entries of y is
slightly perturbed relative to that entry:
Q R Q RQ R
‡0 Â0 ‡0 · · · 0 Â0
c .. d = c .. . . .
d c .. d c . d = y,
”y = c
a . b a . . b a .. d
dc
b
‡n≠1 Ân≠1 0 · · · ‡n≠1 Ân≠1
where each ‡i is "small" and = diag(‡0 , . . . , ‡n≠1 ). The following special structure of ,
inspired by (6.3.3) will be used in the remainder of this note:
Y
0 ◊ 0 matrix
_
] if n = 0
(n)
= ◊1 if n = 1
_
[
diag(◊n , ◊n , ◊n≠1 , . . . , ◊2 ) otherwise .
Then
A B
I + (k) 0
(1 + ‘2 )
0 (1 + ‘1 )
= < k = 0 means (I + (k)
) is 0 ◊ 0 and (1 + ‘1 ) = (1 + 0) >
(1 + 0)(1 + ‘2 )
=
(1 + ‘2 )
=
(1 + ◊1 )
=
(I + (1) ).
Case: k = 1.
Then A B
I + (k) 0
(1 + ‘2 )
0 (1 + ‘1 )
A = B
1 + ◊1 0
(1 + ‘2 )
0 (1 + ‘1 )
A = B
(1 + ◊1 )(1 + ‘2 ) 0
0 (1 + ‘1 )(1 + ‘2 )
A = B
(1 + ◊2 ) 0
0 (1 + ◊2 )
=
(I + (2) ).
Case: k > 1.
Notice that
(I + (k) )(1 + ‘2 )
Q = R
1 + ◊k 0 0 ··· 0
c d
c
c
0 1 + ◊k 0 ··· 0 d
d
c
c 0 0 1 + ◊k≠1 ··· 0 d
d (1 + ‘2 )
c .. .. .. ... .. d
c
a . . . . d
b
0 0 0 · · · 1 + ◊2
Q = R
1 + ◊k+1 0 0 ··· 0
c d
c
c
0 1 + ◊k+1 0 ··· 0 d
d
c
c 0 0 1 + ◊k · · · 0 d
d
c .. .. .. ... .. d
c
a . . . . d
b
0 0 0 · · · 1 + ◊3
WEEK 6. NUMERICAL STABILITY 360
Then A B
I+ (k)
0
(1 + ‘2 )
0 (1 + ‘1 )
A = B
(I + (k)
)(1 + ‘2 ) 0
0 (1 + ‘1 )(1 + ‘2 )
=
Q Q R R
1 + ◊k+1 0 0 ··· 0
c d
c
c c
c
0 1 + ◊k+1 0 ··· 0 d
d
d
d
c d
c
c
c
c 0 0 1 + ◊k ··· 0 d
d 0 d
d
c c .. .. .. .. .. d d
c
c
c
a . . . . . d
b
d
d
c d
a 0 0 0 · · · 1 + ◊3 b
0 (1 + ◊2 )
=
(I + (k+1)
).
We state a theorem that captures how error is accumulated by the algorithm.
Theorem 6.3.3.1 Let x, y œ Rn and let Ÿ := xT y be computed in the order indicated by
Then Ë È
(n)
Ÿ̌ = xT y = xT (I + )y.
Proof.
Proof by Mathematical Induction on n, the size of vectors x and y.
• Base case.
m(x) = m(y) = 0. Trivial!
• Inductive Step.
Inductive Hypothesis (I.H.): Assume that if xT , yT œ Rk , k > 0, then
We will show that when xT , yT œ Rk+1 , the equality fl(xTT yAT ) = BxTT (I + T )yAT holds
B
x0 y0
true again. Assume that xT , yT œ R , and partition xT æ
k+1
and yT æ .
‰1 Â1
WEEK 6. NUMERICAL STABILITY 361
Then A BT A B
x0 y0
fl( )
‰1 Â1
= < definition >
fl((fl(x0T y0 ) + fl(‰1 Â1 ))
= < I.H. with xT = x0 , yT = y0 , and 0 = (k)
>
fl(x0 (I + 0 )y0 + fl(‰1 Â1 ))
T
so that T = (k+1)
.
⌅
A number of useful consequences of Theorem 6.3.3.1 follow. These will be used later as
an inventory (library) of error results from which to draw when analyzing operations and
algorithms that utilize a dot product.
Corollary 6.3.3.2 Under the assumptions of Theorem 6.3.3.1 the following relations hold:
R-1B Ÿ̌ = (x + ”x)T y, where |”x| Æ “n |x|,
⌅
Homework 6.3.3.2 Prove Corollary 6.3.3.2 R1-B.
Solution. From Theorem 6.3.3.1 we know that
(n)
Ÿ̌ = xT (I + )y = (x + ¸ (n)
˚˙ x˝ ) y.
T
”x
Then -Q R- Q R Q R
- ◊n ‰ 0 - |◊n ‰0 | |◊n ||‰0 |
- -
-c d- c d c d
-c
-c
◊n ‰ 1 d-
d-
c
c
|◊n ‰1 | d
d
c
c
|◊n ||‰1 | d
d
-c d- c d c d
|”x| = | (n)
x| = -c ◊n≠1 ‰2 d- = c |◊n≠1 ‰2 | d = c |◊n≠1 ||‰2 | d
-c .. d- c .. d c .. d
-c
-a . d-
b-
c
a . d
b
c
a . d
b
- -
Q
- ◊R2 ‰n≠1
Q
- |◊R2 ‰n≠1 | |◊2 ||‰n≠1 |
|◊n ||‰0 | “n |‰0 |
c d c d
c
c
|◊n ||‰1 | d
d
c
c
“n |‰1 | d
d
c d c d
Æ c |◊n ||‰2 | d Æ c “n |‰2 | d = “n |x|.
c .. d c .. d
c
a . d
b
c
a . d
b
|◊n ||‰n≠1 | “n |‰n≠1 |
(Note: strictly speaking, one should probably treat the case n = 1 separately.)
YouTube: https://www.youtube.com/watch?v=q7rACPOu4ZQ
WEEK 6. NUMERICAL STABILITY 363
Ponder This 6.3.4.1 In the above theorem, could one instead prove the result
y̌ = A(x + ”x),
where ”x is "small"?
Solution. The answer is "sort of". The reason is that for each individual element of y
However, the ”x for each entry Â̌i is different, meaning that we cannot factor out x + ”x to
find that y̌ = A(x + ”x).
However, one could argue that we know that y̌ = Ax + ”y where |”y| Æ “n |A||x|. Hence
if A”x = ”y then A(x + ”x) = y̌. This would mean that ”y is in the column space of A. (For
example, if A is nonsingular). However, that is not quite what we are going for here.
YouTube: https://www.youtube.com/watch?v=pvBMuIzIob8
The idea behind backward error analysis is that the computed result is the exact result
when computing with changed inputs. Let’s consider matrix-matrix multiplication:
C := AB.
What we would like to be able to show is that there exist A and B such that the computed
result, Č, satisfies
Č := (A + A)(B + B).
Let’s think about this...
Ponder This 6.3.5.1 Can one find matrices A and B such that
Č = (A + A)(B + B)?
WEEK 6. NUMERICAL STABILITY 365
YouTube: https://www.youtube.com/watch?v=3d6kQ6rnRhA
For matrix-matrix multiplication, it is possible to "throw" the error onto the result, as
summarized by the following theorem:
Theorem 6.3.5.1 Forward error for matrix-matrix multiplication. Let C œ Rm◊n ,
A œ Rm◊k , and B œ Rk◊n and consider the assignment C := AB implemented via matrix-
vector multiplication. Then there exists C œ Rm◊n such that
Č = AB + C, where | C| Æ “k |A||B|.
Homework 6.3.5.2 Prove Theorem 6.3.5.1.
Solution. Partition
1 2 1 2
C= c0 c1 · · · cn≠1 and B = b0 b1 · · · bn≠1 .
Then 1 2 1 2
c0 c1 · · · cn≠1 := Ab0 Ab1 · · · Abn≠1 .
From R-1F 6.3.4.1 regarding matrix-vector multiplication we know that
1 2 1 2
č0 č1 · · · čn≠1 = Ab0 + ”c0 Ab1 + ”c1 · · · Abn≠1 + ”cn≠1
1 2 1 2
= Ab0 Ab1 · · · Abn≠1 + ”c0 ”c1 · · · ”cn≠1
= AB + C.
YouTube: https://www.youtube.com/watch?v=rxKba-pnquQ
Remark 6.3.5.2 In practice, matrix-matrix multiplication is often the parameterized oper-
ation C := –AB + —C. A consequence of Theorem 6.3.5.1 is that for — ”= 0, the error can be
attributed to a change in parameter C, which means the error has been "thrown back" onto
an input parameter.
WEEK 6. NUMERICAL STABILITY 366
YouTube: https://www.youtube.com/watch?v=ayj_rNkSMig
We now use the error results for the dot product to derive a backward error result
for solving Lx = y, where L is an n ◊ n lower triangular matrix, via the algorithm in
Figure 6.4.1.1, a variation on the algorithm in Figure 5.3.5.1 that stores the result in vector
x and does not assume that L is unit lower triangular.
SolveALx = y B A B A B
LT L LT R xT yT
Læ ,x æ ,y æ
LBL LBR xB yB
LT L is 0 ◊ 0 and yT , xT have 0 elements
while n(LT L ) < n(L) Q R Q R Q R
A B L00 l01 L02 A B x0 A B y0
LT L LT R c T T d xT c d yT
æ a Â1 d
c
æ a l10 ⁄11 l12 b, æa ‰1 b , b
LBL LBR xB yB
L20 l21 L22 x2 y2
‰1 := (Â1 ≠ l10
T
x0 )/⁄11
Q R Q R Q R
A B L00 l01 L02 A B x0 A B y0
LT L LT R x yT
Ωc T T d
a l10 ⁄11 l12 b ,
T
Ωc
a
d
‰1 b , Ωc d
a Â1 b
LBL LBR xB yB
L20 l21 L22 x2 y2
endwhile
Figure 6.4.1.1 Dot product based lower triangular solve algorithm.
To establish the backward error result for this algorithm, we need to understand the error
incurred in the key computation
‰1 := (Â1 ≠ l10
T
x0 )/⁄11 .
The following theorem gives the required (forward error) result, abstracted away from the
specifics of how it occurs in the lower triangular solve algorithm.
Lemma 6.4.1.2 Let n Ø 1, ⁄, ‹ œ R and x, y œ Rn . Assume ⁄ ”= 0 and consider the
computation
‹ := (– ≠ xT y)/⁄,
WEEK 6. NUMERICAL STABILITY 367
Then
(⁄ + ”⁄)ˇ
‹ = – ≠ (x + ”x)T y, where |”x| Æ “n |x| and |”⁄| Æ “2 |⁄|.
Homework 6.4.1.1 Prove Lemma 6.4.1.2
Hint. Use the Alternative Computations Model (Subsubsection 6.2.3.3) appropriately.
Solution. We know that
–≠— 1
‹ˇ = ,
⁄ (1 + ‘≠ )(1 + ‘/ )
Hence
– ≠ (x + ”x)T y 1
‹ˇ = ,
⁄ (1 + ‘≠ )(1 + ‘/ )
or, equivalently,
⁄(1 + ‘≠ )(1 + ‘/ )ˇ
‹ = – ≠ (x + ”x)T y,
or,
⁄(1 + ◊2 )ˇ
‹ = – ≠ (x + ”x)T y,
where |◊2 | Æ “2 , which can also be written as
(⁄ + ”⁄)ˇ
‹ = – ≠ (x + ”x)T y,
‰1 = Â1 /⁄11
and
1
‰̌1 = Â1 /⁄11
1 + ‘/
Rearranging gives us
⁄11 ‰̌1 (1 + ‘/ ) = Â1
or
(⁄11 + ”⁄11 )‰̌1 = Â1
where ”⁄11 = ‘/ ⁄11 and hence
• Case 1: n = 1.
See Homework 6.4.1.2.
• Case 2: n = 2.
See Homework 6.4.1.3.
• Case 3: n > 2.
The system now looks like
A BA B A B
L00 0 x0 y0
T = , (6.4.3)
l10 ⁄11 ‰1 Â1
where A B A B
|”L00 | 0 max(“2 , “n≠2 )|L00 | 0
T Æ T
|”l10 | |”⁄11 | “n≠1 |l10 | “2 |⁄11 |
and hence -A B- -A B-
-
- ”L00 0 -
-
-
- L00 0 -
-
-
- T -|
-
Æ max(“2 , “n≠1 ) -- T -.
-
”l10 ”⁄11 l10 ⁄11
YouTube: https://www.youtube.com/watch?v=GB7wj7_dhCE
A careful examination of the solution to Homework 6.4.1.2, together with the fact that
“n≠1 Æ “n allows us to state a slightly looser, but cleaner, result of Theorem 6.4.1.3:
Corollary 6.4.1.4 Let L œ Rn◊n be a nonsingular lower triangular matrix and let x̌ be the
computed result when executing Figure 6.4.1.1 to solve Lx = y under the computation model
from Subsection 6.2.3. Then there exists a matrix L such that
YouTube: https://www.youtube.com/watch?v=fds-FeL28ok
The numerical stability of various LU factorization algorithms as well as the triangular
solve algorithms can be found in standard graduate level numerical linear algebra texts [19]
[21]. Of particular interest may be the analysis of the Crout variant of LU factorization 5.5.1.4
in
• [6] Paolo Bientinesi, Robert A. van de Geijn, Goal-Oriented and Modular Stability
Analysis, SIAM Journal on Matrix Analysis and Applications , Volume 32 Issue 1,
February 2011.
• [7] Paolo Bientinesi, Robert A. van de Geijn, The Science of Deriving Stability Anal-
yses, FLAME Working Note #33. Aachen Institute for Computational Engineering
Sciences, RWTH Aachen. TR AICES-2008-2. November 2008. (Technical report ver-
sion with exercises.)
since these papers use the same notation as we use in our notes. Here is the pertinent result
from those papers:
Theorem 6.4.2.1 Backward error of Crout variant for LU factorization. Let
A œ Rn◊n and let the LU factorization of A be computed via the Crout variant, yielding
approximate factors Ľ and Ǔ . Then
YouTube: https://www.youtube.com/watch?v=c1NsTSCpe1k
Let us now combine the results from Subsection 6.4.1 and Subsection 6.4.2 into a back-
ward error result for solving Ax = y via LU factorization and two triangular solves.
Theorem 6.4.3.1 Let A œ Rn◊n and x, y œ Rn with Ax = y. Let x̌ be the approximate
solution computed via the following steps:
• Compute the LU factorization, yielding approximate factors Ľ and Ǔ .
Then
(A + A)x̌ = y with | A| Æ (3“n + “n2 )|Ľ||Ǔ |.
We refer the interested learner to the proof in the previously mentioned papers [6] [7].
Homework 6.4.3.1 The question left is how a change in a nonsingular matrix affects the
accuracy of the solution of a linear system that involves that matrix. We saw in Subsec-
tion 1.4.1 that if
Ax = y and A(x + ”x) = y + ”y
then
ΔxΠΔyÎ
Æ Ÿ(A)
ÎxÎ ÎyÎ
when Î · Î is a subordinate norm. But what we want to know is how a change in A affects
the solution:
Ax = y and (A + A)(x + ”x) = y
then
ΔxÎ Ÿ(A) ÎÎAÎ
AÎ
Æ .
ÎxÎ 1 ≠ Ÿ(A) ÎÎAÎ
AÎ
Prove this!
Solution.
Ax = y and (A + A)(x + ”x) = y
implies that
(A + A)(x + ”x) = Ax
WEEK 6. NUMERICAL STABILITY 372
or, equivalently,
Ax + A”x + A”x = 0.
We can rewrite this as
”x = A≠1 (≠ Ax ≠ A”x)
so that
ΔxÎ = ÎA≠1 (≠ Ax ≠ A”x)Î Æ ÎA≠1 ÎÎ AÎÎxÎ + ÎA≠1 ÎÎ AÎΔxÎ.
This can be rewritten as
and finally
ΔxÎ ÎA≠1 ÎÎ AÎ
Æ
ÎxÎ 1 ≠ ÎA≠1 ÎÎ AÎ
and finally
ΔxÎ ÎAÎÎA≠1 Î ÎÎAÎ
AÎ
Æ .
ÎxÎ 1 ≠ ÎAÎÎA≠1 Î ÎÎAÎ
AÎ
The last homework brings up a good question: If A is nonsingular, how small does A
need to be for it to be nonsingular?
Theorem 6.4.3.2 Let A be nonsingular, Î · Î be a subordinate norm, and
Î AÎ 1
< .
ÎAÎ Ÿ(A)
Then A + A is nonsingular.
Proof. Proof by contradiction.
Assume that A is nonsingular,
Î AÎ 1
< .
ÎAÎ Ÿ(A)
and A + A is singular. We will show this leads to a contradition.
Since A + A is singular, there exists x ”= 0 such that (A + A)x = 0. We can rewrite
this as
x = ≠A≠1 Ax
and hence
ÎxÎ = ÎA≠1 AxÎ Æ ÎA≠1 ÎÎ AÎÎxÎ.
Dividing both sides by ÎxÎ yields
1 Æ ÎA≠1 ÎÎ AÎ
1
and hence ÎA≠1 Î
Æ Î AÎ and finally
1 Î AÎ
≠1
Æ ,
ÎAÎÎA Î ÎAÎ
WEEK 6. NUMERICAL STABILITY 373
which is a contradiction. ⌅
YouTube: https://www.youtube.com/watch?v=n95C8qjMBcI
The analysis of LU factorization without partial pivoting is related to that of LU factor-
ization with partial pivoting as follows:
• We have shown that LU factorization with partial pivoting is equivalent to the LU
factorization without partial pivoting on a pre-permuted matrix: P A = LU , where P
is a permutation matrix.
• The permutation (exchanging of rows) doesn’t involve any floating point operations
and therefore does not generate error.
It can therefore be argued that, as a result, the error that is accumulated is equivalent with
or without partial pivoting
More slowly, what if we took the following approach to LU factorization with partial
pivoting:
• Compute the LU factorization with partial pivoting yielding the pivot matrix P , the
unit lower triangular matrix L, and the upper triangular matrix U . In exact arithmetic
this would mean these matrices are related by P A = LU.
• In practice, no error exists in P (except that a wrong index of a row with which to
pivot may result from roundoff error in the intermediate results in matrix A) and
approximate factors Ľ and Ǔ are computed.
• If we now took the pivot matrix P and formed B = P A (without incurring error since
rows are merely swapped) and then computed the LU factorization of B, then the
computed L and U would equal exactly the Ľ and Ǔ that resulted from computing
the LU factorization with row pivoting with A in floating point arithmetic. Why?
Because the exact same computations are performed although possibly with data that
is temporarily in a different place in the matrix at the time of that computation.
We conclude that
P A + B = ĽǓ , where | B| Æ “n |Ľ||Ǔ |
or, equivalently,
P (A + A) = ĽǓ , where P | A| Æ “n |Ľ||Ǔ |
where B = P A and we note that P | A| = |P A| (taking the absolute value of a matrix
and then swapping rows yields the same matrix as when one first swaps the rows and then
takes the absolute value).
YouTube: https://www.youtube.com/watch?v=TdLM41LCma4
The last unit gives a backward error result regarding LU factorization (and, by extention,
LU factorization with pivoting):
(A + A) = ĽǓ with | A| Æ “n |Ľ||Ǔ |..
The question now is: does this mean that LU factorization with partial pivoting is stable?
In other words, is A, which we bounded with | A| Æ “n |Ľ||Ǔ |, always small relative to the
entries of |A|? The following exercise gives some insight:
Homework 6.4.5.1 Apply LU with partial pivoting to
Q R
1 0 1
c
A=a ≠1 1 1 d
b.
≠1 ≠1 1
Homework 6.4.5.2 Generalize the insights from the last homework to a n ◊ n matrix.
What is the maximal element growth that is observed?
Solution. Consider Q R
1 0 0 ··· 0 1
c
c ≠1 1 0 ··· 0 1 d
d
c d
c ≠1 ≠1 1 · · · 0 1 d
c d
A = c .. .
. .
. . . .
. .
. d.
c
c . . . . . . d d
c d
a ≠1 ≠1 ··· 1 1 b
≠1 ≠1 · · · ≠1 1
Notice that no pivoting is necessary when LU factorization with pivoting is performed.
Eliminating the entries below the diagonal in the first column yields:
Q R
1 0 0 ··· 0 1
c
c 0 1 0 ··· 0 2 d
d
c d
c
c 0 ≠1 1 · · · 0 2 d
d
c .. .. .. . . .. .. d.
c
c . . . . . . d
d
c d
a 0 ≠1 ··· 1 2 b
0 ≠1 · · · ≠1 2
Eliminating the entries below the diagonal in the second column again does not require
pivoting and yields: Q R
1 0 0 ··· 0 1
c 0 1 0 ··· 0 2 d
c
d
c d
c 0 0 1 ··· 0 4 d
c d
c . . . . . . .
c .. .. .. .. .. .. d
d
c d
c d
a 0 0 ··· 1 4 b
0 0 · · · ≠1 4
Continuing like this for the remaining columns, eliminating the entries below the diagonal
leaves us with the upper triangular matrix
Q R
1 0 0 ··· 0 1
c
c 0 1 0 ··· 0 2 d
d
c d
c
c 0 0 1 ··· 0 4 d
d
c .. .. .. . . .. .. d.
c
c . . . . . . d
d
c n≠2 d
a 0 0 ··· 1 2 b
0 0 ··· 0 2n≠1
From these exercises we conclude that even LU factorization with partial pivoting can
yield large (exponential) element growth in U .
In practice, this does not seem to happen and LU factorization is considered to be sta-
ble.
WEEK 6. NUMERICAL STABILITY 376
6.5 Enrichments
6.5.1 Systematic derivation of backward error analyses
Throughout the course, we have pointed out that the FLAME notation facilitates the sys-
tematic derivation of linear algebra algorithms. The papers
• [6] Paolo Bientinesi, Robert A. van de Geijn, Goal-Oriented and Modular Stability
Analysis, SIAM Journal on Matrix Analysis and Applications , Volume 32 Issue 1,
February 2011.
• [7] Paolo Bientinesi, Robert A. van de Geijn, The Science of Deriving Stability Anal-
yses, FLAME Working Note #33. Aachen Institute for Computational Engineering
Sciences, RWTH Aachen. TR AICES-2008-2. November 2008. (Technical report ver-
sion of the SIAM paper, but with exercises.)
extend this to the systematic derivation of the backward error analysis of algorithms. Other
publications and texts present error analyses on a case-by-case basis (much like we do in
these materials) rather than as a systematic and comprehensive approach.
6.6 Wrap Up
6.6.1 Additional homework
Homework 6.6.1.1 In Units 6.3.1-3 we analyzed how error accumulates when computing
a dot product of x and y of size m in the order indicated by
• For m = 2:
Ÿ = ‰0 Â0 + ‰1 Â1
• For m = 4:
Ÿ = (‰0 Â0 + ‰1 Â1 ) + (‰2 Â2 + ‰3 Â3 )
• For m = 8:
and so forth. Analyze how under the SCM error accumulates and state backward stability
results. You may assume that m is a power of two.
6.6.2 Summary
In our discussions, the set of floating point numbers, F , is the set of all numbers ‰ = µ ◊ — e
such that
• — = 2,
• µ = ±.”0 ”1 · · · ”t≠1 (µ has only t (binary) digits), where ”j œ {0, 1}),
• ”0 = 0 iff µ = 0 (the mantissa is normalized), and
• ≠L Æ e Æ U .
Definition 6.6.2.1 Machine epsilon (unit roundoff). The machine epsilon (unit round-
off), ‘mach , is defined as the smallest positive floating point number ‰ such that the floating
point number that represents 1 + ‰ is greater than one. ⌃
fl(expression) = [expression]
equals the result when computing {\rm expression} using floating point computation (round-
ing or truncating as every intermediate result is stored). If
Ÿ = expression
in exact arithmetic, then we done the associated floating point result with
Ÿ̌ = [expression].
The Standard Computational Model (SCM) assumes that, for any two floating point
numbers ‰ and Â, the basic arithmetic operations satisfy the equality
fl(‰ op Â) = (‰ op Â)(1 + ‘), |‘| Æ ‘mach , and op œ {+, ≠, ú, /}.
The Alternative Computational Model (ACM) assumes for the basic arithmetic opera-
tions that
‰ op Â
fl(‰ op Â) = , |‘| Æ ‘mach , and op œ {+, ≠, ú, /}.
1+‘
WEEK 6. NUMERICAL STABILITY 378
• Conditioning is a property of the problem you are trying to solve. A problem is well-
conditioned if a small change in the input is guaranteed to only result in a small change
in the output. A problem is ill-conditioned if a small change in the input can result in
a large change in the output.
• Stability is a property of an implementation. If the implementation, when executed
with an input always yields an output that can be attributed to slightly changed input,
then the implementation is backward stable.
Definition 6.6.2.3 Absolute value of vector and matrix. Given x œ Rn and A œ Rm◊n ,
Q R Q R
|‰0 | |–0,0 | |–0,1 | ... |–0,n≠1 |
c d c d
c |‰1 | d c |–1,0 | |–1,1 | ... |–1,n≠1 | d
|x| = c
c .. d
d and |A| = c
c .. .. ... .. d.
d
a . b a . . . b
|‰n≠1 | |–m≠1,0 | |–m≠1,1 | . . . |–m≠1,n≠1 |
⌃
Definition 6.6.2.4 Let Mœ {<, Æ, =, Ø, >} and x, y œ R . Then n
with i = 0, . . . , m ≠ 1 and j = 0, . . . , n ≠ 1. ⌃
Theorem 6.6.2.5 Let A, B œ Rm◊n . If |A| Æ |B| then ÎAÎ1 Æ ÎBÎ1 , ÎAÎŒ Æ ÎBÎŒ , and
ÎAÎF Æ ÎBÎF .
Consider
Q RT Q R
‰0 Â0
c d c d
c
c
‰1 d
d
c
c
Â1 d
d 31 2 4
Ÿ := xT y =
c ..d c .. d
= (‰0 Â0 + ‰1 Â1 ) + · · · + ‰n≠2 Ân≠2 + ‰n≠1 Ân≠1 .
c
c .d
d
c
c . d
d
c ‰n≠2 d c Ân≠2 d
a b a b
‰n≠1 Ân≠1
Under the computational model given in Subsection 6.2.3 the computed result, Ÿ̌, satisfies
Q R
n≠1
ÿ n≠1
Ÿ
a‰i Âi (1 + ‘(i) ) (j)
Ÿ̌ = ú (1 + ‘+ )b ,
i=0 j=i
WEEK 6. NUMERICAL STABILITY 379
(1 + ‘0 ) (1 + ‘1 ) 1
(1 + ‘0 )(1 + ‘1 ) or or or
(1 + ‘1 ) (1 + ‘0 ) (1 + ‘1 )(1 + ‘0 )
so that this lemma can accommodate an analysis that involves a mixture of the Standard and
Alternative Computational Models (SCM and ACM).
Definition 6.6.2.7 For all n Ø 1 and n‘mach < 1, define
⌃
simplifies to
Ÿ̌
=
‰0 Â0 (1 + ◊n ) + ‰1 Â1 (1 + ◊n ) + · · · + ‰n≠1 Ân≠1 (1 + ◊2 )
=
Q RT Q RQ R
‰0 (1 + ◊n ) 0 0 ··· 0 Â0
c d c dc d
c ‰1 d c
c d c
0 (1 + ◊ n ) 0 · · · 0 dc
dc
Â1 d
d
c ‰
c
d c
2 d c 0 0 (1 + ◊n≠1 ) · · · 0 dc
dc Â2 d
d
c .. d c .. .. .. .. .. dc .. d
c
a . b a
d c . . . . . dc
ba . d
b
‰n≠1 0 0 0 · · · (1 + ◊2 ) Ân≠1
=
Q RT Q Q RR Q R
‰0 ◊n 0 0 ··· 0 Â0
c d c c dd c d
c
c
‰1 d
d
c
c
c
c
0 ◊n 0 ··· 0 dd c
dd c
Â1 d
d
c
c ‰2 d
d
c
cI
c
+c 0 0 ◊n≠1 ··· 0 dd c
dd c Â2 d
d,
c .. d c c .. .. .. .. .. dd c .. d
c
a . d
b
c
a
c
a . . . . . dd c
bb a . d
b
‰n≠1 0 0 0 · · · ◊2 Ân≠1
where |◊j | Æ “j , j = 2, . . . , n.
Lemma 6.6.2.8 If n, b Ø 1 then “n Æ “n+b and “n + “b + “n “b Æ “n+b .
WEEK 6. NUMERICAL STABILITY 380
Then Ë È
(n)
Ÿ̌ = xT y = xT (I + )y.
Corollary 6.6.2.10 Under the assumptions of Theorem 6.6.2.9 the following relations hold:
R-1B Ÿ̌ = (x + ”x)T y, where |”x| Æ “n |x|,
Then
(A + A)x̌ = y with | A| Æ (3“n + “n2 )|Ľ||Ǔ |.
Theorem 6.6.2.18 Let A and A + A be nonsingular and
then
ΔxÎ Ÿ(A) ÎÎAÎ
AÎ
Æ .
ÎxÎ 1 ≠ Ÿ(A) ÎÎAÎ
AÎ
Î AÎ 1
< .
ÎAÎ Ÿ(A)
Then A + A is nonsingular.
An important example that demonstrates how LU with partial pivoting can incur "ele-
ment growth": Q R
1 0 0 ··· 0 1
c
c ≠1 1 0 ··· 0 1 d d
c d
c ≠1 ≠1 1 · · · 0 1 d
c d
A=c .. .. .. . . .. .. d .
c
c. . . . . . dd
c d
a ≠1 ≠1 ··· 1 1 b
≠1 ≠1 · · · ≠1 1
Week 7
YouTube: https://www.youtube.com/watch?v=Qq_cQbVQA5Y
Many computational engineering and science applications start with some law of physics
that applies to some physical problem. This is mathematically expressed as a Partial Dif-
ferential Equation (PDE). We here will use one of the simplest of PDEs, Poisson’s equation
on the domain in two dimensions:
≠ u = f.
In two dimensions this is alternatively expressed as
ˆ2u ˆ2u
≠ ≠ = f (x, y) (7.1.1)
ˆx2 ˆy 2
with Dirichlet boundary condition ˆ = 0 (meaning that u(x, y) = 0 on the boundary of
domain ). For example, the domain may be the square 0 Æ x, y Æ 1, ˆ its boundary, and
the question may be a membrane with f being some load from, for example, a sound wave.
Since this course does not require a background in the mathematics of PDEs, let’s explain
the gist of all this in layman’s terms.
• We want to find the function u that satisfies the conditions specified by (7.1.1). It is
assumed that u is appropriately differentiable.
382
WEEK 7. SOLVING SPARSE LINEAR SYSTEMS 383
• For simplicity, let’s assume the domain is the square with 0 Æ x Æ 1 and 0 Æ y Æ 1 so
that the boundary is the boundary of this square. We assume that on the boundary
the function equals zero.
• It is usually difficult to analytically determine the continuous function u that solves
such a "boundary value problem" (except for very simple examples).
• To solve the problem computationally, the problem is "discretized". What this means
for our example is that a mesh is laid over the domain, values for the function u at the
mesh points are approximated, and the operator is approximated. In other words, the
continuous domain is viewed as a mesh instead, as illustrated in Figure 7.1.1.1 (Left).
We will assume an N ◊ N mesh of equally spaced points, where the distance between
two adjacent points is h = 1/(N +1). This means the mesh consists of points {(‰i , Âj )}
with ‰i = (i + 1)h for i = 0, 1, . . . , N ≠ 1 and Âj = (j + 1)h for j = 0, 1, . . . , N ≠ 1.
• If you do the math, details of which can be found in Subsection 7.4.1, you find that
the problem in (7.1.1) can be approximated with a linear equation at each mesh point:
≠u(‰i , Âj≠1 ) ≠ u(‰i≠1 , Âj ) + 4u(‰i , Âj ) ≠ u(xi+1 , uj ) ≠ u(xi , Âj+1 )
= f (‰i , Âi ).
h2
The values in this equation come from the "five point stencil" illustrated in Fig-
ure 7.1.1.1 (Right).
YouTube: https://www.youtube.com/watch?v=GvdBA5emnSs
WEEK 7. SOLVING SPARSE LINEAR SYSTEMS 384
• If we number the values at the grid points, u(‰i , Âj ) in what is called the "natural
ordering" as illustrated in Figure 7.1.1.1 (Middle), then we can write all these insights,
together with the boundary condition, as
≠‚i≠N ≠ ‚i≠1 + 4‚i ≠ ‚i+1 ≠ ‚i+N = h2 „i
or, equivalently,
h2 „i + ‚i≠N + ‚i≠1 + ‚i+1 + ‚i+N
‚i =
4
with appropriate modifications for the case where i places the point that yielded the
equation on the bottom, left, right, and/or top of the mesh.
YouTube: https://www.youtube.com/watch?v=VYMbSJAqaUM
All these insights can be put together into a system of linear equations:
4‚0 ≠‚1 ≠‚4 = h2 „0
≠‚0 +4‚1 ≠‚2 ≠‚5 = h2 „1
≠‚1 +4‚2 ≠‚3 ≠‚6 = h2 „2
≠‚2 +4‚3 ≠‚7 = h2 „3
≠‚0 +4‚4 ≠‚5 ≠‚8 = h2 „4
... ... ... ... ... ..
.
where „i = f (‰i , Âj ) if (‰i , Âj ) is the point associated with value ‚i . In matrix notation this
becomes
Q RQ
4 ≠1 ≠1 ‚0
R Q
h2 „0
R
c ≠1
c 4 ≠1 ≠1 dc
dc ‚1
d
d
c
c h2 „1
d
d
c dc
c
c
≠1 4 ≠1 ≠1 dc
dc ‚2
d
d
c
c h2 „2
d
d
d c d
c
c ≠1 4 ≠1 dc
dc ‚3 d
d
c
c h2 „3 d
d
c dc
c ≠1
c
4 ≠1 ≠1 dc
dc ‚4
d
d
c
c h2 „4
d
d
c .. dc d
=
c d
c
c ≠1 ≠1 4 ≠1 . dc
dc ‚5 d
d
c
c h2 „5 d.
d
d c d
c
c ≠1 ≠1 4 ≠1 dc
dc ‚6 d
d
c
c h2 „6 d
d
c dc
c
c
≠1 ≠1 4 dc
dc
‚7 d
d
c
c h2 „7 d
d
. d c
h2 „8
d
4 ..
c dc ‚8 d c d
c ≠1 da
a b .. b a
.. b
... ... ... . .
This demonstrates how solving the discretized Poisson’s equation boils down to the solution
of a linear system Au = h2 f , where A has a distinct sparsity pattern (pattern of nonzeroes).
WEEK 7. SOLVING SPARSE LINEAR SYSTEMS 385
Homework 7.1.1.1 The observations in this unit suggest the following way of solving
(7.1.1):
• Discretize the domain 0 Æ ‰, Â Æ 1 by creating an (N + 2) ◊ (N + 2) mesh of points.
• An (N + 2) ◊ (N + 2) array U holds the values u(‰i , Âi ) plus the boundary around it.
• Create an (N +2)◊(N +2) array F that holds the values f (‰i , Âj ) (plus, for convenience,
extra values that correspond to the boundary).
• Set all values in U to zero. This initializes the last rows and columns to zero, which
captures the boundary condition, and initializes the rest of the values at the mesh
points to zero.
U (i, j) =
(h2 F (i, j) + U (i, j ≠ 1) + U (i ≠ 1, j) + U (i + 1, j) + U (i, j + 1) )/4
¸ ˚˙ ˝ ¸ ˚˙ ˝ ¸ ˚˙ ˝ ¸ ˚˙ ˝ ¸ ˚˙ ˝
f (‰i , Âj ) u(‰i , Âj≠1 ) u(‰i≠1 , Âj ) u(‰i+1 , Âj ) u(‰i , Âj+1 )
• Bingo! You have written your first iterative solver for a sparse linear system.
• Test your solver with the problem where f (‰, Â) = (– + —)fi 2 sin(–fi‰) sin(—fiÂ).
• Hint: if x and y are arrays with the vectors x and y (with entries ‰i and Âj ), then
mesh( x, y, U ) plots the values in U.
YouTube: https://www.youtube.com/watch?v=j-ELcqx3bRo
Remark 7.1.1.3 The point of this launch is that many problems that arise in computational
science require the solution to a system of linear equations Ax = b where A is a (very) sparse
matrix. Often, the matrix does not even need to be explicitly formed and stored.
Remark 7.1.1.4 Wilkinson defined a sparse matrix as any matrix with enough zeros that
it pays to take advantage of them.
7.1.2 Overview
• 7.1 Opening
• 7.4 Enrichments
¶ 7.4.1 Details!
¶ 7.4.2 Parallelism in splitting methods
¶ 7.4.3 Dr. SOR
• 7.5 Wrap Up
• Exploit sparsity when computing the Cholesky factorization and related triangular
solves of a banded matrix.
• Derive the cost for a Cholesky factorization and related triangular solves of a banded
matrix.
• Utilize nested dissection to reduce fill-in when computing the Cholesky factorization
and related triangular solves of a sparse matrix.
• Connect sparsity patterns in a matrix to the graph that describes that sparsity pattern.
YouTube: https://www.youtube.com/watch?v=UX6Z6q1_prs
It is tempting to simply use a dense linear solver to compute the solution to Ax = b via,
for example, LU or Cholesky factorization, even when A is sparse. This would require O(n3 )
operations, where n equals the size of matrix A. What we see in this unit is that we can
take advantage of a "banded" structure in the matrix to greatly reduce the computational
cost.
Homework 7.2.1.1 The 1D equivalent of the example from Subsection 7.1.1 is given by the
WEEK 7. SOLVING SPARSE LINEAR SYSTEMS 388
that introduces two extra variables ‰≠1 = 0 and ‰n = 0, the problem for all ‰i , 0 Æ i < n,
becomes
≠‰i≠1 + 2‰i ≠ ‰i+1 = 0.
or, equivalently,
‰i≠1 + ‰i+1
‰i = .
2
Reason through what would happen if any ‰i is not equal to zero.
Solution. Building on the hint: Let’s say that ‰i ”= 0 while ‰≠1 , . . . , ‰i≠1 are. Then
‰i≠1 + ‰i+1 1
‰i = = ‰i+1
2 2
and hence
‰i+1 = 2‰i > 0..
Next,
‰i + ‰i+2
‰i+1 = = 2‰i
2
and hence
‰i+2 = 4‰i ≠ ‰i = 3‰i > 0.
Continuing this argument, the solution to the recurrence relation is ‰n = (n ≠ i + 1)‰i and
you find that ‰n > 0 which is a contradiction.
This course covers topics in a "circular" way, where sometimes we introduce and use
results that we won’t formally cover until later in the course. Here is one such situation. In
WEEK 7. SOLVING SPARSE LINEAR SYSTEMS 389
a later week you will prove these relevant results involving eigenvalues:
• A symmetric matrix is symmetric positive definite (SPD) if and only if its eigenvalues
are positive.
• The Gershgorin Disk Theorem tells us that the matrix in (7.2.1) has nonnegative
eigenvalues.
These insights, together with Homework 7.2.1.1, tell us that the matrix in (7.2.1) is SPD.
Homework 7.2.1.2 Compute the Cholesky factor of
Q R
4 ≠2 0 0
c ≠2 5 ≠2 0 d
A=c
c
d
d.
a 0 ≠2 10 6 b
0 0 6 5
Answer. Q R
2 0 0 0
c
c ≠1 2 0 0 d
d
L=c d.
a 0 ≠1 3 0 b
0 0 2 1
Homework 7.2.1.3 Let A œ Rn◊n be tridiagonal and SPD so that
Q R
–0,0 –1,0
c d
c –1,0 –1,1 –2,1 d
c d
c
A=c .. .. .. d
(7.2.2)
c . . . d.
d
c –n≠2,n≠3 –n≠2,n≠2 –n≠1,n≠2 d
a b
–n≠1,n≠2 –n≠1,n≠1
• Propose a Cholesky factorization algorithm that exploits the structure of this matrix.
• What is the cost? (Count square roots, divides, multiplies, and subtractions.)
• What would have been the (approximate) cost if we had not taken advantage of the
tridiagonal structure?
Solution.
• If you play with a few smaller examples, you can conjecture that the Cholesky factor
of (7.2.2) is a bidiagonal matrix (the main diagonal plus the first subdiagonal). Thus,
WEEK 7. SOLVING SPARSE LINEAR SYSTEMS 390
A = LLT translates to
Q R
–0,0 –1,0
c d
c –1,0 –1,1 –2,1 d
c d
c .. .. .. d
c
c . . . d
d
c –n≠2,n≠3 –n≠2,n≠2 –n≠1,n≠2 d
a b
Q –n≠1,n≠2 –n≠1,n≠1 RQ R
⁄0,0 ⁄0,0 ⁄1,0
c dc d
c ⁄1,0 ⁄1,1 dc ⁄1,1 ⁄2,1 d
c dc d
c
=c .. .. dc .. .. d
c . . dc
dc . . d
d
c ⁄n≠2,n≠3 ⁄n≠2,n≠2 dc ⁄n≠2,n≠2 ⁄n≠1,n≠2 d
a ba b
Q
⁄n≠1,n≠2 ⁄n≠1,n≠1 ⁄n≠1,n≠1 R
⁄0,0 ⁄0,0 ⁄0,0 ⁄1,0
c d
c ⁄1,0 ⁄0,0
c
⁄1,0 ⁄0,1 + ⁄1,1 ⁄1,1 ⁄1,1 ⁄2,1 d
d
c
c
... ... d
d
⁄21 ⁄11
=c
c ...
d,
d
c ıı ⁄n≠2,n≠2 ⁄n≠1,n≠2 d
c d
c d
a ⁄n≠1,n≠2 ⁄n≠2,n≠2 ⁄n≠1,n≠2 ⁄n≠1,n≠2 b
+ ⁄n≠1,n≠1 ⁄n≠1,n≠1
where ıı = ⁄n≠3,n≠2 ⁄n≠3,n≠2 + ⁄n≠2,n≠2 ⁄n≠2,n≠2 . With this insight, the algorithm that
overwrites A with its Cholesky factor is given by
for i = 0, . . . , n ≠ 2
Ô
–i,i := –i,i
–i+1,i := –i+1,i /–i,i
–i+1,i+1 := –i+1,i+1 ≠ –i+1,i –i+1,i
endfor
Ô
–n≠1,n≠1 := –n≠1,n≠1
• A cost analysis shows that this requires n square roots, n ≠ 1 divides, n ≠ 1 multiplies,
and n ≠ 1 subtracts.
• The cost, had we not taken advantage of the special structure, would have been (ap-
proximately) 13 n3 .
Homework 7.2.1.4 Propose an algorithm for overwriting y with the solution to Ax = y for
the SPD matrix in Homework 7.2.1.3.
Solution.
• Use the algorithm from Homework 7.2.1.3 to overwrite A with its Cholesky factor.
for i = 0, . . . , n ≠ 2
Âi := Âi /–i,i
Âi+1 := Âi+1 ≠ –i+1,i Âi
endfor
Ân≠1 := Ân≠1 /–n≠1,n≠1
• And so forth.
Let’s see how to take advantage of the zeroes in a matrix with bandwidth b, focusing on
SPD matrices.
Definition 7.2.1.1 The half-band width of a symmetric matrix equals the number of sub-
diagonals beyond which all the matrix contains only zeroes. For example, a diagonal matrix
has half-band width of zero and a tridiagonal matrix has a half-band width of one. ⌃
Homework 7.2.1.5 Assume the SPD matrix A œ Rm◊m has a bandwidth of b. Propose a
WEEK 7. SOLVING SPARSE LINEAR SYSTEMS 392
A = Chol-right-looking(A)
A B
AT L AT R
Aæ
ABL ABR
AT L is 0 ◊ 0
while n(AT L ) < n(A) Q R
A B A00 a01 A02
AT L AT R c d
æ a aT10 –11 aT12 b
ABL ABR
A20 a21 A22
Ô
–11 := –11
a21 := a21 /–11
A22 := A22 ≠ a21 aT21 (updating only the lower triangular part)
Q R
A B A00 a01 A02
AT L AT R c d
Ω a aT10 –11 aT12 b
ABL ABR
A20 a21 A22
endwhile
that takes advantage of the zeroes in the matrix. (You will want to draw yourself a picture.)
What is its approximate cost in flops (when m is large)?
Solution. See the below video.
YouTube: https://www.youtube.com/watch?v=r1P4Ze7Yqe0
The purpose of the game is to limit fill-in, which happens when zeroes turn into nonze-
roes. With an example that would result from, for example, Poisson’s equation, we will
illustrate the basic techniques, which are known as "nested dissection."
If you consider the mesh that results from the discretization of, for example, a square
domain, the numbering of the mesh points does not need to be according to the "natural
ordering" we chose to use before. As we number the mesh points, we reorder (permute) both
the columns of the matrix (which correspond to the elements ‚i to be computed) and the
equations that tell one how ‚i is computed from its neighbors. If we choose a separator, the
points highlighted in red in Figure 7.2.2.1 (Top-Left), and order the mesh points to its left
first, then the ones to its right, and finally the points in the separator, we create a pattern
of zeroes, as illustrated in Figure 7.2.2.1 (Top-Right).
• What special structure does the Cholesky factor of this matrix have?
• How can the different parts of the Cholesky factor be computed in a way that takes
advantage of the zero blocks?
• How do you take advantage of the zero pattern when solving with the Cholesky factors?
Solution.
• The Cholesky factor of this matrix has the structure
Q R
L00 0 0
L=c
a 0 L11 0 d b.
L20 L21 L22
where the ıs indicate "symmetric parts" that don’t play a role. We deduce that the
following steps will yield the Cholesky factor:
¶ Compute the Cholesky factor of A00 :
¶ Solve
XLT11 = A21
for X, overwriting A21 with the result. (This is a triangular solve with multiple
right-hand sides in disguise.)
¶ Update the lower triangular part of A22 with
¶ Solve L00 z0 = y0 .
¶ Solve L11 z1 = y1 .
¶ Solve L22 z2 = y2 ≠ L20 z0 ≠ L21 z1 .
Similarly,
Q RT Q R Q R
LT00 0 LT20 x0 z0
c
a 0 LT11 LT21 d c d c d
b a x1 b = a z1 b .
0 0 L22 T
x2 z2
can be solved via the steps
¶ Solve LT22 x2 = z2 .
¶ Solve LT11 x1 = z1 ≠ LT21 x2 .
¶ Solve LT00 x0 = z0 ≠ LT20 x2 .
YouTube: https://www.youtube.com/watch?v=mwX0wPRdw7U
WEEK 7. SOLVING SPARSE LINEAR SYSTEMS 396
Each of the three subdomains that were created in Figure 7.2.2.1 can themselves be
reordered by identifying separators. In Figure 7.2.2.2 we illustrate this only for the left and
right subdomains. This creates a recursive structure in the matrix. Hence, the name nested
dissection for this approach.
7.2.3 Observations
Through an example, we have illustrated the following insights regarding the direct solution
of sparse linear systems:
• There is a one-to-one correspondence between links in the graph that shows how mesh
points are influenced by other mesh points (connectivity) and nonzeroes in the matrix.
If the graph is undirected, then the sparsity in the matrix is symmetric (provided the
unknowns are ordered in the same order as the equations that relate the unknowns to
their neighbors). If the graph is directed, then the matrix has a nonsymmetric sparsity
pattern.
These obervations relate the problem of reducing fill-in to the problem of partitioning the
graph by identifying a separator. The smaller the number of mesh points in the separator
(the interface), the smaller the submatrix that corresponds to it and the less fill-in will occur
related to this dissection.
WEEK 7. SOLVING SPARSE LINEAR SYSTEMS 397
Remark 7.2.3.1 Importantly: one can start with a mesh and manipulate it into a matrix
or one can start with a matrix and have its sparsity pattern prescribe the graph.
YouTube: https://www.youtube.com/watch?v=OMbxk1ihIFo
Let’s review what we saw in Subsection 7.1.1. The linear system Au = f
modified appropriately for points adjacent to the boundary. Let’s label the value of ‚i during
(k)
the kth iteration with ‚i and state the algorithm more explicitly as
for k = 0, . . . , convergence
for i = 0, . . . , N ◊ N ≠ 1
(k+1) (k) (k) (k) (k)
‚i = (h2 „i + ‚i≠N + ‚i≠1 + ‚i+1 + ‚i+N )/4
endfor
endfor
again, modified appropriately for points adjacent to the boundary. The superscripts are there
to emphasize the iteration during which a value is updated. In practice, only the values for
iteration k and k + 1 need to be stored. We can also capture the algorithm with a vector
and matrix as
(k+1) (k) (k)
4‚0 = ‚1 +‚4 +h2 „0
(k+1) (k) (k) (k)
4‚1 = ‚0 +‚2 +‚5 +h2 „1
(k+1) (k) (k) (k)
4‚2 = ‚1 +‚3 +‚6 +h2 „2
(k+1) (k) (k)
4‚3 = ‚2 +‚7 +h2 „3
(k+1) (k) (k) (k)
4‚4 = ‚0 +‚5 ≠‚8 +h2 „4
.. ... ... ... ... .. ..
. . .
which can be written in matrix form as
Q R
Q R (k+1)
‚0
4 c (k+1)
d
c dc ‚1 d
c
c
4 dc
dc (k+1)
d
d
c
c 4 dc
dc
‚2 d
d
dc (k+1) d
c
c 4 dc ‚3 d
c dc (k+1)
d
c
c 4 dc
dc ‚4
d
d
c 4 dc (k+1)
d
c dc ‚5 d
c dc d
c
c 4 dc
dc
(k+1)
‚6
d
d
c 4 dc d
c dc (k+1) d
c dc ‚7 d
c
a
4 dc
bc (k+1)
d
d
.. c ‚8 d
. a
.. b
.
Q R
Q
(k)
R (7.3.1)
0 1 1 c
‚0
d
Q 2
h „0
R
c 1 0 1 (k)
c 1 dc
dc ‚1 d c
d c h2 „1
d
d
c dc d c
c
c
1 0 1 1 dc
dc
(k)
‚2 d c
d c h2 „2
d
d
d
c
c 1 0 1 dc
dc (k)
‚3
d c
d c h2 „3 d
d
c dc d c
c 1 0 1 1 (k) d
c
dc
dc ‚4
d c
d c h2 „4 d
= c .. dc d+c d
c
c 1 1 0 1 . dc
dc
(k)
‚5 d c
d c h2 „5 d.
d
d
c
c 1 1 0 1 dc
dc (k)
‚6
d c
d c h2 „6 d
d
c dc d c
c
c
1 1 0 dc
dc (k) d c
d c
h2 „7 d
d
. ‚7 d
c
c 1 0 ..
dc
dc (k)
d c
d a h2 „8 d
a bc ‚8 d .. b
... ... ... a
.. b .
.
WEEK 7. SOLVING SPARSE LINEAR SYSTEMS 399
YouTube: https://www.youtube.com/watch?v=7rDvET9_nek
How can we capture this more generally?
• We wish to solve Ax = y.
We write A as the difference of its diagonal, M = D, and the negative of its off-diagonal
part, N = D ≠ A so that
A = D ≠ (D ≠ A) = M ≠ N.
x = M ≠1 (N x + y).
• If we now let x(k) be the values of our vector x in the current step. Then the values
after all elements have been updated are given by the vector
• All we now need is an initial guess for the solution, x(0) , and we are ready to iteratively
solve the linear system by computing x(1) , x(2) , etc., until we (approximately) reach a
fixed point where x(k+1) = M ≠1 (N x(k) + y) ¥ x(k) .
YouTube: https://www.youtube.com/watch?v=ufMUhO1vDew
A variation on the Jacobi iteration is the Gauss-Seidel iteration. It recognizes that since
values at points are updated in some order, if a neighboring value has already been updated
earlier in the current step, then you might as well use that updated value. For our example
from Subsection 7.1.1 this is captured by the algorithm
for k = 0, . . . , convergence
for i = 0, . . . , N ◊ N ≠ 1
(k+1) (k+1) (k+1) (k) (k)
‚i = (h2 „i + ‚i≠N + ‚i≠1 + ‚i+1 + ‚i+N )/4
endfor
endfor
modified appropriately for points adjacent to the boundary. This algorithm exploits the fact
(k+1) (k+1) (k+1)
that ‚i≠N and ‚i≠1 have already been computed by the time ‚i is updated. Once
again, the superscripts are there to emphasize the iteration during which a value is updated.
In practice, the superscripts can be dropped because of the order in which the computation
happens.
Homework 7.3.2.1 Modify the code for Homework 7.1.1.1 ( what you now know as the
Jacobi iteration) to implement the Gauss-Seidel iteration.
Solution. Assignments/Week07/answers/Poisson_GS_iteration.m.
When you execute the script, in the COMMAND WINDOW enter "RETURN" to advance
to the next iteration.
You may also want to observe the Jacobi and Gauss-Seidel iterations in action side-by-side
in Assignments/Week07/answers/Poisson_Jacobi_vs_GS_iteration.m.
Homework 7.3.2.2 Here we repeat (7.3.1) for Jacobi’s iteration applied to the example in
WEEK 7. SOLVING SPARSE LINEAR SYSTEMS 401
Subsection 7.1.1:
Q R
Q R (k+1)
‚0
4 c (k+1)
d
c dc ‚1 d
c
c
4 dc
dc (k+1)
d
d
c
c 4 dc
dc
‚2 d
d
dc (k+1) d
c
c 4 dc ‚3 d
c dc (k+1)
d
c
c 4 dc
dc ‚4
d
d
c 4 dc (k+1)
d
c dc ‚5 d
c dc d
c
c 4 dc
dc
(k+1)
‚6
d
d
c 4 dc d
c dc (k+1) d
c dc ‚7 d
c
a
4 dc
bc (k+1)
d
d
.. c ‚8 d
. a
.. b
.
Q R
Q
(k)
R (7.3.2)
0 1 1 c
‚0
d
Q
h2 „0
R
c 1 0 1 (k)
c 1 dc
dc ‚1 d c
d c h2 „1
d
d
c dc d c
c
c
1 0 1 1 dc
dc
(k)
‚2 d c
d c h2 „2
d
d
d
c
c 1 0 1 dc
dc (k)
‚3
d c
d c h2 „3 d
d
c dc d c
c 1 0 1 1 (k) d
c
dc
dc ‚4
d c
d c h2 „4 d
=c .. dc d+c d
c
c 1 1 0 1 . dc
dc
(k)
‚5 d c
d c h2 „5 d.
d
d
c
c 1 1 0 1 dc
dc (k)
‚6
d c
d c h2 „6 d
d
c dc d c
c
c
1 1 0 dc
dc (k) d c
d c
h2 „7 d
d
. ‚7 d
c
c 1 0 ..
dc
dc (k)
d c
d a h2 „8 d
a bc ‚8 d .. b
... ... ... a
.. b .
.
Solution.
Q R
Q R (k+1)
‚0
4 c (k+1)
d
c dc ‚1 d
c
c
≠1 4 dc
dc (k+1)
d
d
c
c ≠1 4 dc
dc
‚2 d
d
dc (k+1) d
c
c ≠1 4 dc ‚3 d
c dc (k+1)
d
c
c ≠1 4 dc
dc ‚4
d
d
c 4 dc (k+1)
d
c ≠1 ≠1 dc ‚5 d
c dc d
c
c ≠1 ≠1 4 dc
dc
(k+1)
‚6
d
d
c ≠1 ≠1 4 dc d
c dc (k+1) d
c dc ‚7 d
c
a
≠1 4 dc
bc (k+1)
d
d
.. .. .. c ‚8 d
. . . a
.. b
Q R
.
Q R (k)
0 1 1 c
‚0
d
Q
h2 „0
R
(k)
c
c 0 1 1 dc
dc ‚1 d c
d c h2 „1
d
d
c dc d c
c
c
0 1 1 dc
dc
(k)
‚2 d c
d c h2 „2
d
d
d
c
c 0 1 dc
dc (k)
‚3
d c
d c h2 „3 d
d
c dc d c
c 0 1 1 dc (k) d c
h2 „4
d
d
c dc ‚4 d c
:= c .. dc d+c d
c
c 0 1 . dc
dc
(k)
‚5 d c
d c h2 „5 d.
d
d
c
c 0 1 dc
dc (k)
‚6
d c
d c h2 „6 d
d
c dc d c
c
c
0 dc
dc (k) d c
d c
h2 „7 d
d
.. ‚7 d
c
c 0 .
dc
dc (k)
d c
d a h2 „8 d
a bc ‚8 d .. b
.. a
.. b .
. .
This homework suggests the following:
• We wish to solve Ax = y.
We write symmetric A as
A = (D ≠ L) ≠ (LT ) ,
¸ ˚˙ ˝ ¸ ˚˙ ˝
M N
where ≠L equals the strictly lower triangular part of A and D is its diagonal.
• If we now let x(k) be the values of our vector x in the current step. Then the values
after all elements have been updated are given by the vector
x(k+1) = (D ≠ L)≠1 (LT x(k) + y).
Homework 7.3.2.3 When the Gauss-Seidel iteration is used to solve Ax = y, where A œ
(k+1) (k+1)
Rn◊n , it computes entries of x(k+1) in the forward order ‰0 , ‰1 , . . .. If A = D ≠ L ≠ LT ,
this is captured by
(D ≠ L)x(k+1) = LT x(k) + y. (7.3.3)
Modify (7.3.3) to yield a "reverse" Gauss-Seidel method that computes the entries of vector
(k+1) (k+1)
x(k+1) in the order ‰n≠1 , ‰n≠2 , . . ..
(k+1) (k+1)
Solution. The reverse order is given by ‰n≠1 , ‰n≠2 , . . .. This corresponds to the splitting
M = D ≠ LT and N = L so that
(D ≠ LT )x(k+1) = Lx(k) + y.
Homework 7.3.2.4 A "symmetric" Gauss-Seidel iteration to solve symmetric Ax = y,
where A œ Rn◊n , alternates between computing entries in forward and reverse order. In
other words, if A = MF ≠ NF for the forward Gauss-Seidel method and A = MR ≠ NR for
the reverse Gauss-Seidel method, then
1
MF x(k+ 2 ) = NF x(k) + y
1
MR x(k+1) = NR x(k+ 2 ) + y
constitutes one iteration of this symmetric Gauss-Seidel iteration. Determine M and N such
that
M x(k+1) = N x(k) + y
equals one iteration of the symmetric Gauss-Seidel iteration.
(You may want to follow the hint...)
Hint.
• From this unit and the last homework, we know that MF = (D ≠ L), NF = LT ,
MR = (D ≠ LT ), and NR = L.
• Show that
Solution.
• From this unit and the last homework, we know that MF = (D ≠ L), NF = LT ,
MR = (D ≠ LT ), and NR = L.
WEEK 7. SOLVING SPARSE LINEAR SYSTEMS 404
• Show that
Multiplying out the right-hand side and factoring out y yields the desired result.
I + L(D ≠ L)≠1
=
(D ≠ L)(D ≠ L)≠1 + L(D ≠ L)≠1 =
(D ≠ L + L)(D ≠ L)≠1 =
D(D ≠ L)≠1 .
YouTube: https://www.youtube.com/watch?v=L6PZhc-G7cE
The Jacobi and Gauss-Seidel iterations can be generalized as follows. Split matrix A =
M ≠ N where M is nonsingular. Now,
(M ≠ N )x = y
is equivalent to
Mx = Nx + y
and
x = M ≠1 (N x + y).
This is an example of a fixed-point equation: Plug x into M ≠1 (N x+y) and the result is again
x. The iteration is then created by viewing the vector on the left as the next approximation
to the solution given the current approximation x on the right:
Let A = (D ≠ L ≠ U ) where ≠L, D, and ≠U are the strictly lower triangular, diagonal,
and strictly upper triangular parts of A.
M x(k+1) = N x(k) + y,
with which we emphasize that we solve with M rather than inverting it.
Homework 7.3.3.1 Why are the choices of M and N used by the Jacobi iteration and
Gauss-Seidel iteration convenient?
Solution. Both methods have two advantages:
• The multiplication N u(k) can exploit sparsity in the original matrix A.
Solution.
x(k) + M ≠1 r(k)
=
x(k) + M ≠1 (y ≠ Ax(k) )
=
x + M ≠1 (y ≠ (M ≠ N )x(k) )
(k)
=
x(k) + M ≠1 y ≠ M ≠1 (M ≠ N )x(k)
=
x + M ≠1 y ≠ (I ≠ M ≠1 N )x(k)
(k)
=
M ≠1 (N x(k) + y)
This last exercise provides an important link between iterative refinement, discussed in
Subsection 5.3.7, and splitting methods. Let us revisit this, using the notation from this
section.
If Ax = y and x(k) is a (current) approximation to x, then
r(k) = y ≠ Ax(k)
A”x(k) = r(k)
then
x(k+1) = x(k) + ”x(k)
is merely a (hopefully better) approximation to x. If M ¥ A then
So, the better M approximates A, the faster we can expect x(k) to converge to x.
With this in mind, we notice that if A = D ≠ L ≠ U , where D, ≠L, and ≠U equals
its diagonal, strictly lower triangular, and strictly upper triangular part, and we split A =
M ≠ N , then M = D ≠ L is a better approximation to matrix A than is M = D.
Ponder This 7.3.3.3 Given these insights, why might the symmetric Gauss-Seidel method
discussed in Homework 7.3.2.4 have benefits over the regular Gauss-Seidel method?
WEEK 7. SOLVING SPARSE LINEAR SYSTEMS 407
lim ‰(k) = ‰.
kæŒ
lim Îx(k) ≠ xÎ = 0.
kæŒ
Because of the equivalence of norms, if the sequence converges in one norm, it converges in all
(k)
norms. In particular, it means it converges in the Œ-norm, which means that maxi |‰i ≠‰i |
(k)
converges to zero, and hence for all entries |‰i ≠ ‰i | eventually becomes arbitrarily small.
Finally, a sequence of matrices, A(k) , converges to the matrix A if for some norm Î · Î
Again, if it converges for one norm, it converges for all norms and the individual elements
of A(k) converge to the corresponding elements of A.
Let’s now look at the convergence of splitting methods. If x solves Ax = y and x(k) is
the sequence of vectors generated starting with x(0) , then
Mx = Nx + y
M x(k+1) = N x(k) + y
so that
M (x(k+1) ≠ x) = N (x(k) ≠ x)
or, equivalently,
x(k+1) ≠ x = (M ≠1 N )(x(k) ≠ x).
This, in turn, means that
Hence, if ÎM ≠1 N Î < 1 in that norm, then limiæŒ ÎM ≠1 N Îi = 0 and hence x(k) converges
to x. We summarize this in the following theorem:
Theorem 7.3.3.1 Let A œ Rn◊n be nonsingular and x, y œ Rn so that Ax = y. Let
A = M ≠ N be a splitting of A, x(0) be given (an initial guess), and x(k+1) = M ≠1 (N x(k) + y).
If ÎM ≠1 N Î < 1 for some matrix norm induced by the ηΠvector norm, then x(k) will converge
to the solution x.
Because of the equivalence of matrix norms, if we can find any matrix norm ||| · ||| such
that |||M ≠1 N ||| < 1, the sequence of vectors converges.
WEEK 7. SOLVING SPARSE LINEAR SYSTEMS 408
Ponder This 7.3.3.4 Contemplate the finer points of the last argument about the conver-
gence of (M ≠1 N )i
YouTube: https://www.youtube.com/watch?v=uv8cMeR9u_U
Understanding the following observation will have to wait until after we cover eigenvalues
and eigenvectors, later in the course. For splitting methods, it is the spectral radius of a
matrix (the magnitude of the eigenvalue with largest magnitude), fl(B), that often gives us
insight into whether the method converges. This, once again, requires us to use a result from
a future week in this course: It can be shown that for all B œ Rm◊m and ‘ > 0 there exists a
norm ||| · |||B,‘ such that |||B|||B,‘ Æ fl(B) + ‘. What this means is that if we can show that
fl(M ≠1 N ) < 1, then the splitting method converges for the given matrix A.
Homework 7.3.3.5 Given nonsingular A œ Rn◊n , what splitting A = M ≠ N will give the
fastest convergence to the solution of Ax = y?
Solution. M = A and N = 0. Then, regardless of the initial vector x(0) ,
(k+1) GS (i+1)
where any term involving a zero is skipped. We label this ‰i with ‰i in our
subsequent discussion.
What if we pick our next value a bit further:
(k+1) GS (k+1) (k)
‰i = ʉi + (1 ≠ Ê)‰i ,
where Ê Ø 1. This is known as over-relaxation. Then
GS (k+1) 1 (k+1) 1 ≠ Ê (k)
‰i = ‰i ≠ ‰i
Ê Ê
and 5 6
i≠1
ÿ (k+1) 1 (k+1) 1 ≠ Ê (k) n≠1
ÿ (k)
–i,j ‰j + –i,i ‰i ≠ ‰i = ≠ –i,j ‰j + Âi
j=0 Ê Ê j=i+1
or, equivalently,
i≠1
ÿ (k+1) 1 (k+1) 1≠Ê (k)
n≠1
ÿ (k)
–i,j ‰j + –i,i ‰i = –i,i ‰i ≠ –i,j ‰j + Âi .
j=0 Ê Ê j=i+1
where the R stands for "Reverse." The symmetric successive over-relaxation (SSOR) iteration
combines the "forward" SOR with a "reverse" SOR, much like the symmetric Gauss-Seidel
does: 1
x(k+ 2 ) = MF≠1 (NF x(k) + y)
1
x(k+1) = MR≠1 (NR x(k+ 2 ) + y).
This can be expressed as splitting A = M ≠ N . The details are a bit messy, and we will skip
them.
7.4 Enrichments
7.4.1 Details!
To solve the problem computationally the problem is again discretized. Relating back to the
problem of the membrane on the unit square in the previous section, this means that the
continuous domain is viewed as a mesh instead, as illustrated in Figure 7.4.1.1.
becomes
≠u(x ≠ h, y) + 2u(x, y) ≠ u(x + h, y) ≠u(x, y ≠ h) + 2u(x, y) ≠ u(x, y + h)
+ = f (x, y)
h2 h2
or, equivalently,
≠u(x ≠ h, y) ≠ u(x, y ≠ h) + 4u(x, y) ≠ u(x + h, y) ≠ u(x, y + h)
= f (x, y).
h2
If (x, y) corresponds to the point i in a mesh where the interior points form an N ◊ N grid,
this translates to the system of linear equations
≠‚i≠N ≠ ‚i≠1 + 4‚i ≠ ‚i+1 ≠ ‚i+N = h2 „i .
This can be rewritten as
h2 „i + ‚i≠N + ‚i≠1 + ‚i+1 + ‚i+N
‚i =
4
or
4‚0 ≠‚1 ≠‚4 = h2 „0
≠‚0 +4‚1 ≠‚2 ≠‚5 = h2 „1
≠‚1 +4‚2 ≠‚3 ≠‚6 = h2 „2
≠‚2 +4‚3 ≠‚7 = h2 „3
≠‚0 +4‚4 ≠‚5 ≠‚8 = h2 „4
.. .. .. .. ..
‚1 . . . . .
In matrix notation this becomes
Q RQ
4 ≠1 ≠1 ‚0
R Q
h2 „0
R
c ≠1
c 4 ≠1 ≠1 dc
dc ‚1
d
d
c
c h2 „1
d
d
c dc
c
c
≠1 4 ≠1 ≠1 dc
dc ‚2
d
d
c
c h2 „2
d
d
d c d
c
c ≠1 4 ≠1 dc
dc ‚3 d
d
c
c h2 „3 d
d
c dc
c ≠1
c
4 ≠1 ≠1 dc
dc ‚4
d
d
c
c h2 „4
d
d
c .. dc d
=
c d
(7.4.1)
c
c ≠1 ≠1 4 ≠1 . dc
dc ‚5 d
d
c
c h2 „5 d.
d
d c d
c
c ≠1 ≠1 4 ≠1 dc
dc ‚6 d
d
c
c h2 „6 d
d
c dc
c
c
≠1 ≠1 4 dc
dc
‚7 d
d
c
c h2 „7 d
d
. d c
h2 „8
d
4 ..
c dc ‚8 d c d
c ≠1 da
a b .. b a
.. b
... ... ... . .
This demonstrates how solving the discretized Poisson’s equation boils down to the solution
of a linear system Au = h2 f , where A has a distinct sparsity pattern (pattern of nonzeros).
The iteration then proceeds by alternating between (simultaneously) updating all values
at the red points and (simultaneously) updating all values at the black points, always using
the most updated values.
YouTube: https://www.youtube.com/watch?v=WDsF7gaj4E4
SOR was first proposed in 1950 by David M. Young and Stanley P. Frankel. David
Young (1923-2008) was a colleague of ours at UT-Austin. His vanity license plate read "Dr.
SOR."
7.5 Wrap Up
7.5.1 Additional homework
Homework 7.5.1.1 In Subsection 7.3.4 we discussed SOR and SSOR. Research how to
choose the relaxation parameter Ê and then modify your implementation of Gauss-Seidel
from Homework 7.3.2.1 to investigate the benefits.
7.5.2 Summary
Let A œ Rn◊n be tridiagonal and SPD so that
Q R
–0,0 –1,0
c d
c –1,0 –1,1 –2,1 d
c d
c
A=c ... ... ... d
d.
c d
c –n≠2,n≠3 –n≠2,n≠2 –n≠1,n≠2 d
a b
–n≠1,n≠2 –n≠1,n≠1
Then its Cholesky factor is given by
Q R
⁄0,0
c d
c ⁄1,0 ⁄1,1 d
c d
c
c ... ... d
d.
c d
c ⁄n≠2,n≠3 ⁄n≠2,n≠2 d
a b
⁄n≠1,n≠2 ⁄n≠1,n≠1
WEEK 7. SOLVING SPARSE LINEAR SYSTEMS 414
for i = 0, . . . , n ≠ 2
Âi := Âi /–i,i
Âi+1 := Âi+1 ≠ –i+1,i Âi
endfor
Ân≠1 := Ân≠1 /–n≠1,n≠1
M x(k+1) = N x(k) + b
ÎM ≠1 N Î < 1.
WEEK 7. SOLVING SPARSE LINEAR SYSTEMS 415
Given A = D ≠L≠U where ≠L, D, and ≠U equal the strictly lower triangular, diagonal,
and strictly upper triangular parts of A, commonly used splitting methods are
Descent Methods
8.1 Opening
8.1.1 Solving linear systems by solving a minimization problem
YouTube: https://www.youtube.com/watch?v=--WEfBpj1Ts
Consider the quadratic polynomial
1
f (‰) = –‰2 ≠ —‰.
2
Finding the value ‰‚ that minimizes this polynomial can be accomplished via the steps:
• Compute the derivative and set it to zero:
f Õ (‰)
‚ = –‰
‚ ≠ — = 0.
We notice that computing ‰‚ is equivalent to solving the linear system (of one equation)
–‰‚ = —.
416
WEEK 8. DESCENT METHODS 417
This course does not have multivariate calculus as a prerequisite, so we will walk you
through the basic results we will employ. We will focus on finding a solution to Ax = b where
A is symmetric positive definite (SPD). (In our discussions we will just focus on real-valued
problems). Now, if
1
f (x) = xT Ax ≠ xT b,
2
then its gradient equals
Òf (x) = Ax ≠ b.
The function f (x) is minimized (when A is SPD) when its gradient equals zero, which allows
us to compute the vector for which the function achieves its minimum. The basic insight is
that in order to solve Ax‚ = b we can instead find the vector x‚ that minimizes the function
f (x) = 12 xT Ax ≠ xT b.
YouTube: https://www.youtube.com/watch?v=rh9GhwU1fuU
Theorem 8.1.1.1 Let A be SPD and assume that Ax‚ = b. Then the vector x‚ minimizes the
function f (x) = 12 xT Ax ≠ xT b.
Proof. This proof does not employ multivariate calculus!
Let Ax‚ = b. Then
f (x)
= < definition of f (x) >
1 T
2
x Ax ≠ xT b
= < Ax‚ = b >
1 T
2
x Ax ≠ xT Ax‚
= < algebra >
1 T 1 1
x Ax ≠ xT Ax‚ + x‚T Ax‚ ≠ x‚T Ax‚
2
2
¸ ˚˙ 2 ˝
0
= < factor out >
1
2
(x ≠ x
‚ )T
A(x ≠ x‚) ≠ 12 x‚T Ax‚.
Since x‚T Ax‚ is independent of x, and A is SPD, this is clearly minimized when x = x‚. ⌅
8.1.2 Overview
• 8.1 Opening
¶ 8.1.2 Overview
¶ 8.1.3 What you will learn
• 8.4 Enrichments
• 8.5 Wrap Up
• Recognize that while in exact arithmetic the Conjugate Gradient Method solves Ax = b
in a finite number of iterations, in practice it is an iterative method due to error
introduced by floating point arithmetic.
• Accelerate the Method of Steepest Descent and Conjugate Gradient Method by ap-
plying a preconditioner implicitly defines a new problem with the same solution and
better condition number.
YouTube: https://www.youtube.com/watch?v=V7Cvihzs-n4
Remark 8.2.1.1 In the video, the quadratic polynomial pictured takes on the value ≠x‚Ax‚
at x‚ and that minimum is below the x-axis. This does not change the conclusions that are
drawn in the video.
The basic idea behind a descent method is that at the kth iteration one has an approx-
imation to x, x(k) , and one would like to create a better approximation, x(k+1) . To do so,
the method picks a search direction, p(k) , and chooses the next approximation by taking
a step from the current approximate solution in the direction of p(k) :
In other words, one searches for a minimum along a line defined by the current iterate, x(k) ,
and the search direction, p(k) . One then picks –k so that, preferrably, f (x(k+1) ) Æ f (x(k) ).
This is summarized in Figure 8.2.1.2.
WEEK 8. DESCENT METHODS 420
Given : A, b, x(0)
r(0) := b ≠ Ax(0)
k := 0
while r(k) ”= 0
p(k) := next direction
x(k+1) := x(k) + –k p(k) for some scalar –k
r(k+1) := b ≠ Ax(k+1)
k := k + 1
endwhile
Figure 8.2.1.2 Outline for a descent method.
To this goal, typically, an exact descent method picks –k to exactly minimize the
function along the line from the current approximate solution in the direction of p(k) .
f (x(k+1) )
= < x(k+1) = x(k) + –k p(k) >
f (x(k) + –k p(k) )
= < evaluate >
1 2T 1 2 1 2T
1
2
x(k) + –k p(k) A x(k) + –k p(k) ≠ x(k) + –k p(k) b
= < multiply out >
1 (k) T
2
x Ax (k)
+ –k p(k) T Ax(k) + 12 –k2 p(k) T Ap(k) ≠ x(k) T b ≠ –k p(k) T b
= < rearrange >
1 (k) T
2
x Ax(k) ≠ x(k) T b + 12 –k2 p(k) T Ap(k) + –k p(k) T Ax(k) ≠ –k p(k) T b
= < substitute f (x(k) ) and factor out common terms >
f (x ) + 12 –k2 p(k) T Ap(k) + –k p(k) T (Ax(k) ≠ b)
(k)
where r(k) = b ≠ Ax(k) is the residual. This is a quadratic polynomial in the scalar –k (since
this is the only free variable).
WEEK 8. DESCENT METHODS 421
YouTube: https://www.youtube.com/watch?v=SA_VrhP7EZg
Minimizing
1
f (x(k+1) ) = p(k) T Ap(k) –k2 ≠ p(k) T r(k) –k + f (x(k) )
2
exactly requires the deriviative with respect to –k to be zero:
df (x(k) + –k p(k) )
0= = p(k) T Ap(k) –k ≠ p(k) T r(k) .
d–k
Hence, for a given choice of pk
p(k) T r(k)
–k = and x(k+1) = x(k) + –k p(k) .
p(k) T Ap(k)
provides the next approximation to the solution. This leaves us with the question of how to
pick the search directions {p(0) , p(1) , . . .}.
A basic decent method based on these ideas is given in Figure 8.2.1.3.
Given : A, b, x(0)
r(0) := b ≠ Ax(0)
k := 0
while r(k) ”= 0
p(k) := next direction
(k) T r (k)
–k := pp(k) T Ap (k)
YouTube: https://www.youtube.com/watch?v=aBTI_EEQNKE
Even though matrices are often highly sparse, a major part of the cost of solving Ax = b
via descent methods is in the matrix-vector multiplication (a cost that is proportional to the
number of nonzeroes in the matrix). For this reason, reducing the number of these is an
important part of the design of the algorithm.
Homework 8.2.2.1 Let
x(k+1) = x(k) + –k p(k)
r(k) = b ≠ Ax(k)
r(k+1) = b ≠ Ax(k+1)
Show that
r(k+1) = r(k) ≠ –k Ap(k) .
Solution.
r(k+1) = b ≠ Ax(k+1)
= < r(k) = b ≠ Ax(k) >
r(k+1) = r(k) + Ax(k) ≠ Ax(k+1)
= < rearrange, factor >
r(k+1) = r(k) ≠ A(x(k+1) ≠ x(k) )
= < x(k+1) = x(k) + –k p(k) >
r (k+1)
= r(k) ≠ –k Ap(k)
WEEK 8. DESCENT METHODS 423
Alternatively:
r(k+1) = b ≠ Ax(k+1)
= < x(k+1) = x(k) + –k p(k) >
r (k+1)
= b ≠ A(x(k) + –k p(k) )
= < distribute >
r(k+1) = b ≠ Ax(k) ≠ –k Ap(k)
= < definition of r(k) >
r (k+1)
= r(k) ≠ –k Ap(k)
YouTube: https://www.youtube.com/watch?v=j00GS9mTgd8
With the insights from this last homework, we can reformulate our basic descent method
into one with only one matrix-vector multiplication, as illustrated in Figure 8.2.2.1.
Homework 8.2.2.2 For loops in the algorithm in Figure 8.2.2.1 (Right), count the number
of matrix-vector multiplications, dot products, and "axpy" operations (not counting the cost
of determining the next descent direction).
WEEK 8. DESCENT METHODS 424
Solution.
q (k) := Ap(k) 1 mvmult
(k) T r (k)
–k := pp(k) T qx (k) 2 dot products
x(k+1) := x(k) + –k p(k) 1 axpy
r(k+1) := r(k) ≠ –k q (k) 1axpy
Total: 1 mvmults, 2 dot products, 2 axpys
YouTube: https://www.youtube.com/watch?v=OGqV_hfaxJA
We finish our discussion regarding basic descent methods by observing that we don’t need
to keep the history of vectors, x(k) , p(k) , r(k) , q (k) , and scalar –k that were computed as long
as they are not needed to compute the next search direction, leaving us with the algorithm
Given : A, b, x
r := b ≠ Ax
while r(k) ”= 0
p := next direction
q := Ap
– := ppT qr
T
x := x + –p
r := r ≠ –q
endwhile
Figure 8.2.2.2 The algorithm from Figure 8.2.2.1 (Right) storing only the most current
vectors and scalar.
work 8.2.2.2: p(k) = ei modn , which cycles through the standard basis vectors.
Homework 8.2.3.1 For the right-most algorithm in Homework 8.2.2.2, show that if p(0) =
e0 , then Q R Q R
(1) (0) 1 a n≠1
ÿ (0) b 1 a n≠1
ÿ (0) b
‰0 = ‰0 + —0 ≠ –0,j ‰j = —0 ≠ –0,j ‰j .
–0,0 j=0 –0,0 j=1
Solution.
• p(0) = e0 .
• p(0) T Ap(0) = eT0 Ae0 = –0,0 (the (0, 0) element in A, not to be mistaken for –0 ).
• r(0) = Ax(0) ≠ b.
• p(0) T r(0) = eT0 (b ≠ Ax(0) ) = eT0 b ≠ eT0 Ax(0) = —0 ≠ aÂT0 x(0) , where aÂTk denotes the kth row
of A.
(0) T (0) —0 ≠ÂaT (0)
• x(1) = x(0) + –0 p(0) = x(0) + pp(0) T Ap
r
(0) e0 = x
(0)
+ 0x
–0,0
e0 . This means that only the
first element of x(0) changes, and it changes to
Q R Q R
(1) (0) 1 a n≠1
ÿ (0) 1 a n≠1
ÿ (0)
‰0 = ‰0 + —0 ≠ –0,j ‰j b = —0 ≠ –0,j ‰j b .
–0,0 j=0 –0,0 j=1
YouTube: https://www.youtube.com/watch?v=karx3stbVdE
Careful contemplation of the last homework reveals that this is exactly how the first
element in vector x, ‰0 , is changed in the Gauss-Seidel method!
Ponder This 8.2.3.2 Continue the above argument to show that this choice of descent
directions yields the Gauss-Seidel iteration.
WEEK 8. DESCENT METHODS 426
YouTube: https://www.youtube.com/watch?v=tOqAd1OhIwc
For a function f : Rn æ R that we are trying to minimize, for a given x, the direction in
which the function most rapidly increases in value at x is given by its gradient,
Òf (x).
≠Òf (x).
which we recognize as the residual. Thus, recalling that r(k) = b ≠ Ax(k) , the direction of
steepest descent at x(k) is given by p(k) = r(k) = b ≠ Ax(k) . These insights motivate the
algorithms in Figure 8.2.4.1.
Given : A, b, x(0) Given : A, b, x
r(0) := b ≠ Ax(0) k := 0
k := 0 r := b ≠ Ax
while r(k) ”= 0 while r ”= 0
p(k) := r(k) p := r
q (k) := Ap(k) q := Ap
– := ppT qr
(k) T (k) T
–k := pp(k) T qr(k)
x(k+1) := x(k) + –k p(k) x := x + –p
r(k+1) := r(k) ≠ –k q (k) r := r ≠ –q
k := k + 1 k := k + 1
endwhile endwhile
Figure 8.2.4.1 Steepest descent algorithm, with indices and without indices.
WEEK 8. DESCENT METHODS 427
8.2.5 Preconditioning
YouTube: https://www.youtube.com/watch?v=i-83HdtrI1M
For a general (appropriately differential) nonlinear function f (x), using the direction
of steepest descent as the search direction is often a reasonable choice. For our problem,
especially if A is relatively ill-conditioned, we can do better.
Here is the idea: Let A = Q QT be the SVD of SPD matrix A (or, equivalently for SPD
matrices, its spectral decomposition, which we will discuss in Subsection 9.2.4). Then
1 1
f (x) = xT Ax ≠ xT b = xT Q QT x ≠ xT QQT b.
2 2
Using the change of basis y = QT x and ‚b = QT b, then
1
g(y) = y T y ≠ y T ‚b.
2
How this relates to the convergence of the Method of Steepest Descent is discussed (infor-
mally) in the video. The key insight is that if Ÿ(A) = ‡0 /‡n≠1 (the ratio between the largest
and smallest eigenvalues or, equivalently, the ratio between the largest and smallest singular
value) is large, then convergence can take many iterations.
What would happen if instead ‡0 = · · · = ‡n≠1 ? Then A = Q QT is the SVD/Spectral
decomposition of A and A = Q(‡0 I)QT . If we then perform the Method of Steepest Descent
with y (the transformed vector x) and ‚b (the transformed right-hand side), then
y (1)
=
(0) r (0) T r (0)
y ≠ r (0) T ‡0 Ir(0)
r(0)
=
(0) 1 (0)
y ≠ ‡0
r
=
1
(0)
y ≠ ‡0
(‡0 y (0) ≠ ‚b)
=
1 ‚
‡0
b,
which is the solution to ‡0 I = ‚b. Thus, the iteration converges in one step. The point we
are trying to (informally) make is that if A is well-conditioned, then the Method of Steepest
Descent converges faster.
WEEK 8. DESCENT METHODS 428
L≠1 ≠T
M ALM LTM x = L≠1
M b .
¸ ˚˙ ˝ ¸ ˚˙ ˝ ¸ ˚˙ ˝
à x̃ b̃
We note that à is SPD and hence one can apply the Method of Steepest Descent to Ãx̃ = b̃,
where à = L≠1 M ALM , x̃ = LM x, and b̃ = LM b. Once the method converges to the solution
≠T T ≠1
x̃, one can transform that solution of this back to solution of the original problem by solving
LTM x = x̃. If M is chosen carefully, Ÿ(L≠1
M ALM ) can be greatly improved. The best choice
≠T
would be M = A, of course, but that is not realistic. The point is that in our case where A
is SPD, ideally the preconditioner should be SPD.
Some careful rearrangement takes the method of steepest descent on the transformed
problem to the much simpler preconditioned algorithm on the right in Figure 8.2.5.1.
WEEK 8. DESCENT METHODS 429
• x̃(k) = LT x(k)
• p̃(k) = LT p(k) ,
• –
˜ k = –k
• Base case: k = 0.
¶ r̃(0)
= < algorithm on left >
(0)
b̃ ≠ Ãx̃
= < initialization of b̃ and x̃(0) >
L≠1 b ≠ ÃLT x(0)
= < initialization of à >
L≠1 b ≠ L≠1 Ax(0)
= < factor out and initialization of r(0) >
≠1 (0)
L r
¶ p̃(0)
= < initialization in algorithm >
(0)
r̃
= < r̃(0) = L≠1 r(0) >
≠1 (0)
L r
= < from right algorithm: r(k) = M p(k) and M = LLT >
≠1 T (0)
L LL p
= < L≠1 L = I >
= LT p(0) .
˜0
¶ –
= < middle algorithm >
p̃(0) T r̃(0)
p̃(0) T Ãp̃(0)
= < p̃(0) = LT p(0) etc. >
(LT p(0) )T L≠1 r (0)
(LT p(0) )T L≠1 AL≠T LT p(0)
= < transpose and cancel >
p(0) T r(0)
p(0) T Ap(0)
= < right algorithm >
–0 .
• Inductive Step: Assume that x̃(k) = LT x(k) , r̃(k) = L≠1 r(k) , p̃(k) = LT p(k) , and –˜ k = –k .
Show that x̃(k+1) = LT x(k+1) , r̃(k+1) = L≠1 r(k+1) , p̃(k+1) = LT p(k+1) , and –˜ k+1 = –k+1 .
¶ x̃(k+1)
= middle algorithm
x̃(k) + –˜ k p̃(k)
= < I.H. >
LT x(k) + –k LT p(k)
= < factor out; right algorithm >
LT x(k+1)
WEEK 8. DESCENT METHODS 431
¶ r̃(k+1)
= < middle algorithm >
(k)
r̃ ≠ – ˜ k Ãp̃(k)
= < I.H. >
L r ≠ –k L≠1 AL≠T LT p(k)
≠1 (k)
Solution 2 (Constructive solution). Let’s start with the algorithm in the middle:
Given : A, b, x(0) ,
M = LLT
à = L≠1 AL≠T
b̃ = L≠1 b
x̃(0) = LT x(0)
r̃(0) := b̃ ≠ Ãx̃(0)
k := 0
while r̃(k) ”= 0
p̃(k) := r̃(k)
q̃ (k) := Ãp̃(k)
(k) T (k)
˜ k := p̃p̃(k) T q̃r̃(k)
–
x̃(k+1) := x̃(k) + – ˜ k p̃(k)
r̃ (k+1)
:= r̃ ≠ –
(k)
˜ k q̃ (k)
x (k+1)
= L x̃ ≠T (k+1)
k := k + 1
endwhile
We now notice that à = L≠1 AL≠T and we can substitute this into the algorithm:
Given : A, b, x(0) ,
M = LLT
b̃ = L≠1 b
x̃(0) = LT x(0)
r̃(0) := b̃ ≠ L≠1 AL≠T x̃(0)
k := 0
while r̃(k) ”= 0
p̃(k) := r̃(k)
q̃ (k) := L≠1 AL≠T p̃(k)
(k) T (k)
˜ k := p̃p̃(k) T q̃r̃(k)
–
x̃(k+1) := x̃(k) + – ˜ k p̃(k)
r̃ (k+1)
:= r̃ ≠ –
(k)
˜ k q̃ (k)
x (k+1)
= L x̃ ≠T (k+1)
k := k + 1
endwhile
x̃(k) = LT x(k) .
WEEK 8. DESCENT METHODS 433
We substitute that
Now, we exploit that b̃ = L≠1 b and r̃(k) equals the residual b̃≠Ãx̃(k) = L≠1 b≠L≠1 AL≠T LT x(k) =
L≠1 (b ≠ Ax(k) ) = L≠1 r(k) . Substituting these insights in gives us
k := 0 k := 0
while r(k) ”= 0 while r(k) ”= 0
p(k) := L≠T L≠1 r(k) p(k) := M ≠1 r(k)
q̃ := L Ap
(k) ≠1 (k)
q̃ (k) := L≠1 Ap(k)
(k) (k) (k) T r (k)
˜ k := (L(LpT p(k)) ))LT q̃(k) ˜ k := pp(k) T Lq̃
T T ≠1 r
– – (k)
x (k+1)
:= x + –
(k)
˜k L L p≠T T (k) x (k+1)
:= x + –
(k)
˜ k p(k)
r (k+1)
:= r ≠ –
(k)
˜ k Lq̃ (k) r (k+1)
:= r ≠ –
(k)
˜ k Lq̃ (k)
k := k + 1 k := k + 1
endwhile endwhile
Given : A, b, x(0) ,
M = LLT
r := b ≠ Ax(0)
(0)
k := 0
while r(k) ”= 0
p(k) := M ≠1 r(k)
q (k) := Ap(k)
(k) T (k)
–k := pp(k) T qr(k)
x(k+1) := x(k) + –k p(k)
r(k+1) := r(k) ≠ –k q (k)
k := k + 1
endwhile
YouTube: https://www.youtube.com/watch?v=9-SyyJv0XuU
WEEK 8. DESCENT METHODS 435
Let’s start our generic descent method algorithm with x(0) = 0. Here we do not use the
temporary vector q (k) = Ap(k) so that later we can emphasize how to cast the Conjugate
Gradient Method in terms of as few matrix-vector multiplication as possible (one to be
exact).
Given : A, b Given : A, b
x(0) := 0 x := 0
r(0) := b ≠ Ax(0) (= b) r := b
k := 0
while r(k) ”= 0 while r ”= 0
p(k) := next direction p := next direction
– := ppT Ap
(k) T r (k) Tr
–k := pp(k) T Ap (k)
Figure 8.3.1.1 Generic descent algorithm started with x(0) = 0. Left: with indices. Right:
without indices.
Now, since x(0) = 0, clearly
and the search directions were linearly independent. Then, the resulting descent method, in
exact arithmetic, is guaranteed to complete in at most n iterations, This is because then
Span(p(0) , . . . , p(n≠1) ) = Rn
so that
f (x(n) ) = min f (x) = minn f (x)
xœSpan(p(0) ,...,p(n≠1) ) xœR
YouTube: https://www.youtube.com/watch?v=j8uNP7zjdv8
We can write (8.3.1) more concisely: Let
1 2
P (k≠1) = p(0) p(1) · · · p(k≠1)
be the matrix that holds the history of all search directions so far (as its columns) . Then,
letting Q R
–0
a(k≠1) = c
c .. d
a . db,
–k≠1
we notice that Q R
1 2c
–0
x(k) = .. d (k≠1)
p(0) · · · p(k≠1) b=P
. d (8.3.2)
c ak≠1 .
a
–k≠1
Homework 8.3.1.1 Let p(k) be a new search direction that is linearly independent of the
columns of P (k≠1) , which themselves are linearly independent. Show that
A B
y0
where y = œ Rk+1 .
Â1
Hint.
x œ Span(p(0) , . . . , p(k≠1) , p(k) )
if and only if there exists
A B A B
y0 1 2 y0
y= œR k+1
such that x = P (k≠1)
p (k)
.
Â1 Â1
WEEK 8. DESCENT METHODS 437
Solution.
minxœSpan(p(0) ,...,p(k≠1) ,p(k) ) f (x)
= 1 < equivalent2formulation >
miny f ( P (k≠1) p(k) y)
A B
y0
= < partition y = >
A
ÂB1
1 2 y0
miny f ( P (k≠1) p(k) )
Â1
= S < instantiate f >
C A BDT A B
1 2 y 1 2 y
1 0 0
miny U 2 P (k≠1)
p (k)
A P (k≠1)
p (k)
Â1 Â1
C A BDT T
1 2 y0
(k≠1) (k)
≠ P p bV .
Â1
= Ë Ë < multiply out > È Ë È È
miny 12 y0T P (k≠1) T + Â1 p(k) T A P (k≠1) y0 + Â1 p(k) ≠ y0T P (k≠1) T b ≠ Â1 p(k) T b
= Ë < multiply out some more >
miny 12 y0T P (k≠1) T AP (k≠1) y0 + Â1 y0T P (k≠1) T Ap(k)
È
+ 12 Â12 p(k) T Ap(k) ≠ y0T P (k≠1) b ≠ Â1 p(k) T b
T
È
+ 12 Â12 p(k) T Ap(k) ≠ Â1 p(k) T b .
YouTube: https://www.youtube.com/watch?v=5eNmr776GJY
Now, if
P (k≠1) T Ap(k) = 0
WEEK 8. DESCENT METHODS 438
then
minxœSpan(p(0) ,...,p(k≠1) ,p(k) ) f (x)
= Ë < from before >
miny 12 y0T P (k≠1) T AP (k≠1) y0 ≠ y0T P (k≠1) b
T
È
+ Â1 y0T P (k≠1) T Ap(k) + 12 Â12 p(k) T Ap(k) ≠ Â1 p(k) T b
¸ ˚˙ ˝
0
= Ë < remove zero term >
miny 12 y0T P (k≠1) T AP (k≠1) y0 ≠ y0T P (k≠1) b
T
È
+ 12 Â12 p(k) T Ap(k) ≠ Â1 p(k) T b
= Ë < split into two terms that can be È minimized Ë separately > È
miny0 21 y0T P (k≠1) T AP (k≠1) y0 ≠ y0T P (k≠1) b + minÂ1 12 Â12 p(k) T Ap(k) ≠ Â1 p(k) T b
T
YouTube: https://www.youtube.com/watch?v=70t6zgeMHs8
Homework 8.3.1.2 Let A œ Rn◊n be SPD.
ALWAYS/SOMETIMES/NEVER: The the columns of P œ Rn◊k are A-conjugate if and
only if P T AP = D where D is diagonal and has positive values on its diagonal.
Answer. ALWAYS
Now prove it.
WEEK 8. DESCENT METHODS 439
Solution.
P T AP
= < partition P by columns >
1 2T 1 2
p0 · · · pk≠1 A p0 · · · pk≠1
Q
= R
< transpose >
T
p0
c . d 1 2
c . d A p0 · · · pk≠1
a . b
pTk≠1
Q
= R< multiply out >
pT0
c . d1 2
c . d Ap0 · · · Apk≠1
a . b
pTk≠1
= < multiply out >
Q R
pT0 Ap0 pT0 Ap(k) · · · pT0 Apk≠1
c (k) T
c p Ap0 p(k) T Ap(k) · · · p(k) T Apk≠1 d
d
c
c .. .. d
d
a . . b
pTk≠1 Ap0 pTk≠1 Ap(k) · · · pTk≠1 Apk≠1
= < multiply out >
Q R
pT0 Ap0 0 ··· 0
c
c 0 p (k) T
Ap (k)
··· 0 d
d
c
c .. .. .. .. d,
d
a . . . . b
0 0 · · · pTk≠1 Apk≠1
which is a diagonal matrix and its diagonal elements are positive since $ A $ is SPD.
Homework 8.3.1.3 Let A œ Rn◊n be SPD and the columns of P œ Rn◊k be A-conjugate.
ALWAYS/SOMETIMES/NEVER: The columns of P are linearly independent.
Answer. ALWAYS
Now prove it!
Solution. We employ a proof by contradiction. Suppose the columns of P are not linearly
independent. Then there exists y ”= 0 such that P y = 0. Let D = P T AP . From the last
homework we know that D is diagonal and has positive diagonal elements. But then
0
= < Py = 0 >
(P y) A(P y)
T
The above observations leaves us with a descent method that picks the search directions
to be A-conjugate, given in Figure 8.3.1.3.
Given : A, b
x(0) := 0
r(0) = b
k := 0
while r(k) ”= 0
Choose p(k) such that p(k) T AP (k≠1) = 0 and p(k) T r(k) ”= 0
(k) T r (k)
–k := pp(k) T Ap (k)
Remark 8.3.1.4 The important observation is that if p(0) , . . . , p(k) are chosen to be A-
conjugate, then x(k+1) minimizes not only
f (x(k) + –p(k) )
but also
min f (x).
xœSpan(p(0) ,...,p(k≠1) )
YouTube: https://www.youtube.com/watch?v=yXfR71mJ64w
The big question left dangling at the end of the last unit was whether there exists a
direction p(k) that is A-orthogonal to all previous search directions and that is not orthogonal
to r(k) . Let us examine this:
• Assume that all prior search directions p(0) , . . . , p(k≠1) were A-conjugate.
• Consider all vectors p œ Rn that are A-conjugate to p(0) , . . . , p(k≠1) . A vector p has this
property if and only if p ‹ Span(Ap(0) , . . . , Ap(k≠1) ).
WEEK 8. DESCENT METHODS 441
• If all vectors p that are A-conjugate to p(0) , . . . , p(k≠1) are orthogonal to the current
residual, pT r(k) = 0 for all p with P (k≠1) T Ap = 0, then
Let’s think about this: b is orthogonal to all vectors that are orthogonal to Span(Ap(0) , . . . , Ap(p≠1) ).
This means that
b œ Span(Ap(0) , . . . , Ap(k≠1) ).
• We conclude that our method must already have found the solution since x(k) minimizes
f (x) over all vectors in Span(p(0) , . . . , p(k≠1) ). Thus Ax(k) = b and r(k) = 0.
We conclude that there exist descent methods that leverage A-conjugate search directions
as described in Figure 8.3.1.3. The question now is how to find a new A-conjugate search
direction at every step.
YouTube: https://www.youtube.com/watch?v=OWnTq1PIFnQ
The idea behind the Conjugate Gradient Method is that in the current iteration we have
an approximation, x(k) to the solution to Ax = b. By construction, since x(0) = 0,
• We like the direction of steepest descent, r(k) = b ≠ Ax(k) , because it is the direction
in which f (x) decreases most quickly.
• Let us chose p(k) to be the vector that is A-conjugate to p(0) , . . . , p(k≠1) and closest to
the direction of steepest descent, r(k) :
YouTube: https://www.youtube.com/watch?v=i5MoVhNsXYU
Let’s look more carefully at p(k) that satisfies
Notice that
r(k) = v + p(k)
where v is the orthogonal projection of r(k) onto Span(Ap(0) , . . . , Ap(k≠1) )
This can be recognized as a standard linear least squares problem. This allows us to make
a few important observations:
YouTube: https://www.youtube.com/watch?v=ye1FuJixbHQ
Theorem 8.3.4.1 In Figure 8.3.3.1,
• P (k≠1) T r(k) = 0.
WEEK 8. DESCENT METHODS 444
f (P (k≠1) y)
=
1
2
(P (k≠1) y)T A(P (k≠1) y) ≠ (P (k≠1) y)T b
=
1 T
2
y (P (k≠1) T AP (k≠1) )y ≠ y T P (k≠1) T b
• Show that Span(p(0) , . . . , p(k≠1) ) = Span(r(0) , . . . , r(k≠1) ) = Span(b, Ab, . . . , Ak≠1 b).
Proof by induction on k.
¶ Base case: k = 1.
The result clearly holds since p(0) = r(0) = b.
¶ Inductive Hypothesis: Assume the result holds for n Æ k.
Show that the result holds for k = n + 1.
⌅ If k = n + 1 then r(k≠1 ) = r(n) = r(n≠1) ≠ –n≠1 Ap(n≠1) . By I.H.
and
p(n≠1) œ Span(b, Ab, . . . , An≠1 b).
But then
Ap(n≠1) œ Span(Ab, A2 b, . . . , An b)
and hence
r(n) œ Span(b, Ab, A2 b, . . . , An b).
WEEK 8. DESCENT METHODS 445
since
r(n) œ Span(b, Ab, A2 b, . . . , An b)
and
AP n≠1 y0 œ Span(Ab, A2 b, . . . , An b).
¶ We complete the inductive step by noting that all three subspaces have the same
dimension and hence must be the same subspace.
¶ By the Principle of Mathematical Induction, the result holds.
⌅
Definition 8.3.4.2 Krylov subspace. The subspace
and hence
Span(r(0) , . . . , r(j≠1) ) µ Span(p(0) , . . . , p(k≠1) ) =
for j < k. Hence r(j) = P (k≠1) t(j) for some vector t(j) œ Rk . Then
Since this holds for all k and j < k, the desired result is established. ⌅
Next comes the most important result. We established that
p(k) = r(k) ≠ AP (k≠1) z (k≠1) (8.3.3)
where z (k) solves
min Îr(k) ≠ AP (k≠1) zÎ2 .
zœRk
What we are going to show is that in fact the next search direction equals a linear combination
of the current residual and the previous search direction.
WEEK 8. DESCENT METHODS 446
Theorem 8.3.4.4 For k Ø 1, the search directions generated by the Conjugate Gradient
Method satisfy
p(k) = r(k) + “k p(k≠1)
for some constant “k .
Proof. This proof has a lot of very technical details. No harm done if you only pay cursory
attention to those details.
A B
z0
Partition z (k≠1)
= and recall that r(k) = r(k≠1) ≠ “k≠1 Ap(k≠1) so that
’1
p(k)
= < (8.3.3) >
r(k) ≠ AP (k≠1) z (k≠1) A B
z0
= <z =
(k≠1)
>
’1
r(k) ≠ AP (k≠2) z0 + ’1 Ap(k≠1)
= 1 <> 2
r ≠ AP (k≠2) z0 + ’1 (r(k) ≠ r(k≠1) )/–k≠1
(k)
= <> A B
1 2 ’1 (k≠1) (k≠2)
1 ≠ –k≠1 r +
’1 (k)
r ≠ AP z0
–k≠1
¸ ˚˙ ˝
s(k)
1 = <>
2
1 ≠ –k≠1 r(k) + s(k) .
’1
and minimizing p(k) means minimizing the two separate parts. Since r(k) is fixed, this means
minimizing Îs(k) Î22 . An examination of s(k) exposes that
’1 (k≠1) ’1 1 (k≠1) 2
s(k) = r ≠ AP (k≠2) z0 = ≠ r ≠ AP (k≠2) w0
–k≠1 –k≠1
where w0 = ≠(–k≠1 /’1 )z0 . We recall that
and hence we conclude that sk is a vector the direction of p(k≠1) . Since we are only interested
in the direction of p(k) , –k≠1
’1
is not relevant. The upshot of this lengthy analysis is that
⌅
WEEK 8. DESCENT METHODS 447
This implies that while the Conjugate Gradient Method is an A-conjugate method and
hence leverages a "memory" of all previous search directions,
only the last search direction is needed to compute the current one. This reduces the cost of
computing the current search direction and means we don’t have to store all previous ones.
YouTube: https://www.youtube.com/watch?v=jHBK1OQE01s
Remark 8.3.4.5 This is a very, very, very big deal...
YouTube: https://www.youtube.com/watch?v=FVWgZKJQjz0
We have noted that p(k) = r(k) + “k p(k≠1) . Since p(k) is A-conjugate to p(k≠1) we find that
so that
“k = ≠p(k≠1) T Ar(k) /p(k≠1) T Ap(k≠1) .
This yields the first practical instantiation of the Conjugate Gradient method, given in
Figure 8.3.5.1.
WEEK 8. DESCENT METHODS 448
Given : A, b
x(0) := 0
r(0) := b
k := 0
while r(k) ”= 0
if k = 0
p(k) = r(0)
else
“k := ≠p(k≠1) T Ar(k) /p(k≠1) T Ap(k≠1)
p(k) := r(k) + “k p(k≠1)
endif
(k) T r (k)
–k := pp(k) T Ap (k)
p(k) T r(k)
–k := .
p(k) T Ap(k)
Show that an alternative formula for –k is given by
r(k) T r(k)
–k := .
p(k) T Ap(k)
Hint. Use the fact that p(k) = r(k) + “k p(k≠1) and the fact that r(k) is orthogonal to all
previous search directions to show that p(k) T r(k) = r(k) T r(k) .
Solution. We need to show that p(k) T r(k) = r(k) T r(k) .
p(k) T r(k)
= < r(k) + “k p(k≠1) >
(r + “k p(k≠1) )T r(k)
(k)
Given : A, b Given : A, b
x(0) := 0 x(0) := 0
r(0) := b r(0) := b
k := 0 k := 0
while r(k) ”= 0 while r(k) ”= 0
if k = 0 if k = 0
p(k) = r(0) p(k) = r(0)
else else
“k := ≠(p(k≠1) T Ar(k) )/(p(k≠1) T Ap(k≠1) ) “k := (r(k) T r(k) )/(r(k≠1) T r(k≠1) )
p(k) := r(k) + “k p(k≠1) p(k) := r(k) + “k p(k≠1)
endif endif
(k) T r (k) (k) T r (k)
–k := pr(k) T Ap (k) –k := pr(k) T Ap (k)
x (k+1)
:= x + –k p(k)
(k)
x (k+1)
:= x + –k p(k)
(k)
r (k+1)
:= r(k) ≠ –k Ap(k) r (k+1)
:= r(k) ≠ –k Ap(k)
k := k + 1 k := k + 1
endwhile endwhile
Figure 8.3.5.2 Alternative Conjugate Gradient Method algorithms.
Homework 8.3.5.2 For the Conjugate Gradient Method discussed so far,
• Show that
r(k) T r(k) = ≠–k≠1 r(k) T Apk≠1 .
• Show that
p(k≠1) T Ap(k≠1) = r(k≠1) T r(k≠1) /–k≠1 .
Solution.
r(k) T r(k) = r(k) T r(k≠1) ≠ –k≠1 r(k) T Apk≠1 = ≠–k≠1 r(k) T Apk≠1 .
WEEK 8. DESCENT METHODS 450
p(k≠1) T Ap(k≠1)
=
(r(k≠1) ≠ “k≠1 p(k≠2) )T Ap(k≠1)
=
(k≠1) T
r Ap(k≠1)
=
r(k≠1) T (r(k≠1) ≠ r(k) )/–k≠1
=
r(k≠1) T r(k≠1) /–k≠1 .
From the last homework we conclude that
YouTube: https://www.youtube.com/watch?v=f3rLky6mIA4
We finish our discussion of the Conjugate Gradient Method by revisiting the stopping
criteria and preconditioning.
8.3.6.2 Preconditioning
In Subsection 8.2.5 we noted that the method of steepest Descent can be greatly accelerated
by employing a preconditioner. The Conjugate Gradient Method can be greatly accelerated.
While in theory the method requires at most n iterations when A is n ◊ n, in practice a
preconditioned Conjugate Gradient Method requires very few iterations.
Homework 8.3.6.1 Add preconditioning to the algorithm in Figure 8.3.5.2 (right).
WEEK 8. DESCENT METHODS 451
Ax = b
we pick a SPD preconditioner M = L̃L̃T and instead solve the equivalent problem
¸ ˚˙x˝ = L̃
≠1 ≠T T ≠1
L̃
¸ A ˚˙L̃ ˝ L̃ ¸ ˚˙ b.˝
à x̃ b̃
Given : A, b, M = L̃L̃T
x̃(0) := 0
à = L̃≠1 AL̃≠T
r̃(0) := L̃≠1 b
k := 0
while r̃(k) ”= 0
if k = 0
p̃(k) = r̃(0)
else
“˜k := (r̃(k) T r̃(k) )/(r̃(k≠1) T r̃(k≠1) )
p̃(k) := r̃(k) + “˜k p̃(k≠1)
endif
(k) T r̃ (k)
˜ k := p̃r̃(k) T Ãp̃
– (k)
Now, much like we did in the constructive solution to Homework 8.2.5.1 we now morph
this into an algorithm that more directly computes x(k+1) . We start by substituting
à = L̃≠1 AL̃≠T , x̃(k) = L̃T x(k) , r̃(k) = L̃≠1 r(k) , p̃(k) = L̃T p(k) ,
WEEK 8. DESCENT METHODS 452
which yields
Given : A, b, M = L̃L̃T
L̃T x(0) := 0
L̃≠1 r(0) := L̃≠1 b
k := 0
while L̃≠1 r(k) ”= 0
if k = 0
L̃T p(k) = L̃≠1 r(0)
else
“˜k := ((L̃≠1 r(k) )T L̃≠1 r(k) )/(L̃≠1 r(k≠1) )T L̃≠1 r(k≠1) )
L̃T p(k) := L̃≠1 r(k) + “˜k L̃T p(k≠1)
endif
≠1 (k) )T L̃≠1 r (k)
˜ k := ((L̃T (pL̃(k) )Tr L̃≠1
– AL̃≠T L̃T p(k)
L̃T x(k+1) := L̃T x(k) + – ˜ k L̃T p(k)
L̃≠1 r(k+1) := L̃≠1 r(k) ≠ – ˜ k L̃≠1 L̃≠1 AL̃≠T L̃≠T L̃T p(k)
k := k + 1
endwhile
If we now simplify and manipulate various parts of this algorithm we get
Given : A, b, M = L̃L̃T
x(0) := 0
r(0) := b
k := 0
while r(k) ”= 0
if k = 0
p(k) = M ≠1 r(0)
else
“˜k := (r(k) T M ≠1 r(k) )/(r(k≠1) T M ≠1 r(k≠1) )
p(k) := M ≠1 r(k) + “˜k p(k≠1)
endif
(k) T ≠1 r (k)
˜ k := r p(k) M
– T Ap(k)
Finally, we avoid the recomputing of M ≠1 r(k) and Ap(k) by introducing z (k) and q (k) :
Given : A, b, M = L̃L̃T
x(0) := 0
r(0) := b
k := 0
while r(k) ”= 0
z (k) := M ≠1 r(k)
if k = 0
p(k) = z (0)
else
“˜k := (r(k) T z (k) )/(r(k≠1) T z (k≠1) )
p(k) := z (k) + “˜k p(k≠1)
endif
q (k) := Ap(k)
(k) T (k)
˜ k := rp(k) T zq(k)
–
x(k+1) := x(k) + – ˜ k p(k)
r(k+1) := r(k) ≠ – ˜ k q (k)
k := k + 1
endwhile
(Obviously, there are a few other things that can be done to avoid unnecessary recomputa-
tions of r(k) T z (k) .)
8.4 Enrichments
8.4.1 Conjugate Gradient Method: Variations on a theme
Many variations on the Conjugate Gradient Method exist, which are employed in different
situations. A concise summary of these, including suggestions as to which one to use when,
can be found in
• [2] Richard Barrett, Michael Berry, Tony F. Chan, James Demmel, June M. Donato,
Jack Dongarra, Victor Eijkhout, Roldan Pozo, Charles Romine, and Henk Van der
Vorst, Templates for the Solution of Linear Systems: Building Blocks for Iterative
Methods, SIAM Press, 1993. [ PDF ]
WEEK 8. DESCENT METHODS 454
8.5 Wrap Up
8.5.1 Additional homework
Homework 8.5.1.1 When using iterative methods, the matrices are typically very sparse.
The question then is how to store a sparse matrix and how to perform a matrix-vector
multiplication with it. One popular way is known as compressed row storage that involves
three arrays:
• 1D array nzA (nonzero A) which stores the nonzero elements of matrix A. In this array,
first all nonzero elements of the first row are stored, then the second row, etc. It has
size nnzeroes (number of nonzeroes).
• 1D array ir which is an integer array of size n + 1 such that ir( 1 ) equals the
index in array nzA where the first element of the first row is stored. ir( 2 ) then
gives the index where the first element of the second row is stored, and so forth. ir(
n+1 ) equals nnzeroes + 1. Having this entry is convenient when you implement a
matrix-vector multiplication with array nzA.
• 1D array ic of size nnzeroes which holds the column indices of the corresponding
elements in array nzA.
1. Write a function
2. Write a function
8.5.2 Summary
Given a function f : Rn æ R, its gradient is given by
Q R
ˆf
ˆ‰0
(x)
c d
c
c
ˆf
(x) d
d
Òf (x) =
ˆ‰1
c .. d.
c
a . d
b
ˆf
ˆ‰n≠1
(x)
Òf (x) equals the direction in which the function f increases most rapidly at the point x
and ≠Òf (x) equals the direction of steepest descent (the direction in which the function f
decreases most rapidly at the point x).
WEEK 8. DESCENT METHODS 455
In this summary, we will assume that A œ Rn◊n is symmetric positive definite (SPD) and
1
f (x) = xT Ax ≠ xT b.
2
The gradient of this function equals
Òf (x) = Ax ≠ b
Ax‚ = b.
Here p(k) is the "current" search direction and in each iteration we create the next approxi-
mation to x‚, x(k+1) , along the line x(k) + –p(k) .
If x(k+1) minimizes along that line, the method is an exact descent method and
p(k) T r(k)
–k =
p(k) T Ap(k)
so that a prototypical exact descent method is given by
Given : A, b, x(0)
r(0) := b ≠ Ax(0)
k := 0
while r(k) ”= 0
p(k) := next direction
(k) T r (k)
–k := pp(k) T Ap (k)
Once –k is determined,
r(k+1) = r(k) ≠ –k Ap(k) .
which saves a matrix-vector multiplication when incorporated into the prototypical exact
descent method:
Given : A, b, x(0)
r(0) := b ≠ Ax(0)
k := 0
while r(k) ”= 0
p(k) := next direction
q (k) := Ap(k)
(k) T (k)
–k := pp(k) T qr(k)
x(k+1) := x(k) + –k p(k)
r(k+1) := r(k) ≠ –k q (k)
k := k + 1
endwhile
The steepest descent algorithm chooses p(k) = ≠Òf (x(k) ) = b ≠ Ax(k) = r(k) :
Given : A, b, x(0)
r(0) := b ≠ Ax(0)
k := 0
while r(k) ”= 0
p(k) := r(k)
q (k) := Ap(k)
(k) T (k)
–k := pp(k) T qr(k)
x(k+1) := x(k) + –k p(k)
r(k+1) := r(k) ≠ –k q (k)
k := k + 1
endwhile
Convergence can be greatly accelerated by incorporating a preconditioner, M , where,
ideally, M ¥ A is SPD and solving M z = y is easy (cheap).
Given : A, b, x(0) , M
r(0) := b ≠ Ax(0)
k := 0
while r(k) ”= 0
p(k) := M ≠1 r(k)
q (k) := Ap(k)
(k) T (k)
–k := pp(k) T qr(k)
x(k+1) := x(k) + –k p(k)
r(k+1) := r(k) ≠ –k q (k)
k := k + 1
endwhile
WEEK 8. DESCENT METHODS 457
Given : A, b
x(0) := 0
r(0) = b
k := 0
while r(k) ”= 0
Choose p(k) such that p(k) T AP (k≠1) = 0 and p(k) T r(k) ”= 0
(k) T r (k)
–k := pp(k) T Ap (k)
The Conjugate Gradient Method chooses the search direction to equal the vector p(k)
that is A-conjugate to all previous search directions and is closest to the direction of steepest
descent:
Given : A, b
x(0) := 0
r(0) := b
k := 0
while r(k) ”= 0
if k = 0
p(k) = r(0)
else
p(k) minimizes minp‹Span(Ap(0) ,...,Ap(k≠1) ) Îr(k) ≠ pÎ2
endif
(k) T r (k)
–k := pp(k) T Ap (k)
• P (k≠1) T r(k) = 0.
• For k Ø 1
p(k) = r(k) ≠ “k p(k≠1)
Definition 8.5.2.2 Krylov subspace. The subspace
Given : A, b Given : A, b
x(0) := 0 x(0) := 0
r(0) := b r(0) := b
k := 0 k := 0
while r(k) ”= 0 while r(k) ”= 0
if k = 0 if k = 0
p(k) = r(0) p(k) = r(0)
else else
“k := ≠(p(k≠1) T Ar(k) )/(p(k≠1) T Ap(k≠1) ) “k := (r(k) T r(k) )/(r(k≠1) T r(k≠1) )
p(k) := r(k) + “k p(k≠1) p(k) := r(k) + “k p(k≠1)
endif endif
(k) T r (k) (k) T r (k)
–k := pr(k) T Ap (k) –k := pr(k) T Ap (k)
459
Week 9
9.1 Opening
9.1.1 Relating diagonalization to eigenvalues and eigenvectors
You may want to start your exploration of eigenvalues and eigenvectors by watching the
video
• Eigenvectors and eigenvalues | Essence of linear algebra, chapter 14 from the 3Blue1Brown
series. (We don’t embed the video because we are not quite sure that the rules about
doing so are.)
Here are the insights from that video in the terminology of this week.
YouTube: https://www.youtube.com/watch?v=S_OgLYAh2Jk
Homework 9.1.1.1 Eigenvalues and eigenvectors are all about finding scalars, ⁄, and
nonzero vectors, x, such that
Ax = ⁄x.
To help you visualizing how a 2 ◊ 2 real-valued matrix transforms a vector on the unit
circle in general, and eigenvectors of unit length in particular, we have created the function
Assignments/Week09/mat ab/showeig.m (inspired by such a function that used to be part of
Matlab). You may now want to do a "git pull" to update your local copy of the Assignments
directory.
Once you have uploaded this function to Matlab, in the command window, first create a
2 ◊ 2 matrix in array A and then execute showeig( A ).
460
WEEK 9. EIGENVALUES AND EIGENVECTORS 461
A = [ 2 0
0 -0.5 ]
A = [ 2 1
0 -0.5 ]
A = [ 2 1
1 -0.5 ]
theta = pi/4;
A = [ cos( theta) -sin( theta )
sin( theta ) cos( theta ) ]
A = [ 2 1
0 2 ]
A = [ 2 -1
-1 0.5 ]
A = [ 2 1.5
1 -0.5 ]
A = [ 2 -1
1 0.5 ]
A = [ 2 0
0 -0.5 ]
A = [ 2 1
0 -0.5 ]
A = [ 2 1
1 -0.5 ]
If you try a few different symmetric matrices, you will notice that the eigenvectors are always
mutually orthogonal.
theta = pi/4;
A = [ cos( theta) -sin( theta )
sin( theta ) cos( theta ) ]
WEEK 9. EIGENVALUES AND EIGENVECTORS 462
In the end, no vectors are displayed. This is because for real-valued vectors, there are no
vectors such that the rotated vector is in the same direction as the original vector. The
eigenvalues and eigenvectors of a real-valued rotation are complex-valued. Unless ◊ is an
integer multiple of fi.
A = [ 2 1
0 2 ]
We will see later that this is an example of a Jordan block. There is only one linearly
independent eigenvector associated with the eigenvalue 2. Notice that the two eigenvectors
that are displayed are not linearly independent (they point in opposite directions).
A = [ 2 -1
-1 0.5 ]
This matrix has linearly dependent columns (it has a nonzero vector in the null space and
hence 0 is an eigenvalue).
A = [ 2 1.5
1 -0.5 ]
If you look carefully, you notice that the eigenvectors are not mutually orthogonal.
A = [ 2 -1
1 0.5 ]
9.1.2 Overview
• 9.1 Opening
• 9.2 Basics
9.2 Basics
9.2.1 Singular matrices and the eigenvalue problem
WEEK 9. EIGENVALUES AND EIGENVECTORS 464
YouTube: https://www.youtube.com/watch?v=j85zII8u2-I
Definition 9.2.1.1 Eigenvalue, eigenvector, and eigenpair. Let A œ Cm◊m . Then
⁄ œ C and nonzero x œ Cm are said to be an eigenvalue and corresponding eigenvector if
Ax = ⁄x. The tuple (⁄, x) is said to be an eigenpair. ⌃
Ax = ⁄x means that the action of A on an eigenvector x is as if it were multiplied by a
scalar. In other words, the direction does not change and only its length is scaled. "Scaling"
and "direction" should be taken loosely here: an eigenvalue can be negative (in which case
the vector ends up pointing in the opposite direction) or even complex-valued.
As part of an introductory course on linear algebra, you learned that the following state-
ments regarding an m ◊ m matrix A are all equivalent:
• A is nonsingular.
• dim(N (A)) = 0.
• det(A) ”= 0.
Since Ax = ⁄x can be rewritten as (⁄I ≠ A)x = 0, we note that the following statements are
equivalent for a given m ◊ m matrix A:
• (⁄I ≠ A) is singular.
• det(⁄I ≠ A) = 0.
It will become important in our discussions to pick the right equivalent statement in a given
situation.
WEEK 9. EIGENVALUES AND EIGENVECTORS 465
YouTube: https://www.youtube.com/watch?v=K-yDVqijSYw
We will often talk about "the set of all eigenvalues." This set is called the spectrum of
a matrix.
Definition 9.2.1.2 Spectrum of a matrix. The set of all eigenvalues of A is denoted
by (A) and is called the spectrum of A. ⌃
The magnitude of the eigenvalue that is largest in magnitude is known as the spectral
radius. The reason is that all eigenvalues lie in the circle in the complex plane, centered at
the origin, with that radius.
Definition 9.2.1.3 Spectral radius. The spectral radius of A, fl(A), equals the absolute
value of the eigenvalue with largest magnitude:
⌃
In Subsection 7.3.3 we used the spectral radius to argue that the matrix that comes up
when finding the solution to Poisson’s equation is nonsingular. Key in that argument is a
result known as the Gershgorin Disk Theorem.
Proof. Let ⁄ œ (A). Then (⁄I ≠ A)x = 0 for some nonzero vector x. W.l.o.g. assume that
index i has the property that 1 = ‰i Ø |‰j | for j ”= i. Then
or, equivalently,
Hence
|⁄ ≠ –i,i |
=
|–i,0 ‰0 + · · · + –i,i≠1 ‰i≠1 + –i,i+1 ‰i+1 + –i,m≠1 ‰m≠1 |
Æ
|–i,0 ‰0 | + · · · + |–i,i≠1 ‰i≠1 | + |–i,i+1 ‰i+1 | + |–i,m≠1 ‰m≠1 |
=
Æ
|–i,0 ||‰0 | + · · · + |–i,i≠1 ||‰i≠1 | + |–i,i+1 ||‰i+1 | + |–i,m≠1 ||‰m≠1 |
Æ
|–i,0 | + · · · + |–i,i≠1 | + |–i,i+1 | + |–i,m≠1 |
Æ
fli (A).
⌅
YouTube: https://www.youtube.com/watch?v=19FXch2X7sQ
It is important to note that it is not necessarily the case that each such disk has exactly
one eigenvalue in it. There is, however, a slightly stronger result than Theorem 9.2.1.4.
Corollary 9.2.1.5 Let A and Ri (A) be as defined in Theorem 9.2.1.4. Let K and K C be
disjoint subsets of {0, . . . , m ≠ 1} such that K fi K C = {0, . . . , m ≠ 1}. In other words, let
K and K C partition {0, . . . , m ≠ 1}. If
1 2
(fikœK Rk (A)) fl fijœK C Rj (A) = ÿ
then fikœK Rk (A) contains exactly |K| eigenvalues of A (multiplicity counted). In other words,
if fikœK Rk (A) does not intersect with any of the other disks, then it contains as many eigen-
values of A (multiplicity counted) as there are elements of K.
WEEK 9. EIGENVALUES AND EIGENVECTORS 467
Proof. The proof splits A = D + (A ≠ D) where D equals the diagonal of A and considers
AÊ = D + Ê(A ≠ D), which varies continuously with Ê. One can argue that the disks Ri (A0 )
start with only one eigenvalue each and only when they start intersecting can an eigenvalue
"escape" the disk in which it started. We skip the details since we won’t need this result in
this course. ⌅
Through a few homeworks, let’s review basic facts about eigenvalues and eigenvectors.
Homework 9.2.1.1 Let A œ Cm◊m .
TRUE/FALSE: 0 œ (A) if and only A is singular.
Answer. TRUE
Now prove it!
Solution.
Ax = ⁄x
and hence
xH Ax = ⁄xH x.
If we now conjugate both sides we find that
xH Ax = ⁄xH x
which is equivalent to
(xH Ax)H = (⁄xH x)H
which is equivalent to
xH Ax = ⁄xH x
since A is Hermitian. We conclude that
xH Ax
⁄= =⁄
xH x
(since xH x ”= 0).
WEEK 9. EIGENVALUES AND EIGENVECTORS 468
Ax = ⁄x
and hence
xH Ax = ⁄xH x
and finally (since x ”= 0)
xH Ax
⁄= .
xH x
Since A is HPD, both xH Ax and xH x are positive, which means ⁄ is positive.
The converse is also always true, but we are not ready to prove that yet.
Homework 9.2.1.4 Let A œ Cm◊m be Hermitian, (⁄, x) and (µ, y) be eigenpairs associated
with A, and ⁄ ”= µ.
ALWAYS/SOMETIMES/NEVER: xH y = 0
Answer. ALWAYS
Now prove it!
Solution. Since
Ax = ⁄x and Ay = µy
we know that
y H Ax = ⁄y H x and xH Ay = µxH y
and hence (remembering that the eigenxvalues are real-valued)
⁄y H x = y H Ax = xH Ay = µxH y = µy H x.
and hence
“⁄x = µ“x.
Rewriting this we get that
(⁄ ≠ µ)“x = 0.
Since ⁄ ”= µ and “ ”= 0 this means that x = 0 which contradicts that x is an eigenvector.
We conclude that x and y are linearly independent.
We now generalize this insight.
Homework 9.2.1.6 Let A œ Cm◊m , k Æ m, and (⁄i , xi ) for 1 Æ i < k be eigenpairs of this
matrix. Prove that if ⁄i ”= ⁄j when i ”= j then the eigenvectors xi are linearly independent.
In other words, given a set of distinct eigenvalues, a set of vectors created by taking one
eigenvector per eigenvalue is linearly independent.
Hint. Prove by induction.
Solution. Proof by induction on k.
xk = “0 x0 + · · · + “k≠1 xk≠1
or, equivalently,
“0 (⁄0 ≠ ⁄k )x0 + · · · + “k≠1 (⁄i ≠ ⁄k )xk≠1 = 0.
Since at least one “i ”= 0 and ⁄i ”= ⁄k for 0 Æ i < k, we conclude that x0 , . . . , xk≠1 are
linearly dependent, which is a contradiction.
Hence, x0 , . . . , xk are linearly independent.
and Q R
4 ≠1 ≠1
c ≠1
c 4 ≠1 ≠1 d
d
c d
c
c
≠1 4 ≠1 ≠1 d
d
c
c ≠1 4 ≠1 d
d
c d
c ≠1
c
4 ≠1 ≠1 d
d
c .. d
c
c ≠1 ≠1 4 ≠1 . d
d
c
c ≠1 ≠1 4 ≠1 d
d
c d
c
c
≠1 ≠1 4 d
d
.
4 ..
c d
c ≠1 d
a b
.. .. ..
. . .
ALWAYS/SOMETIMES/NEVER: All eigenvalues of these matrices are nonnegative.
ALWAYS/SOMETIMES/NEVER: All eigenvalues of the first matrix are positive.
Answer. ALWAYS: All eigenvalues of these matrices are nonnegative.
ALWAYS: All eigenvalues of the first matrix are positive. (So are all the eigenvalues of
the second matrix, but proving that is a bit trickier.)
Now prove it!
Solution. For the first matrix, we can use the Gershgorin disk theorem to conclude that
all eigenvalues of the matrix lie in the set {x s.t. |x ≠ 2| Æ 2. We also notice that the matrix
is symmetric, which means that its eigenvalues are real-valued. Hence the eigenvalues are
nonnegative. A similar argument can be used for the second matrix.
Now, in Homework 7.2.1.1 we showed that the first matrix is nonsingular. Hence, it
cannot have an eigenvalue equal to zero. We conclude that its eigenvalues are all positive.
It can be shown that the second matrix is also nonsingular, and hence has positive
eigenvalues. However, that is a bit nasty to prove...
WEEK 9. EIGENVALUES AND EIGENVECTORS 471
YouTube: https://www.youtube.com/watch?v=NUvfjg-JUjg
We start by discussing how to further characterize eigenvalues of a given matrix A. We
say "characterize" because none of the discussed insights lead to practical algorithms for
computing them, at least for matrices larger than 4 ◊ 4.
Homework 9.2.2.1 Let A B
–0,0 –0,1
A=
–1,0 –1,1
be a nonsingular matrix. Show that
A B≠1 A B
–0,0 –0,1 1 –1,1 ≠–0,1
= .
–1,0 –1,1 –0,0 –1,1 ≠ –1,0 –0,1 ≠–1,0 –0,0
YouTube: https://www.youtube.com/watch?v=WvwcrDM_K3k
What we notice from the last exercises is that –0,0 –1,1 ≠ –1,0 –0,1 characterizes whether
WEEK 9. EIGENVALUES AND EIGENVECTORS 472
is given by
det(A) = –0,0 –1,1 ≠ –1,0 –0,1 .
⌃
Now, ⁄ is an eigenvalue of A if and only if ⁄I ≠ A is singular. For our 2 ◊ 2 matrix
A B
⁄ ≠ –0,0 ≠–0,1
⁄I ≠ A =
≠–1,0 ⁄ ≠ –1,1
is singular if and only if
(⁄ ≠ –0,0 )(⁄ ≠ –1,1 ) ≠ (≠–1,0 )(≠–0,1 ) = 0.
In other words, ⁄ is an eigenvalue of this matrix if and only if ⁄ is a root of
pA (⁄) = (⁄ ≠ –0,0 )(⁄ ≠ –1,1 ) ≠ (≠–1,0 )(≠–0,1 )
= ⁄2 ≠ (–0,0 + –1,1 )⁄ + (–0,0 –1,1 ≠ –1,0 –0,1 ),
which is a polynomial of degree two. A polynomial of degree two has two roots (counting
multiplicity). This polynomial is known as the characteristic polynomial of the 2 ◊ 2 matrix.
We now have a means for computing eigenvalues and eigenvectors of a 2 ◊ 2 matrix:
• Form the characteristic polynomial pA (⁄).
• Solve for its roots, ⁄0 and ⁄1 .
• Find nonzero vectors x0 and x1 in the null spaces of ⁄0 I ≠ A and ⁄1 I ≠ A, respectively.
YouTube: https://www.youtube.com/watch?v=FjoULa2dMC8
The notion of a determinant of a matrix, det(A), generalizes to m◊m matrices as does the
fact that A is nonsingular if and only if det(A) ”= 0. Similarly, the notion of a characteristic
polynomial is then generalized to m ◊ m matrices:
WEEK 9. EIGENVALUES AND EIGENVECTORS 473
pm (‰) = ‰m + · · · + “1 ‰ + “0 ,
can be factored as
pm (‰) = (‰ ≠ ‰0 )m0 · · · (‰ ≠ ‰k≠1 )mk ,
where the ‰i are distinct roots, mi equals the multiplicity of the root, and m0 + · · · + mk≠1 =
m. The concept of (algebraic) multiplicity carries over to eigenvalues.
Definition 9.2.2.5 Algebraic multiplicity of an eigenvalue. Let A œ Cm◊m and pm (⁄)
its characteristic polynomial. Then the (algebraic) multiplicity of eigenvalue ⁄i equals the
multiplicity of the corresponding root of the polynomial. ⌃
Often we will list the eigenvalues of A œ Cm◊m as m eigenvalues ⁄0 , . . . , ⁄m≠1 even when
some are equal (have algebraic multiplicity greater than one). In this case we say that
we are counting multiplicity. In other words, we are counting each eigenvalue (root of the
characteristic polynomial) separately, even if they are equal.
An immediate consequence is that A has m eigenvalues (multiplicity counted), since a
polynomial of degree m has m roots (multiplicity counted), which is captured in the following
lemma.
Lemma 9.2.2.6 If A œ Cm◊m then A has m eigenvalues (multiplicity counted).
The relation between eigenvalues and the roots of the characteristic polynomial yields a
disconcerting insight: A general formula for the eigenvalues of an arbitrary m ◊ m matrix
with m > 4 does not exist. The reason is that "Galois theory" tells us that there is no general
formula for the roots of a polynomial of degree m > 4 (details go beyond the scope of this
WEEK 9. EIGENVALUES AND EIGENVECTORS 474
and Q R
≠–n≠1 ≠–n≠2 ≠–n≠3 · · · ≠–1 ≠–0
c
c 1 0 0 ··· 0 0 dd
c d
c
c 0 1 0 ··· 0 0 dd
A= c 0 0 1 ··· 0 0 d
c d
c .. .. .. .. .. .. d
c
a . . . . .
d
. b
0 0 0 ··· 1 0
then
pm (⁄) = det(⁄I ≠ A).
(Since we don’t discuss how to compute the determinant of a general matrix, you will have
to take our word for this fact.) Hence, we conclude that no general formula can be found for
the eigenvalues for m ◊ m matrices when m > 4. What we will see is that we will instead
create algorithms that converge to the eigenvalues and/or eigenvalues of matrices.
Corollary 9.2.2.7 If A œ Rm◊m is real-valued then some or all of its eigenvalues may be
complex-valued. If eigenvalue ⁄ is complex-valued, then its conjugate, ⁄̄, is also an eigenvalue.
Indeed, the complex eigenvalues of a real-valued matrix come in complex pairs.
Proof. It can be shown that if A is real-valued, then the coefficients of its characteristic
polynomial are all real -valued. Complex roots of a polynomial with real coefficients come
in conjugate pairs. ⌅
The last corollary implies that if m is odd, then at least one eigenvalue of a real-valued
m ◊ m matrix must be real-valued.
Corollary 9.2.2.8 If A œ Rm◊m is real-valued and m is odd, then at least one of the
eigenvalues of A is real-valued.
YouTube: https://www.youtube.com/watch?v=BVqdIKTK1SI
It would seem that the natural progression for computing eigenvalues and eigenvectors
would be
• Find eigenvectors associated with the eigenvalues by finding bases for the null spaces
of ⁄i I ≠ A.
However, as mentioned, finding the roots of a polynomial is a problem. Moreover, finding
vectors in the null space is also problematic in the presence of roundoff error. For this
reason, the strategy for computing eigenvalues and eigenvectors is going to be to compute
approximations of eigenvectors hand in hand with the eigenvalues.
be the set of all eigenvectors of A associated with ⁄ plus the zero vector (which is not
considered an eigenvector). Show that E⁄ (A) is a subspace.
Solution. A set S µ Cm is a subspace if and only if for all – œ C and x, y œ Cm two
conditions hold:
• x œ S implies that –x œ S.
• x, y œ S implies that x + y œ S.
Then Q R
⁄ ≠ ”0 0 ··· 0
c
c 0 ⁄ ≠ ”1 ··· 0 d
d
⁄I ≠ D = c
c .. .. ... .. d
d
a . . . b
0 0 · · · ⁄ ≠ ”m≠1
is singular if and only if ⁄ = ”i for some i œ {0, . . . , m ≠ 1}. Hence (D) = {”0 , ”1 , . . . , ”m≠1 }.
Now,
Dej = the column of D indexed with j = ”j ej
and hence ej is an eigenvector associated with ”j .
Homework 9.2.3.3 Compute the eigenvalues and corresponding eigenvectors of
Q R
≠2 3 ≠7
c
A=a 0 1 1 d
b
0 0 2
and look for a vector in the null space of this matrix. By examination,
Q R
1
c d
a 0 b
0
and look for a vector in the null space of this matrix. Given where the zero appears
on the diagonal, we notice that a vector of the form
Q R
‰0
c d
a 1 b
0
WEEK 9. EIGENVALUES AND EIGENVECTORS 477
3‰0 ≠ 3(1) = 0
and look for a vector in the null space of this matrix. Given where the zero appears
on the diagonal, we notice that a vector of the form
Q R
‰0
c d
a ‰1 b
1
is in the null space if ‰0 and ‰1 are choosen appropriately. This means that
‰1 ≠ 1(1) = 0
Then Q R
⁄ ≠ ‚0,0 ≠‚0,1 ··· ≠‚0,m≠1
c
c 0 ⁄ ≠ ‚1,1 ··· ≠‚1,m≠1 d
d
⁄I ≠ U = c .. .. .. .. d.
c
a . . . .
d
b
0 0 · · · ⁄ ≠ ‚m≠1,m≠1
is singular if and only if ⁄ = ‚i,i for some i œ {0, . . . , m≠1}. Hence (U ) = {‚0,0 , ‚1,1 , . . . , ‚m≠1,m≠1 }.
Let ⁄ be an eigenvalue of U . Things get a little tricky if ⁄ has multiplicity greater than
one. Partition Q R
U00 u01 U02
c d
U = a 0 ‚11 uT12 b
0 0 U22
where ‚11 = ⁄. We are looking for x ”= 0 such that (⁄I ≠ U )x = 0 or, partitioning x,
Q RQ R Q R
‚11 I ≠ U00 ≠u01 ≠U02 x0 0
c dc d c d
a 0 0 ≠u12T
b a ‰1 b = a 0 b .
0 0 ‚11 I ≠ U22 x2 0
must be such that ‚11 is the FIRST diagonal element that equals ⁄.
In the next week, we will see that practical algorithms for computing the eigenvalues and
eigenvectors of a square matrix morph that matrix into an upper triangular matrix via a
sequence of transforms that preserve eigenvalues. The eigenvectors of that triangular matrix
can then be computed using techniques similar to those in the solution to the last homework.
Once those have been computed, they can be "back transformed" into the eigenvectors of
the original matrix.
WEEK 9. EIGENVALUES AND EIGENVECTORS 479
YouTube: https://www.youtube.com/watch?v=2AsK3KEtsso
Practical methods for computing eigenvalues and eigenvectors transform a given matrix
into a simpler matrix (diagonal or tridiagonal) via a sequence of transformations that preserve
eigenvalues known as similarity transformations.
Definition 9.2.4.1 Given a nonsingular matrix Y , the transformation Y ≠1 AY is called a
similarity transformation (applied to matrix A). ⌃
Definition 9.2.4.2 Matrices A and B are said to be similar if there exists a nonsingular
matrix Y such that B = Y ≠1 AY . ⌃
Homework 9.2.4.1 Let A, B, Y œ Cm◊m , where Y is nonsingular, and (⁄, x) an eigenpair
of A.
Which of the follow is an eigenpair of B = Y ≠1 AY :
• (⁄, x).
• (⁄, Y ≠1 x).
• (⁄, Y x).
• (1/⁄, Y ≠1 x).
Y ≠1 AY Y ≠1 x = ⁄Y ≠1 x.
It is not hard to expand the last proof to show that if A is similar to B and ⁄ œ (A)
has algebraic multiplicity k then ⁄ œ (B) has algebraic multiplicity k.
YouTube: https://www.youtube.com/watch?v=n02VjGJX5CQ
In Subsection 2.2.7, we argued that the application of unitary matrices is desirable, since
they preserve length and hence don’t amplify error. For this reason, unitary similarity trans-
formations are our weapon of choice when designing algorithms for computing eigenvalues
and eigenvectors.
Definition 9.2.4.4 Given a nonsingular matrix Q the transformation QH AQ is called a
unitary similarity transformation (applied to matrix A). ⌃
YouTube: https://www.youtube.com/watch?v=mJNM4EYB-9s
The following is a fundamental theorem for the algebraic eigenvalue problem that is key
to practical algorithms for finding eigenvalues and eigenvectors.
Theorem 9.2.4.5 Schur Decomposition Theorem. Let A œ Cm◊m . Then there exist a
unitary matrix Q and upper triangular matrix U such that A = QU QH . This decomposition
is called the Schur decomposition of matrix A.
Proof. We will outline how to construct Q so that QH AQ = U , an upper triangular matrix.
Since a polynomial of degree m has at least one root, matrix A has at least one eigenvalue,
⁄1 , and corresponding eigenvector q1 , where we normalize this eigenvector to have length one.
WEEK 9. EIGENVALUES AND EIGENVECTORS 481
1 2
Thus Aq1 = ⁄1 q1 . Choose Q2 so that Q = q1 Q2 is unitary. Then
QH AQ
=
1 2H 1 2
q1 Q2 A q1 Q2
A = B
q1H Aq1 q1H AQ2
QH2 Aq1 QH2 AQ2
A = B
⁄1 q1H AQ2
⁄QH 2 q1 QH2 AQ2
A = B
⁄1 w T
,
0 B
where wT = q1H AQ2 and B = QH 2 AQ2 . This insight can be used to construct an inductive
proof. ⌅
In other words: Given matrix A, there exists a unitary matrix Q such that applying the
unitary similarity transformation QH AQ yields an upper triangular matrix U . Since then
(A) = (U ), the eigenvalues of A can be found on the diagonal of U . The eigenvectors of
U can be computed and from those the eigenvectors of A can be recovered.
One should not mistake the above theorem and its proof for a constructive way to compute
the Schur decomposition: finding an eigenvalue, ⁄1 and/or 1 the eigenvector
2 associated with
it, q1 , is difficult. Also, completing the unitary matrix q1 Q2 is expensive (requiring
the equivalent of a QR factorization).
Homework 9.2.4.2 Let A œ Cm◊m , A = QU QH be its Schur decomposition, and X ≠1 U X =
, where is a diagonal matrix and X is nonsingular.
• How are the elements of related to the elements of U ?
Solution.
A = QU QH = QX X ≠1 QH = (QX) (QX)≠1 .
Hence the columns of QX equal eigenvectors of A.
WEEK 9. EIGENVALUES AND EIGENVECTORS 482
YouTube: https://www.youtube.com/watch?v=uV5-0O_LBkA
If the matrix is Hermitian, then the Schur decomposition has the added property that U
is diagonal. The resulting decomposition is known as the Spectral decomposition.
Theorem 9.2.4.6 Spectral Decomposition Theorem. Let A œ Cm◊m be Hermitian.
Then there exist a unitary matrix Q and diagonal matrix D œ Rm◊m such that A = QDQH .
This decomposition is called the spectral decomposition of matrix A.
Proof. Let A = QU QH be the Schur decomposition of A. Then U = QH AQ. Since A is
Hermitian, so is U since U H = (QH AQ)H = QH AH Q = QH AQ = U . A triangular matrix
that is Hermitian is diagonal. Any Hermitian matrix has a real-valued diagonal and hence
D has real-valued on its diagonal. ⌅
In practical algorithms, it will often occur that an intermediate result can be partitioned
into smaller subproblems. This is known as deflating the problem and builds on the fol-
lowing insights.
A B
AT L AT R
Theorem 9.2.4.7 Let A œ C m◊m
be of form A = , where AT L and ABR are
0 ABR
square submatrices. Then (A) = (AT L ) fi (ABR ).
The proof of the above theorem follows from the next homework regarding how the Schur
decomposition of A can be computed from the Schur decompositions of AT L and ABR .
Homework 9.2.4.3 Let A œ Cm◊m be of form
A B
AT L AT R
A= ,
0 ABR
AT L = QT L UT L QH
T L and ABR = QBR UBR QBR .
H
Solution.
A
A = B
AT L AT R
0 ABR
A = B
QT L UT L QH
TL AT R
0 QBR UBR QH
BR
=
A B A B A BH
QT L 0 UT L QH T L AT R QBR QT L 0
0 QBR 0 UBR 0 QBR
¸ ˚˙ ˝ ¸ ˚˙ ˝ ¸ ˚˙ ˝
Q U QH
Homework 9.2.4.4 Generalize the result in the last homework for block upper triangular
matrices: Q R
A0,0 A0,1 ··· A0,N ≠1
c 0 A1,1 ··· A1,N ≠1 d
c d
A=c c .. .. d.
d
a 0 0 . . b
0 0 · · · AN ≠1,N ≠1
Then
A
=
Q R
A0,0 A0,1 · · · A0,N ≠1
c 0 A1,1 · · · A1,N ≠1 d
c d
c
c .. .. d
d
a 0 0 . . b
0 0 · · · AN ≠1,N ≠1
=
Q R
Q0 U0,0 QH
0 A0,1 ··· A0,N ≠1
c
c 0 H
Q1 U11 Q1 · · · A1,N ≠1 d
d
c
c ... .. d
d
a 0 0 . b
0 0 · · · QN ≠1 UN ≠1,N ≠1 QH
N ≠1
=
Q R
Q0 0 · · · 0
c 0 Q 0 d
c 1 ··· d
c
c .. .. d
d
a 0 0 . . b
0 0 · · · QN ≠1
Q R
U0,0 QH T
0 A0,1 Q1 · · · Q0 A0,N ≠1 QN ≠1
c 0 U11 H
· · · Q1 A1,N ≠1 QN ≠1 d
c d
c
c .. .. d
d
a 0 0 . . b
0 0 ··· UN ≠1,N ≠1
Q RH
Q0 0 · · · 0
c 0 Q 0 d
c 1 ··· d
c
c .. .. d
d .
a 0 0 . . b
0 0 · · · QN ≠1
YouTube: https://www.youtube.com/watch?v=dLaFK2TJ7y8
The algebraic eigenvalue problem or, more simply, the computation of eigenvalues and
eigenvectors is often presented as the problem of diagonalizing a matrix. We make that link
in this unit.
WEEK 9. EIGENVALUES AND EIGENVECTORS 485
w = X (X ≠1 w) .
¸ ˚˙ ˝
wÂ
In other words, X ≠1 w is the vector of coefficients when w is writen in terms of the basis
that consists of the columns of X. Similarly, we can write v as a linear combination of the
columns of X:
v = X (X ≠1 v) .
¸ ˚˙ ˝
vÂ
Now, since X is nonsingular, w = Av is equivalent to X ≠1 w = X ≠1 AXX ≠1 v, and hence
X ≠1 w = D(X ≠1 v).
Remark 9.2.5.2 We conclude that if we view the matrices in the right basis (namely the
basis that consists of the columns of X), then the transformation w := Av simplifies to
w := DvÂ. This is a really big deal.
How is diagonalizing a matrix related to eigenvalues and eigenvectors? Let’s assume that
X ≠1 AX = D. We can rewrite this as
AX = XD
and partition
Q R
”0 0 ··· 0
1 2 1 2c
c 0 ”1 ··· 0 d
d
A x0 x1 · · · xm≠1 = x0 x1 · · · xm≠1 c
c .. .. ... .. d.
d
a . . . b
0 0 · · · ”m≠1
We conclude that
Axj = ”j xj
which means that the entries on the diagonal of D are the eigenvalues of A and the corre-
sponding eigenvectors are found as columns of X.
Homework 9.2.5.1 In Homework 9.2.3.3, we computed the eigenvalues and corresponding
eigenvectors of Q R
≠2 3 ≠7
A=c a 0 1 1 d
b.
0 0 2
Use the answer to that question to give a matrix X such that X ≠1 AX = . Check that
AX = X .
Solution. The eigenpairs computed for Homework 9.2.3.3 were
Q R Q R Q R
1 1 ≠1
c d c d c d
(≠2, a 0 b), (1, a 1 b), and (2, a 1 b).
0 0 1
Hence Q R≠1 Q RQ R Q R
1 1 ≠1 ≠2 3 ≠7 1 1 ≠1 ≠2 0 0
c
a 0 1 1 d
b
c
a 0 1 1 d c
ba 0 1 1 d c d
b = a 0 1 0 b.
0 0 1 0 0 2 0 0 1 0 0 2
We can check this:
Q RQ R Q RQ R
≠2 3 ≠7 1 1 ≠1 1 1 ≠1 ≠2 0 0
c dc d c
a 0 1 1 ba 0 1 1 b = a 0 1 1 ba 0 1 0 d
dc
b.
0 0 2 0 0 1 0 0 1 0 0 2
¸ Q ˚˙ R ˝ ¸ Q ˚˙ R ˝
≠2 1 ≠2 ≠2 1 ≠2
c
a 0 1 2 d
b
c
a 0 1 2 d
b
0 0 2 0 0 2
Now assume that the eigenvalues of A œ Cm◊m are given by {⁄0 , ⁄1 , . . . , ⁄m≠1 }, where
eigenvalues are repeated according to their algebraic multiplicity. Assume that there are m
linearly independent vectors {x0 , x1 , . . . , xm≠1 } such that Axj = ⁄j xj . Then
Q R
⁄0 0 ··· 0
1 2 1 2c
c 0 ⁄1 ··· 0 d
d
A x0 x1 · · · xm≠1 = x0 x1 · · · xm≠1 c
c .. .. .. .. d.
d
a . . . . b
0 0 · · · ⁄m≠1
1 2
Hence, if X = x0 x1 · · · xm≠1 and D = diag(⁄0 , ⁄1 , . . . , ⁄m≠1 ) then X ≠1 AX = D. In
other words, if A has m linearly independent eigenvectors, then A is diagonalizable.
These insights are summarized in the following theorem:
WEEK 9. EIGENVALUES AND EIGENVECTORS 487
• Diagonal matrices.
If A is diagonal, then choosing X = I and A = D yields X ≠1 AX = D.
• Hermitian matrices.
If A is Hermitian, then the spectral decomposition theorem tells us that there exists
unitary matrix Q and diagonal matrix D such that A = QDQH . Choosing X = Q
yields X ≠1 AX = D.
X ≠1 QH AQX = X ≠1 U X = D
YouTube: https://www.youtube.com/watch?v=amD2FOSXf s
Homework 9.2.6.1 Compute the eigenvalues of k ◊ k matrix
Q R
µ 1 ··· 0 0 0
c .. d
c 0 µ
c 1 . 0 0 d d
c . . . . .. .. d
c ..
Jk (µ) = c .. .. .. . . dd (9.2.1)
c d
c ... d
a 0 0 0 µ 1 b
0 0 0 ··· 0 µ
where k > 1. For each eigenvalue compute a basis for the subspace of its eigenvectors
(including the zero vector to make it a subspace).
Hint.
• What does this say about the dimension of the null space N (⁄I ≠ Jk (µ))?
Solution. Since the matrix is upper triangular and all entries on its diagonal equal µ. Now,
Q R
0 1 0
··· 0 0
c .. d
c
c 0 0 1 . 0 0 d
d
µI ≠ Jk (µ) =
c .. ... ... .. .. .. d
c
c . . . . d
d
c d
c ... d
a 0 0 0 0 1 b
0 0 0 ··· 0 0
has m≠1 linearly independent columns and hence its nullspace is one dimensional: dim(N (µI≠
Jk (µ))) = 1. So, we are looking for one vector in the basis of N (µI ≠Jk (µ)). By examination,
Jk (µ)e0 = µe0 and hence e0 is an eigenvector associated with the only eigenvalue µ.
WEEK 9. EIGENVALUES AND EIGENVECTORS 489
YouTube: https://www.youtube.com/watch?v=QEunPSiFZF0
The matrix in (9.2.1) is known as a Jordan block.
The point of the last exercise is to show that if A has an eigenvalue of algebraic multiplicity
k, then it does not necessarily have k linearly independent eigenvectors. That, in turn, means
there are matrices that do not have a full set of eigenvectors. We conclude that there are
matrices that are not diagonalizable. We call such matrices defective.
Definition 9.2.6.1 Defective matrix. A matrix A œ Cm◊m that does not have m linearly
independent eigenvectors is said to be defective. ⌃
Corollary 9.2.6.2 Matrix A œ Cm◊m is diagonalizable if and only if it is not defective.
Proof. This is an immediate consequence of Theorem 9.2.5.3. ⌅
Definition 9.2.6.3 Geometric multiplicity. Let ⁄ œ (A). Then the geometric multi-
plicity of ⁄ is defined to be the dimension of E⁄ (A) defined by
In other words, the geometric multiplicity of ⁄ equals the number of linearly independent
eigenvectors that are associated with ⁄. ⌃
Homework 9.2.6.2 Let A œ Cm◊m have the form
A B
A00 0
A=
0 A11
YouTube: https://www.youtube.com/watch?v=RYg4xLKehDQ
WEEK 9. EIGENVALUES AND EIGENVECTORS 491
Theorem 9.2.6.4 Jordan Canonical Form Theorem. Let the eigenvalues of A œ Cm◊m
be given by ⁄0 , ⁄1 , · · · , ⁄k≠1 , where an eigenvalue is listed as many times as its geometric
multiplicity. There exists a nonsingular matrix X such that
Q R
Jm0 (⁄0 ) 0 ··· 0
c
c 0 Jm1 (⁄1 ) ··· 0 d
d
X ≠1 AX = c
c .. .. .. .. d.
d
a . . . . b
0 0 · · · Jmk≠1 (⁄k≠1 )
For our discussion, the sizes of the Jordan blocks Jmi (⁄i ) are not particularly important.
Indeed, this decomposition, known as the Jordan Canonical Form of matrix A, is not par-
ticularly interesting in practice. It is extremely sensitive to perturbation: even the smallest
random change to a matrix will make it diagonalizable. As a result, there is no practical
mathematical software library or tool that computes it. For this reason, we don’t give its
proof and don’t discuss it further.
so that
Axi = ⁄i xi for i = 0, . . . , m ≠ 1.
WEEK 9. EIGENVALUES AND EIGENVECTORS 492
YouTube: https://www.youtube.com/watch?v=sX9pxaH7Wvs
Let v (0) œ Cm◊m be an initial guess’’. Our (first attempt at the) Power Method iterates
as follows:
Pick v (0)
for k = 0, . . .
v (k+1) = Av (k)
endfor
Clearly v (k) = Ak v (0) .
Let
v (0) = Â0 x0 + Â1 x1 + · · · + Âm≠1 xm≠1 .
Then
v (1) = Av (0) = A (Â0 x0 + Â1 x1 + · · · + Âm≠1 xm≠1 )
= Â0 Ax0 + Â1 Ax1 + · · · + Âm≠1 Axm≠1
= Â0 ⁄0 x0 + Â1 ⁄0 x1 + · · · + Âm≠1 ⁄m≠1 xm≠1 ,
v (2) = Av (1) = A (Â0 ⁄0 x0 + Â1 ⁄0 x1 + · · · + Âm≠1 ⁄m≠1 xm≠1 )
= Â0 ⁄0 Ax0 + Â1 ⁄0 Ax1 + · · · + Âm≠1 ⁄m≠1 Axm≠1
= Â0 ⁄20 x0 + Â1 ⁄21 x1 + · · · + Âm≠1 ⁄2m≠1 xm≠1 ,
..
.
v (k) = Av (k≠1) = Â0 ⁄k0 x0 + Â1 ⁄k1 x1 + · · · + Âm≠1 ⁄km≠1 xm≠1 .
Now, as long as Â0 ”= 0 clearly Â0 ⁄k0 x0 will eventually dominate since
This means that v (k) will start pointing in the direction of x0 . In other words, it will start
pointing in the direction of an eigenvector corresponding to ⁄0 . The problem is that it will
become infinitely long if |⁄0 | > 1 or infinitesimally short if |⁄0 | < 1. All is good if |⁄0 | = 1.
WEEK 9. EIGENVALUES AND EIGENVECTORS 493
YouTube: https://www.youtube.com/watch?v=LTON9qw8B0Y
An alternative way of looking at this is to exploit the fact that the eigenvectors, xi , equal
the columns of X. Then Q R
Â0
c d
c Â1 d ≠1 (0)
y = c .. d
c
d=X v
a . b
Âm≠1
and
v (0) = A0 v (0) = Xy
v (1) = Av (0) = AXy = X y
v (2) = Av (1) = AX y = X 2 y
..
.
v (k) = Av (k≠1) = AX k≠1
y=X k
y
Thus Q RQ R
⁄k0 0 · · · 0 Â0
1 2 c 0 ⁄k1 · · ·
c
0 dc
dc Â1 d
d
v (k) = x0 x1 · · · xm≠1 c c .. .. . . .. dc .. d
a . . . .
dc
ba .
d
b
0 0 · · · ⁄km≠1 Âm≠1
= Â0 ⁄k0 x0 + Â1 ⁄k1 x1 + · · · + Âm≠1 ⁄km≠1 xm≠1 .
Notice how looking at v (k) in the right basis (the eigenvectors) simplified the explanation.
YouTube: https://www.youtube.com/watch?v=MGTGo_TGpTM
WEEK 9. EIGENVALUES AND EIGENVECTORS 494
Given an initial v (0) œ Cm , a second attempt at the Power Method iterates as follows:
Pick v (0)
for k = 0, . . .
v (k+1) = Av (k) /⁄0
endfor
Ak = (AA · · · A) = (X X ≠1 )(X X ≠1 ) · · · (X X ≠1 ) = X k
X ≠1 .
¸ ˚˙ ˝ ¸ ˚˙ ˝
times k
so that
v (k) = Ak v (0) /⁄k0
= Ak Xy/⁄k0
= X k X ≠1 Xy/⁄k0
= X k y/⁄k0
Q R
1 0 ··· 0
c 1 2k d
1 2 c
c 0 ⁄1
⁄0
··· 0 d
d
= X k
/⁄k0 y = Xc
c .. .. .. .. d y.
d
c . . . . d
a 1 2k b
0 0 ··· ⁄m≠1
⁄0
WEEK 9. EIGENVALUES AND EIGENVECTORS 495
- -
- -
Now, since - ⁄⁄k0 - < 1 for k > 1 we can argue that
limkæŒ v (k)
=
Q R
1 0 ··· 0
c 1 2k d
c
c 0 ⁄1
⁄0
··· 0 d
d
limkæŒ X c
c .. .. .. .. dy
d
c . . . . d
a 1 2 b
⁄m≠1 k
0 0 ··· ⁄0
=
Q R
1 0 ··· 0
c
c 0 0 ··· 0 d
d
Xc
c .... . . .. dy
d
a . . . . b
0 0 ··· 0
=
XÂ0 e0
=
Â0 Xe0 = Â0 x0 .
Thus, as long as Â0 ”= 0 (which means v (0) must have a component in the direction of x0 ) this
method will eventually yield a vector in the direction of x0 . However, this time the problem
is that we don’t know ⁄0 when we start.
xH Ax
xH x
WEEK 9. EIGENVALUES AND EIGENVECTORS 496
YouTube: https://www.youtube.com/watch?v=P-U4dfwHMwM
Before we make the algorithm practical, let us examine how fast the iteration converges.
This requires a few definitions regarding rates of convergence.
Definition 9.3.2.1 Convergence of a sequence of scalars. Let –0 , –1 , –2 , . . . œ R be
an infinite sequence of scalars. Then –k is said to converge to – if
lim |–k ≠ –| = 0.
kæŒ
⌃
Definition 9.3.2.2 Convergence of a sequence of vectors. Let x0 , x1 , x2 , . . . œ Cm be
an infinite sequence of vectors. Then xk is said to converge to x if for any norm Î · Î
lim Îxk ≠ xÎ = 0.
kæŒ
⌃
Because of the equivalence of norms, discussed in Subsection 1.2.6, if a sequence of vectors
converges in one norm, then it converges in all norms.
Definition 9.3.2.3 Rate of convergence. Let –0 , –1 , –2 , . . . œ R be an infinite sequence
of scalars that converges to –. Then
WEEK 9. EIGENVALUES AND EIGENVECTORS 497
|–k+1 ≠ –| Æ C|–k ≠ –|
|–k+1 ≠ –|
lim = C < 1.
kæŒ |–k ≠ –|
|–k+1 ≠ –| Æ Ck |–k ≠ –|
|–k+1 ≠ –|
lim = 0.
kæŒ |–k ≠ –|
|–k+1 ≠ –|
lim = C.
kæŒ |–k ≠ –|2
|–k+1 ≠ –|
lim = C.
kæŒ |–k ≠ –|3
⌃
Linear convergence can be slow. Let’s say that for k Ø K we observe that
Then, clearly, |–k+n ≠ –| Æ C n |–k ≠ –|. If C = 0.99, progress may be very, very slow. If
WEEK 9. EIGENVALUES AND EIGENVECTORS 498
|–k ≠ –| = 1, then
|–k+1 ≠ –| Æ 0.99000
|–k+2 ≠ –| Æ 0.98010
|–k+3 ≠ –| Æ 0.97030
|–k+4 ≠ –| Æ 0.96060
|–k+5 ≠ –| Æ 0.95099
|–k+6 ≠ –| Æ 0.94148
|–k+7 ≠ –| Æ 0.93206
|–k+8 ≠ –| Æ 0.92274
|–k+9 ≠ –| Æ 0.91351
Quadratic convergence is fast. Now
|–k+1 ≠ –| Æ C|–k ≠ –|2
|–k+2 ≠ –| Æ C|–k+1 ≠ –|2 Æ C(C|–k ≠ –|2 )2 = C 3 |–k ≠ –|4
|–k+3 ≠ –| Æ C|–k+2 ≠ –|2 Æ C(C 3 |–k ≠ –|4 )2 = C 7 |–k ≠ –|8
..
.
|–k+n ≠ –| Æ C 2 |–k ≠ –|2
n ≠1 n
and hence
≠ log10 |–k+1 ≠ –| Ø 2( ≠ log10 |–k ≠ –| ).
¸ ˚˙ ˝ ¸ ˚˙ ˝
number of correct number of correct
digits in –k+1 digits in –k
Cubic convergence is dizzyingly fast: Eventually the number of correct digits triples from
one iteration to the next.
For our analysis for the convergence of the Power Method, we define a convenient norm.
Homework 9.3.2.1 Let X œ Cm◊m be nonsingular. Define Î · ÎX ≠1 : Cm æ R by ÎyÎX ≠1 =
ÎX ≠1 yÎ for some given norm Î · Î : Cm æ R.
ALWAYS/SOMETIMES/NEVER: Î · ÎX ≠1 is a norm.
Solution. We need to show that
Îx + yÎX ≠1
=
ÎX ≠1 (x + y)Î
=
ÎX ≠1 x + X ≠1 yÎ
Æ
ÎX ≠1 xÎ + ÎX ≠1 yÎ
=
ÎxÎX ≠1 + ÎyÎX ≠1 .
What do we learn from this exercise? Recall that a vector z can alternatively be written
as X(X ≠1 z) so that the vector ẑ = X ≠1 z tells you how to represent the vector z in the basis
given by the columns of X. What the exercise tells us is that if we measure a vector by
applying a known norm in a new basis, then that is also a norm.
WEEK 9. EIGENVALUES AND EIGENVECTORS 500
Hence Q R
0 0 ··· 0
c 1 2k d
c
c 0 ⁄1
··· 0 d
d
X ≠1 (v (k) ≠ Â0 x0 ) =
⁄0
c
c .. .. .. .. dy
d
c . . . . d
a 1 2k b
0 0 ··· ⁄m≠1
⁄0
and Q R
0 0 ··· 0
c d
c 0 ⁄1
··· 0 d
(k+1) c d ≠1 (k)
(v ≠ Â0 x0 ) = d X (v ≠ Â0 x0 ).
⁄0
.. .. ..
≠1
X c ...
c
a . . . d
b
0 0 ··· ⁄m≠1
⁄0
Now, let Î · Î be a p-norm and Î · ÎX ≠1 as defined in Homework 9.3.2.1. Then
Îv (k+1) ≠ Â0 x0 ÎX ≠1 = ÎX
.Q (v
≠1 (k+1)
≠ Â0 x0 )Î R .
. 0 0 ··· 0 .
. .
.c d .
.c 0 ⁄⁄1
· · · 0 d .
.c d ≠1 (k) .
= ..c (v )
0
.
c .. .
.. . . . .
.. d d X ≠ Â0 x 0 .
.
.a b .
. .
. 0 0 · · · ⁄m≠1 .
- - ⁄0 - -
- ⁄1 - - -
Æ - ⁄ - ÎX ≠1 (v (k) ≠ Â0 x0 )Î = - ⁄⁄10 - Îv (k) ≠ Â0 x0 ÎX ≠1 .
0
WEEK 9. EIGENVALUES AND EIGENVECTORS 501
This shows that, in this norm, the convergence of v (k) to Â0 x0 is linear: The difference
between current approximation, v (k) , and the eventual vector in the direction of the desired
eigenvector, Âx0 , is reduced by at least a constant factor in each iteration.
Now, what if
|⁄0 | = · · · = |⁄n≠1 | > |⁄n | Ø . . . Ø |⁄m≠1 |?
By extending the above analysis one can easily show that v (k) will converge to a vector in
the subspace spanned by the eigenvectors associated with ⁄0 , . . . , ⁄n≠1 .
An important special case is when n = 2: if A is real-valued then ⁄0 may be complex-
valued in which case its conjugate, ⁄̄0 , is also an eigenvalue and hence has the same mag-
nitude as ⁄0 . We deduce that v (k) will always be in the space spanned by the eigenvectors
corresponding to ⁄0 and ⁄̄0 .
• (⁄, x).
• (1/⁄, x).
YouTube: https://www.youtube.com/watch?v=pKPMuiCNC2s
The Power Method homes in on an eigenvector associated with the largest (in magnitude)
eigenvalue. The Inverse Power Method homes in on an eigenvector associated with the
smallest eigenvalue (in magnitude).
Once again, we assume that a given matrix A œ Cm◊m is diagonalizable so that there
exist matrix X and diagonal matrix such that A = X X ≠1 . We further assume that
= diag(⁄0 , · · · , ⁄m≠1 ) and
|⁄0 | Ø |⁄1 | Ø · · · Ø |⁄m≠2 | > |⁄m≠1 | > 0.
Notice that this means that A is nonsingular.
Clearly, if
|⁄0 | Ø |⁄1 | Ø · · · Ø |⁄m≠2 | > |⁄m≠1 | > 0,
then - - - - - - - -
- 1 - - 1 - - 1 - 1 -- -
- - - - - - -
- >- - Ø- - Ø ··· Ø
- .
-
- ⁄m≠1 - - ⁄m≠2 - - ⁄m≠3 - ⁄0 -
Thus, an eigenvector associated with the smallest (in magnitude) eigenvalue of A is an
eigenvector associated with the largest (in magnitude) eigenvalue of A≠1 .
YouTube: https://www.youtube.com/watch?v=D6KF28ycRB0
This suggest the following naive iteration (which mirrors the second attempt for the
Power Method in Subsubsection 9.3.1.2, but iterating with A≠1 ):
for k = 0, . . .
v (k+1) = A≠1 v (k)
v (k+1) = ⁄m≠1 v (k+1)
endfor
From the analysis of the convergence of in Subsection 9.3.2 for the Power Method algorithm,
we conclude that now
- -
- -
Îv (k+1) ≠ Âm≠1 xm≠1 ÎX ≠1 Æ - ⁄⁄m≠1
m≠2
- Îv (k) ≠ Âm≠1 xm≠1 ÎX ≠1 .
WEEK 9. EIGENVALUES AND EIGENVECTORS 503
We would probably want to factor P A = LU (LU factorization with partial pivoting) once
and solve L(U v (k+1) ) = P v (k) rather than multiplying with A≠1 .
YouTube: https://www.youtube.com/watch?v=7OOJcvYxbxM
A basic idea that allows one to accelerate the convergence of the inverse iteration is
captured by the following exercise:
Homework 9.3.4.1 Let A œ Cm◊m , fl œ C, and (⁄, x) an eigenpair of A .
Which of the follow is an eigenpair of the shifted matrix A ≠ flI:
• (⁄, x).
• (⁄ ≠ fl, x).
Answer. (⁄ ≠ fl, x).
Now justify your answer.
Solution. Let Ax = ⁄x. Then
(A ≠ flI)x = Ax ≠ flx = ⁄x ≠ flx = (⁄ ≠ fl)x.
YouTube: https://www.youtube.com/watch?v=btFWxkXkXZ8
The matrix A ≠ flI is referred to as the matrix A that has been "shifted" by fl. What the
next lemma captures is that shifting A by fl shifts the spectrum of A by fl:
Lemma 9.3.4.1 Let A œ Cm◊m , A = X X ≠1 and fl œ C. Then A ≠ flI = X( ≠ flI)X ≠1 .
Homework 9.3.4.2 Prove Lemma 9.3.4.1.
Solution.
A ≠ flI = X X ≠1 ≠ flXX ≠1 = X( ≠ flI)X ≠1 .
This suggests the following (naive) iteration: Pick a value fl close to ⁄m≠1 . Iterate
Of course one would solve (A ≠ flI)v (k+1) = v (k) rather than computing and applying the
inverse of A.
If we index the eigenvalues so that
then - -
- -
Îv (k+1) ≠ Âm≠1 xm≠1 ÎX ≠1 Æ - ⁄⁄m≠1 ≠fl
m≠2 ≠fl
- Îv (k) ≠ Âm≠1 xm≠1 ÎX ≠1 .
The closer to ⁄m≠1 the shift fl is chosen, the more favorable the ratio (constant) that dictates
the linear convergence of this modified Inverse Power Method.
YouTube: https://www.youtube.com/watch?v=fCDYbunugKk
WEEK 9. EIGENVALUES AND EIGENVECTORS 505
where instead of multiplying by the inverse one would want to solve the linear system (A ≠
flI)v (k+1) = v (k) instead.
The question now becomes how to chose fl so that it is a good guess for ⁄m≠1 . Often
an application inherently supplies a reasonable approximation for the smallest eigenvalue or
an eigenvalue of particular interest. Alternatively, we know that eventually v (k) becomes a
good approximation for xm≠1 and therefore the Rayleigh quotient gives us a way to find a
good approximation for ⁄m≠1 . This suggests the (naive) Rayleigh-quotient iteration:
with
lim (⁄m≠1 ≠ flk ) = 0
kæŒ
which means superlinear convergence is observed. In fact, it can be shown that once k is
large enough
Îv (k+1) ≠ Âm≠1 xm≠1 ÎX ≠1 Æ CÎv (k) ≠ Âm≠1 xm≠1 Î2X ≠1 ,
thus achieving quadratic convergence. Roughly speaking this means that every iteration
doubles the number of correct digits in the current approximation. To prove this, one shows
that |⁄m≠1 ≠ flk | Æ KÎv (k) ≠ Âm≠1 xm≠1 ÎX ≠1 for some constant K. Details go beyond this
discussion.
Better yet, it can be shown that if A is Hermitian, then (once k is large enough)
for some constant C and hence the naive Rayleigh Quotient Iteration achieves cubic conver-
gence for Hermitian matrices. Here our norm Î · ÎX ≠1 becomes any p-norm since the Spectral
Decomposition Theorem tells us that for Hermitian matrices X can be taken to equal a uni-
tary matrix. Roughly speaking this means that every iteration triples the number of correct
digits in the current approximation. This is mind-boggling fast convergence!
WEEK 9. EIGENVALUES AND EIGENVECTORS 506
v (k+1)
= v (k+1) /Îv (k+1) Î
endfor
Remark 9.3.4.2 A concern with the (Shifted) Inverse Power Method and Rayleigh Quotient
Iteration is that the matrix with which one solves is likely nearly singular. It turns out that
this actually helps: the error that is amplified most is in the direction of the eigenvector
associated with the smallest eigenvalue (after shifting, if appropriate).
9.3.5 Discussion
To summarize this section:
• The Power Method finds the eigenvector associated with the largest eigenvalue (in
magnitude). It requires a matrix-vector multiplication for each iteration, thus costing
approximately 2m2 flops per iteration if A is a dense m ◊ m matrix. The convergence
is linear.
• The Inverse Power Method finds the eigenvector associated with the smallest eigenvalue
(in magnitude). It requires the solution of a linear system for each iteration. By
performance an LU factorization with partial pivoting, the investment of an initial
O(m3 ) expense then reduces the cost per iteration to approximately 2m2 flops. if A is
a dense m ◊ m matrix. The convergence is linear.
• The Rayleigh Quotient Iteration finds an eigenvector, but with which eigenvalue it is
associated is not clear from the start. It requires the solution of a linear system for
each iteration. If computed via an LU factorization with partial pivoting, the cost per
iteration is O(m3 ) per iteration, if A is a dense m ◊ m matrix. The convergence is
quadratic if A is not Hermitian, and cubic if it is.
The cost of these methods is greatly reduced if the matrix is sparse, in which case each
iteration may require as little as O(m) per iteration.
9.4 Enrichments
9.4.1 How to compute the eigenvalues of a 2 ◊ 2 matrix
We have noted that finding the eigenvalues of a 2 ◊ 2 matrix requires the solution to the
characteristic polynomial. In particular, if a 2 ◊ 2 matrix A is real-valued and
A B
–00 –01
A=
–10 –11
WEEK 9. EIGENVALUES AND EIGENVECTORS 507
then
det(⁄I ≠ A) = (⁄ ≠ –00 )(⁄ ≠ –11 ) ≠ –10 –01 = ⁄2 ≠(–00 + –11 ) ⁄ + (–00 –11 ≠ –10 –10 ) .
¸ ˚˙ ˝ ¸ ˚˙ ˝
— “
It is then tempting to use the quadratic formula to find the roots: \[ \lambda_0 = \frac{-
\beta + \sqrt{ \beta^2 - 4 \gamma } }{2} \] and \[ \lambda_0 = \frac{-\beta - \sqrt{
\beta^2 - 4 \gamma } }{2}. \] However, as discussed in Subsection C.0.2, one of these
formulae may cause catastrophic cancellation, if “ is small. When is “ small? When –00 –11 ≠
–10 –10 is small. In other words, when the determinant of A is small or, equivalently, when
A has a small eigenvalue.
In the next week, we will discuss the QR algorithm for computing the Spectral Decompo-
sition of a Hermitian matrix. We do not discuss the QR algorithm for computing the Schur
Decomposition of a m ◊ m non-Hermitian matrix, which uses the eigenvalues of
A B
–m≠2,m≠2 –m≠2,m≠1
–m≠1,m≠1 –m≠1,m≠1
to "shift" the matrix. (What this means will become clear next week.) This happened to come
up in Robert’s dissertation work. Making the "rookie mistake" of not avoiding catastrophic
cancellation when computing the roots of a quadratic polynomial cost him three weeks of
his life (debugging his code), since the algorithm that resulted did not converge correctly...
Don’t repeat his mistakes!
9.5 Wrap Up
9.5.1 Additional homework
Homework 9.5.1.1 Let Î · Î be matrix norm induced by a vector norm Î · Î. Prove that for
any A œ Cm◊m , the spectral radius, fl(A) satisfies fl(A) Æ ÎAÎ.
Some results in linear algebra depend on there existing a consistent matrix norm Î · Î
such that ÎAÎ < 1. The following exercise implies that one can alternatively show that the
spectral radius is bounded by one: fl(A) < 1.
Homework 9.5.1.2 Given a matrix A œ Cm◊m and ‘ > 0, there exists a consistent matrix
norm Î · Î such that ÎAÎ Æ fl(A) + ‘.
9.5.2 Summary
Definition 9.5.2.1 Eigenvalue, eigenvector, and eigenpair. Let A œ Cm◊m . Then
⁄ œ C and nonzero x œ Cm are said to be an eigenvalue and corresponding eigenvector if
Ax = ⁄x. The tuple (⁄, x) is said to be an eigenpair. ⌃
For A œ C m◊m
, the following are equivalent statements:
WEEK 9. EIGENVALUES AND EIGENVECTORS 508
• A is nonsingular.
• dim(N (A)) = 0.
• det(A) ”= 0.
• ⁄ is an eigenvalue of A
• (⁄I ≠ A) is singular.
• det(⁄I ≠ A) = 0.
Definition 9.5.2.2 Spectrum of a matrix. The set of all eigenvalues of A is denoted by
(A) and is called the spectrum of A. ⌃
Definition 9.5.2.3 Spectral radius. The spectral radius of A, fl(A), equals the magnitude
of the largest eigenvalue, in magnitude:
⌃
Theorem 9.5.2.4 Gershgorin Disk Theorem. Let A œ C m◊m
,
Q R
–0,0 –0,1 · · · –m≠1,m≠1
c d
c –0,0 –0,1 · · · –m≠1,m≠1 d
A= c
c .. .. .. d,
d
a . . . b
–0,0 –0,1 · · · –m≠1,m≠1
ÿ
fli (A) = |–i,j |,
j”=i
and
Ri (A) = {x s.t. |x ≠ –i,i | Æ fli }.
WEEK 9. EIGENVALUES AND EIGENVECTORS 509
In other words, fli (A) equals the sum of the absolute values of the off diagonal elements of A
in row i and Ri (A) equals the set of all points in the complex plane that are within a distance
fli of diagonal element –i,i . Then
(A) µ fii Ri (A).
In other words, every eigenvalue lies in one of the disks of radius fli (A) around diagonal
element –i,i .
Corollary 9.5.2.5 Let A and Ri (A) be as defined in Theorem 9.5.2.4. Let K and K C be
disjoint subsets of {0, . . . , m ≠ 1} such that K fi K C = {0, . . . , m ≠ 1}. In other words, let
K be a subset of {0, . . . , m ≠ 1 and K C its complement. If
1 2
(fikœK Rk (A)) fl fijœK C Rj (A) = ÿ
then fikœK Rk (A) contains exactly |K| eigenvalues of A (multiplicity counted). In other words,
if fikœK Rk (A) does not intersect with any of the other disks, then it contains as many eigen-
values of A (multiplicity counted) as there are elements of K.
Some useful facts for A œ Cm◊m :
• 0 œ (A) if and only A is singular.
• The eigenvectors corresponding to distinct eigenvectors are linearly independent.
• Let A be nonsingular. Then (⁄, x) is an eigenpair of A if and only if (1/⁄, x) is an
eigenpair of A≠1 .
• (⁄, x) is an eigenpair of A if and only if (⁄ ≠ fl, x) is an eigenpair of A ≠ flI.
Some useful facts for Hermitian A œ Cm◊m :
• All eigenvalues are real-valued.
• A is HPD if and only if all its eigenvalues are positive.
• If (⁄, x) and (µ, y) are both eigenpairs of Hermitian A, then x and y are orthogonal.
Definition 9.5.2.6 The determinant of
A B
–00 –01
A=
–10 –11
is given by
det(A) = –00 –11 ≠ –10 –01 .
⌃
The characteristic polynomial of
A B
–00 –01
A=
–10 –11
WEEK 9. EIGENVALUES AND EIGENVECTORS 510
is given by
det(⁄ ≠ IA) = (⁄ ≠ –00 )(⁄ ≠ –11 ) ≠ –10 –01 .
This is a second degree polynomial in ⁄ and has two roots (multiplicity counted). The
eigenvalues of A equal the roots of this characteristic polynomial.
The characteristic polynomial of A œ Cm◊m is given by
det(⁄ ≠ IA).
This is a polynomial in ⁄ of degree m and has m roots (multiplicity counted). The eigen-
values of A equal the roots of this characteristic polynomial. Hence, A has m eigenvalues
(multiplicity counted).
Definition 9.5.2.7 Algebraic multiplicity of an eigenvalue. Let A œ Cm◊m and pm (⁄)
its characteristic polynomial. Then the (algebraic) multiplicity of eigenvalue ⁄i equals the
multiplicity of the corresponding root of the polynomial. ⌃
If
pm (‰) = –0 + –1 ‰ + · · · + –m≠1 ‰m≠1 + ‰m
and Q R
≠–n≠1 ≠–n≠2 ≠–n≠3 · · · ≠–1 ≠–0
c
c 1 0 0 ··· 0 0 dd
c d
c
c 0 1 0 ··· 0 0 dd
A= c 0 0 1 ··· 0 0 d
c d
c
c .. .. .. ... .. .. d
d
a . . . . . b
0 0 0 ··· 1 0
then
pm (⁄) = det(⁄I ≠ A).
Corollary 9.5.2.8 If A œ R m◊m
is real-valued then some or all of its eigenvalues may be
complex-valued. If eigenvalue ⁄ is complex-valued, then its conjugate, ⁄̄, is also an eigenvalue.
Indeed, the complex eigenvalues of a real-valued matrix come in complex pairs.
Corollary 9.5.2.9 If A œ Rm◊m is real-valued and m is odd, then at least one of the
eigenvalues of A is real-valued.
Let ⁄ be an eigenvalue of A œ Cm◊m and
E⁄ (A) = {x œ Cm |Ax = ⁄x}.
bet the set of all eigenvectors of A associated with ⁄ plus the zero vector (which is not
considered an eigenvector). This set is a subspace.
The elements on the diagonal of a diagonal matrix are its eigenvalues. The elements on
the diagonal of a triangular matrix are its eigenvalues.
Definition 9.5.2.10 Given a nonsingular matrix Y , the transformation Y ≠1 AY is called a
similarity transformation (applied to matrix A). ⌃
Let A, B, Y œ Cm◊m , where Y is nonsingular, B = Y ≠1 AY , and (⁄, x) an eigenpair of A.
Then (⁄, Y ≠1 x) is an eigenpair of B.
WEEK 9. EIGENVALUES AND EIGENVECTORS 511
In other words, the geometric multiplicity of ⁄ equals the number of linearly independent
eigenvectors that are associated with ⁄. ⌃
Definition 9.5.2.21 Jordan Block. Define the k ◊ k Jordan block with eigenvalue µ as
Q R
µ 1 0
··· 0 0
c .. d
c 0 µ
c 1 . 0 0 d d
c . . . . .. .. d
Jk (µ) = c .. . . . . . .
c
. . dd
c d
c .. d
a 0 0 0 . µ 1 b
0 0 0 ··· 0 µ
⌃
WEEK 9. EIGENVALUES AND EIGENVECTORS 512
xH Ax
xH x
is known as the Rayleigh quotient. ⌃
If x is an eigenvector of A, then
xH Ax
xH x
is the associated eigenvalue.
Definition 9.5.2.24 Convergence of a sequence of scalars. Let –0 , –1 , –2 , . . . œ R be
an infinite sequence of scalars. Then –k is said to converge to – if
lim |–k ≠ –| = 0.
kæŒ
⌃
Definition 9.5.2.25 Convergence of a sequence of vectors. Let x0 , x1 , x2 , . . . œ C m
lim Îxk ≠ xÎ = 0.
kæŒ
⌃
Definition 9.5.2.26 Rate of convergence. Let –0 , –1 , –2 , . . . œ R be an infinite sequence
of scalars that converges to –. Then
• –k is said to converge linearly to – if for sufficiently large k
|–k+1 ≠ –| Æ C|–k ≠ –|
for some constant C < 1.
WEEK 9. EIGENVALUES AND EIGENVECTORS 513
|–k+1 ≠ –| Æ Ck |–k ≠ –|
with Ck æ 0.
with Ck æ 0.
⌃
The convergence of the Power Method is linear.
A practical Inverse Power Method for finding the eigenvector associated with the smallest
eigenvalue (in magnitude):
Pick v (0) of unit length
for k = 0, . . .
v (k+1) = A≠1 v (k)
v (k+1) = v (k+1) /Îv (k+1) Î
endfor
The convergence of the Inverse Power Method is linear.
A practical Rayleigh quation iteration for finding the eigenvector associated with the
smallest eigenvalue (in magnitude):
Pick v (0) of unit length
for k = 0, . . .
flk = v (k)H Av (k)
v (k+1) = (A ≠ flk I)≠1 v (k)
v (k+1) = v (k+1) /Îv (k+1) Î
endfor
The convergence of the Rayleigh Quotient Iteration is quadratic (eventually, the number
of correct digits doubles in each iteration). If A is Hermitian, the convergence is cubic
(eventually, the number of correct digits triples in each iteration).
Week 10
10.1 Opening
10.1.1 Subspace iteration with a Hermitian matrix
YouTube: https://www.youtube.com/watch?v=kwJ6HMSLv1U
The idea behind subspace iteration is to perform the Power Method with more than one
vector in order to converge to (a subspace spanned by) the eigenvectors associated with a
set of eigenvalues.
We continue our discussion by restricting ourselves to the case where A œ Cm◊m is Her-
mitian. Why? Because the eigenvectors associated with distinct eigenvalues of a Hermitian
matrix are mutually orthogonal (and can be chosen to be orthonormal), which will simplify
our discussion. Here we repeat the Power Method:
v0 := random vector
(0)
v0 := v0 /Îv0 Î2 normalize to have length one
for k := 0, . . .
(k)
v0 := Av0
(k+1)
v0 := v0 /Îv0 Î2 normalize to have length one
endfor
In previous discussion, we used v (k) for the current approximation to the eigenvector. We
514
WEEK 10. PRACTICAL SOLUTION OF THE HERMITIAN EIGENVALUE PROBLEM515
(k)
now add the subscript to it, v0 , because we will shortly start iterating with multiple vectors.
Homework 10.1.1.1 You may want to start by executing git pu to update your
directory Assignments.
Examine Assignments/Week10/mat ab/PowerMethod.m which implements
This routine implements the Power Method, starting with a vector x for a maximum number
of iterations maxiters or until convergence, whichever comes first. To test it, execute the
script in Assignments/Week10/mat ab/test_SubspaceIteration.m which uses the Power Method
to compute the largest eigenvalue (in magnitude) and corresponding eigenvector for an m◊m
Hermitian matrix A with eigenvalues 1, . . . , m.
Be sure to click on "Figure 1" to see the graph that is created.
Solution. Watch the video regarding this problem on YouTube: https://youtu.be/8Bgf1tJeMmg.
(embedding a video in a solution seems to cause PreTeXt trouble...)
YouTube: https://www.youtube.com/watch?v=wmuWfjwtgcI
Recall that when we analyzed the convergence of the Power Method, we commented on
the fact that the method converges to an eigenvector associated with the largest eigenvalue
(in magnitude) if two conditions are met:
The new function should subtract this vector from the initial random vector as in the above
algorithm.
Modify the appropriate line in Assignments/Week10/mat ab/test_SubspaceIteration.m, chang-
ing (0) to (1), and use it to examine the convergence of the method.
What do you observe?
Solution.
• Assignments/Week10/answers/PowerMethodLambda1.m
YouTube: https://www.youtube.com/watch?v=OvBFQ84jTMw
Because we should be concerned about the introduction of a component in the direction
of x0 due to roundoff error, we may want to reorthogonalize with respect to x0 in each
WEEK 10. PRACTICAL SOLUTION OF THE HERMITIAN EIGENVALUE PROBLEM517
iteration:
x0 := x0 /Îx0 Î2 normalize known eigenvector x0 to have length one
v1 := random vector
v1 := v1 ≠ xH 0 v1 x0 make sure the vector is orthogonal to x0
(0)
v1 := v1 /Îv1 Î2 normalize to have length one
for k := 0, . . .
(k)
v1 := Av1
v1 := v1 ≠ xH 0 v1 x0 make sure the vector is orthogonal to x0
(k+1)
v1 := v1 /Îv1 Î2 normalize to have length one
endfor
• Assignments/Week10/answers/PowerMethodLambda1Reorth.m
YouTube: https://www.youtube.com/watch?v=751FAyKch1s
We now observe that the steps that normalize x0 to have unit length and then subtract
out the component of v1 in the direction of x0 , normalizing the result, are exactly those
performed by the Gram-Schmidt process.
1 More
2 generally, it is is equivalent to computing
the QR factorization of the matrix x0 v1 . This suggests the algorithm
v0 := random vector
v11 := random vector
2 1 2
( v0(0) v1(0) ), R := QR( v0 v1 )
for1k := 0, . . . 2 1 2
( v0(k+1) v1(k+1) , R) := QR(A v0(k) v1(k) )
endfor
We observe:
(k)
• If |⁄0 | > |⁄1 |, the vectors v0 will converge linearly to a vector in the direction of x0
at a rate dictated by the ratio |⁄1 |/|⁄0 |.
(k)
• If |⁄0 | > |⁄1 | > |⁄2 |,, the vectors v1 will converge linearly to a vector in the direction
of x1 at a rate dictated by the ratio |⁄2 |/|⁄1 |.
(k) (k)
• If |⁄0 | Ø |⁄1 | > |⁄2 | then Span({v0 , v1 }) will eventually start approximating the
subspace Span({x0 , x1 }).
Alternatively,
A B
(k)
1
(k) (k)
2H 1
(k) (k)
2 ⁄0 0
A = v0 v1 A v0 v1 converges to
0 ⁄1
if A is Hermitian, |⁄1 | > |⁄2 |, and v (0) and v (1) have components in the directions of x0 and
x1 , respectively.
WEEK 10. PRACTICAL SOLUTION OF THE HERMITIAN EIGENVALUE PROBLEM519
for k := 0, . . .
(V‚ (k+1) , R) := QR(AV‚ (k) )
H
A(k+1) = V‚ (k+1) AV‚ (k+1)
endfor
• Assignments/Week10/answers/SubspaceIteration.m
10.1.2 Overview
• 10.1 Opening
• 10.4 Enrichments
¶ 10.4.1 QR algorithm among the most important algorithms of the 20th century
¶ 10.4.2 Who was John Francis
¶ 10.4.3 Casting the reduction to tridiagonal form in terms of matrix-matrix mul-
tiplication
¶ 10.4.4 Optimizing the tridiagonal QR algorithm
¶ 10.4.5 The Method of Multiple Relatively Robust Representations (MRRR)
• 10.5 Wrap Up
• Cast all computation for computing the eigenvalues and eigenvectors of a Hermitian
matrix in terms of unitary similarity transformations, yielding the Francis Implicit QR
Step.
V‚ := I V =I
for k := 0, . . . for k := 0, . . .
(V‚ , R)
‚ := QR(AV‚ ) (Q, R) := QR(A)
A = RQ
A‚ = V‚ H AV‚ V =VQ
endfor endfor
Figure 10.2.1.1 Left: Subspace iteration started with V‚ = I. Right: Simple QR algorithm.
The magic lies in the fact that the matrices computed by the QR algorithm are identical
to those computed by the subspace iteration: Upon completion V‚ = V and the matrix A‚ on
WEEK 10. PRACTICAL SOLUTION OF THE HERMITIAN EIGENVALUE PROBLEM522
the left equals the (updated) matrix A on the right. To be able to prove this, we annotate
the algorithm so we can reason about the contents of the matrices for each iteration.
A‚(0) := A A(0) := A
V‚ (0) := I V (0) := I
R‚ (0) := I R(0) := I
for k := 0, . . . for k := 0, . . .
(V‚ (k+1) , R
‚ (k+1) ) := QR(AV‚ (k) ) (Q(k+1) , R(k+1) ) := QR(A(k) )
H
A‚(k+1) := V‚ (k+1) AV‚ (k+1) A(k+1) := R(k+1) Q(k+1)
endfor V (k+1) := V (k) Q(k+1)
endfor
Let’s start by showing how the QR algorithm applies unitary equivalence transformations
to the matrices A(k) .
Homework 10.2.1.1 Show that for the algorithm on the right A(k+1) = Q(k+1) A(k) Q(k+1) .
H
after which
A(k+1) := R(k+1) Q(k+1)
Hence
A(k+1) = R(k+1) Q(k+1) = Q(k+1) A(k) Q(k+1) .
H
This last homework shows that A(k+1) is derived from A(k) via a unitary similarity trans-
formation and hence has the same eigenvalues as does A(k) . This means it also is derived from
A via a (sequence of) unitary similarity transformation and hence has the same eigenvalues
as does A.
We now prove these algorithms mathematically equivalent.
Homework 10.2.1.2 In the above algorithms, for all k,
• A‚(k) = A(k) .
• R
‚ (k) = R(k) .
• V‚ (k) = V (k) .
Hint. The QR factorization is unique, provided the diagonal elements of R are taken to be
positive.xs
Solution. We will employ a proof by induction.
• Base case: k = 0
This is trivially true:
¶ A‚(0) = A = A(0) .
WEEK 10. PRACTICAL SOLUTION OF THE HERMITIAN EIGENVALUE PROBLEM523
‚ (0) = I = R(0) .
¶ R
¶ V‚ (0) = I = V (0) .
• Inductive step: Assume that A‚(k) = A(k) , R ‚ (k) = R(k) , and V‚ (k) = V (k) . Show that
A‚(k+1) = A(k+1) , R
‚ (k+1) = R(k+1) , and V‚ (k+1) = V (k+1) .
and
A(k)
= < (I.H.) >
‚ (k)
A
= < algorithm on left >
V‚ (k) H AV‚ (k) (10.2.1)
= < algorithm on left >
V‚ (k) H V‚ (k+1) R
‚ (k+1)
= < I.H. >
(k) H ‚ (k+1) ‚ (k+1)
V V R .
But from the algorithm on the right, we know that
Both (10.2.1) and (10.2.2) are QR factorizations of A(k) and hence, by the uniqueness
of the QR factorization,
R‚ (k+1) = R(k+1)
and
Q(k+1) = V (k) H V‚ (k+1)
or, equivalently and from the algorithm on the right,
Also,
A‚(k+1)
= < algorithm on left >
‚ (k+1) H ‚ (k+1)
V AV
= < V‚ (k+1) = V (k+1) >
V (k+1) H AV (k+1)
= < algorithm on right >
Q(k+1) H V (k) H AV (k) Q(k+1)
= < I.H. >
Q(k+1) H V‚ (k) H AV‚ (k) Q(k+1)
= < algorithm on left >
(k+1) H ‚(k) (k+1)
Q A Q
= < I.H. >
(k+1) H (k) (k+1)
Q A Q
= < last homework >
A(k+1) .
• Ak = V (k) R(k) · · · R(1) R(0) . (Note: Ak here denotes A raised to the kth power.)
• Base case: k = 0
A 0
V (0) R(0 )
¸˚˙˝ = ¸ ˚˙ ˝ A = ¸ ˚˙ ˝ .
0
I I I
• Inductive step: Assume that V (k) = Q(0) · · · Q(k) and Ak = V (k) R(k) · · · R(0) . Show that
V (k+1) = Q(0) · · · Q(k+1) and Ak+1 = V (k+1) R(k+1) · · · R(0) .
Also,
Ak+1
= < definition >
AAk
= < inductive hypothesis >
AV (k) R(k) · · · R(0)
= < inductive hypothesis >
AV R · · · R(0)
‚ (k) (k)
(k)
where flÂ0,0 is the (0, 0) entry in matrix R (k) . Thus, the first column of V (k) equals a vector
that would result from k iterations of the Power Method. Similarly, the second column of
V (k) equals a vector that would result from k iterations of the Power Method, but orthogonal
(k)
to v0 .
YouTube: https://www.youtube.com/watch?v=t51YqvNWa0Q
We make some final observations:
• A(k+1) = Q(k) H A(k) Q(k) . This means we can think of A(k+1) as the matrix A(k) but
viewed in a new basis (namely the basis that consists of the column of Q(k) ).
WEEK 10. PRACTICAL SOLUTION OF THE HERMITIAN EIGENVALUE PROBLEM526
• A(k+1) = (Q(0) · · · Q(k) )H AQ(0) · · · Q(k) = V (k) H AV (k) . This means we can think of
A(k+1) as the matrix A but viewed in a new basis (namely the basis that consists of
the column of V (k) ).
• In each step, we compute
(Q(k+1) , R(k+1) ) = QR(A(k) )
which we can think of as
(Q(k+1) , R(k+1) ) = QR(A(k) ◊ I).
This suggests that in each iteration we perform one step of subspace iteration, but
with matrix A(k) and V = I:
(Q(k+1) , R(k+1) ) = QR(A(k) V ).
• The insight is that the QR algorithm is identical to subspace iteration, except that at
each step we reorient the problem (express it in a new basis) and we restart it with
V = I.
Homework 10.2.1.4 Examine Assignments/Week10/mat ab/SubspaceIterationA Vectors.m,
which implements the subspace iteration in Figure 10.2.1.1 (left). Examine it by executing
the script in Assignments/Week10/mat ab/test_simp e_QR_a gorithm.m.
Solution. Discuss what you observe online with others!
Homework 10.2.1.5 Copy Assignments/Week10/mat ab/SubspaceIterationA Vectors.m into
Simp eQRA g.m and modify it to implement the algorithm in Figure 10.2.1.1 (right) as
function [ Ak, V ] = Simp eQRA g( A, maxits, i ustrate, de ay )
YouTube: https://www.youtube.com/watch?v=HIxSCrFX1Ls
The equivalence of the subspace iteration and the QR algorithm tells us a lot about
convergence. Under mild conditions (|⁄0 | Ø · · · Ø |⁄n≠1 | > |⁄n | > · · · |⁄m≠1 |),
• The first n columns of V (k) converge to a basis for the subspace spanned by the eigen-
vectors associated with ⁄0 , . . . , ⁄n≠1 .
• The last m ≠ n columns of V (k) converge to the subspace orthogonal to the subspace
spanned by the eigenvectors associated with ⁄0 , . . . , ⁄n≠1 .
• If A is Hermitian, then the eigenvectors associated with ⁄0 , . . . , ⁄n≠1 , are orthogonal to
those associated with ⁄n , . . . , ⁄m≠1 . Hence, the subspace spanned by the eigenvectors
associated with ⁄0 , . . . , ⁄n≠1 is orthogonal to the subspace spanned by the eigenvectors
associated with ⁄n , . . . , ⁄m≠1 .
• The rate of convergence with which these subspaces become orthogonal to each other
is linear with a constant |⁄n |/|⁄n≠1 |.
What if in this situation we focus on n = m ≠ 1? Then
• The last column of V (k) converges to point in the direction of the eigenvector associated
with ⁄m≠1 , the smallest in magnitude.
• The rate of convergence of that vector is linear with a constant |⁄m≠1 |/|⁄m≠2 |.
In other words, the subspace iteration acts upon the last column of V (k) in the same way as
would an inverse iteration. This observation suggests that that convergence can be greatly
accelerated by shifting the matrix by an estimate of the eigenvalue that is smallest in mag-
nitude.
Homework 10.2.2.1 Copy Simp eQRA g.m into Simp eShiftedQRA gConstantShift.m
and modify it to implement an algorithm that executes the QR algorithm with a shifted
matrix A ≠ flI:
function [ Ak, V ] = Simp eShiftedQRA gConstantShift( A, rho, maxits, i ustrate, de ay )
V =I
for k := 0, . . .
(Q, R) := QR(A ≠ –m≠1,m≠1 I)
A = RQ + –m≠1,m≠1 I
V =VQ
endfor
Figure 10.2.2.1 Simple shifted QR algorithm.
YouTube: https://www.youtube.com/watch?v=Fhk0e5JF sU
To more carefully examine this algorithm, let us annotate it as we did for the simple QR
algorithm in the last unit.
A(0) = A
V (0) = I
R(0) = I
for k := 0, . . .
(k)
µ(k) = –m≠1,m≠1
(Q(k+1) , R(k+1) ) := QR(A(k) ≠ µ(k) I)
A(k+1) = R(k+1) Q(k+1) + µ(k) I
V (k+1) = V (k) Q(k+1)
endfor
after which
A(k+1) := R(k+1) Q(k+1) + µ(k) I
WEEK 10. PRACTICAL SOLUTION OF THE HERMITIAN EIGENVALUE PROBLEM529
Hence
A(k+1)
=
R(k+1) Q(k+1) + µ(k) I
=
Q(k+1) H
(A(k) ≠ µ(k) I)Q(k+1) + µ(k) I
=
(k+1) H (k) (k+1)
≠ µ(k) Q(k+1) Q(k+1) + µ(k) I
H
Q A Q
=
Q(k+1) A(k) Q(k+1) .
H
This last exercise confirms that the eigenvalues of A(k) equal the eigenvalues of A.
Homework 10.2.2.3 For the above algorithm, show that
Solution. In this problem, we need to assume that Q(0) = I. Also, it helps to recognize
that V (k) = Q(0) · · · Q(k) , which can be shown via a simple inductive proof.
This requires a proof by induction.
• Base case: k = 1.
A ≠ µ0 I
= < A(0) = A >
(0)
A ≠ µ0 I
= < algorithm >
(1) (1)
Q R
= < Q(0) = R(0) = I
(0) (1) (1) (0)
Q Q R R
Show that
(A ≠ µk I)(A ≠ µk≠1 I) · · · (A ≠ µ1 I)(A ≠ µ0 I)
= Q(0) Q(1) · · · Q(k+1) R (k+1)
· R(1) R(0)˝.
· ·˚˙
¸ ˚˙ ˝ ¸
V (k+1) upper triangular
WEEK 10. PRACTICAL SOLUTION OF THE HERMITIAN EIGENVALUE PROBLEM530
Notice that
(A ≠ µk I)(A ≠ µk≠1 I) · · · (A ≠ µ0 I)
= < last homework >
(V (k+1) (k+1) (k+1) H
A V ≠ µk I)(A ≠ µk≠1 I) · · · (A ≠ µ0 I)
= < I = V (k+1) V (k+1) H >
V (k+1) (A(k+1) ≠ µk I)V (k+1) H (A ≠ µk≠1 I) · · · (A ≠ µ0 I)
= < I.H. >
V (k+1) (A(k+1) ≠ µk I)V (k+1) H V (k) R(k) · · · R(0)
= < V (k+1) H = Q(k+1) H V (k) H >
V (k+1) (A(k+1) ≠ µk I)Q(k+1) H R(k) · · · R(0)
= < algorithm >
V (k+1) R(k+1) Q(k+1) Q(k+1) H R(k) · · · R(0)
= < Q(k+1) Q(k+1) H = I >
(k+1) (k+1) (k)
V R R · · · R(0)
then A BA B A B A BA B A B
A00 0 x x A00 0 0 0
=⁄ and =µ .
0 A11 0 0 0 A11 y y
In other words, (A) = (A00 ) fi (A11 ) and eigenvectors of A can be easily constructed
from eigenvalues of A00 and A11 .
This insight allows us to deflate a matrix when stategically placed zeroes (or, rather,
acceptably small entries) appear as part of the QR algorithm. Let us continue to focus on
the Hermitian eigenvalue problem.
Homework 10.2.3.1 Let A œ Cm◊m be a Hermitian matrix and V œ Cm◊m be a unitary
matrix such that A B
A00 0
V AV =
H
.
0 A11
If V00 and V11 are unitary matrices such that V00H A00 V00 = 0 and V11H A11 V11 = 1, are both
diagonal, show that
A A BBH A A BB A B
V00 0 V00 0 0 0
V A V = .
0 V11 0 V11 0 1
Solution. A A BBH A A BB
V00 0 V00 0
V A V
0 V11 0 V11
= < (XY )H = Y H X H >
A BH A B
V00 0 H V00 0
V AV
0 V11 0 V11
= < (V H AV = diag(A00 , A11 ) >
A BH A BA B
V00 0 A00 0 V00 0
0 V11 0 A11 0 V11
A = < partitioned matrix-matrix
B multiplication >
V00H A00 V00 0
0 V11H A11 V11
A = <BV00H A00 V00 = 0 ; V11H A11 V11 = 1 >
0 0
.
0 1
The point of this last exercise is that if at some point the QR algorithm yields a block
diagonal matrix, then the algorithm can proceed to find the spectral decompositions of the
blocks on the diagonal, updating the matrix, V , in which the eigenvectors are accumulated.
Now, since it is the last colum of V (k) that converges fastest to an eigenvector, eventually
we expect A(k) computed as part of the QR algorithm to be of the form
Q R
(k) (k)
(k) A00 f01
A =a (k) T (k)
b,
f01 –m≠1,m≠1
WEEK 10. PRACTICAL SOLUTION OF THE HERMITIAN EIGENVALUE PROBLEM532
(k)
where f01 is small. In other words,
Q R
(k)
A 0
A(k) ¥ a 00 (k)
b.
0 –m≠1,m≠1
(k) (k)
Once f01 is small enough, the algorithm can continue with A00 . The problem is thus
deflated to a smaller problem.
What criteria should we use to deflate. If the active matrix is m ◊ m, for now we use the
criteria
(k) (k)
Îf01 Î1 Æ ‘mach (|–0,0 | + · · · + |–m≠1,m≠1 |).
The idea is that if the magnitudes of the off-diagonal elements of the last row are small
relative to the eigenvalues, then they can be considered to be zero. The sum of the absolute
values of the diagonal elements is an estimate of the sizes of the eigenvalues. We will refine
this criteria later.
Homework 10.2.3.2 Copy Simp eShiftedQRA g.m into Simp eShiftedQRA gWithDef ation
and modify it to add deflation.
function [ Ak, V ] = Simp eShiftedQRA gWithDef ation( A, maxits, i ustrate, de ay )
YouTube: https://www.youtube.com/watch?v=c1zJG3T0D44
The QR algorithms that we have discussed incur the following approximate costs per
iteration for an m ◊ m Hermitian matrix.
• A æ QR (QR factorization): 43 m3 flops.
WEEK 10. PRACTICAL SOLUTION OF THE HERMITIAN EIGENVALUE PROBLEM533
• If the eigenvectors are desired (which is usually the case), the update V := V Q requires
an addition 2m3 flops.
The bottom line is that the computation requires O(m4 ) flops. All other factorizations
we have encountered so far require at most O(m3 ) flops. Generally O(m4 ) is considered
prohibitively expensive. We need to do better!
YouTube: https://www.youtube.com/watch?v=ETQbxnweKok
In this section, we see that if A(0) is a tridiagonal matrix, then so are all A(k) . This
reduces the cost of each iteration of the QR algorithm from O(m3 ) flops to O(m) flops if
only the eigenvalues are computed and O(m2 ) flops if the eigenvectors are also desired. Thus,
if matrix A is first reduced to a tridiagonal matrix via (unitary) similarity transformations,
then the cost of finding its eigenvalues and eigenvectors is reduced from O(m4 ) to O(m3 )
flops. Fortunately, there is an algorithm for reducing a matrix to tridiagonal form that
requires O(m3 ) operations.
The basic algorithm for reducing a Hermitian matrix to tridiagonal form, overwriting the
original matrix with the result, can be explained as follows. We assume that A is stored only
WEEK 10. PRACTICAL SOLUTION OF THE HERMITIAN EIGENVALUE PROBLEM534
in the lower triangular part of the matrix and that only the diagonal and subdiagonal of
the tridiagonal matrix is computed, overwriting those parts of A. Finally, the Householder
vectors used to zero out parts of A can overwrite the entries that they annihilate (set to
zero), much like we did when computing the Householder QR factorization.
Recall that in Subsubsection 3.3.3.3, we introduced the function
CA B D AA BB
fl ‰1
, · := Housev
u2 x2
A B
1
to compute the vector u = that reflects x into ± ÎxÎ2 e0 so that
u2
Q A BA BH R A B
1 1 1 ‰1
aI ≠ b = ± n ÎxÎ2 e0 .
· u2 u2 x2 ¸ ˚˙ ˝
fl
We also reintroduce the notation H(x) for the transformation I ≠ ·1 uuH where u and · are
computed by Housev1(x).
We now describe an algorithm for reducing a Hermitian matrix to tridiagonal form:
A B
–11 ı
• Partition A æ . Here the ı denotes a part of a matrix that is neither
a21 A22
stored nor updated.
• Update [a21 , · ] := Housev1(a21 ). This overwrites the first element of a21 with ± Îa21 Î2
and the remainder with all but the first element of the Householder vector u. Implicitly,
the elements below the first element equal zero in the updated matrix A.
• Update
A22 := H(a21 )A22 H(a21 ).
Since A22 is Hermitian both before and after the update, only the lower triangular part
of the matrix needs to be updated.
◊ ◊ ◊ ◊ ◊ ◊ ◊ 0 0 0 ◊ ◊ 0 0 0
◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ 0 0
◊ ◊ ◊ ◊ ◊ ≠æ 0 ◊ ◊ ◊ ◊ ≠æ 0 ◊ ◊ ◊ ◊
◊ ◊ ◊ ◊ ◊ 0 ◊ ◊ ◊ ◊ 0 0 ◊ ◊ ◊
◊ ◊ ◊ ◊ ◊ 0 ◊ ◊ ◊ ◊ 0 0 ◊ ◊ ◊
Original matrix First iteration Second iteration
◊ ◊ 0 0 0 ◊ ◊ 0 0 0
◊ ◊ ◊ 0 0 ◊ ◊ ◊ 0 0
≠æ 0 ◊ ◊ ◊ 0 ≠æ 0 ◊ ◊ ◊ 0
0 0 ◊ ◊ ◊ 0 0 ◊ ◊ ◊
0 0 0 ◊ ◊ 0 0 0 ◊ ◊
Third iteration
Figure 10.3.1.1 An illustration of the reduction of a Hermitian matrix to tridiagonal form.
The ◊s denote nonzero elements in the matrix. The gray entries above the diagonal are not
actually updated.
The update of A22 warrants closer scrutiny:
This formulation has two advantages: it requires fewer computations and it does not generate
an intermediate result that is not Hermitian. An algorithm that implements all these insights
is given in Figure 10.3.1.2.
WEEK 10. PRACTICAL SOLUTION OF THE HERMITIAN EIGENVALUE PROBLEM536
[A, t] A
:= TriRed-unb(A, B
t) A B
AT L AT R tT
Aæ ,t æ
ABL ABR tB
AT L is 0 ◊ 0 and tT has 0 elements
while m(AT L ) < m(A)Q≠ 2 R Q R
A B A00 a01 A02 A B t0
AT L AT R c d tT c d
æ a aT10 –11 aT12 b , æ a ·1 b
ABL ABR tB
A20 a21 A22 t2
[a21 , ·1 ] := Housev1(a21 )
u21 = a21 with first element replaced with 1
Update A22 := H(u21 )A22 H(u21 ) via the steps
Y
_
_ y21 := A22 u21 (Hermitian matrix-vector multiply!)
_
] — := uH y /2
21 21
_
_
_ w 21 := (y21 ≠ —u21 /·1 )/·1
[
A22 := A22 ≠ tril(u21 w21H
+ w21 uH ) (Hermitian rank-2 update)
Q R 21 Q R
A B A00 a01 A02 A B t0
AT L AT R c T T d tT c d
Ω a a10 –11 a12 b , Ω a ·1 b
ABL ABR tB
A20 a21 A22 t2
endwhile
Figure 10.3.1.2 Basic algorithm for reduction of a Hermitian matrix to tridiagonal form.
During the first iteration, when updating (m ≠ 1) ◊ (m ≠ 1) matrix A22 , the bulk of
computation is in the computation of y21 := A22 u21 , at 2(m ≠ 1)2 flops, and A22 := A22 ≠
(u21 w21
H
+ w21 uH
21 ), at 2(m ≠ 1) flops. The total cost for reducing m ◊ m matrix A to
2
A22 := (I ≠ ·1 u21 uH 1
21 )A22 (I ≠ · u21 u21 )
H
1
= (A22 ≠ u21 uH A22 ) (I ≠ ·1 u21 uH
21 )
· ¸ 21˚˙ ˝
H
y21
¸ ˚˙ ˝
B22
1
= B22 ≠ ·
B22 u21 uH
21 .
¸ ˚˙ ˝
x21
WEEK 10. PRACTICAL SOLUTION OF THE HERMITIAN EIGENVALUE PROBLEM537
Estimate the cost of this alternative approach. What other disadvantage(s) does this ap-
proach have?
Solution. During the kth iteration, for k = 0, 1, . . . , m ≠ 1 the costs for the various steps
are as follows:
Ponder This 10.3.1.2 Propose a postprocess that converts the off-diagonal elements of a
tridiagonal Hermitian matrix to real values. The postprocess must be equivalent to applying
a unitary similarity transformation so that eigenvalues are preserved.
You may want to start by looking at
A B
–0,0 –1,0
A= ,
–1,0 –1,1
where the diagonal elements are real-valued and the off-diagonal elements are complex-
valued. Then move on to Q R
–0,0 –1,0 0
c
A = a –1,0 –1,1 –2,1 d
b.
0 –2,1 –2,2
What is the pattern?
Homework 10.3.1.3 You may want to start by executing git pu to update your
directory Assignments.
In directory Assignments/Week10/mat ab/, you will find the following files:
• Housev1.m: An implementation of the function Housev1, mentioned in the unit.
• TriRed.m: A code skeleton for a function that reduces a Hermitian matrix to a tridi-
agonal matrix. Only the lower triangular part of the input and output are stored.
[ T, t ] = TriRed( A, t )
returns the diagonal and first subdiagonal of the tridiagonal matrix in T, stores the
Householder vectors below the first subdiagonal, and returns the scalars · in vector t.
• TriFromBi.m: A function that takes the diagonal and first subdiagonal in the input
matrix and returns the tridiagonal matrix that they define.
T = TriFromBi( A )
• test_TriRed.m: A script that tests TriRed.
With these resources, you are to complete TriRed by implementing the algorithm in Fig-
ure 10.3.1.2.
Be sure to look at the hint!
Hint. If A holds Hermitian matrix A, storing only the lower triangular part, then Ax is
implemented in Matlab as
( tri ( A ) + tri ( A, -1) ) * x;
Updating only the lower triangular part of array A with A := A ≠ B is accomplished by
A = A - tri ( B );
Solution.
• Assignments/Week10/answers/TriRed.m.
WEEK 10. PRACTICAL SOLUTION OF THE HERMITIAN EIGENVALUE PROBLEM539
YouTube: https://www.youtube.com/watch?v=XAvioT6ALAg
We now introduce another important class of orthogonal matrices known as Givens’
rotations. Actually, we have seen these before, in Subsubsection 2.2.5.1, where we simply
called them rotations. It
A is how
B they are used that makes then Givens’ rotations.
‰1
Given a vector x = œ R2 , there exists an orthogonal matrix G such that GT x =
‰2
A B
±ÎxÎ2
. The Householder transformation is one example of such a matrix G. An
0
A B
“ ≠‡
alternative is the Givens’ rotation: G = where “ 2 + ‡ 2 = 1. (Notice that “ and
‡ “
‡ can be thought of as the cosine and sine of an angle.) Then
A BT A B A BA B
“ ≠‡ “ ≠‡ “ ‡ “ ≠‡
G G =
T
=
A
‡ “ ‡ “ B A
≠‡ “B ‡ “
“ 2 + ‡ 2 ≠“‡ + “‡ 1 0
= = ,
“‡ ≠ “‡ “ 2 + ‡ 2 0 1
where “ 2 + ‡ 2 = 1.
Solution. Take “ = ‰1 /ÎxÎ2 and ‡ = ‰2 /ÎxÎ2 , then “ 2 + ‡ 2 = (‰21 + ‰22 )/ÎxÎ22 = 1 and
A BT A B A BA B A B A B
“ ≠‡ ‰1 “ ‡ ‰1 (‰21 + ‰22 )/ÎxÎ2 ÎxÎ2
= = = .
‡ “ ‰2 ≠‡ “ ‰2 (‰1 ‰2 ≠ ‰1 ‰2 )/ÎxÎ2 0
Remark 10.3.2.1 We only discuss real-valued Givens’ rotations and how they transform
real-valued vectors, since the output of our reduction to tridiagonal form, after postprocess-
ing, yields a real-valued tridiagonal symmatric matrix.
WEEK 10. PRACTICAL SOLUTION OF THE HERMITIAN EIGENVALUE PROBLEM540
Ponder This 10.3.2.2 One could use 2◊2 Householder transformations (reflectors) instead
of Givens’ rotations. Why is it better to use Givens’ rotations in this situation.
YouTube: https://www.youtube.com/watch?v=_IgDCL7OPdU
Now, consider the 4 ◊ 4 tridiagonal matrix
Q R
–0,0 –0,1 0 0
c
c –1,0 –1,1 –1,2 0 d
d
c d.
a 0 –2,1 –2,2 –2,3 b
0 0 –3,2 –3,3
A B
–0,0
From , one can compute “1,0 and ‡1,0 so that
–1,0
A BT A B A B
“1,0 ≠‡1,0 –0,0 –
‚ 0,0
= .
‡1,0 “1,0 –1,0 0
Then
Q R Q RQ R
–
‚ 0,0 –
‚ 0,1 –
‚ 0,2 0 “1,0 ‡1,0 0 0 –0,0 –0,1 0 0
c
c 0 –
‚ 1,1 –1,2 0
‚ d
d
c ≠‡1,0 “1,0 0 0 dc –1,0 –1,1 –1,2 0 d
c d =c
c
dc
dc
d
d.
a 0 –2,1 –2,2 –2,3 b a 0 0 1 0 ba 0 –2,1 –2,2 –2,3 b
0 0 –3,2 –3,3 0 0 0 1 0 0 –3,2 –3,3
A B
–
‚ 1,1
Next, from , one can compute “2,1 and ‡2,1 so that
–2,1
A BT A B A B
“2,1 ≠‡2,1 –
‚ 1,1 ‚
–
‚ 1,1
= .
‡2,1 “2,1 –2,1 0
Then
Q R Q RQ R
–
‚ 0,0 –
‚ 0,1 –
‚ 0,2 0 1 0 0 0 –
‚ 0,0 –
‚ 0,1 –
‚ 0,2 0
c d
c 0 – ‚
‚ 1,1 ‚
– ‚
‚ 1,2 –
‚ 1,3 d c
c 0 “2,1 ‡2,1 0 dc
dc 0 – ‚ 1,1 –
‚ 1,2 0 d
d
c d =c dc d
c
a 0 0 –
‚ 2,2 –
‚ 2,3 d
b a 0 ≠‡2,1 “2,1 0 ba 0 –2,1 –2,2 –2,3 b
0 0 –3,2 –3,3 0 0 0 1 0 0 –3,2 –3,3
WEEK 10. PRACTICAL SOLUTION OF THE HERMITIAN EIGENVALUE PROBLEM541
A B A BT A B
–
‚ 2,2 “3,2 ≠‡3,2 –
‚ 2,2
Finally, from , one can compute “3,2 and ‡3,2 so that =
–3,2 ‡3,2 “3,2 –3,2
A B
‚
–
‚ 2,2
. Then
0
Q R Q RQ R
–
‚ 0,0 –
‚ 0,1 –
‚ 0,2 0 1 0 0 0 –
‚ 0,0 –
‚ 0,1 –
‚ 0,2 0
c d dc d
c 0 ‚
–
‚ 1,1 ‚
–
‚ 1,2 ‚
–
‚ 1,3 d c
c 0 1 0 0 dc 0 ‚
–
‚ 1,1 ‚
–
‚ 1,2 ‚
–
‚ 1,3 d
c d = c dc d
c
a 0 0 ‚
–
‚ 2,2 ‚
–
‚ 2,3 d
b a 1 0 “3,2 ‡3,2 a 0
bc 0 –
‚ 2,2 –
‚ 2,3 d
b
0 0 0 –
‚ 3,3 0 1 ≠‡3,2 “3,2 0 0 –3,2 –3,3
The matrix Q is the orthogonal matrix that results from multiplying the different Givens’
rotations together:
Q RQ RQ R
“1,0 ≠‡1,0 0 0 1 0 0 0 1 0 0 0
c
c ‡1,0 “1,0 0 0 dc
dc 0 “2,1 ≠‡2,1 0 dc
dc 0 1 0 0 d
d
Q=c dc dc d. (10.3.1)
a 0 0 1 0 ba 0 ‡2,1 “2,1 0 ba 0 0 “3,2 ≠‡3,2 b
0 0 0 1 0 0 0 1 0 0 ‡3,2 “3,2
and Q R
—0,0 —0,1 —0,2 ··· —0,m≠2 —0,m≠1
c d
c —1,0 —1,1 —1,2 ··· —1,m≠2 —1,m≠1 d
c d
c
c 0 —2,1 —2,2 ··· —2,m≠2 —2,m≠1 d
d
B= c 0 0 —3,2 ··· —3,m≠2 —3,m≠1 d.
c d
c .. .. .. .. .. .. d
c
a . . . . . .
d
b
0 0 0 · · · —m≠1,m≠2 —m≠1,m≠1
Notice that AQ = QB and hence
1 2
A q0 q1 q2 · · · qm≠2 qm≠1
1 2
= q0 q1 q 2 · · · qm≠2 qm≠1
Q R
—0,0 —0,1 —0,2 · · · —0,m≠2 —0,m≠1
c d
c —1,0 —1,1 —1,2 · · · —1,m≠2 —1,m≠1 d
c d
c 0 —2,1 —2,2 · · · —2,m≠1 d
c d
c 0 0 —3,2 · · · —3,m≠2 —3,m≠1 d.
c d
c . .. .. .. ..
c . ... d
d
a . . . . . b
0 0 0 · · · —m≠1,m≠2 —m≠1,m≠1
Equating the first column on the left and right, we notice that
Next,
—1,0 q1 = Aq0 ≠ —0,0 q0 = q̃1 .
Since Îq1 Î2 = 1 (it is a column of a unitary matrix) and —1,0 is assumed to be positive, then
we know that
—1,0 = Îq̃1 Î2 .
Finally,
q1 = q̃1 /—1,0 .
The point is that the first column of B and second column of Q are prescribed by the first
column of Q and the fact that B has positive elements on the first subdiagonal. In this way,
it can be successively argued that, one by one, each column of Q and each column of B are
prescribed. ⌅
Homework 10.3.4.1 Give all the details of the above proof.
Solution. Assume that q1 , . . . , qk and the column indexed with k ≠ 1 of B have been shown
to be uniquely determined under the stated assumptions. We now show that then qk+1 and
the column indexed by k of B are uniquely determined. (This is the inductive step in the
WEEK 10. PRACTICAL SOLUTION OF THE HERMITIAN EIGENVALUE PROBLEM544
proof.) Then
Aqk = —0,k q0 + —1,k q1 + · · · + —k,k qk + —k+1,k qk+1 .
We can determine —0,k through —k,k by observing that
for j = 0, . . . , k. Then
—k+1,k = Îq̃k+1 Î2
and then
qk+1 = q̃k+1 /—k+1,k .
This way, the columns of Q and B can be determined, one-by-one.
Ponder This 10.3.4.2 Notice the similarity between the above proof and the proof of the
existence and uniqueness of the QR factorization!
This can be brought out by observing that
A B
1 2 1 1 0 2
q0 A
0 q0 q1 q2 · · · qm≠2 qm≠1
Q R
1 —0,0 —0,1 —0,2 ··· —0,m≠2 —0,m≠1
c 0 —1,0
c d
—1,1 —1,2 ··· —1,m≠2 —1,m≠1 d
c d
1 2c 0 0 —2,1 —2,2 ··· —2,m≠1 d
= q0 q1 q2 · · · qm≠2 qm≠1 cc 0 0 0 —3,2 ··· —3,m≠2 —3,m≠1
d
d.
c d
c . .. .. .. .. ..
c . .. d
a . . . . . . .
d
b
0 0 0 0 · · · —m≠1,m≠2 —m≠1,m≠1
YouTube: https://www.youtube.com/watch?v=RSm_Mqi0aSA
In the last unit, we described how, when A(k) is tridiagonal, the steps
A(k) æ Q(k) R(k)
A(k+1) := R(k) Q(k)
of an unshifted QR algorithm can be staged as the computation and application of a se-
quence of Givens’ rotations. Obviously, one could explicitly form A(k) ≠ µk I, perform these
computations with the resulting matrix, and then add µk I to the result to compute
A(k) ≠ µk I æ Q(k) R(k)
A(k+1) := R(k) Q(k) + µk I.
The Francis QR Step combines these separate steps into a single one, in the process casting
all computations in terms of unitary similarity transformations, which ensures numerical
stability.
Consider the 4 ◊ 4 tridiagonal matrix
Q R
–0,0 –0,1 0 0
c
c –1,0 –1,1 –1,2 0 d
d
c d ≠ µI
a 0 –2,1 –2,2 –2,3 b
0 0 –3,2 –3,3
A B
–0,0 ≠ µ
The first Givens’ rotation is computed from , yielding “1,0 and ‡1,0 so that
–1,0
A BT A B
“1,0 ≠‡1,0 –0,0 ≠ µ
‡1,0 “1,0 –1,0
has a zero second entry. Now, to preserve eigenvalues, any orthogonal matrix that is applied
from the left must also have its transpose applied from the right. Let us compute
Q R
˜ 0,0 –
– ‚ 1,0 –
‚ 2,0 0
c
c –
‚ 1,0 –
‚ 1,1 –
‚ 1,2 0 d
d
c d
a –
‚ 2,0 –
‚ 2,1 –2,2 –2,3 b
0 0 –3,2 –3,3
Q RQ RQ R
“1,0 ‡1,0 0 0 –0,0 –0,1 0 0 “1,0 ≠‡1,0 0 0
c ≠‡
c 1,0 “1,0 0 0 dc
dc –1,0 –1,1 –1,2 0 dc
dc ‡1,0 “1,0 0 0 d
d
=c dc dc d.
a 0 0 1 0 ba 0 –2,1 –2,2 –2,3 ba 0 0 1 0 b
0 0 0 1 0 0 –3,2 –3,3 0 0 0 1
WEEK 10. PRACTICAL SOLUTION OF THE HERMITIAN EIGENVALUE PROBLEM546
This is known as
A "introducing
B the bulge."
–
‚ 1,0
Next, from , one can compute “2,0 and ‡2,0 so that
–
‚ 2,0
A BT A B A B
“2,0 ≠‡2,0 –
‚ 1,0 ˜ 1,0
–
= .
‡2,0 “2,0 –
‚ 2,0 0
Then
Q R
˜ 0,0
– ˜ 1,0 0
– 0
c d
c
c ˜ 1,0
– ˜ 1,1 –
– ‚
‚ 2,1 –
‚ 3,1 d
d
c
a 0 ‚
–‚ 2,1 –‚ 2,2 –
‚ 2,3 d
b
0 –‚ 3,1 –‚ 3,2 –3,3
Q RQ RQ R
1 0 0 0 ˜ 0,0 –
– ‚ 1,0 –
‚ 2,0 0 1 0 0 0
c 0 “2,0 ‡2,0 0 dc –
‚ 1,0 –
‚ 1,1 –
‚ 1,2 0 dc 0 “2,0 ≠‡2,0 0 d
=c
c
dc
dc
dc
dc
d
d
a 0 ≠‡2,0 “2,0 0 ba –
‚ 2,0 –
‚ 2,1 –2,2 –2,3 ba 0 ‡2,0 “2,0 0 b
0 0 0 1 0 0 –3,2 –3,3 0 0 0 1
again preserves eigenvalues. Finally, from
A B
–
‚ 2,1
,
–
‚ 3,1
yielding a tridiagonal matrix. The process of transforming the matrix that results from in-
troducing the bulge (the nonzero element – ‚ 2,0 ) back into a tridiagonal matrix is commonly
referred to as "chasing the bulge." Moving the bulge one row and column down the matrix is
illustrated in Figure 10.3.5.1. The process of determining the first Givens’ rotation, introduc-
ing the bulge, and chasing the bulge is know as a Francis Implicit QR step. An algorithhm
for this is given in Figure 10.3.5.2.
WEEK 10. PRACTICAL SOLUTION OF THE HERMITIAN EIGENVALUE PROBLEM547
Figure 10.3.5.1 Illustration of how the bulge is chased one row and column forward in the
matrix.
WEEK 10. PRACTICAL SOLUTION OF THE HERMITIAN EIGENVALUE PROBLEM548
T := ChaseBulge(T )
Q R
TT L ı ı
c d
T æ a TM L TM M ı b
0 TBM TBR
TT L is 0 ◊ 0 and TM M is 3 ◊ 3
while m(TBR ) > 0 Q R
Q R
T00 ı 0 0 0
c d
TT L ı 0 c tT10 ·11 ı 0 0 d
c c d
a TM L TM M ı dbæc
c 0 t21 T22 ı 0 d
d
0 TBM TBR c
a 0 0 tT32 ·33 ı d
b
0A 0 B 0 t43 T44 A B
·21 ·21
Compute (“, ‡) s.t. GT“,‡ t21 = , and assign t21 :=
0 0
T22 := GT“,‡ T22 G“,‡
tT32 := tT32 G“,‡ (not performed
Q
during final step) R
Q R
T00 ı 0 0 0
c T d
TT L ı 0 c t10 ·11 ı 0 0 d
c d c d
a TM L TM M ı b Ω c 0 t21 T22 ı
c 0 d
d
0 a 0 0 tT32 ·33 ı
c d
TBM TBR b
0 0 0 t43 T44
endwhile
Figure 10.3.5.2 Algorithm for "chasing the bulge" that, given a tridiagonal matrix with an
additional nonzero –2,0 element, reduces the given matrix back to a tridiagonal matrix.
The described process has the net result of updating A(k+1) = QT A(k) Q(k) , where Q is
the orthogonal matrix that results from multiplying the different Givens’ rotations together:
Q RQ RQ R
“1,0 ≠‡1,0 0 0 1 0 0 0 1 0 0 0
c
c ‡1,0 “1,0 0 0 dc
dc 0 “2,0 ≠‡2,0 0 dc
dc 0 1 0 0 d
d
Q=c dc dc d.
a 0 0 1 0 ba 0 ‡2,0 “2,0 0 ba 0 0 “3,1 ≠‡3,1 b
0 0 0 1 0 0 0 1 0 0 ‡3,1 “3,1
is exactly the same first column had Q been computed as in Subsection 10.3.3 (10.3.1). Thus,
by the Implicit Q Theorem, the tridiagonal matrix that results from this approach is equal
to the tridiagonal matrix that would be computed by applying the QR factorization from
that section to A ≠ µI, A ≠ µI æ QR followed by the formation of RQ + µI using the
algorithm for computing RQ in Subsection 10.3.3.
WEEK 10. PRACTICAL SOLUTION OF THE HERMITIAN EIGENVALUE PROBLEM549
Remark 10.3.5.3 In Figure 10.3.5.2, we use a variation of the notation we have encoun-
tered when presenting many of our algorithms, n including most recently the reduction to
tridiagonal form. The fact is that when implementing the implicitly shifted QR algorithm,
it is best to do so by explicitly indexing into the matrix. This tridiagonal matrix is typically
stored as just two vectors: one for the diagonal and one for the subdiagonal.
Homework 10.3.5.1 A typical step when "chasing the bulge" one row and column further
down the matrix involves the computation
Q R
–i≠1,i≠1 ◊ ◊ 0
c –
c ‚ –
‚ ◊ 0 d
d
d=
i,i≠1 i,i
c
a 0 –
‚ i+1,i –
‚ i+1,i+1 ◊ b
0 –
‚ i+2,i –
‚ i+2,i+1 –i+2,i+2
Q RQ RQ R
1 0 0 0 –i≠1,i≠1 ◊ ◊ 0 1 0 0 0
c 0 “i ‡i 0 d c –i,i≠1
d c –i,i ◊ 0 dc 0 “i ≠‡i 0 d
c dc d
c dc dc d
a 0 ≠‡i “i 0 b a –i+i,i≠1 –i+1,i –i+1,i+1 ◊ ba 0 ‡i “i 0 b
0 0 0 1 0 0 –i+2,i+1 –i+2,i+2 0 0 0 1
Give a strategy (or formula) for computing
Q R
c –
‚ i,i≠1 –
‚ i,i d
c d
c d
a –
‚ i+1,i –
‚ i+1,i+1 b
–
‚ i+2,i –
‚ i+2,i+1
Solution. Since the subscripts will drive us crazy, let’s relabel, add one of the entries above
the diagonal, and drop the subscripts on “ and ‡:
Q R
◊ ◊ ◊ 0
c ‚ 0 d
c
c ‘‚ Ÿ‚ ⁄ d
d=
c
a 0 ⁄‚ µ‚ ◊ d
b
0 ‰‚ ‚
 ◊
Q RQ RQ R
1 0 0 0 ◊ ◊ ◊ 0 1 0 0 0
c
c 0 “ ‡ 0 d c
dc ‘ Ÿ ⁄ 0 dc
dc 0 “ ≠‡ 0 d
d
c dc dc d
a 0 ≠‡ “ 0 b a „ ⁄ µ ◊ ba 0 ‡ “ 0 b
0 0 0 1 0 0 Â ◊ 0 0 0 1
With this, the way I would compute the desired results is via the steps
• ‘‚ := “‘ + ‡„
A B CA BA BD A B
Ÿ ‚
‚ ⁄ “ ‡ Ÿ ⁄ “ ≠‡
• ‚ µ :=
⁄ ‚ ≠‡ “ ⁄ µ ‡ “
• ‰‚ := ‡Â
‚ := “‰
Translating this to the update of the actual entries is straight forward.
WEEK 10. PRACTICAL SOLUTION OF THE HERMITIAN EIGENVALUE PROBLEM550
Ponder This 10.3.5.2 Write a routine that performs one Francis implicit QR step. Use it
to write an implicitly shifted QR algorithm.
YouTube: https://www.youtube.com/watch?v=fqiex-FQ-JU
YouTube: https://www.youtube.com/watch?v=53XcY9IQDU0
The last unit shows how one iteration of the QR algorithm can be performed on a
tridiagonal matrix by implicitly shifting and then "chasing the bulge." All that is left to
complete the algorithm is to note that
• The shift µk can be chosen to equal –m≠1,m≠1 (the last element on the diagonal, which
tends to converge to the eigenvalue smallest in magnitude). In practice, choosing the
shift to be an eigenvalue of the bottom-right 2 ◊ 2 matrix works better. This is known
as the Wilkinson shift.
that O(m) iterations are needed. Thus, a QR algorithm with a tridiagonal matrix
that accumulates eigenvectors requires O(m3 ) computation.
Thus, the total cost of computing the eigenvalues and eigenvectors is O(m3 ).
• If an element on the subdiagonal becomes zero (or very small), and hence the corre-
sponding element of the superdiagonal, then
◊ ◊
◊ ◊ ◊
A B ◊ ◊ ◊
T00 0
T = ◊ ◊ 0
0 T11
0 ◊ ◊
◊ ◊ ◊
◊ ◊
then
¶ The computation can continue separately with T00 and T11 .
¶ One can pick the shift from the bottom-right of T00 as one continues finding the
eigenvalues of T00 , thus accelerating that part of the computation.k
¶ One can pick the shift from the bottom-right of T11 as one continues finding the
eigenvalues of T11 , thus accelerating that part of the computation.
¶ One must continue to accumulate the eigenvectors by applying the rotations to
the appropriate columns of Q.
¶ Because of the connection between the QR algorithm and the Inverse Power
Method, subdiagonal entries near the bottom-right of T are more likely to con-
verge to a zero, so most deflation will happen there.
¶ A question becomes when an element on the subdiagonal, ·i+1,i can be considered
to be zero. The answer is when |·i+1,i | is small relative to |·i | and |·i+1,i+1 |. A
typical condition that is used is
Ò
|·i+1,i | Æ ‘mach |·i,i | + |·i+1,i+1 |.
10.4 Enrichments
10.4.1 QR algorithm among the most important algorithms of the
20th century
An article published in SIAM News, a publication of the Society for Industrial and Applied
Mathermatics, lists the QR algorithm among the ten most important algorithms of the 20th
century [10]:
WEEK 10. PRACTICAL SOLUTION OF THE HERMITIAN EIGENVALUE PROBLEM552
• Barry A. Cipra, The Best of the 20th Century: Editors Name Top 10 Algorithms,
SIAM News, Volume 33, Number 4, 2000.
Dear Co eagues,
John Francis was born in 1934 in London and current y ives in Hove, near
Brighton. His residence is about a quarter mi e from the sea; he is a
widower. In 1954, he worked at the Nationa Research Deve opment Corp
(NRDC) and attended some ectures given by Christopher Strachey.
In 1955, 56 he was a student at Cambridge but did not comp ete a degree.
He then went back to NRDC as an assistant to Strachey where he got
invo ved in f utter computations and this ed to his work on QR.
After eaving NRDC in 1961, he worked at the Ferranti Corp and then at the
University of Sussex. Subsequent y, he had positions with various
industria organizations and consu tancies. He is now retired. His
interests were quite genera and inc uded Artificia Inte igence,
computer anguages, systems engineering. He has not returned to numerica
computation.
He was surprised to earn there are many references to his work and
that the QR method is considered one of the ten most important
a gorithms of the 20th century. He was unaware of such deve opments as
TeX and Math Lab. Current y he is working on a degree at the Open
University.
John Francis did remarkab e work and we are a in his debt. A ong with
the conjugate gradient method, it provided us with one of the basic too s
of numerica ana ysis.
WEEK 10. PRACTICAL SOLUTION OF THE HERMITIAN EIGENVALUE PROBLEM553
Gene Go ub
this is within a constant factor of the lower bound for computing these eigenvectors since
the eigenvectors constitute O(m2 ) data that must be written upon the completion of the
computation.
When computing the eigenvalues and eigenvectors of a dense Hermitian matrix, MRRR
can replace the implicitly shifted QR algorithm for finding the eigenvalues and eigenvectors
of the tridiagonal matrix. The overall steps then become
A æ QA T QH
A
where T is a tridiagonal real valued matrix. The matrix QA is not explicitly formed
but instead the Householder vectors that were computed as part of the reduction to
tridiagonal form are stored.
T æ QT DQTT .
The details of that method go beyond the scope of this note. We refer the interested reader
to
• [3] Paolo Bientinesi, Inderjit S. Dhillon, Robert A. van de Geijn, A Parallel Eigensolver
for Dense Symmetric Matrices Based on Multiple Relatively Robust Representations,
SIAM Journal on Scientific Computing, 2005
Remark 10.4.5.1 An important feature of MRRR is that it can be used to find a subset of
eigenvectors. This is in contrast to the QR algorithm, which computes all eigenvectors.
10.5 Wrap Up
10.5.1 Additional homework
Homework 10.5.1.1 You may want to do a new "git pull" to update directory Assignments
.
In Assignments/Week10/mat ab you will find the files
With this,
2. Write a function
10.5.2 Summary
We have noticed that typos are uncovered relatively quickly once we release the material.
Because we "cut and paste" the summary from the materials in this week, we are delaying
adding the summary until most of these typos have been identified.
Week 11
11.1 Opening
11.1.1 Linking the Singular Value Decomposition to the Spectral
Decomposition
YouTube: https://www.youtube.com/watch?v=LaYzn2x_Z8Q
Week 2 introduced us to the Singular Value Decomposition (SVD) of a matrix. For any
matrix A œ Cm◊n , there exist unitary matrix U œ Cm◊m , unitary matrix V œ Cn◊n , and
œ Rm◊n of the form
A B
0r◊(n≠r)
= TL
, with T L = diag(‡0 , . . . , ‡r≠1 ), (11.1.1)
0(m≠r)◊r 0(m≠r)◊(n≠r)
and ‡0 Ø ‡1 Ø · · · Ø ‡r≠1 > 0
such
1 that 2A = U V 1H , the SVD
2 of matrix A. We can correspondingly partition U =
UL UR and V = VL VR , where UL and VL have r columns, in which case
A = UL H
T L VL
equals the Reduced Singular Value Decomposition. We did not present practical algorithms
for computing this very important result in Week 2, because we did not have the theory
and practical insights in place to do so. With our discussion of the QR algorithm in the
last week, we can now return to the SVD and present the fundamentals that underlie its
computation.
556
WEEK 11. COMPUTING THE SVD 557
Homework 11.1.1.2 Let A œ Cm◊n and A = U V H its SVD, where has the structure
indicated in (11.1.1). Give the Spectral Decomposition of the matrix AAH .
Solution.
AAH
=
(U V H )(U V H )H
=
T H
U U
=
A B
2
TL 0r◊(m≠r)
U UH.
0(m≠r)◊r 0(m≠r)◊(m≠r)
Last two homeworks expose how to compute the Spectral Decomposition of AH A or AAH
from the SVD of matrix A. We already discovered practical algorithms for computing the
Spectral Decomposition in the last week. What we really want to do is to turn this around:
How do we compute the SVD of A from the Spectral Decomposition of AH A and/or AAH ?
11.1.2 Overview
• 11.1 Opening
• 11.4 Enrichments
• 11.5 Wrap Up
• Transform the implicitly shifted QR algorithm into the implicitly shifted bidiagonal
QR algorithm.
YouTube: https://www.youtube.com/watch?v=aSvGPY09b48
Let’s see if we can turn the discussion from Subsection 11.1.1 around: Given the Spectral
Decomposition of AH A, how can we extract the SVD of A?
Homework 11.2.1.1 Let A œ Cm◊m be nonsingular and AH A = QDQH , the Spectral
Decomposition of AH A. Give a formula for U , V , and so that A = U V H is the SVD of
A. (Notice that A is square.)
Solution. Since A is nonsingular, so is AH A and hence D has positive real values on its
diagonal. If we take V = Q and = D1/2 then
A = U V H = U D1/2 QH .
U = AV ≠1
= AQD≠1/2 .
UHU
=
(AQD≠1/2 )H (AQD≠1/2 )
=
≠1/2 H H
D Q A AQD≠1/2
=
D≠1/2 DD≠1/2
=
I.
The final detail is that the Spectral Decomposition does not require the diagonal elements of
D to be ordered from largest to smallest. This can be easily fixed by permuting the columns
of Q and, correspondingly, the diagonal elements of D.
WEEK 11. COMPUTING THE SVD 560
YouTube: https://www.youtube.com/watch?v=sAbXHD4TMSE
Not all matrices are square and nonsingular. In particular, we are typically interested in
the SVD of matrices where m > n. Let’s examine how to extract the SVD from the Spectral
Decomposition of AH A for such matrices.
Homework 11.2.1.2 Let A œ Cm◊n have full column rank and let AH A = QDQH , the
Spectral Decomposition of AH A. Give a formula for the Reduced SVD of A.
Solution. We notice that if A has full column rank, then its Reduced Singular Value
Decomposition is given by A = UL V H , where UL œ Cm◊n , œ Rn◊n , and V œ Cn◊n .
Importantly, A A is nonsingular, and D has positive real values on its diagonal. If we take
H
UL = AV ≠1
= AQD≠1/2 ,
where, clearly, = D1/2 is nonsingular. We can easily verify that UL has orthonormal
columns:
ULH UL
=
(AQD≠1/2 )H (AQD≠1/2 )
=
D≠1/2 QH AH AQD≠1/2
=
D≠1/2 DD≠1/2
=
I.
As before, the final detail is that the Spectral Decomposition does not require the diagonal
elements of D to be ordered from largest to smallest. This can be easily fixed by permuting
the columns of Q and, correspondingly, the diagonal elements of D.
The last two homeworks gives us a first glimpse at a practical procedure for computing
the (Reduced) SVD from the Spectral Decomposition, for the simpler case where A has full
column rank.
• Form B = AH A.
• Compute the Spectral Decomposition B = QDQH via, for example, the QR algorithm.
WEEK 11. COMPUTING THE SVD 561
• Permute the columns of Q and diagonal elements of D so that the diagonal elements
are ordered from largest to smallest. If P is the permutation matrix such that P DP T
reorders the diagonal of D appropriately, then
AH A
= < Spectral Decomposition >
H
QDQ
= < insert identities >
= Q ¸P ˚˙
T T
P˝ D ¸P ˚˙P˝ QH
I I
= < associativity >
(QP T )(P DP T )(P QH )
= < (BC)H = C H B H H >
(QP )(P DP T )(QP T )H
T
With these insights, we find the Reduced SVD of a matrix with linearly independent columns.
If in addition A is square (and hence nonsingular), then U = UL and A = U V H is its SVD.
Let us now treat the problem in full generality.
Homework 11.2.1.3 Let A œ Cm◊n be of rank r and
A B
1 2 DT L 0r◊(n≠r) 1 2H
A A=
H
QL QR QL QR
0(n≠r)◊r 0(n≠r)◊(n≠r)
A = UL H
T L VL = UL D1/2 QH
L.
UL = AVL = AQL DT L .
≠1 ≠1/2
TL
WEEK 11. COMPUTING THE SVD 562
ULH UL
=
(AQL DT L )H (AQL DT L )
≠1/2 ≠1/2
=
≠1/2 H H ≠1/2
DT L QL A AQL DT L
= A B
≠1/2
1 2 DT L 0r◊(n≠r) 1 2H
≠1/2
DT L QH QL QR QL QR QL DT L
L
0(n≠r)◊r 0(n≠r)◊(n≠r)
= A B
1 2 DT L 0r◊(n≠r) 1 2H
I 0 I 0
≠1/2 ≠1/2
DT L DT L
0(n≠r)◊r 0(n≠r)◊(n≠r)
=
≠1/2 ≠1/2
DT L DT L DT L
=
I.
Although the discussed approaches give us a means by which to compute the (Reduced)
SVD that is mathematically sound, the Achilles heel of these is that it hinges on forming
AH A. While beyond the scope of this course, the conditioning of computing a Spectral
Decomposition of a Hermitian matrix is dictated by the condition number of the matrix,
much like solving a linear system is. We recall from Subsection 4.2.5 that we avoid using
the Method of Normal Equations to solve the linear least squares problem when a matrix
is ill-conditioned. Similarly, we try to avoid computing the SVD from AH A. The problem
here is even more acute: it is often the case that A is (nearly) rank deficient (for example,
in situations where we desire a low rank approximation of a given matrix) and hence it is
frequently the case that the condition number of A is very unfavorable. The question thus
becomes, how can we avoid computing AH A while still benefiting from the insights in this
unit?
YouTube: https://www.youtube.com/watch?v=SXP12WwhtJA
The first observation that leads to a practical algorithm is that matrices for which we
wish to compute the SVD are often tall and skinny, by which we mean that they have many
more rows than they have columns, and it is the Reduced SVD of this matrix that is desired.
The methods we will develop for computing the SVD are based on the implicitly shifted QR
algorithm that was discussed in Subsection 10.3.5, which requires O(n3 ) computation when
applied to an n ◊ n matrix. Importantly, the leading n3 term has a very large constant
relative to, say, the cost of a QR factorization of that same matrix.
Rather than modifying the QR algorithm to work with a tall and skinny matrix, we start
by computing its QR factorization, A = QR. After this, the SVD of the smaller, n ◊ n sized,
matrix R is computed. The following homework shows how the Reduced SVD of A can be
extracted from Q and the SVD of R.
Homework 11.2.2.1 Let A œ Cm◊n , with m Ø n, and A = QR be its QR factorization
where, for simplicity, we assume that n ◊ n upper triangular matrix R is nonsingular. If
R = U‚ ‚ V‚ H is the SVD of R, give the Reduced SVD of A.
Solution.
A
=
QR =
QU‚ R V‚ H =
(QU‚ ) ¸˚˙˝
‚ ‚H
V
¸ ˚˙ ˝.
¸ ˚˙ ˝
U VH
YouTube: https://www.youtube.com/watch?v=T0NYsbdaC78
While it would be nice if the upper triangular structure of R was helpful in computing
its SVD, it is actually the fact that that matrix is square and small (if n π m) that is
significant. For this reason, we now assume that we are interested in finding the SVD of a
square matrix A, and ignore the fact that that matrix may be triangular.
WEEK 11. COMPUTING THE SVD 564
YouTube: https://www.youtube.com/watch?v=sGBD0-PSMN8
Here are some more observations, details of which will become clear in the next units:
A = QA BQH
A,
• We don’t want to explicitly form B T B because the condition number of B equals the
condition number of the original problem (since they are related via unitary transfor-
mations).
• In the next units, we will find that we can again employ the Implicit Q Theorem to
compute the SVD of B, inspired by the implicitly shifted QR algorithm. The algorithm
we develop again casts all updates to B in terms of unitary transformations, yielding
a highly accurate algorithm.
Putting these observations together yields a practical methodology for computing the Re-
duced SVD of a matrix.
WEEK 11. COMPUTING THE SVD 565
YouTube: https://www.youtube.com/watch?v=2OW5Yi6QOdY
Homework 11.2.3.1 Let B œ Rm◊m be a bidiagonal matrix:
Q R
—0,0 —0,1 0 ··· 0 0
c d
c
c
0 —1,1 —1,2 ··· 0 0 d
d
B= c .. ... ... ... .. .. d
c
c . . . d.
d
c
a 0 0 0 · · · —m≠2,m≠2 —m≠2,m≠1 d
b
0 0 0 ··· 0 —m≠1,m≠1
This introduces zeroes below the first entry in the first column, as illustrated by
Q R Q R
◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊
c d c d
c
c
◊ ◊ ◊ ◊ ◊ d
d
c
c
0 ◊ ◊ ◊ ◊ d
d
c
c ◊ ◊ ◊ ◊ ◊ d
d ≠æ c
c 0 ◊ ◊ ◊ ◊ d
d
c
a ◊ ◊ ◊ ◊ ◊ d
b
c
a 0 ◊ ◊ ◊ ◊ d
b
◊ ◊ ◊ ◊ ◊ 0 ◊ ◊ ◊ ◊
The Householder vector that introduced the zeroes is stored over those zeroes.
Next, we introduce zeroes in the first row of this updated matrix.
WEEK 11. COMPUTING THE SVD 567
A B
–11 aT12
• The matrix is still partitioned as A æ , where the zeroes have been
0 A22
overwritten with u21 .
• We compute [u12 , fl1 ] := Housev1((aT12 )T ). The first element of u12 now holds ± Î(aT12 )T Î2
and the rest of the elements define the Householder transformation that introduces ze-
roes in (aT12 )T below the first element. We store uT12 in aT12 .
• After setting the first entry of u12 explicitly to one, we update A22 := A22 H(u12 , fl1 ).
This introduces zeroes to the right of the first entry of aT12 , as illustrated by
Q R Q R
◊ ◊ ◊ ◊ ◊ ◊ ◊ 0 0 0
c d c d
c
c
0 ◊ ◊ ◊ ◊ d
d
c
c
0 ◊ ◊ ◊ ◊ d
d
c
c 0 ◊ ◊ ◊ ◊ d
d ≠æ c
c 0 ◊ ◊ ◊ ◊ d
d
c
a 0 ◊ ◊ ◊ ◊ d
b
c
a 0 ◊ ◊ ◊ ◊ d
b
0 ◊ ◊ ◊ ◊ 0 ◊ ◊ ◊ ◊
The Householder vector that introduced the zeroes is stored over those zeroes.
The algorithm continues this with the updated A22 as illustrated in Figure 11.2.3.1.
◊ ◊ ◊ ◊ ◊ ◊ ◊ 0 0 0 ◊ ◊ 0 0 0
◊ ◊ ◊ ◊ ◊ 0 ◊ ◊ ◊ ◊ 0 ◊ ◊ 0 0
◊ ◊ ◊ ◊ ◊ ≠æ 0 ◊ ◊ ◊ ◊ ≠æ 0 0 ◊ ◊ ◊
◊ ◊ ◊ ◊ ◊ 0 ◊ ◊ ◊ ◊ 0 0 ◊ ◊ ◊
◊ ◊ ◊ ◊ ◊ 0 ◊ ◊ ◊ ◊ 0 0 ◊ ◊ ◊
Original matrix First iteration Second iteration
◊ ◊ 0 0 0 ◊ ◊ 0 0 0
0 ◊ ◊ 0 0 0 ◊ ◊ 0 0
≠æ 0 0 ◊ ◊ 0 ≠æ 0 0 ◊ ◊ 0
0 0 0 ◊ ◊ 0 0 0 ◊ ◊
0 0 0 ◊ ◊ 0 0 0 0 ◊
Third iteration Fourth iteration
Figure 11.2.3.1 An illustration of the reduction of a square matrix to bidiagonal form. The
◊s denote nonzero elements in the matrix.
Ponder This 11.2.3.2 Fill in the details for the above described algorithm that reduces a
square matrix to bidiagonal form. In particular:
• For the update A B A B A B
aT12 1 aT12
:= H( , ·1 ) ,
A22 u21 A22
WEEK 11. COMPUTING THE SVD 568
• For the update A22 := A22 H(u12 , fl1 ), describe explicitly how A22 is updated. (Hint:
look at Homework 10.3.1.1.)
• BiRed.m: A code skeleton for a function that reduces a square matrix to bidiagonal
form.
[ B, t, r ] = BiRed( A, t, r )
returns the diagonal and first superdiagonal of the bidiagonal matrix in B, stores the
Householder vectors below the subdiagonal and above the first superdiagonal, and
returns the scalars · and fl in vectors t and r .
• BiFromB.m: A function that extracts the bidiagonal matrix from matrix B, which also
has the Householder vector information in it.
Bbi = BiFromB( B )
These resources give you the tools to implement and test the reduction to bidiagonal form.
YouTube: https://www.youtube.com/watch?v=V2PaGe52ImQ
Converting a (tridiagonal) implicitly shifted QR algorithm into a (bidiagonal) implicitly
shifted QR algorithm now hinges on some key insights, which we will illustrate with a 4 ◊ 4
example.
• The Francis Implicit QR Step would then compute a first Givens’ rotation so that
A BT A B A B
“0 ≠‡0 ·0,0 ≠ ·3,3 ◊
= . (11.2.1)
‡0 “0 ·1,0 0
¸ ˚˙ ˝
GT0
Q
0 0
R
◊ ◊
1 0 0
=c a 0 1 0 d
b
0 0 GT2
S Q R T
Q R ◊ ◊ ◊ 0 Q R Q R
1 0 0 c 1 0 0 X 1 0 0
W ◊ ◊ ◊ 0 d
bX a 0 1 0 b .
Wc
Wa 0 GT1 0 d
bc
c dc
da 0 G1 0 d X c d
U a ◊ ◊ ◊ ◊ b V
0 0 1 0 0 1 0 0 G2
0 0 ◊ ◊
¸ ˚˙ ˝
Q R
◊ ◊ 0 0
c ◊ ◊ ◊ ◊ d
c d
c d
a 0 ◊ ◊ ◊ b
0 ◊ ◊ ◊
WEEK 11. COMPUTING THE SVD 571
The observation now is that if we can find two sequences of Givens’ rotations such that
Q R
◊ 0◊ 0
c 0 ◊ ◊ 0 d
c d
B (k+1) = c d
a 0 ◊ 0 ◊ b
0 00 ◊
Q RQ RQ T R
1 0 0 1 0 0 G‚
0 0 0
=a 0 1 0
c dc d c d
ba 0 G1 0 b a 0 1 0
‚ T
b
0 0 G ‚T
2 0 0 1 0 0 1
Q R
—0,0 —0,1 0 0 (11.2.2)
c 0 —1,1 —1,2 0 d
◊c
c
d
d
a 0 0 —2,2 —2,3 b
0 0 —3,3
Q RQ RQ R
G0 0 0 1 0 0 1 0 0
c
◊ a 0 1 0 ba
dc
0 G
Â
1 0
dc
b a 0 1 0 db
0 0 1 0 0 1 0 0 G2
Â
¸ ˚˙ ˝
Q
WEEK 11. COMPUTING THE SVD 572
If we iterate in this way, we know that T (k) converge to a diagonal matrix (under mild
conditions). This means that the matrices B (k) converge to a diagonal matrix, B . If we
accumulate all Givens’ rotations into matrices UB and VB , then we end up with the SVD of
B:
B = UB B VBT ,
modulo, most likely, a reordering of the diagonal elements of B and a corresponding re-
ordering of the columns of UB and VB .
This leaves us with the question of how to find the two sequences of Givens’ rotations
mentioned in (11.2.2).
• We know G0 , which was computed from (11.2.1). Importantly, computing this first
Givens’ rotation requires only that the elements ·0,0 , ·1,0 , and ·m≠1,m≠1 of T (k) to be
explicitly formed.
• If we apply it to B (k) , we introduce a bulge:
Q R Q R
◊ ◊ 0 0 —0,0 —0,1 0 0 Q R
G0 0 0
c
c ◊ ◊ ◊ 0 d
d
c
c 0 —1,1 —1,2 0 d
dc d
c d = c da 0 1 0 b.
a 0 0 ◊ ◊ b a 0 0 —2,2 —2,3 b
0 0 1
0 0 0 ◊ 0 0 —3,3
• We next compute a Givens’ rotation, G‚ , that changes the nonzero that was introduced
0
below the diagonal back into a zero.
Q R Q R
◊ ◊ ◊ 0 Q R ◊ ◊ 0 0
‚T
G 0 0 c
c
c 0 ◊ ◊ 0 d
d c
0
dc ◊ ◊ ◊ 0 d
d
c d =a 0 1 0 bc d.
a 0 0 ◊ ◊ b a 0 0 ◊ ◊ b
0 0 1
0 0 0 ◊ 0 0 0 ◊
WEEK 11. COMPUTING THE SVD 573
• This means we now need to chase the bulge that has appeared above the superdiagonal:
Q R Q R
◊ ◊ 0 0 ◊ ◊ ◊ 0 Q R
1 0 0
c
c 0 ◊ ◊ 0 d
d
c
c 0 ◊ ◊ 0 d
dc d
c d = c da 0 G
Â
1 0 b
a 0 ◊ ◊ ◊ b a 0 0 ◊ ◊ b
0 0 1
0 0 0 ◊ 0 0 0 ◊
• We continue like this until the bulge is chased out the end of the matrix.
The net result is an implicitly shifted bidiagonal QR algorithm that is applied directly
to the bidiagonal matrix, maintains the bidiagonal form from one iteration to the next, and
converges to a diagonal matrix that has the singular values of B on its diagonal. Obviously,
deflation can be added to this scheme to further reduce its cost.
YouTube: https://www.youtube.com/watch?v=OoMPkg994ZE
Given a symmetric 2 ◊ 2 matrix A, with
A B
–0,0 –0,1
A=
–1,0 –1,1
We recognize that
A = J JT
is the Spectral Decomposition of A. The columns of J are eigenvectors of length one and
the diagonal elements of are the eigenvalues.
WEEK 11. COMPUTING THE SVD 574
Ponder This 11.3.1.1 Give a geometric argument that Jacobi rotations exist.
Hint. For this exercise, you need to remember a few things:
• How is a linear transformation, L, translated into the matrix A that represents it,
Ax = L(x)?
• If A is not already diagonal, how can the eigenvectors be chosen so that they have unit
length, first one lies in Quadrant I of the plane, and the other one lies in Quadrant II?
and solve for its roots, which give us the eigenvalues of A. Remember to us the
stable formula for computing the roots of a second degree polynomial, discussed in
Subsection 9.4.1.
• Find an eigenvector associated with one of the eigenvalues, scaling it to have unit
length and to lie in either Quadrant I or Quadrant II. This means that the eigenvector
has the form A B
“
‡
if it lies in Quadrant I or A B
≠‡
“
if it lies in Quadrant II.
This gives us the “ and ‡ that define the Jacobi rotation.
Homework 11.3.1.2 With Matlab, use the eig function to explore the eigenvalues and
eigenvectors of various symmetric matrices:
[ Q, Lambda ] = eig( A )
A = [
-1 2
2 3
]
WEEK 11. COMPUTING THE SVD 575
A = [
2 -1
-1 -2
]
How does the matrix Q relate to a Jacobi rotation? How would Q need to be altered for it
to be a Jacobi rotation?
Solution.
>> A = [
-1 2
2 3
]
A =
-1 2
2 3
Q =
-0.9239 0.3827
0.3827 0.9239
Lambda =
-1.8284 0
0 3.8284
We notice that the columns of Q need to be swapped for it to become a Jacobi rotation:
A B A BA B A B
0.9239 0.3827 0.3827 ≠0.9239 0 ≠1 0 ≠1
J= = =Q .
≠0.3827 0.9239 0.9239 0.3827 1 0 1 0
>> A = [
2 -1
-1 -2
]
A =
WEEK 11. COMPUTING THE SVD 576
2 -1
-1 -2
Q =
-0.2298 -0.9732
-0.9732 0.2298
Lambda =
-2.2361 0
0 2.2361
We notice that the columns of Q need to be swapped for it to become a Jacobi rotation. If
we follow our "recipe", we also need to negate each column:
A B
0.9732 ≠0.2298
J= .
0.2298 0.9732
These solutions are not unique. Another way of creating a Jacobi rotation is to, for
example, scale the first column so that the diagonal elements have the same sign. Indeed,
perhaps that is the easier thing to do:
A B A B
≠0.9239 0.3827 0.9239 0.3827
Q= ≠æ J= .
0.3827 0.9239 ≠0.3827 0.9239
YouTube: https://www.youtube.com/watch?v=mBn7d9jUjcs
The oldest algorithm for computing the eigenvalues and eigenvectors of a (symmetric)
matrix is due to Jacobi and dates back to 1846.
• [22] C. G. J. Jacobi, Über ein leichtes Verfahren, die in der Theorie der Säkular-
störungen vorkommenden Gleichungen numerisch aufzulösen, Crelle’s Journal 30, 51-94
WEEK 11. COMPUTING THE SVD 577
(1846).
If we recall correctly (it has been 30 years since we read the paper in German), the paper
thanks Jacobi’s student Seidel for performing the calculations for a 5 ◊ 5 matrix, related to
the orbits of the planets, by hand...
This is a method that keeps resurfacing, since it parallelizes easily. The operation count
tends to be higher (by a constant factor) than that of reduction to tridiagonal form followed
by the tridiagonal QR algorithm.
Jacobi’s original idea went as follows:
• We find the off-diagonal entry with largest magnitude. Let’s say it is –3,1 .
• We now apply the rotation as a unitary similarity transformation from the left to the
rows of A indexed with 1 and 3, and from the right to columns 1 and 3:
Q R
–0,0 ◊ –0,2 ◊
c
c ◊ ◊ ◊ 0 d
d
c d =
a –2,0 ◊ –2,2 ◊ b
◊ 0 ◊ ◊
Q RQ RQ R
1 0 0 0 –0,0 –0,1 –0,2 –0,3 1 0 0 0
c
c 0 “3,1 0 ‡3,1 dc
dc –1,0 –1,1 –1,2 –1,3 dc
dc 0 “3,1 0 ≠‡3,1 d
d
c dc dc d.
a 0 0 1 0 ba –2,0 –2,1 –2,2 –2,3 ba 0 0 1 0 b
0 ≠‡3,1 0 “3,1 –3,0 –3,1 –3,2 –3,3 0 ‡3,1 0 “3,1
The ◊s here denote elements of the matrix that are changed by the application of the
Jacobi rotation.
• This process repeats, reducing the off-diagonal element that is largest in magnitude to
zero in each iteration.
Notice that each application of the Jacobi rotation is a unitary similarity transformation,
and hence preserves the eigenvalues of the matrix. If this method eventually yields a diagonal
WEEK 11. COMPUTING THE SVD 578
matrix, then the eigenvalues can be found on the diagonal of that matrix. We do not give a
proof of convergence here.
• Seide .m: A script that lets you apply Jacobi’s method to a 5 ◊ 5 matrix, much like
Seidel did by hand. Fortunately, you only indicate the off-diagonal element to zero out.
Matlab then does the rest.
Use this to gain insight into how Jacobi’s method works. You will notice that finding the
off-diagonal element that has largest magnitude is bothersome. You don’t need to get it
right every time.
Once you have found the diagonal matrix, restart the process. This time, zero out the
off-diagonal elements in systematic "sweeps," zeroing the elements in the order.
and repeating this until convergence. A sweep zeroes every off-diagonal element exactly once
(in symmetric pairs).
Solution. Hopefully you noticed that the matrix converges to a diagonal matrix, with the
eigenvalues on its diagonal.
The key insight is that applying a Jacobi rotation to zero an element, –i,j , reduces the
square of the Frobenius norm of the off-diagonal elements of the matrix by –i,j 2
. In other
words, let off(A) equal the matrix A but with its diagonal elements set to zero. If Ji,j zeroes
out –i,j (and –j,i ), then
T
Îoff(Ji,j AJi,j )Î2F = Îoff(A)Î2F ≠ 2–i,j
2
.
WEEK 11. COMPUTING THE SVD 579
and let J equal the Jacobi rotation that zeroes the element denoted with –31 :
J T AJ
Q
= R Q R Q R
I 0 0 0 0 A00 a10 AT20 a30 AT40 I 0 0 0 0
c d c d c d
c 0 “11 0 ‡31 0 d c aT10 –11 aT21 –31 aT41 d c 0 “11 0 ≠‡13 0 d
c d c d c d
c 0 0 I 0 0 d c A20 a21 A22 a32 AT42 d c 0 0 I 0 0 d
c d c d c d
c
a 0 ≠‡31 0 “33 0 d c d c
b a aT30 –31 aT32 –33 aT43 b a 0 ‡31 0 “33 0 d
b
0 0 0 0 I A40 a41 A42 a43 A44 0 0 0 0 I
Q R
A00 a‚10 AT20 a‚30 AT40
c d
c
c
a‚T10 – ‚ T21
‚ 11 a 0 a‚T41 d
d
= c
c A20 a‚21 A22 a‚32 AT42 d
d = A.
‚
c
a a‚T30 0 a‚T32 – ‚ T43
‚ 33 a d
b
A40 a‚41 A42 a‚43 A44
and
‚ 2 =
Îoff(A)Î F
Îoff(A00 )Î2F +Îa‚10 Î2F +ÎAT20 Î2F +Îa‚30 Î2F +ÎAT40 Î2F
+Îa‚T10 Î2F +Îa‚T21 Î2F + 0 +Îa‚T41 Î2F
+ÎA20 Î2F +Îa‚21 ÎF +Îoff(A22 )Î2F
2
+Îa‚32 ÎF ÎAT42 Î2F
2
All submatrices in black show up for Îoff(A)Î2F and Îoff(A)Î ‚ 2 . The parts of rows and
F
columns that show up in red and blue is what changes. We argue that the sum of the terms
in red are equal for both.
Since a Jacobi rotation is unitary, it preserves the Frobenius norm of the matrix to which
it applied. Thus, looking at the rows that are modified by applying J T from the left, we find
WEEK 11. COMPUTING THE SVD 580
that
ÎaT10 Î2F +ÎaT21 Î2F +ÎaT41 Î2F
=
+ÎaT Î2 +ÎaT32 Î2F +ÎaT43 Î2F
.A 30 F B.2
. aT aT21 aT41 ..
.
. 10
. =
. aT30 a32T
aT43 .F
.A BA B.2
. “11 ‡31 aT10 aT21 aT41 ..
.
. . =
. ≠‡31 “33 aT aT32 aT43 .F
.A 30 B.2
. a T
a‚T21 a‚T41 ..
. ‚ 10
. . =
. a ‚ T30 a‚T32 a‚T43 .F
Îa‚T10 Î2F +Îa‚T21 Î2F +Îa‚T41 Î2F
+Îa‚T30 Î2F +Îa‚T32 Î2F +Îa‚T43 Î2F .
Similarly, looking at the columns that are modified by applying J T from the right, we find
that .Q R.2
Îa10 Î2F +Îa30 Î2F .
. a10 a30 .
.
.c d.
.c d.
.c d.
+Îa21 Î2F +Îa32 Î2F = .c
.c a21 a32 d
d.
.
.c d.
.a b.
. .
+Îa41 Î2F +Îa43 Î2F . a41 a43 .F
.Q R .2 . . 2
. a10 a30 . . a a‚30 ..
. . . ‚ 10
.c dA B. . .
.c d . . .
.c d “11 ≠‡31 . . . .
= .c
.c a21 a32 d
d . = . a
. ‚ 21 a‚32 ..
.c d ‡31 “33 . . .
.a b . . .
. . . .
. a41 a43 . . a‚ 41 a‚43 .
F F
Îa‚10 Î2F +Îa‚30 Î2F
• The good news: every time a Jacobi rotation is used to zero an off-diagonal element,
off(A) decreases by twice the square of that element.
• The bad news: a previously introduced zero may become nonzero in the process.
The original algorithm developed by Jacobi searched for the largest (in absolute value)
off-diagonal element and zeroed it, repeating this processess until all off-diagonal elements
were small. The problem with this is that searching for the largest off-diagonal element in an
m ◊ m matrix requires O(m2 ) comparisons. Computing and applying one Jacobi rotation as
a similarity transformation requires O(m) flops. For large m this is not practical. Instead, it
can be shown that zeroing the off-diagonal elements by columns (or rows) also converges to
WEEK 11. COMPUTING THE SVD 581
a diagonal matrix. This is known as the column-cyclic Jacobi algorithm. Zeroing out every
pair of off-diagonal elements once is called a sweep. We illustrate this in Figure 11.3.2.1.
Typically only a few sweeps (on the order of five) are needed to converge sufficiently.
Sweep 1
◊ 0 ◊ ◊ ◊ ◊ 0 ◊ ◊ ◊ ◊ 0
0 ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊
æ æ æ
◊ ◊ ◊ ◊ 0 ◊ ◊ ◊ ◊ ◊ ◊ ◊
◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ 0 ◊ ◊ ◊
zero (1, 0) zero (2, 0) zero (3, 0)
◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊
◊ ◊ 0 ◊ ◊ ◊ ◊ 0 ◊ ◊ ◊ ◊
æ æ æ
◊ 0 ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ 0
◊ ◊ ◊ ◊ ◊ 0 ◊ ◊ ◊ ◊ 0 ◊
zero (2, 1) zero (3, 1) zero (3, 2)
Sweep 2
◊ 0 ◊ ◊ ◊ ◊ 0 ◊ ◊ ◊ ◊ 0
0 ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊
æ æ æ
◊ ◊ ◊ ◊ 0 ◊ ◊ ◊ ◊ ◊ ◊ ◊
◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ 0 ◊ ◊ ◊
zero (1, 0) zero (2, 0) zero (3, 0)
◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊
◊ ◊ 0 ◊ ◊ ◊ ◊ 0 ◊ ◊ ◊ ◊
æ æ æ
◊ 0 ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ ◊ 0
◊ ◊ ◊ ◊ ◊ 0 ◊ ◊ ◊ ◊ 0 ◊
zero (2, 1) zero (3, 1) zero (3, 2)
Figure 11.3.2.1 Column-cyclic Jacobi algorithm.
We conclude by noting that the matrix Q such that A = Q QH can be computed by
accumulating all the Jacobi rotations (applying them to the identity matrix).
YouTube: https://www.youtube.com/watch?v=VUktLhUiR7w
Just like the QR algorithm for computing the Spectral Decomposition was modified to
compute the SVD, so can the Jacobi Method for computing the Spectral Decomposition.
The insight is very simple. Let A œ Rm◊n and partition it by columns:
1 2
A= a0 a1 · · · an≠1 .
One could form B = AT A and then compute Jacobi rotations to diagonalize it:
T
· · · J3,1 T
J2,1 B J2,1 J3,1 · · · = D.
¸ ˚˙ ˝ ¸ ˚˙ ˝
QT Q
We recall that if we order the columns of Q and diagonal elements of D appropriately, then
choosing V = Q and = D1/2 yields
A = U V T = U D1/2 QT
or, equivalently,
AQ = U = U D1/2 .
This means that if we apply the Jacobi rotations J2,1 , J3,1 , . . . from the right to A,
then, once B has become (approximately) diagonal, the columns of A‚ = ((AJ2,1 )J3,1 ) · · · are
mutually orthogonal. By scaling them to have length one, setting = diag(Îa‚0 Î2 , Îa‚1 Î2 , . . . , Îa‚n≠1 Î2 ),
we find that
U = A‚ ≠1 = AQ(D1/2 )≠1 .
The only problem is that in forming B, we may introduce unnecessary error since it squares
the condition number.
Here is a more practical algorithm. We notice that
Q R
aT0 a0 aT0 a1 ··· aT0 an≠1
c d
c aT1 a0 aT1 a1 ··· aT1 an≠1 d
B = AT A = c
c .. .. .. d.
d
a . . . b
aTn≠1 a0 aTn≠1 a1 · · · aTn≠1 an≠1
We observe that we don’t need to form all of B. When it is time to compute Ji,j , we need
only compute A B A B
—i,i —j,i aTi ai aTj ai
= ,
—j,i —j,j aTj ai aTj aj
from which Ji,j can be computed. By instead applying this Jacobi rotation to B, we observe
that
T
Ji,j BJi,j = Ji,j
T
AT AJi,j = (AJi,j )T (AJi,j )
WEEK 11. COMPUTING THE SVD 583
and hence the Jacobi rotation can instead be used to take linear combinations of the ith and
jth columns of A: A B
1 2 1 2 “i,j ≠‡i,j
ai aj := ai aj .
‡i,j “i,j
We have thus outlined an algorithm:
• Accumulate the Jacobi rotations into matrix V , by applying them from the right to an
identity matrix:
V = ((I ◊ J2,1 )J3,1 ) · · ·
• Upon completion,
= diag(Îa0 Î2 , Îa0 Î2 , . . . , Îan≠1 Î2 )
and
U =A ≠1
,
meaning that each column of the updated A is divided by its length.
Obviously, there are variations on this theme. Such methods are known as one-sided Jacobi
methods.
11.4 Enrichments
11.4.1 Principal Component Analysis
The Spectral Decomposition and Singular Value Decomposition are fundamental to a tech-
nique in data sciences known as Principal Component Analysis (PCA). The following
tutorial makes that connection.
• [44] Field G. Van Zee, Robert A. van de Geijn, Gregorio Quintana-Ortí, G. Joseph
Elizondo, Families of Algorithms for Reducing a Matrix to Condensed Form, ACM
Transactions on Mathematical Software (TOMS) , Vol. 39, No. 1, 2012.
Bidiagonal, tridiagonal, and upper Hessenberg form are together referred to as condensed
form.
• [42] Field G. Van Zee, Robert A. van de Geijn, Gregorio Quintana-Ortí, Restructuring
the Tridiagonal and Bidiagonal QR Algorithms for Performance, ACM Transactions
on Mathematical Software (TOMS), Vol. 40, No. 3, 2014.
To our knowledge, this yields the fastest implementation for finding the SVD of a bidiagonal
matrix.
11.5 Wrap Up
11.5.1 Additional homework
No additional homework yet.
11.5.2 Summary
We have noticed that typos are uncovered relatively quickly once we release the material.
Because we "cut and paste" the summary from the materials in this week, we are delaying
adding the summary until most of these typos have been identified.
Week 12
12.1 Opening
12.1.1 Simple Implementation of matrix-matrix multiplication
The current coronavirus crisis hit UT-Austin on March 14, a day we spent quickly making
videos for Week 11. We have not been back to the office, to create videos for Week 12, since
then. We will likely add such videos as time goes on. For now, we hope that the notes
suffice.
Remark 12.1.1.1 The exercises in this unit assume that you have installed the BLAS-like
Library Instantiation Software (BLIS), as described in Subsection 0.2.4.
Let A, B, and C be m ◊ k, k ◊ n, and m ◊ n matrices, respectively. We can expose their
individual entries as
Q R Q R
–0,0 –0,1 ··· –0,k≠1 —0,0 —0,1 ··· —0,n≠1
c d c d
c –1,0 –1,1 ··· –1,k≠1 d c —1,0 —1,1 ··· —1,n≠1 d
A=c
c .. .. .. d,B
d =c
c .. .. .. d,
d
a . . . b a . . . b
–m≠1,0 –m≠1,1 · · · –m≠1,k≠1 —k≠1,0 —k≠1,1 · · · —k≠1,n≠1
and Q R
“0,0 “0,1 ··· “0,n≠1
c d
c “1,0 “1,1 ··· “1,n≠1 d
C= c
c .. .. .. d.
d
a . . . b
“m≠1,0 “m≠1,1 · · · “m≠1,n≠1
The computation C := AB + C, which adds the result of the matrix-matrix multiplication
AB to a matrix C, is defined element-wise as
k≠1
ÿ
“i,j := –i,p —p,j + “i,j (12.1.1)
p=0
for all 0 Æ i < m and 0 Æ j < n. We add to C because this will make it easier to play with
the orderings of the loops when implementing matrix-matrix multiplication. The following
585
WEEK 12. ATTAINING HIGH PERFORMANCE 586
pseudo-code computes C := AB + C:
for i := 0, . . . , m ≠ 1
for j := 0, . . . , n ≠ 1
for p := 0, . . . , k ≠ 1
“i,j := –i,p —p,j + “i,j
end
end
end
The outer two loops visit each element of C and the inner loop updates “i,j with (12.1.1). We
use C programming language macro definitions in order to explicitly index into the matrices,
which are passed as one-dimentional arrays in which the matrices are stored in column-major
order.
Remark 12.1.1.2 For a more complete discussion of how matrices are mapped to memory,
you may want to look at 1.2.1 Mapping matrices to memory in our MOOC titled LAFF-On
Programming for High Performance. If the discussion here is a bit too fast, you may want
to consult the entire Section 1.2 Loop orderings of that course.
#define a pha( i,j ) A[ (j)* dA + i ] // map a pha( i,j ) to array A
#define beta( i,j ) B[ (j)* dB + i ] // map beta( i,j ) to array B
#define gamma( i,j ) C[ (j)* dC + i ] // map gamma( i,j ) to array C
to compile, link, and execute it. You can view the performance attained on your computer
with the Matlab Live Script in Assignments/Week12/C/data/P ot_IJP.m x (Alternatively, read
and execute Assignments/Week12/C/data/P ot_IJP_m.m.)
On Robert’s laptop, Homework 12.1.1.1 yields the graph
WEEK 12. ATTAINING HIGH PERFORMANCE 587
as the curve labeled with IJP. The time, in seconds, required to compute matrix-matrix
multiplication as a function of the matrix size is plotted, where m = n = k (each matrix
is square). "Irregularities" in the time required to complete can be attributed to a number
of factors, including that other processes that are executing on the same processor may be
disrupting the computation. One should not be too concerned about those.
The performance of a matrix-matrix multiplication implementation is measured in billions
of floating point operations per second (GFLOPS). We know that it takes 2mnk flops to
compute C := AB + C when C is m ◊ n, A is m ◊ k, and B is k ◊ n. If we measure the
time it takes to complete the computation, T (m, n, k), then the rate at which we compute
is given by
2mnk
◊ 10≠9 GFLOPS.
T (m, n, k)
For our implementation, this yields
WEEK 12. ATTAINING HIGH PERFORMANCE 588
Again, don’t worry too much about the dips in the curves in this and future graphs. If we
controlled the environment in which we performed the experiments (for example, by making
sure no other compute-intensive programs are running at the time of the experiments), these
would largely disappear.
Remark 12.1.1.4 The Gemm in the name of the routine stands for General Matrix-Matrix
multiplication. Gemm is an acronym that is widely used in scientific computing, with roots
in the Basic Linear Algebra Subprograms (BLAS) interface which we will discuss in Subsec-
tion 12.2.5.
Homework 12.1.1.2 The IJP ordering is one possible ordering of the loops. How many
distinct reorderings of those loops are there?
Answer.
3! = 6.
Solution.
• Once a choice is made for the outer-most loop, there are two choices left for the second
loop.
• Once that choice is made, there is only one choice left for the inner-most loop.
make IPJ
make JIP
...
for each of the implementations and view the resulting performance by making the indicated
changes to the Live Script in Assignments/Week12/C/data/P ot_A _Orderings.m x (Alterna-
tively, use the script in Assignments/Week12/C/data/P ot_A _Orderings_m.m). If you have
implemented them all, you can test them all by executing
make A _Orderings
Solution.
• Assignments/Week12/answers/Gemm_IPJ.c
• Assignments/Week12/answers/Gemm_JIP.c
• Assignments/Week12/answers/Gemm_JPI.c
• Assignments/Week12/answers/Gemm_PIJ.c
• Assignments/Week12/answers/Gemm_PJI.c
Figure 12.1.1.5 Performance comparison of all different orderings of the loops, on Robert’s
laptop.
Homework 12.1.1.4 In directory Assignments/Week12/C/ execute
make JPI
and view the results with the Live Script in Assignments/Week12/C/data/P ot_Opener.m x.
WEEK 12. ATTAINING HIGH PERFORMANCE 590
(This may take a little while, since the Makefile now specifies that the largest problem to be
executed is m = n = k = 1500.)
Next, change that Live Script to also show the performance of the reference implemen-
tation provided by the BLAS-like Library Instantion Software (BLIS): Change
to
and rerun the Live Script. This adds a plot to the graph for the reference implementation.
What do you observe? Now are you happy with the improvements you made by reordering
the loops>
Solution. On Robert’s laptop:
Left: Plotting only simple implementations. Right: Adding the performance of the
reference implementation provided by BLIS.
Note: the performance in the graph on the left may not exactly match that in the graph
earlier in this unit. My laptop does not always attain the same performance. When a
processor gets hot, it "clocks down." This means the attainable performance goes down.
A laptop is not easy to cool, so one would expect more fluctuation than when using, for
example, a desktop or a server.
Here is a video from our course "LAFF-On Programming for High Performance", which
explains what you observe. (It refers to "Week 1" or that course. It is part of the launch for
that course.)
WEEK 12. ATTAINING HIGH PERFORMANCE 591
YouTube: https://www.youtube.com/watch?v=eZaq451nuaE
Remark 12.1.1.6 There are a number of things to take away from the exercises in this unit.
• The ordering of the loops matters. Why? Because it changes the pattern of memory
access. Accessing memory contiguously (what is often called "with stride one") improves
performance.
• Compilers, which translate high-level code written in a language like C, are not the
answer. You would think that they would reorder the loops to improve performance.
The widely-used gcc compiler doesn’t do so very effectively. (Intel’s highly-acclaimed
icc compiler does a much better job, but still does not produce performance that rivals
the reference implementation.)
• One solution is to learn how to optimize yourself. Some of the fundamentals can be
discovered in our MOOC LAFF-On Programming for High Performance. The other
solution is to use highly optimized libraries, some of which we will discuss later in this
week.
12.1.2 Overview
• 12.1 Opening
• 12.4 Enrichments
• 12.5 Wrap Up
• Amortize the movement of data between memory layers over useful computation to
overcome the discrepency between the speed of floating point computation and the
speed with which data can be moved in and out of main memory.
• Cast matrix factorizations as blocked algorithms that cast most computation in terms
of matrix-matrix operations.
At the bottom of the pyramid is the computer’s main memory. At the top are the
processor’s registers. In between are progressively larger cache memories: the L1, L2, and
L3 caches. (Some processors now have an L4 cache.) To compute, data must be brought
into registers, of which there are only a few. Main memory is very large and very slow. The
strategy for overcoming the cost (in time) of loading data is to amortise that cost over many
WEEK 12. ATTAINING HIGH PERFORMANCE 594
computations while it resides in a faster memory layer. The question, of course, is whether
an operation we wish to perform exhibits the opportunity for such amortization.
Ponder This 12.2.1.1 For the processor in your computer, research the number of registers
it has, and the sizes of the various caches.
Solution.
• How many floating point operations are required to compute this operation?
The dot product requires m multiplies and m additions, for a total of 2m flops.
• If the scalar fl and the vectors x and y are initially stored in main memory and (if
necessary) are written back to main memory, how many reads and writes (memory
operations) are required? (Give a reasonably tight lower bound.)
¶ The scalar fl is moved into a register and hence only needs to be read once and
written once.
¶ The m elements of x and m elements of y must be read (but not written).
We conclude that the dot product does not exhibit an opportunity for the reuse of most
data. ⇤
WEEK 12. ATTAINING HIGH PERFORMANCE 595
y := –x + y,
• If the scalar – and the vectors x and y are initially stored in main memory and are
written back to main memory (if necessary), how many reads and writes (memops) are
required? (Give a reasonably tight lower bound.)
Solution.
• How many floating point operations are required to compute this operation?
The axpy operation requires m multiplies and m additions, for a total of 2m flops.
• If the scalar – and the vectors x and y are initially stored in main memory and (if
necessary) are written back to main memory, how many reads and writes (memory
operations) are required? (Give a reasonably tight lower bound.)
¶ The scalar – is moved into a register, and hence only needs to be read once. It
does not need to be written back to memory.
¶ The m elements of x are read, and the m elements of y must be read and written.
We conclude that the axpy operation also does not exhibit an opportunity for the reuse of
most data.
The time for performing a floating point operation is orders of magnitude less than that
of moving a floating point number from and to memory. Thus, for an individual dot product
or axpy operation, essentially all time will be spent in moving the data and the attained
performance, in GFLOPS, will be horrible. The important point is that there just isn’t much
reuse of data when executing these kinds of "vector-vector" operations.
Example 12.2.2.1 and Homework 12.2.2.1 appear to suggest that, for example, when
computing a matrix-vector multiplication, one should do so by taking dot products of rows
with the vector rather than by taking linear combinations of the columns, which casts the
computation in terms of axpy operations. It is more complicated than that: the fact that
the algorithm that uses axpys computes with columns of the matrix, which are stored con-
WEEK 12. ATTAINING HIGH PERFORMANCE 596
y := Ax + y,
• If the matrix and vectors are initially stored in main memory and are written back
to main memory (if necessary), how many reads and writes (memops) are required?
(Give a reasonably tight lower bound.)
Solution.
• How many floating point operations are required to compute this operation?
y := Ax + y requires m2 multiplies and m2 additions, for a total of 2m2 flops.
• If the matrix and vectors are initially stored in main memory and are written back to
main memory, how many reads and writes (memops) are required? (Give a reasonably
tight lower bound.)
To come up with a reasonably tight lower bound, we observe that every element of A
must be read (but not written). Thus, a lower bound is m2 memops. The reading and
writing of x and y contribute a lower order term, which we tend to ignore.
While this ratio is better than either the dot product’s or the axpy operation’s, it still does
not look good.
What we notice is that there is a (slighly) better opportunity for reuse of data when
computing a matrix-vector multiplication than there is when computing a dot product or
axpy operation. Can the lower bound on data movement that is given in the solution be
attained? If you bring y into, for example, the L1 cache, then it only needs to be read from
main memory once and is kept in a layer of the memory that is fast enough to keep up with
the speed of floating point computation for the duration of the matrix-vector multiplication.
Thus, it only needs to be read and written once from and to main memory. If we then
compute y := Ax + y by taking linear combinations of the columns of A, staged as axpy
operations, then at the appropriate moment an element of x with which an axpy is performed
can be moved into a register and reused. This approach requires each element of A to be
read once, each element of x to be read once, and each element of y to be read and written
WEEK 12. ATTAINING HIGH PERFORMANCE 597
(from and to main memory) once. If the vector y is too large for the L1 cache, then it
can be partitioned into subvectors that do fit. This would require the vector x to be read
into registers multiple times. However, x itself might then be reused from one of the cache
memories.
Homework 12.2.2.3 Consider the rank-1 update
A := xy T + A,
• If the matrix and vectors are initially stored in main memory and are written back
to main memory (if necessary), how many reads and writes (memops) are required?
(Give a reasonably tight lower bound.)
Solution.
• How many floating point operations are required to compute this operation?
A := xy T + A requires m2 multiplies and m2 additions, for a total of 2m2 flops. (One
multiply and one add per element in A.)
• If the matrix and vectors are initially stored in main memory and are written back to
main memory, how many reads and writes (memops) are required? (Give a reasonably
tight lower bound.)
To come up with a reasonably tight lower bound, we observe that every element of A
must be read and written. Thus, a lower bound is 2m2 memops. The reading of x and
y contribute a lower order term. These vectors need not be written, since they don’t
change.
• How many floating point operations (flops) are required to compute this operation?
• If the matrices are initially stored in main memory and (if necessary) are written back
to main memory, how many reads and writes (memops) are required? (Give a simple
lower bound.)
Solution.
• How many floating point operations are required to compute this operation?
C := AB + C requires m3 multiplies and m3 additions, for a total of 2m3 flops.
• If the matrices are initially stored in main memory and (if necessary) are written back
to main memory, how many reads and writes (memops) are required? (Give a simple
lower bound.)
To come up with a reasonably tight lower bound, we observe that every element of A,
B, and C must be read and every element of C must be written. Thus, a lower bound
is 4m2 memops.
C := AB + C,
where A, B, and C are m ◊ m matrices. If m is small enough, then we can read the three
matrices into the L1 cache, perform the operation, and write the updated matrix C back to
memory. In this case,
• During the computation, the matrices are in a fast memory (the L1 cache), which can
keep up with the speed of floating point computation and
WEEK 12. ATTAINING HIGH PERFORMANCE 599
• The cost of moving each floating point number from main memory into the L1 cache
is amortized over m/2 floating point computations.
If m is large enough, then the cost of moving the data becomes insignificant. (If carefully
orchestrated, some of the movement of data can even be overlapped with computation, but
that is beyond our discussion.)
We immediately notice there is a tension: m must be small so that all three matrices can
fit in the L1 cache. Thus, this only works for relatively small matrices. However, for small
matrices, the ratio m/2 may not be favorable enough to offset the very slow main memory.
Fortunately, matrix-matrix multiplication can be orchestrated by partitioning the matri-
ces that are involved into submatrices, and computing with these submatrices instead. We
recall that if we partition
Q R
C0,0 C0,1 ··· C0,N ≠1
c C1,0 C1,1 ··· C1,N ≠1 d
c d
C= c
c .. .. .. d,
d
a . . . b
CM ≠1,0 CM ≠1,1 · · · CM ≠1,N ≠1
Q R
A0,0 A0,1 ··· A0,K≠1
c A1,0 A1,1 ··· A1,K≠1 d
c d
A=c
c .. .. .. d,
d
a . . . b
AM ≠1,0 AM ≠1,1 · · · AM ≠1,K≠1
and Q R
B0,0 B0,1 ··· B0,N ≠1
c B1,0 B1,1 ··· B1,N ≠1 d
c d
B=c
c .. .. .. d,
d
a . . . b
BK≠1,0 BK≠1,1 · · · BK≠1,N ≠1
qM ≠1 qN ≠1
where Ci,j is mi ◊ nj , Ai,p is mi ◊ kp , and Bp,j is kp ◊ nj , with i=0 mi = m, j=0 ni = n,
q
and K≠1
p=0 ki = k, then
K≠1
ÿ
Ci,j := Ai,p Bp,j + Ci,j .
p=0
If we choose each mi , nj , and kp small enough, then the submatrices fit in the L1 cache.
This still leaves us with the problem that these sizes must be reasonably small if the ratio
of flops to memops is to be sufficient. The answer to that is to block for multiple levels of
caches.
leverage parallelism in the architecture. This allows multiple floating point operations to be
performed simultaneous within a single processing core, which is key to high performance
on modern processors.
Current high-performance libraries invariably build upon the insights in the paper
• [49] Kazushige Goto and Robert van de Geijn, Anatomy of High-Performance Matrix
Multiplication, ACM Transactions on Mathematical Software, Vol. 34, No. 3: Article
12, May 2008.
• [52] Field G. Van Zee and Robert A. van de Geijn, BLIS: A Framework for Rapidly
Instantiating BLAS Functionality, ACM Journal on Mathematical Software, Vol. 41,
No. 3, June 2015.
The algorithm described in both these papers can be captured by the picture in Fig-
ure 12.2.4.1.
WEEK 12. ATTAINING HIGH PERFORMANCE 601
• NVIDIA’s cuBLAS.
• SYMV: Updates y := –Ax + —y, where A is symmetric and only the upper or lower
triangular part is stored.
• SYR: Updates A := –xxT + A, where A is symmetric and stored in only the upper or
lower triangular part of array A:
• SYR2: Updates A := –(xy T + yxT ) + A , where A is symmetric and stored in only the
upper or lower triangular part of array A:
• GEMM: Updates C := –opA (A)opB (B)+—C, where opA (A) and opA (A) indicate whether
A and/or B are to be transposed.
YouTube: https://www.youtube.com/watch?v=PJ6ektH977o
• The notes to which this video refers can be found at http://www.cs.utexas.edu/users/
f ame/Notes/NotesOnCho .pdf.
WEEK 12. ATTAINING HIGH PERFORMANCE 609
• You can find all the implementations that are created during the video in the direc-
tory Assignments/Week12/Cho /. They have been updated slightly since the video
was created in 2011. In particular, the Makefile was changed so that now the BLIS
implementation of the BLAS is used rather than OpenBLAS.
• The Spark tool that is used to generate code skeletons can be found at http://www.cs.
utexas.edu/users/f ame/Spark/.
• How many floating point operations (flops) are required to compute this operation?
• If the matrix is initially stored in main memory and is written back to main memory,
how many reads and writes (memops) are required? (Give a simple lower bound.)
Solution.
WEEK 12. ATTAINING HIGH PERFORMANCE 610
• How many floating point operations are required to compute this operation?
From Homework 5.2.2.1, we know that this right-looking algorithm requires 23 m3 flops.
• If the matrix is initially stored in main memory and is written back to main memory,
how many reads and writes (memops) are required? (Give a simple lower bound.)
We observe that every element of A must be read and written. Thus, a lower bound is
2m2 memops.
• –11 := ‚11 = –11 (No-op). • A11 := L\U 11 (overwrite A11 with its
LU factorization).
• aT12 := uT12 = aT12 (No-op).
• A12 := U12 = L≠1
11 A12 (triangular solve
• a21 := l21 = a21 /‚11 = a21 /–11 . with multiple right-hand sides).
• A22 := A22 ≠ l21 uT12 = A22 ≠ a21 aT12 . • A21 := L21 = A21 U11
≠1
(triangular solve
with multiple right-hand sides).
and hence for each column of the right-hand side, bj , we need to solve a triangular
system, Lxj = bj .
and we recognize that each row, xÂTi , is computed from xÂTi U = ÂbTi or, equivalently,
by solving U T (xÂTi )T = (ÂbTi )T . We observe it is also a triangular solve with multiple
right-hand sides (TRSM).
• The update A22 := A22 ≠ A21 A12 is an instance of C := –AB + —C, where the k (inner)
size is small. This is often referred to as a rank-k update.
In the following homework, you will determine that most computation is now cast in terms
of the rank-k update (matrix-matrix multiplication).
Homework 12.3.2.2 For the algorithm in Figure 12.3.2.3, analyze the (approximate) num-
ber of flops that are performed by the LU factorization of A11 and updates of A21 and A12 ,
aggregated over all iterations. You may assume that the size of A, m, is an integer multiple
of the block size b, so that m = Kb. Next, determine the ratio of the flops spent in the
WEEK 12. ATTAINING HIGH PERFORMANCE 613
• A12 := L≠1
11 A12 : During the kth iteration, A00 is (kb) ◊ (kb), A11 is b ◊ b, and A12 is
b ◊ ((K ≠ k ≠ 1)b). (It helps to draw a picture.) Hence, the total computation spent
in the operation is approximately ((K ≠ k ≠ 1)b)b2 = (K ≠ k ≠ 1)b3 flops.
If we sum this over all K iterations, we find that the total equals
qK≠1 Ë 2 3 È
k=0 b 3
+ 2(K ≠ k ≠ 1)b3
Ë = È
2 qK≠1
3
K +2 k=0 (K ≠ k ≠ 1) b
3
Ë =
2 qK≠1 È 3
3
K +2 j=0j b
Ë ¥ È
2
3
K + K 2
b3 =
2 2
3
b m + bm . 2
Thus, the ratio of time spent in these operation to the total cost of the LU factorization is
2 2
A B2
b m + bm2 . b 3 b
3
2 3 = + .
3
m m 2m
From this last exercise, we learn that if b is fixed and m gets large, essentially all compu-
tation is in the update A22 := A22 ≠ A21 A12 , which we know can attain high performance.
Homework 12.3.2.3 In directory Assignments/Week12/mat ab/, you will find the following
files:
• LU_right_ ooking.m: an implementation of the unblocked right-looking LU factor-
ization.
• LU_b k_right_ ooking.m: A code skeleton for a function that implements a blocked
LU factorization.
These resources give you the tools to implement and test the blocked LU factorization.
3. Once you get the right answer, try different problem sizes to see how the different
implementations perform. Try m = 500, 1000, 1500, 2000.
On the discussion forum, discuss what you think are some of the reasons behind the
performance you observe.
Hint. You can extract L11 and U11 from A11 with
he p /
he p \
Notice that your blocked algorithm gets MUCH better performance than does the unblocked
algorithm. However, the native LU factorization of Matlab does much better yet. The call
u( A ) by Matlab links to a high performance implementation of the LAPACK interface,
which we will discuss later.
Homework 12.3.2.4 For this exercise you will implement the unblocked and blocked al-
gorithm in C. To do so, you need to install the BLAS-like Library Instantiation Software
(BLAS) (see Subsubsection 0.2.4.1) and the libflame library (see Subsubsection 0.2.4.2).
Even if you have not previously programmed in C, you should be able to follow along.
In Assignments/Week12/LU/FLAMEC/, you will find the following files:
• LU_unb_var5.c : a skeleton for implementing the unblocked right-looking LU factor-
ization (which we often call "Variant 5").
• driver.c: A "driver" for testing various implementations. This driver creates matrices
of different sizes and checks the performance and correctness of the result.
WEEK 12. ATTAINING HIGH PERFORMANCE 615
• Makefi e: A "makefile" that compiles, links, and executes. To learn about Makefiles,
you may want to check Wikipedia. (search for "Make" and choose "Make (software)".)
Our libflame library allows a simple translation from algorithms that have been typeset
with the FLAME notation (like those in Figure 12.3.2.1 and Figure 12.3.2.3). The BLAS
functionality discussed earlier in this week is made available through an interface that hides
details like size, stride, etc. A Quick Reference Guide for this interface can be found at
http://www.cs.utexas.edu/users/f ame/pubs/FLAMEC-BLAS-Quickguide.pdf.
With these resources, complete
• LU_unb_var5.c and
• LU_b k_var5.c.
You can skip the testing of LU_b k_var5 by changing the appropriate TRUE to FALSE in
driver.c.
Once you have implemented one or both, you can test by executing make test in a
terminal session (provided you are in the correct directory). This will compile and link,
yieding the executable driver.x, and will then execute this driver. The output is redirected
to the file output.m. Here is a typical line in the output:
• The first line reports the performance of the reference implementation (a simple triple-
nested loop implementation of LU factorization without pivoting). The last number is
the rate of computation in GFLOPS.
• The second line reports the performance of the LU factorization without pivoting that
is part of the libflame library. Again, the last number reports the rate of computation
in GFLOPS.
• The last two lines report the rate of performance (middle number) and difference of the
result relative to the reference implementation (last number), for your unblocked and
blocked implementations. It is important to check that the last number is small. For
larger problem sizes, the reference implementation is not executed, and this difference
is not relevant.
The fact that the blocked version shows a larger difference than does the unblocked
is not significant here. Both have roundoff error in the result (as does the reference
implementation) and we cannot tell which is more accurate.
You can cut and past the output.m file into matlab to see the performance data presented
as a graph.
WEEK 12. ATTAINING HIGH PERFORMANCE 616
Ponder This 12.3.2.5 In Subsection 5.5.1, we discuss five unblocked algorithms for com-
puting LU factorization. Can you derive the corresponding blocked algorithms? You can use
the materials in Assignments/Week12/LU/FLAMEC/ to implement and test all five.
I + U T ≠1 U H .
When the original matrix is m ◊ n, this requires O(mb2 ) floating point operations to be
performed to compute the upper triangular matrix T in each iteration, which adds O(mnb)
to the total cost of the QR factorization. This is computation that an unblocked algorithm
does not perform. In return, the bulk of the computation is performed much faster, which
in the balance benefits performance. Details can be found in, for example,
• [23] Thierry Joffrain, Tze Meng Low, Enrique S. Quintana-Orti, Robert van de Geijn,
Field G. Van Zee, Accumulating Householder transformations, revisited, ACM Trans-
actions on Mathematical Software, Vol. 32, No 2, 2006.
operations that do not attain high performance. In enrichments in those chapters, we point
to papers that cast some of the computation in terms of matrix-matrix multiplication. Here,
we discuss the basic issues.
When computing the LU, Cholesky, or QR factorization, it is possible to factor a panel
of columns before applying an accumulation of the transformations that are encounted to
the rest of the matrix. It is this that allows the computation to be mostly cast in terms of
matrix-matrix operations. When computing the reduction to tridiagonal or bidiagonal form,
this is not possible. The reason is that if we have just computed a unitary transformation,
this transformation must be applied to the rest of the matrix both from the left and from
the right. What this means is that the next transformation to be computed depends on an
update that involves the rest of the matrix. This in turn means that inherently a matrix-
vector operation (involving the "rest of the matrix") must be performed at every step at a
cost of O(m2 ) computation per iteration (if the matrix is m ◊ m). The insight is that O(m3 )
computation is cast in terms of matrix-vector operations, which is of the same order as the
total computation.
While this is bad news, there is still a way to cast about half the computation in terms of
matrix-matrix multiplication for the reduction to tridiagonal form. Notice that this means
the computation is sped up by at most a factor two, since even if the part that is cast in terms
of matrix-matrix multiplication takes no time at all relative to the rest of the computation,
this only cuts the time to completion in half.
The reduction to bidiagonal form is trickier yet. It requires the fusing of a matrix-vector
multiplication with a rank-1 update in order to reuse data that is already in cache.
Details can be found in, for example,
• [44] Field G. Van Zee, Robert A. van de Geijn, Gregorio Quintana-Ortí, G. Joseph
Elizondo, Families of Algorithms for Reducing a Matrix to Condensed Form, ACM
Transactions on Mathematical Software (TOMS) , Vol. 39, No. 1, 2012.
The update of each column is a vector-vector operation, requiring O(m) computation with
O(m) data (if the vectors are of size m). We have reasoned that for such an operation it is
the cost of accessing memory that dominates. In
WEEK 12. ATTAINING HIGH PERFORMANCE 618
• [42] Field G. Van Zee, Robert A. van de Geijn, Gregorio Quintana-Ortí, Restructuring
the Tridiagonal and Bidiagonal QR Algorithms for Performance, ACM Transactions
on Mathematical Software (TOMS), Vol. 40, No. 3, 2014.
we discuss how the rotations from many Francis Steps can be saved up and applied to Q
at the same time. By carefully orchestrating this so that data in cache can be reused, the
performance can be improved to rival that attained by a matrix-matrix operation.
12.4 Enrichments
12.4.1 BLIS and beyond
One of the strengths of the approach to implementing matrix-matrix multiplication described
in Subsection 12.2.4 is that it can be applied to related operations. A recent talk discusses
some of these.
• Robert van de Geijn and Field Van Zee, "The BLIS Framework: Experiments in
Portability," SIAM Conference on Parallel Processing for Scientific Computing (PP20).
SIAM Activitiy group on Supercomputing Best Paper Prize talk. 2020. https://www.
youtube.com/watch?v=1biep1Rh_08
• [40] Robert van de Geijn and Jerrell Watts, SUMMA: Scalable Universal Matrix Mul-
tiplication Algorithm, Concurrency: Practice and Experience, Volume 9, Number 4,
1997.
The techniques described in that paper are generalized in the more recent paper
• [34] Martin D. Schatz, Robert A. van de Geijn, and Jack Poulson, Parallel Matrix
Multiplication: A Systematic Journey, SIAM Journal on Scientific Computing, Volume
38, Issue 6, 2016.
12.5 Wrap Up
12.5.1 Additional homework
No additional homework yet.
12.5.2 Summary
We have noticed that typos are uncovered relatively quickly once we release the material.
Because we "cut and paste" the summary from the materials in this week, we are delaying
adding the summary until most of these typos have been identified.
Appendix A
We have created a document "Advanced Linear Algebra: Are You Ready?" that a learner
can use to self-assess their readiness for a course on numerical linear algebra.
621
Appendix B
Notation
and Q R
‰0
c d
c ‰1 d
x= c
c .. d,
d
a . b
‰m≠1
where – and ‰ is the lower case Greek letters "alpha" and "chi," respectively. You will
also notice that in this course we start indexing at zero. We mostly adopt this convention
(exceptions include i, j, p, m, n, and k, which usually denote integer scalars.)
622
Appendix C
truncating every expression within square brackets to three significant digits, to solve
‰2 + 25‰ + “ = 0
SË ËÔ ÈÈ T SË ËÔ ÈÈ T
≠25+ [[252 ]≠[4]] ≠25+ [625≠4]
‰ = U 2
V =U 2
V
5 Ô 6
[≠25+[ 621]] Ë
[≠25+24.9]
È Ë È
= 2
= 2
= ≠0.1
2
= ≠0.05.
Now, if you do this to the full precision of a typical calculator, the answer is instead
approximately ≠0.040064. The relative error we incurred is, approximately, 0.01/0.04 = 0.25.
623
APPENDIX C. KNOWLEDGE FROM NUMERICAL ANALYSIS 624
What is going on here? The problem comes from the fact that there is error in the 24.9
that is encountered after the square root is taken. Since that number is close in magnitude,
but of opposite sign to the ≠25 to which it is added, the result of ≠25 + 24.9 is mostly error.
This is known as catastrophic cancelation: adding two nearly equal numbers of opposite
sign, at least one of which has some error in it related to roundoff, yields a result with large
relative error.
Now, one can use an alternative formula to compute the root:
Ô Ô Ô
≠— + — 2 ≠ 4“ ≠— + — 2 ≠ 4“ ≠— ≠ — 2 ≠ 4“
‰= = ◊ Ô ,
2 2 ≠— ≠ — 2 ≠ 4“
which yields
2“
‰= Ô 2 .
≠— ≠ — ≠ 4“
Carrying out the computations, rounding intermediate results, yields ≠.0401. The relative
error is now 0.00004/0.040064 ¥ .001. It avoids catastrophic cancellation because now the
two numbers of nearly equal magnitude are added instead. ⇤
Remark C.0.2.2 The point is: if possible, avoid creating small intermediate results that
amplify into a large relative error in the final result.
Notice that in this example it is not inherently the case that a small relative change in
the input is amplified into a large relative change in the output (as is the case when solving a
linear system with a poorly conditioned matrix). The problem is with the standard formula
that was used. Later we will see that this is an example of an unstable algorithm.
Appendix D
625
APPENDIX D. GNU FREE DOCUMENTATION LICENSE 626
A “Modified Version” of the Document means any work containing the Document or a
portion of it, either copied verbatim, or with modifications and/or translated into another
language.
A “Secondary Section” is a named appendix or a front-matter section of the Document
that deals exclusively with the relationship of the publishers or authors of the Document
to the Document’s overall subject (or to related matters) and contains nothing that could
fall directly within that overall subject. (Thus, if the Document is in part a textbook of
mathematics, a Secondary Section may not explain any mathematics.) The relationship
could be a matter of historical connection with the subject or with related matters, or of
legal, commercial, philosophical, ethical or political position regarding them.
The “Invariant Sections” are certain Secondary Sections whose titles are designated, as
being those of Invariant Sections, in the notice that says that the Document is released under
this License. If a section does not fit the above definition of Secondary then it is not allowed
to be designated as Invariant. The Document may contain zero Invariant Sections. If the
Document does not identify any Invariant Sections then there are none.
The “Cover Texts” are certain short passages of text that are listed, as Front-Cover
Texts or Back-Cover Texts, in the notice that says that the Document is released under this
License. A Front-Cover Text may be at most 5 words, and a Back-Cover Text may be at
most 25 words.
A “Transparent” copy of the Document means a machine-readable copy, represented in
a format whose specification is available to the general public, that is suitable for revising
the document straightforwardly with generic text editors or (for images composed of pixels)
generic paint programs or (for drawings) some widely available drawing editor, and that is
suitable for input to text formatters or for automatic translation to a variety of formats
suitable for input to text formatters. A copy made in an otherwise Transparent file format
whose markup, or absence of markup, has been arranged to thwart or discourage subsequent
modification by readers is not Transparent. An image format is not Transparent if used for
any substantial amount of text. A copy that is not “Transparent” is called “Opaque”.
Examples of suitable formats for Transparent copies include plain ASCII without markup,
Texinfo input format, LaTeX input format, SGML or XML using a publicly available DTD,
and standard-conforming simple HTML, PostScript or PDF designed for human modifica-
tion. Examples of transparent image formats include PNG, XCF and JPG. Opaque formats
include proprietary formats that can be read and edited only by proprietary word processors,
SGML or XML for which the DTD and/or processing tools are not generally available, and
the machine-generated HTML, PostScript or PDF produced by some word processors for
output purposes only.
The “Title Page” means, for a printed book, the title page itself, plus such following
pages as are needed to hold, legibly, the material this License requires to appear in the title
page. For works in formats which do not have any title page as such, “Title Page” means
the text near the most prominent appearance of the work’s title, preceding the beginning of
the body of the text.
The “publisher” means any person or entity that distributes copies of the Document to
the public.
APPENDIX D. GNU FREE DOCUMENTATION LICENSE 627
A section “Entitled XYZ” means a named subunit of the Document whose title either
is precisely XYZ or contains XYZ in parentheses following text that translates XYZ in
another language. (Here XYZ stands for a specific section name mentioned below, such
as “Acknowledgements”, “Dedications”, “Endorsements”, or “History”.) To “Preserve the
Title” of such a section when you modify the Document means that it remains a section
“Entitled XYZ” according to this definition.
The Document may include Warranty Disclaimers next to the notice which states that
this License applies to the Document. These Warranty Disclaimers are considered to be
included by reference in this License, but only as regards disclaiming warranties: any other
implication that these Warranty Disclaimers may have is void and has no effect on the
meaning of this License.
2. VERBATIM COPYING. You may copy and distribute the Document in any
medium, either commercially or noncommercially, provided that this License, the copyright
notices, and the license notice saying this License applies to the Document are reproduced
in all copies, and that you add no other conditions whatsoever to those of this License.
You may not use technical measures to obstruct or control the reading or further copying of
the copies you make or distribute. However, you may accept compensation in exchange for
copies. If you distribute a large enough number of copies you must also follow the conditions
in section 3.
You may also lend copies, under the same conditions stated above, and you may publicly
display copies.
cation until at least one year after the last time you distribute an Opaque copy (directly or
through your agents or retailers) of that edition to the public.
It is requested, but not required, that you contact the authors of the Document well
before redistributing any large number of copies, to give them a chance to provide you with
an updated version of the Document.
4. MODIFICATIONS. You may copy and distribute a Modified Version of the Docu-
ment under the conditions of sections 2 and 3 above, provided that you release the Modified
Version under precisely this License, with the Modified Version filling the role of the Doc-
ument, thus licensing distribution and modification of the Modified Version to whoever
possesses a copy of it. In addition, you must do these things in the Modified Version:
A. Use in the Title Page (and on the covers, if any) a title distinct from that of the
Document, and from those of previous versions (which should, if there were any, be
listed in the History section of the Document). You may use the same title as a previous
version if the original publisher of that version gives permission.
B. List on the Title Page, as authors, one or more persons or entities responsible for
authorship of the modifications in the Modified Version, together with at least five of
the principal authors of the Document (all of its principal authors, if it has fewer than
five), unless they release you from this requirement.
C. State on the Title page the name of the publisher of the Modified Version, as the
publisher.
E. Add an appropriate copyright notice for your modifications adjacent to the other copy-
right notices.
F. Include, immediately after the copyright notices, a license notice giving the public
permission to use the Modified Version under the terms of this License, in the form
shown in the Addendum below.
G. Preserve in that license notice the full lists of Invariant Sections and required Cover
Texts given in the Document’s license notice.
I. Preserve the section Entitled “History”, Preserve its Title, and add to it an item stating
at least the title, year, new authors, and publisher of the Modified Version as given
on the Title Page. If there is no section Entitled “History” in the Document, create
one stating the title, year, authors, and publisher of the Document as given on its
Title Page, then add an item describing the Modified Version as stated in the previous
sentence.
APPENDIX D. GNU FREE DOCUMENTATION LICENSE 629
J. Preserve the network location, if any, given in the Document for public access to a
Transparent copy of the Document, and likewise the network locations given in the
Document for previous versions it was based on. These may be placed in the “History”
section. You may omit a network location for a work that was published at least four
years before the Document itself, or if the original publisher of the version it refers to
gives permission.
L. Preserve all the Invariant Sections of the Document, unaltered in their text and in
their titles. Section numbers or the equivalent are not considered part of the section
titles.
M. Delete any section Entitled “Endorsements”. Such a section may not be included in
the Modified Version.
If the Modified Version includes new front-matter sections or appendices that qualify as
Secondary Sections and contain no material copied from the Document, you may at your
option designate some or all of these sections as invariant. To do this, add their titles to
the list of Invariant Sections in the Modified Version’s license notice. These titles must be
distinct from any other section titles.
You may add a section Entitled “Endorsements”, provided it contains nothing but en-
dorsements of your Modified Version by various parties — for example, statements of peer
review or that the text has been approved by an organization as the authoritative definition
of a standard.
You may add a passage of up to five words as a Front-Cover Text, and a passage of up
to 25 words as a Back-Cover Text, to the end of the list of Cover Texts in the Modified
Version. Only one passage of Front-Cover Text and one of Back-Cover Text may be added
by (or through arrangements made by) any one entity. If the Document already includes a
cover text for the same cover, previously added by you or by arrangement made by the same
entity you are acting on behalf of, you may not add another; but you may replace the old
one, on explicit permission from the previous publisher that added the old one.
The author(s) and publisher(s) of the Document do not by this License give permission to
use their names for publicity for or to assert or imply endorsement of any Modified Version.
5. COMBINING DOCUMENTS. You may combine the Document with other docu-
ments released under this License, under the terms defined in section 4 above for modified
versions, provided that you include in the combination all of the Invariant Sections of all of
APPENDIX D. GNU FREE DOCUMENTATION LICENSE 630
the original documents, unmodified, and list them all as Invariant Sections of your combined
work in its license notice, and that you preserve all their Warranty Disclaimers.
The combined work need only contain one copy of this License, and multiple identical
Invariant Sections may be replaced with a single copy. If there are multiple Invariant Sections
with the same name but different contents, make the title of each such section unique by
adding at the end of it, in parentheses, the name of the original author or publisher of that
section if known, or else a unique number. Make the same adjustment to the section titles
in the list of Invariant Sections in the license notice of the combined work.
In the combination, you must combine any sections Entitled “History” in the various
original documents, forming one section Entitled “History”; likewise combine any sections
Entitled “Acknowledgements”, and any sections Entitled “Dedications”. You must delete all
sections Entitled “Endorsements”.
disclaimers. In case of a disagreement between the translation and the original version of
this License or a notice or disclaimer, the original version will prevail.
If a section in the Document is Entitled “Acknowledgements”, “Dedications”, or “His-
tory”, the requirement (section 4) to Preserve its Title (section 1) will typically require
changing the actual title.
9. TERMINATION. You may not copy, modify, sublicense, or distribute the Document
except as expressly provided under this License. Any attempt otherwise to copy, modify,
sublicense, or distribute it is void, and will automatically terminate your rights under this
License.
However, if you cease all violation of this License, then your license from a particular
copyright holder is reinstated (a) provisionally, unless and until the copyright holder explicitly
and finally terminates your license, and (b) permanently, if the copyright holder fails to notify
you of the violation by some reasonable means prior to 60 days after the cessation.
Moreover, your license from a particular copyright holder is reinstated permanently if
the copyright holder notifies you of the violation by some reasonable means, this is the first
time you have received notice of violation of this License (for any work) from that copyright
holder, and you cure the violation prior to 30 days after your receipt of the notice.
Termination of your rights under this section does not terminate the licenses of parties
who have received copies or rights from you under this License. If your rights have been
terminated and not permanently reinstated, receipt of a copy of some or all of the same
material does not give you any rights to use it.
11. RELICENSING. “Massive Multiauthor Collaboration Site” (or “MMC Site”) means
any World Wide Web server that publishes copyrightable works and also provides prominent
facilities for anybody to edit those works. A public wiki that anybody can edit is an example
of such a server. A “Massive Multiauthor Collaboration” (or “MMC”) contained in the site
means any set of copyrightable works thus published on the MMC site.
APPENDIX D. GNU FREE DOCUMENTATION LICENSE 632
“CC-BY-SA” means the Creative Commons Attribution-Share Alike 3.0 license published
by Creative Commons Corporation, a not-for-profit corporation with a principal place of
business in San Francisco, California, as well as future copyleft versions of that license
published by that same organization.
“Incorporate” means to publish or republish a Document, in whole or in part, as part of
another Document.
An MMC is “eligible for relicensing” if it is licensed under this License, and if all works
that were first published under this License somewhere other than this MMC, and subse-
quently incorporated in whole or in part into the MMC, (1) had no cover texts or invariant
sections, and (2) were thus incorporated prior to November 1, 2008.
The operator of an MMC Site may republish an MMC contained in the site under CC-
BY-SA on the same site at any time before August 1, 2009, provided the MMC is eligible
for relicensing.
ADDENDUM: How to use this License for your documents. To use this License
in a document you have written, include a copy of the License in the document and put the
following copyright and license notices just after the title page:
If you have Invariant Sections, Front-Cover Texts and Back-Cover Texts, replace the “with. . .
Texts.” line with this:
with the Invariant Sections being LIST THEIR TITLES, with the
Front-Cover Texts being LIST, and with the Back-Cover Texts being LIST.
If you have Invariant Sections without Cover Texts, or some other combination of the three,
merge those two alternatives to suit the situation.
If your document contains nontrivial examples of program code, we recommend releasing
these examples in parallel under your choice of free software license, such as the GNU General
Public License, to permit their use in free software.
References
[1] Ed Anderson, Zhaojun Bai, James Demmel, Jack J. Dongarra, Jeremy DuCroz, Ann
Greenbaum, Sven Hammarling, Alan E. McKenney, Susan Ostrouchov, and Danny
Sorensen, LAPACK Users’ Guide, SIAM, Philadelphia, 1992.
[2] Richard Barrett, Michael Berry, Tony F. Chan, James Demmel, June M. Donato, Jack
Dongarra, Victor Eijkhout, Roldan Pozo, Charles Romine, and Henk Van der Vorst,
Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods,
SIAM Press, 1993. [ PDF ]
[3] Paolo Bientinesi, Inderjit S. Dhillon, Robert A. van de Geijn, A Parallel Eigensolver
for Dense Symmetric Matrices Based on Multiple Relatively Robust Representations,
SIAM Journal on Scientific Computing, 2005
[4] Paolo Bientinesi, John A. Gunnels, Margaret E. Myers, Enrique S. Quintana-Orti,
Robert A. van de Geijn, The science of deriving dense linear algebra algorithms, ACM
Transactions on Mathematical Software (TOMS), 2005.
[5] Paolo Bientinesi, Enrique S. Quintana-Orti, Robert A. van de Geijn, Representing
linear algebra algorithms in code: the FLAME application program interfaces, ACM
Transactions on Mathematical Software (TOMS), 2005
[6] Paolo Bientinesi, Robert A. van de Geijn, Goal-Oriented and Modular Stability Analy-
sis, SIAM Journal on Matrix Analysis and Applications , Volume 32 Issue 1, February
2011.
[7] Paolo Bientinesi, Robert A. van de Geijn, The Science of Deriving Stability Analyses,
FLAME Working Note #33. Aachen Institute for Computational Engineering Sciences,
RWTH Aachen. TR AICES-2008-2. November 2008.
[8] Christian Bischof and Charles Van Loan, The WY Representation for Products of
Householder Matrices, SIAM Journal on Scientific and Statistical Computing, Vol. 8,
No. 1, 1987.
[9] Basic Linear Algebra Subprograms - A Quick Reference Guide, University of Tennessee,
Oak Ridge National Laboratory, Numerical Algorithms Groiup Ltd.
[10] Barry A. Cipra, The Best of the 20th Century: Editors Name Top 10 Algorithms,
633
APPENDIX D. GNU FREE DOCUMENTATION LICENSE 634
[41] Field G. Van Zee, libflame: The Complete Reference, http://www. u u.com, 2009. [ free
PDF ]
[42] Field G. Van Zee, Robert A. van de Geijn, Gregorio Quintana-Ortí, Restructuring the
Tridiagonal and Bidiagonal QR Algorithms for Performance, ACM Transactions on
Mathematical Software (TOMS), Vol. 40, No. 3, 2014. Available free from http://www.
cs.utexas.edu/~f ame/web/FLAMEPub ications.htm Journal Publication #33. Click on
the title of the paper.
[43] Field G. Van Zee, Robert A. van de Geijn, Gregorio Quintana-Ortí, Restructuring the
Tridiagonal and Bidiagonal QR Algorithms for Performance. ACM Transactions on
Mathematical Software (TOMS) , 2014. Available free from http://www.cs.utexas.edu/
~f ame/web/FLAMEPub ications.htm Journal Publication #33. Click on the title of the
paper.
[44] Field G. Van Zee, Robert A. van de Geijn, Gregorio Quintana-Ortí, G. Joseph Elizondo,
Families of Algorithms for Reducing a Matrix to Condensed Form. ACM Transactions
on Mathematical Software (TOMS) , Vol, No. 1, 2012. Available free from http://www.
cs.utexas.edu/~f ame/web/FLAMEPub ications.htm Journal Publication #26. Click on
the title of the paper.
[45] H. F. Walker, Implementation of the GMRES method using Householder transforma-
tions, SIAM Journal on Scientific and Statistical Computing, Vol. 9, No. 1, 1988.
[46] Stephen J. Wright, A Collection of Problems for Which {G}aussian Elimination with
Partial Pivoting is Unstable, SIAM Journal on Scientific Computing, Vol. 14, No. 1,
1993.
[47] BLAS-like Library Instantiation Software Framework, GitHub repository.
[48] BLIS typed interface, https://github.com/f ame/b is/b ob/master/docs/BLISTypedAPI.
md.
[49] Kazushige Goto and Robert van de Geijn, Anatomy of High-Performance Matrix Mul-
tiplication, ACM Transactions on Mathematical Software, Vol. 34, No. 3: Article 12,
May 2008.
[50] Tyler Michael Smith, Bradley Lowery, Julien Langou, Robert A. van de Geijn, A Tight
I/O Lower Bound for Matrix Multiplication, arxiv.org:1702.02017v2, 2019. (To appear
in ACM Transactions on Mathematical Software.)
[51] Field G. Van Zee and Tyler M. Smith, Implementing High-performance Complex Ma-
trix Multiplication via the 3M and 4M Methods, ACM Transactions on Mathematical
Software, Vol. 44, No. 1, pp. 7:1-7:36, July 2017.
[52] Field G. Van Zee and Robert A. van de Geijn, BLIS: A Framework for Rapidly In-
stantiating BLAS Functionality, ACM Journal on Mathematical Software, Vol. 41,
No. 3, June 2015. You can access this article for free by visiting the Science of High-
Performance Computing group webpage and clicking on the title of Journal Article
39.
Index
637
INDEX 638