Lecture Notes Math 4377/6308 - Advanced Linear Algebra I: Vaughn Climenhaga December 3, 2013

Lecture notes
Math 4377/6308 – Advanced Linear Algebra I
Vaughn Climenhaga
December 3, 2013
2
The primary text for this course is “Linear Algebra and its Applications”,
second edition, by Peter D. Lax (hereinafter referred to as [Lax]). The
lectures will follow the presentation in this book, and many of the homework
exercises will be taken from it.
You may occasionally find it helpful to have access to other resources
that give a more expanded and detailed presentation of various topics than
is available in Lax’s book or in the lecture notes. To this end I suggest the
following list of external references, which are freely available online.
(Bee) “A First Course in Linear Algebra”, by Robert A. Beezer, University

of Puget Sound. Long and comprehensive (1027 pages). Starts from
the very beginning: vectors and matrices as arrays of numbers, systems
of equations, row reduction. Organisation of book is a little non-
standard: chapters and sections are given abbreviations instead of
numbers. http://linear.ups.edu/
(CDW) “Linear Algebra”, by David Cherney, Tom Denton, and Andrew

Waldron, UC Davis. 308 pages. Covers similar material to [Bee].
https://www.math.ucdavis.edu/~linear/
(Hef ) “Linear Algebra”, by Jim Hefferon, Saint Michael’s College. 465

pages. Again, starts from the very beginning. http://joshua.smcvt.
edu/linearalgebra/
(LNS) “Linear Algebra as an Introduction to Abstract Mathematics”, by

Isaiah Lankham, Bruno Nachtergaele, and Anne Schilling, UC Davis.
247 pages. More focused on abstraction than the previous three ref-
erences, and hence somewhat more in line with the present course.
https://www.math.ucdavis.edu/~anne/linear_algebra/
(Tre) “Linear Algebra Done Wrong”,1 by Sergei Treil, Brown University.

276 pages. Starts from the beginning but also takes a more abstract
view. http://www.math.brown.edu/~treil/papers/LADW/LADW.html
The books listed above can all be obtained freely via the links provided.
(These links are also on the website for this course.) Another potentially
useful resource is the series of video lectures by Gilbert Strang from MIT’s
Open CourseWare project: http://ocw.mit.edu/courses/mathematics/
18-06-linear-algebra-spring-2010/video-lectures/
1
If the title seems strange, it may help to be aware that there is a relatively famous
textbook by Sheldon Axler called “Linear Algebra Done Right”, which takes a different
approach to linear algebra than do many other books, including the ones here.
Lecture 1 Monday, Aug. 26
Motivation, linear spaces, and isomorphisms
Further reading: [Lax] Ch. 1 (p. 1–4). See also [Bee] p. 317–333; [CDW]
Ch. 5 (p. 79–87); [Hef ] Ch. 2 (p. 76–87); [LNS] Ch. 4 (p. 36–40); [Tre]
Ch. 1 (p. 1–5)
1.1 General motivation

We begin by mentioning a few examples that on the surface may not appear
to have anything to do with linear algebra, but which turn out to involve
applications of the machinery we will develop in this course. These (and
other similar examples) serve as a motivation for many of the things that
we do.
1. Fibonacci sequence. The Fibonacci sequence is the sequence of

numbers 1, 1, 2, 3, 5, 8, 13, . . . , where each number is the sum of the
previous two. We can use linear algebra to find an exact formula for
the nth term. Somewhat surprisingly, it has the odd-looking form
√ !n √ !n !
1 1+ 5 1− 5
√ − .
5 2 2
We will discuss this example when we talk about eigenvalues, eigen-

vectors, and diagonalisation.
2. Google. Linear algebra and Markov chain methods are at the heart
of the PageRank algorithm that was central to Google’s early success
as an internet search engine. We will discuss this near the end of the
course.
3. Multivariable calculus. In single-variable calculus, the derivative is
a number, while in multivariable calculus it is a matrix. The proper
way to understand this is that in both cases, the derivative is a linear
transformation. We will reinforce this point of view throughout the
course.
4. Singular value decomposition. This is an important tool that has
applications to image compression, suggestion algorithms such as those
used by Netflix, and many other areas. We will mention these near
the end of the course, time permitting.
3
4 LECTURE 1. MONDAY, AUG. 26
5. Rotations. Suppose I start with a sphere, and rotate it first around

one axis (through whatever angle I like) and then around a different
axis (again through whatever angle I like). How does the final position
of the sphere relate to the initial one? Could I have gotten from start to
finish by doing a single rotation around a single axis? How would that
axis relate to the axes I actually performed rotations around? This
and other questions in three-dimensional geometry can be answered
using linear algebra, as we will see later.
6. Partial differential equations. Many important problems in ap-

plied mathematics and engineering can be formulated as partial dif-
ferential equations; the heat equation and the wave equation are two
fundamental examples. A complete theory of PDEs requires functional
analysis, which considers vector spaces whose elements are not arrays
of numbers (as in Rn ), but rather functions with certain differentiabil-
ity properties.
There are many other examples: to chemistry (vibrations of molecules in

terms of their symmetries), integration techniques in calculus (partial frac-
tions), magic squares, error-correcting codes, etc.
1.2 Background: general mathematical notation

and terminology
Throughout this course we will assume a working familiarity with standard
mathematical notation and terminology. Some of the key pieces of back-
ground are reviewed on the first assignment, which is due at the beginning
of the next lecture.
For example, recall that the symbol R stands for the set of real numbers;
C stands for the set of complex numbers; Z stands for the integers (both
positive and negative); and N stands for the natural numbers 1, 2, 3, . . . . Of
particular importance will be the use of the quantifiers ∃ (“there exists”) and
∀ (“for all”), which will appear in many definitions and theorems throughout
the course.
Example 1.1. 1. The statement “∃x ∈ R such that x + 2 = 7” is true,

because we can choose x = 5.
2. The statement “x + 2 = 7 ∀x ∈ R” is false, because x + 2 6= 7 when

x 6= 5.
1.3. VECTOR SPACES 5
3. The statement “∀x ∈ R∃y ∈ R such that x + y = 4” is true, because

no matter what value of x is chosen, we can choose y = 4 − x and then
we have x + y = x + (4 − x) = 4.
The last example has nested quantifiers: the quantifier “∃” occurs inside
the statement to which “∀” applies. You may find it helpful to interpret such
nested statements as a game between two players. In this example, Player
A has the goal of making the statement x + y = 4 (the innermost statement)
be true, and the game proceeds as follows: first Player B chooses a number
x ∈ R, and then Player A chooses y ∈ R. If Player A’s choice makes it so
that x + y = 4, then Player A wins. The statement in the example is true
because Player A can always win.
Example 1.2. The statement “∃y ∈ R such that ∀x ∈ R, x + y = 4” is

false. In the language of the game played just above, Player A is forced to
choose y ∈ R first, and then Player B can choose any x ∈ R. Because Player
B gets to choose after Player A, he can make it so that x + y 6= 4, so Player
A loses.
To parse such statements it may also help to use parentheses: the state-
ment in Example 1.2 would become “∃y ∈ R (such that ∀x ∈ R (x+y = 4))”.
Playing the game described above corresponds to parsing the statement from
the outside in. This is also helpful when finding the negation of the state-
ment (formally, its contrapositive – informally, its opposite).
Example 1.3. The negations of the three statements in Example 1.1 are
1. ∀x ∈ R we have x + 2 6= 7.
2. ∃x ∈ R such that x + 2 6= 7.
3. ∃x ∈ R such that ∀y ∈ R we have x + y 6= 4.
Notice the pattern here: working from the outside in, each ∀ is replaced
with ∃, each ∃ is replaced with ∀, and the innermost statement is negated (so
= becomes 6=, for example). You should think through this to understand
why this is the rule.
1.3 Vector spaces

In your first linear algebra course you studied vectors as rows or columns
of numbers – that is, elements of Rn . This is the most important example
of a vector space, and is sufficient for many applications, but there are also
many other applications where it is important to take the lessons from that
first course and re-learn them in a more abstract setting.
What do we mean by “a more abstract setting”? The idea is that we
should look at vectors in Rn and the things we did with them, and see
exactly what properties we needed in order to use the various definitions,
theorems, techniques, and algorithms we learned in that setting.
So for the moment, think of a vector as an element of Rn . What can we
do with these vectors? A moment’s thought recalls several things:
1. we can add vectors together;
2. we can multiply vectors by real numbers (scalars) to get another vector,

which in some sense points in the same “direction”;
3. we can multiply vectors by matrices;
4. we can find the length of a vector;
5. we can find the angle between two vectors.

The list could be extended, but this will do for now. Indeed, for the time
being we will focus only on the first two items on the last. The others will
enter later.
So: vectors are things that we can add together, and that we can multiply
by scalars. This motivates the following definition.
Definition 1.4. A vector space (or linear space) over R is a set X on which
two operations are defined:
• addition, so that given any x, y ∈ X we can consider their sum x + y ∈
X;
• scalar multiplication, so that given any x ∈ X and c ∈ R we can

consider their product cx ∈ X.
The operations of addition and scalar multiplication are required to satisfy
certain properties:
1. commutativity: x + y = y + x for every x, y ∈ X;
2. associativity of addition: x + (y + z) = (x + y) + z for every x, y, z ∈ X;
3. identity element: there exists an element 0 ∈ X such that x + 0 = x

for all x ∈ X;
1.3. VECTOR SPACES 7
4. additive inverses: for every x ∈ X there exists (−x) ∈ X such that

x + (−x) = 0;
5. associativity of multiplication: a(bx) = (ab)x for all a, b ∈ R and

x ∈ X;
6. distributivity: a(x+y) = ax+ay and (a+b)x = ax+bx for all a, b ∈ R

and x, y ∈ X;
7. multiplication by the unit: 1x = x for all x ∈ X.
The properties in the list above are the axioms of a vector space. They
hold for Rn with the usual definition of addition and scalar multiplication.
Indeed, this is in some sense the motivation for this list of axioms: they for-
malise the properties that we know and love for the example of row/column
vectors in Rn . We will see that these properties are in fact enough to let us
do a great deal of work, and that there are plenty of other things besides
Rn that satisfy them.
Remark 1.5. Some textbooks use different font styles or some other typo-
graphic device to indicate that a particular symbol refers to a vector, instead
of a scalar. For example, one may write x or ~x instead of x to indicate an
element of a vector space. By and large we will not do this; rather, plain
lowercase letters will be used to denote both scalars and vectors (although
we will write 0 for the zero vector, and 0 for the zero scalar). It will always
be clear from context which type of object a letter represents: for example,
in Definition 1.4 it is always specified whether a letter represents a vector
(as in x ∈ X) or a scalar (as in a ∈ R). You should be very careful when
reading and writing mathematical expressions in this course that you are
always aware of whether a particular symbol stands for a scalar, a vector,
or something else.
Before moving on to some examples, we point out that one may also
consider vector spaces over C, the set of complex numbers; here the scalars
may be any complex numbers. In fact, one may consider any field K and
do linear algebra with vector spaces over K. This has many interesting
applications, particularly if K is taken to be a finite field, but these examples
lie beyond the scope of this course, and while we will often say “Let X be
a vector space over the field K”, it will always be the case in our examples
that K is either R or C. Thus we will not trouble ourselves here with the
general abstract notion of a field.
Certain properties follow immediately from the axioms, although they

are not explicitly included in them. It is a worthwhile exercise to deduce
the following results from the axioms.
1. The identity element is unique: if 00 ∈ X is such that x + 00 = x for

all x ∈ X, then 00 = 0.
2. 0x = 0 for all x ∈ X.
3. (−1)x = −x for all x ∈ X.
1.4 Examples
The most familiar examples are the following.
Example 1.6. Let X = {(x1 , . . . , xn ) | xj ∈ R∀j} be the set of row vectors

with n real components, and let addition and scalar multiplication be defined
coordinate-wise. Then X is a vector space over R.
x1
Example 1.7. Let Y = { x··· | xj ∈ R∀j} be the set of column vectors
n
with n real components, and let addition and scalar multiplication be defined
coordinate-wise. Then Y is a vector space over R.
Analogously, one can define Cn as either row vectors or column vectors

with components in C.
The two examples above look very similar, but formally they are different
vector spaces; after all, the sets are different, and a row vector is not a column
vector. Nevertheless, there is a real and precise sense in which they are “the
same example”: namely, they are isomorphic. This means that there is
a bijective (one-to-one and onto) correspondence between them that maps
sums into sums and scalar multiples into scalar multiples: in this case x1we

can consider the transpose map T : X → Y given by T (x1 , . . . , xn ) = x··· ,
n
which has the properties T (x + y) = T (x) + T (y) and T (cx) = cT (x) for all
x, y ∈ X and c ∈ R.
Remark 1.8. Recall that a map T : X → Y is 1-1 if T (x) = T (x0 ) implies
x = x0 , and onto if for every y ∈ Y there exists x ∈ X such that T (x) = y.
We will discuss isomorphisms, and other linear transformations, at greater
length later in the course. The key point for now is that as far as the tools of
linear algebra are concerned, isomorphic vector spaces are indistinguishable
from each other, although they may be described in quite different ways.
1.4. EXAMPLES 9
Example 1.9. Let X be the set of all functions x(t) satisfying the differen-
tial equation ẍ + x = 0. If x and y are solutions, then so is x + y; similarly,
if x is a solution then cx is a solution for every c ∈ R. Thus X is a vector
space. If p is the initial position and v is the initial velocity, then the pair
(p, v) completely determines the solution x(t). The correspondence between
the pair (p, v) ∈ R2 and the solution x(t) is an isomorphism between R2 and
X.
Example 1.10. Let Pn be the set of all polynomials with coefficients in K

and degree at most n: that is, Pn = {a0 +a1 t+a2 t2 +· · ·+an tn | a0 , . . . , an ∈
K}. Then Pn is a vector space over K.
Example 1.11. Let F (R, R) be the set of all functions from R → R, with
addition and scalar multiplication defined in the natural way (pointwise) by
(f + g)(x) = f (x) + g(x) and (cf )(x) = c(f (x)). Then F (R, R) is a vector
space. It contains several other interesting vector spaces.
1. Let C(R) be the subset of F (R, R) that contains all continuous func-
tions.
2. Let L1 (R) be the subset of F (R, R) Rthat contains all integrable func-
∞
tions: that is, L1 (R) = {f : R → R | −∞ |f (x)| dx < ∞}.
3. Let C 1 (R) be the subset of F (R, R) that contains all differentiable

functions.
Each of C(R), L1 (R), and C 1 (R) is a vector space.
Vector spaces of functions, such as those introduced in Example 1.11,

play a key role in many areas of mathematics, such as partial differential
equations.
Lecture 2 Wed. Aug. 28
Subspaces, linear dependence and independence
Further reading: [Lax] Ch. 1 (p. 4–5); see also [Bee] p. 334–372; [CDW]
Ch. 9–10 (p. 159–173); [Hef ] Ch. 2 (p. 87–108); [LNS] Ch. 4–5 (p. 40–54);
[Tre] Ch. 1 (p. 6–9, 30–31)
2.1 Deducing new properties from axioms

Last time we saw the general definition of a vector space in terms of a list of
axioms. We also mentioned certain properties that follow immediately from
these axioms: uniqueness of the zero element, and the fact that 0x = 0 and
(−1)x = −x. Let us briefly go through the proofs of these, to illustrate the
use of the axioms in deriving basic properties.
1. Uniqueness of 0. Suppose 00 also has the property that x + 00 = x

for all x ∈ X. Then in particular, this is true when x = 0, and
so 0 + 00 = 0. On the other hand, because 0 has the property that
y + 0 = y for all y ∈ X, we may in particular choose y = 00 and deduce
that 00 + 0 = 00 . Finally, by commutativity of addition we deduce that
0 = 0 + 00 = 00 + 0 = 00 , and so the zero element is unique.
2. We prove that 0 · x = 0 for all x ∈ X. To this end, let x ∈ X be

arbitrary, and make the following deductions:
x = 1 · x = (1 + 0) · x = 1 · x + 0 · x = x + 0 · x. (2.1)
The first and last equalities use the final axiom (multiplication by the
unit), the second equality uses properties of real numbers, and the
third equality uses the axiom of distributivity. Now by the axiom on
existence of additive inverses, we can add (−x) to both sides and get
0 = x + (−x) = (x + 0 · x) + (−x) = (0 · x + x) + (−x)

= 0 · x + (x + (−x)) = 0 · x + 0 = 0 · x, (2.2)
where the first equality is the property of additive inverses, the second
is from (2.1), the third is from commutativity of addition, the fourth
is from associativity of addition, the fifth is the property of additive
inverses again, and the last equality is the property of the zero vector.
11
12 LECTURE 2. WED. AUG. 28
3. We prove that (−1) · x = −x for all x ∈ X. To this end, we first

observe that the additive inverse is unique: if x + y = 0, then y = −x.
Indeed, adding (−x) to both sides gives
− x = 0 + (−x) = (x + y) + (−x)
= (y + x) + (−x) = y + (x + (−x)) = y + 0 = y,
where the first equality uses the axiom on the zero vector, the second
comes from the equality x + y = 0, the third uses commutativity of
addition, the fourth uses associativity of addition, the fifth uses the
property of additive inverses, and the last once again uses the property
of the zero vector. Armed with this fact on uniqueness, we can now
observe that
x + (−1)x = 1 · x + (−1) · x = (1 + (−1)) · x = 0 · x = 0,
where the first equality is from the axiom on multiplication by the

unit, the second equality is from the distributivity axiom, the third
is from properties of real numbers, and the fourth is what we proved
just a moment ago in (2.2). Because additive inverses are unique, it
follows that (−1) · x = −x.
The arguments above are rather painstaking and difficult to read, but
they illustrate the procedure of deducing other general facts from the small
handful of axioms with which we begin. From now on we will not usual give
explicit references to which axioms are used in any given computation or
argument, but you should always keep in mind that every step of a calcu-
lation or proof needs to be justified in terms of previous results, which are
ultimately based on these axioms.
2.2 Subspaces
Let’s move on to something a little less bland and more concrete. Recalling
our examples from the previous lecture, we see that it is often the case that
one vector space is contained inside another one. For example, Pn ⊂ Pn+1 .
Or recall Example 1.11:
• F (R, R) = {functions R → R}
• C(R) = {f ∈ V | f is continuous}
• C 1 (R) = {f ∈ V | f is differentiable}
2.2. SUBSPACES 13
We have C 1 (R) ⊂ C(R) ⊂ F (R, R). More generally, given d ∈ N, we write

C d (R) for the vector space of functions on R that can be differentiated d
times. Note that C(R) ⊃ C 1 (R) ⊃ C 2 (R) ⊃ · · · .
Definition 2.1. When X and V are vector spaces with X ⊂ V , we say that
X is a subspace of V .
Example 2.2. 1. X1 = {(x, y) ∈ R2 | x + y = 0} is a subspace of R2
2. X2 = {(x, y, z) ∈ R3 | z = 0} is a subspace of R3 (the xy-plane)
3. The set X3 of solutions to ẍ = −x is a subspace of C 2 (R) – it is also

a subspace of C 1 (R), C(R), and F (R, R).
4. X4 = {f ∈ C 1 (R) | f is 2π-periodic} is a subspace of C 1 (R) – it is

also a subspace of C(R) and F (R, R). If you know something about
solutions to ODEs, you will notice that in fact X3 is a subspace of X4 .
In each of these cases one can check that the operations of addition and
multiplication from the ambient vector space (R2 , R3 , or F (R, R)) define a
vector space structure on the given subset, and so it is indeed a subspace.
We omit the details of checking all the axioms, since we are about to learn
a general fact that implies them.
Here is a convenient fact. In general, if we have a set X with two binary
operations (addition and multiplication), and want to check that this is a
vector space, we must verify the list of axioms from the previous lecture.
When X is contained in a vector space V , life is easier: to check that a
non-empty set X ⊂ V is a subspace, we only need to check the following
two conditions:
1. x + y ∈ X whenever x, y ∈ X (closure under addition)
2. cx ∈ X whenever x ∈ X and c ∈ K (closure under scalar multiplica-

tion)
If these two conditions are satisfied, then the fact that the axioms from
the previous lecture hold for X can be quickly deduced from the fact that
they hold for V . For example, since addition is commutative for all pairs
of elements in V , it is certainly commutative for all paris of elements in
the subset X. Similarly for associativity of addition and multiplication,
distributivity, and multiplication by the unit. The only axioms remaining
are existence of the identity element 0 and additive inverses. To get these,
we recall from the previous lecture that 0 = 0x and −x = (−1)x for any
x ∈ V . In particular, for any x ∈ X the second condition just given implies

that 0, −x ∈ X.
In fact, it is often convenient to combine the two conditions given above
into the single following condition.
Proposition 2.3. Let V be a vector space over K. A non-empty set X ⊂ V
is a subspace of V if and only if cx + y ∈ X whenever x, y ∈ X and c ∈ K.
Proof. Exercise.
Now the fact that the sets in Example 2.2 are subspaces of R2 , R3 , and
V , respectively, can be easily checked by observing that each of these sets
is closed under addition and scalar multiplication.
1. If (x, y) and (x0 , y 0 ) are in X1 and c ∈ R, then (cx + x0 , cy + y 00 ) has
(cx + x0 ) + (cy + y 0 ) = c(x + y) + (x0 + y 0 ) = c · 0 + 0 = 0.
2. If (x, y, z), (x0 , y 0 , z 0 ) ∈ X2 and c ∈ R, then (cx + x0 , cy + y 0 , cz + z 0 )
has third component equal to cz + z 0 = c · 0 + 0 = 0, so it is in X2 .
d2
3. If x, y : R → R are in X3 and c ∈ R, then dt2
(cx + y) = cẍ + ÿ =
−(cx + y), so cx + y ∈ X3 .
4. If f, g ∈ X4 and c ∈ R, then we check that cf + g is 2π-periodic:
(cf + g)(t + 2π) = cf (t + 2π) + g(t + 2π) = cf (t) + g(t) = (cf + g)(t).
The first two examples should be familiar to you from a previous linear
algebra course, since both X1 and X2 are the solution set of a system of
linear equations (in fact, a single linear equation) in Rn . What may not be
immediately apparent is that X3 and X4 are in fact examples of exactly this
same type: we define a condition which is linear in a certain sense, and then
consider all elements of the vector space that satisfy this condition. This
will be made precise later when we consider null spaces (or kernels) of linear
transformations.
Remark 2.4. All of the subspaces considered above are described implicitly;
that is, a condition is given, and then the subspace X is the set of all
elements of V that satisfy this condition. Thus if I give you an element of
V , it is easy for you to check whether or not this element is contained in the
subspace X. On the other hand, it is not necessarily so easy for you to give
me a concrete example of an element of X, or a method for producing all
the elements of X. Such a method would be an explicit description of X,
and the process of solving linear equations may be thought of as the process
of going from an implicit to an explicit description of X.
2.3. NEW SUBSPACES FROM OLD ONES 15
Plenty of interesting subsets of vector spaces are not subspaces.
Example 2.5. 1. X = {(x, y) ∈ R2 | x, y ∈ Z}, the integer lattice in R2 ,

is not a subspace of R2 , because it is not closed under multiplication:
1 1 1
2 (1, 1) = ( 2 , 2 ) ∈
/X
2. With V the set of differentiable functions on (−1, 1), the set X = {x ∈

V | ẋ = x2 } is not a subspace of V : it is not closed under addition or
scalar multiplication. Indeed, the function x(t) = (2−t)−1 is contained
in X, but 2x is not.
Every vector space V has at least two subspaces: V itself is a subspace,

and so is the set {0} that contains only the zero vector (check this!) – this
is called the trivial subspace.
2.3 New subspaces from old ones

Above we pointed out that many examples of subspaces are obtained as
solution sets of systems of linear equations. If X ⊂ Rn is the solution set
of a system of linear equations, and Y ⊂ Rn is the solution set of another
system, then X ∩ Y is the solution set of the system obtained by taking all
equations in these two systems. In particular, X ∩ Y is a subspace. This is
a general fact that is true in any vector space, not just Rn .
Exercise 2.6. Prove that if X and Y are subspaces of V , then so is X ∩ Y .
Thus we can form new subspaces from old ones by taking intersections.
We can also form new subspaces by taking sums.
Definition 2.7. If X, Y are two subsets of a vector space V (not necessarily

subspaces), then the sum of X and Y is X + Y = {x + y | x ∈ X, y ∈ Y }.
Example 2.8. 1. Let X = Z2 ⊂ R2 and Y = {(x, y) | x2 + y 2 ≤ 1/9}.

Then X + Y is the set containing a ball of radius 1/3 around every
point with integer coordinates.
2. Let X = {(x, 0, 0) ∈ R3 | x ∈ R} be the x-axis and Y = {(0, y, 0) | y ∈

R} be the y-axis. Then X + Y = {(x, y, 0) | x, y ∈ R} is the xy-plane.
Exercise 2.9. Prove that if X and Y are subspaces of V , then so is X + Y .

So intersections and sums of subspaces are subspaces. The same is not
true of unions.
Exercise 2.10. Give an example of subspaces X and Y of R2 such that X ∪Y

is not a subspace.
You have already encountered the idea of taking sums of subspaces in a
very specific guise, namely taking linear combinations of vectors in Rn . The
notion of linear combination extends naturally to an arbitrary vector space.
Definition 2.11. Given vectors x1 , . . . , xk ∈ V , a linear combination of
x1 , . . . , xk is a vector of the form
k
X
cj xj = c1 x1 + · · · + ck xk ,
j=1
where c1 , . . . , ck ∈ K are the coefficients of the linear combination. The set

of all linear combinations of x1 , . . . , xk is the span of x1 , . . . , xk , denoted
span(x1 , . . . , xk ), or sometimes hx1 , . . . , xk i.
Exercise 2.12. Show that for every choice of x1 , . . . , xk , the set span(x1 , . . . , xk )
is a subspace of V .
You should notice that the mechanics of proving Exercises 2.9 and 2.12
are very similar to each other. Indeed, one can observe the following.
1. Given xj ∈ V , the set Xj = span(xj ) = {cxj | c ∈ K} is a subspace of
V.
2. span(x1 , . . . , xk ) = X1 + · · · + Xk .
2.4 Spanning sets and linear dependence

We say that the set {x1 , . . . , xk } spans V , or generates V , if span(x1 , . . . , xk ) =
V . In particular, a set spans V if every element of V can be written as a
linear combination of elements from that set.
Remark 2.13. If X is a subspace of V and X = span(x1 , . . . , xk ), then it is
reasonable to think of the set {x1 , . . . , xk } as giving an explicit description
of the subspace X, since it gives us a concrete method to produce all the
elements of X: just consider the vectors c1 x1 + · · · + ck xk , where the coef-
ficients c1 , . . . , ck take arbitrary values in the scalar field K. The process of
solving a system of linear equations by row reduction gives a way to find a
‘reasonable’ spanning set for the subspace of solutions to that system. The
question of finding a ‘reasonable’ spanning set for the solution space of a
linear differential equation, such as X3 in Example 2.2, can be rather more
challenging to carry out, but from the point of view of linear algebra is a
very similar task.
2.4. SPANNING SETS AND LINEAR DEPENDENCE 17
What does the word ‘reasonable’ mean in the above remark? Well,
consider the following example. The vector space R2 is spanned by the set
{(1, 0), (0, 1), (1, 1)}. But this set is somehow redundant, since the smaller
set {(1, 0), (0, 1)} also spans R2 . So in finding spanning sets, it makes sense
to look for the smallest one, which is somehow the most efficient. You should
recall from your previous linear algebra experience that in the case of Rn ,
the following condition is crucial.
Definition 2.14. A set {x1 , . . . , xk } is linearly dependent if some non-
trivial linear combination yields the zero vector: that is, if there are scalars
c1 , . . . ck ∈ K such that not all of the cj are 0, and c1 x1 + · · · + ck xk = 0. A
set is linearly independent if it is not linearly dependent.
Recall how the notions of span and linear dependence manifest them-
selves in Rn . Given a set S = {x1 , . . . , xk }, the task of checking whether
or not x ∈ span S reduces to solving the non-homogenerous system of lin-
ear equations c1 x1 + · · · + ck xk = x. If this system has a solution, then
x ∈ span S. If it has no solution, then x ∈ / span S. (Note that span S is an
explicitly described subspace, and now the difficult task is to check whether
or not a specific vector is included in the subspace, which was the easy task
when the subspace was implicitly defined.)
Similarly, you can check for linear dependence in Rn by writing the
condition c1 x1 + · · · + ck xk = 0 as a system of linear equations, and using
row reduction to see if it has a non-trivial solution. A non-trivial solution
corresponds to a non-trivial representation of 0 as a linear combination of
vectors in S, and implies that S is linearly dependent. If there is only the
trivial solution, then S is linearly independent.
Example 2.15. Consider the polynomials f1 (x) = x+1, f2 (x) = x2 −2, and
f3 (x) = x + 3 in P2 . These span P2 and are linearly independent. Indeed,
given any polynomial g ∈ P2 given by g(x) = a2 x2 + a1 x + a0 , we can try to
write g = c1 f1 + c2 f2 + c3 f3 by solving
a2 x2 + a1 x + a0 = c1 (x + 1) + c2 (x2 − 2) + c3 (x + 3).
Comparing coefficients we see that this is equivalent to the system
a2 = c2 ,
a1 = c1 + c3 , (2.3)
a0 = c1 − 2c2 + 3c3 ,
which can be easily checked to have a solution (c1 , c2 , c3 ) for every choice
of a1 , a2 , a3 . Similarly, the homogeneous version of this system has only
the trivial solution, which shows that the polynomials f1 , f2 , f3 are linearly
independent.
The previous example could be done by reducing it to the familiar case

of systems of linear equations. This is not always possible.
Example 2.16. Let V be the vector space of differentiable functions R → R,

and let f (x) = sin2 (x), g(x) = cos(2x), and h(x) = 1. Then {f, g, h} is
linearly dependent, because the trigonometric identity cos(2x) = 1−2 sin2 (x)
implies the non-trivial linear representation of the zero function as 0 =
h − 2f − g.
Lecture 3 Wed. Sep. 4
Bases.
Ch. 11 (p. 175–182); [Hef ] Ch. 2 (p. 109–137); [LNS] Ch. 5 (p. 54–59);
[Tre] Ch. 1,2 (p. 6–7, 54–56)
3.1 More on spanning sets and linear dependence

The definitions of linear dependence, independence, and spanning in the
previous lecture are only made for finite sets. It is useful to extend the
definition to infinite sets, and this is done as follows.
Definition 3.1. A set S ⊂ V is linearly dependent Pif there are x1 , . . . , xn ∈ S

and non-zero scalars c1 , . . . , cn ∈ K such that nj=1 cj xj = 0. If S is not
linearly dependent, it is said to be linearly independent. The span of a set
S ⊂ V is the set of all linear combinations of any finite set of elements of S.
Exercise 3.2. Show that an infinite set S is linearly dependent if and only
if it has a finite subset S 0 that is linearly dependent. Show that an infinite
set S is linearly independent if and only if every finite subset of S is linearly
independent.
Exercise 3.3. Let V be a vector space, let S ⊂ V be spanning, and let L ⊂ V
be linearly independent.
1. Show that if S ⊂ S 0 ⊂ V , then S 0 is spanning.
2. Show that if L0 ⊂ L, then L0 is linearly independent.
The next result is quite important: it says that linearly independent sets
cannot be bigger than spanning sets.
Proposition 3.4. Let V be a vector space, let S = {x1 , . . . , xn } ⊂ V span

V , and let L = {y1 , . . . , yk } ⊂ V be linearly independent. Then k ≤ n.
Proof. We show by induction that for every 0 ≤ j ≤ min(n, k), we can

replace j elements of S with elements of L to obtain a new spanning set
S 0 . In particular, if k > n then this statement with j = n implies that L
contains a spanning subset L0 of size n, so that the elements of L \ L0 can
be written as linear combinations of L0 , contradicting linear independence
of L. Thus it suffices to complete the induction just described.
19
20 LECTURE 3. WED. SEP. 4
The base case of the induction is j = 0, which is trivial. For the induction
step, relabel the elements of S and L so that S 0 = {y1 , . . . , yj , xj+1 , . . . , xn }.
Because S 0 spans V , we can write
yj+1 = c1 y1 + · · · + cj yj + cj+1 xj+1 + · · · + cn xn
for some coefficients ci . If ci = 0 for all j + 1 ≤ i ≤ n, then we have a

contradiction to the linear independence of {y1 , . . . , yj+1 } ⊂ L. Thus there
is some i with j < i ≤ n such that xi can be written as a linear combination
of {y1 , . . . , yj+1 } and the other x` . In particular, replacing xi with yj+1 in
S 0 yields the desired spanning set. This completes the induction step, and
hence completes the proof by the discussion above.
In addition to the definition given above, there are other characterisa-

tions of linear dependence that are often useful.
Proposition 3.5. Let V be a vector space over K, and let S ⊂ V . The

following are equivalent.
1. S is linearly dependent.
2. There exists v ∈ S such that v ∈ span(S \ {v}).
3. There exists v ∈ S such that span(S \ {v}) = span S.
Proof. (1) ⇒ (2). If S is linearly dependent, then there exist x1 , . . . , xn ∈ S

and c1 , . . . cn ∈ K such that c1 6= 0 and
c1 x1 + c2 x2 + · · · + cn xn = 0.
Subtracting c1 x1 from both sides and multiplying by c−1

1 (since c1 6= 0) gives
c2 cn
x1 = − x2 − · · · − xn ,
c1 c1
which proves (2) with v = x1 .

(2) ⇒ (3). The inclusion span(S \{v}) ⊂ span S is immediate, so we only
need to prove the other inclusion. If v = c1 v1 + · · · + cn vn for v1 , . . . , vn ∈ S \
{v}, then for every x ∈ span S, there are a, b1 , . . . , bn ∈ K and w1 , . . . , wn ∈
S \ {v} such that
x = av+b1 w1 +· · · bn wn = ac1 v1 +· · ·+acn vn +b1 w1 +· · · bn wn ∈ span(S\{v}).

3.2. BASES AND DIMENSION 21
(3) ⇒ (1). If span S = span(S \ {v}), then v ∈ span S implies that v ∈

span(S \{v}) (so (2) holds), and in particular there exist x1 , . . . , xn ∈ S \{v}
and c1 , . . . , cn ∈ K such that v = c1 x1 + · · · + cn xn . Rearranging gives
c1 x1 + · · · + cn xn − v = 0,
which is a non-trivial linear representation of 0 in terms of the elements of

S, so S is linearly dependent.
Proposition 3.6. Let V be a vector space over K, and let S ⊂ V . The

following are equivalent.
1. S is linearly independent.
2. Every element of span S can be represented in a unique way as a linear

combination of elements of S.
Proof. (1) ⇒ (2). Suppose there is x ∈ span S such that
x = a1 v1 + · · · + an vn = b1 w1 + · · · bm wm
for some ai , bj ∈ K and vi , wj ∈ S. Without loss of generality (adding

some zero coefficients if need be), assume that m = n and vi = wi for all
1 ≤ i ≤ n, so
x = a1 v1 + · · · + an vn = b1 v1 + · · · bn vn .
This implies that (a1 − b1 )v1 + · · · + (an − bn )vn = 0, and by linear inde-
pendence of S we get ai = bi for each i, so the two representations of x as a
linear combination of elements of S were in fact the same representation.
(2) ⇒ (1). Observe that 0 ∈ span S can be represented as 0 = 0 · v for
any v ∈ S, and by uniqueness there is no non-trivial representation of 0 as
a linear combination of elements of S, so S is linearly independent.
Exercise 3.7. Let L ⊂ V be linearly independent and let v ∈ V \ L. Show

that L ∪ {v} is linearly independent if and only if v ∈
/ span L.
3.2 Bases and dimension

As a preview of the next lecture, we can now give two of the fundamental
definitions for the course.
Definition 3.8. A set {x1 , . . . , xn } ⊂ V is a basis for V if it spans V and

is linearly independent.
Bases are far from unique – there are many choices of basis possible for
a finite-dimensional space, as illustrated in Theorem 4.7 below. Sometimes,
as in Rn , there is a natural one, which is usually referred to as the standard
basis for that vector space: in Rn , we write ei for the vector whose ith
component is 1, with all other entries 0.
Theorem 3.9. If V has a finite basis, then all bases for V have the same
number of vectors.
Proof. Let B, B 0 ⊂ V be bases for V . Then B is linearly independent and

B 0 is spanning, so Proposition 3.4 implies that #B ≤ #B 0 . Similarly, B is
spanning and B 0 is independent, so #B 0 ≤ #B.
Lecture 4 Mon. Sep. 9
Dimension, direct sums, and isomorphisms
Ch. 11 (p. 175–182); [Hef ] Ch. 2 (p. 109–137); [LNS] Ch. 5 (p. 54–59);
[Tre] Ch. 1,2 (p. 6–7, 54–56)
4.1 Dimension
In the last lecture, we hinted that it is useful to have not just spanning sets,
but spanning sets that contain a minimum amount of redundancy. The
definition of basis makes this precise. The spanning property guarantees
that every v ∈ V can be written as a linear combination v = c1 x1 +· · ·+cn xn
of the basis elements, while the linear independence property together with
Proposition 3.6 guarantees that the coefficients c1 , . . . , cn are determined
uniquely by v and x1 , . . . , xn . We will come back to this important point
later in the lecture..
Exercise 4.1. Show that if {x, y} is a basis for X, then so is {x + y, x − y}.
Definition 4.2. A vector space with a finite basis is called finite-dimensional.
Example 4.3. P, the vector space of all polynomials with coefficients in

R, is not finite-dimensional. Indeed, if {f1 , . . . , fn } is any finite collection
of polynomials, then we may let d = max1≤j≤n deg(fj ) and consider the
polynomial g(x) = xd+1 . Because every polynomial in span{f1 , . . . , fn } has
degree at most d, we see that g is not in this span, and hence {f1 , . . . , fn }
does not span P.
Remark 4.4. It is possible to consider infinite bases: there is a sense in

which the infinite set {1, x, x2 , x3 , . . . } is a basis for P. However, we will
not consider these in this course, and for our purposes all bases will be
finite, and all results concerning bases of vector spaces will be given for
finite-dimensional vector spaces.
Lemma 4.5. If V has a finite spanning set then it has a finite basis.
Proof. Let S be a finite spanning set for V . If S is linearly independent then

it is a basis, and we are done. If S is linearly dependent then by Proposition
3.5 there is a proper subset S1 ⊂ S such that span S1 = span S = V . If S1 is
linearly independent then it is our basis, otherwise we repeat the procedure
23
24 LECTURE 4. MON. SEP. 9
to obtain S2 , S3 , . . . . Because #Sk ≤ #S − k, the procedure must terminate

after at most #S steps (recall that S is finite), and so we eventually obtain
a finite basis.
Definition 4.6. The numbers of vectors in a basis for V is the dimension

of V , denoted dim V . By convention, dim{0} = 0.
Theorem 4.7. If V is finite-dimensional, then every linearly independent

set y1 , . . . , yn can be completed to a basis for V .
Proof. If {y1 , . . . , yn } spans V , then it is a basis and we are done. If it

does not span, then there is v ∈ V \ span{y1 , . . . , yn }, and by Exercise 3.7
the set {y1 , . . . , yn , v} is linearly independent. If this set spans then we are
done, otherwise we repeat the procedure. Eventually we have a linearly
independent set with dim V elements. If this set does not span, the same
procedure would give us a linearly independent set with 1 + dim V elements,
contradicting Theorem 3.9.
4.2 Direct sums

Proposition 3.6 related linear independence to uniqueness of representation
for elements in the span of a set; this is the key distinguishing feature be-
tween a spanning set and a basis. A similar distinction is drawn when
we consider the sum of subspaces. Recall that if Y1 , . . . , Ym ⊂ V are sub-
spaces such that every element x ∈ V can be written as x = y1 + · · · + ym
for some yi ∈ Yi , then we say that V is the sum of Y1 , . . . , Ym and write
V = Y1 + · · · + Ym . If y1 , . . . , ym are determined uniquely by x, then we say
that V is the direct sum of Y1 , . . . , Ym , and write
V = Y1 ⊕ · · · ⊕ Ym . (4.1)
Exercise 4.8. Show that V is the direct sum of Y1 , Y2 if and only if V =

Y1 + Y2 and Y1 ∩ Y2 = {0}.
Exercise 4.9. Give an example of three subspaces Y1 , Y2 , Y3 ⊂ R2 such that
Y1 + Y2 + Y3 = R2 and Yi ∩ Yj = {0} for every i 6= j, but R2 is not the direct
sum of Y1 , Y2 , Y3 .
Exercise 4.10. Show that if V = Y1 + · · · + Ym , then the sum is a direct
sum if and only if the subspaces Y1 , . . . , Ym are “linearly independent” in
the following sense: whenever yj ∈ Yj are such that m
P
j=1 yj = 0, we have
yj = 0 for every j.
4.2. DIRECT SUMS 25
Theorem 4.11. Let V be a finite-dimensional vector space and let W ⊂ V

be a subspace. Then W is finite-dimensional and has a complement: that
is, another subspace X ⊂ V such that V = W ⊕ X.
Proof. We can build a finite basis for W as follows: take any nonzero vector
w1 ∈ W , so {w1 } is linearly independent. If {w1 } spans W then we have
found our basis; otherwise there exists w2 ∈ W \ span{w1 }. By Exercise
3.7, the set {w1 , w2 } is once again linearly independent. We continue in
this manner, obtaining linearly independent sets {w1 , . . . , wk } until we find
a basis.
As it stands, the above argument is not yet a proof, because it is possible
that the procedure continues indefinitely: a priori, it could be the case that
we get linearly independent sets {w1 , . . . , wk } for every k = 1, 2, 3, . . . , with-
out ever obtaining a spanning set. However, because the ambient subspace
V is finite-dimensional, we see that the procedure must terminate: by Propo-
sition 3.4, {w1 , . . . wk } cannot be linearly independent when k > dim V , and
so there exists k such that the procedure terminates and gives us a basis for
W.
This shows that W is finite-dimensional. To show that it has a com-
plement, we observe that by Theorem 4.7, the set {w1 , . . . , wk }, which is a
basis for W and hence linearly independent, can be completed to a basis
β = {w1 , . . . , wk , x1 , . . . , xm } for V . Let X = span{x1 , . . . , xm }. We claim
that every element v ∈ V can be written in a unique way as v = w + x,
where w ∈ W and x ∈ X. To show this, observe that because β is a basis
for V , there are unique scalars a1 , . . . , ak , b1 , . . . , bm ∈ K such that
v = (a1 w1 + · · · + ak wk ) + (b1 x1 + · · · + bm xm ).
P P
Let w = j aj wj and x = i bi xi . Then v = w + x. Moreover, by Exercise
4.8 it suffices to check that W ∩P X = {0}. P To see this, suppose that there
are coefficientsP aj , bi suchP
that a j wj = bi xi . Bringing everything to
one side gives j aj wj + i (−bi )xi = 0, and by linear independence of β
we have aj = 0 and bi = 0 for each i, j.
Exercise 4.12. Show that V is the

P direct sum of Y1 , . . . , Ym if and only if
V = Y1 + · · · + Ym and dim V = m j=1 dim Yj .
Quotient spaces and dual spaces
Further reading: [Lax] Ch. 1–2 (p. 7–15); see also [Tre] Ch. 8 (p. 207–214)
5.1 Quotient spaces

Let V be a vector space over K, and Y ⊂ V a subspace. We say that
v1 , v2 ∈ V are congruent modulo Y if v1 − v2 ∈ Y , and write v1 ≡ v2 mod Y .
Example 5.1. Let V = R2 and Y = {(x, y) | x + y = 0}. Then (x1 , y1 ) ≡

(x2 , y2 ) mod Y if and only if (x1 − x2 , y1 − y2 ) ∈ Y ; that is, if and only if
x1 − x2 + y1 − y2 = 0. We can rewrite this condition as x1 + y1 = x2 + y2 .
Exercise 5.2. Show that congruence mod Y is an equivalence relation – that

is, it satisfies the following three properties.
1. Symmetric: if v1 ≡ v2 mod Y , then v2 ≡ v1 mod Y .
2. Reflexive: v ≡ v mod Y for all v ∈ V .
3. Transitive: if v1 ≡ v2 mod Y and v2 ≡ v3 mod Y , then v1 ≡ v3 mod Y .
Furthermore, show that if v1 ≡ v2 mod Y , then cv1 ≡ cv2 mod Y for all
c ∈ K.
Given v ∈ V , let [v]Y = {w ∈ V | w ≡ v mod Y } be the congruence class
of v modulo Y . This is also sometimes called the coset of Y corresponding
to v.
Exercise 5.3. Given v1 , v2 ∈ V , show that [v1 ]Y = [v2 ]Y if v1 ≡ v2 mod Y ,
and [v1 ]Y ∩ [v2 ]Y = ∅ otherwise.
In example 5.1, we see that [v]Y is the line through v with slope −1.
Thus every congruence class modulo Y is a line parallel to Y . In fact we see
that [v]Y = v + Y = {v + x | x ∈ Y }. This fact holds quite generally.
Proposition 5.4. Let Y ⊂ V be a subspace, then for every v ∈ V we have

[v]Y = v + Y .
Proof. (⊂).: If w ∈ [v]Y , then w − v ∈ Y , and so w = v + (w − v) ∈ v + Y .

(⊃): If w ∈ v + Y , then w = v + y for some y ∈ Y , and so w − v = y ∈ Y ,
whence w ≡ v mod Y , so w ∈ [v]Y .
27
We see from this result that every congruence class modulo a subspace
is just a copy of that subspace shifted by some vector. Such a set is called
an affine subspace.
Recall once again our definition of setwise addition, and notice what
happens if we add two congruence classes of Y : given any v, w ∈ V , we have
[v]Y +[w]Y = (v+Y )+(w+Y ) = {v+w+y1 +y2 | y1 , y2 ∈ Y } = (v+w)+(Y +Y )
But since Y is a subspace, we have Y + Y = {y1 + y2 | y1 , y2 ∈ Y } = Y , and

so
[v]Y + [w]Y = (v + w) + Y = [v + w]Y .
Similarly, we can multiply a congruence class by a scalar and get
c · [v]Y = c · {v + y | y ∈ Y } = {cv + cy | y ∈ Y } = cv + Y = [cv]Y .
(If these relations are not clear, spend a minute thinking about what they
look like in the example above, where Y is the line through the origin with
slope −1.)
We have just defined a way to add two congruence classes together, and
a way to multiply a congruence class by a scalar. One can show that these
operations satisfy the axioms of a vector space (commutativity, associativity,
etc.), and so the set of congruence classes forms a vector space over K when
equipped with these operations. We denote this vector space by V /Y , and
refer to it as the quotient space of V mod Y .
Exercise 5.5. What plays the role of the zero vector in V /Y ?
In Example 5.1, V /Y is the space of lines in R2 with slope −1. Notice
that any such line is uniquely specified by its y-intercept, and so this is a
1-dimensional vector space, since it is spanned by the single element [(0, 1)]Y .
Example 5.6. Let V = Rn and Y = {(0, 0, 0, x4 , . . . , xn ) | x4 , . . . , xn ∈ R}.

Then two vectors in V are congruent modulo Y if and only if their first three
components are equal. Thus a congruence class is specified by its first three
components, and in particular {[e1 ]Y , [e2 ]Y , [e3 ]Y } is a basis for V /Y .
The previous example also satisfies dim Y + dim(V /Y ) = dim V . In fact

this result always holds.
Theorem 5.7. Let Y be a subspace of a finite-dimensional vector space V .

Then dim Y + dim(V /Y ) = dim V .
5.2. DUAL SPACES 29
Proof. Let y1 , . . . , yk be a basis for Y . By Theorem 4.7, this can be com-

pleted to a basis for V by adjoining some vectors vk+1 , . . . , vn . We claim
that {[vj ]Y | k + 1 ≤ j ≤ n} is a basis for V /Y .
First we show that it spans V /Y . Given [v]Y ∈ V /Y , because {y1 , . . . , yk , vk+1 , . . . , vn }
is a basis for V , there exist a1 , . . . , ak , bk+1 , . . . , bn ∈ K such that
X X
v= ai yi + bj vj .
Adding the subspace Y to both sides gives

X X
[v]Y = v + Y = Y + bj vj = bj [vj ]Y .
This shows the spanning property. Now we check for linear independence
by supposing that ck+1 , . . . , cn ∈ K are such that
X
cj [vj ]Y = [0]Y .
P
This
P is truePif and only if cj vj ∈ Y , in which case there are di such that
cj vj = di yi . But now linear independence of the basis constructed
above shows that cj = 0 for all j. Thus {[vj ]Y | k + 1 ≤ j ≤ n} is a basis for
V /Y .
From this we conclude that dim(V /Y ) = n − k and dim Y = k, so that
their sum is (n − k) + k = n = dim V .
5.2 Dual spaces

The quotient spaces introduced in the previous section give one way to
construct new vector spaces from old ones. Here is another.
Let V be a vector space over K. A scalar valued function ` : V → K is
called linear if
`(cx + y) = c`(x) + `(y) (5.1)
for all c ∈ K and x, y ∈ V . Note that repeated application of this property
shows that
k k
!
X X
` ci xi = ci `(xi ) (5.2)
i=1 i=1
for every collection of scalars ci ∈ K and vectors xi ∈ V .

The sum of two linear functions is defined by pointwise addition:
(` + m)(x) = `(x) + m(x).

Similarly, scalar multiplication is defined by
(c`)(x) = c(`(x)).
With these operations, the set of linear functions on V becomes a vector

space, which we denote by V 0 and call the dual of V .
Example 5.8. Let V = C(R) be the space of continuous real-valued func-

tions on R. For any s ∈ R, define the function
`s : V → R
f 7→ f (s),
which takes a function f to its value at a point s. This is a linear function,

and so `s ∈ V 0 . However, not every element of V 0 arises in this way. For
example, one can define ` : V → R by
Z 1
`(f ) = f (x) dx.
0
Integration over other domains gives other linear functionals, and still there
are more possibilties: C(R)0 is a very big vector space.
Example 5.9. Let V = C 1 (R) be the space of all continuously differentiable

real-valued functions on R. Fix s ∈ R and let `s (f ) = f 0 (s). Then `s ∈ V 0 .
These examples are infinite-dimensional, and in this broader setting the

notion of dual space becomes very subtle and requires more machinery to
handle properly. In the finite-dimensional case, life is a little more manage-
able.
Theorem 5.10. If V is a finite-dimensional vector space, then dim(V 0 ) =

dim(V ).
Proof. Let {v1 , . . . , vn }Pbe a basis for V . Then every v ∈ V has a unique
representation as v = i ai vi , where ai ∈ K. For each 1 ≤ i ≤ n, define
ì ∈ V 0 by X
ì (v) = ì ai vi = ai .
6 i, then
(Alternately, we can define ì by ì (vi ) = 1 and ì (vj ) = 0 for j =
extend by linearity to all of V .) We leave as an exercise the fact that
{`1 , . . . , `n } is a basis for V 0 .
Linear maps, nullspace and range
Further reading: [Lax] p. 19–20. See also [Bee] p. 513–581, [CDW] Ch.
6 (p. 89–96), [Hef ] Ch. 3 (p. 173–179), [LNS] Ch. 6 (p. 62–81), [Tre] Ch.
1 (p. 12–18)
6.1 Linear maps, examples

Let V and W be vector spaces over the same field K, and T : V → W a map
that assigns to each input v ∈ V an output T (v) ∈ W . The vector space V
in which the inputs live is the domain of the map T , and the vector space
W in which the outputs live is the codomain, or target space.
Definition 6.1. A map T : V → W is a linear transformation (or a linear

map, or a linear operator ) if
1. T (x + y) = T (x) + T (y) for every x, y ∈ V (so T is additive), and
2. T (cx) = cT (x) for every x ∈ V and c ∈ K (so T is homogeneous).
Exercise 6.2. Show that every linear map T has the property that T (0V ) =
0W .
Exercise 6.3. Show that T is linear if and only if T (cx + y) = cT (x) + T (y)
for every x, y ∈ V and c ∈ K.
Exercise 6.4. Show that T is linear if and only if for every x1 , . . . , xn ∈ V
and a1 , . . . , an ∈ K, we have
n n
!
X X
T ai xi = ai T (xi ). (6.1)
i=1 i=1
Example 6.5. The linear functionals discussed in the previous lecture are
all examples of linear maps – for these examples we have W = K.
Example 6.6. The map T : R2 → R2 defined by T (x1 , x2 ) = (x2 , −x1 +2x2 )

is linear: given any c ∈ R and any x = (x1 , x2 ), y = (y1 , y2 ) ∈ R2 , we have
T (cx + y) = T (cx1 + y1 , cx2 + y2 )

= (cx2 + y2 , −(cx1 + y1 ) + 2(cx2 + y2 ))
= c(x2 , −x1 + 2x2 ) + (y2 , −y1 + 2y2 ) = cT (x) + T (y).
31
Example 6.7. The map T : R2 → R2 defined by T (x1 , x2 ) = (x1 + x2 , x22 )

is not linear. To see this, it suffices to observe that T (0, 1) = (1, 1) and
T (0, 2) = (2, 4) 6= 2T (0, 1).
Example 6.8. The map T : R2 → R2 defined by T (x1 , x2 ) = (1 + x1 , x2 ) is

not linear, because T 0 6= 0.
The above examples all look like the sort of thing you will have seen
in your first linear algebra course: defining a function or transformation in
terms of its coordinates, in which case checking for linearity amounts to
confirming that the formula has a specific form, which in turn guarantees
that it can be written in terms of matrix multiplication. The world is much
broader than these examples, however: there are many examples of linear
transformations which are most naturally defined using something other
than a coordinate representation.
Example 6.9. Let V = C 1 (R) be the vector space of all continuously dif-
ferentiable functions, and W = C(R) the vector space of all continuous
d
functions. Then T : V → W defined by (T f )(x) = dx f (x) is a linear trans-
formation: recall from calculus that (cf + g)0 = cf 0 + g 0 .
Example 6.10. Let V = W = C(R), and define T : V → W by

Z 1
(T f )(x) = f (y)(x − y)2 dy.
0
R1 R1
Then T (cf + g)(x) = 0 (cf + g)(y)(x − y)2 dy = c 0 f (y)(x − y)2 dy +
R1 2
0 g(y)(x − y) dy = (c(T f ) + (T g))(x), and so T is linear.
The previous example illustrates the idea at the heart of the operation
of convolution, a linear transformation that is used in many applications,
including image processing, acoustics, data analysis, electrical engineering,
and probability theory. A somewhat more complete description of this op-
eration (which we will not pursue Rfurther in this course) is as follows: let
∞
V = W = L2 (R) = {f : R → R | −∞ f (x)2 dx < ∞}, and fix g ∈ C(R).
Rb
Then define T : V → W by (T f )(x) = a f (y)g(x − y) dy.
Example 6.11. Let V = X = R2 , fix a number θ ∈ R, and let T : V → W

be the map that rotates the input vector v by θ radians around the origin.
The homogeneity property T (cv) = cT (v) is easy to see, so to prove linearity
of T it suffices to check the additivity property T (v + w) = T (v) + T (w).
This property can be proved in two different ways:
6.1. LINEAR MAPS, EXAMPLES 33
1. using the geometric definition of vector addition in R2 , where v + w is

defined as the diagonal of the parallelogram generated by v and w;
2. using the observation that if x = (x1 , x2 ) makes an angle α with the

positive x-axis, then T (x) makes an angle α + θ, and so writing x1 =
r cos α, x2 = r sin α, we get
T (x) = (r cos(α + θ), r sin(α + θ))

= (r cos α cos θ − r sin α sin θ, r sin α cos θ + r cos α sin θ)
= (x1 cos θ − x2 sin θ, x2 cos θ + x1 sin θ),
which can be checked to define a linear map. Recall that the formulas
for sin(α + θ) and cos(α + θ) can be recovered from the identities
eiθ = cos θ + i sin θ and ei(α+θ) = eiα eiθ .
Example 6.12. Fix m, n ∈ N and {Aij ∈ K | 1 ≤ i ≤ m, 1 ≤ j ≤ n}. Let

V = K n and W = K m . Define a map T : V → W by T (v) = w, where
n
X
wi = Aij vj for each 1 ≤ i ≤ m. (6.2)
j=1
The last example is, of course, the description of a linear transformation

as multiplication by a matrix. We will come back to this in due course, and
in particular will give a motivation for why matrix multiplication should
be defined by a strange-looking formula like (6.2). First we discuss some
general properties of linear transformations.
Proposition 6.13. Suppose that the vectors v1 , . . . , vn ∈ V are linearly

dependent, and that T : V → W is linear. Then T (v1 ), . . . , T (vn ) ∈ W are
linearly dependent.
P
Proof. By the hypothesis, there are a1 , . . . , an ∈ K such that i ai vi = 0V
and not all the
Pai are zero.PBy Exercises 6.2 and 6.4, this implies that 0W =
T (0V ) = T ( i ai vi ) = i ai T (vi ), and so T (v1 ), . . . , T (vn ) are linearly
dependent.
Exercise 6.14. Suppose that the vectors v1 , . . . , vn ∈ V are linearly inde-

pendent, and that T : V → W is 1-1 and linear. Show that the vectors
T (v1 ), . . . , T (vn ) ∈ W are linearly independent.
Exercise 6.15. Let S ⊂ V be a spanning set, and suppose that T : V → W
is onto and linear. Show that T (S) ⊂ W is a spanning set for W .
6.2 Null space and range

Let T : V → W be a linear map. Given any subset X ⊂ V , recall that
T (X) = {T (x) | x ∈ X} is the image of X under the action of T . Similarly,
given any subset Y ⊂ W , recall that T −1 (Y ) = {v ∈ V | T (v) ∈ Y } is the
preimage (or inverse image) of Y under T .
We stress that the notation T −1 (Y ) makes sense even when the map T
is not invertible. If T is not 1-1, then T −1 should not be thought of as a
map from W to V ; rather, it is a map from subsets of W to subsets of V .
Thus given w ∈ W , T −1 (w) = {v ∈ V | T (v) = w} is in general a subset of
V , possibly with many elements.
Theorem 6.16. Let T : V → W be a linear map.
1. If X ⊂ V is a subspace of V , then T (X) is a subspace of W .
2. if Y ⊂ W is a subspace of W , then T −1 (Y ) is a subspace of V .
Proof. Exercise.
Definition 6.17. The range of T is T (V ) = {T (v) | v ∈ V }. The nullspace

(or kernel ) of T is T −1 (0) = {v ∈ V | T (v) = 0}. We denote the range of T
by RT and the nullspace by NT .
By Theorem 6.16, both the range and nullspace of T are subspaces (of W
and V , respectively – be very careful to remember where these things live).
You have encountered both of these subspaces before, in your introduction
to linear algebra: if T : Rm → Rn is a linear transformation defined by
coefficients Aij as in Example 6.12, then the equation T (v) = w describes
a linear system of m equations (one equation for each wi ) in n variables
(v1 , . . . , vn ). The nullspace of T is the set of vectors v for which T (v) = 0:
that is, the set of solutions of the homogeneous system of linear equations
given by Aij . The range RT is the set of w ∈ Rm for which the non-
homogeneous system T (v) = w has a solution. This subspace of Rm was
also characterised as the column space of the matrix A.
Isomorphisms
Further reading: [Lax] p. 3, 7. See also [Bee] p. 544–603, [CDW] Ch. 6,

16 (p. 89–96, 237–246), [Hef ] Ch. 3 (p. 157–172, 180–190), [LNS] Ch. 6 (p.
62–81), [Tre] Ch. 1 (p. 12–18)
7.1 More on nullspace and range

The discussion at the end of the previous lecture relates to the following
observations. Nullspace is defined implicitly, meaning that given v ∈ V , it
is easy to test whether v ∈ NT : just compute T (v) and see whether or not it
is 0W . On the other hand, it can be difficult to give an explicit description
of all the elements of the nullspace: it was to accomplish precisely this task
that you learned to use row reduction to solve homogeneous systems of
linear equations. For the range RT , the situation is reversed: this subspace
is defined explicitly, and it is easy to produce all the elements of the range,
by taking T (v) for each v ∈ V . On the other hand, it can be difficult to
determine whether or not a given w ∈ W is in RT : this is the same problem
as determining whether or not the non-homogeneous system T (v) = w has
a solution, or whether w is in the span of the columns of the matrix A.
In the concrete setting of Rn , problems regarding nullspace and range can
be reduced to the familar problems regarding systems of linear equations.
This is not necessarily possible in the abstract setting.
Example 7.1. Consider the linear map T : C 1 (R) → C(R) given by differ-
entiation. The nullspace of T is the space of all constant functions, while
the range of T is all of C(R). On the other hand, if we restrict our attention
to the subspace Pn ⊂ C 1 (R), then T : Pn → Pn has NT = P0 , the space of
constant polynomials, while RT = Pn−1 .
Example
Rx 7.2. Consider T : C(R) → C(R) given by integration: (T f )(x) =
0 f (y) dy. Then NT is trivial, and RT = {g ∈ C 1 (R) | g(0) = 0}.
Example 7.3. Define a map T : C 2 (R) → C(R) by (T f ) = f 00 − αf 0 + f ,
where α ∈ R is a damping parameter. The nullspace of T is the solution
space of the unforced ODE ẍ − αẋ + x = 0, and the range of T is the set of
all forcing terms g ∈ C(R) for which the forced ODE ẍ − αẋ + x = g(t) has
a solution.
Exercise 7.4. Show that a linear map T is 1-1 if and only if NT = {0V }.
35
7.2 Isomorphisms
An important kind of linear transformation is an isomorphism: we say that
T : V → W is an isomorphism if it is linear, 1-1, and onto. If there is an
isomorphism between V and W , we say that V and W are isomorphic.
Thus we may characterise isomorphisms in terms of nullspace and range
as those linear maps for which NT = {0V } and RT = W . (Null space is
trivial and range is everything.)
Exercise 7.5. Show that if T : V → W is an isomorphism, then the inverse
map T −1 : W → V is also an isomorphism.
The notion of isomorphism gives a sense in which various examples of
vector spaces that we have seen so far are “the same”. For instance, let V
be the vector space of column vectors with 2 real components, and let W be
the vector space of row vectors with 2 real components. It seems clear that
V and W are in some sense “the same”, in that both can be described by
prescribing two numbers. More precisely, we can associate to each column
vector ( xy ) the corresponding row vector ( x y ). This association respects
vector addition and scalar multiplication:
0 0

1. ( xy ) + xy0 = x+xy+y 0 is associated to ( x+x0 y+y0 ) = ( x y ) + ( x0 y0 );
2. c ( xy ) = ( cx
cy ) is associated to ( cx cy ) = c ( x y ).
Thus writing T ( xy ) = ( x y ), we see that T is linear. Moreover, it is easy to

show that T is an isomorphism by checking that NT = {0} and RT = W .
The key utility of this fact is that any statement concerning properties
of V as a vector space can be translated into an equivalent statement con-
cerning properties of W . For example, if we have a set of column vectors
and want to know if they are linearly independent, it would suffice to answer
the same question for the corresponding row vectors.
In fact, we have already used this principle in Example 2.15. In that
example, we considered the polynomials f1 (x) = x + 1, f2 (x) = x2 − 2, and
f3 (x) = x + 3 in P2 . We observed that linear dependence or independence of
the set {f1 , f2 , f3 } could be determined by solving the homogeneous system
of linear equations (2.3), which is equivalent
1 to 0determining
linear depen-
0 3
dence or independence of the set { 1 , 0 , 1 } in R . The reason that
1 −2 3 a
this works is because we can associate to each column vector b ∈ R3 the
c
polynomial ax2 + bx + c in P2 . This association is 1-1, onto, and linear,
hence it is an isomorphism.
7.2. ISOMORPHISMS 37
So we have two examples of isomorphisms so far: the vector space of row

vectors is isomorphic to the vector space of column vectors (with the same
number of entries); and the vector space Pn of polynomials with degree ≤ n
is isomorphic to Rn+1 in a natural way, by associating an xn + · · · + a1 x + a0
to the vector with entries an , · · · , a0 .
A less obvious example is the subspace X3 ⊂ C 2 (R) from Example 2.2,
defined as the set of all C 2 functions x : R → R such that ẍ + x = 0. If you
have taken a course in ODEs, you will recognise that the solutions are all of
the form x(t) = a cos(t) + b sin(t), where a, b ∈ R are chosen based on the
initial conditions. In particular, by associating (a, b) ∈ R2 to the function
x(t) = a cos(t) + b sin(t), we obtain an isomorphism between X3 and R2 .
All of the above examples of isomorphisms come about in the same
way: given two vector spaces V and W of the same dimension n, produce
a basis β = {v1 , . . . , vn } for V , and a basis γ = {w P 1 , . . . , wn } for W ; then
given
P coefficients c1 , . . . , cn , associate the vector cj vj ∈ V to the vector
cj wj ∈ W . In fact, this is a very general phenomenon.
Proposition 7.6. If V, W are vector spaces over K and T : V → W is an
isomorphism, then β ⊂ V is a basis for V if and only if T β is a basis for
W.
Proof. This follows from the exercises following Proposition 6.13: if β is a
basis for V , then it is linearly independent, and since T is 1-1, T β is linearly
independent as well; similarly, β spans V , and since T is onto, T β spans W ,
thus T β is a basis. The reverse implication is proved similarly, using the
result from Exercise 7.5 that T −1 is also an isomorphism.
A consequence of Proposition 7.6 is that isomorphic finite-dimensional

vector spaces have the same dimension. The converse is also true. We need
the following lemma.
Lemma 7.7. Let V, W be finite-dimensional vector spaces over the same
field K, and let {v1 , . . . , vn } be a basis for V . Then given any vectors
w1 , . . . , wn ∈ W , there is a unique linear transformation T : V → W such
that T (vi ) = wi for each 1 ≤ i ≤ n.
P
Proof. Define T as follows:P if v ∈ V can be written as v = i ai vi for
ai ∈ K, then T (v) = i ai wi . We must show that T is well-defined and
linear. By well-defined we mean that there is only one possibility for T (v);
since the definition of T (v) involved some coefficients ai ∈ K, there could be
more than one possible definition of T (v) if there was more than one possible
choice of ai . However, because {v1 , . . . , vn } is a basis, the coefficients ai are
determined uniquely P by v, and so there is only one possible value of T (v)

that satisfies T (v) = i ai wi .
0
P P 0
For linearity, we observe that if v = i ai vi and v = i ai vi , then
cv + v 0 = (cai + a0i )vi for all c ∈ K, and so
P
n
X X X
0
T (cv + v ) = (cai + a0i )wi = c a i wi + a0i wi = cT (v) + T (v 0 ).
i=1
Thus T is linear. Finally, we show that T is unique: indeed, if U : V → W

is any linear transformation
P with U (vi ) = wP
P i for every i, P
then by Exercise
6.4 we have U ( i ai vi ) = i ai U (vi ) = i ai wi = T ( i ai vi ), and so
U = T.
Theorem 7.8. Two finite-dimensional vector spaces over the same field K
are isomorphic if and only if they have the same dimension.
Proof. The forward implication (isomorphism implies equal dimension) fol-

lows from Proposition 7.6. For the reverse implication, we suppose that V
and W are vector spaces over K with dim(V ) = dim(W ) = n, so that we
can choose bases {v1 , . . . , vn } for V and {w1 , . . . , wn } for W . By Lemma
7.7, there is a unique linear transformation T : V → W with T (vi ) = wi for
each i.
P We claim that T P is an isomorphism. Indeed, if T (v) = 0W then v =
i ai vi has T (v) = i ai wi = 0W , and by linear independence of {w1 , . . . , wn }
this implies ai = 0 for each i, so v = 0V , which shows that T P is 1-1. To
see that T is onto, note that any w ∈ W can be written as w = iP ai wi for
some ai ∈ K, because of the spanning property, and so writing v = i ai wi ,
we have T (v) = w.
One way of looking at Theorem 7.8 is to say that up to isomorphism,

K n is the only n-dimensional vector space over K. If this is so, then why
do we bother with the abstract point of view that has been presented so
far? Why not just work with row vectors or as column vectors, since every
finite-dimensional example can be reduced to these?
One answer is that there are also very interesting and important exam-
ples of infinite-dimensional vector spaces, which we cannot understand by
comparing them to K n . We have already seen some of these: C(X), C 1 (X),
and so on. Another answer is that there is often some benefit to having
a coordinate-free presentation of a vector space, since there is not always a
natural, readily-available basis to work with. For example, one may consider
the set of solutions to a linear ODE of degree n; this is a vector space of
7.2. ISOMORPHISMS 39
dimension n, but there is no canonical basis with which to work. Similarly,

if one considers all vectors in K n whose components sum to 0, one obtains
a vector space V of dimension n − 1, for which there are various choices of
basis, but none that is obviously superior to all the others.
Indeed, given any two isomorphic vector spaces, there are many isomor-
phisms between them, and similarly every vector space has many different
bases. This will become more and more clear as we go on.
More on linear transformations
Further reading: [Lax] p. 24–26
8.1 The algebra of linear maps

Let V and W be vector spaces over a common field K, and let L(V, W )
be the set of linear maps from V to W . We can define addition and scalar
multiplication on L(V, W ) in the natural way:
(T + S)(v) = T (v) + S(v), (cT )(v) = c(T (v))
for every c ∈ K and S, T ∈ L(V, W ). It is not hard to check that T + S
and cT are also linear maps, and that these operations make L(V, W ) into
a vector space.
There is more to linear maps than just a vector space structure, however,
because in addition to adding linear maps and multiplying them by scalars,
we can compose them, at least under certain conditions. Recall that if
V, W, X are any sets (not necessarily vector spaces) and f : V → W and
g : W → X are any maps (not necessarily linear), then the composition
g ◦ f : V → X is defined by (g ◦ f )(v) = g(f (v)) – that is, the output of f is
fed into g as its input. Schematically we can write
g f
X ←− W ←− V. (8.1)
Composition is always associative, if we compose three maps with compatible
inputs/outputs, such as
h g f
Y ←− X ←− W ←− V, (8.2)
then h ◦ (g ◦ f ) = (h ◦ g) ◦ f .
Now we consider composition of linear maps. If V, W, X are vector spaces
over the same field and we have linear maps T ∈ L(V, W ) and S ∈ L(W, X),
then S ◦ T is the composition defined above.
Exercise 8.1. Show that S ◦ T is linear, and so S ◦ T ∈ L(V, X).
Exercise 8.2. Show that composition is distributive with respect to addition
of linear maps: that is, given any Q, R ∈ L(V, W ) and S, T ∈ L(W, X), we
have
(T + S) ◦ Q = T ◦ Q + S ◦ Q, T ◦ (Q + R) = T ◦ Q + T ◦ R.
41
In light of the distributive property and the associative law, we will

abbreviate our notation and denote the composition of linear maps by jux-
taposing them, in a notation suggestive of multiplication:
def
ST = S ◦ T.
It is very important to keep in mind that this product (composition) is only

defined when the target space of T is the domain of S; if the inputs and
outputs do not match up, then ST is not defined. In particular, it is often
the case that T S is not defined, even when ST is. In order to keep track
of which maps go between which vector spaces (and hence which can be
composed with each other), it is useful to draw diagrams like those in (8.1)
and (8.2). For instance, in Exercise 8.2, we have the diagram
S,T Q,R
X ←− W ←− V.
One important example is when W = V , and we consider linear maps

from V to itself. In this case we will sometimes write L(V ) instead of L(V, V )
for the space of all linear operators on V . Then any two maps in L(V ) have
matching inputs and outputs, whichever order we try to compose them in,
and products are always defined.
Even in this setting, however, it is important to remember that multi-
plication is generally not commutative: it is often the case that ST 6= T S,
even when both are defined.
Example 8.3. Let V = P be the space of all polynomials in x, letR T be dif-

x
ferentiation, and let S be integration starting at 0, so (Sf )(x) = 0 f (t) dt.
(Note that this is always a polynomial whenever f is.) Then it is easy to
check that T S = I, the identity map, but ST is not the identity map, since
(ST )(f ) = 0 whenever f (x) = c is a constant polynomial.
Exercise 8.4. Let V = R3 , let S be rotation around the x-axis by 90 degrees,

and let T be rotation around the y-axis by 90 degrees. Show that S, T ∈
L(V ) and that ST 6= T S.
As usual, when T is 1-1 and onto we say that it is invertible and write
T −1 for its inverse. The same argument as for the inverse of a composition
of any two functions (not necessarily linear) shows that
(ST )−1 = T −1 S −1
whenever ST is defined and S, T are both invertible.

8.2. DUAL SPACES AND TRANSPOSES 43
8.2 Dual spaces and transposes

Given a linear map T : V → W and a linear functional ` ∈ W 0 , we can view
` as a linear map from W → K. Then we have the diagram
` T
K ←− W ←− V,
and in particular we can form the composition m = `T = ` ◦ T : V → K,
which maps v ∈ V to the scalar `(T v) ∈ K. As explained in the previous
section, m is linear, so m ∈ L(V, K) = V 0 , the dual space of V .
Here is another way of thinking about the above construction. Fix a
linear map T ∈ L(V, W ). Then to every ` ∈ W 0 we can assign an element
m ∈ V 0 , defined by m = ` ◦ T . That is, T defines a map T 0 : W 0 → V 0 , where
T 0 ` = m = ` ◦ T . In diagram form, the relation is like this:
V
T /W
T0

m∈V 0 `∈W 0

K
Example 8.5. Let V = Pn , W = Pn−1 , and let T ∈ L(V, W ) be differenti-
ation. Let ` ∈ W 0 be the linear function that evaluates a polynomial g at
a specific input t0 , so `(g) = g(t0 ). Then T 0 ` ∈ V 0 is a linear functional on
Pn , given by (T 0 `)(f ) = `(T f ) = `(f 0 ) = f 0 (t0 ).
Example 8.6. Let V = W = R2 . Fix a, b, c, d ∈ R and consider the linear
transformation T : R2 → R2 given by

x a b x
T = .
y c d y
Let `1 (x, y) = x and `2 (x, y) = y be two linear functionals on R2 . The
proof of Theorem 5.10 shows that {`1 , `2 } is a basis for V 0 , so V 0 is also
isomorphic to R2 . Thus we can represent any linear functional ` ∈ V 0 as
` = s`1 + t`2 , where s, t ∈ R are coordinates that depend on `. Indeed, s
and t are determined by
s(`) = `(1, 0), t(`) = `(0, 1), (8.3)
since then linearity implies that
`(x, y) = x`(1, 0)+y`(0, 1) = `1 (x, y)`(1, 0)+`2 (x, y)`(0, 1) = (s`1 +t`2 )(x, y).
Given such an ` = s`1 + t`2 , what is T 0 `? We want to write T 0 ` = s0 `1 + t0 `2

and then understand how (s, t) and (s0 , t0 ) are related. Using (8.3), we have
s0 = (T 0 `)(1, 0) = `(T (1, 0)) = `(a, c) = as + ct,

t0 = (T 0 `)(0, 1) = `(T (0, 1)) = `(b, d) = bs + dt.
Using matrices, we see that

0
s a c s
= ,
t0 b d t
so that the linear map T 0 is given in matrix form (relative to the basis `1 , `2 )
by the transpose of the matrix that defined T .
We will eventually see that the phenomenon in Example is very general,

and thus we will refer to T 0 ∈ L(W 0 , V 0 ) as the transpose of T ∈ L(V, W );
it is also sometimes called the adjoint.
The following notation is sometimes useful when thinking of dual spaces:
given v ∈ V and ` ∈ V 0 , we write
def
(`, v) = `(v).
This notation is suggestive of the inner product (or scalar product, or dot
product) that is sometimes used on Rn , but is more general. (Although if Rn
is equipped with an inner product, then identifying the two notations gives
a natural isomorphism between Rn and its dual space.) Using this notation,
we see that the transpose satisfies (T 0 `, v) = (T 0 `)(v) = `(T v) = (`, T v), or
more succinctly,
(T 0 `, v) = (`, T v). (8.4)
If you have worked with the dot product on Rn before, you may recognise
(8.4) as one of the identities satisfied by the matrix transpose.
Nullity and rank
Further reading: [Lax] p. 20–22. See also [Bee] p. 544–603, [CDW] Ch.
6, 16 (p. 89–96, 237–246), [Hef ] Ch. 3 (p. 157–172, 180–190), [LNS] Ch. 6
(p. 62–81), [Tre] Ch. 1 (p. 12–18)
9.1 Dimension theorem

The nullity of a linear transformation T is the dimension of its nullspace,
and the rank of a linear transformation is the dimension of its range. These
are related by the following fundamental result.
Theorem 9.1. Let T : V → W be a linear map. Then dim NT + dim RT =
dim V .
Proof. If v1 ≡ v2 mod NT , then there is x ∈ NT such that v1 = v2 + x, and
in particular T (v1 ) = T (v2 + x) = T (v2 ) + T (x) = T (v2 ) + 0W = T (v2 ).
Thus we can define a map U : V /NT → Rt by U ([v]NT ) = T (v); the previous
sentence shows that U is well-defined.
In fact, U is an isomorphism: clearly it is onto, and if U [v]NT = 0W ,
we get T (v) = 0W , so v ∈ NT , hence [v]NT = 0V /NT . Because U is an
isomorphism, it follows from Theorem 9.1 that dim(V /NT ) = dim(RT ).
Moreover, we saw in Theorem 5.7 that dim(Y ) + dim(V /Y ) = dim(V ) for
every subspace Y ⊂ V . Putting Y = NT , this implies the result.
Remark 9.2. Note that it is the dimension of the domain V , and not the
dimension of the codomain W , that enters Theorem 9.1. One way of re-
membering this is the following: if X is some larger vector space containing
W , then we can consider T as a linear map V → X, and so we would need
to replace dim W by dim X in any formula like the one in Theorem 9.1.
However, dim NT and dim RT do not change when we replace W with X.
Corollary 9.3. If dim(V ) > dim(W ), then T is not 1-1.
Proof. dim(RT ) ≤ dim(W ), and so dim Nt = dim V − dim RT ≥ dim V −
dim W ≥ 1.
Corollary 9.4. If dim(V ) = dim(W ), then T is 1-1 if and only if T is onto.

Proof. Exercise.
45
Lecture 10 Wed. Oct. 2
Matrices
Further reading: [Lax] p. 32–36
10.1 Representing linear maps with matrices

[Bee] p. 518–522, [CDW]p. 97–125, [Hef ] p. 191–208, [LNS] p. 71–75,
[Tre] p. 13–21
Write Mm×n (K) for the vector space of m × n matrices with entries in
the field K. We saw in Example 6.12 that every A ∈ Mm×n (K) determines
a linear transformation T : K n → K m by the formula
n
X
(T v)i = Aij vj for each 1 ≤ i ≤ m. (10.1)
j=1
In fact, this gives all the linear transformations from K n to K m .

Theorem 10.1. For every T ∈ L(K n , K m ), there is A ∈ Mm×n (K) such
that T (v) is given by (10.1) for every v ∈ K n . Moreover, the entries of A
are the unique coefficients satisfying
m
X
T (ej ) = Aij di (10.2)
i=1
for every 1 ≤ j ≤ n.
Proof. Write e1 , . . . , en for the standard basis vectors in K n , and d1 , . . . , dm
for the standard basis vectors in K m .PGiven v ∈ K n , let v1 , . . . , vn ∈ K be
the unique coefficients such that v = nj=1 vj ej . Then by linearity,
 
X n n
X
T (v) = T  vj ej  = vj T (ej ). (10.3)
j=1 j=1
For each j, the image T (ej ) is a vector in K m,

and so it can be written as a
linear combination of the standard basis vectors d1 , . . . , dm . Let Aij ∈ K be
the unique coefficients realising this – that is, define Aij by (10.2). Putting
(10.3) and (10.2) together gives
n
X n X
X m
T (v) = vj T (ej ) = Aij vj di .
j=1 j=1 i=1
In other words, T (v) is the vector whose coordinates are given by (10.1).
47
48 LECTURE 10. WED. OCT. 2
We see from Theorem 10.1 that the formula (10.1) is not arbitrary, but
follows naturally from linearity of T and our decision to use the entries of
the matrix A to record the partial coefficients of T . The only choice we
made was to record the coefficients of the different vectors T (ej ) as the
columns of the matrix, rather than the rows. This corresponds to a decision
to treat vectors in K n and K m as column vectors, and to have A act by
multiplication on the left. If we had made the other choice, we would work
with row vectors, and multiply from the right.
Given our choice to work with column vectors, it also makes sense to
think of a matrix A as being a “row of column vectors”. Another way of
thinking of (10.2) is as the observation that multiplying A by the standard
basis vector ej returns the jth column of A.
A similar argument to the one above shows that the usual formula for
matrix multiplication is actually a very natural one, as it is the proper way
to encode composition of linear transformations. To see this, consider the
vector spaces K n , K m , and K ` , and let S ∈ L(K n , K m ), T ∈ L(K m , K ` ),
so that we have the diagram
T S
K ` ←− K m ←− K n ,
and the composition is T S ∈ L(K n , K ` ). By Theorem 10.1 there are ma-

trices A ∈ M`×m (K) and B ∈ Mm×n (K) that encode the transformations
T and S, respectively. Theorem 10.1 also shows that T S is encoded by a
matrix C ∈ M`×n . The next result shows that in fact C = AB using the
usual definition of matrix multiplication; this fact provides motivation for
this definition.
Theorem 10.2. With T, S and A, B, C as above, the matrix C is given by

m
X
Cij = Aik Bkj for every 1 ≤ i ≤ `, 1 ≤ j ≤ n. (10.4)
k=1
Proof. Let ei be the standard basis vectors for K ` , let dj be the standard
basis vectors for K n , and let ck be the standard basis vectors for K m . The
key is to write T S(e` ) as a linear combination of the vectors ck . This can
be done using the formulae for T and S in terms of A and B. First note
that since B is the matrix for S, (10.2) gives
m
X
S(ej ) = Bkj ck .
k=1
10.1. REPRESENTING LINEAR MAPS WITH MATRICES 49
Now applying T and using linearity followed by (10.2) for A and T , we get
m
X m X
X `
T S(ej ) = Bkj T (ck ) = Bkj Aik di . (10.5)
k=1 k=1 i=1
Finally, using (10.2) for C and T S, we also have

`
X
T S(ej ) = Cij di , (10.6)
i=1
and comparing this with (10.5) proves (10.4).
A nice corollary of the above result is the fact that matrix multiplication
is associative – that is, (AB)C = A(BC) whenever the products are defined.
This is immediate from the fact that composition of linear maps is associative
(which is easy to see), and while it could be proved directly using (10.4),
that proof is messier and not as illuminating.
Now we have seen that linear maps from K n to K m can always be
understood in terms of matrix multiplication. What about other vector
spaces? Looking back at Theorems 10.1 and 10.2, we see that the proofs do
not use any special properties of K n apart from the fact that we work with
the standard basis e1 , . . . , en (and similarly for K m , K ` ). In particular, if
we let V, W be any finite-dimensional vector spaces and let e1 , . . . , en be
any basis for V , and d1 , . . . , dm any basis for W , the proof of Theorem
10.1 shows that the matrix A ∈ Mm×n (K) with entries given by (10.2)
determines
P the linear transformation T via the following version of (10.1):
if v = nj=1 vj ej , then
 
m
X n
X
T (v) =  Aij vj  di .
i=1 j=1
There is a very important fact to keep in mind here – the matrix A depends
on our choice of basis for V and W . If we choose a different basis, then the
vectors appearing in (10.2) will change, and so the entries of A will change
as well.
Example 10.3. Let V = W = P2 and let T be differentiation. We first find

the matrix of T relative to the basis e1 = 1, e2 = x, e3 = x2 . We see that
T e1 = 0, T e2 = 1 = e1 , T e3 = 2x = 2e2 ,
and so using (10.2) the matrix of T relative to this basis is

 
0 1 0
A = 0 0 2  .
0 0 0
What if we use the basis d1 = 2, d2 = 2x, and d3 = x2 ? Then we still have

T d1 = 0 and T d2 = d1 , but now T d3 = d2 , and so the matrix of T relative
to this new basis is  
0 1 0
0 0 1  .
0 0 0
Of course, there are many other bases available for P2 . If we write c1 = 2+x,
c2 = 1 + x, c3 = 21 x2 , we see that
T c1 = 1 = c1 − c2 , T c2 = 1 = c1 − c2 , T c3 = x = 2c2 − c1 ,
and so the matrix representing T becomes

 
1 1 2
−1 −1 −1 .
0 0 0
Even for linear maps in K n , where it seems most natural to use the
standard basis e1 , . . . , en , it is often more useful to choose a different basis.
For example, if T is the linear transformation represented by the matrix
A = ( 11 11 ) in the standard basis, we see that taking d1 = ( 11 ) and d2 = −1 1
gives
2
T d1 = = 2d1 , T d2 = 0,
2
so the relative to the basis d1 , d2 , the map T is given by the rather simpler
matrix ( 20 00 ).
Two natural questions present themselves. How do we determine the
relationship between the matrices that appear for different choices of basis?
Given a linear transformation, can we choose a basis in which its matrix
takes on a “nice” form that is easier to work with? These questions will
motivate a great deal of what we do in the rest of this course.
Lecture 11 Mon. Oct. 7
Changing bases
Further reading:
11.1 Commutative diagrams

Let V be a vector space and let β be a set of vectors v1 , . . . , vn ∈ V . There
is a natural linear map Iβ : K n → V defined by
n
X
Iβ (x1 , . . . , xn ) = xi vi .
i=1
The map Iβ is 1-1 if and only if β is linearly independent, and onto if and
only if β spans V . In particular, it is an isomorphism if and only if β is a
basis.
Let V, W be finite-dimensional vector spaces over K, and choose bases
β for V and γ for W . Let n = dim V and m = dim W , so that
Iβ : K n → V, Iγ : K m → W
are isomorphisms. Write β = {v1 , . . . , vn } and γ = {w1 , . . . , wm }, and let A

be the matrix of T relative to β and γ, as defined in the previous section:
then A is given by (10.2), which in this case takes the form
m
X
T (vj ) = Aij wi . (11.1)
i=1
Now let ej , di be the standard basis vectors in K n , K m , respectively. By the

definition of Iβ and Iγ , we have vj = Iβ (ej ) and wi = Iγ di , so (11.1) implies
m
X
T Iβ ej = Aij Iγ di .
i=1
Applying Iγ−1 to both sides gives
m
X
(Iγ−1 T Iβ )ej = Aij di ,
i=1
51
52 LECTURE 11. MON. OCT. 7
and we see that the composition Iγ−1 T Iβ , which is a map from K n → K m ,

is given by multiplication by the matrix A. The diagram below illustrates
the situation.
VO
T /W (11.2)
Iβ Iγ−1

Kn
A / Km
The above picture is an example of a commutative diagram – the word “com-

mutative” here refers to the fact that whether we go from K n to K m directly,
by following the arrow labeled “A”, or via another route, by following the
other three sides of the square, we are still defining the same map from K n
to K m , because A = Iγ−1 T Iβ .
This discussion gives another way to think of the matrix representing
a linear transformation: fix bases β and γ for the domain and codomain,
respectively, which give isomorphisms Iβ and Iγ with K n and K m , so that
Iγ−1 T Iβ is a linear map from K n to K m , which is given by matrix multipli-
cation thanks to Theorem 10.1. This point of view also highlights the role
of the bases β and γ; if we choose different bases, then the isomorphisms Iβ
and Iγ are different so we get a different map K n → K m , and a different
matrix. To emphasise the role of the bases, we will sometimes write [T ]γβ for
the matrix representing the linear transformation T relative to the bases β
and γ.
11.2 Changing bases

[Lax] p. 37–38, [Bee] p. 603–615, [CDW] p. 202–206, [Hef ] p. 237–241,
[LNS] p. 136–138, [Tre] p. 68–73
Let V be a finite-dimensional vector space and β a basis for V . Given
v ∈ V we will write [v]β = Iβ−1 (v) for the vector in K n whose components
are the coefficients in the unique representation of
Pv as a linear combination
of elements of β. That is, [v]β = x ∈ K n iff v = ni=1 xi vi .
Exercise 11.1. In the setting of the previous section, show that the commu-
tative diagram (11.2) can be expressed as the equation [T v]γ = [T ]γβ [v]β .
An important special case of Exercise 11.1 comes when T ∈ L(V ) is a
linear operator on V – that is a linear map from V to itself – and we write
down the matrix of T with respect to the same basis β in the domain and
codomain. In this case we will simply write [T ]β for the matrix of T , and
11.2. CHANGING BASES 53
(11.2) becomes
VO
T /V
O
Iβ Iβ
[T ]β
Kn / Kn
The result of Exercise 11.1 in this case says that [T v]β = [T ]β [v]β .
Let β, γ be two bases for V . It is important to understand how [T ]β and
[T ]γ are related, and to do this we must first understand how [v]β and [v]γ
are related for v ∈ V . These are related to v by
Kn Kn [v]β [v]γ (11.3)

Iβ Iβ
! } Iγ
~
Iγ
V v
We see that [v]γ = Iγ−1 Iβ [v]β . Let Iβγ = Iγ−1 Iβ : this is a linear transformation
from K n to K n , and so is represented by an n × n matrix (relative to the
standard basis). We refer to this matrix as the change-of-coordinates matrix
from β to γ because it has the property
[v]γ = Iβγ [v]β for all v ∈ V. (11.4)
The following example illustrates how all this works. Let V = P2 , let
β = {f1 , f2 , f3 } be the basis given by f1 (x) = 2 + 3x, f2 (x) = 1 + x, and
f3 (x) = x + x2 , and let γ = {g1 , g2 , g3 } be the standard basis g1 (x) = 1,
g2 (x) = x, g3 (x) = x2 . Let e1 , e2 , e3 be the standard basis for R3 . In order
to write down the matrix Iβγ , we need to compute Iβγ ej for j = 1, 2, 3.
First note that Iβγ ej = Iγ−1 Iβ ej , and that Iβ ej = fj . (This is the def-
inition of Iβ .) Thus Iβγ ej = Iγ−1 (fj ) = [fj ]γ . That is, Iβγ ej is given by the
coordinate representation of fj in terms of the basis γ. In particular, since
f1 (x) = 2 + 3x = 2g1 (x) + 3g2 (x),

f2 (x) = 1 + x = g1 (x) + g2 (x),
f3 (x) = x + x2 = g2 (x) + g3 (x),
we see that
     
2 1 0
[f1 ]γ = 3 , [f2 ]γ = 1 , [f3 ]γ = 1 ,
0 0 1
and so  
2 1 0
Iβγ = 3 1 1 .
0 0 1
Thus for example, the polynomial p(x) = (2+3x)−2(1+x)−(x+x2 ) = −x2
has    
1 0
[p]β = −2 , [p]γ =  0  ,
−1 −1
and indeed we can verify that
    
2 1 0 1 0
Iβγ [p]β = 3 1 1 −2 =  0  = [p]γ .
0 0 1 −1 −1
To go the other way and find Iγβ one needs to compute Iγβ ej = [gj ]β . That
is, one must represent elements of β in terms of γ, and the coefficients of
this representation will give the columns of the matrix. So we want to find
coefficients ai , bi , ci such that
g1 = a1 f1 + a2 f2 + a3 f3 ,
g2 = b1 f1 + b2 f2 + b3 f3 , (11.5)
g3 = c1 f1 + c2 f2 + c3 f3 ,
and then we will have  

a1 b1 c1
Iγβ = a2 b2 c2  ,
a3 b3 c3
using the fact that Iγβ e1 = [g1 ]β = a1 e1 + a2 e2 + a3 e3 , and similarly for e2

and e3 . The system of equations (11.5) becomes
1 = a1 (2 + 3x) + a2 (1 + x) + a3 (x + x2 ),
x = b1 (2 + 3x) + b2 (1 + x) + b3 (x + x2 ),
x2 = c1 (2 + 3x) + c2 (1 + x) + c3 (x + x2 ).
Comparing coefficients yields nine equations in nine variables, which can

(eventually) be solved to produce
1 = −1(2 + 3x) + 3(1 + x),

11.2. CHANGING BASES 55
x = (2 + 3x) − 2(1 + x),

x2 = −(2 + 3x) + 2(1 + x) + (x + x2 ).
We conclude that  
−1 1 −1
Iγβ =  3 −2 2  .
0 0 1
We note that the product Iγβ Iβγ corresponds to changing from β-coordinates
to γ-coordinates, and then back to β-coordinates, so we expect that the
product is the identity matrix, and indeed a direct computation verifies
that     
−1 1 −1 2 1 0 1 0 0
 3 −2 2  3 1 1 = 0 1 0 .
0 0 1 0 0 1 0 0 1
The procedure described for the above example works very generally. Given
two bases β = {v1 , . . . , vn } and γ = {w1 , . . . , wn } for a vector space V , let
Aij ∈ K be the unique coefficients such that
n
X
vj = Aij wi for all j = 1, . . . , n. (11.6)
i=1
Then Iβγ is the n × n matrix whose coefficients are given by Aij – that is, the
jth column of Iβγ is given by the coefficients needed to express vj in terms
of the basis γ.
The principle of “express elements of one basis in terms of the other basis,
then use those coefficients to make a matrix” is easy enough to remember.
In principle one might try to build a matrix by putting the coefficients
found above into a row of the matrix, instead of a column, or by expressing
elements of γ in terms of β, instead of what we did above. To keep things
straight it is useful to remember the following.
1. The coefficients we find will go into a column of the matrix, because

we always work with column vectors – [v]β and [v]γ are expressed as
column vectors, not row vectors.
2. The purpose of Iβγ is to turn [v]β into [v]γ – that is, the output, the
information that is needed, is on how to express vectors as linear com-
binations of elements of γ. Thus to determine Iβγ , we should express
elements of β in terms of γ, and not vice versa.
We end this section by remarking that (11.6) is actually a specific case

of (10.2). Indeed, we defined Iβγ as Iγ−1 Iβ , using (11.3). We can draw the
diagrams in (11.3) slightly differently, as
Iβγ Iβγ
Kn / Kn [v]β / [v]γ (11.7)
Iβ Iγ Iβ Iγ

V
I /V v
I / v
Comparing this to (11.2), we see that Iβγ is exactly [I]γβ , the matrix of the
identity transformation relative to the bases β (in the domain) and γ (in
the codomain).
Conjugacy, types of operators, dual spaces
Further reading: [Lax] p. 37–43.
12.1 Conjugacy
[Lax] p. 37–38, [Bee] p. 616–676, [CDW] p. 202–206, [Hef ] p. 241–248,
[LNS] p. 138–143, [Tre] p. 68–73
The previous section lets us describe the relationship between [v]β and
[v]γ when v ∈ V and β, γ are two bases for V . Now we are in a position
to describe the relationship between [T ]β and [T ]γ when T ∈ L(V ). The
following commutative diagram relates these matrices to T itself, and hence
to each other:
[T ]γ
Kn / Kn
Iγ Iγ

VO
T /V
O
Iβ Iβ
[T ]β
Kn / Kn
Recall that Iβγ = Iγ−1 Iβ and Iγβ = Iβ−1 Iγ be obtained by following the right-
hand side from the top to the bottom. Then because the diagram commutes,
we have
[T ]β = Iγβ [T ]γ Iβγ = (Iβγ )−1 [T ]γ Iβγ . (12.1)
We could also have gotten to (12.1) by observing that [T ]β satisfies the re-
lationship [T v]β = [T ]β [v]β , and that using the change-of-coordinates prop-
erties from the previous section, we have
[T v]β = Iγβ [T v]γ = Iγβ [T ]γ [v]γ = Iγβ [T ]γ Iβγ [v]β .
Thus the matrix [T ]γ is obtained from [T ]β by multiplying on the right and

left by Iβγ and its inverse. This motivates the following definition.
Definition 12.1. Let V, W be vector spaces and consider two linear maps
S ∈ L(V ) and T ∈ L(W ). We say that S and T are conjugate if there is
an isomorphism J : V → W such that T = J −1 SJ. That is, the following
57
diagram commutes:
V
S /V
J J

W
T /W
Similarly, we say that two n × n matrices A, B are conjugate, or similar, if

there is an invertible n × n matrix C such that B = C −1 AC.
Exercise 12.2. Show that if two linear transformations S, T ∈ L(V ) are

conjugate, then there are bases β, γ for V such that [S]β = [T ]γ .
Exercise 12.3. Let A, B, C be invertible n × n matrices. Show that AB and
BA are always conjugate. Must ABC and BAC always be conjugate?
12.2 Nilpotent operators and projections

[Lax] p. 30–31. [Bee] p. 687–704, [Hef ] p. 373–377, [Tre] p. 256–262
Certain linear transformations T ∈ L(V ) have particular properties that
warrant giving them special names. Here we quickly go over two such classes.
Definition 12.4. If there is k ∈ N such that T k = 0, then T is a nilpotent

operator.
Example 12.5. Let T ∈ L(R2 ) be the linear operator whose matrix (in the
standard basis, or indeed, in any basis) is A = ( 00 10 ). Then the matrix of T 2
is A2 = 0, and so T 2 is the zero transformation.
Example 12.5 can be generalised. Say that a matrix A ∈ Mn×n is strictly

upper-triangular if Aij = 0 whenever i ≥ j. That is, all entries on the main
diagonal, and below it, are equal to 0.
Exercise 12.6. Prove that if T ∈ L(V ) can be represented by a strictly
upper-triangular matrix, then T is nilpotent.
Exercise 12.7. Use the following steps to prove that the converse is also true
– that is, if T is nilpotent, then there is a basis β such that [T ]β is strictly
upper-triangular.
1. Show that the null spaces of T , T 2 , . . . satisfy the following relation-

ship: NT ⊂ NT 2 ⊂ NT 3 ⊂ · · · ⊂ NT k = V .
2. Let ji = dim(NT i ), for 1 ≤ i ≤ k. Note that jk = dim V .

12.2. NILPOTENT OPERATORS AND PROJECTIONS 59
3. Let v1 , . . . , vj1 be a basis for NT ; extend this to a basis v1 , . . . , vj2 for

NT 2 , and so on.
4. Let β = {v1 , . . . , vdim V } be the basis for V obtained in this way, and
show that [T ]β is strictly upper-triangular.
Another important class are the projections. We say that P ∈ L(V ) is

a projection if there are subspace W, X ⊂ V such that V = W ⊕ X and
P (w + x) = w for every w ∈ W and x ∈ X. In this case we say that P is
the projection onto W along X.
The definition given here of a projection is fairly geometric: to compute
P (v), one first needs to write v in the form v = w + x, where w ∈ W
and x ∈ X. There is a unique way to do this because V = W ⊕ X. This
amounts to taking “the part of v in the direction of W ” and “the part of v in
the direction of X” – then one throws away the second and keeps the first.
Another way of thinking of it is that one starts at v and then moves along
the congruence class [v]X = v + X = {v + x | x ∈ X}. This congruence class
intersects W exactly once (again, this is because of the direct sum property),
and this unique point of intersection is P (v).
Projections can also be characterised algebraically.
Exercise 12.8. Show that every projection P satisfies the equation P 2 = P .
Exercise 12.9. Show that the converse is also true: if P ∈ L(V ) satisfies
P 2 = P , then there are subspaces W, X ⊂ V such that V = W ⊕ X and P
is the projection onto W along X. (Hint: take W = RP and X = NP .)
In order to represent a projection as a matrix, we need to choose a
basis. The previous exercise shows that if P ∈ L(V ) is a projection, then
V = RP ⊕ NP , and P is the projection onto RP along NP . Let v1 , . . . , vk
be a basis for RP , and let w1 , . . . , w` be a basis for NP . Note that k + ` =
n = dim V . We see that
• β = {v1 , . . . , vk , w1 , . . . , w` } is a basis for V ;
• P vi = vi for every 1 ≤ i ≤ k;
• P wi = 0 for every 1 ≤ i ≤ `.
This implies that [P vi ]β = ei and [P wi ]β = 0, so the matrix representation

of P relative to β can be written in block form as

Ik×k 0k×`
[P ]β = ,
0`×k 0`×`
where Ik×k is the k × k identity matrix, 0k×` is the k × ` matrix of all 0s,
and so on. In other words,
k times ` times
z }| { z }| {
[P ]β = diag(1, . . . , 1, 0, . . . , 0),
where diag(d1 , . . . , dn ) is the diagonal matrix D whose entries are given by

Dij = 0 when i 6= j and Dii = di . Using the change-of-coordinates procedure
described earlier, this allows us to compute [P ]γ for any basis γ.
12.3 Dual spaces

The representation of linear transformations via matrices gives us a concrete
way to look at the dual space. Let V be an n-dimensional vector space. Then
once we have fixed a basis β for V , an element of v is represented by the
column vector [v]β ∈ K n . Recall that the dual space V 0 is the space of linear
functionals on V – that is, linear transformations from V to the scalar field
K. Thus V 0 = L(V, K) fits into the scheme we have been discussing, and
elements of V 0 can be represented by matrices.
Recall that if dim V = n and dim W = m, then T ∈ L(V, W ) is repre-
sented by an m × n matrix once we have fixed bases β = {v1 , . . . , vn } for
V and γ = {w1 , . . . , wm } for w. This is because T is characterised by the
equation
[T ]γβ [v]β = [T (v)]γ ,
and [v]β ∈ K n , [T v]γ ∈ K m .
Applying this to linear functionals and using the natural basis {1} for
K, we see that ` ∈ V 0 is represented by a matrix [`]β satisfying
[`]β [v]β = `(v).
The right hand side is a number, which is a 1×1 matrix, so [`]β must be a 1×n
matrix in order for things to match up. In other words, a linear functional
` ∈ V 0 is represented by a row vector [`]β , and `(v) can be computed by
multiplying the row and column vectors of ` and v relative to β.
More on nullity and rank
Further reading:
13.1 Row and column rank via annihilators

[Lax] p. 15–17, 27–28, 37.
Recall that the transpose of T ∈ L(V, W ) is the map T 0 ∈ L(W 0 , V 0 )
given by
T 0 (`)(v) = `(T v) (13.1)
for all ` ∈ W 0 and v ∈ V . Suppose V = K n and W = K m , and let
A ∈ Mm×n be the matrix representing T relative to the standard bases.
Then as described above, every linear functional ` ∈ (K m )0 is represented
by a row vector, which we also denote `. From (13.1), we see that
(T 0 `)(v) = `(T v) = Àv = (À)v.
In other words, the linear functional T 0 ` ∈ (K n )0 is represented by the row

vector À. Thus the transpose of T can also be represented via matrix
multiplication, but multiplication of row vectors from the right, instead of
column vectors from the left.
To represent T 0 as left multiplication of column vectors, we can do the
following. Let At ∈ Mn×m be the regular matrix transpose, that is, (At )ij =
Aji . By the properties of this matrix transpose, we get (À)t = At `t . Since
`t is the column vector associated to the linear functional `, we see that At
is the matrix representing T 0 .
An important object in the dual space is the annihilator, defined as
follows. Let V be any vector space and S ⊂ V be any subset. The set
S ⊥ = {` ∈ V 0 | `(v) = 0 for all v ∈ S}
is called the annihilator of S.

Exercise 13.1. Show that S ⊥ is always a subspace of V .
Annihilators satisfy a dimension theorem similar to Theorems 5.7 and
9.1.
Theorem 13.2. If V is finite-dimensional and W ⊂ V is a subspace of V ,

then dim W + dim W ⊥ = dim V .
61
Proof. Let {v1 , . . . , vm } be a basis for W , and define a linear transformation

T : V 0 → K m by  
`(v1 )
T (`) =  ...  .
 
`(vm )
Then W ⊥ = NT . One can show that T is onto (this is an exercise), and
thus RT = K m . In particular, we have dim W ⊥ = dim NT and dim W =
m = dim RT , and so by Theorem 9.1 we have dim W + dim W ⊥ = dim RT +
dim NT = dim V 0 . By Theorem 5.10, this is equal to dim V .
Theorem 13.3. For any linear transformation T ∈ L(V, W ), the annihila-

tor of the range of T is the nullspace of its transpose: RT⊥ = NT 0 .
Proof.
RT⊥ = {` ∈ W 0 | `(w) = 0 for all w ∈ RT } def. of annihilator

0
= {` ∈ W | `(T v) = 0 for all v ∈ V } def. of range
0 0
= {` ∈ W | (T `)(v) = 0 for all v ∈ V } def. of transpose
0 0
= {` ∈ W | T ` = 0} def. of zero functional
= NT 0 def. of nullspace
The definition of annihilator also goes from the dual space V 0 to V itself:
given S ⊂ V 0 , we write
S ⊥ = {v ∈ V | `(v) = 0 for all ` ∈ S}.
Theorem 13.4. If V is finite-dimensional and X ⊂ V is a subspace of V ,

then (X ⊥ )⊥ = X.
Proof. By definition of X ⊥ , every v ∈ X satisfies `(v) = 0 for all ` ∈ X ⊥ ,
thus X ⊂ (X ⊥ )⊥ . On the other hand, Theorem 13.2 implies that
dim(X ⊥ )⊥ = dim V 0 − dim X ⊥ = dim V 0 − (dim V − dim X) = dim X
using the fact that dim V 0 = dim V . The result follows.
Together with Theorem 13.3, this result immediately implies that the
range of T is the annihilator of the nullspace of T 0 :
RT = (NT 0 )⊥ (13.2)
13.2. EXISTENCE OF SOLUTIONS 63
We can put these results together to conclude that T and T 0 have the
same rank for any finite-dimensional V, W and any T ∈ L(V, W ). Indeed,
Theorems 13.2 and 9.1 yield
dim RT⊥ + dim RT = dim W,

dim NT 0 + dim RT 0 = dim W 0 .
The last terms are equal by Theorem 5.10, and the first terms are equal by
Theorem 13.3. Thus we conclude that
dim RT = dim RT 0 . (13.3)
This can be interpreted as the statement that row and column rank of a
matrix are the same. Indeed, if T is an m × n matrix, then RT ⊂ K m is the
column space of T , and RT 0 ⊂ K n is the row space of T . The above result
shows that these spaces have the same dimension.
Exercise 13.5. Show that if dim V = dim W and T ∈ L(V, W ), then dim NT =
dim NT 0 .
13.2 Existence of solutions

[Lax] p. 23–24
The key relationship between nullity and rank is Theorem 9.1, which
states that dimension of the domain of a linear transformation is the di-
mension of the range plus the dimension of the null space. In the concrete
language of a first course in linear algebra, Corollary 9.3 is the statement that
every undetermined system (m equations in n variables, where n > m) has
a non-trivial solution, and Corollary 9.4 is the statement that for a system
of n equations in n variables, the following two statements are equivalent:
1. the homogeneous system Ax = 0 has only the trivial solution;
2. the non-homogeneous system Ax = b has a solution for every value of

the data b.
This fundamental fact has many applications. Here is one. Given a

bounded domain G ⊂ R2 , the Laplace equation on G is the PDE ∆u =
uxx + uyy = 0 in G, with the value of u prescribed on the boundary of G;
this describes heat distributions and many other important things in physics.
As with any PDE, it is important to understand whether a solution always
exists, for given boundary conditions.
Of course we are not studying PDEs in this course, but there is a useful
finite-dimensional analogue of this problem. Suppose we discretise the prob-
lem by filling the region G with a square lattice and recording the value of
u at every point of the lattice. Then each point has four neighbours, which
we think of as lying to the north, south, east, and west, at a distance of h.
Writing u0 for the value of u at the central point, and uN , uS , uE , uW for
the values at its neighbours, we have the approximations
1
uxx ≈ (uW − 2u0 + uE ),
h2
1
uyy ≈ 2 (uN − 2u0 + uS ),
h
and so the approximate version of the Laplace equation uxx + uyy = 0 is the
linear equation
1
u0 = (uW + uE + uN + uS ). (13.4)
4
That is, u solves the discretised equation if and only if its value at every
lattice point is the average of its values at the four neighbours of that point.
We require this to hold whenever all four neighbours uW , uE , uN , and uS
are in G. If any of those neighbours are outside of G, then we require u0 to
take the value prescribed at the nearest boundary point of G.
Writing n for the number of lattice points in G, we see that the discretised
version of the Laplace equation ∆u = 0 is a system of n equations in n
unknowns. We want to answer the following question: for a given choice of
boundary conditions, is there a solution of the discretised Laplace equation
on the interior of G? If so, is that solution unique?
Because the discretised Laplace equation is a system of n linear equa-
tions in n unknowns, this question is equivalent to the following: does the
given non-homogeneous system of equations have a unique solution? We
know that if the homogeneous system has only the trivial solution, then the
non-homogeneous system has a unique solution for any choice of data – in
particular, the discretised Laplace equation has a unique solution on the
interior of G for any choice of boundary conditions.
The homogeneous system for the Laplace equation corresponds to having
the zero boundary condition (u = 0 on the boundary of G). Let M be the
maximum value that u takes at any point in G. We claim that M = 0.
Indeed, if M > 0 then there is some lattice point at the interior of G for
which u0 = M . By (13.4) and the fact that each of the neighbouring points
has u ≤ M , we see that u = M at every neighbouring point. Continuing in
13.2. EXISTENCE OF SOLUTIONS 65
this way, we conclude that u = M on every point, contradicting the fact that
it vanishes on the boundary. Thus u ≤ 0 on all of G. A similar argument
(considering the minimum value) shows that u ≥ 0 on all of G, and so u = 0
is the only solution to the homogeneous problem. This implies that there
is a unique solution to the non-homogeneous problem for every choice of
boundary conditions.
In fact, we have proved something slightly stronger. Laplace’s equation
is a special case of Poisson’s equation ∆u = f , where f : G → R is some
function – examples where f 6= 0 arise in various physical applications.
Upon discretising, (13.4) becomes u0 = 14 (uW + uE + uN + uS − f0 ), where
f0 is the value of f at the given point. Thus the situation is the following: if
G contains n lattice points and k of these are on the boundary in the sense
that one of their neighbours lies outside of G, then the non-homogeneous
part of k of the equations specifies the boundary condition, while the non-
homogeneous part of the remaining n − k equations specifies the function
f . The above argument permits both parts to be non-zero, and so it shows
that the discretised version of Poisson’s equation has a unique solution for
every choice of boundary condition and every choice of f .
Powers of matrices
Further reading:
14.1 Iterating linear transformations

[Lax] p. 58–59
Suppose we have some system which is modeled by a map (not necessarily
linear) from Rn into itself. That is, a point in Rn represents a particular
state of the system, and if the system is in state x at the present time,
then its state one unit of time into the future is given by F (x). The map
F : Rn → Rn encodes the dynamics of the system; it may be given by
integrating an ODE, or by some other method.
Now suppose that the motion of this system starting from the initial
condition x = 0 is periodic – that is, there is p ∈ N such that F p (0) = 0.
Then writing G = F p , we see that 0 is a fixed point of G – that is, G(0) =
0. If the map G is differentiable (which is usually the case for physically
interesting systems), then for states starting near this fixed point we have
G(x) = DG(0)x + O(x2 ),
where DG(0) ∈ Mn×n is the derivative of G at 0, and the remainder term

O(x2 ) is negligible when kxk is small. Let A = DG(0) – then determining
the behaviour of a trajectory starting at x ≈ 0 amounts to computing Ak (x)
as k increases. That is, we want to study the behaviour of the sequence of
vectors x, Ax, A2 x, A3 x, . . . .
There are other motivations for studying this sequence of vectors. Con-
sider the Fibonacci sequence, defined by x0 = x1 = 1 and then iteratively
by xn = xn−1 + xn−2 for n ≥ 2. Let vn be the vector ( xxn−1n ); then we see
that vn+1 and vn are related by

xn xn 0 1 xn−1
vn+1 = = = = Avn ,
xn+1 xn + xn−1 1 1 xn
where A = ( 01 11 ), and thus we can write down the formula

n−1 n−1 1 n−1 0 n 0
vn = A v1 = A =A A =A .
1 1 1
67
That is, vn is the second column of the matrix An , and xn is the second
entry of this column, so we have the explicit formula
xn = (An )22 .
This gives us an explicit formula for the nth Fibonacci number, if only we
can find a formula for An .
The difficulty, of course, is that it is not clear how to write down a
formula for Ak for an arbitrary matrix A, or even to predict whether the
entries of Ak will grow, shrink, or oscillate in absolute value. For example,
consider the matrices

5 7.1 5 6.9
A= B= (14.1)
−3 −4 −3 −4
It turns that

50 −1476 −3176 50 .004 .008
A ≈ B ≈ (14.2)
1342 2549 −.003 −.006
Thus despite the fact that A and B have entries that are very close to each
other, the iterates An v and B n v behave very differently: we will eventually
be able to show that if v ∈ R2 \{0} is any non-zero vector, then kAn vk → ∞
as n → ∞, while kB n vk → 0.
We could formulate a similar example for a one-dimensional linear map,
which is just multiplication by a real number. Let a = 1.1 and b = 0.9. Then
we see immediately that for any non-zero number x, we have an x → ∞ and
bn x → 0 as n → ∞. In this case, it is clear what the issue is: |a| is larger
than 1, and |b| is smaller than 1.
Can we find a similar criterion for matrices? Is there some property of a
matrix that we can look at and determine what kind of behaviour the iterates
An v will exhibit? This is one question that will motivate our discussion of
eigenvalues and eigenvectors, determinants, and eventually spectral theory.
14.2 Eigenvectors and eigenvalues

[Lax] p. 59–60. [Bee] p. 453–479, [CDW] p. 187–200, [Hef ] p. 364–373,
[LNS] p. 82–88, [Tre] p. 99–105
First we point out that there is one situation in which it is easy to
determine the behaviour of the iterates An v. If v ∈ Rn has the property that
Av is a scalar multiple of v – that is, there exists λ ∈ R such that Av = λv,
14.2. EIGENVECTORS AND EIGENVALUES 69
then the linearity property of multiplication by A makes computing the

iterates An v very easy: we just observe that
An v = An−1 (Av) = An−1 (λv) = λ(An−1 v) = · · · = λn v.
Then An v grows exponentially in magnitude if |λ| > 1, and decreases expo-

nentially if |λ| < 1.
If v ∈ Rn \ {0} is a non-zero vector such that Av = λv for some λ ∈ R,
then v is called an eigenvector of the matrix A, and the corresponding
value of λ is called an eigenvalue of A. We use the same terminology for
linear transformations: given a vector space V and a linear transformation
T ∈ L(V ), an eigenvector v is a non-zero vector such that T (v) = λv, and λ
is called an eigenvalue of T .
Note that if v is an eigenvector for T , then so is cv for every c ∈ K, with
the same eigenvalue: indeed, T (cv) = cT (v) = c(λv) = λ(cv).
Example 14.1. Let A = ( 20 03 ). Then e1 is an eigenvector with eigenvalue
2, and e2 is an eigenvector with eigenvalue 3. On the other hand, ( 11 ) is not
an eigenvector, because A ( 11 ) = ( 23 ) is not a scalar multiple of ( 11 ).
The following elementary result gives an important characterisation of
eigenvalues.
Proposition 14.2. A scalar λ ∈ K is an eigenvalue of a linear transforma-
tion T ∈ L(V ) if and only if T −λI is non-invertible. The set of eigenvectors
corresponding to λ is precisely the set of non-trivial vectors in the nullspace
of T − λI. (This nullspace is often called the eigenspace of λ.)
Proof. Note that v is an eigenvector for λ if and only if T v = λv, which is
equivalent to (T − λI)v = T v − λv = 0. Thus a non-zero vector v is in the
nullspace of T − λI if and only if it is an eigenvector for T with eigenvalue
λ. This proves the second part of the proposition, and the first part follows
because T − λI is invertible if and only if the nullspace is trivial.
For eigenvectors at least, the question of computing An v is tractable.

But how restrictive is the condition that v is an eigenvector of A? The
example above shows that in general, not every vector is an eigenvector. In
fact, when K = R, there may not be any eigenvectors.
Exercise 14.3. Let A = 01 −1

0 , and show that A does not have any eigen-
vectors in R2 .
If we are willing to work over the complex numbers (that is, with K = C
instead of K = R), then there is always at least one eigenvector.
Proposition 14.4. Let A ∈ Mn×n (C). Then there is v ∈ Cn \ {0} and

λ ∈ C such that Av = λv.
Proof. Let w be any nonzero vector in Cn , and consider the vectors
w, Aw, A2 w, . . . , An w.
There are n + 1 of these vectors, and dim(Cn ) = n, thus these vectors are
linearly dependent. That is, there exist coefficients c0 , . . . , cn ∈ C such that
n
X
cj Aj w = 0. (14.3)
j=0
Pn p ∈ j Pn (C) be the polynomial

Let Pn with jcoefficients cj – that is, p(t) =
j=0 cj t . We will write p(A) = j=0 cj A , so that in particular (14.3) can
be written as p(A)w = 0.
By the fundamental theorem of algebra, p can be written as a product
of linear factors – that is, there are a1 , . . . , an ∈ C and c ∈ C \ {0} such that
n
Y
p(t) = c (t − aj ).
j=1
This factorisation works for p(A) as well, and we have
p(A) = c(A − a1 I)(A − a2 I) · · · (A − an I).
Thus (14.3) can be rewritten as
c(A − a1 I)(A − a2 I) · · · (A − an I)w = 0.
Because w 6= 0, this means that the product (A − a1 I) · · · (A − an I) has

non-trivial nullspace, hence is non-invertible. But a product of invertible
matrices is invertible, so we conclude that at least one of the matrices A−aj I
is non-invertible. In particular, A − aj I has a non-trivial nullspace, and
taking v to be a non-trivial element of this nullspace, we get (A − aj I)v = 0,
so Av = aj v, and v is an eigenvector with eigenvalue aj .
Proposition 14.4 shows that over C, every matrix has at least one eigen-
value and eigenvector, but does not show how to compute them in practice,
or how many of them there are. To do this, we need to return to the char-
acterisation of eigenvalues and eigenvectors in Proposition 14.2. If we know
that λ is an eigenvalue, then eigenvectors for λ are elements of the nullspace
14.2. EIGENVECTORS AND EIGENVALUES 71
of A − λI, which can be found via row reduction. Thus the primary chal-
lenge is to find the eigenvalues – that is, to find λ such that A − λI is
non-invertible.
Thus we have reduced the problem of determining eigenvalues and eigen-
vectors to the problem of determining invertibility of a matrix. This moti-
vates the introduction of the determinant, a quantity which we will spend
the next few lectures studying. We end this lecture with the following ob-
servation: if β = {v1 , . . . , vn } is a basis for V consisting of eigenvectors of a
linear operator T , then for each j = 1, . . . , n we have T vj = λj vj for some
λj ∈ K, so that [T vj ]β = λj ej . This implies that [T ]β is the diagonal matrix
with entries λ1 , . . . , λn . Thus finding a basis of eigenvectors has very strong
consequences.
−4 −6 3
Exercise 14.5. Consider the matrix A = 2 4 −2 . Show that 0, −1, and
−2 −2 1
2 are all eigenvalues of A, and find eigenvectors associated to each. Check
that these eigenvectors form a basis β, and find the change-of-coordinates
matrix Q = Iβα , where α is the standard basis for R3 . Verify by direct
computation that Q−1 AQ = diag(0, −1, 2).
Introduction to determinants
Further reading:
15.1 Determinant of 2 × 2 matrices

Given a matrix A, it is important to be able to determine whether or not
A is invertible, and if it is invertible, how to compute A−1 . This arises, for
example, when trying to solve the non-homogeneous equation Ax = y, or
when trying to find eigenvalues by determining whether or not A − λI is
invertible.
In this and the coming lectures, we will study determinants, which pro-
vide a powerful tool for answering this question. It would be customary
at this point to introduce the determinant either via a rather complicated
formula or via a list of properties; we opt to postpone these for a little
while, beginning rather with a discussion that leads us through the process
that one might follow in order to ultimately develop that formula and those
properties, beginning with the question of invertibility
posed above.
a b
Let us first consider the case where A = c d is a 2 × 2 matrix. We
know that A is invertible if and only if it row reduces to ( 10 01 ), and that in
this case A−1 can be computed by row reducing [A | I] to [I | A−1 ]. So let’s
try to row reduce
a b 1 0
[A | I] = . (15.1)
c d 0 1
If c = 0, then our task is easy. Recall that the following are equivalent:
• A is invertible
• The columns of A form a basis for R2 .
• The rows of A form a basis for R2 .
If A is invertible and c = 0, we must have a 6= 0 and d 6= 0, since otherwise

one of the rows or columns of A vanishes. On the other hand, if c = 0 and
a, d are non-zero, we can row reduce (15.1) as follows:
1 ab 1
1 0 a1 − ad
b

a 0
[A | I] → 1 → 1
0 1 0 d 0 1 0 d
73
We deduce that in the case c = 0, invertibility of a is equivalent to the

statement that ad 6= 0, and that in this case we have
−1 1 b

a b a − ad 1 d −b
= 1 =
0 d 0 d ad 0 a
Exercise 15.1. A matrix A ∈ Mn×n is upper triangular if Aij = 0 whenever

i > j. Show that an upper triangular matrix A is invertible if and only
QnAjj 6= 0 for all 1 ≤ j ≤ n. Equivalently, A is invertible if and only if
if
j=1 Ajj 6= 0 (the product of the diagonal entries is non-zero). Note that
upper triangular is a different criterion from strictly upper triangular. Show
that every strictly upper triangular matrix is non-invertible.

Let’s go back to the 2 × 2 matrix A = ac db and consider now the case
c 6= 0. In this case we can row reduce (15.1) by multiplying the first row by
c and then subtracting a times the second row:

a b 1 0 ac bc c 0 0 bc − ad c −a
→ →
c d 0 1 c d 0 1 c d 0 1
At this point we can observe that one of two things happens: either ad = bc
and it is impossible to row reduce this to [I | A−1 ], hence A is non-invertible;
or ad 6= bc and we can get to that form via four more row operations (divide
first row by the non-zero value bc − ad, divide second row by the non-zero
value c, swap the rows, and finally subtract a multiple of a row to eliminate
the remaining non-zero term in the left half). In particular, we conclude
that A is invertible if and only if ad − bc 6= 0. Note that this criterion works
for the case c = 0 as well, because in this case it reduces to ad 6= 0, which
was the criterion found in the previous argument.
Exercise 15.2. Complete the row reduction above to show that the aug-
mented matrix [A | I] row reduces to

ad − bc 0 d −b
0 ad − bc −c a
d −b
and hence A−1 = 1

ad−bc −c a .
Thus for 2 × 2 matrices, invertibility can be determined via a formula –
one simply tests whether or not the determinant ad − bc is zero or non-zero.
It is natural to ask if there is a similar quantity for larger matrices, and
we will address this shortly. First we give a geometric interpretation of the
determinant for a 2 × 2 matrix. Recall again that A is invertible if and only
15.1. DETERMINANT OF 2 × 2 MATRICES 75
if the columns of A form a basis for R2 , which is true if and only if they
are linearly
independent (because there are 2 columns). Let v = ( ac ) and
w = db be the columns of A. Plotting v and w in the plane, we can make
the following geometric observation: v, w are linearly dependent if and only
if they point in the same direction or in opposite directions. Let θ be the
angle between v and w, then the vectors are linearly dependent if and only
if θ = 0 or θ = π, which is equivalent to sin θ = 0. √
So, can we compute
√ sin θ in terms of a, b, c, d? Let r = kvk = a2 + c2
and s = kwk = b2 + d2 be the lengths of the vectors v and w, so that in
polar coordinates we have

v = ( ac ) = ( rr cos α
sin α ) w = db = ss cos β
sin β .
Then θ = β − α, and we have

1
sin θ = sin(β − α) = sin β cos α − cos β sin α = (ad − bc). (15.2)
rs
As long as neither of the columns of A is zero (in which case A is non-
invertible), we see that the angle between the column vectors is proportional
to the determinant ad − bc.
It is still not clear how to generalise this to higher dimensions. What
angle should we be computing for a 3 × 3 matrix, since there are 3 column
vectors to work with?
To this end, we make the following observation. Let P = P (v, w) be the
parallelogram with vertices at 0, v, w, and v + w. Then taking the edge from
0 to v as the base of the parallelogram, the area of P is (base)×(height),
where the base has length r = kvk and the height h is given by the formula
sin θ = h/kwk, so h = kwk sin θ. Using (15.2), we conclude that the area of
the parallelogram is
ad − bc
area(P ) = kvkkwk sin θ = rs = ad − bc,
rs
so that the area is equal to the determinant of A.
There is a small caveat we must insert at this point – the previous para-
graph was not entirely honest. If w lies on the other side of v, then sin θ will
be negative, and the formula will return a negative number. But after all,
the determinant ad − bc may be negative as well. The correct interpreta-
tion here is that the determinant gives the signed area of the parallelogram,
and that reversing the order in which we input the parallelogram’s edges
as the columns of the matrix will switch the sign of the determinant, while
preserving the absolute value.
We observe that the formula for determinant of a 2 × 2 matrix immedi-

ately shows that

a b a c
det = ad − bc = det .
c d b d
That is, both A and its transpose At have the same determinant. We will
later see that this property continues to hold for larger matrices. For now,
note that this means that the determinant of A gives the area of the par-
allelogram spanned by the rows of A, as well as the one spanned by the
columns.
Finally, we observe that this geometric interpretation can also be inter-
preted in terms of row reduction. Given two row vectors v, w ∈ R2 , it is
easy to see that P (v + tw, w) and P (v, w) have the same area for every
t ∈ R. This corresponds to the fact that we may treat w as the base of the
parallelogram, and the height does not change if we replace v with v + tw
for any t. Thus assuming v and w are in “general position”, we can make
0
one row reduction to go from ( wv ) to vw , where v 0 = v + tw is of the form
0

(a, 0), and then we can make one more row reduction to go from vw to
v 0 , where w 0 = w + sv 0 is of the form (0, b). Then P (v, w) = ab, since the

w0
parallelogram spanned by v 0 and w0 is just a rectangle with sides a and b.
The previous paragraph shows that if we row reduce a 2 × 2 matrix A
to a diagonal form D using only operations of the form “add a multiple of
a row to a different row”, then the product of the diagonal entries of D
gives the area of the parallelogram spanned by the rows of A. A completely
analogous result is true for n × n matrices. In the next section, we consider
the 3 × 3 case and find a formula for the product of these diagonal entries
in terms of the entries of A itself.
15.2 Determinant of 3 × 3 matrices

v
Let v, w, x ∈ R3 be the row vectors of a 3 × 3 matrix A = wx . Following
the previous section and the 2 × 2 case, it is reasonable to ask how we
compute the (signed) volume of the parallelepiped P (v, w, x) spanned by
v, w, x, since this number tells us how far A is from being invertible, and
may reasonably be expected to appear in formulas for A−1 . As at the end
v0
of the previous section, we want to row reduce A to D = w0 , where
x0
v0 = (a, 0, 0),w0= (0, b, 0) and x0= (0, c, 0), so that det A = det D = abc
gives the signed volume of P (v, w, x). (Note that we have not yet given a
15.2. DETERMINANT OF 3 × 3 MATRICES 77
formal definition of det A, but our working definition is that at least up to

sign, it should be the volume of this parallelepiped.)
v v0

We go in three steps: first go from to 0
w w , where v = v+sw+tx =
x xv0
(a, 0, 0) for some s, t ∈ R; then row reduce to w0 , where x = (0, b, 0); and
v0 x
finally, row reduce to w00 .
x
To find v 0 , we must find s, t such that v2 + sw2 + tx2 = 0 and v3 + sw3 +
tx3 = 0. This amounts to solving the following non-homogeneous system of
linear equations in two variables:

w2 w3
s t = − v2 v3 .
x2 x3
Of course, this system may not have any solutions, depending on the values
of w2 , w3 , x2 , x3 , v2 , v3 . For now we assume that these are such that there is
always a unique solution, and worry about degenerate cases later. Although
we work with row vectors here instead of column vectors, we can still solve
this system by using Exercise 15.2, which tells us that
−1
w2 w3 1 x3 −w3
= .
x2 x3 w2 x3 − w3 x2 −x2 w2
Thus

− v2 v3

x3 −w3
s t =
w2 x3 − w3 x2 −x2 w2

v3 x2 − v2 x3 v2 w3 − v3 w2
=
w2 x3 − w3 x2
We conclude that
v 3 w1 x 2 − v 2 w1 x 3 + v 2 w3 x 1 − v 3 w2 x 1
a = v1 + sw1 + tx1 = v1 +
w2 x 3 − w3 x 2
v 1 w2 x 3 − v 1 w3 x 2 + v 3 w1 x 2 − v 2 w1 x 3 + v 2 w3 x 1 − v 3 w2 x 1
= . (15.3)
w2 x 3 − w3 x 2
At this point we have succesfully row reduced

   
v1 v2 v3 a 0 0
A = w1 w2 w3  → w1 w2 w3  .
x1 x2 x3 x1 x2 x3
Note that the expression for a has the following properties: every term in
the numerator is the product of exactly one entry from each of v, w, x, and
the denominator is the determinant of the 2 × 2 matrix ( wx22 wx33 ), which is
the matrix lying directly below the two entries of A that we transformed to
0 in our row reduction.
We need to repeat the same procedure with w and x to find b, c, but
now it turns out to be simpler. Indeed, w0 is determined by the requirement
that w0 = w + sv 0 + tx = (0, b, 0) for some s, t ∈ R (not necessarily the same
as before). This gives
w1 + sa + tx1 = 0
w2 + tx2 = b
w3 + tx3 = 0.
The third equation determines t, the second determines b, and the first
determines s. We get t = − wx33 , hence
w3 w2 x 3 − w3 x 2
b = w2 − x2 = .
x3 x3
Note that the numerator
a 0 0ofb matches the denominator of a. Now we have
row reduced A to 0 b 0 , and we see immediately that row reducing to
x1 x2 x3
diagonal form does not change the bottom right entry, so c = x3 , and the
original matrix A is row equivalent to the diagonal matrix diag(a, b, c). In
particular, the volume of the parallelepiped spanned by v, w, x is the same
as the volume of the rectangular prism with side lengths a, b, c – that is, it
is given by
v 1 w2 x 3 − v 1 w3 x 2 + v 3 w1 x 2 − v 2 w1 x 3 + v 2 w3 x 1 − v 3 w2 x 1 w2 x 3 − w3 x 2
abc = · ·x3
w2 x 3 − w3 x 2 x3
= v1 w2 x3 − v1 w3 x2 + v3 w1 x2 − v2 w1 x3 + v2 w3 x1 − v3 w2 x1 . (15.4)
So far we cannot say anything rigorously about this quantity, because the
discussion above did not properly account for the possibility that we may
have x3 = 0 or w2 x3 − w3 x2 = 0. Nevertheless, we may tentatively call this
quantity the determinant of A, and see if it does what we expect it to do
– that is, tell us whether or not A is invertible, and help in computing the
inverse. Thus we make the following definition: given a 3 × 3 matrix A with
entries Aij , the determinant of A is
det(A) = A11 A22 A33 − A11 A23 A32

+ A13 A21 A32 − A12 A21 A33 + A12 A23 A31 − A13 A22 A31 . (15.5)
More about determinants
Further reading:
16.1 Two interpretations of determinant
If we look carefully at (15.5) or the equivalent formula (15.4), we see that

the formula for the determinant can be analysed in two different ways. First
we note that it can be expressed in terms of 2 × 2 determinants: writing
   
v1 v2 v3 A11 A12 A13
A = w1 w2 w3  = A21 A22 A23  ,
x1 x2 x3 A31 A32 A33
we see from (15.5) and (15.4) that
det A = v1 (w2 x3 − w3 x2 ) − w1 (v2 x3 − v3 x2 ) + x1 (v2 w3 − v3 w2 )

w2 w3 v2 v3 v2 v3
= v1 det − w1 det + x1 det
x2 x3 x2 x3 w2 w3

A22 A23 A12 A13 A12 A13
= A11 det − A21 det + A31 det
A32 A33 A32 A33 A22 A23
= A11 det Ã11 − A21 det Ã21 + A31 det Ã31 ,
where Ãij denotes the 2 × 2 matrix obtained from A by deleting the ith row
and jth column. This formula, defining the determinant of a 3 × 3 matrix
in terms of the determinants of the 2 × 2 matrices Ãij , is called cofactor
expansion, or Laplace expansion. We will later see that a generalisation of
this can be used to define the determinant of an n × n matrix in terms of
determinants of (n − 1) × (n − 1) matrices.
Another interpretation of the formula for a 3×3 determinant is illustrated
by the following diagram, which relates each term of (15.5) to a choice of 3
79
entries on a 3 × 3 grid, one from each row and column.

• • •
• • •
• • •
A11 A22 A33 − A11 A23 A32 A13 A21 A32
(16.1)
• • •
• • •
• • •
− A12 A21 A33 A12 A23 A31 − A13 A22 A31
Each of the six terms in (15.5) is a product of one entry from the first
row of A, one entry from the second row, and one entry from the third row.
Moreover, a similar statement is true of columns: each term involves exactly
one matrix entry from each column. For a 3 × 3 matrix, there are exactly six
ways to select three entries that represent each row and each column exactly
once; the diagram above shows these six ways, together with the terms from
det A that they correspond to.
Note that the first two grids correspond to the term A11 det Ã11 in the
cofactor expansion; the next two correspond to −A21 det Ã21 ; and the final
two correspond to A31 det Ã31 . By pairing them differently, we see that the
cofactor expansion can be carried out along any row or any column:
det A = A11 det Ã11 − A21 det Ã21 + A31 det Ã31
= −A12 det Ã12 + A22 det Ã22 − A32 det Ã32
= A13 det Ã13 − A23 det Ã23 + A33 det Ã33
= A11 det Ã11 − A12 det Ã12 + A13 det Ã13
= −A21 det Ã21 + A22 det Ã22 − A23 det Ã23
= A31 det Ã31 − A32 det Ã32 + A33 det Ã33 .
What’s going on with the signs of the terms? Which ones are positive and
which are negative? For the cofactor expansion, observe that the sign of the
term Aij Ãij depends on whether i + j is even or odd: it is positive if i + j
is even, and negative if i + j is odd. We can capture this in a formula by
writing the sign as (−1)i+j , so that the cofactor expansion along the ith row
is
X3
det A = (−1)i+j Aij det Ãij , (16.2)
j=1
16.2. PERMUTATIONS ON THREE SYMBOLS 81
and along the jth column we have

3
X
det A = (−1)i+j Aij det Ãij , (16.3)
i=1
Pictorially, the positive and negative signs associated to the i, j-term follow
the chessboard pattern shown:
+ − +
− + − (16.4)
+ − +
What about (16.1) itself? How do we determine which terms are positive
and which are negative?
16.2 Permutations on three symbols

To each of the configurations in (16.1) we can associate the 3×3 matrix which
has 1s in the marked positions and 0s everywhere else: thus the first con-
figuration, with markers on the main diagonal, corresponds to the identity
matrix, second configuration (that goes with −A11 A23 A32 ) corresponds
1 0 0the
to 0 0 1 , and so on.
010
Definition 16.1. A permutation matrix is an n × n matrix in which every

row and column has exactly one entry equal to 1, and the rest equal to 0.
Writing e1 , e2 , e3 for the standard basis of row vectors in R3 , we can

write the six 3 × 3 permutation matrices from (16.1), together with their
corresponding terms in det A, as
   
1 0 0 e1
0 1 0 = e2  v1 w2 x3 = A11 A22 A33
0 0 1 e3
   
1 0 0 e1
0 0 1 = e3  −v1 w2 x3 = −A11 A22 A33
0 1 0 e2
   
0 0 1 e3
1 0 0 = e1  v3 w1 x2 = A13 A21 A32
0 1 0 e2
   
0 1 0 e2
1 0 0 = e1  −v2 w1 x3 = −A12 A21 A33
0 0 1 e3
   
0 1 0 e2
0 0 1 = e3  v2 w3 x1 = A12 A23 A31
1 0 0 e1
   
0 0 1 e3
0 1 0 = e2  −v3 w2 x1 = −A13 A22 A31
1 0 0 e1
e1
The three permutation matrices corresponding to negative terms are ee3 ,
e2 e3 2
e1 , and e2 . Notice that each of these can be obtained from the identity
e3 e1
matrix via a single row operation of swapping two rows. The identity matrix,
of course, requires zero swaps, while the other two permutation matrices
corresponding to positive terms require two swaps.
e3 e2
Exercise 16.2. Check this last claim, that ee1 and ee3 can be obtained
2 1
from I by two row operations of exchanging rows.
The above discussion suggests the following criterion:
even number of swaps ⇔ positive term

odd number of swaps ⇔ negative term
Let us formalise this notion. A permutation of {1, 2, 3} is a bijection π : {1, 2, 3} →

{1, 2, 3} – that is, a 1-1 and onto map from the set of indices {1, 2, 3} to
itself. This may be thought of as a rearrangement of these indices, so that in
the new ordering π(1) comes first, π(2) comes second, and π(3) third. Then
each of the six permutation matrices listed above is given by a permutation
π: for example,
     
0 1 0 e2 eπ(1) π(1) = 2
0 0 1 = e3  = eπ(2)  , π(2) = 3 (16.5)
1 0 0 e1 eπ(3) π(3) = 1
Definition 16.3. Let S3 be the set of all permutations of three symbols.

Given a permutation π ∈ S3 , the permutation matrix associated to π is
 
eπ(1) (
1 if j = π(i)
P (π) = eπ(2)  P (π)ij =
eπ(3) 0 otherwise
16.3. ORIENTED VOLUMES IN RN 83
Each permutation matrix corresponds to the term in det A which is a

product of the entries of A where the permutation matrix has a 1 – in
terms of the permutation π, we see that π corresponds to the product
A1,π(1) A2,π(2) A3,π(3) . Moreover, we can determine the sign associated with
this term as follows.
Describe π by the list (π(1), π(2), π(3)), which gives the rearrangement
of (1, 2, 3) described by π. Thus for example we write the permutation in
(16.5) as π = (2 3 1).
Observe that exchanging two rows of the permutation matrix corre-
sponds to exchanging two symbols in this list. Let
(
+1 if π is reached in an even number of exchanges
sgn(π) = (16.6)
−1 if π is reached in an odd number of exchanges
be the sign of the permutation π. Then the determinant of the 3 × 3 matrix

A is X
det A = sgn(π)A1,π(1) A2,π(2) A3,π(3) . (16.7)
π∈S3
There is a potential problem with the definition in (16.6): the same

permutation can be reached in different ways. For example, to reach the
permutation π = (2 3 1) from (16.5) from the original arrangement (1 2 3),
we can either swap the first two symbols to get (2 1 3), and then the last
two to get π, or we can swap the first and third to get (3 2 1), and then the
first two to get π. In this case both procedures involve the same number
of swaps, but will we always be so lucky? What if there is some π such
that one way of getting π involves an even number of swaps, and the other
involves an odd number? What should sgn(π) be?
In fact, it turns out that for any permutation π, the number of exchanges
in any process leading to π is either always even (in which case sgn(π) = 1)
or always odd (in which case sgn(π) = −1), as we will see in the next section.
16.3 Oriented volumes in Rn

Consider two vectors v1 , v2 ∈ R2 , and assume that they form a basis. Let
D(v1 , v2 ) be the determinant of the matrix A = ( vv12 ), which as we have seen
is the signed area of the parallelogram spanned by v1 , v2 . Note that this
region can be described as
P (v1 , v2 ) = {t1 v2 + t2 v2 | t1 , t2 ∈ [0, 1]}.

In other words, P (v1 , v2 ) is the image of the square with vertices (0, 0),
(1, 0), (0, 1), and (1, 1) under the linear transformation x 7→ xA.
More generally, given v1 , . . . , vn ∈ Rn , the parallelepiped spanned by
v1 , . . . , vn is
( n )
X
P (v1 , . . . , vn ) = ti vi | ti ∈ [0, 1] for all 1 ≤ i ≤ n
i=1
= {xA | x ∈ C},
v1
where A = v··· is the linear transformation sending ei to vi , and C is
n
the n-dimensional cube with side lengths 1 and vertices at 0 and ei for
1 ≤ i ≤ n. (Note that C also has vertices at ei1 + · · · + eik for every
1 ≤ i1 < i2 < · · · < ik ≤ n.)
For 2 × 2 matrices, recall that det ( vv12 ) = kv1 kkv2 k sin θ, where θ is the
angle from v1 to v2 . In particular, sin θ is positive if the shortest way from
v1 to v2 is to go counterclockwise, and is negative if the shortest way is
clockwise. Thus the sign of the determinant tells us what the orientation
of the vectors v1 , v2 is relative to each other. We write S(v1 , v2 ) = ±1
depending on whether this orientation is positive or negative.
What about 3 × 3 matrices, or larger? Given a basis v1 , v2 , v3 ∈ R3 , we
can no longer use the words “clockwise” and “counterclockwise” to describe
where these vectors are in relation to each other. Instead we note the follow-
ing: in R2 , the standard basis e1 , e2 is positively oriented, and any positively
oriented basis v1 , v2 can be continuously deformed into the standard basis
without becoming dependent. More precisely, if v1 , v2 is a positively oriented
basis, then there are continuous functions w1 , w2 : [0, 1] → R2 such that
• wi (0) = vi and wi (1) = ei for i = 1, 2;
• w1 (t), w2 (t) is a basis for every t ∈ [0, 1].
On the other hand, if v1 , v2 is negatively oriented – for example, if v1 = e2

and v2 = e1 – then there is no way to accomplish this deformation without
the vectors w1 (t), w2 (t) becoming linearly dependent at some point. Thus for
a basis v1 , . . . , vn ∈ Rn , we define orientation as follows: S(v1 , . . . , vn ) = 1
if there exist continuous functions w1 , . . . , wn : [0, 1] → Rn such that
• wi (0) = vi and wi (1) = ei for i = 1, . . . , n;
• w1 (t), . . . , wn (t) is a basis for every t ∈ [0, 1].

16.3. ORIENTED VOLUMES IN RN 85
If no such continuous functions exist, then S(v1 , . . . , vn ) = −1. If v1 , . . . , vn

is not a basis, we put S(v1 , . . . , vn ) = 0.
Exercise 16.4. Show that S(eπ(1) , eπ(2) , eπ(3) ) = sgn(π) for every π ∈ S3 .
Now we make the key definition: given vectors v1 , . . . , vn ∈ Rn , let
D(v1 , . . . , vn ) = S(v1 , . . . , vn )Vol(P (v1 , . . . , vn )) (16.8)
be the signed volume of the parallelepiped spanned by v1 , . . . , vn . Given a

matrix A ∈ Mn×n , let v1 , . . . , vn be the row vectors of A, and define the
determinant of A to be
det A = D(v1 , . . . , vn ). (16.9)
Earlier, we saw that this definition of determinant implies the formula (15.5)
for 3 × 3 matrices, which we rewrote in two different ways, (16.2)–(16.3) and
(16.7). Now we will show that (16.8) and (16.9) force the determinant to
have several properties that in turn force it to be given by a generalisation of
(16.7), the sum over permutations, and that this is equivalent to a recursive
definition following (16.2)–(16.3).
A general approach to determinants
Further reading:
17.1 Properties of determinants

[Lax] p. 44–46
The definition of D(v1 , . . . , vn ) in (16.8) forces the function D : (Rn )n →
R to have the following properties.
1. If vi = vj for some i 6= j, then D(v1 , . . . , vn ) = 0.
2. D(v1 , . . . , vn ) is linear in each of its arguments – that is, for every

1 ≤ j ≤ n and every v1 , . . . , vj−1 , vj+1 , . . . , vn , the function T : x 7→
D(v1 , . . . , vj−1 , x, vj+1 , . . . , vn ) is a linear function of x.
3. D(e1 , . . . , en ) = 1.
To see the first property, note that if vi = vj for i 6= j, then the paral-
lelepiped P (v1 , . . . , vn ) lies in an (n − 1)-dimensional subspace of Rn , and
hence has zero volume. To see the third property, note that P (e1 , . . . , en )
is the unit n-cube, and this basis is positively oriented by definition. It only
remains to prove the second property, linearity in each argument.
Given 1 ≤ j ≤ n and v1 , . . . , vn , let V be the (n − 1)-dimensional volume
of the parallelepiped spanned by v1 , . . . , vj−1 , vj+1 , . . . , vn . Given x ∈ Rn ,
let h(x) be the signed distance from x to the subspace in which this paral-
lelepiped lies – that is, h(x) is positive if S(v1 , . . . , vj−1 , x, vj+1 , . . . , vn ) = 1,
and negative otherwise. Then h is a linear function of x, and we have
T (x) = D(v1 , . . . , vj−1 , x, vj+1 , . . . , vn ) = h(x)V,
so T is linear in x.
There are two more important properties of determinants that follow
from the three listed above.
4. D is an alternating function of the vectors v1 , . . . , vn ; that is, if vi and

vj are interchanged, then D changes by a factor of −1.
5. If v1 , . . . , vn are linearly dependent, then D(v1 , . . . , vn ) = 0.
87
To prove that D is alternating, we fix vk for k 6= i, j and write
T (w, x) = D(v1 , . . . , vi−1 , w, vi+1 , . . . vj−1 , x, vj+1 , . . . , vn ),
so that we are trying to show T (x, w) = −T (w, x). Using property 1, we

have
T (w, x) + T (x, w) = T (w, x) + T (w, w) + T (x, x) + T (x, w),
and by property 2 (multilinearity), this is equal to
T (w, x + w) + T (x, x + w) = T (x + w, x + w),
which vanishes by property 1. Thus D is alternating.

To prove that D vanishes on a linearly dependent set, we note that by
linear dependence, one of the vi s can be written as a linear combination of
the others, so there are coefficients cj ∈ R such that
X
vi = cj vj .
j6=i
Let T (w) = D(v1 , . . . , vi−1 , w, vi+1 , . . . , vn ), then we have

 
X X
D(v1 , . . . , vn ) = T (vi ) = T  cj vj  = cj T (vj ) = 0,
j6=i j6=i
where the last two equalities use property 2 followed by property 1.
17.2 A formula for determinant

[Lax] p. 46–49
Now we will see that properties 1–3 can be used to derive a formula for
the function D, and hence for the determinant. P Given vectors v1 , . . . , vn ∈
Rn , write vi = (vi1 , . . . , vin ), so that vi = nj=1 vij ej . Now by property 2
(multilinearity), we can expand along the coefficients of v1 to get
 
Xn
D(v1 , . . . , vn ) = D  v1j ej , v2 , . . . , vn 
j=1
n
X
= v1j D(ej , v2 , . . . , vn ).
j=1
17.2. A FORMULA FOR DETERMINANT 89
Expanding this along the coefficients of v2 = v21 e1 + · · · + v2n en gives

n n
!
X X
D(v1 , . . . , vn ) = v1j D ej , v2k ek , v3 , . . . , vn
j=1 k=1
n X
X n
= v1j v2k D(ej , ek , v3 , . . . , vn ).
j=1 k=1
Continuing, we eventually get

n X
X n n
X
D(v1 , . . . , vn ) = ··· v1j1 v2j2 · · · vnjn D(ej1 , . . . , ejn ). (17.1)
j1 =1 j2 =1 jn =1
Let’s consider this sum for a moment. Every term in the sum is a product
of n entries vij multiplied by the factor D(ej1 , . . . , ejn ). The n entries vij
in any given term have the property that one of them is an entry in v1 , one
is an entry in v2 , and so on. If we consider the n × n matrix A whose rows
are the vectors v1 , . . . , vn , then this corresponds to taking exactly one entry
from each row of A.
We can rewrite (17.1) as
X
D(v1 , . . . , vn ) = v1,f (1) · · · vn,f (n) D(ef (1) , . . . , ef (n) ), (17.2)
f
where the sum is taken over all functions f : {1, . . . , n} → {1, . . . , n}. If f
is not 1-1, then there are indices i 6= j such that ef (i) = ef (j) , and hence
by property 1, D(ef (1) , . . . , ef (n) ) = 0. Thus the only non-zero terms in
the above sum are those corresponding to 1-1 functions f . A function from
an n-element set to itself is 1-1 if and only if it is onto; in this case it is
called a permutation of the set, as in the 3 × 3 case we discussed earlier.
Let Sn denote the set of all permutations on n symbols; then since the only
functions that contribute to the sum in (17.2) are permutations, we have
X n
Y
D(v1 , . . . , vn ) = D(eπ(1) , . . . , eπ(n) ) vi,π(i) (17.3)
π∈Sn i=1
Now we will use property 3 of the function D (normalisation), together with

property 5 (alternating). Property 3 says that D(e1 , . . . , en ) = 1. Property
5 says that every time we interchange two vectors, we reverse the sign of D.
Thus D(e2 , e1 , e3 , e4 , . . . , en ) = −1, and so on. In particular, if it is possible
to reach the arrangement eπ(1) , . . . , eπ(n) in an even number of swaps, then
D(eπ(1) , . . . , eπ(n) ) = 1, whereas if it takes an odd number of swaps, then

D(eπ(1) , . . . , eπ(n) ) = −1.
This is exactly the way we defined the sign of a permutation earlier:
sgn π = 1 if (π(1), . . . , π(n)) can be obtained from (1, . . . , n) in an even
number of swaps, and sgn π = −1 if it takes an odd number. Thus (17.3)
can be rewritten as
X n
Y
D(v1 , . . . , vn ) = sgn π vi,π(i) . (17.4)
π∈Sn i=1
Given an n × n matrix A, let v1 , . . . , vn be the row vectors of A, so Aij = vij .

Then using the definition of determinant given by det(A) = D(v1 , . . . , vn ),
we can write (17.4) as
X n
Y
det A = sgn π Ai,π(i) , (17.5)
π∈Sn i=1
which generalises (16.7) from the 3 × 3 case.

There is still an issue to be addressed, which is the problem of well-
definedness: given a permutation π, what happens if we choose two different
ways to get to the arrangement (π(1), . . . , π(n)) from the original arrange-
ment (1, . . . , n)? Can they give two different answers for sgn π? We will
address this in the next section. First we observe that if we write P (π) for
the permutation matrix corresponding to π – that is, the n×n matrix whose
rows are given by eπ(i) , so that P (π)ij = 1 if j = π(i) and 0 otherwise –
then
det P (π) = D(eπ(1) , . . . , eπ(n) ) = sgn π.
Thus the sign of a permutation is given by the determinant of the corre-
sponding permutation matrix, and vice versa.
Lecture 18 Mon. Nov. 4
Yet more on determinants
Further reading:
18.1 Signs of permutations

Let π be a permutation on n symbols. To settle the question of whether
sgn π is well defined, we describe another way of determining whether π is
even (sgn π = 1) or odd (sgn π = −1). Write two rows of numbers: the
first row is the numbers 1, . . . , n, and the second is π(1), . . . , π(n). For each
i, draw an arrow from the top row to the bottom row connecting the two
places where i occurs. For example, the permutation π = (2, 4, 1, 3),1 which
has π(1) = 2, π(2) = 4, π(3) = 1, and π(4) = 3, corresponds to the following
picture.
1 2 3 4 (18.1)
w '
2 4 1 3
Now count the number of times the arrows cross (in the above example,
there are three crossings). We claim that the number of crossings has the
same parity (even or odd) as the number of transpositions it takes to obtain
π. To see this, choose any way of obtaining π via transpositions: in the
above example, one possibility would be
(1, 2, 3, 4) → (2, 1, 3, 4) → (2, 4, 3, 1) → (2, 4, 1, 3).
Draw a similar diagram to (18.1), where each transposition gets its own row:
1 2 3 4 (18.2)

2 1 3 4
} '
2 4 3 1

2 4 1 3
1
For any readers who are familiar with the representation of a permutation in terms
of its cycle structure, notice that we are using a similar-looking notation here to mean a
different thing.
91
92 LECTURE 18. MON. NOV. 4
Note that we are careful to draw the arrows in such a way that no more than
two arrows intersect at a time.2 (This is why the second row has a curved
arrow.) This diagram has more intersections than (18.1), but the parity of
intersections is the same. To see that this is true in general, say that i and
j are inverted by π if π(i), π(j) occur in a different order than i, j. That is,
i, j are inverted if i < j and π(i) > π(j), or if i > j and π(i) < π(j).
Proposition 18.1. For any choice of how to draw the arrows in the above
diagram, and any i 6= j, the arrows corresponding to i and j cross an odd
number of times if i, j are inverted by π, and an even number of times if
they are not inverted.
Proof. Every time the arrows for i, j cross, the order of i, j is reversed. If
this order is reversed an odd number of times, then i, j are inverted. If it is
reversed an even number of times, then i, j are not inverted.
Example 18.2. The permutation (2, 4, 1, 3) has three inversions, corre-

sponding to the pairs of arrows starting out at (1, 2), at (1, 4), and at (3, 4).
Using Proposition 18.1, we see that the number of intersections always
has the same parity as the number of inversions (or “backwards pairs”).
This is useful because the number of inversions, unlike the number of trans-
positions or the number of intersections in the diagram, does not depend
on any choices we make – it is completely determined by the permutation.
In particular, its parity is determined by the permutation, and since this
parity matches the parity of the number of transpositions or the number of
intersections, we have proved most of the following result.
Proposition 18.3. If k is the number of inversions in a permutation π –
that is, the number of pairs i < j for which π(i) > π(j) – then sgn π = (−1)k .
If π can be reached in two different ways by a sequence of k transpositions
and a sequence of ` transpositions, then sgn π = (−1)k = (−1)` .
Similarly, if π leads to a diagram such as (18.1) with k intersections,
and another such diagram with ` intersections, then sgn π = (−1)k = (−1)` .
Proof. The only thing left to prove is that the number of intersections has
the same parity as the number of transpositions. This follows because every
diagram as in (18.1) can be expanded to a diagram as in (18.2), in which
every row corresponds to a single transposition, and every rwo has an odd
number of intersections.
2
For the sake of completeness, we also specify that two arrow do not touch unless they
actually cross each other, and that arrows are not allowed to go above the first row or
below the last row.
18.2. DETERMINANT OF TRANSPOSE 93
Exercise 18.4. Let S(v1 , . . . , vn ) be the orientation function defined earlier,

and show that S(eπ(1) , . . . , eπ(n) ) = sgn π for every permutation π.
We close this section by remarking that Proposition 18.3 is the tool
we need to complete the definition of determinant. Before this point, we
had shown that any function D(v1 , . . . , vn ) satisfying the three properties –
multilinearity, normalisation, and vanishing when vi = vj – must be given
by (17.5), but we did not know if (17.5) made sense until we could guarantee
that sgn π was well-defined.
18.2 Determinant of transpose

Let π be a permutation on n symbols, and P (π) the associated permutation
matrix, so that the ith row of P (π) is eπ(i) ; equivalently, P (π)ij = 1 if j =
π(i), and 0 otherwise. The inverse permutation π −1 is such that π −1 (i) = j
whenever π(j) = i, and so its associated permutation matrix is given by
(
−1 1 if π(j) = i
P (π )ij =
0 otherwise
= P (π)ji .
We conclude that P (π −1 ) = P (π)t . Moreover, an arrowed diagram as in

(18.1) for π −1 can be obtained from such a diagram for π by reversing the
directions of the arrows, which does not change the number of intersections,
and thus sgn(π −1 ) = sgn π. This will be important in the proof of the
following result.
Proposition 18.5. For every A ∈ Mn×n , we have det(At ) = det A.
Proof. Using (17.5), we have
X n
Y X n
Y
t
det(A ) = sgn π Ati,π(i) = sgn π Aπ(i),i .
π∈Sn i=1 π∈Sn i=1
Qn Qn
Note that i=1 Aπ(i),i = j=1 Aj,π −1 (j) , and since sgn(π −1 ) = sgn π, we
have
X n
Y
det(At ) = sgn(π −1 ) Aj,π−1 (j) .
π∈Sn i=1
As π ranges over all permutations in Sn , so does π −1 , and so this is equal

to det A.
Corollary 18.6. Let A be an n×n matrix with rows v1 , . . . , vn and columns

w1 , . . . , wn . Then det A = D(v1 , . . . , vn ) = D(w1 , . . . , wn ), where D : (K n )n →
K has the properties listed in §17.1.
Proof. The first equality is the definition of the determinant, and the second
follows from the first because w1 , . . . , wn are the rows of At , which has the
same determinant as A by Proposition 18.5.
18.3 Determinant of product

[Lax] p. 49–51
An important additional property of determinants is the following.
Theorem 18.7. For all A, B ∈ Mn×n , we have det(AB) = (det A)(det B).
Proof. Recall that the jth column of AB is (AB)ej , and the jth column of
B is Bej . Thus writing b1 , . . . , bn for the columns of B, we have (AB)ej =
A(Bej ) = Abj , and by Corollary 18.6 we have
det(AB) = D(Ab1 , . . . , Abn ). (18.3)
Define C : (K n )n → K by
C(b1 , . . . , bn ) = D(Ab1 , . . . , Abn ) (18.4)
We claim that the function C has the first two properties described in §17.1,
and that C(e1 , . . . , en ) = det A. Then it will follow from the discussion in
§17.2 that C(b1 , . . . , bn ) = (det A)D(b1 , . . . , bn ), which will suffice to prove
the theorem.
To verify these properties, first observe that if bi = bj for some bi 6= bj ,
then Abi = Abj , and so the numerator in (18.4) vanishes. This takes care of
property 1. Property 2 follows because Abi depends linearly on bi , and the
right-hand side of (18.4) depends linearly on Abi .
Now we observe that if a1 , . . . , an are the columns of A, then
C(e1 , . . . , en ) = D(Ae1 , . . . , Aen ) = D(a1 , . . . , an ) = det A,
where the second equality follows from matrix multiplication, and the third
from the definition of determinant.
As indicated above, §17.2 now gives
C(b1 , . . . , bn ) = (det A)D(b1 , . . . , bn ) = (det A)(det B)
which together with (18.3) completes the proof.

18.3. DETERMINANT OF PRODUCT 95
Corollary 18.8. An n × n matrix A is invertible if and only if det A 6= 0.

If it is invertible, then det(A−1 ) = det(A)−1 .
Proof. (⇐): If A is non-invertible then its columns are not a basis for K n ,
and so they are linearly dependent. Thus Property 5 of the function D
implies that det A = 0.
(⇒): If A is invertible, then there is B ∈ Mn×n such that AB = I, and
so by Theorem 18.7 we have
1 = det I = det(AB) = (det A)(det B),
implying that det A 6= 0, and that det B = 1/ det A.
Corollary 18.9. If two matrices A and B are similar, then det A = det B.
Proof. By similarity there is an invertible Q ∈ Mn×n such that B = Q−1 AQ.

By Corollary 18.8 we have
1
det B = det(Q−1 AQ) = (det A)(det Q) = det A.
det Q
Lecture 19 Wed. Nov. 6
Other ways of computing determinant
Further reading:
19.1 Cofactor expansion

[Lax]p. 51–53
Recall that one of the ways we were able to interpret the expression for a
3 × 3 determinant was the Laplace expansion, or cofactor expansion, which
gave the determinant of a 3 × 3 matrix in terms of the 2 × 2 determinants of
matrices obtained by deleting a row and a column. It is natural to ask if we
can interpret an n × n determinant in terms of (n − 1) × (n − 1) determinants
in the same way, and indeed we can. As before, we write Ãij ∈ M(n−1)×(n−1)
for the matrix obtained from A by deleting the ith row and jth column.
Theorem 19.1. For any A ∈ Mn×n , the determinant is given by

n
X
det A = (−1)i+1 Ai1 det Ãi1 . (19.1)
i=1
Pn
Proof. Let a1 , . . . , an ∈ Rn be the columns of A. Then a1 = i=1 Ai1 ei ,
and thus
n
X
det A = D(a1 , . . . , an ) = Ai1 D(ei , a2 , . . . , an ),
i=1
where the first equality is the definition of determinant, and the second
follows from multilinearity of D. To prove (19.1), we need to show that
D(ei , a2 , . . . , an ) = (−1)i+1 det Ãi1 . (19.2)
Note that by Properties 1 and 2 of the function D, the value of D does not
change if we add a multiple of one vector to another: in particular, for every
2 ≤ j ≤ n and every λ ∈ K, we have
D(ei , a2 , . . . , aj + λei , . . . , an ) = D(ei , a2 , . . . , aj , . . . , an )

+ λD(ei , a2 , . . . , ei , . . . , an )
= D(ei , a2 , . . . , an )
97
98 LECTURE 19. WED. NOV. 6
In particular, we can add −Aij ei to aj for each j and obtain

 
0 A12 ··· A1n
 .. .. .. .. 
. . . . 
 
0 Ai−1,2
 ··· Ai−1,n 

1
D(ei , a2 , . . . , an ) = det  0 ··· 0  . (19.3)
0 Ai+1 , 2
 ··· Ai+1,n 

 .. .. .. .. 
. . . . 
0 An2 ··· Ann
Notice that we have set every entry in the ith row and 1st column to 0, ex-
cept for the entry in the place (i, 1), which is now equal to 1. The remaining
entries determine the (n−1)×(n−1) matrix Ãi1 . It follows from multilinear-
ity of determinant that the right-hand side of (19.3) is linear in each column
of Ãi1 , and hence D(ei , a2 , . . . , an ) is as well. Similarly, D(ei , a2 , . . . , an ) = 0
if any two columns of Ãi1 are equal. Because these properties characterise
determinant up to a constant normalising factor, we conclude that
D(ei , a2 , . . . , an ) = D(ei , e1 , . . . , ei−1 , ei+1 , . . . , en ) det Ãi1 .
We can go from the order (i, 1, . . . , i−1, i+1, . . . , n) to the order (1, 2, . . . , n)
in i − 1 swaps, and (19.2) follows.
This gives the cofactor (Laplace) expansion along the first column: the
only difference from the 3 × 3 case is that the sum has n terms, instead of 3,
and the matrix Ai1 is an (n − 1) × (n − 1) matrix, instead of a 2 × 2 matrix.
Because interchanging two columns reverses the sign of the determinant,
we immediately obtain from Theorem 19.1 the formula for cofactor expan-
sion along the jth column as
n
X
det A = (−1)i+j Aij det Ãij .
i=1
Similarly, because determinant is preserved under taking transposes, we can

use cofactor expansion along the ith row, getting
n
X
det A = (−1)i+j Aij det Ãij .
j=1
19.2. DETERMINANTS AND ROW REDUCTION 99
19.2 Determinants and row reduction

The formulas for determinant in terms of a sum over permutation and in
terms of a cofactor expansion are very important theoretically, and will
continue to play a key role for us when we return to the study of eigenvalues
and eigenvectors. However, as the size of the matrix increases, these formulas
quickly become inefficient from the point of view of computation. Indeed,
in each case the number of calculations to be performed in the evaluation of
an n × n determinant is on the order of n!, which grows very quickly.
A more efficient way to compute determinant is via row reduction. Recall
that there are three types of row operations:
1. add a multiple of a row to another row;
2. interchange two rows;
3. multiply a row by a nonzero scalar.

We saw in the proof of Theorem 19.1 that
D(v1 , . . . , vi + λv1 , . . . , vn ) = D(v1 , . . . , vi , . . . , vn )

+ λD(v1 , . . . , vi−1 , v1 , vi+1 , . . . , vn )
= D(v1 , . . . , vn ),
and a similar computation shows that for any i 6= j, we can replace vi with
vi + λvj without changing the value of D. In particular, if A row reduces to
B using only the first type of row operation – adding a multiple of a row to
another row – then det A = det B.
Furthermore, we know that interchanging two rows multiplies the deter-
minant by −1. Thus we have the following result.
Proposition 19.2. Suppose A can be row reduced to B using only row
operations of Types 1 and 2. Let k be the number of row operations of Type
2 required to carry out this row reduction. Then det A = (−1)k det B.
This gives a more efficient way of computing determinants. If A is non-
invertible, then it row reduces to a matrix with an empty row, and we
conclude that det A = 0. If A is invertible, then it row reduces to a diagonal
matrix diag(λ1 , . . . , λn ), and we conclude that
det A = (−1)k det diag(λ1 , . . . , λn ), (19.4)
where k is the number of transpositions involved in the row reduction.

Qn
Exercise 19.3. Show that det diag(λ1 , . . . , λn ) = i=1 λi .
Using this exercise, we see that (19.4) reduces to
det A = (−1)k λ1 · · · λn .
It can be shown that the number of row reductions involved in carrying out
the above process for an n × n matrix is on the order of n3 , which gives a
much more manageable number of computations than a direct application of
the formulas via summing over permutations or cofactor expansion. Indeed,
for n = 10 we have n3 = 1000, while n! ≈ 3.6×106 , so the two differ by three
orders of magnitude. For n = 20 we have n3 = 8000, while n! ≈ 2.4 × 1018 .
At a billion operations a second, it would take 1/100, 000 of a second to carry
out 203 operations, but about 76 years to carry out 20! operations. Thus
row reduction is a vastly more efficient way of computing large determinants
than the previous formulas we saw.
Exercise 19.4. Suppose A can be row reduced to B via any sequence of row
operations; let k be the number of transpositions (Type 2) involved, and let
p1 , . . . , pm be the non-zero factors by which rows are multiplied in the m
steps of Type 3. Use multilinearity of determinant to show that
det B = (−1)k p1 · · · pm det A,
and in particular, if A row reduces to the identity through this process, then
det A = (−1)k (p1 · · · pm )−1 .
In fact, it suffices to row reduce to an upper triangular matrix, thanks

to the following exercise.
Exercise 19.5. Let A be upper triangular, and show that det A = ni=1 Aii .
Q
Example 19.6. Consider the matrix

 
1 −2 3 −12
−5 12 −14 19 
A= −9 22 −20 31  .

−4 9 −14 15
Row operations of the first type reduce this to
     
1 −2 3 −12 1 −2 3 −12 1 −2 3 −12
0 2
 1 −41  → 0 0
 5 25 
 → 0 0
 5 25 
.
0 4 7 −77 0 0 15 55  0 0 0 −20
0 1 −2 −33 0 1 −2 −33 0 1 −2 −33
19.2. DETERMINANTS AND ROW REDUCTION 101
With two more transpositions, this becomes an upper triangular matrix with
diagonal entries 1, 1, 5, −20, and so
det A = (−1)2 (1)(1)(5)(−20) = −100.

Inverses, trace, and characteristic polynomial
Further reading:
20.1 Matrix inverses

[Lax] p. 53–54
We can use determinants to solve non-homogeneous systems of linear
equations. Suppose A ∈ Mn×n is invertible, and consider the equation
Ax = u, where u ∈ K n is given and x ∈ K n is unknown. Because A is
invertible, this has a unique solution for every u. But how to express it?
Let aj ∈ K n be the jth column Pnof A, so that Aej = aj . The unknown
vector x can be expressed as x = j=1 xj ej , and thus
n
X n
X
u = Ax = xj (Aej ) = xj aj . (20.1)
j=1 j=1
To find xk , we need to find something to do to both sides of the above

equation that eliminates all the terms xj aj for j 6= k, so that the only
remaining unknown is xk . This can be accomplished by using the fact that
D(a1 , . . . , ak−1 , aj , ak+1 , . . . , an ) = 0 for all j 6= k. More precisely, let Bk =
[a1 | · · · | ak−1 | u | ak+1 | · · · | an ] be the matrix obtained from A by
replacing the kth column with the vector u. Then we have
det Bk = D(a1 , . . . , ak−1 , u, ak+1 , . . . , an )

 
X n
= D a1 , . . . , ak−1 , xj aj , ak+1 , . . . , an 
j=1
n
X
= xj D(a1 , . . . , ak−1 , aj , ak+1 , . . . , an )
j=1
= xk det A,
where the first equality is the definition of determinant, the second equality
uses (20.1), the third uses multilinearity of determinant, and the fourth
uses the fact that D vanishes whenever two of its arguments coincide. We
conclude that xk = (det Bk )/(det A), since A is invertible and hence det A 6=
0.
103
To compute det Bk in terms of A, we use cofactor expansion along the

kth column, and observe that (B˜k )ik = Aik since deleting the kth column of
Bk yields the same matrix as deleting the kth column of A. Thus
n
X
det Bk = (−1)i+k ui det Ãik ,
i=1
since (Bk )ik = ui by the definition of Bk . Together with the computation

above, this gives
n
X det Ãik
xk = (−1)i+k ui , (20.2)
det A
i=1
which is known as Cramer’s rule.
We can use Cramer’s rule to give a formula for the inverse of a matrix.
Given an invertible matrix A, let B = A−1 and let b1 , . . . , bn be the columns
of B. Then the condition AB = I can be rewritten as the system of equations
Ab1 = e1 , . . . , Abn = en .
Applying Cramer’s rule to the equations Abj = ej for 1 ≤ j ≤ n, we get
n
X det Ãik det Ãjk
Bkj = (bj )k = (−1)i+k (ej )i = (−1)j+k .
det A det A
i=1
Notice that the order of the indices is reversed between the left and right-
hand sides of the equation. We conclude that the inverse of a matrix A is
given by the formula
det Ãji
(A−1 )ij = (−1)i+j . (20.3)
det A
For large matrices, this is not a practical method for computing the in-
verse, because it involves computing n2 determinants, and the computation
of inverse via row reduction of [A | I] to [I | A−1 ] is significantly faster.
Nevertheless, (20.3) has important theoretical significance.
20.2 Trace
[Lax] p. 55–56
The determinant is a function that assigns a scalar value to every n × n
matrix. Another important scalar-valued function is the trace, defined as
n
X
Tr A = Aii . (20.4)
i=1
20.2. TRACE 105
That is, the trace of an n × n matrix A is the sum of the diagonal entries
of A. (This looks like the formula for a determinant of an upper-triangular
matrix, where determinant is the product of the diagonal entries, but this
definition of trace is for any square matrix A, regardless of whether or not
it is upper-triangular.)
Determinant is a multilinear function – it depends linearly on each
row/column of A. Trace, on the other hand, is a linear function of the
entire matrix A. Indeed, given any A, B ∈ Mn×n and c ∈ K, we have
n
X n
X
Tr(cA + B) = (cA + B)ii = cAii + Bii = c Tr A + Tr B.
i=1 i=1
Trace is not multiplicative in the same way determinant is: as a general rule,
Tr(AB) 6= Tr(A) Tr(B). (Indeed, Tr(I) = n for the n × n identity matrix, so
Tr(IB) = Tr(B) 6= Tr(I) Tr(B) for every B ∈ Mn×n with non-zero trace.)
Nevertheless, trace has the following important property with respect to
multiplication.
Theorem 20.1. Given any A, B ∈ Mn×n , we have Tr(AB) = Tr(BA).
Proof. By the formula for matrix multiplication, we have (AB)ii = nk=1 Aik Bki ,
P
and so
Xn X n
Tr(AB) = Aik Bki .
i=1 k=1
Pn
Similarly, (BA)ii = k=1 Bik Aki , and so
n X
X n
Tr(BA) = Bik Aki .
i=1 k=1
The two sums are equal (just interchange the names of the indices).
We saw in Corollary 18.9 that similar (conjugate) matrices have the same
determinant; this followed from the fact that determinant is multiplicative.
Even though trace is not multiplicative, Theorem 20.1 leads to the analogous
result for similar matrices.
Corollary 20.2. Similar matrices have the same trace.
Proof. Let A, B ∈ Mn×n be such that B = QAQ−1 for some invertible
Q ∈ Mn×n . Then applying Theorem 20.1 to the matrices QA and Q−1 , we
have
Tr B = Tr((QA)(Q−1 )) = Tr((Q−1 )(QA)) = Tr A.
Exercise 20.3. Corollary 20.2 used the fact that Tr(ABC) = Tr(BCA). Is
it always true that Tr(ABC) = Tr(BAC)? If so, prove it. If not, provide a
counterexample.
20.3 Characteristic polynomial

[Lax] p. 60–63
Now that we have equipped ourselves with the tools of determinant and
trace, we return to the question of finding eigenvalues and eigenvectors for
a square matrix A. That is, given A ∈ Mn×n , for which λ ∈ K and v ∈
K n \ {0} do we have Av = λv?
Recall that this can be rephrased as the equation (A − λI)v = 0, and
that consequently λ ∈ K is an eigenvalue for A if and only if the matrix
A − λI is non-invertible. By Theorem 18.8, A − λI is non-invertible if and
only if det(λI − A) = 0. This gives us a method of finding the eigenvalues.
Example 20.4. Let A = ( 32 14 ). Then we have

λ − 3 −1
det(λI − A) = det
−2 λ − 4
= (λ − 3)(λ − 4) − (−1)(−2) = λ2 − 7λ + 10.
Thus the eigenvalues of A are precisely the roots of λ2 − 7λ + 10 = 0. The

quadratic polynomial factors as (λ − 5)(λ − 2), so the eigenvalues of A are
2 and 5. The null spaces of A − 2I and A − 5I give us eigenvectors:

1 1 −2 1
A − 2I = , A − 5I = ,
2 2 2 −1
1 , and λ = 5 has a

thus λ1 = 2 has a corresponding eigenvector v1 = −1 2
corresponding eigenvector v2 = ( 12 ).
Note that v1 , v2 are linearly independent, hence they form a basis for
R2 . In particular, any v ∈ R2 can be written as v = a1 v1 + a2 v2 for some
a1 , a2 ∈ R, and then we have
AN v = a1 (AN v1 ) + a2 (AN v2 ) = a1 λN N N N
1 v1 + a2 λ v2 = a1 2 v1 + a2 5 v2 .
The above example illustrates a powerful procedure – if we can find a

basis for K n that consists of eigenvectors for the matrix A, then we can
write down an explicit formula for the iterates AN v of any initial vector v.
20.3. CHARACTERISTIC POLYNOMIAL 107
Example 20.5. Recall that the Fibonacci numbers defined by x1 = x2 = 1

and xn = xn−1 + xn−2 can be defined in terms of matrix multiplication by
putting wn = ( xxn+1
n
) and A = ( 01 11 ), so that wn = An ( 01 ). To find the
eigenvalues of the matrix A, we look for λ ∈ R such that

λ −1
0 = det(λI − A) = det = λ2 − λ − 1. (20.5)
−1 λ − 1
√
By the quadratic equation, the eigenvalues are λ1 = 12 (1 + 5) and λ2 =
1
√
2 (1 − 5). To find eigenvectors for λi , we find the null space of A − λi I =
−λi 1 1
1 1−λi . We see that vi = λi is an eigenvector for λi : the fact that
(A − λi I)vi has zero first coordinate is clear, and we see that the second
coordinate is 1 + (1 − λi )λi = 1 + λi − λ2i , which is 0, because λi is a root of
(20.5).
Now to find wn = An ( 01 ), we find coefficients a1 , a2 ∈ R such that

0 1 1
= a1 v1 + a2 v2 = a1 + a2 .
1 λ1 λ2
That is, we find the coordinate representation of ( 01 ) relative to the basis

v1 , v2 . From the first coordinate of the above equation we have a2 = −a1 ,
and√ so the equation in the second coordinate becomes 1 = a1 (λ1 − λ2 ) =
a1 5. We conclude that
An v1 − An v2 λn v1 − λn v2

0 v1 − v2
= √ ⇒ wn = An ( 01 ) = √ = 1 √ 2 .
1 5 5 5
In particular, since xn is the first component of wn , and since v1 , v2 both
have a 1 in their first component, we get the following formula for the nth
Fibonacci number:
λn − λn
xn = 1 √ 2 . (20.6)
5
√
Notice that λ1 is equal to the golden ratio ϕ = 1+2 5 ≈ 1.618. Observe that
the sum of the two eigenvalues is 1, and so the second eigenvalue is 1 − ϕ.
Dividing the quadratic equation ϕ2 − ϕ − 1 = 0 by ϕ, we get ϕ − 1 − 1/ϕ = 0,
and in particular we see that λ2 = 1 − ϕ = −1/ϕ. Thus (20.6) can be
rewritten as
ϕn − (−1)n ϕ−n ϕn ϕ−n
xn = √ = √ − (−1)n √
5 5 5
Note that ϕ > 1 and ϕ−1 ≈ .618, so the second term in the above formula
is always < 21 , and goes to 0 as n → ∞. We can thus make the following
two conclusions: first, the ratio between successive Fibonacci numbers is
xn+1 ϕn+1 − (−1)n+1 ϕ−(n+1)

=
xn ϕn − (−1)n ϕ−n
!
ϕn − (−1)n ϕ−n + (−1)n ϕ−n − (−1)n+1 ϕ−(n+2)
=ϕ
ϕn − (−1)n ϕ−n
!
ϕ −n + ϕ−(n+2)
= ϕ 1 + (−1)n n ,
ϕ − (−1)n ϕ−n
which converges to the golden ratio ϕ as n → ∞; √second,

n the nth Fibonacci
n
√ 1 1+ 5
number is the closest integer to ϕ / 5 = √5 2 .
In both of the above examples, the quantity det(λI − A) turned out to

be a quadratic polynomial in the variable λ. This was because A was a 2 × 2
matrix. The following example shows what happens when we consider a
3 × 3 matrix.
2 0 0
Example 20.6. Let A = 1 3 −1 . Then the condition det(λI − A) = 0
22 0
becomes
 
λ−2 0 0
0 = det(λI − A) = det  −1 λ − 3 1 
−2 −2 λ
= (λ − 2)(λ − 3)λ + (0)(1)(−2) + (0)(−1)(−2)
− (λ − 2)(−2)(1) − (−1)(0)λ − (−2)(λ − 3)(0)
= (λ3 − 5λ2 + 6λ) + (2λ − 4)
= λ3 − 5λ2 + 8λ − 4.
Thus we must find the roots of a cubic polynomial. Observe that λ = 1 is a

root of this polynomial, and we have
λ3 − 5λ2 + 8λ − 4 = (λ − 1)(λ2 − 4λ + 4) = (λ − 1)(λ − 2)2 .
Thus the eigenvalues of A are λ = 1 and λ = 2. To find the eigenspace for

λ = 1, we row reduce
   
1 0 0 1 0 0
A − I = 1 2 −1 → 0 2 −1
2 2 −1 0 0 0
20.3. CHARACTERISTIC POLYNOMIAL 109
0
and see that the eigenspace is spanned by v1 = 1 . The eigenspace for
2
λ = 2 is found by row reducing
   
0 0 0 1 1 −1
A − 2I = 1 1 −1 → 0 0 0  ,
2 2 −2 0 0 0
1 0
so the eigenspace is spanned by v2 = 0 and v3 = 1 . We see that
1 1
v1 , v2 , v3 is a basis for R3 consisting of eigenvectors of A, even though A
only has two eigenvalues.
The examples we have done so far all have the property that there is a
basis for K n consisting of eigenvectors for the matrix
P A. Whenever this is
the case, we can write any vector v ∈ K as v = i ai vi , where vi ∈ K n
n
is an eigenvector for λi ∈ K, and ai ∈ K are scalar coefficients. Note that

some eigenvalues λi may be repeated, as in Example 20.6; however, the
eigenvectors vi must be linearly independent. Then we have
n
X n
X
Av = ai (Avi ) = (ai λi )vi ,
i=1 i=1
Xn Xn
AN v = ai (AN vi ) = (ai λN
i )vi ,
i=1 i=1
which allows us to quickly determine how A and AN act on any vector v.

Thus we have the following important question: Does every matrix A
have the property that there is a basis for K n consisting of eigenvectors for
A? If not, how do we determine which matrices have this property?
The first of these questions is answered by the following exercise. The
second will be answered later.
Exercise 20.7. Consider the matrix A = ( 10 11 ). Show that λ = 1 is the
only eigenvalue of A. Show that R2 does not have a basis consisting of
eigenvectors for A.
We conclude this lecture with the following observation. When we found
the eigenvalues of a 2 × 2 matrix, we saw that det(λI − A) was a quadratic
polynomial in λ. For a 3 × 3 matrix, we saw that det(λI − A) was a cubic
polynomial in λ. More generally, we observe that for an n × n matrix A,
X n
Y
det(λI − A) = sgn π (λI − A)i,π(i) .
π∈Sn i=1
We see that Q (λI − A)ij = λ − Aij if i = j, and −Aij otherwise; consequently,

the product ni=1 (λI − A)i,π(i) is a polynomial in λ, whose degree is equal to
the number of indices i for which π(i) = i. In particular, it is a polynomial
with degree ≤ n, and thus det(λI − A) is also a polynomial in λ with degree
≤ n.
In fact,Qa little more thought shows that there is exactly one permutation
for which ni=1 (λI − A)i,π(i) has degree n – this is the identity permutation,
π(i) = i for all i. The leading coefficient here is equal to 1 – that is,
the product is equal to λn + something with smaller degree, and so we see
that det(λI − A) is a polynomial of degree exactly n, and the coefficient
on the highest order term (λn ) is equal to 1. We call this polynomial the
characteristic polynomial of the matrix A. We will sometimes write it as
pA (λ), so that
pA (λ) = det(λI − A). (20.7)
Exercise 20.8. Show that pA (λ) = (−1)n det(A − λI).
Diagonalisability
Further reading:
21.1 Spectrum of a matrix

Recall that λ ∈ K is an eigenvalue of an n × n matrix A if and only if
A − λI is non-invertible (has non-trivial nullspace) – this is equivalent to
the condition that pA (λ) = det(λI − A) = 0, and we see that the eigenvalues
are the roots of the characteristic polynomial.
Definition 21.1. The spectrum of A is the set of its eigenvalues. The

algebraic multiplicity of an eigenvalue is the number of times it appears as
a root of the characteristic polynomial. The geometric multiplicity of an
eigenvalue is the dimension of the eigenspace corresponding to it.
Example 21.2. Let A be the 3 × 3 matrix from Example 20.6, then the
spectrum of A is {1, 2}. The eigenvalue λ = 1 has algebraic and geomet-
ric multiplicities equal to 1, and the eigenvalue λ = 2 has algebraic and
geometric multiplicities equal to 2.
Exercise 21.3. Let A be an upper triangular matrix. Show that the eigen-
values of A are precisely the diagonal entries of A, and that the algebraic
multiplicity of an eigenvalue is the number of times it appears on the diag-
onal of A.
In Exercise 20.7, the eigenvalue λ = 1 has algebraic multiplicity 2 but
geometric multiplicity 1, so the two numbers are not always the same. We
will come back to the relationship between these two numbers later on. First
we establish some general results, beginning with the fact that eigenvectors
for different eigenvalues are linearly independent.
21.2 Bases of eigenvectors

Theorem 21.4. If λ1 , . . . , λk are distinct eigenvalues for a matrix A (λi 6=
λj for i 6= j), and v1 , . . . , vk are eigenvectors for these eigenvalues, then
v1 , . . . , vk are linearly independent.
111
Proof. By the definition of eigenvector, we have Avi = λi vi for each 1 ≤

i ≤ k. Suppose that v1 , . . . , vk are linearly dependent. Then there is an m
such that vm ∈ span{v1 , . . . , vm−1 }. Consider the smallest such m with this
property. We will use the fact that v1 , . . . , vm are eigenvectors for distinct
eigenvalues to show that the same property holds for some 1 ≤ ` < m, which
will be a contradiction. To this end, let a1 , . . . , am−1 ∈ K be such that
vm = a1 v1 + · · · + a` v` , (21.1)
where 1 ≤ ` < m and a` 6= 0. Consider two equations derived from this one;
multiplying both sides of (21.1) by A and using the eigenvector property
gives
λm vm = a1 λ1 v1 + · · · + a` λ` v` , (21.2)
while multiplying both sides of (21.1) by λm gives
λ m vm = a1 λ m v1 + · · · + a` λ m v` . (21.3)
Subtracting (21.3) from (21.2) gives
0 = a1 (λ1 − λm )v1 + · · · + a` (λ` − λm )v` .
Because λ` 6= λm and a` 6= 0, this implies that v` ∈ span{v1 , . . . , v`−1 },

which contradicts our assumption that m was the smallest index with this
property. We conclude that v1 , . . . , vk are linearly independent.
Corollary 21.5. If A ∈ Mn×n and the characteristic polynomial pA has n

distinct roots, then A has n linearly independent eigenvectors, which form a
basis for K n .
Proof. Suppose pA (λ) = (λ − λ1 ) · · · (λ − λn ), where the λi are distinct.
Let vi be an eigenvector for λi . By Theorem 21.4, v1 , . . . , vn are linearly
independent. Because dim K n = n, they form a basis.
We saw earlier that similar matrices have the same determinant and
trace. They also have the same characteristic polynomial, eigenvalues, and
there is a clear relationship between their eigenspaces.
Theorem 21.6. If A and B are similar via a change-of-coordinates matrix
Q – that is, if B = QAQ−1 – then A and B have the same characteristic
polynomial. Thus every eigenvalue of A is also an eigenvalue of B, with the
same algebraic multiplicity. Moreover, the geometric multiplicities agree,
and if EλA = NA−λI is the eigenspace for A corresponding to λ, then EλA
and EλB are related by EλB = Q(EλA ).
21.2. BASES OF EIGENVECTORS 113
Proof. First observe that
pB (λ) = det(λI − B) = det(λI − Q−1 AQ)

= det(Q−1 (λI − A)Q) = det(λI − A) = pA (λ),
using the fact that similar matrices have the same determinant. This im-
mediately implies that A and B have the same eigenvalues, with the same
algebraic multiplicities. Now given an eigenvalue λ for A and the corre-
sponding eigenspace EλA , we see that for any v ∈ EλA , we have
B(Qv) = (QAQ−1 )(Qv) = QAv = Q(λv) = λ(Qv),
so Qv ∈ EλB . This shows that Q(EλA ) ⊂ EλB . Similarly, if w ∈ EλB , then

writing v = Q−1 w, we see that
Av = AQ−1 w = Q−1 BQQ−1 w = Q−1 Bw = Q−1 λw = λ(Q−1 w) = λv,
so v ∈ EλA and hence w = Qv ∈ Q(EλA ). This proves the theorem.
We described earlier how a basis of eigenvectors can be used to compute

AN v for arbitrary vectors v. It can also be used to perform a change of
coordinates in which A takes a particularly simple form.
Definition 21.7. A matrix A is diagonalisable if there is an invertible ma-

trix Q such that D = Q−1 AQ is diagonal. Similarly, a linear transformation
T ∈ L(V ) is diagonalisable if there is a basis β for V such that [T ]β is
diagonal.
Theorem 21.8. A is diagonalisable if and only if K n has a basis consisting

of eigenvectors for A. The entries of the corresponding diagonal matrix are
the eigenvalues of A.
Proof. (⇒) Suppose there is an invertible matrix Q such that D = Q−1 AQ is

diagonal. Then A = QDQ−1 . Note that a diagonal matrix D = diag(λ1 , . . . , λn )
has eigenvalues λj with eigenvectors ej . Let vj = Qej be the jth column of
Q, then ej = Q−1 vj , and so
Avj = QDQ−1 vj = QDej = Qλj ej = (λj )Qej = λj vj .
Because v1 , . . . , vn are the columns of an invertible matrix, they form a basis.

(⇐) Let β = (v1 , . . . , vn ) be such a basis, with eigenvalues λ1 , . . . , λn .
Let Q be the matrix with column vectors v1 , . . . , vn . Then Q is invertible
because v1 , . . . , vn is a basis. Note that if α is the standard ordered basis,
then Q = Iβα . This is because Q has the property that Qej = vj for each j,
and similarly Q−1 vj = ej . Let D = Q−1 AQ. Then
Dej = Q−1 AQej = Q−1 Avj = Q−1 (λj vj ) = λj (Q−1 vj ) = λj ej .
Because Dej gives the jth column of D, we see that D is diagonal with
entries λ1 , . . . , λn .
By Theorem 21.8, the matrix in Exercise 20.7 is not diagonalisable. It

is an important question what to do with non-diagonalisable matrices, and
we will come back to it.
Corollary 21.9. If the characteristic polynomial of A factors as a product

of linear polynomials with distinct roots, then A is diagonalisable.
Proof. If pA (λ) = nj=1 (λ − λj ), where λi 6= λj for i 6= j, then each λj is an

Q
eigenvalue since det(λj I − A) = 0, and hence has an eigenvector vj ∈ K n .
The eigenvectors v1 , . . . , vn are linearly independent by Theorem 21.4. Thus
they form a basis, so A is diagonalisable by Theorem 21.8.
21.3 Multiplicities
We observed above that the algebraic and geometric multiplicities can dif-
fer. However, there is a universal relationship between them: the geometric
multiplicity is always less than or equal to the algebraic multiplicity. We
need the following exercise.

Exercise 21.10. Let A be a square matrix with the block form A = X Y
0 Z ,
where X, Z are square matrices and Y, 0 have the appropriate dimensions.
Show that det(A) = det(X) det(Z).
Theorem 21.11. Given an n × n matrix A and an eigenvalue λ of A, let

ma (λ) and mg (λ) be the algebraic and geometric multiplicities, respectively.
Then 1 ≤ mg (λ) ≤ ma (λ).
Proof. Let EλA be the eigenspace for A corresponding to λ, and let m =

mg (λ) = dim EλA be the geometric multiplicity of λ. We must show that
ma (λ) ≥ m. Let v1 , . . . , vm be a basis for EλA . Extend it to a basis
v1 , . . . , vn for K n . Let Q ∈ Mn×n be the invertible matrix with column
vectors v1 , . . . , vn , and let B = Q−1 AQ. Then for each 1 ≤ j ≤ m, we have
Bej = Q−1 AQej = Q−1 Avj = Q−1 λvj = λ(Q−1 vj ) = λej ,

21.3. MULTIPLICITIES 115
and thus B has the block form

λIm X
B= ,
0 Y
where Im is the m × m identity matrix, 0 is the (n − m) × m zero matrix,

X is a m × (n − m) matrix, and Y is an (n − m) × (n − m) matrix. Using
Exercise 21.10 and the fact that similar matrices have the same characteristic
polynomial, we have

(t − λ)Im −X
pA (t) = pB (t) = det
0 tIn−m − Y
= det((t − λ)Im ) det(tIn−m − Y ) = (t − λ)m pY (t).
This shows that λ appears at least m times as a root of pA (t), and so

ma (t) ≥ m = mg (t).
More spectral theory
Further reading:
22.1 More on multiplicities

Given a matrix A ∈ Mn×n , let Mg (A) be the maximum number of linearly
independent eigenvectors of A. That is, Mg (A) is the largest number m
such that there are v1 , . . . , vm ∈ K n for which each vj is an eigenvector of
A, and v1 , . . . , vm are linearly independent. Theorem 21.8 shows that A is
diagonalisable if and only if Mg (A). Let σ(A) be the spectrum of A – that
is, the set of eigenvalues. The following result says that Mg (A) is given by
the sum of the geometric multiplicities of the eigenvalues of A.
P
Proposition 22.1. Mg (A) = λ∈σ(A) mg (λ).
Proof. For each eigenvalue λ ∈ σ(A), let v1λ , . . . , vm λ

g (λ)
be a basis for the
A λ
P
eigenspace Eλ . Then {vj | λ ∈ σ(A), 1 ≤ j ≤ mg (λ)} has λ∈σ(A) mg (λ)
elements. Moreover, it is linearly independent by Theorem 21.4P and the fact
that each v1λ , . . . , vm
λ
g (λ)
is linearly independent. Thus M g (A) ≥ λ∈σ(A) mg (λ).
Now we prove the other inequality. Suppose v1 , . . . , vm P are linearly in-
dependent eigenvectors for A. We must show that m ≤ λ∈σ(A) mg (λ).
A
Because every vj must be contained in some Eλ , it suffices to show that any
linearly independent set w1 , . . . , wk ∈ EλA has k ≤ mg (λ). But this follows
immediately from the fact that mg (λ) = dim(EλA ).
Proposition 22.1 has the following important consequence.

Corollary 22.2. A matrix A is diagonalisable if and only if mg (λ) = ma (λ)
for every λ ∈ σ(A).
Proof. By
P Theorem 21.8 and Proposition 22.1, A is diagonalisable if and
only if Pλ∈σ(A) mg (λ) = n. By the definition of algebraic multiplicity,
P we
see
P that m
λ a (λ) = n. Thus A is diagonalisable if and only if m
λ g (λ) =
λ ma (λ). By Theorem 21.11, we have mg (λ) ≤ ma (λ) for every λ, and so
equality occurs in the sum if and only if mg (λ) = ma (λ) for every λ.
Informally, we can think of Theorem 21.11 and Corollary 22.2 as saying

the following: the algebraic multiplicity of an eigenvalue says how many lin-
early independent eigenvectors that eigenvalue should have (according to the
117
characteristic polynomial), while the geometric multiplicity says how many

it does have, and a matrix is diagonalisable if and only if every eigenvalue
has all the eigenvectors it is “supposed to”.
22.2 Trace and determinant

From now on, unless otherwise stated, we will always consider our scalar field
to be C, the complex numbers. The reason for doing this is the fundamental
theorem of algebra, which states that every polynomial factors into linear
termsover C. In particular, given any A ∈ Mn×n (C), we have
pA (λ) = det(λI − A) = (λ − λ1 ) · · · (λ − λn )
for some λ1 , . . . , λn ∈ C. The eigenvalues λj may not be distinct, so Corol-

lary 21.9 does not necessarily apply. However, if A is diagonalisable, we
can make the following observations: A is similar to a diagonal matrix D,
hence it has the same eigenvalues, trace, and determinant as D. Since the
diagonal entries of a diagonal matrix D are the eigenvalues, we see that the
trace of D is the sum of the eigenvalues, and the determinant is the product
of the eigenvalues. Because A and D have the same eigenvalues, trace, and
determinant, this continues to be true for A.
In fact, this relationship continues to hold even when the matrix A is
not diagonalisable.
Theorem 22.3. Let A ∈ Mn×n have eigenvalues λ1 , . . . , λn , where the num-

ber of times each
Pvalue appears in the list
Qnis equal to its algebraic multiplicity.
n
Then Tr(A) = j=1 λj and det(A) = j=1 λj .
Proof. Let the characteristic polynomial of A be given as
pA (t) = tn + an−1 tn−1 + an−2 tn−2 + · · · + a1 t + a0 .
Because (t − λj ) is a factor of pA for every 1 ≤ j ≤ n, we have

 
n
Y Xn n
Y
pA (t) = (t − λj ) = tn −  λj  tn−1 + · · · + (−1)n λj .
j=1 j=1 j=1
That is an−1 = − nj=1 λj and a0 = (−1)n nj=1 λj . Thus to prove the

P Q
theorem it suffices to show that an−1 = − Tr(A) and a0 = (−1)n det A.
22.3. SPECTRAL MAPPING AND CAYLEY–HAMILTON THEOREMS119
Using the formula for determinant as a sum over permutations, we get

 
t − A11 −A12 · · · −A1n
 .. 
 −A21 t − A22 · · · . 
pA (t) = det(tI − A) = det  .

.

 . . .
.


−An1 ··· · · · t − Ann
X n
Y (22.1)
= sgn(π) (tI − A)j,π(j)
π∈Sn j=1
X Yn
= sgn(π) (tδj,π(j) − Aj,π(j) ),
π∈Sn j=1
where δjk is 1 if j = k and 0 otherwise. (This is the Kronecker delta.) Every

permutation other than the identity (π(j) = j for all j) has at least two
values of j for which δj,π(j) = 0, and hence does not contribute to the terms
of pA (t) with degree ≥ n − 1. Thus
 
n
Y
pA (t) =  (t − Ajj ) + (something with degree at most n − 2)
j=1
 
Xn
= tn −  Ajj  tn−1 + (something with degree at most n − 2).
j=1
This gives the desired form for an−1 . For a0 , we observe that the only term
in (22.1) without any factor of t comes from choosing −Aj,π(j) in every term
of the product, and so
X n
Y
a0 = sgn π (−Aj,π(j) ) = (−1)n det A.
π∈Sn j=1
This completes the proof.
22.3 Spectral mapping and Cayley–Hamilton the-

orems
Recall that if q(t) = c0 + c1 t + · · · + cd td is a polynomial in t, then given a
matrix A ∈ Mn×n , we write
q(A) = c0 I + c1 A + c2 A2 + · · · + cd Ad ,
so that q(A) is also an n × n matrix. There is a relationship between the

eigenvalues of A and the eigenvalues of q(A).
Theorem 22.4 (Spectral mapping theorem). Let q be a polynomial and A a

square matrix. Then λ ∈ C is an eigenvalue of q(A) if and only if λ = q(µ)
for some eigenvalue µ of A.
Proof. (⇐). This direction is a straightforward computation: if v is an

eigenvector for A and µ, so that Av = µv, then we have
d d d
!
X X X
i i
q(A)v = ci A v = ci (A v) = ci µi v = q(µ)v,
i=0 i=0 i=0
so λ = q(µ) is an eigenvalue of q(A).

(⇒). Suppose λ is an eigenvalue of q(A); thus q(A)−λI is not invertible.
By the Fundamental Theorem of Algebra, we may factor the polynomial
q(t) − λ as
Yd
q(t) − λ = a (t − νi )
i=1
for some ν1 , . . . , νd ∈ C, and observe that this implies

d
Y
q(A) − λI = a (A − νi I).
i=1
Since the left-hand side is not invertible, it follows that one of the factors
A − νi I is non-invertible – in particular, some νi is an eigenvalue of A.
Because νi is a root of q(t) − λ, we see that q(νi ) − λ = 0, so λ = q(νi ).
Cayley–Hamilton and 2 × 2 matrices
Further reading:
23.1 Cayley–Hamilton theorem

It follows from Theorem 22.4 that if pA (t) is the characteristic polynomial of
A, then all eigenvalues of pA (A) are zero – indeed, every eigenvalue µ of A
has the property that pA (µ) = 0, and by Theorem 22.4 every eigenvalue of
pA (A) is of the form pA (µ) for some eigenvalue µ of A, so every eigenvalue
of pA (A) is zero.
Recall that every nilpotent matrix has the property that all eigenvalues
are zero. Thus the previous paragraph suggests that pA (A) is nilpotent.1 In
fact, even more is true.
Theorem 23.1 (Cayley–Hamilton theorem). If A ∈ Mn×n is a square ma-
trix and pA (t) is the characteristic polynomial of A, then pA (A) = 0.
It is tempting to give the following “proof”:
pA (A) = det(AI − A) = det 0 = 0.
This is wrong. In the first place, notice that the output of determinant is a
scalar, while pA (A) is a matrix – indeed, writing pA (t) = a0 + a1 t + · · · +
an−1 tn−1 + tn , we have
pA (A) = a0 I + a1 A + a2 A2 + · · · + an−1 An−1 + An .
This is an n × n matrix, not a determinant of anything, and it is this ex-
pression that we need to show vanishes.
We first observe that if A happens to be diagonalisable (in particular, if A
has n distinct eigenvalues), then writing any v ∈ Cn as a linear combination
of eigenvectors v1 , . . . , vn , we get
n
X n
X n
X
pA (A)v = pA (A) cj vj = cj pA (A)vj = cj pA (λj )vj = 0,
j=1 j=1 j=1
showing that pA (A) = 0. However, if A is not diagonalisable then we need

to use a different argument.
1
We have not yet proved that all eigenvalues zero implies nilpotence, but this can be
shown.
121
Proof of Theorem 23.1. First we recall (20.3), which stated that for an in-
det Ã
vertible matrix A, we have (A−1 )ij = (−1)i+j det Aji . By using the same
computations as we used in the proof of this formula, we can show that the
matrix B defined by Bij = det Ãji has the property that
BA = AB = (det A)I. (23.1)
(B is sometimes called the adjugate of A.) We will apply this fact to the
matrices Q(t) = tI − A, where t ∈ K. Let P (t) be the matrix whose entries
are given by
˜ ;
P (t)ij = (−1)i+j det Q(t)ji
then (23.1) shows that
P (t)Q(t) = (det Q(t))I = pA (t)I. (23.2)
This is a statement about multiplication of polynomials with matrix coeffi-

cients: indeed, P (t)ij is a polynomial of degree n − 1 in t, so we can write
n−1
X
P (t)ij = Ck,ij tk
k=0
for some coefficients Ck,ij ∈ K, and then letting Ck ∈ Mn×n be the matrix
with coefficients Ck,ij , we get
n−1
X
P (t) = Ck tk .
k=0
Thus (23.2) implies

n−1
!
X
k
P (t)Q(t) = Ck t (tI − A) = pA (t)I.
k=0
Expanding the product in the middle of the above equation gives
pA (t)I = (C0 + C1 t + · · · + Cn−1 tn−1 )(tI − A)

= −C0 A + (C0 − C1 A)t + · · · + (Cn−2 − Cn−1 A)tn−1 + Cn−1 tn .
Let a0 , . . . , an−1 be the coefficients of the characteristic polynomial of A, so

that
pA (t)I = a0 I + a1 It + a2 It2 + · · · + an−1 Itn−1 + Itn .
23.2. WORKING OVER C 123
(Recall that the top coefficient is always 1.) Then since the previous two
equations hold for all t ∈ K, the coefficients of tk must agree, giving
Cn−1 = I,
Cn−2 − Cn−1 A = an−1 I,
Cn−3 − Cn−2 A = an−2 I,
···
C0 − C1 A = a1 I
−C0 A = a0 I.
Now multiply both sides of the first line by An (from the right), both sides
of the second by An−1 , and so on. We get
Cn−1 An = An ,
Cn−2 An−1 − Cn−1 An = an−1 An−1 ,
Cn−3 An−2 − Cn−2 An−1 = an−2 An ,
···
2
C0 A − C1 A = a1 A
−C0 A = a0 I.
Adding up all these equations, we see that the left side is 0 and the right
side is pA (A).
23.2 Working over C

a b

Consider the 2 × 2 matrix A = c d , where a, b, c, d ∈ C. The characteristic
polynomial of A is
pA (t) = det(tI − A) = t2 − (Tr A)t + det A,
which has roots given by
1 p
λ1,2 = Tr A ± (Tr A)2 − 4 det A
2
1 p
= a + d ± a2 + 2ad + d2 − 4(ad − bc) (23.3)
2
1 p
= a + d ± (a − d)2 + 4bc .
2
If (a−d)2 +4bc 6= 0, then we have λ1 6= λ2 , and thus the corresponding eigen-

vectors v1 , v2 are linearly independent. Thus C2 has a basis of eigenvectors
for A, and so A is diagonalisable.
On the other hand, we may have (a−d)2 +4bc = 0, in which case pA (t) =
(t − λ)2 , so A has only a single eigenvalue λ, with algebraic multiplicity 2.
Now one of two things happens: either A is diagonalisable, or it is not.
If A is diagonalisable, then the geometric multiplicity of λ is equal to
the algebraic multiplicity, by Corollary 22.2. That is, the eigenspace EλA
has dimension 2, which means that EλA = C2 , since the whole space has
dimension 2. Thus every vector in C2 is an eigenvector: Av = λv for every
v ∈ C2 , which means that A = λI. Thus the only way that a 2 × 2 matrix A
with a repeated eigenvalue can be diagonalisable is if it is already a scalar
multiple of the identity – this means that b = c = 0. There are plenty of
2 × 2 matrices with (a − d)2 + 4bc = 0 (and hence a repeated eigenvalue) for
which b and c are not both 0, so these matrices are not diagonalisable.
Suppose A is such a matrix – 2 × 2, repeated eigenvalue, not diagonalis-
able. This means that mg (λ) = 1 < 2 = ma (λ), so that λ has an eigenvector
v, but there is no second linearly independent eigenvector. Thus A 6= λI, so
in particular A − λI 6= 0. However, by the Cayley–Hamilton theorem (The-
orem 23.1), A satisfies its own characteristic polynomial pA (t) = (t − λ)2 ,
which means that (A − λI)2 = 0.
Now choose w ∈ C2 \ EλA , and let v = (A − λI)w. Because (A − λI)2 = 0,
we see that
(A − λI)v = (A − λI)2 w = 0,
so v ∈ EλA . In particular, this means that v, w are linearly independent,
hence a basis. Moreover, we have v = Aw − λw, so
Aw = λw + v, Av = λv.
Thus although w is not actually an eigenvector for A, it has some behaviour
that is reminiscent of eigenvectors: multiplying w by A has the effect of
scaling it by the eigenvalue λ and then adding v.
We can use the above to find a matrix that is similar to A and has a
reasonably nice structure. Let Q be the 2 × 2 matrix with columns v, w;
that is, Qe1 = v and Qe2 = w. Let B = Q−1 AQ. Then
Be1 = Q−1 AQe1 = Q−1 Av = Q−1 λv = λe1 ,
Be2 = Q−1 AQe2 = Q−1 Aw = Q−1 (λw + v) = λe2 + e1 ,
and we see that
−1 λ 1
B=Q AQ = . (23.4)
0 λ
23.2. WORKING OVER C 125
We have proved the following theorem.
Theorem 23.2. Let A be a 2 × 2 matrix. Then there is an invertible 2 × 2

matrix Q with entries in C such that Q−1 AQ is either a diagonal matrix or
has the form in (23.4).
Note that (23.4) is the sum of the diagonal matrix λI and the nilpotent
matrix ( 00 10 ). It can be shown that this holds for matrices of any size – any
matrix can be written as the sum of a diagonalisable matrix and a nilpotent
matrix.
Theorem 23.2 has an important application: it lets us compute powers
of 2 × 2 matrices with relatively little fuss. Indeed, suppose A ∈ M2×2 is
such that D = Q−1 AQ is diagonal for some invertible Q. Then we have
A = QDQ−1 , and so
N times
z }| {
AN = (QDQ−1 )N = (QDQ−1 )(QDQ−1 ) · · · (QDQ−1 ) = QDN Q−1
for every N ∈ N. It is easy to compute powers of D = λ0 µ0 , since we get

N
DN = λ0 µ0N , and thus
N
λ 0
A N
=Q Q−1 .
0 µN
Similarly, if J = Q−1 AQ = λ0 λ1 , then we get A = QJ N Q−1 by the same

argument as above, and can use the fact that J = λI + B, where B = ( 00 10 )

has B 2 = 0, to compute
J 2 = (λI + B)(λI + B) = λ2 I + 2λB + B 2 = λ2 I + 2λB,

J 3 = (λ2 I + 2λB)(λI + B) = λ3 I + 3λ2 B,
which suggests the general formula J N = λN I + N λN −1 B. And indeed, if

this formula holds for N , we have
J N +1 = J N J = (λN I + N λN −1 B)(λI + B) = λN +1 I + (N + 1)λN B,
using the fact that B 2 = 0. By induction, we have for every N that
N λN −1
N
N −1 N −1 −1 λ
N
A = QJ Q = Q(λ I + N λ N
B)Q = Q Q−1 .
0 λN
Scaled rotations, etc.
Further reading:
24.1 Working over R

In general, the eigenvalues of A may be complex, even if all the entries of
A are real. This is because real polynomials always factor over the com-
plex numbers, but not necessarily over the reals. Thus while Theorem 23.2
guarantees that every 2 × 2 matrix is similar over C to a diagonal matrix
or something in the form (23.4), this may not always be true if we want to
only consider similarity over R – that is, if we require the similarity matrix
Q to have real entries.
For example, the matrix A = 10 −1

0 has characteristic polynomial pA (t) =
t2 + 1 and eigenvalues ±i, so its eigenvectors are also complex. Indeed, if
v ∈ R2 then Av ∈ R2 , which shows that we cannot have Av = λv for any
λ ∈ C \ R.
More generally, given θ ∈ R, the rotation matrix Rθ = cos θ − sin θ has

sin θ cos θ
the effect of rotating a vector in R2 counterclockwise by the angle θ, and we
immediately see that Rθ has no eigenvectors in R2 when θ is not a multiple
of π. Similarly, given any q > 0, the matrix

q cos θ −q sin θ
qRθ = (24.1)
q sin θ q cos θ
has the effect of rotating by θ and then scaling by q, so it has no real

eigenvectors when θ is not a multiple of π.
On the other hand, qRθ has characteristic polynomial
t2 − 2q cos θ + q 2 ,
and thus its eigenvalues (in C) are

p
q cos θ ± q 2 cos2 θ − q 2 = q(cos θ ± i sin θ) = qe±iθ .
Observe that for λ = q(cos θ + i sin θ) we have

−qi sin θ −q sin θ −i −1
qRθ − λI = = q sin θ ,
q sin θ −qi sin θ 1 −i
127
and the corresponding complex eigenvector is ( 1i ) = ie1 + e2 . It turns out

that any real 2 × 2 matrix with complex eigenvalues is similar to qRθ , and
the key is to write its complex eigenvector as iv1 + v2 , where v1 , v2 ∈ R2 . Let
us explain this a little more carefully; first we need the following elementary
result.
Proposition 24.1. Let A ∈ Mn×n have real entries, and suppose that v ∈
Cn is an eigenvector for A with eigenvalue λ ∈ C. Then v̄ ∈ Cn is an
eigenvector for A with eigenvalue λ̄ ∈ C, where v̄ and λ̄ denote complex
conjugation.
Proof. By properties of complex conjugation and the fact that A is real, so
Ā = A, we have
Av̄ = Āv̄ = Av = λv = λ̄v̄.
Let A be a 2 × 2 real matrix with a complex eigenvalue a + ib, where

b 6= 0. Let w ∈ C2 be a corresponding eigenvector, and write w = iv1 + v2 ,
where v1 , v2 ∈ R2 . (This may seem like an unnatural order to write the real
and imaginary parts in, but turns out to work better.) Then by Proposition
24.1, a − ib is also an eigenvalue of A, with eigenvector w̄ = −iv1 + v2 .
By the assumption that b 6= 0, we see that w and w̄ are linearly indepen-
dent, hence form a basis for C2 . This means that the 2 × 2 complex matrix
P = [w | w̄] is invertible, and diagonalises A to the 2 × 2 complex matrix

−1 a + ib 0
P AP = .
0 a − ib
Now P is related to the 2 × 2 real matrix S = [v1 | v2 ] by
P = [w | w̄] = [v1 | v2 ] 1i −i

1 = ST,
where T = 1i −i

1 is invertible. Hence S is invertible as well, so v1 , v2 form
1
2
a basis for R . Note that v1 = 2i (w − w̄) and v2 = 12 (w + w̄).
−1
We claim that B = S AS hasthe form (24.1). Indeed, observing that
S = P T −1 and that T −1 = 12 −i 1
i 1 , we have

a + ib 0
B = S −1 AS = T P −1 AP T −1 = T T −1
0 a − ib

1 i −i a + ib 0 −i 1
=
2 1 1 0 a − ib i 1

1 ai − b −ai − b −i 1
=
2 a + ib a − ib i 1
24.1. WORKING OVER R 129

a −b
=
b a
and writing a + ib = q(cos θ + i sin θ) = qeiθ for some q > 0 and θ ∈ R, we

have
a −b q cos θ −q sin θ
B= = = qRθ .
b a q sin θ q cos θ
We have proved the following theorem.
Theorem 24.2. Given any real 2 × 2 matrix A, there is an invertible 2 × 2

real matrix Q such that QAQ−1 has one of the following forms:

λ 0 λ 1 cos θ − sin θ
q , (24.2)
0 µ 0 λ sin θ cos θ
where λ, µ, q, θ ∈ R.
In fact, one can determine which of the three cases in (24.2) occurs with
a quick look at pA (t) and A itself: the first case occurs if and only if pA (t)
has distinct real roots or is a scalar multiple of the identity matrix; the
second case occurs if and only if pA (t) has a repeated real root and is not a
scalar multiple of the identity; the third case occurs if and only if pA (t) has
complex roots.
The situation for larger matrices is a more complicated, but in some
sense mimics the behaviour seen here. We observe that Theorem 24.2 can
also be formulated for linear transformations: if V is a vector space over R
with dimension 2 and T ∈ L(V ) is any linear operator, then there is a basis
β for V such that [T ]β has one of the forms in (24.2).
We saw in the previous section how to compute powers of A when it is
similar to one of the first two cases in (24.2). In fact, if we want to compute
AN for a real matrix then even when A has complex eigenvalues we can still
use the diagonal form λ0 µ0 , because the complex parts of Q, DN , and Q−1

will ultimately all cancel out in the product AN = QDN Q−1 . Nevertheless,
it is worth pointing out that powers of the scaled rotation matrix also have
a nice formula.
cos θ − sin θ

Proposition 24.3. Writing Rθ = sin θ cos θ
, we have
(qRθ )N = q N RN θ for all N ∈ Z. (24.3)

Proof. N = 0, 1 is immediate. To prove the rest observe that

cos α − sin α cos β − sin β
Rα Rβ =
sin α cos α sin β cos β

cos α cos β − sin α sin β − cos α sin β − sin α cos β
= (24.4)
sin α cos β + cos α sin β cos α cos β − sin α sin β

cos(α + β) − sin(α + β)
= = Rα+β
sin(α + β) cos(α + β)
This gives N > 1 by induction, since R(N +1)θ = RN θ Rθ = RθN Rθ = RθN +1 .

For N < 0 it suffices to observe that R−N θ RN θ = R0 = I, and so R−N θ =
−1 −N
RN θ = Rθ .
An alternate proof of (24.4) that avoids trigonometric identities can be

given by diagonalising Rθ .
Exercise 24.4. Show that Rθ has eigenvalues e±iθ = cos θ ± i sin θ with
±i

eigenvectors 1 . Conclude that
iθ
e 0 i −i
Rθ = Q Q−1 , Q= ,
0 e−iθ 1 1
and use this to show (24.4).
24.2 Application to ODEs

Consider the second order linear ODE ẍ + aẋ + bx = 0. Recall that this can
be turned into a first order system by putting y = ẋ and x = ( xy ), so that

y y 0 1
ẋ = = = x = Ax.
ẍ −ay − bx −b −a
This is a system of two first order ODEs, which are coupled. If we decou-
ple them by diagonalising A, then we can solve each independently. The
characteristic polynomial of A is
pA (s) = s2 − Tr(A)s + det(A) = s2 + as + b. (24.5)
If A diagonalises as A = QDQ−1 , then putting z = Q−1 x gives
ż = Q−1 ẋ = Q−1 Ax = DQ−1 x = Dz,

24.2. APPLICATION TO ODES 131
λ 0

and so writing D = 0 µ , this becomes
ż1 = λz1 , ż2 = µz2 .
These can be solved independently. (Note that if the roots are complex,
then we will need to take appropriate linear combinations of the complex
solutions z1,2 to obtain real solutions.)
But what if A does not diagonalise? If (24.5) has a repeated root, then
because A is not a scalar multiple of the identity we know that we will get
A = QJQ−1 , where J = λ0 λ1 , and so the change of coordinates z = Q−1 x

gives ż = Jz, that is,

ż1 λ 1 z1
= ,
ż2 0 λ z2
so that we must solve the system
ż1 = λz1 + z2 ,
ż2 = λz2 .
This is not fully decoupled, but because J is upper triangular, it is decoupled

enough to solve. Indeed, we get z2 (t) = eλt z2 (0), and then the first equation
reduces to
ż1 = λz1 + g(t),
where g(t) = z2 (0)eλt is a forcing term. This can be solved via standard
ODE methods: in the absence of the forcing term, the solution would be
z1 (t) = eλt z1 (0), so the quantity z1 e−λt would be constant. Differentiating
this quantity and using the actual ODE gives
d
(z1 e−λt ) = ż1 e−λt − λz1 e−λt = λz1 e−λt + g(t)e−λt − λz1 e−λt = g(t)e−λt .
dt
Thus we have Z t
z1 (t)e−λt = z1 (0) + g(s)e−λs ds,
0
and so the solution is
Z t
λt λt
z1 (t) = z1 (0)e +e g(s)eλ(−s) ds.
0
In the particular case g(t) = z2 (0)eλt , we see that g(s)e−λs = z2 (0) is con-
stant, and so
z1 (t) = z1 (0)eλt + z2 (0)teλt .
Thus the source of the solution teλt when there is a repeated root is the fact
that a eigenvalue with mg < ma leads to a Jordan block. Another way of
viewing this is the following. Let V be the vector space of smooth (infinitely
many times differentiable) functions from R to R, and let D ∈ L(V ) be
the differentiation operator. Then the ODE ẍ + aẋ + bx = 0 becomes
(D2 + aD + bI)(x) = 0. When there is a repeated root we have
s2 + as + b = (s − λ)2 ⇒ D2 + aD + bI = (D − λI)2 ,
and so solving the ODE becomes a question of finding the null space of
(D − λI)2 . The null space of D − λI is eigenfunctions of the differentiation
operator, which is all scalar multiples of eλt , but then we must also consider
generalised eigenfunctions, which are in the null space of (D − λI)2 but not
(D − λI). We see that
d
(teλt ) = λ(teλt ) + eλt .
dt
Writing x(t) = eλt and y(t) = teλt , this can be rewritten as
Dy = λy + x,
which we saw earlier as the equation characterising a generalised eigenvector

of order 2, since x is a genuine eigenvector.
Lecture 25 Mon. Dec. 2
Markov chains
Further reading:
We can introduce Markov chains via the following simplified example.

Suppose that people move back and forth between Dallas and Houston ac-
cording to the following rule: every year, 10% of the people in Dallas move
to Houston (and the rest remain in Dallas), while 5% of the people in Hous-
ton move to Dallas. In the long run, how do the populations of the two
cities relate? To put it another way, if we pick a person at random, what is
the probability that they live in Houston?
The situation can be described by the following graph, in which there
are two “states” each person can be in: “living in Houston”, or “living in
Dallas”.
.05
.95 Houston Dallas .9
.1
The arrows in the graph illustrate the four “transitions” that can occur:
“staying in Houston”; “moving from Houston to Dallas”; “moving from
Dallas to Houston”; “staying in Dallas”. The number with each arrow gives
the probability of that transition occuring.
The circles are called “vertices” of the graph, and the arrows are called
“edges”. If we label the vertices 1 and 2, then the graph drawn above has
the following structure:
P12
P11 1 2 P22
P21
Here Pij is the probability that a person in city i this year will be in city j
next year. We can represent all the information in the graph by writing the
2 × 2 matrix P whose entries are Pij :

P11 P12 .95 .05
P = = .
P21 P22 .1 .9
133
134 LECTURE 25. MON. DEC. 2
Suppose v ∈ R2 is a vector that encodes the populations of the two cities

this year – that is, v1 is the population of Houston, and v2 is the population
of Dallas. Let w be the vector encoding next year’s populations. Then we
see that
w1 = .95v1 + .1v2
w2 = .05v1 + .9v2 ,
which can be written in terms of matrix multiplication as

.95v1 + .1v2 .95 .05
w = w1 w2 = = v1 v2 = vP.
.05v1 + .9v2 .1 .9
Thus next year’s population distribution can be obtained from this year’s
by right multiplication by P . Iterating this procedure, the distribution two
years from now will be wP = vP 2 , three years from now will be vP 3 , and
more generally, the population distribution after n years will be vP n .
A system like the one just described is called a Markov chain. More
precisely, a Markov chain is a collection of states together with a collection
of transition probabilities. The states are usually encoded with the labels
1, 2, . . . , d, and then the transition probabilities Pij are viewed as the entries
of a d × d matrix P , where Pij is the probability of making a transition from
state i to state j. A Markov chain can also be encoded by a directed graph
such as the one above – that is, a collection of vertices and edges, where the
vertices correspond to states of the system, and each edge is an arrow from
one vertex to another labelled with the probability of the corresponding
transition.
Note that every entry Pij lies in the interval [0, 1]. In fact, something
more can be said about the transition probabilities Pij . If we fix i, then
Pi1 + Pi2 represents the probability of going from state i to either of states
1 or 2; similarly, Pi1 + Pi2 + Pi3 is the probability Pof going from state i to
any of the states 1, 2, 3, and so on. In particular, dj=1 Pij = 1, since we go
somewhere with probability 1.
Definition 25.1. A vector v ∈ Rd is a probability vector if vj ≥ 0 for every

j and dj=1 vj = 1. A matrix P ∈ Md×d is a stochastic matrix if every row
P
vector of P is a probability vector.
The previous discussion shows that the matrix of transition probabilities

is a stochastic matrix. This matrix encodes all the important information
about a Markov chain.
135
We remark that sometimes the definition of a stochastic matrix is given

in terms of column vectors instead of row vectors. This would be the way
to do things if Pij represented the probability of going to state i from state
j, instead of the other way around.
We often think of a Markov chain as describing a system about which
we have incomplete information, so that at time t the best we can hope to
do is say what probability the system has of being in a given state. The
following example illustrates this.
Example 25.2. A simple model of enzyme activity is given by Michaelis–

Menten kinetics. Consider the following version of this model: a certain
molecule (call it M ) undergoes a particular reaction only when it has bonded
to an enzyme. At each time step, the molecule can be in one of three states:
bonded, unbonded, and transformed. It can pass between bonded and un-
bonded states (catching and releasing enzyme molecules without undergoing
the reaction); it cannot pass between unbonded and transformed (the reac-
tion only occurs when the enzyme is present); it can pass from bonded to
transformed but not vice versa (the reaction only runs one way). This can
be modeled as a Markov chain using the following graph, which also includes
some transition probabilities for the sake of argument.
.6
.4 unbonded bonded .5
.3
.2
1 transformed
The graph is interpreted as follows.
• If the untransformed molecule is not bonded to an enzyme, then it has

a 60% chance of bonding at the next time step, and a 40% chance of
remaining unbonded.
• If the untransformed molecule is currently bonded to an enzyme, then

it has a 30% chance of unbonded, a 20% chance of undergoing the
transformative reaction, and a 50% chance of staying as it is (bonded
but untransformed).
• One the molecule has been transformed, it remains transformed.

Notice that the arrows leading away from any given vertex have associated
probabilities that sum up to 1. If we assign the unbonded state the label
1, the bonded state the label 2, and the transformed state the label 3, then
the transition matrix for this Markov chain is
 
.4 .6 0
P = .3 .5 .2 . (25.1)
0 0 1
Returning now to our general discussion of Markov chains, we say that a

probability vector v represents the current distribution of the system if the
system is currently in state 1 with probability v1 , in state 2 with probability
v2 , and so on. In particular, if the system is currently in state i with proba-
bility 1, then it is represented by the probability vector ei , the ith standard
basis vector. In this case the distribution of the system at the next time
step is given by the ith row of P , which is ei P , since
Pij = P(state j tomorrow given state i today),
and so at the next time step the system is in state 1 with probability Pi1 ,
state 2 with probability Pi2 , and so on. More generally, if the current dis-
tribution of the system is represented by a probability vector v, and the
distribution at the next time step is represented by a probability vector w,
then we have
wj = P(state j tomorrow)
d
X
= P(state j tomorrow given state i today) · P(state i today)
i=1
Xd
= vi Pij = (vP )j .
i=1
That is, v and w are related by w = vP . Thus, just as before, if v represents

the probability distribution of the system at the present time, then vP n
gives the probability distribution after n time steps have passed.
137
Proposition 25.3. If v is a probability vector and P is a stochastic matrix,

then vP is a probability vector.
Proof. Since vi ≥ 0 and Pij ≥ 0 for all i, j, we immediately get (vP )j ≥ 0

for all j. Moreover,
 
d
X d X
X d d
X d
X X d
(vP )j = vi Pij = vi  Pij =
 vi = 1,
j=1 j=1 i=1 i=1 j=1 i=1
where the first equality is matrix multiplication, the second is just rearrang-
ing terms, the third is because P is a stochastic matrix, and the fourth is
because v is a probability vector.
Example 25.4. Consider the Michaelis–Menten kinetics described above,

with the transition matrix P given in (25.1). The probability that an initially
unbonded molecule has bonded and transformed by time n is (e1 P n )3 =
(P n )13 . That is, this probability is the third coordinate of the first row of
the matrix  n
.4 .6 0
P n = .3 .5 .2 .
0 0 1
Note that P has λ1 = 1 as an eigenvalue, with corresponding left eigenvector
v1 = ( 0 0 1 ). Numerical computation shows that it has two other eigenvalues
λ2 ≈ 0.87 and λ3 ≈ 0.023; letting v2 , v3 be left eigenvectors for these, we
see that v1 , v2 , v3 is a basis for R3 , and so in particular there are coefficients
a1 , a2 , a3 such that
e1 = a1 v1 + a2 v2 + a3 v3
e1 P n = a1 v1 P n + a2 v2 P n + a3 v3 P n
= a1 λn1 v1 + a2 λn2 v2 + a3 λn3 v3
= a1 e3 + (a2 λn2 v2 + a3 λn3 v3 ).
As n → ∞, the quantity inside the brackets goes to 0 because |λ2 |, |λ3 | < 1,
and so e1 P n → a1 e3 . Because e1 P n is a probability vector for all n, we see
that a1 = 1, and so e1 P n → e3 . In terms of the original example, this is
the statement that at large times, the molecule has almost certainly been
transformed.
The phenomenon that occurred in the previous example is in fact quite

a general one: the stochastic matrix representing a Markov chain has 1 as
an eigenvalue, and all other eigenvalues λ have |λ| < 1. Thus writing the
initial probability distribution as v = w + x, where w is a left eigenvector
for 1 and x is a sum of generalised eigenvectors for other eigenvalues, one
gets
vP n = wP n + xP n = w + xP n → w;
the term xP n goes to 0 because all other eigenvalues have absolute value
less than 1. Consequently, it is the eigenvector corresponding to the largest
eigenvalue 1 that governs the long-term behaviour of the system. Moreover,
this eigenvector w has the property that w = wP , meaning that w is a
stationary probability distribution for the Markov chain: if w describes the
probability distribution of the system at the present time, then it will also
describe the probability distribution of the system at all future times.
Lecture 26 Wed. Dec. 4
Perron–Frobenius and Google
Further reading:
To turn the informal discussion from the end of the last lecture into
a precise result, we need a little more terminology. Consider a Markov
chain with states {1, . . . , d} and transition matrix P ∈ Md×d . Let G be
the corresponding graph, where we only draw edges that correspond to a
transition with a non-zero probability of occurring. (Thus in the example of
the Michaelis–Menten kinetics, we do not draw the edge from “unbonded”
to “transformed”, or the edges from “transformed” to “bonded” or “un-
bonded”.)
Definition 26.1. A path in a directed graph is a sequence of edges such
that each edge starts at the vertex where the previous one terminated. A
Markov chain is irreducible if given any two states i and j, there is a path
that starts at vertex i and terminates at vertex j.
Example 26.2. The Markov chain for the Dallas–Houston example is irre-
ducible, since every transition happens with a non-zero probability, and so
the graph contains all possible edges. The Markov chain for the Michaelis–
Menten kinetics is not irreducible, since there is no path from “transformed”
to either of the other vertices.
Proposition 26.3. A Markov chain is irreducible if and only if its transition
matrix P has the property that for every 1 ≤ i, j ≤ d, there is some n ∈ N
such that (P n )ij > 0.
Proof. (⇒). Given irreducibility, for every i, j there is a path in the graph
that connects i to j. Let the vertices of this path be i0 , i1 , . . . , in , where
i0 = i and in = j. Then because each ik → ik+1 is required to be an edge in
the graph, we must have Pik ik+1 > 0, and so
d X
X d d
X
(P n )ij = ··· Pij1 Pj1 j2 · · · Pjn−1 j
j1 =1 j2 =2 jn−1 =1 (26.1)
≥ Pi0 i1 Pi1 i2 · · · Pin−1 in > 0.
(⇐). Given any i, j and n such that (P n )ij > 0, the first line of (26.1) shows
that there exist j1 , . . . , jn−1 such that Pij1 > 0, Pj1 j2 > 0, . . . , Pjn−1 j > 0.
Thus the graph contains an edge from i to j1 , from j1 to j2 , and so on
through to jn−1 to j. This gives a path from i to j, and since this holds for
all i, j, we conclude that the Markov chain is irreducible.
139
140 LECTURE 26. WED. DEC. 4
Exercise 26.4. In the definition of irreducibility, the value of n for which

(P n )ij is allowed to depend on i, j, and the same value of n may not work
for all i, j. Give an example of a Markov chain for which there is no single
value of n that works for all i, j.
Exercise 26.5. Say that a Markov chain is primitive if there is a single value
of n such that (P n )ij > 0 for all i, j. Describe a condition on the graph of
the Markov chain that is equivalent to this property.
The primary result concerning Markov chains (which we will not have
time to prove completely) is the following theorem.
Theorem 26.6 (Perron–Frobenius Theorem for stochastic matrices). Let
P be a primitive stochastic matrix. Then the following statements are true.
1. 1 is an eigenvalue of P , with algebraic and geometric multiplicity 1.
2. All other eigenvalues of P have |λ| < 1.
3. The eigenvalue 1 has a left eigenvector v which is a probability vector

and whose components are all positive.
4. Let Q be the d × d matrix whose row vectors are all equal to v. Then
limn→∞ P n = Q. Equivalently, P n w → v for every probability vector
w.
Exercise 26.7. Show that for any stochastic matrix, the column vector whose
entries are all equal to 1 is a right eigenvector for the eigenvalue 1.
There is also a version of this theorem that works for irreducible matri-
ces (it is slightly more complicated in this case because P may have other
eigenvalues with |λ| = 1). The Perron–Frobenius theorem tells us that for
a primitive (or more generally, irreducible) Markov chain, the long-term
behaviour is completely governed by the eigenvector corresponding to the
eigenvalue 1.
Example 26.8. Consider a mildly inebriated college student who is watch-
ing the Super Bowl on TV, but doesn’t really care for football, and so at any
given point in time might decide to leave the game and go get pizza, or go
make another gin and tonic (his drink of choice, which he thinks goes really
well with mushroom and anchovy pizza), or step over for a little while to the
outdoor theatre next to his apartment, where the symphony is performing
Beethoven’s seventh. Suppose that every ten minutes, he either moves to a
new location or stays where he is, with transition probabilities as shown in
the following graph.
141
.2
.6 football pizza
.7
.1 .3 .4
.1 .1
.9
gin &
Beethoven .5
tonic
.1
If we order the states as {1, 2, 3, 4} = {football, pizza, gin&tonic, Beethoven},

the corresponding stochastic transition matrix is
 
.6 .2 .1 .1
.7 0 0 .3
P =
.1 0 0

.9
0 .4 .1 .5
Notice that some transition probabilies are 0: he never stays at the pizza
place for more than ten minutes, and it doesn’t take longer than ten minutes
to make his next gin & tonic; similarly, after either of those he always goes
to either football or Beethoven, and he never goes from Beethoven directly
to football without first getting either pizza or a drink. Nevertheless, one
can check that P 2 has all positive entries, and so P is primitive.
Numerical computations show that the largest eigenvalue (in absolute
value) is 1, and that there are three other (distinct) eigenvalues, each with
|λ| < 21 . The left eigenvector
corresponding to the largest eigenvalue is
v ≈ .39 .21 .07 .33 , so that in the long run the student will spend 39%
of his time watching football, 21% of his time getting pizza, 7% of his time
making himself a gin & tonic, and 33% of his time at the symphony.
It is also instructive to compute some powers of P and compare them to
the matrix Q whose row vectors are all equal to v. We find that
   
.42 .20 .08 .3 .39 .21 .07 .33
.45 .17 .06 .32 .39 .21 .07 .33
P3 ≈  P6 ≈ 
 
, 
.31 .20 .05 .44 .38 .21 .07 .34
.32 .24 .08 .36 .38 .21 .07 .34
These can be interpreted as follows: the row vectors of P 3 represent possible

probability distributions after 3 transitions (half an hour), while the rows of
P 6 represent the corresponding distributions after 6 transitions (one hour).
The first row represents the distribution if he started out watching football,
the second if he started at the pizza place, the third if he started making a
drink, and the fourth if he started at the symphony. We see that after half
an hour, there is still some dependence on where he started out, while after
a full hour, the dependence is down to under one percentage point, and all
the rows of P 6 are nearly identical to the eigenvector v.
One important application of all this occurs if we want to “rank” the
different states somehow. In the previous example, we might want to know
which of the four activities the student spends most of his time doing. Upon
looking at the graph, or the matrix P , we might guess that Beethoven will
be the most common activity: after all, if we sum the probabilities in the jth
column of P , we are adding up all the probabilities of making a transition
to state j, and this number is higher in the fourth column (Beethoven) than
in the first (football).
Exercise 26.9. Explain why football ends up being more likely than Beethoven,
despite the above fact.
A rather more serious example of this process lies behind the PageRank
algorithm, which was an important part of Google’s early success. The
problem confronting Google, or any search engine, is that it is relatively
easy for a computer to produce a list of websites that are relevant to a
given search phrase, but then one is faced with the problem of ranking these
websites. For example, although the poets may have been mysteriously
silent on the subject of cheese (if Chesterton is to be believed), the internet
is certainly not, and Google currently turns up about 450 million websites
related to “cheese” in one way or another. Not all of these are created equal,
though – out of the resulting list of shops, restaurants, news articles, recipes,
encyclopedia entries, and advertisements for Wisconsin, how does one choose
the ten most important websites to appear on the first page of the search
results? The word “important” here can be replaced with “interesting” or
“valuable” or whatever we want to use to describe the qualities of a website
we are concerned with ranking.
At this point one observes that the internet has the structure of a di-
rected graph, such as the ones we have been studying here: each website is
represented by a vertex, and there is an edge from vertex A to vertex B if
website A has a link to website B. One expects that the most important
websites are those that get linked to a lot; thus perhaps one should count the
143
number of incoming links, or in terms of the graph, the number of incoming

edges.
Upon further thought, though, this needs some refinement. Suppose you
start a cheese-making operation and want to have a web presence, so you
put up a website touting your gouda (or curd of choice). Now if I post
a link to your website from my UH faculty webpage, you have one more
incoming link, which margainlly increases your website’s importance. On
the other hand, if CNN does a story about you and posts a link to your
website from their homepage, that should increase your website’s visibility
and importance rather more than the link from my website did.
This leads to the basic principle followed by the PageRank algorithm:
A site is important if important sites link to it. To put it another way, the
importance of a site is proportional to the importance of the sites that link
to it.
Let’s make this notion precise. Suppose we have a list of N websites, and
we want to rank them. Label the sites 1, 2, . . . , N , and let Lij be the number
of times that website i links to website j. Let Mi be the total number of
outgoing links from website i, and let Pij = Lij /Mi . Then we see that P is
a stochastic N × N matrix, because
N N
X X Lij Mi
Pij = = = 1.
Mi Mi
j=1 j=1
Now let v ∈ Rn be a probability vector that encodes our current best guess
as to the relative “importance” of the websites: Pthat is, website i has a
number vi between 0 and 1 attached to it, and i vi = 1. According to
the above principle (importance being proportional to importance of linking
sites), the “correct” v should have the property that
N
X
vj = vi Pij ,
i=1
where we think of vi Pij as the amount of the site i’s importance that it
passes to the site j. This equation is just the statement that v = vP , that
is, that v is an eigenvector for P with eigenvalue 1. By the Perron–Frobenius
theorem, there is a unique such eigenvector (as long as P is primitive), and
we can use this to rank the websites: the highest-ranked website is the one
for which vi is largest, and so on.
As an illustration, consider the following very small internet, which only
has five websites.
A B C
D E
The matrix L described above, where Lij is the number of links from i to j,
is  
0 1 0 1 0
1 0 1 0 1
 
L= 0 1 0 0 1

0 1 0 0 1
1 0 0 0 0
To get the stochastic matrix P , we have to normalise each row of L to be a
probability vector, obtaining
0 12 0 12 0
 
1 0 1 0 1
3 1 3 3
1
P = 0 2 0 0 2
0 1 1
2 0 0 2
1 0 0 0 0
Numerical computation shows that the eigenvector v satisfying v = vP is
v ≈ (.29, .26, .09, .15, .21).
Thus A is the most important website, then B, then E, then D, then C.

Note that this happens even though A only has two incoming edges, while
B and E have three each; the reason is that the incoming edges for A come
from more important sources (B and E).
The above process can be interpreted in terms of a “random walk”. Sup-
pose you put a monkey on the internet and have him navigate at random
– whatever website he is currently on, he follows a random link from that
page to get to the next website. Then his location is given by a Markov
chain with transition matrix P ; if w describes the probability distribution
145
of where the monkey currently is, then wP describes the probability distri-
bution of where he will be after he follows the next link. After n clicks, his
likely location is described by wP n , which by the Perron–Frobenius theorem
converges to the eigenvector v.
In practice, some tweaks to the process described here are needed to deal
with the real internet - for example, if there is a website that doesn’t link
anywhere, then P is not primitive and the monkey would simply get trapped
at that site, even if it’s not really that important a website. Thus one has
to add a small probability of “resetting” at any given time. And there are
other issues as well – but this approach via Markov chains is a very powerful
one and illustrates the basic idea.

Lecture Notes Math 4377/6308 - Advanced Linear Algebra I: Vaughn Climenhaga December 3, 2013

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Lecture Notes Math 4377/6308 - Advanced Linear Algebra I: Vaughn Climenhaga December 3, 2013

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture Notes Math 4377/6308 - Advanced Linear Algebra I: Vaughn Climenhaga December 3, 2013

Uploaded by

Copyright:

Available Formats

Lecture notes

Math 4377/6308 – Advanced Linear Algebra I

(Bee) “A First Course in Linear Algebra”, by Robert A. Beezer, University

(CDW) “Linear Algebra”, by David Cherney, Tom Denton, and Andrew

(Hef ) “Linear Algebra”, by Jim Hefferon, Saint Michael’s College. 465

(LNS) “Linear Algebra as an Introduction to Abstract Mathematics”, by

(Tre) “Linear Algebra Done Wrong”,1 by Sergei Treil, Brown University.

Motivation, linear spaces, and isomorphisms

1.1 General motivation

1. Fibonacci sequence. The Fibonacci sequence is the sequence of

We will discuss this example when we talk about eigenvalues, eigen-

5. Rotations. Suppose I start with a sphere, and rotate it first around

6. Partial differential equations. Many important problems in ap-

There are many other examples: to chemistry (vibrations of molecules in

1.2 Background: general mathematical notation

Example 1.1. 1. The statement “∃x ∈ R such that x + 2 = 7” is true,

2. The statement “x + 2 = 7 ∀x ∈ R” is false, because x + 2 6= 7 when

3. The statement “∀x ∈ R∃y ∈ R such that x + y = 4” is true, because

Example 1.2. The statement “∃y ∈ R such that ∀x ∈ R, x + y = 4” is

3. ∃x ∈ R such that ∀y ∈ R we have x + y 6= 4.

1.3 Vector spaces

2. we can multiply vectors by real numbers (scalars) to get another vector,

3. we can multiply vectors by matrices;

4. we can find the length of a vector;

5. we can find the angle between two vectors.

• scalar multiplication, so that given any x ∈ X and c ∈ R we can

2. associativity of addition: x + (y + z) = (x + y) + z for every x, y, z ∈ X;

3. identity element: there exists an element 0 ∈ X such that x + 0 = x

4. additive inverses: for every x ∈ X there exists (−x) ∈ X such that

5. associativity of multiplication: a(bx) = (ab)x for all a, b ∈ R and

6. distributivity: a(x+y) = ax+ay and (a+b)x = ax+bx for all a, b ∈ R

7. multiplication by the unit: 1x = x for all x ∈ X.

Certain properties follow immediately from the axioms, although they

1. The identity element is unique: if 00 ∈ X is such that x + 00 = x for

3. (−1)x = −x for all x ∈ X.

Example 1.6. Let X = {(x1 , . . . , xn ) | xj ∈ R∀j} be the set of row vectors

Analogously, one can define Cn as either row vectors or column vectors

Example 1.10. Let Pn be the set of all polynomials with coefficients in K

3. Let C 1 (R) be the subset of F (R, R) that contains all differentiable

Each of C(R), L1 (R), and C 1 (R) is a vector space.

Vector spaces of functions, such as those introduced in Example 1.11,

Subspaces, linear dependence and independence

2.1 Deducing new properties from axioms

1. Uniqueness of 0. Suppose 00 also has the property that x + 00 = x

2. We prove that 0 · x = 0 for all x ∈ X. To this end, let x ∈ X be

0 = x + (−x) = (x + 0 · x) + (−x) = (0 · x + x) + (−x)

3. We prove that (−1) · x = −x for all x ∈ X. To this end, we first

x + (−1)x = 1 · x + (−1) · x = (1 + (−1)) · x = 0 · x = 0,

where the first equality is from the axiom on multiplication by the

We have C 1 (R) ⊂ C(R) ⊂ F (R, R). More generally, given d ∈ N, we write

Example 2.2. 1. X1 = {(x, y) ∈ R2 | x + y = 0} is a subspace of R2

2. X2 = {(x, y, z) ∈ R3 | z = 0} is a subspace of R3 (the xy-plane)

3. The set X3 of solutions to ẍ = −x is a subspace of C 2 (R) – it is also

4. X4 = {f ∈ C 1 (R) | f is 2π-periodic} is a subspace of C 1 (R) – it is

1. x + y ∈ X whenever x, y ∈ X (closure under addition)

2. cx ∈ X whenever x ∈ X and c ∈ K (closure under scalar multiplica-

x ∈ V . In particular, for any x ∈ X the second condition just given implies

Plenty of interesting subsets of vector spaces are not subspaces.