Lecture Notes Math 4377/6308 - Advanced Linear Algebra I: Vaughn Climenhaga December 3, 2013
Lecture Notes Math 4377/6308 - Advanced Linear Algebra I: Vaughn Climenhaga December 3, 2013
Lecture Notes Math 4377/6308 - Advanced Linear Algebra I: Vaughn Climenhaga December 3, 2013
Vaughn Climenhaga
December 3, 2013
2
The primary text for this course is “Linear Algebra and its Applications”,
second edition, by Peter D. Lax (hereinafter referred to as [Lax]). The
lectures will follow the presentation in this book, and many of the homework
exercises will be taken from it.
You may occasionally find it helpful to have access to other resources
that give a more expanded and detailed presentation of various topics than
is available in Lax’s book or in the lecture notes. To this end I suggest the
following list of external references, which are freely available online.
The books listed above can all be obtained freely via the links provided.
(These links are also on the website for this course.) Another potentially
useful resource is the series of video lectures by Gilbert Strang from MIT’s
Open CourseWare project: http://ocw.mit.edu/courses/mathematics/
18-06-linear-algebra-spring-2010/video-lectures/
1
If the title seems strange, it may help to be aware that there is a relatively famous
textbook by Sheldon Axler called “Linear Algebra Done Right”, which takes a different
approach to linear algebra than do many other books, including the ones here.
Lecture 1 Monday, Aug. 26
Further reading: [Lax] Ch. 1 (p. 1–4). See also [Bee] p. 317–333; [CDW]
Ch. 5 (p. 79–87); [Hef ] Ch. 2 (p. 76–87); [LNS] Ch. 4 (p. 36–40); [Tre]
Ch. 1 (p. 1–5)
3
4 LECTURE 1. MONDAY, AUG. 26
The last example has nested quantifiers: the quantifier “∃” occurs inside
the statement to which “∀” applies. You may find it helpful to interpret such
nested statements as a game between two players. In this example, Player
A has the goal of making the statement x + y = 4 (the innermost statement)
be true, and the game proceeds as follows: first Player B chooses a number
x ∈ R, and then Player A chooses y ∈ R. If Player A’s choice makes it so
that x + y = 4, then Player A wins. The statement in the example is true
because Player A can always win.
To parse such statements it may also help to use parentheses: the state-
ment in Example 1.2 would become “∃y ∈ R (such that ∀x ∈ R (x+y = 4))”.
Playing the game described above corresponds to parsing the statement from
the outside in. This is also helpful when finding the negation of the state-
ment (formally, its contrapositive – informally, its opposite).
Example 1.3. The negations of the three statements in Example 1.1 are
1. ∀x ∈ R we have x + 2 6= 7.
2. ∃x ∈ R such that x + 2 6= 7.
Notice the pattern here: working from the outside in, each ∀ is replaced
with ∃, each ∃ is replaced with ∀, and the innermost statement is negated (so
= becomes 6=, for example). You should think through this to understand
why this is the rule.
of a vector space, and is sufficient for many applications, but there are also
many other applications where it is important to take the lessons from that
first course and re-learn them in a more abstract setting.
What do we mean by “a more abstract setting”? The idea is that we
should look at vectors in Rn and the things we did with them, and see
exactly what properties we needed in order to use the various definitions,
theorems, techniques, and algorithms we learned in that setting.
So for the moment, think of a vector as an element of Rn . What can we
do with these vectors? A moment’s thought recalls several things:
1. we can add vectors together;
The properties in the list above are the axioms of a vector space. They
hold for Rn with the usual definition of addition and scalar multiplication.
Indeed, this is in some sense the motivation for this list of axioms: they for-
malise the properties that we know and love for the example of row/column
vectors in Rn . We will see that these properties are in fact enough to let us
do a great deal of work, and that there are plenty of other things besides
Rn that satisfy them.
Remark 1.5. Some textbooks use different font styles or some other typo-
graphic device to indicate that a particular symbol refers to a vector, instead
of a scalar. For example, one may write x or ~x instead of x to indicate an
element of a vector space. By and large we will not do this; rather, plain
lowercase letters will be used to denote both scalars and vectors (although
we will write 0 for the zero vector, and 0 for the zero scalar). It will always
be clear from context which type of object a letter represents: for example,
in Definition 1.4 it is always specified whether a letter represents a vector
(as in x ∈ X) or a scalar (as in a ∈ R). You should be very careful when
reading and writing mathematical expressions in this course that you are
always aware of whether a particular symbol stands for a scalar, a vector,
or something else.
Before moving on to some examples, we point out that one may also
consider vector spaces over C, the set of complex numbers; here the scalars
may be any complex numbers. In fact, one may consider any field K and
do linear algebra with vector spaces over K. This has many interesting
applications, particularly if K is taken to be a finite field, but these examples
lie beyond the scope of this course, and while we will often say “Let X be
a vector space over the field K”, it will always be the case in our examples
that K is either R or C. Thus we will not trouble ourselves here with the
general abstract notion of a field.
8 LECTURE 1. MONDAY, AUG. 26
2. 0x = 0 for all x ∈ X.
1.4 Examples
The most familiar examples are the following.
Example 1.9. Let X be the set of all functions x(t) satisfying the differen-
tial equation ẍ + x = 0. If x and y are solutions, then so is x + y; similarly,
if x is a solution then cx is a solution for every c ∈ R. Thus X is a vector
space. If p is the initial position and v is the initial velocity, then the pair
(p, v) completely determines the solution x(t). The correspondence between
the pair (p, v) ∈ R2 and the solution x(t) is an isomorphism between R2 and
X.
Example 1.11. Let F (R, R) be the set of all functions from R → R, with
addition and scalar multiplication defined in the natural way (pointwise) by
(f + g)(x) = f (x) + g(x) and (cf )(x) = c(f (x)). Then F (R, R) is a vector
space. It contains several other interesting vector spaces.
1. Let C(R) be the subset of F (R, R) that contains all continuous func-
tions.
2. Let L1 (R) be the subset of F (R, R) Rthat contains all integrable func-
∞
tions: that is, L1 (R) = {f : R → R | −∞ |f (x)| dx < ∞}.
Further reading: [Lax] Ch. 1 (p. 4–5); see also [Bee] p. 334–372; [CDW]
Ch. 9–10 (p. 159–173); [Hef ] Ch. 2 (p. 87–108); [LNS] Ch. 4–5 (p. 40–54);
[Tre] Ch. 1 (p. 6–9, 30–31)
x = 1 · x = (1 + 0) · x = 1 · x + 0 · x = x + 0 · x. (2.1)
The first and last equalities use the final axiom (multiplication by the
unit), the second equality uses properties of real numbers, and the
third equality uses the axiom of distributivity. Now by the axiom on
existence of additive inverses, we can add (−x) to both sides and get
where the first equality is the property of additive inverses, the second
is from (2.1), the third is from commutativity of addition, the fourth
is from associativity of addition, the fifth is the property of additive
inverses again, and the last equality is the property of the zero vector.
11
12 LECTURE 2. WED. AUG. 28
− x = 0 + (−x) = (x + y) + (−x)
= (y + x) + (−x) = y + (x + (−x)) = y + 0 = y,
where the first equality uses the axiom on the zero vector, the second
comes from the equality x + y = 0, the third uses commutativity of
addition, the fourth uses associativity of addition, the fifth uses the
property of additive inverses, and the last once again uses the property
of the zero vector. Armed with this fact on uniqueness, we can now
observe that
2.2 Subspaces
Let’s move on to something a little less bland and more concrete. Recalling
our examples from the previous lecture, we see that it is often the case that
one vector space is contained inside another one. For example, Pn ⊂ Pn+1 .
Or recall Example 1.11:
• F (R, R) = {functions R → R}
• C(R) = {f ∈ V | f is continuous}
• C 1 (R) = {f ∈ V | f is differentiable}
2.2. SUBSPACES 13
Definition 2.1. When X and V are vector spaces with X ⊂ V , we say that
X is a subspace of V .
In each of these cases one can check that the operations of addition and
multiplication from the ambient vector space (R2 , R3 , or F (R, R)) define a
vector space structure on the given subset, and so it is indeed a subspace.
We omit the details of checking all the axioms, since we are about to learn
a general fact that implies them.
Here is a convenient fact. In general, if we have a set X with two binary
operations (addition and multiplication), and want to check that this is a
vector space, we must verify the list of axioms from the previous lecture.
When X is contained in a vector space V , life is easier: to check that a
non-empty set X ⊂ V is a subspace, we only need to check the following
two conditions:
If these two conditions are satisfied, then the fact that the axioms from
the previous lecture hold for X can be quickly deduced from the fact that
they hold for V . For example, since addition is commutative for all pairs
of elements in V , it is certainly commutative for all paris of elements in
the subset X. Similarly for associativity of addition and multiplication,
distributivity, and multiplication by the unit. The only axioms remaining
are existence of the identity element 0 and additive inverses. To get these,
we recall from the previous lecture that 0 = 0x and −x = (−1)x for any
14 LECTURE 2. WED. AUG. 28
Now the fact that the sets in Example 2.2 are subspaces of R2 , R3 , and
V , respectively, can be easily checked by observing that each of these sets
is closed under addition and scalar multiplication.
1. If (x, y) and (x0 , y 0 ) are in X1 and c ∈ R, then (cx + x0 , cy + y 00 ) has
(cx + x0 ) + (cy + y 0 ) = c(x + y) + (x0 + y 0 ) = c · 0 + 0 = 0.
2. If (x, y, z), (x0 , y 0 , z 0 ) ∈ X2 and c ∈ R, then (cx + x0 , cy + y 0 , cz + z 0 )
has third component equal to cz + z 0 = c · 0 + 0 = 0, so it is in X2 .
d2
3. If x, y : R → R are in X3 and c ∈ R, then dt2
(cx + y) = cẍ + ÿ =
−(cx + y), so cx + y ∈ X3 .
4. If f, g ∈ X4 and c ∈ R, then we check that cf + g is 2π-periodic:
(cf + g)(t + 2π) = cf (t + 2π) + g(t + 2π) = cf (t) + g(t) = (cf + g)(t).
The first two examples should be familiar to you from a previous linear
algebra course, since both X1 and X2 are the solution set of a system of
linear equations (in fact, a single linear equation) in Rn . What may not be
immediately apparent is that X3 and X4 are in fact examples of exactly this
same type: we define a condition which is linear in a certain sense, and then
consider all elements of the vector space that satisfy this condition. This
will be made precise later when we consider null spaces (or kernels) of linear
transformations.
Remark 2.4. All of the subspaces considered above are described implicitly;
that is, a condition is given, and then the subspace X is the set of all
elements of V that satisfy this condition. Thus if I give you an element of
V , it is easy for you to check whether or not this element is contained in the
subspace X. On the other hand, it is not necessarily so easy for you to give
me a concrete example of an element of X, or a method for producing all
the elements of X. Such a method would be an explicit description of X,
and the process of solving linear equations may be thought of as the process
of going from an implicit to an explicit description of X.
2.3. NEW SUBSPACES FROM OLD ONES 15
What does the word ‘reasonable’ mean in the above remark? Well,
consider the following example. The vector space R2 is spanned by the set
{(1, 0), (0, 1), (1, 1)}. But this set is somehow redundant, since the smaller
set {(1, 0), (0, 1)} also spans R2 . So in finding spanning sets, it makes sense
to look for the smallest one, which is somehow the most efficient. You should
recall from your previous linear algebra experience that in the case of Rn ,
the following condition is crucial.
Definition 2.14. A set {x1 , . . . , xk } is linearly dependent if some non-
trivial linear combination yields the zero vector: that is, if there are scalars
c1 , . . . ck ∈ K such that not all of the cj are 0, and c1 x1 + · · · + ck xk = 0. A
set is linearly independent if it is not linearly dependent.
Recall how the notions of span and linear dependence manifest them-
selves in Rn . Given a set S = {x1 , . . . , xk }, the task of checking whether
or not x ∈ span S reduces to solving the non-homogenerous system of lin-
ear equations c1 x1 + · · · + ck xk = x. If this system has a solution, then
x ∈ span S. If it has no solution, then x ∈ / span S. (Note that span S is an
explicitly described subspace, and now the difficult task is to check whether
or not a specific vector is included in the subspace, which was the easy task
when the subspace was implicitly defined.)
Similarly, you can check for linear dependence in Rn by writing the
condition c1 x1 + · · · + ck xk = 0 as a system of linear equations, and using
row reduction to see if it has a non-trivial solution. A non-trivial solution
corresponds to a non-trivial representation of 0 as a linear combination of
vectors in S, and implies that S is linearly dependent. If there is only the
trivial solution, then S is linearly independent.
Example 2.15. Consider the polynomials f1 (x) = x+1, f2 (x) = x2 −2, and
f3 (x) = x + 3 in P2 . These span P2 and are linearly independent. Indeed,
given any polynomial g ∈ P2 given by g(x) = a2 x2 + a1 x + a0 , we can try to
write g = c1 f1 + c2 f2 + c3 f3 by solving
a2 x2 + a1 x + a0 = c1 (x + 1) + c2 (x2 − 2) + c3 (x + 3).
Comparing coefficients we see that this is equivalent to the system
a2 = c2 ,
a1 = c1 + c3 , (2.3)
a0 = c1 − 2c2 + 3c3 ,
which can be easily checked to have a solution (c1 , c2 , c3 ) for every choice
of a1 , a2 , a3 . Similarly, the homogeneous version of this system has only
18 LECTURE 2. WED. AUG. 28
the trivial solution, which shows that the polynomials f1 , f2 , f3 are linearly
independent.
Bases.
Further reading: [Lax] Ch. 1 (p. 5–7); see also [Bee] p. 373–398; [CDW]
Ch. 11 (p. 175–182); [Hef ] Ch. 2 (p. 109–137); [LNS] Ch. 5 (p. 54–59);
[Tre] Ch. 1,2 (p. 6–7, 54–56)
Exercise 3.2. Show that an infinite set S is linearly dependent if and only
if it has a finite subset S 0 that is linearly dependent. Show that an infinite
set S is linearly independent if and only if every finite subset of S is linearly
independent.
Exercise 3.3. Let V be a vector space, let S ⊂ V be spanning, and let L ⊂ V
be linearly independent.
The next result is quite important: it says that linearly independent sets
cannot be bigger than spanning sets.
19
20 LECTURE 3. WED. SEP. 4
The base case of the induction is j = 0, which is trivial. For the induction
step, relabel the elements of S and L so that S 0 = {y1 , . . . , yj , xj+1 , . . . , xn }.
Because S 0 spans V , we can write
1. S is linearly dependent.
c1 x1 + c2 x2 + · · · + cn xn = 0.
c2 cn
x1 = − x2 − · · · − xn ,
c1 c1
c1 x1 + · · · + cn xn − v = 0,
1. S is linearly independent.
x = a1 v1 + · · · + an vn = b1 w1 + · · · bm wm
Bases are far from unique – there are many choices of basis possible for
a finite-dimensional space, as illustrated in Theorem 4.7 below. Sometimes,
as in Rn , there is a natural one, which is usually referred to as the standard
basis for that vector space: in Rn , we write ei for the vector whose ith
component is 1, with all other entries 0.
Theorem 3.9. If V has a finite basis, then all bases for V have the same
number of vectors.
Further reading: [Lax] Ch. 1 (p. 5–7); see also [Bee] p. 373–398; [CDW]
Ch. 11 (p. 175–182); [Hef ] Ch. 2 (p. 109–137); [LNS] Ch. 5 (p. 54–59);
[Tre] Ch. 1,2 (p. 6–7, 54–56)
4.1 Dimension
In the last lecture, we hinted that it is useful to have not just spanning sets,
but spanning sets that contain a minimum amount of redundancy. The
definition of basis makes this precise. The spanning property guarantees
that every v ∈ V can be written as a linear combination v = c1 x1 +· · ·+cn xn
of the basis elements, while the linear independence property together with
Proposition 3.6 guarantees that the coefficients c1 , . . . , cn are determined
uniquely by v and x1 , . . . , xn . We will come back to this important point
later in the lecture..
Exercise 4.1. Show that if {x, y} is a basis for X, then so is {x + y, x − y}.
Lemma 4.5. If V has a finite spanning set then it has a finite basis.
23
24 LECTURE 4. MON. SEP. 9
V = Y1 ⊕ · · · ⊕ Ym . (4.1)
Proof. We can build a finite basis for W as follows: take any nonzero vector
w1 ∈ W , so {w1 } is linearly independent. If {w1 } spans W then we have
found our basis; otherwise there exists w2 ∈ W \ span{w1 }. By Exercise
3.7, the set {w1 , w2 } is once again linearly independent. We continue in
this manner, obtaining linearly independent sets {w1 , . . . , wk } until we find
a basis.
As it stands, the above argument is not yet a proof, because it is possible
that the procedure continues indefinitely: a priori, it could be the case that
we get linearly independent sets {w1 , . . . , wk } for every k = 1, 2, 3, . . . , with-
out ever obtaining a spanning set. However, because the ambient subspace
V is finite-dimensional, we see that the procedure must terminate: by Propo-
sition 3.4, {w1 , . . . wk } cannot be linearly independent when k > dim V , and
so there exists k such that the procedure terminates and gives us a basis for
W.
This shows that W is finite-dimensional. To show that it has a com-
plement, we observe that by Theorem 4.7, the set {w1 , . . . , wk }, which is a
basis for W and hence linearly independent, can be completed to a basis
β = {w1 , . . . , wk , x1 , . . . , xm } for V . Let X = span{x1 , . . . , xm }. We claim
that every element v ∈ V can be written in a unique way as v = w + x,
where w ∈ W and x ∈ X. To show this, observe that because β is a basis
for V , there are unique scalars a1 , . . . , ak , b1 , . . . , bm ∈ K such that
v = (a1 w1 + · · · + ak wk ) + (b1 x1 + · · · + bm xm ).
P P
Let w = j aj wj and x = i bi xi . Then v = w + x. Moreover, by Exercise
4.8 it suffices to check that W ∩P X = {0}. P To see this, suppose that there
are coefficientsP aj , bi suchP
that a j wj = bi xi . Bringing everything to
one side gives j aj wj + i (−bi )xi = 0, and by linear independence of β
we have aj = 0 and bi = 0 for each i, j.
Further reading: [Lax] Ch. 1–2 (p. 7–15); see also [Tre] Ch. 8 (p. 207–214)
Furthermore, show that if v1 ≡ v2 mod Y , then cv1 ≡ cv2 mod Y for all
c ∈ K.
Given v ∈ V , let [v]Y = {w ∈ V | w ≡ v mod Y } be the congruence class
of v modulo Y . This is also sometimes called the coset of Y corresponding
to v.
Exercise 5.3. Given v1 , v2 ∈ V , show that [v1 ]Y = [v2 ]Y if v1 ≡ v2 mod Y ,
and [v1 ]Y ∩ [v2 ]Y = ∅ otherwise.
In example 5.1, we see that [v]Y is the line through v with slope −1.
Thus every congruence class modulo Y is a line parallel to Y . In fact we see
that [v]Y = v + Y = {v + x | x ∈ Y }. This fact holds quite generally.
27
28 LECTURE 5. WED. SEP. 11
We see from this result that every congruence class modulo a subspace
is just a copy of that subspace shifted by some vector. Such a set is called
an affine subspace.
Recall once again our definition of setwise addition, and notice what
happens if we add two congruence classes of Y : given any v, w ∈ V , we have
(If these relations are not clear, spend a minute thinking about what they
look like in the example above, where Y is the line through the origin with
slope −1.)
We have just defined a way to add two congruence classes together, and
a way to multiply a congruence class by a scalar. One can show that these
operations satisfy the axioms of a vector space (commutativity, associativity,
etc.), and so the set of congruence classes forms a vector space over K when
equipped with these operations. We denote this vector space by V /Y , and
refer to it as the quotient space of V mod Y .
Exercise 5.5. What plays the role of the zero vector in V /Y ?
In Example 5.1, V /Y is the space of lines in R2 with slope −1. Notice
that any such line is uniquely specified by its y-intercept, and so this is a
1-dimensional vector space, since it is spanned by the single element [(0, 1)]Y .
This shows the spanning property. Now we check for linear independence
by supposing that ck+1 , . . . , cn ∈ K are such that
X
cj [vj ]Y = [0]Y .
P
This
P is truePif and only if cj vj ∈ Y , in which case there are di such that
cj vj = di yi . But now linear independence of the basis constructed
above shows that cj = 0 for all j. Thus {[vj ]Y | k + 1 ≤ j ≤ n} is a basis for
V /Y .
From this we conclude that dim(V /Y ) = n − k and dim Y = k, so that
their sum is (n − k) + k = n = dim V .
(c`)(x) = c(`(x)).
`s : V → R
f 7→ f (s),
Integration over other domains gives other linear functionals, and still there
are more possibilties: C(R)0 is a very big vector space.
Proof. Let {v1 , . . . , vn }Pbe a basis for V . Then every v ∈ V has a unique
representation as v = i ai vi , where ai ∈ K. For each 1 ≤ i ≤ n, define
`i ∈ V 0 by X
`i (v) = `i ai vi = ai .
6 i, then
(Alternately, we can define `i by `i (vi ) = 1 and `i (vj ) = 0 for j =
extend by linearity to all of V .) We leave as an exercise the fact that
{`1 , . . . , `n } is a basis for V 0 .
Lecture 6 Mon. Sep. 16
Further reading: [Lax] p. 19–20. See also [Bee] p. 513–581, [CDW] Ch.
6 (p. 89–96), [Hef ] Ch. 3 (p. 173–179), [LNS] Ch. 6 (p. 62–81), [Tre] Ch.
1 (p. 12–18)
Exercise 6.2. Show that every linear map T has the property that T (0V ) =
0W .
Exercise 6.3. Show that T is linear if and only if T (cx + y) = cT (x) + T (y)
for every x, y ∈ V and c ∈ K.
Exercise 6.4. Show that T is linear if and only if for every x1 , . . . , xn ∈ V
and a1 , . . . , an ∈ K, we have
n n
!
X X
T ai xi = ai T (xi ). (6.1)
i=1 i=1
Example 6.5. The linear functionals discussed in the previous lecture are
all examples of linear maps – for these examples we have W = K.
31
32 LECTURE 6. MON. SEP. 16
The above examples all look like the sort of thing you will have seen
in your first linear algebra course: defining a function or transformation in
terms of its coordinates, in which case checking for linearity amounts to
confirming that the formula has a specific form, which in turn guarantees
that it can be written in terms of matrix multiplication. The world is much
broader than these examples, however: there are many examples of linear
transformations which are most naturally defined using something other
than a coordinate representation.
Example 6.9. Let V = C 1 (R) be the vector space of all continuously dif-
ferentiable functions, and W = C(R) the vector space of all continuous
d
functions. Then T : V → W defined by (T f )(x) = dx f (x) is a linear trans-
formation: recall from calculus that (cf + g)0 = cf 0 + g 0 .
The previous example illustrates the idea at the heart of the operation
of convolution, a linear transformation that is used in many applications,
including image processing, acoustics, data analysis, electrical engineering,
and probability theory. A somewhat more complete description of this op-
eration (which we will not pursue Rfurther in this course) is as follows: let
∞
V = W = L2 (R) = {f : R → R | −∞ f (x)2 dx < ∞}, and fix g ∈ C(R).
Rb
Then define T : V → W by (T f )(x) = a f (y)g(x − y) dy.
which can be checked to define a linear map. Recall that the formulas
for sin(α + θ) and cos(α + θ) can be recovered from the identities
eiθ = cos θ + i sin θ and ei(α+θ) = eiα eiθ .
Proof. Exercise.
By Theorem 6.16, both the range and nullspace of T are subspaces (of W
and V , respectively – be very careful to remember where these things live).
You have encountered both of these subspaces before, in your introduction
to linear algebra: if T : Rm → Rn is a linear transformation defined by
coefficients Aij as in Example 6.12, then the equation T (v) = w describes
a linear system of m equations (one equation for each wi ) in n variables
(v1 , . . . , vn ). The nullspace of T is the set of vectors v for which T (v) = 0:
that is, the set of solutions of the homogeneous system of linear equations
given by Aij . The range RT is the set of w ∈ Rm for which the non-
homogeneous system T (v) = w has a solution. This subspace of Rm was
also characterised as the column space of the matrix A.
Lecture 7 Wed. Sep. 18
Isomorphisms
35
36 LECTURE 7. WED. SEP. 18
7.2 Isomorphisms
An important kind of linear transformation is an isomorphism: we say that
T : V → W is an isomorphism if it is linear, 1-1, and onto. If there is an
isomorphism between V and W , we say that V and W are isomorphic.
Thus we may characterise isomorphisms in terms of nullspace and range
as those linear maps for which NT = {0V } and RT = W . (Null space is
trivial and range is everything.)
Exercise 7.5. Show that if T : V → W is an isomorphism, then the inverse
map T −1 : W → V is also an isomorphism.
The notion of isomorphism gives a sense in which various examples of
vector spaces that we have seen so far are “the same”. For instance, let V
be the vector space of column vectors with 2 real components, and let W be
the vector space of row vectors with 2 real components. It seems clear that
V and W are in some sense “the same”, in that both can be described by
prescribing two numbers. More precisely, we can associate to each column
vector ( xy ) the corresponding row vector ( x y ). This association respects
vector addition and scalar multiplication:
0 0
1. ( xy ) + xy0 = x+xy+y 0 is associated to ( x+x0 y+y0 ) = ( x y ) + ( x0 y0 );
2. c ( xy ) = ( cx
cy ) is associated to ( cx cy ) = c ( x y ).
n
X X X
0
T (cv + v ) = (cai + a0i )wi = c a i wi + a0i wi = cT (v) + T (v 0 ).
i=1
Theorem 7.8. Two finite-dimensional vector spaces over the same field K
are isomorphic if and only if they have the same dimension.
41
42 LECTURE 8. MON. SEP. 23
(ST )−1 = T −1 S −1
V
T /W
T0
m∈V 0 `∈W 0
K
Example 8.5. Let V = Pn , W = Pn−1 , and let T ∈ L(V, W ) be differenti-
ation. Let ` ∈ W 0 be the linear function that evaluates a polynomial g at
a specific input t0 , so `(g) = g(t0 ). Then T 0 ` ∈ V 0 is a linear functional on
Pn , given by (T 0 `)(f ) = `(T f ) = `(f 0 ) = f 0 (t0 ).
Example 8.6. Let V = W = R2 . Fix a, b, c, d ∈ R and consider the linear
transformation T : R2 → R2 given by
x a b x
T = .
y c d y
Let `1 (x, y) = x and `2 (x, y) = y be two linear functionals on R2 . The
proof of Theorem 5.10 shows that {`1 , `2 } is a basis for V 0 , so V 0 is also
isomorphic to R2 . Thus we can represent any linear functional ` ∈ V 0 as
` = s`1 + t`2 , where s, t ∈ R are coordinates that depend on `. Indeed, s
and t are determined by
s(`) = `(1, 0), t(`) = `(0, 1), (8.3)
since then linearity implies that
`(x, y) = x`(1, 0)+y`(0, 1) = `1 (x, y)`(1, 0)+`2 (x, y)`(0, 1) = (s`1 +t`2 )(x, y).
44 LECTURE 8. MON. SEP. 23
so that the linear map T 0 is given in matrix form (relative to the basis `1 , `2 )
by the transpose of the matrix that defined T .
This notation is suggestive of the inner product (or scalar product, or dot
product) that is sometimes used on Rn , but is more general. (Although if Rn
is equipped with an inner product, then identifying the two notations gives
a natural isomorphism between Rn and its dual space.) Using this notation,
we see that the transpose satisfies (T 0 `, v) = (T 0 `)(v) = `(T v) = (`, T v), or
more succinctly,
(T 0 `, v) = (`, T v). (8.4)
If you have worked with the dot product on Rn before, you may recognise
(8.4) as one of the identities satisfied by the matrix transpose.
Lecture 9 Wed. Sep. 25
Further reading: [Lax] p. 20–22. See also [Bee] p. 544–603, [CDW] Ch.
6, 16 (p. 89–96, 237–246), [Hef ] Ch. 3 (p. 157–172, 180–190), [LNS] Ch. 6
(p. 62–81), [Tre] Ch. 1 (p. 12–18)
Remark 9.2. Note that it is the dimension of the domain V , and not the
dimension of the codomain W , that enters Theorem 9.1. One way of re-
membering this is the following: if X is some larger vector space containing
W , then we can consider T as a linear map V → X, and so we would need
to replace dim W by dim X in any formula like the one in Theorem 9.1.
However, dim NT and dim RT do not change when we replace W with X.
Corollary 9.3. If dim(V ) > dim(W ), then T is not 1-1.
Proof. dim(RT ) ≤ dim(W ), and so dim Nt = dim V − dim RT ≥ dim V −
dim W ≥ 1.
45
46 LECTURE 9. WED. SEP. 25
Lecture 10 Wed. Oct. 2
Matrices
In other words, T (v) is the vector whose coordinates are given by (10.1).
47
48 LECTURE 10. WED. OCT. 2
We see from Theorem 10.1 that the formula (10.1) is not arbitrary, but
follows naturally from linearity of T and our decision to use the entries of
the matrix A to record the partial coefficients of T . The only choice we
made was to record the coefficients of the different vectors T (ej ) as the
columns of the matrix, rather than the rows. This corresponds to a decision
to treat vectors in K n and K m as column vectors, and to have A act by
multiplication on the left. If we had made the other choice, we would work
with row vectors, and multiply from the right.
Given our choice to work with column vectors, it also makes sense to
think of a matrix A as being a “row of column vectors”. Another way of
thinking of (10.2) is as the observation that multiplying A by the standard
basis vector ej returns the jth column of A.
A similar argument to the one above shows that the usual formula for
matrix multiplication is actually a very natural one, as it is the proper way
to encode composition of linear transformations. To see this, consider the
vector spaces K n , K m , and K ` , and let S ∈ L(K n , K m ), T ∈ L(K m , K ` ),
so that we have the diagram
T S
K ` ←− K m ←− K n ,
Proof. Let ei be the standard basis vectors for K ` , let dj be the standard
basis vectors for K n , and let ck be the standard basis vectors for K m . The
key is to write T S(e` ) as a linear combination of the vectors ck . This can
be done using the formulae for T and S in terms of A and B. First note
that since B is the matrix for S, (10.2) gives
m
X
S(ej ) = Bkj ck .
k=1
10.1. REPRESENTING LINEAR MAPS WITH MATRICES 49
Now applying T and using linearity followed by (10.2) for A and T , we get
m
X m X
X `
T S(ej ) = Bkj T (ck ) = Bkj Aik di . (10.5)
k=1 k=1 i=1
A nice corollary of the above result is the fact that matrix multiplication
is associative – that is, (AB)C = A(BC) whenever the products are defined.
This is immediate from the fact that composition of linear maps is associative
(which is easy to see), and while it could be proved directly using (10.4),
that proof is messier and not as illuminating.
Now we have seen that linear maps from K n to K m can always be
understood in terms of matrix multiplication. What about other vector
spaces? Looking back at Theorems 10.1 and 10.2, we see that the proofs do
not use any special properties of K n apart from the fact that we work with
the standard basis e1 , . . . , en (and similarly for K m , K ` ). In particular, if
we let V, W be any finite-dimensional vector spaces and let e1 , . . . , en be
any basis for V , and d1 , . . . , dm any basis for W , the proof of Theorem
10.1 shows that the matrix A ∈ Mm×n (K) with entries given by (10.2)
determines
P the linear transformation T via the following version of (10.1):
if v = nj=1 vj ej , then
m
X n
X
T (v) = Aij vj di .
i=1 j=1
There is a very important fact to keep in mind here – the matrix A depends
on our choice of basis for V and W . If we choose a different basis, then the
vectors appearing in (10.2) will change, and so the entries of A will change
as well.
T e1 = 0, T e2 = 1 = e1 , T e3 = 2x = 2e2 ,
50 LECTURE 10. WED. OCT. 2
T c1 = 1 = c1 − c2 , T c2 = 1 = c1 − c2 , T c3 = x = 2c2 − c1 ,
Even for linear maps in K n , where it seems most natural to use the
standard basis e1 , . . . , en , it is often more useful to choose a different basis.
For example, if T is the linear transformation represented by the matrix
A = ( 11 11 ) in the standard basis, we see that taking d1 = ( 11 ) and d2 = −1 1
gives
2
T d1 = = 2d1 , T d2 = 0,
2
so the relative to the basis d1 , d2 , the map T is given by the rather simpler
matrix ( 20 00 ).
Two natural questions present themselves. How do we determine the
relationship between the matrices that appear for different choices of basis?
Given a linear transformation, can we choose a basis in which its matrix
takes on a “nice” form that is easier to work with? These questions will
motivate a great deal of what we do in the rest of this course.
Lecture 11 Mon. Oct. 7
Changing bases
Further reading:
The map Iβ is 1-1 if and only if β is linearly independent, and onto if and
only if β spans V . In particular, it is an isomorphism if and only if β is a
basis.
Let V, W be finite-dimensional vector spaces over K, and choose bases
β for V and γ for W . Let n = dim V and m = dim W , so that
Iβ : K n → V, Iγ : K m → W
m
X
(Iγ−1 T Iβ )ej = Aij di ,
i=1
51
52 LECTURE 11. MON. OCT. 7
(11.2) becomes
VO
T /V
O
Iβ Iβ
[T ]β
Kn / Kn
The result of Exercise 11.1 in this case says that [T v]β = [T ]β [v]β .
Let β, γ be two bases for V . It is important to understand how [T ]β and
[T ]γ are related, and to do this we must first understand how [v]β and [v]γ
are related for v ∈ V . These are related to v by
! } Iγ
~
Iγ
V v
We see that [v]γ = Iγ−1 Iβ [v]β . Let Iβγ = Iγ−1 Iβ : this is a linear transformation
from K n to K n , and so is represented by an n × n matrix (relative to the
standard basis). We refer to this matrix as the change-of-coordinates matrix
from β to γ because it has the property
The following example illustrates how all this works. Let V = P2 , let
β = {f1 , f2 , f3 } be the basis given by f1 (x) = 2 + 3x, f2 (x) = 1 + x, and
f3 (x) = x + x2 , and let γ = {g1 , g2 , g3 } be the standard basis g1 (x) = 1,
g2 (x) = x, g3 (x) = x2 . Let e1 , e2 , e3 be the standard basis for R3 . In order
to write down the matrix Iβγ , we need to compute Iβγ ej for j = 1, 2, 3.
First note that Iβγ ej = Iγ−1 Iβ ej , and that Iβ ej = fj . (This is the def-
inition of Iβ .) Thus Iβγ ej = Iγ−1 (fj ) = [fj ]γ . That is, Iβγ ej is given by the
coordinate representation of fj in terms of the basis γ. In particular, since
we see that
2 1 0
[f1 ]γ = 3 , [f2 ]γ = 1 , [f3 ]γ = 1 ,
0 0 1
54 LECTURE 11. MON. OCT. 7
and so
2 1 0
Iβγ = 3 1 1 .
0 0 1
Thus for example, the polynomial p(x) = (2+3x)−2(1+x)−(x+x2 ) = −x2
has
1 0
[p]β = −2 , [p]γ = 0 ,
−1 −1
and indeed we can verify that
2 1 0 1 0
Iβγ [p]β = 3 1 1 −2 = 0 = [p]γ .
0 0 1 −1 −1
To go the other way and find Iγβ one needs to compute Iγβ ej = [gj ]β . That
is, one must represent elements of β in terms of γ, and the coefficients of
this representation will give the columns of the matrix. So we want to find
coefficients ai , bi , ci such that
g1 = a1 f1 + a2 f2 + a3 f3 ,
g2 = b1 f1 + b2 f2 + b3 f3 , (11.5)
g3 = c1 f1 + c2 f2 + c3 f3 ,
1 = a1 (2 + 3x) + a2 (1 + x) + a3 (x + x2 ),
x = b1 (2 + 3x) + b2 (1 + x) + b3 (x + x2 ),
x2 = c1 (2 + 3x) + c2 (1 + x) + c3 (x + x2 ).
We conclude that
−1 1 −1
Iγβ = 3 −2 2 .
0 0 1
We note that the product Iγβ Iβγ corresponds to changing from β-coordinates
to γ-coordinates, and then back to β-coordinates, so we expect that the
product is the identity matrix, and indeed a direct computation verifies
that
−1 1 −1 2 1 0 1 0 0
3 −2 2 3 1 1 = 0 1 0 .
0 0 1 0 0 1 0 0 1
The procedure described for the above example works very generally. Given
two bases β = {v1 , . . . , vn } and γ = {w1 , . . . , wn } for a vector space V , let
Aij ∈ K be the unique coefficients such that
n
X
vj = Aij wi for all j = 1, . . . , n. (11.6)
i=1
Then Iβγ is the n × n matrix whose coefficients are given by Aij – that is, the
jth column of Iβγ is given by the coefficients needed to express vj in terms
of the basis γ.
The principle of “express elements of one basis in terms of the other basis,
then use those coefficients to make a matrix” is easy enough to remember.
In principle one might try to build a matrix by putting the coefficients
found above into a row of the matrix, instead of a column, or by expressing
elements of γ in terms of β, instead of what we did above. To keep things
straight it is useful to remember the following.
2. The purpose of Iβγ is to turn [v]β into [v]γ – that is, the output, the
information that is needed, is on how to express vectors as linear com-
binations of elements of γ. Thus to determine Iβγ , we should express
elements of β in terms of γ, and not vice versa.
56 LECTURE 11. MON. OCT. 7
Iβγ Iβγ
Kn / Kn [v]β / [v]γ (11.7)
Iβ Iγ Iβ Iγ
V
I /V v
I / v
Comparing this to (11.2), we see that Iβγ is exactly [I]γβ , the matrix of the
identity transformation relative to the bases β (in the domain) and γ (in
the codomain).
Lecture 12 Wed. Oct. 9
12.1 Conjugacy
[Lax] p. 37–38, [Bee] p. 616–676, [CDW] p. 202–206, [Hef ] p. 241–248,
[LNS] p. 138–143, [Tre] p. 68–73
The previous section lets us describe the relationship between [v]β and
[v]γ when v ∈ V and β, γ are two bases for V . Now we are in a position
to describe the relationship between [T ]β and [T ]γ when T ∈ L(V ). The
following commutative diagram relates these matrices to T itself, and hence
to each other:
[T ]γ
Kn / Kn
Iγ Iγ
VO
T /V
O
Iβ Iβ
[T ]β
Kn / Kn
Recall that Iβγ = Iγ−1 Iβ and Iγβ = Iβ−1 Iγ be obtained by following the right-
hand side from the top to the bottom. Then because the diagram commutes,
we have
[T ]β = Iγβ [T ]γ Iβγ = (Iβγ )−1 [T ]γ Iβγ . (12.1)
We could also have gotten to (12.1) by observing that [T ]β satisfies the re-
lationship [T v]β = [T ]β [v]β , and that using the change-of-coordinates prop-
erties from the previous section, we have
Definition 12.1. Let V, W be vector spaces and consider two linear maps
S ∈ L(V ) and T ∈ L(W ). We say that S and T are conjugate if there is
an isomorphism J : V → W such that T = J −1 SJ. That is, the following
57
58 LECTURE 12. WED. OCT. 9
diagram commutes:
V
S /V
J J
W
T /W
Example 12.5. Let T ∈ L(R2 ) be the linear operator whose matrix (in the
standard basis, or indeed, in any basis) is A = ( 00 10 ). Then the matrix of T 2
is A2 = 0, and so T 2 is the zero transformation.
4. Let β = {v1 , . . . , vdim V } be the basis for V obtained in this way, and
show that [T ]β is strictly upper-triangular.
• P vi = vi for every 1 ≤ i ≤ k;
• P wi = 0 for every 1 ≤ i ≤ `.
where Ik×k is the k × k identity matrix, 0k×` is the k × ` matrix of all 0s,
and so on. In other words,
k times ` times
z }| { z }| {
[P ]β = diag(1, . . . , 1, 0, . . . , 0),
The right hand side is a number, which is a 1×1 matrix, so [`]β must be a 1×n
matrix in order for things to match up. In other words, a linear functional
` ∈ V 0 is represented by a row vector [`]β , and `(v) can be computed by
multiplying the row and column vectors of ` and v relative to β.
Lecture 13 Mon. Oct. 14
Further reading:
61
62 LECTURE 13. MON. OCT. 14
`(vm )
Then W ⊥ = NT . One can show that T is onto (this is an exercise), and
thus RT = K m . In particular, we have dim W ⊥ = dim NT and dim W =
m = dim RT , and so by Theorem 9.1 we have dim W + dim W ⊥ = dim RT +
dim NT = dim V 0 . By Theorem 5.10, this is equal to dim V .
The definition of annihilator also goes from the dual space V 0 to V itself:
given S ⊂ V 0 , we write
Together with Theorem 13.3, this result immediately implies that the
range of T is the annihilator of the nullspace of T 0 :
RT = (NT 0 )⊥ (13.2)
13.2. EXISTENCE OF SOLUTIONS 63
We can put these results together to conclude that T and T 0 have the
same rank for any finite-dimensional V, W and any T ∈ L(V, W ). Indeed,
Theorems 13.2 and 9.1 yield
The last terms are equal by Theorem 5.10, and the first terms are equal by
Theorem 13.3. Thus we conclude that
This can be interpreted as the statement that row and column rank of a
matrix are the same. Indeed, if T is an m × n matrix, then RT ⊂ K m is the
column space of T , and RT 0 ⊂ K n is the row space of T . The above result
shows that these spaces have the same dimension.
Exercise 13.5. Show that if dim V = dim W and T ∈ L(V, W ), then dim NT =
dim NT 0 .
Of course we are not studying PDEs in this course, but there is a useful
finite-dimensional analogue of this problem. Suppose we discretise the prob-
lem by filling the region G with a square lattice and recording the value of
u at every point of the lattice. Then each point has four neighbours, which
we think of as lying to the north, south, east, and west, at a distance of h.
Writing u0 for the value of u at the central point, and uN , uS , uE , uW for
the values at its neighbours, we have the approximations
1
uxx ≈ (uW − 2u0 + uE ),
h2
1
uyy ≈ 2 (uN − 2u0 + uS ),
h
and so the approximate version of the Laplace equation uxx + uyy = 0 is the
linear equation
1
u0 = (uW + uE + uN + uS ). (13.4)
4
That is, u solves the discretised equation if and only if its value at every
lattice point is the average of its values at the four neighbours of that point.
We require this to hold whenever all four neighbours uW , uE , uN , and uS
are in G. If any of those neighbours are outside of G, then we require u0 to
take the value prescribed at the nearest boundary point of G.
Writing n for the number of lattice points in G, we see that the discretised
version of the Laplace equation ∆u = 0 is a system of n equations in n
unknowns. We want to answer the following question: for a given choice of
boundary conditions, is there a solution of the discretised Laplace equation
on the interior of G? If so, is that solution unique?
Because the discretised Laplace equation is a system of n linear equa-
tions in n unknowns, this question is equivalent to the following: does the
given non-homogeneous system of equations have a unique solution? We
know that if the homogeneous system has only the trivial solution, then the
non-homogeneous system has a unique solution for any choice of data – in
particular, the discretised Laplace equation has a unique solution on the
interior of G for any choice of boundary conditions.
The homogeneous system for the Laplace equation corresponds to having
the zero boundary condition (u = 0 on the boundary of G). Let M be the
maximum value that u takes at any point in G. We claim that M = 0.
Indeed, if M > 0 then there is some lattice point at the interior of G for
which u0 = M . By (13.4) and the fact that each of the neighbouring points
has u ≤ M , we see that u = M at every neighbouring point. Continuing in
13.2. EXISTENCE OF SOLUTIONS 65
this way, we conclude that u = M on every point, contradicting the fact that
it vanishes on the boundary. Thus u ≤ 0 on all of G. A similar argument
(considering the minimum value) shows that u ≥ 0 on all of G, and so u = 0
is the only solution to the homogeneous problem. This implies that there
is a unique solution to the non-homogeneous problem for every choice of
boundary conditions.
In fact, we have proved something slightly stronger. Laplace’s equation
is a special case of Poisson’s equation ∆u = f , where f : G → R is some
function – examples where f 6= 0 arise in various physical applications.
Upon discretising, (13.4) becomes u0 = 14 (uW + uE + uN + uS − f0 ), where
f0 is the value of f at the given point. Thus the situation is the following: if
G contains n lattice points and k of these are on the boundary in the sense
that one of their neighbours lies outside of G, then the non-homogeneous
part of k of the equations specifies the boundary condition, while the non-
homogeneous part of the remaining n − k equations specifies the function
f . The above argument permits both parts to be non-zero, and so it shows
that the discretised version of Poisson’s equation has a unique solution for
every choice of boundary condition and every choice of f .
66 LECTURE 13. MON. OCT. 14
Lecture 14 Wed. Oct. 16
Powers of matrices
Further reading:
67
68 LECTURE 14. WED. OCT. 16
That is, vn is the second column of the matrix An , and xn is the second
entry of this column, so we have the explicit formula
xn = (An )22 .
This gives us an explicit formula for the nth Fibonacci number, if only we
can find a formula for An .
The difficulty, of course, is that it is not clear how to write down a
formula for Ak for an arbitrary matrix A, or even to predict whether the
entries of Ak will grow, shrink, or oscillate in absolute value. For example,
consider the matrices
5 7.1 5 6.9
A= B= (14.1)
−3 −4 −3 −4
It turns that
50 −1476 −3176 50 .004 .008
A ≈ B ≈ (14.2)
1342 2549 −.003 −.006
Thus despite the fact that A and B have entries that are very close to each
other, the iterates An v and B n v behave very differently: we will eventually
be able to show that if v ∈ R2 \{0} is any non-zero vector, then kAn vk → ∞
as n → ∞, while kB n vk → 0.
We could formulate a similar example for a one-dimensional linear map,
which is just multiplication by a real number. Let a = 1.1 and b = 0.9. Then
we see immediately that for any non-zero number x, we have an x → ∞ and
bn x → 0 as n → ∞. In this case, it is clear what the issue is: |a| is larger
than 1, and |b| is smaller than 1.
Can we find a similar criterion for matrices? Is there some property of a
matrix that we can look at and determine what kind of behaviour the iterates
An v will exhibit? This is one question that will motivate our discussion of
eigenvalues and eigenvectors, determinants, and eventually spectral theory.
w, Aw, A2 w, . . . , An w.
There are n + 1 of these vectors, and dim(Cn ) = n, thus these vectors are
linearly dependent. That is, there exist coefficients c0 , . . . , cn ∈ C such that
n
X
cj Aj w = 0. (14.3)
j=0
Proposition 14.4 shows that over C, every matrix has at least one eigen-
value and eigenvector, but does not show how to compute them in practice,
or how many of them there are. To do this, we need to return to the char-
acterisation of eigenvalues and eigenvectors in Proposition 14.2. If we know
that λ is an eigenvalue, then eigenvectors for λ are elements of the nullspace
14.2. EIGENVECTORS AND EIGENVALUES 71
of A − λI, which can be found via row reduction. Thus the primary chal-
lenge is to find the eigenvalues – that is, to find λ such that A − λI is
non-invertible.
Thus we have reduced the problem of determining eigenvalues and eigen-
vectors to the problem of determining invertibility of a matrix. This moti-
vates the introduction of the determinant, a quantity which we will spend
the next few lectures studying. We end this lecture with the following ob-
servation: if β = {v1 , . . . , vn } is a basis for V consisting of eigenvectors of a
linear operator T , then for each j = 1, . . . , n we have T vj = λj vj for some
λj ∈ K, so that [T vj ]β = λj ej . This implies that [T ]β is the diagonal matrix
with entries λ1 , . . . , λn . Thus finding a basis of eigenvectors has very strong
consequences.
−4 −6 3
Exercise 14.5. Consider the matrix A = 2 4 −2 . Show that 0, −1, and
−2 −2 1
2 are all eigenvalues of A, and find eigenvectors associated to each. Check
that these eigenvectors form a basis β, and find the change-of-coordinates
matrix Q = Iβα , where α is the standard basis for R3 . Verify by direct
computation that Q−1 AQ = diag(0, −1, 2).
72 LECTURE 14. WED. OCT. 16
Lecture 15 Mon. Oct. 21
Introduction to determinants
Further reading:
• A is invertible
1 ab 1
1 0 a1 − ad
b
a 0
[A | I] → 1 → 1
0 1 0 d 0 1 0 d
73
74 LECTURE 15. MON. OCT. 21
At this point we can observe that one of two things happens: either ad = bc
and it is impossible to row reduce this to [I | A−1 ], hence A is non-invertible;
or ad 6= bc and we can get to that form via four more row operations (divide
first row by the non-zero value bc − ad, divide second row by the non-zero
value c, swap the rows, and finally subtract a multiple of a row to eliminate
the remaining non-zero term in the left half). In particular, we conclude
that A is invertible if and only if ad − bc 6= 0. Note that this criterion works
for the case c = 0 as well, because in this case it reduces to ad 6= 0, which
was the criterion found in the previous argument.
Exercise 15.2. Complete the row reduction above to show that the aug-
mented matrix [A | I] row reduces to
ad − bc 0 d −b
0 ad − bc −c a
d −b
and hence A−1 = 1
ad−bc −c a .
Thus for 2 × 2 matrices, invertibility can be determined via a formula –
one simply tests whether or not the determinant ad − bc is zero or non-zero.
It is natural to ask if there is a similar quantity for larger matrices, and
we will address this shortly. First we give a geometric interpretation of the
determinant for a 2 × 2 matrix. Recall again that A is invertible if and only
15.1. DETERMINANT OF 2 × 2 MATRICES 75
if the columns of A form a basis for R2 , which is true if and only if they
are linearly
independent (because there are 2 columns). Let v = ( ac ) and
w = db be the columns of A. Plotting v and w in the plane, we can make
the following geometric observation: v, w are linearly dependent if and only
if they point in the same direction or in opposite directions. Let θ be the
angle between v and w, then the vectors are linearly dependent if and only
if θ = 0 or θ = π, which is equivalent to sin θ = 0. √
So, can we compute
√ sin θ in terms of a, b, c, d? Let r = kvk = a2 + c2
and s = kwk = b2 + d2 be the lengths of the vectors v and w, so that in
polar coordinates we have
v = ( ac ) = ( rr cos α
sin α ) w = db = ss cos β
sin β .
That is, both A and its transpose At have the same determinant. We will
later see that this property continues to hold for larger matrices. For now,
note that this means that the determinant of A gives the area of the par-
allelogram spanned by the rows of A, as well as the one spanned by the
columns.
Finally, we observe that this geometric interpretation can also be inter-
preted in terms of row reduction. Given two row vectors v, w ∈ R2 , it is
easy to see that P (v + tw, w) and P (v, w) have the same area for every
t ∈ R. This corresponds to the fact that we may treat w as the base of the
parallelogram, and the height does not change if we replace v with v + tw
for any t. Thus assuming v and w are in “general position”, we can make
0
one row reduction to go from ( wv ) to vw , where v 0 = v + tw is of the form
0
(a, 0), and then we can make one more row reduction to go from vw to
v 0 , where w 0 = w + sv 0 is of the form (0, b). Then P (v, w) = ab, since the
w0
parallelogram spanned by v 0 and w0 is just a rectangle with sides a and b.
The previous paragraph shows that if we row reduce a 2 × 2 matrix A
to a diagonal form D using only operations of the form “add a multiple of
a row to a different row”, then the product of the diagonal entries of D
gives the area of the parallelogram spanned by the rows of A. A completely
analogous result is true for n × n matrices. In the next section, we consider
the 3 × 3 case and find a formula for the product of these diagonal entries
in terms of the entries of A itself.
Of course, this system may not have any solutions, depending on the values
of w2 , w3 , x2 , x3 , v2 , v3 . For now we assume that these are such that there is
always a unique solution, and worry about degenerate cases later. Although
we work with row vectors here instead of column vectors, we can still solve
this system by using Exercise 15.2, which tells us that
−1
w2 w3 1 x3 −w3
= .
x2 x3 w2 x3 − w3 x2 −x2 w2
Thus
− v2 v3
x3 −w3
s t =
w2 x3 − w3 x2 −x2 w2
v3 x2 − v2 x3 v2 w3 − v3 w2
=
w2 x3 − w3 x2
We conclude that
v 3 w1 x 2 − v 2 w1 x 3 + v 2 w3 x 1 − v 3 w2 x 1
a = v1 + sw1 + tx1 = v1 +
w2 x 3 − w3 x 2
v 1 w2 x 3 − v 1 w3 x 2 + v 3 w1 x 2 − v 2 w1 x 3 + v 2 w3 x 1 − v 3 w2 x 1
= . (15.3)
w2 x 3 − w3 x 2
Note that the expression for a has the following properties: every term in
the numerator is the product of exactly one entry from each of v, w, x, and
the denominator is the determinant of the 2 × 2 matrix ( wx22 wx33 ), which is
the matrix lying directly below the two entries of A that we transformed to
0 in our row reduction.
We need to repeat the same procedure with w and x to find b, c, but
now it turns out to be simpler. Indeed, w0 is determined by the requirement
that w0 = w + sv 0 + tx = (0, b, 0) for some s, t ∈ R (not necessarily the same
as before). This gives
w1 + sa + tx1 = 0
w2 + tx2 = b
w3 + tx3 = 0.
The third equation determines t, the second determines b, and the first
determines s. We get t = − wx33 , hence
w3 w2 x 3 − w3 x 2
b = w2 − x2 = .
x3 x3
Note that the numerator
a 0 0ofb matches the denominator of a. Now we have
row reduced A to 0 b 0 , and we see immediately that row reducing to
x1 x2 x3
diagonal form does not change the bottom right entry, so c = x3 , and the
original matrix A is row equivalent to the diagonal matrix diag(a, b, c). In
particular, the volume of the parallelepiped spanned by v, w, x is the same
as the volume of the rectangular prism with side lengths a, b, c – that is, it
is given by
v 1 w2 x 3 − v 1 w3 x 2 + v 3 w1 x 2 − v 2 w1 x 3 + v 2 w3 x 1 − v 3 w2 x 1 w2 x 3 − w3 x 2
abc = · ·x3
w2 x 3 − w3 x 2 x3
= v1 w2 x3 − v1 w3 x2 + v3 w1 x2 − v2 w1 x3 + v2 w3 x1 − v3 w2 x1 . (15.4)
So far we cannot say anything rigorously about this quantity, because the
discussion above did not properly account for the possibility that we may
have x3 = 0 or w2 x3 − w3 x2 = 0. Nevertheless, we may tentatively call this
quantity the determinant of A, and see if it does what we expect it to do
– that is, tell us whether or not A is invertible, and help in computing the
inverse. Thus we make the following definition: given a 3 × 3 matrix A with
entries Aij , the determinant of A is
Further reading:
v1 v2 v3 A11 A12 A13
A = w1 w2 w3 = A21 A22 A23 ,
x1 x2 x3 A31 A32 A33
where Ãij denotes the 2 × 2 matrix obtained from A by deleting the ith row
and jth column. This formula, defining the determinant of a 3 × 3 matrix
in terms of the determinants of the 2 × 2 matrices Ãij , is called cofactor
expansion, or Laplace expansion. We will later see that a generalisation of
this can be used to define the determinant of an n × n matrix in terms of
determinants of (n − 1) × (n − 1) matrices.
Another interpretation of the formula for a 3×3 determinant is illustrated
by the following diagram, which relates each term of (15.5) to a choice of 3
79
80 LECTURE 16. WED. OCT. 23
Pictorially, the positive and negative signs associated to the i, j-term follow
the chessboard pattern shown:
+ − +
− + − (16.4)
+ − +
What about (16.1) itself? How do we determine which terms are positive
and which are negative?
e1 , and e2 . Notice that each of these can be obtained from the identity
e3 e1
matrix via a single row operation of swapping two rows. The identity matrix,
of course, requires zero swaps, while the other two permutation matrices
corresponding to positive terms require two swaps.
e3 e2
Exercise 16.2. Check this last claim, that ee1 and ee3 can be obtained
2 1
from I by two row operations of exchanging rows.
The above discussion suggests the following criterion:
In other words, P (v1 , v2 ) is the image of the square with vertices (0, 0),
(1, 0), (0, 1), and (1, 1) under the linear transformation x 7→ xA.
More generally, given v1 , . . . , vn ∈ Rn , the parallelepiped spanned by
v1 , . . . , vn is
( n )
X
P (v1 , . . . , vn ) = ti vi | ti ∈ [0, 1] for all 1 ≤ i ≤ n
i=1
= {xA | x ∈ C},
v1
where A = v··· is the linear transformation sending ei to vi , and C is
n
the n-dimensional cube with side lengths 1 and vertices at 0 and ei for
1 ≤ i ≤ n. (Note that C also has vertices at ei1 + · · · + eik for every
1 ≤ i1 < i2 < · · · < ik ≤ n.)
For 2 × 2 matrices, recall that det ( vv12 ) = kv1 kkv2 k sin θ, where θ is the
angle from v1 to v2 . In particular, sin θ is positive if the shortest way from
v1 to v2 is to go counterclockwise, and is negative if the shortest way is
clockwise. Thus the sign of the determinant tells us what the orientation
of the vectors v1 , v2 is relative to each other. We write S(v1 , v2 ) = ±1
depending on whether this orientation is positive or negative.
What about 3 × 3 matrices, or larger? Given a basis v1 , v2 , v3 ∈ R3 , we
can no longer use the words “clockwise” and “counterclockwise” to describe
where these vectors are in relation to each other. Instead we note the follow-
ing: in R2 , the standard basis e1 , e2 is positively oriented, and any positively
oriented basis v1 , v2 can be continuously deformed into the standard basis
without becoming dependent. More precisely, if v1 , v2 is a positively oriented
basis, then there are continuous functions w1 , w2 : [0, 1] → R2 such that
Earlier, we saw that this definition of determinant implies the formula (15.5)
for 3 × 3 matrices, which we rewrote in two different ways, (16.2)–(16.3) and
(16.7). Now we will show that (16.8) and (16.9) force the determinant to
have several properties that in turn force it to be given by a generalisation of
(16.7), the sum over permutations, and that this is equivalent to a recursive
definition following (16.2)–(16.3).
86 LECTURE 16. WED. OCT. 23
Lecture 17 Mon. Oct. 28
Further reading:
3. D(e1 , . . . , en ) = 1.
To see the first property, note that if vi = vj for i 6= j, then the paral-
lelepiped P (v1 , . . . , vn ) lies in an (n − 1)-dimensional subspace of Rn , and
hence has zero volume. To see the third property, note that P (e1 , . . . , en )
is the unit n-cube, and this basis is positively oriented by definition. It only
remains to prove the second property, linearity in each argument.
Given 1 ≤ j ≤ n and v1 , . . . , vn , let V be the (n − 1)-dimensional volume
of the parallelepiped spanned by v1 , . . . , vj−1 , vj+1 , . . . , vn . Given x ∈ Rn ,
let h(x) be the signed distance from x to the subspace in which this paral-
lelepiped lies – that is, h(x) is positive if S(v1 , . . . , vj−1 , x, vj+1 , . . . , vn ) = 1,
and negative otherwise. Then h is a linear function of x, and we have
so T is linear in x.
There are two more important properties of determinants that follow
from the three listed above.
87
88 LECTURE 17. MON. OCT. 28
Let’s consider this sum for a moment. Every term in the sum is a product
of n entries vij multiplied by the factor D(ej1 , . . . , ejn ). The n entries vij
in any given term have the property that one of them is an entry in v1 , one
is an entry in v2 , and so on. If we consider the n × n matrix A whose rows
are the vectors v1 , . . . , vn , then this corresponds to taking exactly one entry
from each row of A.
We can rewrite (17.1) as
X
D(v1 , . . . , vn ) = v1,f (1) · · · vn,f (n) D(ef (1) , . . . , ef (n) ), (17.2)
f
where the sum is taken over all functions f : {1, . . . , n} → {1, . . . , n}. If f
is not 1-1, then there are indices i 6= j such that ef (i) = ef (j) , and hence
by property 1, D(ef (1) , . . . , ef (n) ) = 0. Thus the only non-zero terms in
the above sum are those corresponding to 1-1 functions f . A function from
an n-element set to itself is 1-1 if and only if it is onto; in this case it is
called a permutation of the set, as in the 3 × 3 case we discussed earlier.
Let Sn denote the set of all permutations on n symbols; then since the only
functions that contribute to the sum in (17.2) are permutations, we have
X n
Y
D(v1 , . . . , vn ) = D(eπ(1) , . . . , eπ(n) ) vi,π(i) (17.3)
π∈Sn i=1
X n
Y
det A = sgn π Ai,π(i) , (17.5)
π∈Sn i=1
Further reading:
w '
2 4 1 3
Now count the number of times the arrows cross (in the above example,
there are three crossings). We claim that the number of crossings has the
same parity (even or odd) as the number of transpositions it takes to obtain
π. To see this, choose any way of obtaining π via transpositions: in the
above example, one possibility would be
(1, 2, 3, 4) → (2, 1, 3, 4) → (2, 4, 3, 1) → (2, 4, 1, 3).
Draw a similar diagram to (18.1), where each transposition gets its own row:
1 2 3 4 (18.2)
2 1 3 4
} '
2 4 3 1
2 4 1 3
1
For any readers who are familiar with the representation of a permutation in terms
of its cycle structure, notice that we are using a similar-looking notation here to mean a
different thing.
91
92 LECTURE 18. MON. NOV. 4
Note that we are careful to draw the arrows in such a way that no more than
two arrows intersect at a time.2 (This is why the second row has a curved
arrow.) This diagram has more intersections than (18.1), but the parity of
intersections is the same. To see that this is true in general, say that i and
j are inverted by π if π(i), π(j) occur in a different order than i, j. That is,
i, j are inverted if i < j and π(i) > π(j), or if i > j and π(i) < π(j).
Proposition 18.1. For any choice of how to draw the arrows in the above
diagram, and any i 6= j, the arrows corresponding to i and j cross an odd
number of times if i, j are inverted by π, and an even number of times if
they are not inverted.
Proof. Every time the arrows for i, j cross, the order of i, j is reversed. If
this order is reversed an odd number of times, then i, j are inverted. If it is
reversed an even number of times, then i, j are not inverted.
X n
Y X n
Y
t
det(A ) = sgn π Ati,π(i) = sgn π Aπ(i),i .
π∈Sn i=1 π∈Sn i=1
Qn Qn
Note that i=1 Aπ(i),i = j=1 Aj,π −1 (j) , and since sgn(π −1 ) = sgn π, we
have
X n
Y
det(At ) = sgn(π −1 ) Aj,π−1 (j) .
π∈Sn i=1
Define C : (K n )n → K by
We claim that the function C has the first two properties described in §17.1,
and that C(e1 , . . . , en ) = det A. Then it will follow from the discussion in
§17.2 that C(b1 , . . . , bn ) = (det A)D(b1 , . . . , bn ), which will suffice to prove
the theorem.
To verify these properties, first observe that if bi = bj for some bi 6= bj ,
then Abi = Abj , and so the numerator in (18.4) vanishes. This takes care of
property 1. Property 2 follows because Abi depends linearly on bi , and the
right-hand side of (18.4) depends linearly on Abi .
Now we observe that if a1 , . . . , an are the columns of A, then
where the second equality follows from matrix multiplication, and the third
from the definition of determinant.
As indicated above, §17.2 now gives
Proof. (⇐): If A is non-invertible then its columns are not a basis for K n ,
and so they are linearly dependent. Thus Property 5 of the function D
implies that det A = 0.
(⇒): If A is invertible, then there is B ∈ Mn×n such that AB = I, and
so by Theorem 18.7 we have
Corollary 18.9. If two matrices A and B are similar, then det A = det B.
Further reading:
where the first equality is the definition of determinant, and the second
follows from multilinearity of D. To prove (19.1), we need to show that
Note that by Properties 1 and 2 of the function D, the value of D does not
change if we add a multiple of one vector to another: in particular, for every
2 ≤ j ≤ n and every λ ∈ K, we have
97
98 LECTURE 19. WED. NOV. 6
Notice that we have set every entry in the ith row and 1st column to 0, ex-
cept for the entry in the place (i, 1), which is now equal to 1. The remaining
entries determine the (n−1)×(n−1) matrix Ãi1 . It follows from multilinear-
ity of determinant that the right-hand side of (19.3) is linear in each column
of Ãi1 , and hence D(ei , a2 , . . . , an ) is as well. Similarly, D(ei , a2 , . . . , an ) = 0
if any two columns of Ãi1 are equal. Because these properties characterise
determinant up to a constant normalising factor, we conclude that
We can go from the order (i, 1, . . . , i−1, i+1, . . . , n) to the order (1, 2, . . . , n)
in i − 1 swaps, and (19.2) follows.
This gives the cofactor (Laplace) expansion along the first column: the
only difference from the 3 × 3 case is that the sum has n terms, instead of 3,
and the matrix Ai1 is an (n − 1) × (n − 1) matrix, instead of a 2 × 2 matrix.
Because interchanging two columns reverses the sign of the determinant,
we immediately obtain from Theorem 19.1 the formula for cofactor expan-
sion along the jth column as
n
X
det A = (−1)i+j Aij det Ãij .
i=1
and a similar computation shows that for any i 6= j, we can replace vi with
vi + λvj without changing the value of D. In particular, if A row reduces to
B using only the first type of row operation – adding a multiple of a row to
another row – then det A = det B.
Furthermore, we know that interchanging two rows multiplies the deter-
minant by −1. Thus we have the following result.
Proposition 19.2. Suppose A can be row reduced to B using only row
operations of Types 1 and 2. Let k be the number of row operations of Type
2 required to carry out this row reduction. Then det A = (−1)k det B.
This gives a more efficient way of computing determinants. If A is non-
invertible, then it row reduces to a matrix with an empty row, and we
conclude that det A = 0. If A is invertible, then it row reduces to a diagonal
matrix diag(λ1 , . . . , λn ), and we conclude that
Qn
Exercise 19.3. Show that det diag(λ1 , . . . , λn ) = i=1 λi .
Using this exercise, we see that (19.4) reduces to
det A = (−1)k λ1 · · · λn .
It can be shown that the number of row reductions involved in carrying out
the above process for an n × n matrix is on the order of n3 , which gives a
much more manageable number of computations than a direct application of
the formulas via summing over permutations or cofactor expansion. Indeed,
for n = 10 we have n3 = 1000, while n! ≈ 3.6×106 , so the two differ by three
orders of magnitude. For n = 20 we have n3 = 8000, while n! ≈ 2.4 × 1018 .
At a billion operations a second, it would take 1/100, 000 of a second to carry
out 203 operations, but about 76 years to carry out 20! operations. Thus
row reduction is a vastly more efficient way of computing large determinants
than the previous formulas we saw.
Exercise 19.4. Suppose A can be row reduced to B via any sequence of row
operations; let k be the number of transpositions (Type 2) involved, and let
p1 , . . . , pm be the non-zero factors by which rows are multiplied in the m
steps of Type 3. Use multilinearity of determinant to show that
and in particular, if A row reduces to the identity through this process, then
−4 9 −14 15
Row operations of the first type reduce this to
1 −2 3 −12 1 −2 3 −12 1 −2 3 −12
0 2
1 −41 → 0 0
5 25
→ 0 0
5 25
.
0 4 7 −77 0 0 15 55 0 0 0 −20
0 1 −2 −33 0 1 −2 −33 0 1 −2 −33
19.2. DETERMINANTS AND ROW REDUCTION 101
With two more transpositions, this becomes an upper triangular matrix with
diagonal entries 1, 1, 5, −20, and so
Further reading:
= xk det A,
where the first equality is the definition of determinant, the second equality
uses (20.1), the third uses multilinearity of determinant, and the fourth
uses the fact that D vanishes whenever two of its arguments coincide. We
conclude that xk = (det Bk )/(det A), since A is invertible and hence det A 6=
0.
103
104 LECTURE 20. MON. NOV. 11
Notice that the order of the indices is reversed between the left and right-
hand sides of the equation. We conclude that the inverse of a matrix A is
given by the formula
det Ãji
(A−1 )ij = (−1)i+j . (20.3)
det A
For large matrices, this is not a practical method for computing the in-
verse, because it involves computing n2 determinants, and the computation
of inverse via row reduction of [A | I] to [I | A−1 ] is significantly faster.
Nevertheless, (20.3) has important theoretical significance.
20.2 Trace
[Lax] p. 55–56
The determinant is a function that assigns a scalar value to every n × n
matrix. Another important scalar-valued function is the trace, defined as
n
X
Tr A = Aii . (20.4)
i=1
20.2. TRACE 105
That is, the trace of an n × n matrix A is the sum of the diagonal entries
of A. (This looks like the formula for a determinant of an upper-triangular
matrix, where determinant is the product of the diagonal entries, but this
definition of trace is for any square matrix A, regardless of whether or not
it is upper-triangular.)
Determinant is a multilinear function – it depends linearly on each
row/column of A. Trace, on the other hand, is a linear function of the
entire matrix A. Indeed, given any A, B ∈ Mn×n and c ∈ K, we have
n
X n
X
Tr(cA + B) = (cA + B)ii = cAii + Bii = c Tr A + Tr B.
i=1 i=1
Trace is not multiplicative in the same way determinant is: as a general rule,
Tr(AB) 6= Tr(A) Tr(B). (Indeed, Tr(I) = n for the n × n identity matrix, so
Tr(IB) = Tr(B) 6= Tr(I) Tr(B) for every B ∈ Mn×n with non-zero trace.)
Nevertheless, trace has the following important property with respect to
multiplication.
Theorem 20.1. Given any A, B ∈ Mn×n , we have Tr(AB) = Tr(BA).
Proof. By the formula for matrix multiplication, we have (AB)ii = nk=1 Aik Bki ,
P
and so
Xn X n
Tr(AB) = Aik Bki .
i=1 k=1
Pn
Similarly, (BA)ii = k=1 Bik Aki , and so
n X
X n
Tr(BA) = Bik Aki .
i=1 k=1
The two sums are equal (just interchange the names of the indices).
We saw in Corollary 18.9 that similar (conjugate) matrices have the same
determinant; this followed from the fact that determinant is multiplicative.
Even though trace is not multiplicative, Theorem 20.1 leads to the analogous
result for similar matrices.
Corollary 20.2. Similar matrices have the same trace.
Proof. Let A, B ∈ Mn×n be such that B = QAQ−1 for some invertible
Q ∈ Mn×n . Then applying Theorem 20.1 to the matrices QA and Q−1 , we
have
Tr B = Tr((QA)(Q−1 )) = Tr((Q−1 )(QA)) = Tr A.
106 LECTURE 20. MON. NOV. 11
Exercise 20.3. Corollary 20.2 used the fact that Tr(ABC) = Tr(BCA). Is
it always true that Tr(ABC) = Tr(BAC)? If so, prove it. If not, provide a
counterexample.
1 , and λ = 5 has a
thus λ1 = 2 has a corresponding eigenvector v1 = −1 2
corresponding eigenvector v2 = ( 12 ).
Note that v1 , v2 are linearly independent, hence they form a basis for
R2 . In particular, any v ∈ R2 can be written as v = a1 v1 + a2 v2 for some
a1 , a2 ∈ R, and then we have
AN v = a1 (AN v1 ) + a2 (AN v2 ) = a1 λN N N N
1 v1 + a2 λ v2 = a1 2 v1 + a2 5 v2 .
Diagonalisability
Further reading:
Example 21.2. Let A be the 3 × 3 matrix from Example 20.6, then the
spectrum of A is {1, 2}. The eigenvalue λ = 1 has algebraic and geomet-
ric multiplicities equal to 1, and the eigenvalue λ = 2 has algebraic and
geometric multiplicities equal to 2.
Exercise 21.3. Let A be an upper triangular matrix. Show that the eigen-
values of A are precisely the diagonal entries of A, and that the algebraic
multiplicity of an eigenvalue is the number of times it appears on the diag-
onal of A.
In Exercise 20.7, the eigenvalue λ = 1 has algebraic multiplicity 2 but
geometric multiplicity 1, so the two numbers are not always the same. We
will come back to the relationship between these two numbers later on. First
we establish some general results, beginning with the fact that eigenvectors
for different eigenvalues are linearly independent.
111
112 LECTURE 21. WED. NOV. 13
vm = a1 v1 + · · · + a` v` , (21.1)
where 1 ≤ ` < m and a` 6= 0. Consider two equations derived from this one;
multiplying both sides of (21.1) by A and using the eigenvector property
gives
λm vm = a1 λ1 v1 + · · · + a` λ` v` , (21.2)
while multiplying both sides of (21.1) by λm gives
λ m vm = a1 λ m v1 + · · · + a` λ m v` . (21.3)
We saw earlier that similar matrices have the same determinant and
trace. They also have the same characteristic polynomial, eigenvalues, and
there is a clear relationship between their eigenspaces.
Theorem 21.6. If A and B are similar via a change-of-coordinates matrix
Q – that is, if B = QAQ−1 – then A and B have the same characteristic
polynomial. Thus every eigenvalue of A is also an eigenvalue of B, with the
same algebraic multiplicity. Moreover, the geometric multiplicities agree,
and if EλA = NA−λI is the eigenspace for A corresponding to λ, then EλA
and EλB are related by EλB = Q(EλA ).
21.2. BASES OF EIGENVECTORS 113
using the fact that similar matrices have the same determinant. This im-
mediately implies that A and B have the same eigenvalues, with the same
algebraic multiplicities. Now given an eigenvalue λ for A and the corre-
sponding eigenspace EλA , we see that for any v ∈ EλA , we have
then Q = Iβα . This is because Q has the property that Qej = vj for each j,
and similarly Q−1 vj = ej . Let D = Q−1 AQ. Then
Because Dej gives the jth column of D, we see that D is diagonal with
entries λ1 , . . . , λn .
21.3 Multiplicities
We observed above that the algebraic and geometric multiplicities can dif-
fer. However, there is a universal relationship between them: the geometric
multiplicity is always less than or equal to the algebraic multiplicity. We
need the following exercise.
Exercise 21.10. Let A be a square matrix with the block form A = X Y
0 Z ,
where X, Z are square matrices and Y, 0 have the appropriate dimensions.
Show that det(A) = det(X) det(Z).
Further reading:
117
118 LECTURE 22. MON. NOV. 18
pA (λ) = det(λI − A) = (λ − λ1 ) · · · (λ − λn )
This gives the desired form for an−1 . For a0 , we observe that the only term
in (22.1) without any factor of t comes from choosing −Aj,π(j) in every term
of the product, and so
X n
Y
a0 = sgn π (−Aj,π(j) ) = (−1)n det A.
π∈Sn j=1
q(A) = c0 I + c1 A + c2 A2 + · · · + cd Ad ,
120 LECTURE 22. MON. NOV. 18
Since the left-hand side is not invertible, it follows that one of the factors
A − νi I is non-invertible – in particular, some νi is an eigenvalue of A.
Because νi is a root of q(t) − λ, we see that q(νi ) − λ = 0, so λ = q(νi ).
Lecture 23 Wed. Nov. 20
Further reading:
121
122 LECTURE 23. WED. NOV. 20
Proof of Theorem 23.1. First we recall (20.3), which stated that for an in-
det Ã
vertible matrix A, we have (A−1 )ij = (−1)i+j det Aji . By using the same
computations as we used in the proof of this formula, we can show that the
matrix B defined by Bij = det Ãji has the property that
(B is sometimes called the adjugate of A.) We will apply this fact to the
matrices Q(t) = tI − A, where t ∈ K. Let P (t) be the matrix whose entries
are given by
˜ ;
P (t)ij = (−1)i+j det Q(t)ji
for some coefficients Ck,ij ∈ K, and then letting Ck ∈ Mn×n be the matrix
with coefficients Ck,ij , we get
n−1
X
P (t) = Ck tk .
k=0
(Recall that the top coefficient is always 1.) Then since the previous two
equations hold for all t ∈ K, the coefficients of tk must agree, giving
Cn−1 = I,
Cn−2 − Cn−1 A = an−1 I,
Cn−3 − Cn−2 A = an−2 I,
···
C0 − C1 A = a1 I
−C0 A = a0 I.
Now multiply both sides of the first line by An (from the right), both sides
of the second by An−1 , and so on. We get
Cn−1 An = An ,
Cn−2 An−1 − Cn−1 An = an−1 An−1 ,
Cn−3 An−2 − Cn−2 An−1 = an−2 An ,
···
2
C0 A − C1 A = a1 A
−C0 A = a0 I.
Adding up all these equations, we see that the left side is 0 and the right
side is pA (A).
1 p
λ1,2 = Tr A ± (Tr A)2 − 4 det A
2
1 p
= a + d ± a2 + 2ad + d2 − 4(ad − bc) (23.3)
2
1 p
= a + d ± (a − d)2 + 4bc .
2
124 LECTURE 23. WED. NOV. 20
Note that (23.4) is the sum of the diagonal matrix λI and the nilpotent
matrix ( 00 10 ). It can be shown that this holds for matrices of any size – any
matrix can be written as the sum of a diagonalisable matrix and a nilpotent
matrix.
Theorem 23.2 has an important application: it lets us compute powers
of 2 × 2 matrices with relatively little fuss. Indeed, suppose A ∈ M2×2 is
such that D = Q−1 AQ is diagonal for some invertible Q. Then we have
A = QDQ−1 , and so
N times
z }| {
AN = (QDQ−1 )N = (QDQ−1 )(QDQ−1 ) · · · (QDQ−1 ) = QDN Q−1
N λN −1
N
N −1 N −1 −1 λ
N
A = QJ Q = Q(λ I + N λ N
B)Q = Q Q−1 .
0 λN
126 LECTURE 23. WED. NOV. 20
Lecture 24 Mon. Nov. 25
Further reading:
t2 − 2q cos θ + q 2 ,
127
128 LECTURE 24. MON. NOV. 25
P = [w | w̄] = [v1 | v2 ] 1i −i
1 = ST,
where T = 1i −i
1 is invertible. Hence S is invertible as well, so v1 , v2 form
1
2
a basis for R . Note that v1 = 2i (w − w̄) and v2 = 12 (w + w̄).
−1
We claim that B = S AS hasthe form (24.1). Indeed, observing that
S = P T −1 and that T −1 = 12 −i 1
i 1 , we have
a + ib 0
B = S −1 AS = T P −1 AP T −1 = T T −1
0 a − ib
1 i −i a + ib 0 −i 1
=
2 1 1 0 a − ib i 1
1 ai − b −ai − b −i 1
=
2 a + ib a − ib i 1
24.1. WORKING OVER R 129
a −b
=
b a
where λ, µ, q, θ ∈ R.
In fact, one can determine which of the three cases in (24.2) occurs with
a quick look at pA (t) and A itself: the first case occurs if and only if pA (t)
has distinct real roots or is a scalar multiple of the identity matrix; the
second case occurs if and only if pA (t) has a repeated real root and is not a
scalar multiple of the identity; the third case occurs if and only if pA (t) has
complex roots.
The situation for larger matrices is a more complicated, but in some
sense mimics the behaviour seen here. We observe that Theorem 24.2 can
also be formulated for linear transformations: if V is a vector space over R
with dimension 2 and T ∈ L(V ) is any linear operator, then there is a basis
β for V such that [T ]β has one of the forms in (24.2).
We saw in the previous section how to compute powers of A when it is
similar to one of the first two cases in (24.2). In fact, if we want to compute
AN for a real matrix then even when A has complex eigenvalues we can still
use the diagonal form λ0 µ0 , because the complex parts of Q, DN , and Q−1
will ultimately all cancel out in the product AN = QDN Q−1 . Nevertheless,
it is worth pointing out that powers of the scaled rotation matrix also have
a nice formula.
cos θ − sin θ
Proposition 24.3. Writing Rθ = sin θ cos θ
, we have
This is a system of two first order ODEs, which are coupled. If we decou-
ple them by diagonalising A, then we can solve each independently. The
characteristic polynomial of A is
λ 0
and so writing D = 0 µ , this becomes
These can be solved independently. (Note that if the roots are complex,
then we will need to take appropriate linear combinations of the complex
solutions z1,2 to obtain real solutions.)
But what if A does not diagonalise? If (24.5) has a repeated root, then
because A is not a scalar multiple of the identity we know that we will get
A = QJQ−1 , where J = λ0 λ1 , and so the change of coordinates z = Q−1 x
ż1 = λz1 + z2 ,
ż2 = λz2 .
In the particular case g(t) = z2 (0)eλt , we see that g(s)e−λs = z2 (0) is con-
stant, and so
z1 (t) = z1 (0)eλt + z2 (0)teλt .
132 LECTURE 24. MON. NOV. 25
Thus the source of the solution teλt when there is a repeated root is the fact
that a eigenvalue with mg < ma leads to a Jordan block. Another way of
viewing this is the following. Let V be the vector space of smooth (infinitely
many times differentiable) functions from R to R, and let D ∈ L(V ) be
the differentiation operator. Then the ODE ẍ + aẋ + bx = 0 becomes
(D2 + aD + bI)(x) = 0. When there is a repeated root we have
s2 + as + b = (s − λ)2 ⇒ D2 + aD + bI = (D − λI)2 ,
and so solving the ODE becomes a question of finding the null space of
(D − λI)2 . The null space of D − λI is eigenfunctions of the differentiation
operator, which is all scalar multiples of eλt , but then we must also consider
generalised eigenfunctions, which are in the null space of (D − λI)2 but not
(D − λI). We see that
d
(teλt ) = λ(teλt ) + eλt .
dt
Writing x(t) = eλt and y(t) = teλt , this can be rewritten as
Dy = λy + x,
Markov chains
Further reading:
.1
The arrows in the graph illustrate the four “transitions” that can occur:
“staying in Houston”; “moving from Houston to Dallas”; “moving from
Dallas to Houston”; “staying in Dallas”. The number with each arrow gives
the probability of that transition occuring.
The circles are called “vertices” of the graph, and the arrows are called
“edges”. If we label the vertices 1 and 2, then the graph drawn above has
the following structure:
P12
P11 1 2 P22
P21
Here Pij is the probability that a person in city i this year will be in city j
next year. We can represent all the information in the graph by writing the
2 × 2 matrix P whose entries are Pij :
P11 P12 .95 .05
P = = .
P21 P22 .1 .9
133
134 LECTURE 25. MON. DEC. 2
w1 = .95v1 + .1v2
w2 = .05v1 + .9v2 ,
Thus next year’s population distribution can be obtained from this year’s
by right multiplication by P . Iterating this procedure, the distribution two
years from now will be wP = vP 2 , three years from now will be vP 3 , and
more generally, the population distribution after n years will be vP n .
A system like the one just described is called a Markov chain. More
precisely, a Markov chain is a collection of states together with a collection
of transition probabilities. The states are usually encoded with the labels
1, 2, . . . , d, and then the transition probabilities Pij are viewed as the entries
of a d × d matrix P , where Pij is the probability of making a transition from
state i to state j. A Markov chain can also be encoded by a directed graph
such as the one above – that is, a collection of vertices and edges, where the
vertices correspond to states of the system, and each edge is an arrow from
one vertex to another labelled with the probability of the corresponding
transition.
Note that every entry Pij lies in the interval [0, 1]. In fact, something
more can be said about the transition probabilities Pij . If we fix i, then
Pi1 + Pi2 represents the probability of going from state i to either of states
1 or 2; similarly, Pi1 + Pi2 + Pi3 is the probability Pof going from state i to
any of the states 1, 2, 3, and so on. In particular, dj=1 Pij = 1, since we go
somewhere with probability 1.
.6
.4 unbonded bonded .5
.3
.2
1 transformed
and so at the next time step the system is in state 1 with probability Pi1 ,
state 2 with probability Pi2 , and so on. More generally, if the current dis-
tribution of the system is represented by a probability vector v, and the
distribution at the next time step is represented by a probability vector w,
then we have
wj = P(state j tomorrow)
d
X
= P(state j tomorrow given state i today) · P(state i today)
i=1
Xd
= vi Pij = (vP )j .
i=1
where the first equality is matrix multiplication, the second is just rearrang-
ing terms, the third is because P is a stochastic matrix, and the fourth is
because v is a probability vector.
e1 = a1 v1 + a2 v2 + a3 v3
e1 P n = a1 v1 P n + a2 v2 P n + a3 v3 P n
= a1 λn1 v1 + a2 λn2 v2 + a3 λn3 v3
= a1 e3 + (a2 λn2 v2 + a3 λn3 v3 ).
As n → ∞, the quantity inside the brackets goes to 0 because |λ2 |, |λ3 | < 1,
and so e1 P n → a1 e3 . Because e1 P n is a probability vector for all n, we see
that a1 = 1, and so e1 P n → e3 . In terms of the original example, this is
the statement that at large times, the molecule has almost certainly been
transformed.
an eigenvalue, and all other eigenvalues λ have |λ| < 1. Thus writing the
initial probability distribution as v = w + x, where w is a left eigenvector
for 1 and x is a sum of generalised eigenvectors for other eigenvalues, one
gets
vP n = wP n + xP n = w + xP n → w;
the term xP n goes to 0 because all other eigenvalues have absolute value
less than 1. Consequently, it is the eigenvector corresponding to the largest
eigenvalue 1 that governs the long-term behaviour of the system. Moreover,
this eigenvector w has the property that w = wP , meaning that w is a
stationary probability distribution for the Markov chain: if w describes the
probability distribution of the system at the present time, then it will also
describe the probability distribution of the system at all future times.
Lecture 26 Wed. Dec. 4
Further reading:
To turn the informal discussion from the end of the last lecture into
a precise result, we need a little more terminology. Consider a Markov
chain with states {1, . . . , d} and transition matrix P ∈ Md×d . Let G be
the corresponding graph, where we only draw edges that correspond to a
transition with a non-zero probability of occurring. (Thus in the example of
the Michaelis–Menten kinetics, we do not draw the edge from “unbonded”
to “transformed”, or the edges from “transformed” to “bonded” or “un-
bonded”.)
Definition 26.1. A path in a directed graph is a sequence of edges such
that each edge starts at the vertex where the previous one terminated. A
Markov chain is irreducible if given any two states i and j, there is a path
that starts at vertex i and terminates at vertex j.
Example 26.2. The Markov chain for the Dallas–Houston example is irre-
ducible, since every transition happens with a non-zero probability, and so
the graph contains all possible edges. The Markov chain for the Michaelis–
Menten kinetics is not irreducible, since there is no path from “transformed”
to either of the other vertices.
Proposition 26.3. A Markov chain is irreducible if and only if its transition
matrix P has the property that for every 1 ≤ i, j ≤ d, there is some n ∈ N
such that (P n )ij > 0.
Proof. (⇒). Given irreducibility, for every i, j there is a path in the graph
that connects i to j. Let the vertices of this path be i0 , i1 , . . . , in , where
i0 = i and in = j. Then because each ik → ik+1 is required to be an edge in
the graph, we must have Pik ik+1 > 0, and so
d X
X d d
X
(P n )ij = ··· Pij1 Pj1 j2 · · · Pjn−1 j
j1 =1 j2 =2 jn−1 =1 (26.1)
≥ Pi0 i1 Pi1 i2 · · · Pin−1 in > 0.
(⇐). Given any i, j and n such that (P n )ij > 0, the first line of (26.1) shows
that there exist j1 , . . . , jn−1 such that Pij1 > 0, Pj1 j2 > 0, . . . , Pjn−1 j > 0.
Thus the graph contains an edge from i to j1 , from j1 to j2 , and so on
through to jn−1 to j. This gives a path from i to j, and since this holds for
all i, j, we conclude that the Markov chain is irreducible.
139
140 LECTURE 26. WED. DEC. 4
4. Let Q be the d × d matrix whose row vectors are all equal to v. Then
limn→∞ P n = Q. Equivalently, P n w → v for every probability vector
w.
Exercise 26.7. Show that for any stochastic matrix, the column vector whose
entries are all equal to 1 is a right eigenvector for the eigenvalue 1.
There is also a version of this theorem that works for irreducible matri-
ces (it is slightly more complicated in this case because P may have other
eigenvalues with |λ| = 1). The Perron–Frobenius theorem tells us that for
a primitive (or more generally, irreducible) Markov chain, the long-term
behaviour is completely governed by the eigenvector corresponding to the
eigenvalue 1.
Example 26.8. Consider a mildly inebriated college student who is watch-
ing the Super Bowl on TV, but doesn’t really care for football, and so at any
given point in time might decide to leave the game and go get pizza, or go
make another gin and tonic (his drink of choice, which he thinks goes really
well with mushroom and anchovy pizza), or step over for a little while to the
outdoor theatre next to his apartment, where the symphony is performing
Beethoven’s seventh. Suppose that every ten minutes, he either moves to a
new location or stays where he is, with transition probabilities as shown in
the following graph.
141
.2
.6 football pizza
.7
.1 .3 .4
.1 .1
.9
gin &
Beethoven .5
tonic
.1
Notice that some transition probabilies are 0: he never stays at the pizza
place for more than ten minutes, and it doesn’t take longer than ten minutes
to make his next gin & tonic; similarly, after either of those he always goes
to either football or Beethoven, and he never goes from Beethoven directly
to football without first getting either pizza or a drink. Nevertheless, one
can check that P 2 has all positive entries, and so P is primitive.
Numerical computations show that the largest eigenvalue (in absolute
value) is 1, and that there are three other (distinct) eigenvalues, each with
|λ| < 21 . The left eigenvector
corresponding to the largest eigenvalue is
v ≈ .39 .21 .07 .33 , so that in the long run the student will spend 39%
of his time watching football, 21% of his time getting pizza, 7% of his time
making himself a gin & tonic, and 33% of his time at the symphony.
It is also instructive to compute some powers of P and compare them to
the matrix Q whose row vectors are all equal to v. We find that
.42 .20 .08 .3 .39 .21 .07 .33
.45 .17 .06 .32 .39 .21 .07 .33
P3 ≈ P6 ≈
,
.31 .20 .05 .44 .38 .21 .07 .34
.32 .24 .08 .36 .38 .21 .07 .34
142 LECTURE 26. WED. DEC. 4
Now let v ∈ Rn be a probability vector that encodes our current best guess
as to the relative “importance” of the websites: Pthat is, website i has a
number vi between 0 and 1 attached to it, and i vi = 1. According to
the above principle (importance being proportional to importance of linking
sites), the “correct” v should have the property that
N
X
vj = vi Pij ,
i=1
where we think of vi Pij as the amount of the site i’s importance that it
passes to the site j. This equation is just the statement that v = vP , that
is, that v is an eigenvector for P with eigenvalue 1. By the Perron–Frobenius
theorem, there is a unique such eigenvector (as long as P is primitive), and
we can use this to rank the websites: the highest-ranked website is the one
for which vi is largest, and so on.
As an illustration, consider the following very small internet, which only
has five websites.
144 LECTURE 26. WED. DEC. 4
A B C
D E
The matrix L described above, where Lij is the number of links from i to j,
is
0 1 0 1 0
1 0 1 0 1
L= 0 1 0 0 1
0 1 0 0 1
1 0 0 0 0
To get the stochastic matrix P , we have to normalise each row of L to be a
probability vector, obtaining
0 12 0 12 0
1 0 1 0 1
3 1 3 3
1
P = 0 2 0 0 2
0 1 1
2 0 0 2
1 0 0 0 0
of where the monkey currently is, then wP describes the probability distri-
bution of where he will be after he follows the next link. After n clicks, his
likely location is described by wP n , which by the Perron–Frobenius theorem
converges to the eigenvector v.
In practice, some tweaks to the process described here are needed to deal
with the real internet - for example, if there is a website that doesn’t link
anywhere, then P is not primitive and the monkey would simply get trapped
at that site, even if it’s not really that important a website. Thus one has
to add a small probability of “resetting” at any given time. And there are
other issues as well – but this approach via Markov chains is a very powerful
one and illustrates the basic idea.