Strang 367-376
Strang 367-376
A great matrix factorization has been saved for the end of the basic course. UΣV T joins
with LU from elimination and QR from orthogonalization (Gauss and Gram-Schmidt).
Nobody’s name is attached; A = UΣV T is known as the “SVD” or the singular value
decomposition. We want to describe it, to prove it, and to discuss its applications—
which are many and growing.
The SVD is closely associated with the eigenvalue-eigenvector factorization QΛQT of
a positive definite matrix. The eigenvalues are in the diagonal matrix Λ. The eigenvector
matrix Q is orthogonal (QT Q = I) because eigenvectors of a symmetric matrix can be
chosen to be orthonormal. For most matrices that is not true, and for rectangular matrices
it is ridiculous (eigenvalues undefined). But now we allow the Q on the left and the QT
on the right to be any two orthogonal matrices U and V T —not necessarily transposes of
each other. Then every matrix will split into A = UΣV T .
The diagonal (but rectangular) matrix Σ has eigenvalues from AT A, not from A! Those
positive entries (also called sigma) will be σ1 , . . . , σr . They are the singular values of A.
They fill the first r places on the main diagonal of Σ—when A has rank r. The rest of Σ
is zero.
With rectangular matrices, the key is almost always to consider AT A and AAT .
Remark 1. For positive definite matrices, Σ is Λ and UΣV T is identical to QΛQT . For
other symmetric matrices, any negative eigenvalues in Λ become positive in Σ. For
complex matrices, Σ remains real but U and V become unitary (the complex version of
orthogonal). We take complex conjugates in U HU = I and V HV = I and A = UΣV H .
Remark 2. U and V give orthonormal bases for all four fundamental subspaces:
Remark 3. The SVD chooses those bases in an extremely special way. They are more
than just orthonormal. When A multiplies a column v j of V , it produces σ j times a
column of U. That comes directly from AV = UΣ, looked at a column at a time.
368 Chapter 6 Positive Definite Matrices
U must be the eigenvector matrix for AAT . The eigenvalue matrix in the middle is ΣΣT —
which is m by m with σ12 , . . . , σr2 on the diagonal.
From the AT A = V ΣT ΣV T , the V matrix must be the eigenvector matrix for AT A. The
diagonal matrix ΣT Σ has the same σ12 , . . . , σr2 , but it is n by n.
Remark 5. Here is the reason that Av j = σ j u j . Start with AT Av j = σ 2j v j :
This says that Av j is an eigenvector of AAT ! We just moved parentheses to (AAT )(Av j ).
The length of this eigenvector Av j is σ j , because
We will pick a few important applications, after emphasizing one key point. The SVD is
terrific for numerically stable computations. because U and V are orthogonal matrices.
They never change the length of a vector. Since kUxk2 = xTU TUx = kxk2 , multiplication
by U cannot destroy the scaling.
6.3 Singular Value Decomposition 369
1. Image processing Suppose a satellite takes a picture, and wants to send it to Earth.
The picture may contain 1000 by 1000 “pixels”—a million little squares, each with a
definite color. We can code the colors, and send back 1,000,000 numbers. It is better to
find the essential information inside the 1000 by 1000 matrix, and send only that.
Suppose we know the SVD. The key is in the singular values (in Σ). Typically, some
σ ’s are significant and others are extremely small. If we keep 20 and throw away 980,
then we send only the corresponding 20 columns of U and V . The other 980 columns
are multiplied in UΣV T by the small σ ’s that are being ignored. We can do the matrix
multiplication as columns times rows:
A = UΣV T = u1 σ1 vT1 + u2 σ2 vT2 + · · · + ur σr vTr . (3)
Any matrix is the sum of r matrices of rank 1. If only 20 terms are kept, we send 20
times 2000 numbers instead of a million (25 to 1 compression).
The pictures are really striking, as more and more singular values are included. At
first you see nothing, and suddenly you recognize everything. The cost is in computing
the SVD—this has become much more efficient, but it is expensive for a big matrix.
2. The effective rank The rank of a matrix is the number of independent rows, and
the number of independent columns. That can be hard to decide in computations! In
exact arithmetic, counting the pivots is correct. Real arithmetic can be misleading—but
discarding small pivots is not the answer. Consider the following:
" # " # " #
ε 2ε ε 1 ε 1
ε is small and and .
1 2 0 0 ε 1+ε
The first has rank 1, although roundoff error will probably produce a second pivot. Both
pivots will be small; how many do we ignore? The second has one small pivot, but we
cannot pretend that its row is insignificant. The third has two pivots and its rank is 2, but
its “effective rank” ought to be 1.
We go to a more stable measure of rank. The first step is to use AT A or AAT , which
are symmetric but share the same rank as A. Their eigenvalues—the singular values
squared—are not misleading. Based on the accuracy of the data, we decide on a toler-
ance like 10−6 and count the singular values above it—that is the effective rank. The
examples above have effective rank 1 (when ε is very small).
a number eiθ on the unit circle: z = reiθ . That expresses z in “polar coordinates.” If
we think of z as a 1 by 1 matrix, r corresponds to a positive definite matrix and eiθ
corresponds to an orthogonal matrix. More exactly, since eiθ is complex and satisfies
e−iθ eiθ = 1, it forms a 1 by 1 unitary matrix: U HU = I. We take the complex conjugate
as well as the transpose, for U H .
The SVD extends this “polar factorization” to matrices of any size:
Every real square matrix can be factored into A = QS, where Q is orthogonal
and S is symmetric positive semidefinite. If A is invertible then S is positive
definite.
For proof we just insert V TV = I into the middle of the SVD:
The exercises show how, in the reverse order. S changes but Q remains the same. Both
S and S0 are symmetric positive definite because this A is invertible.
that A must have independent columns.) In the reverse order A = S0 Q, the matrix S0 is
the symmetric positive definite square root of AAT .
That minimum length solution will be called x+ . It is our preferred choice as the best
solution to Ax = b (which had no solution), and also to AT Ab x = AT b (which had too
many). We start with a diagonal example.
The columns all end with zero. In the column space, the closest vector to b = (b1 , b2 , b3 )
is p = (b1 , b2 , 0). The best we can do with Ax = b is to solve the first two equations,
since the third equation is 0 = b3 . That error cannot be reduced, but the errors in the first
two equations will be zero. Then
Now we face the second difficulty. To make xb as short as possible, we choose the
totally arbitrary xb3 and xb4 to be zero. The minimum length solution is x+ :
b1 /σ1 1/σ1 0 0
b /σ 0 b1
A+ is pseudoinverse + 2 2 1/σ2 0
x = = b2 . (5)
x+ = A+ b is shortest 0 0 0 0
b3
0 0 0 0
This equation finds x+ , and it also displays the matrix that produces x+ from b. That
matrix is the pseudoinverse A+ of our diagonal A. Based on this example, we know Σ+
372 Chapter 6 Positive Definite Matrices
3. All solutions of AT Ab
x = AT b have the same xr . That vector is x+ .
The fundamental theorem of linear algebra was in Figure 3.4. Every p in the column
space comes from one and only one vector xr in the row space. All we are doing is to
choose that vector, x+ = xr , as the best solution to Ax = b.
The pseudoinverse in Figure 6.3 starts with b and comes back to x+ . It inverts A where
A is invertible—between row space and column space. The pseudoinverse knocks out
the left nullspace by sending it to zero, and it knocks out the nullspace by choosing xr as
x+ .
We have not yet shown that there is a matrix A+ that always gives x+ —but there is.
It will be n by m, because it takes b and p in Rm back to x+ in Rn . We look at one more
example before finding A+ in general.
Example 6. Ax = b is −x1 + 2x2 + 2x3 = 18, with a whole plane of solutions.
According to our theory, the shortest solution should be in the row space of A =
[−1 2 2]. The multiple of that row that satisfies the equation is x+ = (−2, 4, 4). There
are longer solutions like (−2, 5, 3), (−2, 7, 1), or (−6, 3, 3), but they all have nonzero
components from the nullspace. The matrix that produces x+ from b = [18] is the pseu-
doinverse A+ . Whereas A was 1 by 3, this A+ is 3 by 1:
h i+ − 19 −2
+ 2 +
A = −1 2 2 = 9 and A [18] = 4 . (6)
2
9 4
6.3 Singular Value Decomposition 373
Figure 6.3: The pseudoinverse A+ inverts A where it can on the column space.
Example 6 had σ = 3—the square root of the eigenvalue of AAT = [9]. Here it is again
with Σ and Σ+ :
1 2 2
h i h ih i −
3 3 3
A = −1 2 2 = UΣV T = 1 3 0 0 23 − 31 23
2 2
3 3 − 31
1
− 31 23 2
3 3h i − 19
+
V Σ+U T = 23 − 31 2
3 0 1 = 92 = A .
2
2
3 3 − 31 0 2
9
Introduce the new unknown y = V T x = V −1 x, which has the same length as x. Then,
minimizing kAx − bk is the same as minimizing kΣy −U T bk. Now Σ is diagonal and we
know the best y+ . It is y+ = Σ+U T b so the best x+ is V y+ :
2. (a) Compute AAT and its eigenvalues σ12 , 0 and unit eigenvectors u1 , u2 .
(b) Choose signs so that Av1 = σ1 u1 and verify the SVD:
" # " #
1 4 h i σ1 h iT
= u1 u2 v1 v2 .
2 8 0
(c) Which four vectors give orthonormal bases for C (A), N (A), C (AT ), N (AT )?
Problems 3–5 ask for the SVD of matrices of rank 2.
3. Find the SVD from the eigenvectors v1 , v2 of AT A and Avi = σi ui :
" #
1 1
Fibonacci matrix A= .
1 0
4. Use the SVD part of the MATLAB demo eigshow (or Java on the course page
web.mit.edu/18.06) to find the same vectors v1 and v2 graphically.
5. Compute AT A and AAT , and their eigenvalues and unit eigenvectors, for
" #
1 1 0
A= .
0 1 1
A = σ1 u1 vT1 + · · · + σr ur vTr .
6.3 Singular Value Decomposition 375
You can compute A+ , or find the general solution to AT Ab x = AT b and choose the
solution that is in the row space of A. This problem fits the best plane C + Dt + Ez to
b = 0 and also b = 2 at t = z = 0 (and b = 2 at t = z = 1).
21. Removing zero rows of U leaves A = LU, where the r columns or L span the column
space of A and the r rows of U span the row space. Then A+ has the explicit formula
U T (U U T )−1 (LT L)−1 LT .
Why is A+ b in the row space with U T at the front? Why does AT AA+ b = AT b, so
that x+ = A+ b satisfies the normal equation as it should?
22. Explain why AA+ and A+ A are projection matrices (and therefore symmetric). What
fundamental subspaces do they project onto?
In this section we escape for the first time from linear equations. The unknown x will not
be given as the solution to Ax = b or Ax = λ x. Instead, the vector x will be determined
by a minimum principle.
It is astonishing how many natural laws can be expressed as minimum principles. Just
the fact that heavy liquids sink to the bottom is a consequence of minimizing their po-
tential energy. And when you sit on a chair or lie on a bed, the springs adjust themselves
so that the energy is minimized. A straw in a glass of water looks bent because light
reaches your eye as quickly as possible. Certainly there are more highbrow examples:
The fundamental principle of structural engineering is the minimization of total energy.1
We have to say immediately that these “energies” are nothing but positive definite
quadratic functions. And the derivative of a quadratic is linear. We get back to the
familiar linear equations, when we set the first derivatives to zero. Our first goal in
this section is to find the minimum principle that is equivalent to Ax = b, and the
minimization equivalent to Ax = λ x. We will be doing in finite dimensions exactly
what the theory of optimization does in a continuous problem, where “first derivatives
= 0” gives a differential equation. In every problem, we are free to solve the linear
equation or minimize the quadratic.
The first step is straightforward: We want to find the “parabola” P(x) whose minimum
occurs when Ax = b. If A is just a scalar, that is easy to do:
1 dP
The graph of P(x) = Ax2 − bx has zero slope when = Ax − b = 0.
2 dx
This point x = A−1 b will be a minimum if A is positive. Then the parabola P(x) opens
upward (Figure 6.4). In more dimensions this parabola turns into a parabolic bowl (a
paraboloid). To assure a minimum of P(x), not a maximum or a saddle point, A must be
positive definite!
1 I am convinced that plants and people also develop in accordance with minimum principles. Perhaps civilization
is based on a law of least action. There must be new laws (and minimum principles) to be found in the social
sciences and life sciences.