S VD Chapter
S VD Chapter
To gain insight into the SVD, treat the rows of an n × d matrix A as n points in a
d-dimensional space and consider the problem of finding the best k-dimensional subspace
with respect to the set of points. Here best means minimize the sum of the squares of the
perpendicular distances of the points to the subspace. We begin with a special case of
the problem where the subspace is 1-dimensional, a line through the origin. We will see
later that the best-fitting k-dimensional subspace can be found by k applications of the
best fitting line algorithm. Finding the best fitting line through the origin with respect
to a set of points {xi |1 ≤ i ≤ n} in the plane means minimizing the sum of the squared
distances of the points to the line. Here distance is measured perpendicular to the line.
The problem is called the best least squares fit.
In the best least squares fit, one is minimizing the distance to a subspace. An alter-
native problem is to find the function that best fits some data. Here one variable y is a
function of the variables x1 , x2 , . . . , xd and one wishes to minimize the vertical distance,
i.e., distance in the y direction, to the subspace of the xi rather than minimize the per-
pendicular distance to the subspace being fit to the data.
xi
distance
v
projection
Figure 4.1: The projection of the point xi onto the line through the origin in the direction
of v
Returning to the best least squares fit problem, consider projecting a point xi onto a
110
line through the origin. Then
To minimize the sum of the squares of the distances to the line, one could minimize
�n
(x2i1 + x2i2 + · · · +2id ) minus the sum of the squares of the lengths of the projections of
i=1
�
n
the points to the line. However, (x2i1 + x2i2 + · · · +2id ) is a constant (independent of the
i=1
line), so minimizing the sum of the squares of the distances is equivalent to maximizing
the sum of the squares of the lengths of the projections onto the line. Similarly for best-fit
subspaces, we could maximize the sum of the squared lengths of the projections onto the
subspace instead of minimizing the sum of squared distances to the subspace.
With this in mind, define the first singular vector, v1 , of A, which is a column vector,
as the best fit line through the origin for the n points in d-space that are the rows of A.
Thus
v1 = arg max |Av|.
|v|=1
The value σ1 (A) = |Av1 | is called the first singular value of A. Note that σ12 is the
sum of the squares of the projections of the points to the line determined by v1 .
The greedy approach to find the best fit 2-dimensional subspace for a matrix A, takes
v1 as the first basis vector for the 2-dimenional subspace and finds the best 2-dimensional
subspace containing v1 . The fact that we are using the sum of squared distances helps.
For every 2-dimensional subspace containing v1 , the sum of squared lengths of the pro-
jections onto the subspace equals the sum of squared projections onto v1 plus the sum
of squared projections along a vector perpendicular to v1 in the subspace. Thus, instead
of looking for the best 2-dimensional subspace containing v1 , look for a unit vector, call
it v2 , perpendicular to v1 that maximizes |Av|2 among all such unit vectors. Using the
same greedy strategy to find the best three and higher dimensional subspaces, defines
v3 , v4 , . . . in a similar manner. This is captured in the following definitions. There is no
111
apriori guarantee that the greedy algorithm gives the best fit. But, in fact, the greedy
algorithm does work and yields the best-fit subspaces of every dimension.
The second singular vector, v2 , is defined by the best fit line perpendicular to v1
The value σ2 (A) = |Av2 | is called the second singular value of A. The third singular
vector v3 is defined similarly by
v1 , v2 , . . . , vr
If instead of finding v1 that maximized |Av| and then the best fit 2-dimensional
subspace containing v1 , we had found the best fit 2-dimensional subspace, we might have
done better. This is not the case. We now give a simple proof that the greedy algorithm
indeed finds the best subspaces of every dimension.
112
since Vk−1 is an optimal k -1 dimensional subspace. Since wk is perpendicular to
v1 , v2 , . . . , vk−1 , by the definition of vk , |Awk |2 ≤ |Avk |2 . Thus
Note that the n-vector Avi is really a list of lengths (with signs) of the projections of
the rows of A onto vi . Think of |Avi | = σi (A) as the “component” of the matrix A along
vi . For this interpretation to make sense, it should be true that adding up the squares of
the components of A along each of the vi gives the square of the “whole content of the
matrix A”. This is indeed the case and is the matrix analogy of decomposing a vector
into its components along orthogonal directions.
Consider one row, say aj , of A. Since v1 , v2 , . . . , vr span the space of all rows of A,
�
r
aj · v = 0 for all v perpendicular to v1 , v2 , . . . , vr . Thus, for each row aj , (aj · vi )2 =
i=1
|aj |2 . Summing over all rows j,
n
� n �
� r r �
� n r
� r
�
|aj |2 = (aj · vi )2 = (aj · vi )2 = |Avi |2 = σi2 (A).
j=1 j=1 i=1 i=1 j=1 i=1 i=1
�
n �
n �
d
But |aj |2 = a2jk , the sum of squares of all the entries of A. Thus, the sum of
j=1 j=1 k=1
squares of the singular values of A is indeed the square of the “whole content of A”, i.e.,
the sum of squares of all the entries. There is an important norm associated with this
quantity, the Frobenius norm of A, denoted ||A||F defined as
��
||A||F = a2jk .
j,k
Lemma 4.2 For any matrix � 2A, the sum2 of squares of the singular values equals the
Frobenius norm. That is, σi (A) = ||A||F .
A matrix A can be described fully by how it transforms the vectors vi . Every vector
v can be written as a linear combination of v1 , v2 , . . . , vr and a vector perpendicular
to all the vi . Now, Av is the same linear combination of Av1 , Av2 , . . . , Avr as v is of
v1 , v2 , . . . , vr . So the Av1 , Av2 , . . . , Avr form a fundamental set of vectors associated
with A. We normalize them to length one by
1
ui = Avi .
σi (A)
113
The vectors u1 , u2 , . . . , ur are called the left singular vectors of A. The vi are called the
right singular vectors. The SVD theorem (Theorem 4.5) will fully explain the reason for
these terms.
Clearly, the right singular vectors are orthogonal by definition. We now show that the
�r
left singular vectors are also orthogonal and that A = σi ui viT .
i=1
Proof: The proof is by induction on r. For r = 1, there is only one ui so the theorem is
trivially true. For the inductive part consider the matrix
B = A − σ1 u1 v1T .
Thus, there is a run of the algorithm that finds that B has right singular vectors
v2 , v3 , . . . , vr and corresponding left singular vectors u2 , u3 , . . . , ur . By the induction
hypothesis, u2 , u3 , . . . , ur are orthogonal.
It remains to prove that u1 is orthogonal to the other ui . Suppose not and for some
i ≥ 2, uT1 ui �= 0. Without loss of generality assume that uT1 ui > 0. The proof is symmetric
for the case where uT1 ui < 0. Now, for infinitesimally small ε > 0, the vector
� �
v1 + εvi σ1 u1 + εσi ui
A = √
|v1 + εvi | 1 + ε2
has length at least as large as its component along u1 which is
σ1 u1 + εσi ui � �� ε2
� � �
uT
1 ( √ ) = σ 1 + εσ u T
i 1 iu 1 − 2
+ O (ε 4
) = σ1 + εσi uT1 ui − O ε2 > σ1
1 + ε2
a contradiction. Thus, u1 , u2 , . . . , ur are orthogonal.
114
4.2 Singular Value Decomposition (SVD)
We first prove a simple lemma stating that two matrices A and B are identical if
Av = Bv for all v. The lemma states that in the abstract, a matrix A can be viewed as
a transformation that maps vector v onto Av.
Lemma 4.4 Matrices A and B are identical if and only if for all vectors v, Av = Bv.
Proof: Clearly, if A = B then Av = Bv for all v. For the converse, suppose that
Av = Bv for all v. Let ei be the vector that is all zeros except for the ith component
which has value 1. Now Aei is the ith column of A and thus A = B if for each i, Aei = Bei .
�
r
Proof: For each singular vector vj , Avj = σi ui viT vj . Since any vector v can be ex-
i=1
pressed as a linear combination of the singular vectors plus a vector perpendicular to the
�r �r
vi , Av = σi ui viT v and by Lemma 4.4, A = σi ui viT .
i=1 i=1
For any matrix A, the sequence of singular values is unique and if the singular values
are all distinct, then the sequence of singular vectors is unique also. However, when some
set of singular values are equal, the corresponding singular vectors span some subspace.
Any set of orthonormal vectors spanning this subspace can be used as the singular vectors.
115
D VT
r×r r×d
A U
n×d = n×r
There are two important matrix norms, the Frobenius norm denoted ||A||F and the
2-norm denoted ||A||2 . The 2-norm of the matrix A is given by
max |Av|
|v|=1
Let r
�
A= σi ui viT
i=1
be the sum truncated after k terms. It is clear that Ak has rank k. Furthermore, Ak is
the best rank k approximation to A when the error is measured in either the 2-norm or
the Frobenius norm.
Lemma 4.6 The rows of Ak are the projections of the rows of A onto the subspace Vk
spanned by the first k singular vectors of A.
116
Proof: Let a be an arbitrary row vector. Since the vi are orthonormal, the projection
�
k
of the vector a onto Vk is given by (a · vi )vi T . Thus, the matrix whose rows are the
i=1
�
k
projections of the rows of A onto Vk is given by Avi viT . This last expression simplifies
i=1
to
k
� k
�
T
Avi vi = σi ui vi T = Ak .
i=1 i=1
The matrix Ak is the best rank k approximation to A in both the Frobenius and the
2-norm. First we show that the matrix Ak is the best rank k approximation to A in the
Frobenius norm.
�A − Ak �F ≤ �A − B�F
Proof: Let B minimize �A − B�2F among all rank k or less matrices. Let V be the space
spanned by the rows of B. The dimension of V is at most k. Since B minimizes �A − B�2F ,
it must be that each row of B is the projection of the corresponding row of A onto V ,
otherwise replacing the row of B with the projection of the corresponding row of A onto V
does not change V and hence the rank of B but would reduce �A − B�2F . Since each row
of B is the projection of the corresponding row of A, it follows that �A − B�2F is the sum
of squared distances of rows of A to V . Since Ak minimizes the sum of squared distance
of rows of A to any k-dimensional subspace, it follows that �A − Ak �F ≤ �A − B�F .
Next we tackle the 2-norm. We first show that the square of the 2-norm of A − Ak is
the square of the (k + 1)st singular value of A,
117
�
r
The v maximizing this last quantity, subject to the constraint that |v|2 = αi2 = 1,
i=1
occurs when αk+1 = 1 and the rest of the αi are 0. Thus, �A − Ak �22 = σk+1
2
proving the
lemma.
�A − Ak �2 ≤ �A − B�2
Scale z so that |z| = 1. We now show that for this vector z, which lies in the space of the
first k + 1 singular vectors of A, that (A − B) z ≥ σk+1 . Hence the 2-norm of A − B is at
least σk+1 contradicting the assumption that �A − B�2 < σk+1 . First
Since Bz = 0,
�A − B�22 ≥ |Az|2 .
Since z is in the Span {v1 , v2 , . . . , vk+1 }
� �2
��n � n
� � k+1
�2 � � �2 k+1
� � T �2
� �
|Az|2 = � σi ui vi T z� = σi2 vi T z = σi2 vi T z ≥ σk+1
2 2
vi z = σk+1 .
� �
i=1 i=1 i=1 i=1
It follows that
�A − B�22 ≥ σk+1
2
contradicting the assumption that ||A − B||2 < σk+1 . This proves the theorem.
118
analysis texts for more details. The method we present, called the Power Method, is
conceptually simple.�The word power refers to taking high powers of the matrix B = AAT .
If the SVD of A is σi ui viT , then by direct multiplication
i
� �� �
� �
B = AAT = σi ui viT σj vj uTj
i j
� �
= σi σj ui viT vj uTj = σi σj ui (viT · vj )uTj
i,j i,j
�
= σi2 ui uTi ,
i
since viT vj is the dot product of the two vectors and is zero unless i = j. [Caution: ui uj T
is a matrix and is not zero even for i �= j.] Using the same kind of calculation,
�
Bk = σi2k ui uTi .
i
As k increases, for i > 1, σi2k /σ12k goes to zero and B k is approximately equal to
σ12k u1 uT1
provided that for each i > 1, σi (A) < σ1 (A).
This suggests a way of finding σ1 and u1 , by successively powering B. But there are
two issues. First, if there is a significant gap between the first and second singular values
of a matrix, then the above argument applies and the power method will quickly converge
to the first left singular vector. Suppose there is no significant gap. In the extreme case,
there may be ties for the top singular value. Then the above argument does not work. We
overcome this problem in Theorem 4.11 below which states that even with ties, the power
method converges to some vector in the span of those singular vectors corresponding to
the “nearly highest” singular values.
119
1√
20 d
1√
Figure 4.3: The volume of the cylinder of height 20 d
is an upper bound on the volume
of the hemisphere below x1 = 201√d
Lemma 4.10 Let (x1 , x2 , . . . , xd ) be a unit d-dimensional vector picked at random. The
probability that |x1 | ≥ 201√d is at least 9/10.
Proof: We first show that for a vector v picked at random with |v| ≤ 1, the probability
that v1 ≥ 201√d is at least 9/10. Then we let x = v/|v|. This can only increase the value
of v1 , so the result follows.
Let α = 201√d . The probability that |v1 | ≥ α equals one minus the probability that
|v1 | ≤ α. The probability that |v1 | ≤ α is equal to the fraction of the volume of the unit
sphere with |v1 | ≤ α. To get an upper bound on the volume of the sphere with |v1 | ≤ α,
consider twice the volume of the unit radius cylinder of height α. The volume of the
portion of the sphere with |v1 | ≤ α is less than or equal to 2αA(d − 1) and
2αA(d − 1)
Prob(|v1 | ≤ α) ≤
V (d)
Now the volume of the �
unit radius sphere is at least twice the volume of the cylinder of
1 1
height √d−1 and radius 1 − d−1 or
2 1 d−2
V (d) ≥ √ V (d − 1)(1 − ) 2
d−1 d−1
120
Using (1 − x)a ≥ 1 − ax
2 d−2 1 V (d − 1)
V (d) ≥ √ A(d − 1)(1 − )≥ √
d−1 2 d−1 d−1
and √
2αV (d − 1) d−1 1
Prob(|v1 | ≤ α) ≤ 1 ≤ √ ≤ .
√
d−1
V (d − 1) 10 d 10
1√
Thus the probability that v1 ≥ 20 d
is at least 9/10.
Theorem 4.11 Let A be an n × d matrix and x a random unit length vector. Let V be
the space spanned by the left�singular � vectors of A corresponding to singular values greater
ln(n/ε)
than (1 − ε) σ1 . Let k be Ω ε
. Let w be unit vector after k iterations of the power
method, namely,
� T �k
AA x
w = �� �.
(AA T )k x�
� �
The probability that w has a component of at least � perpendicular to V is at most 1/10.
Proof: Let r
�
A= σi ui viT
i=1
be the SVD of A. If the rank of A is less than n, then complete {u1 , u2 , . . . ur } into a
basis {u1 , u2 , . . . un } of n-space. Write x in the basis of the ui � s as
n
�
x= c i ui .
i=1
�
n �
n
Since (AAT )k = σi2k ui uTi , it follows that (AAT )k x = σi2k ci ui . For a random unit
i=1 i=1
length vector x picked independent of A, the ui are fixed vectors and picking x at random
is equivalent to picking random ci . From Lemma 4.10, |c1 | ≥ 201√n with probability at
least 9/10.
Suppose that σ1 , σ2 , . . . , σm are the singular values of A that are greater than or equal
to (1 − ε) σ1 and that σm+1 , . . . , σn are the singular values that are less than (1 − ε) σ1 .
Now � n �2
�� � �n
1 4k
T k 2 � 2k �
|(AA ) x| = � σ i c i ui � = σi4k c2i ≥ σ14k c21 ≥ σ1 ,
� � 400n
i=1 i=1
with probability at least 9/10. Here we used the fact that a sum of positive quantities
is at least as large as its first element and the first element is greater than or equal to
121