0% found this document useful (0 votes)
60 views11 pages

Book3 SVD

Uploaded by

jeren2606
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views11 pages

Book3 SVD

Uploaded by

jeren2606
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

442 CHAPTER 11.

DIMENSIONALITY REDUCTION

Example 11.7 : The eigenvalues of M M T for our running example must in-
clude 58 and 2, because those are the eigenvalues of M T M as we observed in
Section 11.2.1. Since M M T is a 4 × 4 matrix, it has two other eigenvalues,
which must both be 0. The matrix of eigenvectors corresponding to 58, 2, 0,
and 0 is shown in Fig. 11.4. ✷

11.2.4 Exercises for Section 11.2


Exercise 11.2.1 : Let M be the matrix of data points
 
1 1
 2 4 
 
 3 9 
4 16

(a) What are M T M and M M T ?

(b) Compute the eigenpairs for M T M .

! (c) What do you expect to be the eigenvalues of M M T ?

! (d) Find the eigenvectors of M M T , using your eigenvalues from part (c).

! Exercise 11.2.2 : Prove that if M is any matrix, then M T M and M M T are


symmetric.

11.3 Singular-Value Decomposition


We now take up a second form of matrix analysis that leads to a low-dimensional
representation of a high-dimensional matrix. This approach, called singular-
value decomposition (SVD), allows an exact representation of any matrix, and
also makes it easy to eliminate the less important parts of that representation to
produce an approximate representation with any desired number of dimensions.
Of course the fewer the dimensions we choose, the less accurate will be the
approximation.
We begin with the necessary definitions. Then, we explore the idea that the
SVD defines a small number of “concepts” that connect the rows and columns
of the matrix. We show how eliminating the least important concepts gives us a
smaller representation that closely approximates the original matrix. Next, we
see how these concepts can be used to query the original matrix more efficiently,
and finally we offer an algorithm for performing the SVD itself.

11.3.1 Definition of SVD


Let M be an m × n matrix, and let the rank of M be r. Recall that the rank of
a matrix is the largest number of rows (or equivalently columns) we can choose
11.3. SINGULAR-VALUE DECOMPOSITION 443

for which no nonzero linear combination of the rows is the all-zero vector 0 (we
say a set of such rows or columns is independent). Then we can find matrices
U , Σ, and V as shown in Fig. 11.5 with the following properties:

1. U is an m × r column-orthonormal matrix ; that is, each of its columns is


a unit vector and the dot product of any two columns is 0.

2. V is an n × r column-orthonormal matrix. Note that we always use V in


its transposed form, so it is the rows of V T that are orthonormal.

3. Σ is a diagonal matrix; that is, all elements not on the main diagonal are
0. The elements of Σ are called the singular values of M .

n r r n

T r
V
Σ

m M = U

Figure 11.5: The form of a singular-value decomposition

Example 11.8 : Figure 11.6 gives a rank-2 matrix representing ratings of


movies by users. In this contrived example there are two “concepts” underlying
the movies: science-fiction and romance. All the boys rate only science-fiction,
and all the girls rate only romance. It is this existence of two strictly adhered to
concepts that gives the matrix a rank of 2. That is, we may pick one of the first
four rows and one of the last three rows and observe that there is no nonzero
linear sum of these rows that is 0. But we cannot pick three independent rows.
For example, if we pick rows 1, 2, and 7, then three times the first minus the
second, plus zero times the seventh is 0.
We can make a similar observation about the columns. We may pick one
of the first three columns and one of the last two coluns, and they will be
independent, but no set of three columns is independent.
The decomposition of the matrix M from Fig. 11.6 into U , Σ, and V , with
all elements correct to two significant digits, is shown in Fig. 11.7. Since the
rank of M is 2, we can use r = 2 in the decomposition. We shall see how to
compute this decomposition in Section 11.3.6. ✷
444 CHAPTER 11. DIMENSIONALITY REDUCTION

Casablanca
Star Wars
Matrix

Titanic
Alien
Joe 1 1 1 0 0
Jim 3 3 3 0 0
John 4 4 4 0 0
Jack 5 5 5 0 0
Jill 0 0 0 4 4
Jenny 0 0 0 5 5
Jane 0 0 0 2 2

Figure 11.6: Ratings of movies by users

   
1 1 1 0 0 .14 0

 3 3 3 0 0  
  .42 0 
 4 4 4 0 0   .56 0  
 12.4 0
 
   .58 .58 .58 0 0
 5 5 5 0 0 = .70 0 
   0 9.5 0 0 0 .71 .71

 0 0 0 4 4  
  0 .60 

 0 0 0 5 5   0 .75 
0 0 0 2 2 0 .30

M U Σ VT

Figure 11.7: SVD for the matrix M of Fig. 11.6

11.3.2 Interpretation of SVD


The key to understanding what SVD offers is in viewing the r columns of U ,
Σ, and V as representing concepts that are hidden in the original matrix M . In
Example 11.8, these concepts are clear; one is “science fiction” and the other
is “romance.” Let us think of the rows of M as people and the columns of
M as movies. Then matrix U connects people to concepts. For example, the
person Joe, who corresponds to row 1 of M in Fig. 11.6, likes only the concept
science fiction. The value 0.14 in the first row and first column of U is smaller
than some of the other entries in that column, because while Joe watches only
science fiction, he doesn’t rate those movies highly. The second column of the
first row of U is 0, because Joe doesn’t rate romance movies at all.
The matrix V relates movies to concepts. The 0.58 in each of the first three
columns of the first row of V T indicates that the first three movies – The Matrix,
Alien, and Star Wars – each are of the science-fiction genre, while the 0’s in
the last two columns of the first row say that these movies do not partake of
the concept romance at all. Likewise, the second row of V T tells us that the
11.3. SINGULAR-VALUE DECOMPOSITION 445

movies Casablanca and Titanic are exclusively romances.


Finally, the matrix Σ gives the strength of each of the concepts. In our
example, the strength of the science-fiction concept is 12.4, while the strength
of the romance concept is 9.5. Intuitively, the science-fiction concept is stronger
because the data provides more information about the movies of that genre and
the people who like them.
In general, the concepts will not be so clearly delineated. There will be fewer
0’s in U and V , although Σ is always a diagonal matrix and will always have
0’s off the diagonal. The entities represented by the rows and columns of M
(analogous to people and movies in our example) will partake of several different
concepts to varying degrees. In fact, the decomposition of Example 11.8 was
especially simple, since the rank of the matrix M was equal to the desired
number of columns of U , Σ, and V . We were therefore able to get an exact
decomposition of M with only two columns for each of the three matrices U , Σ,
and V ; the product U ΣV T , if carried out to infinite precision, would be exactly
M . In practice, life is not so simple. When the rank of M is greater than the
number of columns we want for the matrices U , Σ, and V , the decomposition is
not exact. We need to eliminate from the exact decomposition those columns of
U and V that correspond to the smallest singular values, in order to get the best
approximation. The following example is a slight modification of Example 11.8
that will illustrate the point.
Casablanca
Star Wars
Matrix

Titanic
Alien

Joe 1 1 1 0 0
Jim 3 3 3 0 0
John 4 4 4 0 0
Jack 5 5 5 0 0
Jill 0 2 0 4 4
Jenny 0 0 0 5 5
Jane 0 1 0 2 2

Figure 11.8: The new matrix M ′ , with ratings for Alien by two additional raters

Example 11.9 : Figure 11.8 is almost the same as Fig. 11.6, but Jill and Jane
rated Alien, although neither liked it very much. The rank of the matrix in
Fig. 11.8 is 3; for example the first, sixth, and seventh rows are independent,
but you can check that no four rows are independent. Figure 11.9 shows the
decomposition of the matrix from Fig. 11.8.
We have used three columns for U , Σ, and V because they decompose a
matrix of rank three. The columns of U and V still correspond to concepts.
The first is still “science fiction” and the second is “romance.” It is harder to
446 CHAPTER 11. DIMENSIONALITY REDUCTION
 
1 1 1 0 0

 3 3 3 0 0 


 4 4 4 0 0 


 5 5 5 0 0 =


 0 2 0 4 4 

 0 0 0 5 5 
0 1 0 2 2

M′

 
.13 .02 −.01

 .41 .07 −.03 
  

 .55 .09 −.04  12.4 0
 0 .56 .59 .56 .09 .09

 .68 .11 −.05  0
 9.5 0   .12 −.02 .12 −.69 −.69 

 .15 −.59 .65 
 0 0 1.3 .40 −.80 .40 .09 .09
 .07 −.73 −.67 
.07 −.29 .32

U Σ VT

Figure 11.9: SVD for the matrix M ′ of Fig. 11.8

explain the third column’s concept, but it doesn’t matter all that much, because
its weight, as given by the third nonzero entry in Σ, is very low compared with
the weights of the first two concepts. ✷

In the next section, we consider eliminating some of the least important


concepts. For instance, we might want to eliminate the third concept in Ex-
ample 11.9, since it really doesn’t tell us much, and the fact that its associated
singular value is so small confirms its unimportance.

11.3.3 Dimensionality Reduction Using SVD


Suppose we want to represent a very large matrix M by its SVD components U ,
Σ, and V , but these matrices are also too large to store conveniently. The best
way to reduce the dimensionality of the three matrices is to set the smallest of
the singular values to zero. If we set the s smallest singular values to 0, then
we can also eliminate the corresponding s columns of U and V .

Example 11.10 : The decomposition of Example 11.9 has three singular val-
ues. Suppose we want to reduce the number of dimensions to two. Then we
set the smallest of the singular values, which is 1.3, to zero. The effect on the
expression in Fig. 11.9 is that the third column of U and the third row of V T are
11.3. SINGULAR-VALUE DECOMPOSITION 447

multiplied only by 0’s when we perform the multiplication, so this row and this
column may as well not be there. That is, the approximation to M ′ obtained
by using only the two largest singular values is that shown in Fig. 11.10.
 
.13 .02

 .41 .07 

 .55 .09  
 12.4 0
 
 .56 .59 .56 .09 .09
 .68 .11 

 0 9.5 .12 −.02 .12 −.69 −.69

 .15 −.59 

 .07 −.73 
.07 −.29
 
0.93 0.95 0.93 .014 .014
 2.93 2.99 2.93 .000 .000 
 
 3.92 4.01 3.92 .026 .026 
 
= 4.84 4.96 4.84 .040 .040 

 0.37 1.21 0.37 4.04 4.04 
 
 0.35 0.65 0.35 4.87 4.87 
0.16 0.57 0.16 1.98 1.98

Figure 11.10: Dropping the lowest singular value from the decomposition of
Fig. 11.7

The resulting matrix is quite close to the matrix M ′ of Fig. 11.8. Ideally, the
entire difference is the result of making the last singular value be 0. However,
in this simple example, much of the difference is due to rounding error caused
by the fact that the decomposition of M ′ was only correct to two significant
digits. ✷

11.3.4 Why Zeroing Low Singular Values Works


The choice of the lowest singular values to drop when we reduce the number of
dimensions can be shown to minimize the root-mean-square error between the
original matrix M and its approximation. Since the number of entries is fixed,
and the square root is a monotone operation, we can simplify and compare
the Frobenius norms of the matrices involved. Recall that the Frobenius norm
of a matrix M , denoted kM k, is the square root of the sum of the squares of
the elements of M . Note that if M is the difference between one matrix and
its approximation, then kM k is proportional to the RMSE (root-mean-square
error) between the matrices.
To explain why choosing the smallest singular values to set to 0 minimizes
the RMSE or Frobenius norm of the difference between M and its approxima-
tion, let us begin with a little matrix algebra. Suppose M is the product of
three matrices M = P QR. Let mij , pij , qij , and rij be the elements in row i
and column j of M , P , Q, and R, respectively. Then the definition of matrix
448 CHAPTER 11. DIMENSIONALITY REDUCTION

How Many Singular Values Should We Retain?


A useful rule of thumb is to retain enough singular values to make up
90% of the energy in Σ. That is, the sum of the squares of the retained
singular values should be at least 90% of the sum of the squares of all the
singular values. In Example 11.10, the total energy is (12.4)2 + (9.5)2 +
(1.3)2 = 245.70, while the retained energy is (12.4)2 + (9.5)2 = 244.01.
Thus, we have retained over 99% of the energy. However, were we to
eliminate the second singular value, 9.5, the retained energy would be
only (12.4)2 /245.70 or about 63%.

multiplication tells us XX
mij = pik qkℓ rℓj
k ℓ

Then 2
XX X XX X
kM k2 = (mij )2 = pik qkℓ rℓj (11.1)
i j i j k ℓ

When we square a sum of terms, as we do on the right side of Equation 11.1, we


effectively create two copies of the sum (with different indices of summation)
and multiply each term of the first sum by each term of the second sum. That
is,
X X 2 X X X X
pik qkℓ rℓj = pik qkℓ rℓj pin qnm rmj
k ℓ k ℓ m n

we can thus rewrite Equation 11.1 as


XXXXXX
kM k2 = pik qkℓ rℓj pin qnm rmj (11.2)
i j k ℓ n m

Now, let us examine the case where P , Q, and R are really the SVD of M .
That is, P is a column-orthonormal matrix, Q is a diagonal matrix, and R is
the transpose of a column-orthonormal matrix. That is, R is row-orthonormal;
its rows are unit vectors and the dot product of any two different rows is 0. To
begin, since Q is a diagonal matrix, qkℓ and qnm will be zero unless k = ℓ and
n = m. We can thus drop the summations for ℓ and m in Equation 11.2 and
set k = ℓ and n = m. That is, Equation 11.2 becomes
XXXX
kM k2 = pik qkk rkj pin qnn rnj (11.3)
i j k n

Next, reorder the summation, so i is the innermost sum. Equation 11.3 has
only two factors pik and pin that involve i; all other factors are constants as far
as summation over i is concerned. Since P is column-orthonormal, We know
11.3. SINGULAR-VALUE DECOMPOSITION 449

P
that i pik pin is 1 if k = n and 0 otherwise. That is, in Equation 11.3 we can
set k = n, drop the factors pik and pin , and eliminate the sums over i and n,
yielding XX
kM k2 = qkk rkj qkk rkj (11.4)
j k
P
Since R is row-orthonormal, j rkj rkj is 1. Thus, we can eliminate the
terms rkj and the sum over j, leaving a very simple formula for the Frobenius
norm: X
kM k2 = (qkk )2 (11.5)
k

Next, let us apply this formula to a matrix M whose SVD is M = U ΣV T .


Let the ith diagonal element of Σ be σi , and suppose we preserve the first n
of the r diagonal elements of Σ, setting the rest to 0. Let Σ′ be the resulting
diagonal matrix. Let M ′ = U Σ′ V T be the resulting approximation to M . Then
M − M ′ = U (Σ − Σ′ )V T is the matrix giving the errors that result from our
approximation.
If we apply Equation 11.5 to the matrix M − M ′ , we see that kM − M ′ k2
equals the sum of the squares of the diagonal elements of Σ − Σ′ . But Σ − Σ′
has 0 for the first n diagonal elements and σi for the ith diagonal element,
where n < i ≤ r. That is, kM − M ′ k2 is the sum of the squares of the elements
of Σ that were set to 0. To minimize kM − M ′ k2 , pick those elements to be
the smallest in Σ. Doing so gives the least possible value of kM − M ′ k2 under
the constraint that we preserve n of the diagonal elements, and it therefore
minimizes the RMSE under the same constraint.

11.3.5 Querying Using Concepts


In this section we shall look at how SVD can help us answer certain queries
efficiently, with good accuracy. Let us assume for example that we have de-
composed our original movie-rating data (the rank-2 data of Fig. 11.6) into the
SVD form of Fig. 11.7. Quincy is not one of the people represented by the
original matrix, but he wants to use the system to know what movies he would
like. He has only seen one movie, The Matrix, and rated it 4. Thus, we can
represent Quincy by the vector q = [4, 0, 0, 0, 0], as if this were one of the rows
of the original matrix.
If we used a collaborative-filtering approach, we would try to compare
Quincy with the other users represented in the original matrix M . Instead,
we can map Quincy into “concept space” by multiplying him by the matrix V
of the decomposition. We find qV = [2.32, 0].3 That is to say, Quincy is high
in science-fiction interest, and not at all interested in romance.
We now have a representation of Quincy in concept space, derived from, but
different from his representation in the original “movie space.” One useful thing
we can do is to map his representation back into movie space by multiplying
3 Note that Fig. 11.7 shows V T , while this multiplication requires V .
450 CHAPTER 11. DIMENSIONALITY REDUCTION

[2.32, 0] by V T . This product is [1.35, 1.35, 1.35, 0, 0]. It suggests that Quincy
would like Alien and Star Wars, but not Casablanca or Titanic.
Another sort of query we can perform in concept space is to find users similar
to Quincy. We can use V to map all users into concept space. For example,
Joe maps to [1.74, 0], and Jill maps to [0, 5.68]. Notice that in this simple
example, all users are either 100% science-fiction fans or 100% romance fans, so
each vector has a zero in one component. In reality, people are more complex,
and they will have different, but nonzero, levels of interest in various concepts.
In general, we can measure the similarity of users by their cosine distance in
concept space.

Example 11.11 : For the case introduced above, note that the concept vectors
for Quincy and Joe, which are [2.32, 0] and [1.74, 0], respectively, are not the
same, but they have exactly the same direction. That is, their cosine distance
is 0. On the other hand, the vectors for Quincy and Jill, which are [2.32, 0] and
[0, 5.68], respectively, have a dot product of 0, and therefore their angle is 90
degrees. That is, their cosine distance is 1, the maximum possible. ✷

11.3.6 Computing the SVD of a Matrix


The SVD of a matrix M is strongly connected to the eigenvalues of the symmet-
ric matrices M T M and M M T . This relationship allows us to obtain the SVD
of M from the eigenpairs of the latter two matrices. To begin the explanation,
start with M = U ΣV T , the expression for the SVD of M . Then

M T = (U ΣV T )T = (V T )T ΣT U T = V ΣT U T

Since Σ is a diagonal matrix, transposing it has no effect. Thus, M T = V ΣU T .


Now, M T M = V ΣU T U ΣV T . Remember that U is an orthonormal matrix,
so U T U is the identity matrix of the appropriate size. That is,

M T M = V Σ2 V T

Multiply both sides of this equation on the right by V to get

M T M V = V Σ2 V T V

Since V is also an orthonormal matrix, we know that V T V is the identity. Thus

M T M V = V Σ2 (11.6)
Since Σ is a diagonal matrix, Σ2 is also a diagonal matrix whose entry in the
ith row and column is the square of the entry in the same position of Σ. Now,
Equation (11.6) should be familiar. It says that V is the matrix of eigenvectors
of M T M and Σ2 is the diagonal matrix whose entries are the corresponding
eigenvalues.
11.3. SINGULAR-VALUE DECOMPOSITION 451

Thus, the same algorithm that computes the eigenpairs for M T M gives us
the matrix V for the SVD of M itself. It also gives us the singular values for
this SVD; just take the square roots of the eigenvalues for M T M .
Only U remains to be computed, but it can be found in the same way we
found V . Start with

M M T = U ΣV T (U ΣV T )T = U ΣV T V ΣU T = U Σ2 U T

Then by a series of manipulations analogous to the above, we learn that

M M T U = U Σ2

That is, U is the matrix of eigenvectors of M M T .


A small detail needs to be explained concerning U and V . Each of these
matrices have r columns, while M T M is an n×n matrix and M M T is an m×m
matrix. Both n and m are at least as large as r. Thus, M T M and M M T should
have an additional n − r and m − r eigenpairs, respectively, and these pairs do
not show up in U , V , and Σ. Since the rank of M is r, all other eigenvalues
will be 0, and these are not useful.

11.3.7 Exercises for Section 11.3


Exercise 11.3.1 : In Fig. 11.11 is a matrix M . It has rank 2, as you can see by
observing that the first column plus the third column minus twice the second
column equals 0.
 
1 2 3

 3 4 5 


 5 4 3 

 0 2 4 
1 3 5

Figure 11.11: Matrix M for Exercise 11.3.1

(a) Compute the matrices M T M and M M T .

! (b) Find the eigenvalues for your matrices of part (a).

(c) Find the eigenvectors for the matrices of part (a).

(d) Find the SVD for the original matrix M from parts (b) and (c). Note
that there are only two nonzero eigenvalues, so your matrix Σ should have
only two singular values, while U and V have only two columns.

(e) Set your smaller singular value to 0 and compute the one-dimensional
approximation to the matrix M from Fig. 11.11.
452 CHAPTER 11. DIMENSIONALITY REDUCTION

(f) How much of the energy of the original singular values is retained by the
one-dimensional approximation?

Exercise 11.3.2 : Use the SVD from Fig. 11.7. Suppose Leslie assigns rating 3
to Alien and rating 4 to Titanic, giving us a representation of Leslie in “movie
space” of [0, 3, 0, 0, 4]. Find the representation of Leslie in concept space. What
does that representation predict about how well Leslie would like the other
movies appearing in our example data?

! Exercise 11.3.3 : Demonstrate that the rank of the matrix in Fig. 11.8 is 3.

! Exercise 11.3.4 : Section 11.3.5 showed how to guess the movies a person
would most like. How would you use a similar technique to guess the people
that would most like a given movie, if all you had were the ratings of that movie
by a few people?

11.4 CUR Decomposition


There is a problem with SVD that does not show up in the running example
of Section 11.3. In large-data applications, it is normal for the matrix M being
decomposed to be very sparse; that is, most entries are 0. For example, a
matrix representing many documents (as rows) and the words they contain (as
columns) will be sparse, because most words are not present in most documents.
Similarly, a matrix of customers and products will be sparse because most
people do not buy most products.
We cannot deal with dense matrices that have millions or billions of rows
and/or columns. However, with SVD, even if M is sparse, U and V will be
dense.4 Since Σ is diagonal, it will be sparse, but Σ is usually much smaller
than U and V , so its sparseness does not help.
In this section, we shall consider another approach to decomposition, called
CUR-decomposition. The merit of this approach lies in the fact that if M is
sparse, then the two large matrices (called C and R for “columns” and “rows”)
analogous to U and V in SVD are also sparse. Only the matrix in the middle
(analogous to Σ in SVD) is dense, but this matrix is small so the density does
not hurt too much.
Unlike SVD, which gives an exact decomposition as long as the parameter r
is taken to be at least as great as the rank of the matrix M , CUR-decomposition
is an approximation no matter how large we make r. There is a theory that
guarantees convergence to M as r gets larger, but typically you have to make r
so large to get, say within 1% that the method becomes impractical. Neverthe-
less, a decomposition with a relatively small value of r has a good probability
of being a useful and accurate decomposition.
4 In Fig. 11.7, it happens that U and V have a significant number of 0’s. However, that is

an artifact of the very regular nature of our example matrix M and is not the case in general.

You might also like