0% found this document useful (0 votes)
55 views10 pages

CS168: The Modern Algorithmic Toolbox Lecture #9: The Singular Value Decomposition (SVD) and Low-Rank Matrix Approximations

This document discusses the singular value decomposition (SVD) and how it can be used for low-rank matrix approximations. The SVD decomposes a matrix A into three matrices: A = USVT, where U and V are orthogonal matrices and S is a diagonal matrix of singular values. The SVD provides the best low-rank approximation of a matrix by keeping only the largest k singular values. It also relates to principal component analysis (PCA), as the right singular vectors of A are the eigenvectors of the matrix AT A. The SVD allows approximating a matrix using fewer computations than PCA alone.

Uploaded by

enjamamul hoq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views10 pages

CS168: The Modern Algorithmic Toolbox Lecture #9: The Singular Value Decomposition (SVD) and Low-Rank Matrix Approximations

This document discusses the singular value decomposition (SVD) and how it can be used for low-rank matrix approximations. The SVD decomposes a matrix A into three matrices: A = USVT, where U and V are orthogonal matrices and S is a diagonal matrix of singular values. The SVD provides the best low-rank approximation of a matrix by keeping only the largest k singular values. It also relates to principal component analysis (PCA), as the right singular vectors of A are the eigenvectors of the matrix AT A. The SVD allows approximating a matrix using fewer computations than PCA alone.

Uploaded by

enjamamul hoq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

CS168: The Modern Algorithmic Toolbox

Lecture #9: The Singular Value Decomposition (SVD)


and Low-Rank Matrix Approximations
Tim Roughgarden & Gregory Valiant
April 27, 2015

1 Low-Rank Matrix Approximations: Motivation


Consider an n × d matrix A. Perhaps A represents a bunch of data points (one per row), or
perhaps A represents a single object, like a rectangular image (with entries = pixel intensi-
ties). We’ve discussed “dimensionality reduction” for vectors — re-representing vectors in d
dimensions as vectors in k dimensions, with k  d — what’s a good notion of dimensionality
reduction for matrices?
One good answer, explored in this lecture, is to reduce the rank of the matrix. Recall
from your linear algebra class that the following are equivalent definitions for the rank of a
matrix B to be k (any one of the conditions implies the other two):

1. The largest linearly independent subset of columns of B has size k. That is, all d
columns of B arise as linear combinations of only k different n-vectors.

2. The largest linearly independent subset of rows of B has size k. That is, all n rows of
B arise as linear combinations of only k different d-vectors.

3. B can written as, or “factored into,” the product of long and skinny (n × k) matrix
Yk and a short and long (k × d) matrix ZTk (Figure 1). Observe that when B can be
written this way, all of its columns are linear combinations of the k columns of Yk ,
and all of its rows are linear combinations of the k rows of ZTk . (It’s also true that a
rank-k matrix in the first two senses can always be written this way; see your linear
algebra course for a proof.)

The primary goal of this lecture is to identify the “best” way to approximate a given
matrix A with a rank-k matrix, for a target rank k. Why might you want to do this?

1. Compression. A low-rank approximation provides a (lossy) compressed version of the


matrix. The original matrix A is described by nd numbers, while describing Yk and
ZTk requires only k(n + d) numbers. When k is small relative to n and d, replacing the

1
Figure 1: Any matrix B of rank k can be decomposed into a long and skinny matrix times
a short and long one.

product of n and d by their sum is a big win. (With an image, n and d are typically
in the 100s. In other applications, n and d might well be in the tens of thousands or
more.)

2. De-noising. If A is a noisy version of some “ground truth” signal that is approximately


low-rank, then passing to a low-rank approximation of the raw data A might throw out
lots of noise and little signal, resulting in a matrix that is actually more informative
than the original.

2 Low-Rank Approximations from PCA


The techniques covered last week can be used to produce low-rank matrix approximations.
Recall the silly example at the beginning of Lecture #7, with a data set of n d-dimensional
vectors xi that turn out to all be multiples of each other. The corresponding matrix A, with
one xi per row, has rank 1. The factorization in Figure 1 just involves taking ZTk to be the
first vector (say), and Yk a list describing what multiple of this vector each other vector is.
That is, the data set can be re-represented, with no loss, by a single d-dimensional vector
and one scalar per data point. When the data points are approximately multiples of the
same vector, they can still be described with high accuracy using such a re-representation.
More generally, for a target rank k, we can ask about how to best approximate a data set
as a linear combinations of a set of k vectors.
Recall that principal components analysis (PCA) proposes a solution for choosing the k
vectors that “best” represent a data set, namely the eigenvectors of the covariance matrix
AT A. In more detail, here’s how we’d use PCA techniques to product a rank-k approxima-
tion to a matrix A:

1. Preprocess A so that the rows sum to the all-zero vector and, optionally, normalize
each column (like last week).

2. Form the covariance matrix AT A.

2
3. In the notation of Figure 1, take the k rows of ZTk to be the top k principal components
of A — the k eigenvectors w1 , w2 , . . . , wk of AT A that have the largest eigenvalues.
(These can be computed using the Power Iteration method from last lecture, or other
methods discussed below.)

4. For i = 1, 2, . . . , n, the ith row of the matrix Yk is defined as the projections (hxi , w1 i, . . . , hxi , xk i)
of xi onto the vectors w1 , . . . , wk . This is the best approximation, in terms of Euclidean
distance from xi , of xi as a linear combination of w1 , . . . , wk .1

The above four steps certainly produce a matrix

Yk · ZTk (1)

that has rank only k. But how well does it approximate the original matrix A? Also, it’s
unsatisfying and possibly wasteful to do so many computations involving the covariance
matrix AT A, when all we really care about is the original matrix A. Is there a better way?
Next we discuss a fundamental matrix operation that provides answers to both of these
questions.

3 The Singular Value Decomposition (SVD)


3.1 Definitions
We’ll start with the formal definitions, and then discuss interpretations, applications, and
connections to concepts in previous lectures. A singular value decomposition (SVD) of an
n × d matrix A expresses the matrix as the product of three “simple” matrices:

A = USVT , (2)

where:

1. U is an n × n orthogonal matrix;2

2. V is a d × d orthogonal matrix;

3. S is an n × d diagonal matrix with nonnegative entries, and with the diagonal entries
sorted from high to low (as one goes “northwest” to “southeast).”3

1
For example, with k = 2, these values (hxi , w1 i, hxi , w2 i) are the values that you plotted in Mini-Project
#4.
2
Recall that a matrix is orthogonal if its columns (or equivalently, its rows) are orthonormal vectors,
meaning they all have norm 1 and the inner product of any distinct pair of them is 0.
3
When we say that a (not necessarily square) matrix is diagonal, we mean what you’d think: only the
entries of the form (i, i) are allowed to be non-zero.

3
Figure 2: The singular value decomposition (SVD). Each singular value in S has an associated
left singular vector in U, and right singular vector in V.

Note that in contrast to the decompositions discussed last week, the orthogonal matrices
U and V are not the same — since A need not be square, U and V need not even have the
same dimensions.4
The columns of U are left singular vectors of A. The columns of V (that is, the rows
of VT ) are right singular vectors of A. The entries of S are the singular values of A. Thus
with each singular vector (left or right) there is an associated singular value. The “first” or
“top” singular vector refers to one associated with the largest singular value, and so on. See
Figure 2.
Every matrix A has a SVD. The proof is not deep, but is better covered in a linear algebra
course than here. Geometrically, this fact is kind of amazing: every matrix A, no matter how
weird, is only performing a rotation (multiplication by VT ), scaling plus adding or deleting
dimensions (multiplication by S), followed by a rotation in the range (multiplication by
U). Along the lines of last lecture’s discussion, the SVD is “more or less unique.” The
singular values of a matrix are unique. When a singular value appears multiple times, the
subspaces spanned by the corresponding left and right singular vectors are uniquely defined,
but arbitrary orthonormal bases can be chosen for each.5

3.2 PCA Reduces to SVD


There is an interesting relationship between the SVD and the decompositions we discussed
last week. Recall in previous lectures we used the fact that AT A, as a symmetric d × d
matrix, can be written as AT A = QDQT , where Q is a d × d orthogonal matrix and D is
a d × d diagonal matrix.6 Consider the SVD A = USVT and what its existence means for
4
Even small numerical examples are tedious to do in detail — the orthogonality constraint on singular
vectors ensures that most of the numbers are messy. The easiest way to get a feel for what SVDs look like
is to feed a few small matrices into the SVD subroutine supported by your favorite environment (Matlab,
python’s numpy library, etc.).
5
Also, one can always multiply the ith left and right singular vectors by -1 to get another SVD.
6
Actually, last week we wrote AT A = QT DQ. It doesn’t really matter, but writing AT A = QDQT is
much more common, and we do this from now.

4
AT A:
AT A = (USVT )T (USVT ) = VST U T T T
| {zU} SV = VDV , (3)
=I

where D is a diagonal matrix with diagonal entries equal to the squares of the diagonal
entries of S (if n < d then the remaining d − n diagonal entries of D are 0).
Recall from last lecture that if you decompose AT A as QDQT , then the rows of QT are
eigenvectors of AT A. The computation in (3) therefore shows that the rows of VT are the
eigenvectors of AT A. Thus, the right singular vectors of A are the same as the eigenvectors
of AT A. Similarly, the eigenvalues of AT A are the squares of the singular values of A.
Thus PCA reduces to computing the SVD of A (without forming AT A). Recall that the
output of PCA, given a target k, is simply the top k eigenvectors of the covariance matrix
AT A. The SVD USVT of A hands you these eigenvectors on a silver platter — they are
simply the first k rows of VT . This is an alternative to the Power Iteration method discussed
last lecture. So which is better? There is no clear answer; in many cases, either should work
fine, and if performance is critical you’ll want to experiment with both. Certainly the Power
Iteration method, which finds the eigenvectors of AT A one-by-one, looks like a good idea
when you only want the top few eigenvectors. If you want many or all of them, then the SVD
— which gives you all of the eigenvectors, whether you want them or not — is probably the
first thing to try. The running time of typical SVD implementations is O(n2 d) or O(d2 n),
whichever is smaller.7 Such implementations have been heavily optimized in most of the
standard libraries.

3.3 More on PCA vs. SVD


PCA and SVD are closely related, and in data analysis circles you should be ready for the
terms to be used almost interchangeably. There are differences, however. First, PCA refers
to data analysis technique, while the SVD is a general operation defined on all matrices. For
example, it doesn’t really make sense to talk about “applying PCA” to a matrix A unless the
rows of A have clear semantics — typically, as data points x1 , . . . , xn in Rd . By contrast, the
SVD (2) is well defined for every matrix A, whatever the semantics for A. In the particular
case where A is a matrix where the rows represent data points, the SVD can be interpreted
as performing the calculations required by PCA.
We can also make more of an “apples vs. apples” comparison in the following way. Let’s
define the “PCA operation” as taking an n × d matrix as input, and possibly a parameter
k, and outputting all (or the top k) eigenvectors of the covariance matrix AT A. The “SVD
operation” takes as input an n × d matrix A and outputs U, S, and VT , where the rows of
VT are the eigenvectors of AT A. Thus the SVD gives strictly more information than PCA,
namely the matrix U.
Is the additional information U provided by SVD useful? In applications where you want
to understand the column structure of A, in addition to the row structure, the answer is
7
We won’t discuss how this is done, instead taking the SVD as a readily available “black box.” Imple-
mentation details are covered in any course on numerical analysis.

5
“yes.” To see this, let’s review some interpretations of the SVD (2). On the one hand,
the decomposition expresses every row of A as a linear combinations of the rows of VT ,
with the rows of US providing the coefficients of these linear combinations. That is, we can
interpret the rows of A in terms of the rows of VT , which is useful when the rows of VT
have interesting semantics. Analogously, the decomposition in (2) expresses the columns of
A as linear combinations of the columns of U, with the coefficients given by the columns
of SVT . So when the columns of U are interpretable, the decomposition gives us a way to
understand the columns of A.
In some applications, we really only care about understanding the rows of A, and the
extra information U provided by the SVD over PCA is irrelevant. In other applications,
both the rows and the columns of A are interesting in their own right. For example:

1. Suppose rows of A are indexed by customers, and the columns by products, with the
matrix entries indicating who likes what. We are interested in understanding the rows,
and in the best-case scenario, the right singular vectors (rows of VT ) are interpretable
as “customer types” or “canonical customers” and the SVD expresses each customer
as a mixture of customer types. For example, perhaps one or both of your instructors
can be understood simply as a mixture of a “math customer,” a “music customer,” and
a “sports customer.” In the ideal case, the left singular vectors (columns of U) can be
interpreted as “product types,” where the “types” are the same as for customers, and
the SVD expresses each product as a mixture of product types (the extent to which a
product appeals to a “math customer,” a “music customer,” etc.).

2. Suppose the matrix represents data about drug interactions, with the rows of A indexed
by proteins or pathways, and the columns by chemicals or drugs. We’re interested in
understanding both proteins and drugs in their own right, as mixtures of a small set
of “basic types.”

In the above two examples, what we really care about is the relationships between two groups
of objects — customers and products, or proteins and drugs — the labeling of one group
as the “rows” of a matrix and the other as the “columns” is arbitrary. In such cases, you
should immediately think of the SVD as a potential tool for better understanding the data.
When the columns of A are not interesting in their own right, PCA already provides the
relevant information.

4 Low-Rank Approximations from the SVD


If we want to best approximate a matrix A by a rank-k matrix, how should we do it? The
SVD gives an elegant and rigorously justified solution. Recall from Section 1 what it means
for a matrix to have rank k — all of the rows are linear combinations of a set of merely k
rows, and all of the columns are linear combinations of merely k columns. Thus choosing a
rank-k matrix boils down to choosing sets of k vectors. What’s a principled way to choose
these? If only we had a representation of the data matrix A as linear combinations of sets

6
Figure 3: Low rank approximation via SVD. Recall that S is non-zero only on its diagonal,
and the diagonal entries of S are sorted from high to low. Our low rank approximation is
Ak = Uk Sk VkT .

of vectors, with these vectors ordered by “importance,” then we could just keep the k “most
important” vectors. But wait, the SVD gives us exactly such a representation!
Formally, given an n × d matrix A and a target rank k ≥ 1, we produce a rank-k
approximation of A as follows. See also Figure 3.

1. Compute the SVD A = USVT , where U is an n × n orthogonal matrix, S is a


nonnegative n × d diagonal matrix with diagonal entries sorted from high to low, and
VT is a d × d orthogonal matrix.

2. Keep only the top k right singular vectors: set VkT equal to the first k rows of VT (a
k × d matrix).

3. Keep only the top k left singular vectors: set Uk equal to the first k columns of U (a
n × k matrix).

4. Keep only the top k singular values: set Sk equal to the first k rows and columns of S
(a k × k matrix), corresponding to the k largest singular values of A.

The computed low-rank approximation is then

Ak = Uk Sk VkT . (4)

Storing the matrices on the right-hand side of (4) takes O(k(n + d)) space, in contrast to the
O(nd) space required to store the original matrix A. This is a big win when k is relatively
small and n and d are relatively large (as in many applications).
In the matrix Ak defined in (4), all of the rows are linear combinations of the top k
right singular vectors of A (with coefficients given by the rows of Uk Sk ), and all of the
columns are linear combinations of the top k left singular vectors of A (with coefficients
given by the columns of Sk VkT ). Thus Ak clearly has rank k. It is natural to interpret (4)
as approximating the raw data A in terms of k “concepts” (e.g., “math,” “music,” and
“sports”), where the singular values of Sk express the signal strengths of these concepts,

7
the rows of VT and columns of U express the “canonical row/column” associated with each
concept (e.g., a customer that likes only music products, or a product liked only by music
customers), and the rows of U (respectively, columns of VT ) approximately express each row
(respectively, column) of A as a linear combination (scaled by Sk ) of the “canonical rows”
(respectively, canonical columns).
Conceptually, this method of producing a low-rank approximation is as clean as could
be imagined: we re-represent A using the SVD, which provides a list of A’s “ingredients,”
ordered by “importance,” and we retain only the k most important ingredients. But is the
result of this elegant computation any good? Also, how does it compare to our previous
method of producing a low-rank approximation via PCA (Section 2)?
The first fact is that the two methods discussed for producing a low-rank approximation
are exactly the same.8

Fact 4.1 The matrix Ak defined in (1) and the matrix Ak defined in (4) are identical.
We won’t prove Fact 4.1, but pause to note its plausibility. Recall that in the PCA-based
solution defined in (1), we defined ZTk to be the top k principal components of A — the first
k eigenvectors of the covariance matrix AT A. As noted in Section 3.2, the right singular
vector of A (i.e., the rows of VT ) are also the eigenvectors of AT A. Thus, the matrices
ZTk and VkT are identical, both equal to the top k eigenvectors of AT A/top k right singular
vectors of A. Given this, it is not surprising that the two definitions of Ak are the same:
both the matrix Yk in (1) and the matrix Uk Sk in (4) are intuitively defining the linear
combinations of the rows of ZTk and VkT that give the best approximation to A. In the
PCA-based solution in Section 2, this is explicitly how Yk is defined; the SVD encodes the
same linear combinations in the form Uk Sk .
Our second fact justifies our methods by stating that the low-rank approximations they
produce are optimal in a natural sense. The guarantee is in terms of the “Frobenius norm”
of a matrix
qP M, which just means applying the `2 norm to the matrix as if it were a vector:
2
kMkF = i,j mij .

Fact 4.2 For every n × d matrix A, rank target k ≥ 1, and rank-k n × d matrix B,

kA − Ak kF ≤ kA − BkF ,

where Ak is the rank-k approximation (4) derived from the SVD of A.


Intuitively, Fact 4.2 holds because: (i) minimizing the Frobenius norm kA−BkF is equivalent
to minimizing the average (over i) of the squared Euclidean distances between the ith rows
of A and B; (ii) the SVD uses the same vectors to approximate the rows of A as PCA (the
top eigenvectors of AT A/right singular vectors of A); and (iii) PCA, by definition, chooses
its k vectors to minimize the average squared Euclidean distance between the rows of A and
the k-dimensional subspace of linear combinations of these vectors. The contribution of a
8
We’re assuming that identical preprocessing of A, if any, is done in both cases.

8
row of A − Ak to the Frobenius norm corresponds exactly to one of these squared Euclidean
distances.

Remark 4.3 (How to Choose k) When producing a low-rank matrix approximation, we’ve
been taking as a parameter the target rank k. But how should k be chosen? In a perfect
world, the eigenvalues of AT A/singular values of A give strong guidance: if the top few
such values are big and the rest are small, then the obvious solution is to take k equal to the
number of big values. In a less perfect world, one takes k as small as possible subject to ob-
taining a useful approximation — of course what “useful” means depends on the application.
Rules of thumb often take the form: choose k such that the sum of the top k eigenvalues is
at least c times as big as the sum of the other eigenvalues, where c is a domain-dependent
constant (like 10, say).

Remark 4.4 (Lossy Compression via Truncated Decompositions) Using the SVD to
produce low-rank matrix approximations introduces a useful paradigm for lossy compression
that we’ll exploit further in later lectures. The first step of the paradigm is to re-express the
raw data exactly as a decomposition into several terms (as in (2)). The second step is to
throw away all but the “most important” terms, yields an approximation of the original data.
This paradigm works well when you can find a representation of the data such that most of
the interesting information is concentrated in just a few components of the decomposition.
The appropriate representation will depend on the data set — though some rules of thumb
can be learned, as we’ll discuss — and of course, messy enough data sets might not admit
any nice representations at all.

5 Recovering Missing Entries via the SVD


This section briefly outlines how the SVD can be used to fill in missing entries of a matrix.9
We’ll also see some more sophisticated techniques for this problem in Week #7.
The input to the problem is an n × d matrix A, except some (10%, say) of the entries
are missing. You’d like to guess what the missing entries are. For a challenging example
with lots of missing entries, if A is a matrix of movie ratings by Netflix customers — and
of course, most people haven’t rated most movies — you’d like to predict what a customer
would rate a particular movie that he/she hasn’t seen yet. Clearly, this is a relevant problem
in the design of a recommendation system, among other applications.
This problem is clearly impossible unless we make assumptions about the “ground truth”
matrix that we’re supposed to recover — otherwise, the missing entries could be anything,
and we have no information about them. A reasonable assumption that makes the problem
more tractable is that the matrix to be recovered is well-approximated by a low-rank matrix.
For intuition, think about the special case where you are given a rank-one matrix — so all
rows are multiples of each other — except a few of the entries are missing. In this case, it is
9
We didn’t have time to cover this in lecture, but see Mini-Project #5.

9
clear that one can reconstruct the missing entries — if not too many are missing, there will
be only one way of “filling in the blanks” that results in a rank-one matrix.
More generally, if there aren’t too many missing entries, and if the matrix to be recovered
is approximately low rank, then the following application of the SVD can yield a good guess
as to the missing entries.

1. Fill in the missing entries with suitable default values to obtain a matrix Â. Examples
for default values include zero, the average value of an entry of the matrix, the average
value of the row containing the entry, and the average value of the column containing
the entry. The performance of the method improves with more accurate choices of
default values.

2. Compute the best rank-k approximation to Â. (The usual comments about how to
choose k apply; see also Remark 4.3.)

References

10

You might also like