CS168: The Modern Algorithmic Toolbox Lecture #9: The Singular Value Decomposition (SVD) and Low-Rank Matrix Approximations
CS168: The Modern Algorithmic Toolbox Lecture #9: The Singular Value Decomposition (SVD) and Low-Rank Matrix Approximations
1. The largest linearly independent subset of columns of B has size k. That is, all d
columns of B arise as linear combinations of only k different n-vectors.
2. The largest linearly independent subset of rows of B has size k. That is, all n rows of
B arise as linear combinations of only k different d-vectors.
3. B can written as, or “factored into,” the product of long and skinny (n × k) matrix
Yk and a short and long (k × d) matrix ZTk (Figure 1). Observe that when B can be
written this way, all of its columns are linear combinations of the k columns of Yk ,
and all of its rows are linear combinations of the k rows of ZTk . (It’s also true that a
rank-k matrix in the first two senses can always be written this way; see your linear
algebra course for a proof.)
The primary goal of this lecture is to identify the “best” way to approximate a given
matrix A with a rank-k matrix, for a target rank k. Why might you want to do this?
1
Figure 1: Any matrix B of rank k can be decomposed into a long and skinny matrix times
a short and long one.
product of n and d by their sum is a big win. (With an image, n and d are typically
in the 100s. In other applications, n and d might well be in the tens of thousands or
more.)
1. Preprocess A so that the rows sum to the all-zero vector and, optionally, normalize
each column (like last week).
2
3. In the notation of Figure 1, take the k rows of ZTk to be the top k principal components
of A — the k eigenvectors w1 , w2 , . . . , wk of AT A that have the largest eigenvalues.
(These can be computed using the Power Iteration method from last lecture, or other
methods discussed below.)
4. For i = 1, 2, . . . , n, the ith row of the matrix Yk is defined as the projections (hxi , w1 i, . . . , hxi , xk i)
of xi onto the vectors w1 , . . . , wk . This is the best approximation, in terms of Euclidean
distance from xi , of xi as a linear combination of w1 , . . . , wk .1
Yk · ZTk (1)
that has rank only k. But how well does it approximate the original matrix A? Also, it’s
unsatisfying and possibly wasteful to do so many computations involving the covariance
matrix AT A, when all we really care about is the original matrix A. Is there a better way?
Next we discuss a fundamental matrix operation that provides answers to both of these
questions.
A = USVT , (2)
where:
1. U is an n × n orthogonal matrix;2
2. V is a d × d orthogonal matrix;
3. S is an n × d diagonal matrix with nonnegative entries, and with the diagonal entries
sorted from high to low (as one goes “northwest” to “southeast).”3
1
For example, with k = 2, these values (hxi , w1 i, hxi , w2 i) are the values that you plotted in Mini-Project
#4.
2
Recall that a matrix is orthogonal if its columns (or equivalently, its rows) are orthonormal vectors,
meaning they all have norm 1 and the inner product of any distinct pair of them is 0.
3
When we say that a (not necessarily square) matrix is diagonal, we mean what you’d think: only the
entries of the form (i, i) are allowed to be non-zero.
3
Figure 2: The singular value decomposition (SVD). Each singular value in S has an associated
left singular vector in U, and right singular vector in V.
Note that in contrast to the decompositions discussed last week, the orthogonal matrices
U and V are not the same — since A need not be square, U and V need not even have the
same dimensions.4
The columns of U are left singular vectors of A. The columns of V (that is, the rows
of VT ) are right singular vectors of A. The entries of S are the singular values of A. Thus
with each singular vector (left or right) there is an associated singular value. The “first” or
“top” singular vector refers to one associated with the largest singular value, and so on. See
Figure 2.
Every matrix A has a SVD. The proof is not deep, but is better covered in a linear algebra
course than here. Geometrically, this fact is kind of amazing: every matrix A, no matter how
weird, is only performing a rotation (multiplication by VT ), scaling plus adding or deleting
dimensions (multiplication by S), followed by a rotation in the range (multiplication by
U). Along the lines of last lecture’s discussion, the SVD is “more or less unique.” The
singular values of a matrix are unique. When a singular value appears multiple times, the
subspaces spanned by the corresponding left and right singular vectors are uniquely defined,
but arbitrary orthonormal bases can be chosen for each.5
4
AT A:
AT A = (USVT )T (USVT ) = VST U T T T
| {zU} SV = VDV , (3)
=I
where D is a diagonal matrix with diagonal entries equal to the squares of the diagonal
entries of S (if n < d then the remaining d − n diagonal entries of D are 0).
Recall from last lecture that if you decompose AT A as QDQT , then the rows of QT are
eigenvectors of AT A. The computation in (3) therefore shows that the rows of VT are the
eigenvectors of AT A. Thus, the right singular vectors of A are the same as the eigenvectors
of AT A. Similarly, the eigenvalues of AT A are the squares of the singular values of A.
Thus PCA reduces to computing the SVD of A (without forming AT A). Recall that the
output of PCA, given a target k, is simply the top k eigenvectors of the covariance matrix
AT A. The SVD USVT of A hands you these eigenvectors on a silver platter — they are
simply the first k rows of VT . This is an alternative to the Power Iteration method discussed
last lecture. So which is better? There is no clear answer; in many cases, either should work
fine, and if performance is critical you’ll want to experiment with both. Certainly the Power
Iteration method, which finds the eigenvectors of AT A one-by-one, looks like a good idea
when you only want the top few eigenvectors. If you want many or all of them, then the SVD
— which gives you all of the eigenvectors, whether you want them or not — is probably the
first thing to try. The running time of typical SVD implementations is O(n2 d) or O(d2 n),
whichever is smaller.7 Such implementations have been heavily optimized in most of the
standard libraries.
5
“yes.” To see this, let’s review some interpretations of the SVD (2). On the one hand,
the decomposition expresses every row of A as a linear combinations of the rows of VT ,
with the rows of US providing the coefficients of these linear combinations. That is, we can
interpret the rows of A in terms of the rows of VT , which is useful when the rows of VT
have interesting semantics. Analogously, the decomposition in (2) expresses the columns of
A as linear combinations of the columns of U, with the coefficients given by the columns
of SVT . So when the columns of U are interpretable, the decomposition gives us a way to
understand the columns of A.
In some applications, we really only care about understanding the rows of A, and the
extra information U provided by the SVD over PCA is irrelevant. In other applications,
both the rows and the columns of A are interesting in their own right. For example:
1. Suppose rows of A are indexed by customers, and the columns by products, with the
matrix entries indicating who likes what. We are interested in understanding the rows,
and in the best-case scenario, the right singular vectors (rows of VT ) are interpretable
as “customer types” or “canonical customers” and the SVD expresses each customer
as a mixture of customer types. For example, perhaps one or both of your instructors
can be understood simply as a mixture of a “math customer,” a “music customer,” and
a “sports customer.” In the ideal case, the left singular vectors (columns of U) can be
interpreted as “product types,” where the “types” are the same as for customers, and
the SVD expresses each product as a mixture of product types (the extent to which a
product appeals to a “math customer,” a “music customer,” etc.).
2. Suppose the matrix represents data about drug interactions, with the rows of A indexed
by proteins or pathways, and the columns by chemicals or drugs. We’re interested in
understanding both proteins and drugs in their own right, as mixtures of a small set
of “basic types.”
In the above two examples, what we really care about is the relationships between two groups
of objects — customers and products, or proteins and drugs — the labeling of one group
as the “rows” of a matrix and the other as the “columns” is arbitrary. In such cases, you
should immediately think of the SVD as a potential tool for better understanding the data.
When the columns of A are not interesting in their own right, PCA already provides the
relevant information.
6
Figure 3: Low rank approximation via SVD. Recall that S is non-zero only on its diagonal,
and the diagonal entries of S are sorted from high to low. Our low rank approximation is
Ak = Uk Sk VkT .
of vectors, with these vectors ordered by “importance,” then we could just keep the k “most
important” vectors. But wait, the SVD gives us exactly such a representation!
Formally, given an n × d matrix A and a target rank k ≥ 1, we produce a rank-k
approximation of A as follows. See also Figure 3.
2. Keep only the top k right singular vectors: set VkT equal to the first k rows of VT (a
k × d matrix).
3. Keep only the top k left singular vectors: set Uk equal to the first k columns of U (a
n × k matrix).
4. Keep only the top k singular values: set Sk equal to the first k rows and columns of S
(a k × k matrix), corresponding to the k largest singular values of A.
Ak = Uk Sk VkT . (4)
Storing the matrices on the right-hand side of (4) takes O(k(n + d)) space, in contrast to the
O(nd) space required to store the original matrix A. This is a big win when k is relatively
small and n and d are relatively large (as in many applications).
In the matrix Ak defined in (4), all of the rows are linear combinations of the top k
right singular vectors of A (with coefficients given by the rows of Uk Sk ), and all of the
columns are linear combinations of the top k left singular vectors of A (with coefficients
given by the columns of Sk VkT ). Thus Ak clearly has rank k. It is natural to interpret (4)
as approximating the raw data A in terms of k “concepts” (e.g., “math,” “music,” and
“sports”), where the singular values of Sk express the signal strengths of these concepts,
7
the rows of VT and columns of U express the “canonical row/column” associated with each
concept (e.g., a customer that likes only music products, or a product liked only by music
customers), and the rows of U (respectively, columns of VT ) approximately express each row
(respectively, column) of A as a linear combination (scaled by Sk ) of the “canonical rows”
(respectively, canonical columns).
Conceptually, this method of producing a low-rank approximation is as clean as could
be imagined: we re-represent A using the SVD, which provides a list of A’s “ingredients,”
ordered by “importance,” and we retain only the k most important ingredients. But is the
result of this elegant computation any good? Also, how does it compare to our previous
method of producing a low-rank approximation via PCA (Section 2)?
The first fact is that the two methods discussed for producing a low-rank approximation
are exactly the same.8
Fact 4.1 The matrix Ak defined in (1) and the matrix Ak defined in (4) are identical.
We won’t prove Fact 4.1, but pause to note its plausibility. Recall that in the PCA-based
solution defined in (1), we defined ZTk to be the top k principal components of A — the first
k eigenvectors of the covariance matrix AT A. As noted in Section 3.2, the right singular
vector of A (i.e., the rows of VT ) are also the eigenvectors of AT A. Thus, the matrices
ZTk and VkT are identical, both equal to the top k eigenvectors of AT A/top k right singular
vectors of A. Given this, it is not surprising that the two definitions of Ak are the same:
both the matrix Yk in (1) and the matrix Uk Sk in (4) are intuitively defining the linear
combinations of the rows of ZTk and VkT that give the best approximation to A. In the
PCA-based solution in Section 2, this is explicitly how Yk is defined; the SVD encodes the
same linear combinations in the form Uk Sk .
Our second fact justifies our methods by stating that the low-rank approximations they
produce are optimal in a natural sense. The guarantee is in terms of the “Frobenius norm”
of a matrix
qP M, which just means applying the `2 norm to the matrix as if it were a vector:
2
kMkF = i,j mij .
Fact 4.2 For every n × d matrix A, rank target k ≥ 1, and rank-k n × d matrix B,
kA − Ak kF ≤ kA − BkF ,
8
row of A − Ak to the Frobenius norm corresponds exactly to one of these squared Euclidean
distances.
Remark 4.3 (How to Choose k) When producing a low-rank matrix approximation, we’ve
been taking as a parameter the target rank k. But how should k be chosen? In a perfect
world, the eigenvalues of AT A/singular values of A give strong guidance: if the top few
such values are big and the rest are small, then the obvious solution is to take k equal to the
number of big values. In a less perfect world, one takes k as small as possible subject to ob-
taining a useful approximation — of course what “useful” means depends on the application.
Rules of thumb often take the form: choose k such that the sum of the top k eigenvalues is
at least c times as big as the sum of the other eigenvalues, where c is a domain-dependent
constant (like 10, say).
Remark 4.4 (Lossy Compression via Truncated Decompositions) Using the SVD to
produce low-rank matrix approximations introduces a useful paradigm for lossy compression
that we’ll exploit further in later lectures. The first step of the paradigm is to re-express the
raw data exactly as a decomposition into several terms (as in (2)). The second step is to
throw away all but the “most important” terms, yields an approximation of the original data.
This paradigm works well when you can find a representation of the data such that most of
the interesting information is concentrated in just a few components of the decomposition.
The appropriate representation will depend on the data set — though some rules of thumb
can be learned, as we’ll discuss — and of course, messy enough data sets might not admit
any nice representations at all.
9
clear that one can reconstruct the missing entries — if not too many are missing, there will
be only one way of “filling in the blanks” that results in a rank-one matrix.
More generally, if there aren’t too many missing entries, and if the matrix to be recovered
is approximately low rank, then the following application of the SVD can yield a good guess
as to the missing entries.
1. Fill in the missing entries with suitable default values to obtain a matrix Â. Examples
for default values include zero, the average value of an entry of the matrix, the average
value of the row containing the entry, and the average value of the column containing
the entry. The performance of the method improves with more accurate choices of
default values.
2. Compute the best rank-k approximation to Â. (The usual comments about how to
choose k apply; see also Remark 4.3.)
References
10