0% found this document useful (0 votes)
73 views37 pages

Dimensionality Reduction

SVD and PCA are dimensionality reduction techniques that can be applied to datasets with many attributes. SVD decomposes a matrix into the product of three matrices, revealing underlying linear structures in the data. PCA projects data onto a new set of orthogonal attributes (principal components) that successively capture the most variance. Both techniques aim to represent the data with fewer dimensions while preserving as much information as possible.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views37 pages

Dimensionality Reduction

SVD and PCA are dimensionality reduction techniques that can be applied to datasets with many attributes. SVD decomposes a matrix into the product of three matrices, revealing underlying linear structures in the data. PCA projects data onto a new set of orthogonal attributes (principal components) that successively capture the most variance. Both techniques aim to represent the data with fewer dimensions while preserving as much information as possible.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

SVD and PCA

 Real data usually have thousands, or millions of


dimensions
 E.g., web documents, where the dimensionality is the
vocabulary of words
 Facebook graph, where the dimensionality is the
number of users
 Huge number of dimensions causes problems
 Data becomes very sparse, some algorithms become
meaningless (e.g. density based clustering)
 The complexity of several algorithms depends on the
dimensionality and they become infeasible.
 Usually the data can be described with fewer
dimensions, without losing much of the meaning
of the data.
 The data reside in a space of lower dimensionality

 Essentially, we assume that some of the data is


noise, and we can approximate the useful part
with a lower dimensionality space.
 Dimensionality reduction does not just reduce the
amount of data, it often brings out the useful part of
the data
 We have already seen a form of
dimensionality reduction

 LSH, and random projections reduce the


dimension while preserving the distances
SVD is “the Rolls-Royce and the Swiss Army
Knife of Numerical Linear Algebra.”*
*Dianne O’Leary, MMDS ’06
 We are given n objects and d attributes describing the
objects. Each object has d numeric values describing
it.
 We will represent the data as a n×d real matrix A.
 We can now use tools from linear algebra to process the
data matrix

 Our goal is to produce a new n×k matrix B such that


 It preserves as much of the information in the original
matrix A as possible
 It reveals something about the structure of the data in A
d terms
(e.g., theorem, proof, etc.)

n documents

Aij = frequency of the j-th


term in the i-th document

Find subsets of terms that bring documents


together
d movies

n customers
Aij = rating of j-th
product by the i-th
customer

Find subsets of movies that capture the behavior or


the customers
 We assume that vectors are column vectors.
 We use   for the transpose of vector  (row vector)
 Dot product:   (1×, ×1 → 1×1)
 The dot product is the projection of vector  on  (and vice versa)
4
 1, 2, 3 1 12

2
     cos, 

 If |||| 1 (unit vector) then   is the projection length of  on 


4
 1, 2, 3 1 0 orthogonal vectors
2
 Orthonormal vectors: two unit vectors that are orthogonal
 An n×m matrix A is a collection of n row vectors and m column
vectors
| | |   
       
| | |   
 Matrix-vector multiplication
 Right multiplication : projection of u onto the row vectors of , or
projection of row vectors of  onto .
 Left-multiplication  : projection of  onto the column vectors of ,
or projection of column vectors of  onto 
 Example:
1 0
1,2,3 0 1 1,2
0 0
 Row space of A: The set of vectors that can be written
as a linear combination of the rows of A
 All vectors of the form   

 Column space of A: The set of vectors that can be


written as a linear combination of the columns of A
 All vectors of the form  .

 Rank of A: the number of linearly independent row (or


column) vectors
 These vectors define a basis for the row (or column) space
of A
 In a rank-1 matrix, all columns (or rows) are
multiples of the same column (or row) vector
1 2 1
 2 4 2
3 6 3
 All rows are multiples of  1,2, 1
 All columns are multiples of 1,2,3 
 External product:   (×1 , 1×! → ×!)
 The resulting ×! has rank 1: all rows (or columns)
are linearly dependent
   
 (Right) Eigenvector of matrix A: a vector v
such that  "
 ": eigenvalue of eigenvector 

 A square matrix A of rank r, has r orthonormal


eigenvectors  ,  , … , $ with eigenvalues
" , " , … , "$ .
 Eigenvectors define an orthonormal basis for
the column space of A
) 
0
) 
 % Σ '   ,  , ⋯ , $
⋱ ⋮
0
)$ $
[n×m] = [n×r] [r×r] [r×m]
r: rank of matrix A

 ) , , ) , ⋯ , )$ : singular values of matrix  (also, the square roots of


eigenvalues of  and  )
  ,  , … , $ : left singular vectors of  (also eigenvectors of  )
  ,  , … , $ : right singular vectors of  (also, eigenvectors of  )

 )   - )   - ⋯ - )$ $ $


 Special case: A is symmetric positive definite
matrix

 "   - "   - ⋯ - "$ $ $

 " , " , ⋯ , "$ , 0: Eigenvalues of A


  ,  , … , $ : Eigenvectors of A
 The left singular vectors are an orthonormal
basis for the row space of A.
 The right singular vectors are an orthonormal
basis for the column space of A.

 If A has rank r, then A can be written as the sum


of r rank-1 matrices

 There are r “linear components” (trends) in A.


 Linear trend: the tendency of the row vectors of A to align with vector
v
 Strength of the i-th linear trend: ||./ || 0/
 Document-term matrix
 Blue and Red rows (colums) are linearly dependent

A=

 There are two prototype documents (vectors of words): blue and red
 To describe the data is enough to describe the two prototypes, and the
projection weights for each row

 A is a rank-2 matrix
2
 1 , 1
2
 Document-term matrix

A=

 There are two prototype documents and words but


they are noisy
 We now have more than two singular vectors, but the
strongest ones are still about the two types.
 By keeping the two strongest singular vectors we obtain
most of the information in the data.
▪ This is a rank-2 approximation of the matrix A
nxd nxk kxk kxd

Uk (Vk): orthogonal matrix containing the top k left (right)


singular vectors of A.
Σk: diagonal matrix containing the top k singular values of A

Ak is an approximation of A
Ak is the best approximation of A
 The rank-k approximation matrix 3
produced by the top-k singular vectors of A
minimizes the Frobenious norm of the
difference with the matrix A
3 arg max   > ?
9:$;<3 9 =3
 
> ? @ AB  >AB
A,B
 We can project the row (and column) vectors
of the matrix A into a k-dimensional space
and preserve most of the information
 (Ideally) The k dimensions reveal latent
features/aspects/topics of the term
(document) space.
 (Ideally) The 3 approximation of matrix A,
contains all the useful information, and what
is discarded is noise
 Rows (columns) are linear combinations of k
latent factors
 E.g., in our extreme document example there are
two factors
 Some noise is added to this rank-k matrix
resulting in higher rank

 SVD retrieves the latent factors (hopefully).


A = U Σ VT
features

significant sig. significant

noise noise
= noise

objects
 Data: Users rating movies
 Sparse and often noisy
 Assumption: There are k basic user profiles, and each user
is a linear combination of these profiles
 E.g., action, comedy, drama, romance
 Each user is a weighted cobination of these profiles
 The “true” matrix has rank k
 What we observe is a noisy, and incomplete version of this
matrix C
 The rank-k approximation C3 is provably close to 3
 Algorithm: compute C3 and predict for user  and movie
!, the value C3 !, .
 Model-based collaborative filtering
 PCA is a special case of SVD on the centered
covariance matrix.
 Goal: reduce the dimensionality while preserving the
“information in the data”
 Information in the data: variability in the data
 We measure variability using the covariance matrix.
 Sample covariance of variables X and Y

@ DA  EF  GA  EH 
A
 Given matrix A, remove the mean of each column
from the column vectors to get the centered matrix C
 The matrix ' I  I is the covariance matrix of the
row vectors of A.
 We will project the rows of matrix A into a new
set of attributes (dimensions) such that:
 The attributes have zero covariance to each other
(they are orthogonal)
 Each attribute captures the most remaining variance
in the data, while orthogonal to the existing attributes
▪ The first attribute should capture the most variance in the
data

 For matrix C, the variance of the rows of C when


 
projected to vector x is given by ) ID
 The right singular vector of C maximizes )  !
Input: 2-d dimensional points

5 Output:

2nd (right)
singular 1st (right) singular vector:
vector direction of maximal variance,
4

2nd (right) singular vector:


direction of maximal variance,
3 after removing the projection of
the data along the first singular
1st (right) singular vector.
vector
2
4.0 4.5 5.0 5.5 6.0
5

2nd (right)
singular σ1: measures how much of the
4
vector data variance is explained by the
first singular vector.

σ2: measures how much of the


3 data variance is explained by the
σ1 second singular vector.
1st (right) singular
vector
2
4.0 4.5 5.0 5.5 6.0
 The variance in the direction of the k-th principal component
is given by the corresponding singular value σk2

 Singular values can be used to estimate how many


components to keep

 Rule of thumb: keep enough to explain 85% of the variation:


k

∑j =1
σ 2
j

n
≈ 0 . 85
∑j =1
σ 2
j
drugs

 ⋯ <
 ⋮ ⋱ ⋮ students

< ⋯ <<
legal illegal
AB : usage of student i of drug j
 %Σ' 
First right singular vector 
Drug 1

 More or less same weight to all drugs
 Discriminates heavy from light users
 Second right singular vector
 Positive values for legal drugs, negative for illegal
Drug 2
 The chosen vectors are such that minimize the sum of square differences
between the data vectors and the low-dimensional projections

1st (right) singular


vector
2
4.0 4.5 5.0 5.5 6.0
 Latent Semantic Indexing (LSI):
 Apply PCA on the document-term matrix, and
index the k-dimensional vectors
 When a query comes, project it onto the k-
dimensional space and compute cosine similarity
in this space
 Principal components capture main topics, and
enrich the document representation
# SVD
dat = seq(1,240,2)
X = matrix(dat,ncol=12)
s = svd(X)
A = diag(s$d)
s$u %*% A %*% t(s$v) # X = U A V'
dat = seq(1,240,2)
X = matrix(dat,ncol=12)
s = svd(X, nu = nrow(X), nv = ncol(X))
A = diag(s$d)
A = cbind(A, 0) # Add two columns with zero, in order to A have the same dimensions of X.
A = cbind(A, 0)p
s$u %*% A %*% t(s$v) # X = U A V'
install.packages("jpeg")
library(jpeg)
tux = readJPEG("tux.jpg")
tux = imagematrix(tux,type='grey')
plot(tux)
reduce <- function(A,dim) {
#Calculates the SVDprincomp
sing <- svd(A)
#Approximate each result of SVD with the given dimension
u<-as.matrix(sing$u[, 1:dim])
v<-as.matrix(sing$v[, 1:dim])
d<-as.matrix(diag(sing$d)[1:dim, 1:dim])
#Create the new approximated matrix
return(imagematrix(u%*%d%*%t(v),type='grey'))
}
tux_d = svd(tux)
length(tux_d$d)
plot(reduce(tux,1))
# 90% reduction
plot(reduce(tux,35))
plot(pc$scores[,2], pc$scores[,1])
# PCA
pc = princomp(iris2)
summary(pc)
pc$scores
pc$loadings

You might also like