Dimensionality Reduction
Dimensionality Reduction
n documents
n customers
Aij = rating of j-th
product by the i-th
customer
A=
There are two prototype documents (vectors of words): blue and red
To describe the data is enough to describe the two prototypes, and the
projection weights for each row
A is a rank-2 matrix
2
1 , 1
2
Document-term matrix
A=
Ak is an approximation of A
Ak is the best approximation of A
The rank-k approximation matrix 3
produced by the top-k singular vectors of A
minimizes the Frobenious norm of the
difference with the matrix A
3 arg max > ?
9:$;<3 9 =3
> ? @ AB >AB
A,B
We can project the row (and column) vectors
of the matrix A into a k-dimensional space
and preserve most of the information
(Ideally) The k dimensions reveal latent
features/aspects/topics of the term
(document) space.
(Ideally) The 3 approximation of matrix A,
contains all the useful information, and what
is discarded is noise
Rows (columns) are linear combinations of k
latent factors
E.g., in our extreme document example there are
two factors
Some noise is added to this rank-k matrix
resulting in higher rank
noise noise
= noise
objects
Data: Users rating movies
Sparse and often noisy
Assumption: There are k basic user profiles, and each user
is a linear combination of these profiles
E.g., action, comedy, drama, romance
Each user is a weighted cobination of these profiles
The “true” matrix has rank k
What we observe is a noisy, and incomplete version of this
matrix C
The rank-k approximation C3 is provably close to 3
Algorithm: compute C3 and predict for user and movie
!, the value C3 !, .
Model-based collaborative filtering
PCA is a special case of SVD on the centered
covariance matrix.
Goal: reduce the dimensionality while preserving the
“information in the data”
Information in the data: variability in the data
We measure variability using the covariance matrix.
Sample covariance of variables X and Y
@ DA EF GA EH
A
Given matrix A, remove the mean of each column
from the column vectors to get the centered matrix C
The matrix ' I I is the covariance matrix of the
row vectors of A.
We will project the rows of matrix A into a new
set of attributes (dimensions) such that:
The attributes have zero covariance to each other
(they are orthogonal)
Each attribute captures the most remaining variance
in the data, while orthogonal to the existing attributes
▪ The first attribute should capture the most variance in the
data
5 Output:
2nd (right)
singular 1st (right) singular vector:
vector direction of maximal variance,
4
2nd (right)
singular σ1: measures how much of the
4
vector data variance is explained by the
first singular vector.
∑j =1
σ 2
j
n
≈ 0 . 85
∑j =1
σ 2
j
drugs
⋯ <
⋮ ⋱ ⋮ students
< ⋯ <<
legal illegal
AB : usage of student i of drug j
%Σ'
First right singular vector
Drug 1
More or less same weight to all drugs
Discriminates heavy from light users
Second right singular vector
Positive values for legal drugs, negative for illegal
Drug 2
The chosen vectors are such that minimize the sum of square differences
between the data vectors and the low-dimensional projections