Chapter 7: Dimensionality Reduction
Chapter 7: Dimensionality Reduction
1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 1/
Dimensionality Reduction
The goal of dimensionality reduction is to find a lower dimensional representation
of the data matrix D to avoid the curse of dimensionality.
Given n × d data matrix, each point x i = (xi 1 , xi 2 , . . . , xid )T is a vector in the
ambient d-dimensional vector space spanned by the d standard basis vectors
e 1, e 2, . . . , e d .
Given any other set of d orthonormal vectors u 1 , u 2 , . . . , u d we can re-express each
point x as
x = a1 u 1 + a2 u 2 + · · · + ad u d
x = Ua
where U is the d × d orthogonal matrix, whose ith column comprises the ith basis
vector u i . Thus U −1 = U T , and we have
a = UT x
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 2/
Optimal Basis: Projection in Lower Dimensional Space
There are potentially infinite choices for the orthonormal basis vectors. Our goal
is to choose an optimal basis that preserves essential information about D.
We are interested in finding the optimal r -dimensional representation of D, with
r ≪ d. Projection of x onto the first r basis vectors is given as
r
X
x ′ = a1 u 1 + a2 u 2 + · · · + ar u r = ai u i = U r a r
i =1
where U r and a r comprises the r basis vectors and coordinates, respv. Also,
restricting a = U T x to r terms, we have
a r = U Tr x
x ′ = U r U Tr x = P r x
Pr
where P r = U r U Tr = i =1 u i u Ti is the orthogonal projection matrix for the
subspace spanned by the first r basis vectors.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 3/
Optimal Basis: Error Vector
d
X
ǫ= ai u i = x − x ′
i =r +1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 4/
Iris Data: Optimal One-dimensional Basis
bC
X3 bC
X3
bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC
bC bC Cb bC bC Cb
bC
bC bC bC Cb bC
bC bC bC Cb
Cb Cb
bC bC bC bC bC bC
bC bC bC Cb bC bC bC Cb
bC Cb bC bC bC Cb bC bC
bC bC bC bC
bC bCCb bC bC bCCb bC
bCbC Cb Cb bC bC bC bC Cb Cb bCbC Cb Cb bC bC bC bC Cb Cb
X1 bC bC bC bC Cb bC bC bC X1 bC bC bC bC Cb bC bC bC
Cb bC Cb Cb bC Cb
bC bC bC bC bC bC
bC Cb bC bC bC X2 bC Cb bC bC bC X2
bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bCbC bC bC bC bCbC bC bC bC
bC bC
bC Cb bC Cb
bC bC
bC bC
bC Cb bC bC Cb bC
bC Cb bC bC Cb bC
bC bC bC bC bC bC bC bC bC bCbC bC bC bC bC bC bC bC bC bC bCbC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC
u1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 5/
Iris Data: Optimal 2D Basis
bC
X3
bC bC
bC bC
bC
X3
bC bC bC bC
bC bC bC
bC bC
bC bC bC
bC bC bC bC bC bC
bC
bC bC bC bC bC bC
bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC Cb bC
bC Cb bC Cb bC
bC Cb CbbC bC bC bC bC bC bC Cb bC bC
X1 Cb bC bC Cb bC bC Cb bC bC bC bC bC bC bC bC bC Cb bC
bC
bC
bC bC
bC bC bC CbbC bC bCbC bC bC bC bC bC
Cb bC X2
bC bC bC
bC X1 Cb bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC
bC Cb bC bC bC bC X2
bC bC bC
C b bC
bC bC bC bC bC bC bC bC bC bC
bCbC bC bC bC bC bC bC bC
bC bC C b
bCbC bC bC
bC
bC
bC Cb
bC bC
bC Cb
bC
bC bC u2
bC bC
bC bC bC
bCbC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC
bC bC Cb Cb bC bC bC bC bC bC CbCb
bC bC bC bC bC bC bC bC bC bC bC
bC bC Cb bC bC bC bC bC bbC C bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC Cb
bC bC bC
u1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 6/
Principal Component Analysis
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 7/
Principal Component: Direction of Most Variance
We seek to find the unit vector u that maximizes the projected variance of the
points. Let D be centered, and let Σ be its covariance matrix.
The projection of x i on u is given as
T
u xi
′
xi = u = (u T x i )u = ai u
uT u
n n n
!
1X 1X T 1X
σu2 (ai − µu )2 = u x i x Ti u = u T x i x Ti u = u T Σu
=
n i =1 n i =1 n i =1
We have to find the optimal basis vector u that maximizes the projected variance
σu2 = u T Σu, subject to the constraint that u T u = 1. The maximization objective
is given as
max J(u) = u T Σu − α(u T u − 1)
u
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 8/
Principal Component: Direction of Most Variance
Given the objective maxu J(u) = u T Σu − α(u T u − 1), we solve it by setting the
derivative of J(u) with respect to u to the zero vector, to obtain
∂
u T Σu − α(u T u − 1) = 0
∂u
that is, 2Σu − 2αu = 0
which implies Σu = αu
σu2 = u T Σuu T αu = αu T u = α
To maximize the projected variance σu2 , we thus choose the largest eigenvalue λ1
of Σ, and the dominant eigenvector u 1 specifies the direction of most variance,
also called the f irst principal component.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 9/
Iris Data: First Principal Component
bC
X3
bC bC
bC bC
bC bC bC
bC bC
bC bC bC
bC bC bC
bC bC
bC bC Cb bC
bC Cb bC
bC bC Cb bC bC
Cb bC bC
bC
Cb bC bC bC
bC
CbCb bC bC bC bC bC bC bC bC
X1 Cb bC bC bC bC bC bC bC bC bC bC
bC bC bC
Cb bC bC bC bC X2
bC
bC bC bC bC bC bC bC bC
bC bC bC bC
bC bC bC
bCbC bC bC
bC
bC
bC Cb
bC
bC
bC bC bC
bC bC bC
bC Cb bC bC bC bC bC bC bC CbCb
C b C b
bC bC bC bC bC bC bC bC
bC bC
bC bC bC bC bC bC bC
bC
bC bC bC Cb
bC bC
u1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 10 /
Minimum Squared Error Approach
The direction that maximizes the projected variance is also the one that minimizes
the average squared error. The mean squared error (MSE) optimization condition
is
n n n
X kx i k2
1X 2 1X
MSE (u) = kǫi k = kx i − x ′i k2 = − u T Σu
n i =1 n i =1 i =1
n
Since the first term is fixed for a dataset D, we see that the direction u 1 that
maximizes the variance is also the one that minimizes the MSE. Further,
n d
X kx i k2 X
T
− u Σu = var (D) = tr (Σ) = σi2
i =1
n i =1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 11 /
Best 2-dimensional Approximation
The best 2D subspace that captures the most variance in D comprises the
eigenvectors u 1 and u 2 corresponding to the largest and second largest
eigenvalues λ1 and λ2 , respv.
Let U 2 = u 1 u 2 be the matrix whose columns correspond to the two principal
components. Given the point x i ∈ D its projected coordinates are computed as
follows:
a i = U T2 x i
Let A denote the projected 2D dataset. The total projected variance for A is
given as
var (A) = u T1 Σu 1 + u T2 Σu 2 = u T1 λ1 u 1 + u T2 λ2 u 2 = λ1 + λ2
The first two principal components also minimize the mean square error objective,
since
n n
1X 2 1X T
MSE = kx i − x ′i k = var (D) − x i P 2 x i = var (D) − var (A)
n i =1 n i =1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 12 /
Optimal and Non-optimal 2D Approximations
The optimal subspace maximizes the variance, and minimizes the squared error,
whereas the nonoptimal subspace captures less variance, and has a high mean
squared error value, as seen from the lengths of the error vectors (line segments).
bC
X3 bC
X3
bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC
bC Cb bC Cb
Cb bC Cb Cb bC Cb
Cb bC bC Cb Cb bC bC Cb
bC bC bC bC bC bC
bbC C bC Cb bbC C bC Cb
Cb bC bC bC Cb bC bC bC
bC bC
Cb bC bC bC Cb bC bC bC
bC bC
CbCb bC bC bC bC bC bC bC bC CbCb bC bC bC bC bC bC bC bC
X1 Cb bC bC bC bC bC bC bC bC bC Cb X1 Cb bC bC bC bC bC bC bC bC bC Cb
bC bC bC bC bC bC
bC bC bC bC bC X2 bC bC bC bC bC X2
bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bCbC bC bC bC bCbC bC bC bC
bC bC
bC Cb bC Cb
bC bC
u2
bC bC bC bC bC bC
bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bCbC bC bC bC bC bC bC bC bC bC bCbC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC
u1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 13 /
Best r -dimensional Approximation
To find the best r -dimensional approximation to D, we compute the eigenvalues of Σ.
Because Σ is positive semidefinite, its eigenvalues are non-negative
λ1 ≥ λ2 ≥ · · · λr ≥ λr +1 · · · ≥ λd ≥ 0
We select the r largest eigenvalues, and their corresponding eigenvectors to form the
best r -dimensional approximation.
Total Projected Variance: Let U r = u 1 · · · u r be the r -dimensional basis vector
matrix, withe the projection matrix given as P r = U r U Tr = ri=1 u i u Ti .
P
Let A denote the dataset formed by the coordinates of the projected points in the
r -dimensional subspace. The projected variance is given as
n r r
1X T X T X
var (A) = xi Pr xi = u i Σu i = λi
n
i =1 i =1 i =1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 14 /
Choosing the Dimensionality
One criteria for choosing r is to compute the fraction of the total variance
captured by the first r principal components, computed as
Pr Pr
λ1 + λ2 + · · · + λr λi λi
f (r ) = = Pid=1 = i =1
λ1 + λ2 + · · · + λd λ
i =1 i
var (D)
Given a certain desired variance threshold, say α, starting from the first principal
component, we keep on adding additional components, and stop at the smallest
value r , for which f (r ) ≥ α. In other words, we select the fewest number of
dimensions such that the subspace spanned by those r dimensions captures at
least α fraction (say 0.9) of the total variance.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 15 /
Principal Component Analysis: Algorithm
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 16 /
Iris Principal Components
Covariance matrix:
0.681 −0.039 1.265
Σ = −0.039 0.187 −0.320
1.265 −0.32 3.092
r 1 2 3
f (r ) 0.925 0.985 1.0
bC
X3
bC bC
bC bC
bC bC bC
bC bC bC
bC bC bC bC
bC bC bC
bC bC bC bC bC bC
bC
bC bC bC bC bC bC
bC
bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC Cb bC
bC Cb bC Cb bC
bC Cb CbbC bC bC bC bC bC bC Cb bC bC
X1 Cb bC bC Cb bC bC Cb bC bC bC bC bC bC bC bC bC Cb bC
bC
bC
bC bC
bC bC bC CbbC bC bCbC bC bC bC bC bC
Cb bC X2
bC bC bC Cb bC bC bC bC bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC
bC Cb bC bC bC bC
bC bC bC
C b bC
bC bC bC bC bC bC bC bC bC bC
bCbC bC bC bC bC bC bC bC u3
bC bC bC
bCbC bC bC
bC
bC
bC Cb
bC bC
u2 bC Cb
bC
bC bC
bC bC
bC bC bC
bCbC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC
bC bC Cb Cb bC bC bC bC bC bC CbCb
bC bC bC bC bC bC bC bC bC bC bC
bC bC Cb bC bC bC bC bC bbC C bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC Cb
bC bC bC
u1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 18 /
Iris Principal Components: Projected Data (2D)
u2
1.5
bC
bC
bC bC
1.0 bC
bC bC bC
bC bC bC bC
bC bC
bC bC bC bC bC bC
bC bC
bC bC
0.5 bC
bC Cb bC bC Cb
bC Cb
bC bC
bC
bC bC
bC bC bC bC Cb bC
bC bC bC Cb bC bC bC Cb bC
Cb bC Cb bC Cb bC Cb bC bC bC Cb
bC Cb bC bC bC
bC Cb Cb
Cb
0 bC bC Cb
Cb bC
Cb bC bC bC bC
bC
b C Cb bC bC bC
bC bC Cb Cb Cb Cb Cb bC bC
bC bC Cb Cb
bC bC bC bC
bC Cb Cb
Cb bC
bC Cb bC bC bC
bC bC bC Cb bC bC
−0.5 bC Cb bC bC
bC
bC bC bC bC
bC
bC bC bC
bC
−1.0 bC
bC bC
bC bC
−1.5 u1
−4 −3 −2 −1 0 1 2 3
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 19 /
Geometry of PCA
Geometrically, when r = d, PCA corresponds to a orthogonal change of basis, so that the
total variance is captured by the sum of the variances along each of the principal
directions u 1 , u 2 , . . . , u d , and further, all covariances are zero.
Let U be the d × d orthogonal matrix U = u 1 u 2 · · · u d , with U −1 = U T . Let
Σu i = λi u i for all 1 ≤ i ≤ d
x T Σ−1 x = 1
defines a d-dimensional ellipsoid (or hyper-ellipse). The eigenvectors u i of Σ, that is, the
principal components, are the directions √
for the principal axes of the ellipsoid. The
square roots of the eigenvalues, that is, λi , give the lengths of the semi-axes.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 20 /
Iris: Elliptic Contours in Standard Basis
bC
bC bC
bC bC
bC bC bC
bC bC
Cb bC bC
bC bC
bC bC bC
bC bC Cb bC
bC Cb bC
bbC C bC bC
Cb bC Cb bC
bC
Cb bC bC
CbbC bC bC bC bC bC bC bC
Cb bC bC bC bC bC bC bC bC bC bC
bC bC bC
bC bC bC
Cb bC bC bC bC
bC
bC bC bC bC bC bC bC bC
bC bC bC bC u3
bC bC bC
bCbC bC bC bC
bC
u2 bC Cb
bC
bC
bC bC bC
bC bC bC
bC Cb Cb bC bC bC bC bC bC CbCb
bC bC bC bC bC bC bC
bC bC bC bC C b
bC bC bC bC bC bC bC
bC
bC bC bC Cb
bC bC
u1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 21 /
Iris: Axis-Parallel Ellipsoid in PC Basis
u3
bC bC
bC bC bC bC
Cb bC bC bC bC bC
bCbC bC bC Cb bC Cb bC bC
bC bC bC bC bC bC Cb
bC Cb bCbC bC bC bC bC
bC CbCb bC bC bC bC bCbC bC bC bC bC bC bC bC Cb bC bC bC bC
bC bC Cb
C b C b Cb bC bC bC bC bC Cb
C b
bC bC bC bC bC Cb bC
Cb
bC Cb CbbC bC bC bC
bC Cb bC bC
bC bC bC
bC Cb bC bC bC bC
bC bC Cb Cb bC bC bC
bC bC Cb bC bC
bC bC bC bC bC bC bC bC bCbC bC bC bC bC
bC Cb bC bC bC bC bC bC bC
bC u2
bC
u1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 22 /
Kernel Principal Component Analysis
Let φ be a function that maps a point x in input space to its image φ(x i ) in
feature space. Let the points in feature space be centered and let Σφ be the
covariance matrix. The first PC in feature space correspond to the dominant
eigenvector
Σφ u 1 = λ1 u 1
where
n
1X
Σφ = φ(x i )φ(x i )T
n i =1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 23 /
Kernel Principal Component Analysis
n
X
It can be shown that u 1 = ci φ(x i ). That is, the PC direction in feature space
i =1
is a linear combination of the transformed points.
The coefficients are captured in the weight vector
T
c = c1 , c2 , · · · , cn
K c = nλ1 c = η1 c
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 24 /
Kernel Principal Component Analysis
n
X
The weight vector c can be used to then find u 1 via u 1 = ci φ(x i ).
i =1
The only constraint we impose is that u 1 should be normalized to be a unit vector, which
implies kck2 = η11 .
We cannot compute directly the principal direction, but we can project any point φ(x)
onto the principal direction u 1 , as follows:
n
X n
X
u T1 φ(x) = ci φ(x i )T φ(x) = ci K (x i , x)
i =1 i =1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 25 /
Kernel PCA Algorithm
KernelPCA
(D, K , α):
1 K = K (x i , x j ) i ,j =1,...,n // compute n × n kernel matrix
2 K = (I − 1n 1n×n )K (I − 1n 1n×n ) // center the kernel matrix
3 (η1 , η2 , . . . , ηd ) = eigenvalues(K
) // compute eigenvalues
4 c 1 c 2 · · · c n = eigenvectors(K ) // compute eigenvectors
5 λi = ηni for all i = 1, . . . , n // compute variance for each
component
q
6 c i = η1 · c i for all i = 1, . . . , n // ensure that u Ti u i = 1
Pi r
λi
7 f (r ) = Pdi=1 , for all r = 1, 2, . . . , d // fraction of total
i=1 λi
variance
8 Choose smallest r so that f (r ) ≥ α // choose dimensionality
9 C r = c 1 c 2 · · · c r // reduced basis
10 A = {a i | a i = C Tr K i , for i = 1, . . . , n} // reduced dimensionality
data
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 26 /
Nonlinear Iris Data: PCA in Input Space
u1
1.5 1.5
bC bC
bC bC
bC bC
1.0 bC 1.0 bC
bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC
0.5 bC bC bC bC 0.5 bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
X2
X2
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
0 bC bC bC bC bC bC bC bC bC bC Cb bC bC bC 0 bC bC bC bC bC bC bC bC bC bC Cb bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
−0.5 bC bC bC bC bC bC −0.5 bC bC bC bC bC bC u2
bC bC bC bC
bC bC bC bC bC bC bC bC
bC bC
−1 bC −1 bC
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 27 /
Nonlinear Iris Data: Projection onto PCs
u2
bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC
bC bC bC
0 bC bC bC
bC bC bC bC bC
bC bC bC
bC bC bC bC bC
bC bC bC bC
bC bC
bC bC
bC
bC bC
bC
−0.5 bC bC
bC bC
bC bC
bC
bC
bC
−1.0 bC
bC
−1.5 u1
−0.75 0 0.5 1.0 1.5
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 28 /
Kernel PCA: 3 PCs (Contours of Constant Projection)
Homogeneous Quadratic Kernel: K (x i , x j ) = (x T
i xj )
2
bC bC bC
bC bC bC
1.0 bC 1.0 bC 1.0 bC
bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC
0.5 bC Cb bC bC 0.5 bC Cb bC bC 0.5 bC Cb bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
X2
X2
X2
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
0 bC bC bC bC bC bC bC bC bC bC Cb bC bC bC 0 bC bC bC bC bC bC bC bC bC bC Cb bC bC bC 0 bC bC bC bC bC bC bC bC bC bC Cb bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
−0.5 bC bC bC bC bC bC −0.5 bC bC bC bC bC bC −0.5 bC bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC bC bC bC bC bC bC bC bC bC
bC bC bC
−1 bC −1 bC −1 bC
−0.5 0 0.5 1.0 1.5 −0.5 0 0.5 1.0 1.5 −0.5 0 0.5 1.0 1.5
X1 X1 X1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 29 /
Kernel PCA: Projected Points onto 2 PCs
Homogeneous Quadratic Kernel: K (x i , x j ) = (x T
i xj )
2
u2
bC bC bC bC bC bC
bC bCbC bC bC bC bC bC bC bC bCbC bC bC bC
bC bC
bC bC bC bC bC bC bC bC bC bC
0 bC bC bC bC bC bC bC
bC bC Cb bC bC
bC bC bC bC bC bC
bC bC
bC
bC C b
bC
Cb bC
−0.5
bC
bC
bC
−1.0
−1.5
bC
−2 u1
−0.5 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 30 /
Singular Value Decomposition
Σ = UΛU T
where the covariance matrix has been factorized into the orthogonal matrix U containing
its eigenvectors, and a diagonal matrix Λ containing its eigenvalues (sorted in decreasing
order).
SVD generalizes the above factorization for any matrix. In particular for an n × d data
matrix D with n points and d columns, SVD factorizes D as follows:
D = L∆R T
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 31 /
Reduced SVD
If the rank of D is r ≤ min(n, d), then there are only r nonzero singular values, ordered
as follows: δ1 ≥ δ2 ≥ · · · ≥ δr > 0.
We discard the left and right singular vectors that correspond to zero singular values, to
obtain the reduced SVD as
D = Lr ∆r R Tr
where Lr is the n × r matrix of the left singular vectors, R r is the d × r matrix of the
right singular vectors, and ∆r is the r × r diagonal matrix containing the positive singular
vectors.
The reduced SVD leads directly to the spectral decomposition of D given as
r
X
D= δi l i r Ti
i =1
The best rank q approximation to the original data D is the matrix D q = qi=1 δi l i r Ti
P
qP P
n d
that minimizes the expression kD − D q kF , where kAkF = i =1 j =1 A(i, j) is called
2
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 32 /
Connection Between SVD and PCA
Assume D has been centered, and let D = L∆R T via SVD. Consider the scatter matrix
for D, given as D T D. We have
T
D T D = L∆R T L∆R T = R∆T LT L∆R T = R(∆T ∆)R T = R∆2d R T
where ∆2d is the d × d diagonal matrix defined as ∆2d (i, i) = δi2 , for i = 1, . . . , d.
The covariance matrix of centered D is given as Σ = 1n D T D; we get
D T D = nΣ
= nUΛU T
= U(nΛ)U T
The right singular vectors R are the same as the eigenvectors of Σ. The singular values
of D are related to the eigenvalues of Σ as
δi2
nλi = δi2 , which implies λi = , for i = 1, . . . , d
n
Likewise the left singular vectors in L are the eigenvectors of the matrix n × n matrix
DD T , and the corresponding eigenvalues are given as δi2 .
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 33 /
Data Mining and Machine Learning:
Fundamental Concepts and Algorithms
dataminingbook.info
1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 7: Dimensionality Reduction 34 /