0% found this document useful (0 votes)
8 views

Principal Component Analysis

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Principal Component Analysis

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

2 Principal Component

Analysis (PCA)

2.1 EXPLANATION AND WORKING

Principal Component Analysis (PCA), a feature extraction method for dimensional


ity reduction, is one of the most popular dimensionality reduction techniques. We
want to reduce the number of features of the dataset (dimensionality of the dataset)
and preserve the maximum possible information from the original dataset at the
same time. PCA solves this problem by combining the input variables to represent it
with fewer orthogonal (uncorrelated) variables that capture most of its variability [1].
Let the dataset contain a set of n data points denoted by x₁,₂,..., X, where each
x is a d-dimensional vector. PCA finds a p-dimensional linear subspace (where
p<d, and often pd) in a way that the original data points lie mainly on this
p-dimensional linear subspace. In practice, we do not usually find a reduced sub
space where all the points lie precisely in that subspace. Instead, we try to find the
approximate subspace which retains most of the variability of data. Thus, PCA tries
to find the linear subspace in which the data approximately lies.
The linear subspace can be defined by p orthogonal vectors, say: U₁, U₂, Up.
This linear subspace forms a new coordinate system and the orthogonal vectors that
define this linear subspace are called the "principal components" [2]. The princi
pal components are a linear transformation of the original features, so there can be
no more than d of them. Also, the principal components are perpendicular to each
other. However, the hope is that only p (p <d) principal components are needed to
approximate the d-dimensional original space. In the case, where p = d the number
of dimensions remains the same and there is no reduction.

Let there be n data points denoted by x₁, x2, ..., x, where each x; is a d-dimensional
vector. The goal is to reduce these points and find a mapping y₁, y2, ..., Yu where each
Vi is a p-dimensional vector (where p <d, and often p << d). That is, the data points
XXXX E Rd are mapped to y₁, 2, ..., Vy¡ ER".
Let X be a dxn matrix that contains all the data points in the original space which
has to be mapped to another pxn matrix Y (matrix), which retains maximum variabil
ity of the data points by reducing the number of features to represent the data point.
Note: In PCA or any variant of PCA, a standardized input matrix is used. So, X
represents the standardized input data matrix, unless otherwise specified.
Now let us discuss how PCA solves the problem of dimensionality reduction.
PCA is a method based on linear projection. For any linear projection-based tech
niques, given a d-dimensional vector x,, we obtain a low dimensional representation
y; (a p-dimensional vector) such that

y=UT xi (2.1)

5
6
Dimensionality Reduction and Data Visualization

How is the direction of the principal components chosen? The basic idea is to pick
a direction along which data is maximally spread, that is, the direction along which
there is maximum variance [1, 3, 4].
Let us consider the first principal component (the direction along which the data
has maximum variance) to be U₁. We now project the n points from matrix X on U₁,
given by:

UTX (2.2)

So, by definition, we would now like to find the maximum variance of U,TX. To find
that we solve for the below optimization problem:

max UX
U₁

We know that

var (U₁T X)=U₁¹ EU, (2.3)

Here, Σ is a sample covariance matrix of the data matrix X. Thus, the optimization
problem now becomes

max UTEU, (2.4)


U₁

However, U, EU, is a quadratic function with no fixed upper bound. So, (2.4) turns
out to be an ill-defined problem. This is because (2.4) depends on the direction as
well as the magnitude of U₁. To convert this into a well-defined problem, we need to
add a constraint to (2.4).
There are two possible approaches to resolve this problem. Either we add a con
straint on the direction of U, or on the magnitude of U₁. Adding a constraint on the
direction of U, and trying to calculate max (UTE U₁) will still make it an ill-defined
problem, because there is no upper-bound even after adding the constraint. But if we
add a constraint on the magnitude of U₁, that is, if we restrict the magnitude of U₁.
let us say U₁1 U₁ = 1, the length of the principal component is fixed. Hence, there is
only one direction in which UT EU, would be maximum. Thus, this problem has
an upper bound, and to solve for the upper bound, we are interested in finding the
direction of U₁.
Using the second case, between all fixed-length vectors, we search for the direc
tion of the maximum variance of the data. Now we have a well-defined problem as
follows:

maxUEU,
U₁
(2.5)
subject toU,TU₁ = 1

Generally, when we want to maximize (or minimize) a function subject to a con


straint, we use the concept of Lagrange multipliers.
Principal Component Analysis (PCA) 7

7
Lagrange's multipliers say that there exists 2, ER such that the solution U, to the
above problem can be rewritten as:

L(U₁,λ₁) = U₁' EU, - λ, (U₁'U, − 1) (2.6)

To optimize (2.6), we simply take the derivative and equate it to 0. This gives:

EU₁ -₁U₁ = 0 (2.7)

Thus,

EU₁ = M₁ U₁ (2.8)

Here, Σ is the covariance matrix of X, λ, is the dual variable and the vector that we
are looking for is U₁.
The direction U, obtained by maximizing the variance is the direction of some
unit vector that satisfies (2.8). However, this is exactly the definition of an eigenvec
tor of matrix Σ. Note that the eigenvector of the matrix multiplied to the matrix
results in a vector that is just the scaled version of the eigenvector itself. So, U, is the
eigenvector of Σ, and is the corresponding eigenvalue.
Now, the question is which eigenvector to choose? Σ is a d x d matrix. So, there
are at most d eigenvectors and eigenvalues.
Now, our objective is to maximize U₁¹ Σ U, and from (2.8) we know that Σ U₁ =
MU₁:

U₁TEU₁ = U₁¹N₁U₁ (2.9)

= M₁U₁TU₁ (2.10)

= λ₁ (2.11)

And since we want to maximize the above quantity, that is, we want to maximize λ₁,
it is evident that we have to pick U₁ to be the eigenvector corresponding to the largest
eigenvalue of Σ. For U₂, U, we proceed similarly and pick the eigenvector that
...

has the second largest, up to the ph largest eigenvalues. Thus, the eigenvector of the
sample covariance matrix Σ corresponding to the maximum eigenvalue the first
principal component. Similarly, the second principal component is the eigenvector
with the second largest eigenvalue and so on. In the same way, all the principal com
ponents can be found just by the eigendecomposition of the covariance matrix [5].
Singular Value Decomposition (SVD) could quickly solve the eigendecomposi
tion problem in a computationally efficient manner. SVD routines are much more
numerically stable than eigenvector routines for finding eigenvectors in the matrix.
We can use SVD to calculate eigenvectors in PCA very efficiently [5, 6]. SVD is a
matrix factorization technique which expresses a matrix (in this case, it is X) as a
linear combination of matrices of rank 1. SVD is a stable method because it does not

require a positive definite matrix [6].

You might also like