Pac
Pac
Introduction
Working
Graphical Overview
Math Reminder
PCA Process
Algorithm
Uses and Limitations
Abstract
In the present big data era, there is a need to process large amounts of
unlabeled data and find some patterns in the data to use it further.
Need to discard features that are unimportant and discover only the
representations that are needed.
It is possible to convert high-dimensional data to low-dimensional data using
different techniques, this dimension reduction is important and makes tasks
such as classification, visualization, communication and storage much easier.
The loss of information should be less while mapping data from high-
dimensional space to low-dimensional space
Data Reduction
summarization of data with many (p) variables by a smaller set of
(k) derived (synthetic, composite) variables.
p k
n A n X
Data Reduction
“Residual” variation is information in A that is not retained in
X
balancing act between
clarity of representation, ease of understanding
oversimplification: loss of important or relevant
information.
Principal Component Analysis
(PCA)
takes a data matrix of n objects by p variables, which may be
correlated, and summarizes it by uncorrelated axes (principal
components or principal axes) that are linear combinations of
the original p variables
the first k components display as much as possible of the
variation among objects.
Working
Geometric Rationale of PCA
objects are represented as a cloud of n points in a
multidimensional space with an axis for each of the p
variables
the centroid of the points is defined by the mean of each
variable
the variance of each variable is the average squared
deviation of its n values around the mean of that variable.
X im X i
n
1
2
Vi
n 1 m 1
Geometric Rationale of PCA
degree to which the variables are linearly correlated
is represented by their covariances.
Cij
1 n
X im X i X jm X j
n 1 m 1
Geometric Rationale of PCA
objective of PCA is to rigidly rotate the axes of this
p-dimensional space to new positions (principal axes)
that have the following properties:
ordered such that principal axis 1 has the highest
variance, axis 2 has the next highest variance, ....
, and axis p has the lowest variance
covariance among each pair of the principal axes
is zero (the principal axes are uncorrelated).
2D Example of PCA
variables X1 and X2 have positive covariance & each
has a similar variance.
14
12
10
Variable X 2
6 X 2 4.91
+
4
X 1 8.35
2
0
0 2 4 6 8 10 12 14 16 18 20
Variable X1
4
Variable X 2
0
-8 -6 -4 -2 0 2 4 6 8 10 12
-2
-4
-6
Variable X1
Principal Components are Computed
PC 1 has the highest possible variance (9.88)
PC 2 has a variance of 3.03
PC 1 and PC 2 have zero covariance.
6
2
PC 2
0
-8 -6 -4 -2 0 2 4 6 8 10 12
-2
-4
-6
PC 1
The Dissimilarity Measure Used in PCA is
Euclidean Distance
PC 1
6
PC 2
Variable X 2
0
-8 -6 -4 -2 0 2 4 6 8 10 12
-2
-4
-6
Variable X1
PC axes are a rigid rotation of the original variables
PC 1 is simultaneously the direction of maximum variance
and a least-squares “line of best fit” (squared distances
of points away from PC 1 are minimized).
8
PC 1
6
PC 2
Variable X 2
0
-8 -6 -4 -2 0 2 4 6 8 10 12
-2
-4
-6
Variable X1
Generalization to p-dimensions
if we take the first k principal components, they define the k-
dimensional “hyperplane of best fit” to the point cloud
of the total variance of all p variables:
PCs 1 to k represent the maximum possible proportion of that
variance that can be displayed in k dimensions
i.e. the squared Euclidean distances among points calculated
from their coordinates on PCs 1 to k are the best possible
representation of their squared Euclidean distances in the
full p dimensions.
Performance Measure PCA
A . V = λ . v
Given A.v=λ.v
A.v-λ.I.v=0
(A - λ . I ). v = 0
Finding the roots of |A - λ . I| will give the eigenvalues and for
each of these eigenvalues there will be an eigenvector
Example …
Calculating eigenvectors & eigenvalues
If A= 0 1
-2 -3
Then |A - λ . I| = 0 1 λ 0 = 0
-2 -3 0 λ
-λ 1 = λ2 + 3λ + 2 = 0
-2 -3-λ
R = U S VT
variables factors factors variables
sig. significant
significant
noise noise
noise
factors factors
samples samples
PCA process –STEP 5
FinalData is the final data set, with data items in columns,
and dimensions along rows.
What will this give us?
It will give us the original data solely in terms of the vectors
we chose.
We have changed our data from being in terms of the axes
x and y , and now they are in terms of our 2 eigenvectors.
PCA process –STEP 5
FinalData transpose: dimensions along
columns
x y
-.827970186 -.175115307
1.77758033 .142857227
-.992197494 .384374989
-.274210416 .130417207
-1.67580142 -.209498461
-.912949103 .175282444
.0991094375 -.349824698
1.14457216 .0464172582
.438046137 .0177646297
1.22382056 -.162675287
PCA process –STEP 5
Reconstruction of original Data
If we reduced the dimensionality, obviously, when
reconstructing the data we would lose those dimensions we
chose to discard. In our example let us assume that we
considered only the x dimension…
Reconstruction of original Data
x
-.827970186
1.77758033
-.992197494
-.274210416
-1.67580142
-.912949103
.0991094375
1.14457216
.438046137
1.22382056
Algorithm
How do I do a PCA?
Step 1 – Standardize
Step 2 – Calculate covariance
Step 3 – Deduce Eigen's:
Step 4 – Re-orient data:
Step 5 – Plot re-oriented data:
Use of PCA
It is often helpful to use a dimensionality-reduction technique
such as PCA prior to performing machine learning because: