0% found this document useful (0 votes)
10 views3 pages

PCA Notes

Principal Component Analysis (PCA) is a method for reducing the dimensionality of large data sets while preserving significant patterns and trends. It transforms a large set of variables into a smaller one, creating principal components that are uncorrelated linear combinations of the original variables. The PCA process involves standardizing the data, computing the covariance matrix, finding eigenvectors and eigenvalues, creating a feature vector, and recasting the data along the principal components axes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views3 pages

PCA Notes

Principal Component Analysis (PCA) is a method for reducing the dimensionality of large data sets while preserving significant patterns and trends. It transforms a large set of variables into a smaller one, creating principal components that are uncorrelated linear combinations of the original variables. The PCA process involves standardizing the data, computing the covariance matrix, finding eigenvectors and eigenvalues, creating a feature vector, and recasting the data along the principal components axes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

What Is Principal Component Analysis?

Principal component analysis (PCA) is a dimensionality reduction and machine learning


method used to simplify a large data set into a smaller set while still maintaining significant
patterns and trends.
Principal component analysis, or PCA, is a dimensionality reduction method that is often
used to reduce the dimensionality of large data sets, by transforming a large set of variables
into a smaller one that still contains most of the information in the large set.
Reducing the number of variables of a data set naturally comes at the expense of accuracy,
but the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because
smaller data sets are easier to explore and visualize, and thus make analyzing data points
much easier and faster for machine learning algorithms without extraneous variables to
process.
So, to sum up, the idea of PCA is simple: reduce the number of variables of a data set,
while preserving as much information as possible.
What Are Principal Components?
Principal components are new variables that are constructed as linear combinations or
mixtures of the initial variables. These combinations are done in such a way that the new
variables (i.e., principal components) are uncorrelated and most of the information within the
initial variables is squeezed or compressed into the first components. So, the idea is 10-
dimensional data gives you 10 principal components, but PCA tries to put maximum possible
information in the first component, then maximum remaining information in the second and
so on, until having something like shown in the scree plot below

An important thing to realize here is that the principal components are less interpretable and
don’t have any real meaning since they are constructed as linear combinations of the initial
variables.
Geometrically speaking, principal components represent the directions of the data that
explain a maximal amount of variance, that is to say, the lines that capture most information
of the data. The relationship between variance and information here, is that, the larger the
variance carried by a line, the larger the dispersion of the data points along it, and the larger
the dispersion along a line, the more information it has.

How Do You Do a Principal Component Analysis?


1. Standardize the range of continuous initial variables
2. Compute the covariance matrix to identify correlations
3. Compute the eigenvectors and eigenvalues of the covariance matrix to identify the
principal components
4. Create a feature vector to decide which principal components to keep
5. Recast the data along the principal components axes
Step 1: Standardization
The aim of this step is to standardize the range of the continuous initial variables so that each
one of them contributes equally to the analysis.
Mathematically, this can be done by subtracting the mean and dividing by the standard
deviation for each value of each variable.

Once the standardization is done, all the variables will be transformed to the same scale.
Step 2: Covariance Matrix Computation
The aim of this step is to understand how the variables of the input data set are varying from
the mean with respect to each other, or in other words, to see if there is any relationship
between them. Because sometimes, variables are highly correlated in such a way that they
contain redundant information. So, in order to identify these correlations, we compute
the covariance matrix.
The covariance matrix is a p × p symmetric matrix (where p is the number of dimensions)
that has as entries the covariances associated with all possible pairs of the initial variables.
For example, for a 3-dimensional data set with 3 variables x, y, and z, the covariance matrix
is a 3×3 data matrix of this from:

If we consider two-dimensional data X1 & X2, then covariance can be calculated as.

Where N=no. of features


Since the covariance of a variable with itself is its variance (Cov(a,a)=Var(a)), in the main
diagonal (Top left to bottom right) we actually have the variances of each initial variable. And
since the covariance is commutative (Cov(a,b)=Cov(b,a)), the entries of the covariance
matrix are symmetric with respect to the main diagonal, which means that the upper and the
lower triangular portions are equal.
Covariances entries of the matrix tell us about the correlations between the variables.
It’s the sign of the covariance that matters:
• If positive then: the two variables increase or decrease together (correlated)
• If negative then: one increases when the other decreases (Inversely correlated)
Step 3: Compute the eigenvectors and eigenvalues of the covariance matrix to identify
the principal components
Eigenvectors and eigenvalues are the linear algebra concepts that we need to compute from
the covariance matrix in order to determine the principal components of the data.
Covariance matrix are actually the directions of the axes where there is the most
variance (most information) and that we call Principal Components. And eigenvalues are
simply the coefficients attached to eigenvectors, which give the amount of variance carried in
each Principal Component.
⌊S − λI⌋ = 0
Where S is Covariance matrix, I is identity matrix
Calculate λ1 and λ2 which ever is small consider for construction of eigenvectors.

Once u1 and u2 values feature vector can be calculated.


Step 4: Create a Feature Vector
computing the eigenvectors and ordering them by their eigenvalues in descending order,
allow us to find the principal components in order of significance. In this step, we will
choose whether to keep all these components or discard those of lesser significance (of low
eigenvalues), and form with the remaining ones a matrix of vectors that we call Feature
vector.
So, the feature vector is simply a matrix that has as columns the eigenvectors of the
components that we decide to keep. This makes it the first step towards dimensionality
reduction, because if we choose to keep only p eigenvectors (components) out of n, the final
data set will have only p dimensions.

Step 5: Recast the Data Along the Principal Components Axes


In this step, we use the feature vector formed using the eigenvectors of the covariance matrix,
to reorient the data from the original axes to the ones represented by the principal
components (hence the name Principal Components Analysis). This can be done by
multiplying the transpose of the original data set by the transpose of the feature vector.

Ex:

You might also like