PCA Explained Stepbystep
PCA Explained Stepbystep
Reducing the number of variables of a data set naturally comes at the expense of
accuracy, but the trick in dimensionality reduction is to trade a little accuracy
for simplicity. Because smaller data sets are easier to explore and visualize and
make analyzing data much easier and faster for machine learning algorithms without
extraneous variables to process.
So to sum up, the idea of PCA is simple — reduce the number of variables of a data
set, while preserving as much information as possible.
Mathematically, this can be done by subtracting the mean and dividing by the
standard deviation for each value of each variable.
Once the standardization is done, all the variables will be transformed to the same
scale.
What do the covariances that we have as entries of the matrix tell us about the
correlations between the variables?
It’s actually the sign of the covariance that matters :
Principal components are new variables that are constructed as linear combinations
or mixtures of the initial variables. These combinations are done in such a way
that the new variables (i.e., principal components) are uncorrelated and most of
the information within the initial variables is squeezed or compressed into the
first components. So, the idea is 10-dimensional data gives you 10 principal
components, but PCA tries to put maximum possible information in the first
component, then maximum remaining information in the second and so on, until having
something like shown in the scree plot below.
Organizing information in principal components this way, will allow you to reduce
dimensionality without losing much information, and this by discarding the
components with low information and considering the remaining components as your
new variables.
An important thing to realize here is that, the principal components are less
interpretable and don’t have any real meaning since they are constructed as linear
combinations of the initial variables.
The second principal component is calculated in the same way, with the condition
that it is uncorrelated with (i.e., perpendicular to) the first principal component
and that it accounts for the next highest variance.
This continues until a total of p principal components have been calculated, equal
to the original number of variables.
Without further ado, it is eigenvectors and eigenvalues who are behind all the
magic explained above, because the eigenvectors of the Covariance matrix are
actually the directions of the axes where there is the most variance(most
information) and that we call Principal Components. And eigenvalues are simply the
coefficients attached to eigenvectors, which give the amount of variance carried in
each Principal Component.
Example:
let’s suppose that our data set is 2-dimensional with 2 variables x,y and that the
eigenvectors and eigenvalues of the covariance matrix are as follows:
If we rank the eigenvalues in descending order, we get λ1>λ2, which means that the
eigenvector that corresponds to the first principal component (PC1) is v1 and the
one that corresponds to the second component (PC2) is v2.
So, the feature vector is simply a matrix that has as columns the eigenvectors of
the components that we decide to keep. This makes it the first step towards
dimensionality reduction, because if we choose to keep only p eigenvectors
(components) out of n, the final data set will have only p dimensions.
Example:
Continuing with the example from the previous step, we can either form a feature
vector with both of the eigenvectors v1 and v2:
Or discard the eigenvector v2, which is the one of lesser significance, and form a
feature vector with v1 only:
Discarding the eigenvector v2 will reduce dimensionality by 1, and will
consequently cause a loss of information in the final data set. But given that v2
was carrying only 4% of the information, the loss will be therefore not important
and we will still have 96% of the information that is carried by v1.
So, as we saw in the example, it’s up to you to choose whether to keep all the
components or discard the ones of lesser significance, depending on what you are
looking for. Because if you just want to describe your data in terms of new
variables (principal components) that are uncorrelated without seeking to reduce
dimensionality, leaving out lesser significant components is not needed.
LAST STEP: RECAST THE DATA ALONG THE PRINCIPAL COMPONENTS AXES
In the previous steps, apart from standardization, you do not make any changes on
the data, you just select the principal components and form the feature vector, but
the input data set remains always in terms of the original axes (i.e, in terms of
the initial variables).
In this step, which is the last one, the aim is to use the feature vector formed
using the eigenvectors of the covariance matrix, to reorient the data from the
original axes to the ones represented by the principal components (hence the name
Principal Components Analysis). This can be done by multiplying the transpose of
the original data set by the transpose of the feature vector.