Dimensionality Reduction
Dimensionality Reduction
Step 1: Standardization
This step is to standardize the range of the continuous initial variables so that each one of them
contributes equally to the analysis.
if there are large differences between the ranges of initial variables, those variables with larger ranges will
dominate over those with small ranges (for example, a variable that ranges between 0 and 100 will
dominate over a variable that ranges between 0 and 1), which will lead to biased results. So, transforming
the data to comparable scales can prevent this problem.
Mathematically, this can be done by subtracting the mean and dividing by the standard deviation for each
value of each variable.
Step 2: Covariance Matrix Computation
This step is to understand how the variables of the input data set are varying from the mean with respect
to each other, or in other words, to see if there is any relationship between them. Because sometimes,
variables are highly correlated in such a way that they contain redundant information. So, in order to
identify these correlations, we compute the covariance matrix.
For example, for a 3-dimensional data set with 3 variables x, y, and z, the covariance matrix is a 3×3 data
matrix of this from:
Since the covariance of a variable with itself is its variance (Cov(a,a)=Var(a)), in the main diagonal (Top left
to bottom right) we actually have the variances of each initial variable. And since the covariance is
commutative (Cov(a,b)=Cov(b,a)), the entries of the covariance matrix are symmetric with respect to the
main diagonal, which means that the upper and the lower triangular portions are equal.
What do the covariances that we have as entries of the matrix tell us about the correlations between
the variables?
If negative then: one increases when the other decreases (Inversely correlated)
Step 3: Compute the eigenvectors and eigenvalues of the covariance matrix to identify the principal
components
Eigenvectors and eigenvalues are the linear algebra concepts that we need to compute from the
covariance matrix in order to determine the principal components of the data.
What you first need to know about eigenvectors and eigenvalues is that they always come in pairs, so that
every eigenvector has an eigenvalue. Also, their number is equal to the number of dimensions of the data.
For example, for a 3-dimensional data set, there are 3 variables, therefore there are 3 eigenvectors with
3 corresponding eigenvalues.
Eigenvectors of the Covariance matrix are actually the directions of the axes where there is the most
variance (most information) and that we call Principal Components.
And eigenvalues are simply the coefficients attached to eigenvectors, which give the amount of variance
carried in each Principal Component.
Principal Component Analysis Example:
Let’s suppose that our data set is 2-dimensional with 2 variables x,y and that the eigenvectors and
eigenvalues of the covariance matrix are as follows:
If we rank the eigenvalues in descending order, we get λ1>λ2, which means that the eigenvector that
corresponds to the first principal component (PC1) is v1 and the one that corresponds to the second
principal component (PC2) is v2.
After having the principal components, to compute the percentage of variance (information) accounted
for by each component, we divide the eigenvalue of each component by the sum of eigenvalues. If we
apply this on the example above, we find that PC1 and PC2 carry respectively 96 percent and 4 percent of
the variance of the data.
In this step, what we do is, to choose whether to keep all these components or discard those of lesser
significance (of low eigenvalues), and form with the remaining ones a matrix of vectors that we call
Feature vector.
So, the feature vector is simply a matrix that has as columns the eigenvectors of the components that we
decide to keep. This makes it the first step towards dimensionality reduction, because if we choose to
keep only p eigenvectors (components) out of n, the final data set will have only p dimensions.
Continuing with the example from the previous step, we can either form a feature vector with both of the
eigenvectors v1 and v2:
Or discard the eigenvector v2, which is the one of lesser significance, and form a feature vector with v1
only:
Discarding the eigenvector v2 will reduce dimensionality by 1, and will consequently cause a loss of
information in the final data set. But given that v2 was carrying only 4 percent of the information, the loss
will be therefore not important and we will still have 96 percent of the information that is carried by v1.
Reference link:
Dimensionality Reduction:
https://medium.com/nerd-for-tech/dimensionality-reduction-techniques-pca-lca-and-svd-
f2a56b097f7c