PCA Explained
PCA Explained
PCA can be used when the dimensions of the input features are high (e.g. a lot of
variables).
First, the original input variables stored in X are z-scored such each original
variable (column of X) has zero mean and unit standard deviation.
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA
import pandas as pd
from sklearn.preprocessing import StandardScaler
plt.style.use('ggplot')
Let’s plot the data before and after the PCA transform and also color code each point
(sample) using the corresponding class of the flower (y) .
fig, axes = plt.subplots(1,2)
plt.show()
We can see that in the PCA space, the variance is maximized along PC1 (explains
73% of the variance) and PC2 (explains 22% of the variance). Together, they explain
95%.
print(pca.explained_variance_ratio_)
# array([0.72962445, 0.22850762])
6. Proof of eigenvalues of original covariance matrix being equal to the
variances of the reduced space
Assuming that the original input variables stored in X are z-scored such each original
variable (column of X) has zero mean and unit standard deviation, we have:
The maximum variance proof can be also seen by estimating the covariance matrix
of the reduced space:
np.cov(X_new.T)
array([[2.93808505e+00, 4.83198016e-16],
[4.83198016e-16, 9.20164904e-01]])
We observe that these values (on the diagonal we have the variances) are equal to
the actual eigenvalues of the covariance stored in pca.explained_variance_ :
pca.explained_variance_
array([2.93808505, 0.9201649 ])
7. Feature importance
The importance of each feature is reflected by the magnitude of the corresponding
values in the eigenvectors (higher magnitude — higher importance).
print(abs( pca.components_ ))
we can conclude that feature 1, 3 and 4 are the most important for PC1. Similarly,
we can state that feature 2 and then 1 are the most important for PC2.
To sum up, we look at the absolute values of the eigenvectors’ components corresponding to
the k largest eigenvalues. In sklearn the components are sorted by explained variance. The
larger they are these absolute values, the more a specific feature contributes to that
principal component.
8. The biplot
The biplot is the best way to visualize all-in-one following a PCA analysis.
There is an implementation in R but there is no standard implementation in python so
I decided to write my own function for that:
plt.xlabel("PC{}".format(1), size=14)
plt.ylabel("PC{}".format(2), size=14)
limx= int(xs.max()) + 1
limy= int(ys.max()) + 1
plt.xlim([-limx,limx])
plt.ylim([-limy,limy])
plt.grid()
plt.tick_params(axis='both', which='both', labelsize=14)
Call the function (make sure to run first the initial blocks of code where we load the iris
data and perform the PCA analysis):
We can again verify visually that a) the variance is maximized and b) that feature 1,
3 and 4 are the most important for PC1. Similarly, feature
2 and then 1 are the most important for PC2.