Reduce Data Dimensionality Using PCA
Reduce Data Dimensionality Using PCA
– Python
Introduction
The advancements in Data Science and Machine Learning have made it possible for
us to solve several complex regression and classification problems. However, the
performance of all these ML models depends on the data fed to them. Thus, it is
imperative that we provide our ML models with an optimal dataset. Now, one might
think that the more data we provide to our model, the better it becomes – however, it
is not the case. If we feed our model with an excessively large dataset (with a large
no. of features/columns), it gives rise to the problem of overfitting, wherein the
model starts getting influenced by outlier values and noise. This is called the Curse
of Dimensionality.
The following graph represents the change in model performance with the increase
in the number of dimensions of the dataset. It can be observed that the model
performance is best only at an option dimension, beyond which it starts decreasing.
Output:
scalar = StandardScaler()
scaled_data
Output:
Step-3: Check the Co-relation between features without PCA (Optional)
Now, we will check the co-relation between our scaled dataset using a heat map. For
this, we have already imported the seaborn library in Step-1. The correlation
between various features is given by the corr() function and then the heat map is
plotted by the heatmap() function. The colour scale on the side of the heatmap helps
determine the magnitude of the co-relation. In our example, we can clearly see that a
darker shade represents less co-relation while a lighter shade represents more co-
relation. The diagonal of the heatmap represents the co-relation of a feature with
itself – which is always 1.0, thus, the diagonal of the heatmap is of the highest shade.
Python3
Output:
Co-relation Heatmap of Iris dataset without PCA
We can observe from the above heatmap that sepal length & petal length and petal
length & petal width have high co-relation. Thus, we evidently need to apply
dimensionality reduction. If you are already aware that your dataset needs
dimensionality reduction – you can skip this step.
We will apply PCA on the scaled dataset. For this Python offers yet another in-built
class called PCA which is present in sklearn.decomposition, which we have already
imported in step-1. We need to create an object of PCA and while doing so we also
need to initialize n_components – which is the number of principal components we
want in our final dataset. Here, we have taken n_components = 3, which means our
final feature set will have 3 columns. We fit our scaled data to the PCA object which
gives us our reduced dataset.
Python
#Applying PCA
#Taking no. of Principal Components as 3
pca = PCA(n_components = 3)
pca.fit(scaled_data)
data_pca = pca.transform(scaled_data)
data_pca = pd.DataFrame(data_pca,columns=['PC1','PC2','PC3'])
data_pca.head()
Output:
PCA Dataset
Python3
Output:
The above heatmap clearly depicts that there is no correlation between various
obtained principal components (PC1, PC2, and PC3). Thus, we have moved from
higher dimensional feature space to a lower-dimensional feature space while
ensuring that there is no correlation between the so obtained PCs is minimum.
Hence, we have accomplished the objectives of PCA.