Principle Component Analysis using
Python
Tushar B. Kute,
http://tusharkute.com
Dimensionality Reduction
• Dimensionality reduction or dimension reduction is the
process of reducing the number of random variables
under consideration by obtaining a set of principal
variables.
• It can be divided into feature selection and feature
extraction.
– Feature selection approaches try to find a subset of the
original variables (also called features or attributes).
– Feature projection or Feature extraction transforms the
data in the high-dimensional space to a space of fewer
dimensions.
Large Dimensions
• Large number of features in the dataset is one of the factors
that affect both the training time as well as accuracy of machine
learning models. You have different options to deal with huge
number of features in a dataset.
– Try to train the models on original number of features, which
take days or weeks if the number of features is too high.
– Reduce the number of variables by merging correlated
variables.
– Extract the most important features from the dataset that
are responsible for maximum variance in the output.
Different statistical techniques are used for this purpose e.g.
linear discriminant analysis, factor analysis, and principal
component analysis.
Principal Component Analysis
• Principal component analysis, or PCA, is a statistical
technique to convert high dimensional data to low
dimensional data by selecting the most important features
that capture maximum information about the dataset.
• The features are selected on the basis of variance that they
cause in the output.
• The feature that causes highest variance is the first
principal component. The feature that is responsible for
second highest variance is considered the second principal
component, and so on.
• It is important to mention that principal components do
not have any correlation with each other.
Advantages of PCA
• The training time of the algorithms reduces
significantly with less number of features.
• It is not always possible to analyze data in high
dimensions. For instance if there are 100
features in a dataset. Total number of scatter
plots required to visualize the data would be
100(100-1)2 = 4950. Practically it is not possible
to analyze data this way.
Normalization of features
• It is imperative to mention that a feature set must be
normalized before applying PCA. For instance if a feature set
has data expressed in units of Kilograms, Light years, or
Millions, the variance scale is huge in the training set. If PCA
is applied on such a feature set, the resultant loadings for
features with high variance will also be large. Hence,
principal components will be biased towards features with
high variance, leading to false results.
• Finally, the last point to remember before we start coding is
that PCA is a statistical technique and can only be applied to
numeric data. Therefore, categorical features are required to
be converted into numerical features before PCA can be
applied.
Example:
Reading the dataset
Normalize
Apply PCA
Calculate variance
Variance plot
Variance Ratio
• The PCA class contains
explained_variance_ratio_ which returns the
variance caused by each of the principal
components.
Principal Components = 1
Principal Components = 2
Principal Components = 3
Useful resources
• https://stackabuse.com
• http://archive.ics.uci.edu/ml/index.php
• https://scikit-learn.org
• https://en.wikipedia.org
• www.towardsdatascience.com
• www.analyticsvidhya.com
• www.kaggle.com
• www.github.com
Thank you
This presentation is created using LibreOffice Impress 5.1.6.2, can be used freely as per GNU General Public License
/mITuSkillologies @mitu_group /company/mitu- c/MITUSkillologies
skillologies
Web Resources
http://mitu.co.in
http://tusharkute.com
[email protected]
[email protected]