What Is Dimension Reduction in Data Science - by Farhad Malik - FinTechExplained - Medium
What Is Dimension Reduction in Data Science - by Farhad Malik - FinTechExplained - Medium
We have access to a large amounts of data now. The large amount of data
can lead us to situations where by we take every possible data that is
available to us and feed it into a forecasting model to predict our target
variable. This article aims to explain the common issues associated with
introduction of large set of features and provides solutions which we can
utilise to resolve those problems.
It is crucial for every data scientist and machine learning expert to understand
what dimension reduction techniques are and when to use them.
https://radiant-brushlands-42789.herokuapp.com/medium.com/fintechexplained/what-is-dimension-reduction-in-data-science-2aa5547f4d29 1/16
5/4/2021 What Is Dimension Reduction In Data Science? | by Farhad Malik | FinTechExplained | Medium
Furthermore, it takes a much larger space to store a data set with a large
number of features. Moreover, it can get very difficult to analyse and
visualize a data set with a large number of dimensions.
This article outlines the techniques which we can follow to compress our
data set onto a new feature subspace of lower dimensionality. I will also be
providing details of important dimension reduction techniques.
https://radiant-brushlands-42789.herokuapp.com/medium.com/fintechexplained/what-is-dimension-reduction-in-data-science-2aa5547f4d29 3/16
5/4/2021 What Is Dimension Reduction In Data Science? | by Farhad Malik | FinTechExplained | Medium
Imagine you want to e-mail a large set of files to your friend. Uploading and
sending the files might take a longer time. You can speed up the process of
uploading of the files by zipping the files and e-mail the zipped file instead.
Zipping the file compresses large quantity of data into smaller equivalent
sets.
or 3 dimension subspace.
https://radiant-brushlands-42789.herokuapp.com/medium.com/fintechexplained/what-is-dimension-reduction-in-data-science-2aa5547f4d29 5/16
5/4/2021 What Is Dimension Reduction In Data Science? | by Farhad Malik | FinTechExplained | Medium
When we have a large set of features (classes), and our data is normally
distributed and the features are not correlated with each other then we can
use LDA to reduce the number of dimensions. LDA is a generalised version
of Fisher’s linear discriminant.
If you want to understand how to enrich features and calculate z-score then
have a look at this article:
This code will result in producing three LDA components for the entire data
set.
https://radiant-brushlands-42789.herokuapp.com/medium.com/fintechexplained/what-is-dimension-reduction-in-data-science-2aa5547f4d29 7/16
5/4/2021 What Is Dimension Reduction In Data Science? | by Farhad Malik | FinTechExplained | Medium
PCA is a very useful technique that can help de-noise and detect patterns in
data. PCA is used in reducing dimensions in images, textual contents and in
speech recognition systems.
Sci-kit learn library offers a powerful PCA component classifier. This code
snippet illustrates how to create PCA components:
https://radiant-brushlands-42789.herokuapp.com/medium.com/fintechexplained/what-is-dimension-reduction-in-data-science-2aa5547f4d29 8/16
5/4/2021 What Is Dimension Reduction In Data Science? | by Farhad Malik | FinTechExplained | Medium
Understanding PCA
This section of the article provides an overview of the process:
PCA technique analyses the entire data set and then finds the points
with maximum variance.
https://radiant-brushlands-42789.herokuapp.com/medium.com/fintechexplained/what-is-dimension-reduction-in-data-science-2aa5547f4d29 9/16
5/4/2021 What Is Dimension Reduction In Data Science? | by Farhad Malik | FinTechExplained | Medium
We need to take the eigen vectors that represent the our data set best. These
are the vectors which we have highest eigenvalues.
Remember eigenvectors with largest eigenvalues are the ones with highest
variance and they are closest to the original data set. Also larger the number of
eigenvectors, slower the computation performance.
I normally take 2–3 top eigen vectors to represent the data set.
https://radiant-brushlands-42789.herokuapp.com/medium.com/fintechexplained/what-is-dimension-reduction-in-data-science-2aa5547f4d29 10/16
5/4/2021 What Is Dimension Reduction In Data Science? | by Farhad Malik | FinTechExplained | Medium
If we want to keep sci-kit learn to give us all of the PCA components so that
we can assess the variance then initialise PCA with None components:
https://radiant-brushlands-42789.herokuapp.com/medium.com/fintechexplained/what-is-dimension-reduction-in-data-science-2aa5547f4d29 11/16
5/4/2021 What Is Dimension Reduction In Data Science? | by Farhad Malik | FinTechExplained | Medium
When we have non-linear features then we can project them onto a larger
feature set to remove their correlations and to make them linear.
https://radiant-brushlands-42789.herokuapp.com/medium.com/fintechexplained/what-is-dimension-reduction-in-data-science-2aa5547f4d29 12/16
5/4/2021 What Is Dimension Reduction In Data Science? | by Farhad Malik | FinTechExplained | Medium
vectors.
Sci-Kit learn offers Kernal PCA modules. To use Kernal PCA, we can use
following snippet of code:
https://radiant-brushlands-42789.herokuapp.com/medium.com/fintechexplained/what-is-dimension-reduction-in-data-science-2aa5547f4d29 13/16
5/4/2021 What Is Dimension Reduction In Data Science? | by Farhad Malik | FinTechExplained | Medium
We have access to a large set of data now. When we are building forecasting
models that are trained on images, sound and/or textual contents then the
input feature sets can end up having a large set of features. It increases
space, further adds over-fitting and slows down the time to train the
models. Occasionally features are introduced that end up adding more
noise than expected.
https://radiant-brushlands-42789.herokuapp.com/medium.com/fintechexplained/what-is-dimension-reduction-in-data-science-2aa5547f4d29 14/16
5/4/2021 What Is Dimension Reduction In Data Science? | by Farhad Malik | FinTechExplained | Medium
Summary
This article provided an overview of the techniques which we can follow to
compress our data set onto a new feature subspace of lower dimensionality.
It also provided details of important dimension reduction techniques.
https://radiant-brushlands-42789.herokuapp.com/medium.com/fintechexplained/what-is-dimension-reduction-in-data-science-2aa5547f4d29 15/16
5/4/2021 What Is Dimension Reduction In Data Science? | by Farhad Malik | FinTechExplained | Medium
https://radiant-brushlands-42789.herokuapp.com/medium.com/fintechexplained/what-is-dimension-reduction-in-data-science-2aa5547f4d29 16/16