Dimensionality Reduction
Dimensionality Reduction
REDUCTION
INTRODUCTION TO DIMENSIONALITY REDUCTION
The backward feature elimination technique is mainly used while developing Linear
Regression or Logistic Regression models. Below steps are performed in this
technique to reduce the dimensionality or in feature selection:
• In this technique, firstly, all the n variables of the given dataset are taken to train
the model.
• The performance of the model is checked.
• Now we will remove one feature each time and train the model on n-1 features for
n times, and compute the performance of the model.
• We will check the variable that has made the smallest or no change in the
performance of the model, and then we will drop that variable or features; after
that, we will be left with n-1 features.
• Repeat the complete process until no feature can be dropped.
In this technique, by selecting the optimum performance of the model and maximum
tolerable error rate, we can define the optimal number of features required for the
machine learning algorithms.
9
Missing Value Ratio
If a dataset has too many missing values, then we drop those variables as they do
not carry much useful information. To perform this, we can set a threshold level,
and if a variable has missing values more than that threshold, we will drop that
variable. The higher the threshold value, the more efficient the reduction.
High Correlation Filter
High Correlation refers to the case when two variables carry approximately
similar information. Due to this factor, the performance of the model can be
degraded. This correlation between the independent numerical variables gives
the calculated value of the correlation coefficient. If this value is higher than the
threshold value, we can remove one of the variables from the dataset. We can
consider those variables or features that show a high correlation with the target
variable.
Random Forest
Random Forest is a popular and very useful feature selection algorithm in machine
learning. This algorithm contains an in-built feature importance package, so we do not
need to program it separately. In this technique, we need to generate a large set of trees
against the target variable, and with the help of usage statistics of each attribute, we need
to find the subset of features.
Random forest algorithm takes only numerical variables, so we need to convert the input
data into numeric data using hot encoding.
Factor Analysis
Factor analysis is a technique in which each variable is kept within a group according
to the correlation with other variables, it means variables within a group can have a
high correlation between themselves, but they have a low correlation with variables of
other groups.
Auto-encoders
One of the popular methods of dimensionality reduction is auto-encoder, which is a
type of ANN or artificial neural network, and its main aim is to copy the inputs to
their outputs. In this, the input is compressed into latent-space representation, and
output is occurred using this representation. It has mainly two parts:
o Encoder: The function of the encoder is to compress the input to form the latent-
space representation.
o Decoder: The function of the decoder is to recreate the output from the latent-
space representation.
12
KEY ASPECTS OF DIMENSIONALITY
Reduction
1) The Curse of Dimensionality
2) Main Approaches for Dimensionality Reduction
3) PCA(Principle Component Analysis)
4) Using Scikit-Learn
5) Randomized PCA
6) Kernel PCA.
THE CURSE OF DIMENSIONALITY 13
Feature Selection
Feature selection is the process of selecting the subset of the relevant features
and leaving out the irrelevant features present in a dataset to build a model of
high accuracy. In other words, it is a way of selecting the optimal features from
the input dataset.
Three methods are used for the feature selection:
1. Filters Methods
In this method, the dataset is filtered, and a
subset that contains only the relevant features is
taken. Some common techniques of the filters
method are:
o Correlation
o Chi-Square Test
o ANOVA
o Information Gain, etc.
2. Wrappers Methods
The wrapper method has the same goal as the filter method,
but it takes a machine-learning model for its evaluation. In
this method, some features are fed to the ML model and
evaluate the performance.
The performance decides whether to add those features or
remove them to increase the accuracy of the model. This method
is more accurate than the filtering method but complex to work.
Some common techniques of wrapper methods are:
o Forward Selection
o Backward Selection
o Bi-directional Elimination
3. Embedded Methods: Embedded methods check the different
training iterations of the machine learning model and evaluate
the importance of each feature. Some common techniques of
Embedded methods are:
o LASSO
o Elastic Net
o Ridge Regression, etc.
17
Feature Extraction:
Feature extraction is the process of transforming the space
containing many dimensions into space with fewer dimensions.
This approach is useful when we want to keep the whole
information but use fewer resources while processing the
information.
Some common feature extraction techniques are:
a. Principal Component Analysis
b. Linear Discriminant Analysis
c. Kernel PCA
d. Quadratic Discriminant Analysis
18
In this article, I will discuss how to find the principal components with a
simple solved numerical example.
Problem definition: Given data in the Table, reduce the dimension from 2 to
1 using the Principal Component Analysis (PCA) algorithm.
Feature Example 1 Example 2 Example 3 Example 4
X1 4 8 13 7
X2 11 4 5 14
20
that is,
Let, be the kth sample in the above Table (dataset). The first principal
component of this example is given by (here “T” denotes the transpose of
the matrix)
31
For example, the first principal component corresponding to the first example
is calculated as follows:
32
X1 4 8 13 7
X2 11 4 5 14
First Principle
Components -4.3052 3.7361 5.6928 -5.1238
number of data
points
39
Kernel PCA
44
Benefits of applying Dimensionality Reduction
Some benefits of applying the dimensionality reduction technique
to the given dataset are given below:
o By reducing the dimensions of the features, the space required
to store the
dataset also gets reduced.
o Less Computation training time is required for reduced
dimensions of
features.
o Reduced dimensions of features of the dataset help in
visualizing the data
quickly.
o It removes the redundant features (if present) by taking care of
multicollinearity.
Disadvantages of dimensionality Reduction
There are also some disadvantages of applying the
dimensionality reduction, which are given below:
o Some data may be lost due to dimensionality
reduction.
o In the PCA dimensionality reduction technique,
sometimes the principal components required to
consider are unknown.