0% found this document useful (0 votes)
35 views29 pages

HAIMLC501 MathematicsForAIML Lecture 16 Dimensionality Reduction SH2022

The document discusses dimensionality reduction techniques. It begins by defining dimensionality reduction as reducing the number of random variables under consideration to obtain a set of principal variables while maintaining similar information. Some benefits mentioned are compressing data size, speeding up computations, removing redundant features, and allowing visualization of higher-dimensional data. Several techniques are then outlined, including removing features with many missing values, low variance filters, decision trees/random forests, and removing highly correlated features.

Uploaded by

abhishek mohite
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views29 pages

HAIMLC501 MathematicsForAIML Lecture 16 Dimensionality Reduction SH2022

The document discusses dimensionality reduction techniques. It begins by defining dimensionality reduction as reducing the number of random variables under consideration to obtain a set of principal variables while maintaining similar information. Some benefits mentioned are compressing data size, speeding up computations, removing redundant features, and allowing visualization of higher-dimensional data. Several techniques are then outlined, including removing features with many missing values, low variance filters, decision trees/random forests, and removing highly correlated features.

Uploaded by

abhishek mohite
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

The Curse of Dimensions...

Of all the facts which were presented to us


we had to pick just those which we deemed CSDC7013
to be essential, and then piece them
Mathematics for
together in their order, so as to reconstruct
this very remarkable chain of events. AIML
Dimensionality
Reduction

Amroz K. Siddiqui

Fr. C. Rodrigues Institute of


Technology, Vashi.

– Sherlock Holmes
– The Sign of Four

Amroz K. Siddiqui (Fr. CRIT) Dimensionality Reduction September 27, 2022 1 / 29


Outline

1 Motivation
What is Dimensionality Reduction?
What are the benefits of Dimension Reduction?

2 Dimensionality Reduction Techniques

Amroz K. Siddiqui (Fr. CRIT) Dimensionality Reduction September 27, 2022 2 / 29


Motivation

Data explosion
New ways of gathering data
Noise, unnecessary data
More is not always good.
Large amounts of data might sometimes produce worse performances
in data analytics applications.

Amroz K. Siddiqui (Fr. CRIT) Dimensionality Reduction September 27, 2022 3 / 29


Motivation

Examples: Ways of collecting data


Political parties are capturing data by expanding their reach on field
Sports. Records explosion.
Organizations are evaluating their brand value by social media
engagements (comments, likes), followers, positive and negative
sentiments
With more variables, comes more trouble! And to avoid this trouble,
dimension reduction techniques comes to the rescue.

Amroz K. Siddiqui (Fr. CRIT) Dimensionality Reduction September 27, 2022 4 / 29


Motivation

In machine learning classification problems, there are often too many


factors on the basis of which the final classification is done.
These factors are basically variables called features.
The higher the number of features, the harder it gets to visualize the
training set and then work on it.
Sometimes, most of these features are correlated, and hence
redundant.
This is where dimensionality reduction algorithms come into play.

Amroz K. Siddiqui (Fr. CRIT) Dimensionality Reduction September 27, 2022 5 / 29


Motivation

What is Dimensionality Reduction?


Dimension Reduction refers to the process of converting a set of data
having vast dimensions into data with lesser dimensions ensuring that
it conveys similar information concisely.
Dimensionality Reduction is the process of reducing the number of
random variables under consideration, by obtaining a set of principal
variables.

Amroz K. Siddiqui (Fr. CRIT) Dimensionality Reduction September 27, 2022 6 / 29


Example

Email Classification
Candidates for recruitment
Environmental variables
Sports variables

Amroz K. Siddiqui (Fr. CRIT) Dimensionality Reduction September 27, 2022 7 / 29


Example

A 3-D classification problem can be hard to visualize.


Whereas a 2-D one can be mapped to a simple 2 dimensional space.
And a 1-D problem to a simple line.

Amroz K. Siddiqui (Fr. CRIT) Dimensionality Reduction September 27, 2022 8 / 29


Example

Amroz K. Siddiqui (Fr. CRIT) Dimensionality Reduction September 27, 2022 9 / 29


Example

Figure: 2D data converted to 1D data, if both convey the same meaning.

Amroz K. Siddiqui (Fr. CRIT) Dimensionality Reduction September 27, 2022 10 / 29


Dimensionality Reduction

In similar ways, we can reduce n dimensions of data set to k


dimensions (k < n) .
These k dimensions can be directly identified (filtered).
Or can be a combination of dimensions (weighted averages of
dimensions)
Or new dimension(s) that represent existing multiple dimensions well.

Amroz K. Siddiqui (Fr. CRIT) Dimensionality Reduction September 27, 2022 11 / 29


What are the benefits of Dimension Reduction?

It helps in data compressing and reducing the storage space required


It speeds up the time required for performing same computations.
Less dimensions leads to less computing, also less dimensions can
allow usage of algorithms unfit for a large number of dimensions
It removes redundant features. For example: there is no point in
storing a value in two different units (meters and inches).
Reducing the dimensions of data to 2D or 3D may allow us to plot
and visualize it precisely. You can then observe patterns more clearly.
It is helpful in noise removal also and as result of that we can improve
the performance of models.

Amroz K. Siddiqui (Fr. CRIT) Dimensionality Reduction September 27, 2022 12 / 29


Components of Dimensionality Reduction

There are two components of dimensionality reduction:


Feature selection: In this, we try to find a subset of the original set
of variables, or features, to get a smaller subset which can be used to
model the problem. It usually involves three ways:
1 Filter
2 Wrapper
3 Embedded
Feature extraction: This reduces the data in a high dimensional
space to a lower dimension space, i.e. a space with lesser no. of
dimensions.

Amroz K. Siddiqui (Fr. CRIT) Dimensionality Reduction September 27, 2022 13 / 29


Dimensionality Reduction Techniques

There are many methods to perform Dimension reduction.


Missing Values
Low Variance
Decision Trees
Random Forest
High Correlation
Backward Feature Elimination
Forward Feature Selection

Amroz K. Siddiqui (Fr. CRIT) Dimensionality Reduction September 27, 2022 14 / 29


Dimensionality Reduction Techniques

Missing Values
The most straightforward way to reduce data dimensionality is via the
count of missing values.
Interpolation?
In most cases, for example, if a data column has only 5-10% of the
possible values, it will likely not be useful for the classification of most
records.

Amroz K. Siddiqui (Fr. CRIT) Dimensionality Reduction September 27, 2022 15 / 29


Missing Values

Amroz K. Siddiqui (Fr. CRIT) Dimensionality Reduction September 27, 2022 16 / 29


Dimensionality Reduction Techniques

Missing Values
The goal, then, becomes to remove those data columns with too
many missing values, i.e. with more missing values in percent than a
given threshold.
Ratio of missing values = number of missing values / total number of
rows

Amroz K. Siddiqui (Fr. CRIT) Dimensionality Reduction September 27, 2022 17 / 29


Dimensionality Reduction Techniques

Missing Values: Issues to deal with...


If the column with the missing values is crucial in classification or
other algorithms.
If it is not just numerical data.
If different classification (or other) algorithms give different results
after eliminating the missing value column.
If too many columns have missing values.

Amroz K. Siddiqui (Fr. CRIT) Dimensionality Reduction September 27, 2022 18 / 29


Dimensionality Reduction Techniques

Low Variance Filter


Another way of measuring how much information a data column has,
is to measure its variance.
If the column cells assume a constant value, the variance would be 0
and the column would be of no help in the discrimination of different
groups of data.

Amroz K. Siddiqui (Fr. CRIT) Dimensionality Reduction September 27, 2022 19 / 29


Dimensionality Reduction Techniques

Low Variance Filter


This method calculates each column variance and removes those
columns with a variance value below a given threshold.
But this dimensionality reduction method applies only to numerical
columns.

Amroz K. Siddiqui (Fr. CRIT) Dimensionality Reduction September 27, 2022 20 / 29


Dimensionality Reduction Techniques

Decision Trees, Random Forests, Ensembles


Decision Tree Ensembles, also referred to as random forests, are
useful for feature selection in addition to being effective classifiers.
One approach to dimensionality reduction is to generate a large and
carefully constructed set of trees against a target attribute.
Then use each attribute’s usage statistics to find the most informative
subset of features.

Amroz K. Siddiqui (Fr. CRIT) Dimensionality Reduction September 27, 2022 21 / 29


Dimensionality Reduction Techniques

Decision Trees, Random Forests, Ensembles


Generate a set of very shallow trees.
Each tree being trained on a small fraction of the total number of
attributes.
If an attribute is often selected as best split, it is most likely an
informative feature to retain.

Amroz K. Siddiqui (Fr. CRIT) Dimensionality Reduction September 27, 2022 22 / 29


Dimensionality Reduction Techniques

High Correlation
Often input features are correlated.
That is they depend on one another and carry similar information.
A data column with values highly correlated to those of another data
column is not going to add very much new information to the existing
pool of input features.
One of the two columns can be removed without decreasing the
amount of information available for future tasks dramatically.

Amroz K. Siddiqui (Fr. CRIT) Dimensionality Reduction September 27, 2022 23 / 29


Dimensionality Reduction Techniques

Backward Feature Elimination


The Backward Feature Elimination loop performs dimensionality
reduction against a particular machine learning algorithm.
The concept behind the Backward Feature Elimination technique is
quite simple.
In this method, we start with all n dimensions.
And keep on removing one feature at a time.

Amroz K. Siddiqui (Fr. CRIT) Dimensionality Reduction September 27, 2022 24 / 29


Dimensionality Reduction Techniques

Backward Feature Elimination


At each iteration, the selected classification algorithm is trained on n
input features.
Then we remove one input feature at a time and train the same
model on n-1 input features n times.
Finally, the input feature whose removal has produced the smallest
increase in the error rate is removed, leaving us with n-1 input
features.

Amroz K. Siddiqui (Fr. CRIT) Dimensionality Reduction September 27, 2022 25 / 29


Dimensionality Reduction Techniques

Backward Feature Elimination


The classification is then repeated using n-2 features n-1 times.
And, again, the feature whose removal produces the smallest
disruption in classification performance is removed for good.
This gives us n-2 input features.

Amroz K. Siddiqui (Fr. CRIT) Dimensionality Reduction September 27, 2022 26 / 29


Dimensionality Reduction Techniques

Backward Feature Elimination


The algorithm starts with all available N input features and continues
till only 1 last feature is left for classification.
Each iteration k then produces a model trained on n-k features and
an error rate e(k).
Selecting the maximum tolerable error rate, we define the smallest
number of features necessary to reach that classification performance
with the selected machine learning algorithm.

Amroz K. Siddiqui (Fr. CRIT) Dimensionality Reduction September 27, 2022 27 / 29


Dimensionality Reduction Techniques

Backward Feature Elimination: The main drawback...


The main drawback of this technique is the high number of iterations
for very high dimensional data sets, possibly leading to very long
computation times.
Solution: Use it after applying other DR Techniques.

Amroz K. Siddiqui (Fr. CRIT) Dimensionality Reduction September 27, 2022 28 / 29


Dimensionality Reduction Techniques

Forward Feature Selection


Reverse to Backward Feature Elimination, we can use Forward
Feature Selection method.
In this method, we select one variable and analyse the performance of
model by adding another variable.
Here, selection of variable is based on higher improvement in model
performance.

Amroz K. Siddiqui (Fr. CRIT) Dimensionality Reduction September 27, 2022 29 / 29

You might also like