HAIMLC501 MathematicsForAIML Lecture 16 Dimensionality Reduction SH2022
HAIMLC501 MathematicsForAIML Lecture 16 Dimensionality Reduction SH2022
Amroz K. Siddiqui
– Sherlock Holmes
– The Sign of Four
1 Motivation
What is Dimensionality Reduction?
What are the benefits of Dimension Reduction?
Data explosion
New ways of gathering data
Noise, unnecessary data
More is not always good.
Large amounts of data might sometimes produce worse performances
in data analytics applications.
Email Classification
Candidates for recruitment
Environmental variables
Sports variables
Missing Values
The most straightforward way to reduce data dimensionality is via the
count of missing values.
Interpolation?
In most cases, for example, if a data column has only 5-10% of the
possible values, it will likely not be useful for the classification of most
records.
Missing Values
The goal, then, becomes to remove those data columns with too
many missing values, i.e. with more missing values in percent than a
given threshold.
Ratio of missing values = number of missing values / total number of
rows
High Correlation
Often input features are correlated.
That is they depend on one another and carry similar information.
A data column with values highly correlated to those of another data
column is not going to add very much new information to the existing
pool of input features.
One of the two columns can be removed without decreasing the
amount of information available for future tasks dramatically.