Dimensionality Reduction
Dimensionality Reduction
Dimensionality Reduction
• Dimensionality reduction is the task of
reducing the number of features in a dataset.
• In machine learning tasks like regression or
classification, there are often too many
variables to work with.
• These variables are also called features.
Dimensionality Reduction
• Dimensionality reduction is the process of reducing the
number of features (or dimensions) in a dataset while
retaining as much information as possible.
• This can be done for a variety of reasons, such as to reduce
the complexity of a model, to improve the performance of a
learning algorithm, or to make it easier to visualize the data.
• There are several techniques for dimensionality reduction,
including principal component analysis (PCA), singular value
decomposition (SVD), and linear discriminant analysis (LDA).
• Each technique uses a different method to project the data
onto a lower-dimensional space while preserving important
information.
Dimensionality Reduction
• Dimensionality reduction is a technique used
to reduce the number of features in a dataset
while retaining as much of the important
information as possible.
• In other words, it is a process of transforming
high-dimensional data into a lower-
dimensional space that still preserves the
essence of the original data.
The Curse of Dimensionality
• In machine learning, high-dimensional data refers to data
with a large number of features or variables.
• The curse of dimensionality is a common problem in
machine learning, where the performance of the model
deteriorates as the number of features increases.
• This is because the complexity of the model increases with
the number of features, and it becomes more difficult to
find a good solution.
• In addition, high-dimensional data can also lead to
overfitting, where the model fits the training data too
closely and does not generalize well to new data.
Dimensionality Reduction Methods
• Dimensionality reduction can help to mitigate
these problems by reducing the complexity of
the model and improving its generalization
performance.
• There are two main approaches to
dimensionality reduction:
– feature selection
– feature extraction.
Feature Selection
• Feature selection involves selecting a subset of the original
features that are most relevant to the problem at hand.
• The goal is to reduce the dimensionality of the dataset
while retaining the most important features.
• There are several methods for feature selection, including
filter methods, wrapper methods, and embedded methods.
• Filter methods rank the features based on their relevance
to the target variable, wrapper methods use the model
performance as the criteria for selecting features, and
embedded methods combine feature selection with the
model training process.
Feature Extraction
• Feature extraction involves creating new features by
combining or transforming the original features.
• The goal is to create a set of features that captures the
essence of the original data in a lower-dimensional space.
• There are several methods for feature extraction, including
principal component analysis (PCA), linear discriminant
analysis (LDA), and t-distributed stochastic neighbor
embedding (t-SNE).
• PCA is a popular technique that projects the original
features onto a lower-dimensional space while preserving
as much of the variance as possible.
Why is Dimensionality Reduction important in
Machine Learning and Predictive Modeling?
• An intuitive example of dimensionality reduction can be discussed through
a simple e-mail classification problem, where we need to classify whether
the e-mail is spam or not.
• This can involve a large number of features, such as whether or not the e-
mail has a generic title, the content of the e-mail, whether the e-mail uses
a template, etc.
• However, some of these features may overlap.
• In another condition, a classification problem that relies on both humidity
and rainfall can be collapsed into just one underlying feature, since both of
the aforementioned are correlated to a high degree.
• Hence, we can reduce the number of features in such problems.
• A 3-D classification problem can be hard to visualize, whereas a 2-D one
can be mapped to a simple 2-dimensional space, and a 1-D problem to a
simple line.
• The below figure illustrates this concept, where a 3-D feature space is split
into two 2-D feature spaces, and later, if found to be correlated, the
number of features can be reduced even further.
Components of Dimensionality Reduction