0% found this document useful (0 votes)
17 views27 pages

Dimensionality Reduction

Dimensionality reduction is the process of reducing the number of features in a dataset while retaining essential information, aimed at improving model performance and visualization. Techniques such as Principal Component Analysis (PCA), Singular Value Decomposition (SVD), and Linear Discriminant Analysis (LDA) are commonly used for this purpose. While dimensionality reduction can enhance model efficiency and prevent overfitting, it may also lead to data loss and challenges in interpretability.

Uploaded by

bca2m2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views27 pages

Dimensionality Reduction

Dimensionality reduction is the process of reducing the number of features in a dataset while retaining essential information, aimed at improving model performance and visualization. Techniques such as Principal Component Analysis (PCA), Singular Value Decomposition (SVD), and Linear Discriminant Analysis (LDA) are commonly used for this purpose. While dimensionality reduction can enhance model efficiency and prevent overfitting, it may also lead to data loss and challenges in interpretability.

Uploaded by

bca2m2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Dimensionality Reduction

Dimensionality Reduction
• Dimensionality reduction is the task of
reducing the number of features in a dataset.
• In machine learning tasks like regression or
classification, there are often too many
variables to work with.
• These variables are also called features.
Dimensionality Reduction
• Dimensionality reduction is the process of reducing the
number of features (or dimensions) in a dataset while
retaining as much information as possible.
• This can be done for a variety of reasons, such as to reduce
the complexity of a model, to improve the performance of a
learning algorithm, or to make it easier to visualize the data.
• There are several techniques for dimensionality reduction,
including principal component analysis (PCA), singular value
decomposition (SVD), and linear discriminant analysis (LDA).
• Each technique uses a different method to project the data
onto a lower-dimensional space while preserving important
information.
Dimensionality Reduction
• Dimensionality reduction is a technique used
to reduce the number of features in a dataset
while retaining as much of the important
information as possible.
• In other words, it is a process of transforming
high-dimensional data into a lower-
dimensional space that still preserves the
essence of the original data.
The Curse of Dimensionality
• In machine learning, high-dimensional data refers to data
with a large number of features or variables.
• The curse of dimensionality is a common problem in
machine learning, where the performance of the model
deteriorates as the number of features increases.
• This is because the complexity of the model increases with
the number of features, and it becomes more difficult to
find a good solution.
• In addition, high-dimensional data can also lead to
overfitting, where the model fits the training data too
closely and does not generalize well to new data.
Dimensionality Reduction Methods
• Dimensionality reduction can help to mitigate
these problems by reducing the complexity of
the model and improving its generalization
performance.
• There are two main approaches to
dimensionality reduction:
– feature selection
– feature extraction.
Feature Selection
• Feature selection involves selecting a subset of the original
features that are most relevant to the problem at hand.
• The goal is to reduce the dimensionality of the dataset
while retaining the most important features.
• There are several methods for feature selection, including
filter methods, wrapper methods, and embedded methods.
• Filter methods rank the features based on their relevance
to the target variable, wrapper methods use the model
performance as the criteria for selecting features, and
embedded methods combine feature selection with the
model training process.
Feature Extraction
• Feature extraction involves creating new features by
combining or transforming the original features.
• The goal is to create a set of features that captures the
essence of the original data in a lower-dimensional space.
• There are several methods for feature extraction, including
principal component analysis (PCA), linear discriminant
analysis (LDA), and t-distributed stochastic neighbor
embedding (t-SNE).
• PCA is a popular technique that projects the original
features onto a lower-dimensional space while preserving
as much of the variance as possible.
Why is Dimensionality Reduction important in
Machine Learning and Predictive Modeling?
• An intuitive example of dimensionality reduction can be discussed through
a simple e-mail classification problem, where we need to classify whether
the e-mail is spam or not.
• This can involve a large number of features, such as whether or not the e-
mail has a generic title, the content of the e-mail, whether the e-mail uses
a template, etc.
• However, some of these features may overlap.
• In another condition, a classification problem that relies on both humidity
and rainfall can be collapsed into just one underlying feature, since both of
the aforementioned are correlated to a high degree.
• Hence, we can reduce the number of features in such problems.
• A 3-D classification problem can be hard to visualize, whereas a 2-D one
can be mapped to a simple 2-dimensional space, and a 1-D problem to a
simple line.
• The below figure illustrates this concept, where a 3-D feature space is split
into two 2-D feature spaces, and later, if found to be correlated, the
number of features can be reduced even further.
Components of Dimensionality Reduction

• There are two components of dimensionality


reduction:
– Feature selection: In this, we try to find a subset of the
original set of variables, or features, to get a smaller subset
which can be used to model the problem.
– It usually involves three ways:
• Filter
• Wrapper
• Embedded
– Feature extraction: This reduces the data in a high
dimensional space to a lower dimension space, i.e. a space
with lesser no. of dimensions.
Methods of Dimensionality Reduction
• The various methods used for dimensionality
reduction include:
– Principal Component Analysis (PCA)
– Linear Discriminant Analysis (LDA)
– Generalized Discriminant Analysis (GDA)
• Dimensionality reduction may be both linear
and non-linear, depending upon the method
used.
Principal Component Analysis
• This method was introduced by Karl Pearson.
• It works on the condition that while the data
in a higher dimensional space is mapped to
data in a lower dimension space, the variance
of the data in the lower dimensional space
should be maximum.
Principal Component Analysis
Principal Component Analysis
• It involves the following steps:
– Construct the covariance matrix of the data.
– Compute the eigenvectors of this matrix.
– Eigenvectors corresponding to the largest
eigenvalues are used to reconstruct a large
fraction of variance of the original data.
Principal Component Analysis
• Hence, we are left with a lesser number of
eigenvectors, and there might have been
some data loss in the process.
• But, the most important variances should be
retained by the remaining eigenvectors.
Eigenvalues
• In PCA, eigenvalues represent the amount of
variance (spread or variability) captured by each
principal component.
• Each principal component corresponds to an
eigenvalue, and the eigenvalues are arranged in
decreasing order. The higher the eigenvalue, the
more variance the corresponding principal
component explains.
• The sum of all eigenvalues equals the total variance
in the original data.
Eigenvectors
• Eigenvectors are associated with the principal components,
and they indicate the direction of the spread or variability in
the data.
• Each eigenvector points in the direction of maximum variance,
and the magnitude of the eigenvalue associated with that
eigenvector indicates the importance or significance of that
direction.
• The first principal component (associated with the largest
eigenvalue) points in the direction of maximum variance, the
second principal component (associated with the second-
largest eigenvalue) points in the direction of the second-
highest variance, and so on.
• Eigenvectors are used to transform the original data into a new
coordinate system, defined by the principal components.
• Eigenvalues tell us how much variance is
captured by each principal component, and
eigenvectors tell us the direction of the spread
of the data in the new coordinate system.
• The goal of PCA is to reduce the
dimensionality of the data by keeping the
principal components with the highest
eigenvalues, as they represent the most
significant patterns or features in the dataset.
Variance in PCA
• Variance is a key concept in Principal Component
Analysis (PCA), and its importance lies in
capturing and retaining the most significant
information present in the data.
• In the context of data, variance represents the
amount of information or spread in the dataset.
• High variance indicates that the data points are
more dispersed, covering a wider range of values.
Variance in PCA
• The primary goal of PCA is to reduce the dimensionality
of the data while retaining as much of its original
variability as possible.
• Principal components are derived in such a way that the
first principal component captures the maximum variance
in the data, the second principal component captures the
second-highest variance, and so on.
• By selecting a subset of principal components that
collectively explain a high percentage of the total
variance, you can represent the data in a lower-
dimensional space without losing significant information.
Variance in PCA
• Variance is a measure of the information
content in each direction (principal
component) of the data.
• By focusing on the directions with the highest
variance, PCA helps in discarding less
important directions and, consequently,
reduces the dimensionality of the data.
Advantages of Dimensionality Reduction
• It helps in data compression, and hence reduced storage
space.
• It reduces computation time.
• It also helps remove redundant features, if any.
• Improved Visualization
– High dimensional data is difficult to visualize, and dimensionality
reduction techniques can help in visualizing the data in 2D or 3D,
which can help in better understanding and analysis.
• Overfitting Prevention
– High dimensional data may lead to overfitting in machine learning
models, which can lead to poor generalization performance.
– Dimensionality reduction can help in reducing the complexity of
the data, and hence prevent overfitting.
Advantages of Dimensionality Reduction
• Feature Extraction
– Dimensionality reduction can help in extracting important features
from high dimensional data, which can be useful in feature
selection for machine learning models.
• Data Preprocessing
– Dimensionality reduction can be used as a preprocessing step
before applying machine learning algorithms to reduce the
dimensionality of the data and hence improve the performance of
the model.
• Improved Performance
– Dimensionality reduction can help in improving the performance
of machine learning models by reducing the complexity of the
data, and hence reducing the noise and irrelevant information in
the data.
Disadvantages of Dimensionality
Reduction
• It may lead to some amount of data loss.
• PCA tends to find linear correlations between variables,
which is sometimes undesirable.
• PCA fails in cases where mean and covariance are not
enough to define datasets.
• We may not know how many principal components to
keep- in practice, some thumb rules are applied.
• Interpretability:
– The reduced dimensions may not be easily interpretable, and
it may be difficult to understand the relationship between
the original features and the reduced dimensions.
Disadvantages of Dimensionality
Reduction
• Overfitting
– In some cases, dimensionality reduction may lead to
overfitting, especially when the number of components is
chosen based on the training data.
• Sensitivity to outliers
– Some dimensionality reduction techniques are sensitive to
outliers, which can result in a biased representation of the
data.
• Computational complexity
– Some dimensionality reduction techniques, such as manifold
learning, can be computationally intensive, especially when
dealing with large datasets.
Important points

• Dimensionality reduction is the process of reducing the number of


features in a dataset while retaining as much information as possible.
• This can be done to reduce the complexity of a model, improve the
performance of a learning algorithm, or make it easier to visualize
the data.
• Techniques for dimensionality reduction include: principal
component analysis (PCA), singular value decomposition (SVD), and
linear discriminant analysis (LDA).
• Each technique projects the data onto a lower-dimensional space
while preserving important information.
• Dimensionality reduction is performed during pre-processing stage
before building a model to improve the performance
• It is important to note that dimensionality reduction can also discard
useful information, so care must be taken when applying these
techniques.

You might also like