0% found this document useful (0 votes)
27 views17 pages

Unit-4 Dimensionality Reduction

The document provides an overview of dimensionality reduction techniques in machine learning, including feature selection and extraction, Linear Discriminant Analysis (LDA), Principal Component Analysis (PCA), Factor Analysis, and Independent Component Analysis (ICA). It discusses the advantages and disadvantages of these techniques, their assumptions, steps for implementation, and various applications in data analysis and modeling. The main goal of these techniques is to reduce the number of features in datasets while retaining essential information to improve model efficiency and prevent overfitting.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views17 pages

Unit-4 Dimensionality Reduction

The document provides an overview of dimensionality reduction techniques in machine learning, including feature selection and extraction, Linear Discriminant Analysis (LDA), Principal Component Analysis (PCA), Factor Analysis, and Independent Component Analysis (ICA). It discusses the advantages and disadvantages of these techniques, their assumptions, steps for implementation, and various applications in data analysis and modeling. The main goal of these techniques is to reduce the number of features in datasets while retaining essential information to improve model efficiency and prevent overfitting.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Introduction to Dimensionality Reduction:

While working with machine learning models, we often encounter datasets with a large
number of features. These datasets can lead to problems such as increased computation
time and overfitting. To address these issues, we use dimensionality reduction techniques.

In other words, it is a process of transforming high-dimensional data into a lower-


dimensional space that still preserves the essence of the original data.

Dimensionality reduction is the process of reducing the number of features (or


dimensions) in a dataset while retaining as much information as possible.
What is Feature selection and Feature Extraction?
Feature Selection
Feature selection chooses the most relevant features from the dataset without altering
them. It helps remove redundant or irrelevant features, improving model efficiency.

Feature Extraction
Feature extraction involves creating new features by combining or transforming the
original features.

Advantages of Dimensionality Reduction


As seen earlier, high dimensionality makes models inefficient. Let’s now summarize the
key advantages of reducing dimensionality.
 Faster Computation: With fewer features, machine learning algorithms can
process data more quickly. This results in faster model training and testing,
which is particularly useful when working with large datasets.
 Better Visualization: As we saw in the earlier figure, reducing dimensions
makes it easier to visualize data, revealing hidden patterns.
 Prevent Overfitting: With fewer features, models are less likely to memorize
the training data and overfit. This helps the model generalize better to new,
unseen data, improving its ability to make accurate predictions.
Disadvantages of Dimensionality Reduction
 Data Loss & Reduced Accuracy – Some important information may be lost
during dimensionality reduction, potentially affecting model performance.
 Interpretability Challenges – The transformed features (e.g., principal
components) may not have clear meanings, making it harder to understand
relationships in the original data.
 Choosing the Right Components – Deciding how many dimensions to keep is
difficult, as keeping too few may lose valuable information, while keeping too
many can lead to overfitting.

Linear Discriminant Analysis in Machine Learning

As we know that while dealing with a high dimensional dataset then we must apply some
dimensionality reduction techniques to the data at hand so, that we can explore the data
and utilize it for modeling in an efficient manner.

What is Linear Discriminant Analysis?


Linear Discriminant Analysis (LDA), also known as Normal Discriminant Analysis or
Discriminant Function Analysis, is a dimensionality reduction technique primarily utilized
in supervised classification problems. It facilitates the modeling of distinctions between
groups, effectively separating two or more classes. LDA operates by projecting features
from a higher-dimensional space into a lower-dimensional one. In machine learning, LDA
serves as a supervised learning algorithm specifically designed for classification tasks,
aiming to identify a linear combination of features that optimally segregates classes within
a dataset.

For example, we have two classes and we need to separate them efficiently. Classes can
have multiple features. Using only a single feature to classify them may result in some
overlapping as shown in the below figure. So, we will keep on increasing the number of
features for proper classification.

Assumptions of LDA
LDA assumes that the data has a Gaussian distribution and that the covariance matrices of
the different classes are equal. It also assumes that the data is linearly separable, meaning
that a linear decision boundary can accurately classify the different classes.
Suppose we have two sets of data points belonging to two different classes that we want
to classify. As shown in the given 2D graph, when the data points are plotted on the 2D
plane, there’s no straight line that can separate the two classes of data points completely.
Hence, in this case, LDA (Linear Discriminant Analysis) is used which reduces the 2D graph
into a 1D graph in order to maximize the separability between the two classes.

Linearly Separable Dataset

Here, Linear Discriminant Analysis uses both axes (X and Y) to create a new axis and
projects data onto a new axis in a way to maximize the separation of the two categories
and hence, reduces the 2D graph into a 1D graph.

Two criteria are used by LDA to create a new axis:


1. Maximize the distance between the means of the two classes.
2. Minimize the variation within each class.

The perpendicular distance between the line and points


In the above graph, it can be seen that a new axis (in red) is generated and plotted in the
2D graph such that it maximizes the distance between the means of the two classes and
minimizes the variation within each class. In simple terms, this newly generated axis
increases the separation between the data points of the two classes. After generating this
new axis using the above-mentioned criteria, all the data points of the classes are plotted
on this new axis and are shown in the figure given below.

But Linear Discriminant Analysis fails when the mean of the distributions are shared, as it
becomes impossible for LDA to find a new axis that makes both classes linearly separable.
In such cases, we use non-linear discriminant analysis.

Steps:
1. Calculating mean vectors for each class.
2. Computing within-class and between-class scatter matrices to understand the
distribution and separation of classes.
3. Solving for the eigenvalues and eigenvectors that maximize the between-class variance
relative to the within-class variance. This defines the optimal projection space to
distinguish the classes.

Advanatages & Disadvantages of using LDA


Advanatages of using LDA
1. It is a simple and computationally efficient algorithm.
2. It can work well even when the number of features is much larger than the
number of training samples.
3. It can handle multicollinearity (correlation between features) in the data.
Disadvantages of LDA
1. It assumes that the data has a Gaussian distribution, which may not always be
the case.
2. It assumes that the covariance matrices of the different classes are equal,
which may not be true in some datasets.
3. It assumes that the data is linearly separable, which may not be the case for
some datasets.
4. It may not perform well in high-dimensional feature spaces.
Applications:
 Facial Recognition
 Medical Diagnostics
 Marketing-Customer Segmentation

Principal Component Analysis(PCA)


Having too many features in data can cause problems like overfitting (good on training
data but poor on new data), slower computation, and lower accuracy. This is called
the curse of dimensionality , where more features exponentially increase the data needed
for reliable results.
The explosion of feature combinations makes sampling harder In high-dimensional data
and tasks like clustering or classification more complex and slow.

It works by transforming high-dimensional data into a lower-dimensional space while


maximizing the variance (or spread) of the data in the new space . This helps preserve the
most important patterns and relationships in the data.
Note: It prioritizes the directions where the data varies the most (because more
variation = more useful information.

PCA is an unsupervised learning algorithm, meaning it doesn’t require prior knowledge of


target variables. It’s commonly used in exploratory data analysis and machine learning
to simplify datasets without losing critical information.
We know everything sound complicated, let’s understand again with help of visual
image where, x-axis (Radius) and y-axis (Area) represent two original features in the

dataset.

Principal Components (PCs):


 PC₁ (First Principal Component): The direction along which the data has the
maximum variance. It captures the most important information.
 PC₂ (Second Principal Component): The direction orthogonal (perpendicular) to
PC₁. It captures the remaining variance but is less significant.
Now, The red dashed lines indicate the spread (variance) of data along different directions
. The variance along PC₁ is greater than PC₂, which means that PC₁ carries more useful
information about the dataset.
 The data points (blue dots) are projected onto PC₁, effectively reducing the
dataset from two dimensions (Radius, Area) to one dimension (PC₁).
 This transformation simplifies the dataset while retaining most of the original
variability.

The image visually explains why PCA selects the direction with the highest variance (PC₁).
By removing PC₂, we reduce redundancy while keeping essential information. The
transformation helps in data compression, visualization, and improved model
performance.

Principal Component Analysis (PCA) Steps:


1. Standardize the Data
o If the dataset features are on different scales, standardize them by
subtracting the mean and dividing by the standard deviation.
2. Compute the Covariance Matrix
o Calculate the covariance matrix for the standardized dataset to understand
feature relationships.
3. Compute Eigenvectors and Eigenvalues
o Determine the eigenvectors and eigenvalues of the covariance matrix.
o Eigenvectors indicate the directions of maximum variance, while
eigenvalues represent the magnitude of variance along those directions.
4. Sort Eigenvectors by Eigenvalues
o Arrange the eigenvectors in descending order based on their corresponding
eigenvalues.
5. Choose Principal Components
o Select the top k eigenvectors (principal components) based on the desired
dimensionality of the reduced dataset.
6. Transform the Data
o Multiply the original standardized data by the selected principal
components to obtain a lower-dimensional representation.
Applications of PCA
 Exploratory Data Analysis – Helps in identifying patterns and trends in high-
dimensional data.
 Predictive Modeling – Reduces dimensionality while preserving essential
information, improving model performance.
 Image Compression – Reduces storage and computational costs by representing
images with fewer principal components.
 Genomics & Pattern Recognition – Identifies meaningful patterns in genetic data
for research and medical applications.
 Financial Data Analysis – Detects hidden trends, correlations, and risk factors in
large financial datasets.
 Visualization of Complex Datasets – Projects high-dimensional data into lower
dimensions for better interpretability.

Factor Analysis
Factor analysis is a statistical method used to analyze the relationships among a set of
observed variables by explaining the correlations or covariances between them in terms
of a smaller number of unobserved variables called factors.

Factor analysis, a method within the realm of statistics and part of the general linear
model (GLM), serves to condense numerous variables into a smaller set of factors. By
doing so, it captures the maximum shared variance among the variables and condenses
them into a unified score, which can subsequently be utilized for further analysis.Factor
analysis operates under several assumptions:

1. Linearity: The relationships between variables and factors are assumed to be


linear.
2. Multivariate Normality: The variables in the dataset should follow a
multivariate normal distribution.
3. No Multicollinearity: Variables should not be highly correlated with each other,
as high multicollinearity can affect the stability and reliability of the factor
analysis results.
4. Adequate Sample Size: Factor analysis generally requires a sufficient sample
size to produce reliable results. The adequacy of the sample size can depend on
factors such as the complexity of the model and the ratio of variables to cases.
5. Homoscedasticity: The variance of the variables should be roughly equal across
different levels of the factors.
6. Uniqueness: Each variable should have unique variance that is not explained by
the factors. This assumption is particularly important in common factor
analysis.
7. Independent Observations: The observations in the dataset should be
independent of each other.
8. Linearity of Factor Scores: The relationship between the observed variables
and the latent factors is assumed to be linear, even though the observed
variables may not be linearly related to each other.
9. Interval or Ratio Scale: Factor analysis typically assumes that the variables are
measured on interval or ratio scales, as opposed to nominal or ordinal scales.
Violation of these assumptions can lead to biased parameter estimates and inaccurate
interpretations of the results. Therefore, it’s important to assess the data for these
assumptions before conducting factor analysis and to consider potential remedies or
alternative methods if the assumptions are not met.

Here are the general steps involved in conducting a factor analysis:


1. Determine the Suitability of Data for Factor Analysis
 Bartlett’s Test: Check the significance level to determine if the correlation
matrix is suitable for factor analysis.
 Kaiser-Meyer-Olkin (KMO) Measure: Verify the sampling adequacy. A value
greater than 0.6 is generally considered acceptable.
2. Choose the Extraction Method
 Principal Component Analysis (PCA): Used when the main goal is data
reduction.
 Principal Axis Factoring (PAF): Used when the main goal is to identify
underlying factors.
3. Factor Extraction
 Use the chosen extraction method to identify the initial factors.
 Extract eigenvalues to determine the number of factors to retain. Factors with
eigenvalues greater than 1 are typically retained in the analysis.
 Compute the initial factor loadings.
4. Determine the Number of Factors to Retain
 Scree Plot: Plot the eigenvalues in descending order to visualize the point
where the plot levels off (the “elbow”) to determine the number of factors to
retain.
 Eigenvalues: Retain factors with eigenvalues greater than 1.
5. Factor Rotation
 Orthogonal Rotation (Varimax, Quartimax): Assumes that the factors are
uncorrelated.
 Oblique Rotation (Promax, Oblimin): Allows the factors to be correlated.
 Rotate the factors to achieve a simpler and more interpretable factor structure.
 Examine the rotated factor loadings.
6. Interpret and Label the Factors
 Analyze the rotated factor loadings to interpret the underlying meaning of each
factor.
 Assign meaningful labels to each factor based on the variables with high
loadings on that factor.
7. Compute Factor Scores (if needed)
 Calculate the factor scores for each individual to represent their value on each
factor.
8. Report and Validate the Results
 Report the final factor structure, including factor loadings and communalities.
 Validate the results using additional data or by conducting a confirmatory
factor analysis if necessary.
Let us discuss some of these Factor Analysis terms:
1. Factor Loadings:
 Factor loadings represent the correlations between the
observed variables and the underlying factors in factor
analysis. They indicate the strength and direction of the
relationship between each variable and each factor.
 Squaring the standardized factor loading gives the
“communality,” which represents the proportion of
variance in a variable explained by the factor.
2. Communality:
 Communality is the sum of the squared factor loadings
for a given variable across all factors.It measures the
proportion of variance in a variable that is explained by
all the factors jointly.
 Communality can be interpreted as the reliability of the
variable in the context of the factors being considered.

Applications of PCA
 Dimensionality Reduction – Reduces the number of features
while retaining essential information.
 Identifying Latent Constructs – Helps uncover hidden patterns
and relationships within the data.
 Data Summarization – Condenses large datasets into a smaller
set of meaningful components.
 Hypothesis Testing – Assists in validating assumptions by
analyzing underlying data structures.
 Variable Selection – Identifies the most significant features,
improving model efficiency.
 Enhancing Predictive Models – Reduces noise and
multicollinearity, leading to better model performance.

Independent Component Analysis:


Independent Component Analysis is a technique used to separate
mixed signals into their independent sources. The application of ICA
ranges from audio and image processing to biomedical signal analysis.

Independent Component Analysis (ICA) is a statistical and computational


technique used in machine learning to separate a multivariate signal into
its independent non-Gaussian components. The goal of ICA is to find a
linear transformation of the data such that the transformed data is as
close to being statistically independent as possible.
[X1,X2,….,Xn]=>[Y1,Y2,….,Yn]
where, X1, X2, …, Xn are the original signals present in the mixed signal
and Y1, Y2, …, Yn are the new features and are independent components
that are independent of each other.

The heart of ICA lies in the principle of statistical independence. ICA


identify components within mixed signals that are statistically
independent of each other.
ICA excels in applications like the "cocktail party problem," where it
isolates distinct audio streams amid noise without prior source

information.
Assumptions in ICA
1. The first assumption asserts that the source signals (original
signals) are statistically independent of each other.
2. The second assumption is that each source signal exhibits non-
Gaussian distributions.

Steps of Independent Component Analysis (ICA)


1. Centering – Adjusts the data to have a zero mean, ensuring that
analysis focuses on variance rather than absolute values.
2. Whitening – Transforms the data into uncorrelated variables,
making it easier to separate independent components.
3. Independent Component Extraction – Uses iterative
optimization techniques to identify statistically independent
components.
o Often incorporates PCA or Singular Value Decomposition
(SVD) at the beginning to reduce dimensionality, improving
efficiency and robustness.

Advantages of Independent Component Analysis (ICA):


 ICA is a powerful tool for separating mixed signals into their
independent components. This is useful in a variety of
applications, such as signal processing, image analysis, and
data compression.
 ICA is a non-parametric approach, which means that it does
not require assumptions about the underlying probability
distribution of the data.
 ICA is an unsupervised learning technique, which means
that it can be applied to data without the need for labeled
examples. This makes it useful in situations where labeled data
is not available.
 ICA can be used for feature extraction, which means that it
can identify important features in the data that can be used for
other tasks, such as classification.

Disadvantages of Independent Component Analysis


(ICA):
 ICA assumes that the underlying sources are non-Gaussian,
which may not always be true. If the underlying sources are
Gaussian, ICA may not be effective.
 ICA assumes that the sources are mixed linearly, which may not
always be the case. If the sources are mixed nonlinearly, ICA
may not be effective.
 ICA can be computationally expensive, especially for large
datasets. This can make it difficult to apply ICA to real-world
problems.
 ICA can suffer from convergence issues, which means that it
may not always be able to find a solution. This can be a problem
for complex datasets with many sources.

Locally Linear Embedding:


LLE(Locally Linear Embedding) is an unsupervised approach
designed to transform data from its original high-dimensional
space into a lower-dimensional representation, all while striving
to retain the essential geometric characteristics of the
underlying non-linear feature structure.

Locally Linear Embedding Algorithm


The LLE algorithm can be broken down into several steps:
 Neighborhood Selection: For each data point in the
high-dimensional space, LLE identifies its k-nearest
neighbors. This step is crucial because LLE assumes that
each data point can be well approximated by a linear
combination of its neighbors.
 Weight Matrix Construction: LLE computes a set of
weights for each data point to express it as a linear
combination of its neighbors. These weights are
determined in such a way that the reconstruction error is
minimized. Linear regression is often used to find these
weights.
 Global Structure Preservation: After constructing the
weight matrix, LLE aims to find a lower-dimensional
representation of the data that best preserves the local
linear relationships. It does this by seeking a set of
coordinates in the lower-dimensional space for each data
point that minimizes a cost function. This cost
function evaluates how well each data point can be
represented by its neighbors.
 Output Embedding: Once the optimization process is
complete, LLE provides the final lower-dimensional
representation of the data. This representation captures
the essential structure of the data while reducing its
dimensionality.

Advantages of LLE
The dimensionality reduction method known as locally linear
embedding (LLE) has many benefits for data processing and
visualization. The following are LLE's main benefits:
 Preservation of Local Structures: LLE is excellent at
maintaining the in-data local relationships or structures.
It successfully captures the inherent geometry of
nonlinear manifolds by maintaining pairwise distances
between nearby data points.
 Handling Non-Linearity: LLE has the ability to capture
nonlinear patterns and structures in the data, in contrast
to linear techniques like Principal Component
Analysis (PCA). When working with complicated, curved,
or twisted datasets, it is especially helpful.
 Dimensionality Reduction: LLE lowers the
dimensionality of the data while preserving its
fundamental properties. Particularly when working with
high-dimensional datasets, this reduction makes data
presentation, exploration, and analysis simpler.

Disavantages of LLE
 Curse of Dimensionality: LLE can experience the
"curse of dimensionality " when used with extremely high-
dimensional data, just like many other dimensionality
reduction approaches. The number of neighbors required
to capture local interactions rises as dimensionality does,
potentially increasing the computational cost of the
approach.
 Memory and computational Requirements: For big
datasets, creating a weighted adjacency matrix as part of
LLE might be memory-intensive. The eigenvalue
decomposition stage can also be computationally taxing
for big datasets.
 Outliers and Noisy data: LLE is susceptible to
anomalies and jittery data points. The quality of the
embedding may be affected and the local linear
relationships may be distorted by outliers.

Isomap:
A nonlinear dimensionality reduction method used in data
analysis and machine learning is called isomap, short for
isometric mapping. Isomap was developed to maintain the
inherent geometry of high-dimensional data as a substitute for
conventional techniques like Principal Component Analysis (PCA).
Isomap creates a low-dimensional representation, usually a two-
or three-dimensional map, by focusing on the preservation of
pairwise distances between data points.
This technique works especially well for extracting the
underlying structure from large, complex datasets, like those
from speech recognition, image analysis, and biological systems.
Finding patterns and insights in a variety of scientific and
engineering domains is made possible by Isomap's capacity to
highlight the fundamental relationships found in data.

Working of ISOMAP
 Calculate the pairwise distances: The algorithm
starts by calculating the Euclidean distances between the
data points.
 Find nearest neighbors according to these
distances: For each data point, its k nearest neighbor is
determined by that distance.
 Create a neighborhood plot: the edges of each point
are aligned with their closest neighbors, which creates a
diagram that represents the data's regional structure.
 Calculate geodesic distances: The Floyd algorithm
sorts through all the pairs of data points in a
neighborhood graph and finds the most distant paths.
geodesic distances are represented by these shortest
paths.
 Perform dimensional reduction: Classical Multi
Scaling MDS is used for geodesic distance matrices that
result in low dimensional embedding of data.

Advantages and Disadvantages of Isomap:


Advantages:
 Capturing non linear relationships: Unlike linear
dimensional reduction techniques such as PCA, Isomap is
able to capture the underlying non linear structure of the
data.
 Global structure: Isomap's goal is to preserve the
overall relationship between data points, which will give
a better representation of the entire manifold.
 Globally optimised: The algorithm guarantees that on
the built neighborhood graph, where geodesic distances
are defined, a global optimal solution will be found.

Disadvanatges:
 Computational cost: for large datasets, computation of
geodesic distance using Floyd's algorithm can be
computationally expensive and lead to a longer run time.
 Sensitive to parameter settings: incorrect selection
of the parameters may lead to a distortion or misleading
insert.
 May be difficult for manifolds with holes or topological
complexity, which may lead to inaccurate
representations: Isomap is not capable of performing well
in a manifold that contains holes or other topological
complexity.

Applications of Isomap
 Visualization: High-dimensional data like face images
can be visualized in a lower-dimensional space, enabling
easier exploration and understanding.
 Data exploration: Isomap can help identify clusters and
patterns within the data that are not readily apparent in
the original high-dimensional space.
 Anomaly detection: Outliers that deviate significantly
from the underlying manifold can be identified using
Isomap.
 Machine learning tasks: Isomap can be used as a pre-
processing step for other machine learning tasks, such as
classification and clustering, by improving the
performance and interpretability of the models.
Least Squares Optimization:
 Least squares optimization is a mathematical technique
that minimizes the sum of squared residuals to find the
best-fitting curve for a set of data points.
 It's a type of regression analysis that's often used by
statisticians and traders to identify trends and trading
opportunities

Steps:
1. Determine the equation of the line you believe best fits the
data.
 Denote the independent variable values as xi and the
dependent ones as yi.
 Calculate the average values of xi and yi as X and Y.
Presume the equation of the line of best fit as y = mx + c, where
m is the slope of the line and c represents the intercept of the
line on the Y-axis.
 Determine the equation of the line you believe best
fits the data.
 The slope m and intercept c can be calculated from
the following formulas:
Thus, we obtain the line of best fit as y = mx + c.
2. Calculate the residuals (differences) between the observed
values and the values
predicted by your model.
3. Square each of these residuals and sum them up.
4. Adjust the model to minimize this sum.

You might also like