Unit 2-2
Unit 2-2
Manifold learning is a technique for dimensionality reduction used in machine learning that seeks to
preserve the underlying structure of high-dimensional data while representing it in a lower-
dimensional environment. This technique is particularly useful when the data has a non-linear
structure that cannot be adequately captured by linear approaches like Principal Component Analysis
(PCA).
• Makes feature extraction easier, identifies important patterns, and reduces noise.
• Boost the effectiveness of machine learning algorithms by keeping the data’s natural
structure.
• Provide more accurate modeling and forecasting, which is especially helpful when dealing
with data that linear techniques are unable to fully model.
In this post, we will examine four manifold learning algorithms that are as follows:
• Isomap
• Multi-Dimensional Scaling(MDS)
We will utilize the scikit-learn digits dataset, which contains pictures of digits (0-9) encoded as 8×8
pixel arrays. Each picture includes 64 characteristics that indicate the pixel intensity.
Steps:
t-SNE is an effective method for displaying high-dimensional data. It is very helpful for constructing
2D or 3D representations of complicated data. t-SNE is based on the concept of probability
distributions, and it attempts to minimize the divergence between two probability distributions, one
measuring pairwise similarities between data points in high-dimensional space and the other
measuring pairwise similarities between data points in low-dimensional space. t-SNE produces a 2D
or 3D display of the data.
• Python3
digits = load_digits()
X = digits.data
y = digits.target
X_tsne = tsne.fit_transform(X)
plt.show()
Output:
t-SNE
• Python3
digits = load_digits()
X = digits.data
y = digits.target
isomap = Isomap(n_components=2)
X_isomap = isomap.fit_transform(X)
plt.show()
Output:
Isomap
• Python3
digits = load_digits()
X = digits.data
y = digits.target
X_lle = lle.fit_transform(X)
plt.show()
Output:
Locally Linear Embedding
MDS is a dimensionality reduction approach that is based on the idea of maintaining the pairwise
distances between data points. MDS seeks a lower-dimensional representation of the data that
retains pairwise distances between data points. MDS is very helpful when working with linear data
structures.
• Python3
digits = load_digits()
X = digits.data
y = digits.target
X_mds = mds.fit_transform(X)
plt.show()
Output:
Multi-Dimensional Scaling
Isomap
An understanding and representation of complicated data structures are crucial for the field of
machine learning. To achieve this, Manifold Learning, a subset of unsupervised learning, has a
significant role to play. Among the manifold learning techniques, ISOMAP (Isometric Mapping) stands
out for its prowess in capturing the intrinsic geometry of high-dimensional data. In the case of
situations in which linear methods are lacking, they have proved particularly efficient.
ISOMAP is a flexible tool that seamlessly blends multiple learning and dimensionality reduction
intending to obtain more detailed knowledge of the underlying structure of data. This article takes a
look at ISOMAP’s inner workings and sheds light on its parameters, functions, and proper
implementation with SkLearn.
Manifold Learning
The idea of an Isometric Map, which aims to preserve pairwise distance between points, is central to
ISOMAP. In doing so, it seeks to achieve low dimensionality representation for the data while at the
same time keeping geodesic distances as shortest possible along the curving edge of the data
manifold. This is particularly important in situations where the underlying structure has been broken
or folded, since traditional methods such as PCA are not able to take these nuances into account.
Understanding the distinction between equatorial and elliptic distances is of vital importance for
ISOMAP. The geodesic distance considers the shortest path along the curved surface of the manifold,
as opposed to Euclidean distances which are measured by measuring straight Line distances in the
input space. In order to provide a more precise representation of the data’s internal structure,
ISOMAP exploits these quantum distances.
ISOMAP Parameters
ISOMAP comes with several parameters, each influencing the dimensionality reduction process:
• eigen_solver: Determines the method used for decomposing an Eigenvalue. There are
options such as “auto”, “arpack” and “dense.”
• radius: You can designate a radius within which neighbors are taken into account in place of
using a set number of neighbors. Outside of this range, data points are not regarded as
neighbors.
• tol: tolerance in the eigenvalue solver to attain convergence. While a lower value might
result in a more accurate solution, it might also lengthen the computation time.
• max_iter: The maximum number of times the eigenvalue solver can run. It continues if None
is selected, unless convergence or additional stopping conditions are satisfied.
• path_method: chooses the approximation technique for geodesic distances on the graph.
‘auto’ (automatic selection) and ‘FW’ (Floyd-Warshall algorithm) are available options.
• metric: The nearest neighbor search’s distance metric. ‘Minkowski‘ is the default; however,
‘euclidean’,’manhattan’, and several other options are also available.
Working of ISOMAP
• Calculate the pairwise distances: The algorithm starts by calculating the Euclidean distances
between the data points.
• Find nearest neighbors according to these distances: For each data point, its k nearest
neighbor is determined by that distance.
• Create a neighborhood plot: the edges of each point are aligned with their closest
neighbors, which creates a diagram that represents the data’s regional structure.
• Calculate geodesic distances: The Floyd algorithm sorts through all the pairs of data points in
a neighborhood graph and finds the most distant paths. geodesic distances are represented
by these shortest paths.
• Perform dimensional reduction: Classical Multi Scaling MDS is used for geodesic distance
matrices that result in low dimensional embedding of data.
Advantages
• Capturing non linear relationships: Unlike linear dimensional reduction techniques such as
PCA, Isomap is able to capture the underlying non linear structure of the data.
• Global structure: Isomap’s goal is to preserve the overall relationship between data points,
which will give a better representation of the entire manifold.
• Globally optimised: The algorithm guarantees that on the built neighborhood graph, where
geodesic distances are defined, a global optimal solution will be found.
Disadvanatges
• Computational cost: for large datasets, computation of geodesic distance using Floyd’s
algorithm can be computationally expensive and lead to a longer run time.
• Sensitive to parameter settings: incorrect selection of the parameters may lead to a
distortion or misleading insert.
• May be difficult for manifolds with holes or topological complexity, which may lead to
inaccurate representations: Isomap is not capable of performing well in a manifold that
contains holes or other topological complexity.
Applications of Isomap
• Data exploration: Isomap can help identify clusters and patterns within the data that are not
readily apparent in the original high-dimensional space.
• Anomaly detection: Outliers that deviate significantly from the underlying manifold can be
identified using Isomap.
• Machine learning tasks: Isomap can be used as a pre-processing step for other machine
learning tasks, such as classification and clustering, by improving the performance and
interpretability of the models.
LLE(Locally Linear Embedding) is an unsupervised approach designed to transform data from its
original high-dimensional space into a lower-dimensional representation, all while striving to retain
the essential geometric characteristics of the underlying non-linear feature structure. LLE operates in
several key steps:
• Firstly, it constructs a nearest neighbors graph to capture these local relationships. Then, it
optimizes weight values for each data point, aiming to minimize the reconstruction error
when expressing a point as a linear combination of its neighbors. This weight matrix reflects
the strength of connections between points.
As an illustration, consider a Swiss roll dataset, which is inherently non-linear in its high-dimensional
space. LLE, in this case, works to project this complex structure onto a lower-dimensional plane,
preserving its distinctive geometric properties throughout the transformation process.”
Table of Content
• Advantages of LLE
• Disavantages of LLE
Mathematical Implementation of LLE Algorithm
The key idea of LLE is that locally, in the vicinity of each data point, the data lies approximately on a
linear subspace. LLE attempts to unfold or unroll the data while preserving these local linear
relationships.
Minimize:
Subject to :
Where:
• wij are the weights that minimize the reconstruction error for data point xi using its
neighbors.
It aims to find a lower-dimensional representation of data while preserving local relationships. The
mathematical expression for LLE involves minimizing the reconstruction error of each data point by
expressing it as a weighted sum of its k nearest neighbors‘ contributions. This optimization is subject
to constraints ensuring that the weights sum to 1 for each data point. Locally Linear Embedding (LLE)
is a dimensionality reduction technique used in machine learning and data analysis. It focuses on
preserving local relationships between data points when mapping high-dimensional data to a lower-
dimensional space. Here, we will explain the LLE algorithm and its parameters.
• Neighborhood Selection: For each data point in the high-dimensional space, LLE identifies its
k-nearest neighbors. This step is crucial because LLE assumes that each data point can be
well approximated by a linear combination of its neighbors.
• Weight Matrix Construction: LLE computes a set of weights for each data point to express it
as a linear combination of its neighbors. These weights are determined in such a way that
the reconstruction error is minimized. Linear regression is often used to find these weights.
• Global Structure Preservation: After constructing the weight matrix, LLE aims to find a
lower-dimensional representation of the data that best preserves the local linear
relationships. It does this by seeking a set of coordinates in the lower-dimensional space for
each data point that minimizes a cost function. This cost function evaluates how well each
data point can be represented by its neighbors.
• Output Embedding: Once the optimization process is complete, LLE provides the final lower-
dimensional representation of the data. This representation captures the essential structure
of the data while reducing its dimensionality.
• Dimensionality of Output Space: You can specify the dimensionality of the lower-
dimensional space to which the data will be mapped. This is often chosen based on the
problem’s requirements and the trade-off between computational complexity and
information preservation.
• Distance Metric: LLE relies on a distance metric to define the proximity between data points.
Common choices include Euclidean distance, Manhattan distance, or custom-defined
distance functions. The choice of distance metric can impact the results.
• Regularization (Optional): In some cases, regularization terms are added to the cost function
to prevent overfitting. Regularization can be useful when dealing with noisy data or when the
number of neighbors is high.
• Optimization Algorithm (Optional): LLE often uses optimization techniques like Singular
Value Decomposition (SVD) or eigenvector methods to find the lower-dimensional
representation. These optimization methods may have their own parameters that can be
adjusted.
Enhanced Computational Efficiency with LLE. LLE offers superior computational efficiency due to its
sparse matrix handling, outperforming other algorithms.
Advantages of LLE
The dimensionality reduction method known as locally linear embedding (LLE) has many benefits for
data processing and visualization. The following are LLE’s main benefits:
• Handling Non-Linearity: LLE has the ability to capture nonlinear patterns and structures in
the data, in contrast to linear techniques like Principal Component Analysis (PCA). When
working with complicated, curved, or twisted datasets, it is especially helpful.
• Dimensionality Reduction: LLE lowers the dimensionality of the data while preserving its
fundamental properties. Particularly when working with high-dimensional datasets, this
reduction makes data presentation, exploration, and analysis simpler.
Disavantages of LLE
• Curse of Dimensionality: LLE can experience the “curse of dimensionality” when used with
extremely high-dimensional data, just like many other dimensionality reduction approaches.
The number of neighbors required to capture local interactions rises as dimensionality does,
potentially increasing the computational cost of the approach.
• Memory and computational Requirements: For big datasets, creating a weighted adjacency
matrix as part of LLE might be memory-intensive. The eigenvalue decomposition stage can
also be computationally taxing for big datasets.
• Outliers and Noisy data: LLE is susceptible to anomalies and jittery data points. The quality
of the embedding may be affected and the local linear relationships may be distorted by
outliers.
Spectral Embedding
Data is projected onto a lower-dimensional subspace using the spectral embedding method, which
reduces the dimensionality of the data while retaining some of its original characteristics. It is
predicated on the notion of employing a matrix’s eigenvectors, which stand for the affinity or
resemblance between the data points. The visualization of high-dimensional data, clustering,
manifold learning, and other applications can all benefit from spectral embedding.
The idea of spectral embedding, how it functions, and how to apply it in Python using the scikit-
learn module are all covered in this article. We will also examine some examples of spectral
embedding being used on various datasets and contrast the outcomes with other approaches.
A dimensionality reduction method that is frequently applied in data analysis and machine learning is
called spectral embedding. High-dimensional data can be visualized and clustered with great benefit
from it. Based on spectral graph theory, spectral embedding shares a tight relationship with Principal
Component Analysis (PCA).
The first step in spectral embedding is to represent the data as a graph. There are several methods to
build this graph, including similarity, epsilon, and k-nearest-neighbor, among others. The graph’s
nodes stand in for data points, while the edges connecting them indicate similarities or pairwise
relationships.
The creation of the Laplacian matrix, which encodes the graph’s structure, comes next. Laplacian
matrices come in various forms, but the most widely used type is the unnormalized Laplacian or L.
It can be computed in the following ways:
Where,
L = Laplacian Matrix
D = Diagonal Degree matrix . Each diagonal entry Dii is the sum of weights of the edges connected to
node i.
W = weighted adjacency matrix, where Wij represents the similarity or weight between nodes i and j.
The Laplacian matrix L’s eigenvalues and eigenvectors must then be calculated. These can be
acquired by the resolution of the subsequent generalized eigenvalue issue:
ƛ = eigenvalues
v = corresponding eigenvectors
Once the eigenvalues and eigenvectors are obtained, dimensionality reduction can be carried out by
choosing the top k eigenvectors that match the lowest k eigenvalues. These k eigenvectors combine
to create a new matrix, Vk .
The eigenvectors are used as the new feature vectors for the data points in order to achieve spectral
embedding. The data points’ coordinates in the lower-dimensional space are determined by the k
eigenvectors in Vk. At this point, every data point is represented by a k-dimensional vector.
We can use the scikit-learn framework and a class named SpectralEmbedding1 to create spectral
embedding in Python. Several factors in this class determine how the affinity matrix is built and how
the eigenvalue decomposition is carried out. These are a few of the parameters:
• affinity: How to construct the affinity matrix. It can be one of {‘nearest_neighbors’, ‘rbf’,
‘precomputed’, ‘precomputed_nearest_neighbors’} or a callable function that takes in a data
matrix and returns an affinity matrix.
• gamma: The kernel coefficient for rbf kernel. If None, gamma will be set to 1/n_features.
• random_state: A pseudo random number generator used for initializing some algorithms.
There are several advantages of Spectral Embedding. Some of the advantages are:
• Preservation of Local and Global Structure: Both local and global structure in the data can
be effectively preserved using spectral embedding techniques. Both relationships between
close-by data points and relationships between distant data points can be captured by them.
Although spectral embedding techniques have many benefits, they also have several drawbacks and
restrictions:
• Scalability: High-dimensional data usually does not scale well with spectral embedding.
Rather than high-dimensional data processing, it works better for dimensionality reduction
or visualization.
Multidimensional scaling (MDS) is a dimensionality reduction technique that is used to project high-
dimensional data onto a lower-dimensional space while preserving the pairwise distances between
the data points as much as possible. MDS is based on the concept of distance and aims to find a
projection of the data that minimizes the differences between the distances in the original space and
the distances in the lower-dimensional space.
MDS is commonly used to visualize complex, high-dimensional data, and to identify patterns and
relationships that may not be apparent in the original space. It can be applied to a wide range of data
types, including numerical, categorical, and mixed data. MDS is implemented using numerical
optimization algorithms, such as gradient descent or simulated annealing, to minimize the difference
between the distances in the original and lower-dimensional spaces.
Overall, MDS is a powerful and flexible technique for reducing the dimensionality of high-
dimensional data, and for revealing hidden patterns and relationships in the data. It is widely used in
many fields, including machine learning, data mining, and pattern recognition.
1. MDS is based on the concept of distance and aims to find a projection of the data that
minimizes the differences between the distances in the original space and the distances in
the lower-dimensional space. This allows MDS to preserve the relationships between the
data points, and to highlight patterns and trends that may not be apparent in the original
space.
2. MDS can be applied to a wide range of data types, including numerical, categorical, and
mixed data. This makes MDS a versatile tool that can be used with many different kinds of
data and allows it to handle complex multi-modal data sets.
4. MDS is widely used in many fields, including machine learning, data mining, and pattern
recognition. This makes it a well-established and widely-supported technique that has been
extensively tested and validated, and that has a large and active user community.
Overall, MDS is a powerful and flexible technique for reducing the dimensionality of high-
dimensional data, and for revealing hidden patterns and relationships in the data. Its key features
include its ability to handle a wide range of data types, its flexibility and adaptability, and its
widespread use and support in many fields.
The mathematical foundation of MDS is the stress function, which measures the difference between
the distances in the original space and the distances in the lower-dimensional space. The stress
function is defined as:
where dij is the distance between data points $i$ and $j$ in the original space, is the distance
between data points i and j in the lower-dimensional space, and n is the number of data points. The
stress function is a measure of the deviation of the distances in the lower-dimensional space from
the distances in the original space and is used to evaluate the quality of the projection.
Like all techniques, MDS has some limitations and drawbacks that should be considered when using
it to analyze and visualize data.
1. It relies on the distances between the data points to define the projection and does not
consider other types of relationships between the data points, such as correlations or
associations. This means that MDS may not be suitable for data sets that have complex, non-
distance-based relationships, or that have missing or noisy distances.
2. It is sensitive to outliers and noise in the data, which can affect the quality of the projection
and the interpretability of the results. MDS may produce projections that are distorted or
misleading if the data contains outliers or noise, and may not accurately reflect the
underlying structure of the data.
3. It is a global optimization technique, which means that it finds a single projection that is
optimal for the entire data set. This can be problematic for data sets that have complex,
multi-modal structures, or that have multiple clusters or groups of data points, as MDS may
not be able to capture the local structure of the data within each group.
1. MDS is based on the concept of distance and aims to find a projection of the data that
minimizes the differences between the distances in the original space and the distances in
the lower-dimensional space. In contrast, PCA and t-SNE are based on the concept of
variance and entropy, respectively, and aim to find a projection of the data that maximizes
the variance or entropy in the lower-dimensional space. This means that MDS is more
focused on preserving the relationships between the data points, while PCA and t-SNE are
more focused on summarizing the data and finding the most relevant dimensions.
2. MDS can be applied to a wide range of data types, including numerical, categorical, and
mixed data. In contrast, PCA and t-SNE are more suited to numerical data, and may not be as
effective with categorical or mixed data. This makes MDS a more versatile and flexible
technique and allows it to handle complex, multi-modal data sets.
3. MDS uses numerical optimization algorithms to find the projection that minimizes the stress
function, and that best preserves the pairwise distances between the data points. In
contrast, PCA and t-SNE use linear algebra and stochastic algorithms, respectively, to find the
projection that maximizes the variance or entropy in the lower-dimensional space. This
means that MDS is a more flexible and adaptable technique, and can find projections that
are different from those produced by PCA or t-SNE.
This helps us explore high dimensional data as well by mapping it into lower dimensionss as the local
structures are retained in the dataset we can get a feel of the same by ploting it and visualizing it in
the 2D or may be 3D plane.
Even though PCA and t-SNE both are unsupervised algorithms that are used to reduce the
dimensionality of the dataset. PCA is a deterministic algorithm to reduce the dimensionality of the
algorithm and the t-SNE algorithm a randomized non-linear method to map the high dimensional
data to the lower dimensional. The data that is obtained after reducing the dimensionality via the t-
SNE algorithm is generally used for visualization purpose only.
One more thing that we can say is an advantage of using the t-SNE data is that it is not effected by
the outliers but the PCA algorithm is highly affected by the outliers because the methodologies that
are used in the two algorithms is different. While we try to preserve the variance in the data
using PCA algorithm we use t-SNE algorithm to retain teh local structure of the dataset.
t-SNE a non-linear dimensionality reduction algorithm finds patterns in the data based on the
similarity of data points with features, the similarity of points is calculated as the conditional
probability that point A would choose point B as its neighborr.
It then tries to minimize the difference between these conditional probabilities (or similarities) in
higher-dimensional and lower-dimensional space for a perfect representation of data points in lower-
dimensional space.
The algorithm computes pairwise conditional probabilities and tries to minimize the sum of the
difference of the probabilities in higher and lower dimensions. This involves a lot of calculations and
computations. So the algorithm takes a lot of time and space to compute. t-SNE has a quadratic time
and space complexity in the number of data points.
Principal Component Analysis(PCA) technique was introduced by the mathematician Karl Pearson in
1901. It works on the condition that while the data in a higher dimensional space is mapped to data
in a lower dimension space, the variance of the data in the lower dimensional space should be
maximum.
• The main goal of Principal Component Analysis (PCA) is to reduce the dimensionality of a
dataset while preserving the most important patterns or relationships between the variables
without any prior knowledge of the target variables.
Principal Component Analysis (PCA) is used to reduce the dimensionality of a data set by finding a
new set of variables, smaller than the original set of variables, retaining most of the sample’s
information, and useful for the regression and classification of data.
2. The first principal component captures the most variation in the data, but the second
principal component captures the maximum variance that is orthogonal to the first principal
component, and so on.
3. Principal Component Analysis can be used for a variety of purposes, including data
visualization, feature selection, and data compression. In data visualization, PCA can be used
to plot high-dimensional data in two or three dimensions, making it easier to interpret. In
feature selection, PCA can be used to identify the most important variables in a dataset. In
data compression, PCA can be used to reduce the size of a dataset without losing important
information.
4. In Principal Component Analysis, it is assumed that the information is carried in the variance
of the features, that is, the higher the variation in a feature, the more information that
features carries.
Overall, PCA is a powerful tool for data analysis and can help to simplify complex datasets, making
them easier to understand and work with.
Step 1: Standardization
First, we need to standardize our dataset to ensure that each variable has a mean of 0 and a standard
deviation of 1.
Z=(X−μ)/σ
Here,
Covariance measures the strength of joint variability between two or more variables, indicating how
much they change in relation to each other. To find the covariance we can use the formula:
cov(x1,x2)=∑i=1n(x1i−x1ˉ)(x2i−x2ˉ)n−1cov(x1,x2)=n−1∑i=1n(x1i−x1ˉ)(x2i−x2ˉ)
AX=λXAX=λX
for some scalar values λλ. then λλ is known as the eigenvalue of matrix A and X is known as
the eigenvector of matrix A for the corresponding eigenvalue.
AX−λX=0(A−λI)X=0AX−λX(A−λI)X=0=0
where I am the identity matrix of the same shape as matrix A. And the above conditions will be true
only if (A–λI)(A–λI) will be non-invertible (i.e. singular matrix). That means,
∣A–λI∣=0∣A–λI∣=0
From the above equation, we can find the eigenvalues \lambda, and therefore corresponding
eigenvector can be found using the equation AX=λXAX=λX.
2. Feature Selection: Principal Component Analysis can be used for feature selection, which is
the process of selecting the most important variables in a dataset. This is useful in machine
learning, where the number of variables can be very large, and it is difficult to identify the
most important variables.
3. Data Visualization: Principal Component Analysis can be used for data visualization. By
reducing the number of variables, PCA can plot high-dimensional data in two or three
dimensions, making it easier to interpret.
5. Noise Reduction: Principal Component Analysis can be used to reduce the noise in data. By
removing the principal components with low variance, which are assumed to represent
noise, Principal Component Analysis can improve the signal-to-noise ratio and make it easier
to identify the underlying structure in the data.
6. Data Compression: Principal Component Analysis can be used for data compression. By
representing the data using a smaller number of principal components, which capture most
of the variation in the data, PCA can reduce the storage requirements and speed up
processing.
2. Data Scaling: Principal Component Analysis is sensitive to the scale of the data. If the data is
not properly scaled, then PCA may not work well. Therefore, it is important to scale the data
before applying Principal Component Analysis.
3. Information Loss: Principal Component Analysis can result in information loss. While
Principal Component Analysis reduces the number of variables, it can also lead to loss of
information. The degree of information loss depends on the number of principal
components selected. Therefore, it is important to carefully select the number of principal
components to retain.
6. Overfitting: Principal Component Analysis can sometimes result in overfitting, which is when
the model fits the training data too well and performs poorly on new data. This can happen if
too many principal components are used or if the model is trained on a small dataset.
Factor analysis, a method within the realm of statistics and part of the general linear model (GLM),
serves to condense numerous variables into a smaller set of factors. By doing so, it captures the
maximum shared variance among the variables and condenses them into a unified score, which can
subsequently be utilized for further analysis.Factor analysis operates under several assumptions:
linearity in relationships, absence of multicollinearity among variables, inclusion of relevant variables
in the analysis, and genuine correlations between variables and factors. While multiple methods
exist, principal component analysis stands out as the most prevalent approach in practice.
In the context of factor analysis, a “factor” refers to an underlying, unobserved variable or latent
construct that represents a common source of variation among a set of observed variables. These
observed variables, also known as indicators or manifest variables, are the measurable variables that
are directly observed or measured in a study.
Factor analysis is a statistical method used to describe variability among observed, correlated
variables in terms of a potentially lower number of unobserved variables called factors. Here are the
general steps involved in conducting a factor analysis:
• Bartlett’s Test: Check the significance level to determine if the correlation matrix is suitable
for factor analysis.
• Kaiser-Meyer-Olkin (KMO) Measure: Verify the sampling adequacy. A value greater than 0.6
is generally considered acceptable.
• Principal Component Analysis (PCA): Used when the main goal is data reduction.
• Principal Axis Factoring (PAF): Used when the main goal is to identify underlying factors.
3. Factor Extraction
• Extract eigenvalues to determine the number of factors to retain. Factors with eigenvalues
greater than 1 are typically retained in the analysis.
• Scree Plot: Plot the eigenvalues in descending order to visualize the point where the plot
levels off (the “elbow”) to determine the number of factors to retain.
5. Factor Rotation
• Orthogonal Rotation (Varimax, Quartimax): Assumes that the factors are uncorrelated.
• Rotate the factors to achieve a simpler and more interpretable factor structure.
• Analyze the rotated factor loadings to interpret the underlying meaning of each factor.
• Assign meaningful labels to each factor based on the variables with high loadings on that
factor.
• Calculate the factor scores for each individual to represent their value on each factor.
• Report the final factor structure, including factor loadings and communalities.
• Validate the results using additional data or by conducting a confirmatory factor analysis if
necessary.
1. Dimensionality Reduction: Factor analysis helps in reducing the number of variables under
consideration by identifying a smaller number of underlying factors that explain the
correlations or covariances among the observed variables. This simplification can make the
data more manageable and easier to interpret.
3. Data Summarization: By condensing the information from multiple variables into a smaller
set of factors, factor analysis provides a more concise summary of the data while retaining as
much relevant information as possible.
4. Hypothesis Testing: Factor analysis can be used to test hypotheses about the underlying
structure of the data. For example, researchers may have theoretical expectations about how
variables should be related to each other, and factor analysis can help evaluate whether
these expectations are supported by the data.
5. Variable Selection: It aids in identifying which variables are most important or relevant for
explaining the underlying factors. This can help in prioritizing variables for further analysis or
for developing more parsimonious models.
6. Improving Predictive Models: Factor analysis can be used as a preprocessing step to improve
the performance of predictive models by reducing multicollinearity among predictors and
capturing the shared variance among variables more efficient.
In factor analysis, several terms are commonly used to describe various concepts and components of
the analysis. Below is a table listing some of the most commonly used terms in factor analysis:
Term Description
The technique used to extract the initial factors from the observed
Extraction Method
variables (e.g., principal component analysis, maximum likelihood).
Scores that represent the value of each factor for each individual
Factor Scores
observation.
Factor Rotation A rule or criterion used to determine the appropriate rotation method
Criterion and angle to achieve a simpler and more interpretable factor structure.
1. Factor Loadings:
• Factor loadings represent the correlations between the observed variables and the
underlying factors in factor analysis. They indicate the strength and direction of the
relationship between each variable and each factor.
o Squaring the standardized factor loading gives the “communality,” which
represents the proportion of variance in a variable explained by the factor.
2. Communality:
• Communality is the sum of the squared factor loadings for a given variable across all
factors.It measures the proportion of variance in a variable that is explained by all
the factors jointly.
3. Spurious Solutions:
4. Uniqueness of a Variable:
5. Eigenvalues/Characteristic Roots:
• Eigenvalues measure the amount of variation in the total sample accounted for by
each factor.They indicate the importance of each factor in explaining the variance in
the variables.
• These are the sums of squared loadings associated with each extracted factor.They
provide information on how much variance in the variables is accounted for by each
factor.
7. Factor Scores:
• Factor scores represent the scores of each case (row) on each factor (column) in the
factor analysis.They are computed by multiplying each case’s standardized score on
each variable by the corresponding factor loading and summing these products.
There are two main types of Factor Analysis used in data science:
Exploratory Factor Analysis (EFA) is used to uncover the underlying structure of a set of observed
variables without imposing preconceived notions about how many factors there are or how the
variables are related to each factor. It explores complex interrelationships among items and aims to
group items that are part of unified concepts or constructs.
• Researchers do not make a priori assumptions about the relationships among factors,
allowing the data to reveal the structure organically.
• Exploratory Factor Analysis (EFA) helps in identifying the number of factors needed to
account for the variance in the observed variables and understanding the relationships
between variables and factors.
Confirmatory Factor Analysis (CFA) is a more structured approach that tests specific hypotheses
about the relationships between observed variables and latent factors based on prior theoretical
knowledge or expectations. It uses structural equation modeling techniques to test a measurement
model, wherein the observed variables are assumed to load onto specific factors.
• Confirmatory Factor Analysis (CFA) assesses the fit of the hypothesized model to the actual
data, examining how well the observed variables align with the proposed factor structure.
• This method allows for the evaluation of relationships between observed variables and
unobserved factors, and it can accommodate measurement error.
• Researchers hypothesize the relationships between variables and factors before conducting
the analysis, and the model is tested against empirical data to determine its validity.
In summary, while Exploratory Factor Analysis (EFA) is more exploratory and flexible, allowing the
data to dictate the factor structure, Confirmatory Factor Analysis (CFA) is more confirmatory, testing
specific hypotheses about how the observed variables are related to latent factors. Both methods are
valuable tools in understanding the underlying structure of data and have their respective strengths
and applications.
• It aims to extract factors that account for the maximum possible variance in the
observed variables.
• After extraction, the factor model is often rotated for further analysis to enhance
interpretability.
• Also known as Rao’s canonical factoring, this method computes a similar model to
PCA but uses the principal axis method.
• It seeks factors that have the highest canonical correlation with the observed
variables.
• Canonical factor analysis is not affected by arbitrary rescaling of the data, making it
robust to certain data transformations.
3. Common Factor Analysis:
• Also referred to as Principal Factor Analysis (PFA) or Principal Axis Factoring (PAF).
• This method aims to identify the fewest factors necessary to account for the
common variance (correlation) among a set of variables.
• Unlike PCA, common factor analysis focuses on capturing shared variance rather
than overall variance.
Let’s have a closer look onto the assumptions of factorial analysis, that are as follows:
1. Linearity: The relationships between variables and factors are assumed to be linear.
2. Multivariate Normality: The variables in the dataset should follow a multivariate normal
distribution.
3. No Multicollinearity: Variables should not be highly correlated with each other, as high
multicollinearity can affect the stability and reliability of the factor analysis results.
4. Adequate Sample Size: Factor analysis generally requires a sufficient sample size to produce
reliable results. The adequacy of the sample size can depend on factors such as the
complexity of the model and the ratio of variables to cases.
5. Homoscedasticity: The variance of the variables should be roughly equal across different
levels of the factors.
6. Uniqueness: Each variable should have unique variance that is not explained by the factors.
This assumption is particularly important in common factor analysis.
8. Linearity of Factor Scores: The relationship between the observed variables and the latent
factors is assumed to be linear, even though the observed variables may not be linearly
related to each other.
9. Interval or Ratio Scale: Factor analysis typically assumes that the variables are measured on
interval or ratio scales, as opposed to nominal or ordinal scales.
Violation of these assumptions can lead to biased parameter estimates and inaccurate
interpretations of the results. Therefore, it’s important to assess the data for these assumptions
before conducting factor analysis and to consider potential remedies or alternative methods if the
assumptions are not met.
KERNEL PCA: PCA is a linear method. That is it can only be applied to datasets which are linearly
separable. It does an excellent job for datasets, which are linearly separable. But, if we use it to non-
linear datasets, we might get a result which may not be the optimal dimensionality reduction. Kernel
PCA uses a kernel function to project dataset into a higher dimensional feature space, where it is
linearly separable. It is similar to the idea of Support Vector Machines. There are various kernel
methods like linear, polynomial, and gaussian.
Kernel Principal Component Analysis (KPCA) is a technique used in machine learning for nonlinear
dimensionality reduction. It is an extension of the classical Principal Component Analysis (PCA)
algorithm, which is a linear method that identifies the most significant features or components of a
dataset. KPCA applies a nonlinear mapping function to the data before applying PCA, allowing it to
capture more complex and nonlinear relationships between the data points.
In KPCA, a kernel function is used to map the input data to a high-dimensional feature space, where
the nonlinear relationships between the data points can be more easily captured by linear methods
such as PCA. The principal components of the transformed data are then computed, which can be
used for tasks such as data visualization, clustering, or classification.
One of the advantages of KPCA over traditional PCA is that it can handle nonlinear relationships
between the input features, which can be useful for tasks such as image or speech recognition. KPCA
can also handle high-dimensional datasets with many features by reducing the dimensionality of the
data while preserving the most important information.
However, KPCA has some limitations, such as the need to choose an appropriate kernel function and
its corresponding parameters, which can be difficult and time-consuming. KPCA can also be
computationally expensive for large datasets, as it requires the computation of the kernel matrix for
all pairs of data points.
Latent Semantic Analysis is a natural language processing method that uses the statistical approach
to identify the association among the words in a document. LSA deals with the following kind of
issue:
Example: mobile, phone, cell phone, telephone are all similar but if we pose a query like “The cell
phone has been ringing” then the documents which have “cell phone” are only retrieved whereas
the documents containing the mobile, phone, telephone are not retrieved.
Assumptions of LSA:
1. The words which are used in the same context are analogous to each other.
2. The hidden semantic structure of the data is unclear due to the ambiguity of the words
chosen.
Singular Value Decomposition is the statistical method that is used to find the latent(hidden)
semantic structure of words spread across the document.
Let
C = collection of documents.
d = number of documents.
The SVD decomposes the M matrix i.e word to document matrix into three matrices as follows
M = U∑VT
where
SVD OF n x d matrix
A very significant feature of SVD is that it allows us to truncate few contexts which are not
necessarily required by us. The ∑ matrix provides us with the diagonal values which represent the
significance of the context from highest to the lowest. By using these values we can reduce the
dimensions and hence this can be used as a dimensionality reduction technique too.
Mk = Uk∑kVTK
where
Mk = approximated matrix of M
Uk, ∑k, VTk are the matrices containing only the k contexts from U, ∑, VT respectively
Truncated SVD after selecting k value
Independent Component Analysis (ICA) is a statistical and computational technique used in machine
learning to separate a multivariate signal into its independent non-Gaussian components. The goal of
ICA is to find a linear transformation of the data such that the transformed data is as close to being
statistically independent as possible.
The heart of ICA lies in the principle of statistical independence. ICA identify components within
mixed signals that are statistically independent of each other.
It is a probability theory that if two random variables X and Y are statistically independent. The joint
probability distribution of the pair is equal to the product of their individual probability distributions,
which means that knowing the outcome of one variable does not change the probability of the other
outcome.
or
Assumptions in ICA
1. The first assumption asserts that the source signals (original signals) are statistically
independent of each other.
2. The second assumption is that each source signal exhibits non-Gaussian distributions.
The observed data X is transformed into hidden components S using a linear static transformation
representation by the matrix W.
The goal is to transform the observed data x in a way that the resulting hidden components are
independent. The independence is measured by some function . The task is to find the
optimal transformation matrix W that maximizes the independence of the hidden components.
• ICA is a powerful tool for separating mixed signals into their independent components. This
is useful in a variety of applications, such as signal processing, image analysis, and data
compression.
• ICA is a non-parametric approach, which means that it does not require assumptions about
the underlying probability distribution of the data.
• ICA is an unsupervised learning technique, which means that it can be applied to data
without the need for labeled examples. This makes it useful in situations where labeled data
is not available.
• ICA can be used for feature extraction, which means that it can identify important features
in the data that can be used for other tasks, such as classification.
• ICA assumes that the underlying sources are non-Gaussian, which may not always be true. If
the underlying sources are Gaussian, ICA may not be effective.
• ICA assumes that the sources are mixed linearly, which may not always be the case. If the
sources are mixed nonlinearly, ICA may not be effective.
• ICA can be computationally expensive, especially for large datasets. This can make it difficult
to apply ICA to real-world problems.
• ICA can suffer from convergence issues, which means that it may not always be able to find a
solution. This can be a problem for complex datasets with many sources.
Both the techniques are used in signal processing and dimensionality reduction, but they have
different goals.
Principal Component Analysis Independent Component Analysis
It reduces the dimensions to avoid the problem of It decomposes the mixed signal into its
overfitting. independent sources’ signals.
It deals with the Principal Components. It deals with the Independent Components.
It focuses on the mutual orthogonality property of It doesn’t focus on the mutual orthogonality of
the principal components. the components.
It doesn’t focus on the mutual independence of It focuses on the mutual independence of the
the components. components.
Nonnegative Matrix Factorization is a matrix factorization method where we constrain the matrices
to be nonnegative. In order to understand NMF, we should clarify the underlying intuition between
matrix factorization.
For a matrix A of dimensions m x n, where each element is ≥ 0, NMF can factorize it into two
matrices W and H having dimensions m x k and k x n respectively and these two matrices only
contain non-negative elements. Here, matrix A is defined as:
where,
This method is widely used in performing tasks such as feature reduction in Facial Recognition and
for various NLP tasks.
Intuition:
Fig 1 : NMF Intuition
The objective of NMF is dimensionality reduction and feature extraction. So, when we set lower
dimension as k, the goal of NMF is to find two matrices W ∈ Rm×k and H ∈ Rn×k having only
nonnegative elements. (As shown in Fig 1)
Therefore, by using NMF we are able to obtain factorized matrices having significantly lower
dimensions than those of the product matrix. Intuitively, NMF assumes that the original input is
made of a set of hidden features, represented by each column of W matrix and each column
in H matrix represents the ‘coordinates of a data point’ in the matrix W. In simple terms, it contains
the weights associated with matrix W.
In this, each data point that is represented as a column in A, can be approximated by an additive
combination of the non-negative vectors, which are represented as columns in W.
• NMF decomposes a data matrix V into the product of two lower rank matrices W and H so
that V is approximately equal to W times H.
• NMF uses an iterative procedure to modify the initial values of W and H so that the product
approaches V. The procedure terminates when the approximation error converges or the
specified number of iterations is reached
• During model apply, an NMF model maps the original data into the new set of attributes
(features) discovered by the model.
Real-life example:
Let us consider some real-life examples to understand the working of the NMF algorithm. Let’s take a
case of image processing.
Suppose, we have an input image, having pixels that form matrix A. Using NMF, we factorize it into
two matrices, one containing the facial feature set [Matrix W] and the other containing the
importance of each facial feature in the input image, i.e. the weights [Matrix H]. (As shown in Fig 2.)
Fig 2 : NMF in Image Processing
NMF is used in major applications such as image processing, text mining, spectral data analysis and
many more. Currently, there is an ongoing research on NMF to increase its efficiency and robustness.
Other research is being done on collective factorization, efficient update of matrices etc. as well.