0% found this document useful (0 votes)
9 views33 pages

Unit 2-2

Manifold learning is a dimensionality reduction technique that preserves the structure of high-dimensional data in lower dimensions, particularly effective for non-linear data. It includes algorithms such as t-SNE, Isomap, Locally Linear Embedding (LLE), and Multi-Dimensional Scaling (MDS), each with unique methods for maintaining data relationships. The document outlines the implementation and advantages of these algorithms, emphasizing their applications in machine learning tasks like visualization, data exploration, and anomaly detection.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views33 pages

Unit 2-2

Manifold learning is a dimensionality reduction technique that preserves the structure of high-dimensional data in lower dimensions, particularly effective for non-linear data. It includes algorithms such as t-SNE, Isomap, Locally Linear Embedding (LLE), and Multi-Dimensional Scaling (MDS), each with unique methods for maintaining data relationships. The document outlines the implementation and advantages of these algorithms, emphasizing their applications in machine learning tasks like visualization, data exploration, and anomaly detection.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Manifold Learning

Manifold learning is a technique for dimensionality reduction used in machine learning that seeks to
preserve the underlying structure of high-dimensional data while representing it in a lower-
dimensional environment. This technique is particularly useful when the data has a non-linear
structure that cannot be adequately captured by linear approaches like Principal Component Analysis
(PCA).

Features of Manifold Learning

• Capturing the complex linkages and non-linear relationships in the data,

• provides a better representation for upcoming analysis.

• Makes feature extraction easier, identifies important patterns, and reduces noise.

• Boost the effectiveness of machine learning algorithms by keeping the data’s natural
structure.

• Provide more accurate modeling and forecasting, which is especially helpful when dealing
with data that linear techniques are unable to fully model.

In this post, we will examine four manifold learning algorithms that are as follows:

• t-Distributed Stochastic Neighbor Embedding(t-SNE)

• Isomap

• Locally Linear Embedding (LLE)

• Multi-Dimensional Scaling(MDS)

We will utilize the scikit-learn digits dataset, which contains pictures of digits (0-9) encoded as 8×8
pixel arrays. Each picture includes 64 characteristics that indicate the pixel intensity.

Steps:

1. Load the dataset and import the necessary libraries.

2. Make an instance of the manifold learning algorithm.

3. Fit the algorithm to the dataset.

4. Convert the dataset to a lower-dimensional space.

5. Visualize the converted data.

Example 1: t-SNE (t-distributed Stochastic Neighbor Embedding)

t-SNE is an effective method for displaying high-dimensional data. It is very helpful for constructing
2D or 3D representations of complicated data. t-SNE is based on the concept of probability
distributions, and it attempts to minimize the divergence between two probability distributions, one
measuring pairwise similarities between data points in high-dimensional space and the other
measuring pairwise similarities between data points in low-dimensional space. t-SNE produces a 2D
or 3D display of the data.

• Python3

from sklearn.datasets import load_digits

from sklearn.manifold import TSNE

import matplotlib.pyplot as plt

digits = load_digits()

X = digits.data

y = digits.target

tsne = TSNE(n_components=2, random_state=42)

X_tsne = tsne.fit_transform(X)

plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y)

plt.show()

Output:

t-SNE

Example 2: Isomap (Isometric Mapping)


Isomap is a dimensionality reduction approach based on the idea of geodesic distance. While
mapping data points from a higher-dimensional space to a lower-dimensional space, Isomap
attempts to retain the geodesic distance between them. When working with non-linear data
structures, isomap comes in handy.

• Python3

from sklearn.datasets import load_digits

from sklearn.manifold import Isomap

import matplotlib.pyplot as plt

digits = load_digits()

X = digits.data

y = digits.target

isomap = Isomap(n_components=2)

X_isomap = isomap.fit_transform(X)

plt.scatter(X_isomap[:, 0], X_isomap[:, 1], c=y)

plt.show()

Output:

Isomap

Example 3: LLE (Locally Linear Embedding)


LLE is a dimensionality reduction approach that is built on the idea of preserving the data’s local
structure. LLE attempts to find a lower-dimensional representation of the data that retains the data
points’ local associations. When working with non-linear data structures, LLE is especially beneficial.

• Python3

from sklearn.datasets import load_digits

from sklearn.manifold import LocallyLinearEmbedding

import matplotlib.pyplot as plt

digits = load_digits()

X = digits.data

y = digits.target

lle = LocallyLinearEmbedding(n_components=2, random_state=42)

X_lle = lle.fit_transform(X)

plt.scatter(X_lle[:, 0], X_lle[:, 1], c=y)

plt.show()

Output:
Locally Linear Embedding

Example 4: MDS (Multi-Dimensional Scaling)

MDS is a dimensionality reduction approach that is based on the idea of maintaining the pairwise
distances between data points. MDS seeks a lower-dimensional representation of the data that
retains pairwise distances between data points. MDS is very helpful when working with linear data
structures.

• Python3

from sklearn.datasets import load_digits

from sklearn.manifold import MDS

import matplotlib.pyplot as plt

digits = load_digits()

X = digits.data

y = digits.target

mds = MDS(n_components=2, random_state=42)

X_mds = mds.fit_transform(X)

plt.scatter(X_mds[:, 0], X_mds[:, 1], c=y)

plt.show()

Output:

Multi-Dimensional Scaling
Isomap

An understanding and representation of complicated data structures are crucial for the field of
machine learning. To achieve this, Manifold Learning, a subset of unsupervised learning, has a
significant role to play. Among the manifold learning techniques, ISOMAP (Isometric Mapping) stands
out for its prowess in capturing the intrinsic geometry of high-dimensional data. In the case of
situations in which linear methods are lacking, they have proved particularly efficient.

ISOMAP is a flexible tool that seamlessly blends multiple learning and dimensionality reduction
intending to obtain more detailed knowledge of the underlying structure of data. This article takes a
look at ISOMAP’s inner workings and sheds light on its parameters, functions, and proper
implementation with SkLearn.

Isometric mapping is an approach to reduce the dimensionality of machine learning.

Manifold Learning

To understand the underlying structure of complex data, Manifold Learning is


an unsupervised method of learning. Basically, it is aimed at capturing the inherent characteristics of
High Definition Datasets and representing them from a less dimensional space. Multiple learning
allows the discovery of nonlinear relationships hidden within data, which is a valuable asset in
different applications compared to linear techniques.

Isometric Mapping Concept

The idea of an Isometric Map, which aims to preserve pairwise distance between points, is central to
ISOMAP. In doing so, it seeks to achieve low dimensionality representation for the data while at the
same time keeping geodesic distances as shortest possible along the curving edge of the data
manifold. This is particularly important in situations where the underlying structure has been broken
or folded, since traditional methods such as PCA are not able to take these nuances into account.

Relation between Geodesic Distances and Euclidean Distances

Understanding the distinction between equatorial and elliptic distances is of vital importance for
ISOMAP. The geodesic distance considers the shortest path along the curved surface of the manifold,
as opposed to Euclidean distances which are measured by measuring straight Line distances in the
input space. In order to provide a more precise representation of the data’s internal structure,
ISOMAP exploits these quantum distances.

ISOMAP Parameters

ISOMAP comes with several parameters, each influencing the dimensionality reduction process:

• n_neighbors: Determines the number of neighbors used to approximate geodesic distances.


Higher values may make it possible to achieve higher results, but they still require more
computing power.

• n_components: Determines the number of dimensions in a low dimensional representation.

• eigen_solver: Determines the method used for decomposing an Eigenvalue. There are
options such as “auto”, “arpack” and “dense.”
• radius: You can designate a radius within which neighbors are taken into account in place of
using a set number of neighbors. Outside of this range, data points are not regarded as
neighbors.

• tol: tolerance in the eigenvalue solver to attain convergence. While a lower value might
result in a more accurate solution, it might also lengthen the computation time.

• max_iter: The maximum number of times the eigenvalue solver can run. It continues if None
is selected, unless convergence or additional stopping conditions are satisfied.

• path_method: chooses the approximation technique for geodesic distances on the graph.
‘auto’ (automatic selection) and ‘FW’ (Floyd-Warshall algorithm) are available options.

• neighbors_algorithm: A method for calculating the closest neighbors. ‘Auto’, ‘ball_tree’,


‘kd_tree’, and ‘brute’ are among the available options. ‘auto’ selects the best algorithm
according to the input data.

• metric: The nearest neighbor search’s distance metric. ‘Minkowski‘ is the default; however,
‘euclidean’,’manhattan’, and several other options are also available.

Working of ISOMAP

• Calculate the pairwise distances: The algorithm starts by calculating the Euclidean distances
between the data points.

• Find nearest neighbors according to these distances: For each data point, its k nearest
neighbor is determined by that distance.

• Create a neighborhood plot: the edges of each point are aligned with their closest
neighbors, which creates a diagram that represents the data’s regional structure.

• Calculate geodesic distances: The Floyd algorithm sorts through all the pairs of data points in
a neighborhood graph and finds the most distant paths. geodesic distances are represented
by these shortest paths.

• Perform dimensional reduction: Classical Multi Scaling MDS is used for geodesic distance
matrices that result in low dimensional embedding of data.

Advantages and Disadvantages of Isomap

Advantages

• Capturing non linear relationships: Unlike linear dimensional reduction techniques such as
PCA, Isomap is able to capture the underlying non linear structure of the data.

• Global structure: Isomap’s goal is to preserve the overall relationship between data points,
which will give a better representation of the entire manifold.

• Globally optimised: The algorithm guarantees that on the built neighborhood graph, where
geodesic distances are defined, a global optimal solution will be found.

Disadvanatges

• Computational cost: for large datasets, computation of geodesic distance using Floyd’s
algorithm can be computationally expensive and lead to a longer run time.
• Sensitive to parameter settings: incorrect selection of the parameters may lead to a
distortion or misleading insert.

• May be difficult for manifolds with holes or topological complexity, which may lead to
inaccurate representations: Isomap is not capable of performing well in a manifold that
contains holes or other topological complexity.

Applications of Isomap

• Visualization: High-dimensional data like face images can be visualized in a lower-


dimensional space, enabling easier exploration and understanding.

• Data exploration: Isomap can help identify clusters and patterns within the data that are not
readily apparent in the original high-dimensional space.

• Anomaly detection: Outliers that deviate significantly from the underlying manifold can be
identified using Isomap.

• Machine learning tasks: Isomap can be used as a pre-processing step for other machine
learning tasks, such as classification and clustering, by improving the performance and
interpretability of the models.

LLE(Locally Linear Embedding) is an unsupervised approach designed to transform data from its
original high-dimensional space into a lower-dimensional representation, all while striving to retain
the essential geometric characteristics of the underlying non-linear feature structure. LLE operates in
several key steps:

• Firstly, it constructs a nearest neighbors graph to capture these local relationships. Then, it
optimizes weight values for each data point, aiming to minimize the reconstruction error
when expressing a point as a linear combination of its neighbors. This weight matrix reflects
the strength of connections between points.

• Next, LLE computes a lower dimensional representation of the data by


finding eigenvectors of a matrix derived from the weight matrix. These eigenvectors
represent the most relevant directions in the reduced space. Users can specify the desired
dimensionality for the output space, and LLE selects the top eigenvectors accordingly.

As an illustration, consider a Swiss roll dataset, which is inherently non-linear in its high-dimensional
space. LLE, in this case, works to project this complex structure onto a lower-dimensional plane,
preserving its distinctive geometric properties throughout the transformation process.”

Table of Content

• Mathematical Implementation of LLE Algorithm

• Locally Linear Embedding Algorithm

• Parameters in LLE Algorithm

• Implementation of Locally Linear Embedding

• Advantages of LLE

• Disavantages of LLE
Mathematical Implementation of LLE Algorithm

The key idea of LLE is that locally, in the vicinity of each data point, the data lies approximately on a
linear subspace. LLE attempts to unfold or unroll the data while preserving these local linear
relationships.

Here is a mathematical overview of the LLE algorithm:

Minimize:

Subject to :

Where:

• xi represents the i-th data point.

• wij are the weights that minimize the reconstruction error for data point xi using its
neighbors.

It aims to find a lower-dimensional representation of data while preserving local relationships. The
mathematical expression for LLE involves minimizing the reconstruction error of each data point by
expressing it as a weighted sum of its k nearest neighbors‘ contributions. This optimization is subject
to constraints ensuring that the weights sum to 1 for each data point. Locally Linear Embedding (LLE)
is a dimensionality reduction technique used in machine learning and data analysis. It focuses on
preserving local relationships between data points when mapping high-dimensional data to a lower-
dimensional space. Here, we will explain the LLE algorithm and its parameters.

Locally Linear Embedding Algorithm

The LLE algorithm can be broken down into several steps:

• Neighborhood Selection: For each data point in the high-dimensional space, LLE identifies its
k-nearest neighbors. This step is crucial because LLE assumes that each data point can be
well approximated by a linear combination of its neighbors.

• Weight Matrix Construction: LLE computes a set of weights for each data point to express it
as a linear combination of its neighbors. These weights are determined in such a way that
the reconstruction error is minimized. Linear regression is often used to find these weights.

• Global Structure Preservation: After constructing the weight matrix, LLE aims to find a
lower-dimensional representation of the data that best preserves the local linear
relationships. It does this by seeking a set of coordinates in the lower-dimensional space for
each data point that minimizes a cost function. This cost function evaluates how well each
data point can be represented by its neighbors.

• Output Embedding: Once the optimization process is complete, LLE provides the final lower-
dimensional representation of the data. This representation captures the essential structure
of the data while reducing its dimensionality.

Parameters in LLE Algorithm

LLE has a few parameters that influence its behavior:


• k (Number of Neighbors): This parameter determines how many nearest neighbors are
considered when constructing the weight matrix. A larger k captures more global
relationships but may introduce noise. A smaller k focuses on local relationships but can be
sensitive to outliers. Selecting an appropriate value for k is essential for the algorithm’s
success.

• Dimensionality of Output Space: You can specify the dimensionality of the lower-
dimensional space to which the data will be mapped. This is often chosen based on the
problem’s requirements and the trade-off between computational complexity and
information preservation.

• Distance Metric: LLE relies on a distance metric to define the proximity between data points.
Common choices include Euclidean distance, Manhattan distance, or custom-defined
distance functions. The choice of distance metric can impact the results.

• Regularization (Optional): In some cases, regularization terms are added to the cost function
to prevent overfitting. Regularization can be useful when dealing with noisy data or when the
number of neighbors is high.

• Optimization Algorithm (Optional): LLE often uses optimization techniques like Singular
Value Decomposition (SVD) or eigenvector methods to find the lower-dimensional
representation. These optimization methods may have their own parameters that can be
adjusted.

LLE (Locally Linear Embedding) represents a significant advancement in structural analysis,


surpassing traditional density modeling techniques like local PCA or mixtures of factor analyzers. The
limitation of density models lies in their inability to consistently establish a set of global coordinates
capable of embedding observations across the entire structural manifold. Consequently, they prove
inadequate for tasks such as generating low-dimensional projections of the original dataset. These
models excel only in identifying linear features, as depicted in the image below. However, they fall
short in capturing intricate curved patterns, a capability inherent to LLE.

Enhanced Computational Efficiency with LLE. LLE offers superior computational efficiency due to its
sparse matrix handling, outperforming other algorithms.

Advantages of LLE

The dimensionality reduction method known as locally linear embedding (LLE) has many benefits for
data processing and visualization. The following are LLE’s main benefits:

• Preservation of Local Structures: LLE is excellent at maintaining the in-data local


relationships or structures. It successfully captures the inherent geometry of nonlinear
manifolds by maintaining pairwise distances between nearby data points.

• Handling Non-Linearity: LLE has the ability to capture nonlinear patterns and structures in
the data, in contrast to linear techniques like Principal Component Analysis (PCA). When
working with complicated, curved, or twisted datasets, it is especially helpful.

• Dimensionality Reduction: LLE lowers the dimensionality of the data while preserving its
fundamental properties. Particularly when working with high-dimensional datasets, this
reduction makes data presentation, exploration, and analysis simpler.

Disavantages of LLE
• Curse of Dimensionality: LLE can experience the “curse of dimensionality” when used with
extremely high-dimensional data, just like many other dimensionality reduction approaches.
The number of neighbors required to capture local interactions rises as dimensionality does,
potentially increasing the computational cost of the approach.

• Memory and computational Requirements: For big datasets, creating a weighted adjacency
matrix as part of LLE might be memory-intensive. The eigenvalue decomposition stage can
also be computationally taxing for big datasets.

• Outliers and Noisy data: LLE is susceptible to anomalies and jittery data points. The quality
of the embedding may be affected and the local linear relationships may be distorted by
outliers.

Spectral Embedding

Data is projected onto a lower-dimensional subspace using the spectral embedding method, which
reduces the dimensionality of the data while retaining some of its original characteristics. It is
predicated on the notion of employing a matrix’s eigenvectors, which stand for the affinity or
resemblance between the data points. The visualization of high-dimensional data, clustering,
manifold learning, and other applications can all benefit from spectral embedding.

The idea of spectral embedding, how it functions, and how to apply it in Python using the scikit-
learn module are all covered in this article. We will also examine some examples of spectral
embedding being used on various datasets and contrast the outcomes with other approaches.

Mathematical Concept of Spectral Embedding

A dimensionality reduction method that is frequently applied in data analysis and machine learning is
called spectral embedding. High-dimensional data can be visualized and clustered with great benefit
from it. Based on spectral graph theory, spectral embedding shares a tight relationship with Principal
Component Analysis (PCA).

The first step in spectral embedding is to represent the data as a graph. There are several methods to
build this graph, including similarity, epsilon, and k-nearest-neighbor, among others. The graph’s
nodes stand in for data points, while the edges connecting them indicate similarities or pairwise
relationships.

The creation of the Laplacian matrix, which encodes the graph’s structure, comes next. Laplacian
matrices come in various forms, but the most widely used type is the unnormalized Laplacian or ￰ L.
It can be computed in the following ways:

Where,

L = Laplacian Matrix

D = Diagonal Degree matrix . Each diagonal entry Dii is the sum of weights of the edges connected to
node i.

W = weighted adjacency matrix, where Wij represents the similarity or weight between nodes i and j.
The Laplacian matrix L’s eigenvalues and eigenvectors must then be calculated. These can be
acquired by the resolution of the subsequent generalized eigenvalue issue:

ƛ = eigenvalues

v = corresponding eigenvectors

Once the eigenvalues and eigenvectors are obtained, dimensionality reduction can be carried out by
choosing the top k eigenvectors that match the lowest k eigenvalues. These k eigenvectors combine
to create a new matrix, Vk .

The eigenvectors are used as the new feature vectors for the data points in order to achieve spectral
embedding. The data points’ coordinates in the lower-dimensional space are determined by the k
eigenvectors in Vk. At this point, every data point is represented by a k-dimensional vector.

Parameters of Spectral Embedding

We can use the scikit-learn framework and a class named SpectralEmbedding1 to create spectral
embedding in Python. Several factors in this class determine how the affinity matrix is built and how
the eigenvalue decomposition is carried out. These are a few of the parameters:

• n_components: The dimension of the projected subspace.

• affinity: How to construct the affinity matrix. It can be one of {‘nearest_neighbors’, ‘rbf’,
‘precomputed’, ‘precomputed_nearest_neighbors’} or a callable function that takes in a data
matrix and returns an affinity matrix.

• gamma: The kernel coefficient for rbf kernel. If None, gamma will be set to 1/n_features.

• random_state: A pseudo random number generator used for initializing some algorithms.

• eigen_solver: The eigenvalue decomposition strategy to use. It can be one of {‘arpack’,


‘lobpcg’, ‘amg’}. AMG requires pyamg to be installed and can be faster on very large sparse
problems.

• eigen_tol: The stopping criterion for eigendecomposition.

• norm_laplacian: Whether to use the normalized Laplacian or not.

• drop_first: Whether to drop the first eigenvector or not.

Advantages of Spectral Embedding

There are several advantages of Spectral Embedding. Some of the advantages are:

• Preservation of Local and Global Structure: Both local and global structure in the data can
be effectively preserved using spectral embedding techniques. Both relationships between
close-by data points and relationships between distant data points can be captured by them.

• Dimensionality Reduction: Reducing dimensionality without sacrificing structural


information about the data is possible using spectral embedding. When data is embedded in
a higher-dimensional space on a non-linear manifold, it is especially helpful.
• Non-linearity Handling: Spectral embedding is able to identify non-linear patterns in the
data, in contrast to linear methods such as PCA. Because of this, it can be used with a variety
of datasets when linear approaches might not work.

Disadvantages of Spectral Embedding

Although spectral embedding techniques have many benefits, they also have several drawbacks and
restrictions:

• Sensitivity to Hyperparameters: Hyperparameters, such as the number of eigenvalues or the


distance metric selected, can have a significant impact on the performance of spectral
embedding techniques. The quality of the embedding may be affected by the selection of
these parameters, which can be difficult to tune.

• Computational Complexity: It can take a lot of computing power to calculate eigenvectors


and eigenvalues, particularly for huge datasets. For big data applications, spectral embedding
is less feasible due to its complexity.

• Scalability: High-dimensional data usually does not scale well with spectral embedding.
Rather than high-dimensional data processing, it works better for dimensionality reduction
or visualization.

Multidimensional scaling (MDS) is a dimensionality reduction technique that is used to project high-
dimensional data onto a lower-dimensional space while preserving the pairwise distances between
the data points as much as possible. MDS is based on the concept of distance and aims to find a
projection of the data that minimizes the differences between the distances in the original space and
the distances in the lower-dimensional space.

MDS is commonly used to visualize complex, high-dimensional data, and to identify patterns and
relationships that may not be apparent in the original space. It can be applied to a wide range of data
types, including numerical, categorical, and mixed data. MDS is implemented using numerical
optimization algorithms, such as gradient descent or simulated annealing, to minimize the difference
between the distances in the original and lower-dimensional spaces.

Overall, MDS is a powerful and flexible technique for reducing the dimensionality of high-
dimensional data, and for revealing hidden patterns and relationships in the data. It is widely used in
many fields, including machine learning, data mining, and pattern recognition.

Features of the Multidimensional Scaling (MDS)

1. MDS is based on the concept of distance and aims to find a projection of the data that
minimizes the differences between the distances in the original space and the distances in
the lower-dimensional space. This allows MDS to preserve the relationships between the
data points, and to highlight patterns and trends that may not be apparent in the original
space.
2. MDS can be applied to a wide range of data types, including numerical, categorical, and
mixed data. This makes MDS a versatile tool that can be used with many different kinds of
data and allows it to handle complex multi-modal data sets.

3. MDS is implemented using numerical optimization algorithms, such as gradient descent or


simulated annealing, to minimize the difference between the distances in the original and
lower-dimensional spaces. This makes MDS a flexible and adaptable technique that can
handle complex, nonlinear data, and can find projections that are different from those
produced by linear techniques, such as principal component analysis (PCA).

4. MDS is widely used in many fields, including machine learning, data mining, and pattern
recognition. This makes it a well-established and widely-supported technique that has been
extensively tested and validated, and that has a large and active user community.

Overall, MDS is a powerful and flexible technique for reducing the dimensionality of high-
dimensional data, and for revealing hidden patterns and relationships in the data. Its key features
include its ability to handle a wide range of data types, its flexibility and adaptability, and its
widespread use and support in many fields.

Breaking down the Math behind Multidimensional Scaling (MDS)

The mathematical foundation of MDS is the stress function, which measures the difference between
the distances in the original space and the distances in the lower-dimensional space. The stress
function is defined as:

where dij is the distance between data points $i$ and $j$ in the original space, is the distance
between data points i and j in the lower-dimensional space, and n is the number of data points. The
stress function is a measure of the deviation of the distances in the lower-dimensional space from
the distances in the original space and is used to evaluate the quality of the projection.

Limitations of Multidimensional Scaling (MDS)

Like all techniques, MDS has some limitations and drawbacks that should be considered when using
it to analyze and visualize data.

1. It relies on the distances between the data points to define the projection and does not
consider other types of relationships between the data points, such as correlations or
associations. This means that MDS may not be suitable for data sets that have complex, non-
distance-based relationships, or that have missing or noisy distances.

2. It is sensitive to outliers and noise in the data, which can affect the quality of the projection
and the interpretability of the results. MDS may produce projections that are distorted or
misleading if the data contains outliers or noise, and may not accurately reflect the
underlying structure of the data.

3. It is a global optimization technique, which means that it finds a single projection that is
optimal for the entire data set. This can be problematic for data sets that have complex,
multi-modal structures, or that have multiple clusters or groups of data points, as MDS may
not be able to capture the local structure of the data within each group.

How Multidimensional Scaling (MDS) is compared to other dimensionality reduction technique


techniques?

MDS is commonly compared to other dimensionality reduction techniques, such as principal


component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE), to understand
how it differs from these techniques and when it may be more appropriate to use.

1. MDS is based on the concept of distance and aims to find a projection of the data that
minimizes the differences between the distances in the original space and the distances in
the lower-dimensional space. In contrast, PCA and t-SNE are based on the concept of
variance and entropy, respectively, and aim to find a projection of the data that maximizes
the variance or entropy in the lower-dimensional space. This means that MDS is more
focused on preserving the relationships between the data points, while PCA and t-SNE are
more focused on summarizing the data and finding the most relevant dimensions.

2. MDS can be applied to a wide range of data types, including numerical, categorical, and
mixed data. In contrast, PCA and t-SNE are more suited to numerical data, and may not be as
effective with categorical or mixed data. This makes MDS a more versatile and flexible
technique and allows it to handle complex, multi-modal data sets.

3. MDS uses numerical optimization algorithms to find the projection that minimizes the stress
function, and that best preserves the pairwise distances between the data points. In
contrast, PCA and t-SNE use linear algebra and stochastic algorithms, respectively, to find the
projection that maximizes the variance or entropy in the lower-dimensional space. This
means that MDS is a more flexible and adaptable technique, and can find projections that
are different from those produced by PCA or t-SNE.

T-distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear dimensionality reduction


technique well-suited for embedding high-dimensional data for visualization in a low-dimensional
space of two or three dimensions.

What is Dimensionality Reduction?

Dimensionality Reduction represents n-dimensions data(multidimensional data with many features)


in 2 or 3 dimensions. An example of dimensionality reduction can be discussed as a classification
problem i.e. student will play football or not that relies on both temperature and humidity can be
collapsed into just one underlying feature since both features are correlated to a high degree. Hence,
we can reduce the number of features in such problems. A 3-D classification problem can be hard to
visualize, whereas a 2-D one can be mapped to simple 2-dimensional space and a 1-D problem to a
simple line.

What is t-SNE Algorithm?


t-Distributed Stochastic Neighbor Embedding is a dimensionality reduction. This algorithm uses some
randomized approach to reduce the dimensionality of the dataset at hand non-linearly. This focuses
more on retaining the local structure of the dataset in the lower dimension as well.

This helps us explore high dimensional data as well by mapping it into lower dimensionss as the local
structures are retained in the dataset we can get a feel of the same by ploting it and visualizing it in
the 2D or may be 3D plane.

What is the difference between PCA and t-SNE algorithm?

Even though PCA and t-SNE both are unsupervised algorithms that are used to reduce the
dimensionality of the dataset. PCA is a deterministic algorithm to reduce the dimensionality of the
algorithm and the t-SNE algorithm a randomized non-linear method to map the high dimensional
data to the lower dimensional. The data that is obtained after reducing the dimensionality via the t-
SNE algorithm is generally used for visualization purpose only.

One more thing that we can say is an advantage of using the t-SNE data is that it is not effected by
the outliers but the PCA algorithm is highly affected by the outliers because the methodologies that
are used in the two algorithms is different. While we try to preserve the variance in the data
using PCA algorithm we use t-SNE algorithm to retain teh local structure of the dataset.

How does t-SNE work?

t-SNE a non-linear dimensionality reduction algorithm finds patterns in the data based on the
similarity of data points with features, the similarity of points is calculated as the conditional
probability that point A would choose point B as its neighborr.

It then tries to minimize the difference between these conditional probabilities (or similarities) in
higher-dimensional and lower-dimensional space for a perfect representation of data points in lower-
dimensional space.

Space and Time Complexity

The algorithm computes pairwise conditional probabilities and tries to minimize the sum of the
difference of the probabilities in higher and lower dimensions. This involves a lot of calculations and
computations. So the algorithm takes a lot of time and space to compute. t-SNE has a quadratic time
and space complexity in the number of data points.

What is Principal Component Analysis(PCA)?

Principal Component Analysis(PCA) technique was introduced by the mathematician Karl Pearson in
1901. It works on the condition that while the data in a higher dimensional space is mapped to data
in a lower dimension space, the variance of the data in the lower dimensional space should be
maximum.

• Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal


transformation that converts a set of correlated variables to a set of uncorrelated
variables.PCA is the most widely used tool in exploratory data analysis and in machine
learning for predictive models. Moreover,
• Principal Component Analysis (PCA) is an unsupervised learning algorithm technique used to
examine the interrelations among a set of variables. It is also known as a general factor
analysis where regression determines a line of best fit.

• The main goal of Principal Component Analysis (PCA) is to reduce the dimensionality of a
dataset while preserving the most important patterns or relationships between the variables
without any prior knowledge of the target variables.

Principal Component Analysis (PCA) is used to reduce the dimensionality of a data set by finding a
new set of variables, smaller than the original set of variables, retaining most of the sample’s
information, and useful for the regression and classification of data.

Principal Component Analysis

1. Principal Component Analysis (PCA) is a technique for dimensionality reduction that


identifies a set of orthogonal axes, called principal components, that capture the maximum
variance in the data. The principal components are linear combinations of the original
variables in the dataset and are ordered in decreasing order of importance. The total
variance captured by all the principal components is equal to the total variance in the
original dataset.

2. The first principal component captures the most variation in the data, but the second
principal component captures the maximum variance that is orthogonal to the first principal
component, and so on.

3. Principal Component Analysis can be used for a variety of purposes, including data
visualization, feature selection, and data compression. In data visualization, PCA can be used
to plot high-dimensional data in two or three dimensions, making it easier to interpret. In
feature selection, PCA can be used to identify the most important variables in a dataset. In
data compression, PCA can be used to reduce the size of a dataset without losing important
information.

4. In Principal Component Analysis, it is assumed that the information is carried in the variance
of the features, that is, the higher the variation in a feature, the more information that
features carries.

Overall, PCA is a powerful tool for data analysis and can help to simplify complex datasets, making
them easier to understand and work with.

Step-By-Step Explanation of PCA (Principal Component Analysis)

Step 1: Standardization

First, we need to standardize our dataset to ensure that each variable has a mean of 0 and a standard
deviation of 1.

Z=(X−μ)/σ

Here,

• μμ is the mean of independent features μ={μ1,μ2,⋯,μm}μ={μ1,μ2,⋯,μm}

• σσ is the standard deviation of independent features σ={σ1,σ2,⋯,σm}σ={σ1,σ2,⋯,σm}

Step2: Covariance Matrix Computation

Covariance measures the strength of joint variability between two or more variables, indicating how
much they change in relation to each other. To find the covariance we can use the formula:

cov(x1,x2)=∑i=1n(x1i−x1ˉ)(x2i−x2ˉ)n−1cov(x1,x2)=n−1∑i=1n(x1i−x1ˉ)(x2i−x2ˉ)

The value of covariance can be positive, negative, or zeros.

• Positive: As the x1 increases x2 also increases.

• Negative: As the x1 increases x2 also decreases.

• Zeros: No direct relation

Step 3: Compute Eigenvalues and Eigenvectors of Covariance Matrix to Identify Principal


Components

Let A be a square nXn matrix and X be a non-zero vector for which

AX=λXAX=λX

for some scalar values λλ. then λλ is known as the eigenvalue of matrix A and X is known as
the eigenvector of matrix A for the corresponding eigenvalue.

It can also be written as :

AX−λX=0(A−λI)X=0AX−λX(A−λI)X=0=0

where I am the identity matrix of the same shape as matrix A. And the above conditions will be true
only if (A–λI)(A–λI) will be non-invertible (i.e. singular matrix). That means,
∣A–λI∣=0∣A–λI∣=0

From the above equation, we can find the eigenvalues \lambda, and therefore corresponding
eigenvector can be found using the equation AX=λXAX=λX.

Advantages of Principal Component Analysis

1. Dimensionality Reduction: Principal Component Analysis is a popular technique used


for dimensionality reduction, which is the process of reducing the number of variables in a
dataset. By reducing the number of variables, PCA simplifies data analysis, improves
performance, and makes it easier to visualize data.

2. Feature Selection: Principal Component Analysis can be used for feature selection, which is
the process of selecting the most important variables in a dataset. This is useful in machine
learning, where the number of variables can be very large, and it is difficult to identify the
most important variables.

3. Data Visualization: Principal Component Analysis can be used for data visualization. By
reducing the number of variables, PCA can plot high-dimensional data in two or three
dimensions, making it easier to interpret.

4. Multicollinearity: Principal Component Analysis can be used to deal with multicollinearity,


which is a common problem in a regression analysis where two or more independent
variables are highly correlated. PCA can help identify the underlying structure in the data and
create new, uncorrelated variables that can be used in the regression model.

5. Noise Reduction: Principal Component Analysis can be used to reduce the noise in data. By
removing the principal components with low variance, which are assumed to represent
noise, Principal Component Analysis can improve the signal-to-noise ratio and make it easier
to identify the underlying structure in the data.

6. Data Compression: Principal Component Analysis can be used for data compression. By
representing the data using a smaller number of principal components, which capture most
of the variation in the data, PCA can reduce the storage requirements and speed up
processing.

7. Outlier Detection: Principal Component Analysis can be used for outlier


detection. Outliers are data points that are significantly different from the other data points
in the dataset. Principal Component Analysis can identify these outliers by looking for data
points that are far from the other points in the principal component space.

Disadvantages of Principal Component Analysis

1. Interpretation of Principal Components: The principal components created by Principal


Component Analysis are linear combinations of the original variables, and it is often difficult
to interpret them in terms of the original variables. This can make it difficult to explain the
results of PCA to others.

2. Data Scaling: Principal Component Analysis is sensitive to the scale of the data. If the data is
not properly scaled, then PCA may not work well. Therefore, it is important to scale the data
before applying Principal Component Analysis.
3. Information Loss: Principal Component Analysis can result in information loss. While
Principal Component Analysis reduces the number of variables, it can also lead to loss of
information. The degree of information loss depends on the number of principal
components selected. Therefore, it is important to carefully select the number of principal
components to retain.

4. Non-linear Relationships: Principal Component Analysis assumes that the relationships


between variables are linear. However, if there are non-linear relationships between
variables, Principal Component Analysis may not work well.

5. Computational Complexity: Computing Principal Component Analysis can be


computationally expensive for large datasets. This is especially true if the number of
variables in the dataset is large.

6. Overfitting: Principal Component Analysis can sometimes result in overfitting, which is when
the model fits the training data too well and performs poorly on new data. This can happen if
too many principal components are used or if the model is trained on a small dataset.

What is Factor Analysis?

Factor analysis, a method within the realm of statistics and part of the general linear model (GLM),
serves to condense numerous variables into a smaller set of factors. By doing so, it captures the
maximum shared variance among the variables and condenses them into a unified score, which can
subsequently be utilized for further analysis.Factor analysis operates under several assumptions:
linearity in relationships, absence of multicollinearity among variables, inclusion of relevant variables
in the analysis, and genuine correlations between variables and factors. While multiple methods
exist, principal component analysis stands out as the most prevalent approach in practice.

What does Factor mean in Factor Analysis?

In the context of factor analysis, a “factor” refers to an underlying, unobserved variable or latent
construct that represents a common source of variation among a set of observed variables. These
observed variables, also known as indicators or manifest variables, are the measurable variables that
are directly observed or measured in a study.

How to do Factor Analysis (Factor Analysis Steps)?

Factor analysis is a statistical method used to describe variability among observed, correlated
variables in terms of a potentially lower number of unobserved variables called factors. Here are the
general steps involved in conducting a factor analysis:

1. Determine the Suitability of Data for Factor Analysis

• Bartlett’s Test: Check the significance level to determine if the correlation matrix is suitable
for factor analysis.

• Kaiser-Meyer-Olkin (KMO) Measure: Verify the sampling adequacy. A value greater than 0.6
is generally considered acceptable.

2. Choose the Extraction Method

• Principal Component Analysis (PCA): Used when the main goal is data reduction.
• Principal Axis Factoring (PAF): Used when the main goal is to identify underlying factors.

3. Factor Extraction

• Use the chosen extraction method to identify the initial factors.

• Extract eigenvalues to determine the number of factors to retain. Factors with eigenvalues
greater than 1 are typically retained in the analysis.

• Compute the initial factor loadings.

4. Determine the Number of Factors to Retain

• Scree Plot: Plot the eigenvalues in descending order to visualize the point where the plot
levels off (the “elbow”) to determine the number of factors to retain.

• Eigenvalues: Retain factors with eigenvalues greater than 1.

5. Factor Rotation

• Orthogonal Rotation (Varimax, Quartimax): Assumes that the factors are uncorrelated.

• Oblique Rotation (Promax, Oblimin): Allows the factors to be correlated.

• Rotate the factors to achieve a simpler and more interpretable factor structure.

• Examine the rotated factor loadings.

6. Interpret and Label the Factors

• Analyze the rotated factor loadings to interpret the underlying meaning of each factor.

• Assign meaningful labels to each factor based on the variables with high loadings on that
factor.

7. Compute Factor Scores (if needed)

• Calculate the factor scores for each individual to represent their value on each factor.

8. Report and Validate the Results

• Report the final factor structure, including factor loadings and communalities.

• Validate the results using additional data or by conducting a confirmatory factor analysis if
necessary.

Why do we need Factor Analysis?

Factorial analysis serves several purposes and objectives in statistical analysis:

1. Dimensionality Reduction: Factor analysis helps in reducing the number of variables under
consideration by identifying a smaller number of underlying factors that explain the
correlations or covariances among the observed variables. This simplification can make the
data more manageable and easier to interpret.

2. Identifying Latent Constructs: It allows researchers to identify latent constructs or


underlying factors that may not be directly observable but are inferred from patterns in the
observed data. These latent constructs can represent theoretical concepts, such as
personality traits, attitudes, or socioeconomic status.

3. Data Summarization: By condensing the information from multiple variables into a smaller
set of factors, factor analysis provides a more concise summary of the data while retaining as
much relevant information as possible.

4. Hypothesis Testing: Factor analysis can be used to test hypotheses about the underlying
structure of the data. For example, researchers may have theoretical expectations about how
variables should be related to each other, and factor analysis can help evaluate whether
these expectations are supported by the data.

5. Variable Selection: It aids in identifying which variables are most important or relevant for
explaining the underlying factors. This can help in prioritizing variables for further analysis or
for developing more parsimonious models.

6. Improving Predictive Models: Factor analysis can be used as a preprocessing step to improve
the performance of predictive models by reducing multicollinearity among predictors and
capturing the shared variance among variables more efficient.

Most Commonly used Terms in Factor Analysis

In factor analysis, several terms are commonly used to describe various concepts and components of
the analysis. Below is a table listing some of the most commonly used terms in factor analysis:

Term Description

Latent variable representing a group of observed variables that are


Factor
related and tend to co-occur.

Correlation coefficient between the observed variable and the


Factor Loading
underlying factor.

Eigenvalue A value indicating the amount of variance explained by each factor.

The proportion of each observed variable’s variance that can be


Communalities
explained by the factors.

The technique used to extract the initial factors from the observed
Extraction Method
variables (e.g., principal component analysis, maximum likelihood).

A method used to rotate the factors to achieve simpler and more


Rotation
interpretable factor structure (e.g., Varimax, Promax).
Term Description

A matrix showing the loadings of observed variables on extracted


Factor Matrix
factors.

A plot used to determine the number of factors to retain based on the


Scree Plot
magnitude of eigenvalues.

A measure of sampling adequacy, indicating the suitability of data for


Kaiser-Meyer-Olkin
factor analysis. Values range from 0 to 1, with higher values indicating
(KMO) Measure
better suitability.

A statistical test used to determine whether the observed variables are


Bartlett’s Test
intercorrelated enough for factor analysis.

The process of rotating the factors to achieve a simpler and more


Factor Rotation
interpretable factor structure.

Scores that represent the value of each factor for each individual
Factor Scores
observation.

The amount of variance in the observed variables explained by each


Factor Variance
factor.

A plot used to visualize the factor loadings of observed variables on the


Loading Plot
extracted factors.

Factor Rotation A rule or criterion used to determine the appropriate rotation method
Criterion and angle to achieve a simpler and more interpretable factor structure.

Let us discuss some of these Factor Analysis terms:

1. Factor Loadings:

• Factor loadings represent the correlations between the observed variables and the
underlying factors in factor analysis. They indicate the strength and direction of the
relationship between each variable and each factor.
o Squaring the standardized factor loading gives the “communality,” which
represents the proportion of variance in a variable explained by the factor.

2. Communality:

• Communality is the sum of the squared factor loadings for a given variable across all
factors.It measures the proportion of variance in a variable that is explained by all
the factors jointly.

o Communality can be interpreted as the reliability of the variable in the


context of the factors being considered.

3. Spurious Solutions:

• If the communality of a variable exceeds 1.0, it indicates a spurious solution, which


may result from factors such as a small sample size or extracting too many or too few
factors.

4. Uniqueness of a Variable:

• Uniqueness of a variable represents the variability of the variable minus its


communality.It reflects the proportion of variance in a variable that is not accounted
for by the factors.

5. Eigenvalues/Characteristic Roots:

• Eigenvalues measure the amount of variation in the total sample accounted for by
each factor.They indicate the importance of each factor in explaining the variance in
the variables.

o A higher eigenvalue suggests a more important factor in explaining the data.

6. Extraction Sums of Squared Loadings:

• These are the sums of squared loadings associated with each extracted factor.They
provide information on how much variance in the variables is accounted for by each
factor.

7. Factor Scores:

• Factor scores represent the scores of each case (row) on each factor (column) in the
factor analysis.They are computed by multiplying each case’s standardized score on
each variable by the corresponding factor loading and summing these products.

Types of Factor Analysis

There are two main types of Factor Analysis used in data science:

1. Exploratory Factor Analysis (EFA)

Exploratory Factor Analysis (EFA) is used to uncover the underlying structure of a set of observed
variables without imposing preconceived notions about how many factors there are or how the
variables are related to each factor. It explores complex interrelationships among items and aims to
group items that are part of unified concepts or constructs.
• Researchers do not make a priori assumptions about the relationships among factors,
allowing the data to reveal the structure organically.

• Exploratory Factor Analysis (EFA) helps in identifying the number of factors needed to
account for the variance in the observed variables and understanding the relationships
between variables and factors.

2. Confirmatory Factor Analysis (CFA)

Confirmatory Factor Analysis (CFA) is a more structured approach that tests specific hypotheses
about the relationships between observed variables and latent factors based on prior theoretical
knowledge or expectations. It uses structural equation modeling techniques to test a measurement
model, wherein the observed variables are assumed to load onto specific factors.

• Confirmatory Factor Analysis (CFA) assesses the fit of the hypothesized model to the actual
data, examining how well the observed variables align with the proposed factor structure.

• This method allows for the evaluation of relationships between observed variables and
unobserved factors, and it can accommodate measurement error.

• Researchers hypothesize the relationships between variables and factors before conducting
the analysis, and the model is tested against empirical data to determine its validity.

In summary, while Exploratory Factor Analysis (EFA) is more exploratory and flexible, allowing the
data to dictate the factor structure, Confirmatory Factor Analysis (CFA) is more confirmatory, testing
specific hypotheses about how the observed variables are related to latent factors. Both methods are
valuable tools in understanding the underlying structure of data and have their respective strengths
and applications.

Types of Factor Extraction Methods

Some of the Type of Factor Extraction methods are dicussed below:

1. Principal Component Analysis (PCA):

• PCA is a widely used method for factor extraction.

• It aims to extract factors that account for the maximum possible variance in the
observed variables.

• Factor weights are computed to extract successive factors until no further


meaningful variance can be extracted.

• After extraction, the factor model is often rotated for further analysis to enhance
interpretability.

2. Canonical Factor Analysis:

• Also known as Rao’s canonical factoring, this method computes a similar model to
PCA but uses the principal axis method.

• It seeks factors that have the highest canonical correlation with the observed
variables.

• Canonical factor analysis is not affected by arbitrary rescaling of the data, making it
robust to certain data transformations.
3. Common Factor Analysis:

• Also referred to as Principal Factor Analysis (PFA) or Principal Axis Factoring (PAF).

• This method aims to identify the fewest factors necessary to account for the
common variance (correlation) among a set of variables.

• Unlike PCA, common factor analysis focuses on capturing shared variance rather
than overall variance.

Assumptions of Factor Analysis

Let’s have a closer look onto the assumptions of factorial analysis, that are as follows:

1. Linearity: The relationships between variables and factors are assumed to be linear.

2. Multivariate Normality: The variables in the dataset should follow a multivariate normal
distribution.

3. No Multicollinearity: Variables should not be highly correlated with each other, as high
multicollinearity can affect the stability and reliability of the factor analysis results.

4. Adequate Sample Size: Factor analysis generally requires a sufficient sample size to produce
reliable results. The adequacy of the sample size can depend on factors such as the
complexity of the model and the ratio of variables to cases.

5. Homoscedasticity: The variance of the variables should be roughly equal across different
levels of the factors.

6. Uniqueness: Each variable should have unique variance that is not explained by the factors.
This assumption is particularly important in common factor analysis.

7. Independent Observations: The observations in the dataset should be independent of each


other.

8. Linearity of Factor Scores: The relationship between the observed variables and the latent
factors is assumed to be linear, even though the observed variables may not be linearly
related to each other.

9. Interval or Ratio Scale: Factor analysis typically assumes that the variables are measured on
interval or ratio scales, as opposed to nominal or ordinal scales.

Violation of these assumptions can lead to biased parameter estimates and inaccurate
interpretations of the results. Therefore, it’s important to assess the data for these assumptions
before conducting factor analysis and to consider potential remedies or alternative methods if the
assumptions are not met.

KERNEL PCA: PCA is a linear method. That is it can only be applied to datasets which are linearly
separable. It does an excellent job for datasets, which are linearly separable. But, if we use it to non-
linear datasets, we might get a result which may not be the optimal dimensionality reduction. Kernel
PCA uses a kernel function to project dataset into a higher dimensional feature space, where it is
linearly separable. It is similar to the idea of Support Vector Machines. There are various kernel
methods like linear, polynomial, and gaussian.

Kernel Principal Component Analysis (KPCA) is a technique used in machine learning for nonlinear
dimensionality reduction. It is an extension of the classical Principal Component Analysis (PCA)
algorithm, which is a linear method that identifies the most significant features or components of a
dataset. KPCA applies a nonlinear mapping function to the data before applying PCA, allowing it to
capture more complex and nonlinear relationships between the data points.

In KPCA, a kernel function is used to map the input data to a high-dimensional feature space, where
the nonlinear relationships between the data points can be more easily captured by linear methods
such as PCA. The principal components of the transformed data are then computed, which can be
used for tasks such as data visualization, clustering, or classification.

One of the advantages of KPCA over traditional PCA is that it can handle nonlinear relationships
between the input features, which can be useful for tasks such as image or speech recognition. KPCA
can also handle high-dimensional datasets with many features by reducing the dimensionality of the
data while preserving the most important information.

However, KPCA has some limitations, such as the need to choose an appropriate kernel function and
its corresponding parameters, which can be difficult and time-consuming. KPCA can also be
computationally expensive for large datasets, as it requires the computation of the kernel matrix for
all pairs of data points.

Latent Semantic Analysis is a natural language processing method that uses the statistical approach
to identify the association among the words in a document. LSA deals with the following kind of
issue:

Example: mobile, phone, cell phone, telephone are all similar but if we pose a query like “The cell
phone has been ringing” then the documents which have “cell phone” are only retrieved whereas
the documents containing the mobile, phone, telephone are not retrieved.

Assumptions of LSA:

1. The words which are used in the same context are analogous to each other.

2. The hidden semantic structure of the data is unclear due to the ambiguity of the words
chosen.

Singular Value Decomposition:

Singular Value Decomposition is the statistical method that is used to find the latent(hidden)
semantic structure of words spread across the document.

Let

C = collection of documents.

d = number of documents.

n = number of unique words in the whole collection.


M=dXn

The SVD decomposes the M matrix i.e word to document matrix into three matrices as follows

M = U∑VT

where

U = distribution of words across the different contexts

∑ = diagonal matrix of the association among the contexts

VT = distribution of contexts across the different documents

SVD OF n x d matrix

A very significant feature of SVD is that it allows us to truncate few contexts which are not
necessarily required by us. The ∑ matrix provides us with the diagonal values which represent the
significance of the context from highest to the lowest. By using these values we can reduce the
dimensions and hence this can be used as a dimensionality reduction technique too.

If we select the k the largest diagonal values in ∑ a matrix we obtain

Mk = Uk∑kVTK

where

Mk = approximated matrix of M

Uk, ∑k, VTk are the matrices containing only the k contexts from U, ∑, VT respectively
Truncated SVD after selecting k value

What is Independent Component Analysis?

Independent Component Analysis (ICA) is a statistical and computational technique used in machine
learning to separate a multivariate signal into its independent non-Gaussian components. The goal of
ICA is to find a linear transformation of the data such that the transformed data is as close to being
statistically independent as possible.

The heart of ICA lies in the principle of statistical independence. ICA identify components within
mixed signals that are statistically independent of each other.

Statistical Independence Concept:

It is a probability theory that if two random variables X and Y are statistically independent. The joint
probability distribution of the pair is equal to the product of their individual probability distributions,
which means that knowing the outcome of one variable does not change the probability of the other
outcome.

or

Assumptions in ICA

1. The first assumption asserts that the source signals (original signals) are statistically
independent of each other.

2. The second assumption is that each source signal exhibits non-Gaussian distributions.

Mathematical Representation of Independent Component Analysis


The observed random vector is , representing the observed data with m
components. The hidden components are represented by the random vector ,
where n is the number of hidden sources.

Linear Static Transformation

The observed data X is transformed into hidden components S using a linear static transformation
representation by the matrix W.

Here, W = transformation matrix.

The goal is to transform the observed data x in a way that the resulting hidden components are
independent. The independence is measured by some function . The task is to find the
optimal transformation matrix W that maximizes the independence of the hidden components.

Advantages of Independent Component Analysis (ICA):

• ICA is a powerful tool for separating mixed signals into their independent components. This
is useful in a variety of applications, such as signal processing, image analysis, and data
compression.

• ICA is a non-parametric approach, which means that it does not require assumptions about
the underlying probability distribution of the data.

• ICA is an unsupervised learning technique, which means that it can be applied to data
without the need for labeled examples. This makes it useful in situations where labeled data
is not available.

• ICA can be used for feature extraction, which means that it can identify important features
in the data that can be used for other tasks, such as classification.

Disadvantages of Independent Component Analysis (ICA):

• ICA assumes that the underlying sources are non-Gaussian, which may not always be true. If
the underlying sources are Gaussian, ICA may not be effective.

• ICA assumes that the sources are mixed linearly, which may not always be the case. If the
sources are mixed nonlinearly, ICA may not be effective.

• ICA can be computationally expensive, especially for large datasets. This can make it difficult
to apply ICA to real-world problems.

• ICA can suffer from convergence issues, which means that it may not always be able to find a
solution. This can be a problem for complex datasets with many sources.

Difference between PCA and ICA

Both the techniques are used in signal processing and dimensionality reduction, but they have
different goals.
Principal Component Analysis Independent Component Analysis

It reduces the dimensions to avoid the problem of It decomposes the mixed signal into its
overfitting. independent sources’ signals.

It deals with the Principal Components. It deals with the Independent Components.

It doesn’t focus on the issue of variance among


It focuses on maximizing the variance. the data points.

It focuses on the mutual orthogonality property of It doesn’t focus on the mutual orthogonality of
the principal components. the components.

It doesn’t focus on the mutual independence of It focuses on the mutual independence of the
the components. components.

Non-Negative Matrix Factorization:

Nonnegative Matrix Factorization is a matrix factorization method where we constrain the matrices
to be nonnegative. In order to understand NMF, we should clarify the underlying intuition between
matrix factorization.

For a matrix A of dimensions m x n, where each element is ≥ 0, NMF can factorize it into two
matrices W and H having dimensions m x k and k x n respectively and these two matrices only
contain non-negative elements. Here, matrix A is defined as:

where,

A -> Original Input Matrix (Linear combination of W & H)

W -> Feature Matrix

H -> Coefficient Matrix (Weights associated with W)

k -> Low rank approximation of A (k ≤ min(m,n))

This method is widely used in performing tasks such as feature reduction in Facial Recognition and
for various NLP tasks.

Intuition:
Fig 1 : NMF Intuition

The objective of NMF is dimensionality reduction and feature extraction. So, when we set lower
dimension as k, the goal of NMF is to find two matrices W ∈ Rm×k and H ∈ Rn×k having only
nonnegative elements. (As shown in Fig 1)

Therefore, by using NMF we are able to obtain factorized matrices having significantly lower
dimensions than those of the product matrix. Intuitively, NMF assumes that the original input is
made of a set of hidden features, represented by each column of W matrix and each column
in H matrix represents the ‘coordinates of a data point’ in the matrix W. In simple terms, it contains
the weights associated with matrix W.

In this, each data point that is represented as a column in A, can be approximated by an additive
combination of the non-negative vectors, which are represented as columns in W.

How Does it Work?

• NMF decomposes multivariate data by creating a user-defined number of features. Each


feature is a linear combination of the original attribute set; the coefficients of these linear
combinations are non-negative.

• NMF decomposes a data matrix V into the product of two lower rank matrices W and H so
that V is approximately equal to W times H.

• NMF uses an iterative procedure to modify the initial values of W and H so that the product
approaches V. The procedure terminates when the approximation error converges or the
specified number of iterations is reached

• During model apply, an NMF model maps the original data into the new set of attributes
(features) discovered by the model.

Real-life example:

Let us consider some real-life examples to understand the working of the NMF algorithm. Let’s take a
case of image processing.

Suppose, we have an input image, having pixels that form matrix A. Using NMF, we factorize it into
two matrices, one containing the facial feature set [Matrix W] and the other containing the
importance of each facial feature in the input image, i.e. the weights [Matrix H]. (As shown in Fig 2.)
Fig 2 : NMF in Image Processing

NMF is used in major applications such as image processing, text mining, spectral data analysis and
many more. Currently, there is an ongoing research on NMF to increase its efficiency and robustness.
Other research is being done on collective factorization, efficient update of matrices etc. as well.

You might also like