0% found this document useful (0 votes)
14 views

Feature Engineering

PCA

Uploaded by

Vinod Krishna
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Feature Engineering

PCA

Uploaded by

Vinod Krishna
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

PCA (Principal Component Analysis) is used when the data is

multi-variate and numeric.

PCA is a technique for reducing the dimensionality of


numerical data while retaining as much variability as possible.
It transforms the original variables into a new set of
uncorrelated variables (principal components) that are ordered
by the amount of variance they explain in the data.

It is not typically used with categorical data or ordinal data


directly, as it relies on numerical values to compute variances
and covariances.

Performing PCA (Principal Component Analysis) is as


follows:

 Standardize the data: This step ensures that each feature


contributes equally to the analysis, especially if the
features are on different scales.

 Generate the covariance matrix / correlation matrix for


all the dimensions: After standardizing, compute the
covariance or correlation matrix to understand the
relationships between the variables.

 Perform eigen decomposition: Compute the eigenvalues


and eigenvectors of the covariance or correlation matrix.
 Sort the eigen pairs in descending order of eigenvalues
and select the ones with the largest values: Rank the
eigenvectors by their corresponding eigenvalues and
select the top components that capture the most variance.

Eigenvectors (principal components) indicate directions of


maximum variance, while eigenvalues reflect the amount of
variance along those directions. The reduction in
dimensionality involves losing less significant information,
which is associated with smaller eigenvalues.

A scree plot is a graphical tool used in Principal Component


Analysis (PCA) to help determine the number of principal
components to retain. It visualizes the eigenvalues associated
with each principal component and helps to identify the
"elbow" point, which indicates the optimal number of
components to keep for analysis.

Factor Analysis
Factor analysis is a statistical technique used to identify
underlying relationships between variables by grouping them
into factors. These factors are latent variables that explain the
patterns of correlations observed in the data.

Factor analysis is often employed to simplify complex


datasets, reduce dimensionality, and identify the underlying
structure.

Assumption for factor analysis


 Linearity of Correlations
 Normality
 Absence of Multicollinearity Factor analysis assumes
that the variables are not too highly collinear because
Highly correlated variables can complicate the extraction
of distinct factors
 Homoscedasticity: The variance should be roughly equal
across variables

Singular Value Decomposition (SVD) is a fundamental matrix


factorization technique in linear algebra used for various
applications in data analysis, machine learning, and signal
processing. SVD decomposes a matrix into three other
matrices, providing insights into the structure and properties
of the original matrix.
Applications
 Dimensionality Reduction: In techniques like Principal
Component Analysis (PCA), SVD is used to reduce the
dimensionality of data while preserving as much
variance as possible.
 Data Compression: In image compression (e.g., JPEG),
SVD helps compress data by retaining only the most
significant singular values.
 Noise Reduction: SVD can be used to filter out noise by
reconstructing the matrix with only the largest singular
values.

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a


powerful dimensionality reduction technique used for
visualizing high-dimensional data in a lower-dimensional
space, typically 2D or 3D. It is particularly useful for
exploring and understanding complex datasets, revealing
patterns, clusters, and relationships that may not be evident in
higher dimensions.

t-SNE is primarily designed for continuous numerical data.


For categorical or mixed-type data, pre-processing and
encoding are required, which might affect the quality of the
results.

Crowding Problem: In lower dimensions, t-SNE may struggle


with the crowding problem, where the data points become too
crowded and the representation may not capture global
structures well.

You might also like