0% found this document useful (0 votes)
17 views28 pages

Unit 3

Uploaded by

Sunil Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views28 pages

Unit 3

Uploaded by

Sunil Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Unit-3

(Dimensionality reduction): Introduction to Dimensionality


reduction, Data set representation, Matrix representation of data set,
Covariance of the Data Matrix. Principal component analysis
(PCA)): Introduction to PCA, Geometric intuition, PCA for
dimensionality reduction, PCA Limitations.
Introduction to Dimensionality reduction

• Dimensionality reduction is a technique used in machine learning and data analysis to


reduce the number of features or variables under consideration.
• The aim is to simplify the dataset while retaining as much relevant information as
possible.
• This is particularly useful when dealing with high-dimensional data, where the number of
features is large compared to the number of samples.

There are various methods for dimensionality reduction, including:


1.Feature selection: Selecting a subset of the original features based on specific criteria such
as relevance, importance, or correlation.
2.Feature extraction: Transforming the original features into a lower-dimensional space
using techniques like principal component analysis (PCA), linear discriminant analysis
(LDA), or t-distributed stochastic neighbor embedding (t-SNE). These methods aim to
preserve the most important information while reducing the dimensionality.
Data Representation
There are two components of • Matrices serve as a powerful tool for
dimensionality reduction: representing datasets.
•Feature selection: In this, we try to • Consider a dataset containing
find a subset of the original set of information about various individuals,
variables, or features, to get a smaller such as age, income, and education
level.
subset which can be used to model the
• By organizing this data into a matrix,
problem. It usually involves three ways: where each row corresponds to an
• Filter individual and each column represents a
• Wrapper different attribute, we create a structured
• Embedded representation that facilitates analysis.
•Feature extraction: This reduces the • This tabular arrangement simplifies
data in a high dimensional space to a operations like finding averages, and
lower dimension space, i.e. a space with correlations, and performing statistical
analyses.
lesser no. of dimensions.
Covariance of the Data Matrix
The variance-covariance matrix is a square matrix with
diagonal elements that represent the variance and the
non-diagonal components that express covariance. The
covariance of a variable can take any real value- positive,
negative, or zero. A positive covariance suggests that the
two variables have a positive relationship, whereas a
negative covariance indicates that they do not. If two
elements do not vary together, they have a zero
covariance.
Principal Component Analysis(PCA)
Principal Component Analysis(PCA) technique was introduced by the mathematician Karl
Pearson in 1901. It works on the condition that while the data in a higher dimensional space is
mapped to data in a lower dimension space, the variance of the data in the lower dimensional
space should be maximum.
•Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal
transformation that converts a set of correlated variables to a set of uncorrelated variables.PCA is
the most widely used tool in exploratory data analysis and in machine learning for predictive
models. Moreover,
•Principal Component Analysis (PCA) is an unsupervised learning algorithm technique used to
examine the interrelations among a set of variables. It is also known as a general factor analysis
where regression determines a line of best fit.
•The main goal of Principal Component Analysis (PCA) is to reduce the dimensionality of a
dataset while preserving the most important patterns or relationships between the variables
without any prior knowledge of the target variables.
The first principal component captures the most variation in the data, but the second
principal component captures the maximum variance that is orthogonal to the first principal
component, and so on.
Step-By-Step Explanation of PCA
Step 1: Standardization
First, we need to standardize our dataset to ensure that each variable has a mean of 0 and a
standard deviation of 1.

Step2: Covariance Matrix Computation


Covariance measures the strength of joint variability between two or more variables, indicating
how much they change in relation to each other. To find the covariance we can use the formula:

The value of covariance can be positive, negative, or zeros.


•Positive: As the x1 increases x2 also increases.
•Negative: As the x1 increases x2 also decreases.
•Zeros: No direct relation
Step 3: Compute Eigenvalues and Eigenvectors of Covariance Matrix to Identify Principal
Components

Let A be a square n x n matrix and X be a non-zero vector for which


AX==λX
for some scalar values λ. then λ is known as the eigenvalue of matrix A and X is known as
the eigenvector of matrix A for the corresponding eigenvalue.
It can also be written as :
AX−λX=0
(A−λI)X=0
where I am the identity matrix of the same shape as matrix A. And the above conditions will be
true only if (A–λI) will be non-invertible (i.e. singular matrix). That means,
∣A–λI∣=0
From the above equation, we can find the eigenvalues \lambda, and therefore corresponding
eigenvector can be found using the equation AX=λX.
Consider the following dataset
x1 2.5 0.5 2.2 1.9 3.1 2.3 2.0 1.0 1.5 1.1

x2 2.4 0.7 2.9 2.2 3.0 2.7 1.6 1.1 1.6 0.9

Step 1: Standardize the Dataset


Mean for x1 x1​= 1.81 = x1meanx1mean​
Mean for x2 x2​= 1.91 = x2meanx2mean​
We will change the dataset.

x1new​
=x1​– 0.69 -1.31 0.39 0.09 1.29 0.49 0.19 -0.81 -0.31 -0.71
x1mean​

x2new=
x2–
x2meanx 0.49 -1.21 0.99 0.29 1.09 0.79 -0.31 -0.81 -0.31 -1.01
2new​
=x2​–
x2mean​
Step 3: Arrange Eigenvalues
The eigenvector with the highest eigenvalue is the Principal Component of the dataset. So in
this case, eigenvectors of lambda1 are the principal components.
{Basically in order to complete the numerical we have to only solve till this step, but if we
have to prove why we have chosen that particular eigenvector we have to follow the steps from
4 to 6}
Q. Consider the two dimensional patterns (2, 1), (3, 5), (4, 3), (5, 6), (6, 7), (7, 8).

Step-01
The given feature vectors are-

x1 = (2, 1)
x2 = (3, 5)
x3 = (4, 3)
x4 = (5, 6)
x5 = (6, 7)
x6 = (7, 8)
Step-02:
Mean vector (µ)
Calculate the mean vector (µ). = ((2 + 3 + 4 + 5 + 6 + 7) / 6, (1 + 5 + 3 + 6 + 7 + 8) / 6)
= (4.5, 5)
Step-04:
Calculate the covariance matrix.
Covariance matrix is given by-

Step-03:
Subtract mean vector (µ) from the given feature vectors.
x1 – µ = (2 – 4.5, 1 – 5) = (-2.5, -4)
x2 – µ = (3 – 4.5, 5 – 5) = (-1.5, 0)
x3 – µ = (4 – 4.5, 3 – 5) = (-0.5, -2)
x4 – µ = (5 – 4.5, 6 – 5) = (0.5, 1)
x5 – µ = (6 – 4.5, 7 – 5) = (1.5, 2)
x6 – µ = (7 – 4.5, 8 – 5) = (2.5, 3

Feature vectors (xi) after subtracting mean vector (µ) are-


Covariance matrix = (m1 + m2 + m3 + m4 +
m5 + m6) / 6

On adding the above matrices and dividing


by 6, we get-
Step-05:

Calculate the eigen values and eigen vectors of the covariance matrix.
λ is an eigen value for a matrix M if it is a solution of the characteristic equation |M – λI| = 0.

From here,
(2.92 – λ)(5.67 – λ) – (3.67 x 3.67) = 0
16.56 – 2.92λ – 5.67λ + λ2 – 13.47 = 0
λ2 – 8.59λ + 3.09 = 0

Solving this quadratic equation, we get λ = 8.22, 0.38


Thus, two eigen values are λ1 = 8.22 and λ2 = 0.38.

Clearly, the second eigen value is very small compared


to the first eigen value.
So, the second eigen vector can be left out.
Eigen vector corresponding to the greatest Eigen value is the principal
component for the given data set.
So. we find the Eigen vector corresponding to Eigen value λ1.

We use the following equation to find the Eigen vector-


MX = λX
where-
•M = Covariance Matrix
•X = Eigen vector
•λ = Eigen value

Substituting the values in the above equation, we get-


Solving these, we get- Thus, principal component for the given data set is-
2.92X1 + 3.67X2 = 8.22X1
3.67X1 + 5.67X2 = 8.22X2

On simplification, we get-
5.3X1 = 3.67X2 ………(1)
3.67X1 = 2.55X2 ………(2)

From (1) and (2), X1 = 0.69X2


From (2), the Eigen vector is-
PCA Limitations

Principal Component Analysis (PCA) is a dimensionality reduction technique in machine


learning, but it has some limitations:
•Assumptions: PCA assumes that features are correlated, have a linear relationship, and
are orthogonal to each other.
•Sensitivity to outliers: PCA is biased by outliers and can be affected by them, which can
distort the principal components and affect the accuracy of the results.
•Missing data: PCA often assumes that the feature set has no missing values.
•Scale of features: PCA is sensitive to the scale of the features.
•Interpretability: It can be difficult to interpret the results of PCA.
•Information loss: PCA always leads to some loss of information when reducing
dimensions.
•Categorical features: PCA is only suitable for continuous, non-discrete data.
•Non-linear relationships: PCA is not well suited to capturing non-linear relationships.

You might also like