0% found this document useful (0 votes)
16 views102 pages

Unit 3

Dimensionality reduction is a process used in machine learning to reduce the number of input features in a dataset while retaining essential information, addressing the challenges of high-dimensional data. Techniques include feature selection, which identifies relevant features, and feature extraction, which transforms data into fewer dimensions, with Principal Component Analysis (PCA) being a prominent method. The benefits of dimensionality reduction include reduced storage space, faster computation, improved visualization, and the removal of redundant features.

Uploaded by

Virupaksh Alur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views102 pages

Unit 3

Dimensionality reduction is a process used in machine learning to reduce the number of input features in a dataset while retaining essential information, addressing the challenges of high-dimensional data. Techniques include feature selection, which identifies relevant features, and feature extraction, which transforms data into fewer dimensions, with Principal Component Analysis (PCA) being a prominent method. The benefits of dimensionality reduction include reduced storage space, faster computation, improved visualization, and the removal of redundant features.

Uploaded by

Virupaksh Alur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 102

Dimensionality Reduction

Unit-3
Unit 3
• Dimensionality Reduction: Introduction, Subset
Selection, Principal Component Analysis PCA, Factor
Analysis, Singular Value Decomposition and Matrix
Factorization, Multidimensional Scaling, Linear
Discriminant Analysis LDA
What is Dimensionality Reduction?
• The number of input features, variables, or columns
present in a given dataset is known as dimensionality, and
the process to reduce these features is called
dimensionality reduction.

• A dataset contains a huge number of input features in


various cases, which makes the predictive modeling task
more complicated, for such cases, dimensionality reduction
techniques are required to use.
Dimensionality Reduction…?
• Dimensionality reduction technique can be defined as, "It is a way of
converting the higher dimensions dataset into lesser dimensions
dataset ensuring that it provides similar information."

• These techniques are widely used in Machine Learning for obtaining a


better fit predictive model while solving the classification and
regression problems.

• Handling the high-dimensional data is very difficult in practice,


commonly known as the curse of dimensionality.
Benefits of Dimensionality Reduction..
• By reducing the dimensions of the features, the space
required to store the dataset also gets reduced.
• Less Computation training time is required for reduced
dimensions of features.
• Reduced dimensions of features of the dataset help in
visualizing the data quickly.
• It removes the redundant features (if present).
Two ways of Dimensionality Reduction
• 1. Feature Selection
• 2. Feature Extraction
Feature Selection
• Feature selection is the process of selecting the subset
of the relevant features and leaving out the irrelevant
features present in a dataset to build a model of high
accuracy. In other words, it is a way of selecting the
optimal features from the input dataset.
General – features reduction technique
• In this example number 2 has 64 features… but many
of them are of no importance to decide the
characteristics of 2, are removed first.
Remove features which are of no importance
Feature Selection – 3 Methods
• 1.Filter Method
• Correlation
• Chi-Square Test
• ANOVA
• Information Gain, etc.

• 2.Wrapper Method
• Forward Selection
• Backward Selection
• Bi-directional Elimination

• 3.Embedded Method
• LASSO
• Elastic Net
• Ridge Regression, etc.
Feature Extraction
• Feature extraction is the process of transforming the
space containing many dimensions into space with
fewer dimensions.

• This approach is useful when we want to keep the


whole information but use fewer resources while
processing the information.
Some common feature extraction techniques are:
1.Principal Component Analysis (PCA)
2.Linear Discriminant Analysis (LDA)
3.Kernel PCA
4.Quadratic Discriminant Analysis (QDA)etc.
ML Model design
• Consider the line passing through the samples in the
diagram.
• It (line) is the model/function/hypothesis generated
after the training phase.
The line is trying to reach all the samples as
close as possible.
• If we have an underfitted
Underfitting:
model, this means that we do
not have enough parameters
to capture the trends in the
underlying system.
• In general, in underfitting,
model fails during testing as
well as training.
• In this a complex model is built
using too many features.
• During training phase, model
works well. But it fails during
testing.
• Under/Overfitting can be solved
in different ways.
• One of the solution for
overfitting is dimensionality
reduction.
• Diagram shows that model
neither suffers from under or
overfitting.
Example to show requirement of Dimensionality reduction

• In this example important features to decide the price


are town, area and plot size. Features like number of
bathroom and trees nearby may not be significant,
hence can be dropped.
PCA
• PCA is a method of Dimensionality Reduction.
• PCA is a process of identifying Principal Components of the
samples.
• It tries to address the problem of overfitting.
Example for PCA (from SK learn (SciKit Learn) library)
• To address overfitting, reduce the
What does PCA do? dimension, without loosing the
information.
• In this example two dimension is
reduced to single dimension.
• But in general, their can be multiple
dimensions… and will be reduced.
• When the data is viewed from one
angle, it will be reduced to single
dimension and the same is shown at
the bottom right corner, and this will
be Principal Component 1.
Similarly compute PC2 • Figure shows the representation of
PC1 and PC2.
• Like this we have several principal
components…

• Say PC1,PC2, PC3… and so on..


• In that PC1 will be of top priority.

• Each Principal Components are


independent and are orthogonal. It
means one PC does not depends on
another…all of them are independent.
Another Example
Example to illustrate the PC
Multiple angles in which picture can be captured
• In previous slide, the last picture gives the right angle
to take the picture.
• It means, you have to identify a better angle to collect
the data without loosing much information.
• The angle shown in the last picture will capture all the
faces, without much overlapping and without loosing
information.
In this example the second one is the best angle to project :
https://www.youtube.com/watch?v=g-Hb26agBFg (reference video)
https://www.youtube.com/watch?v=MLaJbA82nzk
Housing Example: More rooms..more the size
Two dimension is reduced to single dimension
• PCA is a method of dimensionality reduction.
• Example shows how to convert a two dimension to one
dimension.
How to compute PCA?
X Y
2.5 2.4 • Consider the Samples given in the table (10
0.5 0.7 Samples).
2.2 2.9
• Compute the mean of X and mean of Y
1.9 2.2
3.1 3.0
independently. Similar computation has to be
2.3 2.7 done for each features. (In this example only
2 1.6 two features).
1 1.1
1.5 1.6
• Mean of X = 1.81 and Mean of Y = 1.91
1.1 0.9
Next Step is to compute Co-Variance Matrix.
• Covariance between (x, y) is computed as given below:

• The following covariance Matrix to be computed is:


Covariance between (x and x)
• Similarly compute co variance between (x,y),(y,x) and
(y,y).
• Computed Co-Variance matrix is given in next slide
Final co-variance matrix
Alternate Method to compute Co-variance
matrix
Consider Mean centered Matrix as A and now compute
Transpose of A * A to get the Covariance matrix: Divide
the resultant matrix by (n-1)
Next Step is to Compute Eigen Values using
the Co-variance matrix
If A is the given matrix ( in this case co-variance matrix)

We can calculate eigenvalues from the following equation:


|A- λI| = 0
Where A is the given matrix
λ is the eigen value
I is the identity Matrix
|A- λI| = 0
Determinant computation and finally Eigen values
• Compute Eigen vector for each of the eigen value.

• Consider the first eigen value λ1 = 1.284


• C is the covariance matrix
• V is the eigen vector to be computed.
Now convert the two dimension data to single
dimension
Final step
• Compute Eigen vector for the second eigen value.

• Consider the first eigen value λ2 = 0.0490


• C is the covariance matrix
• V is the eigen vector to be computed.
• Using this we can have two linear equation:
• Use any one of the following equation… final result
remains same.

• 0.5674 x1 = -0.6154 y1
• Divide both side by 0.5674.
• You will get : x1 = -1.0845 y1
• x1 = -1.0845 y1

• If y1=1, then x1 will be -1.0845

• So in that case (x1, y1) will be (-1.0845,1). This will be the initial eigen vector.
Needs normalization to get the final value.

• To normalize, take square-root of sum of square of each eigen vector values,


and consider this as ‘x’
• Finally divide each eigen vector values by ‘x’ to get the final eigen vector.
eigen vectors are generated for the eigen
value : 0.490
PCA
Theory – Algorithms – steps explained
Steps/ Functions to perform PCA
• Subtract mean.
• Calculate the covariance matrix.
• Calculate eigenvectors and eigenvalues.
• Select principal components.
• Reduce the data dimension.
• Principal components is a form of multivariate statistical analysis and is one method of
studying the correlation or covariance structure in a set of measurements on m variables for n
observations.

• Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used


to reduce the dimensionality of large data sets, by transforming a large set of variables into a
smaller one that still contains most of the information in the large set.

• Reducing the number of variables of a data set naturally comes at the expense of accuracy, but
the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller
data sets are easier to explore and visualize and make analyzing data much easier and faster
for machine learning algorithms without extraneous variables to process.

• So to sum up, the idea of PCA is simple — reduce the number of variables of a data set, while
preserving as much information as possible.
• What do the covariances that we have as entries of the matrix tell us
about the correlations between the variables?
• It’s actually the sign of the covariance that matters

• if positive then : the two variables increase or decrease together


(correlated)

• if negative then : One increases when the other decreases (Inversely


correlated)

• Now, that we know that the covariance matrix is not more than a table
that summaries the correlations between all the possible pairs of
variables, let’s move to the next step.
Eigenvectors and eigenvalues are the linear algebra concepts that we need to compute
from the covariance matrix in order to determine the principal components of the data.

Principal components are new variables that are constructed as linear combinations or
mixtures of the initial variables.

These combinations are done in such a way that the new variables (i.e., principal
components) are uncorrelated and most of the information within the initial variables is
squeezed or compressed into the first components.

So, the idea is 10-dimensional data gives you 10 principal components, but PCA tries to
put maximum possible information in the first component.

Then maximum remaining information in the second and so on, until having something
like shown in the scree plot below.
• As there are as many principal components as there are variables in the data, principal
components are constructed in such a manner that the first principal component accounts for
the largest possible variance in the data set.

• Organizing information in principal components this way, will allow you to reduce
dimensionality without losing much information, and this by discarding the components with
low information and considering the remaining components as your new variables.

• An important thing to realize here is that, the principal components are less interpretable and
don’t have any real meaning since they are constructed as linear combinations of the initial
variables.
Characteristic Polynomial and characteristic equation
and
Eigen Values and Eigen Vectors
Computation for 2x2 and 3x3 Square Matrix
Eigen Values and Eigen Vectors

The eigenvectors x and eigenvalues  of a matrix A satisfy


Ax = x
If A is an n x n matrix, then x is an n x 1 vector, and  is a constant.

The equation can be rewritten as (A - I) x = 0, where I is the


n x n identity matrix.

58
2 X 2 Example : Compute Eigen Values

A= 1 -2 so A - I = 1 -  -2
3 -4 3 -4 - 

det(A - I) = (1 - )(-4 - ) – (3)(-2)


= 2 + 3  + 2

Set 2 + 3  + 2 to 0

Then =  = (-3 +/- sqrt(9-8))/2

So the two values of  are -1 and -2.


60
Example 1: Find the eigenvalues and eigenvectors of the matrix
− 4 − 6
A=
 3 5 

Solution Let us first derive the characteristic polynomial of A.
We get
 − 4 − 6 1 0 − 4 −  −6 
A − I 2 =   −   =
 3 5  0 1   3 5 − 

A − I 2 = ( −4 −  )(5 −  ) + 18 = 2 −  − 2
We now solve the characteristic equation of A.
2 −  − 2 = 0  ( − 2)( + 1) = 0   = 2 or − 1
The eigenvalues of A are 2 and –1.
The corresponding eigenvectors are found by using these values of  in the equation(A – I2)x = 0.
There are many eigenvectors corresponding to each eigenvalue.
For  = 2
We solve the equation (A – 2I2)x = 0 for x.
The matrix (A – 2I2) is obtained by subtracting 2 from the diagonal elements of A.
We get  − 6 − 6  x1 
 3    =0
 3   x2 

This leads to the system of equations


− 6 x1 − 6 x2 = 0
3 x1 + 3 x2 = 0
giving x1 = –x2. The solutions to this system of equations are x1 = –r, x2 = r, where r is a scalar.
Thus the eigenvectors of A corresponding to  = 2 are nonzero vectors of the form
 x1   −1  −1
v1 =   = x2   =r  
 2
x  1  1
For  = –1
We solve the equation (A + 1I2)x = 0 for x.
The matrix (A + 1I2) is obtained by adding 1 to the diagonal elements of A. We get
− 3 − 6  x1 
 3    =0
 6   x2 
This leads to the system of equations
− 3 x1 − 6 x2 = 0
3 x1 + 6 x2 = 0
Thus x1 = –2x2. The solutions to this system of equations are x1 = –2s and x2 = s, where s is a
scalar. Thus the eigenvectors of A corresponding to  = –1 are nonzero vectors of the form
 x1   −2  −2
v2 =   = x2   = s 
 
x 2  1  1
• Example 2 Calculate the eigenvalue equation and eigenvalues for the
following matrix –
1 0 0
0 −1 2
2 0 0
1 0 0 1−λ 0 0
Solution : Let A = 0 −1 2 and A–λI = 0 −1 − λ 2
2 0 0 2 0 0−λ

We can calculate eigenvalues from the following equation:


|A- λI| = 0 (1 –λ) [(- 1 –λ)(-λ) - 0] – 0 + 0 = 0
λ (1 - λ) (1 + λ) = 0
From this equation, we are able to estimate eigenvalues which are –
λ = 0, 1, -1.
Example2 : Eigenvalues 3x3 Matrix

Find the eigenvalues of 1 2 3


A = 0 −4 2
 
Solution: 
0 0 7

1 2 3 1
0 1 − 
0 2 3 
A − I n = 0 −4 2  −  0
0 =  0
1 −4− 2 
     

0 0 7 1
0 
 0
0 0 7 − 

1 −  2 3 
det( A − I n ) = 0 → det  0 −4− 2 =0
 

 0 0 7 − 
(1 −  )(− 4 −  )(7 −  ) = 0
 = 1, − 4, 7
Example 3: Eigenvalues and Eigenvectors
Find the eigenvalues and eigenvectors of the matrix
5 4 2
A = 4 5 2
2 
 2 2
Solution The matrix A – I3 is obtained by subtracting  from the diagonal elements of A.Thus

5 −  4 2 
A − I 3 =  4 5− 2 
 2 
2 − 
 2

The characteristic polynomial of A is |A – I3|. Using row and column operations to simplify
determinants, we get
Alternate Solution
Solve any two equations
• 2 = 1
Let  = 1 in (A – I3)x = 0. We get
( A − 1I 3 ) x = 0
 4 4 2   x1 
 4 4 2   x2  = 0

2 2 1   x3 

The solution to this system of equations can be shown to be x1 = – s – t, x2 = s, and x3 = 2t, where s and
t are scalars. Thus the eigenspace of 2 = 1 is the space of vectors of the form.

− s − t 
 s 
 

 2t  
Separating the parameters s and t, we can write
− s − t  − 1 − 1
 s  = s  1 + t  0
     

 2t   
 0 
 2
Thus the eigenspace of  = 1 is a two-dimensional subspace of R3 with basis

  − 1  − 1 
   0 
  1 ,  
 0   
   0 

If an eigenvalue occurs as a k times repeated root of the characteristic equation, we say that it is of
multiplicity k. Thus =10 has multiplicity 1, while =1 has multiplicity 2 in this example.
Linear Discriminant Analysis (LDA)
Data representation vs. Data Classification
Difference between PCA vs. LDA
• PCA finds the most accurate data representation in a lower
dimensional space.
• Projects the data in the directions of maximum variance.
• However the directions of maximum variance may be useless for
classification
• In such condition LDA which is also called as Fisher LDA works
well.
• LDA is similar to PCA but LDA in addition finds the axis that
maximizes the separation between multiple classes.
LDA Algorithm
• PCA is good for dimensionality reduction.
• However Figure shows how PCA fails to classify. (because it will try
to project this points which maximizes variance and minimizes the
error)

• Fisher Linear Discriminant Project to a line which reduces the


dimension and also maintains the class discriminating information.
Projection of the samples in the second
picture is the best:
Describe the algorithm with an example:
• Consider a 2-D dataset
• Cl =X1 =(x1,x2) ={(4,1),(2,4),(2,3),(3,6), (4,4)}
• C2=X2=(x1,x2) = {(9,10),(6,8),(9,5),(8,7),(10,8)}

• A1,A2 are the mean centered matrix of C1,C2


Step 1: Compute within class scatter
matrix(Sw)
• Sw= = s1+s2

• s1 is the covariance matrix for class 1 and


• s2 is the covariance matrix for s2.

• Note : Covariance matrix is to be computed on the Mean Cantered data


• For the given example: mean of C1= (3, 3.6) and
• mean of C2=(8.4, 7.6)
• S1=Transpose of mean centred data * Mean centred data
X= Transpose of A * A ; X/(n-1)
Computed values s1,s2 and Sw
Step 2: Compute between class scatter
Matrix(Sb)
• Mean 1 (M1) =(3,3.6)
• Mean 2 (M2)=(8.4,7.6)

• (M1-M2) = (3-8.4, 3.6-7.6) = (-5.4, 4.0)


Step 3: Find the best LDA projection vector
• To do this ..compute the Eigen values and eigen vector
for the largest eigen value, on the matrix which is the
product of : =

• In this example, highest eigen value is : 15.65 ( )


Compute inverse of Sw
• =
Computation of
Eigen vector computed for Eigen value: 15.65
Step 4: Dimension Reduction
Summary of the Steps
• Step 1 - Computing the within-class and between-class scatter matrices.
• Step 2 - Computing the eigenvectors and their corresponding eigenvalues
for the scatter matrices.
• Step 3 - Sorting the eigenvalues and selecting the top k.
• Step 4 - Creating a new matrix that will contain the eigenvectors mapped
to the k eigenvalues.
• Step 5 - Obtaining new features by taking the dot product of the data and
the matrix from Step 4.
Singular Value Decomposition (SVD)
What is singular value decomposition
explain with example?
• The singular value decomposition of a matrix A is the factorization of A into the
product of three matrices A = UDVT where the columns of U and VT are orthonormal
and the matrix D is diagonal with positive real entries. The SVD is useful in many
tasks.
• Calculating the SVD consists of finding the eigenvalues and eigenvectors of AAT and ATA.
• The eigenvectors of ATA make up the columns of V , the eigenvectors of AAT make up
the columns of U.
• Also, the singular values in S are square roots of eigenvalues from AAT or ATA.
• The singular values are the diagonal entries of the S matrix and are arranged in
descending order. The singular values are always real numbers.
• If the matrix A is a real matrix, then U and V are also real.
where:
• U: mxr matrix of the orthonormal eigenvectors of AAT.
• VT: transpose of a rxn matrix containing the orthonormal eigenvectors of ATA.
• W: a rxr diagonal matrix of the singular values which are the square roots of the
eigenvalues of AAT and ATA .
End of Unit 3
End of the Syllabus : Pattern Recognition
CS745
Thank you and all the best

You might also like