0% found this document useful (0 votes)
12 views47 pages

Lecture 15 - 23.09.2024 - Feature Selection

AI

Uploaded by

Amritanshu Vivek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views47 pages

Lecture 15 - 23.09.2024 - Feature Selection

AI

Uploaded by

Amritanshu Vivek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

14

Feature Selection, PCA,


SVD
Thivin Anandh, IISc Bangalore
“ What feature selection?
What is Feature Selection

• Feature selection is the process of choosing a subset of relevant features from a dataset
to be used in model construction and analysis
• It involves identifying and retaining the most informative and discriminative features
while discarding irrelevant or redundant ones.

Image Credits : https://www.heavy.ai/technical-glossary/feature-selection


Types of Feature Selection

Filter Methods

Wrapper Methods

Embedded Methods
Filter Methods

• Filter methods are feature selection techniques that select features based on
their statistical properties, independent of any specific machine learning
algorithm
• These methods evaluate the relevance of features using statistical measures and
rank them accordingly.

Advantages Examples

• Computationally Efficient • Pearson – Correlation Coefficient

• Independence from ML Alg. • Chi-Square test

• More interpretable • Information Gain


Filter Methods – Pearson Correlation Coefficient

• It measures the linear co-relation between the two continuous variables

• r = 1 -> positive correlation , r = -1 -> neg. Correlation, r = 0 -> no corelation


• We can remove highly correlated variables from the data

Advantages Disadvantages

• Simple to understand • Sensitive to outliers

• East to compute on large datasets • Linearity assumption b/w variables


Filter Methods – chi2 test

• The chi-square test is a statistical method used to determine whether there is a


significant association between two categorical variables

Oi – observed frequency, Ei – expected


frequency

Terminologies N=50 Heads Tails


• Degrees of
freedom Expected 25 25
• Critical values
Observed 28 22
• Null Hypothesis
Filter Methods – chi2 test

• Features with higher chi-square values and lower p-values (b/w target and feature) are
considered more significant and are retained for further analysis.
• Note : The expected value on a tabular data is computed using the below formula
Filter Methods – chi2 test

Advantages Disadvantages

• Suitable for Categorical Data • Limited to categorical Data

• Easy to understand and implement • Sensitive to very small sample size

• Non-parametric (No assumptions on • Comparision between two variables


distributions of data) only

• Little dependence on sample size • Challenging to obtain understanding on


(Magic number is 30) large datasets
Filter Methods – Information Gain

• Information gain is a feature selection method commonly used in decision trees and
related algorithms
• It measures the amount of information obtained for a given feature with respect to the
target variable.
• Features with higher information gain are considered more informative and are
retained for further analysis.

Terminologies

• Entropy (Similar to gini index)


Filter Methods – Information gain

Advantages Disadvantages

• Handles missing values (ignoring • Biased towards features with multiple


them) categories (too many splits)

• More interpretable • Not suitable for regression(only for


classification)
• Non-parametric
• Ignores class distribution (creates
• Works with continuous and categorical problem on imbalanced datasets)
values
Wrapper Methods

• Generate Subsets: Wrapper methods generate different subsets of features from the original
feature set.
• Train Model: Each subset of features is used to train a machine learning model.
• Evaluate Performance: The model's performance is evaluated using a performance metric (e.g.,
accuracy, F1-score).
• Select Best Subset: The subset of features that produces the best performance is selected as the
final feature set.

Types of Wrapper Methods


• Forward Selection: Starts with an empty set of features and iteratively adds features one by one
based on their individual performance until no improvement is observed.
• Backward Elimination: Begins with the full set of features and iteratively removes features one by
one based on their individual performance until no improvement is observed.
• Recursive Feature Elimination (RFE): Selects features by recursively considering smaller and
smaller sets of features until the desired number of features is reached.
Wrapper Methods

Advantages
• Model-Centric: Wrapper methods consider the performance of the model when selecting features,
leading to potentially better model performance.
• Feature Interaction: These methods can capture feature interactions that may not be apparent in
individual features alone.

Disadvantages of Wrapper Methods


• Computational Complexity: Wrapper methods can be computationally expensive, especially when
dealing with a large number of features.
• Overfitting: There is a risk of overfitting when using wrapper methods, as they may select features
that perform well on the training data but poorly on unseen data.
Embedded Methods

• Model Training: Embedded methods use machine learning algorithms that inherently perform
feature selection during training.
• Intrinsic Feature Selection: Feature selection is intrinsic to the model's learning algorithm and is
performed automatically during model training.
• Regularization Techniques: Embedded methods often use regularization techniques to penalize the
model for the inclusion of unnecessary or redundant features.

Types of Embedded Methods

• L1 Regularization (Lasso): L1 regularization adds a penalty term to the model's cost function based
on the absolute value of the coefficients, encouraging sparse solutions and automatic feature
selection.
• Tree-Based Methods: Decision tree-based algorithms, such as Random Forest and Gradient
Boosting Machines (GBM), naturally perform feature selection by selecting the most informative
features at each split.
• Elastic Net: Elastic Net is a regularization technique that combines L1 and L2 penalties to achieve
both feature selection and feature grouping.
Embedded Methods
Advantages of Embedded Methods
• Efficient Feature Selection: Embedded methods perform feature selection directly during model
training, making them efficient and suitable for large datasets.
• Automatic Selection: Feature selection is intrinsic to the model training process, eliminating the
need for separate feature selection steps.
• Handles Non-Linear Relationships: Embedded methods, especially tree-based methods, can
capture non-linear relationships between features and the target variable.

DisAdvantages of Embedded Methods


• Model-Specific: Embedded methods are tightly coupled with specific modeling algorithms, limiting
their flexibility compared to wrapper methods.
• Less Control: Unlike wrapper methods, embedded methods offer less control over the feature
selection process, as it is driven by the model's optimization objective.
Summary – Feature selection

Advantages of Feature Selection

• Improved Model Performance: Feature selection can lead to simpler and more interpretable models
that generalize better to unseen data, resulting in improved performance metrics.
• Reduced Overfitting: By focusing on the most informative features, feature selection can reduce the
risk of overfitting and improve the model's ability to generalize to new data.
• Enhanced Model Interpretability: Selecting a subset of relevant features makes the model more
interpretable and easier to understand for stakeholders and domain experts.
“ Singular Value Decomposition
Singular Value Decomposition

• Singular Value Decomposition is a matrix factorization method that decomposes a


matrix into three separate matrices

• U: A matrix of orthogonal vectors that represent the left singular vectors of the input
matrix.
• Σ: A diagonal matrix that represents the singular values of the input matrix.
• 𝑉𝑇: The conjugate transpose of a matrix of orthogonal vectors that represent the
right singular vectors of the input matrix.
SVD – Geometric Intuition

• U represents a rotation matrix that rotates the original coordinate system to a new
coordinate system in which the data points are aligned along the axes of greatest
variance.
• Σ represents a diagonal matrix that scales the data points along each of the new
coordinate axes. The diagonal elements of Σ are known as the singular values, and
they represent the amount of variance captured by each coordinate axis.
• 𝑉𝑇 represents another rotation matrix that rotates the new coordinate system back
to the original coordinate system
Types of SVD

Full SVD Reduced SVD


Properties of SVD

• The rank of matrix A is the number of nonzero singular values


• || 𝐴 ||2 = 𝜎1 and || 𝐴 ||F = (𝜎12 + 𝜎22 + … + 𝜎r2)½ , r is the rank of the matrix
• The nonzero singular values of A are the square roots of the non-zero eigenvalues
of 𝐴𝑇 𝐴 or 𝐴𝐴𝑇.
• If 𝐴 = 𝐴𝑇 , then the singular values of A are the absolute values of eigen values of
A.
• It is also useful for performing low rank approximations of the matrix ( as we can
see on PCA )
“ Principal Component Analysis
PCA – Curse of Dimensionality

• Lot of Data has higher dimensionality associated with it ( such as images)


• Cambridge Analytica claimed they had more than 5000* data points on every US
voter ( data obtained mostly from FB profiles )
• This increase in dimensionality will lead to increase in Euclidean distances in
vector spaces, which makes it difficult for us to find similar data points.
• Larger dimension will also ensure that the test data will be far apart from the
training data, which will result in overfitting of the model.

*-> 'The Great Hack' : Cambridge Analytica is just the tip of the iceberge - Amnesty International
What is PCA?

• Principal component Analysis


finds the lower dimensional
space which preserves
maximum variance
• PCA identifies the axis that
accounts for the largest amount
of variance in the training set.
• The ith axis will be identified as
the ith principal component of the
data.

Image Reference: Hands on Machine learning with


sklearn keras, Geron 3rd Edition
What are subspaces?

• We can obtain these subspaces, which preserve the most variance by computing
the Eigen vectors of the covariance matrix of the given data
• Before that, we need to understand the following terminologies
• Eigen values and Eigen vectors
• Covariance Matrix
Eigenvalues and Eigenvectors

• When a vector 𝑥 is multiplied by a matrix 𝐴 , it linearly transforms the vector into


column space of 𝐴
• In most of the cases, the vector 𝑥 will undergo both scaling and rotation to get to
the final state
• However, for a particular linear transformation 𝐴, there will be a specific vectors 𝑣
such that, they will not undergo any rotation
• 𝐴𝑣 = 𝜆𝑣
• Here 𝑣 is the Eigen vector and 𝜆 is the eigen value corresponding to that
Eigenvector
Co-variance Matrix

• A covariance matrix is a square matrix that summarizes the covariance between


multiple variables in a dataset.
• The diagonal elements of a covariance matrix represent the variance of each
variable
• The off-diagonal elements represent the covariance between each pair of variables
• Positive covariance between two variables indicates that they tend to move together
• Negative covariance indicates that they tend to move in opposite directions.
• A covariance of zero indicates that the two variables are independent
How to compute principal components

• Compute the co-variance matrix of the given data A, by using


• Ã = (𝑥 − 𝜇) (𝑥 − 𝜇) 𝑇
• Now perform the eigen value decomposition of the given matrix
• Ã = 𝑃−1 Â𝑃
• Here the 𝑃 will the Eigen vectors or principal components of the original data 𝐴
• The principal component which captures the highest variance is the Eigen vector
associated with the highest Eigen value
PCA using sklearn

• Import the PCA from sklearn.decomposition module

• Provide the n_components as input to the PCA function to extract “n” principal
modes from the given data
• Use “fit_transform()” to generate the reduced dimensional data
Sufficient dimension

• By looking at the
“.explained_variance_ratio_ “ we can
obtain the variance captured by each
principal component
• So by looking at the cumulative explained
variance, we can decide the number of
modes that we need for our formulation
• The number of modes and percentage of
variance is highly dependent on the nature
of the task
PCA in image compression

• Since we are only storing a reduced


space of the actual data to reconstruct the
original data, we do not need to store the
complete data, which results in storage
reduction
• This is similar to idea used on image
compression formats like jpeg, where they
use coefficients of a cosine transformation
to reconstruct the pixel values
Types of PCA

• Randomized PCA ( Finds a quick approximation of first d principal components )


•Max(m,n) > 500 and n_components < 0.8*min(m,n)
•Criteria for Randomized PCA in sklearn
• Incremental PCA ( Sends large data in batches to compute PCA of large dataset)
Test your understanding

1. Which of the following is not a common dimensionality reduction technique? a)


Principal Component Analysis (PCA) b) t-SNE c) Linear Discriminant Analysis
(LDA) d) K-means clustering
2. True or False: Dimensionality reduction always results in loss of information.
3. In PCA, eigenvectors with the highest eigenvalues correspond to: a) Least
important principal components b) Most important principal components c)
Random noise in the data d) Outliers in the dataset
4. What is the primary goal of Linear Discriminant Analysis (LDA)? a) Maximize
variance of the data b) Minimize within-class scatter c) Cluster similar data points
d) Identify outliers in the dataset
5. True or False: PCA can be used for feature selection.
Solutions

1. d) K-means clustering
2. False
3. b) Most important principal components
4. b) Minimize within-class scatter
5. False
“ Linear Discriminant Analysis
(LDA)
Linear discriminant analysis

Terminologies

LDA method

Mathematical concept behind LDA

Multi-class LDA
LDA – terminologies

• We will be using two terms frequently


• Mean and scatter
• Each cluster has its own mean and within class variance or
scatter shown by s1,s2,3,s4 in the diagram
• This will be defined for each class.
• The color denotes separate classes
• So here we have labels unlike PCA
• In other words, supervised learning.
LDA

• End goal is the same as PCA dimensionality reduction


• But here we want to maximize the distance between means of the two
classes and minimize the "within class scatter"
• How do we quantify "maximizing means between two classes" and
"minimize scatter within class" using a single number ?
• We use a quantity called Fischer score. Notice the denominator and
numerator. If denominator increases, Fischer score reduces which is
undesirable.
• Fischer score for two classes where s is the within class scatter:
LDA – Fischer score illustration
Example

• Example :Choosing which direction to project is important. Both means


and within class variances should be accounted for
“ So how do we find out these
directions or discriminants ?
LDA – Finding out the discriminants

• We won't look into the derivation in detail


• Assume a vector v that we want to find out. We project the means to the
vector
• Next we do the same for the within class variance.
• Then use the Fischer formula and substitute the above equations
• We then take the derivative and get the expression for the vector v. This
vector v is the vector which will give us maximum class separability with
minimum class scatter.
LDA – derivation for the discriminations
LDA – derivation for the discriminations

• The discriminant (vector v) are the eigen vector of the below


matrix
Summary

• Both LDA and PCA try to reduce dimensions


• PCA looks at the data points with the most variation.
• LDA maximizes separation between known categories
• LDA is supervised , PCA is unsupervised.
Test your understanding

1. Which of the following statements about t-SNE is correct? a) It's a linear


dimensionality reduction technique b) It's primarily used for visualization in high-
dimensional spaces c) It always preserves global structure of the data d) It's faster
than PCA for large datasets
2. What is the main advantage of using dimensionality reduction techniques? a) They
always improve model accuracy b) They can help mitigate the curse of
dimensionality c) They increase the computational complexity d) They add more
features to the dataset
3. In LDA, the number of linear discriminants that can be computed is at most: a)
Equal to the number of features b) Equal to the number of classes minus one c)
Equal to the number of samples d) Unlimited
4. True or False: PCA always requires scaling the data before application.
5. Which technique is more suitable when the goal is to maximize class separability?
a) PCA b) LDA c) Random projection d) Autoencoder
Solutions

1. b) It's primarily used for visualization in high-dimensional spaces


2. b) They can help mitigate the curse of dimensionality
3. b) Equal to the number of classes minus one
4. True
5. b) LDA

You might also like