Lecture 15 - 23.09.2024 - Feature Selection
Lecture 15 - 23.09.2024 - Feature Selection
• Feature selection is the process of choosing a subset of relevant features from a dataset
to be used in model construction and analysis
• It involves identifying and retaining the most informative and discriminative features
while discarding irrelevant or redundant ones.
Filter Methods
Wrapper Methods
Embedded Methods
Filter Methods
• Filter methods are feature selection techniques that select features based on
their statistical properties, independent of any specific machine learning
algorithm
• These methods evaluate the relevance of features using statistical measures and
rank them accordingly.
Advantages Examples
Advantages Disadvantages
• Features with higher chi-square values and lower p-values (b/w target and feature) are
considered more significant and are retained for further analysis.
• Note : The expected value on a tabular data is computed using the below formula
Filter Methods – chi2 test
Advantages Disadvantages
• Information gain is a feature selection method commonly used in decision trees and
related algorithms
• It measures the amount of information obtained for a given feature with respect to the
target variable.
• Features with higher information gain are considered more informative and are
retained for further analysis.
Terminologies
Advantages Disadvantages
• Generate Subsets: Wrapper methods generate different subsets of features from the original
feature set.
• Train Model: Each subset of features is used to train a machine learning model.
• Evaluate Performance: The model's performance is evaluated using a performance metric (e.g.,
accuracy, F1-score).
• Select Best Subset: The subset of features that produces the best performance is selected as the
final feature set.
Advantages
• Model-Centric: Wrapper methods consider the performance of the model when selecting features,
leading to potentially better model performance.
• Feature Interaction: These methods can capture feature interactions that may not be apparent in
individual features alone.
• Model Training: Embedded methods use machine learning algorithms that inherently perform
feature selection during training.
• Intrinsic Feature Selection: Feature selection is intrinsic to the model's learning algorithm and is
performed automatically during model training.
• Regularization Techniques: Embedded methods often use regularization techniques to penalize the
model for the inclusion of unnecessary or redundant features.
• L1 Regularization (Lasso): L1 regularization adds a penalty term to the model's cost function based
on the absolute value of the coefficients, encouraging sparse solutions and automatic feature
selection.
• Tree-Based Methods: Decision tree-based algorithms, such as Random Forest and Gradient
Boosting Machines (GBM), naturally perform feature selection by selecting the most informative
features at each split.
• Elastic Net: Elastic Net is a regularization technique that combines L1 and L2 penalties to achieve
both feature selection and feature grouping.
Embedded Methods
Advantages of Embedded Methods
• Efficient Feature Selection: Embedded methods perform feature selection directly during model
training, making them efficient and suitable for large datasets.
• Automatic Selection: Feature selection is intrinsic to the model training process, eliminating the
need for separate feature selection steps.
• Handles Non-Linear Relationships: Embedded methods, especially tree-based methods, can
capture non-linear relationships between features and the target variable.
• Improved Model Performance: Feature selection can lead to simpler and more interpretable models
that generalize better to unseen data, resulting in improved performance metrics.
• Reduced Overfitting: By focusing on the most informative features, feature selection can reduce the
risk of overfitting and improve the model's ability to generalize to new data.
• Enhanced Model Interpretability: Selecting a subset of relevant features makes the model more
interpretable and easier to understand for stakeholders and domain experts.
“ Singular Value Decomposition
Singular Value Decomposition
• U: A matrix of orthogonal vectors that represent the left singular vectors of the input
matrix.
• Σ: A diagonal matrix that represents the singular values of the input matrix.
• 𝑉𝑇: The conjugate transpose of a matrix of orthogonal vectors that represent the
right singular vectors of the input matrix.
SVD – Geometric Intuition
• U represents a rotation matrix that rotates the original coordinate system to a new
coordinate system in which the data points are aligned along the axes of greatest
variance.
• Σ represents a diagonal matrix that scales the data points along each of the new
coordinate axes. The diagonal elements of Σ are known as the singular values, and
they represent the amount of variance captured by each coordinate axis.
• 𝑉𝑇 represents another rotation matrix that rotates the new coordinate system back
to the original coordinate system
Types of SVD
*-> 'The Great Hack' : Cambridge Analytica is just the tip of the iceberge - Amnesty International
What is PCA?
• We can obtain these subspaces, which preserve the most variance by computing
the Eigen vectors of the covariance matrix of the given data
• Before that, we need to understand the following terminologies
• Eigen values and Eigen vectors
• Covariance Matrix
Eigenvalues and Eigenvectors
• Provide the n_components as input to the PCA function to extract “n” principal
modes from the given data
• Use “fit_transform()” to generate the reduced dimensional data
Sufficient dimension
• By looking at the
“.explained_variance_ratio_ “ we can
obtain the variance captured by each
principal component
• So by looking at the cumulative explained
variance, we can decide the number of
modes that we need for our formulation
• The number of modes and percentage of
variance is highly dependent on the nature
of the task
PCA in image compression
1. d) K-means clustering
2. False
3. b) Most important principal components
4. b) Minimize within-class scatter
5. False
“ Linear Discriminant Analysis
(LDA)
Linear discriminant analysis
Terminologies
LDA method
Multi-class LDA
LDA – terminologies