0% found this document useful (0 votes)
10 views

Lecture 14

Uploaded by

sayanpal854
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Lecture 14

Uploaded by

sayanpal854
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Feature Selection

Prof. Subir Kumar Das, Dept. of CSE 1


Generalization Error
• In supervised learning, the main goal is to use training data to build a
model that will be able to make accurate predictions based on new,
unseen data, which has the same characteristics as the initial training set.
• This is known as generalization.
• Generalization relates to how effectively the concepts learned by a
machine learning model apply to particular examples that were not used
throughout the training.
• To train a machine learning model, the dataset is split into 3 sets: training,
validation, and testing.
• Models is trained using the training data, then compared and tuned them
using the evaluation results on the validation set, and in the end, evaluate
the performance of best model on the testing set.
• The error rate on new cases is called the generalization error (or out-of-
sample error), and by evaluating the models on the validation set, an
estimation of this error is calculated.
• A model’s generalization error (also known as a prediction error) can be
expressed as the sum of three very different errors: Bias error, variance
error, and irreducible error.Prof. Subir Kumar Das, Dept. of CSE 2
Bias-Variance Tradeoff
• This type of error results from incorrect assumptions, such as thinking that
the data is linear when it is actually quadratic.
• Bias is defined as a systematic error that happens in the machine learning
model as a result of faulty ML assumptions.
• Bias is also the average squared difference between predictions of the
model and actual data.
• Models with a higher percentage of bias will not match the training data.
• On the other hand, models with lower bias rates will coincide with the
training dataset.
• Variance, as a generalization error, occurs due to the model’s excessive
sensitivity to small variations in the training data.
• In supervised learning, the model learns from training data.
• So, change in the training data, will also affect the model.
• The variance shows the amount by which the performance of the
predictive model will be impacted when evaluating based on the
validation data.
Prof. Subir Kumar Das, Dept. of CSE 3
Bias-Variance Tradeoff

• Bias/variance in machine learning relates to the problem of


simultaneously minimizing two error sources (bias error and variance
error).
• If the model is too simple (e.g., linear model), it will have high bias and low
variance.
• If your model is very complex and has many parameters, it will have low
bias and high variance.
• Decreasing the bias error, will increase the variance error and vice versa.
• This correlation is known as the bias-variance tradeoff.
Prof. Subir Kumar Das, Dept. of CSE 4
• Total Error = Bias²+ Variance + Irreducible Error
Overfitting
• When a model performs very well for training data but has poor
performance with test data (new data), it is known as overfitting.
• Like the child who memorized every math problem in the problem book
and would struggle when facing problems from anywhere else
• In this case, the machine learning model learns the details and noise in the
training data such that it negatively affects the performance of the model
on test data.
• If the model is overfitting, even a slight change in the output data will
cause the model to change significantly.
• Models that are overfitting usually have low bias and high variance

Prof. Subir Kumar Das, Dept. of CSE 5


Underfitting
• When a model has not learned the patterns in the training data well and is
unable to generalize well on the new data, it is known as underfitting.
• An underfit model has poor performance on the training data and will
result in unreliable predictions.
• Underfitting occurs when a model is not able to learn enough from
training data, making it difficult to capture the dominating trend (model is
unable to create a mapping between the input and the target variable).
• Machine learning models with underfitting tend to have poor performance
both in training and testing sets
• Like the child who learned only addition and was not able to solve
problems related to other basic arithmetic operations both from his math
problem book and during the math exam.
• Underfitting models usually have high bias and low variance.

Prof. Subir Kumar Das, Dept. of CSE 6


Feature Selection
• Feature Selection is the process of selecting the most important features
to input in machine learning algorithms.
• It helps in picking the most important factors from a bunch of options to
build better models in machine learning.
• Feature selection techniques are employed to reduce the number of
input variables by eliminating redundant or irrelevant features and
narrowing down the set of features to those most relevant to the
machine learning model.
• This process is crucial for several reasons:
• Improving Accuracy — By focusing on relevant data and eliminating
noise, the accuracy of the model improves.
• Reducing Overfitting — Less redundant data means less opportunity for
the model to make decisions based on noise, thereby reducing the risk of
overfitting.
• Reducing Training Time — Fewer data points reduce algorithm
complexity and the amount of time needed to train a model.
• Simplifying Models — Simpler models are easier to interpret and explain,
which is valuable in many applications.
Prof. Subir Kumar Das, Dept. of CSE 7
Feature Selection
• Simpler models — simple models are easy to explain - a model that is too
complex and unexplainable is not valuable
• Variance reduction — increase the precision of the estimates that can be
obtained for a given simulation
• Avoid the curse of high dimensionality — dimensionally cursed
phenomena states that, as dimensionality and the number of features
increases, the volume of space increases so fast that the available data
become limited. Feature selection may be used to reduce dimensionality
• The most common input variable data types include:
• Numerical Variables, such as Integer Variables and Floating Point Variables;
and Categorical Variables, such as Boolean Variables, Ordinal Variables,
and Nominal Variables.
• Popular libraries for feature selection include sklearn feature selection,
feature selection Python, and feature selection in R.
• Selection algorithms are categorized as either supervised, which can be
used for labeled data; or unsupervised, which can be used for unlabeled
data.
• Unsupervised techniques are classified as filter methods, wrapper
methods, embedded methods, Prof. Subiror hybrid
Kumar Das, Dept.methods:
of CSE 8
Filter Method

• Filter methods select features based on statistics rather than feature


selection cross-validation performance.
• A selected metric is applied to identify irrelevant attributes and perform
recursive feature selection.
• The scores from these evaluations are used to choose the input variables.
• Filter methods are either univariate, in which an ordered ranking list of
features is established to inform the final selection of feature subset;
• or multivariate, which evaluates the relevance of the features as a whole,
identifying redundant and irrelevant features.
• Filter methods are generally used as a preprocessing step.
• The selection of features is independent of any machine learning
algorithms.
• For basic guidance, the following table for defining correlation coefficients
can be referred:

Prof. Subir Kumar Das, Dept. of CSE 9


Filter Method

• Pearson’s Correlation: It is used as a measure for quantifying linear


dependence between two continuous variables X and Y.
• Its value varies from -1 to +1.
• Pearson’s correlation is given as:
• LDA: Linear discriminant analysis is used to find a linear combination of
features that characterizes or separates two or more classes (or levels) of a
categorical variable.
• ANOVA: ANOVA stands for Analysis of variance.
• It is similar to LDA except for the fact that it is operated using one or more
categorical independent features and one continuous dependent feature.
• Chi-Square: It is a is a statistical test applied to the groups of categorical
features to evaluate the likelihood of correlation or association between
them using their frequency distribution.
• One thing that should be kept in mind is that filter methods do not remove
Prof. Subir Kumar Das, Dept. of CSE 10
multicollinearity.
Wrapper Method

• Wrapper feature selection methods consider the selection of a set of


features as a search problem, whereby their quality is assessed with the
preparation, evaluation, and comparison of a combination of features to
other combinations of features.
• It tries to use a subset of features and train a model using them.
• This method facilitates the detection of possible interactions amongst
variables.
• Wrapper methods focus on feature subsets that will help improve the
quality of the results of the clustering algorithm used for the selection.
• These methods are usually computationally very expensive.
• Some common examples of wrapper methods are forward feature
selection, backward feature elimination, recursive feature elimination,
etc.
• One of the best ways for implementing feature selection with wrapper
methods is to use BorutaProf.package that finds the importance of a feature
Subir Kumar Das, Dept. of CSE 11
by creating shadow features.
Embedded Method
• Embedded methods combine the qualities’ of filter and wrapper methods.
• Embedded feature selection methods integrate the feature selection
machine learning algorithm as part of the learning algorithm, in which
classification and feature selection are performed simultaneously.
• It’s implemented by algorithms that have their own built-in feature
selection methods.
• The features that will contribute the most to each iteration of the model
training process are carefully extracted.
• Random forest feature selection, decision tree feature selection, and
LASSO feature selection are common embedded methods.
• LASSO and RIDGE regression which have inbuilt penalization functions to
reduce overfitting.

Prof. Subir Kumar Das, Dept. of CSE 12


Notes

Prof. Subir Kumar Das, Dept. of CSE 13


Prof. Subir Kumar Das, Dept. of CSE 14
Advantages of Wrapper Feature Selection
• Wrapper feature selection methods are a family of supervised feature
selection techniques that use a predictive model to evaluate the
importance of different subsets of features based on their predictive
performance.
• Performance-Oriented — Wrapper methods tend to provide the best-
performing feature set for the specific model used, as they are
algorithm-oriented and optimize for the highest accuracy or other
performance metrics.
• Model Interaction — They interact directly with the classifier to assess
feature usefulness, which can lead to better model performance
compared to methods that do not.
• Feature Interactions — These methods can capture interactions
between features that may be missed by simpler filter methods.

Prof. Subir Kumar Das, Dept. of CSE 15


Disadvantages of Wrapper Feature Selection
• Computationally Intensive — Wrapper methods are computationally
expensive because they require training and evaluating a model for
each candidate subset of features, which can be time consuming and
resource intensive.
• Risk of Overfitting — There is a higher potential for overfitting the
predictors to the training data, as the method seeks to optimize
performance on the given dataset. This may not generalize well to
unseen data.
• Model Dependency — The feature subsets produced by wrapper
methods are specific to the type of model used for selection, which
means they might not perform as well if applied to a different model.
• Lack of Transparency — Wrapper methods do not provide
explanations for why certain features are selected over others, which
can reduce the interpretability of the model.

Prof. Subir Kumar Das, Dept. of CSE 16


Thank You

Prof. Subir Kumar Das, Dept. of CSE 17

You might also like