We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17
Feature Selection
Prof. Subir Kumar Das, Dept. of CSE 1
Generalization Error • In supervised learning, the main goal is to use training data to build a model that will be able to make accurate predictions based on new, unseen data, which has the same characteristics as the initial training set. • This is known as generalization. • Generalization relates to how effectively the concepts learned by a machine learning model apply to particular examples that were not used throughout the training. • To train a machine learning model, the dataset is split into 3 sets: training, validation, and testing. • Models is trained using the training data, then compared and tuned them using the evaluation results on the validation set, and in the end, evaluate the performance of best model on the testing set. • The error rate on new cases is called the generalization error (or out-of- sample error), and by evaluating the models on the validation set, an estimation of this error is calculated. • A model’s generalization error (also known as a prediction error) can be expressed as the sum of three very different errors: Bias error, variance error, and irreducible error.Prof. Subir Kumar Das, Dept. of CSE 2 Bias-Variance Tradeoff • This type of error results from incorrect assumptions, such as thinking that the data is linear when it is actually quadratic. • Bias is defined as a systematic error that happens in the machine learning model as a result of faulty ML assumptions. • Bias is also the average squared difference between predictions of the model and actual data. • Models with a higher percentage of bias will not match the training data. • On the other hand, models with lower bias rates will coincide with the training dataset. • Variance, as a generalization error, occurs due to the model’s excessive sensitivity to small variations in the training data. • In supervised learning, the model learns from training data. • So, change in the training data, will also affect the model. • The variance shows the amount by which the performance of the predictive model will be impacted when evaluating based on the validation data. Prof. Subir Kumar Das, Dept. of CSE 3 Bias-Variance Tradeoff
• Bias/variance in machine learning relates to the problem of
simultaneously minimizing two error sources (bias error and variance error). • If the model is too simple (e.g., linear model), it will have high bias and low variance. • If your model is very complex and has many parameters, it will have low bias and high variance. • Decreasing the bias error, will increase the variance error and vice versa. • This correlation is known as the bias-variance tradeoff. Prof. Subir Kumar Das, Dept. of CSE 4 • Total Error = Bias²+ Variance + Irreducible Error Overfitting • When a model performs very well for training data but has poor performance with test data (new data), it is known as overfitting. • Like the child who memorized every math problem in the problem book and would struggle when facing problems from anywhere else • In this case, the machine learning model learns the details and noise in the training data such that it negatively affects the performance of the model on test data. • If the model is overfitting, even a slight change in the output data will cause the model to change significantly. • Models that are overfitting usually have low bias and high variance
Prof. Subir Kumar Das, Dept. of CSE 5
Underfitting • When a model has not learned the patterns in the training data well and is unable to generalize well on the new data, it is known as underfitting. • An underfit model has poor performance on the training data and will result in unreliable predictions. • Underfitting occurs when a model is not able to learn enough from training data, making it difficult to capture the dominating trend (model is unable to create a mapping between the input and the target variable). • Machine learning models with underfitting tend to have poor performance both in training and testing sets • Like the child who learned only addition and was not able to solve problems related to other basic arithmetic operations both from his math problem book and during the math exam. • Underfitting models usually have high bias and low variance.
Prof. Subir Kumar Das, Dept. of CSE 6
Feature Selection • Feature Selection is the process of selecting the most important features to input in machine learning algorithms. • It helps in picking the most important factors from a bunch of options to build better models in machine learning. • Feature selection techniques are employed to reduce the number of input variables by eliminating redundant or irrelevant features and narrowing down the set of features to those most relevant to the machine learning model. • This process is crucial for several reasons: • Improving Accuracy — By focusing on relevant data and eliminating noise, the accuracy of the model improves. • Reducing Overfitting — Less redundant data means less opportunity for the model to make decisions based on noise, thereby reducing the risk of overfitting. • Reducing Training Time — Fewer data points reduce algorithm complexity and the amount of time needed to train a model. • Simplifying Models — Simpler models are easier to interpret and explain, which is valuable in many applications. Prof. Subir Kumar Das, Dept. of CSE 7 Feature Selection • Simpler models — simple models are easy to explain - a model that is too complex and unexplainable is not valuable • Variance reduction — increase the precision of the estimates that can be obtained for a given simulation • Avoid the curse of high dimensionality — dimensionally cursed phenomena states that, as dimensionality and the number of features increases, the volume of space increases so fast that the available data become limited. Feature selection may be used to reduce dimensionality • The most common input variable data types include: • Numerical Variables, such as Integer Variables and Floating Point Variables; and Categorical Variables, such as Boolean Variables, Ordinal Variables, and Nominal Variables. • Popular libraries for feature selection include sklearn feature selection, feature selection Python, and feature selection in R. • Selection algorithms are categorized as either supervised, which can be used for labeled data; or unsupervised, which can be used for unlabeled data. • Unsupervised techniques are classified as filter methods, wrapper methods, embedded methods, Prof. Subiror hybrid Kumar Das, Dept.methods: of CSE 8 Filter Method
• Filter methods select features based on statistics rather than feature
selection cross-validation performance. • A selected metric is applied to identify irrelevant attributes and perform recursive feature selection. • The scores from these evaluations are used to choose the input variables. • Filter methods are either univariate, in which an ordered ranking list of features is established to inform the final selection of feature subset; • or multivariate, which evaluates the relevance of the features as a whole, identifying redundant and irrelevant features. • Filter methods are generally used as a preprocessing step. • The selection of features is independent of any machine learning algorithms. • For basic guidance, the following table for defining correlation coefficients can be referred:
Prof. Subir Kumar Das, Dept. of CSE 9
Filter Method
• Pearson’s Correlation: It is used as a measure for quantifying linear
dependence between two continuous variables X and Y. • Its value varies from -1 to +1. • Pearson’s correlation is given as: • LDA: Linear discriminant analysis is used to find a linear combination of features that characterizes or separates two or more classes (or levels) of a categorical variable. • ANOVA: ANOVA stands for Analysis of variance. • It is similar to LDA except for the fact that it is operated using one or more categorical independent features and one continuous dependent feature. • Chi-Square: It is a is a statistical test applied to the groups of categorical features to evaluate the likelihood of correlation or association between them using their frequency distribution. • One thing that should be kept in mind is that filter methods do not remove Prof. Subir Kumar Das, Dept. of CSE 10 multicollinearity. Wrapper Method
• Wrapper feature selection methods consider the selection of a set of
features as a search problem, whereby their quality is assessed with the preparation, evaluation, and comparison of a combination of features to other combinations of features. • It tries to use a subset of features and train a model using them. • This method facilitates the detection of possible interactions amongst variables. • Wrapper methods focus on feature subsets that will help improve the quality of the results of the clustering algorithm used for the selection. • These methods are usually computationally very expensive. • Some common examples of wrapper methods are forward feature selection, backward feature elimination, recursive feature elimination, etc. • One of the best ways for implementing feature selection with wrapper methods is to use BorutaProf.package that finds the importance of a feature Subir Kumar Das, Dept. of CSE 11 by creating shadow features. Embedded Method • Embedded methods combine the qualities’ of filter and wrapper methods. • Embedded feature selection methods integrate the feature selection machine learning algorithm as part of the learning algorithm, in which classification and feature selection are performed simultaneously. • It’s implemented by algorithms that have their own built-in feature selection methods. • The features that will contribute the most to each iteration of the model training process are carefully extracted. • Random forest feature selection, decision tree feature selection, and LASSO feature selection are common embedded methods. • LASSO and RIDGE regression which have inbuilt penalization functions to reduce overfitting.
Prof. Subir Kumar Das, Dept. of CSE 12
Notes
Prof. Subir Kumar Das, Dept. of CSE 13
Prof. Subir Kumar Das, Dept. of CSE 14 Advantages of Wrapper Feature Selection • Wrapper feature selection methods are a family of supervised feature selection techniques that use a predictive model to evaluate the importance of different subsets of features based on their predictive performance. • Performance-Oriented — Wrapper methods tend to provide the best- performing feature set for the specific model used, as they are algorithm-oriented and optimize for the highest accuracy or other performance metrics. • Model Interaction — They interact directly with the classifier to assess feature usefulness, which can lead to better model performance compared to methods that do not. • Feature Interactions — These methods can capture interactions between features that may be missed by simpler filter methods.
Prof. Subir Kumar Das, Dept. of CSE 15
Disadvantages of Wrapper Feature Selection • Computationally Intensive — Wrapper methods are computationally expensive because they require training and evaluating a model for each candidate subset of features, which can be time consuming and resource intensive. • Risk of Overfitting — There is a higher potential for overfitting the predictors to the training data, as the method seeks to optimize performance on the given dataset. This may not generalize well to unseen data. • Model Dependency — The feature subsets produced by wrapper methods are specific to the type of model used for selection, which means they might not perform as well if applied to a different model. • Lack of Transparency — Wrapper methods do not provide explanations for why certain features are selected over others, which can reduce the interpretability of the model.