Lecture#10
Lecture#10
Feature selection
• Feature Selection is one of the core concepts in machine learning
which hugely impacts the performance of your model. The data
features that you use to train your machine learning models have
a huge influence on the performance you can achieve.
• In machine learning and statistics, feature selection, also known
as variable selection, attribute selection or variable subset
selection, is the process of selecting a subset of relevant features
(variables, predictors) for use in model construction. Feature
selection techniques are used for several reasons:
• Simplification of models to make them easier to interpret by
researchers/users
• Shorter training times,
• To avoid the curse of dimensionality
• Improve data's compatibility with a learning model class
Feature selection
• Feature selection techniques should be distinguished from feature
extraction. Feature extraction creates new features from functions of the
original features, whereas feature selection returns a subset of the
features. Feature selection techniques are often used in domains where
there are many features and comparatively few samples (or data points).
• A feature selection algorithm can be seen as the combination of a search
technique for proposing new feature subsets, along with an evaluation
measure which scores the different feature subsets. The simplest
algorithm is to test each possible subset of features finding the one which
minimizes the error rate.
• Irrelevant or partially relevant features can negatively impact model
performance.
• Feature selection and Data cleaning should be the first and most
important step of your model designing.
why feature
• What’s/why feature selection
• A procedure in machine learning
to find a subset of features that produces
‘better’ model for given dataset
– Avoid overfitting and achieve better
generalization ability
– Reduce the storage requirement and
training time
– Interpretability
When feature selection is important
• Noise data
•Lots of low frequent features
•Use multiple type features
•Too many features comparing to samples
•Complex model
•Samples in real scenario is in
homogeneous with training & test
samples
How to Select Features
• How to select features and what are Benefits of performing
feature selection before modeling your data?
Categories
•Single feature evaluation
•Frequency based, mutual information, KL
divergence, Gini-indexing, information gain,
Chi-square statistic
•Subset selection method
– Sequential forward selection
– Sequential backward selection
Single feature evaluation
• Measure quality of features by all
kinds of metrics
- Frequency based
- Dependence of feature and label
(Co-occurrence)
(Mutual information, Chi-square statistic)
- Information theory
(KL divergence, information gain)
- Gini-indexin
Frequency based