Module5.2 Feature selection methods
Module5.2 Feature selection methods
2)
L.Y., SEM VII, BTech, 2021-22
Feature selection
• Overview
• Sequential Forward selection method
• Sequential Backward selection method
Overview: Feature selection(FS)\Dimension
Reduction
• In many applications, we often encounter a very large
number of potential features that can be used
• Which subset of these features should be used for the best
classification\regression?
• Why we need for a small number of discriminative
features?
1. to improve performance (in terms of speed, predictive
power, simplicity of the model)
2. to visualize the data for model selection
3. To reduce dimensionality and remove noise
4. To reduce computational burden
• Feature Selection is a process that chooses an optimal
subset of features according to a certain criterion
CURSE OF DIMENSIONALITY
The required number of data samples (to achieve the same accuracy)
grows exponentionally with the number of features!
In practice: number of training examples is fixed!
=> the classifier’s performance usually will degrade for a large number of
features!
90%
88%
86%
Accuracy
84%
82%
80%
78%
76%
74%
to 0
to 0
to 0
to 0
to 0
to 0
to 0
to 0
to 0
to 0
00
0
to 0
p1
p2
p3
p4
p5
p6
p7
p8
p9
9
p1
p1
p1
p1
p1
p1
p1
p1
p1
p1
p2
to
to
to
to
to
to
to
to
to
From http://elpub.scix.net/data/works/att/02-28.content.pdf
Overview
• Why does accuracy reduce with more
features?
• How does it depend on the specific choice
of features?
• What else changes if we use more features?
• So, how do we choose the right features?
Why accuracy reduces:
• Suppose the best feature set has 20 features.
If you add another 5 features, typically the
accuracy of machine learning model may
reduce. But you still have the original 20
features!! Why does this happen???
Noise / Explosion
• The additional features typically add noise.
Machine learning model will pick up on spurious
correlations among features, that might be true in
the training set, but not in the test set.
• For some ML models, more features means more
parameters to learn (more NN weights, more
decision tree nodes, etc…) – the increased space
of possibilities is more difficult to search.
FEATURE SELECTION
Feature selection:
Problem of selecting some subset from a set of input
features upon which it should focus attention, while ignoring
the rest
Wrapper Methods
Wrapper Approach
Advantages
Accuracy: wrappers generally have better recognition rates than filters since
they tuned to the specific interactions between the classifier and the
features.
Ability to generalize: wrappers have a mechanism to avoid over fitting, since
they typically use cross-validation measures of predictive accuracy.
Disadvantages
Slow execution
FILTER VS WRAPPER
APPROACHES (CONT’D)
Filter Apporach
Advantages
Fast execution: Filters generally involve a non-iterative computation on the
dataset, which can execute much faster than a classifier training session
Generality: Since filters evaluate the intrinsic properties of the data, rather
than their interactions with a particular classifier, their results exhibit more
generality; the solution will be “good” for a large family of classifiers
Disadvantages
Tendency to select large subsets: Filter objective functions are generally
monotonic
Feature selection methods
Naïve sequential feature selection
Sort the given N features in order of their performance
Disadvantage
• Feature correlation is not considered.
• Best pair of features may not even contain the best individual
feature.
Sequential Forward Selection (SFS)
• Sequential Forward Selection is the simplest greedy
search algorithm
• Starting from the empty set S, sequentially add the
feature F that results in the highest objective
function J when combined with the features that
have already been selected
• Algorithm:
Sequential Forward Selection (SFS)
• SFS performs best when the optimal subset has a small
number of features
• When the search is near the empty set, a large number of
states can be potentially evaluated
• Towards the full set, the region examined by SFS is narrower
since most of the features have already been selected
• The search space is drawn like an ellipse to emphasize
the fact that there are fewer states towards the full or
empty sets
• The main disadvantage of SFS is that it is unable to
remove features that become obsolete after the
addition of other features
SEQUENTIAL FORWARD SELECTION (SFS)
(HEURISTIC SEARCH)
{x2, x1} {x2 , x3} {x2 , x4} J(x2, x3)>=J(x2, xi); i=1,4
0,0,0,0
ILLUSTRATION (SFS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected
0,0,0,0
ILLUSTRATION (SFS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected
0,0,0,0
ILLUSTRATION (SFS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected
0,0,0,0
ILLUSTRATION (SFS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected
0,0,0,0
SFS example
SFS Example
Sequential Backward Selection (SBS)
• Sequential Backward Selection works in the opposite
direction of SFS
• Starting from the full set S, sequentially remove the feature F that
results in the smallest decrease in the value of the objective
function J
• Notice that removal of a feature may actually lead to an increase in
the objective function J. Such functions are said to be non-monotonic
• Algorithm
Sequential Backward Selection (SBS)
• SBS works best when the optimal feature subset has
a large number of features, since SBS spends most of
its time visiting large subsets
• The main limitation of SBS is its inability to reevaluate
the usefulness of a feature after it has been
discarded
SEQUENTIAL BACKWARD SELECTION
(SBS) (HEURISTIC SEARCH)
• J(x2) is maximum
{x2} {x3} • x3 is the worst feature
ILLUSTRATION (SBS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected
0,0,0,0
ILLUSTRATION (SBS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected
0,0,0,0
ILLUSTRATION (SBS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected
0,0,0,0
ILLUSTRATION (SBS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected
0,0,0,0
ILLUSTRATION (SBS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected
0,0,0,0
SBS Example
Some generalizations of SFS and SBS
– Plus-L, minus-R” selection (LRS)
– Sequential floating forward/backward selection
(SFFS and SFBS)
Plus-L Minus-R Selection (LRS)
• Plus-L Minus-R is a generalization of SFS and SBS
• If L>R, LRS starts from the empty set and repeatedly adds ‘L’
features and removes ‘R’ features
• If L<R, LRS starts from the full set and repeatedly removes ‘R’
features followed by ‘L’ feature additions
• LRS attempts to compensate for the weaknesses of SFS
and SBS with some backtracking capabilities
• Its main limitation is the lack of a theory to help to
choose the optimal values of L and R
Bidirectional Search (BDS)
• BDS applies SFS and SBS simultaneously:
• SFS is performed from the empty set
• SBS is performed from the full set
• To guarantee that SFS and SBS converge to the same solution
• Features already selected by SFS are not removed by SBS
• Features already removed by SBS are not selected by SFS
• Algorithm
BIDIRECTIONAL SEARCH (BDS)
SBS
{x1, x2, x3, x4}
SFS
𝜙
BIDIRECTIONAL SEARCH (BDS)
SBS
{x1, x2, x3, x4}
SFS
𝜙
BIDIRECTIONAL SEARCH (BDS)
SBS
{x1, x2, x3, x4}
SFS
𝜙
BIDIRECTIONAL SEARCH (BDS)
SBS
{x1, x2, x3, x4}
SFS
𝜙
BIDIRECTIONAL SEARCH (BDS)
SBS
{x1, x2, x3, x4}
SFS
𝜙
BIDIRECTIONAL SEARCH (BDS)
SBS
{x1, x2, x3, x4}
SFS
𝜙
BIDIRECTIONAL SEARCH (BDS)
SBS
{x1, x2, x3, x4}
SFS
𝜙
BIDIRECTIONAL SEARCH (BDS)
SBS
{x1, x2, x3, x4}
SFS
𝜙
BIDIRECTIONAL SEARCH (BDS)
SBS
{x1, x2, x3, x4}
SFS
𝜙
ILLUSTRATION (BDS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected
0,0,0,0
ILLUSTRATION (BDS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected
0,0,0,0
ILLUSTRATION (BDS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected
0,0,0,0
ILLUSTRATION (BDS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected
0,0,0,0
ILLUSTRATION (BDS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected
0,0,0,0
Sequential Floating Selection (SFFS and SFBS)
• Sequential Floating Selection methods are an extension to the LRS
algorithms with flexible backtracking capabilities
• Rather than fixing the values of ‘L’ and ‘R’, these floating methods allow
those values to be determined from the data
• The dimensionality of the subset during the search can be though to be
“floating” up and down
• There are two floating methods
• Sequential Floating Forward Selection (SFFS) starts from the empty set
• After each forward step, SFFS performs backward steps as long as the objective
function increases
• Sequential Floating Backward Selection (SFBS) starts from the full set
• After each backward step, SFBS performs forward steps as long as the
objective function increases
Limitations of Feature selection methods