0% found this document useful (0 votes)
3 views

Module5.2 Feature selection methods

The document discusses feature selection, a crucial process in machine learning that involves selecting a subset of features to improve model performance, reduce dimensionality, and enhance interpretability. It outlines methods such as Sequential Forward Selection and Sequential Backward Selection, emphasizing the importance of choosing the right features to avoid the curse of dimensionality and noise. Additionally, it compares filter and wrapper approaches for evaluating feature subsets, highlighting their respective advantages and disadvantages.

Uploaded by

stutiii24
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Module5.2 Feature selection methods

The document discusses feature selection, a crucial process in machine learning that involves selecting a subset of features to improve model performance, reduce dimensionality, and enhance interpretability. It outlines methods such as Sequential Forward Selection and Sequential Backward Selection, emphasizing the importance of choosing the right features to avoid the curse of dimensionality and noise. Additionally, it compares filter and wrapper approaches for evaluating feature subsets, highlighting their respective advantages and disadvantages.

Uploaded by

stutiii24
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Feature selection(5.

2)
L.Y., SEM VII, BTech, 2021-22
Feature selection

• Overview
• Sequential Forward selection method
• Sequential Backward selection method
Overview: Feature selection(FS)\Dimension
Reduction
• In many applications, we often encounter a very large
number of potential features that can be used
• Which subset of these features should be used for the best
classification\regression?
• Why we need for a small number of discriminative
features?
1. to improve performance (in terms of speed, predictive
power, simplicity of the model)
2. to visualize the data for model selection
3. To reduce dimensionality and remove noise
4. To reduce computational burden
• Feature Selection is a process that chooses an optimal
subset of features according to a certain criterion
CURSE OF DIMENSIONALITY

 The required number of data samples (to achieve the same accuracy)
grows exponentionally with the number of features!
 In practice: number of training examples is fixed!
=> the classifier’s performance usually will degrade for a large number of
features!

In fact, after a certain point, increasing the


dimensionality of the problem by adding
new features would actually degrade the
performance of classifier.
Overview
• Reasons for performing FS may include:
• removing irrelevant data.
• increasing predictive accuracy of learned models.
• reducing the cost of the data.
• improving learning efficiency, such as reducing storage
requirements and computational cost.
• reducing the complexity of the resulting model
description, improving the understanding of the data
and the model.
Overview: Feature Selection vs. Feature
Dimension Reduction
• Both are collectively used for feature dimensionality reduction
purpose

• Feature Selection: choose a best subset of size m from the


available d features

• Dimensionality reduction: given d features (set Y), extract m


new features (set X) by linear or non-linear combination of all
the d features
• New features by extraction may not have physical
interpretation/meaning
• Examples of linear feature dimensionality reduction
– Unsupervised: PCA; Supervised: LDA
Overview: Feature Selection vs. Dimensionality
reduction
• Feature Selection
– When classifying novel patterns, only a small number of
features need to be computed (i.e., faster classification).
– The measurement units (length, weight, etc.) of the
features are preserved.
• Dimensionality Reduction
– When classifying novel patterns, all features need to be
computed
– The measurement units (length, weight, etc.) of the
features are lost.
Feature Selection: Why?
The accuracy of all test Web URLs when chang the number of
top words for category file

90%
88%
86%
Accuracy

84%
82%
80%
78%
76%
74%
to 0

to 0

to 0

to 0

to 0

to 0

to 0

to 0

to 0

to 0
00
0

to 0
p1

p2

p3

p4

p5

p6

p7

p8

p9

9
p1

p1

p1

p1

p1

p1

p1

p1

p1

p1

p2
to

to

to

to

to

to

to

to

to

Number of top words for category file

The performance drops as number of predictors increases


Feature Selection: Why?

From http://elpub.scix.net/data/works/att/02-28.content.pdf
Overview
• Why does accuracy reduce with more
features?
• How does it depend on the specific choice
of features?
• What else changes if we use more features?
• So, how do we choose the right features?
Why accuracy reduces:
• Suppose the best feature set has 20 features.
If you add another 5 features, typically the
accuracy of machine learning model may
reduce. But you still have the original 20
features!! Why does this happen???
Noise / Explosion
• The additional features typically add noise.
Machine learning model will pick up on spurious
correlations among features, that might be true in
the training set, but not in the test set.
• For some ML models, more features means more
parameters to learn (more NN weights, more
decision tree nodes, etc…) – the increased space
of possibilities is more difficult to search.
FEATURE SELECTION

 Feature selection:
Problem of selecting some subset from a set of input
features upon which it should focus attention, while ignoring
the rest

 Humans/animals do that constantly!


FEATURE SELECTION (DEFINITION)

 Given a set of N features, the role of feature selection


is to select a subset of size M (M << N) that leads to
the smallest classification/clustering error.
Feature selection: Search strategy and
objective function
• Feature Subset Selection requires
• A search strategy to select candidate subsets
• An objective function to evaluate these candidates
Search Strategy
• Exhaustive evaluation of feature subsets involves  N 
N M 
combinations for a fixed value of M, and 2
combinations if M must be optimized as well
• This number of combinations is unfeasible, even for
moderate values of M and N, so a search procedure
must be used in practice
• For example, exhaustive evaluation of 10 out of 20
features involves 184,756 feature subsets; exhaustive
evaluation of 10 out of 100 involves more than 1013
feature subsets
• A search strategy is therefore needed to direct the
FSS process as it explores the space of all possible
combination of features
Search Strategy
• There is a large number of search strategies, which can be grouped in three
categories
• Exponential algorithms: These algorithms evaluate a number of subsets that grows
exponentially with the dimensionality of the search space. The most representative
algorithms under this class are
• Exhaustive Search
• Branch and Bound
• Approximate Monotonicity with Branch and Bound
• Beam Search
• Sequential algorithms: These algorithms add or remove features sequentially, but have
a tendency to become trapped in local minima. Representative examples of sequential
search include
• Sequential Forward Selection
• Sequential Backward Selection
• Plus-l Minus-r Selection
• Bidirectional Search
• Sequential Floating Selection
• Randomized algorithms: These algorithms incorporating randomness into their search
procedure to escape local minima.T he representative examples are
• Random Generation plus Sequential Selection
• Simulated Annealing
• Genetic Algorithms
Objective Function
• The objective function evaluates candidate subsets and
returns a measure of their “goodness”, a feedback signal
used by the search strategy to select new candidates
• Objective functions are divided in two groups
• Filters: The objective function evaluates feature subsets
by their information content, typically interclass distance,
statistical dependence or information-theoretic
measures
• Wrappers: The objective function is a pattern classifier,
which evaluates feature subsets by their predictive
accuracy (recognition rate on test data) by statistical
resampling or crossvalidation
FILTER METHOD
 Filter Methods

 Evaluation is independent of the


classification algorithm.

 The objective function evaluates


feature subsets by their information
content, typically interclass distance,
statistical dependence or information-
theoretic measures.
• Pearson correlation coefficient
• F-score
• Chi-square
• Information content
WRAPPER METHOD

 Wrapper Methods

 Evaluation uses criteria related to the


classification algorithm.

 The objective function is a pattern


classifier, which evaluates feature
subsets by their predictive accuracy
(recognition rate on test data) by
statistical resampling or cross-
validation.
FILTER VS WRAPPER
APPROACHES

Wrapper Approach
 Advantages
 Accuracy: wrappers generally have better recognition rates than filters since
they tuned to the specific interactions between the classifier and the
features.
 Ability to generalize: wrappers have a mechanism to avoid over fitting, since
they typically use cross-validation measures of predictive accuracy.
 Disadvantages
 Slow execution
FILTER VS WRAPPER
APPROACHES (CONT’D)

Filter Apporach
 Advantages
 Fast execution: Filters generally involve a non-iterative computation on the
dataset, which can execute much faster than a classifier training session
 Generality: Since filters evaluate the intrinsic properties of the data, rather
than their interactions with a particular classifier, their results exhibit more
generality; the solution will be “good” for a large family of classifiers
 Disadvantages
 Tendency to select large subsets: Filter objective functions are generally
monotonic
Feature selection methods
Naïve sequential feature selection
Sort the given N features in order of their performance

Select the top M features from this sorted list

Disadvantage
• Feature correlation is not considered.
• Best pair of features may not even contain the best individual
feature.
Sequential Forward Selection (SFS)
• Sequential Forward Selection is the simplest greedy
search algorithm
• Starting from the empty set S, sequentially add the
feature F that results in the highest objective
function J when combined with the features that
have already been selected
• Algorithm:
Sequential Forward Selection (SFS)
• SFS performs best when the optimal subset has a small
number of features
• When the search is near the empty set, a large number of
states can be potentially evaluated
• Towards the full set, the region examined by SFS is narrower
since most of the features have already been selected
• The search space is drawn like an ellipse to emphasize
the fact that there are fewer states towards the full or
empty sets
• The main disadvantage of SFS is that it is unable to
remove features that become obsolete after the
addition of other features
SEQUENTIAL FORWARD SELECTION (SFS)
(HEURISTIC SEARCH)

 First, the best single feature is selected (i.e., using


some criterion function).

 Then, pairs of features are formed using one of the


remaining features and this best feature, and the best
pair is selected.

 Next, triplets of features are formed using one of the


remaining features and these two best features, and the
best triplet is selected.

 This procedure continues until a predefined number of SFS performs best


when the optimal
features are selected. subset is small.
{x1, x2, x3, x4}

SEQUENTIAL FORWARD SELECTION (SFS)


(HEURISTIC SEARCH)

{x2 , x3 , x1} {x2 , x3, x4} J(x2, x3 , x1)>=J(x2, x3, x4)

{x2, x1} {x2 , x3} {x2 , x4} J(x2, x3)>=J(x2, xi); i=1,4

{x1} {x2} {x3} {x4} J(x2)>=J(xi); i=1,3,4


ILLUSTRATION (SFS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected

0,1,1,1 1,0,1,1 1,1,0,1 1,1,1,0

0,1,0,1 0,1,0,1 1,0,0,1 0,1,1,0 1,0,1,0 1,1,0,0

0,0,0,1 0,0,1,0 0,1,0,0 1,0,0,0

0,0,0,0
ILLUSTRATION (SFS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected

0,1,1,1 1,0,1,1 1,1,0,1 1,1,1,0

0,1,0,1 0,1,0,1 1,0,0,1 0,1,1,0 1,0,1,0 1,1,0,0

0,0,0,1 0,0,1,0 0,1,0,0 1,0,0,0 x3

0,0,0,0
ILLUSTRATION (SFS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected

0,1,1,1 1,0,1,1 1,1,0,1 1,1,1,0

0,1,0,1 0,1,0,1 1,0,0,1 0,1,1,0 1,0,1,0 1,1,0,0 x2, x3

0,0,0,1 0,0,1,0 0,1,0,0 1,0,0,0 x3

0,0,0,0
ILLUSTRATION (SFS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected

0,1,1,1 1,0,1,1 1,1,0,1 1,1,1,0 x1,x2, x3

0,1,0,1 0,1,0,1 1,0,0,1 0,1,1,0 1,0,1,0 1,1,0,0 x2, x3

0,0,0,1 0,0,1,0 0,1,0,0 1,0,0,0 x3

0,0,0,0
ILLUSTRATION (SFS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected

0,1,1,1 1,0,1,1 1,1,0,1 1,1,1,0 x1,x2, x3

0,1,0,1 0,1,0,1 1,0,0,1 0,1,1,0 1,0,1,0 1,1,0,0 x2, x3

0,0,0,1 0,0,1,0 0,1,0,0 1,0,0,0 x3

0,0,0,0
SFS example
SFS Example
Sequential Backward Selection (SBS)
• Sequential Backward Selection works in the opposite
direction of SFS
• Starting from the full set S, sequentially remove the feature F that
results in the smallest decrease in the value of the objective
function J
• Notice that removal of a feature may actually lead to an increase in
the objective function J. Such functions are said to be non-monotonic
• Algorithm
Sequential Backward Selection (SBS)
• SBS works best when the optimal feature subset has
a large number of features, since SBS spends most of
its time visiting large subsets
• The main limitation of SBS is its inability to reevaluate
the usefulness of a feature after it has been
discarded
SEQUENTIAL BACKWARD SELECTION
(SBS) (HEURISTIC SEARCH)

 First, the objective function is computed for all N


features.

 Then, each feature is deleted one at a time, the criterion


function is computed for all subsets with N-1 features,
and the worst feature is discarded.

 Next, each feature among the remaining N-1 is deleted


one at a time, and the worst feature is discarded to form
a subset with N-2 features.
SBS performs best
when the optimal
 This procedure continues until a predefined number of
subset is large.
features are left.
SEQUENTIAL BACKWARD SELECTION
(SBS) (HEURISTIC SEARCH)

{x1, x2, x3, x4}

• J(x1, x2, x3) is maximum


{x2, x3, x4} {x1, x3, x4} {x1, x2, x3} • x4 is the worst feature

• J(x2, x3) is maximum {x2, x3} {x1, x3} {x1, x2}


• x1 is the worst feature

• J(x2) is maximum
{x2} {x3} • x3 is the worst feature
ILLUSTRATION (SBS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected

0,1,1,1 1,0,1,1 1,1,0,1 1,1,1,0

0,1,0,1 0,1,0,1 1,0,0,1 0,1,1,0 1,0,1,0 1,1,0,0

0,0,0,1 0,0,1,0 0,1,0,0 1,0,0,0

0,0,0,0
ILLUSTRATION (SBS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected

0,1,1,1 1,0,1,1 1,1,0,1 1,1,1,0 x1, x2, x3

0,1,0,1 0,1,0,1 1,0,0,1 0,1,1,0 1,0,1,0 1,1,0,0

0,0,0,1 0,0,1,0 0,1,0,0 1,0,0,0

0,0,0,0
ILLUSTRATION (SBS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected

0,1,1,1 1,0,1,1 1,1,0,1 1,1,1,0 x1, x2, x3

0,1,0,1 0,1,0,1 1,0,0,1 0,1,1,0 1,0,1,0 1,1,0,0 x2, x3

0,0,0,1 0,0,1,0 0,1,0,0 1,0,0,0

0,0,0,0
ILLUSTRATION (SBS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected

0,1,1,1 1,0,1,1 1,1,0,1 1,1,1,0 x1, x2, x3

0,1,0,1 0,1,0,1 1,0,0,1 0,1,1,0 1,0,1,0 1,1,0,0 x2, x3

0,0,0,1 0,0,1,0 0,1,0,0 1,0,0,0 x2

0,0,0,0
ILLUSTRATION (SBS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected

0,1,1,1 1,0,1,1 1,1,0,1 1,1,1,0 x1, x2, x3

0,1,0,1 0,1,0,1 1,0,0,1 0,1,1,0 1,0,1,0 1,1,0,0 x2, x3

0,0,0,1 0,0,1,0 0,1,0,0 1,0,0,0 x2

0,0,0,0
SBS Example
Some generalizations of SFS and SBS
– Plus-L, minus-R” selection (LRS)
– Sequential floating forward/backward selection
(SFFS and SFBS)
Plus-L Minus-R Selection (LRS)
• Plus-L Minus-R is a generalization of SFS and SBS
• If L>R, LRS starts from the empty set and repeatedly adds ‘L’
features and removes ‘R’ features
• If L<R, LRS starts from the full set and repeatedly removes ‘R’
features followed by ‘L’ feature additions
• LRS attempts to compensate for the weaknesses of SFS
and SBS with some backtracking capabilities
• Its main limitation is the lack of a theory to help to
choose the optimal values of L and R
Bidirectional Search (BDS)
• BDS applies SFS and SBS simultaneously:
• SFS is performed from the empty set
• SBS is performed from the full set
• To guarantee that SFS and SBS converge to the same solution
• Features already selected by SFS are not removed by SBS
• Features already removed by SBS are not selected by SFS
• Algorithm
BIDIRECTIONAL SEARCH (BDS)

SBS
{x1, x2, x3, x4}

SFS
𝜙
BIDIRECTIONAL SEARCH (BDS)

SBS
{x1, x2, x3, x4}

{x1} {x2} {x3} {x4} J(x2) is maximum


x2 is selected

SFS
𝜙
BIDIRECTIONAL SEARCH (BDS)

SBS
{x1, x2, x3, x4}

{x2, x3, x4} {x2, x1, x4} {x2, x2, x3}

{x1} {x2} {x3} {x4} J(x2) is maximum


x2 is selected

SFS
𝜙
BIDIRECTIONAL SEARCH (BDS)

SBS
{x1, x2, x3, x4}

J(x2, x1, x4) is maximum


{x2, x3, x4} {x2, x1, x4} {x2, x2, x3} x3 is removed

{x1} {x2} {x3} {x4} J(x2) is maximum


x2 is selected

SFS
𝜙
BIDIRECTIONAL SEARCH (BDS)

SBS
{x1, x2, x3, x4}

J(x2, x1, x4) is maximum


{x2, x3, x4} {x2, x1, x4} {x2, x2, x3} x3 is removed

{x2, x1} {x2, x4}

{x1} {x2} {x3} {x4} J(x2) is maximum


x2 is selected

SFS
𝜙
BIDIRECTIONAL SEARCH (BDS)

SBS
{x1, x2, x3, x4}

J(x2, x1, x4) is maximum


{x2, x3, x4} {x2, x1, x4} {x2, x2, x3} x3 is removed

{x2, x1} {x2, x4} J(x2, x4) is maximum


x4 is selected

{x1} {x2} {x3} {x4} J(x2) is maximum


x2 is selected

SFS
𝜙
BIDIRECTIONAL SEARCH (BDS)

SBS
{x1, x2, x3, x4}

J(x2, x1, x4) is maximum


{x2, x3, x4} {x2, x1, x4} {x2, x2, x3} x3 is removed

{x2, x1} {x2, x4} J(x2, x4) is maximum


x4 is selected

{x1} {x2} {x3} {x4} J(x2) is maximum


x2 is selected

SFS
𝜙
BIDIRECTIONAL SEARCH (BDS)

SBS
{x1, x2, x3, x4}

J(x2, x1, x4) is maximum


{x2, x3, x4} {x2, x1, x4} {x2, x2, x3} x3 is removed

{x2, x1} {x2, x4} J(x2, x4) is maximum


x4 is selected

{x1} {x2} {x3} {x4} J(x2) is maximum


x2 is selected

SFS
𝜙
BIDIRECTIONAL SEARCH (BDS)

SBS
{x1, x2, x3, x4}

J(x2, x1, x4) is maximum


{x2, x3, x4} {x2, x1, x4} {x2, x2, x3} x3 is removed

{x2, x1} {x2, x4} J(x2, x4) is maximum


x4 is selected

{x1} {x2} {x3} {x4} J(x2) is maximum


x2 is selected

SFS
𝜙
ILLUSTRATION (BDS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected

0,1,1,1 1,0,1,1 1,1,0,1 1,1,1,0

0,1,0,1 0,1,0,1 1,0,0,1 0,1,1,0 1,0,1,0 1,1,0,0

0,0,0,1 0,0,1,0 0,1,0,0 1,0,0,0

0,0,0,0
ILLUSTRATION (BDS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected

0,1,1,1 1,0,1,1 1,1,0,1 1,1,1,0

0,1,0,1 0,1,0,1 1,0,0,1 0,1,1,0 1,0,1,0 1,1,0,0

0,0,0,1 0,0,1,0 0,1,0,0 1,0,0,0

0,0,0,0
ILLUSTRATION (BDS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected

0,1,1,1 1,0,1,1 1,1,0,1 1,1,1,0

0,1,0,1 0,1,0,1 1,0,0,1 0,1,1,0 1,0,1,0 1,1,0,0

0,0,0,1 0,0,1,0 0,1,0,0 1,0,0,0 x2

0,0,0,0
ILLUSTRATION (BDS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected

0,1,1,1 1,0,1,1 1,1,0,1 1,1,1,0


x2, x1, x4

0,1,0,1 0,1,0,1 1,0,0,1 0,1,1,0 1,0,1,0 1,1,0,0

0,0,0,1 0,0,1,0 0,1,0,0 1,0,0,0 x2

0,0,0,0
ILLUSTRATION (BDS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected

0,1,1,1 1,0,1,1 1,1,0,1 1,1,1,0


x2, x1, x4

0,1,0,1 0,1,0,1 1,0,0,1 0,1,1,0 1,0,1,0 1,1,0,0


x2, x4

0,0,0,1 0,0,1,0 0,1,0,0 1,0,0,0 x2

0,0,0,0
Sequential Floating Selection (SFFS and SFBS)
• Sequential Floating Selection methods are an extension to the LRS
algorithms with flexible backtracking capabilities
• Rather than fixing the values of ‘L’ and ‘R’, these floating methods allow
those values to be determined from the data
• The dimensionality of the subset during the search can be though to be
“floating” up and down
• There are two floating methods
• Sequential Floating Forward Selection (SFFS) starts from the empty set
• After each forward step, SFFS performs backward steps as long as the objective
function increases
• Sequential Floating Backward Selection (SFBS) starts from the full set
• After each backward step, SFBS performs forward steps as long as the
objective function increases
Limitations of Feature selection methods

• Unclear how to tell in advance if feature selection


will work
• Only known way is to check but for very high
dimensional data (at least half a million features) it helps
most of the time
• How many features to select?
• Perform cross-validation

You might also like