0% found this document useful (0 votes)
100 views24 pages

Feature Selection Methods

The document discusses feature selection, which aims to select a subset of important features for classification. It covers filter and wrapper methods, search strategies like sequential forward and backward selection, and evaluation criteria. Genetic algorithms can also be used for randomized feature selection.

Uploaded by

Sakshi jha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views24 pages

Feature Selection Methods

The document discusses feature selection, which aims to select a subset of important features for classification. It covers filter and wrapper methods, search strategies like sequential forward and backward selection, and evaluation criteria. Genetic algorithms can also be used for randomized feature selection.

Uploaded by

Sakshi jha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Feature

Selection

Robot Image Credit: Viktoriya Sukhanova © 123RF.com


Feature Selection
• Given a set of n features, the goal of feature selection is to select a
subset of d features (d < n) in order to minimize the classification error.

dimensionality
reduction

• Why perform feature selection?


– Data interpretation\knowledge discovery (insights into which factors
which are most representative of your problem)
– Curse of dimensionality (amount of data grows exponentially with # of
features O(2" )

• Fundamentally different from dimensionality reduction (we will


discuss next time) based on feature combinations (i.e., feature
extraction).
Feature Selection vs.
Dimensionality Reduction
• Feature Selection
– When classifying novel patterns, only a small number of features
need to be computed (i.e., faster classification).
– The measurement units (length, weight, etc.) of the features are
preserved.

• Dimensionality Reduction (next time)


– When classifying novel patterns, all features need to be computed.
– The measurement units (length, weight, etc.) of the features are
lost.
Feature Selection Steps

• Feature selection is an
optimization problem.
– Step 1: Search the space of
possible feature subsets.

– Step 2: Pick the subset that is


optimal or near-optimal with
respect to some objective
function.
Feature Selection Steps (cont’d)

Search strategies
– Optimal
– Heuristic

Evaluation strategies
- Filter methods
- Wrapper methods
Evaluation Strategies
• Filter Methods
– Evaluation is independent of
the classification algorithm.

– The objective function


evaluates feature subsets by
their information content,
typically interclass distance,
statistical dependence or
information-theoretic
measures.
Evaluation Strategies

• Wrapper Methods
– Evaluation uses criteria
related to the
classification algorithm.

– The objective function is a


pattern classifier, which
evaluates feature subsets
by their predictive
accuracy (recognition rate
on test data) by statistical
resampling or cross-
validation.
Filter vs. Wrapper Approaches
Filter vs Wrapper Approaches
Search Strategies
• Assuming n features, an exhaustive search would
require:
ænö
– Examining all ç ÷ possible subsets of size d.
èd ø

– Selecting the subset that performs the best according to the


criterion function.

• The number of subsets grows combinatorially, making


exhaustive search impractical.
• In practice, heuristics are used to speed-up search but
they cannot guarantee optimality.

10
Naïve Search
• Sort the given n features in order of their probability of
correct recognition.

• Select the top d features from this sorted list.

• Disadvantage
– Correlation among features is not considered.
– The best pair of features may not even contain the best
individual feature.
Sequential forward selection (SFS)
(heuristic search)
• First, the best single feature is selected (i.e.,
using some criterion function).
• Then, pairs of features are formed using one of
the remaining features and this best feature, and
the best pair is selected.
• Next, triplets of features are formed using one
of the remaining features and these two best
features, and the best triplet is selected.
• This procedure continues until a predefined
number of features are selected.

SFS performs
best when the
optimal subset is
small.
12
Example
features added at
each iteration

Results of sequential forward feature selection for classification of a satellite image


using 28 features. x-axis shows the classification accuracy (%) and y-axis shows the
features added at each iteration (the first iteration is at the bottom). The highest
accuracy value is shown with a star. 13
Sequential backward selection (SBS)
(heuristic search)
• First, the criterion function is computed for all n
features.
• Then, each feature is deleted one at a time, the
criterion function is computed for all subsets with
n-1 features, and the worst feature is discarded.
• Next, each feature among the remaining n-1 is
deleted one at a time, and the worst feature is
discarded to form a subset with n-2 features.
• This procedure continues until a predefined
number of features are left.
SBS performs
best when the
optimal subset is
large.
14
Example
features removed at
each iteration

Results of sequential backward feature selection for classification of a satellite image


using 28 features. x-axis shows the classification accuracy (%) and y-axis shows the
features removed at each iteration (the first iteration is at the top). The highest accuracy
value is shown with a star. 15
Bidirectional Search (BDS)
• BDS applies SFS and SBS
simultaneously:
– SFS is performed from the
empty set.
– SBS is performed from the
full set.
• To guarantee that SFS and SBS
converge to the same
solution:
– Features already selected by
SFS are not removed by SBS.
– Features already removed by
SBS are not added by SFS.
3
Limitations of SFS and SBS

• The main limitation of SFS is that it is unable to


remove features that become non useful after the
addition of other features.
• The main limitation of SBS is its inability to
reevaluate the usefulness of a feature after it has
been discarded.
• We will examine some generalizations of SFS and
SBS:
– Plus-L, minus-R” selection (LRS)
– Sequential floating forward/backward selection (SFFS and
SFBS)
“Plus-L, minus-R” selection (LRS)
• A generalization of SFS and SBS
– If L>R, LRS starts from the empty set and:
• Repeatedly add L features
• Repeatedly remove R features
– If L<R, LRS starts from the full set and:
• Repeatedly removes R features
• Repeatedly add L features

Its main limitation is the lack of a


theory to help choose the optimal
values of L and R.
Sequential floating forward/backward
selection (SFFS and SFBS)
• An extension to LRS:
– Rather than fixing the values of L and R, floating methods
determine these values from the data.
– The dimensionality of the subset during the search can be
thought to be “floating” up and down

• Two floating methods:


– Sequential floating forward selection (SFFS)
– Sequential floating backward selection (SFBS)

P. Pudil, J. Novovicova, J. Kittler, Floating search methods in feature


selection, Pattern Recognition Lett. 15 (1994) 1119–1125.
Sequential floating forward selection
(SFFS)
• Sequential floating forward selection (SFFS) starts from
the empty set.
• After each forward step, SFFS performs backward steps
as long as the objective function increases.
Sequential floating backward selection
(SFBS)

• Sequential floating backward selection (SFBS) starts


from the full set.

• After each backward step, SFBS performs forward steps


as long as the objective function increases.
Feature Selection using GAs
(randomized search)

• GAs provide a simple, general, and powerful framework


for feature selection.

Feature Feature
Data Classifier
Extraction Subset

Feature
Selection
(GA)
Feature Selection Using GAs
(cont’d)
• Binary encoding: 1 means “choose feature” and 0
means “do not choose” feature.
1 N

• Fitness evaluation (to be maximized)

Fitness=w1 ´ accuracy + w2 ´ #zeros


Classification Number of
accuracy using a features
validation set w1>>w2
Feature Selection Summary
• Has two-fold advantage of providing some interpretation of
the data and making the learning problem easier

• Finding global optimum impractical in most situations, rely


on heuristics instead (greedy\random search)

• Filtering is fast and general but can pick a large # of


features

• Wrapping considers model bias but is MUCH slower due to


training multiple models

You might also like