0% found this document useful (0 votes)
2 views

Wrapper Method

The document discusses the importance of dimensionality reduction in data processing, highlighting its benefits such as improved algorithm performance, visualization, data compression, and noise removal. It differentiates between feature selection and feature extraction, emphasizing that feature selection focuses on identifying a subset of relevant features to minimize classification errors. Various methods for feature selection, including wrapper, filter, and embedded methods, are outlined, along with search and evaluation strategies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Wrapper Method

The document discusses the importance of dimensionality reduction in data processing, highlighting its benefits such as improved algorithm performance, visualization, data compression, and noise removal. It differentiates between feature selection and feature extraction, emphasizing that feature selection focuses on identifying a subset of relevant features to minimize classification errors. Various methods for feature selection, including wrapper, filter, and embedded methods, are outlined, along with search and evaluation strategies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

WRAPPER METHOD – FORWARD AND

BACKWARD SELECTION
WHY DIMENSIONALITY REDUCTION?

◼ It is so easy and convenient to collect data

◼ Data is not collected only for data mining

◼ Data accumulates in an unprecedented speed

◼ Data pre-processing is an important part for effective


machine learning and data mining

◼ Dimensionality reduction is an effective approach to


downsizing data
WHY DIMENSIONALITY REDUCTION?

◼ Most machine learning and data mining techniques


may not be effective for high-dimensional data

◼ Curse of Dimensionality

◼ The intrinsic dimension may be small.


WHY DIMENSIONALITY REDUCTION?

◼ Visualization: projection of high-dimensional data

onto 2D or 3D.

◼ Data compression: efficient storage and retrieval.

◼ Noise removal: positive effect on query accuracy.


MOTIVATION

◼ Especially when dealing with a large number of variables


there is a need for dimensionality reduction!

◼ Dimensionality reduction can significantly improve a learning


algorithm’s performance!
MAJOR TECHNIQUES OF
DIMENSIONALITY REDUCTION

◼ Feature Selection

◼ Feature Extraction (Reduction)


FEATURE EXTRACTION VS
SELECTION

◼ Feature extraction

◼ All original features are used and they are transferred

◼ The transformed features are linear/nonlinear


combinations of the original features

◼ Feature selection

◼ Only a subset of the original features are selected


FEATURE SELECTION
FEATURE SELECTION

◼ Feature selection:
Problem of selecting some subset from a set of input
features upon which it should focus attention, while
ignoring the rest

◼ Humans/animals do that constantly!


FEATURE SELECTION (DEF.)

◼ Given a set of N features, the role of feature selection


is to select a subset of size M (M < N) that leads to the
smallest classification/clustering error.
WHY IS FEATURE SELECTION? WHY NOT
FEATURE EXTRACTION?

❖ You may want to extract meaningful rules from your


classifier

◼ When you transform or project, the measurement units (length,


weight, etc.) of your features are lost

❖ Features may not be numeric

◼ A typical situation in the machine learning domain


MOTIVATIONAL EXAMPLE FROM BIOLOGY

Monkeys performing classification task

• Eye separation, Eye height, Mouth height, Nose length


MOTIVATIONAL EXAMPLE FROM BIOLOGY

Monkeys performing classification task

Diagnostic features:
- Eye separation
- Eye height

Non-Diagnostic features:
- Mouth height
- Nose length
FEATURE SELECTION METHODS

❖ Feature selection is an
optimization problem.

◼ Search the space of possible


feature subsets.

◼ Pick the subset that is optimal


or near-optimal with respect
to an objective function.
WRAPPER, FILTER AND EMBEDDED METHODS

◼ The value of a feature is related to a model-construction method. Three


classes of methods:

1. Wrapper methods are built “around” a specific predictive model (measure


error rate)
2. Filter methods use a proxy measure instead of the error rate to score a
feature subset
3. Embedded methods perform feature selection as an integral part of the
model construction process.
TOP-DOWN AND BOTTOM-UP METHODS

◼ In a bottom-up method one gradually adds the ranked features in the order of
their individual discrimination power and stops when the error rate stops
decreasing
◼ In a top-down truncation method one starts with the complete set of
features and progressively eliminates features while searching for the optimal
performance point
FEATURE SELECTION METHODS

❖ Feature selection is an optimization problem.


◼ Search the space of possible feature subsets.
◼ Pick the subset that is optimal or near-optimal with respect to a
certain criterion.

Search strategies Evaluation strategies


◼ Optimum - Filter methods
◼ Heuristic - Wrapper methods
◼ Randomized
EVALUATION STRATEGIES

❖ Filter Methods

◼ Evaluation is independent of the


classification algorithm.

◼ The objective function evaluates


feature subsets by their information
content, typically interclass distance,
statistical dependence or
information-theoretic measures.
EVALUATION STRATEGIES

❖ Wrapper Methods
◼ Evaluation uses criteria related to the
classification algorithm.

◼ The objective function is a pattern


classifier, which evaluates feature
subsets by their predictive accuracy
(recognition rate on test data) by
statistical resampling or
cross-validation.
FILTER VS WRAPPER
APPROACHES

Wrapper Approach
❖ Advantages
◼ Accuracy: wrappers generally have better recognition rates than filters since
they tuned to the specific interactions between the classifier and the
features.
◼ Ability to generalize: wrappers have a mechanism to avoid over fitting, since
they typically use cross-validation measures of predictive accuracy.
❖ Disadvantages
◼ Slow execution
FILTER VS WRAPPER
APPROACHES (CONT’D)

Filter Apporach
◼ Advantages
◼ Fast execution: Filters generally involve a non-iterative computation on the
dataset, which can execute much faster than a classifier training session
◼ Generality: Since filters evaluate the intrinsic properties of the data, rather
than their interactions with a particular classifier, their results exhibit more
generality; the solution will be “good” for a large family of classifiers
◼ Disadvantages
◼ Tendency to select large subsets: Filter objective functions are generally
monotonic
SEARCH STRATEGIES
Four Features – x1, x2, x3,
x4
1,1,1,1

0,1,1,1 1,0,1,1 1,1,0,1 1,1,1,0

0,1,0,1 0,1,0,1 1,0,0,1 0,1,1,0 1,0,1,0 1,1,0,0


0,0,0,1 0,0,1,0 0,1,0,0 1,0,0,0

0,0,0,0

1-xi is selected; 0-xi is not selected


NAÏVE SEARCH

❖ Sort the given N features in order of their probability of


correct recognition.

❖ Select the top M features from this sorted list.

❖ Disadvantage
◼ Feature correlation is not considered.
◼ Best pair of features may not even contain the best individual
feature.
SEQUENTIAL FORWARD SELECTION (SFS)
(HEURISTIC SEARCH)

❖ First, the best single feature is selected (i.e., using some


criterion function).

❖ Then, pairs of features are formed using one of the


remaining features and this best feature, and the best
pair is selected.

❖ Next, triplets of features are formed using one of the


remaining features and these two best features, and the
best triplet is selected.

❖ This procedure continues until a predefined number of SFS performs best


when the optimal
features are selected. subset is small.
SEQUENTIAL FORWARD SELECTION (SFS)
(HEURISTIC SEARCH)

{x1, x2, x3,


x4}

{x2 , x3 , x1} {x2 , x3, x4} J(x2, x3 , x1)>=J(x2, x3, x4)

{x2, {x2 , {x2 , J(x2, x3)>=J(x2, xi); i=1,4


x1} x3} x4}

{x1}{x2}{x3}{x4} J(x2)>=J(xi); i=1,3,4


ILLUSTRATION (SFS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected

0,1,1,1 1,0,1,1 1,1,0,1 1,1,1,0

0,1,0,1 0,1,0,1 1,0,0,1 0,1,1,0 1,0,1,0 1,1,0,0

0,0,0,1 0,0,1,0 0,1,0,0 1,0,0,0

0,0,0,0
ILLUSTRATION (SFS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected

0,1,1,1 1,0,1,1 1,1,0,1 1,1,1,0

0,1,0,1 0,1,0,1 1,0,0,1 0,1,1,0 1,0,1,0 1,1,0,0

0,0,0,1 0,0,1,0 0,1,0,0 1,0,0,0 x3

0,0,0,0
ILLUSTRATION (SFS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected

0,1,1,1 1,0,1,1 1,1,0,1 1,1,1,0

0,1,0,1 0,1,0,1 1,0,0,1 0,1,1,0 1,0,1,0 1,1,0,0 x2,


x3

0,0,0,1 0,0,1,0 0,1,0,0 1,0,0,0 x3

0,0,0,0
ILLUSTRATION (SFS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected

0,1,1,1 1,0,1,1 1,1,0,1 1,1,1,0 x1,x2,


x3

0,1,0,1 0,1,0,1 1,0,0,1 0,1,1,0 1,0,1,0 1,1,0,0 x2,


x3

0,0,0,1 0,0,1,0 0,1,0,0 1,0,0,0 x3

0,0,0,0
ILLUSTRATION (SFS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected

0,1,1,1 1,0,1,1 1,1,0,1 1,1,1,0 x1,x2,


x3

0,1,0,1 0,1,0,1 1,0,0,1 0,1,1,0 1,0,1,0 1,1,0,0 x2,


x3

0,0,0,1 0,0,1,0 0,1,0,0 1,0,0,0 x3

0,0,0,0
SEQUENTIAL BACKWARD SELECTION
(SBS) (HEURISTIC SEARCH)

❖ First, the criterion function is computed for all n features.

❖ Then, each feature is deleted one at a time, the criterion


function is computed for all subsets with n-1 features, and
the worst feature is discarded.

❖ Next, each feature among the remaining n-1 is deleted


one at a time, and the worst feature is discarded to form a
subset with n-2 features.

❖ This procedure continues until a predefined number of SBS performs best


when the optimal
features are left. subset is large.
SEQUENTIAL BACKWARD SELECTION
(SBS) (HEURISTIC SEARCH)

{x1, x2, x3,


x4}
• J(x1, x2, x3) is
{x2, x3, {x1, x3, {x1, x2, maximum
x4} x4} x3} • x3 is the worst feature

• J(x2, x3) is maximum {x2, {x1, {x1,


• x1 is the worst feature
x3} x3} x2}

• J(x2) is maximum
{x2}{x3} • x3 is the worst feature
ILLUSTRATION (SBS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected

0,1,1,1 1,0,1,1 1,1,0,1 1,1,1,0

0,1,0,1 0,1,0,1 1,0,0,1 0,1,1,0 1,0,1,0 1,1,0,0

0,0,0,1 0,0,1,0 0,1,0,0 1,0,0,0

0,0,0,0
ILLUSTRATION (SBS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected

0,1,1,1 1,0,1,1 1,1,0,1 1,1,1,0 x1, x2,


x3

0,1,0,1 0,1,0,1 1,0,0,1 0,1,1,0 1,0,1,0 1,1,0,0

0,0,0,1 0,0,1,0 0,1,0,0 1,0,0,0

0,0,0,0
ILLUSTRATION (SBS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected

0,1,1,1 1,0,1,1 1,1,0,1 1,1,1,0 x1, x2,


x3

0,1,0,1 0,1,0,1 1,0,0,1 0,1,1,0 1,0,1,0 1,1,0,0 x2,


x3

0,0,0,1 0,0,1,0 0,1,0,0 1,0,0,0

0,0,0,0
ILLUSTRATION (SBS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected

0,1,1,1 1,0,1,1 1,1,0,1 1,1,1,0 x1, x2,


x3

0,1,0,1 0,1,0,1 1,0,0,1 0,1,1,0 1,0,1,0 1,1,0,0 x2,


x3

0,0,0,1 0,0,1,0 0,1,0,0 1,0,0,0 x2

0,0,0,0
ILLUSTRATION (SBS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected

0,1,1,1 1,0,1,1 1,1,0,1 1,1,1,0 x1, x2,


x3

0,1,0,1 0,1,0,1 1,0,0,1 0,1,1,0 1,0,1,0 1,1,0,0 x2,


x3

0,0,0,1 0,0,1,0 0,1,0,0 1,0,0,0 x2

0,0,0,0
BIDIRECTIONAL SEARCH (BDS)
(HEURISTIC SEARCH)

◼ BDS applies SFS and SBS simultaneously:


◼ SFS is performed from the empty set
◼ SBS is performed from the full set
◼ To guarantee that SFS and SBS converge
to the same solution
◼ Features already selected by SFS are not
removed by SBS
◼ Features already removed by SBS are not
selected by SFS
BIDIRECTIONAL SEARCH (BDS)

SBS
{x1, x2, x3,
x4}

SFS
BIDIRECTIONAL SEARCH (BDS)

SBS
{x1, x2, x3,
x4}

{x1} {x2} {x3} {x4} J(x2) is maximum


x2 is selected

SFS
BIDIRECTIONAL SEARCH (BDS)

SBS
{x1, x2, x3,
x4}
{x2, x3, {x2, x1, {x2, x2,
x4} x4} x3}

{x1} {x2} {x3} {x4} J(x2) is maximum


x2 is selected

SFS
BIDIRECTIONAL SEARCH (BDS)

SBS
{x1, x2, x3,
x4}
J(x2, x1, x4) is maximum
{x2, x3, {x2, x1, {x2, x2, x3 is removed
x4} x4} x3}

{x1} {x2} {x3} {x4} J(x2) is maximum


x2 is selected

SFS
BIDIRECTIONAL SEARCH (BDS)

SBS
{x1, x2, x3,
x4}
J(x2, x1, x4) is maximum
{x2, x3, {x2, x1, {x2, x2, x3 is removed
x4} x4} x3}

{x2, {x2,
x1} x4}

{x1} {x2} {x3} {x4} J(x2) is maximum


x2 is selected

SFS
BIDIRECTIONAL SEARCH (BDS)

SBS
{x1, x2, x3,
x4}
J(x2, x1, x4) is maximum
{x2, x3, {x2, x1, {x2, x2, x3 is removed
x4} x4} x3}

{x2, {x2, J(x2, x4) is maximum


x4 is selected
x1} x4}

{x1} {x2} {x3} {x4} J(x2) is maximum


x2 is selected

SFS
BIDIRECTIONAL SEARCH (BDS)

SBS
{x1, x2, x3,
x4}
J(x2, x1, x4) is maximum
{x2, x3, {x2, x1, {x2, x2, x3 is removed
x4} x4} x3}

{x2, {x2, J(x2, x4) is maximum


x4 is selected
x1} x4}

{x1} {x2} {x3} {x4} J(x2) is maximum


x2 is selected

SFS
BIDIRECTIONAL SEARCH (BDS)

SBS
{x1, x2, x3,
x4}
J(x2, x1, x4) is maximum
{x2, x3, {x2, x1, {x2, x2, x3 is removed
x4} x4} x3}

{x2, {x2, J(x2, x4) is maximum


x4 is selected
x1} x4}

{x1} {x2} {x3} {x4} J(x2) is maximum


x2 is selected

SFS
BIDIRECTIONAL SEARCH (BDS)

SBS
{x1, x2, x3,
x4}
J(x2, x1, x4) is maximum
{x2, x3, {x2, x1, {x2, x2, x3 is removed
x4} x4} x3}

{x2, {x2, J(x2, x4) is maximum


x4 is selected
x1} x4}

{x1} {x2} {x3} {x4} J(x2) is maximum


x2 is selected

SFS
ILLUSTRATION (BDS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected

0,1,1,1 1,0,1,1 1,1,0,1 1,1,1,0

0,1,0,1 0,1,0,1 1,0,0,1 0,1,1,0 1,0,1,0 1,1,0,0

0,0,0,1 0,0,1,0 0,1,0,0 1,0,0,0

0,0,0,0
ILLUSTRATION (BDS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected

0,1,1,1 1,0,1,1 1,1,0,1 1,1,1,0

0,1,0,1 0,1,0,1 1,0,0,1 0,1,1,0 1,0,1,0 1,1,0,0

0,0,0,1 0,0,1,0 0,1,0,0 1,0,0,0

0,0,0,0
ILLUSTRATION (BDS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected

0,1,1,1 1,0,1,1 1,1,0,1 1,1,1,0

0,1,0,1 0,1,0,1 1,0,0,1 0,1,1,0 1,0,1,0 1,1,0,0

0,0,0,1 0,0,1,0 0,1,0,0 1,0,0,0 x2

0,0,0,0
ILLUSTRATION (BDS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected

0,1,1,1 1,0,1,1 1,1,0,1 1,1,1,0


x2, x1,
x4

0,1,0,1 0,1,0,1 1,0,0,1 0,1,1,0 1,0,1,0 1,1,0,0

0,0,0,1 0,0,1,0 0,1,0,0 1,0,0,0 x2

0,0,0,0
ILLUSTRATION (BDS)
Four Features – x1, x2, x3, x4
1,1,1,1
1-xi is selected; 0-xi is not selected

0,1,1,1 1,0,1,1 1,1,0,1 1,1,1,0


x2, x1,
x4

0,1,0,1 0,1,0,1 1,0,0,1 0,1,1,0 1,0,1,0 1,1,0,0


x2,
x4

0,0,0,1 0,0,1,0 0,1,0,0 1,0,0,0 x2

0,0,0,0
FEATURE SELECTION METHODS

Filter:
Selected
All Features Supervised Classifier
Features
Filter Learning
Algorithm Selected
Features

Wrapper:

Feature Feature Classifier


All Features Supervised Classifier
Subset Evaluation
Search Learning
Algorithm Criterion Selected
Features

Criterion Value

INTRODUCTION TO MACHINE LEARNING AND DATA MINING,


CARLA BRODLEY
Feature Feature
All Features Supervised Classifier
Subset Evaluation
Search Learning
Algorithm Criterion Selected
Features

Criterion Value

INTRODUCTION TO MACHINE LEARNING AND DATA MINING,


CARLA BRODLEY
Feature Feature
All Features Supervised Classifier
Subset Evaluation
Search Learning
Algorithm Criterion Selected
Features

Criterion Value

Search Method: sequential forward search

A B C D

A, B B, C B, D

A, B, C B, C, D
INTRODUCTION TO MACHINE LEARNING AND DATA MINING,
CARLA BRODLEY
Feature Feature
All Features Supervised Classifier
Subset Evaluation
Search Learning
Algorithm Criterion Selected
Features

Criterion Value

Search Method: sequential backward elimination

ABC ABD ACD BCD

AB AD BD

A D

INTRODUCTION TO MACHINE LEARNING AND DATA MINING,


CARLA BRODLEY
“PLUS-L, MINUS-R” SELECTION
(LRS) (HEURISTIC SEARCH)

❖ A generalization of SFS and SBS


◼ If L>R, LRS starts from the empty set and:
◼ Repeatedly add L features
◼ Repeatedly remove R features

◼ If L<R, LRS starts from the full set and:


◼ Repeatedly removes R features
◼ Repeatedly add L features

❖ LRS attempts to compensate for the weaknesses of


SFS and SBS with some backtracking capabilities.
“PLUS-L, MINUS-R” SELECTION
(LRS) (HEURISTIC SEARCH)

You might also like