0% found this document useful (0 votes)
48 views62 pages

Features Election

Feature selection is used to reduce the number of features in datasets to improve machine learning models. It is necessary when datasets have too many features which can reduce accuracy and require large training databases. Feature selection methods evaluate features individually and in combinations to select an optimal subset of important features. Correlation-based ranking is commonly used but not comprehensive as it does not consider feature interactions. More rigorous methods evaluate all possible feature subsets which becomes computationally infeasible with many features. Forward selection grows the feature set incrementally to overcome this issue.

Uploaded by

Rohit Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views62 pages

Features Election

Feature selection is used to reduce the number of features in datasets to improve machine learning models. It is necessary when datasets have too many features which can reduce accuracy and require large training databases. Feature selection methods evaluate features individually and in combinations to select an optimal subset of important features. Correlation-based ranking is commonly used but not comprehensive as it does not consider feature interactions. More rigorous methods evaluate all possible feature subsets which becomes computationally infeasible with many features. Forward selection grows the feature set incrementally to overcome this issue.

Uploaded by

Rohit Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Feature Selection

Slides are prepared from different resources on the web.


Why

• Too many features


• Require large training databases
• Increased requirement of time
The accuracy of all test Web URLs when chang the number of
top words for category file

90%
88%
86%
Accuracy

84%
82%
80%
78%
76%
74%
10

20

30

40

50

60

70

80

90

0
10

11

12

13

14

15

16

17

18

19

20
top

top

top

top

top

top

top

top

top
top

top

top

top

top

top

top

top

top

top

top
Number of top words for category file

David Corne, and Nick Taylor, Heriot-Watt University - [email protected]


These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
From http://elpub.scix.net/data/works/att/02-28.content.pdf
Slide Credit: David Corne, and Nick Taylor
Quite easy to find lots more cases from
papers, where experiments show that
accuracy reduces when you use more
features

Slide Credit: David Corne, and Nick Taylor


• Why does accuracy reduce with more
features?
• How does it depend on the specific choice
of features?
• What else changes if we use more features?
• So, how do we choose the right features?
Why accuracy reduces:
• Note: suppose the best feature set has 20
features. If you add another 5 features,
typically the accuracy of machine learning
may reduce. But you still have the original
20 features!! Why does this happen???
Noise / Spurious Correlations /
Explosion
• The additional features typically add noise. Machine
learning will pick up on spurious correlations, that might
be true in the training set, but not in the test set.
• For some ML methods, more features means more
parameters to learn (more NN weights, more decision tree
nodes, etc…) – the increased space of possibilities is more
difficult to search.
– Removing irrelevant data.
– increasing predictive accuracy of learned
models.
– reducing the cost of the data.
– improving learning efficiency, such as
reducing storage requirements and
computational cost.
– reducing the complexity of the resulting
model description, improving the
understanding of the data and the model.
What to do?

• Feature Selection:
– Feature Selection is a process that
chooses an optimal subset of features
according to a certain criterion
• Dimensionality Reduction:
– Transforms data from a high dimensional
space to a low dimensional space
Feature Selection: What

You have some data, and you want to use it to


build a classifier, so that you can predict
something (e.g. likelihood of cancer)

Slide Credit: David Corne, and Nick Taylor


Feature Selection: What

You have some data, and you want to use it to


build a classifier, so that you can predict something
(e.g. likelihood of cancer)

The data has 10,000 fields (features)

Slide Credit: David Corne, and Nick Taylor


Feature Selection: What
You have some data, and you want to use it to
build a classifier, so that you can predict something
(e.g. likelihood of cancer)

The data has 10,000 fields (features)

you need to cut it down to 1,000 fields before


you try machine learning. Which 1,000?

Slide Credit: David Corne, and Nick Taylor


Feature Selection: What
You have some data, and you want to use it to
build a classifier, so that you can predict something
(e.g. likelihood of cancer)

The data has 10,000 fields (features)

you need to cut it down to 1,000 fields before


you try machine learning. Which 1,000?
The process of choosing the 1,000 fields to use is called
Feature Selection
Slide Credit: David Corne, and Nick Taylor
Datasets with many features

Gene expression datasets (~10,000 features)


http://www.ncbi.nlm.nih.gov/sites/entrez?db=gds

Proteomics data (~20,000 features)


http://www.ebi.ac.uk/pride/

Slide Credit: David Corne, and Nick Taylor


Feature selection methods
Feature selection methods

Slide Credit: David Corne, and Nick Taylor


Correlation-based feature ranking
It is indeed used often, by practitioners (who
perhaps don’t understand the issues
involved in FS)

It is actually fine for certain datasets.


It is not even considered in Dash & Liu’s
survey.
A made-up dataset
f1 f2 f3 f4 … class
0.4 0.6 0.4 0.6 1
0.2 0.4 1.6 -0.6 1
0.5 0.7 1.8 -0.8 1
0.7 0.8 0.2 0.9 2
0.9 0.8 1.8 -0.7 2
0.5 0.5 0.6 0.5 2
Correlated with the class
f1 f2 f3 f4 … class
0.4 0.6 0.4 0.6 1
0.2 0.4 1.6 -0.6 1
0.5 0.7 1.8 -0.8 1
0.7 0.8 0.2 0.9 2
0.9 0.8 1.8 -0.7 2
0.5 0.5 0.6 0.5 2
uncorrelated with the class /
seemingly random
f1 f2 f3 f4 … class
0.4 0.6 0.4 0.6 1
0.2 0.4 1.6 -0.6 1
0.5 0.7 1.8 -0.8 1
0.7 0.8 0.2 0.9 2
0.9 0.8 1.8 -0.7 2
0.5 0.5 0.6 0.5 2
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
Correlation based FS reduces the
dataset to this.
f1 f2 … class
0.4 0.6 1
0.2 0.4 1
0.5 0.7 1
0.7 0.8 2
0.9 0.8 2
0.5 0.5 2
But, col 5 shows us f3 + f4 – which
is perfectly correlated with the class!
f1 f2 f3 f4 … class
0.4 0.6 0.4 0.6 1 1
0.2 0.4 1.6 -0.6 1 1
0.5 0.7 1.8 -0.8 1 1
0.7 0.8 0.2 0.9 1.1 2
0.9 0.8 1.8 -0.7 1.1 2
0.5 0.5 0.6 0.5 1.1 2
Good FS Methods therefore:
• Need to consider how well features work
together
• As we have noted before, if you take 100
features that are each well correlated with
the class, they may simply be correlated
strongly with each other, so provide no
more information than just one of them
`Complete’ methods
Original dataset has N features
You want to use a subset of k features
A complete FS method means: try every
subset of k features, and choose the
best!
The number of subsets is N! / k!(N−k)!
what is this when N is 100 and k is 5?
David Corne, and Nick Taylor, Heriot-Watt University - [email protected]
These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
`Complete’ methods
Original dataset has N features
You want to use a subset of k features
A complete FS method means: try
every subset of k features, and choose
the best!
The number of subsets is N! / k!(N−k)!
What is this when N is 100 and k is 5?
75,287,520 -- almost nothing
`Complete’ methods
Original dataset has N features
You want to use a subset of k features
A complete FS method means: try every
subset of k features, and choose the best!
The number of subsets is N! / k!(N−k)!
What is this when N is 10,000 and k is 100?

David Corne, and Nick Taylor, Heriot-Watt University - [email protected]


These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html
`Complete’ methods
Original dataset has N features
You want to use a subset of k features
A complete FS method means: try every
subset of k features, and choose the best!
The number of subsets is N! / k!(N−k)!
What is this when N is 10,000 and k is 100?

Actually it is around 5 × 1035,101


(there are around 1080 atoms in the universe)
Feature Search Space
`forward’ methods

These methods `grow’ a set S of features –


• S starts empty
• Find the best feature to add (by checking
which one gives best performance on a test
set when combined with S).
• If overall performance has improved, return
to step 2; else stop

Slide Credit: David Corne, and Nick Taylor


Forward selection illustrated
F1 F2 F3 F4 F5 F6 F7 F8 F9 Etc

2 5 65 67 2 2 12 2 234 …
1 2 4 5 13 1 1 43 12 …
4 3 43 2 4 6 2 2 1 …
5 4 2 3 5 5 13 1 2 …
3 5 1 4 7 3 4 6 13 …
2 2 6 5 7 1 5 4 4 …
1 3 4 4 55 4 7 55 43 …

Selected feature set {}


Test each feature in turn to find out which
works best with current feature set …
F1 F2 F3 F4 F5 F6 F7 F8 F9 Etc

2 5 65 67 2 2 12 2 234 …
1 2 4 5 13 1 1 43 12 …
4 3 43 2 4 6 2 2 1 …
5 4 2 3 5 5 13 1 2 …
3 5 1 4 7 3 4 6 13 …
2 2 6 5 7 1 5 4 4 …
1 3 4 4 55 4 7 55 43 …

Selected feature set {}


Test each feature in turn to find out which
works best with current feature set …
F1 F2 F3 F4 F5 F6 F7 F8 F9 Etc

2 5 65 67 2 2 12 2 234 …
1 2 4 5 13 1 1 43 12 …
4 3 43 2 4 6 2 2 1 …
5 4 2 3 5 5 13 1 2 …
3 5 1 4 7 3 4 6 13 …
2 2 6 5 7 1 5 4 4 …
1 3 4 4 55 4 7 55 43 …

65%
Selected feature set {}
Test each feature in turn to find out which
works best with current feature set …
F1 F2 F3 F4 F5 F6 F7 F8 F9 Etc

2 5 65 67 2 2 12 2 234 …
1 2 4 5 13 1 1 43 12 …
4 3 43 2 4 6 2 2 1 …
5 4 2 3 5 5 13 1 2 …
3 5 1 4 7 3 4 6 13 …
2 2 6 5 7 1 5 4 4 …
1 3 4 4 55 4 7 55 43 …

58%
Selected feature set {}
Test each feature in turn to find out which
works best with current feature set …
F1 F2 F3 F4 F5 F6 F7 F8 F9 Etc

2 5 65 67 2 2 12 2 234 …
1 2 4 5 13 1 1 43 12 …
4 3 43 2 4 6 2 2 1 …
5 4 2 3 5 5 13 1 2 …
3 5 1 4 7 3 4 6 13 …
2 2 6 5 7 1 5 4 4 …
1 3 4 4 55 4 7 55 43 …

54%
Selected feature set {}
Test each feature in turn to find out which
works best with current feature set …
F1 F2 F3 F4 F5 F6 F7 F8 F9 Etc

2 5 65 67 2 2 12 2 234 …
1 2 4 5 13 1 1 43 12 …
4 3 43 2 4 6 2 2 1 …
5 4 2 3 5 5 13 1 2 …
3 5 1 4 7 3 4 6 13 …
2 2 6 5 7 1 5 4 4 …
1 3 4 4 55 4 7 55 43 …

72%
Selected feature set {}
Test each feature in turn to find out which
works best with current feature set …
F1 F2 F3 F4 F5 F6 F7 F8 F9 Etc

2 5 65 67 2 2 12 2 234 …
1 2 4 5 13 1 1 43 12 …
4 3 43 2 4 6 2 2 1 …
5 4 2 3 5 5 13 1 2 …
3 5 1 4 7 3 4 6 13 …
2 2 6 5 7 1 5 4 4 …
1 3 4 4 55 4 7 55 43 …

64%
Selected feature set {}
Etc

F1 F2 F3 F4 F5 F6 F7 F8 F9 Etc

2 5 65 67 2 2 12 2 234 …
1 2 4 5 13 1 1 43 12 …
4 3 43 2 4 6 2 2 1 …
5 4 2 3 5 5 13 1 2 …
3 5 1 4 7 3 4 6 13 …
2 2 6 5 7 1 5 4 4 …
1 3 4 4 55 4 7 55 43 …
65% 58% 54% 72% 64% 61% 62% 25% 49% ….

Selected feature set {}


Add the winning feature to the selected
feature set
F1 F2 F3 F4 F5 F6 F7 F8 F9 Etc

2 5 65 67 2 2 12 2 234 …
1 2 4 5 13 1 1 43 12 …
4 3 43 2 4 6 2 2 1 …
5 4 2 3 5 5 13 1 2 …
3 5 1 4 7 3 4 6 13 …
2 2 6 5 7 1 5 4 4 …
1 3 4 4 55 4 7 55 43 …

65% 58% 54% 72% 64% 61% 62% 25% 49% ….


Selected feature set {F4}
We have completed one ‘round’ of forward
selection
F1 F2 F3 F4 F5 F6 F7 F8 F9 Etc

2 5 65 67 2 2 12 2 234 …
1 2 4 5 13 1 1 43 12 …
4 3 43 2 4 6 2 2 1 …
5 4 2 3 5 5 13 1 2 …
3 5 1 4 7 3 4 6 13 …
2 2 6 5 7 1 5 4 4 …
1 3 4 4 55 4 7 55 43 …

65% 58% 54% 72% 64% 61% 62% 25% 49% ….


Selected feature set {F4}
Test each feature in turn to find out which
works best with current feature set …
F1 F2 F3 F4 F5 F6 F7 F8 F9 Etc

2 5 65 67 2 2 12 2 234 …
1 2 4 5 13 1 1 43 12 …
4 3 43 2 4 6 2 2 1 …
5 4 2 3 5 5 13 1 2 …
3 5 1 4 7 3 4 6 13 …
2 2 6 5 7 1 5 4 4 …
1 3 4 4 55 4 7 55 43 …

Selected feature set {F4}


Test each feature in turn to find out which
works best with current feature set …
F1 F2 F3 F4 F5 F6 F7 F8 F9 Etc

2 5 65 67 2 2 12 2 234 …
1 2 4 5 13 1 1 43 12 …
4 3 43 2 4 6 2 2 1 …
5 4 2 3 5 5 13 1 2 …
3 5 1 4 7 3 4 6 13 …
2 2 6 5 7 1 5 4 4 …
1 3 4 4 55 4 7 55 43 …

61%
Selected feature set {F4}
Test each feature in turn to find out which
works best with current feature set …
F1 F2 F3 F4 F5 F6 F7 F8 F9 Etc

2 5 65 67 2 2 12 2 234 …
1 2 4 5 13 1 1 43 12 …
4 3 43 2 4 6 2 2 1 …
5 4 2 3 5 5 13 1 2 …
3 5 1 4 7 3 4 6 13 …
2 2 6 5 7 1 5 4 4 …
1 3 4 4 55 4 7 55 43 …

59%
Selected feature set {F4}
Test each feature in turn to find out which
works best with current feature set …
F1 F2 F3 F4 F5 F6 F7 F8 F9 Etc

2 5 65 67 2 2 12 2 234 …
1 2 4 5 13 1 1 43 12 …
4 3 43 2 4 6 2 2 1 …
5 4 2 3 5 5 13 1 2 …
3 5 1 4 7 3 4 6 13 …
2 2 6 5 7 1 5 4 4 …
1 3 4 4 55 4 7 55 43 …

58%
Selected feature set {F4}
Test each feature in turn to find out which
works best with current feature set …
F1 F2 F3 F4 F5 F6 F7 F8 F9 Etc

2 5 65 67 2 2 12 2 234 …
1 2 4 5 13 1 1 43 12 …
4 3 43 2 4 6 2 2 1 …
5 4 2 3 5 5 13 1 2 …
3 5 1 4 7 3 4 6 13 …
2 2 6 5 7 1 5 4 4 …
1 3 4 4 55 4 7 55 43 …

Selected feature set {F4}


Test each feature in turn to find out which
works best with current feature set …
F1 F2 F3 F4 F5 F6 F7 F8 F9 Etc

2 5 65 67 2 2 12 2 234 …
1 2 4 5 13 1 1 43 12 …
4 3 43 2 4 6 2 2 1 …
5 4 2 3 5 5 13 1 2 …
3 5 1 4 7 3 4 6 13 …
2 2 6 5 7 1 5 4 4 …
1 3 4 4 55 4 7 55 43 …

66%
Selected feature set {F4}
Etc

F1 F2 F3 F4 F5 F6 F7 F8 F9 Etc

2 5 65 67 2 2 12 2 234 …
1 2 4 5 13 1 1 43 12 …
4 3 43 2 4 6 2 2 1 …
5 4 2 3 5 5 13 1 2 …
3 5 1 4 7 3 4 6 13 …
2 2 6 5 7 1 5 4 4 …
1 3 4 4 55 4 7 55 43 …
61% 59% 58% 66% 68% 75% 47% 49% ….

Selected feature set {F4}


Add the winning feature to the selected
feature set
F1 F2 F3 F4 F5 F6 F7 F8 F9 Etc

2 5 65 67 2 2 12 2 234 …
1 2 4 5 13 1 1 43 12 …
4 3 43 2 4 6 2 2 1 …
5 4 2 3 5 5 13 1 2 …
3 5 1 4 7 3 4 6 13 …
2 2 6 5 7 1 5 4 4 …
1 3 4 4 55 4 7 55 43 …
61% 59% 58% 66% 68% 75% 47% 49% ….

Selected feature set {F4, F7}


We have completed the second ‘round’ of
forward selection
F1 F2 F3 F4 F5 F6 F7 F8 F9 Etc

2 5 65 67 2 2 12 2 234 …
1 2 4 5 13 1 1 43 12 …
4 3 43 2 4 6 2 2 1 …
5 4 2 3 5 5 13 1 2 …
3 5 1 4 7 3 4 6 13 …
2 2 6 5 7 1 5 4 4 …
1 3 4 4 55 4 7 55 43 …
61% 59% 58% 66% 68% 75% 47% 49% ….

Selected feature set {F4, F7}


Continue…
adding one feature after each round,
until overall accuracy starts to reduce
F1 F2 F3 F4 F5 F6 F7 F8 F9 Etc

2 5 65 67 2 2 12 2 234 …
1 2 4 5 13 1 1 43 12 …
4 3 43 2 4 6 2 2 1 …
5 4 2 3 5 5 13 1 2 …
3 5 1 4 7 3 4 6 13 …
2 2 6 5 7 1 5 4 4 …
1 3 4 4 55 4 7 55 43 …
61% 59% 58% 66% 68% 75% 47% 49% ….

Selected feature set {F4, F7}


`backward’ methods
These methods remove features one by one.
• S starts with the full feature set
• Find the best feature to remove (by
checking which removal from S gives best
performance on a test set).
• If overall performance has improved,
return to step 2; else stop

Slide Credit: David Corne, and Nick Taylor


– Bidirectional Generation (BG): Begins the search in both
directions, performing SFG and SBG concurrently. They
stop in two cases: (1) when one search finds the best
subset comprised of m features before it reaches the
exact middle, or (2) both searches achieve the middle of
the search space. It takes advantage of both SFG and
SBG.

– Random Generation (RG): It starts the search in a


random direction. The choice of adding or removing a
features is a random decision. RGtries to avoid the
stagnation into a local optima by not following a fixed
way for subset generation. Unlike SFG or SBG, the size of
the subset of features cannot be stipulated.
Selection Criteria
– Information Measures.
• Information serves to measure the
uncertainty of the receiver when she/he
receives a message.
• Shannon’s Entropy:

• Information gain:
Selection Criteria
Distance Measures.
Measures of separability, discrimination or
divergence measures . The most typical is
derived from distance between the class
conditional density functions.
Selection Criteria
Dependence Measures.
• known as measures of association or correlation.
• Its main goal is to quantify how strongly two
variables are correlated or present some
association with each other, in such way that
knowing the value of one of them, we can derive
the value for the other.
• Pearson correlation coefficient:
Selection Criteria
– Consistency Measures.
• They attempt to find a minimum number of
features that separate classes as the full set of
features can.

• They aim to achieve P(C|FullSet) = P(C|SubSet).

• An inconsistency is defined as the case of two


examples with the same inputs (same feature
values) but with different output feature values
(classes in classification).
Selection Criteria
Accuracy Measures: This form of evaluation relies
on the classifier or learner. Among various possible
subsets of features, the subset which yields the
best predictive accuracy is chosen

You might also like