0% found this document useful (0 votes)

4 views

CS464_Ch5_FeatureSelection

Uploaded by

pesimistcaylaq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

CS464_Ch5_FeatureSelection

Uploaded by

pesimistcaylaq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

CS-464

Chapter 5: Feature Selection

(Slides based on material by Mehmet Koyutürk, Öznur
Taştan and Mark Craven)
Feature Selection
• The objective in classification/regression is to learn
a function that relates values of features to values
of outcome variable(s)
– Often, we are presented with many features
– Not all of these features are relevant

• Feature Selection is the task of identifying an

“optimal” (take this in lay language) set of features
that are useful for accurately predicting the
outcome variable
Motivation for Feature Selection
• Accuracy
– Getting rid of irrelevant features can help learn better
predictive models by reducing confusion
• Generalizability
– Models with less features have lower complexity, so they
are less prone to overfitting
• Interpretability
– Identifying a small set of features can help understand the
mechanistics of the relationship between the features and
the outcome variable(s)
• Efficiency
– With smaller number of features, learning and prediction
may take less time/space
Three Main Approaches
1. Treat feature selection as a separate task
• Filtering-based feature selection
• Wrapper-based feature selection

2. Embed feature selection into the task of learning a

model
• Regularization

3. Do not select features, instead construct new

features that effectively represent combinations
original features
• Dimensionality reduction
Feature Selection as a Separate Task
Filtering

Rank
Score Features Select Top Train
Features Based on k Features Model
Score
• k can be chosen heuristically
• Scores do not represent • Standard rules of thumb can be
prediction performance used to set a threshold (e.g.,
since no validation is use features with statistically
done at this stage significant scores)
• Do NOT use validation/ • Can use cross-validation to
test samples to compute select an optimal value of k
score (using prediction performance
as the criterion)
Scoring Features for Filtering
• Mutual information
– Reduction in uncertainty on the value of the outcome variable
upon observation of the value of feature
– Already discussed

• Statistical tests
– t-statistic: Standardized difference of the mean value of the
feature in different classes (continuous features)
– Chi-square statistic: Difference between counts in different
classes (discrete features, related to mutual information)

• Variance/frequency
– Continuous features with low variance are usually not useful
– Discrete features that are too frequent or too rare are usually
not useful
Feature Selection – In Text Classification
• In text classification, we usually represent documents with a
high--dimensional feature vector:
• Each dimension corresponding to a term
• Many dimensions correspond to rare words
• Rare words can mislead the classifier

• Rare misleading features are called noise features

• Eliminating noise features from the representation increases

efficiency and effectiveness of text classification

40
Noisy Features
• A noise feature is one that increases the classification error on
new data.

• Suppose you are doing topic classification. One class is China

• A rare term, say arachnocentric, has no information about

documents about China, but all instances of arachnocentric in the
training data happen to occur in the documents related to China

• The learner might produce a classifier that misassigns test

documents containing arachnocentric to China.

• Such an incorrect generalization from an accidental property of

the training set is an example of overfitting
9
Feature Selection
• All possible feature subsets 2^N combinations.

• If you fix the feature subset size to M

• This number of combinations is unfeasible, even for

moderate M

• A search strategy is therefore needed to direct the

feature selection process as it explores the space of
all possible combination of features

10
Filtering-Based Selection
• Use a simple measure to assess the relevance of
each feature to the outcome variable (class)
• Mutual information – reduction in the uncertainty in class
upon observation of the value of the feature
• Chi--square test --a statistical test that compares the
frequencies of a term between different classes

• Rank features, try models that include the top k

features as you increase k

• These methods are based on the rationale:

– good feature subsets contain features highly correlated
with (predictive of) the class
43
11
Information
• Information: reduction in uncertainty (amount of surprise in
the outcome)

1
I (E) == log 2
I(X=x) = -log 2 p(x)
p(x)
• If the probability of this event happening is small and it
happens the information is large:

Observing the outcome of a coin flip is head

I = - log21/ 2 = 1
The outcome of a dice is 6
I = - log21/ 6 = 2.58

12
Entropy
• The entropy of a random variable is the sum of the
information provided by its possible values, weighted by the
probability of each value
• Entropy is a measure of uncertainty

The summation is over all

possible values of the random
variable

The entropy of a binary random variable

as a function of the probability of a success
Mutual Information
• Mutual information I(X,Y) is the reduction of uncertainty in
one variable upon observation of the other variable
• Mutual information is a measure of statistical dependency between
two random variables
Mutual Information
• The mutual information between feature vector and class
label measures the amount by which the uncertainty in the
class is decreased by knowledge of the feature. Compute the
mutual information (MI) of term t and class c.
• Below U is a random variable that takes values (the
document contains term ) and (the document does not
contain )
• C is a random variable that takes values (the document is in
class ) and (the document is not in class ).

§§ Definition:

http://nlp.stanford.edu/IR--boo k/ html/htmledition/mutual--information--
4 5 45
1.html
Mutual Information
• If a term’s occurrence is independent of the class (ie.
term’s distribution is the same in the class as it is in the
collection as a whole), then MI is 0

• MI is maximum if the term is a perfect indicator for class

membership (ie. the term is present in a document if and
only if the document is in the class)

16
How to compute Mutual Information
• Based on maximum likelihood estimates, the formula we
actually use is:

§§ N10: number of documents that contain t (et = 1) and are not in c (ec = 0)
§§ N11: number of documents that contain t (et = 1) and are in c (ec = 1)
§§ N01: number of documents that do not contain t (et = 1) and are in c (ec = 1)
§§ N00: number of documents that do not contain t (et = 1) and are not in c (ec = 1)
§§ N = N00 + N01 + N10 + N11.

47
17
Mutual Information Example

48
18
Why Feature Selection Helps

50
t-statistic
• We have n1 and n2 samples from each class, respectively

• For each feature, let x1 , s1 be the sample mean and variance of

the first class, x2 , s2 be that of the second

• The distribution of t approaches from uniform to normal distribution

as number of samples grow
• We can set a threshold on the t-statistic for a feature to be selected
based on the t-distribution
Wrapper Methods
• Frame the feature selection task as a search
problem

• Evaluate each feature set by using the prediction

performance of the learning algorithm on that
feature set
– Cross-validation

• How to search the exponential space of feature

sets?
Searching for Feature Sets

state = set of features

start state = empty (forward selection)
or full (backward elimination)

operators
add/subtract a feature

scoring function
cross-validation accuracy using learning method on a
given state’s feature set
Forward Selection
Forward Selection
Backward Elimination
Forward Selection vs. Backward Elimination
Embedded Methods (Regularization)

• Instead of explicitly selecting features, bias the

learning process towards using a small number
of features

• Key idea: objective function has two parts

• Term representing error minimization (model fit)
• Term that “shrinks” parameters toward 0
Ridge Regression
• Linear regression:

• Penalty term (L2 norm of the coefficients) added:

LASSO
• Ridge regression shrinks the weights, but does not
necessarily reduce the number of features
– We would like to force some coefficients to be set to 0

• Add L1 norm of the coefficients as the penalty term:

– Why does this result in more coefficients to be set to 0,

effectively performing feature selection?
Ridge Regression vs. LASSO
Generalizing Regularization
• L1 and L2 penalties can be used with other learning
methods (logistic regression, neural nets, SVMs,
etc.)
– Both can help avoid overfitting by reducing variance
• There are many variants with somewhat different
biases
– Elastic net: includes L1 and L2 penalties
– Group Lasso: bias towards selecting defined groups of
features
– Graph Lasso: bias towards selecting “adjacent” features
in a defined graph

Lecture#10
No ratings yet
Lecture#10
24 pages
3b Features PDF
No ratings yet
3b Features PDF
40 pages
Feature Selection
No ratings yet
Feature Selection
61 pages
کتاب پنجم بارگزاری شده
No ratings yet
کتاب پنجم بارگزاری شده
35 pages
Chandra Shekar 2014
No ratings yet
Chandra Shekar 2014
13 pages
Feature Selection 1692278667
No ratings yet
Feature Selection 1692278667
100 pages
Chi-Square:: Principal Component Analysis (PCA)
No ratings yet
Chi-Square:: Principal Component Analysis (PCA)
6 pages
Flairs99 042
No ratings yet
Flairs99 042
5 pages
Machine Learning Fundamentals
No ratings yet
Machine Learning Fundamentals
52 pages
Feature Subset Selection: A Correlation Based Filter Approach
No ratings yet
Feature Subset Selection: A Correlation Based Filter Approach
4 pages
Feature Engg Pre Processing Python
No ratings yet
Feature Engg Pre Processing Python
68 pages
ML Unit 1
No ratings yet
ML Unit 1
73 pages
Module5.2 Feature selection methods
No ratings yet
Module5.2 Feature selection methods
64 pages
Pattern L1 L6
No ratings yet
Pattern L1 L6
19 pages
Navot PHD
No ratings yet
Navot PHD
145 pages
Types of Data (Qualitative and Quantitative)
No ratings yet
Types of Data (Qualitative and Quantitative)
89 pages
Lecture 4
No ratings yet
Lecture 4
31 pages
ML-2-PPT-UNIT-2
No ratings yet
ML-2-PPT-UNIT-2
214 pages
dimensionalityReduction.pptx
No ratings yet
dimensionalityReduction.pptx
117 pages
Wrapper Method
No ratings yet
Wrapper Method
58 pages
4 Classification
No ratings yet
4 Classification
20 pages
Module-3 DSV
No ratings yet
Module-3 DSV
20 pages
Feature Selection - Study Material
No ratings yet
Feature Selection - Study Material
6 pages
Feature Selection
No ratings yet
Feature Selection
36 pages
u1 p2 2
No ratings yet
u1 p2 2
66 pages
Conditional Likelihood Maximisation: A Unifying Framework For Information Theoretic Feature Selection
No ratings yet
Conditional Likelihood Maximisation: A Unifying Framework For Information Theoretic Feature Selection
40 pages
Classification With Decision Trees I: Instructor: Qiang Yang
No ratings yet
Classification With Decision Trees I: Instructor: Qiang Yang
29 pages
Lecture 03
No ratings yet
Lecture 03
33 pages
ML -3_Sovan_KNN__1
No ratings yet
ML -3_Sovan_KNN__1
95 pages
University Institute of Engineering Department of Computer Science & Engineering
No ratings yet
University Institute of Engineering Department of Computer Science & Engineering
23 pages
A Comparative Study Between Feature Selection Algorithms - Ok
No ratings yet
A Comparative Study Between Feature Selection Algorithms - Ok
10 pages
Feature Selection
No ratings yet
Feature Selection
56 pages
Dimenn Red PDF
No ratings yet
Dimenn Red PDF
135 pages
Note 4 Nov 2023
No ratings yet
Note 4 Nov 2023
18 pages
Special Topic: Missing Values
No ratings yet
Special Topic: Missing Values
25 pages
CSE4261 Lecture-10
No ratings yet
CSE4261 Lecture-10
50 pages
Foundations of Machine Learning: Module 3: Instance Based Learning and Feature Reduction
No ratings yet
Foundations of Machine Learning: Module 3: Instance Based Learning and Feature Reduction
40 pages
ML Inter Q&A
No ratings yet
ML Inter Q&A
54 pages
Chapter4 Machine Learning Part3
No ratings yet
Chapter4 Machine Learning Part3
43 pages
E-Note 14653 Content Document 20231228101402AM
No ratings yet
E-Note 14653 Content Document 20231228101402AM
10 pages
Classification
No ratings yet
Classification
73 pages
An Introduction To Variable and Feature Selection: Isabelle Guyon
No ratings yet
An Introduction To Variable and Feature Selection: Isabelle Guyon
26 pages
An Introduction To Variable and Feature Selection: Isabelle Guyon
No ratings yet
An Introduction To Variable and Feature Selection: Isabelle Guyon
26 pages
An Introduction To Variable and Feature Selection
No ratings yet
An Introduction To Variable and Feature Selection
26 pages
DM - Ch4 - Classification (Part1)
No ratings yet
DM - Ch4 - Classification (Part1)
20 pages
KNN Evaluation
No ratings yet
KNN Evaluation
51 pages
CS 343: Artificial Intelligence Machine Learning: Raymond J. Mooney
No ratings yet
CS 343: Artificial Intelligence Machine Learning: Raymond J. Mooney
35 pages
Foundations of Machine Learning: Sudeshna Sarkar IIT Kharagpur
No ratings yet
Foundations of Machine Learning: Sudeshna Sarkar IIT Kharagpur
40 pages
Speeding Up Feature Selection by Using An Information Theoretic Bound
No ratings yet
Speeding Up Feature Selection by Using An Information Theoretic Bound
8 pages
Unit3 7 Issues
No ratings yet
Unit3 7 Issues
24 pages
Chapter 4 (2)
No ratings yet
Chapter 4 (2)
103 pages
ML Lecture 02
No ratings yet
ML Lecture 02
40 pages
8.predictive Analytics - Classification 2
No ratings yet
8.predictive Analytics - Classification 2
28 pages
Model Selection and Feature Selection: Piyush Rai CS5350/6350: Machine Learning
No ratings yet
Model Selection and Feature Selection: Piyush Rai CS5350/6350: Machine Learning
14 pages
Pattern Summary Final
No ratings yet
Pattern Summary Final
28 pages
Module 5 ML
No ratings yet
Module 5 ML
12 pages
Feature Selection Techniques in Machine Learning
No ratings yet
Feature Selection Techniques in Machine Learning
49 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Elementary Statistics
From Everand
Elementary Statistics
jay prakash Maheshwari
5/5 (1)
Btech - CS3404 - Theory of Automata - Unit 2
No ratings yet
Btech - CS3404 - Theory of Automata - Unit 2
3 pages
2012 P4 Maths
No ratings yet
2012 P4 Maths
207 pages
DSD Model Paper 2
No ratings yet
DSD Model Paper 2
8 pages
Lesson 21 Two Bad Ants
No ratings yet
Lesson 21 Two Bad Ants
2 pages
Doc-20240512-Wa0001 240512 122632
No ratings yet
Doc-20240512-Wa0001 240512 122632
17 pages
Exercises: Vocabulary
No ratings yet
Exercises: Vocabulary
2 pages
Full Download (Ebook) Modelling of Vibrations of Overhead Line Conductors: Assessment of the Technology by Giorgio Diana (eds.) ISBN 9783319728070, 9783319728087, 3319728075, 3319728083 PDF DOCX
100% (9)
Full Download (Ebook) Modelling of Vibrations of Overhead Line Conductors: Assessment of the Technology by Giorgio Diana (eds.) ISBN 9783319728070, 9783319728087, 3319728075, 3319728083 PDF DOCX
65 pages
Opticon Manual
No ratings yet
Opticon Manual
31 pages
Create A Huffman Code Dictionary in MATLAB
No ratings yet
Create A Huffman Code Dictionary in MATLAB
10 pages
Effect of Strategic Human Resource Management Practice On Organization Citizenship Behaviors Study On Commercial Bank of Ethiopia Jimma District
No ratings yet
Effect of Strategic Human Resource Management Practice On Organization Citizenship Behaviors Study On Commercial Bank of Ethiopia Jimma District
14 pages
Testing_the_validity_of_CAPM_in_Indian_s
No ratings yet
Testing_the_validity_of_CAPM_in_Indian_s
5 pages
Physics Annual Paper
No ratings yet
Physics Annual Paper
2 pages
Mechanical Design
No ratings yet
Mechanical Design
190 pages
SR Flipflop Truth Table
No ratings yet
SR Flipflop Truth Table
8 pages
New Idea&Aes
No ratings yet
New Idea&Aes
222 pages
Unit 1 PPT Final
No ratings yet
Unit 1 PPT Final
111 pages
Short Questions For Mid
No ratings yet
Short Questions For Mid
2 pages
FLOCALC Calculation Details
100% (2)
FLOCALC Calculation Details
21 pages
A Level Mathematics - Practice Paper - 53 - Trigonometry (Part 3) MS
No ratings yet
A Level Mathematics - Practice Paper - 53 - Trigonometry (Part 3) MS
7 pages
RB Nie Unit 3 Om 2016
No ratings yet
RB Nie Unit 3 Om 2016
84 pages
Class 7 Maths EnglishMedium
100% (1)
Class 7 Maths EnglishMedium
248 pages
Final Mechanics ENGLISH MATH SCIENCE Amazing Race
No ratings yet
Final Mechanics ENGLISH MATH SCIENCE Amazing Race
7 pages
1.0 Kinetics 2020 - 2021 (Lecturer)
No ratings yet
1.0 Kinetics 2020 - 2021 (Lecturer)
15 pages
Significant Figures Standard Form
No ratings yet
Significant Figures Standard Form
15 pages
ECE-223, Solution For Assignment #8
No ratings yet
ECE-223, Solution For Assignment #8
4 pages
Immediate Download Fundamentals of Graphics Using MATLAB 1st Edition Ranjan Parekh Ebooks 2024
100% (4)
Immediate Download Fundamentals of Graphics Using MATLAB 1st Edition Ranjan Parekh Ebooks 2024
62 pages
GCE O-Level Mathematics Syallabus 'D' May/June 2016 Paper 1
No ratings yet
GCE O-Level Mathematics Syallabus 'D' May/June 2016 Paper 1
20 pages
Explain Briefly The Stages in Data Processing
No ratings yet
Explain Briefly The Stages in Data Processing
7 pages
Chapter # 4 Differentiation
No ratings yet
Chapter # 4 Differentiation
9 pages
The Application of Pairs Trading To Stock Markets
No ratings yet
The Application of Pairs Trading To Stock Markets
31 pages

CS464_Ch5_FeatureSelection

Uploaded by

CS464_Ch5_FeatureSelection

Uploaded by

CS-464

Chapter 5: Feature Selection

• Feature Selection is the task of identifying an

2. Embed feature selection into the task of learning a

3. Do not select features, instead construct new

• Rare misleading features are called noise features

• Eliminating noise features from the representation increases

• Suppose you are doing topic classification. One class is China

• A rare term, say arachnocentric, has no information about

• The learner might produce a classifier that misassigns test

• Such an incorrect generalization from an accidental property of

• If you fix the feature subset size to M

• This number of combinations is unfeasible, even for

• A search strategy is therefore needed to direct the

• Rank features, try models that include the top k

• These methods are based on the rationale:

Observing the outcome of a coin flip is head

The summation is over all

The entropy of a binary random variable

• MI is maximum if the term is a perfect indicator for class

• For each feature, let x1 , s1 be the sample mean and variance of

• The distribution of t approaches from uniform to normal distribution

• Evaluate each feature set by using the prediction

• How to search the exponential space of feature

state = set of features

• Instead of explicitly selecting features, bias the

• Key idea: objective function has two parts

• Penalty term (L2 norm of the coefficients) added:

• Add L1 norm of the coefficients as the penalty term:

– Why does this result in more coefficients to be set to 0,

You might also like