0% found this document useful (0 votes)
4 views

CS464_Ch5_FeatureSelection

Uploaded by

pesimistcaylaq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

CS464_Ch5_FeatureSelection

Uploaded by

pesimistcaylaq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

CS-464

Chapter 5: Feature Selection


(Slides based on material by Mehmet Koyutürk, Öznur
Taştan and Mark Craven)
Feature Selection
• The objective in classification/regression is to learn
a function that relates values of features to values
of outcome variable(s)
– Often, we are presented with many features
– Not all of these features are relevant

• Feature Selection is the task of identifying an


“optimal” (take this in lay language) set of features
that are useful for accurately predicting the
outcome variable
Motivation for Feature Selection
• Accuracy
– Getting rid of irrelevant features can help learn better
predictive models by reducing confusion
• Generalizability
– Models with less features have lower complexity, so they
are less prone to overfitting
• Interpretability
– Identifying a small set of features can help understand the
mechanistics of the relationship between the features and
the outcome variable(s)
• Efficiency
– With smaller number of features, learning and prediction
may take less time/space
Three Main Approaches
1. Treat feature selection as a separate task
• Filtering-based feature selection
• Wrapper-based feature selection

2. Embed feature selection into the task of learning a


model
• Regularization

3. Do not select features, instead construct new


features that effectively represent combinations
original features
• Dimensionality reduction
Feature Selection as a Separate Task
Filtering

Rank
Score Features Select Top Train
Features Based on k Features Model
Score
• k can be chosen heuristically
• Scores do not represent • Standard rules of thumb can be
prediction performance used to set a threshold (e.g.,
since no validation is use features with statistically
done at this stage significant scores)
• Do NOT use validation/ • Can use cross-validation to
test samples to compute select an optimal value of k
score (using prediction performance
as the criterion)
Scoring Features for Filtering
• Mutual information
– Reduction in uncertainty on the value of the outcome variable
upon observation of the value of feature
– Already discussed

• Statistical tests
– t-statistic: Standardized difference of the mean value of the
feature in different classes (continuous features)
– Chi-square statistic: Difference between counts in different
classes (discrete features, related to mutual information)

• Variance/frequency
– Continuous features with low variance are usually not useful
– Discrete features that are too frequent or too rare are usually
not useful
Feature Selection – In Text Classification
• In text classification, we usually represent documents with a
high--dimensional feature vector:
• Each dimension corresponding to a term
• Many dimensions correspond to rare words
• Rare words can mislead the classifier

• Rare misleading features are called noise features

• Eliminating noise features from the representation increases


efficiency and effectiveness of text classification

40
Noisy Features
• A noise feature is one that increases the classification error on
new data.

• Suppose you are doing topic classification. One class is China

• A rare term, say arachnocentric, has no information about


documents about China, but all instances of arachnocentric in the
training data happen to occur in the documents related to China

• The learner might produce a classifier that misassigns test


documents containing arachnocentric to China.

• Such an incorrect generalization from an accidental property of


the training set is an example of overfitting
9
Feature Selection
• All possible feature subsets 2^N combinations.

• If you fix the feature subset size to M

• This number of combinations is unfeasible, even for


moderate M

• A search strategy is therefore needed to direct the


feature selection process as it explores the space of
all possible combination of features

10
Filtering-Based Selection
• Use a simple measure to assess the relevance of
each feature to the outcome variable (class)
• Mutual information – reduction in the uncertainty in class
upon observation of the value of the feature
• Chi--square test --a statistical test that compares the
frequencies of a term between different classes

• Rank features, try models that include the top k


features as you increase k

• These methods are based on the rationale:


– good feature subsets contain features highly correlated
with (predictive of) the class
43
11
Information
• Information: reduction in uncertainty (amount of surprise in
the outcome)

1
I (E) == log 2
I(X=x) = -log 2 p(x)
p(x)
• If the probability of this event happening is small and it
happens the information is large:

Observing the outcome of a coin flip is head


I = - log21/ 2 = 1
The outcome of a dice is 6
I = - log21/ 6 = 2.58

12
Entropy
• The entropy of a random variable is the sum of the
information provided by its possible values, weighted by the
probability of each value
• Entropy is a measure of uncertainty

The summation is over all


possible values of the random
variable

The entropy of a binary random variable


as a function of the probability of a success
Mutual Information
• Mutual information I(X,Y) is the reduction of uncertainty in
one variable upon observation of the other variable
• Mutual information is a measure of statistical dependency between
two random variables
Mutual Information
• The mutual information between feature vector and class
label measures the amount by which the uncertainty in the
class is decreased by knowledge of the feature. Compute the
mutual information (MI) of term t and class c.
• Below U is a random variable that takes values (the
document contains term ) and (the document does not
contain )
• C is a random variable that takes values (the document is in
class ) and (the document is not in class ).

§§ Definition:

http://nlp.stanford.edu/IR--boo k/ html/htmledition/mutual--information--
4 5 45
1.html
Mutual Information
• If a term’s occurrence is independent of the class (ie.
term’s distribution is the same in the class as it is in the
collection as a whole), then MI is 0

• MI is maximum if the term is a perfect indicator for class


membership (ie. the term is present in a document if and
only if the document is in the class)

16
How to compute Mutual Information
• Based on maximum likelihood estimates, the formula we
actually use is:

§§ N10: number of documents that contain t (et = 1) and are not in c (ec = 0)
§§ N11: number of documents that contain t (et = 1) and are in c (ec = 1)
§§ N01: number of documents that do not contain t (et = 1) and are in c (ec = 1)
§§ N00: number of documents that do not contain t (et = 1) and are not in c (ec = 1)
§§ N = N00 + N01 + N10 + N11.

47
17
Mutual Information Example

48
18
Why Feature Selection Helps

50
t-statistic
• We have n1 and n2 samples from each class, respectively

• For each feature, let x1 , s1 be the sample mean and variance of


the first class, x2 , s2 be that of the second

• The distribution of t approaches from uniform to normal distribution


as number of samples grow
• We can set a threshold on the t-statistic for a feature to be selected
based on the t-distribution
Wrapper Methods
• Frame the feature selection task as a search
problem

• Evaluate each feature set by using the prediction


performance of the learning algorithm on that
feature set
– Cross-validation

• How to search the exponential space of feature


sets?
Searching for Feature Sets

state = set of features


start state = empty (forward selection)
or full (backward elimination)

operators
add/subtract a feature

scoring function
cross-validation accuracy using learning method on a
given state’s feature set
Forward Selection
Forward Selection
Backward Elimination
Forward Selection vs. Backward Elimination
Embedded Methods (Regularization)

• Instead of explicitly selecting features, bias the


learning process towards using a small number
of features

• Key idea: objective function has two parts


• Term representing error minimization (model fit)
• Term that “shrinks” parameters toward 0
Ridge Regression
• Linear regression:

• Penalty term (L2 norm of the coefficients) added:


LASSO
• Ridge regression shrinks the weights, but does not
necessarily reduce the number of features
– We would like to force some coefficients to be set to 0

• Add L1 norm of the coefficients as the penalty term:

– Why does this result in more coefficients to be set to 0,


effectively performing feature selection?
Ridge Regression vs. LASSO
Generalizing Regularization
• L1 and L2 penalties can be used with other learning
methods (logistic regression, neural nets, SVMs,
etc.)
– Both can help avoid overfitting by reducing variance
• There are many variants with somewhat different
biases
– Elastic net: includes L1 and L2 penalties
– Group Lasso: bias towards selecting defined groups of
features
– Graph Lasso: bias towards selecting “adjacent” features
in a defined graph

You might also like