CS464_Ch5_FeatureSelection
CS464_Ch5_FeatureSelection
Rank
Score Features Select Top Train
Features Based on k Features Model
Score
• k can be chosen heuristically
• Scores do not represent • Standard rules of thumb can be
prediction performance used to set a threshold (e.g.,
since no validation is use features with statistically
done at this stage significant scores)
• Do NOT use validation/ • Can use cross-validation to
test samples to compute select an optimal value of k
score (using prediction performance
as the criterion)
Scoring Features for Filtering
• Mutual information
– Reduction in uncertainty on the value of the outcome variable
upon observation of the value of feature
– Already discussed
• Statistical tests
– t-statistic: Standardized difference of the mean value of the
feature in different classes (continuous features)
– Chi-square statistic: Difference between counts in different
classes (discrete features, related to mutual information)
• Variance/frequency
– Continuous features with low variance are usually not useful
– Discrete features that are too frequent or too rare are usually
not useful
Feature Selection – In Text Classification
• In text classification, we usually represent documents with a
high--dimensional feature vector:
• Each dimension corresponding to a term
• Many dimensions correspond to rare words
• Rare words can mislead the classifier
40
Noisy Features
• A noise feature is one that increases the classification error on
new data.
10
Filtering-Based Selection
• Use a simple measure to assess the relevance of
each feature to the outcome variable (class)
• Mutual information – reduction in the uncertainty in class
upon observation of the value of the feature
• Chi--square test --a statistical test that compares the
frequencies of a term between different classes
1
I (E) == log 2
I(X=x) = -log 2 p(x)
p(x)
• If the probability of this event happening is small and it
happens the information is large:
12
Entropy
• The entropy of a random variable is the sum of the
information provided by its possible values, weighted by the
probability of each value
• Entropy is a measure of uncertainty
§§ Definition:
http://nlp.stanford.edu/IR--boo k/ html/htmledition/mutual--information--
4 5 45
1.html
Mutual Information
• If a term’s occurrence is independent of the class (ie.
term’s distribution is the same in the class as it is in the
collection as a whole), then MI is 0
16
How to compute Mutual Information
• Based on maximum likelihood estimates, the formula we
actually use is:
§§ N10: number of documents that contain t (et = 1) and are not in c (ec = 0)
§§ N11: number of documents that contain t (et = 1) and are in c (ec = 1)
§§ N01: number of documents that do not contain t (et = 1) and are in c (ec = 1)
§§ N00: number of documents that do not contain t (et = 1) and are not in c (ec = 1)
§§ N = N00 + N01 + N10 + N11.
47
17
Mutual Information Example
48
18
Why Feature Selection Helps
50
t-statistic
• We have n1 and n2 samples from each class, respectively
operators
add/subtract a feature
scoring function
cross-validation accuracy using learning method on a
given state’s feature set
Forward Selection
Forward Selection
Backward Elimination
Forward Selection vs. Backward Elimination
Embedded Methods (Regularization)