Feature Selection - New
Feature Selection - New
Reduction
What is Feature selection (or Variable Selection)?
• Thousands of features with many irrelevant and redundant features. Irrelevant and redundant
features may confuse learners.
• Reduces Overfitting: Less redundant data means less opportunity to make decisions based on
noise.
• Reduces Training Time: fewer data points reduce algorithm complexity and algorithms train
faster.
• Especially when dealing with a large number of variables there is a need for Dimensionality
Reduction
• LDA: Linear discriminant analysis is used to find a linear combination of features that
characterizes or separates two or more classes (or levels) of a categorical variable.
• ANOVA: ANOVA stands for Analysis of variance. It is similar to LDA except for the fact
that it is operated using one or more categorical independent features and one
continuous dependent feature. It provides a statistical test of whether the means of
several groups are equal or not.
• Chi-Square: It is a is a statistical test applied to the groups of categorical features to
evaluate the likelihood of correlation or association between them using their
frequency distribution.
Chi-Squared test
• Example: We would like to determine the relevance of pitch type (feature
with 3 values: good, medium, bad) to the performance of a base ball team
(target value with three classes: Wins, Draws, Losses). Following are the
observed distribution of statistics /frequencies from the dataset.
• In wrapper methods, we try to use a subset of features and train a model using them. Based on the
inferences that we draw from the previous model, we decide to add or remove features from your
subset. The problem is essentially reduced to a search problem. These methods are usually
computationally very expensive.
• Some common examples of wrapper methods are forward feature selection, backward feature
elimination, recursive feature elimination, etc.
• Forward Selection: Forward selection is an iterative method in which we start with having no feature in
the model. In each iteration, we keep adding the feature which best improves our model till an addition
of a new variable does not improve the performance of the model.
• Backward Elimination: In backward elimination, we start with all the features and removes the least
significant feature at each iteration which improves the performance of the model. We repeat this until
no improvement is observed on removal of features.
• Recursive Feature elimination: It is a greedy optimization algorithm which aims to find the best
performing feature subset. It repeatedly creates models and keeps aside the best or the worst
performing feature at each iteration. It constructs the next model with the left features until all the
features are exhausted. It then ranks the features based on the order of their elimination.
Forward Selection
Recursive Feature Elimination
Embedded Methods
• The main differences between the filter and wrapper methods for feature selection
are:
• Filter methods measure the relevance of features by their correlation with
dependent variable while wrapper methods measure the usefulness of a subset of
feature by actually training a model on it.
• Filter methods are much faster compared to wrapper methods as they do not involve
training the models. On the other hand, wrapper methods are computationally very
expensive as well.
• Filter methods use statistical methods for evaluation of a subset of features while
wrapper methods use cross validation.
• Filter methods might fail to find the best subset of features in many occasions but
wrapper methods can always provide the best subset of features.
• Using the subset of features from the wrapper methods make the model more prone
to overfitting as compared to using subset of features from the filter methods.
Principal Component Analysis
(Dimensionality Reduction)
Applications of PCA
• Data Visualization/Presentation
• Data Compression
• Noise Reduction
• Data Classification
• Trend Analysis
• Factor Analysis
Data Presentation
• Example: 53 Blood and urine
measurements (wet chemistry) from • Spectral Format
65 people (33 alcoholics, 32 non-
1000
alcoholics). 900
800
• Matrix Format 700
600
H-WBC H-RBC H-Hgb H-Hct H-MCV H-MCH H-MCHC 500
A1 8.0000 4.8200 14.1000 41.0000 85.0000 29.0000 34.0000 400
A2 7.3000 5.0200 14.7000 43.0000 86.0000 29.0000 34.0000 300
A3 4.3000 4.4800 14.1000 41.0000 91.0000 32.0000 35.0000 200
A4 7.5000 4.4700 14.9000 45.0000 101.0000 33.0000 33.0000 100
A5 7.3000 5.5200 15.4000 46.0000 84.0000 28.0000 33.0000 00
10 20 30 40 50 60
A6 6.9000 4.8600 16.0000 47.0000 97.0000 33.0000 34.0000 asur ment
Measurement
A7 7.8000 4.6800 14.7000 43.0000 92.0000 31.0000 34.0000
1.8
Univariate Bivariate
550
1.6 500
1.4 450
1.2 400
C-LDH
H-Bands
1 350
0.8 300
0.6 250
0.4 200
0.2 150
100
0 50
0 10 20 30 40 50 60 70 0 50 150 250 350 450
Person Trivariate C-Triglycerides
4
3
M-EPI
2
1
0
600
400 500
400
200 300
C-LDH 200
100
00
C-Triglycerides
Data Presentation
• Better presentation than ordinate axes?
• Do we need a 53 dimension space to view data?
• How to find the ‘best’ low dimension space that conveys
maximum useful information?
• One answer: Find “Principal Components”
Principal Component Analysis (PCA)
• PCA converts a set of observations of possibly correlated variables into a set of
values of linearly uncorrelated variables called principal components
• Takes a 𝑛 × 𝑝 data matrix of possibly correlated axes and summarizes it by
uncorrelated axes.
• The first k components display as much as possible of the variation among
objects.
Geometric Rationale of PCA
• Idea:
• Given data points in a d-dimensional space,
• project into lower dimensional space while preserving
as much information as possible
– Eg, find best planar approximation to 3D data
– Eg, find best 12-D approximation to 104-D data
• In particular, choose projection that minimizes
squared error in reconstructing original data
PCA: Algorithm
PCA Example –STEP 1
x y
-.827970186 -.175115307
1.77758033 .142857227
-.992197494 .384374989
-.274210416 .130417207
-1.67580142 -.209498461
-.912949103 .175282444
.0991094375 -.349824698
1.14457216 .0464172582
.438046137 .0177646297
1.22382056 -.162675287
PCA Example –STEP 5
http://kybele.psych.cornell.edu/~edelman/Psych-465-Spring-2003/PCA-tutorial.pdf
Reconstruction of original Data
• If we reduced the dimensionality, obviously, when
reconstructing the data we would lose those
dimensions we chose to discard. In our example let
us assume that we considered only the x dimension…
Reconstruction of original Data
http://kybele.psych.cornell.edu/~edelman/Psych-465-Spring-2003/PCA-tutorial.pdf
x
-.827970186
1.77758033
-.992197494
-.274210416
-1.67580142
-.912949103
.0991094375
1.14457216
.438046137
1.22382056
References and useful links
• http://www.iro.umontreal.ca/~pift6080/H09/
documents/papers/pca_tutorial.pdf
• https://www.cs.cmu.edu/~elaw/papers/pca.p df
• https://stats.stackexchange.com/questions/26 91/making-sense-
of-principal-component- analysis-eigenvectors-eigenvalues