0% found this document useful (0 votes)
21 views41 pages

Feature Selection - New

Uploaded by

245123742004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views41 pages

Feature Selection - New

Uploaded by

245123742004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Feature Selection and Dimensionality

Reduction
What is Feature selection (or Variable Selection)?

• Problem of selecting some subset of a learning algorithm’s


input variables upon which it should focus attention, while
ignoring the rest. In other words, Dimensionality
Reduction.

Why Feature Selection
• Naïve theoretical view: more features means more information and more discriminating
power. In practice: It is not the case, many reasons.

• Thousands of features with many irrelevant and redundant features. Irrelevant and redundant
features may confuse learners.

• Reduces Overfitting: Less redundant data means less opportunity to make decisions based on
noise.

• Improves Accuracy: Less misleading data means modeling accuracy improves.

• Reduces Training Time: fewer data points reduce algorithm complexity and algorithms train
faster.

• Especially when dealing with a large number of variables there is a need for Dimensionality
Reduction

• Feature Selection can significantly improve a learning algorithm’s performance


The curse of Dimesionality
• the classifier’s performance
usually will degrade for a
large number of features.
• The required number of
samples (to achieve the
same accuracy) grows
exponentially with the
number of variables.

• In practice: the number of training examples is


fixed.
Filter Methods

• The selection of features is independent of any machine learning


algorithms.
• Instead, features are selected on the basis of their scores in various
statistical tests for their correlation with the outcome variable.
• The correlation is a subjective term here. For basic guidance, you
can refer to the following table for defining correlation co-efficients.
• Note that filter methods do not remove multicollinearity.
• So, you must deal with multicollinearity of features as well before
training models for your data.
Statistical Tests
• Pearson’s Correlation: It is used as a measure for quantifying linear dependence
between two continuous variables X and Y. Its value varies from -1 to +1. Pearson’s
correlation is given as:

• LDA: Linear discriminant analysis is used to find a linear combination of features that
characterizes or separates two or more classes (or levels) of a categorical variable.
• ANOVA: ANOVA stands for Analysis of variance. It is similar to LDA except for the fact
that it is operated using one or more categorical independent features and one
continuous dependent feature. It provides a statistical test of whether the means of
several groups are equal or not.
• Chi-Square: It is a is a statistical test applied to the groups of categorical features to
evaluate the likelihood of correlation or association between them using their
frequency distribution.
Chi-Squared test
• Example: We would like to determine the relevance of pitch type (feature
with 3 values: good, medium, bad) to the performance of a base ball team
(target value with three classes: Wins, Draws, Losses). Following are the
observed distribution of statistics /frequencies from the dataset.

• To find the expected frequencies, we assume independence of the rows


and columns.
• To get the expected frequency corresponding to the 11 at top left, we look
at row total (21) and column total (30), multiply them, and then divide by
the overall total (75).
• So the expected frequency is: 21*30 /75 = 8.4
Chi-Squared test
• We compute the expected frequencies for every entry in
the table and are summarized as follows.

• The number of degrees of freedom is calculated for an m-


by-n table as (m-1)(n-1), so in this case (3-1)(3-1) = 2*2 = 4.

• In statistics, the degrees of freedom (DF) indicate the


number of independent values that can vary in an analysis
without breaking any constraints.
Chi-Squared test
• To calculate X^2, we then have a further table:
Chi-Squared test
• The tabular 95% value of X^2
(degrees of freedom = 4) is
9.49, so the value of X^2 that
we obtained (6.70) is not
significant at the 5% level.

• We conclude that the state of


the pitch does not affect the
performance of the team.

• P(X^2 > 9.49) < 0.05. So for


X^2 > 9.49 the assumption of
independence can be rejected
with 95% confidence.
Wrapper Methods

• In wrapper methods, we try to use a subset of features and train a model using them. Based on the
inferences that we draw from the previous model, we decide to add or remove features from your
subset. The problem is essentially reduced to a search problem. These methods are usually
computationally very expensive.

• Some common examples of wrapper methods are forward feature selection, backward feature
elimination, recursive feature elimination, etc.

• Forward Selection: Forward selection is an iterative method in which we start with having no feature in
the model. In each iteration, we keep adding the feature which best improves our model till an addition
of a new variable does not improve the performance of the model.

• Backward Elimination: In backward elimination, we start with all the features and removes the least
significant feature at each iteration which improves the performance of the model. We repeat this until
no improvement is observed on removal of features.

• Recursive Feature elimination: It is a greedy optimization algorithm which aims to find the best
performing feature subset. It repeatedly creates models and keeps aside the best or the worst
performing feature at each iteration. It constructs the next model with the left features until all the
features are exhausted. It then ranks the features based on the order of their elimination.
Forward Selection
Recursive Feature Elimination
Embedded Methods

• Embedded methods combine the qualities’ of


filter and wrapper methods. It’s implemented by
algorithms that have their own built-in feature
selection methods.
• Example: Lasso regression performs L1
regularization which adds penalty equivalent to
absolute value of the magnitude of coefficients.
• Other examples of embedded methods are Ridge
regression. Regularized trees, Memetic
algorithm, Random multinomial logit.
Lasso regression
1
σ𝑚 𝑖 𝑖 2
• 𝐽 𝑤 = 𝑖=1 ℎ𝑤 𝑥 −𝑦 +𝜆 𝑤 1
2𝑚

Note that w1 is zero here


means that we are not
considering feature 1 to
determine the prediction.
Difference between Filter and Wrapper methods

• The main differences between the filter and wrapper methods for feature selection
are:
• Filter methods measure the relevance of features by their correlation with
dependent variable while wrapper methods measure the usefulness of a subset of
feature by actually training a model on it.
• Filter methods are much faster compared to wrapper methods as they do not involve
training the models. On the other hand, wrapper methods are computationally very
expensive as well.
• Filter methods use statistical methods for evaluation of a subset of features while
wrapper methods use cross validation.
• Filter methods might fail to find the best subset of features in many occasions but
wrapper methods can always provide the best subset of features.
• Using the subset of features from the wrapper methods make the model more prone
to overfitting as compared to using subset of features from the filter methods.
Principal Component Analysis
(Dimensionality Reduction)
Applications of PCA
• Data Visualization/Presentation
• Data Compression
• Noise Reduction
• Data Classification
• Trend Analysis
• Factor Analysis
Data Presentation
• Example: 53 Blood and urine
measurements (wet chemistry) from • Spectral Format
65 people (33 alcoholics, 32 non-
1000
alcoholics). 900
800
• Matrix Format 700
600
H-WBC H-RBC H-Hgb H-Hct H-MCV H-MCH H-MCHC 500
A1 8.0000 4.8200 14.1000 41.0000 85.0000 29.0000 34.0000 400
A2 7.3000 5.0200 14.7000 43.0000 86.0000 29.0000 34.0000 300
A3 4.3000 4.4800 14.1000 41.0000 91.0000 32.0000 35.0000 200
A4 7.5000 4.4700 14.9000 45.0000 101.0000 33.0000 33.0000 100
A5 7.3000 5.5200 15.4000 46.0000 84.0000 28.0000 33.0000 00
10 20 30 40 50 60
A6 6.9000 4.8600 16.0000 47.0000 97.0000 33.0000 34.0000 asur ment
Measurement
A7 7.8000 4.6800 14.7000 43.0000 92.0000 31.0000 34.0000

A8 8.6000 4.8200 15.8000 42.0000 88.0000 33.0000 37.0000

A9 5.1000 4.7100 14.0000 43.0000 92.0000 30.0000 32.0000


Data Presentation

1.8
Univariate Bivariate
550
1.6 500
1.4 450
1.2 400

C-LDH
H-Bands

1 350
0.8 300
0.6 250
0.4 200
0.2 150
100
0 50
0 10 20 30 40 50 60 70 0 50 150 250 350 450
Person Trivariate C-Triglycerides
4

3
M-EPI
2

1
0
600
400 500
400
200 300
C-LDH 200
100
00
C-Triglycerides
Data Presentation
• Better presentation than ordinate axes?
• Do we need a 53 dimension space to view data?
• How to find the ‘best’ low dimension space that conveys
maximum useful information?
• One answer: Find “Principal Components”
Principal Component Analysis (PCA)
• PCA converts a set of observations of possibly correlated variables into a set of
values of linearly uncorrelated variables called principal components
• Takes a 𝑛 × 𝑝 data matrix of possibly correlated axes and summarizes it by
uncorrelated axes.
• The first k components display as much as possible of the variation among
objects.
Geometric Rationale of PCA

• objective of PCA is to rigidly rotate the axes of


this p-dimensional space to new positions
(principal axes) that have the following
properties:
– ordered such that principal axis 1 has the highest
variance, axis 2 has the next highest variance, .... ,
and axis p has the lowest variance
– covariance among each pair of the principal axes
is zero (the principal axes are uncorrelated).
PCA

• Project data onto a space variance is


maximized and Error is minimized.

• Orthogonal projection of data onto lower- dimension linear space


that...
– maximizes variance of projected data (purple line)
– minimizes mean squared distance between data point and
projections (sum of blue lines)
PCA

• Idea:
• Given data points in a d-dimensional space,
• project into lower dimensional space while preserving
as much information as possible
– Eg, find best planar approximation to 3D data
– Eg, find best 12-D approximation to 104-D data
• In particular, choose projection that minimizes
squared error in reconstructing original data
PCA: Algorithm
PCA Example –STEP 1

• Subtract the mean


• from each of the data dimensions. All the x values have x
subtracted and y values have y subtracted from them.
This produces a data set whose mean is zero.
• Subtracting the mean makes variance and covariance
calculation easier by simplifying their equations. The
variance and co-variance values are not affected by the
mean value.
PCA Example –STEP 1
http://kybele.psych.cornell.edu/~edelman/Psych-465-Spring-2003/PCA-tutorial.pdf

DATA: ZERO MEAN DATA:


x y x y
2.5 2.4 .69 .49
0.5 0.7 -1.31 -1.21
2.2 2.9 .39 .99
1.9 2.2 .09 .29
3.1 3.0 1.29 1.09
2.3 2.7 .49 .79
2 1.6
.19 -.31
1 1.1
-.81 -.81
1.5 1.6
-.31 -.31
1.1 0.9
-.71 -1.01
PCA Example –STEP 1
http://kybele.psych.cornell.edu/~edelman/Psych-465-Spring-2003/PCA-tutorial.pdf
PCA Example –STEP 2
• Calculate the covariance matrix

cov = .616555556 .615444444


.615444444 .716555556

• Since the non-diagonal elements in this covariance matrix are positive, we


should expect that both the x and y variable increase together.
PCA Example –STEP 3

• Calculate the eigenvectors and eigenvalues of


the covariance matrix
eigenvalues = .0490833989
1.28402771
eigenvectors = -.735178656 -.677873399
.677873399 -.735178656
PCA Example –STEP 3
• http://kybele.psych.cornell.edu/~edelman/Psych-465-Spring-2003/PCA-tutorial.pdf
• eigenvectors are plotted as
diagonal dotted lines on the plot.
• Note they are perpendicular to
each other.
• Note one of the eigenvectors
goes through the middle of the
points, like drawing a line of best
fit.
• The second eigenvector gives us
the other, less important, pattern
in the data, that all the points
follow the main line, but are off
to the side of the main line by
some amount.
PCA Example –STEP 4
• Reduce dimensionality and form feature vector
• the eigenvector with the highest eigenvalue is the
principle component of the data set.
• In our example, the eigenvector with the larges eigenvalue
• was the one that pointed down the middle of the data.

• Once eigenvectors are found from the covariance matrix,


the next step is to order them by eigenvalue, highest to
lowest.
• This gives you the components in order of significance.
PCA Example –STEP 4
• Now, if you like, you can decide to ignore the components of
lesser significance.

• You do lose some information, but if the eigenvalues are


small, you don’t lose much

• p dimensions in your data


• calculate p eigenvectors and eigenvalues
• choose only the first k eigenvectors
• final data set has only k dimensions.
PCA Example –STEP 4
• Feature Vector
FeatureVector = (eig1 eig2 eig3 …eign)
We can either form a feature vector with both of the
eigenvectors:
-.677873399 -.735178656
-.735178656 .677873399
or, we can choose to leave out the smaller, less
significant component and only have a single
column:
- .677873399
- .735178656
PCA Example –STEP 5
• Deriving the new data
• FinalData = RowFeatureVector x RowZeroMeanData
• RowFeatureVector is the matrix with the eigenvectors in the
columns transposed so that the eigenvectors are now in the
rows, with the most significant eigenvector at the top
• RowZeroMeanData is the mean-adjusted data
transposed, ie. the data items are in each column, with
each row holding a separate dimension.
PCA Example –STEP 5
FinalData transpose: dimensions along columns

x y
-.827970186 -.175115307
1.77758033 .142857227
-.992197494 .384374989
-.274210416 .130417207
-1.67580142 -.209498461
-.912949103 .175282444
.0991094375 -.349824698
1.14457216 .0464172582
.438046137 .0177646297
1.22382056 -.162675287
PCA Example –STEP 5
http://kybele.psych.cornell.edu/~edelman/Psych-465-Spring-2003/PCA-tutorial.pdf
Reconstruction of original Data
• If we reduced the dimensionality, obviously, when
reconstructing the data we would lose those
dimensions we chose to discard. In our example let
us assume that we considered only the x dimension…
Reconstruction of original Data
http://kybele.psych.cornell.edu/~edelman/Psych-465-Spring-2003/PCA-tutorial.pdf

x
-.827970186
1.77758033
-.992197494
-.274210416
-1.67580142
-.912949103
.0991094375
1.14457216
.438046137
1.22382056
References and useful links

• http://www.iro.umontreal.ca/~pift6080/H09/
documents/papers/pca_tutorial.pdf
• https://www.cs.cmu.edu/~elaw/papers/pca.p df
• https://stats.stackexchange.com/questions/26 91/making-sense-
of-principal-component- analysis-eigenvectors-eigenvalues

You might also like