0% found this document useful (0 votes)
123 views29 pages

The 5 Feature Selection Algorithms Every Data Scientist Should Know

The document discusses 5 feature selection algorithms that data scientists should know: 1) Pearson correlation, 2) Chi-squared test, 3) Recursive feature elimination, 4) Lasso selection, and 5) Tree-based selection. It also discusses why feature selection is important for reducing overfitting, improving explainability, and removing non-informative features. The document demonstrates these techniques on a dataset of football players to identify attributes that correlate with highly rated players.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
123 views29 pages

The 5 Feature Selection Algorithms Every Data Scientist Should Know

The document discusses 5 feature selection algorithms that data scientists should know: 1) Pearson correlation, 2) Chi-squared test, 3) Recursive feature elimination, 4) Lasso selection, and 5) Tree-based selection. It also discusses why feature selection is important for reducing overfitting, improving explainability, and removing non-informative features. The document demonstrates these techniques on a dataset of football players to identify attributes that correlate with highly rated players.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 29

The 5 Feature Selection

Algorithms every Data Scientist


should know
Bonus: What makes a good footballer great?

Rahul Agarwal
Follow
Jul 27, 2019 · 7 min read

Data Science is the study of algorithms.

I grapple through with many algorithms on a day to day basis,


so I thought of listing some of the most common and most used
algorithms one will end up using in this new DS Algorithm
series.

How many times it has happened when you create a lot of


features and then you need to come up with ways to reduce the
number of features.

We sometimes end up using correlation or tree-based methods


to find out the important features.

Can we add some structure to it?


This post is about some of the most common feature
selection techniques one can use while working with
data.

Why Feature Selection?


Before we proceed, we need to answer this question. Why don’t
we give all the features to the ML algorithm and let it decide
which feature is important?

So there are three reasons why we don’t:

1. Curse of dimensionality — Overfitting


Source

If we have more columns in the data than the number of rows,


we will be able to fit our training data perfectly, but that won’t
generalize to the new samples. And thus we learn absolutely
nothing.

2. Occam’s Razor:

We want our models to be simple and explainable. We lose


explainability when we have a lot of features.

3. Garbage In Garbage out:

Most of the times, we will have many non-informative features.


For Example, Name or ID variables. Poor-quality input will
produce Poor-Quality output.
Also, a large number of features make a model bulky, time-
taking, and harder to implement in production.

So What do we do?
We select only useful features.

Fortunately, Scikit-learn has made it pretty much easy for us to


make the feature selection. There are a lot of ways in which we
can think of feature selection, but most feature selection
methods can be divided into three major buckets

 Filter based: We specify some metric and based on


that filter features. An example of such a metric could
be correlation/chi-square.

 Wrapper-based: Wrapper methods consider the


selection of a set of features as a search problem.
Example: Recursive Feature Elimination

 Embedded: Embedded methods use algorithms that


have built-in feature selection methods. For instance,
Lasso and RF have their own feature selection
methods.

So enough of theory let us start with our five feature selection


methods.

We will try to do this using a dataset to understand it better.

I am going to be using a football player dataset to find


out what makes a good player great?
Don’t worry if you don’t understand football
terminologies. I will try to keep it at a minimum.

Here is the Kaggle Kernel with the code to try out yourself.

Some simple Data Preprocessing


We have done some basic preprocessing such as removing Nulls
and one hot encoding. And converting the problem to a
classification problem using:
y = traindf['Overall']>=87

Here we use High Overall as a proxy for a great player.

Our dataset(X) looks like below and has 223 columns.


train Data X

1. Pearson Correlation

This is a filter-based method.


We check the absolute value of the Pearson’s
correlation between the target and numerical features in our
dataset. We keep the top n features based on this criterion.

2. Chi-Squared
This is another filter-based method.

In this method, we calculate the chi-square metric between the


target and the numerical variable and only select the variable
with the maximum chi-squared values.
Source

Let us create a small example of how we calculate the chi-


squared statistic for a sample.

So let’s say we have 75 Right-Forwards in our dataset and 25


Non-Right-Forwards. We observe that 40 of the Right-Forwards
are good, and 35 are not good. Does this signify that the player
being right forward affects the overall performance?

Observed and Expected Counts

We calculate the chi-squared value:

To do this, we first find out the values we would expect to be


falling in each bucket if there was indeed independence between
the two categorical variables.

This is simple. We multiply the row sum and the column sum
for each cell and divide it by total observations.
so Good and NotRightforward Bucket Expected value= 25(Row
Sum)*60(Column Sum)/100(Total Observations)

Why is this expected? Since there are 25% notRightforwards in


the data, we would expect 25% of the 60 good players we
observed in that cell. Thus 15 players.

Then we could just use the below formula to sum over all the 4
cells:
I won’t show it here, but the chi-squared statistic also works in a
hand-wavy way with non-negative numerical and categorical
features.

We can get chi-squared features from our dataset as:

3. Recursive Feature Elimination


This is a wrapper based method. As I said before, wrapper
methods consider the selection of a set of features as a search
problem.

From sklearn Documentation:

The goal of recursive feature elimination (RFE) is to select


features by recursively considering smaller and
smaller sets of features. First, the estimator is trained on
the initial set of features and the importance of each feature is
obtained either through a  coef_ attribute or through
a  feature_importances_ attribute. Then, the least important features
are pruned from current set of features. That procedure is
recursively repeated on the pruned set until the desired
number of features to select is eventually reached.

As you would have guessed, we could use any estimator with the
method. In this case, we use LogisticRegression , and the RFE
observes the coef_ attribute of the LogisticRegression object

4. Lasso: SelectFromModel
Source

This is an Embedded method. As said before, Embedded


methods use algorithms that have built-in feature selection
methods.

For example, Lasso and RF have their own feature selection


methods. Lasso Regularizer forces a lot of feature weights to be
zero.

Here we use Lasso to select variables.

5. Tree-based: SelectFromModel
This is an Embedded method. As said before, Embedded
methods use algorithms that have built-in feature selection
methods.

We can also use RandomForest to select features based on


feature importance.

We calculate feature importance using node impurities in each


decision tree. In Random forest, the final feature importance is
the average of all decision tree feature importance.

We could also have used a LightGBM. Or an XGBoost object as


long it has a feature_importances_ attribute.

Bonus
Why use one, when we can have all?

The answer is sometimes it won’t be possible with a lot of data


and time crunch.

But whenever possible, why not do this?


We check if we get a feature based on all the methods. In this
case, as we can see Reactions and LongPassing are excellent
attributes to have in a high rated player. And as
expected Ballcontrol and Finishing occupy the top spot too.

Conclusion
Feature engineering and feature selection are critical parts of
any machine learning pipeline.

We strive for accuracy in our models, and one cannot get to a


good accuracy without revisiting these pieces again and again.

In this article, I tried to explain some of the most used feature


selection techniques as well as my workflow when it comes to
feature selection.

I also tried to provide some intuition into these methods, but


you should probably try to see more into it and try to
incorporate these methods into your work.

Do read my post on feature engineering too if you are


interested.
If you want to learn more about Data Science, I would like to
call out this excellent course by Andrew Ng. This was the one
that got me started. Do check it out.

Thanks for the read. I am going to be writing more beginner-


friendly posts in the future too. Follow me up at Medium or
Subscribe to my blog to be informed about them. As always, I
welcome feedback and constructive criticism and can be
reached on Twitter @mlwhiz.

You might also like