Open in app Get started
Published in Towards Data Science
sampath kumar gajawada Follow
Oct 19, 2019 · 7 min read · Listen
Save
ANOVA for Feature Selection in Machine
Learning
Applications of ANOVA in Feature selection
Open in app Get started
Photo by Fahrul Azmi
The biggest challenge in machine learning is selecting the best features to train the
model. We need only the features which are highly dependent on the response variable.
But what if the response variable is continuous and the predictor is categorical ???
ANOVA ( Analysis of Variance) helps us to complete our job of selecting the best features.
Open in app Get started
In this article, I will take you through
a. Impact of Variance
b. F-Distribution
c. ANOVA
c. One Way ANOVA with example
Impact of Variance
Variance is the measurement of the spread between numbers in a variable. It measures
how far a number is from the mean and every number in a variable.
The variance of a feature determines how much it is impacting the response variable. If
the variance is low, it implies there is no impact of this feature on response and vice-
versa.
F-Distribution
A probability distribution generally used for the analysis of variance. It assumes
Hypothesis as
H0: Two variances are equal
H1: Two variances are not equal
Degrees of Freedom
Degrees of freedom refers to the maximum number of logically independent values,
which have the freedom to vary. In simple words, it can be defined as the total number
of observations minus the number of independent constraints imposed on the
observations.
Df = N -1 where N is the Sample Size
F- Value
It is the ratio of two Chi-distributions divided by its degrees of Freedom.
Open in app Get started
F value
Let’s solve the above equation and check how it can be useful to analyze the variance.
F ratio
In the real world, we always deal with samples so comparing standard deviations will be
almost equal to comparing the variances.
image from https://newonlinecourses.science.psu.edu
In the above fig, we could observe that the shape of F- distribution always depends on
degrees of freedom.
Open in app Get started
ANOVA
Analysis of Variance is a statistical method, used to check the means of two or more
groups that are significantly different from each other. It assumes Hypothesis as
H0: Means of all groups are equal.
H1: At least one mean of the groups are different.
How comparison of means transformed to the comparison of variance?
Consider two distributions and their behavior in below fig.
Behavior of distributions
From the above fig, we can say If the distributions overlap or close, the grand mean will
be similar to individual means whereas if distributions are far, the grand mean and
individual means differ by larger distance.
It refers to variations between the groups as the values in each group are different. So in
ANOVA, we will compare Between-group variability to Within-group variability.
ANOVA uses F-tet check if there is any significant difference between the groups. If there
is no significant difference between the groups that all variances are equal, the result of
ANOVA’s F-ratio will be close to 1.
One Way ANOVA with example
1. One Way ANOVA tests the relationship between categorical predictor vs continuous
Open in app Get started
response.
2. Here we will check whether there is equal variance between groups of categorical
feature wrt continuous response.
3. If there is equal variance between groups, it means this feature has no impact on
response and it can not be considered for model training.
Let’s consider a school dataset having data about student’s performance. We have to
predict the final grade of the student based on features like age, guardian, study time,
failures, activities, etc.
By using One Way ANOVA let us determine is there any impact of the guardian on the
final grade. Below is the data
Student final grades by the guardian
We can see guardian ( mother, father, other ) as columns and student final grade in
rows.
Steps to perform One Way ANOVA
1. Define Hypothesis
2. Calculate the Sum of Squares
3. Determine degrees of freedom
4. F-value
Open in app Get started
5. Accept or Reject the Null Hypothesis
Define Hypothesis
H0: All levels or groups in guardian have equal variance
H1: At least one group is different.
Calculate the Sum of Squares
The sum of squares is the statistical technique used to determine the dispersion in data
points. It is the measure of deviation and can be written as
Sum of Squares
As stated in ANOVA, we have to do F-test to check if there is any variance between the
groups by comparing the variance between the groups and variance within groups. This
can be done by using the sum of squares and the definitions are as follows.
Total Sum of Squares
The distance between each observed point x from the grand mean xbar is x-xbar. If you
calculate this distance between each data point, square each distance and add up all the
squared distances you get
Total Sum of Squares
Between the Sum of Squares
The distance between each group average value g from grand means xbar is g-xbar.
Doing similar to the total sum of squares we get
Open in app Get started
Between the Sum of Squares
Within the Sum of Squares
The distance between each observed value within the group x from the group-mean g is
given as x-g. Doing similar to the total sum of squares we get
Within the Sum of Squares
The total sum of squares = Between Sum of Squares + Within Sum of Squares
Determine degrees of freedom
We already discussed what is the definition of degrees of freedom now we will calculate
for between groups and within groups.
1. Since we have 3 groups ( mother, father, other) degrees of freedom for Between
groups can be given as (3–1) = 2.
2. Having 18 samples in each group, Degrees of freedom for within groups will be the
sum of degrees of freedom of all groups that is (18–1) + (18–1) + (18–1) = 51.
F-value
Since we are comparing the variance between the groups and variance within the
groups. The F value is given as
F value
Calculating Sum of Squares and F value here is the summary.
ANOVA table
Open in app Get started
Accept or reject the Null Hypothesis
With 95% confidence, alpha = 0.05 , df1 =2 ,df2 =51 given F value from the F table is
3.179 and the calculated F value is 18.49.
F test
In the above fig, we see that the calculated F value falls in the rejection region that is
beyond our confidence level. So we are rejecting the Null Hypothesis.
To Conclude, as the null hypothesis isOpen
rejected
in app that
Get started
means variance exists between the groups which
state that there is an impact of the guardian on
student final score. So we will include this feature for
model training.
615 8
Using One way ANOVA we can check only single predictor vs response and determine
the relationship but what if you have two predictors? we will use Two way ANOVA and if
there are more than two features we will go for multi-factor ANOVA.
Using two-way or multi-factor ANOVA we can check the relationship on a response like
1. Will the guardian is impacting the final student grade?
Sign up for The Variable
2. Will the student activities are impacting the final student grade?
By Towards Data Science
3.Every
WillThursday, the Variable
the guardian delivers the
and student very best together
activities of Towardsare
Dataimpacting
Science: from hands-on
final tutorials
grade?
and cutting-edge research to original features you don't want to miss. Take a look.
Drawing above conclusions doing one test is always interesting right ?? I am on the way
Get this newsletter
to make an article on two-way and multi-factor ANOVA and will make more interesting.
Here we dealt with having the response as continuous and predictor as categorical. If the
response is categorical and the predictor is categorical, please check on my article Chi-
Square test for Feature Selection in machine learning.
Chi-Square Test for Feature Selection in Machine learning
We always wonder where the Chi-Square test is useful in machine
learning and how this test makes difference.Feature…
towardsdatascience.com
Hope you enjoyed it !! Stay tuned !!! Please do comment on any queries or suggestions !!!!