ANOVA Test in Python1
ANOVA Test in Python1
The following tutorial is based on data analysis; we will discuss the Analysis of Variance
(ANOVA) in detail, along with the process of carrying it out in the Python programming
language. ANOVAs are generally utilized in Psychology studies.
In the following tutorial, we will understand how we can carry out ANOVA with the help of
the SciPy library, evaluating it "by hand" in Python, utilizing Pyyttbl and Statsmodels.
ANOVA test checks whether a difference in the average somewhere in the model or not
(checking whether there was an overall effect or not); however, this method doesn't tell us the
spot of the difference (if there is one). We can find the spot of the difference between the
group by conducting the post hoc tests.
However, in order to perform any tests, we first have to define the null and alternate
hypotheses:
We can perform an ANOVA Test by comparing two types of variations. The First variation is
between the sample means and the other one within each of the samples. The formula shown
below describes one-way ANOVA Test statistics.
The output of the ANOVA formula, the F statistic (also known as the F-ratio), enables the
analysis of the multiple sets of data in order to determine the variability among the samples
and within samples.
We can write the formula for the One-way ANOVA test as illustrated below:
Where,
Whenever we plot the ANOVA table, we can see all the above components in the following
format:
Usually, if the p-value belonging to the F is smaller than 0.05, then the null hypothesis is
excluded, and the alternative hypothesis is maintained. In the case of the null hypothesis
rejection, we can say that the means of all the sets/groups aren't equal.
Note: If no real difference is present among the tested groups, which is known as the
null hypothesis, the F-ratio statistics of the ANOVA Test will be adjacent to 1.
1. We can obtain observations randomly and independently from the population defined
by the factor levels.
2. The data for every level of the factor is distributed generally.
3. Case Independent: The sample cases must be independent of each other.
4. Variance Homogeneity: Homogeneity signifies that the variance between the group
needs to be around equal.
We can test the assumption of variance homogeneity with the bits of help of tests like the
Brown-Forsythe Test or Levene's Test. We can also test the Normality of the score
distributions with the help of histograms, the kurtosis or skewness values, or with the help of
tests like Kolmogorov-Smirnov, Shapiro-Wilk, or Q-Q plot. We can also determine the
assumption of independence from the study design.
It is quite noteworthy to notice that the ANOVA test is not robust to violating the assumption
of independence. This is to inform that even if someone tries to violate the assumptions of
Normality or homogeneity, they can conduct the test and trust the findings.
Nevertheless, the outputs of the ANOVA test are unacceptable if the assumption of
independence is dishonored. Usually, the analysis, along with the violations of homogeneity,
is considered robust if we have equal-sized groups. Resuming the ANOVA test along with
violations of Normality is usually fine if we have a large sample size.
Understanding the Types of ANOVA Tests
The ANOVA Tests can be classified into three major types. These types are shown below:
An Analysis of Variance Test that has only one independent variable is known as the One-
way ANOVA Test.
For instance, a country can assess the differences in the cases of Coronavirus, and a Country
can have multiple categories for comparison.
An Analysis of Variance Test that has two independent variables is known as a Two-way
ANOVA test. This test is also known as Factorial ANOVA Test.
For example, expanding the above example, a two-way ANOVA can examine the difference
in the cases of Coronavirus (the dependent variable) by Age Group (the first independent
variable) and Gender (the second independent variable). The two-way ANOVA can be
utilized in order to examine the interaction among these two independent variables.
Interactions denote that the differences are uneven across all classes of the independent
variables.
Suppose that the old age group may have higher cases of Coronavirus overall compared to
the young age group; however, this difference could vary in countries in Europe compared
to countries in Asia.
An ANOVA Test will provide us a single (univariate) F-value; however, a MANOVA Test
will provide us a multivariate F-value.
The two-way ANOVA test with Replication is carried out when two groups and the members
of those groups are performing multiple tasks.
For instance, suppose that a vaccine for Coronavirus is still under development. Doctors are
performing two different treatments in order to cure two groups of patients infected by the
virus.
The two-way ANOVA test without Replication is carried out when we have only one group,
and we are double-testing that same group.
For instance, suppose that the vaccine has been developed successfully, and the researchers
are testing one set of volunteers before and after they have been vaccinated in order to
observe whether the vaccination is working properly or not.
Thus, the researcher uses the post hoc test in order to check which groups are different from
each other.
We could perform post hoc tests which are t-tests inspecting mean differences among the
groups. We can conduct several multiple comparison tests to control the Type I error rate,
including the Bonferroni, Dunnet, Scheffe, and Turkey tests.
Now, we will understand only one-way ANOVA test using the Python programming
language.
In order to begin working with the ANOVA test, let us import some necessary libraries and
modules for the project.
Syntax:
1. import pandas as pd
2. import matplotlib.pyplot as plt
3. import statsmodels.api as sm
4. from statsmodels.formula.api import ols
5. import seaborn as sns
6. import numpy as np
7. import pandas.tseries
8. plt.style.use('fivethirtyeight')
The Hypothesis
"For every diet, the mean of the people's weights is the same."
In the following problem, we will use a Diet dataset designed by the University of Sheffield.
The dataset contains a binary variable as the gender, which consists of 1 for Male and 0 for
Female.
Syntax:
1. mydata = pd.read_csv('Diet_Dataset.csv')
Once we have successfully imported the dataset, let us print some data to get a sense of it.
Example -
1. print(mydata.head())
Output:
Now let us print the total number of rows present in the dataset.
Example -
Now, we have to see if there are any values that are missing in the dataset or not. We can
check this by using the following syntax.
Example -
1. print(mydata.gender.unique())
2. # displaying the person(s) having missing value in gender column
3. print(mydata[mydata.gender == ' '])
Output:
We can observe that two entries are containing the missing values in the 'gender' column.
Now let us find the total percentage of missing values in the dataset.
Example -
Output:
In the following step, we will be plot a graph using the distplot() function to understand the
Weight distribution in the Sample data. Let us consider the snippet of code.
Example -
Output:
We can also plot a distribution plot for each Gender in the dataset. Here is a syntax for the
same:
Example -
Output:
We can also use the following function to display the distribution plot for each gender.
Example:
1. def infergender(x):
2. if x == '1':
3. return 'Male'
4.
5. if x == '0':
6. return 'Female'
7.
8. return 'Other'
9.
10. def showdistribution(df, gender, column, group):
11. f, ax = plt.subplots( figsize = (11, 9) )
12. plt.title( 'Weight Distribution for {} on each {}'.format(gender, column) )
13. for groupmember in group:
14. sns.distplot(df[df[column] == groupmember].weight6weeks, label='{}'.format(gr
oupmember))
15. plt.legend()
16. plt.show()
17.
18. uniquediet = mydata.Diet.unique()
19. uniquegender = mydata.gender.unique()
20.
21. for gender in uniquegender:
22. if gender != ' ':
23. showdistribution(mydata[mydata.gender == gender], infergender(gender), 'Diet',
uniquediet)
Output:
Graph 1:
Graph 2:
Now, we will calculate the mean, median, non-zero count, and standard deviation according
to the 'gender' column using the snippet of code given below:
Example -
1. print(mydata.groupby('gender').agg(
2. [ np.mean, np.median, np.count_nonzero, np.std ]
3. ).weight6weeks)
Output:
As we can observe, we have estimated the required statistical measurements on the basis of
gender. We can also classify these statistical measurements on the basis of gender as well as
diet.
Example -
1. print(mydata.groupby(['gender', 'Diet']).agg(
2. [np.mean, np.median, np.count_nonzero, np.std]
3. ).weight6weeks)
Output:
We can observe that there is a slight difference in weight on females in the diet; however, it
doesn't seem to affect males.
And this test attempts to check whether this hypothesis is true or not.
Let us consider initially determining the confidence level of 95%, which also implies that we
will accept only an error rate of 5%.
Example -
Output:
In the case of males, we can't accept the null hypothesis below the confidence level of 95%
because the p-value is larger than the value of alpha, i.e., 0.05 < 0.512784. Thus, no
difference is found in the weights of males after providing these three types of diet.
In the case of females, since the p-value PR (> F) is below the rate of error, i.e., 0.05 >
0.010566, we could reject the null hypothesis. This statement indicates that we are pretty
confident about the fact that there is a difference in terms of height for females in diets.
So, now we understand the effect of diet on females; however, we are not aware of the
difference between the diets. So, we will perform a post hoc analysis with the help of the
Tukey HSD (Honest Significant Difference) test.
Example -
Output:
As we can observe from the above output, we can only reject the null hypothesis among the
1st and 3rd types of diet, which means that a statistically significant difference is present in
weight for diet 1 and diet 3.