A Complete Guide To Hypothesis Testing For Data Scientists Using Python - by Rashida Nasrin Sucky - Oct, 2020 - Towards Data Science
A Complete Guide To Hypothesis Testing For Data Scientists Using Python - by Rashida Nasrin Sucky - Oct, 2020 - Towards Data Science
You have 1 free member-only story left this month. Sign up for Medium and get an extra one
https://towardsdatascience.com/a-complete-guide-to-hypothesis-testing-for-data-scientists-using-python-69f670e6779e 1/14
19/10/2020 A Complete Guide to Hypothesis Testing for Data Scientists Using Python | by Rashida Nasrin Sucky | Oct, 2020 | Towards Data S…
Hypothesis testing is an important part of statistics and data analysis. Most of the time
it is practically not possible to take data from a total population. In that case, we take a
sample and make estimations or claims about the total population. These assumptions
or claims are hypotheses. Hypothesis testing is the process to test if there is evidence to
reject that hypothesis.
In this article, we are going to cover the hypothesis testing of the population
proportion, the difference in population proportion, population or sample mean and
the difference in the sample mean.
I will explain the process of hypothesis testing step by step for all the four categories
individually with examples.
I used a Jupyter Notebook environment for this exercise. If you do not have that feel
free to use any notebook or IDE of your choice.
A Google collab notebook will be perfect too. Google collab is a smart notebook. These
common libraries are preinstalled in it.
https://towardsdatascience.com/a-complete-guide-to-hypothesis-testing-for-data-scientists-using-python-69f670e6779e 2/14
19/10/2020 A Complete Guide to Hypothesis Testing for Data Scientists Using Python | by Rashida Nasrin Sucky | Oct, 2020 | Towards Data S…
This is the most basic hypothesis testing. Most of the time we do not have a specific
fixed value for comparison. But if we have, this is the most simple hypothesis testing. I
am going to start with a one proportion hypothesis testing.
I used the Heart dataset from Kaggle for this demonstration. Please feel free to
download the dataset for your practice. Here I import the packages and the dataset:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import scipy.stats.distributions as distdf =
pd.read_csv('Heart.csv')
df.head()
Source: Author
The last column of the dataset is ‘AHD’. That is if the person has heart disease. The
research question for this section is,
“The population proportion of Ireland having heart disease is 42%. Are more
people suffering from heart disease in the US”?
In this problem, the null hypothesis is the population proportion having heart disease
in the US is less than or equal to 42%. But if we test for equal to less than will be
covered automatically. So, I am making it only equal to.
And the alternative hypothesis is the population proportion of the US having heart
disease is more than 42%.
https://towardsdatascience.com/a-complete-guide-to-hypothesis-testing-for-data-scientists-using-python-69f670e6779e 3/14
19/10/2020 A Complete Guide to Hypothesis Testing for Data Scientists Using Python | by Rashida Nasrin Sucky | Oct, 2020 | Towards Data S…
Let’s see if we can find the evidence to reject the null hypothesis.
Step 2: Assume that the dataset above is a representative sample from the population
of the US. So, calculate the population proportion of the US having heart disease.
p_us = len(df[df['AHD']=='Yes'])/len(df)
The population proportion of the sample having heart disease is 0.46 or 46%. This
percentage is more than the null hypothesis. That is 42%.
But the question is if it is significantly more than 42%. If we take a different simple
random sample, the currently observed population proportion (46%) can be different.
To find out if the observed population proportion is significantly more than the null
hypothesis, perform a hypothesis test.
In this formula, p0 is 0.42 (according to the null hypothesis) and n is the size of the
sample population. Now calculate the Standard error and the test statistics:
https://towardsdatascience.com/a-complete-guide-to-hypothesis-testing-for-data-scientists-using-python-69f670e6779e 4/14
19/10/2020 A Complete Guide to Hypothesis Testing for Data Scientists Using Python | by Rashida Nasrin Sucky | Oct, 2020 | Towards Data S…
Find the test statistics using the formula for test statistic above:
#Best estimate
be = p_us #hypothesized estimate
he = 0.42test_stat = (be - he)/se
This test statistic is also called z-score. You can find the p-value from a z_table or you
can find the p-value from this formula in python.
pvalue = 2*dist.norm.cdf(-np.abs(test_stat))
The p-value is 0.1718. It means the sample population proportion (46% or 0.46) is
0.1718 null standard errors above the null hypothesis.
Here p-value is bigger than our considered significance level of 0.05. So, we cannot
reject the null hypothesis. That means there is no significant difference in population
proportion having heart disease in Ireland and the US.
Here, we are going to test if the population proportion of females with heart
disease is different from the population proportion of males with heart disease.
Step 1: Set up the null hypothesis, alternative hypothesis, and significance level.
https://towardsdatascience.com/a-complete-guide-to-hypothesis-testing-for-data-scientists-using-python-69f670e6779e 5/14
19/10/2020 A Complete Guide to Hypothesis Testing for Data Scientists Using Python | by Rashida Nasrin Sucky | Oct, 2020 | Towards Data S…
Here, we want to check if there is any difference between the population proportion of
males and females having heart disease. We will start with the assumption that there is
no difference.
Ho: p1 -p2 = 0
This is our null hypothesis. Here, p1 is the population proportion of females with heart
disease and p2 is the population proportion of males having heart disease.
Ha: p1 - p2 != 0
Step 2: Prepare a chart that shows the population proportion of males and females
with heart disease and the total male and female population.
Image by Author
We will use the same formula for the test statistic as before. The best estimate is p1 —
p2. Here, p1 is the population proportion of females with heart disease and p2 is the
population proportion of males with heart disease.
https://towardsdatascience.com/a-complete-guide-to-hypothesis-testing-for-data-scientists-using-python-69f670e6779e 6/14
19/10/2020 A Complete Guide to Hypothesis Testing for Data Scientists Using Python | by Rashida Nasrin Sucky | Oct, 2020 | Towards Data S…
The standard error for two population proportion is calculated with the formula below:
Here, p is the total population proportion in the sample with heart disease. n1 and n2
are the total numbers of the female and male populations in the sample.
Now, use this standard error and calculate the test statistic.
The calculated test_statistic is -0.296. That means that the observed difference in
sample proportions is 0.296 estimated standard error below the hypothesized value.
pvalue = 2*dist.norm.cdf(-np.abs(test_statistic)
The p-value is 0.7675. That means more than 76% of the time we would see that the
results we observed are true considering the null hypothesis is true.
https://towardsdatascience.com/a-complete-guide-to-hypothesis-testing-for-data-scientists-using-python-69f670e6779e 7/14
19/10/2020 A Complete Guide to Hypothesis Testing for Data Scientists Using Python | by Rashida Nasrin Sucky | Oct, 2020 | Towards Data S…
In another way, the p-value is bigger than the significance level (0.1). So, we do not
have enough evidence to reject the null hypothesis.
The population proportion of males with heart disease is not significantly different
than the population proportion of females with heart disease.
“Check if the mean RestBP is great than 135”. Here, RestBP is resting blood
pressure. We have a RestBP column in the DataFrame. Let’s solve this problem step by
step.
We need to find out if the mean RestBP is greater than 135. Let’s assume that the mean
RestBP is less than or equal to 135.
So, the null hypothesis can be that the mean RestBP is 135. Because if we can prove
that the mean RestBP is greater than 135, it is automatically greater than 134 or 130.
If we find enough evidence to reject the null hypothesis, we can accept that the mean
RestBP is greater than 135. This is the alternative hypothesis for this example.
Ho: mu = 135
Ha: mu > 135
We will check if we can reject the null hypothesis using a significance level of 0.05.
I collected this dataset from Kaggle. I was not involved in collecting the data. For the
demonstration purpose, just assume that this is a simple random sample. To check the
second assumption, plot the data, and have a look at the distribution.
sns.distplot(df.RestBP)
Image by Author
The good news is, we do not need to worry about the normality of the data. Because we
have a large enough sample size(more than 25 data).
https://towardsdatascience.com/a-complete-guide-to-hypothesis-testing-for-data-scientists-using-python-69f670e6779e 9/14
19/10/2020 A Complete Guide to Hypothesis Testing for Data Scientists Using Python | by Rashida Nasrin Sucky | Oct, 2020 | Towards Data S…
std= df.RestBP.std()
n = len(df)
se = std/np.sqrt(n)
#Best estimate
be = df.RestBP.mean() #Hypothesized estimatehe = 135
test_statistic = (be - he)/se
Test statistic came out to be -3.27. Look at the formula for test statistics. On top, it
measures the distance between the original mean and hypothesized mean. And the
bottom is the standard error.
So, this test_statistic means, the sample mean is 3.27 standard error below the
hypothesized mean.
pvalue = 2*dist.norm.cdf(-np.abs(test_statistic))
The p-value is 0.001 which is less than the significance level (0.05).
There is only a 0.1% probability that we will see the observed result is true when the
null hypothesis is true. 0.1% probability is too low.
So, we reject the null hypothesis and accept the alternative hypothesis based on this
sample data.
https://towardsdatascience.com/a-complete-guide-to-hypothesis-testing-for-data-scientists-using-python-69f670e6779e 10/14
19/10/2020 A Complete Guide to Hypothesis Testing for Data Scientists Using Python | by Rashida Nasrin Sucky | Oct, 2020 | Towards Data S…
males.
As a null hypothesis, start with the claim that the mean RestBP of females and the
mean RestBP of males are the same. So the difference between these two means will be
zero.
The alternative hypothesis is, these two means are not the same. Let’s perform the test
with a 10% significance level.
Both the male and female populations have large enough data in this data. So,
checking for the normality of the data is not required.
The formula for the test statistic is the same as before. But the formula for the standard
error is different.
Here s1 and s2 are the sample standard deviation of the female and male population
respectively. n1 and n2 are the sample size of the female and male population. Now,
calculate the standard error:
pop_fe = df[df.Gender=='Female'].dropna()
pop_male = df[df.Gender=='Male'].dropna()std_fe =
pop_fe.RestBP.std()
std_male = pop_male.RestBP.std()se = np.sqrt(std_fe**2/len(pop_fe) +
std_male**2/len(pop_male))
https://towardsdatascience.com/a-complete-guide-to-hypothesis-testing-for-data-scientists-using-python-69f670e6779e 11/14
19/10/2020 A Complete Guide to Hypothesis Testing for Data Scientists Using Python | by Rashida Nasrin Sucky | Oct, 2020 | Towards Data S…
The test_statistic is 1.086. For the information, the observed difference in mean
‘mu_diff’ is 2.52.
As we are testing if the mean is different from each other, this is a two-tailed test.
The p-value is the probability that the test statistic is either less than 1.086 or greater
than 1.086.
pvalue = 2*dist.norm.cdf(-np.abs(test_statistic))
That means, there is approximately 55.4% probability that the observed result or more
extreme is true when the null hypothesis is true.
In another way, the p-value is much bigger than the significance level. So, we fail
to reject the null hypothesis.
The final inference is, based on the observed difference between the mean RestBP of
females and the mean RestBP of males, we cannot support the idea that there is a
significant difference between the two means.
Conclusion
https://towardsdatascience.com/a-complete-guide-to-hypothesis-testing-for-data-scientists-using-python-69f670e6779e 12/14
19/10/2020 A Complete Guide to Hypothesis Testing for Data Scientists Using Python | by Rashida Nasrin Sucky | Oct, 2020 | Towards Data S…
I explained the four most common types of research questions in this article with
working examples. Hope you will be able to use hypothesis testing in decision making
from now on.
Recommended Reading
Hands-on real-world examples, research, tutorials, and cutting-edge techniques delivered Monday
to Thursday. Make learning your daily ritual. Take a look
Your email
By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information
about our privacy practices.
https://towardsdatascience.com/a-complete-guide-to-hypothesis-testing-for-data-scientists-using-python-69f670e6779e 14/14