0% found this document useful (0 votes)
78 views14 pages

A Complete Guide To Hypothesis Testing For Data Scientists Using Python - by Rashida Nasrin Sucky - Oct, 2020 - Towards Data Science

The document is a guide to hypothesis testing for data scientists using Python. It discusses hypothesis testing for one population proportion, the difference in population proportions, population or sample mean, and the difference in sample means. For each, it provides the steps to define the null and alternative hypotheses, calculate relevant statistics like the test statistic and p-value, and make a conclusion. Sample code is included to demonstrate a hypothesis test on whether more people in the US have heart disease compared to Ireland using a heart disease dataset.

Uploaded by

Ghivvago
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views14 pages

A Complete Guide To Hypothesis Testing For Data Scientists Using Python - by Rashida Nasrin Sucky - Oct, 2020 - Towards Data Science

The document is a guide to hypothesis testing for data scientists using Python. It discusses hypothesis testing for one population proportion, the difference in population proportions, population or sample mean, and the difference in sample means. For each, it provides the steps to define the null and alternative hypotheses, calculate relevant statistics like the test statistic and p-value, and make a conclusion. Sample code is included to demonstrate a hypothesis test on whether more people in the US have heart disease compared to Ireland using a heart disease dataset.

Uploaded by

Ghivvago
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

19/10/2020 A Complete Guide to Hypothesis Testing for Data Scientists Using Python | by Rashida Nasrin Sucky | Oct, 2020

ct, 2020 | Towards Data S…

Get started Open in app

487K Followers · About Follow

You have 1 free member-only story left this month. Sign up for Medium and get an extra one

https://towardsdatascience.com/a-complete-guide-to-hypothesis-testing-for-data-scientists-using-python-69f670e6779e 1/14
19/10/2020 A Complete Guide to Hypothesis Testing for Data Scientists Using Python | by Rashida Nasrin Sucky | Oct, 2020 | Towards Data S…

Photo by Jaroslav Devia on Unsplash

A Complete Guide to Hypothesis Testing for


Data Scientists Using Python
Explained Clearly with Sample Research Questions, Solution Steps, and Complete
Codes

Rashida Nasrin Sucky 1 day ago · 11 min read

Hypothesis testing is an important part of statistics and data analysis. Most of the time
it is practically not possible to take data from a total population. In that case, we take a
sample and make estimations or claims about the total population. These assumptions
or claims are hypotheses. Hypothesis testing is the process to test if there is evidence to
reject that hypothesis.

Hypothesis testing normally is done on proportion and mean.

In this article, we are going to cover the hypothesis testing of the population
proportion, the difference in population proportion, population or sample mean and
the difference in the sample mean.

I will explain the process of hypothesis testing step by step for all the four categories
individually with examples.

I used a Jupyter Notebook environment for this exercise. If you do not have that feel
free to use any notebook or IDE of your choice.

A Google collab notebook will be perfect too. Google collab is a smart notebook. These
common libraries are preinstalled in it.

Hypothesis Testing for One Proportion

https://towardsdatascience.com/a-complete-guide-to-hypothesis-testing-for-data-scientists-using-python-69f670e6779e 2/14
19/10/2020 A Complete Guide to Hypothesis Testing for Data Scientists Using Python | by Rashida Nasrin Sucky | Oct, 2020 | Towards Data S…

This is the most basic hypothesis testing. Most of the time we do not have a specific
fixed value for comparison. But if we have, this is the most simple hypothesis testing. I
am going to start with a one proportion hypothesis testing.

I used the Heart dataset from Kaggle for this demonstration. Please feel free to
download the dataset for your practice. Here I import the packages and the dataset:

import pandas as pd
import numpy as np
import statsmodels.api as sm
import scipy.stats.distributions as distdf =
pd.read_csv('Heart.csv')
df.head()

Source: Author

The last column of the dataset is ‘AHD’. That is if the person has heart disease. The
research question for this section is,

“The population proportion of Ireland having heart disease is 42%. Are more
people suffering from heart disease in the US”?

Now, find the answer to this research question step by step.

Step 1: define the null hypothesis and alternative hypothesis.

In this problem, the null hypothesis is the population proportion having heart disease
in the US is less than or equal to 42%. But if we test for equal to less than will be
covered automatically. So, I am making it only equal to.

And the alternative hypothesis is the population proportion of the US having heart
disease is more than 42%.

https://towardsdatascience.com/a-complete-guide-to-hypothesis-testing-for-data-scientists-using-python-69f670e6779e 3/14
19/10/2020 A Complete Guide to Hypothesis Testing for Data Scientists Using Python | by Rashida Nasrin Sucky | Oct, 2020 | Towards Data S…

Ho: p0 = 0.42 #null hypothesis


Ha: p > 0.42 #alternative hypothesis

Let’s see if we can find the evidence to reject the null hypothesis.

Step 2: Assume that the dataset above is a representative sample from the population
of the US. So, calculate the population proportion of the US having heart disease.

p_us = len(df[df['AHD']=='Yes'])/len(df)

The population proportion of the sample having heart disease is 0.46 or 46%. This
percentage is more than the null hypothesis. That is 42%.

But the question is if it is significantly more than 42%. If we take a different simple
random sample, the currently observed population proportion (46%) can be different.

To find out if the observed population proportion is significantly more than the null
hypothesis, perform a hypothesis test.

Step 3: Calculate the Test Statistic:

Here is the formula for test-statistics:

We use this formula for standard error:

In this formula, p0 is 0.42 (according to the null hypothesis) and n is the size of the
sample population. Now calculate the Standard error and the test statistics:

se = np.sqrt(0.42 * (1-0.42) / len(df))

https://towardsdatascience.com/a-complete-guide-to-hypothesis-testing-for-data-scientists-using-python-69f670e6779e 4/14
19/10/2020 A Complete Guide to Hypothesis Testing for Data Scientists Using Python | by Rashida Nasrin Sucky | Oct, 2020 | Towards Data S…

Find the test statistics using the formula for test statistic above:

#Best estimate
be = p_us #hypothesized estimate
he = 0.42test_stat = (be - he)/se

The test statistics came out to be 1.3665.

Step 4: Calculate the p-value

This test statistic is also called z-score. You can find the p-value from a z_table or you
can find the p-value from this formula in python.

pvalue = 2*dist.norm.cdf(-np.abs(test_stat))

The p-value is 0.1718. It means the sample population proportion (46% or 0.46) is
0.1718 null standard errors above the null hypothesis.

Step 5: Infer the conclusion from the p-value

Consider the significance level alpha to be 5% or 0.05. A significance level of 5% or less


means that there is a probability of 95% or greater that the results are not random.

Here p-value is bigger than our considered significance level of 0.05. So, we cannot
reject the null hypothesis. That means there is no significant difference in population
proportion having heart disease in Ireland and the US.

Hypothesis Tests for the Difference in Two Proportions


Comparative tests are conducted much more frequently than one population
proportion hypothesis test. A two-sample test of proportions is performed to assess if
the population proportion of some traits differs between two subgroups.

Here, we are going to test if the population proportion of females with heart
disease is different from the population proportion of males with heart disease.

Step 1: Set up the null hypothesis, alternative hypothesis, and significance level.

https://towardsdatascience.com/a-complete-guide-to-hypothesis-testing-for-data-scientists-using-python-69f670e6779e 5/14
19/10/2020 A Complete Guide to Hypothesis Testing for Data Scientists Using Python | by Rashida Nasrin Sucky | Oct, 2020 | Towards Data S…

Here, we want to check if there is any difference between the population proportion of
males and females having heart disease. We will start with the assumption that there is
no difference.

Ho: p1 -p2 = 0

This is our null hypothesis. Here, p1 is the population proportion of females with heart
disease and p2 is the population proportion of males having heart disease.

What could be the alternative hypothesis?

The alternative hypothesis can be, there is a difference.

Ha: p1 - p2 != 0

Let’s use the significance level of 0.1 or 10%.

Step 2: Prepare a chart that shows the population proportion of males and females
with heart disease and the total male and female population.

df['Gender'] = df.Sex.replace({1: "Male", 0: "Female"})


p = df.groupby("Gender")['AHD'].agg([lambda z: np.mean(z=='Yes'),
"size"])
p.columns = ["HeartDisease", 'Total']
p

Image by Author

Step 3: Calculate the test statistic

We will use the same formula for the test statistic as before. The best estimate is p1 —
p2. Here, p1 is the population proportion of females with heart disease and p2 is the
population proportion of males with heart disease.
https://towardsdatascience.com/a-complete-guide-to-hypothesis-testing-for-data-scientists-using-python-69f670e6779e 6/14
19/10/2020 A Complete Guide to Hypothesis Testing for Data Scientists Using Python | by Rashida Nasrin Sucky | Oct, 2020 | Towards Data S…

#Best estimate is p1 - p2. Get p1 and p2 from the chart p above


p_fe = p.HeartDisease.Female
p_male = p.HeartDisease.Male

The standard error for two population proportion is calculated with the formula below:

Here, p is the total population proportion in the sample with heart disease. n1 and n2
are the total numbers of the female and male populations in the sample.

p = p_us #calculated in the beginning of the previous example


n1 = p.Total.Female
n2 = p.Total.Male
se = np.sqrt(p_us*(1-p_us)*(1/n1 + 1/n2))

Now, use this standard error and calculate the test statistic.

#calculate the best estimate


be = p_fe - p_male #Calculate the hypothesized estimate
#Our null hypothesis is p1 - p2 = 0he = 0 #Calculate the test
statistic
test_statistic = (be - he)/se

The calculated test_statistic is -0.296. That means that the observed difference in
sample proportions is 0.296 estimated standard error below the hypothesized value.

Step 4: Calculate the p-value

pvalue = 2*dist.norm.cdf(-np.abs(test_statistic)

The p-value is 0.7675. That means more than 76% of the time we would see that the
results we observed are true considering the null hypothesis is true.

https://towardsdatascience.com/a-complete-guide-to-hypothesis-testing-for-data-scientists-using-python-69f670e6779e 7/14
19/10/2020 A Complete Guide to Hypothesis Testing for Data Scientists Using Python | by Rashida Nasrin Sucky | Oct, 2020 | Towards Data S…

In another way, the p-value is bigger than the significance level (0.1). So, we do not
have enough evidence to reject the null hypothesis.

The population proportion of males with heart disease is not significantly different
than the population proportion of females with heart disease.

Hypothesis Testing for One Mean


This is a simple hypothesis testing process. We can perform this test if we have a
specific fixed mean value to compare. Let’s work on an example to understand the
process.

This is the research question:

“Check if the mean RestBP is great than 135”. Here, RestBP is resting blood
pressure. We have a RestBP column in the DataFrame. Let’s solve this problem step by
step.

Step 1: State the hypothesis

We need to find out if the mean RestBP is greater than 135. Let’s assume that the mean
RestBP is less than or equal to 135.

So, the null hypothesis can be that the mean RestBP is 135. Because if we can prove
that the mean RestBP is greater than 135, it is automatically greater than 134 or 130.

If we find enough evidence to reject the null hypothesis, we can accept that the mean
RestBP is greater than 135. This is the alternative hypothesis for this example.

Ho: mu = 135
Ha: mu > 135

We will check if we can reject the null hypothesis using a significance level of 0.05.

Step 2: Check the assumptions

There are two assumptions:

1. The sample should be a simple random sample.

2. The data need to be normally distributed.


https://towardsdatascience.com/a-complete-guide-to-hypothesis-testing-for-data-scientists-using-python-69f670e6779e 8/14
19/10/2020 A Complete Guide to Hypothesis Testing for Data Scientists Using Python | by Rashida Nasrin Sucky | Oct, 2020 | Towards Data S…

I collected this dataset from Kaggle. I was not involved in collecting the data. For the
demonstration purpose, just assume that this is a simple random sample. To check the
second assumption, plot the data, and have a look at the distribution.

sns.distplot(df.RestBP)

Image by Author

The distribution is not exactly normal. But it is close to normal.

The good news is, we do not need to worry about the normality of the data. Because we
have a large enough sample size(more than 25 data).

Step 3: Calculate the test statistic

Here is the formula to calculate the test statistic:

First, calculate the standard error using the formula below:

Here, S is the sample standard deviation and n is the number of samples.

https://towardsdatascience.com/a-complete-guide-to-hypothesis-testing-for-data-scientists-using-python-69f670e6779e 9/14
19/10/2020 A Complete Guide to Hypothesis Testing for Data Scientists Using Python | by Rashida Nasrin Sucky | Oct, 2020 | Towards Data S…

std= df.RestBP.std()
n = len(df)
se = std/np.sqrt(n)

Now, use this standard error to find the test statistic:

#Best estimate
be = df.RestBP.mean() #Hypothesized estimatehe = 135
test_statistic = (be - he)/se

Test statistic came out to be -3.27. Look at the formula for test statistics. On top, it
measures the distance between the original mean and hypothesized mean. And the
bottom is the standard error.

So, this test_statistic means, the sample mean is 3.27 standard error below the
hypothesized mean.

Step 4: Infer the conclusion from the test statistic

Convert this test_statistic to a probability value to see if this difference is unusual or


not. We can get the value using this python formula:

pvalue = 2*dist.norm.cdf(-np.abs(test_statistic))

The p-value is 0.001 which is less than the significance level (0.05).

So, we can reject the null hypothesis.

There is only a 0.1% probability that we will see the observed result is true when the
null hypothesis is true. 0.1% probability is too low.

So, we reject the null hypothesis and accept the alternative hypothesis based on this
sample data.

Hypothesis Testing for the Difference in Mean


For this example, we will use the same data, the RestBP column. But this time to test if
there is any difference between the mean RestBP of females to the mean RestBP of

https://towardsdatascience.com/a-complete-guide-to-hypothesis-testing-for-data-scientists-using-python-69f670e6779e 10/14
19/10/2020 A Complete Guide to Hypothesis Testing for Data Scientists Using Python | by Rashida Nasrin Sucky | Oct, 2020 | Towards Data S…

males.

Step 1: State the hypothesis

As a null hypothesis, start with the claim that the mean RestBP of females and the
mean RestBP of males are the same. So the difference between these two means will be
zero.

The alternative hypothesis is, these two means are not the same. Let’s perform the test
with a 10% significance level.

Ho: mu_female - mu_male = 0


Ha: mu_female - mu_male != 0

Both the male and female populations have large enough data in this data. So,
checking for the normality of the data is not required.

Step 2: Calculate the test statistic

The formula for the test statistic is the same as before. But the formula for the standard
error is different.

Here s1 and s2 are the sample standard deviation of the female and male population
respectively. n1 and n2 are the sample size of the female and male population. Now,
calculate the standard error:

pop_fe = df[df.Gender=='Female'].dropna()
pop_male = df[df.Gender=='Male'].dropna()std_fe =
pop_fe.RestBP.std()
std_male = pop_male.RestBP.std()se = np.sqrt(std_fe**2/len(pop_fe) +
std_male**2/len(pop_male))

Use the standard error to get the test statistic.

https://towardsdatascience.com/a-complete-guide-to-hypothesis-testing-for-data-scientists-using-python-69f670e6779e 11/14
19/10/2020 A Complete Guide to Hypothesis Testing for Data Scientists Using Python | by Rashida Nasrin Sucky | Oct, 2020 | Towards Data S…

#calculate the best estimate


mu_fe = pop_fe.RestBP.mean() #Mean RestBP for females
mu_male = pop_male.RestBP.mean() #Mean RestBP for malesmu_diff =
mu_fe - mu_male #hypothesized estimate
mu_diff_hyp = 0 #null hypothesis: difference of two mean =
zerotest_statistic = (be-he)/se

The test_statistic is 1.086. For the information, the observed difference in mean
‘mu_diff’ is 2.52.

As we are testing if the mean is different from each other, this is a two-tailed test.

The p-value is the probability that the test statistic is either less than 1.086 or greater
than 1.086.

Step 3: Infer the conclusions from the test statistic

Calculate the p-value from this test statistic in python:

pvalue = 2*dist.norm.cdf(-np.abs(test_statistic))

The p-values came out to be 0.277. As this is a two-tailed test,

p(z < -1.086) = 0.277

p(z > 1.086) = 0.277

p-value = 0.277+0.277 = 0.554

That means, there is approximately 55.4% probability that the observed result or more
extreme is true when the null hypothesis is true.

In another way, the p-value is much bigger than the significance level. So, we fail
to reject the null hypothesis.

The final inference is, based on the observed difference between the mean RestBP of
females and the mean RestBP of males, we cannot support the idea that there is a
significant difference between the two means.

Conclusion
https://towardsdatascience.com/a-complete-guide-to-hypothesis-testing-for-data-scientists-using-python-69f670e6779e 12/14
19/10/2020 A Complete Guide to Hypothesis Testing for Data Scientists Using Python | by Rashida Nasrin Sucky | Oct, 2020 | Towards Data S…

I explained the four most common types of research questions in this article with
working examples. Hope you will be able to use hypothesis testing in decision making
from now on.

Recommended Reading

A Complete Anomaly Detection Algorithm From Scratch in


Python: Step by Step Guide
Anomaly Detection Algorithm Using the Probabilities
towardsdatascience.com

A Complete Guide to Confidence Interval, and Examples in Python


Deep Understanding of Confidence Interval and Its Calculation, a
Very Popular Parameter in Statistics
towardsdatascience.com

A Complete K Mean Clustering Algorithm From Scratch in Python:


Step by Step Guide
Also, How to Use K Mean Clustering Algorithm for Dimensionality
Reduction of an Image
towardsdatascience.com

Multiclass Classification Algorithm from Scratch with a Project in


Python: Step by Step Guide
This article explains two methods: The gradient descent method and
the optimization function method
towardsdatascience.com

Data Binning with Pandas Cut or Qcut Method


When You Are Looking for a Range Not an Exact Value, a Grade Not
a Score
towardsdatascience.com
https://towardsdatascience.com/a-complete-guide-to-hypothesis-testing-for-data-scientists-using-python-69f670e6779e 13/14
19/10/2020 A Complete Guide to Hypothesis Testing for Data Scientists Using Python | by Rashida Nasrin Sucky | Oct, 2020 | Towards Data S…

Your Everyday Cheatsheet for Python’s Matplotlib


A Complete Visualization Course
towardsdatascience.com

Sign up for The Daily Pick


By Towards Data Science

Hands-on real-world examples, research, tutorials, and cutting-edge techniques delivered Monday
to Thursday. Make learning your daily ritual. Take a look

Your email

Get this newsletter

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information
about our privacy practices.

Data Science Artificial Intelligence Machine Learning Programming Python

About Help Legal

Get the Medium app

https://towardsdatascience.com/a-complete-guide-to-hypothesis-testing-for-data-scientists-using-python-69f670e6779e 14/14

You might also like