Hypothesis Testing in Python
Hypothesis Testing in Python
z-scores
HYPOTHESIS TESTING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
A/B testing
In 2013, Electronic Arts (EA) released
SimCity 5
mean_comp_samp = stack_overflow['converted_comp'].mean()
119574.71738168952
5607.997577378606
119574.71738168952
mean_comp_hyp = 110000
std_error
5607.997577378606
1.7073326529796957
Determine whether sample statistics are close to or far away from expected (or
"hypothesized" values)
James Chapman
Curriculum Manager, DataCamp
Criminal trials
Two possible true states:
1. Defendant committed the crime
2. Not guilty
Prosecution must present evidence "beyond reasonable doubt" for a guilty verdict
The alternative hypothesis (HA ) is the new "challenger" idea of the researcher
1"Naught" is British English for "zero". For historical reasons, "H-naught" is the international convention for
pronouncing the null hypothesis.
If the evidence from the sample is "significant" that HA is true, reject H0 , else choose H0
Test Tails
alternative different from null two-tailed
alternative greater than null right-tailed
alternative less than null left-tailed
0.39141972578505085
prop_child_hyp = 0.35
0.010351057228878566
4.001497129152506
3.1471479512323874e-05
James Chapman
Curriculum Manager, DataCamp
p-value recap
p-values quantify evidence for the null hypothesis
Large p-value → fail to reject null hypothesis
3.1471479512323874e-05
3.1471479512323874e-05
True
Reject H0 in favor of HA
import numpy as np
lower = np.quantile(first_code_boot_distn, 0.025)
upper = np.quantile(first_code_boot_distn, 0.975)
print((lower, upper))
(0.37063246351172047, 0.41132242370632466)
actual H0 actual HA
False positives are Type I errors; false negatives are Type II errors.
A false positive (Type I) error: data scientists didn't start coding as children at a higher rate
A false negative (Type II) error: data scientists started coding as children at a higher rate
James Chapman
Curriculum Manager, DataCamp
Two-sample problems
Compare sample statistics across groups of a variable
converted_comp is a numerical variable
Are users who first programmed as a child compensated higher than those that started as
adults?
H0 : μchild = μadult
H0 : μchild − μadult = 0
HA : The mean compensation (in USD) is greater for those that coded first as a child
compared to those that coded first as an adult.
age_first_code_cut
adult 111313.311047
child 132419.570621
Name: converted_comp, dtype: float64
x̄ - a sample mean
x̄child - sample mean compensation for coding first as a child
x̄adult - sample mean compensation for coding first as an adult
x̄child − x̄adult - a test statistic
z-score - a (standardized) test statistic
adult 111313.311047
child 132419.570621
Name: converted_comp, dtype: float64 age_first_code_cut
s = stack_overflow.groupby('age_first_code_cut')['converted_comp'].std()
adult 271546.521729
child 255585.240115
Name: converted_comp, dtype: float64 age_first_code_cut
n = stack_overflow.groupby('age_first_code_cut')['converted_comp'].count()
adult 1376
child 885
Name: converted_comp, dtype: int64
import numpy as np
numerator = xbar_child - xbar_adult
denominator = np.sqrt(s_child ** 2 / n_child + s_adult ** 2 / n_adult)
t_stat = numerator / denominator
1.8699313316221844
James Chapman
Curriculum Manager, DataCamp
t-distributions
t statistic follows a t-distribution
Have a parameter named degrees of
freedom, or df
Look like normal distributions, with fatter
tails
df = nchild + nadult − 2
HA : The mean compensation (in USD) is greater for those that coded first as a child
compared to those that coded first as an adult
If p ≤ α then reject H0 .
SE(x̄child − x̄adult ) ≈ √
s2child s2adult
+
nchild nadult
z-statistic: needed when using one sample statistic to estimate a population parameter
t-statistic: needed when using multiple sample statistics to estimate a population parameter
1.8699313316221844
2259
0.030811302165157595
Evidence that Stack Overflow data scientists who started coding as a child earn more.
James Chapman
Curriculum Manager, DataCamp
US Republican presidents dataset
state county repub_percent_08 repub_percent_12
0 Alabama Hale 38.957877 37.139882
1 Arkansas Nevada 56.726272 58.983452
2 California Lake 38.896719 39.331367
3 California Ventura 42.923190 45.250693
.. ... ... ... ...
96 Wisconsin La Crosse 37.490904 40.577038
97 Wisconsin Lafayette 38.104967 41.675050
98 Wyoming Weston 76.684241 83.983328
99 Alaska District 34 77.063259 40.789626
1 https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VOQCHQ
H0 : μ2008 − μ2012 = 0
-2.877109041242944
df = ndif f − 1
New hypotheses:
H0 : μdiff = 0
HA : μdiff < 0
degrees_of_freedom = n_diff - 1
9.572537285272411e-08
99
BF10 power
T-test 1.323e+05 1.0
1Details on Returns from pingouin.ttest() are available in the API docs for pingouin at https://pingouin-
stats.org/generated/pingouin.ttest.html#pingouin.ttest.
BF10 power
T-test 1.323e+05 0.696338
power
T-test 0.454972
Unpaired t-tests on paired data increases the chances of false negative errors
James Chapman
Curriculum Manager, DataCamp
Job satisfaction: 5 categories
stack_overflow['job_sat'].value_counts()
alpha = 0.2
pingouin.anova(data=stack_overflow,
dv="converted_comp",
between="job_sat")
0.001315 <α
At least two categories have significantly different compensation
James Chapman
Curriculum Manager, DataCamp
Chapter 1 recap
Is a claim about an unknown population proportion feasible?
3. Calculate a p-value
Now, calculate the test statistic without using the bootstrap distribution
p^ − p0
z=
√ p0 ∗ (1 − p0 )
n
^ and n) and the hypothesized parameter (p0 )
Only uses sample information ( p
s is calculated from x̄
x̄ estimates the population mean
s estimates the population standard deviation
↑ uncertainty in our estimate of the parameter
t-distribution - fatter tails than a normal distribution
alpha = 0.01
stack_overflow['age_cat'].value_counts(normalize=True)
Under 30 0.535604
At least 30 0.464396
Name: age_cat, dtype: float64
0.5356037151702786
p_0 = 0.50
n = len(stack_overflow)
2261
import numpy as np
numerator = p_hat - p_0
denominator = np.sqrt(p_0 * (1 - p_0) / n)
z_score = numerator / denominator
3.385911440783663
p_value = norm.cdf(-z_score) +
1 - norm.cdf(z_score)
p_value = 2 * (1 - norm.cdf(z_score))
Left-tailed ("less than"):
0.0007094227368100725
from scipy.stats import norm
p_value = norm.cdf(z_score)
p_value <= alpha
p_value = 1 - norm.cdf(z_score)
James Chapman
Curriculum Manager, DataCamp
Comparing two proportions
H0 : Proportion of hobbyist users is the same for those under thirty as those at least thirty
H0 : p≥30 − p<30 = 0
HA : Proportion of hobbyist users is different for those under thirty to those at least thirty
HA : p≥30 − p<30 ≠ 0
alpha = 0.05
age_cat hobbyist
At least 30 Yes 0.773333
No 0.226667
Under 30 Yes 0.843105
No 0.156895
Name: hobbyist, dtype: float64
n = stack_overflow.groupby("age_cat")['hobbyist'].count()
age_cat
At least 30 1050
Under 30 1211
Name: hobbyist, dtype: int64
age_cat hobbyist
At least 30 Yes 0.773333
No 0.226667
Under 30 Yes 0.843105
No 0.156895
Name: hobbyist, dtype: float64
0.773333 0.843105
age_cat
At least 30 1050
Under 30 1211
Name: hobbyist, dtype: int64
1050 1211
-4.223718652693034
age_cat hobbyist
At least 30 Yes 812
No 238
Under 30 Yes 1021
No 190
Name: hobbyist, dtype: int64
(-4.223691463320559, 2.403330142685068e-05)
James Chapman
Curriculum Manager, DataCamp
Revisiting the proportion test
age_by_hobbyist = stack_overflow.groupby("age_cat")['hobbyist'].value_counts()
age_cat hobbyist
At least 30 Yes 812
No 238
Under 30 Yes 1021
No 190
Name: hobbyist, dtype: int64
(-4.223691463320559, 2.403330142685068e-05)
alpha = 0.1
Assuming independence, how far away are the observed results from the expected values?
Degrees of freedom:
(2 − 1) ∗ (5 − 1) = 4
1Left-tailed chi-square tests are used in statistical forensics to detect if a fit is suspiciously good because the
data was fabricated. Chi-square tests of variance can be two-tailed. These are niche uses, though.
James Chapman
Curriculum Manager, DataCamp
Purple links
How do you feel when you discover that you've already visited the top resource?
purple_link_counts = stack_overflow['purple_link'].value_counts()
purple_link_counts = purple_link_counts.rename_axis('purple_link')\
.reset_index(name='n')\
.sort_values('purple_link')
purple_link n
2 Amused 368
3 Annoyed 263
0 Hello, old friend 1225
1 Indifferent 405
H0 : The sample matches the hypothesized χ2 measures how far observed results are
distribution from expectations in each group
purple_link prop n
0 Amused 0.166667 376.833333
1 Annoyed 0.166667 376.833333
2 Hello, old friend 0.500000 1130.500000
3 Indifferent 0.166667 376.833333
plt.bar(purple_link_counts['purple_link'], purple_link_counts['n'],
color='red', label='Observed')
plt.bar(hypothesized['purple_link'], hypothesized['n'], alpha=0.5,
color='blue', label='Hypothesized')
plt.legend()
plt.show()
purple_link prop n
0 Amused 0.166667 376.833333
1 Annoyed 0.166667 376.833333
2 Hello, old friend 0.500000 1130.500000
3 Indifferent 0.166667 376.833333
Power_divergenceResult(statistic=44.59840778416629, pvalue=1.1261810719413759e-09)
James Chapman
Curriculum Manager, DataCamp
Randomness
Assumption
The samples are random subsets of larger
populations
Consequence
Sample is not representative of population
Consequence
Increased chance of false negative/positive error
Consequence
Wider confidence intervals
n ≥ 30 n1 ≥ 30, n2 ≥ 30
n × p^ ≥ 10 n1 × p^1 ≥ 10
Revisit data collection to check for randomness, independence, and sample size
James Chapman
Curriculum Manager, DataCamp
Parametric tests
z-test, t-test, and ANOVA are all parametric tests
Assume a normal distribution
alpha = 0.01
import pingouin
pingouin.ttest(x=repub_votes_potus_08_12_small['repub_percent_08'],
y=repub_votes_potus_08_12_small['repub_percent_12'],
paired=True,
alternative="less")
repub_votes_small['diff'] = repub_votes_small['repub_percent_08'] -
repub_votes_small['repub_percent_12']
print(repub_votes_small)
repub_votes_small['abs_diff'] = repub_votes_small['diff'].abs()
print(repub_votes_small)
Incorporate the sum of the ranks for negative and positive differences
T_minus = 1 + 4 + 5 + 2 + 3
T_plus = 0
W = np.min([T_minus, T_plus])
James Chapman
Curriculum Manager, DataCamp
Wilcoxon-Mann-Whitney test
Also know as the Mann Whitney U test
A t-test on the ranks of the numeric input
age_vs_comp_wide = age_vs_comp.pivot(columns='age_first_code_cut',
values='converted_comp')
import pingouin
pingouin.mwu(x=age_vs_comp_wide['child'],
y=age_vs_comp_wide['adult'],
alternative='greater')
alpha=0.01
pingouin.kruskal(data=stack_overflow,
dv='converted_comp',
between='job_sat')
James Chapman
Curriculum Manager, DataCamp
Course recap
Chapter 1 Chapter 3
Chapter 2 Chapter 4
Bayesian statistics
Bayesian Data Analysis in Python
Applications
Customer Analytics and A/B Testing in Python