What is statistics?
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Maggie Matsui
Content Developer, DataCamp
What is statistics?
The field of statistics - the practice and study of collecting and analyzing data
A summary statistic - a fact about or summary of some data
INTRODUCTION TO STATISTICS IN PYTHON
What can statistics do?
What is statistics?
The field of statistics - the practice and study of collecting and analyzing data
A summary statistic - a fact about or summary of some data
What can statistics do?
How likely is someone to purchase a product? Are people more likely to purchase it if they
can use a different payment system?
How many occupants will your hotel have? How can you optimize occupancy?
How many sizes of jeans need to be manufactured so they can fit 95% of the population?
Should the same number of each size be produced?
A/B tests: Which ad is more effective in getting people to purchase a product?
INTRODUCTION TO STATISTICS IN PYTHON
What can't statistics do?
Why is Game of Thrones so popular?
Instead...
Are series with more violent scenes viewed by more people?
But...
Even so, this can't tell us if more violent scenes lead to more views
INTRODUCTION TO STATISTICS IN PYTHON
Types of statistics
Descriptive statistics Inferential statistics
Describe and summarize data Use a sample of data to make inferences
about a larger population
50% of friends drive to work
25% take the bus
25% bike What percent of people drive to work?
INTRODUCTION TO STATISTICS IN PYTHON
Types of data
Numeric (Quantitative) Categorical (Qualitative)
Continuous (Measured) Nominal (Unordered)
Airplane speed Married/unmarried
Time spent waiting in line Country of residence
Discrete (Counted) Ordinal (Ordered)
Number of pets
Number of packages shipped
INTRODUCTION TO STATISTICS IN PYTHON
Categorical data can be represented as numbers
Nominal (Unordered) Ordinal (Ordered)
Married/unmarried ( 1 / 0 ) Strongly disagree ( 1 )
Country of residence ( 1 , 2 , ...) Somewhat disagree ( 2 )
Neither agree nor disagree ( 3 )
Somewhat agree ( 4 )
Strongly agree ( 5 )
INTRODUCTION TO STATISTICS IN PYTHON
Why does data type matter?
Summary statistics Plots
import numpy as np
np.mean(car_speeds['speed_mph'])
40.09062
INTRODUCTION TO STATISTICS IN PYTHON
Why does data type matter?
Summary statistics Plots
demographics['marriage_status'].value_counts()
single 188
married 143
divorced 124
dtype: int64
INTRODUCTION TO STATISTICS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Measures of center
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Maggie Matsui
Content Developer, DataCamp
Mammal sleep data
print(msleep)
name genus vore order ... sleep_cycle awake brainwt bodywt
1 Cheetah Acinonyx carni Carnivora ... NaN 11.9 NaN 50.000
2 Owl monkey Aotus omni Primates ... NaN 7.0 0.01550 0.480
3 Mountain beaver Aplodontia herbi Rodentia ... NaN 9.6 NaN 1.350
4 Greater short-ta... Blarina omni Soricomorpha ... 0.133333 9.1 0.00029 0.019
5 Cow Bos herbi Artiodactyla ... 0.666667 20.0 0.42300 600.000
.. ... ... ... ... ... ... ... ... ...
79 Tree shrew Tupaia omni Scandentia ... 0.233333 15.1 0.00250 0.104
80 Bottle-nosed do... Tursiops carni Cetacea ... NaN 18.8 NaN 173.330
81 Genet Genetta carni Carnivora ... NaN 17.7 0.01750 2.000
82 Arctic fox Vulpes carni Carnivora ... NaN 11.5 0.04450 3.380
83 Red fox Vulpes carni Carnivora ... 0.350000 14.2 0.05040 4.230
INTRODUCTION TO STATISTICS IN PYTHON
Histograms
INTRODUCTION TO STATISTICS IN PYTHON
How long do mammals in this dataset typically sleep?
What's a typical value?
Where is the center of the data?
Mean
Median
Mode
INTRODUCTION TO STATISTICS IN PYTHON
Measures of center: mean
name sleep_total import numpy as np
1 Cheetah 12.1 np.mean(msleep['sleep_total'])
2 Owl monkey 17.0
3 Mountain beaver 14.4 10.43373
4 Greater short-t... 14.9
5 Cow 4.0
.. ... ...
Mean sleep time =
12.1 + 17.0 + 14.4 + 14.9 + ...
= 10.43
83
INTRODUCTION TO STATISTICS IN PYTHON
Measures of center: median
msleep['sleep_total'].sort_values() msleep['sleep_total'].sort_values().iloc[41]
29 1.9 10.1
30 2.7
22 2.9
9 3.0
23 3.1
np.median(msleep['sleep_total'])
...
19 18.0
61 18.1 10.1
36 19.4
21 19.7
42 19.9
INTRODUCTION TO STATISTICS IN PYTHON
Measures of center: mode
Most frequent value msleep['vore'].value_counts()
msleep['sleep_total'].value_counts()
herbi 32
omni 20
12.5 4 carni 19
10.1 3 insecti 5
14.9 2 Name: vore, dtype: int64
11.0 2
8.4 2
import statistics
...
statistics.mode(msleep['vore'])
14.3 1
17.0 1
'herbi'
Name: sleep_total, Length: 65, dtype: int64
INTRODUCTION TO STATISTICS IN PYTHON
Adding an outlier
msleep[msleep['vore'] == 'insecti']
name genus vore order sleep_total
22 Big brown bat Eptesicus insecti Chiroptera 19.7
43 Little brown bat Myotis insecti Chiroptera 19.9
62 Giant armadillo Priodontes insecti Cingulata 18.1
67 Eastern american mole Scalopus insecti Soricomorpha 8.4
INTRODUCTION TO STATISTICS IN PYTHON
Adding an outlier
msleep[msleep['vore'] == "insecti"]['sleep_total'].agg([np.mean, np.median])
mean 16.53
median 18.9
Name: sleep_total, dtype: float64
INTRODUCTION TO STATISTICS IN PYTHON
Adding an outlier
msleep[msleep['vore'] == 'insecti']
name genus vore order sleep_total
22 Big brown bat Eptesicus insecti Chiroptera 19.7
43 Little brown bat Myotis insecti Chiroptera 19.9
62 Giant armadillo Priodontes insecti Cingulata 18.1
67 Eastern american mole Scalopus insecti Soricomorpha 8.4
84 Mystery insectivore ... insecti ... 0.0
INTRODUCTION TO STATISTICS IN PYTHON
Adding an outlier
msleep[msleep['vore'] == "insecti"]['sleep_total'].agg([np.mean, np.median])
mean 13.22
median 18.1
Name: sleep_total, dtype: float64
Mean: 16.5 → 13.2
Median: 18.9 → 18.1
INTRODUCTION TO STATISTICS IN PYTHON
Which measure to use?
INTRODUCTION TO STATISTICS IN PYTHON
Skew
Left-skewed Right-skewed
INTRODUCTION TO STATISTICS IN PYTHON
Which measure to use?
INTRODUCTION TO STATISTICS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Measures of spread
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Maggie Matsui
Content Developer, DataCamp
What is spread?
INTRODUCTION TO STATISTICS IN PYTHON
Variance
Average distance from each data point to the data's mean
INTRODUCTION TO STATISTICS IN PYTHON
Variance
Average distance from each data point to the data's mean
INTRODUCTION TO STATISTICS IN PYTHON
Calculating variance
1. Subtract mean from each data point 2. Square each distance
dists = msleep['sleep_total'] - sq_dists = dists ** 2
np.mean(msleep['sleep_total']) print(sq_dists)
print(dists)
0 2.776439
0 1.666265 1 43.115837
1 6.566265 2 15.731259
2 3.966265 3 19.947524
3 4.466265 4 41.392945
4 -6.433735 ...
...
INTRODUCTION TO STATISTICS IN PYTHON
Calculating variance
3. Sum squared distances Use np.var()
sum_sq_dists = np.sum(sq_dists) np.var(msleep['sleep_total'], ddof=1)
print(sum_sq_dists)
19.805677
1624.065542
Without ddof=1 , population variance is
4. Divide by number of data points - 1 calculated instead of sample variance:
variance = sum_sq_dists / (83 - 1) np.var(msleep['sleep_total'])
print(variance)
19.567055
19.805677
INTRODUCTION TO STATISTICS IN PYTHON
Standard deviation
np.sqrt(np.var(msleep['sleep_total'], ddof=1))
4.450357
np.std(msleep['sleep_total'], ddof=1)
4.450357
INTRODUCTION TO STATISTICS IN PYTHON
Mean absolute deviation
dists = msleep['sleep_total'] - mean(msleep$sleep_total)
np.mean(np.abs(dists))
3.566701
Standard deviation vs. mean absolute deviation
Standard deviation squares distances, penalizing longer distances more than shorter ones.
Mean absolute deviation penalizes each distance equally.
One isn't better than the other, but SD is more common than MAD.
INTRODUCTION TO STATISTICS IN PYTHON
Quantiles
np.quantile(msleep['sleep_total'], 0.5)
0.5 quantile = median
10.1
Quartiles:
np.quantile(msleep['sleep_total'], [0, 0.25, 0.5, 0.75, 1])
array([ 1.9 , 7.85, 10.1 , 13.75, 19.9 ])
INTRODUCTION TO STATISTICS IN PYTHON
Boxplots use quartiles
import matplotlib.pyplot as plt
plt.boxplot(msleep['sleep_total'])
plt.show()
INTRODUCTION TO STATISTICS IN PYTHON
Quantiles using np.linspace()
np.quantile(msleep['sleep_total'], [0, 0.2, 0.4, 0.6, 0.8, 1])
array([ 1.9 , 6.24, 9.48, 11.14, 14.4 , 19.9 ])
np.linspace(start, stop, num)
np.quantile(msleep['sleep_total'], np.linspace(0, 1, 5))
array([ 1.9 , 7.85, 10.1 , 13.75, 19.9 ])
INTRODUCTION TO STATISTICS IN PYTHON
Interquartile range (IQR)
Height of the box in a boxplot
np.quantile(msleep['sleep_total'], 0.75) - np.quantile(msleep['sleep_total'], 0.25)
5.9
from scipy.stats import iqr
iqr(msleep['sleep_total'])
5.9
INTRODUCTION TO STATISTICS IN PYTHON
Outliers
Outlier: data point that is substantially different from the others
How do we know what a substantial difference is? A data point is an outlier if:
data < Q1 − 1.5 × IQR or
data > Q3 + 1.5 × IQR
INTRODUCTION TO STATISTICS IN PYTHON
Finding outliers
from scipy.stats import iqr
iqr = iqr(msleep['bodywt'])
lower_threshold = np.quantile(msleep['bodywt'], 0.25) - 1.5 * iqr
upper_threshold = np.quantile(msleep['bodywt'], 0.75) + 1.5 * iqr
msleep[(msleep['bodywt'] < lower_threshold) | (msleep['bodywt'] > upper_threshold)]
name vore sleep_total bodywt
4 Cow herbi 4.0 600.000
20 Asian elephant herbi 3.9 2547.000
22 Horse herbi 2.9 521.000
...
INTRODUCTION TO STATISTICS IN PYTHON
All in one go
msleep['bodywt'].describe()
count 83.000000
mean 166.136349
std 786.839732
min 0.005000
25% 0.174000
50% 1.670000
75% 41.750000
max 6654.000000
Name: bodywt, dtype: float64
INTRODUCTION TO STATISTICS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
What are the
chances?
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Maggie Matsui
Content Developer, DataCamp
Measuring chance
What's the probability of an event?
# ways event can happen
P (event) =
total # of possible outcomes
Example: a coin flip
1 way to get heads 1
P (heads) = = = 50%
2 possible outcomes 2
INTRODUCTION TO STATISTICS IN PYTHON
Assigning salespeople
INTRODUCTION TO STATISTICS IN PYTHON
Assigning salespeople
1
P (Brian) = = 25%
4
INTRODUCTION TO STATISTICS IN PYTHON
Sampling from a DataFrame
print(sales_counts) sales_counts.sample()
name n_sales name n_sales
0 Amir 178 1 Brian 128
1 Brian 128
2 Claire 75 sales_counts.sample()
3 Damian 69
name n_sales
2 Claire 75
INTRODUCTION TO STATISTICS IN PYTHON
Setting a random seed
np.random.seed(10) np.random.seed(10)
sales_counts.sample() sales_counts.sample()
name n_sales name n_sales
1 Brian 128 1 Brian 128
np.random.seed(10)
sales_counts.sample()
name n_sales
1 Brian 128
INTRODUCTION TO STATISTICS IN PYTHON
A second meeting
Sampling without replacement
INTRODUCTION TO STATISTICS IN PYTHON
A second meeting
1
P (Claire) = = 33%
3
INTRODUCTION TO STATISTICS IN PYTHON
Sampling twice in Python
sales_counts.sample(2)
name n_sales
1 Brian 128
2 Claire 75
INTRODUCTION TO STATISTICS IN PYTHON
Sampling with replacement
INTRODUCTION TO STATISTICS IN PYTHON
Sampling with replacement
1
P (Claire) = = 25%
4
INTRODUCTION TO STATISTICS IN PYTHON
Sampling with/without replacement in Python
sales_counts.sample(5, replace = True)
name n_sales
1 Brian 128
2 Claire 75
1 Brian 128
3 Damian 69
0 Amir 178
INTRODUCTION TO STATISTICS IN PYTHON
Independent events
Two events are independent if the probability
of the second event isn't affected by the
outcome of the first event.
INTRODUCTION TO STATISTICS IN PYTHON
Independent events
Two events are independent if the probability
of the second event isn't affected by the
outcome of the first event.
Sampling with replacement = each pick is
independent
INTRODUCTION TO STATISTICS IN PYTHON
Dependent events
Two events are dependent if the probability
of the second event is affected by the
outcome of the first event.
INTRODUCTION TO STATISTICS IN PYTHON
Dependent events
Two events are dependent if the probability
of the second event is affected by the
outcome of the first event.
INTRODUCTION TO STATISTICS IN PYTHON
Dependent events
Two events are dependent if the probability
of the second event is affected by the
outcome of the first event.
Sampling without replacement = each pick is
dependent
INTRODUCTION TO STATISTICS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Discrete
distributions
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Maggie Matsui
Content Developer, DataCamp
Rolling the dice
INTRODUCTION TO STATISTICS IN PYTHON
Rolling the dice
INTRODUCTION TO STATISTICS IN PYTHON
Choosing salespeople
INTRODUCTION TO STATISTICS IN PYTHON
Probability distribution
Describes the probability of each possible outcome in a scenario
Expected value: mean of a probability distribution
Expected value of a fair die roll =
(1 × 16 ) + (2 × 16 ) + (3 × 16 ) + (4 × 16 ) + (5 × 16 ) + (6 × 16 ) = 3.5
INTRODUCTION TO STATISTICS IN PYTHON
Visualizing a probability distribution
INTRODUCTION TO STATISTICS IN PYTHON
Probability = area
P (die roll) ≤ 2 = ?
INTRODUCTION TO STATISTICS IN PYTHON
Probability = area
P (die roll) ≤ 2 = 1/3
INTRODUCTION TO STATISTICS IN PYTHON
Uneven die
Expected value of uneven die roll =
(1 × 16 ) + (2 × 0) + (3 × 13 ) + (4 × 16 ) + (5 × 16 ) + (6 × 16 ) = 3.67
INTRODUCTION TO STATISTICS IN PYTHON
Visualizing uneven probabilities
INTRODUCTION TO STATISTICS IN PYTHON
Adding areas
P (uneven die roll) ≤ 2 = ?
INTRODUCTION TO STATISTICS IN PYTHON
Adding areas
P (uneven die roll) ≤ 2 = 1/6
INTRODUCTION TO STATISTICS IN PYTHON
Discrete probability distributions
Describe probabilities for discrete outcomes
Fair die Uneven die
Discrete uniform distribution
INTRODUCTION TO STATISTICS IN PYTHON
Sampling from discrete distributions
print(die) rolls_10 = die.sample(10, replace = True)
rolls_10
number prob
0 1 0.166667 number prob
1 2 0.166667 0 1 0.166667
2 3 0.166667 0 1 0.166667
3 4 0.166667 4 5 0.166667
4 5 0.166667 1 2 0.166667
5 6 0.166667 0 1 0.166667
0 1 0.166667
5 6 0.166667
np.mean(die['number'])
5 6 0.166667
...
3.5
INTRODUCTION TO STATISTICS IN PYTHON
Visualizing a sample
rolls_10['number'].hist(bins=np.linspace(1,7,7))
plt.show()
INTRODUCTION TO STATISTICS IN PYTHON
Sample distribution vs. theoretical distribution
Sample of 10 rolls Theoretical probability distribution
np.mean(rolls_10['number']) = 3.0
mean(die['number']) = 3.5
INTRODUCTION TO STATISTICS IN PYTHON
A bigger sample
Sample of 100 rolls Theoretical probability distribution
np.mean(rolls_100['number']) = 3.4
mean(die['number']) = 3.5
INTRODUCTION TO STATISTICS IN PYTHON
An even bigger sample
Sample of 1000 rolls Theoretical probability distribution
np.mean(rolls_1000['number']) = 3.48
mean(die['number']) = 3.5
INTRODUCTION TO STATISTICS IN PYTHON
Law of large numbers
As the size of your sample increases, the sample mean will approach the expected value.
Sample size Mean
10 3.00
100 3.40
1000 3.48
INTRODUCTION TO STATISTICS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Continuous
distributions
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Maggie Matsui
Content Developer, DataCamp
Waiting for the bus
INTRODUCTION TO STATISTICS IN PYTHON
Continuous uniform distribution
INTRODUCTION TO STATISTICS IN PYTHON
Continuous uniform distribution
INTRODUCTION TO STATISTICS IN PYTHON
Probability still = area
P (4 ≤ wait time ≤ 7) = ?
INTRODUCTION TO STATISTICS IN PYTHON
Probability still = area
P (4 ≤ wait time ≤ 7) = ?
INTRODUCTION TO STATISTICS IN PYTHON
Probability still = area
P (4 ≤ wait time ≤ 7) = 3 × 1/12 = 3/12
INTRODUCTION TO STATISTICS IN PYTHON
Uniform distribution in Python
P (wait time ≤ 7)
from scipy.stats import uniform
uniform.cdf(7, 0, 12)
0.5833333
INTRODUCTION TO STATISTICS IN PYTHON
"Greater than" probabilities
P (wait time ≥ 7) = 1 − P (wait time ≤ 7)
from scipy.stats import uniform
1 - uniform.cdf(7, 0, 12)
0.4166667
INTRODUCTION TO STATISTICS IN PYTHON
P (4 ≤ wait time ≤ 7)
INTRODUCTION TO STATISTICS IN PYTHON
P (4 ≤ wait time ≤ 7)
INTRODUCTION TO STATISTICS IN PYTHON
P (4 ≤ wait time ≤ 7)
from scipy.stats import uniform
uniform.cdf(7, 0, 12) - uniform.cdf(4, 0, 12)
0.25
INTRODUCTION TO STATISTICS IN PYTHON
Total area = 1
P (0 ≤ wait time ≤ 12) = ?
INTRODUCTION TO STATISTICS IN PYTHON
Total area = 1
P (0 ≤ outcome ≤ 12) = 12 × 1/12 = 1
INTRODUCTION TO STATISTICS IN PYTHON
Generating random numbers according to uniform
distribution
from scipy.stats import uniform
uniform.rvs(0, 5, size=10)
array([1.89740094, 4.70673196, 0.33224683, 1.0137103 , 2.31641255,
3.49969897, 0.29688598, 0.92057234, 4.71086658, 1.56815855])
INTRODUCTION TO STATISTICS IN PYTHON
Other continuous distributions
INTRODUCTION TO STATISTICS IN PYTHON
Other continuous distributions
INTRODUCTION TO STATISTICS IN PYTHON
Other special types of distributions
Normal distribution Exponential distribution
INTRODUCTION TO STATISTICS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
The binomial
distribution
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Maggie Matsui
Content Developer, DataCamp
Coin flipping
INTRODUCTION TO STATISTICS IN PYTHON
Binary outcomes
INTRODUCTION TO STATISTICS IN PYTHON
A single flip
binom.rvs(# of coins, probability of heads/success, size=# of trials)
1 = head, 0 = tails
from scipy.stats import binom
binom.rvs(1, 0.5, size=1)
array([1])
INTRODUCTION TO STATISTICS IN PYTHON
One flip many times
binom.rvs(1, 0.5, size=8)
array([0, 1, 1, 0, 1, 0, 1, 1])
INTRODUCTION TO STATISTICS IN PYTHON
Many flips one time
binom.rvs(8, 0.5, size=1)
array([5])
INTRODUCTION TO STATISTICS IN PYTHON
Many flips many times
binom.rvs(3, 0.5, size=10)
array([0, 3, 2, 1, 3, 0, 2, 2, 0, 0])
INTRODUCTION TO STATISTICS IN PYTHON
Other probabilities
binom.rvs(3, 0.25, size=10)
array([1, 1, 1, 1, 0, 0, 2, 0, 1, 0])
INTRODUCTION TO STATISTICS IN PYTHON
Binomial distribution
Probability distribution of the number of
successes in a sequence of independent
trials
E.g. Number of heads in a sequence of coin
flips
Described by n and p
n: total number of trials
p: probability of success
INTRODUCTION TO STATISTICS IN PYTHON
What's the probability of 7 heads?
P (heads = 7)
# binom.pmf(num heads, num trials, prob of heads)
binom.pmf(7, 10, 0.5)
0.1171875
INTRODUCTION TO STATISTICS IN PYTHON
What's the probability of 7 or fewer heads?
P (heads ≤ 7)
binom.cdf(7, 10, 0.5)
0.9453125
INTRODUCTION TO STATISTICS IN PYTHON
What's the probability of more than 7 heads?
P (heads > 7)
1 - binom.cdf(7, 10, 0.5)
0.0546875
INTRODUCTION TO STATISTICS IN PYTHON
Expected value
Expected value = n × p
Expected number of heads out of 10 flips = 10 × 0.5 = 5
INTRODUCTION TO STATISTICS IN PYTHON
Independence
The binomial distribution is a probability
distribution of the number of successes in a
sequence of independent trials
INTRODUCTION TO STATISTICS IN PYTHON
Independence
The binomial distribution is a probability
distribution of the number of successes in a
sequence of independent trials
Probabilities of second trial are altered due to
outcome of the first
If trials are not independent, the binomial
distribution does not apply!
INTRODUCTION TO STATISTICS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
The normal
distribution
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Maggie Matsui
Content Developer, DataCamp
What is the normal distribution?
INTRODUCTION TO STATISTICS IN PYTHON
Symmetrical
INTRODUCTION TO STATISTICS IN PYTHON
Area = 1
INTRODUCTION TO STATISTICS IN PYTHON
Curve never hits 0
INTRODUCTION TO STATISTICS IN PYTHON
Described by mean and standard deviation
Mean: 20
Standard deviation: 3
Standard normal distribution
Mean: 0
Standard deviation: 1
INTRODUCTION TO STATISTICS IN PYTHON
Described by mean and standard deviation
Mean: 20
Standard deviation: 3
Standard normal distribution
Mean: 0
Standard deviation: 1
INTRODUCTION TO STATISTICS IN PYTHON
Areas under the normal distribution
68% falls within 1 standard deviation
INTRODUCTION TO STATISTICS IN PYTHON
Areas under the normal distribution
95% falls within 2 standard deviations
INTRODUCTION TO STATISTICS IN PYTHON
Areas under the normal distribution
99.7% falls within 3 standard deviations
INTRODUCTION TO STATISTICS IN PYTHON
Lots of histograms look normal
Normal distribution Women's heights from NHANES
Mean: 161 cm Standard deviation: 7 cm
INTRODUCTION TO STATISTICS IN PYTHON
Approximating data with the normal distribution
INTRODUCTION TO STATISTICS IN PYTHON
What percent of women are shorter than 154 cm?
from scipy.stats import norm
norm.cdf(154, 161, 7)
0.158655
16% of women in the survey are shorter than
154 cm
INTRODUCTION TO STATISTICS IN PYTHON
What percent of women are taller than 154 cm?
from scipy.stats import norm
1 - norm.cdf(154, 161, 7)
0.841345
INTRODUCTION TO STATISTICS IN PYTHON
What percent of women are 154-157 cm?
norm.cdf(157, 161, 7) - norm.cdf(154, 161, 7)
INTRODUCTION TO STATISTICS IN PYTHON
What percent of women are 154-157 cm?
norm.cdf(157, 161, 7) - norm.cdf(154, 161, 7)
0.1252
INTRODUCTION TO STATISTICS IN PYTHON
What height are 90% of women shorter than?
norm.ppf(0.9, 161, 7)
169.97086
INTRODUCTION TO STATISTICS IN PYTHON
What height are 90% of women taller than?
norm.ppf((1-0.9), 161, 7)
152.029
INTRODUCTION TO STATISTICS IN PYTHON
Generating random numbers
# Generate 10 random heights
norm.rvs(161, 7, size=10)
array([155.5758223 , 155.13133235, 160.06377097, 168.33345778,
165.92273375, 163.32677057, 165.13280753, 146.36133538,
149.07845021, 160.5790856 ])
INTRODUCTION TO STATISTICS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
The central limit
theorem
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Maggie Matsui
Content Developer, DataCamp
Rolling the dice 5 times
die = pd.Series([1, 2, 3, 4, 5, 6])
# Roll 5 times
samp_5 = die.sample(5, replace=True)
print(samp_5)
array([3, 1, 4, 1, 1])
np.mean(samp_5)
2.0
INTRODUCTION TO STATISTICS IN PYTHON
Rolling the dice 5 times
# Roll 5 times and take mean
samp_5 = die.sample(5, replace=True)
np.mean(samp_5)
4.4
samp_5 = die.sample(5, replace=True)
np.mean(samp_5)
3.8
INTRODUCTION TO STATISTICS IN PYTHON
Rolling the dice 5 times 10 times
Repeat 10 times: sample_means = []
for i in range(10):
Roll 5 times
samp_5 = die.sample(5, replace=True)
Take the mean sample_means.append(np.mean(samp_5))
print(sample_means)
[3.8, 4.0, 3.8, 3.6, 3.2, 4.8, 2.6,
3.0, 2.6, 2.0]
INTRODUCTION TO STATISTICS IN PYTHON
Sampling distributions
Sampling distribution of the sample mean
INTRODUCTION TO STATISTICS IN PYTHON
100 sample means
sample_means = []
for i in range(100):
sample_means.append(np.mean(die.sample(5, replace=True)))
INTRODUCTION TO STATISTICS IN PYTHON
1000 sample means
sample_means = []
for i in range(1000):
sample_means.append(np.mean(die.sample(5, replace=True)))
INTRODUCTION TO STATISTICS IN PYTHON
Central limit theorem
The sampling distribution of a statistic becomes closer to the normal distribution as the
number of trials increases.
* Samples should be random and independent
INTRODUCTION TO STATISTICS IN PYTHON
Standard deviation and the CLT
sample_sds = []
for i in range(1000):
sample_sds.append(np.std(die.sample(5, replace=True)))
INTRODUCTION TO STATISTICS IN PYTHON
Proportions and the CLT
sales_team = pd.Series(["Amir", "Brian", "Claire", "Damian"])
sales_team.sample(10, replace=True)
array(['Claire', 'Damian', 'Brian', 'Damian', 'Damian', 'Amir', 'Amir', 'Amir',
'Amir', 'Damian'], dtype=object)
sales_team.sample(10, replace=True)
array(['Brian', 'Amir', 'Brian', 'Claire', 'Brian', 'Damian', 'Claire', 'Brian',
'Claire', 'Claire'], dtype=object)
INTRODUCTION TO STATISTICS IN PYTHON
Sampling distribution of proportion
INTRODUCTION TO STATISTICS IN PYTHON
Mean of sampling distribution
# Estimate expected value of die
np.mean(sample_means)
3.48
# Estimate proportion of "Claire"s
np.mean(sample_props)
Estimate characteristics of unknown
0.26
underlying distribution
More easily estimate characteristics of
large populations
INTRODUCTION TO STATISTICS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
The Poisson
distribution
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Maggie Matsui
Content Developer, DataCamp
Poisson processes
Events appear to happen at a certain rate,
but completely at random
Examples
Number of animals adopted from an
animal shelter per week
Number of people arriving at a
restaurant per hour
Number of earthquakes in California per
year
Time unit is irrelevant, as long as you use
the same unit when talking about the same
situation
INTRODUCTION TO STATISTICS IN PYTHON
Poisson distribution
Probability of some # of events occurring over a fixed period of time
Examples
Probability of ≥ 5 animals adopted from an animal shelter per week
Probability of 12 people arriving at a restaurant per hour
Probability of < 20 earthquakes in California per year
INTRODUCTION TO STATISTICS IN PYTHON
Lambda (λ)
λ = average number of events per time interval
Average number of adoptions per week = 8
INTRODUCTION TO STATISTICS IN PYTHON
Lambda is the distribution's peak
INTRODUCTION TO STATISTICS IN PYTHON
Probability of a single value
If the average number of adoptions per week is 8, what is P (# adoptions in a week = 5)?
from scipy.stats import poisson
poisson.pmf(5, 8)
0.09160366
INTRODUCTION TO STATISTICS IN PYTHON
Probability of less than or equal to
If the average number of adoptions per week is 8, what is P (# adoptions in a week ≤ 5)?
from scipy.stats import poisson
poisson.cdf(5, 8)
0.1912361
INTRODUCTION TO STATISTICS IN PYTHON
Probability of greater than
If the average number of adoptions per week is 8, what is P (# adoptions in a week > 5)?
1 - poisson.cdf(5, 8)
0.8087639
If the average number of adoptions per week is 10, what is P (# adoptions in a week > 5)?
1 - poisson.cdf(5, 10)
0.932914
INTRODUCTION TO STATISTICS IN PYTHON
Sampling from a Poisson distribution
from scipy.stats import poisson
poisson.rvs(8, size=10)
array([ 9, 9, 8, 7, 11, 3, 10, 6, 8, 14])
INTRODUCTION TO STATISTICS IN PYTHON
The CLT still applies!
INTRODUCTION TO STATISTICS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
More probability
distributions
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Maggie Matsui
Content Developer, DataCamp
Exponential distribution
Probability of time between Poisson events
Examples
Probability of > 1 day between adoptions
Probability of < 10 minutes between restaurant arrivals
Probability of 6-8 months between earthquakes
Also uses lambda (rate)
Continuous (time)
INTRODUCTION TO STATISTICS IN PYTHON
Customer service requests
On average, one customer service ticket is created every 2 minutes
λ = 0.5 customer service tickets created each minute
INTRODUCTION TO STATISTICS IN PYTHON
Lambda in exponential distribution
INTRODUCTION TO STATISTICS IN PYTHON
Expected value of exponential distribution
In terms of rate (Poisson):
λ = 0.5 requests per minute
In terms of time between events (exponential):
1/λ = 1 request per 2 minutes
1/0.5 = 2
INTRODUCTION TO STATISTICS IN PYTHON
How long until a new request is created?
P (wait < 1 min) =
from scipy.stats import expon expon.cdf(1, scale=2)
scale = 1/λ = 1/0.5 = 2 0.3934693402873666
P (wait > 4 min) = P (1 min < wait < 4 min) =
1- expon.cdf(4, scale=2) expon.cdf(4, scale=2) - expon.cdf(1, scale=2)
0.1353352832366127 0.4711953764760207
INTRODUCTION TO STATISTICS IN PYTHON
(Student's) t-distribution
Similar shape as the normal distribution
INTRODUCTION TO STATISTICS IN PYTHON
Degrees of freedom
Has parameter degrees of freedom (df) which affects the thickness of the tails
Lower df = thicker tails, higher standard deviation
Higher df = closer to normal distribution
INTRODUCTION TO STATISTICS IN PYTHON
Log-normal distribution
Variable whose logarithm is normally
distributed
Examples:
Length of chess games
Adult blood pressure
Number of hospitalizations in the 2003
SARS outbreak
INTRODUCTION TO STATISTICS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Correlation
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Maggie Matsui
Content Developer, DataCamp
Relationships between two variables
x = explanatory/independent variable
y = response/dependent variable
INTRODUCTION TO STATISTICS IN PYTHON
Correlation coefficient
Quantifies the linear relationship between two variables
Number between -1 and 1
Magnitude corresponds to strength of relationship
Sign (+ or -) corresponds to direction of relationship
INTRODUCTION TO STATISTICS IN PYTHON
Magnitude = strength of relationship
0.99 (very strong relationship)
INTRODUCTION TO STATISTICS IN PYTHON
Magnitude = strength of relationship
0.99 (very strong relationship) 0.75 (strong relationship)
INTRODUCTION TO STATISTICS IN PYTHON
Magnitude = strength of relationship
0.56 (moderate relationship)
INTRODUCTION TO STATISTICS IN PYTHON
Magnitude = strength of relationship
0.56 (moderate relationship) 0.21 (weak relationship)
INTRODUCTION TO STATISTICS IN PYTHON
Magnitude = strength of relationship
0.04 (no relationship) Knowing the value of x doesn't tell us
anything about y
INTRODUCTION TO STATISTICS IN PYTHON
Sign = direction
0.75: as x increases, y increases -0.75: as x increases, y decreases
INTRODUCTION TO STATISTICS IN PYTHON
Visualizing relationships
import seaborn as sns
sns.scatterplot(x="sleep_total", y="sleep_rem", data=msleep)
plt.show()
INTRODUCTION TO STATISTICS IN PYTHON
Adding a trendline
import seaborn as sns
sns.lmplot(x="sleep_total", y="sleep_rem", data=msleep, ci=None)
plt.show()
INTRODUCTION TO STATISTICS IN PYTHON
Computing correlation
msleep['sleep_total'].corr(msleep['sleep_rem'])
0.751755
msleep['sleep_rem'].corr(msleep['sleep_total'])
0.751755
INTRODUCTION TO STATISTICS IN PYTHON
Many ways to calculate correlation
Used in this course: Pearson product-moment correlation (r )
Most common
x̄ = mean of x
σx = standard deviation of x
n
(xi − x̄)(yi − ȳ )
r=∑
σx × σy
i=1
Variations on this formula:
Kendall's tau
Spearman's rho
INTRODUCTION TO STATISTICS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Correlation caveats
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Maggie Matsui
Content Developer, DataCamp
Non-linear relationships
r = 0.18
INTRODUCTION TO STATISTICS IN PYTHON
Non-linear relationships
What we see: What the correlation coefficient sees:
INTRODUCTION TO STATISTICS IN PYTHON
Correlation only accounts for linear relationships
Correlation shouldn't be used blindly Always visualize your data
df['x'].corr(df['y'])
0.081094
INTRODUCTION TO STATISTICS IN PYTHON
Mammal sleep data
print(msleep)
name genus vore order ... sleep_cycle awake brainwt bodywt
1 Cheetah Acinonyx carni Carnivora ... NaN 11.9 NaN 50.000
2 Owl monkey Aotus omni Primates ... NaN 7.0 0.01550 0.480
3 Mountain beaver Aplodontia herbi Rodentia ... NaN 9.6 NaN 1.350
4 Greater short-ta... Blarina omni Soricomorpha ... 0.133333 9.1 0.00029 0.019
5 Cow Bos herbi Artiodactyla ... 0.666667 20.0 0.42300 600.000
.. ... ... ... ... ... ... ... ... ...
79 Tree shrew Tupaia omni Scandentia ... 0.233333 15.1 0.00250 0.104
80 Bottle-nosed do... Tursiops carni Cetacea ... NaN 18.8 NaN 173.330
81 Genet Genetta carni Carnivora ... NaN 17.7 0.01750 2.000
82 Arctic fox Vulpes carni Carnivora ... NaN 11.5 0.04450 3.380
83 Red fox Vulpes carni Carnivora ... 0.350000 14.2 0.05040 4.230
INTRODUCTION TO STATISTICS IN PYTHON
Body weight vs. awake time
msleep['bodywt'].corr(msleep['awake'])
0.3119801
INTRODUCTION TO STATISTICS IN PYTHON
Distribution of body weight
INTRODUCTION TO STATISTICS IN PYTHON
Log transformation
msleep['log_bodywt'] = np.log(msleep['bodywt'])
sns.lmplot(x='log_bodywt',
y='awake',
data=msleep,
ci=None)
plt.show()
msleep['log_bodywt'].corr(msleep['awake'])
0.5687943
INTRODUCTION TO STATISTICS IN PYTHON
Other transformations
Log transformation ( log(x) )
Square root transformation ( sqrt(x) )
Reciprocal transformation ( 1 / x )
Combinations of these, e.g.:
log(x) and log(y)
sqrt(x) and 1 / y
INTRODUCTION TO STATISTICS IN PYTHON
Why use a transformation?
Certain statistical methods rely on variables having a linear relationship
Correlation coefficient
Linear regression
Introduction to Linear Modeling in Python
INTRODUCTION TO STATISTICS IN PYTHON
Correlation does not imply causation
x is correlated with y does not mean x causes y
INTRODUCTION TO STATISTICS IN PYTHON
Confounding
INTRODUCTION TO STATISTICS IN PYTHON
Confounding
INTRODUCTION TO STATISTICS IN PYTHON
Confounding
INTRODUCTION TO STATISTICS IN PYTHON
Confounding
INTRODUCTION TO STATISTICS IN PYTHON
Confounding
INTRODUCTION TO STATISTICS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Design of
experiments
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Maggie Matsui
Content Developer, DataCamp
Vocabulary
Experiment aims to answer: What is the effect of the treatment on the response?
Treatment: explanatory/independent variable
Response: response/dependent variable
E.g.: What is the effect of an advertisement on the number of products purchased?
Treatment: advertisement
Response: number of products purchased
INTRODUCTION TO STATISTICS IN PYTHON
Controlled experiments
Participants are assigned by researchers to either treatment group or control group
Treatment group sees advertisement
Control group does not
Groups should be comparable so that causation can be inferred
If groups are not comparable, this could lead to confounding (bias)
Treatment group average age: 25
Control group average age: 50
Age is a potential confounder
INTRODUCTION TO STATISTICS IN PYTHON
The gold standard of experiments will use...
Randomized controlled trial
Participants are assigned to treatment/control randomly, not based on any other
characteristics
Choosing randomly helps ensure that groups are comparable
Placebo
Resembles treatment, but has no effect
Participants will not know which group they're in
In clinical trials, a sugar pill ensures that the effect of the drug is actually due to the drug
itself and not the idea of receiving the drug
INTRODUCTION TO STATISTICS IN PYTHON
The gold standard of experiments will use...
Double-blind trial
Person administering the treatment/running the study doesn't know whether the
treatment is real or a placebo
Prevents bias in the response and/or analysis of results
Fewer opportunities for bias = more reliable conclusion about causation
INTRODUCTION TO STATISTICS IN PYTHON
Observational studies
Participants are not assigned randomly to groups
Participants assign themselves, usually based on pre-existing characteristics
Many research questions are not conducive to a controlled experiment
You can't force someone to smoke or have a disease
You can't make someone have certain past behavior
Establish association, not causation
Effects can be confounded by factors that got certain people into the control or
treatment group
There are ways to control for confounders to get more reliable conclusions about
association
INTRODUCTION TO STATISTICS IN PYTHON
Longitudinal vs. cross-sectional studies
Longitudinal study Cross-sectional study
Participants are followed over a period of Data on participants is collected from a
time to examine effect of treatment on single snapshot in time
response Effect of age on height is confounded by
Effect of age on height is not confounded generation
by generation Cheaper, faster, more convenient
More expensive, results take longer
INTRODUCTION TO STATISTICS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Congratulations!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Maggie Matsui
Content Developer, DataCamp
Overview
Chapter 1 Chapter 2
What is statistics? Measuring chance
Measures of center Probability distributions
Measures of spread Binomial distribution
Chapter 3 Chapter 4
Normal distribution Correlation
Central limit theorem Controlled experiments
Poisson distribution Observational studies
INTRODUCTION TO STATISTICS IN PYTHON
Build on your skills
Introduction to Linear Modeling in Python
INTRODUCTION TO STATISTICS IN PYTHON
Congratulations!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
A tale of two
variables
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
Swedish motor insurance data
Each row represents one geographic region n_claims total_payment_sek
in Sweden.
108 392.5
There are 63 rows. 19 46.2
13 15.7
124 422.2
40 119.4
... ...
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Descriptive statistics
import pandas as pd
print(swedish_motor_insurance.mean())
n_claims 22.904762
total_payment_sek 98.187302
dtype: float64
print(swedish_motor_insurance['n_claims'].corr(swedish_motor_insurance['total_payment_sek']))
0.9128782350234068
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
What is regression?
Statistical models to explore the n_claims total_payment_sek
relationship a response variable and some
108 3925
explanatory variables.
19 462
Given values of explanatory variables, you
13 157
can predict the values of the response
124 4222
variable.
40 1194
200 ???
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Jargon
Response variable (a.k.a. dependent variable)
The variable that you want to predict.
Explanatory variables (a.k.a. independent variables)
The variables that explain how the response variable will change.
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Linear regression and logistic regression
Linear regression
The response variable is numeric.
Logistic regression
The response variable is logical.
Simple linear/logistic regression
There is only one explanatory variable.
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Visualizing pairs of variables
import matplotlib.pyplot as plt
import seaborn as sns
sns.scatterplot(x="n_claims",
y="total_payment_sek",
data=swedish_motor_insurance)
plt.show()
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Adding a linear trend line
sns.regplot(x="n_claims",
y="total_payment_sek",
data=swedish_motor_insurance,
ci=None)
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Course flow
Chapter 1
Visualizing and ing linear regression models.
Chapter 2
Making predictions from linear regression models and understanding model coe cients.
Chapter 3
Assessing the quality of the linear regression model.
Chapter 4
Same again, but with logistic regression models
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Python packages for regression
statsmodels
Optimized for insight (focus in this course)
scikit-learn
Optimized for prediction (focus in other DataCamp courses)
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Fitting a linear
regression
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
Straight lines are defined by two things
Intercept
The y value at the point when x is zero.
Slope
The amount the y value increases if you increase x by one.
Equation
y = intercept + slope ∗ x
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Estimating the intercept
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Estimating the intercept
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Estimating the intercept
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Estimating the slope
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Estimating the slope
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Estimating the slope
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Estimating the slope
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Running a model
from statsmodels.formula.api import ols
mdl_payment_vs_claims = ols("total_payment_sek ~ n_claims",
data=swedish_motor_insurance)
mdl_payment_vs_claims = mdl_payment_vs_claims.fit()
print(mdl_payment_vs_claims.params)
Intercept 19.994486
n_claims 3.413824
dtype: float64
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Interpreting the model coefficients
Intercept 19.994486
n_claims 3.413824
dtype: float64
Equation
total_payment_sek = 19.99 + 3.41 ∗ n_claims
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Categorical
explanatory
variables
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
Fish dataset
Each row represents one sh. species mass_g
There are 128 rows in the dataset. Bream 242.0
There are 4 species of sh: Perch 5.9
Common Bream Pike 200.0
European Perch Roach 40.0
Northern Pike ... ...
Common Roach
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Visualizing 1 numeric and 1 categorical variable
import matplotlib.pyplot as plt
import seaborn as sns
sns.displot(data=fish,
x="mass_g",
col="species",
col_wrap=2,
bins=9)
plt.show()
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Summary statistics: mean mass by species
summary_stats = fish.groupby("species")["mass_g"].mean()
print(summary_stats)
species
Bream 617.828571
Perch 382.239286
Pike 718.705882
Roach 152.050000
Name: mass_g, dtype: float64
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Linear regression
from statsmodels.formula.api import ols
mdl_mass_vs_species = ols("mass_g ~ species", data=fish).fit()
print(mdl_mass_vs_species.params)
Intercept 617.828571
species[T.Perch] -235.589286
species[T.Pike] 100.877311
species[T.Roach] -465.778571
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Model with or without an intercept
From previous slide, model with intercept Model without an intercept
mdl_mass_vs_species = ols( mdl_mass_vs_species = ols(
"mass_g ~ species", data=fish).fit() "mass_g ~ species + 0", data=fish).fit()
print(mdl_mass_vs_species.params) print(mdl_mass_vs_species.params)
Intercept 617.828571 species[Bream] 617.828571
species[T.Perch] -235.589286 species[Perch] 382.239286
species[T.Pike] 100.877311 species[Pike] 718.705882
species[T.Roach] -465.778571 species[Roach] 152.050000
The coe cients are relative to the intercept: In case of a single, categorical variable,
617.83 − 235.59 = 382.24! coe cients are the means.
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Making predictions
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
The fish dataset: bream
bream = fish[fish["species"] == "Bream"]
print(bream.head())
species mass_g length_cm
0 Bream 242.0 23.2
1 Bream 290.0 24.0
2 Bream 340.0 23.9
3 Bream 363.0 26.3
4 Bream 430.0 26.5
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Plotting mass vs. length
sns.regplot(x="length_cm",
y="mass_g",
data=bream,
ci=None)
plt.show()
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Running the model
mdl_mass_vs_length = ols("mass_g ~ length_cm", data=bream).fit()
print(mdl_mass_vs_length.params)
Intercept -1035.347565
length_cm 54.549981
dtype: float64
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Data on explanatory values to predict
If I set the explanatory variables to these values,
what value would the response variable have?
explanatory_data = pd.DataFrame({"length_cm": np.arange(20, 41)})
length_cm
0 20
1 21
2 22
3 23
4 24
5 25
...
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Call predict()
print(mdl_mass_vs_length.predict(explanatory_data))
0 55.652054
1 110.202035
2 164.752015
3 219.301996
4 273.851977
...
16 928.451749
17 983.001730
18 1037.551710
19 1092.101691
20 1146.651672
Length: 21, dtype: float64
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Predicting inside a DataFrame
explanatory_data = pd.DataFrame( length_cm mass_g
{"length_cm": np.arange(20, 41)} 0 20 55.652054
) 1 21 110.202035
prediction_data = explanatory_data.assign( 2 22 164.752015
mass_g=mdl_mass_vs_length.predict(explanatory_data) 3 23 219.301996
) 4 24 273.851977
print(prediction_data) .. ... ...
16 36 928.451749
17 37 983.001730
18 38 1037.551710
19 39 1092.101691
20 40 1146.651672
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Showing predictions
import matplotlib.pyplot as plt
import seaborn as sns
fig = plt.figure()
sns.regplot(x="length_cm",
y="mass_g",
ci=None,
data=bream,)
sns.scatterplot(x="length_cm",
y="mass_g",
data=prediction_data,
color="red",
marker="s")
plt.show()
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Extrapolating
Extrapolating means making predictions
outside the range of observed data.
little_bream = pd.DataFrame({"length_cm": [10]})
pred_little_bream = little_bream.assign(
mass_g=mdl_mass_vs_length.predict(little_bream))
print(pred_little_bream)
length_cm mass_g
0 10 -489.847756
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Working with model
objects
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
.params attribute
from statsmodels.formula.api import ols
mdl_mass_vs_length = ols("mass_g ~ length_cm", data = bream).fit()
print(mdl_mass_vs_length.params)
Intercept -1035.347565
length_cm 54.549981
dtype: float64
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
.fittedvalues attribute
Fi ed values: predictions on the original 0 230.211993
dataset 1 273.851977
2 268.396979
3 399.316934
print(mdl_mass_vs_length.fittedvalues)
4 410.226930
...
or equivalently 30 873.901768
31 873.901768
explanatory_data = bream["length_cm"] 32 939.361745
33 1004.821722
print(mdl_mass_vs_length.predict(explanatory_data)) 34 1037.551710
Length: 35, dtype: float64
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
.resid attribute
Residuals: actual response values minus 0 11.788007
1 16.148023
predicted response values
2 71.603021
3 -36.316934
print(mdl_mass_vs_length.resid) 4 19.773070
...
or equivalently
print(bream["mass_g"] - mdl_mass_vs_length.fittedvalues)
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
.summary()
mdl_mass_vs_length.summary()
OLS Regression Results
==============================================================================
Dep. Variable: mass_g R-squared: 0.878
Model: OLS Adj. R-squared: 0.874
Method: Least Squares F-statistic: 237.6
Date: Thu, 29 Oct 2020 Prob (F-statistic): 1.22e-16
Time: 13:23:21 Log-Likelihood: -199.35
No. Observations: 35 AIC: 402.7
Df Residuals: 33 BIC: 405.8
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
<-----------------------------------------------------------------------------
Intercept -1035.3476 107.973 -9.589 0.000 -1255.020 -815.676
length_cm 54.5500 3.539 15.415 0.000 47.350 61.750
==============================================================================
Omnibus: 7.314 Durbin-Watson: 1.478
Prob(Omnibus): 0.026 Jarque-Bera (JB): 10.857
Skew: -0.252 Prob(JB): 0.00439
Kurtosis: 5.682 Cond. No. 263.
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
OLS Regression Results
==============================================================================
Dep. Variable: mass_g R-squared: 0.878
Model: OLS Adj. R-squared: 0.874
Method: Least Squares F-statistic: 237.6
Date: Thu, 29 Oct 2020 Prob (F-statistic): 1.22e-16
Time: 13:23:21 Log-Likelihood: -199.35
No. Observations: 35 AIC: 402.7
Df Residuals: 33 BIC: 405.8
Df Model: 1
Covariance Type: nonrobust
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
coef std err t P>|t| [0.025 0.975]
<-----------------------------------------------------------------------------
Intercept -1035.3476 107.973 -9.589 0.000 -1255.020 -815.676
length_cm 54.5500 3.539 15.415 0.000 47.350 61.750
==============================================================================
Omnibus: 7.314 Durbin-Watson: 1.478
Prob(Omnibus): 0.026 Jarque-Bera (JB): 10.857
Skew: -0.252 Prob(JB): 0.00439
Kurtosis: 5.682 Cond. No. 263.
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Regression to the
mean
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
The concept
Response value = ed value + residual
"The stu you explained" + "the stu you couldn't explain"
Residuals exist due to problems in the model and fundamental randomness
Extreme cases are o en due to randomness
Regression to the mean means extreme cases don't persist over time
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Pearson's father son dataset
1078 father/son pairs father_height_cm son_height_cm
Do tall fathers have tall sons? 165.2 151.8
160.7 160.6
165.0 160.9
167.0 159.5
155.3 163.3
... ...
1 Adapted from h ps://www.rdocumentation.org/packages/UsingR/topics/father.son
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Scatter plot
fig = plt.figure()
sns.scatterplot(x="father_height_cm",
y="son_height_cm",
data=father_son)
plt.axline(xy1=(150, 150),
slope=1,
linewidth=2,
color="green")
plt.axis("equal")
plt.show()
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Adding a regression line
fig = plt.figure()
sns.regplot(x="father_height_cm",
y="son_height_cm",
data=father_son,
ci = None,
line_kws={"color": "black"})
plt.axline(xy1 = (150, 150),
slope=1,
linewidth=2,
color="green")
plt.axis("equal")
plt.show()
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Running a regression
mdl_son_vs_father = ols("son_height_cm ~ father_height_cm",
data = father_son).fit()
print(mdl_son_vs_father.params)
Intercept 86.071975
father_height_cm 0.514093
dtype: float64
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Making predictions
really_tall_father = pd.DataFrame( really_short_father = pd.DataFrame(
{"father_height_cm": [190]}) {"father_height_cm": [150]})
mdl_son_vs_father.predict( mdl_son_vs_father.predict(
really_tall_father) really_short_father)
183.7 163.2
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Transforming
variables
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
Perch dataset
perch = fish[fish["species"] == "Perch"]
print(perch.head())
species mass_g length_cm
55 Perch 5.9 7.5
56 Perch 32.0 12.5
57 Perch 40.0 13.8
58 Perch 51.5 15.0
59 Perch 70.0 15.7
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
It's not a linear relationship
sns.regplot(x="length_cm",
y="mass_g",
data=perch,
ci=None)
plt.show()
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Bream vs. perch
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Plotting mass vs. length cubed
perch["length_cm_cubed"] = perch["length_cm"] ** 3
sns.regplot(x="length_cm_cubed",
y="mass_g",
data=perch,
ci=None)
plt.show()
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Modeling mass vs. length cubed
perch["length_cm_cubed"] = perch["length_cm"] ** 3
mdl_perch = ols("mass_g ~ length_cm_cubed", data=perch).fit()
mdl_perch.params
Intercept -0.117478
length_cm_cubed 0.016796
dtype: float64
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Predicting mass vs. length cubed
explanatory_data = pd.DataFrame({"length_cm_cubed": np.arange(10, 41, 5) ** 3,
"length_cm": np.arange(10, 41, 5)})
prediction_data = explanatory_data.assign(
mass_g=mdl_perch.predict(explanatory_data))
print(prediction_data)
length_cm_cubed length_cm mass_g
0 1000 10 16.678135
1 3375 15 56.567717
2 8000 20 134.247429
3 15625 25 262.313982
4 27000 30 453.364084
5 42875 35 719.994447
6 64000 40 1074.801781
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Plotting mass vs. length cubed
fig = plt.figure() fig = plt.figure()
sns.regplot(x="length_cm_cubed", y="mass_g", sns.regplot(x="length_cm", y="mass_g",
data=perch, ci=None) data=perch, ci=None)
sns.scatterplot(data=prediction_data, sns.scatterplot(data=prediction_data,
x="length_cm_cubed", y="mass_g", x="length_cm", y="mass_g",
color="red", marker="s") color="red", marker="s")
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Facebook advertising dataset
How advertising works spent_usd n_impressions n_clicks
1. Pay Facebook to shows ads. 1.43 7350 1
1.82 17861 2
2. People see the ads ("impressions").
1.25 4259 1
3. Some people who see it, click it.
1.29 4133 1
4.77 15615 3
936 rows ... ... ...
Each row represents 1 advert
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Plot is cramped
sns.regplot(x="spent_usd",
y="n_impressions",
data=ad_conversion,
ci=None)
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Square root vs square root
ad_conversion["sqrt_spent_usd"] = np.sqrt(
ad_conversion["spent_usd"])
ad_conversion["sqrt_n_impressions"] = np.sqrt(
ad_conversion["n_impressions"])
sns.regplot(x="sqrt_spent_usd",
y="sqrt_n_impressions",
data=ad_conversion,
ci=None)
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Modeling and predicting
mdl_ad = ols("sqrt_n_impressions ~ sqrt_spent_usd", data=ad_conversion).fit()
explanatory_data = pd.DataFrame({"sqrt_spent_usd": np.sqrt(np.arange(0, 601, 100)),
"spent_usd": np.arange(0, 601, 100)})
prediction_data = explanatory_data.assign(sqrt_n_impressions=mdl_ad.predict(explanatory_data),
n_impressions=mdl_ad.predict(explanatory_data) ** 2)
print(prediction_data)
sqrt_spent_usd spent_usd sqrt_n_impressions n_impressions
0 0.000000 0 15.319713 2.346936e+02
1 10.000000 100 597.736582 3.572890e+05
2 14.142136 200 838.981547 7.038900e+05
3 17.320508 300 1024.095320 1.048771e+06
4 20.000000 400 1180.153450 1.392762e+06
5 22.360680 500 1317.643422 1.736184e+06
6 24.494897 600 1441.943858 2.079202e+06
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Quantifying model
fit
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
Bream and perch models
Bream Perch
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Coefficient of determination
Sometimes called "r-squared" or "R-squared".
The proportion of the variance in the response variable that is predictable from the
explanatory variable
1 means a perfect t
0 means the worst possible t
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
.summary()
Look at the value titled "R-Squared"
mdl_bream = ols("mass_g ~ length_cm", data=bream).fit()
print(mdl_bream.summary())
# Some lines of output omitted
OLS Regression Results
Dep. Variable: mass_g R-squared: 0.878
Model: OLS Adj. R-squared: 0.874
Method: Least Squares F-statistic: 237.6
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
.rsquared attribute
print(mdl_bream.rsquared)
0.8780627095147174
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
It's just correlation squared
coeff_determination = bream["length_cm"].corr(bream["mass_g"]) ** 2
print(coeff_determination)
0.8780627095147173
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Residual standard error (RSE)
A "typical" di erence between a prediction and an observed response
It has the same unit as the response variable.
MSE = RSE²
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
.mse_resid attribute
mse = mdl_bream.mse_resid
print('mse: ', mse)
mse: 5498.555084973521
rse = np.sqrt(mse)
print("rse: ", rse)
rse: 74.15224261594197
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Calculating RSE: residuals squared
residuals_sq = mdl_bream.resid ** 2 residuals sq:
0 138.957118
print("residuals sq: \n", residuals_sq) 1 260.758635
2 5126.992578
3 1318.919660
4 390.974309
...
30 2125.047026
31 6576.923291
32 206.259713
33 889.335096
34 7665.302003
Length: 35, dtype: float64
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Calculating RSE: sum of residuals squared
residuals_sq = mdl_bream.resid ** 2 resid sum of sq : 181452.31780412616
resid_sum_of_sq = sum(residuals_sq)
print("resid sum of sq :",
resid_sum_of_sq)
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Calculating RSE: degrees of freedom
residuals_sq = mdl_bream.resid ** 2 deg freedom: 33
resid_sum_of_sq = sum(residuals_sq)
deg_freedom = len(bream.index) - 2
print("deg freedom: ", deg_freedom)
Degrees of freedom equals the number of
observations minus the number of model
coe cients.
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Calculating RSE: square root of ratio
residuals_sq = mdl_bream.resid ** 2 rse : 74.15224261594197
resid_sum_of_sq = sum(residuals_sq)
deg_freedom = len(bream.index) - 2
rse = np.sqrt(resid_sum_of_sq/deg_freedom)
print("rse :", rse)
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Interpreting RSE
mdl_bream has an RSE of 74 .
The di erence between predicted bream masses and observed bream masses is typically
about 74g.
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Root-mean-square error (RMSE)
residuals_sq = mdl_bream.resid ** 2 residuals_sq = mdl_bream.resid ** 2
resid_sum_of_sq = sum(residuals_sq) resid_sum_of_sq = sum(residuals_sq)
deg_freedom = len(bream.index) - 2 n_obs = len(bream.index)
rse = np.sqrt(resid_sum_of_sq/deg_freedom) rmse = np.sqrt(resid_sum_of_sq/n_obs)
print("rse :", rse) print("rmse :", rmse)
rse : 74.15224261594197 rmse : 72.00244396727619
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Visualizing model fit
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
Residual properties of a good fit
Residuals are normally distributed
The mean of the residuals is zero
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Bream and perch again
Bream: the "good" model Perch: the "bad" model
mdl_bream = ols("mass_g ~ length_cm", data=bream).fit() mdl_perch = ols("mass_g ~ length_cm", data=perch).fit()
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Residuals vs. fitted
Bream Perch
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Q-Q plot
Bream Perch
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Scale-location plot
Bream Perch
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
residplot()
sns.residplot(x="length_cm", y="mass_g", data=bream, lowess=True)
plt.xlabel("Fitted values")
plt.ylabel("Residuals")
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
qqplot()
from statsmodels.api import qqplot
qqplot(data=mdl_bream.resid, fit=True, line="45")
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Scale-location plot
model_norm_residuals_bream = mdl_bream.get_influence().resid_studentized_internal
model_norm_residuals_abs_sqrt_bream = np.sqrt(np.abs(model_norm_residuals_bream))
sns.regplot(x=mdl_bream.fittedvalues, y=model_norm_residuals_abs_sqrt_bream, ci=None, lowess=True)
plt.xlabel("Fitted values")
plt.ylabel("Sqrt of abs val of stdized residuals")
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Outliers, leverage,
and influence
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
Roach dataset
roach = fish[fish['species'] == "Roach"]
print(roach.head())
species mass_g length_cm
35 Roach 40.0 12.9
36 Roach 69.0 16.5
37 Roach 78.0 17.5
38 Roach 87.0 18.2
39 Roach 120.0 18.6
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Which points are outliers?
sns.regplot(x="length_cm",
y="mass_g",
data=roach,
ci=None)
plt.show()
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Extreme explanatory values
roach["extreme_l"] = ((roach["length_cm"] < 15) |
(roach["length_cm"] > 26))
fig = plt.figure()
sns.regplot(x="length_cm",
y="mass_g",
data=roach,
ci=None)
sns.scatterplot(x="length_cm",
y="mass_g",
hue="extreme_l",
data=roach)
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Response values away from the regression line
roach["extreme_m"] = roach["mass_g"] < 1
fig = plt.figure()
sns.regplot(x="length_cm",
y="mass_g",
data=roach,
ci=None)
sns.scatterplot(x="length_cm",
y="mass_g",
hue="extreme_l",
style="extreme_m",
data=roach)
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Leverage and influence
Leverage is a measure of how extreme the explanatory variable values are.
In uence measures how much the model would change if you le the observation out of the
dataset when modeling.
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
.get_influence() and .summary_frame()
mdl_roach = ols("mass_g ~ length_cm", data=roach).fit()
summary_roach = mdl_roach.get_influence().summary_frame()
roach["leverage"] = summary_roach["hat_diag"]
print(roach.head())
species mass_g length_cm leverage
35 Roach 40.0 12.9 0.313729
36 Roach 69.0 16.5 0.125538
37 Roach 78.0 17.5 0.093487
38 Roach 87.0 18.2 0.076283
39 Roach 120.0 18.6 0.068387
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Cook's distance
Cook's distance is the most common measure of in uence.
roach["cooks_dist"] = summary_roach["cooks_d"]
print(roach.head())
species mass_g length_cm leverage cooks_dist
35 Roach 40.0 12.9 0.313729 1.074015
36 Roach 69.0 16.5 0.125538 0.010429
37 Roach 78.0 17.5 0.093487 0.000020
38 Roach 87.0 18.2 0.076283 0.001980
39 Roach 120.0 18.6 0.068387 0.006610
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Most influential roaches
print(roach.sort_values("cooks_dist", ascending = False))
species mass_g length_cm leverage cooks_dist
35 Roach 40.0 12.9 0.313729 1.074015 # really short roach
54 Roach 390.0 29.5 0.394740 0.365782 # really long roach
40 Roach 0.0 19.0 0.061897 0.311852 # roach with zero mass
52 Roach 290.0 24.0 0.099488 0.150064
51 Roach 180.0 23.6 0.088391 0.061209
.. ... ... ... ... ...
43 Roach 150.0 20.4 0.050264 0.000257
44 Roach 145.0 20.5 0.050092 0.000256
42 Roach 120.0 19.4 0.056815 0.000199
47 Roach 160.0 21.1 0.050910 0.000137
37 Roach 78.0 17.5 0.093487 0.000020
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Removing the most influential roach
roach_not_short = roach[roach["length_cm"] != 12.9]
sns.regplot(x="length_cm",
y="mass_g",
data=roach,
ci=None,
line_kws={"color": "green"})
sns.regplot(x="length_cm",
y="mass_g",
data=roach_not_short,
ci=None,
line_kws={"color": "red"})
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Why you need
logistic regression
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
Bank churn dataset
has_churned time_since_ rst_purchase time_since_last_purchase
0 0.3993247 -0.5158691
1 -0.4297957 0.6780654
0 3.7383122 0.4082544
0 0.6032289 -0.6990435
... ... ...
response length of relationship recency of activity
1 h ps://www.rdocumentation.org/packages/bayesQR/topics/Churn
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Churn vs. recency: a linear model
mdl_churn_vs_recency_lm = ols("has_churned ~ time_since_last_purchase",
data=churn).fit()
print(mdl_churn_vs_recency_lm.params)
Intercept 0.490780
time_since_last_purchase 0.063783
dtype: float64
intercept, slope = mdl_churn_vs_recency_lm.params
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Visualizing the linear model
sns.scatterplot(x="time_since_last_purchase",
y="has_churned",
data=churn)
plt.axline(xy1=(0, intercept),
slope=slope)
plt.show()
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Zooming out
sns.scatterplot(x="time_since_last_purchase",
y="has_churned",
data=churn)
plt.axline(xy1=(0,intercept),
slope=slope)
plt.xlim(-10, 10)
plt.ylim(-0.2, 1.2)
plt.show()
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
What is logistic regression?
Another type of generalized linear model.
Used when the response variable is logical.
The responses follow logistic (S-shaped) curve.
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Logistic regression using logit()
from statsmodels.formula.api import logit
mdl_churn_vs_recency_logit = logit("has_churned ~ time_since_last_purchase",
data=churn).fit()
print(mdl_churn_vs_recency_logit.params)
Intercept -0.035019
time_since_last_purchase 0.269215
dtype: float64
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Visualizing the logistic model
sns.regplot(x="time_since_last_purchase",
y="has_churned",
data=churn,
ci=None,
logistic=True)
plt.axline(xy1=(0,intercept),
slope=slope,
color="black")
plt.show()
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Zooming out
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Predictions and odds
ratios
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
The regplot() predictions
sns.regplot(x="time_since_last_purchase",
y="has_churned",
data=churn,
ci=None,
logistic=True)
plt.show()
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Making predictions
mdl_recency = logit("has_churned ~ time_since_last_purchase",
data = churn).fit()
explanatory_data = pd.DataFrame(
{"time_since_last_purchase": np.arange(-1, 6.25, 0.25)})
prediction_data = explanatory_data.assign(
has_churned = mdl_recency.predict(explanatory_data))
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Adding point predictions
sns.regplot(x="time_since_last_purchase",
y="has_churned",
data=churn,
ci=None,
logistic=True)
sns.scatterplot(x="time_since_last_purchase",
y="has_churned",
data=prediction_data,
color="red")
plt.show()
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Getting the most likely outcome
prediction_data = explanatory_data.assign(
has_churned = mdl_recency.predict(explanatory_data))
prediction_data["most_likely_outcome"] = np.round(prediction_data["has_churned"])
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Visualizing most likely outcome
sns.regplot(x="time_since_last_purchase",
y="has_churned",
data=churn,
ci=None,
logistic=True)
sns.scatterplot(x="time_since_last_purchase",
y="most_likely_outcome",
data=prediction_data,
color="red")
plt.show()
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Odds ratios
Odds ratio is the probability of something
happening divided by the probability that it
doesn't.
probability
odds_ratio =
(1 − probability)
0.25 1
odds_ratio = =
(1 − 0.25) 3
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Calculating odds ratio
prediction_data["odds_ratio"] = prediction_data["has_churned"] /
(1 - prediction_data["has_churned"])
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Visualizing odds ratio
sns.lineplot(x="time_since_last_purchase",
y="odds_ratio",
data=prediction_data)
plt.axhline(y=1,
linestyle="dotted")
plt.show()
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Visualizing log odds ratio
sns.lineplot(x="time_since_last_purchase",
y="odds_ratio",
data=prediction_data)
plt.axhline(y=1,
linestyle="dotted")
plt.yscale("log")
plt.show()
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Calculating log odds ratio
prediction_data["log_odds_ratio"] = np.log(prediction_data["odds_ratio"])
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
All predictions together
time_since_last_prchs has_churned most_likely_rspns odds_ratio log_odds_ratio
0 0.491 0 0.966 -0.035
2 0.623 1 1.654 0.503
4 0.739 1 2.834 1.042
6 0.829 1 4.856 1.580
... ... ... ... ...
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Comparing scales
Are values easy to Are changes easy to Is
Scale
interpret? interpret? precise?
Probability ✔ ✘ ✔
Most likely
✔✔ ✔ ✘
outcome
Odds ratio ✔ ✘ ✔
Log odds ratio ✘ ✔ ✔
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Quantifying logistic
regression fit
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
The four outcomes
predicted false predicted true
actual false correct false positive
actual true false negative correct
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Confusion matrix: counts of outcomes
actual_response = churn["has_churned"]
predicted_response = np.round(mdl_recency.predict())
outcomes = pd.DataFrame({"actual_response": actual_response,
"predicted_response": predicted_response})
print(outcomes.value_counts(sort=False))
actual_response predicted_response
0 0.0 141
1.0 59
1 0.0 111
1.0 89
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Visualizing the confusion matrix
conf_matrix = mdl_recency.pred_table()
print(conf_matrix)
[[141. 59.]
[111. 89.]]
true negative false positive
false negative true positive
from statsmodels.graphics.mosaicplot
import mosaic
mosaic(conf_matrix)
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Accuracy
Accuracy is the proportion of correct
acc = (TN + TP) / (TN + TP + FN + FP)
predictions.
print(acc)
TN + TP
accuracy =
TN + FN + FP + TP 0.575
[[141., 59.],
[111., 89.]]
TN = conf_matrix[0,0]
TP = conf_matrix[1,1]
FN = conf_matrix[1,0]
FP = conf_matrix[0,1]
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Sensitivity
Sensitivity is the proportion of true positives.
sens = TP / (FN + TP)
TP print(sens)
sensitivity =
FN + TP
0.445
[[141., 59.],
[111., 89.]]
TN = conf_matrix[0,0]
TP = conf_matrix[1,1]
FN = conf_matrix[1,0]
FP = conf_matrix[0,1]
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Specificity
Speci city is the proportion of true negatives.
spec = TN / (TN + FP)
TN print(spec)
specificity =
TN + FP
0.705
[[141., 59.],
[111., 89.]]
TN = conf_matrix[0,0]
TP = conf_matrix[1,1]
FN = conf_matrix[1,0]
FP = conf_matrix[0,1]
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Congratulations
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
You learned things
Chapter 1 Chapter 2
Fit a simple linear regression Make predictions
Interpret coe cients Regression to the mean
Transforming variables
Chapter 3 Chapter 4
Quantifying model t Fit a simple logistic regression
Outlier, leverage, and in uence Make predictions
Get performance from confusion matrix
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Multiple explanatory variables
Intermediate Regression with statsmodels in Python
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Unlocking advanced skills
Generalized Linear Models in Python
Introduction to Predictive Analytics in
Python
Linear Classi ers in Python
INTRODUCTION TO REGRESSION WITH STATSMODELS IN PYTHON
Happy learning!
I N T R O D U C T I O N T O R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Parallel slopes linear
regression
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
The previous course
This course assumes knowledge from Introduction to Regression with statsmodels in Python
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
From simple regression to multiple regression
Multiple regression is a regression model with more than one explanatory variable.
More explanatory variables can give more insight and be er predictions.
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
The course contents
Chapter 1 Chapter 2
"Parallel slopes" regression Interactions
Simpson's Paradox
Chapter 3 Chapter 4
More explanatory variables Multiple logistic regression
How linear regression works The logistic distribution
How logistic regression works
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
The fish dataset
mass_g length_cm species Each row represents a sh
242.0 23.2 Bream mass_g is the response variable
5.9 7.5 Perch 1 numeric, 1 categorical explanatory
200.0 30.0 Pike variable
40.0 12.9 Roach
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
One explanatory variable at a time
from statsmodels.formula.api import ols mdl_mass_vs_species = ols("mass_g ~ species + 0",
data=fish).fit()
mdl_mass_vs_length = ols("mass_g ~ length_cm",
data=fish).fit() print(mdl_mass_vs_species.params)
print(mdl_mass_vs_length.params) species[Bream] 617.828571
species[Perch] 382.239286
species[Pike] 718.705882
Intercept -536.223947
species[Roach] 152.050000
length_cm 34.899245
dtype: float64
dtype: float64
1 intercept coe cient 1 intercept coe cient for each category
1 slope coe cient
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Both variables at the same time
mdl_mass_vs_both = ols("mass_g ~ length_cm + species + 0",
data=fish).fit()
print(mdl_mass_vs_both.params)
species[Bream] -672.241866
species[Perch] -713.292859
species[Pike] -1089.456053
species[Roach] -726.777799
length_cm 42.568554
dtype: float64
1 slope coe cient
1 intercept coe cient for each category
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Comparing coefficients
print(mdl_mass_vs_length.params) print(mdl_mass_vs_both.params)
Intercept -536.223947 species[Bream] -672.241866
length_cm 34.899245 species[Perch] -713.292859
species[Pike] -1089.456053
species[Roach] -726.777799
length_cm 42.568554
print(mdl_mass_vs_species.params)
species[Bream] 617.828571
species[Perch] 382.239286
species[Pike] 718.705882
species[Roach] 152.050000
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Visualization: 1 numeric explanatory variable
import matplotlib.pyplot as plt
import seaborn as sns
sns.regplot(x="length_cm",
y="mass_g",
data=fish,
ci=None)
plt.show()
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Visualization: 1 categorical explanatory variable
sns.boxplot(x="species",
y="mass_g",
data=fish,
showmeans=True)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Visualization: both explanatory variables
coeffs = mdl_mass_vs_both.params plt.axline(xy1=(0, ic_bream), slope=sl, color="blue")
print(coeffs) plt.axline(xy1=(0, ic_perch), slope=sl, color="green")
plt.axline(xy1=(0, ic_pike), slope=sl, color="red")
plt.axline(xy1=(0, ic_roach), slope=sl, color="orange")
species[Bream] -672.241866
species[Perch] -713.292859
species[Pike] -1089.456053
species[Roach] -726.777799
length_cm 42.568554
ic_bream, ic_perch, ic_pike, ic_roach, sl = coeffs
sns.scatterplot(x="length_cm",
y="mass_g",
hue="species",
data=fish)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Predicting parallel
slopes
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
The prediction workflow
import pandas as pd length_cm
import numpy as np 0 5
expl_data_length = pd.DataFrame( 1 10
{"length_cm": np.arange(5, 61, 5)}) 2 15
print(expl_data_length) 3 20
4 25
5 30
6 35
7 40
8 45
9 50
10 55
11 60
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
The prediction workflow
[A, B, C] x [1, 2] ==> [A1, B1, C1, A2, B2, C2] length_cm species
0 5 Bream
1 5 Roach
from itertools import product
2 5 Perch
product(["A", "B", "C"], [1, 2])
3 5 Pike
4 10 Bream
length_cm = np.arange(5, 61, 5) 5 10 Roach
species = fish["species"].unique() 6 10 Perch
...
p = product(length_cm, species) 41 55 Roach
42 55 Perch
expl_data_both = pd.DataFrame(p, 43 55 Pike
columns=['length_cm', 44 60 Bream
'species']) 45 60 Roach
print(expl_data_both) 46 60 Perch
47 60 Pike
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
The prediction workflow
Predict mass_g from length_cm only length_cm mass_g
0 5 -361.7277
1 10 -187.2315
prediction_data_length = expl_data_length.assign(
2 15 -12.7353
mass_g = mdl_mass_vs_length.predict(expl_data)
3 20 161.7610
)
4 25 336.2572
5 30 510.7534
... # number of rows: 12
Predict mass_g from both explanatory length_cm species mass_g
variables 0 5 Bream -459.3991
1 5 Roach -513.9350
2 5 Perch -500.4501
prediction_data_both = expl_data_both.assign(
3 5 Pike -876.6133
mass_g = mdl_mass_vs_both.predict(expl_data)
4 10 Bream -246.5563
)
5 10 Roach -301.0923
... # number of rows: 48
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Visualizing the predictions
plt.axline(xy1=(0, ic_bream), slope=sl, color="blue")
plt.axline(xy1=(0, ic_perch), slope=sl, color="green")
plt.axline(xy1=(0, ic_pike), slope=sl, color="red")
plt.axline(xy1=(0, ic_roach), slope=sl, color="orange")
sns.scatterplot(x="length_cm",
y="mass_g",
hue="species",
data=fish)
sns.scatterplot(x="length_cm",
y="mass_g",
color="black",
data=prediction_data)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Manually calculating predictions for linear regression
coeffs = mdl_mass_vs_length.params Intercept -536.223947
print(coeffs) length_cm 34.899245
intercept, slope = coeffs length_cm mass_g
0 5 -361.727721
1 10 -187.231494
explanatory_data = pd.DataFrame(
2 15 -12.735268
{"length_cm": np.arange(5, 61, 5)})
3 20 161.760959
4 25 336.257185
prediction_data = explanatory_data.assign(
5 30 510.753412
mass_g = intercept + slope * explanatory_data
...
)
9 50 1208.738318
10 55 1383.234545
print(prediction_data) 11 60 1557.730771
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Manually calculating predictions for multiple
regression
coeffs = mdl_mass_vs_both.params
print(coeffs)
species[Bream] -672.241866
species[Perch] -713.292859
species[Pike] -1089.456053
species[Roach] -726.777799
length_cm 42.568554
ic_bream, ic_perch, ic_pike, ic_roach, slope = coeffs
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
np.select()
conditions = [
condition_1,
condition_2,
# ...
condition_n
]
choices = [list_of_choices] # same length as conditions
np.select(conditions, choices)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Choosing an intercept with np.select()
conditions = [ [ -672.24 -726.78 -713.29 -1089.46
explanatory_data["species"] == "Bream", -672.24 -726.78 -713.29 -1089.46
explanatory_data["species"] == "Perch", -672.24 -726.78 -713.29 -1089.46
explanatory_data["species"] == "Pike", -672.24 -726.78 -713.29 -1089.46
explanatory_data["species"] == "Roach" -672.24 -726.78 -713.29 -1089.46
] -672.24 -726.78 -713.29 -1089.46
-672.24 -726.78 -713.29 -1089.46
-672.24 -726.78 -713.29 -1089.46
choices = [ic_bream, ic_perch, ic_pike, ic_roach]
-672.24 -726.78 -713.29 -1089.46
-672.24 -726.78 -713.29 -1089.46
intercept = np.select(conditions, choices) -672.24 -726.78 -713.29 -1089.46
-672.24 -726.78 -713.29 -1089.46]
print(intercept)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
The final prediction step
prediction_data = explanatory_data.assign( length_cm species intercept mass_g
intercept = np.select(conditions, choices), 0 5 Bream -672.2419 -459.3991
mass_g = intercept + slope * explanatory_data["length_cm"]) 1 5 Roach -726.7778 -513.9350
2 5 Perch -713.2929 -500.4501
print(prediction_data) 3 5 Pike -1089.4561 -876.6133
4 10 Bream -672.2419 -246.5563
5 10 Roach -726.7778 -301.0923
6 10 Perch -713.2929 -287.6073
7 10 Pike -1089.4561 -663.7705
8 15 Bream -672.2419 -33.7136
...
40 55 Bream -672.2419 1669.0286
41 55 Roach -726.7778 1614.4927
42 55 Perch -713.2929 1627.9776
43 55 Pike -1089.4561 1251.8144
44 60 Bream -672.2419 1881.8714
45 60 Roach -726.7778 1827.3354
46 60 Perch -713.2929 1840.8204
47 60 Pike -1089.4561 1464.6572
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Compare to .predict()
mdl_mass_vs_both.predict(explanatory_data) 0 -459.3991
1 -513.9350
2 -500.4501
3 -876.6133
4 -246.5563
5 -301.0923
...
43 1251.8144
44 1881.8714
45 1827.3354
46 1840.8204
47 1464.6572
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Assessing model
performance
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
Model performance metrics
Coe cient of determination (R-squared): how well the linear regression line ts the
observed values.
Larger is be er.
Residual standard error (RSE): the typical size of the residuals.
Smaller is be er.
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Getting the coefficient of determination
print(mdl_mass_vs_length.rsquared)
0.8225689502644215
print(mdl_mass_vs_species.rsquared)
0.25814887709499157
print(mdl_mass_vs_both.rsquared)
0.9200433561156649
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Adjusted coefficient of determination
More explanatory variables increases R2 .
Too many explanatory variables causes over ing.
Adjusted coe cient of determination penalizes more explanatory variables.
R¯2 = 1 − (1 − R2 ) nobsn−n
obs −1
var −1
Penalty is noticeable when R2 is small, or nvar is large fraction of nobs .
In statsmodels , it's contained in the rsquared_adj a ribute.
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Getting the adjusted coefficient of determination
print("rsq_length: ", mdl_mass_vs_length.rsquared)
print("rsq_adj_length: ", mdl_mass_vs_length.rsquared_adj)
rsq_length: 0.8225689502644215
rsq_adj_length: 0.8211607673300121
print("rsq_species: ", mdl_mass_vs_species.rsquared)
print("rsq_adj_species: ", mdl_mass_vs_species.rsquared_adj)
rsq_species: 0.25814887709499157
rsq_adj_species: 0.24020086605696722
print("rsq_both: ", mdl_mass_vs_both.rsquared
print("rsq_adj_both: ", mdl_mass_vs_both.rsquared_adj)
rsq_both: 0.9200433561156649
rsq_adj_both: 0.9174431400543857
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Getting the residual standard error
rse_length = np.sqrt(mdl_mass_vs_length.mse_resid)
print("rse_length: ", rse_length)
rse_length: 152.12092835414788
rse_species = np.sqrt(mdl_mass_vs_species.mse_resid)
print("rse_species: ", rse_species)
rse_species: 313.5501156682592
rse_both = np.sqrt(mdl_mass_vs_both.mse_resid)
print("rse_both: ", rse_both)
rse_both: 103.35563303966488
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Models for each
category
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
Four categories
print(fish["species"].unique())
array(['Bream', 'Roach', 'Perch', 'Pike'], dtype=object)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Splitting the dataset
bream = fish[fish["species"] == "Bream"]
perch = fish[fish["species"] == "Perch"]
pike = fish[fish["species"] == "Pike"]
roach = fish[fish["species"] == "Roach"]
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Four models
mdl_bream = ols("mass_g ~ length_cm", data=bream).fit() mdl_perch = ols("mass_g ~ length_cm", data=perch).fit()
print(mdl_bream.params) print(mdl_perch.params)
Intercept -1035.3476 Intercept -619.1751
length_cm 54.5500 length_cm 38.9115
mdl_pike = ols("mass_g ~ length_cm", data=pike).fit() mdl_roach = ols("mass_g ~ length_cm", data=roach).fit()
print(mdl_pike.params) print(mdl_roach.params)
Intercept -1540.8243 Intercept -329.3762
length_cm 53.1949 length_cm 23.3193
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Explanatory data
explanatory_data = pd.DataFrame( length_cm
{"length_cm": np.arange(5, 61, 5)}) 0 5
1 10
print(explanatory_data) 2 15
3 20
4 25
5 30
6 35
7 40
8 45
9 50
10 55
11 60
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Making predictions
prediction_data_bream = explanatory_data.assign( prediction_data_perch = explanatory_data.assign(
mass_g = mdl_bream.predict(explanatory_data), mass_g = mdl_perch.predict(explanatory_data),
species = "Bream") species = "Perch")
prediction_data_pike = explanatory_data.assign( prediction_data_roach = explanatory_data.assign(
mass_g = mdl_pike.predict(explanatory_data), mass_g = mdl_roach.predict(explanatory_data),
species = "Pike") species = "Roach")
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Concatenating predictions
prediction_data = pd.concat([prediction_data_bream, length_cm mass_g species
prediction_data_roach, 0 5 -762.597660 Bream
prediction_data_perch, 1 10 -489.847756 Bream
prediction_data_pike]) 2 15 -217.097851 Bream
3 20 55.652054 Bream
4 25 328.401958 Bream
5 30 601.151863 Bream
...
3 20 -476.926955 Pike
4 25 -210.952626 Pike
5 30 55.021703 Pike
6 35 320.996032 Pike
7 40 586.970362 Pike
8 45 852.944691 Pike
9 50 1118.919020 Pike
10 55 1384.893349 Pike
11 60 1650.867679 Pike
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Visualizing predictions
sns.lmplot(x="length_cm",
y="mass_g",
data=fish,
hue="species",
ci=None)
plt.show()
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Adding in your predictions
sns.lmplot(x="length_cm",
y="mass_g",
data=fish,
hue="species",
ci=None)
sns.scatterplot(x="length_cm",
y="mass_g",
data=prediction_data,
hue="species",
ci=None,
legend=False)
plt.show()
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Coefficient of determination
mdl_fish = ols("mass_g ~ length_cm + species", print(mdl_bream.rsquared_adj)
data=fish).fit()
0.874
print(mdl_fish.rsquared_adj)
print(mdl_perch.rsquared_adj)
0.917
0.917
print(mdl_pike.rsquared_adj)
0.941
print(mdl_roach.rsquared_adj)
0.815
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Residual standard error
print(np.sqrt(mdl_fish.mse_resid)) print(np.sqrt(mdl_bream.mse_resid))
103 74.2
print(np.sqrt(mdl_perch.mse_resid))
100
print(np.sqrt(mdl_pike.mse_resid))
120
print(np.sqrt(mdl_roach.mse_resid))
38.2
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
One model with an
interaction
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
What is an interaction?
In the sh dataset
Di erent sh species have di erent mass to length ratios.
The e ect of length on the expected mass is di erent for di erent species.
More generally
The e ect of one explanatory variable on the expected response changes depending on the
value of another explanatory variable.
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Specifying interactions
No interactions No interactions
response ~ explntry1 + explntry2 mass_g ~ length_cm + species
With interactions (implicit) With interactions (implicit)
response_var ~ explntry1 * explntry2 mass_g ~ length_cm * species
With interactions (explicit) With interactions (explicit)
response ~ explntry1 + explntry2 + explntry1:explntry2 mass_g ~ length_cm + species + length_cm:species
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Running the model
mdl_mass_vs_both = ols("mass_g ~ length_cm * species", data=fish).fit()
print(mdl_mass_vs_both.params)
Intercept -1035.3476
species[T.Perch] 416.1725
species[T.Pike] -505.4767
species[T.Roach] 705.9714
length_cm 54.5500
length_cm:species[T.Perch] -15.6385
length_cm:species[T.Pike] -1.3551
length_cm:species[T.Roach] -31.2307
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Easier to understand coefficients
mdl_mass_vs_both_inter = ols("mass_g ~ species + species:length_cm + 0", data=fish).fit()
print(mdl_mass_vs_both_inter.params)
species[Bream] -1035.3476
species[Perch] -619.1751
species[Pike] -1540.8243
species[Roach] -329.3762
species[Bream]:length_cm 54.5500
species[Perch]:length_cm 38.9115
species[Pike]:length_cm 53.1949
species[Roach]:length_cm 23.3193
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Familiar numbers
print(mdl_mass_vs_both_inter.params) print(mdl_bream.params)
species[Bream] -1035.3476 Intercept -1035.3476
species[Perch] -619.1751 length_cm 54.5500
species[Pike] -1540.8243
species[Roach] -329.3762
species[Bream]:length_cm 54.5500
species[Perch]:length_cm 38.9115
species[Pike]:length_cm 53.1949
species[Roach]:length_cm 23.3193
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Making predictions
with interactions
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
The model with the interaction
mdl_mass_vs_both_inter = ols("mass_g ~ species + species:length_cm + 0",
data=fish).fit()
print(mdl_mass_vs_both_inter.params)
species[Bream] -1035.3476
species[Perch] -619.1751
species[Pike] -1540.8243
species[Roach] -329.3762
species[Bream]:length_cm 54.5500
species[Perch]:length_cm 38.9115
species[Pike]:length_cm 53.1949
species[Roach]:length_cm 23.3193
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
The prediction flow
from itertools import product length_cm species mass_g
0 5 Bream -762.5977
length_cm = np.arange(5, 61, 5) 1 5 Roach -212.7799
2 5 Perch -424.6178
species = fish["species"].unique() 3 5 Pike -1274.8499
4 10 Bream -489.8478
p = product(length_cm, species) 5 10 Roach -96.1836
6 10 Perch -230.0604
7 10 Pike -1008.8756
explanatory_data = pd.DataFrame(p,
8 15 Bream -217.0979
columns=["length_cm",
...
"species"])
40 55 Bream 1964.9014
41 55 Roach 953.1833
prediction_data = explanatory_data.assign(
42 55 Perch 1520.9556
mass_g = mdl_mass_vs_both_inter.predict(explanatory_data))
43 55 Pike 1384.8933
44 60 Bream 2237.6513
print(prediction_data) 45 60 Roach 1069.7796
46 60 Perch 1715.5129
47 60 Pike 1650.8677
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Visualizing the predictions
sns.lmplot(x="length_cm",
y="mass_g",
data=fish,
hue="species",
ci=None)
sns.scatterplot(x="length_cm",
y="mass_g",
data=prediction_data,
hue="species")
plt.show()
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Manually calculating the predictions
coeffs = mdl_mass_vs_both_inter.params
species[Bream] -1035.3476
species[Perch] -619.1751
species[Pike] -1540.8243
species[Roach] -329.3762
species[Bream]:length_cm 54.5500
species[Perch]:length_cm 38.9115
species[Pike]:length_cm 53.1949
species[Roach]:length_cm 23.3193
ic_bream, ic_perch, ic_pike, ic_roach,
slope_bream, slope_perch, slope_pike, slope_roach = coeffs
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Manually calculating the predictions
conditions = [
explanatory_data["species"] == "Bream",
explanatory_data["species"] == "Perch",
explanatory_data["species"] == "Pike",
explanatory_data["species"] == "Roach"
]
ic_choices = [ic_bream, ic_perch, ic_pike, ic_roach]
intercept = np.select(conditions, ic_choices)
slope_choices = [slope_bream, slope_perch, slope_pike, slope_roach]
slope = np.select(conditions, slope_choices)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Manually calculating the predictions
prediction_data = explanatory_data.assign( prediction_data = explanatory_data.assign(
mass_g = intercept + slope * explanatory_data["length_cm"]) mass_g = mdl_mass_vs_both_inter.predict(explanatory_data))
print(prediction_data) print(prediction_data)
length_cm species mass_g length_cm species mass_g
0 5 Bream -762.5977 0 5 Bream -762.5977
1 5 Roach -212.7799 1 5 Roach -212.7799
2 5 Perch -424.6178 2 5 Perch -424.6178
3 5 Pike -1274.8499 3 5 Pike -1274.8499
4 10 Bream -489.8478 4 10 Bream -489.8478
5 10 Roach -96.1836 5 10 Roach -96.1836
... ...
43 55 Pike 1384.8933 43 55 Pike 1384.8933
44 60 Bream 2237.6513 44 60 Bream 2237.6513
45 60 Roach 1069.7796 45 60 Roach 1069.7796
46 60 Perch 1715.5129 46 60 Perch 1715.5129
47 60 Pike 1650.8677 47 60 Pike 1650.8677
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Simpson's Paradox
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
A most ingenious paradox!
Simpson's Paradox occurs when the trend of a model on the whole dataset is very di erent
from the trends shown by models on subsets of the dataset.
trend = slope coe cient
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Synthetic Simpson data
x y group 5 groups of data, labeled "A" to "E"
62.24344 70.60840 D
52.33499 14.70577 B
56.36795 46.39554 C
66.80395 66.17487 D
66.53605 89.24658 E
62.38129 91.45260 E
1 h ps://www.rdocumentation.org/packages/datasauRus/topics/simpsons_paradox
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Linear regressions
Whole dataset By group
mdl_whole = ols("y ~ x", mdl_by_group = ols("y ~ group + group:x + 0",
data=simpsons_paradox).fit() data = simpsons_paradox).fit()
print(mdl_whole.params) print(mdl_by_group.params)
Intercept -38.554 groupA groupB groupC groupD groupE
x 1.751 32.5051 67.3886 99.6333 132.3932 123.8242
groupA:x groupB:x groupC:x groupD:x groupE:x
-0.6266 -1.0105 -0.9940 -0.9908 -0.5364
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Plotting the whole dataset
sns.regplot(x="x",
y="y",
data=simpsons_paradox,
ci=None)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Plotting by group
sns.lmplot(x="x",
y="y",
data=simpsons_paradox,
hue="group",
ci=None)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Reconciling the difference
Good advice
If possible, try to plot the dataset.
Common advice
You can't choose the best model in general – it depends on the dataset and the question you
are trying to answer.
More good advice
Articulate a question before you start modeling.
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Test score example
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Infectious disease example
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Reconciling the difference
Usually (but not always) the grouped model contains more insight.
Are you missing explanatory variables?
Context is important.
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Simpson's paradox in real datasets
The paradox is usually less obvious.
You may see a zero slope rather than a complete change in direction.
It may not appear in every group.
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Two numeric
explanatory
variables
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
Visualizing three numeric variables
3D sca er plot
2D sca er plot with response as color
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Another column for the fish dataset
species mass_g length_cm height_cm
Bream 1000 33.5 18.96
Bream 925 36.2 18.75
Roach 290 24.0 8.88
Roach 390 29.5 9.48
Perch 1100 39.0 12.80
Perch 1000 40.2 12.60
Pike 1250 52.0 10.69
Pike 1650 59.0 10.81
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
3D scatter plot
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
2D scatter plot, color for response
sns.scatterplot(x="length_cm",
y="height_cm",
data=fish,
hue="mass_g")
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Modeling with two numeric explanatory variables
mdl_mass_vs_both = ols("mass_g ~ length_cm + height_cm",
data=fish).fit()
print(mdl_mass_vs_both.params)
Intercept -622.150234
length_cm 28.968405
height_cm 26.334804
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
The prediction flow
from itertools import product length_cm height_cm mass_g
0 5 2 -424.638603
length_cm = np.arange(5, 61, 5) 1 5 4 -371.968995
height_cm = np.arange(2, 21, 2) 2 5 6 -319.299387
3 5 8 -266.629780
p = product(length_cm, height_cm) 4 5 10 -213.960172
.. ... ... ...
explanatory_data = pd.DataFrame(p, 115 60 12 1431.971694
columns=["length_cm", 116 60 14 1484.641302
"height_cm"]) 117 60 16 1537.310909
118 60 18 1589.980517
119 60 20 1642.650125
prediction_data = explanatory_data.assign(
mass_g = mdl_mass_vs_both.predict(explanatory_data))
[120 rows x 3 columns]
print(prediction_data)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Plotting the predictions
sns.scatterplot(x="length_cm",
y="height_cm",
data=fish,
hue="mass_g")
sns.scatterplot(x="length_cm",
y="height_cm",
data=prediction_data,
hue="mass_g",
legend=False,
marker="s")
plt.show()
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Including an interaction
mdl_mass_vs_both_inter = ols("mass_g ~ length_cm * height_cm",
data=fish).fit()
print(mdl_mass_vs_both_inter.params)
Intercept 159.107480
length_cm 0.301426
height_cm -78.125178
length_cm:height_cm 3.545435
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
The prediction flow with an interaction
length_cm = np.arange(5, 61, 5)
height_cm = np.arange(2, 21, 2)
p = product(length_cm, height_cm)
explanatory_data = pd.DataFrame(p,
columns=["length_cm",
"height_cm"])
prediction_data = explanatory_data.assign(
mass_g = mdl_mass_vs_both_inter.predict(explanatory_data))
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Plotting the predictions
sns.scatterplot(x="length_cm",
y="height_cm",
data=fish,
hue="mass_g")
sns.scatterplot(x="length_cm",
y="height_cm",
data=prediction_data,
hue="mass_g",
legend=False,
marker="s")
plt.show()
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
More than two
explanatory
variables
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
From last time
sns.scatterplot(x="length_cm",
y="height_cm",
data=fish,
hue="mass_g")
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Faceting by species
grid = sns.FacetGrid(data=fish,
col="species",
hue="mass_g",
col_wrap=2,
palette="plasma")
grid.map(sns.scatterplot,
"length_cm",
"height_cm")
plt.show()
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Faceting by species
It's possible to use more than one
categorical variable for faceting
Beware of faceting overuse
Plo ing becomes harder with increasing
number of variables
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Different levels of interaction
No interactions
ols("mass_g ~ length_cm + height_cm + species + 0", data=fish).fit()
two-way interactions between pairs of variables
ols(
"mass_g ~ length_cm + height_cm + species +
length_cm:height_cm + length_cm:species + height_cm:species + 0", data=fish).fit()
three-way interaction between all three variables
ols(
"mass_g ~ length_cm + height_cm + species +
length_cm:height_cm + length_cm:species + height_cm:species + length_cm:height_cm:species + 0", data=fish).fit()
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
All the interactions
ols(
"mass_g ~ length_cm + height_cm + species +
length_cm:height_cm + length_cm:species + height_cm:species + length_cm:height_cm:species + 0",
data=fish).fit()
same as
ols(
"mass_g ~ length_cm * height_cm * species + 0",
data=fish).fit()
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Only two-way interactions
ols(
"mass_g ~ length_cm + height_cm + species +
length_cm:height_cm + length_cm:species + height_cm:species + 0",
data=fish).fit()
same as
ols(
"mass_g ~ (length_cm + height_cm + species) ** 2 + 0",
data=fish).fit()
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
The prediction flow
mdl_mass_vs_all = ols( length_cm height_cm species mass_g
"mass_g ~ length_cm * height_cm * species + 0", 0 5 2 Bream -570.656437
data=fish).fit() 1 5 2 Roach 31.449145
2 5 2 Perch 43.789984
length_cm = np.arange(5, 61, 5) 3 5 2 Pike 271.270093
height_cm = np.arange(2, 21, 2) 4 5 4 Bream -451.127405
species = fish["species"].unique() .. ... ... ... ...
475 60 18 Pike 2690.346384
p = product(length_cm, height_cm, species) 476 60 20 Bream 1531.618475
477 60 20 Roach 2621.797668
explanatory_data = pd.DataFrame(p, 478 60 20 Perch 3041.931709
columns=["length_cm", 479 60 20 Pike 2926.352397
"height_cm",
"species"]) [480 rows x 4 columns]
prediction_data = explanatory_data.assign(
mass_g = mdl_mass_vs_all.predict(explanatory_data))
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
How linear
regression works
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
The standard simple linear regression plot
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Visualizing residuals
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
A metric for the best fit
The simplest idea (which doesn't work)
Take the sum of all the residuals.
Some residuals are negative.
The next simplest idea (which does work)
Take the square of each residual, and add up those squares.
This is called the sum of squares.
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
A detour into numerical optimization
A line plot of a quadratic equation
x = np.arange(-4, 5, 0.1)
y = x ** 2 - x + 10
xy_data = pd.DataFrame({"x": x,
"y": y})
sns.lineplot(x="x",
y="y",
data=xy_data)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Using calculus to solve the equation
y = x2 − x + 10
∂y
∂x
= 2x − 1
0 = 2x − 1
x = 0.5
y = 0.52 − 0.5 + 10 = 9.75
Not all equations can be solved like this.
You can let Python gure it out.
Don't worry if this doesn't make sense, you
won't need it for the exercises.
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
minimize()
from scipy.optimize import minimize fun: 9.75
hess_inv: array([[0.5]])
jac: array([0.])
def calc_quadratic(x):
message: 'Optimization terminated successfully.'
y = x ** 2 - x + 10
nfev: 6
return y
nit: 2
njev: 3
minimize(fun=calc_quadratic, status: 0
x0=3) success: True
x: array([0.49999998])
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
A linear regression algorithm
De ne a function to calculate the sum of
def calc_sum_of_squares(coeffs):
squares metric.
intercept, slope = coeffs
# More calculation!
Call minimize() to nd coe cients that minimize(
minimize this function. fun=calc_sum_of_squares,
x0=0
)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Multiple logistic
regression
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
Bank churn dataset
has_churned time_since_ rst_purchase time_since_last_purchase
0 0.3993247 -0.5158691
1 -0.4297957 0.6780654
0 3.7383122 0.4082544
0 0.6032289 -0.6990435
... ... ...
response length of relationship recency of activity
1 h ps://www.rdocumentation.org/packages/bayesQR/topics/Churn
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
logit()
from statsmodels.formula.api import logit
logit("response ~ explanatory", data=dataset).fit()
logit("response ~ explanatory1 + explanatory2", data=dataset).fit()
logit("response ~ explanatory1 * explanatory2", data=dataset).fit()
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
The four outcomes
predicted false predicted true
actual false correct false positive
actual true false negative correct
conf_matrix = mdl_logit.pred_table()
print(conf_matrix)
[[102. 98.]
[ 53. 147.]]
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Prediction flow
from itertools import product
explanatory1 = some_values
explanatory2 = some_values
p = product(explanatory1, explanatory2)
explanatory_data = pd.DataFrame(p,
columns=["explanatory1",
"explanatory2"])
prediction_data = explanatory_data.assign(
mass_g = mdl_logit.predict(explanatory_data))
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Visualization
prediction_data["most_likely_outcome"] = np.round(prediction_data["has_churned"])
sns.scatterplot(...
data=churn,
hue="has_churned",
...)
sns.scatterplot(...
data=prediction_data,
hue="most_likely_outcome",
...)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
The logistic
distribution
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
Gaussian probability density function (PDF)
from scipy.stats import norm
x = np.arange(-4, 4.05, 0.05)
gauss_dist = pd.DataFrame({
"x": x,
"gauss_pdf": norm.pdf(x)}
)
sns.lineplot(x="x",
y="gauss_pdf",
data=gauss_dist)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Gaussian cumulative distribution function (CDF)
x = np.arange(-4, 4.05, 0.05)
gauss_dist = pd.DataFrame({
"x": x,
"gauss_pdf": norm.pdf(x),
"gauss_cdf": norm.cdf(x)}
)
sns.lineplot(x="x",
y="gauss_cdf",
data=gauss_dist)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Gaussian cumulative distribution function (CDF)
x = np.arange(-4, 4.05, 0.05)
gauss_dist = pd.DataFrame({
"x": x,
"gauss_pdf": norm.pdf(x),
"gauss_cdf": norm.cdf(x)}
)
sns.lineplot(x="x",
y="gauss_cdf",
data=gauss_dist)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Gaussian inverse CDF
p = np.arange(0.001, 1, 0.001)
gauss_dist_inv = pd.DataFrame({
"p": p,
"gauss_inv_cdf": norm.ppf(p)}
)
sns.lineplot(x="p",
y="gauss_inv_cdf",
data=gauss_dist_inv)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Logistic PDF
from scipy.stats import logistic
x = np.arange(-4, 4.05, 0.05)
logistic_dist = pd.DataFrame({
"x": x,
"log_pdf": logistic.pdf(x)}
)
sns.lineplot(x="x",
y="log_pdf",
data=logistic_dist)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Logistic distribution
Logistic distribution CDF is also called the logistic function.
1
cdf(x) = (1+exp(−x))
Logistic distribution inverse CDF is also called the logit function.
p
inverse_cdf(p) = log( (1−p) )
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
How logistic
regression works
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
Sum of squares doesn't work
np.sum((y_pred - y_actual) ** 2)
y_actual is always 0 or 1 .
y_pred is between 0 and 1 .
There is a be er metric than sum of squares.
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Likelihood
y_pred * y_actual
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Likelihood
y_pred * y_actual + (1 - y_pred) * (1 - y_actual)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Likelihood
np.sum(y_pred * y_actual + (1 - y_pred) * (1 - y_actual))
When y_actual = 1
y_pred * 1 + (1 - y_pred) * (1 - 1) = y_pred
When y_actual = 0
y_pred * 0 + (1 - y_pred) * (1 - 0) = 1 - y_pred
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Log-likelihood
Computing likelihood involves adding many very small numbers, leading to numerical error.
Log-likelihood is easier to compute.
log_likelihood = np.log(y_pred) * y_actual + np.log(1 - y_pred) * (1 - y_actual)
Both equations give the same answer.
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Negative log-likelihood
Maximizing log-likelihood is the same as minimizing negative log-likelihood.
-np.sum(log_likelihoods)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Logistic regression algorithm
def calc_neg_log_likelihood(coeffs)
intercept, slope = coeffs
# More calculation!
from scipy.optimize import minimize
minimize(
fun=calc_neg_log_likelihood,
x0=[0, 0]
)
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Congratulations!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Maarten Van den Broeck
Content Developer at DataCamp
You learned things
Chapter 1 Chapter 2
Fit/visualize/predict/assess parallel slopes Interactions between explanatory variables
Simpson's Paradox
Chapter 3 Chapter 4
Extend to many explanatory variables Logistic regression with multiple
explanatory variables
Implement linear regression algorithm
Logistic distribution
Implement logistic regression algorithm
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
There is more to learn
Training and testing sets
Cross validation
P-values and signi cance
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Advanced regression
Generalized Linear Models in Python
Introduction to Predictive Analytics in Python
Linear Classi ers in Python
Machine Learning with Tree-Based Models in Python
INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON
Have fun regressing!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Sampling and point
estimates
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Estimating the population of France
A census asks every household how many
people live there.
SAMPLING IN PYTHON
There are lots of people in France
Censuses are really expensive!
SAMPLING IN PYTHON
Sampling households
Cheaper to ask a small number of households
and use statistics to estimate the population
Working with a subset of the whole population
is called sampling
SAMPLING IN PYTHON
Population vs. sample
The population is the complete dataset
Doesn't have to refer to people
Typically, don't know what the whole population is
The sample is the subset of data you calculate on
SAMPLING IN PYTHON
Coffee rating dataset
total_cup_points variety country_of_origin aroma flavor aftertaste body balance
90.58 NA Ethiopia 8.67 8.83 8.67 8.50 8.42
89.92 Other Ethiopia 8.75 8.67 8.50 8.42 8.42
... ... ... ... ... ... ... ...
73.75 NA Vietnam 6.75 6.67 6.5 6.92 6.83
Each row represents 1 coffee
1338 rows
We'll treat this as the population
SAMPLING IN PYTHON
Points vs. flavor: population
pts_vs_flavor_pop = coffee_ratings[["total_cup_points", "flavor"]]
total_cup_points flavor
0 90.58 8.83
1 89.92 8.67
2 89.75 8.50
3 89.00 8.58
4 88.83 8.50
... ... ...
1333 78.75 7.58
1334 78.08 7.67
1335 77.17 7.33
1336 75.08 6.83
1337 73.75 6.67
[1338 rows x 2 columns]
SAMPLING IN PYTHON
Points vs. flavor: 10 row sample
pts_vs_flavor_samp = pts_vs_flavor_pop.sample(n=10)
total_cup_points flavor
1088 80.33 7.17
1157 79.67 7.42
1267 76.17 7.33
506 83.00 7.67
659 82.50 7.42
817 81.92 7.50
1050 80.67 7.42
685 82.42 7.50
1027 80.92 7.25
62 85.58 8.17
[10 rows x 2 columns]
SAMPLING IN PYTHON
Python sampling for Series
Use .sample() for pandas DataFrames and Series
cup_points_samp = coffee_ratings['total_cup_points'].sample(n=10)
1088 80.33
1157 79.67
1267 76.17
... ...
685 82.42
1027 80.92
62 85.58
Name: total_cup_points, dtype: float64
SAMPLING IN PYTHON
Population parameters & point estimates
A population parameter is a calculation made on the population dataset
import numpy as np
np.mean(pts_vs_flavor_pop['total_cup_points'])
82.15120328849028
A point estimate or sample statistic is a calculation made on the sample dataset
np.mean(cup_points_samp)
81.31800000000001
SAMPLING IN PYTHON
Point estimates with pandas
pts_vs_flavor_pop['flavor'].mean()
7.526046337817639
pts_vs_flavor_samp['flavor'].mean()
7.485000000000001
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Convenience
sampling
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
The Literary Digest election prediction
Prediction: Landon gets 57%; Roosevelt gets 43%
Actual results: Landon got 38%; Roosevelt got 62%
Sample not representative of population, causing sample bias
Collecting data by the easiest method is called convenience sampling
SAMPLING IN PYTHON
Finding the mean age of French people
Survey 10 people at Disneyland Paris
Mean age of 24.6 years
Will this be a good estimate for all of
France?
1 Image by Sean MacEntee
SAMPLING IN PYTHON
How accurate was the survey?
Year Average French Age 24.6 years is a poor estimate
1975 31.6 People who visit Disneyland aren't
1985 33.6 representative of the whole population
1995 36.2
2005 38.9
2015 41.2
SAMPLING IN PYTHON
Convenience sampling coffee ratings
coffee_ratings["total_cup_points"].mean()
82.15120328849028
coffee_ratings_first10 = coffee_ratings.head(10)
coffee_ratings_first10["total_cup_points"].mean()
89.1
SAMPLING IN PYTHON
Visualizing selection bias
import matplotlib.pyplot as plt
import numpy as np
coffee_ratings["total_cup_points"].hist(bins=np.arange(59, 93, 2))
plt.show()
coffee_ratings_first10["total_cup_points"].hist(bins=np.arange(59, 93, 2))
plt.show()
SAMPLING IN PYTHON
Distribution of a population and of a convenience
sample
Population: Convenience sample:
SAMPLING IN PYTHON
Visualizing selection bias for a random sample
coffee_sample = coffee_ratings.sample(n=10)
coffee_sample["total_cup_points"].hist(bins=np.arange(59, 93, 2))
plt.show()
SAMPLING IN PYTHON
Distribution of a population and of a simple random
sample
Population: Random Sample:
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Pseudo-random
number generation
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
What does random mean?
{adjective} made, done, happening, or chosen without method or conscious decision.
1 Oxford Languages
SAMPLING IN PYTHON
True random numbers
Generated from physical processes, like flipping coins
Hotbits uses radioactive decay
RANDOM.ORG uses atmospheric noise
True randomness is expensive
1 https://www.fourmilab.ch/hotbits 2 https://www.random.org
SAMPLING IN PYTHON
Pseudo-random number generation
Pseudo-random number generation is cheap and fast
Next "random" number calculated from previous "random" number
The first "random" number calculated from a seed
The same seed value yields the same random numbers
SAMPLING IN PYTHON
Pseudo-random number generation example
seed = 1
calc_next_random(seed)
calc_next_random(3)
calc_next_random(2)
SAMPLING IN PYTHON
Random number generating functions
Prepend with numpy.random , such as numpy.random.beta()
function distribution function distribution
.beta Beta .hypergeometric Hypergeometric
.binomial Binomial .lognormal Lognormal
.chisquare Chi-squared .negative_binomial Negative binomial
.exponential Exponential .normal Normal
.f F .poisson Poisson
.gamma Gamma .standard_t t
.geometric Geometric .uniform Uniform
SAMPLING IN PYTHON
Visualizing random numbers
randoms = np.random.beta(a=2, b=2, size=5000)
randoms
array([0.6208281 , 0.73216171, 0.44298403, ...,
0.13411873, 0.52198411, 0.72355098])
plt.hist(randoms, bins=np.arange(0, 1, 0.05))
plt.show()
SAMPLING IN PYTHON
Random numbers seeds
np.random.seed(20000229) np.random.seed(20000229)
np.random.normal(loc=2, scale=1.5, size=2) np.random.normal(loc=2, scale=1.5, size=2)
array([-0.59030264, 1.87821258]) array([-0.59030264, 1.87821258])
np.random.normal(loc=2, scale=1.5, size=2) np.random.normal(loc=2, scale=1.5, size=2)
array([2.52619561, 4.9684949 ]) array([2.52619561, 4.9684949 ])
SAMPLING IN PYTHON
Using a different seed
np.random.seed(20000229) np.random.seed(20041004)
np.random.normal(loc=2, scale=1.5, size=2) np.random.normal(loc=2, scale=1.5, size=2)
array([-0.59030264, 1.87821258]) array([1.09364337, 4.55285159])
np.random.normal(loc=2, scale=1.5, size=2) np.random.normal(loc=2, scale=1.5, size=2)
array([2.52619561, 4.9684949 ]) array([2.67038916, 2.36677492])
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Simple random and
systematic sampling
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Simple random sampling
SAMPLING IN PYTHON
Simple random sampling of coffees
SAMPLING IN PYTHON
Simple random sampling with pandas
coffee_ratings.sample(n=5, random_state=19000113)
total_cup_points variety country_of_origin aroma flavor \
437 83.25 None Colombia 7.92 7.75
285 83.83 Yellow Bourbon Brazil 7.92 7.50
784 82.08 None Colombia 7.50 7.42
648 82.58 Caturra Colombia 7.58 7.50
155 84.58 Caturra Colombia 7.42 7.67
aftertaste body balance
437 7.25 7.83 7.58
285 7.33 8.17 7.50
784 7.42 7.67 7.42
648 7.42 7.67 7.42
155 7.75 8.08 7.83
SAMPLING IN PYTHON
Systematic sampling
SAMPLING IN PYTHON
Systematic sampling - defining the interval
sample_size = 5
pop_size = len(coffee_ratings)
print(pop_size)
1338
interval = pop_size // sample_size
print(interval)
267
SAMPLING IN PYTHON
Systematic sampling - selecting the rows
coffee_ratings.iloc[::interval]
total_cup_points variety country_of_origin aroma flavor aftertaste \
0 90.58 None Ethiopia 8.67 8.83 8.67
267 83.92 None Colombia 7.83 7.75 7.58
534 82.92 Bourbon El Salvador 7.50 7.50 7.75
801 82.00 Typica Taiwan 7.33 7.50 7.17
1068 80.50 Other Taiwan 7.17 7.17 7.17
body balance
0 8.50 8.42
267 7.75 7.75
534 7.92 7.83
801 7.50 7.33
1068 7.17 7.25
SAMPLING IN PYTHON
The trouble with systematic sampling
coffee_ratings_with_id = coffee_ratings.reset_index()
coffee_ratings_with_id.plot(x="index", y="aftertaste", kind="scatter")
plt.show()
Systematic sampling is only safe if we don't see a pattern in this scatter plot
SAMPLING IN PYTHON
Making systematic sampling safe
shuffled = coffee_ratings.sample(frac=1)
shuffled = shuffled.reset_index(drop=True).reset_index()
shuffled.plot(x="index", y="aftertaste", kind="scatter")
plt.show()
Shuffling rows + systematic sampling is the same as simple random sampling
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Stratified and
weighted random
sampling
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Coffees by country
top_counts = coffee_ratings['country_of_origin'].value_counts()
top_counts.head(6)
country_of_origin
Mexico 236
Colombia 183
Guatemala 181
Brazil 132
Taiwan 75
United States (Hawaii) 73
dtype: int64
1 The dataset lists Hawaii and Taiwan as countries for convenience, as they are notable coffee-growing regions.
SAMPLING IN PYTHON
Filtering for 6 countries
top_counted_countries = ["Mexico", "Colombia", "Guatemala",
"Brazil", "Taiwan", "United States (Hawaii)"]
top_counted_subset = coffee_ratings['country_of_origin'].isin(top_counted_countries)
coffee_ratings_top = coffee_ratings[top_counted_subset]
SAMPLING IN PYTHON
Counts of a simple random sample
coffee_ratings_samp = coffee_ratings_top.sample(frac=0.1, random_state=2021)
coffee_ratings_samp['country_of_origin'].value_counts(normalize=True)
country_of_origin
Mexico 0.250000
Guatemala 0.204545
Colombia 0.181818
Brazil 0.181818
United States (Hawaii) 0.102273
Taiwan 0.079545
dtype: float64
SAMPLING IN PYTHON
Comparing proportions
Population: 10% simple random sample:
Mexico 0.268182 Mexico 0.250000
Colombia 0.207955 Guatemala 0.204545
Guatemala 0.205682 Colombia 0.181818
Brazil 0.150000 Brazil 0.181818
Taiwan 0.085227 United States (Hawaii) 0.102273
United States (Hawaii) 0.082955 Taiwan 0.079545
Name: country_of_origin, dtype: float64 Name: country_of_origin, dtype: float64
SAMPLING IN PYTHON
Proportional stratified sampling
coffee_ratings_strat = coffee_ratings_top.groupby("country_of_origin")\
.sample(frac=0.1, random_state=2021)
coffee_ratings_strat['country_of_origin'].value_counts(normalize=True)
Mexico 0.272727
Guatemala 0.204545
Colombia 0.204545
Brazil 0.147727
Taiwan 0.090909
United States (Hawaii) 0.079545
Name: country_of_origin, dtype: float64
SAMPLING IN PYTHON
Equal counts stratified sampling
coffee_ratings_eq = coffee_ratings_top.groupby("country_of_origin")\
.sample(n=15, random_state=2021)
coffee_ratings_eq['country_of_origin'].value_counts(normalize=True)
Taiwan 0.166667
Brazil 0.166667
United States (Hawaii) 0.166667
Guatemala 0.166667
Mexico 0.166667
Colombia 0.166667
Name: country_of_origin, dtype: float64
SAMPLING IN PYTHON
Weighted random sampling
Specify weights to adjust the relative probability of a row being sampled
import numpy as np
coffee_ratings_weight = coffee_ratings_top
condition = coffee_ratings_weight['country_of_origin'] == "Taiwan"
coffee_ratings_weight['weight'] = np.where(condition, 2, 1)
coffee_ratings_weight = coffee_ratings_weight.sample(frac=0.1, weights="weight")
SAMPLING IN PYTHON
Weighted random sampling results
10% weighted sample:
coffee_ratings_weight['country_of_origin'].value_counts(normalize=True)
Brazil 0.261364
Mexico 0.204545
Guatemala 0.204545
Taiwan 0.170455
Colombia 0.090909
United States (Hawaii) 0.068182
Name: country_of_origin, dtype: float64
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Cluster sampling
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Stratified sampling vs. cluster sampling
Stratified sampling
Split the population into subgroups
Use simple random sampling on every subgroup
Cluster sampling
Use simple random sampling to pick some subgroups
Use simple random sampling on only those subgroups
SAMPLING IN PYTHON
Varieties of coffee
varieties_pop = list(coffee_ratings['variety'].unique())
[None, 'Other', 'Bourbon', 'Catimor',
'Ethiopian Yirgacheffe','Caturra',
'SL14', 'Sumatra', 'SL34', 'Hawaiian Kona',
'Yellow Bourbon', 'SL28', 'Gesha', 'Catuai',
'Pacamara', 'Typica', 'Sumatra Lintong',
'Mundo Novo', 'Java', 'Peaberry', 'Pacas',
'Mandheling', 'Ruiru 11', 'Arusha',
'Ethiopian Heirlooms', 'Moka Peaberry',
'Sulawesi', 'Blue Mountain', 'Marigojipe',
'Pache Comun']
SAMPLING IN PYTHON
Stage 1: sampling for subgroups
import random
varieties_samp = random.sample(varieties_pop, k=3)
['Hawaiian Kona', 'Bourbon', 'SL28']
SAMPLING IN PYTHON
Stage 2: sampling each group
variety_condition = coffee_ratings['variety'].isin(varieties_samp)
coffee_ratings_cluster = coffee_ratings[variety_condition]
coffee_ratings_cluster['variety'] = coffee_ratings_cluster['variety'].cat.remove_unused_categories()
coffee_ratings_cluster.groupby("variety")\
.sample(n=5, random_state=2021)
SAMPLING IN PYTHON
Stage 2 output
total_cup_points variety country_of_origin ...
variety
Bourbon 575 82.83 Bourbon Guatemala
560 82.83 Bourbon Guatemala
524 83.00 Bourbon Guatemala
1140 79.83 Bourbon Guatemala
318 83.67 Bourbon Brazil
Hawaiian Kona 1291 73.67 Hawaiian Kona United States (Hawaii)
1266 76.25 Hawaiian Kona United States (Hawaii)
488 83.08 Hawaiian Kona United States (Hawaii)
461 83.17 Hawaiian Kona United States (Hawaii)
117 84.83 Hawaiian Kona United States (Hawaii)
SL28 137 84.67 SL28 Kenya
452 83.17 SL28 Kenya
224 84.17 SL28 Kenya
66 85.50 SL28 Kenya
559 82.83 SL28 Kenya
SAMPLING IN PYTHON
Multistage sampling
Cluster sampling is a type of multistage sampling
Can have > 2 stages
E.g., countrywide surveys may sample states, counties, cities, and neighborhoods
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Comparing
sampling methods
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Review of sampling techniques - setup
top_counted_countries = ["Mexico", "Colombia", "Guatemala",
"Brazil", "Taiwan", "United States (Hawaii)"]
subset_condition = coffee_ratings['country_of_origin'].isin(top_counted_countries)
coffee_ratings_top = coffee_ratings[subset_condition]
coffee_ratings_top.shape
(880, 8)
SAMPLING IN PYTHON
Review of simple random sampling
coffee_ratings_srs = coffee_ratings_top.sample(frac=1/3, random_state=2021)
coffee_ratings_srs.shape
(293, 8)
SAMPLING IN PYTHON
Review of stratified sampling
coffee_ratings_strat = coffee_ratings_top.groupby("country_of_origin")\
.sample(frac=1/3, random_state=2021)
coffee_ratings_strat.shape
(293, 8)
SAMPLING IN PYTHON
Review of cluster sampling
import random
top_countries_samp = random.sample(top_counted_countries, k=2)
top_condition = coffee_ratings_top['country_of_origin'].isin(top_countries_samp)
coffee_ratings_cluster = coffee_ratings_top[top_condition]
coffee_ratings_cluster['country_of_origin'] = coffee_ratings_cluster['country_of_origin']\
.cat.remove_unused_categories()
coffee_ratings_clust = coffee_ratings_cluster.groupby("country_of_origin")\
.sample(n=len(coffee_ratings_top) // 6)
coffee_ratings_clust.shape
(292, 8)
SAMPLING IN PYTHON
Calculating mean cup points
Population Simple random sample
coffee_ratings_top['total_cup_points'].mean() coffee_ratings_srs['total_cup_points'].mean()
81.94700000000002 81.95982935153583
Stratified sample Cluster sample
coffee_ratings_strat['total_cup_points'].mean() coffee_ratings_clust['total_cup_points'].mean()
81.92566552901025 82.03246575342466
SAMPLING IN PYTHON
Mean cup points by country: simple random
Population: Simple random sample:
coffee_ratings_top.groupby("country_of_origin")\ coffee_ratings_srs.groupby("country_of_origin")\
['total_cup_points'].mean() ['total_cup_points'].mean()
country_of_origin country_of_origin
Brazil 82.405909 Brazil 82.414878
Colombia 83.106557 Colombia 82.925536
Guatemala 81.846575 Guatemala 82.045385
Mexico 80.890085 Mexico 81.100714
Taiwan 82.001333 Taiwan 81.744333
United States (Hawaii) 81.820411 United States (Hawaii) 82.008000
Name: total_cup_points, dtype: float64 Name: total_cup_points, dtype: float64
SAMPLING IN PYTHON
Mean cup points by country: stratified
Population: Stratified sample:
coffee_ratings_top.groupby("country_of_origin")\ coffee_ratings_strat.groupby("country_of_origin")\
['total_cup_points'].mean() ['total_cup_points'].mean()
country_of_origin country_of_origin
Brazil 82.405909 Brazil 82.499773
Colombia 83.106557 Colombia 83.288197
Guatemala 81.846575 Guatemala 81.727667
Mexico 80.890085 Mexico 80.994684
Taiwan 82.001333 Taiwan 81.846800
United States (Hawaii) 81.820411 United States (Hawaii) 81.051667
Name: total_cup_points, dtype: float64 Name: total_cup_points, dtype: float64
SAMPLING IN PYTHON
Mean cup points by country: cluster
Population: Cluster sample:
coffee_ratings_top.groupby("country_of_origin")\ coffee_ratings_clust.groupby("country_of_origin")\
['total_cup_points'].mean() ['total_cup_points'].mean()
country_of_origin country_of_origin
Brazil 82.405909 Colombia 83.128904
Colombia 83.106557 Mexico 80.936027
Guatemala 81.846575 Name: total_cup_points, dtype: float64
Mexico 80.890085
Taiwan 82.001333
United States (Hawaii) 81.820411
Name: total_cup_points, dtype: float64
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Relative error of
point estimates
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Sample size is number of rows
len(coffee_ratings.sample(n=300)) len(coffee_ratings.sample(frac=0.25))
300 334
SAMPLING IN PYTHON
Various sample sizes
coffee_ratings['total_cup_points'].mean()
82.15120328849028
coffee_ratings.sample(n=10)['total_cup_points'].mean()
83.027
coffee_ratings.sample(n=100)['total_cup_points'].mean()
82.4897
coffee_ratings.sample(n=1000)['total_cup_points'].mean()
82.1186
SAMPLING IN PYTHON
Relative errors
Population parameter:
population_mean = coffee_ratings['total_cup_points'].mean()
Point estimate:
sample_mean = coffee_ratings.sample(n=sample_size)['total_cup_points'].mean()
Relative error as a percentage:
rel_error_pct = 100 * abs(population_mean-sample_mean) / population_mean
SAMPLING IN PYTHON
Relative error vs. sample size
import matplotlib.pyplot as plt
errors.plot(x="sample_size",
y="relative_error",
kind="line")
plt.show()
Properties:
Really noise, particularly for small samples
Amplitude is initially steep, then flattens
Relative error decreases to zero (when the
sample size = population)
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Creating a sampling
distribution
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Same code, different answer
coffee_ratings.sample(n=30)['total_cup_points'].mean() coffee_ratings.sample(n=30)['total_cup_points'].mean()
82.53066666666668 81.97566666666667
coffee_ratings.sample(n=30)['total_cup_points'].mean() coffee_ratings.sample(n=30)['total_cup_points'].mean()
82.68 81.675
SAMPLING IN PYTHON
Same code, 1000 times
mean_cup_points_1000 = []
for i in range(1000):
mean_cup_points_1000.append(
coffee_ratings.sample(n=30)['total_cup_points'].mean()
)
print(mean_cup_points_1000)
[82.11933333333333, 82.55300000000001, 82.07266666666668, 81.76966666666667,
...
82.74166666666666, 82.45033333333335, 81.77199999999999, 82.8163333333333]
SAMPLING IN PYTHON
Distribution of sample means for size 30
import matplotlib.pyplot as plt
plt.hist(mean_cup_points_1000, bins=30)
plt.show()
A sampling distribution is a distribution of
replicates of point estimates.
SAMPLING IN PYTHON
Different sample sizes
Sample size: 6 Sample size: 150
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Approximate
sampling
distributions
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
4 dice
die1 die2 die3 die4
0 1 1 1 1
1 1 1 1 2
2 1 1 1 3
3 1 1 1 4
4 1 1 1 5
dice = expand_grid( ... ... ... ... ...
{'die1': [1, 2, 3, 4, 5, 6], 1291 6 6 6 2
'die2': [1, 2, 3, 4, 5, 6], 1292 6 6 6 3
'die3': [1, 2, 3, 4, 5, 6], 1293 6 6 6 4
'die4': [1, 2, 3, 4, 5, 6] 1294 6 6 6 5
} 1295 6 6 6 6
)
[1296 rows x 4 columns]
SAMPLING IN PYTHON
Mean roll
dice['mean_roll'] = (dice['die1'] + die1 die2 die3 die4 mean_roll
dice['die2'] + 0 1 1 1 1 1.00
dice['die3'] + 1 1 1 1 2 1.25
dice['die4']) / 4 2 1 1 1 3 1.50
print(dice) 3 1 1 1 4 1.75
4 1 1 1 5 2.00
... ... ... ... ... ...
1291 6 6 6 2 5.00
1292 6 6 6 3 5.25
1293 6 6 6 4 5.50
1294 6 6 6 5 5.75
1295 6 6 6 6 6.00
[1296 rows x 5 columns]
SAMPLING IN PYTHON
Exact sampling distribution
dice['mean_roll'] = dice['mean_roll'].astype('category')
dice['mean_roll'].value_counts(sort=False).plot(kind="bar")
SAMPLING IN PYTHON
The number of outcomes increases fast
n_dice = list(range(1, 101))
n_outcomes = []
for n in n_dice:
n_outcomes.append(6**n)
outcomes = pd.DataFrame(
{"n_dice": n_dice,
"n_outcomes": n_outcomes})
outcomes.plot(x="n_dice",
y="n_outcomes",
kind="scatter")
plt.show()
SAMPLING IN PYTHON
Simulating the mean of four dice rolls
import numpy as np
np.random.choice(list(range(1, 7)), size=4, replace=True).mean()
SAMPLING IN PYTHON
Simulating the mean of four dice rolls
import numpy as np
sample_means_1000 = []
for i in range(1000):
sample_means_1000.append(
np.random.choice(list(range(1, 7)), size=4, replace=True).mean()
)
print(sample_means_1000)
[3.25, 3.25, 1.75, 2.0, 2.0, 1.0, 1.0, 2.75, 2.75, 2.5, 3.0, 2.0, 2.75,
...
1.25, 2.0, 2.5, 2.5, 3.75, 1.5, 1.75, 2.25, 2.0, 1.5, 3.25, 3.0, 3.5]
SAMPLING IN PYTHON
Approximate sampling distribution
plt.hist(sample_means_1000, bins=20)
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Standard errors and
the Central Limit
Theorem
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Sampling distribution of mean cup points
Sample size: 5 Sample size: 20
Sample size: 80 Sample size: 320
SAMPLING IN PYTHON
Consequences of the central limit theorem
Averages of independent samples have approximately normal distributions.
As the sample size increases,
The distribution of the averages gets closer to being normally distributed
The width of the sampling distribution gets narrower
SAMPLING IN PYTHON
Population & sampling distribution means
coffee_ratings['total_cup_points'].mean() Use np.mean() on each approximate
sampling distribution:
82.15120328849028
Sample size Mean sample mean
5 82.18420719999999
20 82.1558634
80 82.14510154999999
320 82.154017925
SAMPLING IN PYTHON
Population & sampling distribution standard deviations
coffee_ratings['total_cup_points'].std(ddof=0) Sample size Std dev sample mean
5 1.1886358227738543
2.685858187306438
20 0.5940321141669805
80 0.2934024263916487
Specify ddof=0 when calling .std() on 320 0.13095083089190876
populations
Specify ddof=1 when calling np.std() on
samples or sampling distributions
SAMPLING IN PYTHON
Population mean over square root sample size
Sample size Std dev sample mean Calculation Result
5 1.1886358227738543 2.685858187306438 / sqrt(5) 1.201
20 0.5940321141669805 2.685858187306438 / sqrt(20) 0.601
80 0.2934024263916487 2.685858187306438 / sqrt(80) 0.300
320 0.13095083089190876 2.685858187306438 / sqrt(320) 0.150
SAMPLING IN PYTHON
Standard error
Standard deviation of the sampling distribution
Important tool in understanding sampling variability
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Introduction to
bootstrapping
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
With or without
Sampling without replacement: Sampling with replacement ("resampling"):
SAMPLING IN PYTHON
Simple random sampling without replacement
Population: Sample:
SAMPLING IN PYTHON
Simple random sampling with replacement
Population: Resample:
SAMPLING IN PYTHON
Why sample with replacement?
coffee_ratings : a sample of a larger population of all coffees
Each coffee in our sample represents many different hypothetical population coffees
Sampling with replacement is a proxy
SAMPLING IN PYTHON
Coffee data preparation
coffee_focus = coffee_ratings[["variety", "country_of_origin", "flavor"]]
coffee_focus = coffee_focus.reset_index()
index variety country_of_origin flavor
0 0 None Ethiopia 8.83
1 1 Other Ethiopia 8.67
2 2 Bourbon Guatemala 8.50
3 3 None Ethiopia 8.58
4 4 Other Ethiopia 8.50
... ... ... ... ...
1333 1333 None Ecuador 7.58
1334 1334 None Ecuador 7.67
1335 1335 None United States 7.33
1336 1336 None India 6.83
1337 1337 None Vietnam 6.67
[1338 rows x 4 columns]
SAMPLING IN PYTHON
Resampling with .sample()
coffee_resamp = coffee_focus.sample(frac=1, replace=True)
index variety country_of_origin flavor
1140 1140 Bourbon Guatemala 7.25
57 57 Bourbon Guatemala 8.00
1152 1152 Bourbon Mexico 7.08
621 621 Caturra Thailand 7.50
44 44 SL28 Kenya 8.08
... ... ... ... ...
996 996 Typica Mexico 7.33
1090 1090 Bourbon Guatemala 7.33
918 918 Other Guatemala 7.42
249 249 Caturra Colombia 7.67
467 467 Caturra Colombia 7.50
[1338 rows x 4 columns]
SAMPLING IN PYTHON
Repeated coffees
coffee_resamp["index"].value_counts() 658 5
167 4
363 4
357 4
1047 4
..
771 1
770 1
766 1
764 1
0 1
Name: index, Length: 868, dtype: int64
SAMPLING IN PYTHON
Missing coffees
num_unique_coffees = len(coffee_resamp.drop_duplicates(subset="index"))
868
len(coffee_ratings) - num_unique_coffees
470
SAMPLING IN PYTHON
Bootstrapping
The opposite of sampling from a
population
Sampling: going from a population to a
smaller sample
Bootstrapping: building up a theoretical
population from the sample
Bootstrapping use case:
Develop understanding of sampling
variability using a single sample
SAMPLING IN PYTHON
Bootstrapping process
1. Make a resample of the same size as the original sample
2. Calculate the statistic of interest for this bootstrap sample
3. Repeat steps 1 and 2 many times
The resulting statistics are bootstrap statistics, and they form a bootstrap distribution
SAMPLING IN PYTHON
Bootstrapping coffee mean flavor
import numpy as np
mean_flavors_1000 = []
for i in range(1000):
mean_flavors_1000.append(
np.mean(coffee_sample.sample(frac=1, replace=True)['flavor'])
)
SAMPLING IN PYTHON
Bootstrap distribution histogram
import matplotlib.pyplot as plt
plt.hist(mean_flavors_1000)
plt.show()
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Comparing
sampling and
bootstrap
distributions
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Coffee focused subset
coffee_sample = coffee_ratings[["variety", "country_of_origin", "flavor"]]\
.reset_index().sample(n=500)
index variety country_of_origin flavor
132 132 Other Costa Rica 7.58
51 51 None United States (Hawaii) 8.17
42 42 Yellow Bourbon Brazil 7.92
569 569 Bourbon Guatemala 7.67
.. ... ... ... ...
643 643 Catuai Costa Rica 7.42
356 356 Caturra Colombia 7.58
494 494 None Indonesia 7.58
169 169 None Brazil 7.81
[500 rows x 4 columns]
SAMPLING IN PYTHON
The bootstrap of mean coffee flavors
import numpy as np
mean_flavors_5000 = []
for i in range(5000):
mean_flavors_5000.append(
np.mean(coffee_sample.sample(frac=1, replace=True)['flavor'])
)
bootstrap_distn = mean_flavors_5000
SAMPLING IN PYTHON
Mean flavor bootstrap distribution
import matplotlib.pyplot as plt
plt.hist(bootstrap_distn, bins=15)
plt.show()
SAMPLING IN PYTHON
Sample, bootstrap distribution, population means
Sample mean: Estimated population mean:
coffee_sample['flavor'].mean() np.mean(bootstrap_distn)
7.5132200000000005 7.513357731999999
True population mean:
coffee_ratings['flavor'].mean()
7.526046337817639
SAMPLING IN PYTHON
Interpreting the means
Bootstrap distribution mean:
Usually close to the sample mean
May not be a good estimate of the population mean
Bootstrapping cannot correct biases from sampling
SAMPLING IN PYTHON
Sample sd vs. bootstrap distribution sd
Sample standard deviation: Estimated population standard deviation?
coffee_sample['flavor'].std() np.std(bootstrap_distn, ddof=1)
0.3540883911928703 0.015768474367958217
SAMPLING IN PYTHON
Sample, bootstrap dist'n, pop'n standard deviations
Sample standard deviation: Estimated population standard deviation:
coffee_sample['flavor'].std() standard_error = np.std(bootstrap_distn, ddof=1)
0.3540883911928703 Standard error is the standard deviation of the
statistic of interest
True standard deviation: standard_error * np.sqrt(500)
coffee_ratings['flavor'].std(ddof=0)
0.3525938058821761
0.34125481224622645 Standard error times square root of sample
size estimates the population standard
deviation
SAMPLING IN PYTHON
Interpreting the standard errors
Estimated standard error → standard deviation of the bootstrap distribution for a sample
statistic
Population std. dev ≈ Std. Error × √Sample size
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Confidence intervals
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Confidence intervals
"Values within one standard deviation of the mean" includes a large number of values from
each of these distributions
We'll define a related concept called a confidence interval
SAMPLING IN PYTHON
Predicting the weather
Rapid City, South Dakota in the United
States has the least predictable weather
Our job is to predict the high temperature
there tomorrow
SAMPLING IN PYTHON
Our weather prediction
Point estimate = 47°F (8.3°C)
Range of plausible high temperature values = 40 to 54°F (4.4 to 12.8°C)
SAMPLING IN PYTHON
We just reported a confidence interval!
40 to 54°F is a confidence interval
Sometimes written as 47 °F (40°F, 54°F) or 47°F [40°F, 54°F]
... or, 47 ± 7°F
7°F is the margin of error
SAMPLING IN PYTHON
Bootstrap distribution of mean flavor
import matplotlib.pyplot as plt
plt.hist(coffee_boot_distn, bins=15)
plt.show()
SAMPLING IN PYTHON
Mean of the resamples
import numpy as np
np.mean(coffee_boot_distn)
7.513452892
SAMPLING IN PYTHON
Mean plus or minus one standard deviation
np.mean(coffee_boot_distn)
7.513452892
np.mean(coffee_boot_distn) - np.std(coffee_boot_distn, ddof=1)
7.497385709174466
np.mean(coffee_boot_distn) + np.std(coffee_boot_distn, ddof=1)
7.529520074825534
SAMPLING IN PYTHON
Quantile method for confidence intervals
np.quantile(coffee_boot_distn, 0.025)
7.4817195
np.quantile(coffee_boot_distn, 0.975)
7.5448805
SAMPLING IN PYTHON
Inverse cumulative distribution function
PDF: The bell curve
CDF: integrate to get area under bell curve
Inv. CDF: flip x and y axes
Implemented in Python with
from scipy.stats import norm
norm.ppf(quantile, loc=0, scale=1)
SAMPLING IN PYTHON
Standard error method for confidence interval
point_estimate = np.mean(coffee_boot_distn)
7.513452892
std_error = np.std(coffee_boot_distn, ddof=1)
0.016067182825533724
from scipy.stats import norm
lower = norm.ppf(0.025, loc=point_estimate, scale=std_error)
upper = norm.ppf(0.975, loc=point_estimate, scale=std_error)
print((lower, upper))
(7.481961792328933, 7.544943991671067)
SAMPLING IN PYTHON
Let's practice!
SAMPLING IN PYTHON
Congratulations!
SAMPLING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Recap
Chapter 1 Chapter 3
Sampling basics Sample size and population parameters
Selection bias Creating sampling distributions
Pseudo-random numbers Approximate vs. actual sampling dist'ns
Central limit theorem
Chapter 2 Chapter 4
Simple random sampling Bootstrapping from a single sample
Systematic sampling Standard error
Stratified sampling Confidence intervals
Cluster sampling
SAMPLING IN PYTHON
The most important things
The std. deviation of a bootstrap statistic is a good approximation of the standard error
Can assume bootstrap distributions are normally distributed for confidence intervals
SAMPLING IN PYTHON
What's next?
Experimental Design in Python and Customer Analytics and A/B Testing in Python
Hypothesis Testing in Python
Foundations of Probability in Python and Bayesian Data Analysis in Python
SAMPLING IN PYTHON
Happy learning!
SAMPLING IN PYTHON
Hypothesis tests and
z-scores
HYPOTHESIS TESTING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
A/B testing
In 2013, Electronic Arts (EA) released
SimCity 5
They wanted to increase pre-orders of the
game
They used A/B testing to test different
advertising scenarios
This involves splitting users into control and
treatment groups
1 Image credit: "Electronic Arts" by majaX1 CC BY-NC-SA 2.0
HYPOTHESIS TESTING IN PYTHON
Retail webpage A/B test
Control: Treatment:
HYPOTHESIS TESTING IN PYTHON
A/B test results
The treatment group (no ad) got 43.4% more purchases than the control group (with ad)
Intuition that "showing an ad would increase sales" was false
Was this result statistically significant or just chance?
Need EA's data to determine this
Techniques from Sampling in Python + this course to do so
HYPOTHESIS TESTING IN PYTHON
Stack Overflow Developer Survey 2020
import pandas as pd
print(stack_overflow)
respondent age_1st_code ... age hobbyist
0 36.0 30.0 ... 34.0 Yes
1 47.0 10.0 ... 53.0 Yes
2 69.0 12.0 ... 25.0 Yes
3 125.0 30.0 ... 41.0 Yes
4 147.0 15.0 ... 28.0 No
... ... ... ... ... ...
2259 62867.0 13.0 ... 33.0 Yes
2260 62882.0 13.0 ... 28.0 Yes
[2261 rows x 8 columns]
HYPOTHESIS TESTING IN PYTHON
Hypothesizing about the mean
A hypothesis:
The mean annual compensation of the population of data scientists is $110,000
The point estimate (sample statistic):
mean_comp_samp = stack_overflow['converted_comp'].mean()
119574.71738168952
HYPOTHESIS TESTING IN PYTHON
Generating a bootstrap distribution
import numpy as np
# Step 3. Repeat steps 1 & 2 many times, appending to a list
so_boot_distn = []
for i in range(5000):
so_boot_distn.append(
# Step 2. Calculate point estimate
np.mean(
# Step 1. Resample
stack_overflow.sample(frac=1, replace=True)['converted_comp']
)
)
1 Bootstrap distributions are taught in Chapter 4 of Sampling in Python
HYPOTHESIS TESTING IN PYTHON
Visualizing the bootstrap distribution
import matplotlib.pyplot as plt
plt.hist(so_boot_distn, bins=50)
plt.show()
HYPOTHESIS TESTING IN PYTHON
Standard error
std_error = np.std(so_boot_distn, ddof=1)
5607.997577378606
HYPOTHESIS TESTING IN PYTHON
z-scores
value − mean
standardized value =
standard deviation
sample stat − hypoth. param. value
z=
standard error
HYPOTHESIS TESTING IN PYTHON
sample stat − hypoth. param. value
z=
standard error
stack_overflow['converted_comp'].mean()
119574.71738168952
mean_comp_hyp = 110000
std_error
5607.997577378606
z_score = (mean_comp_samp - mean_comp_hyp) / std_error
1.7073326529796957
HYPOTHESIS TESTING IN PYTHON
Testing the hypothesis
Is 1.707 a high or low number?
This is the goal of the course!
HYPOTHESIS TESTING IN PYTHON
Testing the hypothesis
Is 1.707 a high or low number?
This is the goal of the course!
Hypothesis testing use case:
Determine whether sample statistics are close to or far away from expected (or
"hypothesized" values)
HYPOTHESIS TESTING IN PYTHON
Standard normal (z) distribution
Standard normal distribution: normal distribution with mean = 0 + standard deviation = 1
HYPOTHESIS TESTING IN PYTHON
Let's practice!
HYPOTHESIS TESTING IN PYTHON
p-values
HYPOTHESIS TESTING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Criminal trials
Two possible true states:
1. Defendant committed the crime
2. Defendant did not commit the crime
Two possible verdicts:
1. Guilty
2. Not guilty
Initially the defendant is assumed to be not guilty
Prosecution must present evidence "beyond reasonable doubt" for a guilty verdict
HYPOTHESIS TESTING IN PYTHON
Age of first programming experience
age_first_code_cut classifies when Stack Overflow user first started programming
"adult" means they started at 14 or older
"child" means they started before 14
Previous research: 35% of software developers started programming as children
Evidence that a greater proportion of data scientists starting programming as children?
HYPOTHESIS TESTING IN PYTHON
Definitions
A hypothesis is a statement about an unknown population parameter
A hypothesis test is a test of two competing hypotheses
The null hypothesis (H0 ) is the existing idea
The alternative hypothesis (HA ) is the new "challenger" idea of the researcher
For our problem:
H0 : The proportion of data scientists starting programming as children is 35%
HA : The proportion of data scientists starting programming as children is greater than 35%
1"Naught" is British English for "zero". For historical reasons, "H-naught" is the international convention for
pronouncing the null hypothesis.
HYPOTHESIS TESTING IN PYTHON
Criminal trials vs. hypothesis testing
Either HA or H0 is true (not both)
Initially, H0 is assumed to be true
The test ends in either "reject H0 " or "fail to reject H0 "
If the evidence from the sample is "significant" that HA is true, reject H0 , else choose H0
Significance level is "beyond a reasonable doubt" for hypothesis testing
HYPOTHESIS TESTING IN PYTHON
One-tailed and two-tailed tests
Hypothesis tests check if the sample statistics
lie in the tails of the null distribution
Test Tails
alternative different from null two-tailed
alternative greater than null right-tailed
alternative less than null left-tailed
HA : The proportion of data scientists starting
programming as children is greater than 35%
This is a right-tailed test
HYPOTHESIS TESTING IN PYTHON
p-values
p-values: probability of obtaining a result,
assuming the null hypothesis is true
Large p-value, large support for H0
Statistic likely not in the tail of the null
distribution
Small p-value, strong evidence against H0
Statistic likely in the tail of the null
distribution
"p" in p-value → probability
"small" means "close to zero"
HYPOTHESIS TESTING IN PYTHON
Calculating the z-score
prop_child_samp = (stack_overflow['age_first_code_cut'] == "child").mean()
0.39141972578505085
prop_child_hyp = 0.35
std_error = np.std(first_code_boot_distn, ddof=1)
0.010351057228878566
z_score = (prop_child_samp - prop_child_hyp) / std_error
4.001497129152506
HYPOTHESIS TESTING IN PYTHON
Calculating the p-value
norm.cdf() is normal CDF from scipy.stats .
Left-tailed test → use norm.cdf() .
Right-tailed test → use 1 - norm.cdf() .
from scipy.stats import norm
1 - norm.cdf(z_score, loc=0, scale=1)
3.1471479512323874e-05
HYPOTHESIS TESTING IN PYTHON
Let's practice!
HYPOTHESIS TESTING IN PYTHON
Statistical
significance
HYPOTHESIS TESTING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
p-value recap
p-values quantify evidence for the null hypothesis
Large p-value → fail to reject null hypothesis
Small p-value → reject null hypothesis
Where is the cutoff point?
HYPOTHESIS TESTING IN PYTHON
Significance level
The significance level of a hypothesis test (α) is the threshold point for "beyond a
reasonable doubt"
Common values of α are 0.2 , 0.1 , 0.05 , and 0.01
If p ≤ α, reject H0 , else fail to reject H0
α should be set prior to conducting the hypothesis test
HYPOTHESIS TESTING IN PYTHON
Calculating the p-value
alpha = 0.05
prop_child_samp = (stack_overflow['age_first_code_cut'] == "child").mean()
prop_child_hyp = 0.35
std_error = np.std(first_code_boot_distn, ddof=1)
z_score = (prop_child_samp - prop_child_hyp) / std_error
p_value = 1 - norm.cdf(z_score, loc=0, scale=1)
3.1471479512323874e-05
HYPOTHESIS TESTING IN PYTHON
Making a decision
alpha = 0.05
print(p_value)
3.1471479512323874e-05
p_value <= alpha
True
Reject H0 in favor of HA
HYPOTHESIS TESTING IN PYTHON
Confidence intervals
For a significance level of α, it's common to choose a confidence interval level of 1 - α
α = 0.05 → 95% confidence interval
import numpy as np
lower = np.quantile(first_code_boot_distn, 0.025)
upper = np.quantile(first_code_boot_distn, 0.975)
print((lower, upper))
(0.37063246351172047, 0.41132242370632466)
HYPOTHESIS TESTING IN PYTHON
Types of errors
Truly didn't commit crime Truly committed crime
Verdict not guilty correct they got away with it
Verdict guilty wrongful conviction correct
actual H0 actual HA
chosen H0 correct false negative
chosen HA false positive correct
False positives are Type I errors; false negatives are Type II errors.
HYPOTHESIS TESTING IN PYTHON
Possible errors in our example
If p ≤ α, we reject H0 :
A false positive (Type I) error: data scientists didn't start coding as children at a higher rate
If p > α, we fail to reject H0 :
A false negative (Type II) error: data scientists started coding as children at a higher rate
HYPOTHESIS TESTING IN PYTHON
Let's practice!
HYPOTHESIS TESTING IN PYTHON
Performing t-tests
HYPOTHESIS TESTING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Two-sample problems
Compare sample statistics across groups of a variable
converted_comp is a numerical variable
age_first_code_cut is a categorical variable with levels ( "child" and "adult" )
Are users who first programmed as a child compensated higher than those that started as
adults?
HYPOTHESIS TESTING IN PYTHON
Hypotheses
H0 : The mean compensation (in USD) is the same for those that coded first as a child and
those that coded first as an adult.
H0 : μchild = μadult
H0 : μchild − μadult = 0
HA : The mean compensation (in USD) is greater for those that coded first as a child
compared to those that coded first as an adult.
HA : μchild > μadult
HA : μchild − μadult > 0
HYPOTHESIS TESTING IN PYTHON
Calculating groupwise summary statistics
stack_overflow.groupby('age_first_code_cut')['converted_comp'].mean()
age_first_code_cut
adult 111313.311047
child 132419.570621
Name: converted_comp, dtype: float64
HYPOTHESIS TESTING IN PYTHON
Test statistics
Sample mean estimates the population mean
x̄ - a sample mean
x̄child - sample mean compensation for coding first as a child
x̄adult - sample mean compensation for coding first as an adult
x̄child − x̄adult - a test statistic
z-score - a (standardized) test statistic
HYPOTHESIS TESTING IN PYTHON
Standardizing the test statistic
sample stat − population parameter
z=
standard error
difference in sample stats − difference in population parameters
t=
standard error
(x̄child − x̄adult ) − (μchild − μadult )
t=
SE(x̄child − x̄adult )
HYPOTHESIS TESTING IN PYTHON
Standard error
SE(x̄child − x̄adult ) ≈ √
s2child s2adult
+
nchild nadult
s is the standard deviation of the variable
n is the sample size (number of observations/rows in sample)
HYPOTHESIS TESTING IN PYTHON
Assuming the null hypothesis is true
(x̄child − x̄adult ) − (μchild − μadult )
t=
SE(x̄child − x̄adult )
(x̄child − x̄adult )
H0 : μchild − μadult = 0 → t=
SE(x̄child − x̄adult )
(x̄child − x̄adult )
t=
√
s2child s2adult
+
nchild nadult
HYPOTHESIS TESTING IN PYTHON
Calculations assuming the null hypothesis is true
xbar = stack_overflow.groupby('age_first_code_cut')['converted_comp'].mean()
adult 111313.311047
child 132419.570621
Name: converted_comp, dtype: float64 age_first_code_cut
s = stack_overflow.groupby('age_first_code_cut')['converted_comp'].std()
adult 271546.521729
child 255585.240115
Name: converted_comp, dtype: float64 age_first_code_cut
n = stack_overflow.groupby('age_first_code_cut')['converted_comp'].count()
adult 1376
child 885
Name: converted_comp, dtype: int64
HYPOTHESIS TESTING IN PYTHON
Calculating the test statistic
(x̄child − x̄adult )
t=
√
s2child s2adult
+
nchild nadult
import numpy as np
numerator = xbar_child - xbar_adult
denominator = np.sqrt(s_child ** 2 / n_child + s_adult ** 2 / n_adult)
t_stat = numerator / denominator
1.8699313316221844
HYPOTHESIS TESTING IN PYTHON
Let's practice!
HYPOTHESIS TESTING IN PYTHON
Calculating p-values
from t-statistics
HYPOTHESIS TESTING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
t-distributions
t statistic follows a t-distribution
Have a parameter named degrees of
freedom, or df
Look like normal distributions, with fatter
tails
HYPOTHESIS TESTING IN PYTHON
Degrees of freedom
Larger degrees of freedom → t-distribution
gets closer to the normal distribution
Normal distribution → t-distribution with
infinite df
Degrees of freedom: maximum number of
logically independent values in the data
sample
HYPOTHESIS TESTING IN PYTHON
Calculating degrees of freedom
Dataset has 5 independent observations
Four of the values are 2, 6, 8, and 5
The sample mean is 5
The last value must be 4
Here, there are 4 degrees of freedom
df = nchild + nadult − 2
HYPOTHESIS TESTING IN PYTHON
Hypotheses
H0 : The mean compensation (in USD) is the same for those that coded first as a child and
those that coded first as an adult
HA : The mean compensation (in USD) is greater for those that coded first as a child
compared to those that coded first as an adult
Use a right-tailed test
HYPOTHESIS TESTING IN PYTHON
Significance level
α = 0.1
If p ≤ α then reject H0 .
HYPOTHESIS TESTING IN PYTHON
Calculating p-values: one proportion vs. a value
from scipy.stats import norm
1 - norm.cdf(z_score)
SE(x̄child − x̄adult ) ≈ √
s2child s2adult
+
nchild nadult
z-statistic: needed when using one sample statistic to estimate a population parameter
t-statistic: needed when using multiple sample statistics to estimate a population parameter
HYPOTHESIS TESTING IN PYTHON
Calculating p-values: two means from different groups
numerator = xbar_child - xbar_adult
denominator = np.sqrt(s_child ** 2 / n_child + s_adult ** 2 / n_adult)
t_stat = numerator / denominator
1.8699313316221844
degrees_of_freedom = n_child + n_adult - 2
2259
HYPOTHESIS TESTING IN PYTHON
Calculating p-values: two means from different groups
Use t-distribution CDF not normal CDF
from scipy.stats import t
1 - t.cdf(t_stat, df=degrees_of_freedom)
0.030811302165157595
Evidence that Stack Overflow data scientists who started coding as a child earn more.
HYPOTHESIS TESTING IN PYTHON
Let's practice!
HYPOTHESIS TESTING IN PYTHON
Paired t-tests
HYPOTHESIS TESTING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
US Republican presidents dataset
state county repub_percent_08 repub_percent_12
0 Alabama Hale 38.957877 37.139882
1 Arkansas Nevada 56.726272 58.983452
2 California Lake 38.896719 39.331367
3 California Ventura 42.923190 45.250693
.. ... ... ... ...
96 Wisconsin La Crosse 37.490904 40.577038
97 Wisconsin Lafayette 38.104967 41.675050
98 Wyoming Weston 76.684241 83.983328
99 Alaska District 34 77.063259 40.789626
[100 rows x 4 columns]
100 rows; each row represents county-level votes in a presidential election.
1 https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VOQCHQ
HYPOTHESIS TESTING IN PYTHON
Hypotheses
Question: Was the percentage of Republican candidate votes lower in 2008 than 2012?
H0 : μ2008 − μ2012 = 0
HA : μ2008 − μ2012 < 0
Set α = 0.05 significance level.
Data is paired → each voter percentage refers to the same county
Want to capture voting patterns in model
HYPOTHESIS TESTING IN PYTHON
From two samples to one
sample_data = repub_votes_potus_08_12
sample_data['diff'] = sample_data['repub_percent_08'] - sample_data['repub_percent_12']
import matplotlib.pyplot as plt
sample_data['diff'].hist(bins=20)
HYPOTHESIS TESTING IN PYTHON
Calculate sample statistics of the difference
xbar_diff = sample_data['diff'].mean()
-2.877109041242944
HYPOTHESIS TESTING IN PYTHON
Revised hypotheses
Old hypotheses: x̄diff − μdiff
t=
√
H0 : μ2008 − μ2012 = 0 s2dif f
HA : μ2008 − μ2012 < 0 ndiff
df = ndif f − 1
New hypotheses:
H0 : μdiff = 0
HA : μdiff < 0
HYPOTHESIS TESTING IN PYTHON
Calculating the p-value
x̄diff − μdiff
n_diff = len(sample_data) t=
√
s2diff
100
ndiff
s_diff = sample_data['diff'].std()
df = ndiff − 1
t_stat = (xbar_diff-0) / np.sqrt(s_diff**2/n_diff)
-5.601043121928489 from scipy.stats import t
p_value = t.cdf(t_stat, df=n_diff-1)
degrees_of_freedom = n_diff - 1
9.572537285272411e-08
99
HYPOTHESIS TESTING IN PYTHON
Testing differences between two means using ttest()
import pingouin
pingouin.ttest(x=sample_data['diff'],
y=0,
alternative="less")
T dof alternative p-val CI95% cohen-d \
T-test -5.601043 99 less 9.572537e-08 [-inf, -2.02] 0.560104
BF10 power
T-test 1.323e+05 1.0
1Details on Returns from pingouin.ttest() are available in the API docs for pingouin at https://pingouin-
stats.org/generated/pingouin.ttest.html#pingouin.ttest.
HYPOTHESIS TESTING IN PYTHON
ttest() with paired=True
pingouin.ttest(x=sample_data['repub_percent_08'],
y=sample_data['repub_percent_12'],
paired=True,
alternative="less")
T dof alternative p-val CI95% cohen-d \
T-test -5.601043 99 less 9.572537e-08 [-inf, -2.02] 0.217364
BF10 power
T-test 1.323e+05 0.696338
HYPOTHESIS TESTING IN PYTHON
Unpaired ttest()
pingouin.ttest(x=sample_data['repub_percent_08'],
y=sample_data['repub_percent_12'],
paired=False, # The default
alternative="less")
T dof alternative p-val CI95% cohen-d BF10 \
T-test -1.536997 198 less 0.062945 [-inf, 0.22] 0.217364 0.927
power
T-test 0.454972
Unpaired t-tests on paired data increases the chances of false negative errors
HYPOTHESIS TESTING IN PYTHON
Let's practice!
HYPOTHESIS TESTING IN PYTHON
ANOVA tests
HYPOTHESIS TESTING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Job satisfaction: 5 categories
stack_overflow['job_sat'].value_counts()
Very satisfied 879
Slightly satisfied 680
Slightly dissatisfied 342
Neither 201
Very dissatisfied 159
Name: job_sat, dtype: int64
HYPOTHESIS TESTING IN PYTHON
Visualizing multiple distributions
Is mean annual compensation different for
different levels of job satisfaction?
import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot(x="converted_comp",
y="job_sat",
data=stack_overflow)
plt.show()
HYPOTHESIS TESTING IN PYTHON
Analysis of variance (ANOVA)
A test for differences between groups
alpha = 0.2
pingouin.anova(data=stack_overflow,
dv="converted_comp",
between="job_sat")
Source ddof1 ddof2 F p-unc np2
0 job_sat 4 2256 4.480485 0.001315 0.007882
0.001315 <α
At least two categories have significantly different compensation
HYPOTHESIS TESTING IN PYTHON
Pairwise tests
μvery dissatisfied ≠ μslightly dissatisfied μslightly dissatisfied ≠ μslightly satisfied
μvery dissatisfied ≠ μneither μslightly dissatisfied ≠ μvery satisfied
μvery dissatisfied ≠ μslightly satisfied μneither ≠ μslightly satisfied
μvery dissatisfied ≠ μvery satisfied μneither ≠ μvery satisfied
μslightly dissatisfied ≠ μneither μslightly satisfied ≠ μvery satisfied
Set significance level to α = 0.2.
HYPOTHESIS TESTING IN PYTHON
pairwise_tests()
pingouin.pairwise_tests(data=stack_overflow,
dv="converted_comp",
between="job_sat",
padjust="none")
Contrast A B Paired Parametric ... dof alternative p-unc BF10 hedges
0 job_sat Slightly satisfied Very satisfied False True ... 1478.622799 two-sided 0.000064 158.564 -0.192931
1 job_sat Slightly satisfied Neither False True ... 258.204546 two-sided 0.484088 0.114 -0.068513
2 job_sat Slightly satisfied Very dissatisfied False True ... 187.153329 two-sided 0.215179 0.208 -0.145624
3 job_sat Slightly satisfied Slightly dissatisfied False True ... 569.926329 two-sided 0.969491 0.074 -0.002719
4 job_sat Very satisfied Neither False True ... 328.326639 two-sided 0.097286 0.337 0.120115
5 job_sat Very satisfied Very dissatisfied False True ... 221.666205 two-sided 0.455627 0.126 0.063479
6 job_sat Very satisfied Slightly dissatisfied False True ... 821.303063 two-sided 0.002166 7.43 0.173247
7 job_sat Neither Very dissatisfied False True ... 321.165726 two-sided 0.585481 0.135 -0.058537
8 job_sat Neither Slightly dissatisfied False True ... 367.730081 two-sided 0.547406 0.118 0.055707
9 job_sat Very dissatisfied Slightly dissatisfied False True ... 247.570187 two-sided 0.259590 0.197 0.119131
[10 rows x 11 columns]
HYPOTHESIS TESTING IN PYTHON
As the number of groups increases...
HYPOTHESIS TESTING IN PYTHON
Bonferroni correction
pingouin.pairwise_tests(data=stack_overflow,
dv="converted_comp",
between="job_sat",
padjust="bonf")
Contrast A B ... p-unc p-corr p-adjust BF10 hedges
0 job_sat Slightly satisfied Very satisfied ... 0.000064 0.000638 bonf 158.564 -0.192931
1 job_sat Slightly satisfied Neither ... 0.484088 1.000000 bonf 0.114 -0.068513
2 job_sat Slightly satisfied Very dissatisfied ... 0.215179 1.000000 bonf 0.208 -0.145624
3 job_sat Slightly satisfied Slightly dissatisfied ... 0.969491 1.000000 bonf 0.074 -0.002719
4 job_sat Very satisfied Neither ... 0.097286 0.972864 bonf 0.337 0.120115
5 job_sat Very satisfied Very dissatisfied ... 0.455627 1.000000 bonf 0.126 0.063479
6 job_sat Very satisfied Slightly dissatisfied ... 0.002166 0.021659 bonf 7.43 0.173247
7 job_sat Neither Very dissatisfied ... 0.585481 1.000000 bonf 0.135 -0.058537
8 job_sat Neither Slightly dissatisfied ... 0.547406 1.000000 bonf 0.118 0.055707
9 job_sat Very dissatisfied Slightly dissatisfied ... 0.259590 1.000000 bonf 0.197 0.119131
[10 rows x 11 columns]
HYPOTHESIS TESTING IN PYTHON
More methods
padjust : string
Method used for testing and adjustment of pvalues.
'none' : no correction [default]
'bonf' : one-step Bonferroni correction
'sidak' : one-step Sidak correction
'holm' : step-down method using Bonferroni adjustments
'fdr_bh' : Benjamini/Hochberg FDR correction
'fdr_by' : Benjamini/Yekutieli FDR correction
HYPOTHESIS TESTING IN PYTHON
Let's practice!
HYPOTHESIS TESTING IN PYTHON
One-sample
proportion tests
HYPOTHESIS TESTING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Chapter 1 recap
Is a claim about an unknown population proportion feasible?
1. Standard error of sample statistic from bootstrap distribution
2. Compute a standardized test statistic
3. Calculate a p-value
4. Decide which hypothesis made most sense
Now, calculate the test statistic without using the bootstrap distribution
HYPOTHESIS TESTING IN PYTHON
Standardized test statistic for proportions
p: population proportion (unknown population parameter)
p^: sample proportion (sample statistic)
p0 : hypothesized population proportion
p^ − mean( p^) p^ − p
z= =
SE( p^) SE( p^)
Assuming H0 is true, p = p0 , so
p^ − p0
z=
SE( p^)
HYPOTHESIS TESTING IN PYTHON
Simplifying the standard error calculations
SE p^ = √
p0 ∗ (1 − p0 )
→ Under H0 , SE p^ depends on hypothesized p0 and sample size n
n
Assuming H0 is true,
p^ − p0
z=
√ p0 ∗ (1 − p0 )
n
^ and n) and the hypothesized parameter (p0 )
Only uses sample information ( p
HYPOTHESIS TESTING IN PYTHON
Why z instead of t?
(x̄child − x̄adult )
t=
√
s2child s2adult
+
nchild nadult
s is calculated from x̄
x̄ estimates the population mean
s estimates the population standard deviation
↑ uncertainty in our estimate of the parameter
t-distribution - fatter tails than a normal distribution
p^ only appears in the numerator, so z-scores are fine
HYPOTHESIS TESTING IN PYTHON
Stack Overflow age categories
H0 : Proportion of Stack Overflow users under thirty = 0.5
HA : Proportion of Stack Overflow users under thirty ≠ 0.5
alpha = 0.01
stack_overflow['age_cat'].value_counts(normalize=True)
Under 30 0.535604
At least 30 0.464396
Name: age_cat, dtype: float64
HYPOTHESIS TESTING IN PYTHON
Variables for z
p_hat = (stack_overflow['age_cat'] == 'Under 30').mean()
0.5356037151702786
p_0 = 0.50
n = len(stack_overflow)
2261
HYPOTHESIS TESTING IN PYTHON
Calculating the z-score
p^ − p0
z=
√
p0 ∗ (1 − p0 )
n
import numpy as np
numerator = p_hat - p_0
denominator = np.sqrt(p_0 * (1 - p_0) / n)
z_score = numerator / denominator
3.385911440783663
HYPOTHESIS TESTING IN PYTHON
Calculating the p-value
Two-tailed ("not equal"):
p_value = norm.cdf(-z_score) +
1 - norm.cdf(z_score)
p_value = 2 * (1 - norm.cdf(z_score))
Left-tailed ("less than"):
0.0007094227368100725
from scipy.stats import norm
p_value = norm.cdf(z_score)
p_value <= alpha
Right-tailed ("greater than"):
True
p_value = 1 - norm.cdf(z_score)
HYPOTHESIS TESTING IN PYTHON
Let's practice!
HYPOTHESIS TESTING IN PYTHON
Two-sample
proportion tests
HYPOTHESIS TESTING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Comparing two proportions
H0 : Proportion of hobbyist users is the same for those under thirty as those at least thirty
H0 : p≥30 − p<30 = 0
HA : Proportion of hobbyist users is different for those under thirty to those at least thirty
HA : p≥30 − p<30 ≠ 0
alpha = 0.05
HYPOTHESIS TESTING IN PYTHON
Calculating the z-score
z-score equation for a proportion test:
( p^≥30 − p^<30 ) − 0
z=
SE( p^≥30 − p^<30 )
Standard error equation:
SE( p^≥30 − p^<30 ) = √
p^ × (1 − p^) p^ × (1 − p^)
+
n≥30 n<30
p^ → weighted mean of p^≥30 and p^<30
n≥30 × p^≥30 + n<30 × p^<30
p^ =
n≥30 + n<30
Only require p^≥30 , p^<30 , n≥30 , n<30 from the sample to calculate the z-score
HYPOTHESIS TESTING IN PYTHON
Getting the numbers for the z-score
p_hats = stack_overflow.groupby("age_cat")['hobbyist'].value_counts(normalize=True)
age_cat hobbyist
At least 30 Yes 0.773333
No 0.226667
Under 30 Yes 0.843105
No 0.156895
Name: hobbyist, dtype: float64
n = stack_overflow.groupby("age_cat")['hobbyist'].count()
age_cat
At least 30 1050
Under 30 1211
Name: hobbyist, dtype: int64
HYPOTHESIS TESTING IN PYTHON
Getting the numbers for the z-score
p_hats = stack_overflow.groupby("age_cat")['hobbyist'].value_counts(normalize=True)
age_cat hobbyist
At least 30 Yes 0.773333
No 0.226667
Under 30 Yes 0.843105
No 0.156895
Name: hobbyist, dtype: float64
p_hat_at_least_30 = p_hats[("At least 30", "Yes")]
p_hat_under_30 = p_hats[("Under 30", "Yes")]
print(p_hat_at_least_30, p_hat_under_30)
0.773333 0.843105
HYPOTHESIS TESTING IN PYTHON
Getting the numbers for the z-score
n = stack_overflow.groupby("age_cat")['hobbyist'].count()
age_cat
At least 30 1050
Under 30 1211
Name: hobbyist, dtype: int64
n_at_least_30 = n["At least 30"]
n_under_30 = n["Under 30"]
print(n_at_least_30, n_under_30)
1050 1211
HYPOTHESIS TESTING IN PYTHON
Getting the numbers for the z-score
p_hat = (n_at_least_30 * p_hat_at_least_30 + n_under_30 * p_hat_under_30) /
(n_at_least_30 + n_under_30)
std_error = np.sqrt(p_hat * (1-p_hat) / n_at_least_30 +
p_hat * (1-p_hat) / n_under_30)
z_score = (p_hat_at_least_30 - p_hat_under_30) / std_error
print(z_score)
-4.223718652693034
HYPOTHESIS TESTING IN PYTHON
Proportion tests using proportions_ztest()
stack_overflow.groupby("age_cat")['hobbyist'].value_counts()
age_cat hobbyist
At least 30 Yes 812
No 238
Under 30 Yes 1021
No 190
Name: hobbyist, dtype: int64
n_hobbyists = np.array([812, 1021])
n_rows = np.array([812 + 238, 1021 + 190])
from statsmodels.stats.proportion import proportions_ztest
z_score, p_value = proportions_ztest(count=n_hobbyists, nobs=n_rows,
alternative="two-sided")
(-4.223691463320559, 2.403330142685068e-05)
HYPOTHESIS TESTING IN PYTHON
Let's practice!
HYPOTHESIS TESTING IN PYTHON
Chi-square test of
independence
HYPOTHESIS TESTING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Revisiting the proportion test
age_by_hobbyist = stack_overflow.groupby("age_cat")['hobbyist'].value_counts()
age_cat hobbyist
At least 30 Yes 812
No 238
Under 30 Yes 1021
No 190
Name: hobbyist, dtype: int64
from statsmodels.stats.proportion import proportions_ztest
n_hobbyists = np.array([812, 1021])
n_rows = np.array([812 + 238, 1021 + 190])
stat, p_value = proportions_ztest(count=n_hobbyists, nobs=n_rows,
alternative="two-sided")
(-4.223691463320559, 2.403330142685068e-05)
HYPOTHESIS TESTING IN PYTHON
Independence of variables
Previous hypothesis test result: evidence that hobbyist and age_cat are associated
Statistical independence - proportion of successes in the response variable is the same
across all categories of the explanatory variable
HYPOTHESIS TESTING IN PYTHON
Test for independence of variables
import pingouin
expected, observed, stats = pingouin.chi2_independence(data=stack_overflow, x='hobbyist',
y='age_cat', correction=False)
print(stats)
test lambda chi2 dof pval cramer power
0 pearson 1.000000 17.839570 1.0 0.000024 0.088826 0.988205
1 cressie-read 0.666667 17.818114 1.0 0.000024 0.088773 0.988126
2 log-likelihood 0.000000 17.802653 1.0 0.000025 0.088734 0.988069
3 freeman-tukey -0.500000 17.815060 1.0 0.000024 0.088765 0.988115
4 mod-log-likelihood -1.000000 17.848099 1.0 0.000024 0.088848 0.988236
5 neyman -2.000000 17.976656 1.0 0.000022 0.089167 0.988694
χ2 statistic = 17.839570 = (−4.223691463320559)2 = (z -score)2
HYPOTHESIS TESTING IN PYTHON
Job satisfaction and age category
stack_overflow['age_cat'].value_counts() stack_overflow['job_sat'].value_counts()
Under 30 1211 Very satisfied 879
At least 30 1050 Slightly satisfied 680
Name: age_cat, dtype: int64 Slightly dissatisfied 342
Neither 201
Very dissatisfied 159
Name: job_sat, dtype: int64
HYPOTHESIS TESTING IN PYTHON
Declaring the hypotheses
H0 : Age categories are independent of job satisfaction levels
HA : Age categories are not independent of job satisfaction levels
alpha = 0.1
Test statistic denoted χ2
Assuming independence, how far away are the observed results from the expected values?
HYPOTHESIS TESTING IN PYTHON
Exploratory visualization: proportional stacked bar plot
props = stack_overflow.groupby('job_sat')['age_cat'].value_counts(normalize=True)
wide_props = props.unstack()
wide_props.plot(kind="bar", stacked=True)
HYPOTHESIS TESTING IN PYTHON
Exploratory visualization: proportional stacked bar plot
HYPOTHESIS TESTING IN PYTHON
Chi-square independence test
import pingouin
expected, observed, stats = pingouin.chi2_independence(data=stack_overflow, x="job_sat", y="age_cat")
print(stats)
test lambda chi2 dof pval cramer power
0 pearson 1.000000 5.552373 4.0 0.235164 0.049555 0.437417
1 cressie-read 0.666667 5.554106 4.0 0.235014 0.049563 0.437545
2 log-likelihood 0.000000 5.558529 4.0 0.234632 0.049583 0.437871
3 freeman-tukey -0.500000 5.562688 4.0 0.234274 0.049601 0.438178
4 mod-log-likelihood -1.000000 5.567570 4.0 0.233854 0.049623 0.438538
5 neyman -2.000000 5.579519 4.0 0.232828 0.049676 0.439419
Degrees of freedom:
(No. of response categories − 1) × (No. of explanatory categories − 1)
(2 − 1) ∗ (5 − 1) = 4
HYPOTHESIS TESTING IN PYTHON
Swapping the variables?
props = stack_overflow.groupby('age_cat')['job_sat'].value_counts(normalize=True)
wide_props = props.unstack()
wide_props.plot(kind="bar", stacked=True)
HYPOTHESIS TESTING IN PYTHON
Swapping the variables?
HYPOTHESIS TESTING IN PYTHON
chi-square both ways
expected, observed, stats = pingouin.chi2_independence(data=stack_overflow, x="age_cat", y="job_sat")
print(stats[stats['test'] == 'pearson'])
test lambda chi2 dof pval cramer power
0 pearson 1.0 5.552373 4.0 0.235164 0.049555 0.437417
Ask: Are the variables X and Y independent?
Not: Is variable X independent from variable Y?
HYPOTHESIS TESTING IN PYTHON
What about direction and tails?
Observed and expected counts squared must be non-negative
chi-square tests are almost always right-tailed 1
1Left-tailed chi-square tests are used in statistical forensics to detect if a fit is suspiciously good because the
data was fabricated. Chi-square tests of variance can be two-tailed. These are niche uses, though.
HYPOTHESIS TESTING IN PYTHON
Let's practice!
HYPOTHESIS TESTING IN PYTHON
Chi-square
goodness of fit tests
HYPOTHESIS TESTING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Purple links
How do you feel when you discover that you've already visited the top resource?
purple_link_counts = stack_overflow['purple_link'].value_counts()
purple_link_counts = purple_link_counts.rename_axis('purple_link')\
.reset_index(name='n')\
.sort_values('purple_link')
purple_link n
2 Amused 368
3 Annoyed 263
0 Hello, old friend 1225
1 Indifferent 405
HYPOTHESIS TESTING IN PYTHON
Declaring the hypotheses
hypothesized = pd.DataFrame({ purple_link prop
'purple_link': ['Amused', 'Annoyed', 'Hello, old friend', 'Indifferent'], 0 Amused 0.166667
'prop': [1/6, 1/6, 1/2, 1/6]}) 1 Annoyed 0.166667
2 Hello, old friend 0.500000
3 Indifferent 0.166667
H0 : The sample matches the hypothesized χ2 measures how far observed results are
distribution from expectations in each group
HA : The sample does not match the alpha = 0.01
hypothesized distribution
HYPOTHESIS TESTING IN PYTHON
Hypothesized counts by category
n_total = len(stack_overflow)
hypothesized["n"] = hypothesized["prop"] * n_total
purple_link prop n
0 Amused 0.166667 376.833333
1 Annoyed 0.166667 376.833333
2 Hello, old friend 0.500000 1130.500000
3 Indifferent 0.166667 376.833333
HYPOTHESIS TESTING IN PYTHON
Visualizing counts
import matplotlib.pyplot as plt
plt.bar(purple_link_counts['purple_link'], purple_link_counts['n'],
color='red', label='Observed')
plt.bar(hypothesized['purple_link'], hypothesized['n'], alpha=0.5,
color='blue', label='Hypothesized')
plt.legend()
plt.show()
HYPOTHESIS TESTING IN PYTHON
Visualizing counts
HYPOTHESIS TESTING IN PYTHON
chi-square goodness of fit test
print(hypothesized)
purple_link prop n
0 Amused 0.166667 376.833333
1 Annoyed 0.166667 376.833333
2 Hello, old friend 0.500000 1130.500000
3 Indifferent 0.166667 376.833333
from scipy.stats import chisquare
chisquare(f_obs=purple_link_counts['n'], f_exp=hypothesized['n'])
Power_divergenceResult(statistic=44.59840778416629, pvalue=1.1261810719413759e-09)
HYPOTHESIS TESTING IN PYTHON
Let's practice!
HYPOTHESIS TESTING IN PYTHON
Assumptions in
hypothesis testing
HYPOTHESIS TESTING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Randomness
Assumption
The samples are random subsets of larger
populations
Consequence
Sample is not representative of population
How to check this
Understand how your data was collected
Speak to the data collector/domain expert
1 Sampling techniques are discussed in "Sampling in Python".
HYPOTHESIS TESTING IN PYTHON
Independence of observations
Assumption
Each observation (row) in the dataset is independent
Consequence
Increased chance of false negative/positive error
How to check this
Understand how our data was collected
HYPOTHESIS TESTING IN PYTHON
Large sample size
Assumption
The sample is big enough to mitigate uncertainty, so that the Central Limit Theorem applies
Consequence
Wider confidence intervals
Increased chance of false negative/positive errors
How to check this
It depends on the test
HYPOTHESIS TESTING IN PYTHON
Large sample size: t-test
One sample Two samples
At least 30 observations in the sample At least 30 observations in each sample
n ≥ 30 n1 ≥ 30, n2 ≥ 30
n: sample size ni : sample size for group i
Paired samples ANOVA
At least 30 pairs of observations across the At least 30 observations in each sample
samples
ni ≥ 30 for all values of i
Number of rows in our data ≥ 30
HYPOTHESIS TESTING IN PYTHON
Large sample size: proportion tests
One sample Two samples
Number of successes in sample is greater Number of successes in each sample is
than or equal to 10 greater than or equal to 10
n × p^ ≥ 10 n1 × p^1 ≥ 10
Number of failures in sample is greater n2 × p^2 ≥ 10
than or equal to 10
Number of failures in each sample is
n × (1 − p^) ≥ 10 greater than or equal to 10
n: sample size n1 × (1 − p^1 ) ≥ 10
p^: proportion of successes in sample
n2 × (1 − p^2 ) ≥ 10
HYPOTHESIS TESTING IN PYTHON
Large sample size: chi-square tests
The number of successes in each group in greater than or equal to 5
ni × p^i ≥ 5 for all values of i
The number of failures in each group in greater than or equal to 5
ni × (1 − p^i ) ≥ 5 for all values of i
ni : sample size for group i
p^i : proportion of successes in sample group i
HYPOTHESIS TESTING IN PYTHON
Sanity check
If the bootstrap distribution doesn't look normal, assumptions likely aren't valid
Revisit data collection to check for randomness, independence, and sample size
HYPOTHESIS TESTING IN PYTHON
Let's practice!
HYPOTHESIS TESTING IN PYTHON
Non-parametric
tests
HYPOTHESIS TESTING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Parametric tests
z-test, t-test, and ANOVA are all parametric tests
Assume a normal distribution
Require sufficiently large sample sizes
HYPOTHESIS TESTING IN PYTHON
Smaller Republican votes data
print(repub_votes_small)
state county repub_percent_08 repub_percent_12
80 Texas Red River 68.507522 69.944817
84 Texas Walker 60.707197 64.971903
33 Kentucky Powell 57.059533 61.727293
81 Texas Schleicher 74.386503 77.384464
93 West Virginia Morgan 60.857614 64.068711
HYPOTHESIS TESTING IN PYTHON
Results with pingouin.ttest()
5 pairs is not enough to meet the sample size condition for the paired t-test:
At least 30 pairs of observations across the samples.
alpha = 0.01
import pingouin
pingouin.ttest(x=repub_votes_potus_08_12_small['repub_percent_08'],
y=repub_votes_potus_08_12_small['repub_percent_12'],
paired=True,
alternative="less")
T dof alternative p-val CI95% cohen-d BF10 power
T-test -5.875753 4 less 0.002096 [-inf, -2.11] 0.500068 26.468 0.239034
HYPOTHESIS TESTING IN PYTHON
Non-parametric tests
Non-parametric tests avoid the parametric assumptions and conditions
Many non-parametric tests use ranks of the data
x = [1, 15, 3, 10, 6]
from scipy.stats import rankdata
rankdata(x)
array([1., 5., 2., 4., 3.])
HYPOTHESIS TESTING IN PYTHON
Non-parametric tests
Non-parametric tests are more reliable than parametric tests for small sample sizes and
when data isn't normally distributed
HYPOTHESIS TESTING IN PYTHON
Non-parametric tests
Non-parametric tests are more reliable than parametric tests for small sample sizes and
when data isn't normally distributed
Wilcoxon-signed rank test
Developed by Frank Wilcoxon in 1945
One of the first non-parametric procedures
HYPOTHESIS TESTING IN PYTHON
Wilcoxon-signed rank test (Step 1)
Works on the ranked absolute differences between the pairs of data
repub_votes_small['diff'] = repub_votes_small['repub_percent_08'] -
repub_votes_small['repub_percent_12']
print(repub_votes_small)
state county repub_percent_08 repub_percent_12 diff
80 Texas Red River 68.507522 69.944817 -1.437295
84 Texas Walker 60.707197 64.971903 -4.264705
33 Kentucky Powell 57.059533 61.727293 -4.667760
81 Texas Schleicher 74.386503 77.384464 -2.997961
93 West Virginia Morgan 60.857614 64.068711 -3.211097
HYPOTHESIS TESTING IN PYTHON
Wilcoxon-signed rank test (Step 2)
Works on the ranked absolute differences between the pairs of data
repub_votes_small['abs_diff'] = repub_votes_small['diff'].abs()
print(repub_votes_small)
state county repub_percent_08 repub_percent_12 diff abs_diff
80 Texas Red River 68.507522 69.944817 -1.437295 1.437295
84 Texas Walker 60.707197 64.971903 -4.264705 4.264705
33 Kentucky Powell 57.059533 61.727293 -4.667760 4.667760
81 Texas Schleicher 74.386503 77.384464 -2.997961 2.997961
93 West Virginia Morgan 60.857614 64.068711 -3.211097 3.211097
HYPOTHESIS TESTING IN PYTHON
Wilcoxon-signed rank test (Step 3)
Works on the ranked absolute differences between the pairs of data
from scipy.stats import rankdata
repub_votes_small['rank_abs_diff'] = rankdata(repub_votes_small['abs_diff'])
print(repub_votes_small)
state county repub_percent_08 repub_percent_12 diff abs_diff rank_abs_diff
80 Texas Red River 68.507522 69.944817 -1.437295 1.437295 1.0
84 Texas Walker 60.707197 64.971903 -4.264705 4.264705 4.0
33 Kentucky Powell 57.059533 61.727293 -4.667760 4.667760 5.0
81 Texas Schleicher 74.386503 77.384464 -2.997961 2.997961 2.0
93 West Virginia Morgan 60.857614 64.068711 -3.211097 3.211097 3.0
HYPOTHESIS TESTING IN PYTHON
Wilcoxon-signed rank test (Step 4)
state county repub_percent_08 repub_percent_12 diff abs_diff rank_abs_diff
80 Texas Red River 68.507522 69.944817 -1.437295 1.437295 1.0
84 Texas Walker 60.707197 64.971903 -4.264705 4.264705 4.0
33 Kentucky Powell 57.059533 61.727293 -4.667760 4.667760 5.0
81 Texas Schleicher 74.386503 77.384464 -2.997961 2.997961 2.0
93 West Virginia Morgan 60.857614 64.068711 -3.211097 3.211097 3.0
Incorporate the sum of the ranks for negative and positive differences
T_minus = 1 + 4 + 5 + 2 + 3
T_plus = 0
W = np.min([T_minus, T_plus])
HYPOTHESIS TESTING IN PYTHON
Implementation with pingouin.wilcoxon()
alpha = 0.01
pingouin.wilcoxon(x=repub_votes_potus_08_12_small['repub_percent_08'],
y=repub_votes_potus_08_12_small['repub_percent_12'],
alternative="less")
W-val alternative p-val RBC CLES
Wilcoxon 0.0 less 0.03125 -1.0 0.72
Fail to reject H0 , since 0.03125 > 0.01
HYPOTHESIS TESTING IN PYTHON
Let's practice!
HYPOTHESIS TESTING IN PYTHON
Non-parametric
ANOVA and
unpaired t-tests
HYPOTHESIS TESTING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Wilcoxon-Mann-Whitney test
Also know as the Mann Whitney U test
A t-test on the ranks of the numeric input
Works on unpaired data
HYPOTHESIS TESTING IN PYTHON
Wilcoxon-Mann-Whitney test setup
age_vs_comp = stack_overflow[['converted_comp', 'age_first_code_cut']]
age_vs_comp_wide = age_vs_comp.pivot(columns='age_first_code_cut',
values='converted_comp')
age_first_code_cut adult child
0 77556.0 NaN
1 NaN 74970.0
2 NaN 594539.0
... ... ...
2258 NaN 97284.0
2259 NaN 72000.0
2260 NaN 180000.0
[2261 rows x 2 columns]
HYPOTHESIS TESTING IN PYTHON
Wilcoxon-Mann-Whitney test
alpha=0.01
import pingouin
pingouin.mwu(x=age_vs_comp_wide['child'],
y=age_vs_comp_wide['adult'],
alternative='greater')
U-val alternative p-val RBC CLES
MWU 744365.5 greater 1.902723e-19 -0.222516 0.611258
HYPOTHESIS TESTING IN PYTHON
Kruskal-Wallis test
Kruskal-Wallis test is to Wilcoxon-Mann-Whitney test as ANOVA is to t-test
alpha=0.01
pingouin.kruskal(data=stack_overflow,
dv='converted_comp',
between='job_sat')
Source ddof1 H p-unc
Kruskal job_sat 4 72.814939 5.772915e-15
HYPOTHESIS TESTING IN PYTHON
Let's practice!
HYPOTHESIS TESTING IN PYTHON
Congratulations!
HYPOTHESIS TESTING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
Course recap
Chapter 1 Chapter 3
Workflow for testing proportions vs. a Testing differences in sample proportions
hypothesized value between two groups using proportion tests
False negative/false positive errors Using chi-square independence/goodness
of fit tests
Chapter 2 Chapter 4
Testing differences in sample means Reviewing assumptions of parametric
between two groups using t-tests hypothesis tests
Extending this to more than two groups Examined non-parametric alternatives
using ANOVA and pairwise t-tests when assumptions aren't valid
HYPOTHESIS TESTING IN PYTHON
More courses
Inference
Statistics Fundamentals with Python skill track
Bayesian statistics
Bayesian Data Analysis in Python
Applications
Customer Analytics and A/B Testing in Python
HYPOTHESIS TESTING IN PYTHON
Congratulations!
HYPOTHESIS TESTING IN PYTHON