0% found this document useful (0 votes)
53 views

Intermediate Regression With Statsmodels in Python

Linear regression and logistic regression are the two most widely used statistical models and act like master keys, unlocking the secrets hidden in datasets. In this course, you’ll build on the skills you gained in "Introduction to Regression in Python with statsmodels", as you learn about linear and logistic regression with multiple explanatory variables.

Uploaded by

jcmayac
Copyright
© © All Rights Reserved
0% found this document useful (0 votes)
53 views

Intermediate Regression With Statsmodels in Python

Linear regression and logistic regression are the two most widely used statistical models and act like master keys, unlocking the secrets hidden in datasets. In this course, you’ll build on the skills you gained in "Introduction to Regression in Python with statsmodels", as you learn about linear and logistic regression with multiple explanatory variables.

Uploaded by

jcmayac
Copyright
© © All Rights Reserved
You are on page 1/ 129

Parallel slopes linear

regression
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
The previous course
This course assumes knowledge from Introduction to Regression with statsmodels in Python

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


From simple regression to multiple regression
Multiple regression is a regression model with more than one explanatory variable.

More explanatory variables can give more insight and be er predictions.

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


The course contents
Chapter 1 Chapter 2
"Parallel slopes" regression Interactions

Simpson's Paradox

Chapter 3 Chapter 4
More explanatory variables Multiple logistic regression

How linear regression works The logistic distribution

How logistic regression works

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


The fish dataset
mass_g length_cm species Each row represents a sh

242.0 23.2 Bream mass_g is the response variable

5.9 7.5 Perch 1 numeric, 1 categorical explanatory


200.0 30.0 Pike variable

40.0 12.9 Roach

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


One explanatory variable at a time
from statsmodels.formula.api import ols mdl_mass_vs_species = ols("mass_g ~ species + 0",
data=fish).fit()
mdl_mass_vs_length = ols("mass_g ~ length_cm",
data=fish).fit() print(mdl_mass_vs_species.params)

print(mdl_mass_vs_length.params) species[Bream] 617.828571


species[Perch] 382.239286
species[Pike] 718.705882
Intercept -536.223947
species[Roach] 152.050000
length_cm 34.899245
dtype: float64
dtype: float64

1 intercept coe cient 1 intercept coe cient for each category

1 slope coe cient

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Both variables at the same time
mdl_mass_vs_both = ols("mass_g ~ length_cm + species + 0",
data=fish).fit()

print(mdl_mass_vs_both.params)

species[Bream] -672.241866
species[Perch] -713.292859
species[Pike] -1089.456053
species[Roach] -726.777799
length_cm 42.568554
dtype: float64

1 slope coe cient

1 intercept coe cient for each category

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Comparing coefficients
print(mdl_mass_vs_length.params) print(mdl_mass_vs_both.params)

Intercept -536.223947 species[Bream] -672.241866


length_cm 34.899245 species[Perch] -713.292859
species[Pike] -1089.456053
species[Roach] -726.777799
length_cm 42.568554

print(mdl_mass_vs_species.params)

species[Bream] 617.828571
species[Perch] 382.239286
species[Pike] 718.705882
species[Roach] 152.050000

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Visualization: 1 numeric explanatory variable
import matplotlib.pyplot as plt
import seaborn as sns

sns.regplot(x="length_cm",
y="mass_g",
data=fish,
ci=None)

plt.show()

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Visualization: 1 categorical explanatory variable
sns.boxplot(x="species",
y="mass_g",
data=fish,
showmeans=True)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Visualization: both explanatory variables
coeffs = mdl_mass_vs_both.params plt.axline(xy1=(0, ic_bream), slope=sl, color="blue")
print(coeffs) plt.axline(xy1=(0, ic_perch), slope=sl, color="green")
plt.axline(xy1=(0, ic_pike), slope=sl, color="red")
plt.axline(xy1=(0, ic_roach), slope=sl, color="orange")
species[Bream] -672.241866
species[Perch] -713.292859
species[Pike] -1089.456053
species[Roach] -726.777799
length_cm 42.568554

ic_bream, ic_perch, ic_pike, ic_roach, sl = coeffs

sns.scatterplot(x="length_cm",
y="mass_g",
hue="species",
data=fish)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Predicting parallel
slopes
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
The prediction workflow
import pandas as pd length_cm
import numpy as np 0 5
expl_data_length = pd.DataFrame( 1 10
{"length_cm": np.arange(5, 61, 5)}) 2 15
print(expl_data_length) 3 20
4 25
5 30
6 35
7 40
8 45
9 50
10 55
11 60

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


The prediction workflow
[A, B, C] x [1, 2] ==> [A1, B1, C1, A2, B2, C2] length_cm species
0 5 Bream
1 5 Roach
from itertools import product
2 5 Perch
product(["A", "B", "C"], [1, 2])
3 5 Pike
4 10 Bream
length_cm = np.arange(5, 61, 5) 5 10 Roach
species = fish["species"].unique() 6 10 Perch
...
p = product(length_cm, species) 41 55 Roach
42 55 Perch
expl_data_both = pd.DataFrame(p, 43 55 Pike
columns=['length_cm', 44 60 Bream
'species']) 45 60 Roach
print(expl_data_both) 46 60 Perch
47 60 Pike

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


The prediction workflow
Predict mass_g from length_cm only length_cm mass_g
0 5 -361.7277
1 10 -187.2315
prediction_data_length = expl_data_length.assign(
2 15 -12.7353
mass_g = mdl_mass_vs_length.predict(expl_data)
3 20 161.7610
)
4 25 336.2572
5 30 510.7534
... # number of rows: 12

Predict mass_g from both explanatory length_cm species mass_g


variables 0 5 Bream -459.3991
1 5 Roach -513.9350
2 5 Perch -500.4501
prediction_data_both = expl_data_both.assign(
3 5 Pike -876.6133
mass_g = mdl_mass_vs_both.predict(expl_data)
4 10 Bream -246.5563
)
5 10 Roach -301.0923
... # number of rows: 48

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Visualizing the predictions
plt.axline(xy1=(0, ic_bream), slope=sl, color="blue")
plt.axline(xy1=(0, ic_perch), slope=sl, color="green")
plt.axline(xy1=(0, ic_pike), slope=sl, color="red")
plt.axline(xy1=(0, ic_roach), slope=sl, color="orange")

sns.scatterplot(x="length_cm",
y="mass_g",
hue="species",
data=fish)

sns.scatterplot(x="length_cm",
y="mass_g",
color="black",
data=prediction_data)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Manually calculating predictions for linear regression
coeffs = mdl_mass_vs_length.params Intercept -536.223947
print(coeffs) length_cm 34.899245

intercept, slope = coeffs length_cm mass_g


0 5 -361.727721
1 10 -187.231494
explanatory_data = pd.DataFrame(
2 15 -12.735268
{"length_cm": np.arange(5, 61, 5)})
3 20 161.760959
4 25 336.257185
prediction_data = explanatory_data.assign(
5 30 510.753412
mass_g = intercept + slope * explanatory_data
...
)
9 50 1208.738318
10 55 1383.234545
print(prediction_data) 11 60 1557.730771

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Manually calculating predictions for multiple
regression
coeffs = mdl_mass_vs_both.params
print(coeffs)

species[Bream] -672.241866
species[Perch] -713.292859
species[Pike] -1089.456053
species[Roach] -726.777799
length_cm 42.568554

ic_bream, ic_perch, ic_pike, ic_roach, slope = coeffs

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


np.select()
conditions = [
condition_1,
condition_2,
# ...
condition_n
]

choices = [list_of_choices] # same length as conditions

np.select(conditions, choices)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Choosing an intercept with np.select()
conditions = [ [ -672.24 -726.78 -713.29 -1089.46
explanatory_data["species"] == "Bream", -672.24 -726.78 -713.29 -1089.46
explanatory_data["species"] == "Perch", -672.24 -726.78 -713.29 -1089.46
explanatory_data["species"] == "Pike", -672.24 -726.78 -713.29 -1089.46
explanatory_data["species"] == "Roach" -672.24 -726.78 -713.29 -1089.46
] -672.24 -726.78 -713.29 -1089.46
-672.24 -726.78 -713.29 -1089.46
-672.24 -726.78 -713.29 -1089.46
choices = [ic_bream, ic_perch, ic_pike, ic_roach]
-672.24 -726.78 -713.29 -1089.46
-672.24 -726.78 -713.29 -1089.46
intercept = np.select(conditions, choices) -672.24 -726.78 -713.29 -1089.46
-672.24 -726.78 -713.29 -1089.46]
print(intercept)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


The final prediction step
prediction_data = explanatory_data.assign( length_cm species intercept mass_g
intercept = np.select(conditions, choices), 0 5 Bream -672.2419 -459.3991
mass_g = intercept + slope * explanatory_data["length_cm"]) 1 5 Roach -726.7778 -513.9350
2 5 Perch -713.2929 -500.4501
print(prediction_data) 3 5 Pike -1089.4561 -876.6133
4 10 Bream -672.2419 -246.5563
5 10 Roach -726.7778 -301.0923
6 10 Perch -713.2929 -287.6073
7 10 Pike -1089.4561 -663.7705
8 15 Bream -672.2419 -33.7136
...
40 55 Bream -672.2419 1669.0286
41 55 Roach -726.7778 1614.4927
42 55 Perch -713.2929 1627.9776
43 55 Pike -1089.4561 1251.8144
44 60 Bream -672.2419 1881.8714
45 60 Roach -726.7778 1827.3354
46 60 Perch -713.2929 1840.8204
47 60 Pike -1089.4561 1464.6572

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Compare to .predict()
mdl_mass_vs_both.predict(explanatory_data) 0 -459.3991
1 -513.9350
2 -500.4501
3 -876.6133
4 -246.5563
5 -301.0923
...
43 1251.8144
44 1881.8714
45 1827.3354
46 1840.8204
47 1464.6572

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Assessing model
performance
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
Model performance metrics
Coe cient of determination (R-squared): how well the linear regression line ts the
observed values.
Larger is be er.

Residual standard error (RSE): the typical size of the residuals.


Smaller is be er.

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Getting the coefficient of determination
print(mdl_mass_vs_length.rsquared)

0.8225689502644215

print(mdl_mass_vs_species.rsquared)

0.25814887709499157

print(mdl_mass_vs_both.rsquared)

0.9200433561156649

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Adjusted coefficient of determination
More explanatory variables increases R2 .
Too many explanatory variables causes over ing.

Adjusted coe cient of determination penalizes more explanatory variables.

R¯2 = 1 − (1 − R2 ) nobsn−n
obs −1
var −1

Penalty is noticeable when R2 is small, or nvar is large fraction of nobs .


In statsmodels , it's contained in the rsquared_adj a ribute.

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Getting the adjusted coefficient of determination
print("rsq_length: ", mdl_mass_vs_length.rsquared)
print("rsq_adj_length: ", mdl_mass_vs_length.rsquared_adj)

rsq_length: 0.8225689502644215
rsq_adj_length: 0.8211607673300121

print("rsq_species: ", mdl_mass_vs_species.rsquared)


print("rsq_adj_species: ", mdl_mass_vs_species.rsquared_adj)

rsq_species: 0.25814887709499157
rsq_adj_species: 0.24020086605696722

print("rsq_both: ", mdl_mass_vs_both.rsquared


print("rsq_adj_both: ", mdl_mass_vs_both.rsquared_adj)

rsq_both: 0.9200433561156649
rsq_adj_both: 0.9174431400543857

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Getting the residual standard error
rse_length = np.sqrt(mdl_mass_vs_length.mse_resid)
print("rse_length: ", rse_length)

rse_length: 152.12092835414788

rse_species = np.sqrt(mdl_mass_vs_species.mse_resid)
print("rse_species: ", rse_species)

rse_species: 313.5501156682592

rse_both = np.sqrt(mdl_mass_vs_both.mse_resid)
print("rse_both: ", rse_both)

rse_both: 103.35563303966488

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Models for each
category
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
Four categories
print(fish["species"].unique())

array(['Bream', 'Roach', 'Perch', 'Pike'], dtype=object)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Splitting the dataset
bream = fish[fish["species"] == "Bream"]
perch = fish[fish["species"] == "Perch"]
pike = fish[fish["species"] == "Pike"]
roach = fish[fish["species"] == "Roach"]

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Four models
mdl_bream = ols("mass_g ~ length_cm", data=bream).fit() mdl_perch = ols("mass_g ~ length_cm", data=perch).fit()
print(mdl_bream.params) print(mdl_perch.params)

Intercept -1035.3476 Intercept -619.1751


length_cm 54.5500 length_cm 38.9115

mdl_pike = ols("mass_g ~ length_cm", data=pike).fit() mdl_roach = ols("mass_g ~ length_cm", data=roach).fit()


print(mdl_pike.params) print(mdl_roach.params)

Intercept -1540.8243 Intercept -329.3762


length_cm 53.1949 length_cm 23.3193

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Explanatory data
explanatory_data = pd.DataFrame( length_cm
{"length_cm": np.arange(5, 61, 5)}) 0 5
1 10
print(explanatory_data) 2 15
3 20
4 25
5 30
6 35
7 40
8 45
9 50
10 55
11 60

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Making predictions
prediction_data_bream = explanatory_data.assign( prediction_data_perch = explanatory_data.assign(
mass_g = mdl_bream.predict(explanatory_data), mass_g = mdl_perch.predict(explanatory_data),
species = "Bream") species = "Perch")

prediction_data_pike = explanatory_data.assign( prediction_data_roach = explanatory_data.assign(


mass_g = mdl_pike.predict(explanatory_data), mass_g = mdl_roach.predict(explanatory_data),
species = "Pike") species = "Roach")

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Concatenating predictions
prediction_data = pd.concat([prediction_data_bream, length_cm mass_g species
prediction_data_roach, 0 5 -762.597660 Bream
prediction_data_perch, 1 10 -489.847756 Bream
prediction_data_pike]) 2 15 -217.097851 Bream
3 20 55.652054 Bream
4 25 328.401958 Bream
5 30 601.151863 Bream
...
3 20 -476.926955 Pike
4 25 -210.952626 Pike
5 30 55.021703 Pike
6 35 320.996032 Pike
7 40 586.970362 Pike
8 45 852.944691 Pike
9 50 1118.919020 Pike
10 55 1384.893349 Pike
11 60 1650.867679 Pike

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Visualizing predictions
sns.lmplot(x="length_cm",
y="mass_g",
data=fish,
hue="species",
ci=None)
plt.show()

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Adding in your predictions
sns.lmplot(x="length_cm",
y="mass_g",
data=fish,
hue="species",
ci=None)

sns.scatterplot(x="length_cm",
y="mass_g",
data=prediction_data,
hue="species",
ci=None,
legend=False)

plt.show()

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Coefficient of determination
mdl_fish = ols("mass_g ~ length_cm + species", print(mdl_bream.rsquared_adj)
data=fish).fit()

0.874
print(mdl_fish.rsquared_adj)

print(mdl_perch.rsquared_adj)
0.917

0.917

print(mdl_pike.rsquared_adj)

0.941

print(mdl_roach.rsquared_adj)

0.815

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Residual standard error
print(np.sqrt(mdl_fish.mse_resid)) print(np.sqrt(mdl_bream.mse_resid))

103 74.2

print(np.sqrt(mdl_perch.mse_resid))

100

print(np.sqrt(mdl_pike.mse_resid))

120

print(np.sqrt(mdl_roach.mse_resid))

38.2

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
One model with an
interaction
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
What is an interaction?
In the sh dataset
Di erent sh species have di erent mass to length ratios.

The e ect of length on the expected mass is di erent for di erent species.

More generally
The e ect of one explanatory variable on the expected response changes depending on the
value of another explanatory variable.

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Specifying interactions
No interactions No interactions
response ~ explntry1 + explntry2 mass_g ~ length_cm + species

With interactions (implicit) With interactions (implicit)


response_var ~ explntry1 * explntry2 mass_g ~ length_cm * species

With interactions (explicit) With interactions (explicit)


response ~ explntry1 + explntry2 + explntry1:explntry2 mass_g ~ length_cm + species + length_cm:species

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Running the model
mdl_mass_vs_both = ols("mass_g ~ length_cm * species", data=fish).fit()

print(mdl_mass_vs_both.params)

Intercept -1035.3476
species[T.Perch] 416.1725
species[T.Pike] -505.4767
species[T.Roach] 705.9714
length_cm 54.5500
length_cm:species[T.Perch] -15.6385
length_cm:species[T.Pike] -1.3551
length_cm:species[T.Roach] -31.2307

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Easier to understand coefficients
mdl_mass_vs_both_inter = ols("mass_g ~ species + species:length_cm + 0", data=fish).fit()

print(mdl_mass_vs_both_inter.params)

species[Bream] -1035.3476
species[Perch] -619.1751
species[Pike] -1540.8243
species[Roach] -329.3762
species[Bream]:length_cm 54.5500
species[Perch]:length_cm 38.9115
species[Pike]:length_cm 53.1949
species[Roach]:length_cm 23.3193

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Familiar numbers
print(mdl_mass_vs_both_inter.params) print(mdl_bream.params)

species[Bream] -1035.3476 Intercept -1035.3476


species[Perch] -619.1751 length_cm 54.5500
species[Pike] -1540.8243
species[Roach] -329.3762
species[Bream]:length_cm 54.5500
species[Perch]:length_cm 38.9115
species[Pike]:length_cm 53.1949
species[Roach]:length_cm 23.3193

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Making predictions
with interactions
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
The model with the interaction
mdl_mass_vs_both_inter = ols("mass_g ~ species + species:length_cm + 0",
data=fish).fit()

print(mdl_mass_vs_both_inter.params)

species[Bream] -1035.3476
species[Perch] -619.1751
species[Pike] -1540.8243
species[Roach] -329.3762
species[Bream]:length_cm 54.5500
species[Perch]:length_cm 38.9115
species[Pike]:length_cm 53.1949
species[Roach]:length_cm 23.3193

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


The prediction flow
from itertools import product length_cm species mass_g
0 5 Bream -762.5977
length_cm = np.arange(5, 61, 5) 1 5 Roach -212.7799
2 5 Perch -424.6178
species = fish["species"].unique() 3 5 Pike -1274.8499
4 10 Bream -489.8478
p = product(length_cm, species) 5 10 Roach -96.1836
6 10 Perch -230.0604
7 10 Pike -1008.8756
explanatory_data = pd.DataFrame(p,
8 15 Bream -217.0979
columns=["length_cm",
...
"species"])
40 55 Bream 1964.9014
41 55 Roach 953.1833
prediction_data = explanatory_data.assign(
42 55 Perch 1520.9556
mass_g = mdl_mass_vs_both_inter.predict(explanatory_data))
43 55 Pike 1384.8933
44 60 Bream 2237.6513
print(prediction_data) 45 60 Roach 1069.7796
46 60 Perch 1715.5129
47 60 Pike 1650.8677

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Visualizing the predictions
sns.lmplot(x="length_cm",
y="mass_g",
data=fish,
hue="species",
ci=None)

sns.scatterplot(x="length_cm",
y="mass_g",
data=prediction_data,
hue="species")

plt.show()

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Manually calculating the predictions
coeffs = mdl_mass_vs_both_inter.params

species[Bream] -1035.3476
species[Perch] -619.1751
species[Pike] -1540.8243
species[Roach] -329.3762
species[Bream]:length_cm 54.5500
species[Perch]:length_cm 38.9115
species[Pike]:length_cm 53.1949
species[Roach]:length_cm 23.3193

ic_bream, ic_perch, ic_pike, ic_roach,


slope_bream, slope_perch, slope_pike, slope_roach = coeffs

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Manually calculating the predictions
conditions = [
explanatory_data["species"] == "Bream",
explanatory_data["species"] == "Perch",
explanatory_data["species"] == "Pike",
explanatory_data["species"] == "Roach"
]

ic_choices = [ic_bream, ic_perch, ic_pike, ic_roach]


intercept = np.select(conditions, ic_choices)

slope_choices = [slope_bream, slope_perch, slope_pike, slope_roach]


slope = np.select(conditions, slope_choices)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Manually calculating the predictions
prediction_data = explanatory_data.assign( prediction_data = explanatory_data.assign(
mass_g = intercept + slope * explanatory_data["length_cm"]) mass_g = mdl_mass_vs_both_inter.predict(explanatory_data))

print(prediction_data) print(prediction_data)

length_cm species mass_g length_cm species mass_g


0 5 Bream -762.5977 0 5 Bream -762.5977
1 5 Roach -212.7799 1 5 Roach -212.7799
2 5 Perch -424.6178 2 5 Perch -424.6178
3 5 Pike -1274.8499 3 5 Pike -1274.8499
4 10 Bream -489.8478 4 10 Bream -489.8478
5 10 Roach -96.1836 5 10 Roach -96.1836
... ...
43 55 Pike 1384.8933 43 55 Pike 1384.8933
44 60 Bream 2237.6513 44 60 Bream 2237.6513
45 60 Roach 1069.7796 45 60 Roach 1069.7796
46 60 Perch 1715.5129 46 60 Perch 1715.5129
47 60 Pike 1650.8677 47 60 Pike 1650.8677

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Simpson's Paradox
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
A most ingenious paradox!
Simpson's Paradox occurs when the trend of a model on the whole dataset is very di erent
from the trends shown by models on subsets of the dataset.

trend = slope coe cient

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Synthetic Simpson data
x y group 5 groups of data, labeled "A" to "E"

62.24344 70.60840 D
52.33499 14.70577 B
56.36795 46.39554 C
66.80395 66.17487 D
66.53605 89.24658 E
62.38129 91.45260 E

1 h ps://www.rdocumentation.org/packages/datasauRus/topics/simpsons_paradox

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Linear regressions
Whole dataset By group
mdl_whole = ols("y ~ x", mdl_by_group = ols("y ~ group + group:x + 0",
data=simpsons_paradox).fit() data = simpsons_paradox).fit()

print(mdl_whole.params) print(mdl_by_group.params)

Intercept -38.554 groupA groupB groupC groupD groupE


x 1.751 32.5051 67.3886 99.6333 132.3932 123.8242
groupA:x groupB:x groupC:x groupD:x groupE:x
-0.6266 -1.0105 -0.9940 -0.9908 -0.5364

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Plotting the whole dataset
sns.regplot(x="x",
y="y",
data=simpsons_paradox,
ci=None)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Plotting by group
sns.lmplot(x="x",
y="y",
data=simpsons_paradox,
hue="group",
ci=None)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Reconciling the difference
Good advice
If possible, try to plot the dataset.

Common advice
You can't choose the best model in general – it depends on the dataset and the question you
are trying to answer.

More good advice


Articulate a question before you start modeling.

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Test score example

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Infectious disease example

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Reconciling the difference
Usually (but not always) the grouped model contains more insight.

Are you missing explanatory variables?

Context is important.

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Simpson's paradox in real datasets
The paradox is usually less obvious.

You may see a zero slope rather than a complete change in direction.

It may not appear in every group.

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Two numeric
explanatory
variables
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
Visualizing three numeric variables
3D sca er plot

2D sca er plot with response as color

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Another column for the fish dataset
species mass_g length_cm height_cm
Bream 1000 33.5 18.96
Bream 925 36.2 18.75
Roach 290 24.0 8.88
Roach 390 29.5 9.48
Perch 1100 39.0 12.80
Perch 1000 40.2 12.60
Pike 1250 52.0 10.69
Pike 1650 59.0 10.81

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


3D scatter plot

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


2D scatter plot, color for response
sns.scatterplot(x="length_cm",
y="height_cm",
data=fish,
hue="mass_g")

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Modeling with two numeric explanatory variables
mdl_mass_vs_both = ols("mass_g ~ length_cm + height_cm",
data=fish).fit()

print(mdl_mass_vs_both.params)

Intercept -622.150234
length_cm 28.968405
height_cm 26.334804

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


The prediction flow
from itertools import product length_cm height_cm mass_g
0 5 2 -424.638603
length_cm = np.arange(5, 61, 5) 1 5 4 -371.968995
height_cm = np.arange(2, 21, 2) 2 5 6 -319.299387
3 5 8 -266.629780
p = product(length_cm, height_cm) 4 5 10 -213.960172
.. ... ... ...
explanatory_data = pd.DataFrame(p, 115 60 12 1431.971694
columns=["length_cm", 116 60 14 1484.641302
"height_cm"]) 117 60 16 1537.310909
118 60 18 1589.980517
119 60 20 1642.650125
prediction_data = explanatory_data.assign(
mass_g = mdl_mass_vs_both.predict(explanatory_data))
[120 rows x 3 columns]

print(prediction_data)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Plotting the predictions
sns.scatterplot(x="length_cm",
y="height_cm",
data=fish,
hue="mass_g")

sns.scatterplot(x="length_cm",
y="height_cm",
data=prediction_data,
hue="mass_g",
legend=False,
marker="s")

plt.show()

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Including an interaction
mdl_mass_vs_both_inter = ols("mass_g ~ length_cm * height_cm",
data=fish).fit()

print(mdl_mass_vs_both_inter.params)

Intercept 159.107480
length_cm 0.301426
height_cm -78.125178
length_cm:height_cm 3.545435

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


The prediction flow with an interaction
length_cm = np.arange(5, 61, 5)
height_cm = np.arange(2, 21, 2)

p = product(length_cm, height_cm)

explanatory_data = pd.DataFrame(p,
columns=["length_cm",
"height_cm"])

prediction_data = explanatory_data.assign(
mass_g = mdl_mass_vs_both_inter.predict(explanatory_data))

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Plotting the predictions
sns.scatterplot(x="length_cm",
y="height_cm",
data=fish,
hue="mass_g")

sns.scatterplot(x="length_cm",
y="height_cm",
data=prediction_data,
hue="mass_g",
legend=False,
marker="s")

plt.show()

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
More than two
explanatory
variables
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
From last time
sns.scatterplot(x="length_cm",
y="height_cm",
data=fish,
hue="mass_g")

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Faceting by species
grid = sns.FacetGrid(data=fish,
col="species",
hue="mass_g",
col_wrap=2,
palette="plasma")

grid.map(sns.scatterplot,
"length_cm",
"height_cm")

plt.show()

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Faceting by species
It's possible to use more than one
categorical variable for faceting

Beware of faceting overuse

Plo ing becomes harder with increasing


number of variables

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Different levels of interaction
No interactions

ols("mass_g ~ length_cm + height_cm + species + 0", data=fish).fit()

two-way interactions between pairs of variables

ols(
"mass_g ~ length_cm + height_cm + species +
length_cm:height_cm + length_cm:species + height_cm:species + 0", data=fish).fit()

three-way interaction between all three variables

ols(
"mass_g ~ length_cm + height_cm + species +
length_cm:height_cm + length_cm:species + height_cm:species + length_cm:height_cm:species + 0", data=fish).fit()

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


All the interactions
ols(
"mass_g ~ length_cm + height_cm + species +
length_cm:height_cm + length_cm:species + height_cm:species + length_cm:height_cm:species + 0",
data=fish).fit()

same as

ols(
"mass_g ~ length_cm * height_cm * species + 0",
data=fish).fit()

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Only two-way interactions
ols(
"mass_g ~ length_cm + height_cm + species +
length_cm:height_cm + length_cm:species + height_cm:species + 0",
data=fish).fit()

same as

ols(
"mass_g ~ (length_cm + height_cm + species) ** 2 + 0",
data=fish).fit()

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


The prediction flow
mdl_mass_vs_all = ols( length_cm height_cm species mass_g
"mass_g ~ length_cm * height_cm * species + 0", 0 5 2 Bream -570.656437
data=fish).fit() 1 5 2 Roach 31.449145
2 5 2 Perch 43.789984
length_cm = np.arange(5, 61, 5) 3 5 2 Pike 271.270093
height_cm = np.arange(2, 21, 2) 4 5 4 Bream -451.127405
species = fish["species"].unique() .. ... ... ... ...
475 60 18 Pike 2690.346384
p = product(length_cm, height_cm, species) 476 60 20 Bream 1531.618475
477 60 20 Roach 2621.797668
explanatory_data = pd.DataFrame(p, 478 60 20 Perch 3041.931709
columns=["length_cm", 479 60 20 Pike 2926.352397
"height_cm",
"species"]) [480 rows x 4 columns]

prediction_data = explanatory_data.assign(
mass_g = mdl_mass_vs_all.predict(explanatory_data))

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
How linear
regression works
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
The standard simple linear regression plot

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Visualizing residuals

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


A metric for the best fit
The simplest idea (which doesn't work)
Take the sum of all the residuals.

Some residuals are negative.

The next simplest idea (which does work)


Take the square of each residual, and add up those squares.

This is called the sum of squares.

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


A detour into numerical optimization
A line plot of a quadratic equation

x = np.arange(-4, 5, 0.1)
y = x ** 2 - x + 10

xy_data = pd.DataFrame({"x": x,
"y": y})

sns.lineplot(x="x",
y="y",
data=xy_data)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Using calculus to solve the equation
y = x2 − x + 10
∂y
∂x
= 2x − 1

0 = 2x − 1

x = 0.5

y = 0.52 − 0.5 + 10 = 9.75

Not all equations can be solved like this.

You can let Python gure it out.

Don't worry if this doesn't make sense, you


won't need it for the exercises.

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


minimize()
from scipy.optimize import minimize fun: 9.75
hess_inv: array([[0.5]])
jac: array([0.])
def calc_quadratic(x):
message: 'Optimization terminated successfully.'
y = x ** 2 - x + 10
nfev: 6
return y
nit: 2
njev: 3
minimize(fun=calc_quadratic, status: 0
x0=3) success: True
x: array([0.49999998])

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


A linear regression algorithm
De ne a function to calculate the sum of
def calc_sum_of_squares(coeffs):
squares metric.
intercept, slope = coeffs
# More calculation!

Call minimize() to nd coe cients that minimize(


minimize this function. fun=calc_sum_of_squares,
x0=0
)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Multiple logistic
regression
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
Bank churn dataset
has_churned time_since_ rst_purchase time_since_last_purchase
0 0.3993247 -0.5158691
1 -0.4297957 0.6780654
0 3.7383122 0.4082544
0 0.6032289 -0.6990435
... ... ...
response length of relationship recency of activity

1 h ps://www.rdocumentation.org/packages/bayesQR/topics/Churn

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


logit()
from statsmodels.formula.api import logit

logit("response ~ explanatory", data=dataset).fit()

logit("response ~ explanatory1 + explanatory2", data=dataset).fit()

logit("response ~ explanatory1 * explanatory2", data=dataset).fit()

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


The four outcomes
predicted false predicted true
actual false correct false positive
actual true false negative correct

conf_matrix = mdl_logit.pred_table()

print(conf_matrix)

[[102. 98.]
[ 53. 147.]]

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Prediction flow
from itertools import product

explanatory1 = some_values
explanatory2 = some_values

p = product(explanatory1, explanatory2)
explanatory_data = pd.DataFrame(p,
columns=["explanatory1",
"explanatory2"])
prediction_data = explanatory_data.assign(
mass_g = mdl_logit.predict(explanatory_data))

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Visualization
prediction_data["most_likely_outcome"] = np.round(prediction_data["has_churned"])

sns.scatterplot(...
data=churn,
hue="has_churned",
...)

sns.scatterplot(...
data=prediction_data,
hue="most_likely_outcome",
...)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
The logistic
distribution
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
Gaussian probability density function (PDF)
from scipy.stats import norm

x = np.arange(-4, 4.05, 0.05)

gauss_dist = pd.DataFrame({
"x": x,
"gauss_pdf": norm.pdf(x)}
)

sns.lineplot(x="x",
y="gauss_pdf",
data=gauss_dist)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Gaussian cumulative distribution function (CDF)
x = np.arange(-4, 4.05, 0.05)

gauss_dist = pd.DataFrame({
"x": x,
"gauss_pdf": norm.pdf(x),
"gauss_cdf": norm.cdf(x)}
)

sns.lineplot(x="x",
y="gauss_cdf",
data=gauss_dist)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Gaussian cumulative distribution function (CDF)
x = np.arange(-4, 4.05, 0.05)

gauss_dist = pd.DataFrame({
"x": x,
"gauss_pdf": norm.pdf(x),
"gauss_cdf": norm.cdf(x)}
)

sns.lineplot(x="x",
y="gauss_cdf",
data=gauss_dist)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Gaussian inverse CDF
p = np.arange(0.001, 1, 0.001)

gauss_dist_inv = pd.DataFrame({
"p": p,
"gauss_inv_cdf": norm.ppf(p)}
)

sns.lineplot(x="p",
y="gauss_inv_cdf",
data=gauss_dist_inv)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Logistic PDF
from scipy.stats import logistic

x = np.arange(-4, 4.05, 0.05)

logistic_dist = pd.DataFrame({
"x": x,
"log_pdf": logistic.pdf(x)}
)

sns.lineplot(x="x",
y="log_pdf",
data=logistic_dist)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Logistic distribution
Logistic distribution CDF is also called the logistic function.
1
cdf(x) = (1+exp(−x))

Logistic distribution inverse CDF is also called the logit function.


p
inverse_cdf(p) = log( (1−p) )

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
How logistic
regression works
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
Sum of squares doesn't work
np.sum((y_pred - y_actual) ** 2)

y_actual is always 0 or 1 .

y_pred is between 0 and 1 .

There is a be er metric than sum of squares.

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Likelihood
y_pred * y_actual

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Likelihood
y_pred * y_actual + (1 - y_pred) * (1 - y_actual)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Likelihood
np.sum(y_pred * y_actual + (1 - y_pred) * (1 - y_actual))

When y_actual = 1

y_pred * 1 + (1 - y_pred) * (1 - 1) = y_pred

When y_actual = 0

y_pred * 0 + (1 - y_pred) * (1 - 0) = 1 - y_pred

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Log-likelihood
Computing likelihood involves adding many very small numbers, leading to numerical error.

Log-likelihood is easier to compute.

log_likelihood = np.log(y_pred) * y_actual + np.log(1 - y_pred) * (1 - y_actual)

Both equations give the same answer.

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Negative log-likelihood
Maximizing log-likelihood is the same as minimizing negative log-likelihood.

-np.sum(log_likelihoods)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Logistic regression algorithm
def calc_neg_log_likelihood(coeffs)
intercept, slope = coeffs
# More calculation!

from scipy.optimize import minimize

minimize(
fun=calc_neg_log_likelihood,
x0=[0, 0]
)

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Let's practice!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Congratulations!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

Maarten Van den Broeck


Content Developer at DataCamp
You learned things
Chapter 1 Chapter 2

Fit/visualize/predict/assess parallel slopes Interactions between explanatory variables

Simpson's Paradox

Chapter 3 Chapter 4

Extend to many explanatory variables Logistic regression with multiple


explanatory variables
Implement linear regression algorithm
Logistic distribution

Implement logistic regression algorithm

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


There is more to learn
Training and testing sets

Cross validation

P-values and signi cance

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Advanced regression
Generalized Linear Models in Python

Introduction to Predictive Analytics in Python

Linear Classi ers in Python

Machine Learning with Tree-Based Models in Python

INTERMEDIATE REGRESSION WITH STATSMODELS IN PYTHON


Have fun regressing!
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N

You might also like