Intermediate Regression With Statsmodels in Python
Intermediate Regression With Statsmodels in Python
regression
I N T E R M E D I AT E R E G R E S S I O N W I T H S TAT S M O D E L S I N P Y T H O N
Simpson's Paradox
Chapter 3 Chapter 4
More explanatory variables Multiple logistic regression
print(mdl_mass_vs_both.params)
species[Bream] -672.241866
species[Perch] -713.292859
species[Pike] -1089.456053
species[Roach] -726.777799
length_cm 42.568554
dtype: float64
print(mdl_mass_vs_species.params)
species[Bream] 617.828571
species[Perch] 382.239286
species[Pike] 718.705882
species[Roach] 152.050000
sns.regplot(x="length_cm",
y="mass_g",
data=fish,
ci=None)
plt.show()
sns.scatterplot(x="length_cm",
y="mass_g",
hue="species",
data=fish)
sns.scatterplot(x="length_cm",
y="mass_g",
hue="species",
data=fish)
sns.scatterplot(x="length_cm",
y="mass_g",
color="black",
data=prediction_data)
species[Bream] -672.241866
species[Perch] -713.292859
species[Pike] -1089.456053
species[Roach] -726.777799
length_cm 42.568554
np.select(conditions, choices)
0.8225689502644215
print(mdl_mass_vs_species.rsquared)
0.25814887709499157
print(mdl_mass_vs_both.rsquared)
0.9200433561156649
R¯2 = 1 − (1 − R2 ) nobsn−n
obs −1
var −1
rsq_length: 0.8225689502644215
rsq_adj_length: 0.8211607673300121
rsq_species: 0.25814887709499157
rsq_adj_species: 0.24020086605696722
rsq_both: 0.9200433561156649
rsq_adj_both: 0.9174431400543857
rse_length: 152.12092835414788
rse_species = np.sqrt(mdl_mass_vs_species.mse_resid)
print("rse_species: ", rse_species)
rse_species: 313.5501156682592
rse_both = np.sqrt(mdl_mass_vs_both.mse_resid)
print("rse_both: ", rse_both)
rse_both: 103.35563303966488
sns.scatterplot(x="length_cm",
y="mass_g",
data=prediction_data,
hue="species",
ci=None,
legend=False)
plt.show()
0.874
print(mdl_fish.rsquared_adj)
print(mdl_perch.rsquared_adj)
0.917
0.917
print(mdl_pike.rsquared_adj)
0.941
print(mdl_roach.rsquared_adj)
0.815
103 74.2
print(np.sqrt(mdl_perch.mse_resid))
100
print(np.sqrt(mdl_pike.mse_resid))
120
print(np.sqrt(mdl_roach.mse_resid))
38.2
The e ect of length on the expected mass is di erent for di erent species.
More generally
The e ect of one explanatory variable on the expected response changes depending on the
value of another explanatory variable.
print(mdl_mass_vs_both.params)
Intercept -1035.3476
species[T.Perch] 416.1725
species[T.Pike] -505.4767
species[T.Roach] 705.9714
length_cm 54.5500
length_cm:species[T.Perch] -15.6385
length_cm:species[T.Pike] -1.3551
length_cm:species[T.Roach] -31.2307
print(mdl_mass_vs_both_inter.params)
species[Bream] -1035.3476
species[Perch] -619.1751
species[Pike] -1540.8243
species[Roach] -329.3762
species[Bream]:length_cm 54.5500
species[Perch]:length_cm 38.9115
species[Pike]:length_cm 53.1949
species[Roach]:length_cm 23.3193
print(mdl_mass_vs_both_inter.params)
species[Bream] -1035.3476
species[Perch] -619.1751
species[Pike] -1540.8243
species[Roach] -329.3762
species[Bream]:length_cm 54.5500
species[Perch]:length_cm 38.9115
species[Pike]:length_cm 53.1949
species[Roach]:length_cm 23.3193
sns.scatterplot(x="length_cm",
y="mass_g",
data=prediction_data,
hue="species")
plt.show()
species[Bream] -1035.3476
species[Perch] -619.1751
species[Pike] -1540.8243
species[Roach] -329.3762
species[Bream]:length_cm 54.5500
species[Perch]:length_cm 38.9115
species[Pike]:length_cm 53.1949
species[Roach]:length_cm 23.3193
print(prediction_data) print(prediction_data)
62.24344 70.60840 D
52.33499 14.70577 B
56.36795 46.39554 C
66.80395 66.17487 D
66.53605 89.24658 E
62.38129 91.45260 E
1 h ps://www.rdocumentation.org/packages/datasauRus/topics/simpsons_paradox
print(mdl_whole.params) print(mdl_by_group.params)
Common advice
You can't choose the best model in general – it depends on the dataset and the question you
are trying to answer.
Context is important.
You may see a zero slope rather than a complete change in direction.
print(mdl_mass_vs_both.params)
Intercept -622.150234
length_cm 28.968405
height_cm 26.334804
print(prediction_data)
sns.scatterplot(x="length_cm",
y="height_cm",
data=prediction_data,
hue="mass_g",
legend=False,
marker="s")
plt.show()
print(mdl_mass_vs_both_inter.params)
Intercept 159.107480
length_cm 0.301426
height_cm -78.125178
length_cm:height_cm 3.545435
p = product(length_cm, height_cm)
explanatory_data = pd.DataFrame(p,
columns=["length_cm",
"height_cm"])
prediction_data = explanatory_data.assign(
mass_g = mdl_mass_vs_both_inter.predict(explanatory_data))
sns.scatterplot(x="length_cm",
y="height_cm",
data=prediction_data,
hue="mass_g",
legend=False,
marker="s")
plt.show()
grid.map(sns.scatterplot,
"length_cm",
"height_cm")
plt.show()
ols(
"mass_g ~ length_cm + height_cm + species +
length_cm:height_cm + length_cm:species + height_cm:species + 0", data=fish).fit()
ols(
"mass_g ~ length_cm + height_cm + species +
length_cm:height_cm + length_cm:species + height_cm:species + length_cm:height_cm:species + 0", data=fish).fit()
same as
ols(
"mass_g ~ length_cm * height_cm * species + 0",
data=fish).fit()
same as
ols(
"mass_g ~ (length_cm + height_cm + species) ** 2 + 0",
data=fish).fit()
prediction_data = explanatory_data.assign(
mass_g = mdl_mass_vs_all.predict(explanatory_data))
x = np.arange(-4, 5, 0.1)
y = x ** 2 - x + 10
xy_data = pd.DataFrame({"x": x,
"y": y})
sns.lineplot(x="x",
y="y",
data=xy_data)
0 = 2x − 1
x = 0.5
1 h ps://www.rdocumentation.org/packages/bayesQR/topics/Churn
conf_matrix = mdl_logit.pred_table()
print(conf_matrix)
[[102. 98.]
[ 53. 147.]]
explanatory1 = some_values
explanatory2 = some_values
p = product(explanatory1, explanatory2)
explanatory_data = pd.DataFrame(p,
columns=["explanatory1",
"explanatory2"])
prediction_data = explanatory_data.assign(
mass_g = mdl_logit.predict(explanatory_data))
sns.scatterplot(...
data=churn,
hue="has_churned",
...)
sns.scatterplot(...
data=prediction_data,
hue="most_likely_outcome",
...)
gauss_dist = pd.DataFrame({
"x": x,
"gauss_pdf": norm.pdf(x)}
)
sns.lineplot(x="x",
y="gauss_pdf",
data=gauss_dist)
gauss_dist = pd.DataFrame({
"x": x,
"gauss_pdf": norm.pdf(x),
"gauss_cdf": norm.cdf(x)}
)
sns.lineplot(x="x",
y="gauss_cdf",
data=gauss_dist)
gauss_dist = pd.DataFrame({
"x": x,
"gauss_pdf": norm.pdf(x),
"gauss_cdf": norm.cdf(x)}
)
sns.lineplot(x="x",
y="gauss_cdf",
data=gauss_dist)
gauss_dist_inv = pd.DataFrame({
"p": p,
"gauss_inv_cdf": norm.ppf(p)}
)
sns.lineplot(x="p",
y="gauss_inv_cdf",
data=gauss_dist_inv)
logistic_dist = pd.DataFrame({
"x": x,
"log_pdf": logistic.pdf(x)}
)
sns.lineplot(x="x",
y="log_pdf",
data=logistic_dist)
y_actual is always 0 or 1 .
When y_actual = 1
When y_actual = 0
-np.sum(log_likelihoods)
minimize(
fun=calc_neg_log_likelihood,
x0=[0, 0]
)
Simpson's Paradox
Chapter 3 Chapter 4
Cross validation