vertopal.com_Lab_Linear_Regression
vertopal.com_Lab_Linear_Regression
Regression
• Load Datasets
• 3.6.2 Simple Linear Regression
• 3.6.3 Multiple Linear Regression
• 3.6.4 Interaction Terms
• 3.6.5 Non-linear Transformations of the Predictors
• 3.6.6 Qualitative Predictors
## perform imports and set-up
import numpy as np
import pandas as pd
import statsmodels.api as sm
%matplotlib inline
plt.style.use('ggplot') # emulate pretty r-style plots
Load Datasets
# Load Boston housing data set
boston = load_boston()
ax.scatter(boston_df.LSTAT.values,
boston_df.MEDV.values,facecolors='none', edgecolors='b',\
label="data");
ax.set_ylabel('MEDV');
ax.set_xlabel('LSTAT');
boston_df.MEDV.values)
======================================================================
========
Dep. Variable: MEDV R-squared:
0.544
Model: OLS Adj. R-squared:
0.543
Method: Least Squares F-statistic:
601.6
Date: Fri, 24 Jun 2016 Prob (F-statistic):
5.08e-88
Time: 10:05:12 Log-Likelihood:
-1641.5
No. Observations: 506 AIC:
3287.
Df Residuals: 504 BIC:
3295.
Df Model: 1
======================================================================
========
coef std err t P>|t| [95.0%
Conf. Int.]
----------------------------------------------------------------------
--------
const 34.5538 0.563 61.415 0.000 33.448
35.659
LSTAT -0.9500 0.039 -24.528 0.000 -1.026
-0.874
======================================================================
========
Omnibus: 137.043 Durbin-Watson:
0.892
Prob(Omnibus): 0.000 Jarque-Bera (JB):
291.373
Skew: 1.453 Prob(JB):
5.36e-64
Kurtosis: 5.319 Cond. No.
29.7
======================================================================
========
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
ax.legend(loc='best');
plt.xlabel('LSTAT');
plt.ylabel('MEDV');
Diagnostic Plots for Linear Model
# Create plots of residuals
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(12,6))
# RESIDUALS
# The results contain the residuals
fitted_values = linear_results.fittedvalues.values
residuals = linear_results.resid.values
# STUDENTIZED RESIDUALS
# To asses data outliers we will look at the studentized residuals.
This is in the data array
# returned from summary table (10th column)
studentized_residuals = data[:,10]
# We can also examine the leverages to identify points that may alter
the regression line
from statsmodels.stats.outliers_influence import OLSInfluence
leverage = OLSInfluence(linear_results).influence
fig, ax = plt.subplots(figsize=(8,6))
ax.scatter(leverage, studentized_residuals,facecolors='none',
edgecolors='b');
ax.set_xlabel('Leverage');
ax.set_ylabel('Studentized Residuals');
3.6.3 Multiple Linear Regression
Here we will estimate MEDV using multiple linear regression. In the first example we
will regress LSTAT and AGE onto MEDV.
# create our design matrix using LSTAT and AGE predictors
X = sm.add_constant(boston_df[['LSTAT','AGE']])
Model parameters:
const 33.222761
LSTAT -1.032069
AGE 0.034544
dtype: float64
Now we will perform the regression over all 13 predictors in the boston housing
dataset.
# create our design matrix using all the predictors (last column is
MEDV)
X = sm.add_constant(boston_df.iloc[:,0:-1])
======================================================================
========
Dep. Variable: MEDV R-squared:
0.741
Model: OLS Adj. R-squared:
0.734
Method: Least Squares F-statistic:
108.1
Date: Fri, 24 Jun 2016 Prob (F-statistic):
6.95e-135
Time: 10:05:13 Log-Likelihood:
-1498.8
No. Observations: 506 AIC:
3026.
Df Residuals: 492 BIC:
3085.
Df Model: 13
======================================================================
========
coef std err t P>|t| [95.0%
Conf. Int.]
----------------------------------------------------------------------
--------
const 36.4911 5.104 7.149 0.000 26.462
46.520
CRIM -0.1072 0.033 -3.276 0.001 -0.171
-0.043
ZN 0.0464 0.014 3.380 0.001 0.019
0.073
INDUS 0.0209 0.061 0.339 0.735 -0.100
0.142
CHAS 2.6886 0.862 3.120 0.002 0.996
4.381
NOX -17.7958 3.821 -4.658 0.000 -25.302
-10.289
RM 3.8048 0.418 9.102 0.000 2.983
4.626
AGE 0.0008 0.013 0.057 0.955 -0.025
0.027
DIS -1.4758 0.199 -7.398 0.000 -1.868
-1.084
RAD 0.3057 0.066 4.608 0.000 0.175
0.436
TAX -0.0123 0.004 -3.278 0.001 -0.020
-0.005
PTRATIO -0.9535 0.131 -7.287 0.000 -1.211
-0.696
B 0.0094 0.003 3.500 0.001 0.004
0.015
LSTAT -0.5255 0.051 -10.366 0.000 -0.625
-0.426
======================================================================
========
Omnibus: 178.029 Durbin-Watson:
1.078
Prob(Omnibus): 0.000 Jarque-Bera (JB):
782.015
Skew: 1.521 Prob(JB):
1.54e-170
Kurtosis: 8.276 Cond. No.
1.51e+04
======================================================================
========
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
[2] The condition number is large, 1.51e+04. This might indicate that
there are
strong multicollinearity or other numerical problems.
======================================================================
========
Dep. Variable: MEDV R-squared:
0.556
Model: OLS Adj. R-squared:
0.553
Method: Least Squares F-statistic:
209.3
Date: Fri, 24 Jun 2016 Prob (F-statistic):
4.86e-88
Time: 10:05:13 Log-Likelihood:
-1635.0
No. Observations: 506 AIC:
3278.
Df Residuals: 502 BIC:
3295.
Df Model: 3
======================================================================
========
coef std err t P>|t| [95.0%
Conf. Int.]
----------------------------------------------------------------------
--------
Intercept 36.0885 1.470 24.553 0.000 33.201
38.976
LSTAT -1.3921 0.167 -8.313 0.000 -1.721
-1.063
AGE -0.0007 0.020 -0.036 0.971 -0.040
0.038
LSTAT:AGE 0.0042 0.002 2.244 0.025 0.001
0.008
======================================================================
========
Omnibus: 135.601 Durbin-Watson:
0.965
Prob(Omnibus): 0.000 Jarque-Bera (JB):
296.955
Skew: 1.417 Prob(JB):
3.29e-65
Kurtosis: 5.461 Cond. No.
6.88e+03
======================================================================
========
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
[2] The condition number is large, 6.88e+03. This might indicate that
there are
strong multicollinearity or other numerical problems.
======================================================================
========
Dep. Variable: MEDV R-squared:
0.641
Model: OLS Adj. R-squared:
0.639
Method: Least Squares F-statistic:
448.5
Date: Fri, 24 Jun 2016 Prob (F-statistic):
1.56e-112
Time: 10:05:13 Log-Likelihood:
-1581.3
No. Observations: 506 AIC:
3169.
Df Residuals: 503 BIC:
3181.
Df Model: 2
======================================================================
===========
coef std err t P>|t| [95.0%
Conf. Int.]
----------------------------------------------------------------------
-----------
Intercept 42.8620 0.872 49.149 0.000
41.149 44.575
LSTAT -2.3328 0.124 -18.843 0.000 -
2.576 -2.090
I(LSTAT ** 2) 0.0435 0.004 11.628 0.000
0.036 0.051
======================================================================
========
Omnibus: 107.006 Durbin-Watson:
0.921
Prob(Omnibus): 0.000 Jarque-Bera (JB):
228.388
Skew: 1.128 Prob(JB):
2.55e-50
Kurtosis: 5.397 Cond. No.
1.13e+03
======================================================================
========
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
[2] The condition number is large, 1.13e+03. This might indicate that
there are
strong multicollinearity or other numerical problems.
The near zero p-value for the quadratic term suggest an improved model. We will
plot and perform some diagnostics of the fit.
fig, ax = plt.subplots(figsize=(8,6))
ax.legend(loc='best');
plt.xlabel('LSTAT');
plt.ylabel('MEDV');
Diagnostic tests of quadratic estimate
# import anova function
from statsmodels.stats.api import anova_lm
# Plot the residual for each fitted value for the linear model
ax1.scatter(linear_fit_values, residuals, facecolors='none',
edgecolors='b');
ax1.set_xlabel('fitted values');
ax1.set_ylabel('residuals');
ax1.set_title('Linear Model Residuals')
======================================================================
========
Dep. Variable: MEDV R-squared:
0.682
Model: OLS Adj. R-squared:
0.679
Method: Least Squares F-statistic:
214.2
Date: Fri, 24 Jun 2016 Prob (F-statistic):
8.73e-122
Time: 10:57:12 Log-Likelihood:
-1550.6
No. Observations: 506 AIC:
3113.
Df Residuals: 500 BIC:
3139.
Df Model: 5
======================================================================
===========
coef std err t P>|t| [95.0%
Conf. Int.]
----------------------------------------------------------------------
-----------
Intercept 67.6997 3.604 18.783 0.000
60.618 74.781
I(LSTAT ** 1) -11.9911 1.526 -7.859 0.000 -
14.989 -8.994
I(LSTAT ** 2) 1.2728 0.223 5.703 0.000
0.834 1.711
I(LSTAT ** 3) -0.0683 0.014 -4.747 0.000 -
0.097 -0.040
I(LSTAT ** 4) 0.0017 0.000 4.143 0.000
0.001 0.003
I(LSTAT ** 5) -1.632e-05 4.42e-06 -3.692 0.000 -2.5e-
05 -7.63e-06
======================================================================
========
Omnibus: 144.085 Durbin-Watson:
0.987
Prob(Omnibus): 0.000 Jarque-Bera (JB):
494.545
Skew: 1.292 Prob(JB):
4.08e-108
Kurtosis: 7.096 Cond. No.
1.37e+08
======================================================================
========
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
[2] The condition number is large, 1.37e+08. This might indicate that
there are
strong multicollinearity or other numerical problems.
Education Urban US
1 17 Yes Yes
2 10 Yes Yes
3 12 Yes Yes
4 14 Yes Yes
5 13 Yes No
Sales
~CompPrice+Income+Advertising+Population+Price+ShelveLoc+Age+Education
+Urban+US+Income:Advertising+Price:Age
======================================================================
========
Dep. Variable: Sales R-squared:
0.876
Model: OLS Adj. R-squared:
0.872
Method: Least Squares F-statistic:
210.0
Date: Fri, 24 Jun 2016 Prob (F-statistic):
6.14e-166
Time: 14:38:28 Log-Likelihood:
-564.67
No. Observations: 400 AIC:
1157.
Df Residuals: 386 BIC:
1213.
Df Model: 13
======================================================================
=================
coef std err t P>|t|
[95.0% Conf. Int.]
----------------------------------------------------------------------
-----------------
Intercept 6.5756 1.009 6.519 0.000
4.592 8.559
ShelveLoc[T.Good] 4.8487 0.153 31.724 0.000
4.548 5.149
ShelveLoc[T.Medium] 1.9533 0.126 15.531 0.000
1.706 2.201
Urban[T.Yes] 0.1402 0.112 1.247 0.213
-0.081 0.361
US[T.Yes] -0.1576 0.149 -1.058 0.291
-0.450 0.135
CompPrice 0.0929 0.004 22.567 0.000
0.085 0.101
Income 0.0109 0.003 4.183 0.000
0.006 0.016
Advertising 0.0702 0.023 3.107 0.002
0.026 0.115
Population 0.0002 0.000 0.433 0.665
-0.001 0.001
Price -0.1008 0.007 -13.549 0.000
-0.115 -0.086
Age -0.0579 0.016 -3.633 0.000
-0.089 -0.027
Education -0.0209 0.020 -1.063 0.288
-0.059 0.018
Income:Advertising 0.0008 0.000 2.698 0.007
0.000 0.001
Price:Age 0.0001 0.000 0.801 0.424
-0.000 0.000
======================================================================
========
Omnibus: 1.281 Durbin-Watson:
2.047
Prob(Omnibus): 0.527 Jarque-Bera (JB):
1.147
Skew: 0.129 Prob(JB):
0.564
Kurtosis: 3.050 Cond. No.
1.31e+05
======================================================================
========
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
[2] The condition number is large, 1.31e+05. This might indicate that
there are
strong multicollinearity or other numerical problems.