0% found this document useful (0 votes)
0 views

vertopal.com_Lab_Linear_Regression

This document covers linear regression techniques using Python, specifically focusing on the Boston housing dataset. It includes sections on simple and multiple linear regression, interaction terms, and diagnostic plots, utilizing libraries such as statsmodels and matplotlib for analysis and visualization. Key results from regression analyses, including coefficients and statistical significance, are presented throughout the document.

Uploaded by

tedxitu2022
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

vertopal.com_Lab_Linear_Regression

This document covers linear regression techniques using Python, specifically focusing on the Boston housing dataset. It includes sections on simple and multiple linear regression, interaction terms, and diagnostic plots, utilizing libraries such as statsmodels and matplotlib for analysis and visualization. Key results from regression analyses, including coefficients and statistical significance, are presented throughout the document.

Uploaded by

tedxitu2022
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

ISLR- Python: Chapter 3 -- Lab Linear

Regression
• Load Datasets
• 3.6.2 Simple Linear Regression
• 3.6.3 Multiple Linear Regression
• 3.6.4 Interaction Terms
• 3.6.5 Non-linear Transformations of the Predictors
• 3.6.6 Qualitative Predictors
## perform imports and set-up
import numpy as np
import pandas as pd
import statsmodels.api as sm

from scipy import stats


from matplotlib import pyplot as plt
from sklearn.datasets import load_boston # boston data set is part of
sklearn
from statsmodels.sandbox.regression.predstd import wls_prediction_std
from mpl_toolkits.mplot3d import Axes3D

%matplotlib inline
plt.style.use('ggplot') # emulate pretty r-style plots

Load Datasets
# Load Boston housing data set
boston = load_boston()

#Transform the data into a dataframe for analysis¶

# combine the predictors and responses for a dataframe


predictors = boston.data
response = boston.target
boston_data = np.column_stack([predictors,response])

# now get the column names of the data frame


col_names = np.append(boston.feature_names, 'MEDV')

# create the data frame


boston_df = pd.DataFrame(boston_data, columns = col_names)
boston_df.head()

# Load the carseats data set


carseats_df = pd.read_csv('../data/Carseats.csv',index_col = 0)
3.6.2 Simple Linear Regression
We will regress LSTAT (% of households with low economic status) onto the MEDV
(median home values). We will do this using two methods -- scipy.linregress and
statsmodels. We will look at several diagnostics that describe the quality of the fit.

Regress LSTAT onto MEDV (Scipy)


# Create a figure to plot our data and OLS estimate.
fig, ax = plt.subplots(figsize=(8,6))

ax.scatter(boston_df.LSTAT.values,
boston_df.MEDV.values,facecolors='none', edgecolors='b',\
label="data");
ax.set_ylabel('MEDV');
ax.set_xlabel('LSTAT');

# call scipy's linregress returning fit coefficients and simple


statistics
beta1, beta0, r_value, p_value, stderr =
stats.linregress(boston_df.LSTAT.values,\

boston_df.MEDV.values)

# add the estimation to the data plot


ax.plot(boston_df.LSTAT.values, beta0 +
beta1*boston_df.LSTAT.values,color='r', label="OLS");
ax.legend(loc='best');

# print the regression estimates returned from scipy


print('beta= [',round(beta0,3),',', round(beta1,3),']')
print('R={0:.3f}, p_value={1:.3f}, stderr={2:.3f}'.format(r_value,
p_value, stderr))

beta= [ 34.554 , -0.95 ]


R=-0.738, p_value=0.000, stderr=0.039
Regress LSTAT onto MEDV (Statsmodels)
# Another method is to use the package statsmodels.

# Create a design matrix


# set the independent variable as the LSTAT
X = boston_df.LSTAT
# We add a constant for the intercept term
X = sm.add_constant(X)

# set the dependent variable


Y = boston_df.MEDV

# create the model instance and fit


linear_model = sm.OLS(Y,X)
linear_results = linear_model.fit()
# data about the model is stored in summary
print(linear_results.summary())

OLS Regression Results

======================================================================
========
Dep. Variable: MEDV R-squared:
0.544
Model: OLS Adj. R-squared:
0.543
Method: Least Squares F-statistic:
601.6
Date: Fri, 24 Jun 2016 Prob (F-statistic):
5.08e-88
Time: 10:05:12 Log-Likelihood:
-1641.5
No. Observations: 506 AIC:
3287.
Df Residuals: 504 BIC:
3295.
Df Model: 1

Covariance Type: nonrobust

======================================================================
========
coef std err t P>|t| [95.0%
Conf. Int.]
----------------------------------------------------------------------
--------
const 34.5538 0.563 61.415 0.000 33.448
35.659
LSTAT -0.9500 0.039 -24.528 0.000 -1.026
-0.874
======================================================================
========
Omnibus: 137.043 Durbin-Watson:
0.892
Prob(Omnibus): 0.000 Jarque-Bera (JB):
291.373
Skew: 1.453 Prob(JB):
5.36e-64
Kurtosis: 5.319 Cond. No.
29.7
======================================================================
========

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.

# statsmodels includes functions for getting the prediction and


confidence intervals.

# Get prediction interval for a given x


prstd, iv_l, iv_u = wls_prediction_std(linear_results)

# statsmodels also includes the confidence intervals for the fitted


values (i.e. the mean fits)
# but they are buried in the results of summary table function.
from statsmodels.stats.outliers_influence import summary_table

# call the summary table at a (1-alpha)100% confidence interval level


simpleTable, data, column_names = summary_table(linear_results,
alpha=0.05)

# Data contains the confidence intervals we want but we need to make


sure we get the right
# columns so lets print off the names
print('column_names: ', column_names)

# Get confidence intervals for a given x


predicted_mean_ci_low, predicted_mean_ci_high = data[:,4:6].T

column_names: ['Obs', 'Dep Var\nPopulation', 'Predicted\nValue', 'Std


Error\nMean Predict', 'Mean ci\n95% low', 'Mean ci\n95% upp', 'Predict
ci\n95% low', 'Predict ci\n95% upp', 'Residual', 'Std Error\
nResidual', 'Student\nResidual', "Cook's\nD"]

# Create a plot to plot the data, OLS estimate, prediction and


confidence intervals
fig, ax = plt.subplots(figsize=(8,6))

# get numpy array values from dataframe


x = boston_df.LSTAT.values
y = boston_df.MEDV.values

# Plot the data


ax.scatter(x, y, facecolors='none', edgecolors='b', label="data")
# plot the models fitted values
ax.plot(x, linear_results.fittedvalues, 'g', label="OLS")

# plot the high and low prediction intervals


ax.plot(x, iv_u, color='0.75',label="Prediction Interval")
ax.plot(x, iv_l, color='0.75')

# plot the high and low mean confidence intervals


ax.plot(x,predicted_mean_ci_low, 'r', label="Predicted Mean CI")
ax.plot(x,predicted_mean_ci_high,'r')

ax.legend(loc='best');

plt.xlabel('LSTAT');
plt.ylabel('MEDV');
Diagnostic Plots for Linear Model
# Create plots of residuals
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(12,6))

# RESIDUALS
# The results contain the residuals
fitted_values = linear_results.fittedvalues.values
residuals = linear_results.resid.values

# Plot the residual for each fitted value


ax1.scatter(fitted_values, residuals, facecolors='none',
edgecolors='b');
ax1.set_xlabel('fitted values');
ax1.set_ylabel('residuals');
# The residual plot indicates significant nonlinearity (a u-shape
pattern is clear)

# STUDENTIZED RESIDUALS
# To asses data outliers we will look at the studentized residuals.
This is in the data array
# returned from summary table (10th column)
studentized_residuals = data[:,10]

# Plot the studentized residuals


ax2.scatter(fitted_values,studentized_residuals, facecolors='none',
edgecolors='b');
ax2.set_ylabel('Studentized Residuals');
ax2.set_xlabel('fitted values');

# |studentized residual| > 3 are generally considered outliers

# We can also examine the leverages to identify points that may alter
the regression line
from statsmodels.stats.outliers_influence import OLSInfluence
leverage = OLSInfluence(linear_results).influence

fig, ax = plt.subplots(figsize=(8,6))
ax.scatter(leverage, studentized_residuals,facecolors='none',
edgecolors='b');
ax.set_xlabel('Leverage');
ax.set_ylabel('Studentized Residuals');
3.6.3 Multiple Linear Regression
Here we will estimate MEDV using multiple linear regression. In the first example we
will regress LSTAT and AGE onto MEDV.
# create our design matrix using LSTAT and AGE predictors
X = sm.add_constant(boston_df[['LSTAT','AGE']])

# set the dependent variable


Y = boston_df.MEDV

fig = plt.figure(figsize=(12, 8))


ax = fig.add_subplot(111, projection='3d',azim=-60, elev=5)

# create the model instance and estimate


model = sm.OLS(Y,X)
estimate = model.fit()
# data about the model is stored in summary
print('Model parameters:\n', estimate.params[:])

# Plot the data


ax.scatter(X.loc[:,'LSTAT'], X.loc[:,'AGE'], Y.values,
facecolors=(0,0,0,0),\
edgecolor='k', depthshade=True);
ax.set_xlabel('LSTAT');
ax.set_ylabel('AGE')
ax.set_zlabel('MEDV')

# Plot the OLS estimate


# create a grid of points
xx1, xx2 = np.meshgrid(np.linspace(X.LSTAT.min(), X.LSTAT.max(), 100),

np.linspace(X.AGE.min(), X.AGE.max(), 100))

# plot the plane by evaluating the parameters over the grid


Z = estimate.params[0] + estimate.params[1] * xx1 + estimate.params[2]
* xx2
# plot plane
surf = ax.plot_surface(xx1, xx2, Z, cmap=plt.cm.RdBu_r, alpha=0.75,
linewidth=0)

Model parameters:
const 33.222761
LSTAT -1.032069
AGE 0.034544
dtype: float64
Now we will perform the regression over all 13 predictors in the boston housing
dataset.
# create our design matrix using all the predictors (last column is
MEDV)
X = sm.add_constant(boston_df.iloc[:,0:-1])

# create the model instance and estimate


model = sm.OLS(Y,X)
est = model.fit()
# data about the model is stored in summary
print(est.summary())

OLS Regression Results

======================================================================
========
Dep. Variable: MEDV R-squared:
0.741
Model: OLS Adj. R-squared:
0.734
Method: Least Squares F-statistic:
108.1
Date: Fri, 24 Jun 2016 Prob (F-statistic):
6.95e-135
Time: 10:05:13 Log-Likelihood:
-1498.8
No. Observations: 506 AIC:
3026.
Df Residuals: 492 BIC:
3085.
Df Model: 13

Covariance Type: nonrobust

======================================================================
========
coef std err t P>|t| [95.0%
Conf. Int.]
----------------------------------------------------------------------
--------
const 36.4911 5.104 7.149 0.000 26.462
46.520
CRIM -0.1072 0.033 -3.276 0.001 -0.171
-0.043
ZN 0.0464 0.014 3.380 0.001 0.019
0.073
INDUS 0.0209 0.061 0.339 0.735 -0.100
0.142
CHAS 2.6886 0.862 3.120 0.002 0.996
4.381
NOX -17.7958 3.821 -4.658 0.000 -25.302
-10.289
RM 3.8048 0.418 9.102 0.000 2.983
4.626
AGE 0.0008 0.013 0.057 0.955 -0.025
0.027
DIS -1.4758 0.199 -7.398 0.000 -1.868
-1.084
RAD 0.3057 0.066 4.608 0.000 0.175
0.436
TAX -0.0123 0.004 -3.278 0.001 -0.020
-0.005
PTRATIO -0.9535 0.131 -7.287 0.000 -1.211
-0.696
B 0.0094 0.003 3.500 0.001 0.004
0.015
LSTAT -0.5255 0.051 -10.366 0.000 -0.625
-0.426
======================================================================
========
Omnibus: 178.029 Durbin-Watson:
1.078
Prob(Omnibus): 0.000 Jarque-Bera (JB):
782.015
Skew: 1.521 Prob(JB):
1.54e-170
Kurtosis: 8.276 Cond. No.
1.51e+04
======================================================================
========

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
[2] The condition number is large, 1.51e+04. This might indicate that
there are
strong multicollinearity or other numerical problems.

# Compute all the variance inflation factors


from statsmodels.stats.outliers_influence import
variance_inflation_factor

VIFs = [(predictor, variance_inflation_factor(X.values,_)) \


for _,predictor in enumerate(list(X))] # list(X) returns
column names of df

print('Variance Inflation Factors')


for tup in VIFs:
print('{:10}'.format(tup[0]), '{:.3f}'.format(tup[1]))

Variance Inflation Factors


const 585.425
CRIM 1.773
ZN 2.299
INDUS 3.991
CHAS 1.074
NOX 4.395
RM 1.934
AGE 3.101
DIS 3.957
RAD 7.481
TAX 9.008
PTRATIO 1.799
B 1.346
LSTAT 2.938

3.6.4 Interaction Terms


Statsmodels uses the patsy package to convert formulas to matrices for fitting. This
allows for easy implementation of arbitrary functions of the predictors.
# import statsmodels patsy api
import statsmodels.formula.api as smf
# Construct model and fit
model = smf.ols(formula='MEDV ~ LSTAT*AGE', data=boston_df)
estimate = model.fit()
print(estimate.summary())

OLS Regression Results

======================================================================
========
Dep. Variable: MEDV R-squared:
0.556
Model: OLS Adj. R-squared:
0.553
Method: Least Squares F-statistic:
209.3
Date: Fri, 24 Jun 2016 Prob (F-statistic):
4.86e-88
Time: 10:05:13 Log-Likelihood:
-1635.0
No. Observations: 506 AIC:
3278.
Df Residuals: 502 BIC:
3295.
Df Model: 3

Covariance Type: nonrobust

======================================================================
========
coef std err t P>|t| [95.0%
Conf. Int.]
----------------------------------------------------------------------
--------
Intercept 36.0885 1.470 24.553 0.000 33.201
38.976
LSTAT -1.3921 0.167 -8.313 0.000 -1.721
-1.063
AGE -0.0007 0.020 -0.036 0.971 -0.040
0.038
LSTAT:AGE 0.0042 0.002 2.244 0.025 0.001
0.008
======================================================================
========
Omnibus: 135.601 Durbin-Watson:
0.965
Prob(Omnibus): 0.000 Jarque-Bera (JB):
296.955
Skew: 1.417 Prob(JB):
3.29e-65
Kurtosis: 5.461 Cond. No.
6.88e+03
======================================================================
========

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
[2] The condition number is large, 6.88e+03. This might indicate that
there are
strong multicollinearity or other numerical problems.

3.6.5 Non-linear Transformations of the Predictors


Statsmodel patsy api allows us to include non-linear terms
model = smf.ols('MEDV ~ LSTAT + I(LSTAT**2)', data=boston_df)
quadratic_results = model.fit()
print(quadratic_results.summary())

OLS Regression Results

======================================================================
========
Dep. Variable: MEDV R-squared:
0.641
Model: OLS Adj. R-squared:
0.639
Method: Least Squares F-statistic:
448.5
Date: Fri, 24 Jun 2016 Prob (F-statistic):
1.56e-112
Time: 10:05:13 Log-Likelihood:
-1581.3
No. Observations: 506 AIC:
3169.
Df Residuals: 503 BIC:
3181.
Df Model: 2

Covariance Type: nonrobust

======================================================================
===========
coef std err t P>|t| [95.0%
Conf. Int.]
----------------------------------------------------------------------
-----------
Intercept 42.8620 0.872 49.149 0.000
41.149 44.575
LSTAT -2.3328 0.124 -18.843 0.000 -
2.576 -2.090
I(LSTAT ** 2) 0.0435 0.004 11.628 0.000
0.036 0.051
======================================================================
========
Omnibus: 107.006 Durbin-Watson:
0.921
Prob(Omnibus): 0.000 Jarque-Bera (JB):
228.388
Skew: 1.128 Prob(JB):
2.55e-50
Kurtosis: 5.397 Cond. No.
1.13e+03
======================================================================
========

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
[2] The condition number is large, 1.13e+03. This might indicate that
there are
strong multicollinearity or other numerical problems.

The near zero p-value for the quadratic term suggest an improved model. We will
plot and perform some diagnostics of the fit.
fig, ax = plt.subplots(figsize=(8,6))

# get numpy array values from dataframe


x = boston_df.LSTAT.values
y = boston_df.MEDV.values

# Plot the data


ax.scatter(x, y, facecolors='none', edgecolors='b', label="data");
# plot the models fitted values
ax.plot(x, quadratic_results.fittedvalues, 'g',
marker='o',linestyle='none', label="OLS");

ax.legend(loc='best');

plt.xlabel('LSTAT');
plt.ylabel('MEDV');
Diagnostic tests of quadratic estimate
# import anova function
from statsmodels.stats.api import anova_lm

# perform the hypothesis test (see https://en.wikipedia.org/wiki/F-


test regression section)
anova_table = anova_lm(linear_results, quadratic_results)
print(anova_table)

df_resid ssr df_diff ss_diff F


Pr(>F)
0 504 19472.381418 0 NaN NaN
NaN
1 503 15347.243158 1 4125.13826 135.199822 7.630116e-
28
The F-statistic is 135 with a p-value of ~0 indicating there is a large difference in the
unexplained variances of the two models. This is not too suprising given the plot
above. Now consider the residuals.
# Create plots of residuals
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(12,6))

# RESIDUALS OF LINEAR MODEL


# The results contain the residuals
linear_fit_values = linear_results.fittedvalues.values
residuals = linear_results.resid.values

# Plot the residual for each fitted value for the linear model
ax1.scatter(linear_fit_values, residuals, facecolors='none',
edgecolors='b');
ax1.set_xlabel('fitted values');
ax1.set_ylabel('residuals');
ax1.set_title('Linear Model Residuals')

# RESIDUALS OF QUADRATIC MODEL


# The results contain the residuals
quadratic_fit_values = quadratic_results.fittedvalues.values
quadratic_residuals = quadratic_results.resid.values
ax2.scatter(quadratic_fit_values, quadratic_residuals,
facecolors='none', edgecolors='b');
ax2.set_title('Quadratic Model Residuals');
We can also try higher order polynomial fits:
formula = 'MEDV ~ 1 + ' + ' + '.join('I(LSTAT**{})'.format(i) for i in
range(1, 6))
print(formula)
model = smf.ols(formula, data=boston_df)
order_5_results = model.fit()
print(order_5_results.summary())

MEDV ~ 1 + I(LSTAT**1) + I(LSTAT**2) + I(LSTAT**3) + I(LSTAT**4) +


I(LSTAT**5)
OLS Regression Results

======================================================================
========
Dep. Variable: MEDV R-squared:
0.682
Model: OLS Adj. R-squared:
0.679
Method: Least Squares F-statistic:
214.2
Date: Fri, 24 Jun 2016 Prob (F-statistic):
8.73e-122
Time: 10:57:12 Log-Likelihood:
-1550.6
No. Observations: 506 AIC:
3113.
Df Residuals: 500 BIC:
3139.
Df Model: 5

Covariance Type: nonrobust

======================================================================
===========
coef std err t P>|t| [95.0%
Conf. Int.]
----------------------------------------------------------------------
-----------
Intercept 67.6997 3.604 18.783 0.000
60.618 74.781
I(LSTAT ** 1) -11.9911 1.526 -7.859 0.000 -
14.989 -8.994
I(LSTAT ** 2) 1.2728 0.223 5.703 0.000
0.834 1.711
I(LSTAT ** 3) -0.0683 0.014 -4.747 0.000 -
0.097 -0.040
I(LSTAT ** 4) 0.0017 0.000 4.143 0.000
0.001 0.003
I(LSTAT ** 5) -1.632e-05 4.42e-06 -3.692 0.000 -2.5e-
05 -7.63e-06
======================================================================
========
Omnibus: 144.085 Durbin-Watson:
0.987
Prob(Omnibus): 0.000 Jarque-Bera (JB):
494.545
Skew: 1.292 Prob(JB):
4.08e-108
Kurtosis: 7.096 Cond. No.
1.37e+08
======================================================================
========

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
[2] The condition number is large, 1.37e+08. This might indicate that
there are
strong multicollinearity or other numerical problems.

3.6.6 Qualitative Predictors


# Examine the head of the carseats data set
carseats_df.head()

Sales CompPrice Income Advertising Population Price ShelveLoc


Age \
1 9.50 138 73 11 276 120 Bad
42
2 11.22 111 48 16 260 83 Good
65
3 10.06 113 35 10 269 80 Medium
59
4 7.40 117 100 4 466 97 Medium
55
5 4.15 141 64 3 340 128 Bad
38

Education Urban US
1 17 Yes Yes
2 10 Yes Yes
3 12 Yes Yes
4 14 Yes Yes
5 13 Yes No

# Construct the formula with two interaction terms


formula ='Sales ~' +
'+'.join(list(carseats_df.iloc[:,1:].columns.tolist()) +
['Income:Advertising'] + ['Price:Age'])
print(formula)
print()

model = smf.ols(formula, data=carseats_df)


carseat_results = model.fit()
print(carseat_results.summary())

Sales
~CompPrice+Income+Advertising+Population+Price+ShelveLoc+Age+Education
+Urban+US+Income:Advertising+Price:Age

OLS Regression Results

======================================================================
========
Dep. Variable: Sales R-squared:
0.876
Model: OLS Adj. R-squared:
0.872
Method: Least Squares F-statistic:
210.0
Date: Fri, 24 Jun 2016 Prob (F-statistic):
6.14e-166
Time: 14:38:28 Log-Likelihood:
-564.67
No. Observations: 400 AIC:
1157.
Df Residuals: 386 BIC:
1213.
Df Model: 13

Covariance Type: nonrobust

======================================================================
=================
coef std err t P>|t|
[95.0% Conf. Int.]
----------------------------------------------------------------------
-----------------
Intercept 6.5756 1.009 6.519 0.000
4.592 8.559
ShelveLoc[T.Good] 4.8487 0.153 31.724 0.000
4.548 5.149
ShelveLoc[T.Medium] 1.9533 0.126 15.531 0.000
1.706 2.201
Urban[T.Yes] 0.1402 0.112 1.247 0.213
-0.081 0.361
US[T.Yes] -0.1576 0.149 -1.058 0.291
-0.450 0.135
CompPrice 0.0929 0.004 22.567 0.000
0.085 0.101
Income 0.0109 0.003 4.183 0.000
0.006 0.016
Advertising 0.0702 0.023 3.107 0.002
0.026 0.115
Population 0.0002 0.000 0.433 0.665
-0.001 0.001
Price -0.1008 0.007 -13.549 0.000
-0.115 -0.086
Age -0.0579 0.016 -3.633 0.000
-0.089 -0.027
Education -0.0209 0.020 -1.063 0.288
-0.059 0.018
Income:Advertising 0.0008 0.000 2.698 0.007
0.000 0.001
Price:Age 0.0001 0.000 0.801 0.424
-0.000 0.000
======================================================================
========
Omnibus: 1.281 Durbin-Watson:
2.047
Prob(Omnibus): 0.527 Jarque-Bera (JB):
1.147
Skew: 0.129 Prob(JB):
0.564
Kurtosis: 3.050 Cond. No.
1.31e+05
======================================================================
========

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
[2] The condition number is large, 1.31e+05. This might indicate that
there are
strong multicollinearity or other numerical problems.

# The default treatment for categoricals in patsy is the reference=0.


In this case the dummy variable for bad is 0.
# more treatments can be found in the patsy documentation. ....from
patsy.contrasts import Treatment

You might also like