0% found this document useful (0 votes)
34 views

Tutorial 4 - Jupyter Notebook

Uploaded by

fubbyubby
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Tutorial 4 - Jupyter Notebook

Uploaded by

fubbyubby
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

5/30/24, 3:17 PM Tutorial 4 - Jupyter Notebook

Tutorial 4 - Multiple Linear Regression


The brake horsepower developed by an automobile engine on a dynamometer is thought to be a
function of the engine speed in revolutions per minute (rpm), the road octane number of the fuel,
and the engine compression. An experiment is run and the data can be found in "Tutorial 4
data.xlsx"

In [31]: #libraries
import numpy as np
import matplotlib.pyplot as plt
from tabulate import tabulate
import pandas as pd
import scipy

Part 1 - Multiple Linear Regression on Dataset


Below, the data has been imported and separated into independent (x) and dependent variables
(y).

Create the data matrix X.

First create a vector of ones using np.ones((rows,columns)).

Next combine that vector with x using np.hstack((vector,x)). This built in function stacks your two
matricies horizontally as long as they have the same number of rows. Print the result to verify your
data matrix.

In [15]: df = pd.read_excel("Tutorial 4 data.xlsx").to_numpy()



#if your dataset begins in row 1, the titles become headers and the dataset
x=df[0:12,2:5]
y=df[0:12,1]

#make a vector of ones and then the data matrix X

xvectors=np.ones((12,1))
xvector=np.hstack((xvectors,x))

Solve for the parameters of the multiple regression model and print the result.

𝛽 = (𝑋 𝑇 𝑋)−1 𝑋 𝑇 𝑦
Transpose of matrix X: X.T
Multply matrices together: X@y
inverse of a matrix: np.linalg.inv()

localhost:8888/notebooks/Desktop/Tutorial 4 .ipynb 1/7


5/30/24, 3:17 PM Tutorial 4 - Jupyter Notebook

In [16]: #solve for beta and print the values here


beta=np.linalg.inv(xvector.T@xvector)@xvector.T@y
print(beta)

[-2.66031212e+02 1.07132079e-02 3.13480626e+00 1.86740943e+00]

Part 2 - Resiudal Analysis


Use the parameters found in part 1 to solve for 𝑦 ̂

𝑦 ̂ = 𝑋𝛽
Next, solve for the residuals and plot your results on subplots.
𝑒 = 𝑦 − 𝑦̂

The first subplot is setup for you below. We want to plot the residuals vs the independent and
dependent variables (4 total). Therefore, we make a matrix of 2 X 2. The layout='constained'
provides space to label each axis.

localhost:8888/notebooks/Desktop/Tutorial 4 .ipynb 2/7


5/30/24, 3:17 PM Tutorial 4 - Jupyter Notebook

In [18]: # solve for yhat and residuals here


yhat=xvector@beta
residuals=y-yhat
print(yhat)
print(residuals)

#this is code for making the subplot and plotting the first residual plot.
fig,axs=plt.subplots(2,2,layout='constrained')
axs[0,0].scatter(x[:,0],residuals)
axs[0,0].set(xlabel='rpm',ylabel='residuals')

axs[0,1].scatter(x[:,1],residuals)
axs[0,1].set(xlabel='Road Octane Number',ylabel='residuals')

axs[1,0].scatter(x[:,2],residuals)
axs[1,0].set(xlabel='Compression',ylabel='residuals')

axs[1,1].scatter(y,residuals)
axs[1,1].set(xlabel='Break Horsepower',ylabel='residuals')
plt.show()

[224.26871063 225.32824692 240.95847559 218.86255836 207.44420245


267.10824644 243.78632464 237.12456005 235.90671156 221.12574828
222.12606905 233.96014603]
[ 0.73128937 -13.32824692 -11.95847559 3.13744164 11.55579755
10.89175356 2.21367536 -0.12456005 -2.90671156 2.87425172
0.87393095 -3.96014603]

localhost:8888/notebooks/Desktop/Tutorial 4 .ipynb 3/7


5/30/24, 3:17 PM Tutorial 4 - Jupyter Notebook

Are there any patterns in the residuals? Are there any outliers? Is there a better plot we can use to
determine this?

YES! Calculate the standardized residuals and determine if there are any outliers.

𝑒𝑖
𝑑𝑖 = ⎯⎯⎯⎯⎯⎯⎯⎯⎯
√𝑀𝑆𝐸

localhost:8888/notebooks/Desktop/Tutorial 4 .ipynb 4/7


5/30/24, 3:17 PM Tutorial 4 - Jupyter Notebook

In [25]: #Calculate standardized residuals here


SSE=np.sum((y-yhat)**2)
print(SSE)
MSE=SSE/(12-4)
print(MSE)
sqrtMSE=np.sqrt(MSE)
print(sqrtMSE)
di=residuals/sqrtMSE
print(di)

#You can copy your subplot from a previous section and quickly modify it to
fig,axs=plt.subplots(2,2,layout='constrained')
axs[0,0].scatter(x[:,0],di)
axs[0,0].set(xlabel='rpm',ylabel='Standardized Residuals')

axs[0,1].scatter(x[:,1],di)
axs[0,1].set(xlabel='Road Octane Number',ylabel='Standardized Residuals')

axs[1,0].scatter(x[:,2],di)
axs[1,0].set(xlabel='Compression',ylabel='Standardized Residuals')

axs[1,1].scatter(y,di)
axs[1,1].set(xlabel='Break Horsepower',ylabel='Standardized Residuals')
plt.show()

621.2650617774844
77.65813272218556
8.812385189163349
[ 0.08298427 -1.51244489 -1.35700782 0.35602638 1.31131326 1.23595977
0.25120048 -0.01413466 -0.32984391 0.32616047 0.09917076 -0.44938413]

localhost:8888/notebooks/Desktop/Tutorial 4 .ipynb 5/7


5/30/24, 3:17 PM Tutorial 4 - Jupyter Notebook

Based on the figures above, there are no outliers since no observe has
|𝑑 𝑖 | > 3

Part 3 - Model Analysis


In [27]: ybar=(np.mean(y))
SSR=np.sum((yhat-ybar)**2)
MSR=SSR/(4-1)
F=MSR/MSE
print(F)

table=[["Regression",3,SSR,MSR,F],["Residual Error",8,SSE,MSE],["Total",11
col2names=['DF','SS','MS','F']
print(tabulate(table,headers=col2names))

11.11596363636129
DF SS MS F
-------------- ---- -------- -------- ------
Regression 3 2589.73 863.245 11.116
Residual Error 8 621.265 77.6581
Total 11 3211

localhost:8888/notebooks/Desktop/Tutorial 4 .ipynb 6/7


5/30/24, 3:17 PM Tutorial 4 - Jupyter Notebook

Calculate the p-value for your ANOVA Table and print the result below. You can solve for a p-value
using the built in function

p_value=1-scipy.stats.f.cdf(#F value, DOF in numerator, DOF in denominator)

In [33]: #solve for the p-value here


p_value=1-(scipy.stats.f.cdf(F,3,8))
print(p_value)


0.0031699790971878583

What is your conclusion from your ANOVA Table and p-value?

At this point, have you determined that each regressor variable is significant in the model?

In [ ]: #The p-value is highly significant and from the ANOVA table we understand t
#each regressor variable is in fact significant at a 95% confidence level??

localhost:8888/notebooks/Desktop/Tutorial 4 .ipynb 7/7

You might also like