0% found this document useful (0 votes)
127 views55 pages

LGT2425 Lecture 3 Part II (Notes)

This document provides an overview of multiple linear regression analysis. It discusses estimating a multiple regression model using the least squares method to minimize the sum of squared errors. It also covers calculating R-squared, adjusted R-squared, and using Excel's regression tool to estimate a multiple regression model using data from Butler Trucking Company on miles traveled, deliveries, and travel time. Finally, it discusses checking the assumptions of a multiple regression model by examining residual plots.

Uploaded by

Jackie Chou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
127 views55 pages

LGT2425 Lecture 3 Part II (Notes)

This document provides an overview of multiple linear regression analysis. It discusses estimating a multiple regression model using the least squares method to minimize the sum of squared errors. It also covers calculating R-squared, adjusted R-squared, and using Excel's regression tool to estimate a multiple regression model using data from Butler Trucking Company on miles traveled, deliveries, and travel time. Finally, it discusses checking the assumptions of a multiple regression model by examining residual plots.

Uploaded by

Jackie Chou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

LGT2425 INTRODUCTION

TO BUSINESS ANALYTICS
Lecture 3: Linear Regression (Part II)

1
Multiple regression model
■ Multiple regression model

■ y = Dependent variable
■ x1, x2,…,xq = Independent variables
■ β0, β1,…, βq = Parameters (represents the change in the mean value of
the dependent variable y that corresponds to a one unit increase in
the independent variable xi, holding the values of all other
independent variable constant)
■ ε = Error term (accounts for the variability in y that cannot be
explained by the linear effect of the q independent variables)
2
Estimated multiple regression model

■ Estimated multiple regression model

3
Least squares method and multiple regression

■ The least squares method is used to develop the estimated multiple


regression equation
n 2 n 2
b0 , b1 , b2 , , bq that satisfy min i 1
yi yˆi min e.
i 1 i

4
Multiple regression model

5
Extension of Butler Trucking Company
■ Butler Trucking Company
– The estimated simple linear regression equation is
ŷi =1.2739+0.0678xi
– The linear effect of the number of miles traveled explains
66.41% of the variability in travel time in the sample data
(r2=0.6641)
– 33.59% of the variability in sample travel times remains
unexplained
– The managers want to consider adding one or more
independent variables, such as number of deliveries, to
the model to explain some of the remaining variability in
the dependent variable
– 300 observations are used this time
6
Assignment Miles (x1) Deliveries (x2) Time (y)
1 100.0 4.0 9.3
2 50.0 3.0 4.8
3 100.0 4.0 8.9
4 100.0 2.0 6.5
5 50.0 2.0 4.2

290 85.0 2.0 7.8


291 75.0 2.0 6.5
292 70.0 2.0 6.1
293 75.0 4.0 7.2
294 70.0 6.0 8.9
295 95.0 6.0 10.9
296 50.0 4.0 7.2
297 50.0 1.0 3.5
298 85.0 2.0 8.0
299 100.0 2.0 7.8
300 65.0 6.0 10.0

7
Estimated multiple regression model

■ Estimated multiple linear regression with two independent variables

ŷ b0 b 1 x1 b2 x2

■ ŷ = estimated mean travel time


■ x1=distance travelled
■ x2=number of deliveries
■ SST, SSR, SSE and r2 are computed

8
Excel’s regression tool

9
Excel’s regression tool

10
Excel’s regression tool

𝑦ො = 0.1273 + 0.0672x1 + 0.6900x2 (r2=SSR/SST=915.5161/1120.1032=81.73%)

11
Calculate SSR, SST and r2
Assignment Miles (x1) Deliveries (x2) Time (y) Predicted y Mean y SST SSR
1 100.0 4.0 9.3 9.6055 7.2840 4.0643 5.3894
2 50.0 3.0 4.8 5.5564 7.2840 6.1703 2.9845
3 100.0 4.0 8.9 9.6055 7.2840 2.6115 5.3894
4 100.0 2.0 6.5 8.2255 7.2840 0.6147 0.8864
5 50.0 2.0 4.2 4.8664 7.2840 9.5111 5.8447

290 85.0 2.0 7.8 7.2178 7.2840 0.2663 0.0044


291 75.0 2.0 6.5 6.5460 7.2840 0.6147 0.5447
292 70.0 2.0 6.1 6.2101 7.2840 1.4019 1.1534
293 75.0 4.0 7.2 7.9260 7.2840 0.0071 0.4121
294 70.0 6.0 8.9 8.9700 7.2840 2.6115 2.8428
295 95.0 6.0 10.9 10.6496 7.2840 13.0755 11.3272
296 50.0 4.0 7.2 6.2464 7.2840 0.0071 1.0766
297 50.0 1.0 3.5 4.1764 7.2840 14.3187 9.6570
298 85.0 2.0 8.0 7.2178 7.2840 0.5127 0.0044
299 100.0 2.0 7.8 8.2255 7.2840 0.2663 0.8864
300 65.0 6.0 10.0 8.6341 7.2840 7.3767 1.8229
Mean y 7.2840 Total 1120.1032 915.5161

12
Adjusted r2

■ r2 never decreases when a new x variable is added to the model


■ This can be a disadvantage when comparing models
■ What is the net effect of adding a new variable?
– We lose a degree of freedom when a new x variable is added
– Did the new x variable add enough explanatory power to offset the loss of one degree
of freedom?

2 2n 1
radj 1 (1 r )
n k 1
(where n = sample size, k = number of independent variables)
– Interpreted as the percentage of the total sum of squares that can be explained by
using the estimated regression equation adjusted for the number of x variables used
– Smaller than r2
– Useful in comparing among models
13
2 2 n 1
r
adj 1 (1 r )
n k 1

Adjusted r2=1-[(1-0.8173)(300-1)/(300-2-1)]=0.8161=81.61%

14
Estimated multiple regression model

X1

X2

15
Inference and regression

Simple linear regression model Multiple regression model


16
Inference and regression
■ Statistical inference 碍硁വ抷
– Process of making estimates and drawing conclusions about one
or more characteristics of a population (the value of one or more
parameters) through the analysis of sample data drawn from the
population
■ Inference is commonly used to estimate and draw conclusions on
– The regression parameters β0, β1,…, βq
– The mean value and/or the predicted value of the dependent
variable y for specific values of the independent variables x1,
x2,…,xq
■ Consider both hypothesis testing and interval estimation
17
Inference and regression
■ Three regression conditions
– The population of potential error terms ε is normally distributed with
a mean of 0
– The population of potential error terms ε has a constant variance
– The values of ε are statistically independent
■ The errors must satisfy these conditions in order for inferences
■ How to check?
– Residual plots to check for violations of regression conditions
– Residuals vs. ŷi
– Residuals vs. Xi

18
■ When using a normal probability plot, normal errors will approximately
display in a straight line

Percent
100

0
-3 -2 -1 0 1 2 3
Residual

19
Symmetrically distributed around 0 Not symmetrically distributed around 0

20
Y Y

x x
residuals

residuals
x x

Non-constant variance Constant variance

21
Not Independent
Independent
residuals

residuals
X
residuals

error=no trend(cuz cant explain)

22
When residuals do not meet conditions

■ An important independent variable has been omitted


■ The function form of the model is inadequate to explain the
relationships between the independent variables and the dependent
variables

23
Excel’s regression tool

24
Scatter chart of residuals (e) and predicted
values of the dependent variable (xi)

25
Excel’s regression tool

26
27
Scatter chart of residuals (e) and predicted
values of the dependent variable (ŷ)

28
Inference and regression
■ Testing individual regression parameters
– T-test
– To determine whether statistically significant relationships exist
between the dependent variable y and each of the independent
variables
– If βj=0, there is no linear relationship between the dependent
variable y and the independent variable xj
– If βj≠0, there is a linear relationship between y and xj

29
Inference and regression
■ Use a t test to test the hypothesis that a regression parameter
– Sbj is estimated standard deviation of bj
– As the magnitude of t increases (as t deviates from zero in either
direction), we are more likely to reject the hypothesis that the
regression parameter βj=0

bj 0
t STAT (df = n – k – 1)
Sb
j

30
n

bj Sbj

31
32
H0: βj = 0 From the Excel output:
For Miles tSTAT = 27.3655, with p-value < 0.0001
H1: βj 0
For Deliveries tSTAT = 23.3731, with p-value < 0.0001
d.f. = 300-2-1 = 297

Excel = 0.05 The test statistic for each variable falls


=T.INV.2T(0.05,297) t /2 = 1.97 in the rejection region (p-values < 0.05)
Decision:
/2=0.025 /2=0.025 Reject H0 for each variable
Conclusion:
There is evidence that both
Reject H0
-tα/2
Do not reject H0
tα/2
Reject H0
Miles and Deliveries affect
0
-1.97 1.97 travel time at = 0.05

33
34
35
H 0: β j = 0 From the Excel output:
For Miles tSTAT = 27.3655, with p-value < 0.0001
H 1: β j 0
For Deliveries tSTAT = 23.3731, with p-value < 0.0001
d.f. = 300-2-1 = 297

Excel = 0.01 The test statistic for each variable falls


=T.INV.2T(0.01,297) t /2 = 2.59 in the rejection region (p-values < 0.01)
Decision:
/2=0.005 /2=0.005 Reject H0 for each variable
Conclusion:
There is evidence that both
Reject H0
-tα/2
Do not reject H0
tα/2
Reject H0
Miles and Deliveries affect
0
-2.59 2.59 travel time at = 0.01

36
P-value
■ P-value
– The probability of obtaining a test statistic equal to or more
extreme (< or >) than the observed sample value given H0 is true
– H0 is there is no linear relationship between the dependent
variable y and the independent variable
– The p-value is also called the observed level of significance
– Smallest value of α for which H0 can be rejected
■ Compare the p-value with α
– If p-value < α, reject H0
– If p-value ≥ α, do not reject H0
– If the p-value is low then H0 must go

37
Excel
=T.DIST.2T(D18,297) Reject H0 Do not reject H0 Reject H0
-tα/2 tα/2
0
-27.3655 27.3655

38
Inference and regression
■ Confidence interval
– An estimate of a population parameter that provides an interval
believed to contain the value of the parameter at some level of
confidence

bj tα / 2 Sb
j

■ Confidence level
– Indicates how frequently interval estimates based on samples of
the same size taken from the same population using identical
sampling techniques will contain the true value of the parameter
we are estimating
– 1 - α (level of significance)
39
bj Sbj

40
n

k
Confidence level

41
For Miles, upper 95%
=0.06718172+1.968*0.002454979=0.0720

Lower 95%
=0.06718172-1.968*0.002454979=0.0624

bj tα / 2 Sb 0.0624 ≤ β1 ≤ 0.0720
j
You have 95% confidence that this interval correctly estimates the
relationship between these variables.

From a hypothesis-testing viewpoint, because this confidence interval does


not include 0, you can conclude that the regression coefficient (β1) has a
significant effect.
42
Inference and regression
■ Testing for an overall regression relationship
■ Use an F test based on the F probability distribution
– H0: β1=β2=…= βq=0 (no linear relationship)
– H1: at least one βi ≠ 0 (at least one independent variable affects y)

SSR
MSR k
FSTAT
MSE SSE
n k 1
43
n
P-value for the F Test

SSR
MSR k
FSTAT
MSE SSE
n k 1

FSTAT = [915.5160626/2]/[204.5871374/(300-2-1)]=457.7580313/0.68884558=664.5292419

44
H0: β1=β2=0
H1: β1 and β2 not both zero
= 0.05, = 0.01
df1= 2 df2 = 297

= 0.05 = 0.01

0 Do not Reject H0
F 0 Do not Reject H0
F
reject H0 reject H0
F0.05 = 3.03 F0.01 = 4.68
Excel Excel
=F.INV.RT(0.05,2,297) =F.INV.RT(0.01,2,297)

Since FSTAT test statistic is in the rejection region, reject H0. There is evidence that at least one independent
variable affects y.
45
46
47
Inference and regression
■ Non-significant independent variables
– If practical experience dictates that the non-significant independent variable
has a relationship with the dependent variable, the independent variable
should be kept in the model
– If the model sufficiently explains the dependent variable without the non-
significant independent variable, then consider rerunning the regression
without the non-significant independent variable (results may change)
– The appropriate treatment of the inclusion or exclusion of the y-intercept when
b0 is not statistically significant may require special consideration
– Regression through the origin should not be forced unless there are strong a
priori reasons for believing that the dependent variable is equal to zero when
the values of all independent variables in the model are equal to zero

48
Categorical independent variables

■ Butler Trucking Company and Rush Hour


– Dependent variable: travel time (y)
– Independent variables: miles traveled (x1) and number of
deliveries (x2)
– Categorical variable/dummy variable: rush hour (x3)
■ x3=0 if an assignment did not include travel on the congested
segment of highway during afternoon rush hour
■ x3=1 if an assignment included travel on the congested
segment of highway during afternoon rush hour

49
Categorical independent variables

ei = yi - 𝑦ො𝑖 +ve means actual is larger

50
Categorical independent variables

𝑦ො = –0.3302 + 0.0672x1 + 0.6735x2 + 0.9980x3.


51
Categorical independent variables
■ The model estimates that travel time increases by
– 0.0672 hour for every increase of 1 mile traveled, holding constant the number
of deliveries and whether the driving assignment route requires the driver to
travel on the congested segment of a highway during the afternoon rush hour
period
– 0.6735 hour for every delivery, holding constant the number of miles traveled
and whether the driving assignment route requires the driver to travel on the
congested segment of a highway during the afternoon rush hour period
– 0.9980 hour if the driving assignment route requires the driver to travel on the
congested segment of a highway during the afternoon rush hour period, holding
constant the number of miles traveled and the number of deliveries
■ r2=0.8838 indicates that the regression model explains approximately 88.38% of the
variability in travel time for the driving assignments in the sample

52
Categorical independent variables

■ When x3 0:

■ When x3 1:

53
Categorical independent variables

■ If a categorical variable has k levels, k-1 dummy variables are required


■ Suppose a manufacturer of vending machines organized the sales
territories for a particular state into three regions: A, B, and C
■ Suppose the managers believe sales region is one of the important
factors in predicting the number of units sold

54
Categorical independent variables

ŷ b0 b1 x1 b2 x2
Region A yˆ b0 b1 0 b2 0 b0
Region B yˆ b0 b1 1 b2 0 b0 b1

Region C yˆ b0 b1 0 b2 1 b0 b2

55

You might also like