LGT2425 Lecture 3 Part II (Notes)
LGT2425 Lecture 3 Part II (Notes)
TO BUSINESS ANALYTICS
Lecture 3: Linear Regression (Part II)
1
Multiple regression model
■ Multiple regression model
■ y = Dependent variable
■ x1, x2,…,xq = Independent variables
■ β0, β1,…, βq = Parameters (represents the change in the mean value of
the dependent variable y that corresponds to a one unit increase in
the independent variable xi, holding the values of all other
independent variable constant)
■ ε = Error term (accounts for the variability in y that cannot be
explained by the linear effect of the q independent variables)
2
Estimated multiple regression model
3
Least squares method and multiple regression
4
Multiple regression model
5
Extension of Butler Trucking Company
■ Butler Trucking Company
– The estimated simple linear regression equation is
ŷi =1.2739+0.0678xi
– The linear effect of the number of miles traveled explains
66.41% of the variability in travel time in the sample data
(r2=0.6641)
– 33.59% of the variability in sample travel times remains
unexplained
– The managers want to consider adding one or more
independent variables, such as number of deliveries, to
the model to explain some of the remaining variability in
the dependent variable
– 300 observations are used this time
6
Assignment Miles (x1) Deliveries (x2) Time (y)
1 100.0 4.0 9.3
2 50.0 3.0 4.8
3 100.0 4.0 8.9
4 100.0 2.0 6.5
5 50.0 2.0 4.2
7
Estimated multiple regression model
ŷ b0 b 1 x1 b2 x2
8
Excel’s regression tool
9
Excel’s regression tool
10
Excel’s regression tool
11
Calculate SSR, SST and r2
Assignment Miles (x1) Deliveries (x2) Time (y) Predicted y Mean y SST SSR
1 100.0 4.0 9.3 9.6055 7.2840 4.0643 5.3894
2 50.0 3.0 4.8 5.5564 7.2840 6.1703 2.9845
3 100.0 4.0 8.9 9.6055 7.2840 2.6115 5.3894
4 100.0 2.0 6.5 8.2255 7.2840 0.6147 0.8864
5 50.0 2.0 4.2 4.8664 7.2840 9.5111 5.8447
12
Adjusted r2
2 2n 1
radj 1 (1 r )
n k 1
(where n = sample size, k = number of independent variables)
– Interpreted as the percentage of the total sum of squares that can be explained by
using the estimated regression equation adjusted for the number of x variables used
– Smaller than r2
– Useful in comparing among models
13
2 2 n 1
r
adj 1 (1 r )
n k 1
Adjusted r2=1-[(1-0.8173)(300-1)/(300-2-1)]=0.8161=81.61%
14
Estimated multiple regression model
X1
X2
15
Inference and regression
18
■ When using a normal probability plot, normal errors will approximately
display in a straight line
Percent
100
0
-3 -2 -1 0 1 2 3
Residual
19
Symmetrically distributed around 0 Not symmetrically distributed around 0
20
Y Y
x x
residuals
residuals
x x
21
Not Independent
Independent
residuals
residuals
X
residuals
22
When residuals do not meet conditions
23
Excel’s regression tool
24
Scatter chart of residuals (e) and predicted
values of the dependent variable (xi)
25
Excel’s regression tool
26
27
Scatter chart of residuals (e) and predicted
values of the dependent variable (ŷ)
28
Inference and regression
■ Testing individual regression parameters
– T-test
– To determine whether statistically significant relationships exist
between the dependent variable y and each of the independent
variables
– If βj=0, there is no linear relationship between the dependent
variable y and the independent variable xj
– If βj≠0, there is a linear relationship between y and xj
29
Inference and regression
■ Use a t test to test the hypothesis that a regression parameter
– Sbj is estimated standard deviation of bj
– As the magnitude of t increases (as t deviates from zero in either
direction), we are more likely to reject the hypothesis that the
regression parameter βj=0
bj 0
t STAT (df = n – k – 1)
Sb
j
30
n
bj Sbj
31
32
H0: βj = 0 From the Excel output:
For Miles tSTAT = 27.3655, with p-value < 0.0001
H1: βj 0
For Deliveries tSTAT = 23.3731, with p-value < 0.0001
d.f. = 300-2-1 = 297
33
34
35
H 0: β j = 0 From the Excel output:
For Miles tSTAT = 27.3655, with p-value < 0.0001
H 1: β j 0
For Deliveries tSTAT = 23.3731, with p-value < 0.0001
d.f. = 300-2-1 = 297
36
P-value
■ P-value
– The probability of obtaining a test statistic equal to or more
extreme (< or >) than the observed sample value given H0 is true
– H0 is there is no linear relationship between the dependent
variable y and the independent variable
– The p-value is also called the observed level of significance
– Smallest value of α for which H0 can be rejected
■ Compare the p-value with α
– If p-value < α, reject H0
– If p-value ≥ α, do not reject H0
– If the p-value is low then H0 must go
37
Excel
=T.DIST.2T(D18,297) Reject H0 Do not reject H0 Reject H0
-tα/2 tα/2
0
-27.3655 27.3655
38
Inference and regression
■ Confidence interval
– An estimate of a population parameter that provides an interval
believed to contain the value of the parameter at some level of
confidence
bj tα / 2 Sb
j
■ Confidence level
– Indicates how frequently interval estimates based on samples of
the same size taken from the same population using identical
sampling techniques will contain the true value of the parameter
we are estimating
– 1 - α (level of significance)
39
bj Sbj
40
n
k
Confidence level
41
For Miles, upper 95%
=0.06718172+1.968*0.002454979=0.0720
Lower 95%
=0.06718172-1.968*0.002454979=0.0624
bj tα / 2 Sb 0.0624 ≤ β1 ≤ 0.0720
j
You have 95% confidence that this interval correctly estimates the
relationship between these variables.
SSR
MSR k
FSTAT
MSE SSE
n k 1
43
n
P-value for the F Test
SSR
MSR k
FSTAT
MSE SSE
n k 1
FSTAT = [915.5160626/2]/[204.5871374/(300-2-1)]=457.7580313/0.68884558=664.5292419
44
H0: β1=β2=0
H1: β1 and β2 not both zero
= 0.05, = 0.01
df1= 2 df2 = 297
= 0.05 = 0.01
0 Do not Reject H0
F 0 Do not Reject H0
F
reject H0 reject H0
F0.05 = 3.03 F0.01 = 4.68
Excel Excel
=F.INV.RT(0.05,2,297) =F.INV.RT(0.01,2,297)
Since FSTAT test statistic is in the rejection region, reject H0. There is evidence that at least one independent
variable affects y.
45
46
47
Inference and regression
■ Non-significant independent variables
– If practical experience dictates that the non-significant independent variable
has a relationship with the dependent variable, the independent variable
should be kept in the model
– If the model sufficiently explains the dependent variable without the non-
significant independent variable, then consider rerunning the regression
without the non-significant independent variable (results may change)
– The appropriate treatment of the inclusion or exclusion of the y-intercept when
b0 is not statistically significant may require special consideration
– Regression through the origin should not be forced unless there are strong a
priori reasons for believing that the dependent variable is equal to zero when
the values of all independent variables in the model are equal to zero
48
Categorical independent variables
49
Categorical independent variables
50
Categorical independent variables
52
Categorical independent variables
■ When x3 0:
■ When x3 1:
53
Categorical independent variables
54
Categorical independent variables
ŷ b0 b1 x1 b2 x2
Region A yˆ b0 b1 0 b2 0 b0
Region B yˆ b0 b1 1 b2 0 b0 b1
Region C yˆ b0 b1 0 b2 1 b0 b2
55