05 Linear Regression 2
05 Linear Regression 2
1
MULTIPLE REGRESSION MODEL
■ y = dependent variable
■ x1, x2,…,xq = independent variables
■ β0, β1,…, βq = parameters (represents the change in the mean value of the dependent
variable y that corresponds to a one unit increase in the independent variable xi, holding
the values of all other independent variable constant)
■ ε = error term (accounts for the variability in y that cannot be explained by
the linear effect of the q independent variables) 2
ESTIMATED MULTIPLE REGRESSION MODEL
3
LEAST SQUARES METHOD AND MULTIPLE REGRESSION
4
MULTIPLE REGRESSION MODEL
5
INFERENCE AND REGRESSION
16
ŷ b 0 b 1 x1 b 2 x2
10
EXCEL’S REGRESSION TOOL
11
EXCEL’S REGRESSION TOOL
14
15
Check residual plots
before inference
INFERENCE AND REGRESSION
x x
residuals
residuals
x x
Non-constant variance
Constant variance
19
Not Independent
residuals
Independent
residuals
X
residuals
20
WHEN RESIDUALS DO NOT MEET CONDITIONS
21
EXCEL’S REGRESSION TOOL
22
SCATTER CHART OF RESIDUALS AND PREDICTED VALUES OF
THE DEPENDENT VARIABLE
23
EXCEL’S REGRESSION TOOL
24
25
SCATTER CHART OF RESIDUALS AND PREDICTED VALUES OF
THE DEPENDENT VARIABLE
26
Hypothesis testing
INFERENCE AND REGRESSION
■ Statistical inference
– Process of making estimates and drawing conclusions about one or more
characteristics of a population (the value of one or more
parameters) through the analysis of sample data drawn from the
population
■ Inference is commonly used to estimate and draw conclusions on
– The regression parameters β0,β1,…,βq
– The mean value and/or the predicted value of the dependent variable y
for specific values of the independent variables x1,x2,…,xq
■ Consider both hypothesis testing and interval estimation
28
HYPOTHESIS TESTING
29
■ Claim: The population mean age is 50
■ H0: μ = 50, H1: μ ≠ 50
■ Sample the population and find sample mean
Population
Sample
30
■ Suppose the sample mean (μ) age was 20
■ This is significantly lower than the claimed mean population age of 50
■ If the null hypothesis were true, the probability of getting such a different sample mean
would be very small
■ Getting a sample mean of 20 is so unlikely if the population mean was 50
■ You conclude that the population mean must not be 50
31
32
Region of Region of
Rejection Rejection
Critical Values
33
Sample
34
35
INFERENCE AND REGRESSION
36
INFERENCE AND REGRESSION
b
tSTAT Sj
bj (df = n – k – 1)
0
37
n
bj Sbj
38
H0: βj = 0 From the excel output:
For Miles tSTAT = 27.3655, with p-value < 0.0001
H1: βj 0
For Deliveries tSTAT = 23.3731, with p-value < 0.0001
d.f. = 300-2-1 = 297
■ p-value
– The probability of obtaining a test statistic equal to or more extreme (< or
>) than the observed sample value, given H0 is true (no linear relationship)
– The p-value is also called the observed level of significance
– Smallest value of α for which H0 can be rejected
■ Compare the p-value with α
Excel
=T.DIST.2T(D18,297)
-27.3655 0 27.3655
42
Interval estimation
INFERENCE AND REGRESSION
■ Confidence interval
– An estimate of a population parameter that provides an interval
believed to contain the value of the parameter at some level of
confidence
b j tα / 2 Sb j
■ Confidence level
– Indicates how frequently interval estimates based on samples of the
same size taken from the same population using identical sampling
techniques will contain the true value of the parameter we are
estimating
– 1 - α (level of significance) 44
bj Sbj
45
n
k
Confidence level
46
For miles, lower 95%
=0.06718172-1.968*0.002454979=0.0624
T.INV.2T(0.05, 297) = 1.968
Upper 95%
=0.06718172+1.968*0.002454979=0.0720
0.0624 ≤ β1 ≤ 0.0720
b j tα / 2 Sb j You have 95% confidence that this interval correctly estimates the
relationship between these variables.
significant effect.
48
F test
INFERENCE AND REGRESSION
SSR
MSR k
FSTAT MSE SSE
nk
50
1
n
p-value for the F Test
SSR
MSR k
FSTAT MSE SSE
n k 1
FSTAT = [915.5160626/2]/[204.5871374/(300-2-1)]=457.7580313/0.68884558=664.5292419 50
H0: β1=β2=0
H1: β1 and β2 not both zero
= 0.05, = 0.01
df1= 2 df2 = 297
= 0.05 = 0.01
0 Do not Reject H
F 0 Do not Reject H
F
0 0
reject H0 reject H0
F0.05 = 3.03 F0.01 = 4.68
Excel Excel
=F.INV.RT(0.05,2,297) =F.INV.RT(0.01,2,297)
Since FSTAT test statistic is in the rejection region, reject H0. There is evidence that at least one 51
the values of all independent variables in the model are equal to zero
Multicollinearity
MULTICOLLINEARITY
55
Miles traveled and gasoline consumed are strongly related X
goes up and Y goes up 56
The primary consequence of multicollinearity is that
it increases the variances and standard errors of the
regression estimates of β0, β1, β2, .
. . , βq and predicted values of the dependent
variable, and so inference based on these
estimates is less precise than it should be.
59
58
A TEST OF MULTICOLLINEARITY
■ As a rule-of-thumb, multicollinearity is a
potential problem if the absolute value of the
sample correlation coefficient exceeds
±0.7 for any two of the independent
variables
■ Correlation coefficient in Excel
■ =CORREL(array1, array2)
🞍 x3=0 if an assignment did not include travel on the congested segment of highway during
afternoon rush hour
ŷi
CATEGORICAL INDEPENDENT VARIABLES
61
ŷ = –0.3302 + 0.0672x1 + 0.6735x2 + 0.9980x3
CATEGORICAL INDEPENDENT VARIABLES
■ When x3 0:
■ When x
1: 3
66
CATEGORICAL INDEPENDENT VARIABLES
67
CATEGORICAL INDEPENDENT VARIABLES
Region A
Region B
Region C
68
APPLICATION 1
https://www.emerald.com/insight/content/doi/10.1108/01443579910287064/full/html
70
DATABASES
71