0% found this document useful (0 votes)
19 views

05 Linear Regression 2

The document discusses multiple linear regression analysis. It covers estimating multiple regression models using least squares methods and Excel's regression tool. It also discusses checking regression assumptions by examining residual plots, hypothesis testing of regression parameters using t-tests, and determining statistical significance.

Uploaded by

87rkkcbct7
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

05 Linear Regression 2

The document discusses multiple linear regression analysis. It covers estimating multiple regression models using least squares methods and Excel's regression tool. It also discusses checking regression assumptions by examining residual plots, hypothesis testing of regression parameters using t-tests, and determining statistical significance.

Uploaded by

87rkkcbct7
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 71

MM3425 BUSINESS ANALYTICS

LINEAR REGRESSION PART II

1
MULTIPLE REGRESSION MODEL

■ y = dependent variable
■ x1, x2,…,xq = independent variables
■ β0, β1,…, βq = parameters (represents the change in the mean value of the dependent
variable y that corresponds to a one unit increase in the independent variable xi, holding
the values of all other independent variable constant)
■ ε = error term (accounts for the variability in y that cannot be explained by
the linear effect of the q independent variables) 2
ESTIMATED MULTIPLE REGRESSION MODEL

3
LEAST SQUARES METHOD AND MULTIPLE REGRESSION

■ The least squares method is used to develop the estimated


multiple regression equation

b0, b1, b 2 ,…, bq that satisfy min  i ˆi   


n 2 n
ei2 .
1
y 
i
min
i
1
y

4
MULTIPLE REGRESSION MODEL

5
INFERENCE AND REGRESSION

16

Simple linear regression model Multiple regression model


EXTENSION OF BUTLER TRUCKING COMPANY

■ Butler Trucking Company


– The estimated simple linear regression equation is
ŷi =1.2739+0.0678xi
– The linear effect of the number of miles traveled explains
66.41% of the variability in travel time in the sample
data (r2=0.6641)
– 33.59% of the variability in sample travel times remains
unexplained
– The managers want to consider adding one or more
independent variables, such as number of deliveries, to
the model to explain some of the remaining variability in
the dependent variable 6

– 300 observations are used this time


Assignment Miles (x1) Deliveries (x2) Time (y)
1 100.0 4.0 9.3
2 50.0 3.0 4.8
3 100.0 4.0 8.9
4 100.0 2.0 6.5
5 50.0 2.0 4.2

290 85.0 2.0 7.8


291 75.0 2.0 6.5
292 70.0 2.0 6.1
293 75.0 4.0 7.2
294 70.0 6.0 8.9
295 95.0 6.0 10.9
296 50.0 4.0 7.2
297 50.0 1.0 3.5
298 85.0 2.0 8.0
299 100.0 2.0 7.8
300 65.0 6.0 10.0 7
ESTIMATED MULTIPLE REGRESSION MODEL

■ Estimated multiple linear regression with two independent


variables

ŷ  b 0  b 1 x1  b 2 x2

■ ŷ = estimated mean travel time


■ x1= distance travelled
■ x2= number of deliveries
8

■ SST, SSR, SSE and r2 are computed


EXCEL’S REGRESSION TOOL

10
EXCEL’S REGRESSION TOOL

11
EXCEL’S REGRESSION TOOL

ŷ = 0.1273 + 0.0672x1 + 0.6900x2 (r2=SSR/SST=915.5161/1120.1032=81.73%) 11


ADJUSTED R2

■ r2 never decreases when a new x variable is added to the model


■ This can be a disadvantage when comparing models
■ What is the net effect of adding a new variable?
– We lose a degree of freedom when a new x variable is added
– Did the new x variable add enough explanatory power to offset the loss of one degree of
freedom?

(where n = sample size, k = number of independent variables)


– Interpreted as the percentage of the total sum of squares that can be explained by using
the estimated regression equation adjusted for the number of x variables used
12
– Smaller than r2
– Useful in comparing among models
Adjusted r2=1-[(1-0.8173)(300-1)/(300-2-1)]=0.8161=81.61%

14
15
Check residual plots
before inference
INFERENCE AND REGRESSION

■ Three regression conditions


– The population of potential error terms ε is normally distributed with a mean of 0
– The population of potential error terms ε has a constant variance
– The values of ε are statistically independent
■ The errors must satisfy these conditions in order for inferences
■ How to check?
– Residual plots to check for violations of regression conditions
– Residuals vs. ŷ
– Residuals vs. xi
17
 Symmetrically distributed around 0 Not symmetrically distributed around 0
18
Y Y

x x
residuals

residuals
x x

Non-constant variance
 Constant variance
19
Not Independent

residuals
 Independent

residuals
X
residuals

20
WHEN RESIDUALS DO NOT MEET CONDITIONS

■ An important independent variable has been omitted


■ The function form of the model is inadequate to explain the
relationships between the independent variables and the
dependent variable

21
EXCEL’S REGRESSION TOOL

22
SCATTER CHART OF RESIDUALS AND PREDICTED VALUES OF
THE DEPENDENT VARIABLE

23
EXCEL’S REGRESSION TOOL

24
25
SCATTER CHART OF RESIDUALS AND PREDICTED VALUES OF
THE DEPENDENT VARIABLE

26
Hypothesis testing
INFERENCE AND REGRESSION

■ Statistical inference
– Process of making estimates and drawing conclusions about one or more
characteristics of a population (the value of one or more
parameters) through the analysis of sample data drawn from the
population
■ Inference is commonly used to estimate and draw conclusions on
– The regression parameters β0,β1,…,βq
– The mean value and/or the predicted value of the dependent variable y
for specific values of the independent variables x1,x2,…,xq
■ Consider both hypothesis testing and interval estimation
28
HYPOTHESIS TESTING

■ If you can assume that he is innocent or guilty, which way is


easier to prove that he is guilty?

29
■ Claim: The population mean age is 50
■ H0: μ = 50, H1: μ ≠ 50
■ Sample the population and find sample mean

Population

Sample
30
■ Suppose the sample mean (μ) age was 20
■ This is significantly lower than the claimed mean population age of 50
■ If the null hypothesis were true, the probability of getting such a different sample mean
would be very small
■ Getting a sample mean of 20 is so unlikely if the population mean was 50
■ You conclude that the population mean must not be 50
31

■ You reject the null hypothesis (H0: μ = 50)


■ If the sample mean is close to the assumed population mean
– H0 is not rejected
■ If the sample mean is far from the assumed population mean
– H0 is rejected
■ How far is “far enough” to reject H0?
– The critical value of a test statistic is determined for decision making

32
Region of Region of
Rejection Rejection

Critical Values

33
Sample

34
35
INFERENCE AND REGRESSION

■ Testing individual regression parameters


– t-test
– To determine whether statistically significant relationships exist between
the dependent variable y and each of the independent variables xj
– If βj=0, there is no linear relationship between the dependent variable y
and the independent variable xj
– If βj≠0, there is a linear relationship between y and xj

36
INFERENCE AND REGRESSION

■ Use a t test to test the hypothesis that a regression parameter


– Sbj is estimated standard deviation of bj
– As the magnitude of t increases (as t deviates from zero in either direction),
we are more likely to reject the hypothesis that the regression parameter βj=0

b 
tSTAT  Sj
bj (df = n – k – 1)
0

37
n

bj Sbj

38
H0: βj = 0 From the excel output:
For Miles tSTAT = 27.3655, with p-value < 0.0001
H1: βj  0
For Deliveries tSTAT = 23.3731, with p-value < 0.0001
d.f. = 300-2-1 = 297

 = 0.05 The test statistic for each variable falls in


Excel
the rejection region (p-values < 0.05)
=T.INV.2T(0.05,297) t/2 = 1.97
Decision:
/2=0.025 /2=0.025 Reject H0 for each variable
Conclusion:
There is evidence that both
Reject H0
Do not reject H
-t α/2
0
tα/2
Reject H0
Miles and Deliveries affect
0
-1.97 1.97 travel time at  = 0.05
39
H0: βj = 0 From the excel output:
For Miles tSTAT = 27.3655, with p-value < 0.0001
H1: βj  0
For Deliveries tSTAT = 23.3731, with p-value < 0.0001
d.f. = 300-2-1 = 297

 = 0.01 The test statistic for each variable falls in


Excel
the rejection region (p-values < 0.01)
=T.INV.2T(0.01,297) t/2 = 2.59
Decision:
/2=0.005 /2=0.005 Reject H0 for each variable
Conclusion:
There is evidence that both
Reject H0
Do not reject H
-t α/2
0
tα/2
Reject H0
Miles and Deliveries affect
0
-2.59 2.59 travel time at  = 0.01
40
P-VALUE

■ p-value
– The probability of obtaining a test statistic equal to or more extreme (< or
>) than the observed sample value, given H0 is true (no linear relationship)
– The p-value is also called the observed level of significance
– Smallest value of α for which H0 can be rejected
■ Compare the p-value with α

– If p-value < α, reject H 0

– If p-value ≥ α, do not reject H 0 41

– If the p-value is low then H0 must go


/2=0.005 /2=0.005

Do not reject H Reject H0


Reject H0
-t α/2
0
tα/2
0
-2.59 2.59

Excel
=T.DIST.2T(D18,297)
-27.3655 0 27.3655

42
Interval estimation
INFERENCE AND REGRESSION

■ Confidence interval
– An estimate of a population parameter that provides an interval
believed to contain the value of the parameter at some level of
confidence

b j  tα / 2 Sb j
■ Confidence level
– Indicates how frequently interval estimates based on samples of the
same size taken from the same population using identical sampling
techniques will contain the true value of the parameter we are
estimating
– 1 - α (level of significance) 44
bj Sbj

45
n

k
Confidence level

46
For miles, lower 95%
=0.06718172-1.968*0.002454979=0.0624
T.INV.2T(0.05, 297) = 1.968
Upper 95%
=0.06718172+1.968*0.002454979=0.0720

 0.0624 ≤ β1 ≤ 0.0720
b j  tα / 2 Sb j You have 95% confidence that this interval correctly estimates the
relationship between these variables.

From a hypothesis-testing viewpoint, because this confidence interval does


not include 0, you can conclude that the regression coefficient (β1) has a 47

significant effect.
48
F test
INFERENCE AND REGRESSION

■ Testing for an overall regression relationship


■ Use an F test based on the F probability distribution
– H0: β1=β2=…= βq=0 (no linear relationship)
– H1: at least one βi ≠0 (at least one independent variable affects y)

SSR

 MSR  k
FSTAT MSE SSE
nk
50
1
n
p-value for the F Test

SSR
MSR k
FSTAT  MSE  SSE
n  k 1

FSTAT = [915.5160626/2]/[204.5871374/(300-2-1)]=457.7580313/0.68884558=664.5292419 50
H0: β1=β2=0
H1: β1 and β2 not both zero
 = 0.05,  = 0.01
df1= 2 df2 = 297

 = 0.05  = 0.01

0 Do not Reject H
F 0 Do not Reject H
F
0 0
reject H0 reject H0
F0.05 = 3.03 F0.01 = 4.68
Excel Excel
=F.INV.RT(0.05,2,297) =F.INV.RT(0.01,2,297)

Since FSTAT test statistic is in the rejection region, reject H0. There is evidence that at least one 51

independent variable affects y.


53
54
INFERENCE AND REGRESSION

■ Non-significant independent variables


– If practical experience dictates that the non-significant independent variable
has a relationship with the dependent variable, the independent variable
should be kept in the model
– If the model sufficiently explains the dependent variable without the non-
significant independent variable, then consider rerunning the
regression without the non-significant independent variable (results
may change)
– The appropriate treatment of the inclusion or exclusion of the y-intercept when b0
is not statistically significant may require special consideration
– Regression through the origin should not be forced unless there are strong priori
reasons for believing that the dependent variable is equal to zero when 54

the values of all independent variables in the model are equal to zero
Multicollinearity
MULTICOLLINEARITY

■ Multicollinearity refers to the correlation among the independent variables in


multiple regression analysis
■ What will happen if independent variables are correlated?
– In t tests for the significance of individual parameters, the difficulty caused by
multicollinearity is that it is possible to conclude that a parameter associated
with one of the multicollinear independent variables is not significantly different
from zero when the independent variable actually has a strong relationship
with the dependent variable
■ This problem is avoided when there is little correlation among the independent
variables

55
Miles traveled and gasoline consumed are strongly related X
goes up and Y goes up 56
The primary consequence of multicollinearity is that
it increases the variances and standard errors of the
regression estimates of β0, β1, β2, .
. . , βq and predicted values of the dependent
variable, and so inference based on these
estimates is less precise than it should be.

59

58
A TEST OF MULTICOLLINEARITY

■ As a rule-of-thumb, multicollinearity is a
potential problem if the absolute value of the
sample correlation coefficient exceeds
±0.7 for any two of the independent
variables
■ Correlation coefficient in Excel
■ =CORREL(array1, array2)

 rMiles, Gasoline Consumption=0.9571 >0.7


– Miles and Gasoline Consumption are
collinear
– Include either Miles or Gasoline
Consumption

 rMiles, Deliveries=0.0258 <0.7


60
– Miles and Deliveries are not
collinear
Categorical variable
CATEGORICAL INDEPENDENT VARIABLES

🞍 Butler Trucking Company and Rush Hour


– Dependent variable: travel time (y)
– Independent variables: miles traveled (x1) and number of deliveries (x2)
– Categorical variable/dummy variable: rush hour (x3)

🞍 x3=0 if an assignment did not include travel on the congested segment of highway during
afternoon rush hour

🞍 x3=1 if an assignment included travel on the congested segment of highway during


afternoon rush hour
62
CATEGORICAL INDEPENDENT VARIABLES

ei = yi - +ve means actual is larger 60

ŷi
CATEGORICAL INDEPENDENT VARIABLES

61
ŷ = –0.3302 + 0.0672x1 + 0.6735x2 + 0.9980x3
CATEGORICAL INDEPENDENT VARIABLES

ŷ = –0.3302 + 0.0672x1 + 0.6735x2 + 0.9980x3


■ The model estimates that travel time increases by
– 0.0672 hour for every increase of 1 mile traveled, holding constant the number of
deliveries and whether the driving assignment route requires the driver to travel
on the congested segment of a highway during the afternoon rush hour period
– 0.6735 hour for every delivery, holding constant the number of miles
traveled and whether the driving assignment route requires the driver to
travel on the congested segment of a highway during the afternoon rush
hour period
– 0.9980 hour if the driving assignment route requires the driver to travel on the
congested segment of a highway during the afternoon rush hour period,
holding
■ r2=0.8838 constant
indicates thatthe
the number ofmodel
regression miles explains
traveledapproximately
and the number of deliveries
88.38% of the
variability in travel time for the driving assignments in the sample 62
CATEGORICAL INDEPENDENT VARIABLES

■ When x3  0:

■ When x
 1: 3

66
CATEGORICAL INDEPENDENT VARIABLES

■ If a categorical variable has k levels, k-1 dummy variables are required


■ Suppose a manufacturer of vending machines organized the sales
territories for a particular state into three regions: A, B, and C
■ Suppose the managers believe sales region is one of the important factors in
predicting the number of units sold

67
CATEGORICAL INDEPENDENT VARIABLES

Region A

Region B

Region C

68
APPLICATION 1

https://www.sciencedirect.com/science/article/pii/S2212567112001888 Which variables are


69
significant?
APPLICATION 2

https://www.emerald.com/insight/content/doi/10.1108/01443579910287064/full/html
70
DATABASES

◾ HKSAR Government: https://data.gov.hk/en/


◾ Statista (PolyU login required): https://www-statista-com.ezproxy.lb.polyu.edu.hk/
◾ World bank: https://data.worldbank.org/
◾ Aviation: https://www.bts.gov/
◾ Shipping (PolyU login required): https://www.lib.polyu.edu.hk/databases/shipping-intelligence-network-
individual- title-varies
◾ HK Air Traffic Statistics: https://www.cad.gov.hk/english/statistics.html
◾ Kaggle Dataset: https://www.kaggle.com/datasets

71

You might also like