X X B X B X B y X X B X B N B Y: QMDS 202 Data Analysis and Modeling
X X B X B X B y X X B X B N B Y: QMDS 202 Data Analysis and Modeling
where b1, b2, , bk are sample linear regression coefficients of x1, x2, , xk
respectively and b0 is the constant of the equation.
For k = 2, the sample regression equation is y b0 b1 x1 b2 x 2 where b0, b1, and b2
can be found by solving a system of three normal equations:
y b0 n b1x1 b2 x2
2
x1 y b0 x1 b1x1 b2 x1 x2
2
x2 y b0 x2 b1x1 x2 b2 x2
Example 1
x1
x2
x1 y
x2 y
x1 x 2
1
5
8
6
3
200
700
800
400
100
100
300
400
200
100
100
1500
3200
1200
300
20000
210000
320000
80000
10000
200
3500
6400
2400
300
x12
1
25
64
36
9
x 22
40000
490000
640000
160000
10000
82.99
305.20
394.73
241.55
95.92
10
33
600
2800
n=6
400
1500
4000
10300
240000
880000
6000
18800
100
235
360000
1700000
379.61
1500
By solving the above system of normal equations, we should find the following:
b0 = 6.397
b1 = 20.492
b2 = 0.280
( y i y i ) 2
(17.01)2
(-5.2)2
(5.27)2
(-41.55)2
(4.08)2
(20.39)2
2502.954
28.88
6 2 1
2502.954
Note: s is the point estimate of (the standard deviation of the error variable .)
Testing the Validity of the Model The Analysis of Variance (ANOVA) Test
Lets consider a simple linear regression model:
y
y = y / n = the mean of y
*
*
*
x
( y i y ) ( y i y ) ( y i y i )
( y i y ) ( y i y ) ( y i y i )
y i y = total deviations
y i y = total deviations of estimated values from the mean
y i y i = error deviations = ei
ei y i y i = the residual of the ith data point
( y i y ) 2 ( y i y ) 2 ( y i y i ) 2
df
k
nk1
n1
SS
SSR
SSE
SST
MS
MSR
MSE
MSR = SSR / k
MSE = SSE / (n k 1)
Note. MSE = s2
The ANOVA test of the regression model in Example 1:
3
F
MSR/MSE
20.492 0
3.48 > 3.182 Reject H0
5.882
p-value approach:
p-value = 0.04 < = 0.05 Reject H0
The slope 1 is significant, that is, there is a meaningful relationship between X 1
and Y.
The t-test for X2:
H0: X2 is not a significant independent variable (2 = 0)
H1: X2 is a significant independent variable (2 0)
= 0.05
/2 = 0.025
df = n k 1 = 6 2 1 = 3
Critical values = 3.182
Reject H0 if TS < 3.182 or TS > 3.182
b ( 2 ) 0
TS 2
where S b2 = estimated standard deviation of b2
S b2
TS
0.280 0
4.089 > 3.182 Reject H0
0.069
p-value approach:
p-value = 0.026 < = 0.05 Reject H0
X2 is also a significant independent variable.
In case there are some insignificant independent variables in the model (the p-values
of some regression coefficients are bigger than ), we should take out the most
insignificant variable from the model (the one with the highest p-value) and run the
regression function once again by using only the remaining variables. Then we
observe the p-values of the coefficients in this new model and repeat the same
procedure (if necessary) until all the p-values are less than .
The Coefficient of Multiple Determination (R2)
R2
SST
total var iation
In Example 1,
R2
92497
0.974
92497 2503
SSE /( n k 1)
SST /( n 1)
If n is considerably larger than k, the actual and adjusted R2 values will be similar. But
if SSE is quite different from 0 and k is large compared to n, the actual and adjusted
values of R2 will differ substantially.
2
Radj
= 1
In Example 1,
SSE /( n k 1)
n 1
1 1 R 2
=
SST /( n 1)
n k 1
2
Radj
= 1
2502.636 / 3
834.212
1
0.956
95000 / 5
19000