Chapter 17
Chapter 17
Introduction
House
Cost
bo ut
sa
o st
c
se t.
ho u o Size)
n g a e fo + 75(
d i a r
Buil er squ 25000
p =
$75 e cost
Most lots sell s
H ou
for $25,000
House size
However, house cost vary even among same size
houses! Since cost behave unpredictably,
House we add a random component.
Cost
(2,4)
Let us compare two lines
4
The second line is horizontal
3 (4,3.2)
2.5
2
(1,2)
(3,1.5)
1
To calculate the estimates of the line The regression equation that estimates
coefficients, that minimize the differences the equation of the first order linear model
between the data points and the line, use is:
the formulas:
cov(X,Y)) ssXYXY
cov(X,Y ˆ
bb11 2 22
ˆ
Y bb00 bb11XX
Y
ssXX
2
ssXX
bb00 YY bb11XX
The Simple Linear Regression Line
Y 14,822.823; cov(X,Y )
(X i X )(Yi Y )
2,712,511
n 1
where n = 100.
cov(X,Y) 1,712,511
b1 .06232
2
sX 43,528,690
b0 Y b1 X 14,822.82 (.06232)(36,009.45) 17,067
Yˆ b0 b1 X 17,067 .0623X
• Solution – continued
– Using the computer (Xm17-02)
Regression Statistics
Multiple R 0.8063
R Square 0.6501
Adjusted R Square 0.6466
Yˆ 17,067 .0623X
Standard Error 303.1
Observations 100
ANOVA
df SS MS F Significance F
Regression 1 16734111 16734111 182.11 0.0000
Residual 98 9005450 91892
Total 99 25739561
Coefficients Standard Error t Stat P-value
Intercept 17067 169 100.97 0.0000
Odometer -0.0623 0.0046 -13.49 0.0000
Interpreting the Linear Regression -
Equation
17067 Odometer Line Fit Plot
16000
15000
Price
14000
0 No data 13000
Odometer
Yˆ 17,067 .0623X
0 + 1X1
X1 X2 X3
From the
From the first
first three
three assumptions
assumptions we we
have: YY isis normally
have: normally distributed
distributed with
with
mean E(Y) == 00 ++ 11X,
mean E(Y) X, and
and aa constant
constant
deviation
standard deviation
standard
Assessing the Model
• The least squares method will produces a
regression line whether or not there are linear
relationship between X and Y.
• Consequently, it is important to assess how well
the linear model fits the data.
• Several methods are used to assess the model.
All are based on the sum of squares for errors,
SSE.
Sum of Squares for Errors
– This is the sum of differences between the points
and the regression line.
– It can serve as a measure of how well the line fits the
data. SSE is defined by
nn
SSE (Yi i Yi )i .
SSE
(Y ˆ
Yˆ 22 .
)
i1
i1
– A shortcut formula
2 cov(X,Y)
2
2
SSE(n
SSE (n1)s
1)sY
2 cov(X,Y)
Y s22
sXX
Standard Error of Estimate
– The mean error is equal to zero.
– If is small the errors tend to be close to zero
(close to the mean error). Then, the model fits the
data well.
– Therefore, we can, use as a measure of the
suitability of using a linear model.
– An estimator of is given by s
SStan
tandard
dard Error
Error of
of Estimate
Estimate
SSE
SSE
ss
nn22
• Example 17.3
– Calculate the standard error of estimate for Example 17.2,
and describe what does it tell you about the model fit?
• Solution
sY2
i i
(Y Yˆ ) 2
259,996
Calculated before
n 1
[cov( X , Y )] 2
( 2, 712,511) 2
SSE (n 1) sY2 2
99(259,996) 9,005,450
sX 43,528,690
SSE 9,005,450 It is hard to assess the model based
s 303.13
n2 98 on s even when compared with the
mean value of Y.
s 303.1 y 14,823
Testing the Slope
– When no linear relationship exists between two
variables, the regression line should be horizontal.
b1 .0623
s 303.1
sb1 .00462
(n 1)sX
2
(99)(43,528,690)
b1 1 .0623 0
t 13.49
sb1 .00462
– The rejection region is t > t.025 or t < -t.025 with = n-2 = 98.
Approximately, t.025 = 1.984
Xm17-02
• Using the computer
Price Odometer SUMMARY OUTPUT
14636 37388
14122 44758 Regression Statistics
14016 45833 Multiple R 0.8063
15590 30862 R Square 0.6501 There is overwhelming evidence to infer
15568 31705 Adjusted R Square 0.6466
14718 34010 Standard Error 303.1 that the odometer reading affects the
14470 45854 Observations 100 auction selling price.
15690 19057
15072 40149 ANOVA
14802 40237 df SS MS F Significance F
15190 32359 Regression 1 16734111 16734111 182.11 0.0000
14660 43533 Residual 98 9005450 91892
15612 32744 Total 99 25739561
15610 34470
14634 37720 Coefficients Standard Error t Stat P-value
14632 41350 Intercept 17067 169 100.97 0.0000
15740 24469 Odometer -0.0623 0.0046 -13.49 0.0000
Coefficient of Determination
– To measure the strength of the linear relationship we
use the coefficient of determination:
cov(X,Y) 22
R
22
R
cov(X,Y)
2 2
or,
or,
rr22
XY
;;
2 2
ssXXssYY XY
SSE
SSE
or, RR 1
or, 22 1 (seep.p.18
(see 18above)
above)
(Yi i Y )
(Y Y )22
• To understand the significance of this coefficient note:
x1 x2
Variation explained by the
Total variation in Y = + Unexplained variation (error)
regression line
• R2 measures the proportion of the variation in Y
that is explained by the variation in X.
R 1
2 SSE
i SSE
(Y Y ) 2
SSR
(Yi Y ) 2
(Y Y )
i
2
(Yi Y ) 2
2
[cov(X,Y)] [2,712,511]2
R
2
2 2
(43,528,688)(259,996) .6501
sX sY
– Using the computer
From the regression output we have
SUMMARY OUTPUT
Regression Statistics
65% of the variation in the auction
Multiple R 0.8063 selling price is explained by the
R Square 0.6501
Adjusted R Square 0.6466 variation in odometer reading. The
Standard Error
Observations
303.1
100
rest (35%) remains unexplained by
this model.
ANOVA
df SS MS F Significance F
Regression 1 16734111 16734111 182.11 0.0000
Residual 98 9005450 91892
Total 99 25739561