Chapter 9
Chapter 9
Chapter 10
Simple Linear
Regression and
Correlation
10-2
• Using Statistics
• The Simple Linear Regression Model
• Estimation: The Method of Least Squares
• Error Variance and the Standard Errors of Regression
Estimators
• Correlation
• Hypothesis Tests about the Regression Relationship
• How Good is the Regression?
• Analysis of Variance Table and an F Test of the
Regression Model
• Residual Analysis and Checking for Model Inadequacies
• Use of the Regression Model for Prediction
10-3
10 LEARNING OBJECTIVES
After studying this chapter, you should be able to:
• Determine whether a regression experiment would
be useful in a given instance
• Formulate a regression model
• Compute a regression equation
• Compute the covariance and the correlation
coefficient of two random variables
• Compute confidence intervals for regression
coefficients
• Compute a prediction interval for the dependent
variable
10-4
100
Sales
80
Larger (smaller) values of sales tend to be 60
advertising. 20
0
0 10 20 30 40 50
A d ve rtising
The scatter of points tends to be distributed around a positively sloped straight line.
The pairs of values of advertising expenditures and sales are not located exactly on a
straight line.
The scatter plot reveals a more or less strong tendency rather than a precise linear
relationship.
The line represents the nature of the relationship on average.
10-7
Y
Y
Y
X 0 X X
Y
Y
X X X
10-8
Model Building
Theinexact
The inexactnature
natureof
ofthe
the Data InANOVA,
In ANOVA,the thesystematic
systematic
relationshipbetween
relationship between componentisisthe
component thevariation
variation
advertisingand
advertising andsales
sales ofmeans
of meansbetween
betweensamples
samples
suggeststhat
suggests thataastatistical
statistical ortreatments
or treatments(SSTR)
(SSTR)andand
modelmight
model mightbebeuseful
usefulinin Statistical therandom
the randomcomponent
componentisis
analyzingthe
analyzing therelationship.
relationship. model theunexplained
the unexplainedvariation
variation
(SSE).
(SSE).
AAstatistical
statisticalmodel
model
separatesthe
separates thesystematic
systematic Inregression,
In regression,
regression the
regression the
componentof ofaa
Systematic systematiccomponent
componentisis
component systematic
relationshipfrom
fromthe
the component theoverall
overalllinear
linear
relationship the
randomcomponent.
random component.
component
component + relationship,and
relationship, andthe
the
Random randomcomponent
random componentisisthethe
errors variationaround
variation aroundthetheline.
line.
10-9
Y= 0++1XX
Y= ++
0 1
Nonrandomoror
Nonrandom Random
Random
Systematic
Systematic Component
Component
Component
Component
where
where
YYisisthe
thedependent
dependentvariable,
variable,the
thevariable
variablewe
wewish
wishtotoexplain
explainor
orpredict
predict
XXisisthe
theindependent
independentvariable,
variable,also
alsocalled
calledthe
thepredictor
predictorvariable
variable
isisthe
theerror
errorterm,
term,the
theonly
onlyrandom
randomcomponent
componentininthethemodel,
model,and
andthus,
thus,the
the
onlysource
only sourceof
ofrandomness
randomnessininY.
Y.
0isisthe
theintercept
interceptofofthe
thesystematic
systematiccomponent
componentof
ofthe
theregression
regressionrelationship.
relationship.
0
1isisthe
theslope
slopeof
ofthe
thesystematic
systematiccomponent.
component.
1
Theconditional
The conditionalmean
meanof Y:E [Y X ] 0 1 X
ofY:
10-10
i]=0++1X
E[Y]= Xi
}
E[Yi 0 1 i
1
Actualobserved
Actual observedvalues
valuesof
ofYY
0 = Intercept
differfrom
differ fromthe
theexpected
expectedvalue
value
byan
by anunexplained
unexplainedor
orrandom
random
error:
error:
X
Xi E[Y]i]++ i
YYi i==E[Y i i
==00++11XXi +
i + i
i
10-11
relationship.
relationship.
•• The
Thevalues
valuesof ofthe theindependent
independent
variableXXare
variable areassumed
assumedfixed
fixed
(notrandom);
(not random);the theonly
only E[Y]=0 + 1 X
randomnessin
randomness inthe
thevalues
valuesof
ofYY
comesfrom
comes fromthe theerror term.i.
errorterm
i
•• The errorsiare
Theerrors
i
arenormally
normally
distributedwith
distributed withmean mean00andand Identical normal
variance22. . The
variance Theerrors
errorsare
are distributions of errors,
all centered on the
uncorrelated(not
uncorrelated (notrelated)
related)in
in regression line.
successiveobservations.
successive observations. That That
is: ~N(0,
is: ~ N(0,22)) X
10-12
Theestimated
The estimatedregression
regressionequation:
equation:
YY==bb00++bb1X
1X++ee
wherebb0estimates
where estimatesthetheintercept
interceptofofthe
thepopulation
populationregression line,0;;
regressionline,
0 0
bb11estimates
estimatesthe
theslope
slopeof ofthe
thepopulation
populationregression line,;1;
regressionline,
1
andeestands
and standsfor
forthe
theobserved
observederrors
errors--the
theresiduals
residualsfrom
fromfitting
fittingthe
theestimated
estimated
regressionline
regression linebb0++bbX 1X to
toaaset
setof
ofnnpoints.
points.
0 1
Theestimated
The estimatedregression
regressionline:
line:
YY bb00 ++bb11XX
whereY
where (Y
Y (Y--hat)
hat)isisthe
thevalue
valueofofYYlying
lyingon
onthe
thefitted
fittedregression
regressionline
linefor
foraagiven
given
valueof
value ofX.
X.
10-13
Data
Three errors from the
least squares regression
X line X
Y
Errors in Regression
Y
the observed data point
Y b0 b1 X
.
the fitted regression line
Yi
Yi
{
Error ei Yi Yi
Yi the predicted value of Y for X
i
X
Xi
10-15
The least squares regression line is that which minimizes the SSE
with respect to the estimates b 0 and b 1 .
n n
y
i=1
i nb0 b1 x i
i=1
At this point
SSE is
Least squares b0 minimized
n n n with respect
x y i i
b0 x i b1 x 2i to b0 and b1
SSx
SS ((xx xx))
2
2
xx n
2
2
x
n 22
yy
SS y
SS ((yy yy))
2
2
yy n
2
2
y
n
xx((
yy))
SSxy
SS ((xx xx)()(yy yy))
xyxy
xy
nn
Least squares
Least squaresregression
regressionestimators:
estimators:
SS XY
SS
b
b11 SSXY
SS XX
yy bb1xx
bb00 1
10-17
Example 10-1
Miles Dollars Miles 2 Miles*Dollars
22
x2 x
Miles
1211
Dollars
1802
Miles 2 Miles*Dollars
1466521 2182222 2 x
1211
1345
1802
2405
1466521
1809025
2182222
3234725 SS x x
SS
1345
1422
2405
2005
1809025
2022084
3234725
2851110
x nn
1422 2005 2022084 2851110
1687 2511 2845969 4236057 2
1687
1849
1849
2511
2332
2332
2845969
3418801
3418801
4236057
4311868
4311868 79 , 4482
79, 448
2026
2026
2305
2305
4104676
4104676
4669930
4669930 293
293, 426 ,946
, 426,946 40
40,947
,947,557
,557.84
.84
2133
2133
3016
3016
4549689
4549689
6433128
6433128 2525
xx((yy))
2253 3385 5076009 7626405
2253 3385 5076009 7626405
2400 3090 5760000 7416000
2400
2468
3090
3694
5760000
6091024
7416000
9116792 SS xy
xy xy
SS xy
2468 3694 6091024 9116792
2699
2699
3371
3371
7284601
7284601
9098329
9098329 nn
2806 3998 7873636 11218388
2806 3998 7873636 11218388
3082 3555 9498724 10956510
((79
79, 448
, 448)()(106
106,605
,605))
390 ,185 , 014 51 51, 402
, 402,852
,852.4.4
3082 3555 9498724 10956510
3209
3209
4692
4692
10297681
10297681
15056628
15056628 390,185,014
3466 4244 12013156 14709704 2525
3466 4244 12013156 14709704
3643 5298 13271449 19300614
3643
3852
5298
4801
13271449
14837904
19300614
18493452 SS
SS 51, 402
, 402,852
,852.4.4
3852 4801 14837904 18493452
XYXY 51
4033
4033
5147
5147
16265089
16265089
20757852
20757852 b
b 1 11.255333776
.25533377611.26 .26
4267
4267
5738
5738
18207288
18207288
24484046
24484046 1 SS SS 40,947
40 ,947,557
,557.84.84
4498
4498
6420
6420
20232004
20232004
28877160
28877160 XX
4533 6059 20548088 27465448
4533 6059 20548088 27465448
4804
4804
6426
6426
23078416
23078416
30870504
30870504 b y b x
106,605
106 ,605
( 1. 255333776 )79 , 448
79, 448
5090
5090
6321
6321
25908100
25908100
32173890
32173890 b 0 y b 1x (1.255333776)
5233
5233
5439
7026
7026
6964
27384288
27384288
29582720
36767056
36767056
37877196
0 1 2525 25 25
5439 6964 29582720 37877196
79,448
79,448
106,605 293,426,946
106,605 293,426,946
390,185,014
390,185,014 274
274.85
.85
10-18
Thestandard
standarderror
errorof
ofbb0 (intercept)
(intercept):: Example10
Example 10- -1:1:
The 0
s 22
x
s x
ss(b(b00) )
s(b )
s
s x x2
2
nSS X
nSS
s(b00) nSS 318 .
X
158 293426944
nSS XX 318.158 293426944
( (25
25)()(4097557
4097557.84
.84) )
wheress == MSE
where MSE 170
170.338.338
ss
Thestandard
standarderror
errorof
ofbb1(slope)
(slope):: ss(b(b11) )
The 1 SSSS X
X
318
318.158.158
s(b ) ss
40947557.84
40947557 .84
s(b11) SS
SS XX 00.04972
.04972
10-20
b1=1.25533
.3
0.025,( 25 2 )
1
p e:
slo
1.25533((2.069)
==1.25533 2.069)((00.04972
.04972))
on
d
Height = Slope
un
11.25533
.25533010287
bo
6 010287
..
5%
5 24
1. 1
9
[115246
[115246
.. ,1,1.35820
.35820]]
er
:
nd
p
Up
u
bo
9 5%
er
L ow
0 (not a possible value of the
Length = 1
regression slope at 95%)
10-21
10-5 Correlation
Thecorrelation
The correlationbetween
betweentwo
tworandom
randomvariables,
variables,XXand
andY,
Y,isisaameasure
measureof
ofthe
the
degreeof
degree of linear
linearassociation
associationbetween
betweenthe
thetwo
twovariables.
variables.
Thepopulation
The populationcorrelation,
correlation,denoted
denotedby,
by,can
cantake
takeon
onany
anyvalue
valuefrom
from-1
-1toto1.1.
indicatesaaperfect
indicates perfectnegative
negativelinear
linearrelationship
relationship
-1<<<<00 indicates
-1 indicatesaanegative
negativelinear
linearrelationship
relationship
indicatesno
indicates nolinear
linearrelationship
relationship
00<<<<11 indicates
indicatesaapositive
positivelinear
linearrelationship
relationship
indicatesaaperfect
indicates perfectpositive
positivelinear
linearrelationship
relationship
Theabsolute
The absolutevalue ofindicates
valueof indicatesthe
thestrength
strengthor
orexactness
exactnessof
ofthe
therelationship.
relationship.
10-22
Illustrations of Correlation
Y Y Y
= -1 = 0
= 1
X X X
X X X
10-23
X X X
A hypothesis test for the existence of a linear relationship between X and Y:
H0: 1 0
H1: 1 0
Test statistic for the existence of a linear relationship between X and Y:
b
t 1
(n - 2) s (b )
1
where b is the least - squares estimate of the regression slope and s ( b ) is the standard error of b .
1 1 1
When the null hypothesis is true, the statistic has a t distribution with n - 2 degrees of freedom.
10-26
}
Y
Y
Unexplained Deviation
{ Total Deviation
2 2
( y y ) ( y y) ( y y )
2
Y
Explained Deviation
{ SST = SSE + SSR
Y Y Y
X X X
SST SST SST
S
r2 = 0 SSE r2 = 0.50 SSE SSR r2 = 0.90 S SSR
E
7000
Example 10 -1: 6000
5000
Dollars
SSR 64527736.8
r 2
0.96518 4000
2000
1000 1500 2000 2500 3000 3500 4000 4500 5000 5500
Miles
10-29
Regression SSR
Regression SSR (1)
(1) MSR
MSR MSR
MSR
MSE
MSE
Error
Error SSE
SSE (n-2)
(n-2) MSE
MSE
Total
Total SST
SST (n-1)
(n-1) MST
MST
Example10-1
Example 10-1
Sourceofof Sum
Source Sumofof Degreesofof
Degrees
Variation Squares
Variation Squares Freedom
Freedom FFRatio
Ratio ppValue
Value
MeanSquare
Mean Square
Regression 64527736.8
Regression 64527736.8 11 64527736.8 637.47
64527736.8 637.47 0.000
0.000
Error
Error 2328161.2 23
2328161.2 23 101224.4
101224.4
Total
Total 66855898.0 24
66855898.0 24
10-30
0 0
x or y x or y
Residuals Residuals
0 0
Time x or y
••Point
Point Prediction
Prediction
AAsingle-valued
single-valuedestimate
estimateof
ofYYfor
foraagiven
givenvalue
valueof
ofXX
obtainedby
obtained byinserting
insertingthe
thevalue
valueof
ofXXin
inthe
theestimated
estimated
regressionequation.
regression equation.
••Prediction
Prediction Interval
Interval
For
Foraavalue
valueof
ofYYgiven
givenaavalue
valueof
ofXX
Variation
Variationin
inregression
regressionline
lineestimate
estimate
Variation
Variationofofpoints
pointsaround
aroundregression
regressionline
line
For
Foran
anaverage
averagevalue
valueof
ofYYgiven
givenaavalue
valueof
ofXX
Variation
Variationin
inregression
regressionline
lineestimate
estimate
10-36
X X X X
X X X
3) Variation around the regression
line Prediction Interval for E[Y|X]
10-39
(1--))100%
AA(1 100%prediction
predictioninterval
intervalfor
forYY::
11 ((xx xx))
2
2
y
ˆ t s 1
yˆ t s 1 n SS
n SS
2
2 X
X
Example10
Example 10--11(X
(X==4,000)
4,000)::
,000 33,177
11 ((44,000 ,177.92
.92))
2
2
5296 .05676
5296.05 .62[[4619
676.62 4619.43
.43, ,5972
5972.67
.67]]
10-40
11 ((xx xx)) 2
2
n SS
2
2 X
X
Example10
Example 10--11(X
(X==4,000)
4,000)::
,000 33,177
11 ((44,000 ,177.92
.92))
2
2
55,296 .05156
,296.05 .48[[5139
156.48 5139.57
.57, ,5452
5452.53
.53]]
10-41
••The
The Case
Case of
of Independent
Independent Random
Random
Variables:
Variables:
For
Forindependent
independentrandom
randomvariables,
variables,XX11,,XX22,,…,
…,XXnn,,
theexpected
the expectedvalue
valuefor
forthe
thesum,
sum,isisgiven
givenby: by:
•• E(X
E(X11++XX22 ++…
…++XXnn))==E(X
E(X11))++E(X
E(X22)+
)+… …++E(X
E(Xnn))
•• For
Forindependent
independentrandom
randomvariables,
variables,XX11,,XX22,,…,
…,XXnn,,the
the
variancefor
variance forthe
thesum,
sum,isisgiven
givenby:
by:
•• V(X
V(X11++XX22 ++…
…++XXnn))==V(X
V(X11))++V(X
V(X22)+
)+… …++V(X
V(Xnn))
10-42
••The
Thecovariance
covariancebetween
betweentwo
tworandom
randomvariables
variablesXX11
andXX22isisgiven
and givenby:
by:
•• Cov(X
Cov(X11,,XX22))==E{[X
E{[X11––E(X
E(X11)])][X
[X22––E(X
E(X22)]}
)]}
••AAsimpler
simplermeasure
measureof
ofcovariance
covarianceisisgiven
givenby:
by:
••Cov(X
Cov(X11,,XX22))==SD(X
SD(X11))SD(X whereisisthe
SD(X22))where the
correlationbetween
correlation betweenXX11and
andXX22..
10-44
••The
The Case
Case of
of Dependent
Dependent Random
Random
Variables with
Variables with Weights:
Weights:
For
Fordependent
dependentrandom
randomvariables,
variables,XX11,,XX22,,…,
…,XXnn,,with
with
respective weights11,,22,,…,
respectiveweights …,nn,,the
thevariance
varianceforforthethe
sum,isisgiven
sum, givenby: by:
•• V(
V(11XX11++11XX22 ++……++nnXXnn))==1122V(X
V(X11))++2222
V(X22)+
V(X …++nn22V(X
)+… V(Xnn))++221122Cov(X
Cov(X11,,XX22))++…
…++ 22
n-1 nnCov(X
n-1 Cov(Xn-1n-1,,X
Xnn))