COMPLETE
BUSINESS
STATISTICS
by
AMIR D. ACZEL
&
JAYAVEL SOUNDERPANDIAN
7th edition.
Prepared by Lloyd Jaisingh, Morehead State
University
Chapter 10
Simple Linear Regression and Correlation
McGraw-Hill/Irwin Copyright © 2009 by The McGraw-Hill Companies, Inc. All rights reserved.
10-2
10 Simple Linear Regression and Correlation
• Using Statistics
• The Simple Linear Regression Model
• Estimation: The Method of Least Squares
• Error Variance and the Standard Errors of Regression Estimators
• Correlation
• Hypothesis Tests about the Regression Relationship
• How Good is the Regression?
• Analysis of Variance Table and an F Test of the Regression Model
• Residual Analysis and Checking for Model Inadequacies
• Use of the Regression Model for Prediction
• The Solver Method for Regression
10-3
10 LEARNING OBJECTIVES
After studying this chapter, you should be able to:
• Determine whether a regression experiment would be useful in a given
instance
• Formulate a regression model
• Compute a regression equation
• Compute the covariance and the correlation coefficient of two random
variables
• Compute confidence intervals for regression coefficients
• Compute a prediction interval for the dependent variable
10-4
10 LEARNING OBJECTIVES (continued)
After studying this chapter, you should be able to:
• Test hypothesis about a regression coefficients
• Conduct an ANOVA experiment using regression results
• Analyze residuals to check if the assumptions about the
regression model are valid
• Solve regression problems using spreadsheet templates
• Use LINEST function to carry out a regression
10-5
10-1 Using Statistics
• Regression refers to the statistical technique of modeling the
relationship between variables.
• In simple linear regression, we model the relationship
between two variables.
• One of the variables, denoted by Y, is called the dependent
variable and the other, denoted by X, is called the
independent variable.
• The model we will use to depict the relationship between X and
Y will be a straight-line relationship.
• A graphical sketch of the the pairs (X, Y) is called a scatter
plot.
10-6
10-1 Using Statistics
This scatterplot locates pairs of observations of Scatterplot of Advertising Expenditures (X) and Sales (Y)
advertising expenditures on the x-axis and sales 140
on the y-axis. We notice that: 120
100
Sales
80
Larger (smaller) values of sales tend to be 60
associated with larger (smaller) values of 40
advertising. 20
0
0 10 20 30 40 50
A d ve rtising
The scatter of points tends to be distributed around a positively sloped straight line.
The pairs of values of advertising expenditures and sales are not located exactly on a
straight line.
The scatter plot reveals a more or less strong tendency rather than a precise linear
relationship.
The line represents the nature of the relationship on average.
10-7
Examples of Other Scatterplots
Y
Y
Y
X 0 X X
Y
Y
X X X
10-8
Model Building
The inexact nature of the Data In ANOVA, the systematic
relationship between component is the variation
advertising and sales of means between samples
suggests that a statistical or treatments (SSTR) and
model might be useful in
Statistical the random component is
analyzing the relationship. model the unexplained variation
(SSE).
A statistical model separates
the systematic component Systematic In regression, the
of a relationship from the systematic component is
component
random component. the overall linear
+ relationship, and the
Random random component is the
errors variation around the line.
10-9
10-2 The Simple Linear Regression
Model
The population simple linear regression model:
Y= 0 + 1 X +
Nonrandom or Random
Systematic Component
Component
where
Y is the dependent variable, the variable we wish to explain or predict
X is the independent variable, also called the predictor variable
is the error term, the only random component in the model, and thus, the
only source of randomness in Y.
0 is the intercept of the systematic component of the regression relationship.
1 is the slope of the systematic component.
The conditional mean of Y: E[Y X ] 0 1 X
10-10
Picturing the Simple Linear
Regression Model
Y
Regression Plot The simple linear regression
model gives an exact linear
relationship between the
expected or average value of Y,
the dependent variable, and X,
E[Y]=0 + 1 X
the independent or predictor
Yi
variable:
{
Error: i } 1 = Slope
E[Yi]=0 + 1 Xi
}
1
Actual observed values of Y
0 = Intercept
differ from the expected value by
an unexplained or random error:
X
Yi = E[Yi] + i
Xi = 0 + 1 Xi + i
10-11
Assumptions of the Simple Linear
Regression Model
• The relationship between X and Y is a Assumptions of the Simple
straight-line relationship. Y Linear Regression Model
• The values of the independent
variable X are assumed fixed (not
random); the only randomness in the
values of Y comes from the error term
i. E[Y]=0 + 1 X
• The errors i are normally distributed
with mean 0 and variance 2. The
errors are uncorrelated (not related)
in successive observations. That is:
~ N(0,2)
Identical normal
distributions of errors,
all centered on the
regression line.
X
10-12
10-3 Estimation: The Method of Least
Squares
Estimation of a simple linear regression relationship involves finding
estimated or predicted values of the intercept and slope of the linear
regression line.
The estimated regression equation:
Y = b0 + b1X + e
where b0 estimates the intercept of the population regression line, 0 ;
b1 estimates the slope of the population regression line, 1;
and e stands for the observed errors - the residuals from fitting the estimated
regression line b0 + b1X to a set of n points.
The estimated regression line:
Y b0 + b1 X
where Y (Y - hat) is the value of Y lying on the fitted regression line for a given
value of X.
10-13
Fitting a Regression Line
Y Y
Data
Three errors from the
least squares regression
X line X
Y
Three errors Errors from the least
from a fitted line squares regression
line are minimized
X X
10-14
Errors in Regression
Y
the observeddata point
Y b0 b1 X the fitted regression line
Yi .
Yi
{
Error ei Yi Yi
Yi the predicted value of Y for X
i
X
Xi
10-15
Least Squares Regression
The sum of squared errors in regression is:
n n
SSE = e
i=1
2
i (y
i=1
i y i ) 2
The least squares regression line is that which minimizes the SSE
with respect to the estimates b 0 and b 1 .
The normal equations: SSE b0
n n
y
i=1
i nb0 b1 x i
i=1
At this point
SSE is
Least squares b0 minimized
n n n with respect
x y
i=1
i i b0 x i b1 x 2i
i=1 i=1
to b0 and b1
Least squares b1 b1
10-16
Sums of Squares, Cross Products,
and Least Squares Estimators
Sums of Squares and Cross Products:
x
2
SSx (x x ) x
2 2
n 2
SS y ( y y ) y
2 2 y
n
SSxy (x x )( y y ) xy
x ( y )
n
Least squares regression estimators:
SS XY
b1
SS X
b0 y b1 x
10-17
Example 10-1
Miles Dollars Miles 2 Miles*Dollars
2 x 2
1211
1345
1802
2405
1466521
1809025
2182222
3234725
SS x x
1422 2005 2022084 2851110 n
1687 2511 2845969 4236057 2
1849 2332 3418801 4311868 79, 448
2026 2305 4104676 4669930 293, 426,946 40,947 ,557.84
2133 3016 4549689 6433128 25
2253
2400
3385
3090
5076009
5760000
7626405
7416000 x ( y )
2468 3694 6091024 9116792 SS xy xy
2699 3371 7284601 9098329 n
2806 3998 7873636 11218388
(79, 448)(106,605)
390,185,014 51, 402,852.4
3082 3555 9498724 10956510
3209 4692 10297681 15056628
3466 4244 12013156 14709704 25
3643 5298 13271449 19300614
3852 4801 14837904 18493452 SS 51, 402,852.4
4033 5147 16265089 20757852 b XY 1.255333776 1.26
4267 5738 18207288 24484046 1 SS 40,947 ,557.84
4498 6420 20232004 28877160 X
4533 6059 20548088 27465448
4804 6426 23078416 30870504 106,605 79,448
5090 6321 25908100 32173890 b y b x (1.255333776 )
5233 7026 27384288 36767056
0 1 25 25
5439 6964 29582720 37877196
79,448 106,605 293,426,946 390,185,014 274.85
10-18
Template (partial output) that can be
used to carry out a Simple Regression
10-19
Template (continued) that can be used
to carry out a Simple Regression
10-20
Template (continued) that can be used
to carry out a Simple Regression
Residual Analysis. The plot shows the absence of a relationship
between the residuals and the X-values (miles).
10-21
Template (continued) that can be used
to carry out a Simple Regression
Note: The normal probability plot is approximately linear. This
would indicate that the normality assumption for the errors has not
been violated.
10-22
Total Variance and Error Variance
Y Y
X X
What you see when looking
What you see when looking
at the total variation of Y.
along the regression line at
the error variance of Y.
10-23
10-4 Error Variance and the Standard
Errors of Regression Estimators
Y
Degrees of Freedom in Regression:
df = (n - 2) (n total observations less one degree of freedom
for each parameter estimated (b 0 and b1 ) )
2 Square and sum all
2 ( SS XY ) regression errors to find
SSE = ( Y - Y ) SSY SSE.
SS X X
= SSY b1SS XY Example 10 - 1:
SSE = SS Y b1 SS XY
2 2 66855898 (1.255333776)( 51402852 .4 )
An unbiased estimator of s , denoted by S :
2328161.2
SSE 2328161.2
SSE MSE
MSE = n2 23
(n - 2) 101224 .4
s MSE 101224 .4 318.158
10-24
Standard Errors of Estimates in
Regression
The standard error of b0 (intercept): Example 10 - 1:
2
s x
s(b0 )
s(b0 )
s x 2
nSS X
nSS X 318.158 293426944
( 25)( 4097557.84 )
where s = MSE 170.338
s
The standard error of b1 (slope): s(b1 )
SS X
318.158
s
s(b1 ) 40947557.84
SS X 0.04972
10-25
Confidence Intervals for the
Regression Parameters
A (1 - ) 100% confidence interval for b :
0
b t s (b ) Example 10 - 1
0 ,(n 2 ) 0 95% Confidence Intervals:
2
b t s (b )
0 0.025,( 25 2 ) 0
A (1 - ) 100% confidence interval for b : = 274.85 ( 2.069) (170.338)
1
b t s (b ) 274.85 352.43
1 ,(n 2 ) 1
2 [ 77.58, 627.28]
Least-squares point estimate:
b1=1.25533
b1 t s (b1 )
0.025,( 25 2 )
= 1.25533 ( 2.069) ( 0.04972 )
Height = Slope
1.25533 010287
.
[115246
. ,1.35820]
0 (not a possible value of the
Length = 1
regression slope at 95%)
10-26
Template (partial output) that can be used
to obtain Confidence Intervals for 0 and 1
10-27
10-5 Correlation
The correlation between two random variables, X and Y, is a measure of the
degree of linear association between the two variables.
The population correlation, denoted by, can take on any value from -1 to 1.
1 indicates a perfect negative linear relationship
-1 < < 0 indicates a negative linear relationship
0 indicates no linear relationship
0<<1 indicates a positive linear relationship
1 indicates a perfect positive linear relationship
The absolute value of indicates the strength or exactness of the relationship.
10-28
Illustrations of Correlation
Y Y Y
= -1 =0
=1
X X X
Y = -.8 Y =0 Y
= .8
X X X
10-29
Covariance and Correlation
The covariance of two random variables X and Y:
Cov ( X , Y ) E [( X )(Y )]
X Y
where and Y are the population means of X and Y respectively.
X
The population correlation coefficient: Example 10 - 1:
Cov ( X , Y ) SS
= XY
r=
SS SS
X Y X Y
51402852.4
The sample correlation coefficient * :
( 40947557.84)( 66855898)
SS
XY 51402852.4
r= .9824
SS SS 52321943.29
X Y
*Note: If < 0, b1 < 0 If = 0, b1 = 0 If > 0, b1 >0
10-30
Hypothesis Tests for the Correlation
Coefficient
Example 10 -1:
r
H0: = 0 (No linear relationship) t( n 2 )
H1: 0 (Some linear relationship) 1 r2
n2
0.9824
r =
Test Statistic: t( n 2 ) 1 - 0.9651
1 r2
25 - 2
n2 0.9824
= 25.25
0.0389
t0. 005 2.807 25.25
H 0 rejected at 1% level
10-31
10-6 Hypothesis Tests about the
Regression Relationship
Constant Y Unsystematic Variation Nonlinear Relationship
Y Y Y
X X X
A hypothesis test for the existence of a linear relationship between X and Y:
H0: 1 0
H1: 1 0
Test statistic for the existence of a linear relationship between X and Y:
b
1
t
(n - 2) s(b )
1
where b is the least - squares estimate of the regression slope and s ( b ) is the standard error of b .
1 1 1
When the null hypothesis is true, the statistic has a t distribution with n - 2 degrees of freedom.
10-32
Hypothesis Tests for the Regression
Slope
Example 10 - 1: Example10 - 4 :
H0: 1 0 H : 1
0 1
H1: 1 0 H : 1
1 1
b b 1
1 1
t t
(n - 2) s(b ) ( n - 2) s (b )
1
1
1.24 - 1
1.25533 = 1.14
= 25.25 0.21
0.04972
t 1.671 1.14
t 2.807 25.25 (0.05,58)
( 0 . 005 , 23 ) H is not rejected at the10% level.
0
H 0 is rejected at the 1% level and we may
We may not conclude that the beta
conclude that there is a relationship between
coefficien t is different from 1.
charges and miles traveled.
10-33
10-7 How Good is the Regression?
The coefficient of determination, r2, is a descriptive measure of the strength of
the regression relationship, a measure of how well the regression line fits the data.
( y y ) ( y y) ( y y )
Y Total = Unexplained Explained
Deviation Deviation Deviation
Y . (Error) (Regression)
Y
Y
Unexplained Deviation
Explained Deviation
{
}
{
Total Deviation
SST
2
= SSE
2
( y y ) ( y y) ( y y )
+ SSR
Percentage of
2
2 SSR SSE
r 1 total variation
SST SST explained by
X
X the regression.
10-34
The Coefficient of Determination
Y Y Y
X X X
SST SST SST
S
r2 = 0 SSE r2 = 0.50 SSE SSR r2 = 0.90 S SSR
E
7000
Example 10 -1: 6000
5000
Dollars
SSR 64527736.8
r 2
0.96518 4000
SST 66855898 3000
2000
1000 1500 2000 2500 3000 3500 4000 4500 5000 5500
Miles
10-35
10-8 Analysis-of-Variance Table and
an F Test of the Regression Model
Source of Sum of Degrees of
Variation Squares Freedom Mean Square F Ratio
Regression SSR (1) MSR MSR
MSE
Error SSE (n-2) MSE
Total SST (n-1) MST
Example 10-1
Source of Sum of Degrees of
Variation Squares Freedom F Ratio p Value
Mean Square
Regression 64527736.8 1 64527736.8 637.47 0.000
Error 2328161.2 23 101224.4
Total 66855898.0 24
10-36
Template (partial output) that displays Analysis of
Variance and an F Test of the Regression Model
10-37
10-9 Residual Analysis and Checking
for Model Inadequacies
Residuals Residuals
0 0
x or y x or y
Homoscedasticity: Residuals appear completely Heteroscedasticity: Variance of residuals
random. No indication of model inadequacy. increases when x changes.
Residuals Residuals
0 0
Time x or y
Curved pattern in residuals resulting from
Residuals exhibit a linear trend with time. underlying nonlinear relationship.
10-38
Normal Probability Plot of the
Residuals
Flatter than Normal
10-39
Normal Probability Plot of the
Residuals
More Peaked than Normal
10-40
Normal Probability Plot of the
Residuals
Positively Skewed
10-41
Normal Probability Plot of the
Residuals
Negatively Skewed
10-42
10-10 Use of the Regression Model
for Prediction
• Point Prediction
A single-valued estimate of Y for a given value of X obtained by
inserting the value of X in the estimated regression equation.
• Prediction Interval
For a value of Y given a value of X
Variation in regression line estimate
Variation of points around regression line
For an average value of Y given a value of X
Variation in regression line estimate
10-43
Errors in Predicting E[Y|X]
Y Upper limit on slope Y Upper limit on intercept
Regression line Regression line
Lower limit on slope
Y Y Lower limit on intercept
X X X X
1) Uncertainty about the 2) Uncertainty about the
slope of the regression line intercept of the regression line
10-44
Prediction Interval for E[Y|X]
Y Prediction band for E[Y|X] • The prediction band for E[Y|X] is
Regression narrowest at the mean value of X.
line • The prediction band widens as the
distance from the mean of X increases.
Y • Predictions become very unreliable when
we extrapolate beyond the range of the
sample itself.
X X
Prediction Interval for E[Y|X]
10-45
Additional Error in Predicting Individual
Value of Y
Y
Regression line Y Prediction band for E[Y|X]
Regression
line
Prediction band for Y
X X X
3) Variation around the regression
line Prediction Interval for E[Y|X]
10-46
Prediction Interval for a Value of Y
A (1 - ) 100% prediction interval for Y :
1 (x x) 2
yˆ t s 1
2 n SS X
Example10 - 1 (X = 4,000) :
1 (4,000 3,177.92) 2
{274.85 (1.2553)(4,000)} 2.069 318.16 1
25 40,947,557.84
5296.05 676.62 [4619.43, 5972.67]
10-47
Prediction Interval for the Average
Value of Y
A (1 - ) 100% prediction interval for the E[Y X] :
1 (x x) 2
yˆ t s
2 n SS X
Example10 - 1 (X = 4,000) :
1 (4,000 3,177.92) 2
{274.85 (1.2553)(4,000)} 2.069 318.16
25 40,947,557.84
5,296.05 156.48 [5139.57, 5452.53]
10-48
Template Output with Prediction
Intervals
10-49
10-11 The Excel Solver Method for
Regression
The solver macro available in EXCEL can also be used to conduct a
simple linear regression. See the text for instructions.
10-50
Using Minitab Fitted-Line Plot for
Regression
Fitted Line Plot
Y = - 0.8465 + 1.352 X
9.0 S 0.184266
R-Sq 95.2%
R-Sq(adj) 94.8%
8.5
8.0
7.5
Y
7.0
6.5
6.0
5.5 6.0 6.5 7.0 7.5
X