0% found this document useful (0 votes)
120 views50 pages

Complete Business Statistics: Simple Linear Regression and Correlation

Uploaded by

Ali Elattar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
120 views50 pages

Complete Business Statistics: Simple Linear Regression and Correlation

Uploaded by

Ali Elattar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 50

COMPLETE

BUSINESS
STATISTICS
by
AMIR D. ACZEL
&
JAYAVEL SOUNDERPANDIAN
7th edition.

Prepared by Lloyd Jaisingh, Morehead State


University

Chapter 10
Simple Linear Regression and Correlation

McGraw-Hill/Irwin Copyright © 2009 by The McGraw-Hill Companies, Inc. All rights reserved.
10-2

10 Simple Linear Regression and Correlation


• Using Statistics
• The Simple Linear Regression Model
• Estimation: The Method of Least Squares
• Error Variance and the Standard Errors of Regression Estimators
• Correlation
• Hypothesis Tests about the Regression Relationship
• How Good is the Regression?
• Analysis of Variance Table and an F Test of the Regression Model
• Residual Analysis and Checking for Model Inadequacies
• Use of the Regression Model for Prediction
• The Solver Method for Regression
10-3

10 LEARNING OBJECTIVES
After studying this chapter, you should be able to:
• Determine whether a regression experiment would be useful in a given
instance
• Formulate a regression model
• Compute a regression equation
• Compute the covariance and the correlation coefficient of two random
variables
• Compute confidence intervals for regression coefficients
• Compute a prediction interval for the dependent variable
10-4

10 LEARNING OBJECTIVES (continued)

After studying this chapter, you should be able to:


• Test hypothesis about a regression coefficients
• Conduct an ANOVA experiment using regression results
• Analyze residuals to check if the assumptions about the
regression model are valid
• Solve regression problems using spreadsheet templates
• Use LINEST function to carry out a regression
10-5

10-1 Using Statistics

• Regression refers to the statistical technique of modeling the


relationship between variables.
• In simple linear regression, we model the relationship
between two variables.
• One of the variables, denoted by Y, is called the dependent
variable and the other, denoted by X, is called the
independent variable.
• The model we will use to depict the relationship between X and
Y will be a straight-line relationship.
• A graphical sketch of the the pairs (X, Y) is called a scatter
plot.
10-6

10-1 Using Statistics


This scatterplot locates pairs of observations of Scatterplot of Advertising Expenditures (X) and Sales (Y)
advertising expenditures on the x-axis and sales 140

on the y-axis. We notice that: 120

100

Sales
80
 Larger (smaller) values of sales tend to be 60
associated with larger (smaller) values of 40

advertising. 20

0
0 10 20 30 40 50
A d ve rtising

 The scatter of points tends to be distributed around a positively sloped straight line.

 The pairs of values of advertising expenditures and sales are not located exactly on a
straight line.
 The scatter plot reveals a more or less strong tendency rather than a precise linear
relationship.
 The line represents the nature of the relationship on average.
10-7

Examples of Other Scatterplots

Y
Y
Y

X 0 X X
Y

Y
X X X
10-8

Model Building

The inexact nature of the Data In ANOVA, the systematic


relationship between component is the variation
advertising and sales of means between samples
suggests that a statistical or treatments (SSTR) and
model might be useful in
Statistical the random component is
analyzing the relationship. model the unexplained variation
(SSE).
A statistical model separates
the systematic component Systematic In regression, the
of a relationship from the systematic component is
component
random component. the overall linear
+ relationship, and the
Random random component is the
errors variation around the line.
10-9

10-2 The Simple Linear Regression


Model
The population simple linear regression model:
Y= 0 + 1 X + 
Nonrandom or Random
Systematic Component
Component
where
 Y is the dependent variable, the variable we wish to explain or predict
 X is the independent variable, also called the predictor variable
  is the error term, the only random component in the model, and thus, the
only source of randomness in Y.

 0 is the intercept of the systematic component of the regression relationship.


 1 is the slope of the systematic component.

The conditional mean of Y: E[Y X ]   0   1 X


10-10

Picturing the Simple Linear


Regression Model
Y
Regression Plot The simple linear regression
model gives an exact linear
relationship between the
expected or average value of Y,
the dependent variable, and X,
E[Y]=0 + 1 X
the independent or predictor
Yi
variable:
{
Error: i } 1 = Slope

E[Yi]=0 + 1 Xi
}

1
Actual observed values of Y
0 = Intercept
differ from the expected value by
an unexplained or random error:

X
Yi = E[Yi] + i
Xi = 0 + 1 Xi + i
10-11

Assumptions of the Simple Linear


Regression Model
• The relationship between X and Y is a Assumptions of the Simple
straight-line relationship. Y Linear Regression Model
• The values of the independent
variable X are assumed fixed (not
random); the only randomness in the
values of Y comes from the error term
i. E[Y]=0 + 1 X
• The errors i are normally distributed
with mean 0 and variance 2. The
errors are uncorrelated (not related)
in successive observations. That is:
~ N(0,2)
Identical normal
distributions of errors,
all centered on the
regression line.

X
10-12

10-3 Estimation: The Method of Least


Squares
Estimation of a simple linear regression relationship involves finding
estimated or predicted values of the intercept and slope of the linear
regression line.

The estimated regression equation:


Y = b0 + b1X + e

where b0 estimates the intercept of the population regression line, 0 ;


b1 estimates the slope of the population regression line, 1;
and e stands for the observed errors - the residuals from fitting the estimated
regression line b0 + b1X to a set of n points.
The estimated regression line:

Y  b0 + b1 X
where Y (Y - hat) is the value of Y lying on the fitted regression line for a given
value of X.
10-13

Fitting a Regression Line


Y Y

Data
Three errors from the
least squares regression
X line X
Y

Three errors Errors from the least


from a fitted line squares regression
line are minimized
X X
10-14

Errors in Regression

Y
the observeddata point
Y  b0  b1 X the fitted regression line
Yi .
Yi
{
Error ei  Yi  Yi
Yi the predicted value of Y for X
i

X
Xi
10-15

Least Squares Regression

The sum of squared errors in regression is:


n n
SSE = e
i=1
2
i   (y
i=1
i  y i ) 2

The least squares regression line is that which minimizes the SSE
with respect to the estimates b 0 and b 1 .

The normal equations: SSE b0

n n

y
i=1
i  nb0  b1  x i
i=1
At this point
SSE is
Least squares b0 minimized
n n n with respect

x y
i=1
i i b0  x i  b1  x 2i
i=1 i=1
to b0 and b1

Least squares b1 b1
10-16

Sums of Squares, Cross Products,


and Least Squares Estimators
Sums of Squares and Cross Products:
  x
2

SSx   (x  x )   x
2 2

n 2
SS y   ( y  y )   y 
2 2   y
n
SSxy   (x  x )( y  y )   xy 
  x  ( y )
n
Least  squares regression estimators:
SS XY
b1 
SS X

b0  y  b1 x
10-17

Example 10-1

Miles Dollars Miles 2 Miles*Dollars


2  x 2
1211
1345
1802
2405
1466521
1809025
2182222
3234725
SS x   x 
1422 2005 2022084 2851110 n
1687 2511 2845969 4236057 2
1849 2332 3418801 4311868 79, 448
2026 2305 4104676 4669930  293, 426,946   40,947 ,557.84
2133 3016 4549689 6433128 25
2253
2400
3385
3090
5076009
5760000
7626405
7416000  x ( y )
2468 3694 6091024 9116792 SS xy   xy 
2699 3371 7284601 9098329 n
2806 3998 7873636 11218388
(79, 448)(106,605)
 390,185,014   51, 402,852.4
3082 3555 9498724 10956510
3209 4692 10297681 15056628
3466 4244 12013156 14709704 25
3643 5298 13271449 19300614
3852 4801 14837904 18493452 SS 51, 402,852.4
4033 5147 16265089 20757852 b  XY   1.255333776  1.26
4267 5738 18207288 24484046 1 SS 40,947 ,557.84
4498 6420 20232004 28877160 X
4533 6059 20548088 27465448
4804 6426 23078416 30870504 106,605  79,448 
5090 6321 25908100 32173890 b  y b x   (1.255333776 ) 
5233 7026 27384288 36767056
0 1 25  25 
5439 6964 29582720 37877196
79,448 106,605 293,426,946 390,185,014  274.85
10-18

Template (partial output) that can be


used to carry out a Simple Regression
10-19

Template (continued) that can be used


to carry out a Simple Regression
10-20

Template (continued) that can be used


to carry out a Simple Regression

Residual Analysis. The plot shows the absence of a relationship


between the residuals and the X-values (miles).
10-21

Template (continued) that can be used


to carry out a Simple Regression

Note: The normal probability plot is approximately linear. This


would indicate that the normality assumption for the errors has not
been violated.
10-22

Total Variance and Error Variance


Y Y

X X

What you see when looking


What you see when looking
at the total variation of Y.
along the regression line at
the error variance of Y.
10-23

10-4 Error Variance and the Standard


Errors of Regression Estimators
Y
Degrees of Freedom in Regression:

df = (n - 2) (n total observations less one degree of freedom


for each parameter estimated (b 0 and b1 ) )
2 Square and sum all
2 ( SS XY ) regression errors to find
SSE =  ( Y - Y )  SSY  SSE.
SS X X

= SSY  b1SS XY Example 10 - 1:


SSE = SS Y  b1 SS XY
2 2  66855898  (1.255333776)( 51402852 .4 )
An unbiased estimator of s , denoted by S :
 2328161.2
SSE 2328161.2
SSE MSE  
MSE = n2 23
(n - 2)  101224 .4
s  MSE  101224 .4  318.158
10-24

Standard Errors of Estimates in


Regression

The standard error of b0 (intercept): Example 10 - 1:


2
s x
s(b0 ) 
s(b0 ) 
s  x 2
nSS X
nSS X 318.158 293426944

( 25)( 4097557.84 )
where s = MSE  170.338
s
The standard error of b1 (slope): s(b1 ) 
SS X
318.158
s 
s(b1 )  40947557.84
SS X  0.04972
10-25

Confidence Intervals for the


Regression Parameters
A (1 -  ) 100% confidence interval for b :
0
b  t  s (b ) Example 10 - 1
0  ,(n 2 ) 0 95% Confidence Intervals:
2 
b t s (b )
0  0.025,( 25 2 ) 0
A (1 -  ) 100% confidence interval for b : = 274.85  ( 2.069) (170.338)
1
b  t  s (b )  274.85  352.43
1  ,(n 2 ) 1
2   [ 77.58, 627.28]
Least-squares point estimate:
b1=1.25533
b1  t s (b1 )
 0.025,( 25 2 )
= 1.25533  ( 2.069) ( 0.04972 )
Height = Slope

 1.25533  010287
.
 [115246
. ,1.35820]

0 (not a possible value of the


Length = 1
regression slope at 95%)
10-26

Template (partial output) that can be used


to obtain Confidence Intervals for 0 and 1
10-27

10-5 Correlation

The correlation between two random variables, X and Y, is a measure of the


degree of linear association between the two variables.

The population correlation, denoted by, can take on any value from -1 to 1.

  1 indicates a perfect negative linear relationship


-1 <  < 0 indicates a negative linear relationship
0 indicates no linear relationship
0<<1 indicates a positive linear relationship
  1 indicates a perfect positive linear relationship

The absolute value of  indicates the strength or exactness of the relationship.


10-28

Illustrations of Correlation

Y Y Y
 = -1 =0
=1

X X X

Y  = -.8 Y =0 Y
 = .8

X X X
10-29

Covariance and Correlation


The covariance of two random variables X and Y:
Cov ( X , Y )  E [( X   )(Y   )]
X Y
where  and  Y are the population means of X and Y respectively.
X

The population correlation coefficient: Example 10 - 1:


Cov ( X , Y ) SS
= XY
  r=
SS SS
X Y X Y
51402852.4
The sample correlation coefficient * : 
( 40947557.84)( 66855898)
SS
XY 51402852.4
r=  .9824
SS SS 52321943.29
X Y

*Note: If  < 0, b1 < 0 If  = 0, b1 = 0 If  > 0, b1 >0


10-30

Hypothesis Tests for the Correlation


Coefficient

Example 10 -1:
r
H0:  = 0 (No linear relationship) t( n 2 ) 
H1:   0 (Some linear relationship) 1 r2
n2
0.9824
r =
Test Statistic: t( n 2 )  1 - 0.9651
1 r2
25 - 2
n2 0.9824
=  25.25
0.0389
t0. 005  2.807  25.25
H 0 rejected at 1% level
10-31

10-6 Hypothesis Tests about the


Regression Relationship
Constant Y Unsystematic Variation Nonlinear Relationship
Y Y Y

X X X
A hypothesis test for the existence of a linear relationship between X and Y:
H0: 1  0
H1:  1  0
Test statistic for the existence of a linear relationship between X and Y:
b
 1
t
(n - 2) s(b )
1
where b is the least - squares estimate of the regression slope and s ( b ) is the standard error of b .
1 1 1
When the null hypothesis is true, the statistic has a t distribution with n - 2 degrees of freedom.
10-32

Hypothesis Tests for the Regression


Slope
Example 10 - 1: Example10 - 4 :
H0: 1  0 H :  1
0 1
H1:  1  0 H :  1
1 1
b b 1
1  1
t  t
(n - 2) s(b ) ( n - 2) s (b )
1
1
1.24 - 1
1.25533 =  1.14
=  25.25 0.21
0.04972
t  1.671  1.14
t  2.807  25.25 (0.05,58)
( 0 . 005 , 23 ) H is not rejected at the10% level.
0
H 0 is rejected at the 1% level and we may
We may not conclude that the beta
conclude that there is a relationship between
coefficien t is different from 1.
charges and miles traveled.
10-33

10-7 How Good is the Regression?

The coefficient of determination, r2, is a descriptive measure of the strength of


the regression relationship, a measure of how well the regression line fits the data.
( y  y )  ( y  y)  ( y  y )
Y Total = Unexplained Explained
Deviation Deviation Deviation
Y . (Error) (Regression)

Y

Y
Unexplained Deviation

Explained Deviation
{
}
{
Total Deviation

SST
2
= SSE
2
 ( y  y )   ( y  y)   ( y  y )
+ SSR

Percentage of
2

2 SSR SSE
r   1 total variation
SST SST explained by
X
X the regression.
10-34

The Coefficient of Determination

Y Y Y

X X X
SST SST SST
S
r2 = 0 SSE r2 = 0.50 SSE SSR r2 = 0.90 S SSR
E

7000
Example 10 -1: 6000

5000

Dollars
SSR 64527736.8
r  2
  0.96518 4000

SST 66855898 3000

2000

1000 1500 2000 2500 3000 3500 4000 4500 5000 5500
Miles
10-35

10-8 Analysis-of-Variance Table and


an F Test of the Regression Model
Source of Sum of Degrees of
Variation Squares Freedom Mean Square F Ratio

Regression SSR (1) MSR MSR


MSE
Error SSE (n-2) MSE
Total SST (n-1) MST

Example 10-1
Source of Sum of Degrees of
Variation Squares Freedom F Ratio p Value
Mean Square
Regression 64527736.8 1 64527736.8 637.47 0.000
Error 2328161.2 23 101224.4
Total 66855898.0 24
10-36

Template (partial output) that displays Analysis of


Variance and an F Test of the Regression Model
10-37

10-9 Residual Analysis and Checking


for Model Inadequacies
Residuals Residuals

0 0

x or y x or y

Homoscedasticity: Residuals appear completely Heteroscedasticity: Variance of residuals


random. No indication of model inadequacy. increases when x changes.

Residuals Residuals

0 0

Time x or y

Curved pattern in residuals resulting from


Residuals exhibit a linear trend with time. underlying nonlinear relationship.
10-38

Normal Probability Plot of the


Residuals
Flatter than Normal
10-39

Normal Probability Plot of the


Residuals

More Peaked than Normal


10-40

Normal Probability Plot of the


Residuals

Positively Skewed
10-41

Normal Probability Plot of the


Residuals

Negatively Skewed
10-42

10-10 Use of the Regression Model


for Prediction

• Point Prediction
A single-valued estimate of Y for a given value of X obtained by
inserting the value of X in the estimated regression equation.
• Prediction Interval
For a value of Y given a value of X
 Variation in regression line estimate
 Variation of points around regression line
For an average value of Y given a value of X
 Variation in regression line estimate
10-43

Errors in Predicting E[Y|X]

Y Upper limit on slope Y Upper limit on intercept


Regression line Regression line

Lower limit on slope


Y Y Lower limit on intercept

X X X X

1) Uncertainty about the 2) Uncertainty about the


slope of the regression line intercept of the regression line
10-44

Prediction Interval for E[Y|X]

Y Prediction band for E[Y|X] • The prediction band for E[Y|X] is


Regression narrowest at the mean value of X.
line • The prediction band widens as the
distance from the mean of X increases.
Y • Predictions become very unreliable when
we extrapolate beyond the range of the
sample itself.

X X

Prediction Interval for E[Y|X]


10-45

Additional Error in Predicting Individual


Value of Y

Y
Regression line Y Prediction band for E[Y|X]
Regression
line

Prediction band for Y

X X X
3) Variation around the regression
line Prediction Interval for E[Y|X]
10-46

Prediction Interval for a Value of Y

A (1 -  ) 100% prediction interval for Y :

1 (x  x) 2

yˆ  t  s 1  

2 n SS X

Example10 - 1 (X = 4,000) :

1 (4,000  3,177.92) 2

{274.85  (1.2553)(4,000)}  2.069  318.16 1  


25 40,947,557.84

 5296.05  676.62  [4619.43, 5972.67]


10-47

Prediction Interval for the Average


Value of Y
A (1 -  ) 100% prediction interval for the E[Y X] :

1 (x  x) 2

yˆ  t  s


2 n SS X

Example10 - 1 (X = 4,000) :

1 (4,000  3,177.92) 2

{274.85  (1.2553)(4,000)}  2.069  318.16 


25 40,947,557.84

 5,296.05  156.48  [5139.57, 5452.53]


10-48

Template Output with Prediction


Intervals
10-49

10-11 The Excel Solver Method for


Regression
The solver macro available in EXCEL can also be used to conduct a
simple linear regression. See the text for instructions.
10-50

Using Minitab Fitted-Line Plot for


Regression

Fitted Line Plot


Y = - 0.8465 + 1.352 X

9.0 S 0.184266
R-Sq 95.2%
R-Sq(adj) 94.8%
8.5

8.0

7.5
Y

7.0

6.5

6.0
5.5 6.0 6.5 7.0 7.5
X

You might also like