0% found this document useful (0 votes)
18 views48 pages

Chapter 12

Uploaded by

10aisahpayapo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views48 pages

Chapter 12

Uploaded by

10aisahpayapo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 48

Introduction to Probability

and Statistics
Twelfth Edition

Robert J. Beaver • Barbara M. Beaver • William Mendenhall

Presentation designed and written by:


Barbara M. Beaver
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Introduction to Probability
and Statistics
Twelfth Edition

Chapter 12
Linear Regression and
Correlation
Some graphic screen captures from Seeing Statistics ® Copyright ©2006 Brooks/Cole
Some images © 2001-(current year) www.arttoday.com A division of Thomson Learning, Inc.
Introduction
• In Chapter 11, we used ANOVA to investigate the
effect of various factor-level combinations
(treatments) on a response x.
• Our objective was to see whether the treatment
means were different.
• In Chapters 12 and 13, we investigate a response y
which is affected by various independent variables,
xi.
• Our objective is to use the information provided by
the xi to predict the value of y.

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Example
• Let y be a student’s college achievement,
measured by his/her GPA. This might be a
function of several variables:
– x1 = rank in high school class
– x2 = high school’s overall rating
– x3 = high school GPA
– x4 = SAT scores
• We want to predict y using knowledge of x1, x2,
x3 and x4. Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Example
• Let y be the monthly sales revenue for a
company. This might be a function of several
variables:
– x1 = advertising expenditure
– x2 = time of year
– x3 = state of economy
– x4 = size of inventory
• We want to predict y using knowledge of x1, x2,
x3 and x4. Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Some Questions
• Which of the independent variables are
useful and which are not?
• How could we create a prediction equation
to allow us to predict y using knowledge of
x1, x2, x3 etc?
• How good is this prediction?
We
We start
start with
with the the simplest
simplest case,
case, in
in which
which the the
response
response yy isis aa function
function of
of aa single
single independent
independent
variable,
variable, x.x. Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
A Simple Linear Model
• In Chapter 3, we used the equation of
a line to describe the relationship between y
and x for a sample of n pairs, (x, y).
• If we want to describe the relationship
between y and x for the whole population,
there are two models we can choose
••Deterministic Model: yy == x
Deterministic Model: x
••Probabilistic
Probabilistic Model:
Model:
––yy == deterministic
deterministic model
model ++ random
random error error
––yy == x
x 
 Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
A Simple Linear Model
• Since the bivariate measurements that we
observe do not generally fall exactly on a
straight line, we choose to use:
• Probabilistic Model:
– y = x 
– E(y) = x
Points deviate from the
line of means by an amount
 where has a normal
distribution with mean 0 and
variance 2. Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
The Random Error
• The line of means, E(y) = x , describes
average value of y for any fixed value of x.
• The population of measurements is generated
as y deviates from
the population line
by . We estimate 
andusing sample
information.

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
MY APPLET The Method of
Least Squares
• The equation of the best-fitting line
is calculated using a set of n pairs (xi, yi).
•We choose our
estimates a and b to
estimate  and  so
Bestfittingline:yˆ  a  bx
that the vertical
Choosea and
distances of theb to minimize
points
SSE from
 ( y  the
yˆ ) 2 line,
  ( y  a  bx ) 2
are minimized.
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Least Squares Estimators
Calculate
Calculatethe thesums
sumsof of squares
squares::
22 22
22 (( xx)) 22 (( yy))
SSxxxx 
 xx  SSyyyy 
 yy 
nn nn
(( xx)()( yy))
SSxyxy 
 xy xy 
nn
Best
Bestfitting
fittinglineline:: yˆyˆ  aa bx bx where
where
SSxyxy
bb   and
and aa  yy  bbxx
SSxxxx
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Example
The table shows the math achievement test scores
for a random sample of n = 10 college freshmen,
along with their final calculus grades.
Student 1 2 3 4 5 6 7 8 9 10
Math test, x 39 43 21 64 57 47 28 75 34 52
Calculus grade, y 65 78 52 82 92 89 73 98 56 75
 xx  460  yy 
460 760760
Use
Useyour
yourcalculator
calculator 22 22
to
tofind
findthe
thesums
sums  xx  23634  yy 
23634 59816
59816
and
andsums
sumsofof
squares.
 xy
xy 36854
36854
squares.
xx 46 
46 yyCopyright
 7676 ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Example
22
((460 )
460) 2474
SSxxxx  23634
 23634   2474
10
10
22
((760 )
760) 2056
SSyyyy  59816
59816   2056
10
10
((460
460 )()(760
760 ))
SSxyxy  36854
36854 
1894
1894
10
10
1894
1894
bb   .76556
.76556 and and aa  76 ..76556
76 46)) 
76556((46 40
40..78
78
2474
2474
Best
Bestfitting line:: yˆyˆ 
fittingline 40 78..77
40..78 77xx
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
The Analysis of Variance
• The total variation in the experiment is
measured by the total sum of squares:
squares
22
Total SS  S   ( y  y )
Total SS Syyyy  ( y  y )
The Total SS is divided into two parts:
SSR (sum of squares for regression):
measures the variation explained by using x in
the model.
SSE (sum of squares for error): measures the
leftover variation not explained by x.
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
The Analysis of Variance
We calculate
((SSxyxy))22 1894
1894
22
SSR
SSR 

SSxxxx 2474
2474

1449
1449..9741
9741
SSE
SSE Total
TotalSS SS--SSR
SSR
((SSxyxy))22

SSyyyy 
SSxxxx
 2056 1449
2056 1449..9741
9741

606
606..0259
0259
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
The ANOVA Table
Total df = n -1 Mean Squares
Regression df = 1 MSR = SSR/(1)

Error df = n –1 – 1 = n - 2 MSE = SSE/(n-2)

Source df SS MS F
Regression 1 SSR SSR/(1) MSR/MSE
Error n-2 SSE SSE/(n-2)
Total n -1 Total SS

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
The Calculus Problem
((SSxyxy))22 1894 22
1894 1449 .9741
SSR 
SSR  
 1449 .9741
SSxxxx 2474
2474
((SSxyxy))22
SSE
SSETotal
TotalSS SS--SSR
SSR SSyyyy 
SSxxxx
 2056 1449
2056 9741
1449..9741 606
606..0259
0259
Source df SS MS F
Regression 1 1449.9741 1449.9741 19.14
Error 8 606.0259 75.7532
Total 9 2056.0000

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Testing the Usefulness
of the Model
• The first question to ask is whether the
independent variable x is of any use in
predicting y.
• If it is not, then the value of y does not change,
regardless of the value of x. This implies that
the slope of the line, , is zero.

HH00 ::  versusHHaa:: 


00 versus 00

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Testing the
Usefulness of the Model
• The test statistic is function of b, our best
estimate of Using MSE as the best estimate
of the random variation 2, we obtain a t
statistic.
bb 00
Test
Teststatistic
statistic::tt 
 which
whichhas
hasaatt distributi
distribution
on
MSE
MSE
SSxxxx
MSE
MSE
with nn 22or
df 
withdf oraaconfidence
confidenceinterval
interval::bb tt/ /22
SSxxxx
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
The Calculus Problem
MY APPLET

• Is there a significant relationship between


There is a significant
the calculus grades and the test scores at the
linear relationship
5% level of significance?
between the calculus
grades and the test scores
H 0 :  0 versusH a :   0 the population of
for
b 0 .7656 college
0 freshmen.
t  4.38
M SE/ S xx 75.7532 / 2474
Reject H 0 when |t| > 2.306. Since t = 4.38 falls into
the rejection region, H 0 is rejected .
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
The F Test
• You can test the overall usefulness of the
model using an F test. If the model is
useful, MSR will be large compared to
the unexplained variation, MSE.
TotestH 0 : model isusefulin predictingy
M SR This test is
TestStatistic: F  exactly
M SE equivalent to
RejectH 0 if F  F with1 and n - 2 df . the t-test, with t2
= F.
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Minitab Output
Least squares
H 0 :  line
To testregression 0
Regression Analysis: y versus x
The regression equation is y = 40.8 + 0.766 x
Predictor Coef SE Coef T P
Constant 40.784 8.507 4.79 0.001
x 0.7656 0.1750 4.38 0.002

S = 8.70363 R-Sq = 70.5% R-Sq(adj) = 66.8%

Analysis of Variance
Source DF SS MS F P
Regression 1 1450.0 1450.0 19.14 0.002
Residual Error 8 606.0 75.8
Total 9 2056.0

Regression coefficients, 2
MSE a and Copyright t F
b ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Measuring the Strength
of the Relationship
• If the independent variable x is of useful in
predicting y, you will want to know how well
the model fits.
• The strength of the relationship between x and y
can be measured using:
SSxyxy
Correlatio
Correlationncoefficien
tt::rr 
coefficien 
SSxxxxSSyyyy
22
SS
22 xyxy SSR
SSR
Coefficien
ttof
Coefficien of determinat
determination
ion::rr 
 
SxxxxSSyyyy©2006 Total
SCopyright TotalSS
Brooks/ColeSS
A division of Thomson Learning, Inc.
Measuring the Strength
of the Relationship
• Since Total SS = SSR + SSE, r2 measures
 the proportion of the total variation in the
responses that can be explained by using the
independent variable x in the model.
 the percent reduction the total variation by
using the regression equation rather than just
using the sample mean y-bar to estimate y.
For the calculus problem, r2 = .705 or 22 SSR
SSR
70.5%. The model is working well! rr 

Total
TotalSS
SS
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Interpreting a
Significant Regression
• Even if you do not reject the null hypothesis
that the slope of the line equals 0, it does not
necessarily mean that y and x are unrelated.
• Type II error—falsely declaring that the slope is
0 and that x and y are unrelated.
• It may happen that y and x are perfectly related
in a nonlinear way.
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Some Cautions
• You may have fit the wrong model.

•Extrapolation—predicting values of y outside


the range of the fitted data.
•Causality—Do not conclude that x causes y.
There may be an unknown variable at work!
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Checking the
Regression Assumptions
•Remember that the results of a regression
analysis are only valid when the necessary
assumptions have been satisfied.
1.
1. The
The relationship
relationship between
between xx and
and yy isis linear,
linear,
given by yy ==  ++ x
given by x ++ 
2.
2. The
The random
random error terms  are
error terms are independent
independent and,and,
for
for any
any value
value ofof x,
x, have
have aa normal
normal distribution
distribution
with
with mean 0 and variance  ..
mean 0 and variance  22

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Diagnostic Tools
•We use the same diagnostic tools used in
Chapter 11 to check the normality
assumption and the assumption of equal
variances.
1.
1. Normal
Normal probability
probability plot
plot of
of residuals
residuals
2.
2. Plot
Plot of
of residuals
residuals versus
versus fit
fit or
or
residuals
residuals versus
versus variables
variables
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Residuals
•The residual error is the “leftover”
variation in each data point after the
variation explained by the regression model
has been removed.
• yyii  yˆyˆii or
Residual

Residual or yyii  aa  bx
bxii
•If all assumptions have been met, these
residuals should be normal,
normal with mean 0
and variance 2.
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Normal Probability Plot

IfIf the
the normality
normality assumption
assumption isis valid,
valid, the
the
plot
plot should
should resemble
resemble aa straight
straight line,
line,
sloping
sloping upward
upward to to the
the right.
right.
Normal Probability Plot of the Residuals
(response is y)


IfIf not,
not, you
you will
will often
99

often see
see the
the pattern
pattern fail
fail
in
in the
the tails
tails of
of the
the graph.
95
90

80
graph.
70
Percent

60
50
40
30
20

10

1
-20 -10 0 10 20
Residual
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Residuals versus Fits

IfIf the
the equal
equal variance
variance assumption
assumption isis valid,
valid,
the
the plot
plot should
should appear
appear asas aa random
random
scatter
scatter around
around the
the zero
zero center
center line.
line.
Residuals Versus the Fitted Values
(response is y)


IfIf not,
not, you
you will
will see
15

see aa pattern
pattern in in the
the
10

residuals.
residuals. 5
Residual

-5

-10

60 70 80 90 100
Fitted Value

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Estimation and Prediction
• Once you have
 determined that the regression line is useful
 used the diagnostic plots to check for
violation of the regression assumptions.
• You are ready to use the regression line to
 Estimate the average value of y for a
given value of x
 Predict a particular value of y for a
given value of x.
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Estimation and Prediction
Estimating a
particular value of y
when x = x0

Estimating the
average value of y
when x = x0

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Estimation and Prediction
• The best estimate of either E(y) or y for
a given value x = x0 is
yˆyˆ 
aa bx
bx00
• Particular values of y are more difficult to
predict, requiring a wider range of values in the
prediction interval.

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Estimation and Prediction
To
Toestimate
estimatethe
theaverag
averageevalue
valueof
of yy when xx00 ::
whenxx 
 11 ((xx0  xx))22 
yˆyˆ  MSE  0
tt//22 MSE 

n
 n SSxxxx 
To
Topredict
predictaaparticular
particularvalue
valueof of yy when xx00 ::
whenxx 
 11 ((xx0  xx))22 
yˆyˆ 
tt//22 MSE11  0
MSE 

 n
n SSxxxx 
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
The Calculus Problem
• Estimate the average calculus grade for
students whose achievement score is 50 with a
95% confidence interval.

Calculateyˆyˆ 
Calculate 40.78424
40.78424.76556(50) 79.06
.76556(50) 79.06
 11 ((50
50  46
46 )) 22

yˆyˆ 
22..306
306 75 7532 
75..7532 

10
10 2474
2474 
79 06 
79..06 55 or
66..55 or72.51
72.51to to85.61.
85.61.
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
The Calculus Problem
• Estimate the calculus grade for a particular
student whose achievement score is 50 with a
95% confidence interval.

Calculateyˆyˆ 
Calculate 40.78424
40.78424.76556(50) 79.06
.76556(50) 79.06
 11 ((50
50  46
46 ))  22
yˆyˆ 
22..306
306 75 753211 
75..7532 

 10
10 2474
2474 
Notice how
79 06 
79..06 21 11 or
21..11 or57.95
57.95to to100.17.
100.17.
much wider this
interval is!
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Minitab Output
Confidence and prediction
intervals when x = 50
Predicted Values for New Observations
New Obs Fit SE Fit 95.0% CI 95.0% PI
1 79.06 2.84 (72.51, 85.61) (57.95,100.17)

Values of Predictors for New Observations


New Obs x Fitted Line Plot
y = 40.78 + 0.7656 x
1 50.0
Green prediction 120
Regression
95% CI
95% PI
110
bands are always 100
S
R-Sq
R-Sq(adj)
8.70363
70.5%
66.8%

wider than red 90

80

confidence bands.
y

70

60

Both intervals are 50

40

narrowest when x = x- 30
20 30 40 50 60 70 80

bar. x
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Correlation Analysis
• The strength of the relationship between x and y is
measured using the coefficient of correlation:
correlation
SSxyxy
Correlatio
Correlationncoefficien
tt::rr 
coefficien 
SSxxxxSSyyyy
• Recall from Chapter 3 that
(1) -1  r  1 (2) r and b have the same sign
(3) r  0 means no linear relationship
(4) r  1 or –1 means a strong (+) or (-)
relationship Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Example
The table shows the heights and weights of
n = 10 randomly selected college football
players.
Player 1 2 3 4 5 6 7 8 9 10
Height, x 73 71 75 72 72 75 67 69 71 69
Weight, y 185 175 200 210 190 195 150 170 180 175

Use
Useyour
yourcalculator
calculator
SSxyxy  328 SSxxxx 
328 60..44 SSyyyy 
60 2610
2610
to
tofind
findthe
thesums
sums 328
rr  328 
and
andsums
sumsofof  ..8261
8261
squares.
squares. ((60
60..44)()(2610
2610))
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Football Players
Scatterplot of Weight vs Height

210

200

190
Weight

180
rr==.8261
.8261
170

160
Strong
Strongpositive
positive
150
correlation
correlation
66 67 68 69 70 71 72 73 74 75
Height
As
Asthe
theplayer’s
player’s
height
heightincreases,
increases,so
so
does
doeshis
hisweight.
weight.

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Some Correlation Patterns
• Use
rr==0;the
No Exploring Correlation applet to
0; No rr==.931;
.931;Strong
Strong
explore
correlationsome correlation patterns:
correlation positive
positivecorrelation
correlation

rr==1;
1;Linear
Linear
relationship
relationship rr==-.67;
-.67;Weaker
Weaker
negative
negativecorrelation
correlation

MY APPLET

Copyright ©2006 Brooks/Cole


A division of Thomson Learning, Inc.
Inference using r
• The population coefficient of correlation is
called  (“rho”). We can test for a significant
correlation between x and y using a t test:
This test is
TotestH 0 :  0 versusH a :  0 exactly
equivalent to
n 2
Test Statistic: t r 2
the t-test for the
1 r slope .
RejectH 0 if t  t / 2 or t   t / 2 withn - 2 df .
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
rr 
..8261
8261 Example
Is there a significant positive correlation
between weight and height in the population
of all college football players?
nn 22
H00::  
H 00 Test
TestStatistic
Statistic:: tt 
rr 22
11 rr
Haa::  00
H
88
Use
Usethe
thet-table
t-tablewith
withn-2
n-2==88df
dfto
to

..8261
8261 4.15
22  4.15
bound 11 ..8261
8261
boundthe
thep-value
p-valueasasp-value
p-value
<<.005.
.005.There
Thereisisaasignificant
significant
MY APPLET
positive
positivecorrelation.
correlation.
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Key Concepts
I. A Linear Probabilistic Model
1. When the data exhibit a linear relationship, the
appropriate model is y  x .
2. The random error  has a normal distribution with
mean 0 and variance 2.
II. Method of Least Squares
1. Estimates a and b, for  and , are chosen to
minimize SSE, the sum of the squared deviations
about the regression line, yˆ a  bx.
2. The least squares estimates are b = Sxy/Sxx and
a  y  b x.
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Key Concepts
III. Analysis of Variance
1. Total SS  SSR  SSE, where Total SS  Syy and
SSR  (Sxy)2  Sxx.
2. The best estimate of 2 is MSE  SSE  (n  2).

IV. Testing, Estimation, and Prediction


1. A test for the significance of the linear regression—
H0 :   0 can be implemented using one of two
test statistics:
b M SR
t or F
M SE/ S xx M SE
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Key Concepts
2. The strength of the relationship between x and y can be
measured using
2 SSR
R 
Total SS
which gets closer to 1 as the relationship gets stronger.
3. Use residual plots to check for nonnormality, inequality of
variances, and an incorrectly fit model.
4. Confidence intervals can be constructed to estimate the
intercept  and slope  of the regression line and to estimate
the average value of y, E( y ), for a given value of x.
5. Prediction intervals can be constructed to predict a
particular observation, y, for a given value of x. For a given
x, prediction intervals are always wider than confidence
intervals.
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.
Key Concepts
V. Correlation Analysis
1. Use the correlation coefficient to measure the
relationship between x and y when both variables are
random: S xy
r
S xx S yy

2. The sign of r indicates the direction of the


relationship; r near 0 indicates no linear relationship,
and r near 1 or 1 indicates a strong linear
relationship.
3. A test of the significance of the correlation
coefficient is identical to the test of the slope 
Copyright ©2006 Brooks/Cole
A division of Thomson Learning, Inc.

You might also like