DATAENG Lesson 10 Simple Linear Regression and Correlation
DATAENG Lesson 10 Simple Linear Regression and Correlation
Sixth Edition
Douglas C. Montgomery George C. Runger
Chapter 11
Simple Linear Regression and Correlation
Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.
DATAENG
(Engineering Data Analysis)
2
Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.
11-1: Empirical Models
• Many problems in engineering and science involve
exploring the relationships between two or more
variables.
• Regression analysis is a statistical technique that
is very useful for these types of problems.
• For example, in a chemical process, suppose that
the yield of the product is related to the process-
operating temperature.
• Regression analysis can be used to build a model
to predict yield at a given temperature level.
n n
yi xi
n
yi xi − i =1
n
i =1
ˆ =
i =1
2
n (11-2)
xi
n
xi2 − i =1
i =1 n
ˆ +
yˆ = ˆ x (11-3)
yi = ˆ + ˆ xi + ei , i = 1, 2, , n
n n
xi yi
n n
S xy = yi ( xi − x )2 = xi yi − i =1 i =1
i =1 i =1 n
20 20
n = 20 xi = 23 .92 yi = 1,843 .21
i =1 i =1
x = 1.1960 y = 92 .1605
20 20
yi2 = 170 ,044 .5321 xi2 = 29 .2892
i =1 i =1
20
xi yi = 2, 214 .6566
i =1
2
20
xi
20 ( 23 .92 ) 2
S xx = xi2 − i =1 = 29 .2892 −
i =1 20 20
= 0.68088
and 20 20
xi yi
20
S xy = xi yi − i =1 i =1
i =1 20
( 23 .92 ) (1,843 .21)
= 2, 214 .6566 − = 10 .17744
20
Sec 11-2 Simple Linear Regression 9
Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.
EXAMPLE 11-1 Oxygen Purity - continued
Therefore, the least squares estimates of the slope and intercept are
S xy 10 .17744
ˆ 1 = = = 14 .94748
S xx 0.68088
and
The fitted simple linear regression model (with the coefficients reported to
three decimal places) is
yˆ = 74 .283 + 14 .947 x
n n
SS E = ei2 = ( yi − yˆ i )2
i =1 i =1
H0: 1 = 1,0
H1: 1 1,0
H0: 0 = 0,0
H1: 0 0,0
An appropriate test statistic would be
ˆ −
ˆ −
0 0, 0 0 0,0
T0 = = (11-7)
1 ˆ )
se (
x2 0
ˆ 2
+
n S xx
H0: 1 = 0
H1: 1 0
Practical Interpretation: Since the reference value of t is t0.005,18 = 2.88, the value of the test
statistic is very far into the critical region, implying that H0: 1 = 0 should be rejected. There is
−9
strong evidence to support this claim. The P-value for this test is P −~ 1.23 10 . This was
obtained manually with a calculator.
Table 11-2 presents the Minitab output for this problem. Notice that the t-statistic value for the
slope is computed as 11.35 and that the reported P-value is P = 0.000. Minitab also reports the
t-statistic for testing the hypothesis H0: 0 = 0. This statistic is computed from Equation 11-7,
with 0,0 = 0, as t0 = 46.62. Clearly, then, the hypothesis that the intercept is zero is rejected.
Symbolically,
SST = SSR + SSE (11-9)
SSE = error sum of squares
SSR = regression sum of squares
SST = total corrected sum of squares
EXAMPLE 11-3 Oxygen Purity ANOVA We will use the analysis of variance
approach to test for significance of regression using the oxygen purity data
model from Example 11-1. Recall that SST = 173 .38, ˆ 1 = 14.947 , Sxy = 10.17744,
and n = 20. The regression sum of squares is
SS = ˆ S = (14 .947 ) 10 .17744 = 152 .13
R 1 xy
ˆ −t
ˆ2 ˆ +t
ˆ2
1 /2, n − 2 1 1 /2, n − 2 (11-11)
S xx S xx
2 1 x2
ˆ
0 − t /2, n − 2
ˆ +
n S xx
(11-12)
1 x2
0 ˆ 0 + t /2, n − 2
ˆ +
2
n S xx
Sec 11-5 Confidence Intervals 29
Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.
11-5: Confidence Intervals
EXAMPLE 11-4 Oxygen Purity Confidence Interval on the Slope We will
find a 95% confidence interval on the slope of the regression line using the data
in Example 11-1. Recall that ˆ 1 = 14.947 , S xx = 0.68088 , and ˆ 2 = 1.18 (see Table
11-2). Then, from Equation 11-11 we find
ˆ −t
ˆ2 ˆ +t
ˆ2
0.025,18 1 1 0.025,18
S xx S xx
Or
1.18 1.18
14.947 − 2.101 1 14.947 + 2.101
0.68088 0.68088
This simplifies to
12.181 1 17.713
Definition
A 100(1 - )% confidence interval about the mean response at the value of
x = x0, say Y | x0 , is given by
ˆ Y | x0 − t /2, n − 2
1 ( x0 − x )2
+
2
ˆ
n S xx
Y | x0
ˆ Y | x0 + t /2, n − 2
1 ( x0 − x )2
+
2
ˆ (11-13)
n S xx
ˆ ˆ
where ˆ Y | x0 = 0 + 1 x0 is computed from the fitted regression model.
We will construct a 95% confidence interval about the mean response for the
data in Example 11-1. The fitted model is ˆ Y | x0 = 74.283 + 14.947 x0 , and the
95% confidence interval on Y |x0 is found from Equation 11-13 as
1 ( x0 − 1.1960 ) 2
ˆ Y | x0 2.101 1.18 +
20 0.68088
Suppose that we are interested in predicting mean oxygen purity when
x0 = 100%. Then ˆ = 74.283 + 14.947 (1.00) = 89.23
Y | x1.00
yˆ 0 − t/2, n − 2
1
ˆ 1 + +
2 ( x0 − x )2
n S xx
(11-14)
Y0 yˆ 0 + t/2, n − 2
1
ˆ 1 + +
2 ( x0 − x )2
n S xx
89 .23 − 2.101 1.18 1 +
1
+
(1.00 − 1.1960 )2
20 0 . 68088
Y0 89 .23 + 2.101 . 1.18 1 +
1
+
(1.00 − 1.1960 )2
20 0 . 68088
which simplifies to
86.83 y0 91.63
Table 11-4 presents the observed and predicted values of y at each value
of x from this data set, along with the corresponding residual. These values
were computed using Minitab and show the number of decimal places
typical of computer output.
A normal probability plot of the residuals is shown in Fig. 11-10. Since the
residuals fall approximately along a straight line in the figure, we conclude
that there is no severe departure from normality.
The residuals are also plotted against the predicted value ŷi in Fig. 11-11
and against the hydrocarbon levels xi in Fig. 11-12. These plots do not
indicate any serious model inadequacies.
are the mean and variance X, and is the correlation coefficient between Y and
X. Recall that the correlation coefficient is defined as
XY
= (11-15)
X Y
where XY is the covariance between Y and X.
The conditional distribution of Y for a given value of X = x is
y − 0 − 1 x
2
fY | x ( y ) =
1 1
exp − (11-16)
2 Y | x 2 Y | x
where
Y (11-17)
0 = Y − X
X
Y
1 = (11-18)
X
Sec 11-8 Correlation 44
Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.
11-8: Correlation
It is possible to draw inferences about the correlation coefficient in this model.
The estimator of is the sample correlation coefficient
n
Yi ( X i − X ) S XY
R= i =1
= (11-19)
n n
2
1/ 2
(S XX SST )1/2
( X i − X ) i ( )
2
Y − Y
i =1 i =1
Note that
1/2
ˆ SST
1 = R (11-20)
S XX
S XX ˆ S
SS R
R 2
= ˆ2
=
1 XY
=
1
SYY SST SST
Sec 11-8 Correlation 45
Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.
11-8: Correlation
It is often useful to test the hypotheses
H0: = 0
H1: 0
R n−2
T0 = (11-21)
1 − R2
H0: = 0
H1: 0
Figure 11-13 Scatter plot of wire bond strength versus wire length, Example 11-8.
Analysis of Variance
Source DF SS MS F P
Regression 1 5885.9 5885.9 615.08 0.000
Residual Error 23 220.1 9.6
Total 24 6105.9
Sec 11-8 Correlation 52
Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.
11-8: Correlation
Example 11-8 (continued)
Now Sxx = 698.56 and Sxy = 2027.7132, and the sample correlation coefficient
is
S xy 2027 .7132
r= = = 0.9818
S xx SST 1/2
(698 .560 )(6105 .9 ) 1/2
H0: = 0
H1: 0
which reduces to
0.9585 0.9921
Figure 11-14 Plot of DC output y versus wind Figure 11-15 Plot of residuals ei versus fitted
velocity x for the windmill data. values yˆi for the windmill data.
Figure 11-17 Plot of residuals versus fitted values Figure 11-18 Normal probability plot of the residuals
yˆ i for the transformed model for the windmill data. for the transformed model for the windmill data.
A plot of the residuals from the transformed model versus yˆ i is shown in Figure 11-17. This plot does not
reveal any serious problem with inequality of variance. The normal probability plot, shown in Figure 11-18,
gives a mild indication that the errors come from a distribution with heavier tails than the normal (notice the
slight upward and downward curve at the extremes). This normal probability plot has the z-score value
plotted on the horizontal axis. Since there is no strong signal of model inadequacy, we conclude that the
transformed model is satisfactory.
Sec 11-9 Transformation and Logistic Regression 60
Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.
Useful Intrinsically Linear Functions
Additional
Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.
Important Terms & Concepts of Chapter 11
Analysis of variance test in Odds ratio
regression Prediction interval on a future
Confidence interval on mean observation
response Regression analysis
Correlation coefficient Residual plots
Empirical model Residuals
Confidence intervals on model Scatter diagram
parameters Simple linear regression model
Intrinsically linear model standard error
Least squares estimation of Statistical test on model
regression model parameters
parameters Transformations
Logistics regression
Model adequacy checking
Chapter 11 Summary 62
Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.