Simple Linear Regression (Solutions To Exercises)
Simple Linear Regression (Solutions To Exercises)
Simple Linear Regression (Solutions To Exercises)
Chapter 5
Contents
On a machine that folds plastic film the temperature may be varied in the range
of 130-185 ◦ C. For obtaining, if possible, a model for the influence of tempera-
ture on the folding thickness, n = 12 related set of values of temperature and
the fold thickness were measured that is illustrated in the following figure:
130
120
Thickness
110
100
90
1) β̂ 0 = 0, β̂ 1 = −0.9, σ̂ = 36
2) β̂ 0 = 0, β̂ 1 = 0.9, σ̂ = 3.6
3) β̂ 0 = 252, β̂ 1 = −0.9, σ̂ = 3.6
4) β̂ 0 = −252, β̂ 1 = −0.9, σ̂ = 36
5) β̂ 0 = 252, β̂ 1 = −0.9, σ̂ = 36
Solution
First of all, the only possible intercept ( β̂ 0 ) among the ones given in the answers is
252. And then the slope estimate of -0.9 in these two options looks reasonable. We
Chapter 5 5.1 PLASTIC FILM FOLDING MACHINE 4
just need to decide on whether the estimated standard deviation of the error se = σ̂
is 3.6 or 36. From the figure it is clear that the points are NOT having an average
vertical distance to the line in the size of 36, so 3.6 must be the correct number and
hence the correct answer is:
Solution
The proportion of variation explained must be pretty high, so 0 can be ruled out.
Answer 1 and 4 is also ruled out since the correlation clearly is negative. This also
narrows the possibilities down to answer 3 and 5. And since the correlation is NOT
exactly -1 (in which case the observations would be exactly on the line), the correct
answer is:
a) Calculate the 95% confidence interval for the slope in the usual linear re-
gression model, which expresses the life time as a linear function of the
temperature.
Solution
Either one could do all the regression computations to find the β̂ 1 = −5.3133 and
then subsequently use the formula for the confidence interval for β 1 in Method 5.15
s
1
β̂ 1 ± t1−α/2 · σ̂β1 = β̂ 1 ± tα/2 · σ̂ ,
∑i=1 i − x̄ )2
n
( x
D <- data.frame(t=c(10,20,30,40,50,60,70,80,90),
y=c(420,365,285,220,176,117,69,34,5))
fit <- lm(y ~ t, data=D)
summary(fit)
Call:
lm(formula = y ~ t, data = D)
Residuals:
Min 1Q Median 3Q Max
-21.02 -12.62 -9.16 17.71 29.64
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 453.556 14.394 31.5 8.4e-09 ***
t -5.313 0.256 -20.8 1.5e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
and use the knowledge of the information in the R-output that wht is know as the
"standard error for the slope” can be directly read off as
s
1
σ̂β1 = σ̂ = 0.2558,
∑i=1 ( xi − x̄ )2
n
qt(.975,7)
[1] 2.365
-5.31+c(-1,1)*qt(.975,7)*0.2558
5%?
Solution
Since the confidence interval does not include 0, it can be documented that there
is a relationship between life time and temperature, also the p-value is 1.5 · 10−7 <
0.05 = α, which also give strong evidence against the null-hypothesis.
Chapter 5 5.3 YIELD OF CHEMICAL PROCESS 8
Yi = β 0 + β 1 xi + ε i , ε i ∼ N (0, σε2 ), i = 1, . . . , 5
Solution
D <- data.frame(x=c(0,25,50,75,100),
y=c(14,38,54,76,95))
fit <- lm(y ~ x, data=D)
summary(fit)
Call:
lm(formula = y ~ x, data = D)
Residuals:
1 2 3 4 5
-1.4 2.6 -1.4 0.6 -0.4
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.4000 1.4967 10.3 0.002 **
x 0.8000 0.0244 32.7 0.000063 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Alternatively one could use hand calculations and use the formula in Theorem 5.12
for the t-test of the null hypothesis: H0 : β 1 = 0.
The relevant test statistic and p-value can be read off in the R output as 32.7 and
0.000063. So the answer is:
Yes, as the relevant test statistic and p-value are resp. 32.7 and 0.00006 < 0.05 = α.
Solution
We use the formula in Equation (5-59) for the confidence limit of the line (the ex-
pected value of Yi for a value xnew ):
s
1 ( xnew − x̄ )2
β̂ 0 + β̂ 1 xnew ± t1−α/2 σ̂ + ,
n Sxx
Solution
We use the basic definition of finding a quantile (from Definition 1.7) and the upper
quartile is q0.75 (see Definition 1.8). We set n = 5, p = 0.75, so
np = 3.75
Chapter 5 5.3 YIELD OF CHEMICAL PROCESS 11
In the manufacturing of a plastic material, it is believed that the cooling time has
an influence on the impact strength. Therefore a study is carried out in which
plastic material impact strength is determined for 4 different cooling times. The
results of this experiment are shown in the following table:
a) What is the 95% confidence interval for the slope of the regression model,
expressing the impact strength as a linear function of the cooling time?
Chapter 5 5.4 PLASTIC MATERIAL 13
Solution
The easiest way to get to the confidence interval is to use the standard error for the
slope (σ̂β1 or denoted with SEβ1 ) given in the R output:
x <- c(15,25,35,40)
y <- c(42.1,36.0,31.8,28.7)
summary(lm(y ~ x))
Call:
lm(formula = y ~ x)
Residuals:
1 2 3 4
0.2814 -0.6051 0.4085 -0.0847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 49.639 0.878 56.5 0.00031 ***
x -0.521 0.029 -18.0 0.00308 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
the standard error for the slope is σ̂β1 = 0.029 (also known as the sampling distribu-
tion standard deviation for β̂ 1 ). Finding the relevant t-quantile (with ν = 2 degrees
of freedom (either of):
b) Can you conclude that there is a relation between the impact strength and
the cooling time at significance level α = 5%?
Solution
The relevant p-value can be read off directly from the summary output: 0.00308, and
we can conclude: Yes, as the relevant p-value is 0.00308, which is smaller than 0.05.
c) For a similar plastic material the tabulated value for the linear relation
between temperature and impact strength (i.e the slope) is −0.30. If the
following hypothesis is tested (at level α = 0.05)
H0 : β 1 = −0.30
H1 : β 1 6= −0.30
with the usual t-test statistic for such a test, what is the range (for t) within
which the hypothesis is accepted?
Solution
The so-called critical values for the t-statistic with ν = 2 degrees of freedom is found
as (or at least the negative one of the two): t0.025 = −4.303 - in R: qt(0.975,2)). So
the answer becomes:
[−4.303, 4.303].
Chapter 5 5.5 WATER POLUTION 15
a) What are the parameter estimates for the three unknown parameters in
the usual linear regression model: 1) The intercept (β 0 ), 2) the slope (β 1 )
and 3) error standard deviation (σ)?
Chapter 5 5.5 WATER POLUTION 16
Solution
Call:
lm(formula = concentration ~ distance, data = D)
Residuals:
1 2 3 4 5
0.324 -0.488 0.100 -0.032 0.096
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.664 0.365 31.96 0.000067 ***
distance -0.244 0.055 -4.43 0.021 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Given the knowledge of the R-output structure, the three values can be read off
directly from the output.
So the correct answer is: β̂ 0 = 11.7, β̂ 1 = −0.244 and SEσ̂ = σ̂ = 0.348.
Solution
The amount of variation in the model output (Y) explained by the variable input
(x) can be found from the squared correlation, that can be read off directly from the
Chapter 5 5.5 WATER POLUTION 17
output as "Multiple R-squared". So the correct answer is: R2 = 86.8% (it is actually
an estimate of the variation in concentration which can be explained by distance,
since it is what we found with the particular data at hand. If the sample was taken
again, then this value would vary. We should actually calculate a confidence interval
for R2 to understand how accurate this estimate is!).
Solution
The wanted number is estimated by the point on the line (using xnew = 7)
−0.244 · 7 + 11.664 = 9.96,
and the confidence interval is given by
s
1 (7 − 6)2
9.96 ± t0.025 (3) · σ̂ + ,
5 Sxx
sd(D$distance)
[1] 3.162
and thus
Sxx = (n − 1) · s2x = 4 · 3.1622 = 40.
This could also have been found by
When purifying drinking water you can use a so-called membrane filtration.
In an experiment one wishes to examine the relationship between the pressure
drop across a membrane and the flux (flow per area) through the membrane.
We observe the following 10 related values of pressure (x) and flux (y):
1 2 3 4 5 6 7 8 9 10
Pressure (x) 1.02 2.08 2.89 4.01 5.32 5.83 7.26 7.96 9.11 9.99
Flux (y) 1.15 0.85 1.56 1.72 4.32 5.07 5.00 5.31 6.17 7.04
D <- data.frame(
pressure=c(1.02,2.08,2.89,4.01,5.32,5.83,7.26,7.96,9.11,9.99),
flux=c(1.15,0.85,1.56,1.72,4.32,5.07,5.00,5.31,6.17,7.04)
)
a) What is the empirical correlation between pressure and flux estimated to?
Give also an interpretation of the correlation.
Chapter 5 5.6 MEMBRANE PRESSURE DROP 19
Solution
D <- data.frame(
pressure=c(1.02,2.08,2.89,4.01,5.32,5.83,7.26,7.96,9.11,9.99),
flux=c(1.15,0.85,1.56,1.72,4.32,5.07,5.00,5.31,6.17,7.04)
)
fit <- lm(flux ~ pressure, data=D)
summary(fit)
Call:
lm(formula = flux ~ pressure, data = D)
Residuals:
Min 1Q Median 3Q Max
-0.989 -0.318 -0.140 0.454 1.046
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.1886 0.4417 -0.43 0.68
pressure 0.7225 0.0706 10.23 0.0000072 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The found coefficient of determination (see Theorem 5.25) can be read off the R out-
put to be 0.929. The sign of the correlation is the same as the sign of the slope, which
can be read off to be positive ( β̂ 1 = 0.7225), so the correlation is
√
ρ̂ = r = 0.929 = 0.964.
So the empirical correlation is 0.964, and thus flux is found to increase with increas-
ing pressure.
b) What is a 90% confidence interval for the slope β 1 in the usual regression
model?
Chapter 5 5.6 MEMBRANE PRESSURE DROP 20
Solution
We use the formula for the slope (β 1 , see Method 5.15) confidence interval, and can
actually just realize that the correct t-quantile to use is the t1−0.05 (8) = 1.860 (in R:
qt(0.95,8)), and the other values we read of the summary output.
So the confidence interval is: 0.7225 ± 1.860 · 0.0706.
Solution
The squared correlation, r2 = 0.929 express the explained variation, this means that
1 − 0.929 = 0.071 express the unexplained variation by the model.
d) Can you at significance level α = 0.05 reject the hypothesis that the line
passes through (0, 0)?
Solution
A
e) A confidence interval for the line at three different pressure levels: xnew =
B C
3.5, xnew = 5.0 and xnew = 9.5 will look as follows:
U
β̂ 0 + β̂ 1 · xnew ± CU
where U then is either A, B or C. Write the constants CU in increasing
order.
Chapter 5 5.6 MEMBRANE PRESSURE DROP 21
Solution
The formula for the Confidence limits of α + βxnew includes the following term:
( xnew − x̄ )2
Sxx
and this is the ONLY term in CU that makes CU different between the three Us. And
since x̄ = 5.547 it is clear that
and hence
B
( xnew − 5.547)2 < ( xnew
A
− 5.547)2 < ( xnew
C
− 5.547)2
So CB < CA < CC
Chapter 5 5.7 MEMBRANE PRESSURE DROP (MATRIX FORM) 22
a) Find parameters values, standard errors, t-test statistics, and p-values for
the standard hypotheses tests.
D <- data.frame(
pressure=c(1.02,2.08,2.89,4.01,5.32,5.83,7.26,7.96,9.11,9.99),
flux=c(1.15,0.85,1.56,1.72,4.32,5.07,5.00,5.31,6.17,7.04)
)
Solution
Chapter 5 5.7 MEMBRANE PRESSURE DROP (MATRIX FORM) 23
D <- data.frame(
pressure=c(1.02,2.08,2.89,4.01,5.32,5.83,7.26,7.96,9.11,9.99),
flux=c(1.15,0.85,1.56,1.72,4.32,5.07,5.00,5.31,6.17,7.04)
)
fit <- lm(flux ~ pressure, data=D)
summary(fit)
Call:
lm(formula = flux ~ pressure, data = D)
Residuals:
Min 1Q Median 3Q Max
-0.989 -0.318 -0.140 0.454 1.046
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.1886 0.4417 -0.43 0.68
pressure 0.7225 0.0706 10.23 0.0000072 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The parameter estimates are given in the first column, the standard errors in the
second column, the t-test statistics are given in the third column and the p-values of
the standard hypothesis are given in the last column.
Solution
[,1]
[1,] -0.1886
[2,] 0.7225
## Collection in a table
analasis.table <- cbind(beta, se.beta, t.obs, p.value)
analasis.table
se.beta
[1,] -0.1886 0.44171 -0.4269 0.680696710
[2,] 0.7225 0.07064 10.2269 0.000007177
## Done!!
Chapter 5 5.8 INDEPENDENCE AND CORRELATION 25
n·(n+1)
a) Show that Sxx = 12·(n−1)
.
Solution
x̄ becomes
1 n i−1 1 n
x̄ = ∑
n i =1 n − 1
= ∑
n ( n − 1 ) i =1
( i − 1)
1 n ( n + 1) 1
= −n = ,
n ( n − 1) 2 2
Solution
Cov( β̂ 0 , β̂ 1 )
ρn ( β̂ 0 , β̂ 1 ) = q
V( β̂ 0 ) V( β̂ 1 )
σ2 x̄/Sxx
= −r
2
σ4 n1 + Sx̄xx S1xx
x̄/S
=− r xx
1 Sxx 2
Sxx n + x̄
x̄
= −q .
Sxx
n + x̄2
Notice that the correlation is not a function of the variance (σ2 ), but only a function
of the independent variables. Now insert the values of x̄ and Sxx
1 1
ρn ( β̂ 0 , β̂ 1 ) = − q =− q
n +1 1 n +1+3( n −1)
2 12(n−1) + 42 12(n−1)
1 6( n − 1)
p
=− q =− √
2 62n −1 2 2n − 1
( n −1)
s √ r
1 6( n − 1) 3 n−1
=− =−
2 2(n − 1/2) 2 n − 1/2
.
√
3
which converges to − 2 for n → ∞.
Solution
1
x̄ = ,
2
and
2 2k 2
k
1 1
Snew
xx = ∑ 0−
2
+ ∑ 1−
2
i =1 i = k +1
k k k n
= + = = .
4 4 2 4
Solution
Sxx n ( n + 1) 4 ( n + 1)
new
= = < 1; f or n>2
Sxx 12(n − 1) n 3( n − 1)
e) What is the consequence for the parameter variance in the two layouts?
Solution
The larger Sxx for the new layout imply that the parameter variance is smaller for
the new layout (given that data comes from the same model).
Solution
The smaller parameter variance for the new layout would suggest that we should
use this layout. However, we would not be able to check that data is in fact generated
by a linear model. Consider e.g. data generated by the model
yi = β 0 + β 1 xi2 + ε i , ε i ∼ N (0, σ2 ),