0% found this document useful (0 votes)
17 views13 pages

Midterm2021R1 Sol PDF

Uploaded by

vanessalaucode
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views13 pages

Midterm2021R1 Sol PDF

Uploaded by

vanessalaucode
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

STAT 404 Midterm Exam

September–December, 2021.
Instructor: Jiahua Chen

Total marks: 70

• Put your name and student ID on the up-right corner of every sheet.

• You must write your solutions on paper (not on a device). Use a camera or
scanner to upload your written solutions to Canvas as a PDF.

• Correct answers are usually short. Please be brief and answer the questions
with complete sentences.

• R is allowed only for manual calculations like on the assignments. Answers


obtained using one-line R commands will not be accepted.

• Submit the R commands you used as a runnable script in a separate .txt


file. It must be clear which question the code corresponds to. If the code
is taken from somewhere, cite the source of the code.

• Common notations and model assumptions are assumed in the problems


unless otherwise specified. Use the conventional 5% level for tests, hypothesis
for two-sided alternative, and 95% confidence level unless otherwise specified.

1
1. [6] Name and explain (in complete sentences) the three principles in
design of experiments that were emphasized in STAT 404 lectures.
Use your own discretion to decide how much to write based on the
following example:
Random variable: a function defined on sample space. It provides a
numerical summary of the output of an experiment.
Answer
The three principles in design of experiments are as follows:

1. Replication: applying the same treatment to several experimental


units. Its purpose is to reduce the influence of the experimental
error to better isolate the treatment effects.
2. Randomization: allocating treatments to units in random order.
Its purpose is to reduce the influence of unknown variables that
may have an impact on the response.
3. Blocking: partitioning units into blocks based on similar charac-
teristics and comparing treatments within blocks. Its purpose is
to reduce the influence of known variables not of interest that may
have an impact on the response.

2
2. Consider a two-sample problem based on self-reported body temper-
atures of 15 male and 20 female students. Suppose that the equal
variance assumption holds. We wish to test the equal mean tempera-
ture null against the one-sided alternative that male students have a
higher mean body temperature at nominal level 4%.
A student mistakenly uses the critical value from the 4% level two-sided
t-test as the critical value in their one-sided t-test.

(a) [2] What is the actual size of the type I error of the student’s test?
Answer
The actual size of the type I error is 2% because the critical value
in the 4% level two-sided t-test is the upper 98% quantile of the t
distribution. The probability that the test statistic is greater than
the 98% quantile is 0.02 (2%).
(b) [2] If the alternative hypothesis holds in the specific application,
is the power of the student’s test higher or lower than the correct
one-sided t-test?
Answer
If the alternative hypothesis holds, the power of the student’s test
is lower than the correct one-sided test because the probability of
the test statistic being larger than the 98% quantile critical value
is smaller than the probability of it being larger than the correct
96% quantile critical value.

Reminder: answer these questions in complete sentences and


explain your answer.

3
3. A linear regression model assumes that the response values in an ex-
periment can be expressed as
yi = x>
i β + i

for i = 1, 2, . . . , n, where x>


i β is the expected value of yi and i ’s are
2
iid N(0, σ ).
We collected data on gender (x1 , coded binary as -1 (F) and 1 (M)),
length of right foot (x2 , in cm), and response height (y, in cm). The
data are posted as Midterm2021Q3.txt on Canvas for ease of numerical
computation.
Note that in this case, x = (x0 , x1 , x2 )> and β = (β0 , β1 , β2 )> .
You can use the following code to load the design matrix X
and response vector Y into R:

# load data (already encode gender into -1, 1)


data = read.delim("Midterm2021Q3.txt", se = ",")
X = cbind(1,data$gender, data$foot.length)
Y = data$height

(a) [4] For the given data set, obtain the least squares estimates of β.
Answer
The estimates of coefficients (β0 , β1 , β2 )> are given by
β̂ = (X > X)−1 X T Y = (169.460, 5.946, 0.404)> .
R code:
invXTX = solve(t(X) %*% X)
beta.hat = invXTX %*% t(X) %*% Y

(b) [4] Estimate the error variance σ 2 (use the method given in class).
Answer
The estimate of the error variance σ 2 is given by the residual mean
sum of squares
(Y − X β̂)T (Y − X β̂)
σ̂ 2 = = 34.0188.
(N − k − 1)

4
R code:
N = nrow(X); k = ncol(X)-1
resid = Y - X %*% beta.hat
MSS.error = sum(resid^2) / (N-k-1)
(c) [4] Estimate the variance matrix of β̂.
Answer
The variance matrix of β̂ is given by
Var(β̂) = (X > X)−1 X > (Var(Y ))X(X > X)−1 = σ 2 (X > X)−1 .
Thus, an estimate is given by
24.3488 1.1793 −0.8704
 
d β̂) = σ̂ 2 (X > X)−1
Var( =  1.1793 1.0305 −0.0408 .
−0.8704 −0.0408 0.0324
R code:
var.beta = solve(t(X) %*% X) * MSS.error
(d) [4] Construct 95%, two-sided individual (not simultaneous) CIs
for β1 and β2 (effects of gender and foot-length). Hint: remember
the general recipe for CI construction.
Answer
Consider a two-sided t-test with H0 : βi = 0, H1 : βi 6= 0 for
i ∈ {1, 2}. Note that the test statistic is given by
β̂i
Ti = ,
SE(β̂i )
which follows a distribution tN −k−1 under H0 . Therefore, the CI
for β̂1 is given by
β̂1 ± t0.975,32 × SE(β̂1 ) = (3.8783, 8.0138)
and the CI for β̂2 is given by
β̂2 ± t0.975,32 × SE(β̂2 ) = (0.03684, 0.77038).

R code:

5
# beta1
beta.hat[1+1] - qt(1-0.05/2, N-k-1)*var.beta[2,2]^.5;
beta.hat[1+1] + qt(1-0.05/2, N-k-1)*var.beta[2,2]^.5

# beta2
beta.hat[1+2] - qt(1-0.05/2, N-k-1)*var.beta[3,3]^.5;
beta.hat[1+2] + qt(1-0.05/2, N-k-1)*var.beta[3,3]^.5

(e) [4] Estimate the variance of β̂2 − β̂1 (both LS estimators).


Answer
Note that

Var(β̂2 − β̂1 ) = Var((0, −1, 1)β̂) = (0, −1, 1)Var(β̂)(0, −1, 1)T .

Thus,

d β̂)(0, −1, 1)T = 1.1445.


d β̂2 − β̂1 ) = (0, −1, 1)Var(
Var(

R code (can also compute manually):


a = c(0, -1, 1)
t(a) %*% var.beta %*% a

6
4. Consider a hypothetical one-way layout comparing k = 5 treatments.
The response values are given as follows:

yy1 = c(11.44,11.15,10.94,12.38, 9.24)


yy2 = c(10.40, 13.16,13.53,11.54,12.39,12.59,11.50,11.81)
yy3 = c(15.66,16.22,13.68,13.81,13.09,15.51,14.74,14.57,14.52)
yy4 = c(10.04, 8.76, 8.26, 7.92)
yy5 = c(11.20,12.74,10.97,13.80,11.75,10.77,13.51)

(a) [4] Compute the treatment sum of squares, SS(trt).


Answer
By definition,
5
X
SS(trt) = ni (ȳi. − ȳ.. )2 = 108.4607.
i=1

R code:
y = c(yy1, yy2, yy3, yy4, yy5)
trt =
as.factor(rep(1:5, c(
length(yy1),
length(yy2),
length(yy3),
length(yy4),
length(yy5)
)))
anova_data = data.frame(y = y, trt = trt)
bar.y = mean(anova_data$y)
bar.yi = tapply(anova_data$y, INDEX = anova_data$trt, FUN = mean)
n = length(y)
ni = summary(anova_data$trt)
k = length(ni)
ss.trt = sum(ni * (bar.yi - bar.y) ^ 2)

7
(b) [4] Compute the error sum of squares (also called residual), SS(err).
Answer
By definition,
ni
5 X
X
SS(err) = SS(total)−SS(trt) = (ȳij −ȳ.. )2 −SS(trt) = 32.55049.
i=1 j=1

R code:
ss.total = sum((y - bar.y)^2)
ss.error = ss.total - ss.trt

(c) [4] Complete the one-way layout ANOVA table.

Source DF SS MSS F
Treatment
Error
Total

Answer

Source DF SS MSS F
Treatment 4 108.4607 27.1152 23.3245
Error 28 32.5505 1.1625
Total 32 141.0112

R code:
mss.trt = ss.trt / (k - 1)
mss.error = ss.error / (n - k)
f.test = mss.trt / mss.error

8
5. Continuing the last problem:

(a) [4] Test the hypothesis that all treatment means are equal at the
5% level. Be sure to state the null and alternative hypotheses,
and state what the test statistic is and its reference distribution.
Then, clearly state your conclusions.
Answer
Let µi be the true mean response value of treatment i. The null
hypothesis is
H0 : µ1 = µ2 = · · · = µ5
and the alternative hypothesis is at least two of {µi : i = 1, 2, . . . , 5}
are not equal. The test statistic is
MSS(trt)
F =
MSS(err)
which has a Fk−1,n−k distribution under H0 . The p-value is

p-value = P (F > Fobs |H0 ) = 1.4 × 10−8 .

We therefore reject the null hypothesis that all treatment means


are equal at the 5% significance level.
R code:
pf(f.test, k-1, n-k, lower.tail=F)

(b) [4] Construct simultaneous 95% confidence intervals for mean dif-
ferences using Tukey’s method, but present only the first three
(that is, 1 vs 2; 1 vs 3; 2 vs 3).
Answer
The lower and upper limits of the 95% Tukey confidence intervals
for the first three are in the following table:

lower bound upper bound


1 vs 2 -2.876 0.706
1 vs 3 -5.366 -1.862
2 vs 3 -4.056 -1.003

9
R code:
cv.tukey = qtukey(0.95, k, n - k) / sqrt(2)
diff.means = c(bar.yi[1]-bar.yi[2],
bar.yi[1]-bar.yi[3],
bar.yi[2]-bar.yi[3])
se = sqrt(mss.error * c((1 / ni[1] + 1 / ni[2]),
(1 / ni[1] + 1 / ni[3]),
(1 / ni[2] + 1 / ni[3])))
(diff.means - se * cv.tukey)
(diff.means + se * cv.tukey)

(c) [3] Find the estimated effects of the 5 treatments and the estimated
error variance (i.e., τ̂j and σ̂ 2 ).
Answer
The estimated effects of the treatment means are given by

τ̂i = ȳi. − ȳ.. .

The values (rounded to 3 decimal places) are −1.200, −0.115,


2.414, −3.485, and −0.124 for treatments 1 to 5, respectively.
The estimated error variance is

σ̂ 2 = MSS(err) = 1.162 .

R code:
hat.taui = bar.yi - bar.y
hat.sigma2 = mss.error

(d) [4] Suppose the true effects of the 5 treatments are the same as
the estimates found in part (c), but the true error variance is ten
times larger than the estimate. What is the power of the test for
H0 : τ1 = · · · = τ5 ?
Answer
Let δ = ki=1 ni (τi − τ̄ )2 /σ 2 where τ̄ = N −1 ki=1 ni τi , the null
P P

hypothesis H0 : τ1 = · · · = τ5 is the same as H0 : δ = 0 and the


alternative hypothesis is H1 : δ > 0.

10
When the true effects are given as in part (c) and σ 2 = 10σ̂ 2 , we
have δ = 9.3298. The power is computed as

power = P (F > F0.95,k,n−k |δ = 9.3298)

where F0.95,k,n−k is the 95% quantile of a F distribution with de-


grees of freedom 4 and 28.
When δ > 0, the test statistic follows a non-central F distribution
with non-centrality parameter δ. The power is found to be 0.5985.
R code:
bar.tau = sum(ni*hat.taui)/n
delta = sum(ni*(hat.taui-bar.tau)^2) / (10*hat.sigma2)
qq = qf(0.95, k-1, n-k)
pf(qq, k-1, n-k, delta, lower.tail=F)

11
6. Continuing the last problem:

(a) [4] Suppose the random effects model is more suitable instead so
that the variance of the treatment effect is στ2 .
Obtain an expression for E [(ȳ1· − ȳ2· )2 ] in terms of στ2 and σ 2 .
Answer
Since ȳ1 and ȳ2 have the same mean and they are independent,
we have

E (ȳ1· − ȳ2· )2 = Var(ȳ1· ) + Var(ȳ2· ) = 2στ2 + (1/n1 + 1/n2 )σ 2 .


 

This gives us the required expression.


(b) [4] Based on your result in (a), provide a sensible estimate of στ2 .
Use your estimate σ̂ 2 from (Q5c) for σ 2 .
Answer
A sensible estimator and the estimate of στ2 are

(ȳ1· − ȳ2· )2 − (1/n1 + 1/n2 )MSS(err)


σ̂τ2 = = 0.3997.
2
R code:
((bar.yi[1]-bar.yi[2])^2 - (1/ni[1]+1/ni[2])*hat.sigma2) / 2

12
7. [5] Someone claims that the ratio of height over foot-length is larger
for males than females.
Using a randomization test with the data we collected (Midterm2021Q3.txt
from Q3), provide your opinion on this claim as a statistician.
Note: computing all permutations is infeasible due to the size
of this data set. You may estimate the p-value via simulations.
Answer
See Assignment 3.

13

You might also like