Statistics GIDP Ph.D. Qualifying Exam Methodology: January 10, 9:00am-1:00pm
Statistics GIDP Ph.D. Qualifying Exam Methodology: January 10, 9:00am-1:00pm
Statistics GIDP Ph.D. Qualifying Exam Methodology: January 10, 9:00am-1:00pm
1. A manufacturer suspects that the batches of raw material furnished by her supplier differ
significantly in antibiotic content. There are a large number of batches currently in the
warehouse. Five of these are randomly selected for study. A biologist makes five
determinations on each batch and obtains the following data:
Region
Design NE NW SE SW
1 225 350 220 375
2 400 525 390 550
3 275 340 200 320
(d) Can you check whether there exists an interaction between the Design factor and the Region
factor? If no, explain why not. If yes, write your statistical model, state the hypotheses,
analyze the data, and report your finding
Treatment Replicate
A B C Combination I II III
- - - (1) 22 31 25
+ - - a 32 43 29
- + - b 35 34 50
+ + - ab 55 47 46
- - + c 44 45 38
+ - + ac 40 37 36
- + + bc 60 50 54
+ + + abc 39 41 47
(c) Use the analysis of variance to confirm your informal conclusions from part (b).
(d) Write down a regression model for predicting tool life (in hours) based on the results of this
experiment.
(e) What levels of A, B, and C would you recommend using? Justify your answer.
(a) Examine the two predictor variables X1 and X2 to determine if multicollinearity is a concern
with these data.
(b) To assuage any concerns over multicollinearity, apply a ridge regression analysis to these
data. Begin with a ridge trace plot and use it to identify a plausible value for the ridge tuning
parameter c.
[Hint: if your plot covers sufficient values of c, the stock traceplot() function in the R
genridge package will draw vertical lines at plausible values, known as the modified Hoerl-
Kennard-Baldwin (HKB) and Lawless-Wang (LW) estimators. Decide if either could be a
useful choice here and if so, select it.]
(c) Using the tuning parameter, c, selected in part (b), calculate the ridge-regression predicted
values. From these find the raw residuals and construct a residual plot. Comment on any
patterns that may appear.
5. Consider the simple linear regression model: Yi ~ indep. N(0+1Xi, 2), i = 1, ..., n, where
X is a non-stochastic predictor variable. Suppose interest exists in estimating the X value(s)
at which E[Y] = 0. Let this target parameter be .
(b) State the maximum likelihood estimator (MLE) of . Call this ̂.
(c) Recall from Section 5.5.4 in Casella & Berger that the Delta Method can be used to determine
the asymptotic features of a function of random variables. In particular, for a bivariate
random vector [U1 U2]T and a differentiable function g(u1,u2), where E[Uj] = j (j = 1,2),
2 ∂
E[g(U1,U2)] ≈ g(,2) + ∑j=1 ∂ g(,2) E(Ui – i)
j
and
2 ∂ 2 ∂ ∂
Var[g(U1,U2)] ≈ ∑j=1{∂ g(,2)} Var(Ui) + 2{∂ g(,2)}{∂ g(,2)}Cov(U1,U2)
j 1 2
1. A manufacturer suspects that the batches of raw material furnished by her supplier differ
significantly in antibiotic content. There are a large number of batches currently in the
warehouse. Five of these are randomly selected for study. A biologist makes five
determinations on each batch and obtains the following data:
Source DF SS MS F P
Batch 4 0.096976 0.024244 5.54 0.004
Error 20 0.087600 0.004380
Total 24 0.184576
Region
Design NE NW SE SW
1 225 350 220 375
2 400 525 390 550
3 275 340 200 320
The raw residual plot and QQ-plot (below) do not suggest any unusual patterns. Thus the
normality assumption is reasonable.
(d) Can you check whether there exists an interaction between the Design factor and the
Region factor? If no, explain why not. If yes, write your statistical model, state the
hypotheses, analyze the data, and report your finding
Yes. Use Tukey’s 1-degree of freedom test, with model yij = μ + τi + αj + γτiαj + εij. The
hypotheses are
H0: 0
H1: 0
The following SAS output shows that the interaction is not significant (P-value = 0.4741).
Therefore the additive model from above is appropriate.
Source DF Type III SS Mean Square F Value Pr > F
design 2 821.240880 410.620440 0.61 0.5808
region 3 1099.093874 366.364625 0.54 0.6745
q 1 404.852146 404.852146 0.60 0.4741
1 NE 225
1 NW 350
1 SE 220
1 SW 375
2 NE 400
2 NW 525
2 SE 390
2 SW 550
3 NE 275
3 NW 340
3 SE 200
3 SW 320
;
data two;
set myout;
q=pred*pred;
Treatment Replicate
A B C Combination I II III
- - - (1) 22 31 25
+ - - a 32 43 29
- + - b 35 34 50
+ + - ab 55 47 46
- - + c 44 45 38
+ - + ac 40 37 36
- + + bc 60 50 54
+ + + abc 39 41 47
(c) Use the analysis of variance to confirm your informal conclusions from part (b).
The ANOVA table below confirms the significance of factors B, C and the interaction AC:
their P-values are 0.001, 0.0077, and 0.0012, respectively
Source DF Type III SS Mean Square F Value Pr > F
A 1 0.6666667 0.6666667 0.02 0.8837
B 1 770.6666667 770.6666667 25.55 0.0001
A*B 1 16.6666667 16.6666667 0.55 0.4681
C 1 280.1666667 280.1666667 9.29 0.0077
A*C 1 468.1666667 468.1666667 15.52 0.0012
B*C 1 48.1666667 48.1666667 1.60 0.2245
A*B*C 1 28.1666667 28.1666667 0.93 0.3483
(d) Write down a regression model for predicting tool life (in hours) based on the results of this
experiment.
Since the interaction AC is significant we need to include Factor A into the model, though
itself, it is not significant. Then, the prediction equation becomes
Life = 40.8333 + 0.16667XA + 5.66667XB + 3.41667XC – 4.41667XAC
(e) What levels of A, B, and C would you recommend using? Justify your answer.
For a maximum of the tool life, we would recommend high levels for factors B and C, and
low level for factor A. This is because
(i) all three main factors have a positive effect,
(ii) the interaction AC has a negative effect, and
(iii) the coefficient of factor C is larger than that for factor A.
data inter;
set Q3;
AB=A*B; AC=A*C; BC=B*C; ABC=A*B*C;
run;
We see some multicollinearity may exist between X1 & X3 and between X2 & X3 (no
surprise, in either case). A further check is available via the sample correlations among the
predictor variables:
cor( cbind(X1,X2,X3) )
X1 X2 X3
X1 1.0000000 0.4425075 0.8883073
X2 0.4425075 1.0000000 0.7890032
X3 0.8883073 0.7890032 1.0000000
where we see large correlations with the interaction term X3.
While informative, however, all the above analysis is subjective. A final, objective,
definitive check employs the VIFs:
require( car )
vif( lm(Y ~ X1+X2+X3) )
X1 X2 X3
29.35675 16.40282 62.54291
where clearly max{VIFk} > 10. (Also, the mean VIF is 36.1008, which is above the target
bound of 6.) Multicollinearity is a clear problem with these data.
(b) To assuage any concerns over multicollinearity, apply a ridge regression analysis to these
data. Begin with a ridge trace plot and use it to identify a plausible value for the ridge
tuning parameter c.
[Hint: if your plot covers sufficient values of c, the stock traceplot() function in the R
genridge package will draw vertical lines at plausible values, known as the modified Hoerl-
Kennard-Baldwin (HKB) and Lawless-Wang (LW) estimators. Decide if either could be a
useful choice here and if so, select it.]
–
For a ridge regression fit, start by centering the response variable to Ui = Yi − Y and
standardizing the predictors to Z1, Z2, and Z3, each with zero mean and unit variance:
U = scale( Y, scale=F )
Z1 = scale( X1 )
Z2 = scale( X2 )
Z3 = scale( X3 )
Now load the genridge package, choose a range for c (here 0 < c < 10) and construct the
traceplot using the ridge() function:
require( genridge )
c = seq( from=.01,to=10,by=.01 )
traceplot( ridge(U~Z1+Z2+Z3, lambda=c) )
The stock traceplot suggests that the Lawless-Wang (LW) estimate for c is far enough
along to where the traces start to flatten and stabilize. So choose cLW. The exact value can
be found as the $kLW attribute in the ridge() object:
print( ridge(U~Z1+Z2+Z3,lambda=c)$kLW )
[1] 3.531847
(c) Using the tuning parameter, c, selected in part (b), calculate the ridge-regression predicted
values. From these find the raw residuals and construct a residual plot. Comment on any
patterns that may appear.
The LW estimate is c = 3.531847. The final ridge regression object is then constructed via:
cLW = ridge(U~Z1+Z2+Z3,lambda=c)$kLW
college.ridge = ridge( U~Z1+Z2+Z3, lambda=cLW )
Fitted values and residuals can be found via direct calculation. First identify the ridge regr.
coefficients (use the $coef attribute in the ridge() object, but select its components
carefully!) and then proceed from there:
Zmtx = as.matrix( cbind(Z1,Z2,Z3) )
Uhat = Zmtx %*% college.ridge$coef[,1:3]
RawResid = U - Uhat
plot( RawResid ~ Uhat, pch=19 ); abline( h=0 )
The plot shows possible decreasing variability with increasing response (i.e. variance
heterogeneity), or perhaps a decreasing trend suggestive of a possible hidden/unidentified
predictor variable. Either way, deeper diagnostic study of these data is warranted.
5. Consider the simple linear regression model: Yi ~ indep. N(0+1Xi, 2), i = 1, ..., n, where
X is a non-stochastic predictor variable. Suppose interest exists in estimating the X
value(s) at which E[Y] = 0. Let this target parameter be .
(b) State the maximum likelihood estimator (MLE) of . Call this ̂.
(c) Recall from Section 5.5.4 in Casella & Berger that the Delta Method can be used to
determine the asymptotic features of a function of random variables. In particular, for a
bivariate random vector [U1 U2]T and a differentiable function g(u1,u2), where E[Uj] = j
(j = 1,2),
2 ∂
E[g(U1,U2)] ≈ g(,2) + ∑ j=1 ∂ g(,2) E(Ui – i)
j
and
2 ∂ 2 ∂ ∂
Var[g(U1,U2)] ≈ ∑ j=1{∂ g(,2)} Var(Ui) + 2{∂ g(,2)}{∂ g(,2)}Cov(U1,U2)
j 1 2
Let g(1,2) = = –0/1. We know that the MLEs for j are unbiased such that E[̂j] = j
(j=0,1). Then from the Delta Method we see
1 ∂ 1 ∂
(i) E[̂] = E[–̂0/̂1] ≈ g(,2) + ∑ j=0 ∂ g(,1) E(̂i – i) = –0/1 + ∑ j=0 ∂ g(,1)(0)
i i
= –0/1 = .
(This is actually true in general: under the conditions of the Delta Method, E[̂0/̂1] ≈ 0/1.
See Example 5.5.27 in Casella & Berger.)
(ii) From the Delta Method,
2
2 Var[̂0] Var[̂] Cov[̂0;̂]
Var[̂] = Var[–̂0/̂1] = (–1) Var[̂0/̂1] ≈ 2 + –2 .
1 0 1 2 0 1
– 2 –
Now let SSx = ∑(Xi – X) . We know Var[̂0] = 2(n–1SSx + X2)/SSx, Var[̂1] = 2/SSx, and
–
Cov[̂0,̂1] = –2X/SSx. This allows for some simplification:
–
2 SSx – 2 X0 2 SSx – 2 –
Var[̂] ≈ 2 + X + 2+2 = 2 n + X + 2 – 2X.
1 SSx n 1 1 1 SSx
6. Assume a simple linear regression-through-the-origin model: Yi ~ indep. N(1Xi, 2), i = 1,
..., n. Equation (4.14) in Kutner et al. shows that the maximum likelihood estimator (MLE)
of 1 is
n
∑i=1XiYi
b1 = n 2
∑j=1Xj
with standard error
MSE
s{b1} = n ,
∑k=1Xk2
n
where MSE = ∑i=1(Yi – b1Xi)2/(n–2) is an unbiased estimator of 2. Starting from the basic
principles of the Working-Hotelling-Scheffé (WHS) construction, derive the 1–WHS
simultaneous confidence band for the mean response E[Yi] with this model. What is the
“shape” of this band?
The WHS band will take the form Ŷh ± W s{Ŷh}, where the WHS critical point is based on
W2 = pF(1–; p, n–p). Here p=1, while Ŷh = b1Xh with standard error
for all Xh. Thus the band will be b1Xh ± W |Xh| s{b1}, where W2 = F(1–;1,n–1).
But recall that F(1–;1,n–1) = t2(1–{½};n–1), so this simplifies to
for all Xh. Note the form of this “band”: a pair of straight lines that intersect at the origin.