0% found this document useful (0 votes)
7 views

Midterm Practice Solutions

The document contains practice questions for an MS3252 midterm exam, including true or false statements and problem-solving exercises related to regression analysis. It covers topics such as model assumptions, hypothesis testing, confidence intervals, and model selection criteria. The document provides detailed calculations and interpretations for various regression models and statistical tests.

Uploaded by

詠芯謝
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Midterm Practice Solutions

The document contains practice questions for an MS3252 midterm exam, including true or false statements and problem-solving exercises related to regression analysis. It covers topics such as model assumptions, hypothesis testing, confidence intervals, and model selection criteria. The document provides detailed calculations and interpretations for various regression models and statistical tests.

Uploaded by

詠芯謝
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

MS3252 Midterm Practice Questions

I. True or False (2pts each)


Note: If a statement is not always true, it is regarded as a “false” statement.
1. A useful tool for assessing the appropriateness of model assumptions is a residuals
versus fitted values plot; if the model assumptions hold, this should resemble a null
plot (i.e. the data scatters around the zero line randomly).
TRUE

2. To study the relationship between cholesterol and patient height and weight, re-
searchers consider a regression model E(Y | X) = β0 + β1 X, where Y = LDL
cholesterol in mg/dL, and X = BMI = weight in kg / (height in m)2 ; in the ter-
minology of this course, BMI is the predictor, and height and weight are the two
independent variables.
FALSE

3. If the coefficient of determination for the regression y∼x1 is 0.60, so 60% of the
variation in y is explained by its linear association with x1, then the coefficient of
determination for the regression y∼x1+x2 will be at least 0.60.
TRUE

4. Comparing the two regression models y∼x1 and y∼x1+x2, their SST are different.
FALSE

5. The sum of squares due to regression SSR = SST −SSE = ni=1 (Ŷi −Ȳ )2 represents
P
the variation in y that remains unexplained after accounting for the relationship
between y and x hypothesized by the regression model.
FALSE

6. Consider a regression model E(Y | X1 , X2 , X3 ) = β0 + β1 X1 + β2 X2 + β3 X3 , all of


r12 , r13 and r23 are not that high, where rij denotes the pairwise correlation between
Xi and Xj , then we can conclude that no multicollinearity exists.
FALSE

7. Consider testing the hypothesis H0 : Reduced Model versus H1 : Full Model based
(SSEreduced − SSEf ull )/(dfreduced − dff ull )
on the test statistic F = . The denomi-
SSEf ull /dff ull
nator degree of freedom is the residual degree of freedom for the full model, while
the numerator degree of freedom is the number of additional parameters in the full
model over the reduced model.
TRUE

1
8. Consider a regression model of y∼x1, testing whether or not the slope coefficient
is zero is equivalent to testing whether or not the correlation coefficient between y
and x1 is zero.
TRUE

9. In a linear regression model, the variance function is assumed to be a constant


function of the response.
FALSE

10. For linear regression, the adjusted R2 is non-decreasing when more predictors are
added to the model, because more variation of the response is being explained.
FALSE

2
II. Problem Solving
Note: Show your steps! You may lose points for not justifying your answers. If the desired
degree of freedom is not available in the t or F tables, round it off to the closest available.
1. In an experiment, a metal ball is released from rest from different heights near the
ground surface and allowed to free fall in vacuum. The durations (in s) taken for the
ball to reach the ground (time) and its speeds (in m/s) when it reaches the ground
(speed) are recorded. The sample variance of speed is 62.45004 and that of time
is 0.6507599, while the correlation between them is 0.9930025. The R output of the
linear regression model speed∼time is given below, but some figures are missing.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3645 0.5389 0.676 0.5
time ___A__ ___B__ ___C__ ______

Residual standard error: 0.9379 on 99 degrees of freedom


Multiple R-squared: ___D__, Adjusted R-squared: ___E__
F-statistic: ___F__ on 1 and 99 DF, p-value: ______
(a) What is the sample size of this data?
n = 101.

(b) Find the values of A, B, and C.



SY 62.45004
A = b1 = r = 0.9930025 √ = 9.727614.
SX 0.6507599
Se 0.9379
B = Sb1 = p 2
= p = 0.1162642.
(n − 1)SX 100(0.6507599)
b1 9.727614
C= = = 83.66818.
Sb1 0.1162642

(c) Interpret carefully the meaning of A in the model.


The estimated slope coefficient A equals 9.727614, which means on average,
the terminal speed of the ball when it lands on the ground surface will increase
by 9.727614 m/s for each 1 s increase in the time taken.

(d) Find the values of D and E.


D = R2 = r2 = 0.99300252 = 0.986054.
100
E = adjustedR2 = 1 − (1 − R2 ) = 0.9859131.
99

(e) Find the F-statistic of the model, i.e. the value of F. State the corresponding
null and alternative hypotheses, degrees of freedom, and conclude the test at
5% significance level.
ˆ H0 : βtime = 0 vs H1 : βtime ̸= 0.
ˆ The F-statistic is F = C 2 = 7000.364.
ˆ With d.f. 1 and 99, the critical value is between 3.92 and 4.00.

3
ˆ Reject the null, i.e. time is influential on speed.

(f) Construct a 95% confidence interval for the slope coefficient of time.

b1 ± t0.025,99 Sb1 = 9.727614 ± 1.9842(0.1162642) = [9.496923, 9.958305].

(g) Give a point estimate of the mean speed when reaching the ground for a journey
taking 5 s.
Ŷ = 0.3645 + 9.727614(5) = 49.00257(m/s).

2
(h) Given the sample mean of time is 4.564794, compute Sm for time = 5. Hence,
find the 95% confidence interval for the above point estimate.

(5 − X̄)2
 
2 2 1
Sm = Se + 2
= 0.01126972.
n (n − 1)SX
C.I. = Ŷ ± t0.025,99 Sm = [48.79193, 49.21321].

2. The R output of the linear regression model y∼x1 is given below.


Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.9781 0.1405 6.963 1.43e-07
x1 -0.3204 0.1042 -3.076 0.00465

Residual standard error: 0.7691 on 28 degrees of freedom


Multiple R-squared: 0.2525, Adjusted R-squared: 0.2259
F-statistic: ____ on 1 and 28 DF, p-value: ________
Reproduce the following ANOVA table, i.e. find the values of c1 to c8.
Df Sum Sq Mean Sq F value Pr(>F)
x1 c1 c2 c3 c4 c5
Residuals c6 c7 c8

c1 = 1
c6 = 28
c5 = 0.00465
c4 = (−3.076)2 = 9.461776
c8 = 0.76912 = 0.5915148
c7 = c6c8 = 16.56241
c3 = c4c8 = 5.596781
c2 = c1c3 = 5.596781

4
3. The R output of the linear regression model y∼x1+x2 is given below.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.98498 0.13363 7.371 6.28e-08
x1 -0.32333 0.09907 -3.264 0.00298
x2 0.39379 0.19797 1.989 0.05690

Residual standard error: 0.7315 on 27 degrees of freedom


Multiple R-squared: 0.3481, Adjusted R-squared: 0.2998
F-statistic: 7.208 on 2 and 27 DF, p-value: 0.003102
(a) What is the sample size?
n = 30.

(b) Write down the estimated regression equation.


y = 0.98498 − 0.32333x1 + 0.39379x2.

(c) Interpret the estimated slope coefficient of x1 carefully.


We expect y to decrease by 0.32333 unit on average for each unit increase in
x1 while x2 remains unchanged.

(d) Test whether or not the slope coefficient of x1 is zero. State your hypotheses,
test statistic, degree of freedom, and conclusion at 5% significance level.
ˆ H0 : β1 = 0 vs H1 : β1 ̸= 0, where β1 is the true slope coefficient of x1.
ˆ The t-statistic is −3.264.
ˆ The d.f. is 27, C.V. is 2.0518, and p-value is 0.00298.
ˆ Reject the null, i.e. x1 is influential on y in the presence of x2.

(e) Construct a 95% confidence interval for the slope coefficient of x1.

b1 ± t0.025,27 Sb1 = −0.32333 ± 2.0518(0.09907) = [−0.5266018, −0.1200582].

(f) Give a point estimate of y for a new observation with x1= 1 and x2= −1.
Also, find the corresponding 95% confidence interval if Sp2 = 0.6005.

ŷ = 0.98498 − 0.32333(1) + 0.39379(−1) = 0.26786.

Prediction Interval = ŷ ± t0.025,27 Sp = [−1.3221, 1.8578].

(g) Perform a partial F-test between the models y∼x1 and y∼x1+x2. State your
hypotheses, test statistic, degree of freedom, and conclusion at 5% significance
level.
ˆ H0 : β2 = 0 vs H1 : β2 ̸= 0, where β2 is the true slope coefficient of x2.
ˆ The F-statistic is 1.9892 = 3.956121.
ˆ The d.f. are {1, 27}, C.V. is 4.21, and p-value is 0.05690.
ˆ Accept the null, i.e. x2 is not influential on y in the presence of x1.

5
4. Consider a response y and four candidate predictors x1,x2,x3,x4. The following
table presents some statistics using different sets of variables as predictors.
predictors rsquare adjr cp aic
1 0 0 9.715913 47.22907
x1 0.03243409 -0.002121838 10.492631 48.23992
x2 0.16076178 0.130788991 5.652635 43.97125
x3 0.01834042 -0.016718846 11.024187 48.67375
x4 0.09851366 0.066317717 8.000380 46.11776
x1 x2 0.19107243 0.131151865 6.509442 44.86770
x1 x3 0.05421051 -0.015847975 11.671314 49.55702
x1 x4 0.12791266 0.063313600 8.891570 47.12310
x2 x3 0.16268818 0.100665082 7.579979 45.90231
x2 x4 0.30322294 0.251609826 2.279583 40.39038
x3 x4 0.12804384 0.063454496 8.886622 47.11859
x1 x2 x3 0.19425379 0.101283073 8.389454 46.74948
x1 x2 x4 0.32970944 0.252368225 3.280620 41.22775
x1 x3 x4 0.16140194 0.064640623 9.628491 47.94836
x2 x3 x4 0.30882514 0.229074195 4.068291 42.14820
x1 x2 x3 x4 0.33714981 0.231093776 5.000000 42.89289
(a) Using the all possible regression procedure with R2 as the criterion, which is
the best model and why?
y∼x1+x2+x3+x4, because of its highest R2 .

(b) Consider the forward selection procedure. List all the models that we shall
compare in the first step. Which model will you select to proceed and why?
y∼1, y∼x1, y∼x2, y∼x3, and y∼x4. We choose y∼x2 to proceed because it
has the lowest AIC.

(c) Continued on part (b), list all the models that we shall compare in the second
step. Which model will you select to proceed and why?
y∼x2, y∼x1+x2, y∼x2+x3, and y∼x2+x4. We choose y∼x2+x4 to proceed
because it has the lowest AIC.

(d) Continued on part (c), proceed in a similar manner for the third step and
onwards. Hence, which is the best model using the forward selection procedure
and why?
In the third step, we compare y∼x2+x4, y∼x1+x2+x4, and y∼x2+x3+x4. As
the current model y∼x2+x4 has the lowest AIC, we terminate the procedure
and claim it is the best model.

You might also like