pastPaper2024Spring Assm02
pastPaper2024Spring Assm02
Time: 12:30-14:30
This question paper has 21 pages; the appendices are from page NINE.
Instructions: This paper has SIX questions. Make sure to complete all subquestions.
Student ID:
Student Name:
DO NOT TURN OVER THE PAGE UNTIL YOU ARE TOLD TO DO SO.
Question 2: Multiple Linear Regression (MLR) [Total: 30 marks]
To answer Question 2 (Q2), refer to Appendix 1: Codes and Outputs for Q2.
The dataset for Question 2 contains information collected by the U.S Census Service concerning
housing in the area of Boston Mass.
The table below describes variables in the dataset for Question 2.
An economist took a survey and wanted to determine the effect of the predictors nox (X1 ), rm
(X2 ), and lstat (X3 ) on the response medv (Y ). Establish the following multiple linear regression
(MLR) model to characterize their relationship, based on 50 samples,
2.1 Fit a multiple linear regression model relating Y to all three mentioned regressors.
[2 marks]
2.2 Construct the analysis-of-variance table based on the model (1) and code and output.
[4 marks]
What is the purpose of the ANOVA here? Explain it in words and write down the null
hypothesis. [(2+2) marks]
2.3 Use the t-test to assess the contribution of rm (X2 ) to the model by the p-value method
(α = 0.05). [4 marks]
2.4 Based on the ANOVA table, use the partial F test to assess the contribution of rm (X2 )
to the model by the critical region approach (α = 0.05). [5 marks]
How is this partial F statistic related to the t-test for β2 calculated in Subquestion 2.3?
[3 marks]
4
Question 3: Multicollinearity [Total: 10 marks]
To answer Question 3 (Q3), refer to Appendix 2: Codes and Outputs for Q3.
The dataset for Question 3 contains information collected by the U.S Census Service concerning
housing in the area of Boston Mass.
The table below describes variables in the dataset for Question 3
The economist collected a dataset to examine the effect of the predictors indus (X1 ), nox (X2 ),
rm (X3 ), age (X4 ), rad (X5 ), tax (X6 ), ptratio (X7 ), black (X8 ), lstat (X9 ) and medv (X10 ) on
the response crim (Y ) and examined if there is multicollinearity among the predictors.
3.2 In general, what are the reasons that may cause multicollinearity? Give at least two
reasons.
[2 marks]
3.4 How does one deal with multicollinearity? Name two specific methods.
[2 marks]
5
Question 4: Residual Analysis [Total: 10 marks]
To answer Question 4 (Q4), refer to Appendix 3: Codes and Outputs for Q4.
The dataset for Question 4 contains information collected by the U.S Census Service concerning
housing in the area of Boston Mass.
The table below describes variables in the dataset for Question 4
Suppose that the medv (Y ) is the response variable to predictors crim (X1 ), rm (X2 ), tax (X3 )
and lstat (X4 ). The model + data is expressible
4.1 If there is a large difference between the ordinary residual and the PRESS residual, what
can one learn about the model? [1 mark]
4.2 According to the normal probability plot of the residuals (Figure: Q4-1), does there seem
to be any problem with the normality assumption? Explain. [(1+2) marks]
4.3 Interpret the plot of the residuals against the fitted values (Figure: Q4-2). [(1+2) marks]
4.4 Figure: Q4-3 plots residuals versus an extra predictor X5 . Does this plot indicate that
the model (2) will be improved by adding the predictor X5 ? Explain.
[(1+2) marks]
6
Question 5: Variable Selection [Total: 10 marks]
To answer Question 5 (Q5), refer to Appendix 4: Codes and Outputs for Q5.
The dataset for Question 5 contains information collected by the U.S Census Service concerning
housing in the area of Boston Mass.
The table below describes variables in the dataset for Question 5
Suppose that the medv (Y ) is the response variable that relates to predictors crim (X1 ), indus
(X2 ), nox (X3 ), rm (X4 ), age (X5 ), rad (X6 ), tax (X7 ), ptratio (X8 ), black (X9 ) and lstat (X10 ),
by multiple linear regression. Now we employ variable selection methods to obtain reduced
models to fit the data. The sample size is 50.
5.1 Based on the outputs method 1, what is the selection method (forward, backward, or
stepwise) used to select the appropriate model? Explain. [(1+2) marks]
5.2 Based on the outputs method 2, which criterion (AIC or BIC) is the selection method
based on? Explain. [(1+2)marks]
5.3 Give the selected fitted model from subquestion 5.1. [1 mark]
5.4 Based on Figure: Q5-1, determine which of the explanatory variables should be included
in the regression model using the all possible regressions approach and BIC. Explain.
[(1+2) marks]
7
Question 6: Random Effects Model [Total: 30 marks]
To answer Question 6 (Q6), refer to Appendix 5: Codes and Outputs for Q6.
The table below describes variables in the dataset for Question 6
Variable Description
Instructor name of the leader
Town the location collecting the sodium
Sodium the weight of the sodium (milligram)
Suppose that we have a dataset containing the weight of sodium in different towns measured by
different instructors.
The Laird-Ware form based on the dataset is in the form of a random-effects one-way ANOVA
model below
2
!
yij = β0 + b0i + ϵij , i = 1, 2, j = 1, ..., ni , ni = 60, (3)
i=1
where observation yij is the weight of sodium sample for the jth of ni sodium samples in the
ith town, β0 is the fixed-effect coefficient, which represents the general average weight of the
sodium here, b0i is the deviation of the weight of the sodium in town i from the general mean,
and ϵij is the deviation of the weight of j’s sodium sample in town i from the town mean.
We assume b0i ∼ N (0, γ02 ) and ϵij ∼ N (0, σ 2 ).
i.i.d. i.i.d.
6.1 Describe the nested structure and give a brief reason. [(1+1) marks]
6.2 What is the first level and what is the second level, respectively? [(1+1) marks]
6.3 What are the ML and REML estimates of the parameters (β0 , γ0 and σ) in model (3)
based on the given dataset, respectively? [(6+6) marks]
6.4 How many “clusters” ( = level-2 units) are in model (3)? Is the difference between the
ML and REML estimates important? Explain. [(1+2+2) marks]
6.5 Determine the intra-class correlation based on the results of the REML method.
[4 marks]
6.6 How much does the variation between different towns affect the total variation of sodium
based on the results of the REML method? [2 marks]
6.7 What are the 95% confidence intervals of σ, β0 and γ0 in model (3) based on the REML method?
[3 marks]
END
8
Appendices: Codes and Outputs
9
Appendix 1: Codes and Outputs for Q2
10
11
Appendix 2: Codes and Outputs for Q3
12
Appendix 3: Codes and Outputs for Q4
probability
Residuals$t_i
Figure: Q4-1
13
Figure: Q4-2
Residuals vs Predictor X5
Figure: Q4-3
14
Appendix 4: Codes and Outputs for Q5
15
16
17
18
Figure: Q5-1
19
Appendix 5: Codes and Outputs for Q6
20
21
AMA3602 Solutions
Solutions and Marking Scheme
Solution of Question 2
2.1 The fitted model is Ŷ = 70.52308 + 0.88285X1 + 0.18603X2 + 0.10950X3 .
2.2
MSR 42395
F0 (x)| x = = = 865.2
MSRes x 49
λmax
κ= > 1000
λmin
3.2
x1 and x2 : 0.92 x1 and x3 : -0.88
3.3
(1) V IF =s are 11.484663, 7.327303, 2.264535, 4.574577, 1.140066.
(2) There are problems with multicollinearity because 1 V IF > 10.
3.4
Effects:
(1) Strong multicollinearity between x1 and x2 results in large variances and covariances
for the LSE of the regression coefficients;
|r12 | → 1, Var ( β̂ j ) = c jj σ2 → ∞ and Cov( βˆ1 , βˆ2 ) = c12 σ2 → + − ∞
(2) Multicollinearity also tends to produce LSE β̂ j that are too large in absolute value;;
(3) When strong multicollinearity is present, LS will generally produce poor estimates of
the individual model parameters.
Solution of Question 4
4.1
Backward.
direction = ‘backward’ in the function step().
4.2
BIC.
k = log( N ) in the function step().
4.3
4.4
choose x1 , x3 , x4 , x5 .
Reason: x1 , x3 , x4 , x5 in the top row (smallest BIC value) of the BIC plot. 4.5
choose x1 , x2 , x3 , x4 , x5 .
Reason: the smallest Cp value.
Solution of Question 6
5a.1
Empty model, since it contains no predictors
5a.2
2 levels.
The first level:
Yij = α0i + ϵij
The second level:
α0i = γ00 + u0i
5a.3
Var (Yij ) = d2 + σ2
5a.4
b0i represents the deviation of (Aβ1 − 42) value in education attainment i from the
general mean ϵij represents the deviation of patient j’s (Aβ1 − 42) value in education
attainment i from the education attainment mean
5a.5
REML: 3351/(3351+244469) = 0.01352191 = 1.35% (fit3:ML)
Explain the calculated ICC ρ:
d2
1. the proportion of variation in individuals’ (Aβ1 − 42) value due to differ-
Var (Yij )
:
ences between education attainments;
2. cor (Yij , Yij′ ): the correlation between the (Aβ1 − 42) value of two individuals from
the education attainment.
5a.6
β 0 : [466.6411, 929.0779], d: [0.0000, 394.0216], σ: [429.1349, 578.1808]
5b.1
5b.2