0% found this document useful (0 votes)
13 views24 pages

pastPaper2024Spring Assm02

The document is an examination paper for the Applied Linear Models course at The Hong Kong Polytechnic University, detailing the structure and content of the exam, which includes six questions on topics such as Multiple Linear Regression, Multicollinearity, Residual Analysis, Variable Selection, and Random Effects Model. Each question has specific sub-questions that require analysis of datasets related to housing in Boston and sodium weight measurements in different towns. The exam is open-book and spans 21 pages, with instructions for answering provided.

Uploaded by

debbyzhuang1129
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views24 pages

pastPaper2024Spring Assm02

The document is an examination paper for the Applied Linear Models course at The Hong Kong Polytechnic University, detailing the structure and content of the exam, which includes six questions on topics such as Multiple Linear Regression, Multicollinearity, Residual Analysis, Variable Selection, and Random Effects Model. Each question has specific sub-questions that require analysis of datasets related to housing in Boston and sodium weight measurements in different towns. The exam is open-book and spans 21 pages, with instructions for answering provided.

Uploaded by

debbyzhuang1129
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

THE HONG KONG POLYTECHNIC UNIVERSITY

Department of Applied Mathematics

Subject Code: AMA3602

Subject Title: Applied Linear Models

Date: 7th May, 2024

Time: 12:30-14:30

Time Allowed: 2 hours

This question paper has 21 pages; the appendices are from page NINE.

The examination is open-book.

Please write your answers on the answer sheets provided.

Do not write your answers on this paper.

Instructions: This paper has SIX questions. Make sure to complete all subquestions.

Subject Examiner: Dr. Catherine LIU

Student ID:

Student Name:

DO NOT TURN OVER THE PAGE UNTIL YOU ARE TOLD TO DO SO.
Question 2: Multiple Linear Regression (MLR) [Total: 30 marks]
To answer Question 2 (Q2), refer to Appendix 1: Codes and Outputs for Q2.
The dataset for Question 2 contains information collected by the U.S Census Service concerning
housing in the area of Boston Mass.
The table below describes variables in the dataset for Question 2.

Notation Variable Description


Y medv median value of owner-occupied homes in 1000s
X1 nox nitrogen oxides concentration (parts per 10 million)
X2 rm average number of rooms per dwelling
X3 lstat lower status of the population

An economist took a survey and wanted to determine the effect of the predictors nox (X1 ), rm
(X2 ), and lstat (X3 ) on the response medv (Y ). Establish the following multiple linear regression
(MLR) model to characterize their relationship, based on 50 samples,

Yi = β0 + β1 X1i + β2 X2i + β3 X3i + εi , i = 1, . . . , 50. (1)

2.1 Fit a multiple linear regression model relating Y to all three mentioned regressors.
[2 marks]

2.2 Construct the analysis-of-variance table based on the model (1) and code and output.
[4 marks]

What is the purpose of the ANOVA here? Explain it in words and write down the null
hypothesis. [(2+2) marks]

2.3 Use the t-test to assess the contribution of rm (X2 ) to the model by the p-value method
(α = 0.05). [4 marks]

2.4 Based on the ANOVA table, use the partial F test to assess the contribution of rm (X2 )
to the model by the critical region approach (α = 0.05). [5 marks]

How is this partial F statistic related to the t-test for β2 calculated in Subquestion 2.3?
[3 marks]

2.5 Express the null hypothesis of interest H0 : β1 − β2 = 0, β2 + β3 = 0 as the general linear


hypothesis H0 : Tβ = 0. [2 marks]

Test H0 : Tβ = 0 vs H1 : Tβ ̸= 0 at the significance level α = 0.05 using the critical


region approach. [6 marks]

4
Question 3: Multicollinearity [Total: 10 marks]
To answer Question 3 (Q3), refer to Appendix 2: Codes and Outputs for Q3.
The dataset for Question 3 contains information collected by the U.S Census Service concerning
housing in the area of Boston Mass.
The table below describes variables in the dataset for Question 3

Notation Variable Description


Y crim per capita crime rate by town
X1 indus proportion of non-retail business acres per town
X2 nox nitrogen oxides concentration (parts per 10 million)
X3 rm average number of rooms per dwelling
X4 age proportion of owner-occupied units built prior to 1940
X5 rad index of accessibility to radial highways
X6 tax full-value property-tax rate per 10, 000
X7 ptratio pupil-teacher ratio
X8 black 1000(Bk − 0.63)2 where Bk is the proportion of blacks by town
X9 lstat lower status of the population
X10 medv median value of owner-occupied homes in 1000s

The economist collected a dataset to examine the effect of the predictors indus (X1 ), nox (X2 ),
rm (X3 ), age (X4 ), rad (X5 ), tax (X6 ), ptratio (X7 ), black (X8 ), lstat (X9 ) and medv (X10 ) on
the response crim (Y ) and examined if there is multicollinearity among the predictors.

3.1 Determine the condition number of XT X. [2 marks]

Is there any evidence of multicollinearity? [1 mark]

3.2 In general, what are the reasons that may cause multicollinearity? Give at least two
reasons.
[2 marks]

3.3 Determine the V IF s. [2 marks]

Based on the V IF s, is there any evidence of multicollinearity? [1 mark]

3.4 How does one deal with multicollinearity? Name two specific methods.
[2 marks]

5
Question 4: Residual Analysis [Total: 10 marks]
To answer Question 4 (Q4), refer to Appendix 3: Codes and Outputs for Q4.
The dataset for Question 4 contains information collected by the U.S Census Service concerning
housing in the area of Boston Mass.
The table below describes variables in the dataset for Question 4

Notation Variable Description


Y medv median value of owner-occupied homes in 1000s
X1 crim per capita crime rate by town
X2 rm average number of rooms per dwelling
X3 tax full-value property-tax rate per 10, 000
X4 lstat lower status of the population

Suppose that the medv (Y ) is the response variable to predictors crim (X1 ), rm (X2 ), tax (X3 )
and lstat (X4 ). The model + data is expressible

Yi = β0 + β1 X1i + β2 X2i + β3 X3i + β4 X4i + εi , i = 1, ..., 50. (2)

4.1 If there is a large difference between the ordinary residual and the PRESS residual, what
can one learn about the model? [1 mark]

4.2 According to the normal probability plot of the residuals (Figure: Q4-1), does there seem
to be any problem with the normality assumption? Explain. [(1+2) marks]

4.3 Interpret the plot of the residuals against the fitted values (Figure: Q4-2). [(1+2) marks]

4.4 Figure: Q4-3 plots residuals versus an extra predictor X5 . Does this plot indicate that
the model (2) will be improved by adding the predictor X5 ? Explain.
[(1+2) marks]

6
Question 5: Variable Selection [Total: 10 marks]
To answer Question 5 (Q5), refer to Appendix 4: Codes and Outputs for Q5.
The dataset for Question 5 contains information collected by the U.S Census Service concerning
housing in the area of Boston Mass.
The table below describes variables in the dataset for Question 5

Notation Variable Description


Y medv median value of owner-occupied homes in 1000s
X1 crim per capita crime rate by town
X2 indus proportion of non-retail business acres per town
X3 nox nitrogen oxides concentration (parts per 10 million)
X4 rm average number of rooms per dwelling
X5 age proportion of owner-occupied units built prior to 1940
X6 rad index of accessibility to radial highways
X7 tax full-value property-tax rate per 10, 000
X8 ptratio pupil-teacher ratio
X9 black 1000(Bk − 0.63)2 where Bk is the proportion of blacks by town
X10 lstat lower status of the population

Suppose that the medv (Y ) is the response variable that relates to predictors crim (X1 ), indus
(X2 ), nox (X3 ), rm (X4 ), age (X5 ), rad (X6 ), tax (X7 ), ptratio (X8 ), black (X9 ) and lstat (X10 ),
by multiple linear regression. Now we employ variable selection methods to obtain reduced
models to fit the data. The sample size is 50.

5.1 Based on the outputs method 1, what is the selection method (forward, backward, or
stepwise) used to select the appropriate model? Explain. [(1+2) marks]

5.2 Based on the outputs method 2, which criterion (AIC or BIC) is the selection method
based on? Explain. [(1+2)marks]

5.3 Give the selected fitted model from subquestion 5.1. [1 mark]

5.4 Based on Figure: Q5-1, determine which of the explanatory variables should be included
in the regression model using the all possible regressions approach and BIC. Explain.
[(1+2) marks]

7
Question 6: Random Effects Model [Total: 30 marks]
To answer Question 6 (Q6), refer to Appendix 5: Codes and Outputs for Q6.
The table below describes variables in the dataset for Question 6

Variable Description
Instructor name of the leader
Town the location collecting the sodium
Sodium the weight of the sodium (milligram)

Suppose that we have a dataset containing the weight of sodium in different towns measured by
different instructors.
The Laird-Ware form based on the dataset is in the form of a random-effects one-way ANOVA
model below
2
!
yij = β0 + b0i + ϵij , i = 1, 2, j = 1, ..., ni , ni = 60, (3)
i=1

where observation yij is the weight of sodium sample for the jth of ni sodium samples in the
ith town, β0 is the fixed-effect coefficient, which represents the general average weight of the
sodium here, b0i is the deviation of the weight of the sodium in town i from the general mean,
and ϵij is the deviation of the weight of j’s sodium sample in town i from the town mean.
We assume b0i ∼ N (0, γ02 ) and ϵij ∼ N (0, σ 2 ).
i.i.d. i.i.d.

6.1 Describe the nested structure and give a brief reason. [(1+1) marks]

6.2 What is the first level and what is the second level, respectively? [(1+1) marks]

6.3 What are the ML and REML estimates of the parameters (β0 , γ0 and σ) in model (3)
based on the given dataset, respectively? [(6+6) marks]

6.4 How many “clusters” ( = level-2 units) are in model (3)? Is the difference between the
ML and REML estimates important? Explain. [(1+2+2) marks]

6.5 Determine the intra-class correlation based on the results of the REML method.
[4 marks]

6.6 How much does the variation between different towns affect the total variation of sodium
based on the results of the REML method? [2 marks]

6.7 What are the 95% confidence intervals of σ, β0 and γ0 in model (3) based on the REML method?
[3 marks]

END

8
Appendices: Codes and Outputs

Appendix 0: Summary of the Boston Housing Dataset

9
Appendix 1: Codes and Outputs for Q2

10
11
Appendix 2: Codes and Outputs for Q3

12
Appendix 3: Codes and Outputs for Q4

probability

Residuals$t_i
Figure: Q4-1

13
Figure: Q4-2

Residuals vs Predictor X5

Figure: Q4-3

14
Appendix 4: Codes and Outputs for Q5

15
16
17
18
Figure: Q5-1

19
Appendix 5: Codes and Outputs for Q6

20
21
AMA3602 Solutions
Solutions and Marking Scheme

Solution of Question 2
2.1 The fitted model is Ŷ = 70.52308 + 0.88285X1 + 0.18603X2 + 0.10950X3 .
2.2

Source of Variation Sum of Squares Degrees of Freedom Mean Squares F0


Regression 127184 3 42395 873.63
Residuals 20187 416 49
Total 2733.3 49
2.3
H0 : β 1 = β 2 = β 3 = 0, H1 : β j ̸= 0, for at least one j, β j = 1, 2, 3, α = 0.05
An F-statistic is used to test the significance of regression. Let X be the random sample
and x be the observed value of X.
The critical region is C = {X : F0 (x)| x > F0.05,3,416 }, where F0.05,3,416 is the critical value,
that is, the 95% quantile of an F statistic with degrees of freedom 3 and 416.
Under the null hypothesis H0 ,

MSR 42395
F0 (x)| x = = = 865.2
MSRes x 49

Fα,k,n−k−1 = F0.05,3,416 = 2.626 < 865.2 = F0 (x)| x


Therefore, we are against the null hypothesis in favor of H1 , i.e., average math score y is
related to the average reading score x1 and district average income x2 and the percentage
of English learners x3 .
2.4
The 95% confident interval on the mean response is (652.7,654.03)
Solution of Question 2
2.5
(1) H0 : β 1 = β 2 = 0 vs H1 : β 1 ̸= β 2 ̸= 0
To test this hypothesis, we calculate the extra sum of squares due to β 1 and β 2 ,

SSR ( β 1 , β 2 | β 0 , β 3 ) = SSR ( β 0 , β 1 , β 2 , β 3 ) − SSR ( β 0 , β 3 )

SSR − SSR ( β 0 , β 3 ) = 127184 − 47660 = 79523.99


Let X be the random sample and x be the observed value of X.
The critical region is C = {X : F0 (x) > F0.05,3,416 }, where F0.05,3,416 is the critical value,
i.e. the 95% quantile of an F-statistic with degrees of freedom 3 and 416.
Under the null hypothesis H0 , given the observations, we can obtain the realization of
our F-test statistic under degrees of freedom of 3 and 416.

SSR ( β 1 , β 2 | β 0 , β 3 , β 4 )/r H0 79523.99/3


F0 (x) = ∼ F0.05,3,416 = = 540.9796
MSRes x 49

Fα,r,n− p = F0.05,3,416 = 2.626 < 540.9796 = F0 (x)


Therefore, we are against the null hypothesis in favor of H1 , i.e., the regressors x1 and
x2 contribute significantly to the model given that x3 and x4 are already in the model.
(2) The full model:
y = X β+
The reduced model:
Solution of Question 3
3.1
(1) The condition number of X′ X = 4084.215.
(2) There is severe multicollinearity since the conditional number

λmax
κ= > 1000
λmin

3.2
x1 and x2 : 0.92 x1 and x3 : -0.88

3.3
(1) V IF =s are 11.484663, 7.327303, 2.264535, 4.574577, 1.140066.
(2) There are problems with multicollinearity because 1 V IF > 10.

3.4
Effects:
(1) Strong multicollinearity between x1 and x2 results in large variances and covariances
for the LSE of the regression coefficients;
|r12 | → 1, Var ( β̂ j ) = c jj σ2 → ∞ and Cov( βˆ1 , βˆ2 ) = c12 σ2 → + − ∞
(2) Multicollinearity also tends to produce LSE β̂ j that are too large in absolute value;;
(3) When strong multicollinearity is present, LS will generally produce poor estimates of
the individual model parameters.
Solution of Question 4
4.1
Backward.
direction = ‘backward’ in the function step().
4.2
BIC.
k = log( N ) in the function step().

4.3

ŷ = −127 + 0.2201x1 + 0.1301x3 − 0.09765x4 + 0.001436x5

4.4
choose x1 , x3 , x4 , x5 .
Reason: x1 , x3 , x4 , x5 in the top row (smallest BIC value) of the BIC plot. 4.5
choose x1 , x2 , x3 , x4 , x5 .
Reason: the smallest Cp value.
Solution of Question 6
5a.1
Empty model, since it contains no predictors

5a.2
2 levels.
The first level:
Yij = α0i + ϵij
The second level:
α0i = γ00 + u0i

5a.3
Var (Yij ) = d2 + σ2
5a.4
b0i represents the deviation of (Aβ1 − 42) value in education attainment i from the
general mean ϵij represents the deviation of patient j’s (Aβ1 − 42) value in education
attainment i from the education attainment mean
5a.5
REML: 3351/(3351+244469) = 0.01352191 = 1.35% (fit3:ML)
Explain the calculated ICC ρ:
d2
1. the proportion of variation in individuals’ (Aβ1 − 42) value due to differ-
Var (Yij )
:
ences between education attainments;

2. cor (Yij , Yij′ ): the correlation between the (Aβ1 − 42) value of two individuals from
the education attainment.

5a.6
β 0 : [466.6411, 929.0779], d: [0.0000, 394.0216], σ: [429.1349, 578.1808]
5b.1

α0i =γ00 + u0i the random intercept)


(1)
α1i =γ10 the constant slope

5b.2

Yij = β 0 + β 1 x1ij + b0i + ϵij , i = 1, 2 j = 1, 2, ..., 27 (2)


i.i.d. i.i.d.
where b0i ∼ N (0, d20 ) and ϵij ∼ N (0, σ2 ). The fixed-effect coefficients β 0 and β 1
represent the intercept and slope, respectively.
5b.3
12733/247139 = 0.05152161 = 51.52% (fit2:REML)
5b.4
fit2: βˆ0 = 693.291, βˆ1 = −2.341, d0 = 112.8, σ = 497.1

You might also like