ST201 2024
ST201 2024
ST201
Statistical Models and Data Analysis
Instructions to candidates
This paper contains FOUR questions. Answer THREE questions. Question 1 is compulsory, and
two of the other three questions need to be answered. Question 1 is worth 40 marks, and each
of Questions 2-4 is worth 30 marks. The total mark is 100. The numbers in brackets beside
each question indicate the marks available for that part of the question.
Candidates must not take the question paper away with them after the examination. Please
place the question paper within your answer booklet at the end of the examination.
The goal is to understand how the price of a used car depends on its characteristics.
(a) Based on the above descriptions of the variables, explain their variable types. [4 marks]
(b) The variable Age is believed to be a key predictor of the sale price. We explore the
relationship between Age and Price.
i. Output 1(a) gives two scatter plots. Panel (a) plots Age versus Price, and Panel
(b) plots Age versus log(Price). Based on these two plots, comment on the rela-
tionship between Age and Price. [3 marks]
ii. If you are asked to fit a simple linear regression model to understand the rela-
tionship between Price and Age, would you choose Price or log(Price) as the
response variable? Give the reasons for your choice. [4 marks]
iii. Output 1(b) gives the R output for a simple linear regression model, which re-
gresses log(Price) onto Age. Interpret the estimated regression coefficient of
Age.
[3 marks]
(c) The sale price of a car is believed to also depend on whether the seller is an individual
or a dealer.
i. Output 1(c) gives a boxplot for log(Price) versus Seller. Comment on the relation-
ship between Price and Seller. [2 marks]
ii. Output 1(d) gives the R output for a simple linear regression model, which re-
gresses log(Price) onto Seller. Interpret the estimated regression coefficient.
[2 marks]
(d) We then run multiple linear regression models, regressing log(Price) onto car char-
acteristics.
i. Output 1(e) gives the R output for a multiple linear regression model, which in-
cludes all the variables. We notice that the regression coefficient for Year is not
obtained, as indicated by “NA". Explain why this happens. Give the name of this
issue and suggest a way to handle the issue. [4 marks]
ii. Output 1(f) gives the R output for a multiple linear regression model, which in-
cludes all the available variables except for Year. Interpret the R2 value of the
obtained model. [3 marks]
iii. Give the formula for calculating the adjusted R value and explain why this value
2
Output 1(a)
Output 1(b)
Call:
lm(formula = log(Price) ~ Age, data = data.train)
Residuals:
Min 1Q Median 3Q Max
-2.17813 -0.40370 -0.01431 0.28945 3.14327
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.283008 0.021939 240.80 <2e-16 ***
Age -0.138769 0.002449 -56.67 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Output 1(d)
Call:
lm(formula = log(Price) ~ Seller, data = data.train)
Residuals:
Min 1Q Median 3Q Max
-3.1828 -0.5247 -0.0139 0.5167 2.8199
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.60070 0.02725 168.85 <2e-16 ***
SellerIndividual -0.55617 0.03149 -17.66 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Output 1(e)
Call:
lm(formula = log(Price) ~ ., data = data.train)
Coefficients:
(Intercept) Age Usage FuelDiesel
9.903e+00 -1.141e-01 -3.052e-07 6.123e-01
FuelLPG FuelPetrol SellerIndividual TransmissionManual
-4.340e-02 9.409e-02 -2.008e-01 -7.662e-01
Owner4_or_more Owner2 Owner3 Year
-1.106e-01 -5.020e-02 -9.845e-02 NA
Call:
lm(formula = log(Price) ~ . - Year, data = data.train)
Residuals:
Min 1Q Median 3Q Max
-1.7556 -0.2997 -0.0005 0.2941 2.4130
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.903e+00 8.545e-02 115.894 < 2e-16 ***
Age -1.141e-01 2.381e-03 -47.917 < 2e-16 ***
Usage -3.052e-07 2.095e-07 -1.457 0.14532
FuelDiesel 6.123e-01 8.113e-02 7.547 5.67e-14 ***
FuelLPG -4.340e-02 1.375e-01 -0.316 0.75236
FuelPetrol 9.409e-02 8.118e-02 1.159 0.24651
SellerIndividual -2.008e-01 1.990e-02 -10.091 < 2e-16 ***
TransmissionManual -7.662e-01 2.730e-02 -28.070 < 2e-16 ***
Owner4_or_more -1.106e-01 6.144e-02 -1.800 0.07201 .
Owner2 -5.020e-02 2.071e-02 -2.424 0.01539 *
Owner3 -9.845e-02 3.548e-02 -2.775 0.00555 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Output 1(g)
Output 2(a)
Output 2(b)
(a) Suppose that a researcher tries to understand whether continuously following a healthy
diet routine (measured by variable Diet) can reduce the risk of having diabetes.
i. Explain the concept of a confounding variable in the current context. [3 marks]
ii. Give an example of a potential confounder variable. Explain your answer. [2 marks]
(b) Output 3(a) gives the result of logistic regression, regressing Diabetes on the rest of
the variables.
i. Construct a 95% confidence interval for the coefficient for Diet and interpret this
confidence interval. You may need to use the supplied Murdoch & Barnes Statis-
tical Tables. [4 marks]
ii. Give the formula for calculating the deviance residuals and explain what the de-
viance residuals can be used for. [4 marks]
iii. Given an observation whose health diagnostic measurements are given in Output
3(b), predict whether they have diabetes. Explain your answer. [3 marks]
(c) Suppose you include some polynomial terms of age and fit a new model based on
the training set. The result is shown in Output 3(c). Compare this model with the
one in Output 3(a) using the likelihood ratio test. Write down the test statistic, the
reference distribution, and the p-value. You may need to use the supplied Murdoch &
Barnes Statistical Tables. [3 marks]
(d) We now evaluate the two logistic regression models in Output 3(a) and Output 3(c)
in terms of their performance on a validation set. Output 3(d) gives two ROC curves,
with the solid one from the model in Output 3(a) and the dashed one from the model
in Output 3(c).
i. Explain how each point on an ROC curve is obtained. [4 marks]
ii. Give the name of a statistic that can be used for comparing the two models
based on their ROC curves. [3 marks]
(e) Linear discriminant analysis can also be used for predicting Diabetes. Describe the
assumptions of linear discriminant analysis. [4 marks]
Call:
glm(formula = Diabetes ~ ., family = binomial, data = train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.8797516 0.7521290 -10.477 < 2e-16 ***
Glucose 0.0327814 0.0039500 8.299 < 2e-16 ***
BloodPressure -0.0078977 0.0060381 -1.308 0.191
SkinThickness 0.0074762 0.0075869 0.985 0.324
Insulin -0.0008366 0.0009917 -0.844 0.399
BMI 0.0687947 0.0164366 4.185 2.85e-05 ***
Age 0.0388568 0.0091365 4.253 2.11e-05 ***
Diet 0.0708684 0.2294000 0.309 0.757
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Output 3(b)
Call:
glm(formula = Diabetes ~ . + I(Age^2) + I(Age^3), family = binomial,
data = train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.089e+01 3.776e+00 -2.885 0.003915 **
Glucose 3.442e-02 4.099e-03 8.398 < 2e-16 ***
BloodPressure -7.921e-03 6.272e-03 -1.263 0.206629
SkinThickness 6.539e-03 7.678e-03 0.852 0.394411
Insulin -7.388e-04 1.019e-03 -0.725 0.468357
BMI 5.824e-02 1.694e-02 3.438 0.000586 ***
Age 1.583e-01 2.984e-01 0.531 0.595745
Diet 8.956e-02 2.356e-01 0.380 0.703893
I(Age^2) 7.940e-04 7.479e-03 0.106 0.915450
I(Age^3) -3.579e-05 5.924e-05 -0.604 0.545737
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Output 3(d)
(a) A regression model is trained based on a training set. Let the obtained regression
function be fˆ, which maps a p-dimensional vector to a real value. Suppose you are
given a test dataset, denoted by (x̃i , ỹi ), i = 1, ..., m. Write down the formula for
calculating the test mean squared error based on the test dataset. [3 marks]
(b) Explain the meanings of the following terms in a prediction problem:
• bias,
• variance,
• irreducible error.
Explain how the test error can be decomposed based on these quantities. [5 marks]
(c) A regression tree is trained on the car sales data in Questions 1 and 2, and the result
is shown in Output 4(a).
i. You are given a test observation whose predictor vector is shown in Output 4(b).
Predict the value of Y for this test observation. [3 marks]
ii. Explain why tree pruning is typically needed. [4 marks]
iii. Output 4(c) gives a plot from performing a ten-fold cross-validation for tree prun-
ing. Explain the meaning of tree size on the X-axis of the plot. [3 marks]
iv. Based on Output 4(c), what tree size would you choose? Explain your answer.
[3 marks]
(d) Bagging and random forest are further applied to the data. Output 4(d) gives out-of-
bag error for the bagging and random forest results.
i. Explain the meaning of m in the legend in Output 4(d). [3 marks]
ii. Explain the concept of out-of-bag error. [4 marks]
iii. Based on this output, which method would you choose for making predictions?
Explain your answer. [2 marks]
Output 4(b)
Output 4(c)