0% found this document useful (0 votes)
15 views12 pages

ST201 2024

The Spring 2024 ST201 exam consists of four questions, with candidates required to answer three, including a compulsory question. The exam focuses on statistical models and data analysis, particularly regarding used car prices and diabetes data, emphasizing regression analysis and interpretation of results. Candidates are provided with statistical tables, graph paper, and allowed to use calculators during the 2-hour writing period.

Uploaded by

gisergo10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views12 pages

ST201 2024

The Spring 2024 ST201 exam consists of four questions, with candidates required to answer three, including a compulsory question. The exam focuses on statistical models and data analysis, particularly regarding used car prices and diabetes data, emphasizing regression analysis and interpretation of results. Candidates are provided with statistical tables, graph paper, and allowed to use calculators during the 2-hour writing period.

Uploaded by

gisergo10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Spring 2024 Exam

ST201
Statistical Models and Data Analysis

Suitable for all candidates

Instructions to candidates

This paper contains FOUR questions. Answer THREE questions. Question 1 is compulsory, and
two of the other three questions need to be answered. Question 1 is worth 40 marks, and each
of Questions 2-4 is worth 30 marks. The total mark is 100. The numbers in brackets beside
each question indicate the marks available for that part of the question.

Candidates must not take the question paper away with them after the examination. Please
place the question paper within your answer booklet at the end of the examination.

Time allowed Reading Time: None


Writing Time: 2 hours
You are supplied with: Murdoch & Barnes Statistical Tables, 4th edition
Graph paper
You may also use: No additional materials
Calculators: Calculators are allowed in this exam

©LSE ST 2024/ST201 Page 1 of 12


1. Consider data on used cars for a car market in 2021. The data has 3,457 observations,
each corresponding to a used car on sale. The data contains the following eight variables.

• Year: Year when the car was first bought.


• Age: Number of years that the car has been used. It is calculated as Age = 2021 -
Year, where Year is the year when the car was first bought.
• Price: Price (in US dollars) at which the car is being sold.
• Usage: The mileage (km).
• Fuel: Fuel type of car (petrol/diesel/CNG/LPG).
• Seller: Whether the seller is an individual or a dealer (Individual/Dealer).
• Transmission: Gear transmission of the car (Automatic/Manual).
• Owner: Number of previous owners of the car (1/2/3/4 or more).

The goal is to understand how the price of a used car depends on its characteristics.

(a) Based on the above descriptions of the variables, explain their variable types. [4 marks]
(b) The variable Age is believed to be a key predictor of the sale price. We explore the
relationship between Age and Price.
i. Output 1(a) gives two scatter plots. Panel (a) plots Age versus Price, and Panel
(b) plots Age versus log(Price). Based on these two plots, comment on the rela-
tionship between Age and Price. [3 marks]
ii. If you are asked to fit a simple linear regression model to understand the rela-
tionship between Price and Age, would you choose Price or log(Price) as the
response variable? Give the reasons for your choice. [4 marks]
iii. Output 1(b) gives the R output for a simple linear regression model, which re-
gresses log(Price) onto Age. Interpret the estimated regression coefficient of
Age.
[3 marks]
(c) The sale price of a car is believed to also depend on whether the seller is an individual
or a dealer.
i. Output 1(c) gives a boxplot for log(Price) versus Seller. Comment on the relation-
ship between Price and Seller. [2 marks]
ii. Output 1(d) gives the R output for a simple linear regression model, which re-
gresses log(Price) onto Seller. Interpret the estimated regression coefficient.
[2 marks]
(d) We then run multiple linear regression models, regressing log(Price) onto car char-
acteristics.
i. Output 1(e) gives the R output for a multiple linear regression model, which in-
cludes all the variables. We notice that the regression coefficient for Year is not
obtained, as indicated by “NA". Explain why this happens. Give the name of this
issue and suggest a way to handle the issue. [4 marks]
ii. Output 1(f) gives the R output for a multiple linear regression model, which in-
cludes all the available variables except for Year. Interpret the R2 value of the
obtained model. [3 marks]
iii. Give the formula for calculating the adjusted R value and explain why this value
2

is needed in addition to the R2 value. [3 marks]

©LSE ST 2024/ST201 Page 2 of 12


iv. Explain how dummy variables are created for the variable Owner in the model in
Output 1(f) and interpret the corresponding estimated coefficients. [4 marks]
v. Predict the sale price of a car in this market in 2021. The information about this
car is given in Output 1(g). [4 marks]
vi. Add an interaction term between Age and Seller into the regression model. Write
down the model equation and interpret the regression coefficient for the interac-
tion term. [4 marks]

Output 1(a)

Output 1(b)

> res1 = lm(log(Price)~Age, data = data.train)


> summary(res1)

Call:
lm(formula = log(Price) ~ Age, data = data.train)

Residuals:
Min 1Q Median 3Q Max
-2.17813 -0.40370 -0.01431 0.28945 3.14327

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.283008 0.021939 240.80 <2e-16 ***
Age -0.138769 0.002449 -56.67 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6035 on 3455 degrees of freedom


Multiple R-squared: 0.4817,Adjusted R-squared: 0.4816
F-statistic: 3211 on 1 and 3455 DF, p-value: < 2.2e-16

©LSE ST 2024/ST201 Page 3 of 12


Output 1(c)

Output 1(d)

> res2 = lm(log(Price)~Seller, data = data.train)


> summary(res2)

Call:
lm(formula = log(Price) ~ Seller, data = data.train)

Residuals:
Min 1Q Median 3Q Max
-3.1828 -0.5247 -0.0139 0.5167 2.8199

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.60070 0.02725 168.85 <2e-16 ***
SellerIndividual -0.55617 0.03149 -17.66 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8028 on 3455 degrees of freedom


Multiple R-squared: 0.08283,Adjusted R-squared: 0.08257
F-statistic: 312 on 1 and 3455 DF, p-value: < 2.2e-16

Output 1(e)

> res3 = lm(log(Price)~., data = data.train)


> res3

Call:
lm(formula = log(Price) ~ ., data = data.train)

Coefficients:
(Intercept) Age Usage FuelDiesel
9.903e+00 -1.141e-01 -3.052e-07 6.123e-01
FuelLPG FuelPetrol SellerIndividual TransmissionManual
-4.340e-02 9.409e-02 -2.008e-01 -7.662e-01
Owner4_or_more Owner2 Owner3 Year
-1.106e-01 -5.020e-02 -9.845e-02 NA

©LSE ST 2024/ST201 Page 4 of 12


Output 1(f)

> res4 = lm(log(Price)~ .-Year, data = data.train)


> summary(res4)

Call:
lm(formula = log(Price) ~ . - Year, data = data.train)

Residuals:
Min 1Q Median 3Q Max
-1.7556 -0.2997 -0.0005 0.2941 2.4130

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.903e+00 8.545e-02 115.894 < 2e-16 ***
Age -1.141e-01 2.381e-03 -47.917 < 2e-16 ***
Usage -3.052e-07 2.095e-07 -1.457 0.14532
FuelDiesel 6.123e-01 8.113e-02 7.547 5.67e-14 ***
FuelLPG -4.340e-02 1.375e-01 -0.316 0.75236
FuelPetrol 9.409e-02 8.118e-02 1.159 0.24651
SellerIndividual -2.008e-01 1.990e-02 -10.091 < 2e-16 ***
TransmissionManual -7.662e-01 2.730e-02 -28.070 < 2e-16 ***
Owner4_or_more -1.106e-01 6.144e-02 -1.800 0.07201 .
Owner2 -5.020e-02 2.071e-02 -2.424 0.01539 *
Owner3 -9.845e-02 3.548e-02 -2.775 0.00555 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4736 on 3446 degrees of freedom


Multiple R-squared: 0.6816,Adjusted R-squared: 0.6807
F-statistic: 737.8 on 10 and 3446 DF, p-value: < 2.2e-16

Output 1(g)

Age Usage Fuel Seller Transmission Owner Year


9 100000 Diesel Individual Manual 1 2012

©LSE ST 2024/ST201 Page 5 of 12


2. In this question, we continue to investigate the data in Question 1.

(a) Consider the multiple linear regression model in Output 1(f).


i. Output 2(a) gives the residual plot for this model. Name two assumptions as-
sessed by this plot and explain these two assumptions. [4 marks]
ii. Is there evidence of the violation of these two assumptions based on the plot in
Output 2(a)? Explain your answer. [4 marks]
iii. Explain the concept of outlier and give the name of the statistic that is used to
detect outliers. [3 marks]
iv. Explain the concept of high-leverage observation. [2 marks]
v. Output 2(b) shows a plot for the hat values. Are there observations that have
high leverage? Explain your answer. [2 marks]
(b) We perform model selection with the model in Output 1(f) being the full model.
i. Suppose that we use the backward stepwise selection procedure. How many
models will be searched through? [3 marks]
ii. Suppose that we use the best subset selection procedure. How many models
will be searched through? [3 marks]
iii. Name three evaluation criteria that can be used to compare the candidate mod-
els. [3 marks]
iv. For each criterion you give in your answer to Question 2(b)iii, how would you
choose the model? Do you choose the model with the largest or smallest crite-
rion value? [3 marks]
v. Explain the concept of overfitting. [3 marks]

Output 2(a)

Output 2(b)

©LSE ST 2024/ST201 Page 6 of 12


3. In this question, we analyse a dataset on diabetes. The goal is two-fold. First, we hope
to understand the relationship between having diabetes and diagnostic measurements
of health conditions, such as blood pressure and body mass index. Second, we hope
to predict whether a patient has diabetes based on the diagnostic measurements. The
following variables are available.

• Glucose: Two-hour plasma glucose concentration in an oral glucose tolerance test


(mg/dL)
• BloodPressure: Diastolic blood pressure (mmHg)
• SkinThickness: Triceps skin fold thickness (mm)
• Insulin: Two-Hour serum insulin (muU/ml)
• BMI: Body mass index (weight in kg/(height in m)2 )
• Age: Age (years)
• Diet: A measure of whether the person continuously followed a healthy diet routine;
1 if yes and 0 otherwise
• Diabetes: 1 if the person has diabetes and 0 otherwise

(a) Suppose that a researcher tries to understand whether continuously following a healthy
diet routine (measured by variable Diet) can reduce the risk of having diabetes.
i. Explain the concept of a confounding variable in the current context. [3 marks]
ii. Give an example of a potential confounder variable. Explain your answer. [2 marks]
(b) Output 3(a) gives the result of logistic regression, regressing Diabetes on the rest of
the variables.
i. Construct a 95% confidence interval for the coefficient for Diet and interpret this
confidence interval. You may need to use the supplied Murdoch & Barnes Statis-
tical Tables. [4 marks]
ii. Give the formula for calculating the deviance residuals and explain what the de-
viance residuals can be used for. [4 marks]
iii. Given an observation whose health diagnostic measurements are given in Output
3(b), predict whether they have diabetes. Explain your answer. [3 marks]
(c) Suppose you include some polynomial terms of age and fit a new model based on
the training set. The result is shown in Output 3(c). Compare this model with the
one in Output 3(a) using the likelihood ratio test. Write down the test statistic, the
reference distribution, and the p-value. You may need to use the supplied Murdoch &
Barnes Statistical Tables. [3 marks]
(d) We now evaluate the two logistic regression models in Output 3(a) and Output 3(c)
in terms of their performance on a validation set. Output 3(d) gives two ROC curves,
with the solid one from the model in Output 3(a) and the dashed one from the model
in Output 3(c).
i. Explain how each point on an ROC curve is obtained. [4 marks]
ii. Give the name of a statistic that can be used for comparing the two models
based on their ROC curves. [3 marks]
(e) Linear discriminant analysis can also be used for predicting Diabetes. Describe the
assumptions of linear discriminant analysis. [4 marks]

©LSE ST 2024/ST201 Page 7 of 12


Output 3(a)

> res.glm1 = glm(Diabetes~., family = binomial, data = train)


> summary(res.glm1)

Call:
glm(formula = Diabetes ~ ., family = binomial, data = train)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.8797516 0.7521290 -10.477 < 2e-16 ***
Glucose 0.0327814 0.0039500 8.299 < 2e-16 ***
BloodPressure -0.0078977 0.0060381 -1.308 0.191
SkinThickness 0.0074762 0.0075869 0.985 0.324
Insulin -0.0008366 0.0009917 -0.844 0.399
BMI 0.0687947 0.0164366 4.185 2.85e-05 ***
Age 0.0388568 0.0091365 4.253 2.11e-05 ***
Diet 0.0708684 0.2294000 0.309 0.757
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 793.94 on 613 degrees of freedom


Residual deviance: 607.04 on 606 degrees of freedom
AIC: 623.04

Output 3(b)

Glucose BloodPressure SkinThickness Insulin BMI Age Diabetes Diet


148 72 35 0 33.6 50 1 0

©LSE ST 2024/ST201 Page 8 of 12


Output 3(c)

> res.glm2 = glm(Diabetes~.+ I(Age^2)+I(Age^3), family = binomial, data = train)


> summary(res.glm2)

Call:
glm(formula = Diabetes ~ . + I(Age^2) + I(Age^3), family = binomial,
data = train)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.089e+01 3.776e+00 -2.885 0.003915 **
Glucose 3.442e-02 4.099e-03 8.398 < 2e-16 ***
BloodPressure -7.921e-03 6.272e-03 -1.263 0.206629
SkinThickness 6.539e-03 7.678e-03 0.852 0.394411
Insulin -7.388e-04 1.019e-03 -0.725 0.468357
BMI 5.824e-02 1.694e-02 3.438 0.000586 ***
Age 1.583e-01 2.984e-01 0.531 0.595745
Diet 8.956e-02 2.356e-01 0.380 0.703893
I(Age^2) 7.940e-04 7.479e-03 0.106 0.915450
I(Age^3) -3.579e-05 5.924e-05 -0.604 0.545737
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 793.94 on 613 degrees of freedom


Residual deviance: 577.61 on 604 degrees of freedom
AIC: 597.61

Output 3(d)

©LSE ST 2024/ST201 Page 9 of 12


4. Consider a regression problem with a continuous response variable Y and a p-dimensional
predictor vector X = (X1 , ..., Xp )⊤ .

(a) A regression model is trained based on a training set. Let the obtained regression
function be fˆ, which maps a p-dimensional vector to a real value. Suppose you are
given a test dataset, denoted by (x̃i , ỹi ), i = 1, ..., m. Write down the formula for
calculating the test mean squared error based on the test dataset. [3 marks]
(b) Explain the meanings of the following terms in a prediction problem:
• bias,
• variance,
• irreducible error.
Explain how the test error can be decomposed based on these quantities. [5 marks]
(c) A regression tree is trained on the car sales data in Questions 1 and 2, and the result
is shown in Output 4(a).
i. You are given a test observation whose predictor vector is shown in Output 4(b).
Predict the value of Y for this test observation. [3 marks]
ii. Explain why tree pruning is typically needed. [4 marks]
iii. Output 4(c) gives a plot from performing a ten-fold cross-validation for tree prun-
ing. Explain the meaning of tree size on the X-axis of the plot. [3 marks]
iv. Based on Output 4(c), what tree size would you choose? Explain your answer.
[3 marks]
(d) Bagging and random forest are further applied to the data. Output 4(d) gives out-of-
bag error for the bagging and random forest results.
i. Explain the meaning of m in the legend in Output 4(d). [3 marks]
ii. Explain the concept of out-of-bag error. [4 marks]
iii. Based on this output, which method would you choose for making predictions?
Explain your answer. [2 marks]

©LSE ST 2024/ST201 Page 10 of 12


Output 4(a)

Output 4(b)

Age Price Usage Fuel Seller Transmission Owner Year


14 821.9178 70000 Petrol Individual Manual 1 2007

Output 4(c)

©LSE ST 2024/ST201 Page 11 of 12


Output 4(d)

©LSE ST 2024/ST201 Page 12 of 12

You might also like