0% found this document useful (0 votes)
22 views12 pages

Statistical Modelling and Inference Assignment 4: Spring 2 2022

Uploaded by

johnx8483
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views12 pages

Statistical Modelling and Inference Assignment 4: Spring 2 2022

Uploaded by

johnx8483
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Statistical Modelling and Inference

Assignment 4
Spring 2 2022

ASSIGNMENT CHECKLIST
• Have you shown all of your working, including probability notation where necessary?
• Have you included all R output and plots to support your answers where necessary?
• Have you included all of your R code?
• Have you made sure that all plots and tables each have a caption?
• Is the submission a single pdf file - correctly orientated, easy to read? If not, penalties apply.
• Assignments emailed instead of submitted by the online submission on BB will not be marked and will
receive zero.
• Q1-Q3 may be typed or hand written and scanned in as a pdf

Q1
This questions may be typed or hand written and scanned in as a pdf.
The aim of this question is to familiarise your with linearising models and dealing with trans-
formations in linear regression.
The saturated growth equation is useful for modelling the growth of some animal species. It can be written as
αx
Y = ,
δ+x

where Y is some physical measurements of animal size, x represents time, and α and δ are unknown parameters.
(a) Linearise the saturated growth equation.
[2 marks]
(b) Find α̂ and δ̂ based on the least-squares estimate of the parameters of the linearised model.
[2 marks]
(c) Explain why the approach in part (b) is different to fitting the saturated model directly using the
method of least squares. Hint: What is the function we are minimizing in each case?
[3 marks]
(d) A study was conducted in 2002 to investigate the growth of harbour seals. Carcasses of seals that drifted
ashore were collected. Their length (in cm) were measured and their age (in years) were estimated
based on teeth. The linearised saturated growth model was fitted to the data using R and the following
output was obtained. Note that some entries in the outputs in this question are missing and you are to
calculate this. The suffix .trans is used to indicate that a transformation was applied to the variable.
seal.lm <- lm(Length.trans ~ Age.trans)
summary(seal.lm)

##
## Call:

1
## lm(formula = Length.trans ~ Age.trans)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.986e-04 -1.731e-04 -2.973e-05 1.127e-04 5.092e-04
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.075e-03 9.825e-05 61.830 < 2e-16 ***
## Age.trans ????????? 2.897e-04 8.423 1.27e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0002495 on 13 degrees of freedom
## Multiple R-squared: 0.8451, Adjusted R-squared: 0.8332
## F-statistic: 70.94 on 1 and 13 DF, p-value: 1.267e-06
i) How many observations are in the data?
[1 mark]
ii) A few observations from the data are given below. For the following observations, write down the
response vector y and the design matrix X for the linearised model.
[2 marks]
Length Age
489 4
476 3
513 7
382 1

iii) Using Part (b) and the R output above, calculate α̂ and δ̂ from the saturated growth curve for this data,
and give the best fitting saturated growth curve. [6 marks]
iv) Find the 90% confidence interval for the intercept term in the linearised model.
[4 marks]
v) The following output gives the 90% prediction interval for Length.trans of a seal aged 6 years, based
on the linearised model. Find the corresponding 90% prediction interval for the length of a seal
aged 6 years.
[5 marks]
x0 <- tibble(Age.trans=??)
predict(seal.lm, newdata = x0, interval = "prediction", level = 0.9)

## fit lwr upr


## 1 0.006481724 0.006023172 ???????????
[Question total: 25]

Solutions:

2
(a)
αx
Y =
δ+x
1 δ+x
=
Y αx    
δ 1 1
= +
α x α
Y ∗ = β1 x ∗ + β0

(b) Let β̂0 and β̂1 denote the least squares estimate of β0 and β1 respectively. From Part (a), we have that
1 δ̂
α̂ = β̂0 and α̂ = β̂1 . Hence,
1 β̂1
α̂ = and δ̂ = α̂β̂1 = .
β̂0 β̂0

(c) With linearization, least square estimation is minimizing


n n  2
X 2
X 1 1 δ 1
Q(β0 , β1 ) = (yi∗ − β1 x∗i − β0 ) = − − .
i=1 i=1
yi α α xi

Whereas without linearization, we are directly minimizing


n  2
X αxi
Q(α, δ) = yi − .
i=1
δ + xi

These two functions are obviously different. The error structure has also changed after linearization.
(d) i) From the output, we know that n − p = 13 and p − 1 = 1. Hence, n = 15.
ii)
1 1
   
489 1 4
 1  ∗
1 1
y∗ =  476
 1 ,
 X =
1
3
1
513 7
1
382 1 1

iii) From the R output, we have β̂0 = 6.075 × 10−3 and β̂1 = 8.423 × (2.897 × 10−4 ) = 2.44 × 10−3 .
1 2.44 × 10−3
Hence, α̂ = −3
= 164.6091 and δ̂ = ≈ 0.4017. The fitted model is
6.075 × 10 6.075 × 10−3
164.6091x
Y = .
0.4017 + x

iv) The critical value in this case is t13,0.05 ≈ 1.7709 (using qt(0.95,13)in R). From the R output,
we have β̂0 = 6.075 × 10−3 and SE(β̂0 ) = 9.825 × 10−5 . It follows that

CI = β̂0 ± t13,0.05 SE(β̂0 ) = 6.075 × 10−3 ± 1.7709 9.825 × 10−5 ≈ (0.0059, 0.0062) .


v) The upper limit of the CI for Length.trans can be deduced using


upp = fit + (fit − lwr) = 0.006481724 + (0.006481724 − 0.006023172) = 0.006940276

We can back transform the PI for Length.trans to obtain an approximate PI for Length:
   
1 1 1 1
PILength = , = , = (144.0865, 166.0255)
upr lwr 0.006940276 0.006023172

3
Q2
This questions may be typed or hand written and scanned in as a pdf.
A survey, conducted in a city by a local Internet provider, collected data for the following variables from 35
randomly selected households.
• Y : number of hours of Internet used in a week
• X1 : number of children in the household
• X2 : net household income (in dollars) in the previous week
• X3 : number of years of formal education of the highest income earner in the household
The values for the first 4 households are shown below.

Household 1 2 3 4
Y 12.8 9.4 14 15.6
X1 1.0 0.0 2 3.0
X2 1590.0 968.0 732 780.0
X3 15.0 11.0 12 13.0

Consider the multiple regression model


M : Y = Xβ + ,
2
where Yi ∼ N (ηi , σ ) independently for i = 1, 2, . . . , n and η = Xβ.
(a) Write down the dimensions of Y , X, β, and .
[4 marks]
(b) Write down β in full, and the first four rows of X and y (the vector of observed values).
[3 marks]
[Question total: 7]

Solutions:

(a) There are n = 35 subjects (households) and p = 4 parameters (β0 , β1 , β2 , β3 ).


• Y is 35 × 1,
•  is 35 × 1,
• β is 4 × 1, and
• X is 35 × 4.
(b) We have    
  1 1 1590 15 12.8
β0 1 0 968 11  9.4 
β1     
β=
β2  , X = 1 2 732 12 , y = 14.0 .
   
1 3 780 13 15.6
β3 
.. .. .. ..
 
..

. . . . .

Q3
This questions may be typed or hand written and scanned in as a pdf.
Suppose y1 , y2 , . . . , yn are independent exponential observations with probability density function
1 −y/θ
f (y; θ) = e for y ≥ 0, θ > 0.
θ

4
(a) Write down the log-likelihood `(θ; y).
[2 marks]
(b) Hence, find the maximum likelihood estimate θ̂.
[2 marks]
(c) Find the Fisher information Iθ .
[3 marks]
[Question total: 7]

Solutions:

(a) The likelihood function is


n
Y 1
L(θ, y) = e−yi /θ
i=1
θ
1 − P yi /θ
= e .
θn
Hence the log-likelihood is P
yi
`(θ, y) = log(L(θ, y)) = −n log θ − .
θ
(b) The score function is P
∂` −n yi
S(θ, y) = = + 2 .
∂θ θ θ
Setting S(θ, y) equal to zero and solving gives the MLE
P
yi
θ̂ = = ȳ.
n

(c) The Fisher information defined as


∂2`
 
Iθ = E − 2 .
∂θ

The second derivative of the log-likelihood is

∂2`
P
n 2 yi
= 2− .
∂θ2 θ θ3
Therefore,
n
∂2`
  X E(Yi ) n
E − 2 =2 3
− 2
∂θ i=1
θ θ
nθ n
=2 3
− 2
θ θ
n
= 2.
θ

Q4
The aim of this question is is to explore the concepts on ANCOVA and the three types of
regression models associated to it: identical, parallel, and separate. You will also get to whet
you skills on performing some analysis on contrasts.
An agriculture company has developed three new variants of fast growing crop, and wishes to investigate
their sensitivity to soil salinity. Seeds of each plant type were planted in different sites with varying levels of

5
soil salinity (measured in decisiemens per meter (dS/m)). The height (in 10cm) of the crop was measured at
15 days after they were planted. The data is given in crop.csv, available from BB. It contains the following
variables for 85 crops:

Variable Details
Crop_no Unique identifier for each crop
Type Variant of crop (Type I, II, or III)
Salinity Salinity level of the soil (dS/m)
Height Height of the crop (10cm)

(a) Produce a scatterplot of Height against Salinity level by crop type. Describe the relationship.
[3 marks]
(b) Fit the following linear models in R:
• Model 1 – identical regression lines: Height on Salinity level
• Model 2 – parallel regression lines: Height on Type and Salinity level
• Model 3 – separate regression lines: Height on Type and Salinity level with interaction
Consider Model 3 (separate regression lines). Write down the line of best fit for the relationship between
Height on Type and Salinity level for each of Type I, Type II, and Type III crop separately.
[3 marks]
(c) Test for a statistically significant interaction term in Model 3 (separate regression lines) at the 5%
significance level. Clearly state the null and alternative hypotheses, the test statistic, the P-value, and
your conclusion.
[2 marks]
(d) Calculate the BIC for all three models. Which model has the best fit according to BIC?
[2 marks]
(e) Assess the assumptions for Model 3 (separate regression lines) using appropriate diagnostic plots.
[4 marks]
(f) The company is interested in the difference between crop types II and III. Based on Model 3 (separate
regression lines), calculate the estimated difference in mean height between Type II and Type III
crops (that is, µ̂Type II − µ̂Type III ). Then calculate the corresponding 95% confidence interval. [Hint:
Review the lecture on contrasts.]
[3 marks]
[Question total: 17]

Solutions:

(a) See the code and output attached at the end for the scatterplot.
There is a weak, negative, and non-linear relationship between Height against Salinity level. The lines for
different types of crops appear to have different slopes and intercepts.
(b) See the output for separate regression.
For type I crop:
Height = 5.7874 − 0.2902 × Salinity
For type II crop:
Height = (5.7874 + 2.9467) + (−0.2902 − 1.6772) × Salinity
= 8.7341 − 1.9674 × Salinity

6
For type III crop:
Height = (5.7874 + 22.4455) + (−0.2902 − 7.9971) × Salinity
= 28.2329 − 8.2873 × Salinity
(c) The full model (i.e., Model 3, separate regression lines) is

Heighti = β0 +β1 ×Salinityi +β2 ×TypeII+β3 ×TypeIII+β4 ×Salinityi ×TypeII+β5 ×Salinityi ×TypeIII+i

We are testing H0 : β4 = β5 = 0 vs H1 : at least one of β4 and β5 is not zero, where β4 and β5 are the
coefficients of interaction terms in Model 3.
See the output for anova(crop\_seperate, crop\_parallel).
The test statistic is F = 24.234 and the associated P-value is 6.2 × 10−9 . Based on this, there is sufficient
evidence to reject H0 at 5% significance level.
(d) From R, the BIC for the three models are as follows:
• Model 1 (identical regression lines): 448.3398
• Model 2 (parallel regression lines): 450.1701
• Model 3 (separate regression lines): 418.3894
Model 3 (separate regression lines) has the lowest BIC value and hence is preferred.
(e)
• Linearity: Residual versus Fitted (top left) shows the pattern is non-linear. So this assumption is not
reasonable.
• Homoscedascity: Standardised residual versus Fitted (bottom left) shows unequal spread as move
from left to right so this also not reasonable.
• Normality: Normal QQ-plot (top right) shows the points are reasonably close to the line, with
deviation at the ends. The normality assumption is questionable.
• Independence: The height of one crop should not affect the height of the other crop so this is
reasonable.
(f) The difference between the mean of type II and type III crops can be expressed as a linear combination
of β̂. Let X be the design matrix and β = (β0 , β1 , β2 , β3 , β4 , β5 )> be the coefficients for the intercept,
Salinity, factor(Type)2, factor(Type3), Height × factor(Type)2 and Height × factor(Type)3,
respectively. Then setting
λ = (0, 0, 0, 1, −1, Height, −Height)>
will give the difference in mean between Type II and Type II engine.
To obtain an estimate of this difference and the associated confidence interval (CI), we substitute Height
with its estimated marginal mean (i.e., the mean of this covariate) and compute the CI using
q
>
CI = λ β ± tn−p,α/2 se λ> (X > X)− 1λ.

This yields (see the R output)


µ̂Type II − µ̂Type III = −4.77
and
CI = (−6.56, −2.97).

7
Load in the data
# The here package can just make finding files nice, don't be put off by it.
crop <- read_csv(here::here("data/crop.csv"), col_types = cols())
head(crop) # Check it has read in properly

## # A tibble: 6 x 3
## Height Salinity Type
## <dbl> <dbl> <dbl>
## 1 6.87 2.63 1
## 2 3.8 3.05 1
## 3 3.43 1.55 1
## 4 3.36 3.89 1
## 5 6.21 3.37 1
## 6 8.05 3.34 1
• Type has read in as a double, let’s change that to a factor.
crop <- crop %>%
mutate(Type = as.factor(Type))

Get a scatter plot


crop %>%
ggplot(aes(x = Salinity, y = Height, colour = Type)) +
geom_point() +
geom_smooth(method = lm, formula = y ~ x, se = FALSE) +
labs(y="Height (cm)", x="Salinity level", col="Type of crop")

20

15

Type of crop
Height (cm)

1
10
2
3

1 2 3 4
Salinity level

8
Fit the 3 different ANCOVA models
identical <- lm(Height ~ Salinity, data = crop)
parallel <- lm(Height ~ Salinity + Type, data = crop)
separate <- lm(Height ~ Salinity * Type, data = crop)

summary(separate)

##
## Call:
## lm(formula = Height ~ Salinity * Type, data = crop)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7630 -1.6844 -0.1351 1.7698 6.6924
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.7874 1.6536 3.500 0.000768 ***
## Salinity -0.2902 0.5982 -0.485 0.628884
## Type2 2.9467 2.3131 1.274 0.206433
## Type3 22.4455 3.2230 6.964 8.75e-10 ***
## Salinity:Type2 -1.6772 1.0095 -1.661 0.100612
## Salinity:Type3 -7.9971 1.1517 -6.944 9.56e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.449 on 79 degrees of freedom
## Multiple R-squared: 0.4959, Adjusted R-squared: 0.464
## F-statistic: 15.54 on 5 and 79 DF, p-value: 1.257e-10

Test for a significant interaction term


For this, we compare the parallel regressions to the separate regressions using an ANOVA:
anova(parallel, separate)

## Analysis of Variance Table


##
## Model 1: Height ~ Salinity + Type
## Model 2: Height ~ Salinity * Type
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 81 764.76
## 2 79 473.97 2 290.79 24.234 6.206e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Obtain the BIC


BIC(identical, parallel, separate)

## df BIC
## identical 3 448.3398
## parallel 5 450.1701
## separate 7 418.3894

9
Assumption plots:
plot(separate)

Residuals vs Fitted
8

83
17
6

10
4
Residuals

2
−2 0
−6

0 5 10 15

Fitted values
lm(Height ~ Salinity * Type)
Normal Q−Q

83
Standardized residuals

17
10
2
1
0
−2 −1

−2 −1 0 1 2

Theoretical Quantiles
lm(Height ~ Salinity * Type)

10
Scale−Location
83
17
Standardized residuals

1.5
1.0
0.5
0.0 10

0 5 10 15

Fitted values
lm(Height ~ Salinity * Type)
Residuals vs Leverage
1
83
Standardized residuals

17 0.5
2
1
0
−2

Cook's distance 76
0.5

0.00 0.05 0.10 0.15 0.20 0.25 0.30

Leverage
lm(Height ~ Salinity * Type)

The hard one! contrast time


# Load the emmeans library
library(emmeans)
# create emmeans object, we are interested in breaking up the separate model
# by crop type.
crop_emmeans <- emmeans(separate,~"Type")

## NOTE: Results may be misleading due to involvement in interactions


# Create the contrast object
# This is saying we want to compare type II with type III

11
crop_contrast <- contrast(crop_emmeans,method=list("Type_II-III"=c(0,1,-1)))
crop_contrast

## contrast estimate SE df t.ratio p.value


## Type_II-III -4.77 0.903 79 -5.280 <.0001
# get the CI
confint(crop_contrast)

## contrast estimate SE df lower.CL upper.CL


## Type_II-III -4.77 0.903 79 -6.56 -2.97
##
## Confidence level used: 0.95
[[Assignment total: 56]]

12

You might also like