0% found this document useful (0 votes)
38 views38 pages

Assignment3 Finaldraft

Uploaded by

lewis.hastie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views38 pages

Assignment3 Finaldraft

Uploaded by

lewis.hastie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

STAT4027 Assignment 3

Lewis Hastie

2023-10-06

Question 1

a)

y <- c(0, 0, 0, 2, 5, 1, 5, 14, 3, 19, 3, 14, 22)


t <- c(25, 37, 41, 42, 94, 16, 63, 126, 5, 31, 7, 24, 36)
g <- as.factor(c(0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1))

data_q1 <- data.frame(y = y, t = t, g = g)

# Fit the poisson model


glm_pois <- glm(y ~ offset(log(t)) + g, family = poisson)
summary(glm_pois)

##
## Call:
## glm(formula = y ~ offset(log(t)) + g, family = poisson)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.8000 0.1925 -14.549 <2e-16 ***
## g1 2.2761 0.2312 9.847 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 124.138 on 12 degrees of freedom
## Residual deviance: 17.676 on 11 degrees of freedom
## AIC: 57.94
##
## Number of Fisher Scoring iterations: 5

# Fit the negative binomial model


glm_nb <- glm.nb(y ~ offset(log(t)) + g)

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =


## control$trace > : iteration limit reached

1
Table 1: y/t summary table

Mean Variance
0.245362 0.0731728

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =


## control$trace > : iteration limit reached

summary(glm_nb)

##
## Call:
## glm.nb(formula = y ~ offset(log(t)) + g, init.theta = 86681.79322,
## link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.8000 0.1925 -14.549 <2e-16 ***
## g1 2.2761 0.2312 9.847 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for Negative Binomial(86681.79) family taken to be 1)
##
## Null deviance: 124.128 on 12 degrees of freedom
## Residual deviance: 17.675 on 11 degrees of freedom
## AIC: 59.94
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 86682
## Std. Err.: 2544076
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -53.94

mu_var1 = cbind(mean(y/t), var(y/t))


knitr::kable(mu_var1, col.names = c("Mean", "Variance"), caption = '$y/t$ summary table')

Our poisson model demonstrates a better model fit, as its AIC value is lower at a value of 57.94, compared
to the AIC value of 59.94 for our NB model. This is due to the data not demonstrating any overdispersion,
as from the summary table we can see that the variance is not larger then the mean value. Furthermore, the
theta value in our negative binomial model is very large, with a very large standard error, indicating that
the poisson model is suitable.

b)

2
# Without t information ?
glm_pois_2 <- glm(y ~ g, family = poisson)
summary(glm_pois_2)

##
## Call:
## glm(formula = y ~ g, family = poisson)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.2164 0.1925 6.321 2.61e-10 ***
## g1 1.2850 0.2312 5.559 2.71e-08 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 107.246 on 12 degrees of freedom
## Residual deviance: 72.966 on 11 degrees of freedom
## AIC: 113.23
##
## Number of Fisher Scoring iterations: 6

# Fit the negative binomial model


glm_nb <- glm.nb(y ~ g)
summary(glm_nb)

##
## Call:
## glm.nb(formula = y ~ g, init.theta = 0.9387015914, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.2164 0.4126 2.948 0.00319 **
## g1 1.2850 0.6322 2.033 0.04208 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for Negative Binomial(0.9387) family taken to be 1)
##
## Null deviance: 19.423 on 12 degrees of freedom
## Residual deviance: 15.151 on 11 degrees of freedom
## AIC: 79.027
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.939
## Std. Err.: 0.489
##
## 2 x log-likelihood: -73.027

3
Table 2: y summary table

Mean Variance
6.769231 59.52564

mu_var2 = cbind(mean(y), var(y))


knitr::kable(mu_var2, col.names = c("Mean", "Variance"), caption = '$y$ summary table')

In this instance the NB model has a better AIC fit with scores of 79.027 vs. 113.26 for the NB and Poisson
model respectively. This makes intuitive sense as the theta value in the NB model is small with small
standard error, suggesting that a NB model is suitable, and that when there is a lack of information about
ti , the dispersion is larger than the mean.
This is supported by the data, as the summary table indicates there is significant overdispersion in the data
as the variance is significantly larger then the mean and as such the negative binomial model is preferred.

c)

i)

To find our moment estimators, we equate our theoretical mean and sample mean, and theoretical estimators
with our sample estimator.
For our mean,

E(Y ) = ȳ (1)
(1 − π)λ = ȳ (2)

λ= (3)
1−π

Now our variance,

V ar(Y ) = s2 (4)
2
λ(1 − π)(1 + πλ) = s . (5)

Subbing in our lambda from our first equation yields

4
 
ȳ ȳπ
× (1 − π) 1 + = s2 (6)
1−π 1−π
ȳπ s2
1+ = (7)
1−π ȳ
s2 s2
1 − π + ȳπ = − π (8)
ȳ ȳ
s2 s2
π − π + ȳπ = −1 (9)
ȳ ȳ
 2
s2

s
π − 1 + ȳ = −1 (10)
ȳ ȳ
 2   2 
s s
π̂ = −1 / − 1 + ȳ (11)
ȳ ȳ
s2 − ȳ
= 2 (12)
s + ȳ 2 − ȳ

Subbing this expression for π into our mean expression we have,


λ= (13)
1−π
 2
s + ȳ 2 − ȳ − (s2 − ȳ)

= ȳ (14)
s2 + ȳ 2 − ȳ
ȳ 2
 
= ȳ (15)
s2 + ȳ 2 − ȳ
s + ȳ 2 − ȳ
2
λ̂ = (16)

as required.
Thus we can find the estimates for our data.

ybar <- mean(y)


y_var <- var(y)

pi <- (y_var - ybar)/(y_var + ybarˆ2 - ybar)


lambda <- ybar/(1 - pi)

Thus our λ = 14.5627914. Our π = 0.5351694.

ii)

To find our MLE estimators first note n0 represents the amount of zeros in the sample.
We have our likelihood, which the is product of n0 zeroes, and n − n0 non-zero terms:

n−n
Y0  λyi e−λ

−λ n0
 
L(λ, π) = π + (1 − π)e × (1 − π) (17)
i=1
yi !

Taking the logarithm we have,

5
 n−n
X0  yi −λ 
−λ
 λ e
ℓ(λ, π) = n0 ln π + (1 − π)e + ln (1 − π) (18)
i=1
yi !
n−n
X0  λyi e−λ 
= n0 ln π + (1 − π)e−λ + (n − n0 ) ln(1 − π) +

(19)
i=1
yi !
n−n
X0 n−n
X0
= n0 ln π + (1 − π)e−λ + (n − n0 ) ln(1 − π) + ln(λ)

yi − (n − n0 )λ − ln(yi !) (20)
i=1 i=1

In order to prove the required results, we will employ a method known as profile likelihood estimation, in
which we first find the MLE of π(λ), and then substitute this result into ℓ to find the MLE of λ.
Taking the derivative with respect to π yields,

∂ℓ n0 (1 − e−λ ) (n − n0 )
= −λ
− (21)
∂π π + (1 − π)e 1−π

Setting the derivative to zero and solving,

(n − n0 ) n0 (1 − e−λ )
= (22)
1−π π + (1 − π)e−λ
(n − n0 )π + (n − n0 )(1 − π)e−λ
= n0 (1 − e−λ ) (23)
1−π
(n − n0 )π
+ (n − n0 )e−λ = n0 (1 − e−λ ) (24)
1−π
(n − n0 )π
= n0 − ne−λ (25)
1−π
(n − n0 )π = (n0 − ne−λ ) − π(n0 − ne−λ ) (26)
−λ −λ
π(n0 − ne + n − n0 ) = (n0 − ne ) (27)
−λ
n0 − ne
π= (28)
n − ne−λ
n0 /n − e−λ
= (29)
1 − e−λ

We now substitute this value back into ℓ, let r0 = n0 /n, and 1 − π = (1 − r0 )/(1 − e−λ ),

r0 − e−λ
   
1 − r0 −λ 1 − r0
ℓ = n0 ln + e + (n − n0 ) ln + n ln(λ)ȳ − (n − n0 )λ (30)
1 − e−λ 1 − e−λ 1 − e−λ
= n0 ln(r0 ) + (n − n0 ) ln(1 − r0 ) − ln(1 − e−λ ) + n ln(λ)ȳ − (n − n0 )λ
 
(31)
(32)

Taking the derivative with respect to λ,

∂ℓ −(n − n0 )e−λ nȳ


= + − (n − n0 ) (33)
∂λ 1 − e−λ λ
(34)

6
Setting to zero and solving,

−(n − n0 )e−λ nȳ


0= + − (n − n0 ) (35)
1 − e−λ λ
nȳ (n − n0 )e−λ
= (n − n0 ) + (36)
λ 1 − e−λ
 
nȳ 1
= (n − n0 ) (37)
λ 1 − e−λ
nȳ(1 − e−λ ) = λ(n − n0 ) (38)
 n0 
ȳ(1 − e−λ̂ ) = λ̂ 1 − (39)
n

As required. Notice now our expression for π can be written as,

r0 − e−λ r0 − 1
π= =1+ (40)
1 − e−λ 1 − e−λ

Now our solution for λ̂ implies,

ȳ 1 − r0
= (41)
λ̂ 1 − e−λ̂

Hence our result for π,


π̂ = 1 − (42)
λ̂
Now for our hurdle model, we have log-likelihood equation (from lecture notes)

n1
X n1
X
ℓ(π, λ) = n0 ln π + n1 ln(1 − π) + ln λ yi − n1 λ − ln(yi !) − n1 ln(1 − e−λ ) (43)
i=1 i=1

Taking the derivative with respect to π yields,

∂ℓ n0 n1
= − (44)
∂π π 1−π

Setting equal to zero,

n0 n1
0= − (45)
π 1−π
n1 n0
= (46)
1−π π
π(n0 + n1 ) = n0 (47)
n0
π̂ = (48)
n

as required.
To calulcate the MLE of λ, we will differentiate our log-likelihood with respect to λ.

7
∂ℓ nȳ e−λ
= − n1 − n1 × (49)
∂λ λ 1 − e−λ

Setting to zero,

nȳ e−λ
0= − n1 − n1 × (50)
λ 1 − e−λ
−λ
 
e nȳ
n1 1 + = (51)
1 − e−λ λ
n 
1
λ = ȳ(1 − e−λ ) (52)
n
 n0 
λ 1− = ȳ(1 − e−λ ) (53)
n

as required.
We can now solve using the uni-root function,

y_bar <- mean(y)


n <- length(y)
n_0 <- sum(y == 0)

lambda_MLE <- function(lambda){


lambda*(1 - n_0/n) - y_bar*(1 - exp(-lambda))
}

lamh <- uniroot(lambda_MLE,lower=1,upper=50)$root


pi_hat <- 1 - y_bar/lamh

pi_hat_hurdle <- n_0/n

Thus for our zero-inflated poisson model we have an estimated λ̂ = 8.7986718, and an estimated π̂ =
0.2306531. For our hurdle model, we have λ̂Hurdle = 8.7986718, and π̂Hurdle = 0.2307692.

Question 2

a)

dat <- read.table("~/Desktop/R/4027/Assignment 3/Epilepsy.txt", header = T)

dat <- dat[,1:4]

# Calculating over all times and over all treatment groups


agg0 <- data.frame(mean = mean(dat$y), var = var(dat$y))

# Calculating over all time for each treatment group


agg1 <- do.call(data.frame, aggregate(y ~ type, data = dat, FUN = function(x) c(mean = mean(x), var = va

# Calculating for each time over all treatment

8
agg2 <- do.call(data.frame, aggregate(y ~ time, data = dat, FUN = function(x) c(mean = mean(x), var = va

# Calculating for each time over all treatment groups


agg3 <- do.call(data.frame, aggregate(y ~ time + type, data = dat, FUN = function(x) c(mean = mean(x), v

agg0 |> kable(caption = 'Mean and Variance across all treatments and times') |> kable_styling(latex_opti

Table 3: Mean and Variance across all treatments and times

mean var
8.271186 152.607

agg1 |> kable(caption = 'Mean and Variance for each treatment across all times') |> kable_styling(latex_

Table 4: Mean and Variance for each treatment across all times

type y.mean y.var


0 8.607143 107.9344
1 7.967742 193.9664

agg2 |> kable(caption = 'Mean and Variance across all treatments for each time') |> kable_styling(latex

Table 5: Mean and Variance across all treatments for each time

time y.mean y.var


1 8.949153 220.08358
2 8.355932 103.78492
3 8.440678 200.18177
4 7.338983 92.88311

agg3 |> kable(caption = 'Mean and Variance for each treatment and time') |> kable_styling(latex_options

Table 6: Mean and Variance for each treatment and time

time type y.mean y.var


1 0 9.357143 102.75661
2 0 8.285714 66.65608
3 0 8.785714 215.28571
4 0 8.000000 57.92593
1 1 8.580645 332.71828
2 1 8.419355 140.65161
3 1 8.129032 193.04946
4 1 6.741936 126.66452

We can see that for the mean and variance values across all aggregates that our data is overdispersed. Mean
values for treatment 1 is less then those for treatment 0 when aggregated across time. We can see that the
mean values decrease for each time point when aggregated over treatment, and that our mean values for
treatment 1 are generally less then for treatment 0 for each time point (with the exception of time point 3).

9
Table 7: M1

b0 b1 b2
parm 2.2940093 -0.0573954 -0.0772055
sem 0.0588292 0.0202729 0.0452718

b)

#test <- flexmix(formula = y ~ time + type | pat, data = dat, k = 1, model = FLXglm(family = "poisson"))
#summary(test)
#
#parameters(test)
#
#test2 <- flexmix(formula = y ~ time + type | pat, data = dat, k = 2, model = FLXglm(family = "poisson")
#summary(test2)
#
#parameters(test2)

Model 1

# Model 1
m <- length(unique(dat$pat))
param_m1 <- c(1, 0.1, 0.1)
# Model 1
logl_m1 <- function(par){
b0 = par[1]; b1 = par[2]; b2 = par[3]
#mu = exp(b0 + b1*dat$time + b2*dat$type)
l = rep(0, m)
for (i in 1:m){
sub_dat = dat[dat$pat == i,]
mu = exp(b0 + b1*sub_dat$time + b2*sub_dat$type)
yp = sub_dat$y
l[i] = prod(exp(-mu)*muˆyp/factorial(yp))
}
ll = sum(log(l))
return(-ll)
}

mixreg_m1 <- optim(param_m1, logl_m1, method = "BFGS", hessian=T)

OIm=solve(mixreg_m1$hessian); sem=sqrt(diag(OIm))
parm=mixreg_m1$par; estm=rbind(parm,sem)
colnames(estm)=c("b0","b1","b2")

estm |> kable(caption = 'M1')

AICm=mixreg_m1$value*2+length(parm)*2

Thus for model we have AIC value 3280.3479633.


Model 2

10
Table 8: M2

b01 b02 b1 b2 pi
parm 1.409746 3.2040306 -0.0574200 0.3080130 0.7976866
sem 0.072124 0.0638979 0.0202733 0.0521591 0.0525467

m <- length(unique(dat$pat))
param_m2 <- c(1, 1, 0.5, 0.5, 0.7)
# Model 1
logl_m2 <- function(par){
b01 = par[1]; b02 = par[2]; b1 = par[3]; b2 = par[4]; pi = par[5]
#mu = exp(b0 + b1*dat$time + b2*dat$type)
l1 = rep(0, m)
l2 = rep(0, m)
l = rep(0, m)

for (i in 1:m){
sub_dat = dat[dat$pat == i,]

mu1 = exp(b01 + b1*sub_dat$time + b2*sub_dat$type)


mu2 = exp(b02 + b1*sub_dat$time + b2*sub_dat$type)

yp = sub_dat$y

l1[i] = prod(exp(-mu1)*(mu1)ˆyp/factorial(yp))
l2[i] = prod(exp(-mu2)*(mu2)ˆyp/factorial(yp))
l[i] = pi*l1[i]+(1-pi)*l2[i]

}
ll = sum(log(l))
return(-ll)
}

mixreg_m2 <- optim(param_m2, logl_m2, method = "BFGS", hessian=T)

OIm=solve(mixreg_m2$hessian); sem=sqrt(diag(OIm))
parm=mixreg_m2$par; estm=rbind(parm,sem)
colnames(estm)=c("b01", "b02", "b1", "b2", "pi")

estm |> kable(caption = 'M2')

AICm=mixreg_m2$value*2+length(parm)*2

Thus for model 2 we have AIC value 1927.9811221


Model 3

m <- length(unique(dat$pat))
param_m3 <- c(1, 1, 0.5, 0.5, 0.5, 0.7)
# Model 1
logl_m3 <- function(par){

11
Table 9: M3

b01 b02 b11 b12 b2 pi


parm 1.441878 3.1781987 -0.0715436 -0.0475024 0.3098578 0.7972647
sem 0.089736 0.0764258 0.0317597 0.0265506 0.0498577 0.0524799

b01 = par[1]; b02 = par[2]; b11 = par[3]; b12 = par[4]; b2 = par[5]; pi = par[6]
#mu = exp(b0 + b1*dat$time + b2*dat$type)
l1 = rep(0, m)
l2 = rep(0, m)
l = rep(0, m)

for (i in 1:m){
sub_dat = dat[dat$pat == i,]

mu1 = exp(b01 + b11*sub_dat$time + b2*sub_dat$type)


mu2 = exp(b02 + b12*sub_dat$time + b2*sub_dat$type)

yp = sub_dat$y

l1[i] = prod(exp(-mu1)*(mu1)ˆyp/factorial(yp))
l2[i] = prod(exp(-mu2)*(mu2)ˆyp/factorial(yp))

l[i] = pi*l1[i]+(1-pi)*l2[i]

}
ll = sum(log(l))
return(-ll)
}

mixreg_m3 <- optim(param_m3, logl_m3, method = "BFGS", hessian=T)

OIm=solve(mixreg_m3$hessian); sem=sqrt(diag(OIm))
parm=mixreg_m3$par; estm=rbind(parm,sem)
colnames(estm)=c("b01", "b02", "b11", "b12", "b2", "pi")

estm |> kable(caption = 'M3')

AICm=mixreg_m3$value*2+length(parm)*2

Thus for model 3 we have AIC value 1929.6482612

c)

Based on the above AIC values, we choose model 2 to proceed due to the smallest AIC value.

parm <- mixreg_m2$par

ind = rep(0, m) #calculate group membership indicator

12
b01 = parm[1]; b02 = parm[2]; b1 = parm[3]; b2 = parm[4]; pi = parm[5]
l1 = rep(0, m)
l2 = rep(0, m)
l = rep(0, m)

for (i in 1:m) {
sub_dat = dat[dat$pat == i,]

yp = sub_dat$y

mu1 = exp(b01 + b1*sub_dat$time + b2*sub_dat$type)


mu2 = exp(b02 + b1*sub_dat$time + b2*sub_dat$type)

l1[i] = prod(exp(-mu1)*(mu1)ˆyp/factorial(yp))
l2[i] = prod(exp(-mu2)*(mu2)ˆyp/factorial(yp))

ind[i] = pi*l1[i]/(pi*l1[i]+(1-pi)*l2[i])
}

# Assigning groups to the dataframe


for (i in 1:nrow(dat)){
dat$ind[i] <- round(ind[dat$pat[i]])
}

# Group memberships
membership <- unique(dat[, colnames(dat) %in% c('pat', 'ind')])
rownames(membership) <- NULL
membership

## pat ind
## 1 1 1
## 2 2 1
## 3 3 1
## 4 4 1
## 5 5 0
## 6 6 1
## 7 7 1
## 8 8 0
## 9 9 1
## 10 10 1
## 11 11 0
## 12 12 1
## 13 13 1
## 14 14 0
## 15 15 0
## 16 16 1
## 17 17 1
## 18 18 0
## 19 19 1
## 20 20 1
## 21 21 1
## 22 22 1
## 23 23 1

13
Table 10: Group Sizes by treatment

Group Treatment Number


0 0 8
1 0 20
0 1 4
1 1 27

## 24 24 1
## 25 25 0
## 26 26 1
## 27 27 1
## 28 28 0
## 29 29 1
## 30 30 1
## 31 31 1
## 32 32 1
## 33 33 1
## 34 34 1
## 35 35 0
## 36 36 1
## 37 37 1
## 38 38 1
## 39 39 1
## 40 40 1
## 41 41 1
## 42 42 1
## 43 43 0
## 44 44 1
## 45 45 1
## 46 46 1
## 47 47 1
## 48 48 1
## 49 49 0
## 50 50 1
## 51 51 1
## 52 52 1
## 53 53 0
## 54 54 1
## 55 55 1
## 56 56 1
## 57 57 1
## 58 58 1
## 59 59 1

# Group Sizes
grp_size <- data.frame(aggregate(pat ~ ind + type, data = dat, FUN = function(x) length(unique(x))))

grp_size |> kable(col.names = c('Group', 'Treatment', 'Number'), caption = 'Group Sizes by treatment')

14
# observed means and variances
obs <- do.call(data.frame, aggregate(y ~ time + type + ind , data = dat, FUN = function(x) c(mean = mean

obs$ind <- as.character(obs$ind)


obs$type <- as.character(obs$type)

obs |> kable(col.names = c('Time', 'Treatment', 'Group', 'Mean', 'Variance'), caption = 'Observed Mean a

Table 11: Observed Mean and Variance

Time Treatment Group Mean Variance


1 0 0 20.500000 161.428571
2 0 0 18.875000 45.267857
3 0 0 22.125000 528.982143
4 0 0 18.000000 52.000000
1 1 0 38.250000 1826.916667
2 1 0 26.750000 656.250000
3 1 0 36.000000 590.000000
4 1 0 26.750000 585.583333
1 0 1 4.900000 13.357895
2 0 1 4.050000 11.944737
3 0 1 3.450000 6.155263
4 0 1 4.000000 4.210526
1 1 1 4.185185 17.618234
2 1 1 5.703704 27.216524
3 1 1 4.000000 17.461538
4 1 1 3.777778 7.871795

Plotting the information below,

b01 = parm[1]; b02 = parm[2]; b1 = parm[3]; b2 = parm[4]; pi = parm[5]

pred_df <- expand.grid(time = levels(factor(dat$time)), type = levels(factor(dat$type)), group = levels(

prediction_function <- function(pred_df){


beta_int = ifelse(pred_df$group == 0, b01, b02)
pred_df$mean = exp(beta_int + b1*as.numeric(as.character(pred_df$time)) + b2*as.numeric(as.character(p
pred_df
}

predicted <- prediction_function(pred_df)


#predicted$variance <- NaN
#predicted$plotting <- 'Predicted'
#obs$plotting <- 'Observed'
#names(obs) <- names(predicted)

names(obs) <- c('time', 'type', 'group', 'y.mean', 'y.var')


merged_df <- merge(obs, predicted, by = c("time", "type", "group"), all.x = TRUE)

ggplot(merged_df, aes(x = time, y = mean, color = type, group = group, linetype = group)) +
geom_line() +
geom_point(aes(y = y.mean)) +

15
facet_wrap(~ type, labeller = 'label_both') +
labs(colour = 'Treatment', title = 'Fitted mean lines by group and treatment')

Fitted mean lines by group and treatment


type: 0 type: 1

30
group
0
1
mean

20
Treatment
0
1
10

1 2 3 4 1 2 3 4
time

ggplot(merged_df, aes(x = time, y = y.var, color = type, group = group)) +


geom_point() +
facet_wrap(~ group, labeller = 'label_both', scale = 'free') +
labs(title = 'Observed Variance by Group and Treatment', colour = 'treatment')

16
Observed Variance by Group and Treatment
group: 0 group: 1

1500

20

treatment
y.var

1000
0
1

500 10

0
1 2 3 4 1 2 3 4
time

Question 3

a)

dat3 <- Salamanders #built in data


dat3$spp <- factor(dat3$spp,levels=c("PR","EC-A","GP","DF","DM","EC-L","DES-L")) #base
dat3$mined <- factor(dat3$mined,levels=c("no","yes")) #base no

summary(dat3)

## site mined cover sample DOP


## R-1 : 28 no :336 Min. :-1.59152 Min. :1.00 Min. :-2.1984
## R-2 : 28 yes:308 1st Qu.:-0.69629 1st Qu.:1.75 1st Qu.:-0.3018
## R-3 : 28 Median :-0.04974 Median :2.50 Median :-0.0916
## R-4 : 28 Mean : 0.00000 Mean :2.50 Mean : 0.0000
## R-5 : 28 3rd Qu.: 0.59682 3rd Qu.:3.25 3rd Qu.: 0.0000
## R-6 : 28 Max. : 1.88993 Max. :4.00 Max. : 3.1691
## (Other):476
## Wtemp DOY spp count
## Min. :-3.0234 Min. :-2.7122 PR :92 Min. : 0.000
## 1st Qu.:-0.6139 1st Qu.:-0.5653 EC-A :92 1st Qu.: 0.000
## Median : 0.0370 Median :-0.0590 GP :92 Median : 0.000

17
Table 12: Mean, Variance, Proportion of Zeroes by Species

Species Mean Variance Sum Zeroes


PR 0.2934783 0.6711658 27 0.8478261
EC-A 0.5434783 1.3717152 50 0.7717391
GP 1.1739130 3.7935977 108 0.5869565
DF 1.2717391 3.5407310 117 0.5108696
DM 1.4782609 4.7357860 136 0.5108696
EC-L 2.1847826 21.2511945 201 0.5108696
DES-L 2.3152174 10.2402054 213 0.4673913

## Mean : 0.0000 Mean : 0.0000 DF :92 Mean : 1.323


## 3rd Qu.: 0.6032 3rd Qu.: 0.9739 DM :92 3rd Qu.: 2.000
## Max. : 2.2094 Max. : 1.4600 EC-L :92 Max. :36.000
## DES-L:92

# mean/var overall
mean_count <- mean(dat3$count)
var_count <- var(dat3$count)
num_zero <- sum(dat3$count == 0)

# by spp
agg_q3 <- do.call(data.frame, aggregate(count ~ spp, data = dat3, FUN = function(x) c(mean = mean(x), va

names(agg_q3) <- c('Species', 'Mean', 'Variance', 'Sum', 'Zeroes')


agg_q3 |> kable(caption = 'Mean, Variance, Proportion of Zeroes by Species')

ggplot(data = dat3, aes(x = count)) +


geom_histogram() +
labs(x = "Count", title = "Distribution of Count by Species") +
facet_wrap(~spp)

## ‘stat_bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.

18
Distribution of Count by Species
PR EC−A GP
80
60
40
20
0
DF DM EC−L
80
60
count

40
20
0
0 10 20 30 0 10 20 30
DES−L
80
60
40
20
0
0 10 20 30
Count

From the summary output we can that there exists a large number of zeroes, as the median is zero. Fur-
thermore, there exists some extreme values, as the third quartile has a value of 2, while the maximum value
is 36.
We have an overall mean of 1.3229814, and variance 6.9468427: thus there is evidence of overdispersion
as the variance is larger then the mean. We have an overall count of zeros as 387, indicated that there is
significant zero inflation as 60.0931677 percent of the data are zeros.
From the mean and variance table by species., we can see that this pattern continues when the counts are
stratified by species. All species have some level of overdispersion, with some species, such as DES-L and
EC-L being significantly overdispersed.
From the distributions of counts by species, we see that there is significant over-inflation.

b)

m10 <- glm(count ~ spp + mined + cover + Wtemp + DOY, data = dat3, family = 'poisson')
summary(m10)

##
## Call:
## glm(formula = count ~ spp + mined + cover + Wtemp + DOY, family = "poisson",
## data = dat3)
##
## Coefficients:

19
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.58172 0.19354 -3.006 0.002649 **
## sppEC-A 0.61619 0.23882 2.580 0.009878 **
## sppGP 1.38629 0.21517 6.443 1.17e-10 ***
## sppDF 1.46634 0.21350 6.868 6.51e-12 ***
## sppDM 1.61682 0.21069 7.674 1.67e-14 ***
## sppEC-L 2.00747 0.20497 9.794 < 2e-16 ***
## sppDES-L 2.06546 0.20428 10.111 < 2e-16 ***
## minedyes -2.31284 0.12028 -19.229 < 2e-16 ***
## cover -0.23784 0.04137 -5.749 8.99e-09 ***
## Wtemp -0.04532 0.04047 -1.120 0.262831
## DOY 0.13311 0.03694 3.604 0.000314 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2120.7 on 643 degrees of freedom
## Residual deviance: 1265.8 on 633 degrees of freedom
## AIC: 2011.1
##
## Number of Fisher Scoring iterations: 6

check_overdispersion(m10)

## # Overdispersion test
##
## dispersion ratio = 2.872
## Pearson’s Chi-Squared = 1818.083
## p-value = < 0.001

## Overdispersion detected.

check_zeroinflation(m10)

## # Check for zero-inflation


##
## Observed zeros: 387
## Predicted zeros: 302
## Ratio: 0.78

## Model is underfitting zeros (probable zero-inflation).

p_m10 <- 1 - pchisq(m10$deviance, m10$df.residual)

From the overdispersion and zero-inflation tests, it is clear that there is significant overdispersion. Further-
more, the model is underfitting zeroes, indicating that their is zero inflation. The covariates included in the
model are all significant at the α = 0.05 level, with the exception of Wtemp, which was not significant. All
the species effects spp and the variable DOY were found to have positive effects, while the variables mined,
cover, and Wtemp were found to have negative effects.
We can perform a hypothesis test using the residual deviance to test if the proposed model is significant
compared to the saturated model, with a resulting p-value of 0, Thus we have evidence to reject the fit of

20
this model, and conclude this model fails to capture a significant amount of variation in the data. The AIC
value is 2011.1001221.

m20 <- glm.nb(count ~ spp + mined + cover + Wtemp + DOY, data = dat3)
summary(m20)

##
## Call:
## glm.nb(formula = count ~ spp + mined + cover + Wtemp + DOY, data = dat3,
## init.theta = 0.8358320528, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.60776 0.24071 -2.525 0.0116 *
## sppEC-A 0.59798 0.30798 1.942 0.0522 .
## sppGP 1.26602 0.29022 4.362 1.29e-05 ***
## sppDF 1.58841 0.28452 5.583 2.37e-08 ***
## sppDM 1.67338 0.28324 5.908 3.46e-09 ***
## sppEC-L 1.84542 0.28092 6.569 5.06e-11 ***
## sppDES-L 2.06604 0.27838 7.422 1.16e-13 ***
## minedyes -2.17571 0.17211 -12.642 < 2e-16 ***
## cover -0.14609 0.07724 -1.891 0.0586 .
## Wtemp -0.06120 0.06960 -0.879 0.3793
## DOY 0.03227 0.06508 0.496 0.6199
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for Negative Binomial(0.8358) family taken to be 1)
##
## Null deviance: 886.88 on 643 degrees of freedom
## Residual deviance: 549.00 on 633 degrees of freedom
## AIC: 1695.2
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.836
## Std. Err.: 0.109
##
## 2 x log-likelihood: -1671.220

#check_overdispersion(m20)
#check_zeroinflation(m20)

p_m20 <- 1 - pchisq(m20$deviance, m20$df.residual)

For the negative binomial model, all species except for EC-A were found to be significant at the level of
α = 0.05. cover, Wtemp, and DOY, were not found to be significant. The species variables spp and DOY were
all found to have positive effects, while the variables cover, Wtemp, and mined were found to be negative.
We can perform a hypothesis test using the residual deviance to test if the proposed model is signficant, where
the null hypothesis is that proposed model provides a better fit. With a resulting p-value of 0.9929618, thus
indicating that we retain the null and conclude that this model is a good fit. The AIC value is 1695.220271.

21
Based on the two model AIC values, we can conclude that the negative binomial model fits the data better,
as it has a lower AIC value of 1695.220271 compared to 2011.1001221.

c)

# Poisson Model
M1 <- zeroinfl(count ~ spp + mined + cover + Wtemp + DOY| mined + DOY, data = dat3, dist = "poisson")
summary(M1)

##
## Call:
## zeroinfl(formula = count ~ spp + mined + cover + Wtemp + DOY | mined +
## DOY, data = dat3, dist = "poisson")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -1.3780 -0.5153 -0.3735 0.1763 8.9950
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.20706 0.21994 -0.941 0.34648
## sppEC-A 0.91739 0.28844 3.180 0.00147 **
## sppGP 1.27219 0.24118 5.275 1.33e-07 ***
## sppDF 1.36875 0.23971 5.710 1.13e-08 ***
## sppDM 1.49724 0.23681 6.323 2.57e-10 ***
## sppEC-L 1.88908 0.23089 8.182 2.79e-16 ***
## sppDES-L 1.90127 0.23022 8.259 < 2e-16 ***
## minedyes -1.21981 0.15256 -7.996 1.29e-15 ***
## cover -0.25251 0.04417 -5.717 1.08e-08 ***
## Wtemp -0.08794 0.04229 -2.079 0.03758 *
## DOY 0.22235 0.04166 5.337 9.44e-08 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.0079 0.1628 -6.190 6.03e-10 ***
## minedyes 2.0690 0.2532 8.173 3.01e-16 ***
## DOY 0.3021 0.1220 2.476 0.0133 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Number of iterations in BFGS optimization: 20
## Log-likelihood: -872.9 on 14 Df

# Negative Binomial Model


M2 <- zeroinfl(count ~ spp + mined + cover + Wtemp + DOY| mined + DOY, data = dat3, dist = "negbin")
summary(M2)

##
## Call:
## zeroinfl(formula = count ~ spp + mined + cover + Wtemp + DOY | mined +
## DOY, data = dat3, dist = "negbin")

22
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -0.98598 -0.47031 -0.34711 0.08008 9.13671
##
## Count model coefficients (negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.43709 0.24233 -1.804 0.07128 .
## sppEC-A 0.72928 0.30496 2.391 0.01679 *
## sppGP 1.34014 0.27830 4.815 1.47e-06 ***
## sppDF 1.44799 0.27444 5.276 1.32e-07 ***
## sppDM 1.63121 0.27301 5.975 2.30e-09 ***
## sppEC-L 1.85662 0.26833 6.919 4.54e-12 ***
## sppDES-L 2.03587 0.26654 7.638 2.20e-14 ***
## minedyes -1.27181 0.21830 -5.826 5.68e-09 ***
## cover -0.21341 0.07198 -2.965 0.00303 **
## Wtemp -0.07159 0.06823 -1.049 0.29411
## DOY 0.13679 0.07081 1.932 0.05337 .
## Log(theta) 0.53650 0.25754 2.083 0.03724 *
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.9214 0.4708 -4.081 4.48e-05 ***
## minedyes 2.5793 0.4764 5.415 6.14e-08 ***
## DOY 0.3298 0.1833 1.800 0.0719 .
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Theta = 1.71
## Number of iterations in BFGS optimization: 24
## Log-likelihood: -820.9 on 15 Df

# Testing for zip vs null zip.


M0 <- update(M1, . ~ 1)
pchisq(2*(logLik(M1)-logLik(M0)), df = 12, lower.tail=FALSE) # df is from df M1 - df M0

## ’log Lik.’ 1.144508e-73 (df=14)

M0 <- update(M2, . ~ 1)
pchisq(2 * (logLik(M2) - logLik(M0)), df = 12, lower.tail=FALSE) # df is from df M2 - df M0

## ’log Lik.’ 4.722142e-50 (df=15)

# Testing for zip vs no zip. (poisson)


vuong(m10, M1)

## Vuong Non-Nested Hypothesis Test-Statistic:


## (test-statistic is asymptotically distributed N(0,1) under the
## null that the models are indistinguishible)
## -------------------------------------------------------------
## Vuong z-statistic H_A p-value
## Raw -5.062120 model2 > model1 2.0731e-07

23
## AIC-corrected -4.937309 model2 > model1 3.9604e-07
## BIC-corrected -4.658501 model2 > model1 1.5926e-06

# Testing for zip vs no zip. (negative binomial)


vuong(m20, M2)

## Vuong Non-Nested Hypothesis Test-Statistic:


## (test-statistic is asymptotically distributed N(0,1) under the
## null that the models are indistinguishible)
## -------------------------------------------------------------
## Vuong z-statistic H_A p-value
## Raw -2.6118644 model2 > model1 0.0045025
## AIC-corrected -2.0794122 model2 > model1 0.0187897
## BIC-corrected -0.8899942 model2 > model1 0.1867345

We have the following conclusions based on the output for our significance tests. Both the ZIP and the
ZINB models M1 and M2 were found to be significant when compared to the null. The ZIP model M1 was
found to have a significant improvement of fit over the regular poisson model m10 across all p values (raw
and adjusted). The ZINB was found to be significant accroding to the raw, and AIC-corrected p-values,
however insignificant when judged against the BIC corrected p-value.
For our ZIP model M1, we have that in our logit model, both the mined and DOY variables were found to be
significant predictors of inflated zeroes. With mined = yes found to increase the chance of inflated zeroes
relative to the mined = no group, and the date of year increasing the chance of excessive zeroes.
In the ZIP model, the overall significance of effects remained the same, with the exception of the variable
Wtemp, which was found to be significant in the ZIP model (M1) compared to the regular poisson model
(m10). The direction of the effects remained the same, however most of the effects decreased in magnitude,
with the exception of the EC-A species effect, and the effect of DOY.
In the ZINB model, the overall significance of effects changed: the species EC-A, as well as the variables
cover were found to be significant in M2, but they were not in m20. As with the zip mode, the direction of
effects remained the same, however most of the effects decreased in magnitude, with the exception of the
EC-A and GP species effects, and the effect of DOY.
For our ZINB model M1, we have that in our logit model, only the mined variable was found to be a significant
predictor of inflated zeroes at the α = 0.05 level. With mined = yes found to increase the chance of inflated
zeroes relative to the mined = no group.

d)

# random effect poisson model


rand_poi <- glmmTMB(count ~ spp + mined + cover + Wtemp + DOY + (1|site), ziformula= ~ mined + DOY, fami

summary(rand_poi)

## Family: poisson ( log )


## Formula: count ~ spp + mined + cover + Wtemp + DOY + (1 | site)
## Zero inflation: ~mined + DOY
## Data: dat3
##
## AIC BIC logLik deviance df.resid
## 1772.6 1839.6 -871.3 1742.6 629

24
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev.
## site (Intercept) 0.03681 0.1918
## Number of obs: 644, groups: site, 23
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.24940 0.23162 -1.077 0.28158
## sppEC-A 0.89219 0.28639 3.115 0.00184 **
## sppGP 1.27423 0.24136 5.279 1.30e-07 ***
## sppDF 1.34859 0.24070 5.603 2.11e-08 ***
## sppDM 1.51613 0.23749 6.384 1.73e-10 ***
## sppEC-L 1.90730 0.23164 8.234 < 2e-16 ***
## sppDES-L 1.89647 0.23076 8.218 < 2e-16 ***
## minedyes -1.28806 0.21929 -5.874 4.26e-09 ***
## cover -0.21068 0.08040 -2.620 0.00878 **
## Wtemp -0.11529 0.05623 -2.050 0.04032 *
## DOY 0.22158 0.04366 5.075 3.88e-07 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Zero-inflation model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.0336 0.1663 -6.214 5.17e-10 ***
## minedyes 1.9774 0.2893 6.835 8.20e-12 ***
## DOY 0.3207 0.1269 2.527 0.0115 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

rand_genpois <- glmmTMB(count ~ spp + mined + cover + Wtemp + DOY + (1|site), ziformula= ~ mined + DOY,

summary(rand_genpois)

## Family: genpois ( log )


## Formula: count ~ spp + mined + cover + Wtemp + DOY + (1 | site)
## Zero inflation: ~mined + DOY
## Data: dat3
##
## AIC BIC logLik deviance df.resid
## 1653.0 1724.5 -810.5 1621.0 628
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev.
## site (Intercept) 0.2065 0.4545
## Number of obs: 644, groups: site, 23
##
## Dispersion parameter for genpois family (): 2.76
##
## Conditional model:

25
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.59389 0.31164 -1.906 0.0567 .
## sppEC-A 0.62279 0.34289 1.816 0.0693 .
## sppGP 1.40011 0.30214 4.634 3.59e-06 ***
## sppDF 1.50234 0.30063 4.997 5.81e-07 ***
## sppDM 1.65442 0.29708 5.569 2.56e-08 ***
## sppEC-L 1.87994 0.29420 6.390 1.66e-10 ***
## sppDES-L 2.05522 0.28942 7.101 1.24e-12 ***
## minedyes -1.99625 0.35244 -5.664 1.48e-08 ***
## cover -0.14094 0.14768 -0.954 0.3399
## Wtemp -0.05037 0.08576 -0.587 0.5570
## DOY 0.16226 0.07328 2.214 0.0268 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Zero-inflation model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.7911 0.8150 -3.425 0.000616 ***
## minedyes 1.2695 0.8535 1.488 0.136879
## DOY 0.9849 0.5588 1.762 0.078008 .
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

result_func <- function(x){


cbind(model = deparse(substitute(x)), rmse = rmse(x), aic = AIC(x))
}

results_q3 <- as.data.frame(rbind(result_func(m10),


result_func(m20),
result_func(M1),
result_func(M2),
result_func(rand_poi),
result_func(rand_genpois)
))

results_q3 |> kable()

model rmse aic


m10 2.1873122803424 2011.10012210437
m20 2.24905768068354 1695.22027098336
M1 2.19789183354427 1773.74957899428
M2 2.21702788359007 1671.78817120194
rand_poi 2.18078817572983 1772.55341055411
rand_genpois 2.16382194641293 1652.96743810151
Based on RMSE values, the models perform in this order: the ZIGP mixed model performed best, followed
by the ZIP mixed model. Then the poisson glm model, followed by the ZIP model. Then the ZINB model,
followed by the negative binomial model.
In terms of AIC, the ZIGP mixed model performed best, followed by the ZINB model, followed by the
negative binomial mode. Then worst performing models were the ZIP mixed model, followed by the ZIP
model, followed by the poisson model. Note that all the models accounting for overdispersion performed
better in terms of AIC.
For the ZIP mixed model (rand_poi), we have that the significance of all the effects remained the same as

26
those in M1. The mined and DOY variables were found to be significant in generating excessive zeroes in the
logit portion of the model, when compared to our model M1.
For the ZIGP mixed model (rand_genpois), we have that the species indicator EC-A and the cover variable
is not significant, and that the DOY variable is significant at the α = 0.05 level, when compared to M2.
Furthermore, the mined variable was not found to be significant in inflating zero counts in the logit part of
the model, when compared to our model M2.

e)

Zero-inflated negative binomial model

# This is the cover case


scover <- rep(seq(min(dat3$cover),max(dat3$cover),length.out=50),14)
sspp <- rep(rep(c("PR","EC-A","GP","DF","DM","EC-L","DES-L"),each=50),2)
zero <- rep(0,350); smined <- rep(c("no","yes"),each=350)

predcover <- data.frame(sspp,smined,scover,zero,zero)


colnames(predcover) <- c("spp","mined","cover","Wtemp","DOY")

predcover$phat = predict(M2,predcover)

# This is the sWtemp case


sWtemp <- rep(seq(min(dat3$Wtemp),max(dat3$Wtemp),length.out=50),14)
sspp <- rep(rep(c("PR","EC-A","GP","DF","DM","EC-L","DES-L"),each=50),2)
zero <- rep(0,350); smined <- rep(c("no","yes"),each=350)

predWtemp <- data.frame(sspp,smined,zero,sWtemp,zero)


colnames(predWtemp) <- c("spp","mined","cover","Wtemp","DOY")

predWtemp$phat = predict(M2,predWtemp)

# This is the DOY case


sdoy <- rep(seq(min(dat3$DOY),max(dat3$DOY),length.out=50),14)
sspp <- rep(rep(c("PR","EC-A","GP","DF","DM","EC-L","DES-L"),each=50),2)
zero <- rep(0,350); smined <- rep(c("no","yes"),each=350)

preddoy <- data.frame(sspp,smined,zero,zero,sdoy)


colnames(preddoy) <- c("spp","mined","cover","Wtemp","DOY")

preddoy$phat = predict(M2,preddoy)

# Plotting the curves


ggplot(data = predcover, aes(x = cover, y = phat, colour = spp, group = spp)) +
geom_line() +
facet_wrap(~ mined, labeller = 'label_both', scales = 'free') +
labs(y = 'Count', x = 'Cover')

27
mined: no mined: yes

0.6

spp
DES−L
4
DF
0.4
Count

DM
EC−A
EC−L
GP
2
0.2 PR

−1 0 1 2 −1 0 1 2
Cover

ggplot(data = predWtemp , aes(x = Wtemp, y = phat, colour = spp, group = spp)) +


geom_line() +
facet_wrap(~ mined, labeller = 'label_both', scales = 'free') +
labs(y = 'Count')

28
mined: no mined: yes
0.6

spp
4
DES−L
0.4
DF
Count

DM
3
EC−A
EC−L

2 GP
0.2
PR

−3 −2 −1 0 1 2 −3 −2 −1 0 1 2
Wtemp

ggplot(data = preddoy, aes(x = DOY, y = phat, colour = spp, group = spp)) +


geom_line() +
facet_wrap(~ mined, labeller = 'label_both', scales = 'free') +
labs(y = 'Count')

29
mined: no mined: yes
5

0.5

4
spp
0.4
DES−L
DF
3
Count

DM
0.3
EC−A
EC−L
2
GP
0.2
PR

1
0.1

−2 −1 0 1 −2 −1 0 1
DOY

We can see clearly from three plots that belonging to a site where mountain top removal coal mining occured
severly depletes the overall salamander counts, compared to not. From the first plot, we can see that as the
scales number of cover objects in stream increases, the overall counts decrease. Similarly, as the scaled water
temperature increases, the counts of salamanders decrease. As the scaled day of year increases, the overall
counts of salamanders increases, however this pattern is not seen if mountain top removal coal mining has
taken place, in which case the count decreases throughout the year.

Question 4

y <- c(10,23,23,26,17,5,53,55,32,46,10,8,10,8,23,0,3,22,15,32,3)
n <- c(39,62,81,51,39,6,74,72,51,79,13,16,30,28,45,4,12,41,30,51,7)
seed <- factor(c(1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0))
root <- factor(c(1,1,1,1,1,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0))
yc <- cbind(y,n-y)

# Interaction Model
m1 <- glm(yc ~ seed*root, family = binomial(link=logit))

# No Interaction Model
m2 <- glm(yc ~ seed + root, family = binomial(link=logit))

summary(m1)

##

30
## Call:
## glm(formula = yc ~ seed * root, family = binomial(link = logit))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.1278 0.1688 0.757 0.44880
## seed1 0.6322 0.2100 3.010 0.00261 **
## root1 -0.5401 0.2498 -2.162 0.03062 *
## seed1:root1 -0.7781 0.3064 -2.539 0.01111 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 98.719 on 20 degrees of freedom
## Residual deviance: 33.278 on 17 degrees of freedom
## AIC: 117.87
##
## Number of Fisher Scoring iterations: 4

summary(m2)

##
## Call:
## glm(formula = yc ~ seed + root, family = binomial(link = logit))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.3643 0.1428 2.550 0.0108 *
## seed1 0.2705 0.1547 1.748 0.0804 .
## root1 -1.0647 0.1442 -7.383 1.55e-13 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 98.719 on 20 degrees of freedom
## Residual deviance: 39.686 on 18 degrees of freedom
## AIC: 122.28
##
## Number of Fisher Scoring iterations: 4

# test if change is sign, df = 1


p <- 1 - pchisq(m2$deviance - m1$deviance , 1)

To test if our interaction effect is significant, we will conduct a chi-square test on the differences of the
deviances. As the difference between the degrees of freedom in each model is 1, we have that our degrees
of freedom for our distribution is 1. Thus we yield a p-value of 0.0113601, indicating we reject the null
hypothesis that the simple model provides a better fit, and that the model including the interaction term is
preferred.

31
b)

summary(m1)

##
## Call:
## glm(formula = yc ~ seed * root, family = binomial(link = logit))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.1278 0.1688 0.757 0.44880
## seed1 0.6322 0.2100 3.010 0.00261 **
## root1 -0.5401 0.2498 -2.162 0.03062 *
## seed1:root1 -0.7781 0.3064 -2.539 0.01111 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 98.719 on 20 degrees of freedom
## Residual deviance: 33.278 on 17 degrees of freedom
## AIC: 117.87
##
## Number of Fisher Scoring iterations: 4

pred <- data.frame(seed = factor(c(1)), root = factor(c(1)))


p <- predict(m1, newdata = pred, type = 'response')

res_p <- resid(m1, type = "pearson")

res_p[1]

## 1
## -1.396088

m1$fitted.values

## 1 2 3 4 5 6 7 8
## 0.3639706 0.3639706 0.3639706 0.3639706 0.3639706 0.6813559 0.6813559 0.6813559
## 9 10 11 12 13 14 15 16
## 0.6813559 0.6813559 0.6813559 0.3983740 0.3983740 0.3983740 0.3983740 0.3983740
## 17 18 19 20 21
## 0.5319149 0.5319149 0.5319149 0.5319149 0.5319149

Hence the probability that a seed with type i, j = (1, 1) germinates is 0.3639706.

(10 - p*39)/sqrt((p*39)*(39 - p*39)/39) # From the formula for pearsons residuals

## 1
## -1.396088

32
res_p

## 1 2 3 4 5 6
## -1.39608767 0.11451055 -1.49681854 2.16456258 0.93358010 0.79894098
## 7 8 9 10 11 12
## 0.64358638 1.50298177 -0.82617835 -1.88994181 0.67998020 0.83034028
## 13 14 15 16 17 18
## -0.72767376 -1.21769581 1.54477216 -1.62746694 -1.95715471 0.05993344
## 19 20 21
## -0.35032452 1.36731648 -0.54795962

c)

We need to find p111 , p100 , p110 , p101 , p010 , p011 using our binomial model. Note the first two indices correspond
to combinations of seed and root, while the third indice indicates whether or not it germinated. We can find
the corresponding probabilites using our predict function.

pred <- data.frame(seed = factor(c(1, 1, 0)), root = factor(c(1, 0, 1)))


prob <- predict(m1, newdata = pred, type = 'response')
names(prob) <- c('p_111', 'p_101', 'p_011')
p_111 <- prob[1]; p_101 <- prob[2]; p_011 <- prob[3]

ln1 <- log( (p_111*(1-p_101))/((1 - p_111)*p_101) )


ln2 <- log( (p_111*(1-p_011))/((1-p_111)*p_011) )

unname(ln1 - ln2)

## [1] -1.172255

d)

we need to find,

P (X11 = 1, X10 = 0)
P (X11 = 1, X10 = 0|X11 + X10 = 1) = (54)
P (X11 + X10 = 1)
P (X11 = 1, X10 = 0)
= (55)
P (X11 = 1, X10 = 0) + P (X11 = 0, X10 = 1)
p111 × p100
= (56)
p111 × p100 + p110 × p101

Thus using our seed+root model,

summary(m2)

##
## Call:
## glm(formula = yc ~ seed + root, family = binomial(link = logit))
##
## Coefficients:

33
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.3643 0.1428 2.550 0.0108 *
## seed1 0.2705 0.1547 1.748 0.0804 .
## root1 -1.0647 0.1442 -7.383 1.55e-13 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 98.719 on 20 degrees of freedom
## Residual deviance: 39.686 on 18 degrees of freedom
## AIC: 122.28
##
## Number of Fisher Scoring iterations: 4

pred <- data.frame(seed = factor(c(1, 1, 0)), root = factor(c(1, 0, 1)))


prob <- predict(m2, newdata = pred, type = 'response')
names(prob) <- c('p_111', 'p_101', 'p_011')
p_111 <- prob[1]; p_101 <- prob[2]; p_011 <- prob[3]

answer <- unname((p_111*(1-p_101)/(p_111*(1-p_101) + (1-p_111)*p_101)))

Thus we have our result, P (X11 = 1, X10 = 0|X11 + X10 = 1) = 0.2564028.

e)

Now we have under the null hypothesis H0 : β2 = 0,

logit(P (X11 = 1)) = logit(P (X10 = 1)) = β0 + β1 (57)

which implies that, P (X11 = 1) = P (X10 = 1) → P (X11 = 0) = P (X10 = 0).


Thus we have,

P (X11 = 1, X10 = 0) = P (X11 = 1)P (X10 = 0) = P (X10 = 1)P (X11 = 0) = P (X10 = 1, X11 = 0) (58)

From this we can conclude that,

P (X11 = 1, X10 = 0|X11 + X10 = 1) = P (X10 = 1, X11 = 0|X11 + X10 = 1) (59)

Now these two results are complementary and comprise the entire sample space. Thus we have,

P (X11 = 1, X10 = 0|X11 + X10 = 1) + P (X10 = 1, X11 = 0|X11 + X10 = 1) = 1 (60)


2P (X11 = 1, X10 = 0|X11 + X10 = 1) = 1 (61)
1
P (X11 = 1, X10 = 0|X11 + X10 = 1) = (62)
2

A possible test that can be used in the situation presented in assignment sheet is the binomial test. This is
because, conditioning on X11 + X10 = 1, we have that (X11 = 1) = P (X10 = 1). The expected counts n10

34
and n01 thus ar expected to be roughly equal. Under the null hypothesis H0 : β2 = 0, we should have that
our counts n10 are binomially distributed,

n01    n01
X n01 + n10 1
P (X11 ≤ n01 |X11 + X10 = n01 + n10 ) = (63)
i=0
i 2

Furthermore, as we have that our probability of success is 0.5, our binomial distribution will be symmetric,
meaning it is straightforward to carry out a two sided test, as we simply calculate the probability of obtaining
the observed n01 or greater and multiply this value by two.

Question 5

a)

We will prove the result by obtaining the marginal distribution of Y by integrating out the random variable
P.
We have that,

Z 1
f (y) = f (y, p)dp (64)
0
Z 1
= f (y|p)f (p)dp (65)
0
Z 1 
n y pa−1 (1 − p)b−1
= p (1 − p)n−y × dp (66)
0 y B(a, b)
  Z 1
n 1
= py+a−1 (1 − p)n−y+b−1 dp (67)
y B(a, b) 0
n B(a + y, n + b − y) 1 py+a−1 (1 − p)n−y+b−1
  Z
= dp (68)
y B(a, b) 0 B(a + y, n + b − y)
 
n B(a + y, n + b − y)
= (69)
y B(a, b)

As the integral evaluates to one, as the inside is the pdf of a B(a + y, n + b − y) distribution integrated across
its entire domain.

b)

We will first deduce the result for the mean. We have that,

E(Y ) = EBe [EBi (Y |P )] (70)


= EBe [np] (71)
= nEBe [p] (72)
a
=n (73)
a+b

We will now deduce the variance in two parts, we have

35
V ar(Y ) = E [V ar(Y |P )] + V ar [E(Y |P )] (74)

We will first simplify the right hand term,

V ar [E(Y |P )] = V ar [np] (75)


2
= n V ar(p) (76)
ab
= n2 (77)
(a + b)2 (a + b + 1)

Now,

E [V ar(Y |P )] = E [np(1 − p)] (78)


= nE p − p2
 
(79)
= n E(p) − E(p2 )
 
(80)

Now,

E(p2 ) = V ar(p) + E(p)2 (81)


 2
ab a
= + (82)
(a + b)2 (a + b + 1) a+b
ab + a2 (a + b + 1)
= (83)
(a + b)2 (a + b + 1)

Thus,

ab + a2 (a + b + 1)
 
 2
 a
n E(p) − E(p ) = n − (84)
a + b (a + b)2 (a + b + 1)
a(a + b)(a + b + 1) − ab − a2 (a + b + 1)
 
=n (85)
(a + b)2 (a + b + 1)

Thus we have,

V ar(Y ) = E [V ar(Y |P )] + V ar [E(Y |P )] (86)


a(a + b)(a + b + 1) − ab − a2 (a + b + 1)
 
ab
=n + n2 (87)
(a + b)2 (a + b + 1) (a + b)2 (a + b + 1)
 
a (a + b)(a + b + 1) − b − a(a + b + 1) + nb
=n (88)
(a + b)2 (a + b + 1)
 
a b(n − 1) + b(a + b + 1)
=n (89)
(a + b)2 (a + b + 1)
 
a b n−1
=n 1+ (90)
a+ba+b (a + b + 1)

36
Now we wish to show that the beta-binomial distribution allows a higher dispersion then the binomial
a
distribution with p = a+b . Let the binomial distribution with such a probability be X. We have,

a b
V ar(X) = np(1 − p) = n (91)
a+ba+b

Now when n = 1, the variance of our binomial distribution and our beta-binomial distribution coincide,
V ar(Y ) = V ar(X). When n > 1, we have,

 
n−1
1+ >1 (92)
(a + b + 1)
 
a b n−1 a b
n 1+ >n (93)
a+ba+b (a + b + 1) a+ba+b
V ar(Y ) > V ar(X) (94)

Thus the beta-binomial distribution allows for higher dispersion.

c)
2
We have the delta theorem V ar [g(Y )] ≈ σ 2 [g ′ (µ)] . Now we wish to find σ 2 , µ and g ′ (µ).
First g ′ (µ),

 
p
g(p) = ln (95)
1−p
= ln(p) − ln(1 − p) (96)
1 1
g ′ (p) = + (97)
p 1−p
1
= (98)
p(1 − p)

y

Now σ 2 = V ar n ,

y 1
V ar = V ar(Y ) (99)
n n2  
1 a b n−1
= 2 ×n 1+ (100)
n a+ba+b (a + b + 1)
 
1 a b n−1
= 1+ (101)
na+ba+b (a + b + 1)

y a

Now µ = E n = a+b , hence,

  
′ a b
g (µ) = 1 (102)
a+b a+b
2
(a + b)
= (103)
ab

37
Thus we have,

2
V ar [g(Y )] ≈ σ 2 [g ′ (µ)] (104)
2 
(a + b)2
  
1 a b n−1
= × 1+ (105)
ab na+ba+b (a + b + 1)
2 2
   
(a + b) 1 ab ab 1 − 1/n
= × + (106)
ab n (a + b)2 (a + b)2 (a + b + 1)

Now as n → ∞,

(a + b)4
 
ab
V ar [g(Y )] → (107)
(ab)2 (a + b)2 (a + b + 1)
2
(a + b)
= (108)
ab(a + b + 1)

as required.

38

You might also like