Assignment3 Finaldraft
Assignment3 Finaldraft
Lewis Hastie
2023-10-06
Question 1
a)
##
## Call:
## glm(formula = y ~ offset(log(t)) + g, family = poisson)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.8000 0.1925 -14.549 <2e-16 ***
## g1 2.2761 0.2312 9.847 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 124.138 on 12 degrees of freedom
## Residual deviance: 17.676 on 11 degrees of freedom
## AIC: 57.94
##
## Number of Fisher Scoring iterations: 5
1
Table 1: y/t summary table
Mean Variance
0.245362 0.0731728
summary(glm_nb)
##
## Call:
## glm.nb(formula = y ~ offset(log(t)) + g, init.theta = 86681.79322,
## link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.8000 0.1925 -14.549 <2e-16 ***
## g1 2.2761 0.2312 9.847 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for Negative Binomial(86681.79) family taken to be 1)
##
## Null deviance: 124.128 on 12 degrees of freedom
## Residual deviance: 17.675 on 11 degrees of freedom
## AIC: 59.94
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 86682
## Std. Err.: 2544076
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -53.94
Our poisson model demonstrates a better model fit, as its AIC value is lower at a value of 57.94, compared
to the AIC value of 59.94 for our NB model. This is due to the data not demonstrating any overdispersion,
as from the summary table we can see that the variance is not larger then the mean value. Furthermore, the
theta value in our negative binomial model is very large, with a very large standard error, indicating that
the poisson model is suitable.
b)
2
# Without t information ?
glm_pois_2 <- glm(y ~ g, family = poisson)
summary(glm_pois_2)
##
## Call:
## glm(formula = y ~ g, family = poisson)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.2164 0.1925 6.321 2.61e-10 ***
## g1 1.2850 0.2312 5.559 2.71e-08 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 107.246 on 12 degrees of freedom
## Residual deviance: 72.966 on 11 degrees of freedom
## AIC: 113.23
##
## Number of Fisher Scoring iterations: 6
##
## Call:
## glm.nb(formula = y ~ g, init.theta = 0.9387015914, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.2164 0.4126 2.948 0.00319 **
## g1 1.2850 0.6322 2.033 0.04208 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for Negative Binomial(0.9387) family taken to be 1)
##
## Null deviance: 19.423 on 12 degrees of freedom
## Residual deviance: 15.151 on 11 degrees of freedom
## AIC: 79.027
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.939
## Std. Err.: 0.489
##
## 2 x log-likelihood: -73.027
3
Table 2: y summary table
Mean Variance
6.769231 59.52564
In this instance the NB model has a better AIC fit with scores of 79.027 vs. 113.26 for the NB and Poisson
model respectively. This makes intuitive sense as the theta value in the NB model is small with small
standard error, suggesting that a NB model is suitable, and that when there is a lack of information about
ti , the dispersion is larger than the mean.
This is supported by the data, as the summary table indicates there is significant overdispersion in the data
as the variance is significantly larger then the mean and as such the negative binomial model is preferred.
c)
i)
To find our moment estimators, we equate our theoretical mean and sample mean, and theoretical estimators
with our sample estimator.
For our mean,
E(Y ) = ȳ (1)
(1 − π)λ = ȳ (2)
ȳ
λ= (3)
1−π
V ar(Y ) = s2 (4)
2
λ(1 − π)(1 + πλ) = s . (5)
4
ȳ ȳπ
× (1 − π) 1 + = s2 (6)
1−π 1−π
ȳπ s2
1+ = (7)
1−π ȳ
s2 s2
1 − π + ȳπ = − π (8)
ȳ ȳ
s2 s2
π − π + ȳπ = −1 (9)
ȳ ȳ
2
s2
s
π − 1 + ȳ = −1 (10)
ȳ ȳ
2 2
s s
π̂ = −1 / − 1 + ȳ (11)
ȳ ȳ
s2 − ȳ
= 2 (12)
s + ȳ 2 − ȳ
ȳ
λ= (13)
1−π
2
s + ȳ 2 − ȳ − (s2 − ȳ)
= ȳ (14)
s2 + ȳ 2 − ȳ
ȳ 2
= ȳ (15)
s2 + ȳ 2 − ȳ
s + ȳ 2 − ȳ
2
λ̂ = (16)
ȳ
as required.
Thus we can find the estimates for our data.
ii)
To find our MLE estimators first note n0 represents the amount of zeros in the sample.
We have our likelihood, which the is product of n0 zeroes, and n − n0 non-zero terms:
n−n
Y0 λyi e−λ
−λ n0
L(λ, π) = π + (1 − π)e × (1 − π) (17)
i=1
yi !
5
n−n
X0 yi −λ
−λ
λ e
ℓ(λ, π) = n0 ln π + (1 − π)e + ln (1 − π) (18)
i=1
yi !
n−n
X0 λyi e−λ
= n0 ln π + (1 − π)e−λ + (n − n0 ) ln(1 − π) +
(19)
i=1
yi !
n−n
X0 n−n
X0
= n0 ln π + (1 − π)e−λ + (n − n0 ) ln(1 − π) + ln(λ)
yi − (n − n0 )λ − ln(yi !) (20)
i=1 i=1
In order to prove the required results, we will employ a method known as profile likelihood estimation, in
which we first find the MLE of π(λ), and then substitute this result into ℓ to find the MLE of λ.
Taking the derivative with respect to π yields,
∂ℓ n0 (1 − e−λ ) (n − n0 )
= −λ
− (21)
∂π π + (1 − π)e 1−π
(n − n0 ) n0 (1 − e−λ )
= (22)
1−π π + (1 − π)e−λ
(n − n0 )π + (n − n0 )(1 − π)e−λ
= n0 (1 − e−λ ) (23)
1−π
(n − n0 )π
+ (n − n0 )e−λ = n0 (1 − e−λ ) (24)
1−π
(n − n0 )π
= n0 − ne−λ (25)
1−π
(n − n0 )π = (n0 − ne−λ ) − π(n0 − ne−λ ) (26)
−λ −λ
π(n0 − ne + n − n0 ) = (n0 − ne ) (27)
−λ
n0 − ne
π= (28)
n − ne−λ
n0 /n − e−λ
= (29)
1 − e−λ
We now substitute this value back into ℓ, let r0 = n0 /n, and 1 − π = (1 − r0 )/(1 − e−λ ),
r0 − e−λ
1 − r0 −λ 1 − r0
ℓ = n0 ln + e + (n − n0 ) ln + n ln(λ)ȳ − (n − n0 )λ (30)
1 − e−λ 1 − e−λ 1 − e−λ
= n0 ln(r0 ) + (n − n0 ) ln(1 − r0 ) − ln(1 − e−λ ) + n ln(λ)ȳ − (n − n0 )λ
(31)
(32)
6
Setting to zero and solving,
r0 − e−λ r0 − 1
π= =1+ (40)
1 − e−λ 1 − e−λ
ȳ 1 − r0
= (41)
λ̂ 1 − e−λ̂
ȳ
π̂ = 1 − (42)
λ̂
Now for our hurdle model, we have log-likelihood equation (from lecture notes)
n1
X n1
X
ℓ(π, λ) = n0 ln π + n1 ln(1 − π) + ln λ yi − n1 λ − ln(yi !) − n1 ln(1 − e−λ ) (43)
i=1 i=1
∂ℓ n0 n1
= − (44)
∂π π 1−π
n0 n1
0= − (45)
π 1−π
n1 n0
= (46)
1−π π
π(n0 + n1 ) = n0 (47)
n0
π̂ = (48)
n
as required.
To calulcate the MLE of λ, we will differentiate our log-likelihood with respect to λ.
7
∂ℓ nȳ e−λ
= − n1 − n1 × (49)
∂λ λ 1 − e−λ
Setting to zero,
nȳ e−λ
0= − n1 − n1 × (50)
λ 1 − e−λ
−λ
e nȳ
n1 1 + = (51)
1 − e−λ λ
n
1
λ = ȳ(1 − e−λ ) (52)
n
n0
λ 1− = ȳ(1 − e−λ ) (53)
n
as required.
We can now solve using the uni-root function,
Thus for our zero-inflated poisson model we have an estimated λ̂ = 8.7986718, and an estimated π̂ =
0.2306531. For our hurdle model, we have λ̂Hurdle = 8.7986718, and π̂Hurdle = 0.2307692.
Question 2
a)
8
agg2 <- do.call(data.frame, aggregate(y ~ time, data = dat, FUN = function(x) c(mean = mean(x), var = va
agg0 |> kable(caption = 'Mean and Variance across all treatments and times') |> kable_styling(latex_opti
mean var
8.271186 152.607
agg1 |> kable(caption = 'Mean and Variance for each treatment across all times') |> kable_styling(latex_
Table 4: Mean and Variance for each treatment across all times
agg2 |> kable(caption = 'Mean and Variance across all treatments for each time') |> kable_styling(latex
Table 5: Mean and Variance across all treatments for each time
agg3 |> kable(caption = 'Mean and Variance for each treatment and time') |> kable_styling(latex_options
We can see that for the mean and variance values across all aggregates that our data is overdispersed. Mean
values for treatment 1 is less then those for treatment 0 when aggregated across time. We can see that the
mean values decrease for each time point when aggregated over treatment, and that our mean values for
treatment 1 are generally less then for treatment 0 for each time point (with the exception of time point 3).
9
Table 7: M1
b0 b1 b2
parm 2.2940093 -0.0573954 -0.0772055
sem 0.0588292 0.0202729 0.0452718
b)
#test <- flexmix(formula = y ~ time + type | pat, data = dat, k = 1, model = FLXglm(family = "poisson"))
#summary(test)
#
#parameters(test)
#
#test2 <- flexmix(formula = y ~ time + type | pat, data = dat, k = 2, model = FLXglm(family = "poisson")
#summary(test2)
#
#parameters(test2)
Model 1
# Model 1
m <- length(unique(dat$pat))
param_m1 <- c(1, 0.1, 0.1)
# Model 1
logl_m1 <- function(par){
b0 = par[1]; b1 = par[2]; b2 = par[3]
#mu = exp(b0 + b1*dat$time + b2*dat$type)
l = rep(0, m)
for (i in 1:m){
sub_dat = dat[dat$pat == i,]
mu = exp(b0 + b1*sub_dat$time + b2*sub_dat$type)
yp = sub_dat$y
l[i] = prod(exp(-mu)*muˆyp/factorial(yp))
}
ll = sum(log(l))
return(-ll)
}
OIm=solve(mixreg_m1$hessian); sem=sqrt(diag(OIm))
parm=mixreg_m1$par; estm=rbind(parm,sem)
colnames(estm)=c("b0","b1","b2")
AICm=mixreg_m1$value*2+length(parm)*2
10
Table 8: M2
b01 b02 b1 b2 pi
parm 1.409746 3.2040306 -0.0574200 0.3080130 0.7976866
sem 0.072124 0.0638979 0.0202733 0.0521591 0.0525467
m <- length(unique(dat$pat))
param_m2 <- c(1, 1, 0.5, 0.5, 0.7)
# Model 1
logl_m2 <- function(par){
b01 = par[1]; b02 = par[2]; b1 = par[3]; b2 = par[4]; pi = par[5]
#mu = exp(b0 + b1*dat$time + b2*dat$type)
l1 = rep(0, m)
l2 = rep(0, m)
l = rep(0, m)
for (i in 1:m){
sub_dat = dat[dat$pat == i,]
yp = sub_dat$y
l1[i] = prod(exp(-mu1)*(mu1)ˆyp/factorial(yp))
l2[i] = prod(exp(-mu2)*(mu2)ˆyp/factorial(yp))
l[i] = pi*l1[i]+(1-pi)*l2[i]
}
ll = sum(log(l))
return(-ll)
}
OIm=solve(mixreg_m2$hessian); sem=sqrt(diag(OIm))
parm=mixreg_m2$par; estm=rbind(parm,sem)
colnames(estm)=c("b01", "b02", "b1", "b2", "pi")
AICm=mixreg_m2$value*2+length(parm)*2
m <- length(unique(dat$pat))
param_m3 <- c(1, 1, 0.5, 0.5, 0.5, 0.7)
# Model 1
logl_m3 <- function(par){
11
Table 9: M3
b01 = par[1]; b02 = par[2]; b11 = par[3]; b12 = par[4]; b2 = par[5]; pi = par[6]
#mu = exp(b0 + b1*dat$time + b2*dat$type)
l1 = rep(0, m)
l2 = rep(0, m)
l = rep(0, m)
for (i in 1:m){
sub_dat = dat[dat$pat == i,]
yp = sub_dat$y
l1[i] = prod(exp(-mu1)*(mu1)ˆyp/factorial(yp))
l2[i] = prod(exp(-mu2)*(mu2)ˆyp/factorial(yp))
l[i] = pi*l1[i]+(1-pi)*l2[i]
}
ll = sum(log(l))
return(-ll)
}
OIm=solve(mixreg_m3$hessian); sem=sqrt(diag(OIm))
parm=mixreg_m3$par; estm=rbind(parm,sem)
colnames(estm)=c("b01", "b02", "b11", "b12", "b2", "pi")
AICm=mixreg_m3$value*2+length(parm)*2
c)
Based on the above AIC values, we choose model 2 to proceed due to the smallest AIC value.
12
b01 = parm[1]; b02 = parm[2]; b1 = parm[3]; b2 = parm[4]; pi = parm[5]
l1 = rep(0, m)
l2 = rep(0, m)
l = rep(0, m)
for (i in 1:m) {
sub_dat = dat[dat$pat == i,]
yp = sub_dat$y
l1[i] = prod(exp(-mu1)*(mu1)ˆyp/factorial(yp))
l2[i] = prod(exp(-mu2)*(mu2)ˆyp/factorial(yp))
ind[i] = pi*l1[i]/(pi*l1[i]+(1-pi)*l2[i])
}
# Group memberships
membership <- unique(dat[, colnames(dat) %in% c('pat', 'ind')])
rownames(membership) <- NULL
membership
## pat ind
## 1 1 1
## 2 2 1
## 3 3 1
## 4 4 1
## 5 5 0
## 6 6 1
## 7 7 1
## 8 8 0
## 9 9 1
## 10 10 1
## 11 11 0
## 12 12 1
## 13 13 1
## 14 14 0
## 15 15 0
## 16 16 1
## 17 17 1
## 18 18 0
## 19 19 1
## 20 20 1
## 21 21 1
## 22 22 1
## 23 23 1
13
Table 10: Group Sizes by treatment
## 24 24 1
## 25 25 0
## 26 26 1
## 27 27 1
## 28 28 0
## 29 29 1
## 30 30 1
## 31 31 1
## 32 32 1
## 33 33 1
## 34 34 1
## 35 35 0
## 36 36 1
## 37 37 1
## 38 38 1
## 39 39 1
## 40 40 1
## 41 41 1
## 42 42 1
## 43 43 0
## 44 44 1
## 45 45 1
## 46 46 1
## 47 47 1
## 48 48 1
## 49 49 0
## 50 50 1
## 51 51 1
## 52 52 1
## 53 53 0
## 54 54 1
## 55 55 1
## 56 56 1
## 57 57 1
## 58 58 1
## 59 59 1
# Group Sizes
grp_size <- data.frame(aggregate(pat ~ ind + type, data = dat, FUN = function(x) length(unique(x))))
grp_size |> kable(col.names = c('Group', 'Treatment', 'Number'), caption = 'Group Sizes by treatment')
14
# observed means and variances
obs <- do.call(data.frame, aggregate(y ~ time + type + ind , data = dat, FUN = function(x) c(mean = mean
obs |> kable(col.names = c('Time', 'Treatment', 'Group', 'Mean', 'Variance'), caption = 'Observed Mean a
ggplot(merged_df, aes(x = time, y = mean, color = type, group = group, linetype = group)) +
geom_line() +
geom_point(aes(y = y.mean)) +
15
facet_wrap(~ type, labeller = 'label_both') +
labs(colour = 'Treatment', title = 'Fitted mean lines by group and treatment')
30
group
0
1
mean
20
Treatment
0
1
10
1 2 3 4 1 2 3 4
time
16
Observed Variance by Group and Treatment
group: 0 group: 1
1500
20
treatment
y.var
1000
0
1
500 10
0
1 2 3 4 1 2 3 4
time
Question 3
a)
summary(dat3)
17
Table 12: Mean, Variance, Proportion of Zeroes by Species
# mean/var overall
mean_count <- mean(dat3$count)
var_count <- var(dat3$count)
num_zero <- sum(dat3$count == 0)
# by spp
agg_q3 <- do.call(data.frame, aggregate(count ~ spp, data = dat3, FUN = function(x) c(mean = mean(x), va
18
Distribution of Count by Species
PR EC−A GP
80
60
40
20
0
DF DM EC−L
80
60
count
40
20
0
0 10 20 30 0 10 20 30
DES−L
80
60
40
20
0
0 10 20 30
Count
From the summary output we can that there exists a large number of zeroes, as the median is zero. Fur-
thermore, there exists some extreme values, as the third quartile has a value of 2, while the maximum value
is 36.
We have an overall mean of 1.3229814, and variance 6.9468427: thus there is evidence of overdispersion
as the variance is larger then the mean. We have an overall count of zeros as 387, indicated that there is
significant zero inflation as 60.0931677 percent of the data are zeros.
From the mean and variance table by species., we can see that this pattern continues when the counts are
stratified by species. All species have some level of overdispersion, with some species, such as DES-L and
EC-L being significantly overdispersed.
From the distributions of counts by species, we see that there is significant over-inflation.
b)
m10 <- glm(count ~ spp + mined + cover + Wtemp + DOY, data = dat3, family = 'poisson')
summary(m10)
##
## Call:
## glm(formula = count ~ spp + mined + cover + Wtemp + DOY, family = "poisson",
## data = dat3)
##
## Coefficients:
19
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.58172 0.19354 -3.006 0.002649 **
## sppEC-A 0.61619 0.23882 2.580 0.009878 **
## sppGP 1.38629 0.21517 6.443 1.17e-10 ***
## sppDF 1.46634 0.21350 6.868 6.51e-12 ***
## sppDM 1.61682 0.21069 7.674 1.67e-14 ***
## sppEC-L 2.00747 0.20497 9.794 < 2e-16 ***
## sppDES-L 2.06546 0.20428 10.111 < 2e-16 ***
## minedyes -2.31284 0.12028 -19.229 < 2e-16 ***
## cover -0.23784 0.04137 -5.749 8.99e-09 ***
## Wtemp -0.04532 0.04047 -1.120 0.262831
## DOY 0.13311 0.03694 3.604 0.000314 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2120.7 on 643 degrees of freedom
## Residual deviance: 1265.8 on 633 degrees of freedom
## AIC: 2011.1
##
## Number of Fisher Scoring iterations: 6
check_overdispersion(m10)
## # Overdispersion test
##
## dispersion ratio = 2.872
## Pearson’s Chi-Squared = 1818.083
## p-value = < 0.001
## Overdispersion detected.
check_zeroinflation(m10)
From the overdispersion and zero-inflation tests, it is clear that there is significant overdispersion. Further-
more, the model is underfitting zeroes, indicating that their is zero inflation. The covariates included in the
model are all significant at the α = 0.05 level, with the exception of Wtemp, which was not significant. All
the species effects spp and the variable DOY were found to have positive effects, while the variables mined,
cover, and Wtemp were found to have negative effects.
We can perform a hypothesis test using the residual deviance to test if the proposed model is significant
compared to the saturated model, with a resulting p-value of 0, Thus we have evidence to reject the fit of
20
this model, and conclude this model fails to capture a significant amount of variation in the data. The AIC
value is 2011.1001221.
m20 <- glm.nb(count ~ spp + mined + cover + Wtemp + DOY, data = dat3)
summary(m20)
##
## Call:
## glm.nb(formula = count ~ spp + mined + cover + Wtemp + DOY, data = dat3,
## init.theta = 0.8358320528, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.60776 0.24071 -2.525 0.0116 *
## sppEC-A 0.59798 0.30798 1.942 0.0522 .
## sppGP 1.26602 0.29022 4.362 1.29e-05 ***
## sppDF 1.58841 0.28452 5.583 2.37e-08 ***
## sppDM 1.67338 0.28324 5.908 3.46e-09 ***
## sppEC-L 1.84542 0.28092 6.569 5.06e-11 ***
## sppDES-L 2.06604 0.27838 7.422 1.16e-13 ***
## minedyes -2.17571 0.17211 -12.642 < 2e-16 ***
## cover -0.14609 0.07724 -1.891 0.0586 .
## Wtemp -0.06120 0.06960 -0.879 0.3793
## DOY 0.03227 0.06508 0.496 0.6199
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for Negative Binomial(0.8358) family taken to be 1)
##
## Null deviance: 886.88 on 643 degrees of freedom
## Residual deviance: 549.00 on 633 degrees of freedom
## AIC: 1695.2
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.836
## Std. Err.: 0.109
##
## 2 x log-likelihood: -1671.220
#check_overdispersion(m20)
#check_zeroinflation(m20)
For the negative binomial model, all species except for EC-A were found to be significant at the level of
α = 0.05. cover, Wtemp, and DOY, were not found to be significant. The species variables spp and DOY were
all found to have positive effects, while the variables cover, Wtemp, and mined were found to be negative.
We can perform a hypothesis test using the residual deviance to test if the proposed model is signficant, where
the null hypothesis is that proposed model provides a better fit. With a resulting p-value of 0.9929618, thus
indicating that we retain the null and conclude that this model is a good fit. The AIC value is 1695.220271.
21
Based on the two model AIC values, we can conclude that the negative binomial model fits the data better,
as it has a lower AIC value of 1695.220271 compared to 2011.1001221.
c)
# Poisson Model
M1 <- zeroinfl(count ~ spp + mined + cover + Wtemp + DOY| mined + DOY, data = dat3, dist = "poisson")
summary(M1)
##
## Call:
## zeroinfl(formula = count ~ spp + mined + cover + Wtemp + DOY | mined +
## DOY, data = dat3, dist = "poisson")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -1.3780 -0.5153 -0.3735 0.1763 8.9950
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.20706 0.21994 -0.941 0.34648
## sppEC-A 0.91739 0.28844 3.180 0.00147 **
## sppGP 1.27219 0.24118 5.275 1.33e-07 ***
## sppDF 1.36875 0.23971 5.710 1.13e-08 ***
## sppDM 1.49724 0.23681 6.323 2.57e-10 ***
## sppEC-L 1.88908 0.23089 8.182 2.79e-16 ***
## sppDES-L 1.90127 0.23022 8.259 < 2e-16 ***
## minedyes -1.21981 0.15256 -7.996 1.29e-15 ***
## cover -0.25251 0.04417 -5.717 1.08e-08 ***
## Wtemp -0.08794 0.04229 -2.079 0.03758 *
## DOY 0.22235 0.04166 5.337 9.44e-08 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.0079 0.1628 -6.190 6.03e-10 ***
## minedyes 2.0690 0.2532 8.173 3.01e-16 ***
## DOY 0.3021 0.1220 2.476 0.0133 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Number of iterations in BFGS optimization: 20
## Log-likelihood: -872.9 on 14 Df
##
## Call:
## zeroinfl(formula = count ~ spp + mined + cover + Wtemp + DOY | mined +
## DOY, data = dat3, dist = "negbin")
22
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -0.98598 -0.47031 -0.34711 0.08008 9.13671
##
## Count model coefficients (negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.43709 0.24233 -1.804 0.07128 .
## sppEC-A 0.72928 0.30496 2.391 0.01679 *
## sppGP 1.34014 0.27830 4.815 1.47e-06 ***
## sppDF 1.44799 0.27444 5.276 1.32e-07 ***
## sppDM 1.63121 0.27301 5.975 2.30e-09 ***
## sppEC-L 1.85662 0.26833 6.919 4.54e-12 ***
## sppDES-L 2.03587 0.26654 7.638 2.20e-14 ***
## minedyes -1.27181 0.21830 -5.826 5.68e-09 ***
## cover -0.21341 0.07198 -2.965 0.00303 **
## Wtemp -0.07159 0.06823 -1.049 0.29411
## DOY 0.13679 0.07081 1.932 0.05337 .
## Log(theta) 0.53650 0.25754 2.083 0.03724 *
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.9214 0.4708 -4.081 4.48e-05 ***
## minedyes 2.5793 0.4764 5.415 6.14e-08 ***
## DOY 0.3298 0.1833 1.800 0.0719 .
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Theta = 1.71
## Number of iterations in BFGS optimization: 24
## Log-likelihood: -820.9 on 15 Df
M0 <- update(M2, . ~ 1)
pchisq(2 * (logLik(M2) - logLik(M0)), df = 12, lower.tail=FALSE) # df is from df M2 - df M0
23
## AIC-corrected -4.937309 model2 > model1 3.9604e-07
## BIC-corrected -4.658501 model2 > model1 1.5926e-06
We have the following conclusions based on the output for our significance tests. Both the ZIP and the
ZINB models M1 and M2 were found to be significant when compared to the null. The ZIP model M1 was
found to have a significant improvement of fit over the regular poisson model m10 across all p values (raw
and adjusted). The ZINB was found to be significant accroding to the raw, and AIC-corrected p-values,
however insignificant when judged against the BIC corrected p-value.
For our ZIP model M1, we have that in our logit model, both the mined and DOY variables were found to be
significant predictors of inflated zeroes. With mined = yes found to increase the chance of inflated zeroes
relative to the mined = no group, and the date of year increasing the chance of excessive zeroes.
In the ZIP model, the overall significance of effects remained the same, with the exception of the variable
Wtemp, which was found to be significant in the ZIP model (M1) compared to the regular poisson model
(m10). The direction of the effects remained the same, however most of the effects decreased in magnitude,
with the exception of the EC-A species effect, and the effect of DOY.
In the ZINB model, the overall significance of effects changed: the species EC-A, as well as the variables
cover were found to be significant in M2, but they were not in m20. As with the zip mode, the direction of
effects remained the same, however most of the effects decreased in magnitude, with the exception of the
EC-A and GP species effects, and the effect of DOY.
For our ZINB model M1, we have that in our logit model, only the mined variable was found to be a significant
predictor of inflated zeroes at the α = 0.05 level. With mined = yes found to increase the chance of inflated
zeroes relative to the mined = no group.
d)
summary(rand_poi)
24
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev.
## site (Intercept) 0.03681 0.1918
## Number of obs: 644, groups: site, 23
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.24940 0.23162 -1.077 0.28158
## sppEC-A 0.89219 0.28639 3.115 0.00184 **
## sppGP 1.27423 0.24136 5.279 1.30e-07 ***
## sppDF 1.34859 0.24070 5.603 2.11e-08 ***
## sppDM 1.51613 0.23749 6.384 1.73e-10 ***
## sppEC-L 1.90730 0.23164 8.234 < 2e-16 ***
## sppDES-L 1.89647 0.23076 8.218 < 2e-16 ***
## minedyes -1.28806 0.21929 -5.874 4.26e-09 ***
## cover -0.21068 0.08040 -2.620 0.00878 **
## Wtemp -0.11529 0.05623 -2.050 0.04032 *
## DOY 0.22158 0.04366 5.075 3.88e-07 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Zero-inflation model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.0336 0.1663 -6.214 5.17e-10 ***
## minedyes 1.9774 0.2893 6.835 8.20e-12 ***
## DOY 0.3207 0.1269 2.527 0.0115 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
rand_genpois <- glmmTMB(count ~ spp + mined + cover + Wtemp + DOY + (1|site), ziformula= ~ mined + DOY,
summary(rand_genpois)
25
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.59389 0.31164 -1.906 0.0567 .
## sppEC-A 0.62279 0.34289 1.816 0.0693 .
## sppGP 1.40011 0.30214 4.634 3.59e-06 ***
## sppDF 1.50234 0.30063 4.997 5.81e-07 ***
## sppDM 1.65442 0.29708 5.569 2.56e-08 ***
## sppEC-L 1.87994 0.29420 6.390 1.66e-10 ***
## sppDES-L 2.05522 0.28942 7.101 1.24e-12 ***
## minedyes -1.99625 0.35244 -5.664 1.48e-08 ***
## cover -0.14094 0.14768 -0.954 0.3399
## Wtemp -0.05037 0.08576 -0.587 0.5570
## DOY 0.16226 0.07328 2.214 0.0268 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Zero-inflation model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.7911 0.8150 -3.425 0.000616 ***
## minedyes 1.2695 0.8535 1.488 0.136879
## DOY 0.9849 0.5588 1.762 0.078008 .
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
26
those in M1. The mined and DOY variables were found to be significant in generating excessive zeroes in the
logit portion of the model, when compared to our model M1.
For the ZIGP mixed model (rand_genpois), we have that the species indicator EC-A and the cover variable
is not significant, and that the DOY variable is significant at the α = 0.05 level, when compared to M2.
Furthermore, the mined variable was not found to be significant in inflating zero counts in the logit part of
the model, when compared to our model M2.
e)
predcover$phat = predict(M2,predcover)
predWtemp$phat = predict(M2,predWtemp)
preddoy$phat = predict(M2,preddoy)
27
mined: no mined: yes
0.6
spp
DES−L
4
DF
0.4
Count
DM
EC−A
EC−L
GP
2
0.2 PR
−1 0 1 2 −1 0 1 2
Cover
28
mined: no mined: yes
0.6
spp
4
DES−L
0.4
DF
Count
DM
3
EC−A
EC−L
2 GP
0.2
PR
−3 −2 −1 0 1 2 −3 −2 −1 0 1 2
Wtemp
29
mined: no mined: yes
5
0.5
4
spp
0.4
DES−L
DF
3
Count
DM
0.3
EC−A
EC−L
2
GP
0.2
PR
1
0.1
−2 −1 0 1 −2 −1 0 1
DOY
We can see clearly from three plots that belonging to a site where mountain top removal coal mining occured
severly depletes the overall salamander counts, compared to not. From the first plot, we can see that as the
scales number of cover objects in stream increases, the overall counts decrease. Similarly, as the scaled water
temperature increases, the counts of salamanders decrease. As the scaled day of year increases, the overall
counts of salamanders increases, however this pattern is not seen if mountain top removal coal mining has
taken place, in which case the count decreases throughout the year.
Question 4
y <- c(10,23,23,26,17,5,53,55,32,46,10,8,10,8,23,0,3,22,15,32,3)
n <- c(39,62,81,51,39,6,74,72,51,79,13,16,30,28,45,4,12,41,30,51,7)
seed <- factor(c(1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0))
root <- factor(c(1,1,1,1,1,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0))
yc <- cbind(y,n-y)
# Interaction Model
m1 <- glm(yc ~ seed*root, family = binomial(link=logit))
# No Interaction Model
m2 <- glm(yc ~ seed + root, family = binomial(link=logit))
summary(m1)
##
30
## Call:
## glm(formula = yc ~ seed * root, family = binomial(link = logit))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.1278 0.1688 0.757 0.44880
## seed1 0.6322 0.2100 3.010 0.00261 **
## root1 -0.5401 0.2498 -2.162 0.03062 *
## seed1:root1 -0.7781 0.3064 -2.539 0.01111 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 98.719 on 20 degrees of freedom
## Residual deviance: 33.278 on 17 degrees of freedom
## AIC: 117.87
##
## Number of Fisher Scoring iterations: 4
summary(m2)
##
## Call:
## glm(formula = yc ~ seed + root, family = binomial(link = logit))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.3643 0.1428 2.550 0.0108 *
## seed1 0.2705 0.1547 1.748 0.0804 .
## root1 -1.0647 0.1442 -7.383 1.55e-13 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 98.719 on 20 degrees of freedom
## Residual deviance: 39.686 on 18 degrees of freedom
## AIC: 122.28
##
## Number of Fisher Scoring iterations: 4
To test if our interaction effect is significant, we will conduct a chi-square test on the differences of the
deviances. As the difference between the degrees of freedom in each model is 1, we have that our degrees
of freedom for our distribution is 1. Thus we yield a p-value of 0.0113601, indicating we reject the null
hypothesis that the simple model provides a better fit, and that the model including the interaction term is
preferred.
31
b)
summary(m1)
##
## Call:
## glm(formula = yc ~ seed * root, family = binomial(link = logit))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.1278 0.1688 0.757 0.44880
## seed1 0.6322 0.2100 3.010 0.00261 **
## root1 -0.5401 0.2498 -2.162 0.03062 *
## seed1:root1 -0.7781 0.3064 -2.539 0.01111 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 98.719 on 20 degrees of freedom
## Residual deviance: 33.278 on 17 degrees of freedom
## AIC: 117.87
##
## Number of Fisher Scoring iterations: 4
res_p[1]
## 1
## -1.396088
m1$fitted.values
## 1 2 3 4 5 6 7 8
## 0.3639706 0.3639706 0.3639706 0.3639706 0.3639706 0.6813559 0.6813559 0.6813559
## 9 10 11 12 13 14 15 16
## 0.6813559 0.6813559 0.6813559 0.3983740 0.3983740 0.3983740 0.3983740 0.3983740
## 17 18 19 20 21
## 0.5319149 0.5319149 0.5319149 0.5319149 0.5319149
Hence the probability that a seed with type i, j = (1, 1) germinates is 0.3639706.
## 1
## -1.396088
32
res_p
## 1 2 3 4 5 6
## -1.39608767 0.11451055 -1.49681854 2.16456258 0.93358010 0.79894098
## 7 8 9 10 11 12
## 0.64358638 1.50298177 -0.82617835 -1.88994181 0.67998020 0.83034028
## 13 14 15 16 17 18
## -0.72767376 -1.21769581 1.54477216 -1.62746694 -1.95715471 0.05993344
## 19 20 21
## -0.35032452 1.36731648 -0.54795962
c)
We need to find p111 , p100 , p110 , p101 , p010 , p011 using our binomial model. Note the first two indices correspond
to combinations of seed and root, while the third indice indicates whether or not it germinated. We can find
the corresponding probabilites using our predict function.
unname(ln1 - ln2)
## [1] -1.172255
d)
we need to find,
P (X11 = 1, X10 = 0)
P (X11 = 1, X10 = 0|X11 + X10 = 1) = (54)
P (X11 + X10 = 1)
P (X11 = 1, X10 = 0)
= (55)
P (X11 = 1, X10 = 0) + P (X11 = 0, X10 = 1)
p111 × p100
= (56)
p111 × p100 + p110 × p101
summary(m2)
##
## Call:
## glm(formula = yc ~ seed + root, family = binomial(link = logit))
##
## Coefficients:
33
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.3643 0.1428 2.550 0.0108 *
## seed1 0.2705 0.1547 1.748 0.0804 .
## root1 -1.0647 0.1442 -7.383 1.55e-13 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 98.719 on 20 degrees of freedom
## Residual deviance: 39.686 on 18 degrees of freedom
## AIC: 122.28
##
## Number of Fisher Scoring iterations: 4
e)
P (X11 = 1, X10 = 0) = P (X11 = 1)P (X10 = 0) = P (X10 = 1)P (X11 = 0) = P (X10 = 1, X11 = 0) (58)
Now these two results are complementary and comprise the entire sample space. Thus we have,
A possible test that can be used in the situation presented in assignment sheet is the binomial test. This is
because, conditioning on X11 + X10 = 1, we have that (X11 = 1) = P (X10 = 1). The expected counts n10
34
and n01 thus ar expected to be roughly equal. Under the null hypothesis H0 : β2 = 0, we should have that
our counts n10 are binomially distributed,
n01 n01
X n01 + n10 1
P (X11 ≤ n01 |X11 + X10 = n01 + n10 ) = (63)
i=0
i 2
Furthermore, as we have that our probability of success is 0.5, our binomial distribution will be symmetric,
meaning it is straightforward to carry out a two sided test, as we simply calculate the probability of obtaining
the observed n01 or greater and multiply this value by two.
Question 5
a)
We will prove the result by obtaining the marginal distribution of Y by integrating out the random variable
P.
We have that,
Z 1
f (y) = f (y, p)dp (64)
0
Z 1
= f (y|p)f (p)dp (65)
0
Z 1
n y pa−1 (1 − p)b−1
= p (1 − p)n−y × dp (66)
0 y B(a, b)
Z 1
n 1
= py+a−1 (1 − p)n−y+b−1 dp (67)
y B(a, b) 0
n B(a + y, n + b − y) 1 py+a−1 (1 − p)n−y+b−1
Z
= dp (68)
y B(a, b) 0 B(a + y, n + b − y)
n B(a + y, n + b − y)
= (69)
y B(a, b)
As the integral evaluates to one, as the inside is the pdf of a B(a + y, n + b − y) distribution integrated across
its entire domain.
b)
We will first deduce the result for the mean. We have that,
35
V ar(Y ) = E [V ar(Y |P )] + V ar [E(Y |P )] (74)
Now,
Now,
Thus,
ab + a2 (a + b + 1)
2
a
n E(p) − E(p ) = n − (84)
a + b (a + b)2 (a + b + 1)
a(a + b)(a + b + 1) − ab − a2 (a + b + 1)
=n (85)
(a + b)2 (a + b + 1)
Thus we have,
36
Now we wish to show that the beta-binomial distribution allows a higher dispersion then the binomial
a
distribution with p = a+b . Let the binomial distribution with such a probability be X. We have,
a b
V ar(X) = np(1 − p) = n (91)
a+ba+b
Now when n = 1, the variance of our binomial distribution and our beta-binomial distribution coincide,
V ar(Y ) = V ar(X). When n > 1, we have,
n−1
1+ >1 (92)
(a + b + 1)
a b n−1 a b
n 1+ >n (93)
a+ba+b (a + b + 1) a+ba+b
V ar(Y ) > V ar(X) (94)
c)
2
We have the delta theorem V ar [g(Y )] ≈ σ 2 [g ′ (µ)] . Now we wish to find σ 2 , µ and g ′ (µ).
First g ′ (µ),
p
g(p) = ln (95)
1−p
= ln(p) − ln(1 − p) (96)
1 1
g ′ (p) = + (97)
p 1−p
1
= (98)
p(1 − p)
y
Now σ 2 = V ar n ,
y 1
V ar = V ar(Y ) (99)
n n2
1 a b n−1
= 2 ×n 1+ (100)
n a+ba+b (a + b + 1)
1 a b n−1
= 1+ (101)
na+ba+b (a + b + 1)
y a
Now µ = E n = a+b , hence,
′ a b
g (µ) = 1 (102)
a+b a+b
2
(a + b)
= (103)
ab
37
Thus we have,
2
V ar [g(Y )] ≈ σ 2 [g ′ (µ)] (104)
2
(a + b)2
1 a b n−1
= × 1+ (105)
ab na+ba+b (a + b + 1)
2 2
(a + b) 1 ab ab 1 − 1/n
= × + (106)
ab n (a + b)2 (a + b)2 (a + b + 1)
Now as n → ∞,
(a + b)4
ab
V ar [g(Y )] → (107)
(ab)2 (a + b)2 (a + b + 1)
2
(a + b)
= (108)
ab(a + b + 1)
as required.
38