0% found this document useful (0 votes)

38 views38 pages

Assignment3 Finaldraft

Uploaded by

lewis.hastie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views38 pages

Assignment3 Finaldraft

Uploaded by

lewis.hastie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

STAT4027 Assignment 3

Lewis Hastie

2023-10-06

Question 1

y <- c(0, 0, 0, 2, 5, 1, 5, 14, 3, 19, 3, 14, 22)

t <- c(25, 37, 41, 42, 94, 16, 63, 126, 5, 31, 7, 24, 36)
g <- as.factor(c(0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1))

data_q1 <- data.frame(y = y, t = t, g = g)

# Fit the poisson model

glm_pois <- glm(y ~ offset(log(t)) + g, family = poisson)
summary(glm_pois)

##
## Call:
## glm(formula = y ~ offset(log(t)) + g, family = poisson)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.8000 0.1925 -14.549 <2e-16 ***
## g1 2.2761 0.2312 9.847 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 124.138 on 12 degrees of freedom
## Residual deviance: 17.676 on 11 degrees of freedom
## AIC: 57.94
##
## Number of Fisher Scoring iterations: 5

# Fit the negative binomial model

glm_nb <- glm.nb(y ~ offset(log(t)) + g)

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =

## control$trace > : iteration limit reached

1
Table 1: y/t summary table

Mean Variance
0.245362 0.0731728

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =

## control$trace > : iteration limit reached

summary(glm_nb)

##
## Call:
## glm.nb(formula = y ~ offset(log(t)) + g, init.theta = 86681.79322,
## link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.8000 0.1925 -14.549 <2e-16 ***
## g1 2.2761 0.2312 9.847 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for Negative Binomial(86681.79) family taken to be 1)
##
## Null deviance: 124.128 on 12 degrees of freedom
## Residual deviance: 17.675 on 11 degrees of freedom
## AIC: 59.94
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 86682
## Std. Err.: 2544076
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -53.94

mu_var1 = cbind(mean(y/t), var(y/t))

knitr::kable(mu_var1, col.names = c("Mean", "Variance"), caption = '$y/t$ summary table')

Our poisson model demonstrates a better model fit, as its AIC value is lower at a value of 57.94, compared
to the AIC value of 59.94 for our NB model. This is due to the data not demonstrating any overdispersion,
as from the summary table we can see that the variance is not larger then the mean value. Furthermore, the
theta value in our negative binomial model is very large, with a very large standard error, indicating that
the poisson model is suitable.

2
# Without t information ?
glm_pois_2 <- glm(y ~ g, family = poisson)
summary(glm_pois_2)

##
## Call:
## glm(formula = y ~ g, family = poisson)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.2164 0.1925 6.321 2.61e-10 ***
## g1 1.2850 0.2312 5.559 2.71e-08 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 107.246 on 12 degrees of freedom
## Residual deviance: 72.966 on 11 degrees of freedom
## AIC: 113.23
##
## Number of Fisher Scoring iterations: 6

# Fit the negative binomial model

glm_nb <- glm.nb(y ~ g)
summary(glm_nb)

##
## Call:
## glm.nb(formula = y ~ g, init.theta = 0.9387015914, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.2164 0.4126 2.948 0.00319 **
## g1 1.2850 0.6322 2.033 0.04208 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for Negative Binomial(0.9387) family taken to be 1)
##
## Null deviance: 19.423 on 12 degrees of freedom
## Residual deviance: 15.151 on 11 degrees of freedom
## AIC: 79.027
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.939
## Std. Err.: 0.489
##
## 2 x log-likelihood: -73.027

3
Table 2: y summary table

Mean Variance
6.769231 59.52564

mu_var2 = cbind(mean(y), var(y))

knitr::kable(mu_var2, col.names = c("Mean", "Variance"), caption = '$y$ summary table')

In this instance the NB model has a better AIC fit with scores of 79.027 vs. 113.26 for the NB and Poisson
model respectively. This makes intuitive sense as the theta value in the NB model is small with small
standard error, suggesting that a NB model is suitable, and that when there is a lack of information about
ti , the dispersion is larger than the mean.
This is supported by the data, as the summary table indicates there is significant overdispersion in the data
as the variance is significantly larger then the mean and as such the negative binomial model is preferred.

To find our moment estimators, we equate our theoretical mean and sample mean, and theoretical estimators
with our sample estimator.
For our mean,

E(Y ) = ȳ (1)
(1 − π)λ = ȳ (2)
ȳ
λ= (3)
1−π

Now our variance,

V ar(Y ) = s2 (4)
2
λ(1 − π)(1 + πλ) = s . (5)

Subbing in our lambda from our first equation yields

4

ȳ ȳπ
× (1 − π) 1 + = s2 (6)
1−π 1−π
ȳπ s2
1+ = (7)
1−π ȳ
s2 s2
1 − π + ȳπ = − π (8)
ȳ ȳ
s2 s2
π − π + ȳπ = −1 (9)
ȳ ȳ
2
s2

s
π − 1 + ȳ = −1 (10)
ȳ ȳ
2 2
s s
π̂ = −1 / − 1 + ȳ (11)
ȳ ȳ
s2 − ȳ
= 2 (12)
s + ȳ 2 − ȳ

Subbing this expression for π into our mean expression we have,

ȳ
λ= (13)
1−π
2
s + ȳ 2 − ȳ − (s2 − ȳ)

= ȳ (14)
s2 + ȳ 2 − ȳ
ȳ 2

= ȳ (15)
s2 + ȳ 2 − ȳ
s + ȳ 2 − ȳ
2
λ̂ = (16)
ȳ

as required.
Thus we can find the estimates for our data.

ybar <- mean(y)

y_var <- var(y)

pi <- (y_var - ybar)/(y_var + ybarˆ2 - ybar)

lambda <- ybar/(1 - pi)

Thus our λ = 14.5627914. Our π = 0.5351694.

ii)

To find our MLE estimators first note n0 represents the amount of zeros in the sample.
We have our likelihood, which the is product of n0 zeroes, and n − n0 non-zero terms:

n−n
Y0 λyi e−λ

−λ n0

L(λ, π) = π + (1 − π)e × (1 − π) (17)
i=1
yi !

Taking the logarithm we have,

5
n−n
X0 yi −λ
−λ
λ e
ℓ(λ, π) = n0 ln π + (1 − π)e + ln (1 − π) (18)
i=1
yi !
n−n
X0 λyi e−λ
= n0 ln π + (1 − π)e−λ + (n − n0 ) ln(1 − π) +

(19)
i=1
yi !
n−n
X0 n−n
X0
= n0 ln π + (1 − π)e−λ + (n − n0 ) ln(1 − π) + ln(λ)

yi − (n − n0 )λ − ln(yi !) (20)
i=1 i=1

In order to prove the required results, we will employ a method known as profile likelihood estimation, in
which we first find the MLE of π(λ), and then substitute this result into ℓ to find the MLE of λ.
Taking the derivative with respect to π yields,

∂ℓ n0 (1 − e−λ ) (n − n0 )
= −λ
− (21)
∂π π + (1 − π)e 1−π

Setting the derivative to zero and solving,

(n − n0 ) n0 (1 − e−λ )
= (22)
1−π π + (1 − π)e−λ
(n − n0 )π + (n − n0 )(1 − π)e−λ
= n0 (1 − e−λ ) (23)
1−π
(n − n0 )π
+ (n − n0 )e−λ = n0 (1 − e−λ ) (24)
1−π
(n − n0 )π
= n0 − ne−λ (25)
1−π
(n − n0 )π = (n0 − ne−λ ) − π(n0 − ne−λ ) (26)
−λ −λ
π(n0 − ne + n − n0 ) = (n0 − ne ) (27)
−λ
n0 − ne
π= (28)
n − ne−λ
n0 /n − e−λ
= (29)
1 − e−λ

We now substitute this value back into ℓ, let r0 = n0 /n, and 1 − π = (1 − r0 )/(1 − e−λ ),

r0 − e−λ

1 − r0 −λ 1 − r0
ℓ = n0 ln + e + (n − n0 ) ln + n ln(λ)ȳ − (n − n0 )λ (30)
1 − e−λ 1 − e−λ 1 − e−λ
= n0 ln(r0 ) + (n − n0 ) ln(1 − r0 ) − ln(1 − e−λ ) + n ln(λ)ȳ − (n − n0 )λ

(31)
(32)

Taking the derivative with respect to λ,

∂ℓ −(n − n0 )e−λ nȳ

= + − (n − n0 ) (33)
∂λ 1 − e−λ λ
(34)

6
Setting to zero and solving,

−(n − n0 )e−λ nȳ

0= + − (n − n0 ) (35)
1 − e−λ λ
nȳ (n − n0 )e−λ
= (n − n0 ) + (36)
λ 1 − e−λ

nȳ 1
= (n − n0 ) (37)
λ 1 − e−λ
nȳ(1 − e−λ ) = λ(n − n0 ) (38)
n0
ȳ(1 − e−λ̂ ) = λ̂ 1 − (39)
n

As required. Notice now our expression for π can be written as,

r0 − e−λ r0 − 1
π= =1+ (40)
1 − e−λ 1 − e−λ

Now our solution for λ̂ implies,

ȳ 1 − r0
= (41)
λ̂ 1 − e−λ̂

Hence our result for π,

ȳ
π̂ = 1 − (42)
λ̂
Now for our hurdle model, we have log-likelihood equation (from lecture notes)

n1
X n1
X
ℓ(π, λ) = n0 ln π + n1 ln(1 − π) + ln λ yi − n1 λ − ln(yi !) − n1 ln(1 − e−λ ) (43)
i=1 i=1

Taking the derivative with respect to π yields,

∂ℓ n0 n1
= − (44)
∂π π 1−π

Setting equal to zero,

n0 n1
0= − (45)
π 1−π
n1 n0
= (46)
1−π π
π(n0 + n1 ) = n0 (47)
n0
π̂ = (48)
n

as required.
To calulcate the MLE of λ, we will differentiate our log-likelihood with respect to λ.

7
∂ℓ nȳ e−λ
= − n1 − n1 × (49)
∂λ λ 1 − e−λ

Setting to zero,

nȳ e−λ
0= − n1 − n1 × (50)
λ 1 − e−λ
−λ

e nȳ
n1 1 + = (51)
1 − e−λ λ
n
1
λ = ȳ(1 − e−λ ) (52)
n
n0
λ 1− = ȳ(1 − e−λ ) (53)
n

as required.
We can now solve using the uni-root function,

y_bar <- mean(y)

n <- length(y)
n_0 <- sum(y == 0)

lambda_MLE <- function(lambda){

lambda*(1 - n_0/n) - y_bar*(1 - exp(-lambda))
}

lamh <- uniroot(lambda_MLE,lower=1,upper=50)$root

pi_hat <- 1 - y_bar/lamh

pi_hat_hurdle <- n_0/n

Thus for our zero-inflated poisson model we have an estimated λ̂ = 8.7986718, and an estimated π̂ =
0.2306531. For our hurdle model, we have λ̂Hurdle = 8.7986718, and π̂Hurdle = 0.2307692.

Question 2

dat <- read.table("~/Desktop/R/4027/Assignment 3/Epilepsy.txt", header = T)

dat <- dat[,1:4]

# Calculating over all times and over all treatment groups

agg0 <- data.frame(mean = mean(dat$y), var = var(dat$y))

# Calculating over all time for each treatment group

agg1 <- do.call(data.frame, aggregate(y ~ type, data = dat, FUN = function(x) c(mean = mean(x), var = va

# Calculating for each time over all treatment

8
agg2 <- do.call(data.frame, aggregate(y ~ time, data = dat, FUN = function(x) c(mean = mean(x), var = va

# Calculating for each time over all treatment groups

agg3 <- do.call(data.frame, aggregate(y ~ time + type, data = dat, FUN = function(x) c(mean = mean(x), v

agg0 |> kable(caption = 'Mean and Variance across all treatments and times') |> kable_styling(latex_opti

Table 3: Mean and Variance across all treatments and times

mean var
8.271186 152.607

agg1 |> kable(caption = 'Mean and Variance for each treatment across all times') |> kable_styling(latex_

Table 4: Mean and Variance for each treatment across all times

type y.mean y.var

0 8.607143 107.9344
1 7.967742 193.9664

agg2 |> kable(caption = 'Mean and Variance across all treatments for each time') |> kable_styling(latex

Table 5: Mean and Variance across all treatments for each time

time y.mean y.var

1 8.949153 220.08358
2 8.355932 103.78492
3 8.440678 200.18177
4 7.338983 92.88311

agg3 |> kable(caption = 'Mean and Variance for each treatment and time') |> kable_styling(latex_options

Table 6: Mean and Variance for each treatment and time

time type y.mean y.var

1 0 9.357143 102.75661
2 0 8.285714 66.65608
3 0 8.785714 215.28571
4 0 8.000000 57.92593
1 1 8.580645 332.71828
2 1 8.419355 140.65161
3 1 8.129032 193.04946
4 1 6.741936 126.66452

We can see that for the mean and variance values across all aggregates that our data is overdispersed. Mean
values for treatment 1 is less then those for treatment 0 when aggregated across time. We can see that the
mean values decrease for each time point when aggregated over treatment, and that our mean values for
treatment 1 are generally less then for treatment 0 for each time point (with the exception of time point 3).

9
Table 7: M1

b0 b1 b2
parm 2.2940093 -0.0573954 -0.0772055
sem 0.0588292 0.0202729 0.0452718

#test <- flexmix(formula = y ~ time + type | pat, data = dat, k = 1, model = FLXglm(family = "poisson"))
#summary(test)
#
#parameters(test)
#
#test2 <- flexmix(formula = y ~ time + type | pat, data = dat, k = 2, model = FLXglm(family = "poisson")
#summary(test2)
#
#parameters(test2)

Model 1

# Model 1
m <- length(unique(dat$pat))
param_m1 <- c(1, 0.1, 0.1)
# Model 1
logl_m1 <- function(par){
b0 = par[1]; b1 = par[2]; b2 = par[3]
#mu = exp(b0 + b1*dat$time + b2*dat$type)
l = rep(0, m)
for (i in 1:m){
sub_dat = dat[dat$pat == i,]
mu = exp(b0 + b1*sub_dat$time + b2*sub_dat$type)
yp = sub_dat$y
l[i] = prod(exp(-mu)*muˆyp/factorial(yp))
}
ll = sum(log(l))
return(-ll)
}

mixreg_m1 <- optim(param_m1, logl_m1, method = "BFGS", hessian=T)

OIm=solve(mixreg_m1$hessian); sem=sqrt(diag(OIm))
parm=mixreg_m1$par; estm=rbind(parm,sem)
colnames(estm)=c("b0","b1","b2")

estm |> kable(caption = 'M1')

AICm=mixreg_m1$value*2+length(parm)*2

Thus for model we have AIC value 3280.3479633.

Model 2

10
Table 8: M2

b01 b02 b1 b2 pi
parm 1.409746 3.2040306 -0.0574200 0.3080130 0.7976866
sem 0.072124 0.0638979 0.0202733 0.0521591 0.0525467

m <- length(unique(dat$pat))
param_m2 <- c(1, 1, 0.5, 0.5, 0.7)
# Model 1
logl_m2 <- function(par){
b01 = par[1]; b02 = par[2]; b1 = par[3]; b2 = par[4]; pi = par[5]
#mu = exp(b0 + b1*dat$time + b2*dat$type)
l1 = rep(0, m)
l2 = rep(0, m)
l = rep(0, m)

for (i in 1:m){
sub_dat = dat[dat$pat == i,]

mu1 = exp(b01 + b1sub_dat$time + b2sub_dat$type)

mu2 = exp(b02 + b1*sub_dat$time + b2*sub_dat$type)

yp = sub_dat$y

l1[i] = prod(exp(-mu1)*(mu1)ˆyp/factorial(yp))
l2[i] = prod(exp(-mu2)*(mu2)ˆyp/factorial(yp))
l[i] = pi*l1[i]+(1-pi)*l2[i]

}
ll = sum(log(l))
return(-ll)
}

mixreg_m2 <- optim(param_m2, logl_m2, method = "BFGS", hessian=T)

OIm=solve(mixreg_m2$hessian); sem=sqrt(diag(OIm))
parm=mixreg_m2$par; estm=rbind(parm,sem)
colnames(estm)=c("b01", "b02", "b1", "b2", "pi")

estm |> kable(caption = 'M2')

AICm=mixreg_m2$value*2+length(parm)*2

Thus for model 2 we have AIC value 1927.9811221

Model 3

m <- length(unique(dat$pat))
param_m3 <- c(1, 1, 0.5, 0.5, 0.5, 0.7)
# Model 1
logl_m3 <- function(par){

11
Table 9: M3

b01 b02 b11 b12 b2 pi

parm 1.441878 3.1781987 -0.0715436 -0.0475024 0.3098578 0.7972647
sem 0.089736 0.0764258 0.0317597 0.0265506 0.0498577 0.0524799

b01 = par[1]; b02 = par[2]; b11 = par[3]; b12 = par[4]; b2 = par[5]; pi = par[6]
#mu = exp(b0 + b1*dat$time + b2*dat$type)
l1 = rep(0, m)
l2 = rep(0, m)
l = rep(0, m)

for (i in 1:m){
sub_dat = dat[dat$pat == i,]

mu1 = exp(b01 + b11sub_dat$time + b2sub_dat$type)

mu2 = exp(b02 + b12*sub_dat$time + b2*sub_dat$type)

yp = sub_dat$y

l1[i] = prod(exp(-mu1)*(mu1)ˆyp/factorial(yp))
l2[i] = prod(exp(-mu2)*(mu2)ˆyp/factorial(yp))

l[i] = pi*l1[i]+(1-pi)*l2[i]

}
ll = sum(log(l))
return(-ll)
}

mixreg_m3 <- optim(param_m3, logl_m3, method = "BFGS", hessian=T)

OIm=solve(mixreg_m3$hessian); sem=sqrt(diag(OIm))
parm=mixreg_m3$par; estm=rbind(parm,sem)
colnames(estm)=c("b01", "b02", "b11", "b12", "b2", "pi")

estm |> kable(caption = 'M3')

AICm=mixreg_m3$value*2+length(parm)*2

Thus for model 3 we have AIC value 1929.6482612

Based on the above AIC values, we choose model 2 to proceed due to the smallest AIC value.

parm <- mixreg_m2$par

ind = rep(0, m) #calculate group membership indicator

12
b01 = parm[1]; b02 = parm[2]; b1 = parm[3]; b2 = parm[4]; pi = parm[5]
l1 = rep(0, m)
l2 = rep(0, m)
l = rep(0, m)

for (i in 1:m) {
sub_dat = dat[dat$pat == i,]

yp = sub_dat$y

mu1 = exp(b01 + b1sub_dat$time + b2sub_dat$type)

mu2 = exp(b02 + b1*sub_dat$time + b2*sub_dat$type)

l1[i] = prod(exp(-mu1)*(mu1)ˆyp/factorial(yp))
l2[i] = prod(exp(-mu2)*(mu2)ˆyp/factorial(yp))

ind[i] = pi*l1[i]/(pi*l1[i]+(1-pi)*l2[i])
}

# Assigning groups to the dataframe

for (i in 1:nrow(dat)){
dat$ind[i] <- round(ind[dat$pat[i]])
}

# Group memberships
membership <- unique(dat[, colnames(dat) %in% c('pat', 'ind')])
rownames(membership) <- NULL
membership

## pat ind
## 1 1 1
## 2 2 1
## 3 3 1
## 4 4 1
## 5 5 0
## 6 6 1
## 7 7 1
## 8 8 0
## 9 9 1
## 10 10 1
## 11 11 0
## 12 12 1
## 13 13 1
## 14 14 0
## 15 15 0
## 16 16 1
## 17 17 1
## 18 18 0
## 19 19 1
## 20 20 1
## 21 21 1
## 22 22 1
## 23 23 1

13
Table 10: Group Sizes by treatment

Group Treatment Number

0 0 8
1 0 20
0 1 4
1 1 27

## 24 24 1
## 25 25 0
## 26 26 1
## 27 27 1
## 28 28 0
## 29 29 1
## 30 30 1
## 31 31 1
## 32 32 1
## 33 33 1
## 34 34 1
## 35 35 0
## 36 36 1
## 37 37 1
## 38 38 1
## 39 39 1
## 40 40 1
## 41 41 1
## 42 42 1
## 43 43 0
## 44 44 1
## 45 45 1
## 46 46 1
## 47 47 1
## 48 48 1
## 49 49 0
## 50 50 1
## 51 51 1
## 52 52 1
## 53 53 0
## 54 54 1
## 55 55 1
## 56 56 1
## 57 57 1
## 58 58 1
## 59 59 1

# Group Sizes
grp_size <- data.frame(aggregate(pat ~ ind + type, data = dat, FUN = function(x) length(unique(x))))

grp_size |> kable(col.names = c('Group', 'Treatment', 'Number'), caption = 'Group Sizes by treatment')

14
# observed means and variances
obs <- do.call(data.frame, aggregate(y ~ time + type + ind , data = dat, FUN = function(x) c(mean = mean

obs$ind <- as.character(obs$ind)

obs$type <- as.character(obs$type)

obs |> kable(col.names = c('Time', 'Treatment', 'Group', 'Mean', 'Variance'), caption = 'Observed Mean a

Table 11: Observed Mean and Variance

Time Treatment Group Mean Variance

1 0 0 20.500000 161.428571
2 0 0 18.875000 45.267857
3 0 0 22.125000 528.982143
4 0 0 18.000000 52.000000
1 1 0 38.250000 1826.916667
2 1 0 26.750000 656.250000
3 1 0 36.000000 590.000000
4 1 0 26.750000 585.583333
1 0 1 4.900000 13.357895
2 0 1 4.050000 11.944737
3 0 1 3.450000 6.155263
4 0 1 4.000000 4.210526
1 1 1 4.185185 17.618234
2 1 1 5.703704 27.216524
3 1 1 4.000000 17.461538
4 1 1 3.777778 7.871795

Plotting the information below,

b01 = parm[1]; b02 = parm[2]; b1 = parm[3]; b2 = parm[4]; pi = parm[5]

pred_df <- expand.grid(time = levels(factor(dat$time)), type = levels(factor(dat$type)), group = levels(

prediction_function <- function(pred_df){

beta_int = ifelse(pred_df$group == 0, b01, b02)
pred_df$mean = exp(beta_int + b1*as.numeric(as.character(pred_df$time)) + b2*as.numeric(as.character(p
pred_df
}

predicted <- prediction_function(pred_df)

#predicted$variance <- NaN
#predicted$plotting <- 'Predicted'
#obs$plotting <- 'Observed'
#names(obs) <- names(predicted)

names(obs) <- c('time', 'type', 'group', 'y.mean', 'y.var')

merged_df <- merge(obs, predicted, by = c("time", "type", "group"), all.x = TRUE)

ggplot(merged_df, aes(x = time, y = mean, color = type, group = group, linetype = group)) +
geom_line() +
geom_point(aes(y = y.mean)) +

15
facet_wrap(~ type, labeller = 'label_both') +
labs(colour = 'Treatment', title = 'Fitted mean lines by group and treatment')

Fitted mean lines by group and treatment

type: 0 type: 1

30
group
0
1
mean

20
Treatment
0
1
10

1 2 3 4 1 2 3 4
time

ggplot(merged_df, aes(x = time, y = y.var, color = type, group = group)) +

geom_point() +
facet_wrap(~ group, labeller = 'label_both', scale = 'free') +
labs(title = 'Observed Variance by Group and Treatment', colour = 'treatment')

16
Observed Variance by Group and Treatment
group: 0 group: 1

1500

treatment
y.var

1000
0
1

500 10

0
1 2 3 4 1 2 3 4
time

Question 3

dat3 <- Salamanders #built in data

dat3$spp <- factor(dat3$spp,levels=c("PR","EC-A","GP","DF","DM","EC-L","DES-L")) #base
dat3$mined <- factor(dat3$mined,levels=c("no","yes")) #base no

summary(dat3)

## site mined cover sample DOP

## R-1 : 28 no :336 Min. :-1.59152 Min. :1.00 Min. :-2.1984
## R-2 : 28 yes:308 1st Qu.:-0.69629 1st Qu.:1.75 1st Qu.:-0.3018
## R-3 : 28 Median :-0.04974 Median :2.50 Median :-0.0916
## R-4 : 28 Mean : 0.00000 Mean :2.50 Mean : 0.0000
## R-5 : 28 3rd Qu.: 0.59682 3rd Qu.:3.25 3rd Qu.: 0.0000
## R-6 : 28 Max. : 1.88993 Max. :4.00 Max. : 3.1691
## (Other):476
## Wtemp DOY spp count
## Min. :-3.0234 Min. :-2.7122 PR :92 Min. : 0.000
## 1st Qu.:-0.6139 1st Qu.:-0.5653 EC-A :92 1st Qu.: 0.000
## Median : 0.0370 Median :-0.0590 GP :92 Median : 0.000

17
Table 12: Mean, Variance, Proportion of Zeroes by Species

Species Mean Variance Sum Zeroes

PR 0.2934783 0.6711658 27 0.8478261
EC-A 0.5434783 1.3717152 50 0.7717391
GP 1.1739130 3.7935977 108 0.5869565
DF 1.2717391 3.5407310 117 0.5108696
DM 1.4782609 4.7357860 136 0.5108696
EC-L 2.1847826 21.2511945 201 0.5108696
DES-L 2.3152174 10.2402054 213 0.4673913

## Mean : 0.0000 Mean : 0.0000 DF :92 Mean : 1.323

## 3rd Qu.: 0.6032 3rd Qu.: 0.9739 DM :92 3rd Qu.: 2.000
## Max. : 2.2094 Max. : 1.4600 EC-L :92 Max. :36.000
## DES-L:92

# mean/var overall
mean_count <- mean(dat3$count)
var_count <- var(dat3$count)
num_zero <- sum(dat3$count == 0)

# by spp
agg_q3 <- do.call(data.frame, aggregate(count ~ spp, data = dat3, FUN = function(x) c(mean = mean(x), va

names(agg_q3) <- c('Species', 'Mean', 'Variance', 'Sum', 'Zeroes')

agg_q3 |> kable(caption = 'Mean, Variance, Proportion of Zeroes by Species')

ggplot(data = dat3, aes(x = count)) +

geom_histogram() +
labs(x = "Count", title = "Distribution of Count by Species") +
facet_wrap(~spp)

## ‘stat_bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.

18
Distribution of Count by Species
PR EC−A GP
80
60
40
20
0
DF DM EC−L
80
60
count

40
20
0
0 10 20 30 0 10 20 30
DES−L
80
60
40
20
0
0 10 20 30
Count

From the summary output we can that there exists a large number of zeroes, as the median is zero. Fur-
thermore, there exists some extreme values, as the third quartile has a value of 2, while the maximum value
is 36.
We have an overall mean of 1.3229814, and variance 6.9468427: thus there is evidence of overdispersion
as the variance is larger then the mean. We have an overall count of zeros as 387, indicated that there is
significant zero inflation as 60.0931677 percent of the data are zeros.
From the mean and variance table by species., we can see that this pattern continues when the counts are
stratified by species. All species have some level of overdispersion, with some species, such as DES-L and
EC-L being significantly overdispersed.
From the distributions of counts by species, we see that there is significant over-inflation.

m10 <- glm(count ~ spp + mined + cover + Wtemp + DOY, data = dat3, family = 'poisson')
summary(m10)

##
## Call:
## glm(formula = count ~ spp + mined + cover + Wtemp + DOY, family = "poisson",
## data = dat3)
##
## Coefficients:

19
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.58172 0.19354 -3.006 0.002649 **
## sppEC-A 0.61619 0.23882 2.580 0.009878 **
## sppGP 1.38629 0.21517 6.443 1.17e-10 ***
## sppDF 1.46634 0.21350 6.868 6.51e-12 ***
## sppDM 1.61682 0.21069 7.674 1.67e-14 ***
## sppEC-L 2.00747 0.20497 9.794 < 2e-16 ***
## sppDES-L 2.06546 0.20428 10.111 < 2e-16 ***
## minedyes -2.31284 0.12028 -19.229 < 2e-16 ***
## cover -0.23784 0.04137 -5.749 8.99e-09 ***
## Wtemp -0.04532 0.04047 -1.120 0.262831
## DOY 0.13311 0.03694 3.604 0.000314 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2120.7 on 643 degrees of freedom
## Residual deviance: 1265.8 on 633 degrees of freedom
## AIC: 2011.1
##
## Number of Fisher Scoring iterations: 6

check_overdispersion(m10)

## # Overdispersion test
##
## dispersion ratio = 2.872
## Pearson’s Chi-Squared = 1818.083
## p-value = < 0.001

## Overdispersion detected.

check_zeroinflation(m10)

## # Check for zero-inflation

##
## Observed zeros: 387
## Predicted zeros: 302
## Ratio: 0.78

## Model is underfitting zeros (probable zero-inflation).

p_m10 <- 1 - pchisq(m10$deviance, m10$df.residual)

From the overdispersion and zero-inflation tests, it is clear that there is significant overdispersion. Further-
more, the model is underfitting zeroes, indicating that their is zero inflation. The covariates included in the
model are all significant at the α = 0.05 level, with the exception of Wtemp, which was not significant. All
the species effects spp and the variable DOY were found to have positive effects, while the variables mined,
cover, and Wtemp were found to have negative effects.
We can perform a hypothesis test using the residual deviance to test if the proposed model is significant
compared to the saturated model, with a resulting p-value of 0, Thus we have evidence to reject the fit of

20
this model, and conclude this model fails to capture a significant amount of variation in the data. The AIC
value is 2011.1001221.

m20 <- glm.nb(count ~ spp + mined + cover + Wtemp + DOY, data = dat3)
summary(m20)

##
## Call:
## glm.nb(formula = count ~ spp + mined + cover + Wtemp + DOY, data = dat3,
## init.theta = 0.8358320528, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.60776 0.24071 -2.525 0.0116 *
## sppEC-A 0.59798 0.30798 1.942 0.0522 .
## sppGP 1.26602 0.29022 4.362 1.29e-05 ***
## sppDF 1.58841 0.28452 5.583 2.37e-08 ***
## sppDM 1.67338 0.28324 5.908 3.46e-09 ***
## sppEC-L 1.84542 0.28092 6.569 5.06e-11 ***
## sppDES-L 2.06604 0.27838 7.422 1.16e-13 ***
## minedyes -2.17571 0.17211 -12.642 < 2e-16 ***
## cover -0.14609 0.07724 -1.891 0.0586 .
## Wtemp -0.06120 0.06960 -0.879 0.3793
## DOY 0.03227 0.06508 0.496 0.6199
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for Negative Binomial(0.8358) family taken to be 1)
##
## Null deviance: 886.88 on 643 degrees of freedom
## Residual deviance: 549.00 on 633 degrees of freedom
## AIC: 1695.2
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.836
## Std. Err.: 0.109
##
## 2 x log-likelihood: -1671.220

#check_overdispersion(m20)
#check_zeroinflation(m20)

p_m20 <- 1 - pchisq(m20$deviance, m20$df.residual)

For the negative binomial model, all species except for EC-A were found to be significant at the level of
α = 0.05. cover, Wtemp, and DOY, were not found to be significant. The species variables spp and DOY were
all found to have positive effects, while the variables cover, Wtemp, and mined were found to be negative.
We can perform a hypothesis test using the residual deviance to test if the proposed model is signficant, where
the null hypothesis is that proposed model provides a better fit. With a resulting p-value of 0.9929618, thus
indicating that we retain the null and conclude that this model is a good fit. The AIC value is 1695.220271.

21
Based on the two model AIC values, we can conclude that the negative binomial model fits the data better,
as it has a lower AIC value of 1695.220271 compared to 2011.1001221.

# Poisson Model
M1 <- zeroinfl(count ~ spp + mined + cover + Wtemp + DOY| mined + DOY, data = dat3, dist = "poisson")
summary(M1)

##
## Call:
## zeroinfl(formula = count ~ spp + mined + cover + Wtemp + DOY | mined +
## DOY, data = dat3, dist = "poisson")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -1.3780 -0.5153 -0.3735 0.1763 8.9950
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.20706 0.21994 -0.941 0.34648
## sppEC-A 0.91739 0.28844 3.180 0.00147 **
## sppGP 1.27219 0.24118 5.275 1.33e-07 ***
## sppDF 1.36875 0.23971 5.710 1.13e-08 ***
## sppDM 1.49724 0.23681 6.323 2.57e-10 ***
## sppEC-L 1.88908 0.23089 8.182 2.79e-16 ***
## sppDES-L 1.90127 0.23022 8.259 < 2e-16 ***
## minedyes -1.21981 0.15256 -7.996 1.29e-15 ***
## cover -0.25251 0.04417 -5.717 1.08e-08 ***
## Wtemp -0.08794 0.04229 -2.079 0.03758 *
## DOY 0.22235 0.04166 5.337 9.44e-08 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.0079 0.1628 -6.190 6.03e-10 ***
## minedyes 2.0690 0.2532 8.173 3.01e-16 ***
## DOY 0.3021 0.1220 2.476 0.0133 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Number of iterations in BFGS optimization: 20
## Log-likelihood: -872.9 on 14 Df

# Negative Binomial Model

M2 <- zeroinfl(count ~ spp + mined + cover + Wtemp + DOY| mined + DOY, data = dat3, dist = "negbin")
summary(M2)

##
## Call:
## zeroinfl(formula = count ~ spp + mined + cover + Wtemp + DOY | mined +
## DOY, data = dat3, dist = "negbin")

22
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -0.98598 -0.47031 -0.34711 0.08008 9.13671
##
## Count model coefficients (negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.43709 0.24233 -1.804 0.07128 .
## sppEC-A 0.72928 0.30496 2.391 0.01679 *
## sppGP 1.34014 0.27830 4.815 1.47e-06 ***
## sppDF 1.44799 0.27444 5.276 1.32e-07 ***
## sppDM 1.63121 0.27301 5.975 2.30e-09 ***
## sppEC-L 1.85662 0.26833 6.919 4.54e-12 ***
## sppDES-L 2.03587 0.26654 7.638 2.20e-14 ***
## minedyes -1.27181 0.21830 -5.826 5.68e-09 ***
## cover -0.21341 0.07198 -2.965 0.00303 **
## Wtemp -0.07159 0.06823 -1.049 0.29411
## DOY 0.13679 0.07081 1.932 0.05337 .
## Log(theta) 0.53650 0.25754 2.083 0.03724 *
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.9214 0.4708 -4.081 4.48e-05 ***
## minedyes 2.5793 0.4764 5.415 6.14e-08 ***
## DOY 0.3298 0.1833 1.800 0.0719 .
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Theta = 1.71
## Number of iterations in BFGS optimization: 24
## Log-likelihood: -820.9 on 15 Df

# Testing for zip vs null zip.

M0 <- update(M1, . ~ 1)
pchisq(2*(logLik(M1)-logLik(M0)), df = 12, lower.tail=FALSE) # df is from df M1 - df M0

## ’log Lik.’ 1.144508e-73 (df=14)

M0 <- update(M2, . ~ 1)
pchisq(2 * (logLik(M2) - logLik(M0)), df = 12, lower.tail=FALSE) # df is from df M2 - df M0

## ’log Lik.’ 4.722142e-50 (df=15)

# Testing for zip vs no zip. (poisson)

vuong(m10, M1)

## Vuong Non-Nested Hypothesis Test-Statistic:

## (test-statistic is asymptotically distributed N(0,1) under the
## null that the models are indistinguishible)
## -------------------------------------------------------------
## Vuong z-statistic H_A p-value
## Raw -5.062120 model2 > model1 2.0731e-07

23
## AIC-corrected -4.937309 model2 > model1 3.9604e-07
## BIC-corrected -4.658501 model2 > model1 1.5926e-06

# Testing for zip vs no zip. (negative binomial)

vuong(m20, M2)

## Vuong Non-Nested Hypothesis Test-Statistic:

We have the following conclusions based on the output for our significance tests. Both the ZIP and the
ZINB models M1 and M2 were found to be significant when compared to the null. The ZIP model M1 was
found to have a significant improvement of fit over the regular poisson model m10 across all p values (raw
and adjusted). The ZINB was found to be significant accroding to the raw, and AIC-corrected p-values,
however insignificant when judged against the BIC corrected p-value.
For our ZIP model M1, we have that in our logit model, both the mined and DOY variables were found to be
significant predictors of inflated zeroes. With mined = yes found to increase the chance of inflated zeroes
relative to the mined = no group, and the date of year increasing the chance of excessive zeroes.
In the ZIP model, the overall significance of effects remained the same, with the exception of the variable
Wtemp, which was found to be significant in the ZIP model (M1) compared to the regular poisson model
(m10). The direction of the effects remained the same, however most of the effects decreased in magnitude,
with the exception of the EC-A species effect, and the effect of DOY.
In the ZINB model, the overall significance of effects changed: the species EC-A, as well as the variables
cover were found to be significant in M2, but they were not in m20. As with the zip mode, the direction of
effects remained the same, however most of the effects decreased in magnitude, with the exception of the
EC-A and GP species effects, and the effect of DOY.
For our ZINB model M1, we have that in our logit model, only the mined variable was found to be a significant
predictor of inflated zeroes at the α = 0.05 level. With mined = yes found to increase the chance of inflated
zeroes relative to the mined = no group.

# random effect poisson model

rand_poi <- glmmTMB(count ~ spp + mined + cover + Wtemp + DOY + (1|site), ziformula= ~ mined + DOY, fami

summary(rand_poi)

## Family: poisson ( log )

## Formula: count ~ spp + mined + cover + Wtemp + DOY + (1 | site)
## Zero inflation: ~mined + DOY
## Data: dat3
##
## AIC BIC logLik deviance df.resid
## 1772.6 1839.6 -871.3 1742.6 629

24
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev.
## site (Intercept) 0.03681 0.1918
## Number of obs: 644, groups: site, 23
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.24940 0.23162 -1.077 0.28158
## sppEC-A 0.89219 0.28639 3.115 0.00184 **
## sppGP 1.27423 0.24136 5.279 1.30e-07 ***
## sppDF 1.34859 0.24070 5.603 2.11e-08 ***
## sppDM 1.51613 0.23749 6.384 1.73e-10 ***
## sppEC-L 1.90730 0.23164 8.234 < 2e-16 ***
## sppDES-L 1.89647 0.23076 8.218 < 2e-16 ***
## minedyes -1.28806 0.21929 -5.874 4.26e-09 ***
## cover -0.21068 0.08040 -2.620 0.00878 **
## Wtemp -0.11529 0.05623 -2.050 0.04032 *
## DOY 0.22158 0.04366 5.075 3.88e-07 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Zero-inflation model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.0336 0.1663 -6.214 5.17e-10 ***
## minedyes 1.9774 0.2893 6.835 8.20e-12 ***
## DOY 0.3207 0.1269 2.527 0.0115 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

rand_genpois <- glmmTMB(count ~ spp + mined + cover + Wtemp + DOY + (1|site), ziformula= ~ mined + DOY,

summary(rand_genpois)

## Family: genpois ( log )

## Formula: count ~ spp + mined + cover + Wtemp + DOY + (1 | site)
## Zero inflation: ~mined + DOY
## Data: dat3
##
## AIC BIC logLik deviance df.resid
## 1653.0 1724.5 -810.5 1621.0 628
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev.
## site (Intercept) 0.2065 0.4545
## Number of obs: 644, groups: site, 23
##
## Dispersion parameter for genpois family (): 2.76
##
## Conditional model:

25
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.59389 0.31164 -1.906 0.0567 .
## sppEC-A 0.62279 0.34289 1.816 0.0693 .
## sppGP 1.40011 0.30214 4.634 3.59e-06 ***
## sppDF 1.50234 0.30063 4.997 5.81e-07 ***
## sppDM 1.65442 0.29708 5.569 2.56e-08 ***
## sppEC-L 1.87994 0.29420 6.390 1.66e-10 ***
## sppDES-L 2.05522 0.28942 7.101 1.24e-12 ***
## minedyes -1.99625 0.35244 -5.664 1.48e-08 ***
## cover -0.14094 0.14768 -0.954 0.3399
## Wtemp -0.05037 0.08576 -0.587 0.5570
## DOY 0.16226 0.07328 2.214 0.0268 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Zero-inflation model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.7911 0.8150 -3.425 0.000616 ***
## minedyes 1.2695 0.8535 1.488 0.136879
## DOY 0.9849 0.5588 1.762 0.078008 .
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

result_func <- function(x){

cbind(model = deparse(substitute(x)), rmse = rmse(x), aic = AIC(x))
}

results_q3 <- as.data.frame(rbind(result_func(m10),

result_func(m20),
result_func(M1),
result_func(M2),
result_func(rand_poi),
result_func(rand_genpois)
))

results_q3 |> kable()

model rmse aic

m10 2.1873122803424 2011.10012210437
m20 2.24905768068354 1695.22027098336
M1 2.19789183354427 1773.74957899428
M2 2.21702788359007 1671.78817120194
rand_poi 2.18078817572983 1772.55341055411
rand_genpois 2.16382194641293 1652.96743810151
Based on RMSE values, the models perform in this order: the ZIGP mixed model performed best, followed
by the ZIP mixed model. Then the poisson glm model, followed by the ZIP model. Then the ZINB model,
followed by the negative binomial model.
In terms of AIC, the ZIGP mixed model performed best, followed by the ZINB model, followed by the
negative binomial mode. Then worst performing models were the ZIP mixed model, followed by the ZIP
model, followed by the poisson model. Note that all the models accounting for overdispersion performed
better in terms of AIC.
For the ZIP mixed model (rand_poi), we have that the significance of all the effects remained the same as

26
those in M1. The mined and DOY variables were found to be significant in generating excessive zeroes in the
logit portion of the model, when compared to our model M1.
For the ZIGP mixed model (rand_genpois), we have that the species indicator EC-A and the cover variable
is not significant, and that the DOY variable is significant at the α = 0.05 level, when compared to M2.
Furthermore, the mined variable was not found to be significant in inflating zero counts in the logit part of
the model, when compared to our model M2.

Zero-inflated negative binomial model

# This is the cover case

scover <- rep(seq(min(dat3$cover),max(dat3$cover),length.out=50),14)
sspp <- rep(rep(c("PR","EC-A","GP","DF","DM","EC-L","DES-L"),each=50),2)
zero <- rep(0,350); smined <- rep(c("no","yes"),each=350)

predcover <- data.frame(sspp,smined,scover,zero,zero)

colnames(predcover) <- c("spp","mined","cover","Wtemp","DOY")

predcover$phat = predict(M2,predcover)

# This is the sWtemp case

sWtemp <- rep(seq(min(dat3$Wtemp),max(dat3$Wtemp),length.out=50),14)
sspp <- rep(rep(c("PR","EC-A","GP","DF","DM","EC-L","DES-L"),each=50),2)
zero <- rep(0,350); smined <- rep(c("no","yes"),each=350)

predWtemp <- data.frame(sspp,smined,zero,sWtemp,zero)

colnames(predWtemp) <- c("spp","mined","cover","Wtemp","DOY")

predWtemp$phat = predict(M2,predWtemp)

# This is the DOY case

sdoy <- rep(seq(min(dat3$DOY),max(dat3$DOY),length.out=50),14)
sspp <- rep(rep(c("PR","EC-A","GP","DF","DM","EC-L","DES-L"),each=50),2)
zero <- rep(0,350); smined <- rep(c("no","yes"),each=350)

preddoy <- data.frame(sspp,smined,zero,zero,sdoy)

colnames(preddoy) <- c("spp","mined","cover","Wtemp","DOY")

preddoy$phat = predict(M2,preddoy)

# Plotting the curves

ggplot(data = predcover, aes(x = cover, y = phat, colour = spp, group = spp)) +
geom_line() +
facet_wrap(~ mined, labeller = 'label_both', scales = 'free') +
labs(y = 'Count', x = 'Cover')

27
mined: no mined: yes

0.6

spp
DES−L
4
DF
0.4
Count

DM
EC−A
EC−L
GP
2
0.2 PR

−1 0 1 2 −1 0 1 2
Cover

ggplot(data = predWtemp , aes(x = Wtemp, y = phat, colour = spp, group = spp)) +

geom_line() +
facet_wrap(~ mined, labeller = 'label_both', scales = 'free') +
labs(y = 'Count')

28
mined: no mined: yes
0.6

spp
4
DES−L
0.4
DF
Count

DM
3
EC−A
EC−L

2 GP
0.2
PR

−3 −2 −1 0 1 2 −3 −2 −1 0 1 2
Wtemp

ggplot(data = preddoy, aes(x = DOY, y = phat, colour = spp, group = spp)) +

geom_line() +
facet_wrap(~ mined, labeller = 'label_both', scales = 'free') +
labs(y = 'Count')

29
mined: no mined: yes
5

0.5

4
spp
0.4
DES−L
DF
3
Count

DM
0.3
EC−A
EC−L
2
GP
0.2
PR

1
0.1

−2 −1 0 1 −2 −1 0 1
DOY

We can see clearly from three plots that belonging to a site where mountain top removal coal mining occured
severly depletes the overall salamander counts, compared to not. From the first plot, we can see that as the
scales number of cover objects in stream increases, the overall counts decrease. Similarly, as the scaled water
temperature increases, the counts of salamanders decrease. As the scaled day of year increases, the overall
counts of salamanders increases, however this pattern is not seen if mountain top removal coal mining has
taken place, in which case the count decreases throughout the year.

Question 4

y <- c(10,23,23,26,17,5,53,55,32,46,10,8,10,8,23,0,3,22,15,32,3)
n <- c(39,62,81,51,39,6,74,72,51,79,13,16,30,28,45,4,12,41,30,51,7)
seed <- factor(c(1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0))
root <- factor(c(1,1,1,1,1,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0))
yc <- cbind(y,n-y)

# Interaction Model
m1 <- glm(yc ~ seed*root, family = binomial(link=logit))

# No Interaction Model
m2 <- glm(yc ~ seed + root, family = binomial(link=logit))

summary(m1)

30
## Call:
## glm(formula = yc ~ seed * root, family = binomial(link = logit))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.1278 0.1688 0.757 0.44880
## seed1 0.6322 0.2100 3.010 0.00261 **
## root1 -0.5401 0.2498 -2.162 0.03062 *
## seed1:root1 -0.7781 0.3064 -2.539 0.01111 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 98.719 on 20 degrees of freedom
## Residual deviance: 33.278 on 17 degrees of freedom
## AIC: 117.87
##
## Number of Fisher Scoring iterations: 4

summary(m2)

##
## Call:
## glm(formula = yc ~ seed + root, family = binomial(link = logit))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.3643 0.1428 2.550 0.0108 *
## seed1 0.2705 0.1547 1.748 0.0804 .
## root1 -1.0647 0.1442 -7.383 1.55e-13 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 98.719 on 20 degrees of freedom
## Residual deviance: 39.686 on 18 degrees of freedom
## AIC: 122.28
##
## Number of Fisher Scoring iterations: 4

# test if change is sign, df = 1

p <- 1 - pchisq(m2$deviance - m1$deviance , 1)

To test if our interaction effect is significant, we will conduct a chi-square test on the differences of the
deviances. As the difference between the degrees of freedom in each model is 1, we have that our degrees
of freedom for our distribution is 1. Thus we yield a p-value of 0.0113601, indicating we reject the null
hypothesis that the simple model provides a better fit, and that the model including the interaction term is
preferred.

31
b)

summary(m1)

##
## Call:
## glm(formula = yc ~ seed * root, family = binomial(link = logit))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.1278 0.1688 0.757 0.44880
## seed1 0.6322 0.2100 3.010 0.00261 **
## root1 -0.5401 0.2498 -2.162 0.03062 *
## seed1:root1 -0.7781 0.3064 -2.539 0.01111 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 98.719 on 20 degrees of freedom
## Residual deviance: 33.278 on 17 degrees of freedom
## AIC: 117.87
##
## Number of Fisher Scoring iterations: 4

pred <- data.frame(seed = factor(c(1)), root = factor(c(1)))

p <- predict(m1, newdata = pred, type = 'response')

res_p <- resid(m1, type = "pearson")

res_p[1]

## 1
## -1.396088

m1$fitted.values

## 1 2 3 4 5 6 7 8
## 0.3639706 0.3639706 0.3639706 0.3639706 0.3639706 0.6813559 0.6813559 0.6813559
## 9 10 11 12 13 14 15 16
## 0.6813559 0.6813559 0.6813559 0.3983740 0.3983740 0.3983740 0.3983740 0.3983740
## 17 18 19 20 21
## 0.5319149 0.5319149 0.5319149 0.5319149 0.5319149

Hence the probability that a seed with type i, j = (1, 1) germinates is 0.3639706.

(10 - p39)/sqrt((p39)(39 - p39)/39) # From the formula for pearsons residuals

## 1
## -1.396088

32
res_p

## 1 2 3 4 5 6
## -1.39608767 0.11451055 -1.49681854 2.16456258 0.93358010 0.79894098
## 7 8 9 10 11 12
## 0.64358638 1.50298177 -0.82617835 -1.88994181 0.67998020 0.83034028
## 13 14 15 16 17 18
## -0.72767376 -1.21769581 1.54477216 -1.62746694 -1.95715471 0.05993344
## 19 20 21
## -0.35032452 1.36731648 -0.54795962

We need to find p111 , p100 , p110 , p101 , p010 , p011 using our binomial model. Note the first two indices correspond
to combinations of seed and root, while the third indice indicates whether or not it germinated. We can find
the corresponding probabilites using our predict function.

pred <- data.frame(seed = factor(c(1, 1, 0)), root = factor(c(1, 0, 1)))

prob <- predict(m1, newdata = pred, type = 'response')
names(prob) <- c('p_111', 'p_101', 'p_011')
p_111 <- prob[1]; p_101 <- prob[2]; p_011 <- prob[3]

ln1 <- log( (p_111(1-p_101))/((1 - p_111)p_101) )

ln2 <- log( (p_111*(1-p_011))/((1-p_111)*p_011) )

unname(ln1 - ln2)

## [1] -1.172255

we need to find,

P (X11 = 1, X10 = 0)
P (X11 = 1, X10 = 0|X11 + X10 = 1) = (54)
P (X11 + X10 = 1)
P (X11 = 1, X10 = 0)
= (55)
P (X11 = 1, X10 = 0) + P (X11 = 0, X10 = 1)
p111 × p100
= (56)
p111 × p100 + p110 × p101

Thus using our seed+root model,

summary(m2)

##
## Call:
## glm(formula = yc ~ seed + root, family = binomial(link = logit))
##
## Coefficients:

33
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.3643 0.1428 2.550 0.0108 *
## seed1 0.2705 0.1547 1.748 0.0804 .
## root1 -1.0647 0.1442 -7.383 1.55e-13 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 98.719 on 20 degrees of freedom
## Residual deviance: 39.686 on 18 degrees of freedom
## AIC: 122.28
##
## Number of Fisher Scoring iterations: 4

pred <- data.frame(seed = factor(c(1, 1, 0)), root = factor(c(1, 0, 1)))

prob <- predict(m2, newdata = pred, type = 'response')
names(prob) <- c('p_111', 'p_101', 'p_011')
p_111 <- prob[1]; p_101 <- prob[2]; p_011 <- prob[3]

answer <- unname((p_111(1-p_101)/(p_111(1-p_101) + (1-p_111)*p_101)))

Thus we have our result, P (X11 = 1, X10 = 0|X11 + X10 = 1) = 0.2564028.

Now we have under the null hypothesis H0 : β2 = 0,

logit(P (X11 = 1)) = logit(P (X10 = 1)) = β0 + β1 (57)

which implies that, P (X11 = 1) = P (X10 = 1) → P (X11 = 0) = P (X10 = 0).

Thus we have,

P (X11 = 1, X10 = 0) = P (X11 = 1)P (X10 = 0) = P (X10 = 1)P (X11 = 0) = P (X10 = 1, X11 = 0) (58)

From this we can conclude that,

P (X11 = 1, X10 = 0|X11 + X10 = 1) = P (X10 = 1, X11 = 0|X11 + X10 = 1) (59)

Now these two results are complementary and comprise the entire sample space. Thus we have,

P (X11 = 1, X10 = 0|X11 + X10 = 1) + P (X10 = 1, X11 = 0|X11 + X10 = 1) = 1 (60)

2P (X11 = 1, X10 = 0|X11 + X10 = 1) = 1 (61)
1
P (X11 = 1, X10 = 0|X11 + X10 = 1) = (62)
2

A possible test that can be used in the situation presented in assignment sheet is the binomial test. This is
because, conditioning on X11 + X10 = 1, we have that (X11 = 1) = P (X10 = 1). The expected counts n10

34
and n01 thus ar expected to be roughly equal. Under the null hypothesis H0 : β2 = 0, we should have that
our counts n10 are binomially distributed,

n01 n01
X n01 + n10 1
P (X11 ≤ n01 |X11 + X10 = n01 + n10 ) = (63)
i=0
i 2

Furthermore, as we have that our probability of success is 0.5, our binomial distribution will be symmetric,
meaning it is straightforward to carry out a two sided test, as we simply calculate the probability of obtaining
the observed n01 or greater and multiply this value by two.

Question 5

We will prove the result by obtaining the marginal distribution of Y by integrating out the random variable
P.
We have that,

Z 1
f (y) = f (y, p)dp (64)
0
Z 1
= f (y|p)f (p)dp (65)
0
Z 1
n y pa−1 (1 − p)b−1
= p (1 − p)n−y × dp (66)
0 y B(a, b)
Z 1
n 1
= py+a−1 (1 − p)n−y+b−1 dp (67)
y B(a, b) 0
n B(a + y, n + b − y) 1 py+a−1 (1 − p)n−y+b−1
Z
= dp (68)
y B(a, b) 0 B(a + y, n + b − y)

n B(a + y, n + b − y)
= (69)
y B(a, b)

As the integral evaluates to one, as the inside is the pdf of a B(a + y, n + b − y) distribution integrated across
its entire domain.

We will first deduce the result for the mean. We have that,

E(Y ) = EBe [EBi (Y |P )] (70)

= EBe [np] (71)
= nEBe [p] (72)
a
=n (73)
a+b

We will now deduce the variance in two parts, we have

35
V ar(Y ) = E [V ar(Y |P )] + V ar [E(Y |P )] (74)

We will first simplify the right hand term,

V ar [E(Y |P )] = V ar [np] (75)

2
= n V ar(p) (76)
ab
= n2 (77)
(a + b)2 (a + b + 1)

Now,

E [V ar(Y |P )] = E [np(1 − p)] (78)

= nE p − p2

(79)
= n E(p) − E(p2 )

(80)

Now,

E(p2 ) = V ar(p) + E(p)2 (81)

2
ab a
= + (82)
(a + b)2 (a + b + 1) a+b
ab + a2 (a + b + 1)
= (83)
(a + b)2 (a + b + 1)

Thus,

ab + a2 (a + b + 1)

2
a
n E(p) − E(p ) = n − (84)
a + b (a + b)2 (a + b + 1)
a(a + b)(a + b + 1) − ab − a2 (a + b + 1)

=n (85)
(a + b)2 (a + b + 1)

Thus we have,

V ar(Y ) = E [V ar(Y |P )] + V ar [E(Y |P )] (86)

a(a + b)(a + b + 1) − ab − a2 (a + b + 1)

ab
=n + n2 (87)
(a + b)2 (a + b + 1) (a + b)2 (a + b + 1)

a (a + b)(a + b + 1) − b − a(a + b + 1) + nb
=n (88)
(a + b)2 (a + b + 1)

a b(n − 1) + b(a + b + 1)
=n (89)
(a + b)2 (a + b + 1)

a b n−1
=n 1+ (90)
a+ba+b (a + b + 1)

36
Now we wish to show that the beta-binomial distribution allows a higher dispersion then the binomial
a
distribution with p = a+b . Let the binomial distribution with such a probability be X. We have,

a b
V ar(X) = np(1 − p) = n (91)
a+ba+b

Now when n = 1, the variance of our binomial distribution and our beta-binomial distribution coincide,
V ar(Y ) = V ar(X). When n > 1, we have,

n−1
1+ >1 (92)
(a + b + 1)

a b n−1 a b
n 1+ >n (93)
a+ba+b (a + b + 1) a+ba+b
V ar(Y ) > V ar(X) (94)

Thus the beta-binomial distribution allows for higher dispersion.

c)
2
We have the delta theorem V ar [g(Y )] ≈ σ 2 [g ′ (µ)] . Now we wish to find σ 2 , µ and g ′ (µ).
First g ′ (µ),

p
g(p) = ln (95)
1−p
= ln(p) − ln(1 − p) (96)
1 1
g ′ (p) = + (97)
p 1−p
1
= (98)
p(1 − p)

y

Now σ 2 = V ar n ,

y 1
V ar = V ar(Y ) (99)
n n2
1 a b n−1
= 2 ×n 1+ (100)
n a+ba+b (a + b + 1)

1 a b n−1
= 1+ (101)
na+ba+b (a + b + 1)

y a

Now µ = E n = a+b , hence,

′ a b
g (µ) = 1 (102)
a+b a+b
2
(a + b)
= (103)
ab

37
Thus we have,

2
V ar [g(Y )] ≈ σ 2 [g ′ (µ)] (104)
2
(a + b)2

1 a b n−1
= × 1+ (105)
ab na+ba+b (a + b + 1)
2 2

(a + b) 1 ab ab 1 − 1/n
= × + (106)
ab n (a + b)2 (a + b)2 (a + b + 1)

Now as n → ∞,

(a + b)4

ab
V ar [g(Y )] → (107)
(ab)2 (a + b)2 (a + b + 1)
2
(a + b)
= (108)
ab(a + b + 1)

as required.

PSI Homework 1
No ratings yet
PSI Homework 1
8 pages
DETAILED LESSON PLAN in Measures of Central Tendency FINAL
92% (12)
DETAILED LESSON PLAN in Measures of Central Tendency FINAL
8 pages
Ipuc-Statistics-Pract. Assignment Problems
No ratings yet
Ipuc-Statistics-Pract. Assignment Problems
16 pages
AP Statistics Chapter Notes (1-12)
No ratings yet
AP Statistics Chapter Notes (1-12)
15 pages
M.lit. Assignment Grade11 Term2.memo
67% (3)
M.lit. Assignment Grade11 Term2.memo
7 pages
Statistics Formula Sheet-With Tables
No ratings yet
Statistics Formula Sheet-With Tables
5 pages
CFA L2 SimpleSheets Formula Sheet Final
No ratings yet
CFA L2 SimpleSheets Formula Sheet Final
5 pages
Guide ECON306 Solution HW 10
No ratings yet
Guide ECON306 Solution HW 10
35 pages
CS1B April 2024
No ratings yet
CS1B April 2024
9 pages
Review Materials 0 8 1
No ratings yet
Review Materials 0 8 1
140 pages
Probability and Statistics With R For Engineers and Scientists 1st Edition Michael Akritas Solutions Manual Download
100% (23)
Probability and Statistics With R For Engineers and Scientists 1st Edition Michael Akritas Solutions Manual Download
16 pages
HW5 JW
No ratings yet
HW5 JW
12 pages
HWK 5
No ratings yet
HWK 5
16 pages
Using Maxlik
No ratings yet
Using Maxlik
20 pages
Math170S Lecture6
No ratings yet
Math170S Lecture6
13 pages
STAM Formula Sheet
100% (2)
STAM Formula Sheet
4 pages
Suggested Solutions: Problem Set 3 Econ 210: April 27, 2015
No ratings yet
Suggested Solutions: Problem Set 3 Econ 210: April 27, 2015
11 pages
CH 12 Sol
No ratings yet
CH 12 Sol
5 pages
Session CLRM Review 4
No ratings yet
Session CLRM Review 4
15 pages
Assignment 1
No ratings yet
Assignment 1
7 pages
Logistic Regression (With R) : 1 Theory
No ratings yet
Logistic Regression (With R) : 1 Theory
15 pages
R300 Solution Guide 2018M
No ratings yet
R300 Solution Guide 2018M
8 pages
Logistic Regression With R
No ratings yet
Logistic Regression With R
5 pages
GLM Sol
No ratings yet
GLM Sol
11 pages
DwightTimothie Assignment04
No ratings yet
DwightTimothie Assignment04
5 pages
Stat3110 Stat6110
No ratings yet
Stat3110 Stat6110
8 pages
AllNotes 4
No ratings yet
AllNotes 4
56 pages
SPC V V Iyer
No ratings yet
SPC V V Iyer
85 pages
Atelier Regression Logistique
No ratings yet
Atelier Regression Logistique
4 pages
STAT4027 Assignment 1: Lewis Hastie
No ratings yet
STAT4027 Assignment 1: Lewis Hastie
26 pages
Trivedi Rudri Assignement 2
No ratings yet
Trivedi Rudri Assignement 2
9 pages
STAT511Q2Q4
No ratings yet
STAT511Q2Q4
11 pages
Appendix-List of Formula
No ratings yet
Appendix-List of Formula
3 pages
Cópia de Aula5 - Contagem
No ratings yet
Cópia de Aula5 - Contagem
28 pages
451hw02 Soln
No ratings yet
451hw02 Soln
16 pages
R Code Default Data PDF
No ratings yet
R Code Default Data PDF
10 pages
Formula Sheet
No ratings yet
Formula Sheet
8 pages
Notes
No ratings yet
Notes
10 pages
Lecture Notes - 1
No ratings yet
Lecture Notes - 1
56 pages
Stat 2013
No ratings yet
Stat 2013
132 pages
2101 F 12 Logistic Regression With R1
No ratings yet
2101 F 12 Logistic Regression With R1
10 pages
Hw2 - Raymond Von Mizener - Chirag Mahapatra
No ratings yet
Hw2 - Raymond Von Mizener - Chirag Mahapatra
13 pages
Cs1a
No ratings yet
Cs1a
10 pages
Psp-Unit-6 Estimation Theory PDF
No ratings yet
Psp-Unit-6 Estimation Theory PDF
38 pages
Risk Fisher
No ratings yet
Risk Fisher
39 pages
Econometrics - Exercise Set 2 (Solution)
No ratings yet
Econometrics - Exercise Set 2 (Solution)
12 pages
3.exponential Family & Point Estimation - 552
0% (1)
3.exponential Family & Point Estimation - 552
33 pages
Fundamentals of Mathematical Statistics 2020
No ratings yet
Fundamentals of Mathematical Statistics 2020
196 pages
ECON 1630 Problem Set #2 Fall 2021: Bias Variance
No ratings yet
ECON 1630 Problem Set #2 Fall 2021: Bias Variance
9 pages
Exercise 3 Computer Intensive Statistics
No ratings yet
Exercise 3 Computer Intensive Statistics
10 pages
Diago, Primel MS 102 Midterm M2 L2
No ratings yet
Diago, Primel MS 102 Midterm M2 L2
11 pages
Prints PDF
No ratings yet
Prints PDF
106 pages
Problem 4.1 A)
No ratings yet
Problem 4.1 A)
11 pages
Ps 2,3
No ratings yet
Ps 2,3
48 pages
Lecture 1
No ratings yet
Lecture 1
8 pages
NOTES
No ratings yet
NOTES
14 pages
Unit - III
No ratings yet
Unit - III
4 pages
Msqe Metrics 1 ps2
No ratings yet
Msqe Metrics 1 ps2
11 pages
Chap - 2point - Estimation
No ratings yet
Chap - 2point - Estimation
11 pages
Design and Analysis of Computer Experiments: Theory: 1 Density Estimation
No ratings yet
Design and Analysis of Computer Experiments: Theory: 1 Density Estimation
9 pages
ACTL30004 Assignment
No ratings yet
ACTL30004 Assignment
15 pages
Hasil Uji Daya Beda Aitem Dan Reliabilitas SPSS Nasywa
No ratings yet
Hasil Uji Daya Beda Aitem Dan Reliabilitas SPSS Nasywa
3 pages
Weather Wax Hastie Solutions Manual
No ratings yet
Weather Wax Hastie Solutions Manual
18 pages
Poisson Regression - Stata Data Analysis Examples
No ratings yet
Poisson Regression - Stata Data Analysis Examples
12 pages
Statistics 2024 - 25
No ratings yet
Statistics 2024 - 25
10 pages
STAT613
No ratings yet
STAT613
295 pages
Advanced Statistical Inference
No ratings yet
Advanced Statistical Inference
7 pages
ST102 Exercise 1
No ratings yet
ST102 Exercise 1
4 pages
Correlation & Regression
No ratings yet
Correlation & Regression
70 pages
Business Statistics - Chapter 2
No ratings yet
Business Statistics - Chapter 2
112 pages
Tutorial 2 Variance Standard Deviation
No ratings yet
Tutorial 2 Variance Standard Deviation
22 pages
L09 OtherDimensionReductionMethods-1
No ratings yet
L09 OtherDimensionReductionMethods-1
29 pages
PSMOD - Chapter 2 - Summary Measures of Statistics
No ratings yet
PSMOD - Chapter 2 - Summary Measures of Statistics
31 pages
BIOSTAT LESSON 2 - Descriptive Statistics
No ratings yet
BIOSTAT LESSON 2 - Descriptive Statistics
3 pages
Lecture Four - Measures of Dispersion
No ratings yet
Lecture Four - Measures of Dispersion
6 pages
Ben Shade
No ratings yet
Ben Shade
68 pages
On The Hauck-Donner Effect in Wald Tests: Detection, Tipping Points, and Parameter Space Characterization
No ratings yet
On The Hauck-Donner Effect in Wald Tests: Detection, Tipping Points, and Parameter Space Characterization
30 pages
4.1 Descriptive Measures
No ratings yet
4.1 Descriptive Measures
34 pages
L08 PrincipalComponentAnalysis
No ratings yet
L08 PrincipalComponentAnalysis
36 pages
NUCES Grading Policy
No ratings yet
NUCES Grading Policy
2 pages
Assignment Q4
No ratings yet
Assignment Q4
7 pages
S1 Last Minute Revision Worksheet
No ratings yet
S1 Last Minute Revision Worksheet
10 pages
Statistics and Probability Module 3
No ratings yet
Statistics and Probability Module 3
3 pages
4027 Assignment Q5
No ratings yet
4027 Assignment Q5
12 pages
Ams 310 HW 1
No ratings yet
Ams 310 HW 1
9 pages
Quality C
No ratings yet
Quality C
11 pages
4027 Question 2
No ratings yet
4027 Question 2
9 pages
Summarizing Relationships: ETF1100 Business Statistics Week 6 Charanjit Kaur
No ratings yet
Summarizing Relationships: ETF1100 Business Statistics Week 6 Charanjit Kaur
4 pages
9903 - Statistics Iii
No ratings yet
9903 - Statistics Iii
6 pages
Independent Sample T-Test
No ratings yet
Independent Sample T-Test
8 pages
3 5 7 MovingAverages
No ratings yet
3 5 7 MovingAverages
2 pages
Statistic (Question 8)
No ratings yet
Statistic (Question 8)
2 pages
STATS PROB Long Test
No ratings yet
STATS PROB Long Test
2 pages
Chapter 1 Learning Target Keys
No ratings yet
Chapter 1 Learning Target Keys
1 page
Business Cards 90mm X 55mm
No ratings yet
Business Cards 90mm X 55mm
1 page
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
From Everand
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
Shubhankar Paul
No ratings yet
Algebraic Equations
From Everand
Algebraic Equations
Demetrios P. Kanoussis
No ratings yet
De Moiver's Theorem (Trigonometry) Mathematics Question Bank
From Everand
De Moiver's Theorem (Trigonometry) Mathematics Question Bank
Mohmmad Khaja Shareef
No ratings yet

Assignment3 Finaldraft

Uploaded by

Assignment3 Finaldraft

Uploaded by

STAT4027 Assignment 3

y <- c(0, 0, 0, 2, 5, 1, 5, 14, 3, 19, 3, 14, 22)

data_q1 <- data.frame(y = y, t = t, g = g)

# Fit the poisson model

# Fit the negative binomial model

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =

mu_var1 = cbind(mean(y/t), var(y/t))

# Fit the negative binomial model

mu_var2 = cbind(mean(y), var(y))

Now our variance,

Subbing in our lambda from our first equation yields

Subbing this expression for π into our mean expression we have,

ybar <- mean(y)

pi <- (y_var - ybar)/(y_var + ybarˆ2 - ybar)

Thus our λ = 14.5627914. Our π = 0.5351694.

Taking the logarithm we have,

Setting the derivative to zero and solving,

Taking the derivative with respect to λ,

∂ℓ −(n − n0 )e−λ nȳ

−(n − n0 )e−λ nȳ

As required. Notice now our expression for π can be written as,

Now our solution for λ̂ implies,

Hence our result for π,

Taking the derivative with respect to π yields,

Setting equal to zero,

y_bar <- mean(y)

lambda_MLE <- function(lambda){

lamh <- uniroot(lambda_MLE,lower=1,upper=50)$root

pi_hat_hurdle <- n_0/n

dat <- read.table("~/Desktop/R/4027/Assignment 3/Epilepsy.txt", header = T)

dat <- dat[,1:4]

# Calculating over all times and over all treatment groups

# Calculating over all time for each treatment group

# Calculating for each time over all treatment

# Calculating for each time over all treatment groups

Table 3: Mean and Variance across all treatments and times

type y.mean y.var

time y.mean y.var

Table 6: Mean and Variance for each treatment and time

time type y.mean y.var

mixreg_m1 <- optim(param_m1, logl_m1, method = "BFGS", hessian=T)

estm |> kable(caption = 'M1')

Thus for model we have AIC value 3280.3479633.

mu1 = exp(b01 + b1*sub_dat$time + b2*sub_dat$type)

mixreg_m2 <- optim(param_m2, logl_m2, method = "BFGS", hessian=T)

estm |> kable(caption = 'M2')

Thus for model 2 we have AIC value 1927.9811221

b01 b02 b11 b12 b2 pi

mu1 = exp(b01 + b11*sub_dat$time + b2*sub_dat$type)

mixreg_m3 <- optim(param_m3, logl_m3, method = "BFGS", hessian=T)

estm |> kable(caption = 'M3')

Thus for model 3 we have AIC value 1929.6482612

parm <- mixreg_m2$par

ind = rep(0, m) #calculate group membership indicator

mu1 = exp(b01 + b1*sub_dat$time + b2*sub_dat$type)

# Assigning groups to the dataframe

Group Treatment Number

obs$ind <- as.character(obs$ind)

Table 11: Observed Mean and Variance

Time Treatment Group Mean Variance

Plotting the information below,

b01 = parm[1]; b02 = parm[2]; b1 = parm[3]; b2 = parm[4]; pi = parm[5]

pred_df <- expand.grid(time = levels(factor(dat$time)), type = levels(factor(dat$type)), group = levels(

prediction_function <- function(pred_df){

predicted <- prediction_function(pred_df)

names(obs) <- c('time', 'type', 'group', 'y.mean', 'y.var')

Fitted mean lines by group and treatment

ggplot(merged_df, aes(x = time, y = y.var, color = type, group = group)) +

dat3 <- Salamanders #built in data

## site mined cover sample DOP

Species Mean Variance Sum Zeroes

## Mean : 0.0000 Mean : 0.0000 DF :92 Mean : 1.323

names(agg_q3) <- c('Species', 'Mean', 'Variance', 'Sum', 'Zeroes')

ggplot(data = dat3, aes(x = count)) +

## ‘stat_bin()‘ using ‘bins = 30‘. Pick better value with ‘binwidth‘.

mu1 = exp(b01 + b1sub_dat$time + b2sub_dat$type)

mu1 = exp(b01 + b11sub_dat$time + b2sub_dat$type)

mu1 = exp(b01 + b1sub_dat$time + b2sub_dat$type)

(10 - p39)/sqrt((p39)(39 - p39)/39) # From the formula for pearsons residuals

ln1 <- log( (p_111(1-p_101))/((1 - p_111)p_101) )

answer <- unname((p_111(1-p_101)/(p_111(1-p_101) + (1-p_111)*p_101)))