R Session GLMs 2024
R Session GLMs 2024
R Session GLMs 2024
Now let’s analyse these data to see if risk of crop damage is affected by distance from the nearest forest
patch.
plot(damaged/total ~ distance, data=dat)
Let’s first see why a simple linear model often does not work with such data!
## Let’s first try out a simple linear model
mod <- lm(damaged/total ~ distance, data=dat)
summary(mod) ## model results
plot(mod) ## pattern in variance is not good
Generalised linear models extend ordinary regression models to include some specific types of non-normally
distributed errors and specific types of non-linearity. For the crop damage study, we want to model the
probability of damage P as a logistic function of distance D.
Page 1 of 7
Advanced Ecological Statistics
KI, Jan 2024
b0 +b 1 D
e
P= b 0 +b1 D
1+ e
This is a non linear relationship that has to be linearised before we can apply GLMs.
We take the log of the odds ratio, also called the logit of P which gives us:
Ln(P/ 1-P) = b0 + b1D
The right hand side terms form the linear predictor (the sum of the effects of predictor variables). The left
hand term is the logit of P and this is the link function which relates P (the probability of damage) to the
linear predictor. This is a generalized linear model with binomial errors and a logit link function.
We then take the result which gives parameters estimated in terms of the link function and then back
transform to predict P, the probability of damage, in the original scale. To back transform, rewrite the above
equation as an exponential function and simplify for P and you’ll get
P = e(b0+b1D) / (1+e(b0+b1D))
Let’s carry out this analysis: We specify the linear predictor, binomial error structure, and logit link which
linearises our hypothesized logistic relationship.
dat$undamaged <- dat$total - dat$damaged
# calculate the number of “failures”
# to provide to the glm function
Now here are the results from our glm which were stored in the object called result
summary(result)
Here is a plot of the data on the proportion damaged with predicted values added to it:
plot(damaged/total ~ distance, data=dat)
points(predict(result, type="response") ~ distance, data = dat, col="blue", pch=16)
# one way of adding back-transformed
# predicted values;
# the function predict can give you predicted values
# both in the original scale and in the scale of
# the linear predictor
Let’s check that we understand the coefficients and how to use the coefficients to plot the fitted curve to the
data. The coefficients (intercept a and slope b) of the linear predictor are given in terms of the link function
(logit P). That is,
Page 2 of 7
Advanced Ecological Statistics
KI, Jan 2024
logitP = b0 + b1*distance
2.635 - 0.553*distance
To get predicted values in the original scale, we can plug in the predicted logits into the logistic function as
follows:
Prob of damage, P
=exp(2.635 - 0.553*distance)/(1+exp(2.635 - 0.553*distance))
Here is a plot of the data on the proportion damaged with predicted values (back transformed as above)
added to it:
plot(damaged/total ~ distance, data=dat)
We need to check model fit. Standardised residuals should show no pattern in variance, and with no outliers
or a few points making a large contribution to the fit.
plot(result)
We also need to check for overdispersion. An easy rule of thumb to use to check for overdispersion is that in
the model output, the residual deviance and residual df should be similar; i.e., the ratio of residual deviance
to the residual df should be roughly 1. A ratio of > 1.5 is something to start worrying about seriously. The
problem is that when there is overdispersion, p values and confidence intervals cannot be trusted: both p
values are confidence intervals are likely to be highly biased (too small) leading us to wrongly conclude that
there is a statistically detectable effect when there isn’t!
summary(result) ## overdispersion not a problem in this example
See Zuur et al. 2007 and Crawley 2007 for possible ways to deal with overdispersion.
Page 3 of 7
Advanced Ecological Statistics
KI, Jan 2024
on territories – on dung piles and off dung piles within a territory. Seed presence or absence in a plot is the
response variable. The predictors are territory type (two levels – lek versus dispersed), location within
territory (on versus off dung piles), distance to the nearest Prosopis (if this distance is small, we would
expect males to have increased access to Prosopis which would increase the chances of seeds being found on
that male’s territory), season (early breeding season versus late breeding season).
drop1(result, test="Chisq") ## carries out single term deletions for a given model
## one way to carry out likelihood ratio tests to
## assess statistical significance of terms in a model
Let’s assess the relative importance of the different predictors by calculating the change in seed presence for
a given change in each predictor keeping other predictors contant (categorical variables at one level and
continuous variables at the median value)
## difference in seed presence between lek and dispersed territories
seed.lek <- -2.475833 + 0.213729 - 0.001447 *median(seed$Nearest.tree)
seed.dispersed <- -2.475833 - 0.001447 * median(seed$Nearest.tree)
seed.d.p <- exp(seed.dispersed)/(1+exp(seed.dispersed))
seed.l.p <- exp(seed.lek)/(1+exp(seed.lek))
c(seed.d.p, seed.l.p) ## very small difference between lek and dispersed territories
mating benefits (the number of females present). To test this idea, we wish to examine the relationship
between male display rate and the number of females present on their territory.
Y = exp(b0 + b1X)
log(Y) = a + bX
The right hand side terms form the linear predictor (the sum of the effects of predictor variables). The left
hand side term is the natural logarithm of Y and this is the link function which relates Y to the linear predictor.
So, with the error structure, linear predictor and link function identified we can now fit the GLM
mod.glm <- glm(display ~ females, data=dis, family=poisson(link="log"))
summary(mod.glm) ## Model results, also check for overdispersion
Page 5 of 7
Advanced Ecological Statistics
KI, Jan 2024
Let’s formulate our model. The questions we wish to address will guide the terms we include in the model.
Our sample size will also constrain the number of terms we can include. Our main aims are to evaluate how
the difference in species richness in grazed and ungrazed plots changes with rainfall, soil available nitrogen
and soil available phosphorus. So, we include grazing treatment, rainfall, N, P, interaction of treatment with
rainfall, interaction of treatment with N, and interaction of treatment with P. Since theory suggests that
species richness is likely to decrease at high rainfall (high productivity) we also include a quadratic term for
rainfall and an interaction between treatment and the quadratic term. The model thus includes 11 parameters
to be estimated. We do not include two-way interactions between rainfall and N and P, first because there
were no strong a priori reasons to do so, and secondly because our sample size constrains us from doing so
(the ratio of sample size and the number of parameters to be estimated is already just less than 5). So, here’s
our model:
Now, we carry out the main tests of interest, i.e., whether the interaction terms are important. We delete each
interaction in turn and examine the change in model fit.
mod2 <- update(mod1, ~.- gr.trt:p) # removing the gr.trt:p interaction term
mod3 <- update(mod1, ~.- gr.trt:n) # delete the gr.trt:n interaction term
## test for change in model fit with each of these interaction terms deleted
anova(mod2, mod1, test="Chi") ## Likelihood ratio test
anova(mod3, mod1, test="Chi")
anova(mod4, mod1, test="Chi")
None of the interaction terms are statistically significant. Using the modelling strategy of carrying out only
very cautious model simplification by removing only statistically non-significant interaction terms and
retaining all main effects whether significant or not, our final model is
mod.fin <- glm(sp.rich ~ gr.trt + rainfall + I(rainfall^2) + n + p, data = grazing,
family = poisson)
Page 6 of 7
Advanced Ecological Statistics
KI, Jan 2024
Aggressive model simplification can result in uncovering spurious relationships (the p hacking/data dredging
problem discussed a lot in recent literature). Therefore, it is best to avoid extensive model simplification.
Returning to the questions we initially posed, what do you infer from this model about the effect of grazing on
richness and the modulation of this effect by resource availability?
Some other ways to look at the relative magnitudes of the effects of different predictors are 1) to examine the
change in sp.rich for a given sensible change in predictor variables, for example a 50% change (from the
25th quantile to the 75th quantile) in a predictor holding all the others constant; 2) to scale all continuous
predictors so that the model coefficients are directly comparable; 3) to examine the change in model fit (e.g.,
pseudo-r squared) when each predictor is removed from the model.
Page 7 of 7