R Session GLMs 2024

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Advanced Ecological Statistics

KI, Jan 2024

Fitting and Interpreting Generalised Linear Models in R

A simple example with binomial errors


Here is the ecological question. Crop damage by wild pigs is a major cause of conflict and we are interested
in investigating the factors that make farms vulnerable to damage by wild pigs. We suspect that one of the
main factors is how close the village is to a forested patch (that wild pigs can use as a refuge from which to
visit crop fields at night). So we collect data from many villages (village = independent sample) situated at
different distances from forests. We ask a sample of farmers from each village whether their farms have
experienced damage or not. We wish to analyse these data to see if risk of crop damage is affected by
distance from the nearest forest patch.

Let’s read in the data:


dat <- read.csv("example3.csv")
head(dat)

Now let’s analyse these data to see if risk of crop damage is affected by distance from the nearest forest
patch.
plot(damaged/total ~ distance, data=dat)

Let’s first see why a simple linear model often does not work with such data!
## Let’s first try out a simple linear model
mod <- lm(damaged/total ~ distance, data=dat)
summary(mod) ## model results
plot(mod) ## pattern in variance is not good

plot(damaged/total ~ distance, data=dat, ylim=c(-0.2,1))


abline(coef(mod))
## adding in model predictions shows that model does not provide good fit and
## even predicts negative probability of crop damage!
Now let’s analyse these data in a more appropriate way to see if the probability of being damaged is affected
by distance.

Let’s now use generalized linear models.


Generalised linear models extend ordinary regression models to include some specific types of non-normally
distributed errors and specific types of non-linearity. They are well established, efficient and powerful
models.
## Let’s now fit a GLM with binomial errors
To describe the deterministic part of the relationship we will use a very simple expected relationship.
Proportions are bounded by 0 and 1 and a simple expectation is that they should show a logistic relationship
(S shaped relationship). To describe the stochastic part of our statistical model, since we are dealing with
proportions that are bounded and since these data represent a process where there are two outcomes,
experiencing damage or not, we can use a binomial probability distribution. To fit such a model, we turn to
generalised linear models.

Generalised linear models extend ordinary regression models to include some specific types of non-normally
distributed errors and specific types of non-linearity. For the crop damage study, we want to model the
probability of damage P as a logistic function of distance D.

Page 1 of 7
Advanced Ecological Statistics
KI, Jan 2024

b0 +b 1 D
e
P= b 0 +b1 D
1+ e

This is a non linear relationship that has to be linearised before we can apply GLMs.

We take the log of the odds ratio, also called the logit of P which gives us:
Ln(P/ 1-P) = b0 + b1D

The right hand side terms form the linear predictor (the sum of the effects of predictor variables). The left
hand term is the logit of P and this is the link function which relates P (the probability of damage) to the
linear predictor. This is a generalized linear model with binomial errors and a logit link function.

We then take the result which gives parameters estimated in terms of the link function and then back
transform to predict P, the probability of damage, in the original scale. To back transform, rewrite the above
equation as an exponential function and simplify for P and you’ll get
P = e(b0+b1D) / (1+e(b0+b1D))

Let’s carry out this analysis: We specify the linear predictor, binomial error structure, and logit link which
linearises our hypothesized logistic relationship.
dat$undamaged <- dat$total - dat$damaged
# calculate the number of “failures”
# to provide to the glm function

result <- glm(cbind(damaged, undamaged) ~ distance, data = dat,


family=binomial(link="logit"))
We give the function glm the response variable, one or more predictor variables, (optionally a data frame in
which the variables are stored), the family of probability distributions that is appropriate for our data, and the
link function to use. The response variable is given as a matrix of successes (damaged) and failures
(undamaged). Several other link functions are possible for each distribution. Type ?family to look these
up.

Now here are the results from our glm which were stored in the object called result
summary(result)

Here is information on confidence intervals


confint(result)

Here is a plot of the data on the proportion damaged with predicted values added to it:
plot(damaged/total ~ distance, data=dat)
points(predict(result, type="response") ~ distance, data = dat, col="blue", pch=16)
# one way of adding back-transformed
# predicted values;
# the function predict can give you predicted values
# both in the original scale and in the scale of
# the linear predictor

Let’s check that we understand the coefficients and how to use the coefficients to plot the fitted curve to the
data. The coefficients (intercept a and slope b) of the linear predictor are given in terms of the link function
(logit P). That is,

Page 2 of 7
Advanced Ecological Statistics
KI, Jan 2024

logitP = b0 + b1*distance
2.635 - 0.553*distance

To get predicted values in the original scale, we can plug in the predicted logits into the logistic function as
follows:

Probability of damage, P = e(b0+b1D) / 1+e(b0+b1D)


= exp(logit P) / (1 + exp(logit P))
Or in other words,

Prob of damage, P
=exp(2.635 - 0.553*distance)/(1+exp(2.635 - 0.553*distance))

Here is a plot of the data on the proportion damaged with predicted values (back transformed as above)
added to it:
plot(damaged/total ~ distance, data=dat)

logit.p <- 2.635 - 0.553*dat$distance


p.damage <- exp(logit.p) / (1+exp(logit.p)) # back transform from
# logit of P to P
points(p.damage ~ distance, data = dat, col="blue", pch=16)
# add to plot
curve(exp(coef(result)[1]+coef(result)[2]*x) / (1+exp(coef(result)[1]+coef(result)
[2]*x)), add=TRUE, col="blue") # adding the estimated curve

We need to check model fit. Standardised residuals should show no pattern in variance, and with no outliers
or a few points making a large contribution to the fit.
plot(result)

We also need to check for overdispersion. An easy rule of thumb to use to check for overdispersion is that in
the model output, the residual deviance and residual df should be similar; i.e., the ratio of residual deviance
to the residual df should be roughly 1. A ratio of > 1.5 is something to start worrying about seriously. The
problem is that when there is overdispersion, p values and confidence intervals cannot be trusted: both p
values are confidence intervals are likely to be highly biased (too small) leading us to wrongly conclude that
there is a statistically detectable effect when there isn’t!
summary(result) ## overdispersion not a problem in this example

See Zuur et al. 2007 and Crawley 2007 for possible ways to deal with overdispersion.

Example of a GLM with binary response variable


Let’s look at a data set on blackbuck territorial behaviour and how that might facilitate the dispersal of seeds
of a woody invasive (from Jadeja et al. 2013 Oikos). This study examined whether territorial males through
their dung deposition behaviour shape seed dispersal patterns. Territorial males through their restricted
movement and marking behaviour were expected to concentrate dung and therefore seeds on to territories.
Let’s examine whether seed deposition varies between males following two different mating strategies:
lekking and dispersed territoriality. Also, we would expect seed deposition to be greater on dung piles, scent
marks created by territorial males, than off dung piles within a territory. The data come from plots were laid

Page 3 of 7
Advanced Ecological Statistics
KI, Jan 2024

on territories – on dung piles and off dung piles within a territory. Seed presence or absence in a plot is the
response variable. The predictors are territory type (two levels – lek versus dispersed), location within
territory (on versus off dung piles), distance to the nearest Prosopis (if this distance is small, we would
expect males to have increased access to Prosopis which would increase the chances of seeds being found on
that male’s territory), season (early breeding season versus late breeding season).

seed <- read.csv("seed deposition.csv")


str(seed)

result <-glm (Seed.P.A.~ TType + TTlocation + Nearest.tree + Season,


family="binomial" , data=seed)
summary(result) ## model results

plot(result) ## difficult to check residuals with binary data

Let’s make sure we understand the coefficients


exp(coef(result)) ## odds ratio

drop1(result, test="Chisq") ## carries out single term deletions for a given model
## one way to carry out likelihood ratio tests to
## assess statistical significance of terms in a model

Let’s assess the relative importance of the different predictors by calculating the change in seed presence for
a given change in each predictor keeping other predictors contant (categorical variables at one level and
continuous variables at the median value)
## difference in seed presence between lek and dispersed territories
seed.lek <- -2.475833 + 0.213729 - 0.001447 *median(seed$Nearest.tree)
seed.dispersed <- -2.475833 - 0.001447 * median(seed$Nearest.tree)
seed.d.p <- exp(seed.dispersed)/(1+exp(seed.dispersed))
seed.l.p <- exp(seed.lek)/(1+exp(seed.lek))
c(seed.d.p, seed.l.p) ## very small difference between lek and dispersed territories

## difference in seed presence on and off dungpiles


seed.dung <- -2.475833 + 1.270969 - 0.001447 *median(seed$Nearest.tree)
seed.offdung <- -2.475833 - 0.001447 * median(seed$Nearest.tree)
seed.dung.p <- exp(seed.dung)/(1+exp(seed.dung))
seed.offdung.p <- exp(seed.offdung)/(1+exp(seed.offdung))
c(seed.dung.p, seed.offdung.p) ## substantial difference

## difference in seed presence for a 50% difference in distance to nearest tree


summary(seed$Nearest.tree)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 2.70 44.03 96.27 136.50 239.80 342.30
seed.nearest.1 <- -2.475833 - 0.001447*44.03
seed.nearest.3 <- -2.475833 - 0.001447*239.8
seed.n3.p <- exp(seed.nearest.3)/(1+exp(seed.nearest.3))
seed.n1.p <- exp(seed.nearest.1)/(1+exp(seed.nearest.1))
c(seed.n3.p, seed.n1.p) ## very small difference

## complete it for season

Example with poisson errors


Here is the ecological question. We measure the display behaviour of male antelope in the presence of
varying numbers of females (potential mates). We predict that males adjust their displaying to potential
Page 4 of 7
Advanced Ecological Statistics
KI, Jan 2024

mating benefits (the number of females present). To test this idea, we wish to examine the relationship
between male display rate and the number of females present on their territory.

dis <- read.csv("display_rate.csv") ## here are the data

plot(display ~ females, data=dis) ## Plot of the data

## Let’s first try fitting a simple linear model


mod <- lm(display ~ females, data = dis)
summary(mod) ## parameter estimates and other model results
plot(mod) ## model-checking

plot(display ~ females, data=dis, xlim=c(1,10), ylim=c(-10,30)) ## Plot of the data


abline(coef(mod))
## Worrying! Model predicts negative display rates which is impossible

## Fitting a GLM instead


Generalised linear models extend ordinary regression models to include some specific types of non-normally
distributed errors and specific types of non-linearity. For our data, we need a probability distribution that is
discrete, does not take negative values, and is good at describing an observation process involving the
number of occurrences of an event in a fixed unit (of time or space), where each occurrence is independent of
the others. The Poisson distribution is appropriate particularly when the event of interest is rare. As we just
saw, one challenge when dealing with counts is that it does not make sense to think of negative counts. So,
any statistical model has to make sure that it does not try to predict values less than 0. One solution is to
assume that the expected response Y is an exponential function of the predictor variables:

Y = exp(b0 + b1X)

Then if we consider the natural logarithm of Y, we get:

log(Y) = a + bX

The right hand side terms form the linear predictor (the sum of the effects of predictor variables). The left
hand side term is the natural logarithm of Y and this is the link function which relates Y to the linear predictor.
So, with the error structure, linear predictor and link function identified we can now fit the GLM
mod.glm <- glm(display ~ females, data=dis, family=poisson(link="log"))
summary(mod.glm) ## Model results, also check for overdispersion

plot(display ~ females, data=dis, xlim=c(1,10), ylim=c(-10,30)) ## Plot of the data


curve(exp(coef(mod.glm)[1]+coef(mod.glm)[2]*x), add=T) ## fits well
plot(mod.glm) ## model checking

Another Poisson example with multiple predictors


Let’s bring in data on plant species richness (adapted from Anderson et al. 2007, Ecology 88:1191-1201).
The data consist of measurements of species richness in 4x4 m2 plots that were placed in 8 different sites that
varied in rainfall. At each site, 3 plots were fenced (excluding large grazers) and 3 plots were open to grazers.
Soil nitrogen and phosphorus were also measured in each plot. The questions we are interested in are does
grazing increase plant species richness, and is this effect of grazing on plant species richness modulated by
the availability of different resources (moisture and soil nutrients) to plants?

Page 5 of 7
Advanced Ecological Statistics
KI, Jan 2024

grazing <- read.csv("grazing_sprich.csv")


head(grazing)
summary(grazing)

Let’s formulate our model. The questions we wish to address will guide the terms we include in the model.
Our sample size will also constrain the number of terms we can include. Our main aims are to evaluate how
the difference in species richness in grazed and ungrazed plots changes with rainfall, soil available nitrogen
and soil available phosphorus. So, we include grazing treatment, rainfall, N, P, interaction of treatment with
rainfall, interaction of treatment with N, and interaction of treatment with P. Since theory suggests that
species richness is likely to decrease at high rainfall (high productivity) we also include a quadratic term for
rainfall and an interaction between treatment and the quadratic term. The model thus includes 11 parameters
to be estimated. We do not include two-way interactions between rainfall and N and P, first because there
were no strong a priori reasons to do so, and secondly because our sample size constrains us from doing so
(the ratio of sample size and the number of parameters to be estimated is already just less than 5). So, here’s
our model:

mod1 <- glm(sp.rich ~ gr.trt + rainfall + I(rainfall^2) + n + p + gr.trt:rainfall +


gr.trt:I(rainfall^2) + gr.trt:n + gr.trt:p, data = grazing, family = poisson)

summary(mod1) ## model results, also check for overdispersion


cor.test(fitted(mod1), grazing$sp.rich) ## the correlation between observed and
## predicted can be used in a GLM to judge a model’s
## predictive power; analogous to using r-squared for
## ordinary least squares models

## There are also several pseudo r squared measures available


## McFadden’s pseudo-r squared is a simple and frequently used measure
## McFadden’s pseudo-r squared is calculated as
## 1 – (log likelihood of the current model / log likelihood of the null model)

plot(residuals(mod1) ~ predict(mod1, type="link")); abline(h=0)


## standardized residuals should show homoscedasticity

Now, we carry out the main tests of interest, i.e., whether the interaction terms are important. We delete each
interaction in turn and examine the change in model fit.
mod2 <- update(mod1, ~.- gr.trt:p) # removing the gr.trt:p interaction term

mod3 <- update(mod1, ~.- gr.trt:n) # delete the gr.trt:n interaction term

mod4 <- update(mod1, ~.- gr.trt:rainfall - gr.trt:I(rainfall^2))


## delete interaction terms with rainfall

## test for change in model fit with each of these interaction terms deleted
anova(mod2, mod1, test="Chi") ## Likelihood ratio test
anova(mod3, mod1, test="Chi")
anova(mod4, mod1, test="Chi")

None of the interaction terms are statistically significant. Using the modelling strategy of carrying out only
very cautious model simplification by removing only statistically non-significant interaction terms and
retaining all main effects whether significant or not, our final model is
mod.fin <- glm(sp.rich ~ gr.trt + rainfall + I(rainfall^2) + n + p, data = grazing,
family = poisson)

Page 6 of 7
Advanced Ecological Statistics
KI, Jan 2024

Aggressive model simplification can result in uncovering spurious relationships (the p hacking/data dredging
problem discussed a lot in recent literature). Therefore, it is best to avoid extensive model simplification.

summary(mod.fin) ## model results


drop1(mod.fin, test=”Chi”) ## likelihood ratio tests for terms in the model
## can do the same by writing models with and without
## a term and doing anova(model1, model2)
plot(mod.fin) ## model criticism

Returning to the questions we initially posed, what do you infer from this model about the effect of grazing on
richness and the modulation of this effect by resource availability?

Examine model predictions


plot(sp.rich ~ rainfall, data = grazing, col=ifelse(gr.trt=="g","darkgreen","blue"),
pch=19)

## add the predicted relationship between species richness and rainfall


## separately for grazed and ungrazed plots.
## Because we used a log link in our Poisson regression,
## log y = b0 + b1*gr.trt + b2*rainfall + b3*rainfall^2 +b4*n + b5*p,
## we need to back transform (exponentiate) to get predictions
## in the raw scale

curve(exp(coef(mod.fin)["(Intercept)"] + coef(mod.fin)["rainfall"]*x + coef(mod.fin)


["I(rainfall^2)"]*x^2), 40,90, add=T, lwd=2, col="darkgreen")

curve(exp(coef(mod.fin)["(Intercept)"] + coef(mod.fin)["gr.trtu"] + coef(mod.fin)


["rainfall"]*x + coef(mod.fin)["I(rainfall^2)"]*x^2), 40,90, add=T, lwd=2, col="blue")

## Model predictions describe the data fairly well

Some other ways to look at the relative magnitudes of the effects of different predictors are 1) to examine the
change in sp.rich for a given sensible change in predictor variables, for example a 50% change (from the
25th quantile to the 75th quantile) in a predictor holding all the others constant; 2) to scale all continuous
predictors so that the model coefficients are directly comparable; 3) to examine the change in model fit (e.g.,
pseudo-r squared) when each predictor is removed from the model.

Page 7 of 7

You might also like