Advanced Research Skills: Glms Ii Binomial Family
Advanced Research Skills: Glms Ii Binomial Family
GLMs II
Binomial Family
Olivier MISSA, [email protected]
Advanced Research Skills
2
Outline
Continue our Introduction to Generalized Linear Models.
In this lecture:
Illustrate the use of GLMs for
proportion and
binary data.
3
Binary & Proportion data tend to follow
the Binomial distribution
The Canonical link of this glm family
is the logit function:
The variance reaches a maximum for intermediate values
of p and a minimum at either 0% or 100%.
Reminder
|
|
.
|
\
|
) 1 (
log
p
p
( ) p p n Var = 1
p n Mean=
4
In R, binary/proportion data can be entered
into a model as a response in three different ways:
as a numeric vector
(holding the number or proportion of successes)
as a logical vector or a factor
(TRUE or the first factor level will be considered successes).
as a two-column matrix
(the first column holding the number of successes and
the second column the number of failures).
Three ways to work with binary data
5
Toxicity to tobacco budworm (moth) of different doses
of trans-cypermethrin.
Batches of 20 moths (of each sex) were put in contact
for three days with increasing doses of the pyrethroid.
1
st
Example
Dose (micrograms)
Sex 1 2 4 8 16 32
Male 1 4 9 13 18 20
Female 0 2 6 10 12 16
Number of dead moths
out of 20 tested
> (dose <- rep(2^(0:5), 2))
[1] 1 2 4 8 16 32 1 2 4 8 16 32
> numdead <- c(1,4,9,13,18,20,0,2,6,10,12,16)
> (sex <- factor( rep( c("M","F"), c(6,6) ) ))
[1] M M M M M M F F F F F F
Levels: F M
> SF <- cbind(numdead, numalive=20-numdead)
6
1
st
Example
> modb <- glm(SF ~ sex*dose, family=binomial)
> summary(modb)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.71578 0.32233 -5.323 1.02e-07 ***
sexM -0.21194 0.51523 -0.411 0.68082
dose 0.11568 0.02379 4.863 1.16e-06 ***
sexM:dose 0.18156 0.06692 2.713 0.00666 **
---
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 124.876 on 11 degrees of freedom
Residual deviance: 18.164 on 8 degrees of freedom
AIC: 56.275
What is modelled is the proportion of successes
=
(
|
|
.
|
\
|
+ =
n
i
i
i
i i i i
y n
y n
y n y y y D
1
ln ) ( ) / ln( 2
7
1
st
Example
> ldose <- log2(dose)
> modb2 <- glm(SF ~ sex*ldose, family=binomial)
> summary(modb2)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.9935 0.5527 -5.416 6.09e-08 ***
sexM 0.1750 0.7783 0.225 0.822
ldose 0.9060 0.1671 5.422 5.89e-08 ***
sexM:ldose 0.3529 0.2700 1.307 0.191
---
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 124.8756 on 11 degrees of freedom
Residual deviance: 4.9937 on 8 degrees of freedom
AIC: 43.104
8
1
st
Example
> drop1(modb2, test="Chisq")
Single term deletions
Model:
SF ~ sex * ldose
Df Deviance AIC LRT Pr(Chi)
<none> 4.994 43.104
sex:ldose 1 6.757 42.867 1.763 0.1842
> modb3 <- update(modb2, ~. sex:ldose)
> summary(modb3)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.4732 0.4685 -7.413 1.23e-13 ***
sexM 1.1007 0.3558 3.093 0.00198 **
ldose 1.0642 0.1311 8.119 4.70e-16 ***
---
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 124.876 on 11 degrees of freedom
Residual deviance: 6.757 on 9 degrees of freedom
AIC: 42.867
9
1
st
Example
> drop1(modb3, test="Chisq")
Single term deletions
Model:
SF ~ sex + ldose
Df Deviance AIC LRT Pr(Chi)
<none> 6.757 42.867
sex 1 16.984 51.094 10.227 0.001384 **
ldose 1 118.799 152.909 112.042 < 2.2e-16 ***
> shapiro.test(residuals(modb3), type="deviance")
Shapiro-Wilk normality test
data: residuals(modb3, type = "deviance")
W = 0.9666, p-value = 0.8725
10
1
st
Example
> par(mfrow=c(2,2))
> plot(modb3)
11
1
st
Example
> plot( c(0,1) ~ c(1,32), type="n", log="x",
xlab="dose", ylab="Probability")
> text(dose, numdead/20, labels=as.character(sex) )
> ld <- seq(0,32,0.5)
> lines (ld, predict(modb3, data.frame(ldose=log2(ld),
sex=factor(rep("M", length(ld)), levels=levels(sex))),
type="response") )
> lines (ld, predict(modb3, data.frame(ldose=log2(ld),
sex=factor(rep("F", length(ld)), levels=levels(sex))),
type="response"), lty=2, col="red" )
12
1
st
Example
> modbp <- glm(SF ~ sex*ldose,
family=binomial(link="probit"))
> AIC(modbp)
[1] 41.87836
> modbc <- glm(SF ~ sex*ldose,
family=binomial(link="cloglog"))
> AIC(modbc)
[1] 43.8663
> AIC(modb3)
[1] 42.86747
13
> summary(modb3)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.4732 0.4685 -7.413 1.23e-13 ***
sexM 1.1007 0.3558 3.093 0.00198 **
ldose 1.0642 0.1311 8.119 4.70e-16 ***
---
> exp(modb3$coeff) ## careful it may be misleading
(Intercept) sexM ldose
0.031019 3.006400 2.898560 ## odds ration: p / (1-p)
> exp(modb3$coeff[1]+modb3$coeff[2]) ## odds for males
(Intercept)
0.09325553
1
st
Example
logit scale
|
|
.
|
\
|
) 1 (
log
p
p
Every doubling of the dose will lead
to an increase in the odds of dying
over surviving by a factor of 2.899
14
Erythrocyte Sedimentation Rate in a group of patients.
Two groups : <20 (healthy) or >20 (ill) mm/hour
Q: Is it related to globulin & fibrinogen level in the blood ?
2
nd
Example
> data("plasma", package="HSAUR")
> str(plasma)
'data.frame': 32 obs. of 3 variables:
$ fibrinogen: num 2.52 2.56 2.19 2.18 3.41 2.46 3.22 2.21 ...
$ globulin : int 38 31 33 31 37 36 38 37 39 41 ...
$ ESR : Factor w/ 2 levels "ESR < 20","ESR > 20": 1 1 ...
> summary(plasma)
fibrinogen globulin ESR
Min. :2.090 Min. :28.00 ESR < 20:26
1st Qu.:2.290 1st Qu.:31.75 ESR > 20: 6
Median :2.600 Median :36.00
Mean :2.789 Mean :35.66
3rd Qu.:3.167 3rd Qu.:38.00
Max. :5.060 Max. :46.00
15
2
nd
Example
> stripchart(globulin ~ ESR, vertical=T, data=plasma,
xlab="Erythrocyte Sedimentation Rate (mm/hr)",
ylab="Globulin blood level", method="jitter" )
> stripchart(fibrinogen ~ ESR, vertical=T, data=plasma,
xlab="Erythrocyte Sedimentation Rate (mm/hr)",
ylab="Fibrinogen blood level", method="jitter" )
16
2
nd
Example
> mod1 <- glm(ESR~fibrinogen, data=plasma, family=binomial)
> summary(mod1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.8451 2.7703 -2.471 0.0135 *
fibrinogen 1.8271 0.9009 2.028 0.0425 *
---
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 30.885 on 31 degrees of freedom
Residual deviance: 24.840 on 30 degrees of freedom
AIC: 28.840
> mod2 <- glm(ESR~fibrinogen+globulin, data=plasma,
family=binomial)
> AIC(mod2)
[1] 28.97111
factor
17
2
nd
Example
> anova(mod1, mod2, test="Chisq")
Analysis of Deviance Table
Model 1: ESR ~ fibrinogen
Model 2: ESR ~ fibrinogen + globulin
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1 30 24.8404
2 29 22.9711 1 1.8692 0.1716
> summary(mod2)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -12.7921 5.7963 -2.207 0.0273 *
fibrinogen 1.9104 0.9710 1.967 0.0491 *
globulin 0.1558 0.1195 1.303 0.1925
---
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 30.885 on 31 degrees of freedom
Residual deviance: 22.971 on 29 degrees of freedom
AIC: 28.971
The difference in terms of
Deviance between these
models is not significant,
which leads us to select
the least complex model
18
2
nd
Example
> shapiro.test(residuals(mod1, type="deviance"))
Shapiro-Wilk normality test
data: residuals(mod1, type = "deviance")
W = 0.6863, p-value = 5.465e-07
> par(mfrow=c(2,2))
> plot(mod1)