Logistic Regression Diagnostics, Splines and Interactions: Sandy Eckel Seckel@jhsph - Edu 19 May 2007
Logistic Regression Diagnostics, Splines and Interactions: Sandy Eckel Seckel@jhsph - Edu 19 May 2007
Sandy Eckel
[email protected]
19 May 2007
1
Logistic Regression Diagnostics
Graphs to check assumptions
3
How to look at the data? Binary Y and
Binary (or categorical) X
4
How to look at the data? Binary Y and
Continuous X
1
.8
Actual
.6
breast fed
breastfeeding
.4 .2
0
0 20 40 60 80
age of child (months)
Lowess smoother
1
.8
Probability of
breast fed
.6
breastfeeding
.4 .2
0
0 20 40 60 80
age of child (months)
bandwidth = .9
9
Assumptions of logistic regression
Two assumptions:
L – the model fits the data
I – the observations are all independent
10
How can we assess our model ?
L – the model fits the data
3 methods for assessing model fit
“Look” at the data
Binary or categorical predictors: tables
Do you see a need for interaction?
Continuous predictors: lowess curves
Do you see a need for interaction or splines?
Graph observed probability vs. the
predicted probability
observed probabilities
Observed
very well
.2
bandwidth = .8
.2 .4
Pr(bf)
.6 .8 1
interactions
Predicted
12
Assess model fit : Method 3
X2 Test of Goodness of Fit
13
Method 3:
X2 Test of Goodness of Fit
14
Summary: logistic regression model diagnostics
15
How do we add
Flexibility in logistic regression?
Interaction
is used to allow different effects (difference
in log odds ratio) for different groups
16
Example: Back to breastfeeding example
Secondary predictors:
Child’s age (0 to 76 months)
Mother’s age (17 to 52) – need to center
Number of children (parity) (1 to 14) – need to
center
17
Model A: gender
p p
log = β0 + β1 (Gender ) ⇒ log = -0.37 + 0.04(Gender )
1− p 1− p
------------------------------------------------------------------------------
bf | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender | .0389756 .1873558 0.21 0.835 -.3282351 .4061863
(Intercept) | -.3692173 .1281411 -2.88 0.004 -.6203693 -.1180653
------------------------------------------------------------------------------
18
Model B:
gender and mother's age
p
log = β0 + β1 (Gender ) + β2 ( Agemom − 25)
1− p
p
⇒ log = -0.16 + 0.06(Gender ) + -0.06( Agemom − 25)
1− p
------------------------------------------------------------------------------
bf | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender | .0620916 .1907094 0.33 0.745 -.311692 .4358751
age_momc | -.0615396 .0156442 -3.93 0.000 -.0922016 -.0308776
(Intercept) | -.1573215 .13957 -1.13 0.260 -.4308736 .1162307
------------------------------------------------------------------------------
20 30 40 50 20 30 40 50
age of mother (years)
bandwidth = .9
20
Possible modification – add a spline
(agemom – 25)+
= 0 if age < 25
= (agemom – 25) if age >25
21
Model C:
gender and mother's age with spline
p
log = β0 + β1 (Gender ) + β2 ( Agemom − 25) + β3 ( Agemom − 25) +
1− p
p
⇒ log = -0.55 + 0.08(Gender ) + -0.25( Agemom − 25) + 0.23( Agemom − 25) +
1− p
------------------------------------------------------------------------------
bf | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender | .0821887 .1928521 0.43 0.670 -.2957946 .4601719
age_momc | -.2467804 .0627557 -3.93 0.000 -.3697794 -.1237814
age_mom_sp | .2306511 .074613 3.09 0.002 .0844122 .3768899
(Intercept) | -.5487527 .1888302 -2.91 0.004 -.9188531 -.1786522
------------------------------------------------------------------------------
23
Model C: Interpretation
p
log = β0 + β1 (Gender ) + β 2 ( Agemom − 25) + β3 ( Agemom − 25) +
1− p
26
Breastfeeding example conclusion
For boys and girls with mothers under 25 years of
age, the odds that the mother will breastfeed the
child decreases by a factor of
exp(β2)=exp(-.24)=0.78
for each additional year of mother’s age
(95% CI: 0.69, 0.88)
27
Model D: gender and number of children (parity)
Logit estimates Number of obs = 472
LR chi2(2) = 9.99
Prob > chi2 = 0.0068
Log likelihood = -315.01027 Pseudo R2 = 0.0156
------------------------------------------------------------------------------
bf | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender | .0622939 .1894771 0.33 0.742 -.3090744 .4336622
parityc | -.1180777 .0384221 -3.07 0.002 -.1933837 -.0427718
(Intercept) | -.8009664 .1937284 -4.13 0.000 -1.180667 -.4212659
------------------------------------------------------------------------------
Sketch of Model D
breastfeeding
p
log odds of
log = β0 + β1 (Gender )
1− p
+ β 2 ( Parity − 8)
0 5 10 15 0 5 10 15
# of kids mother had born alive
bandwidth = .9 29
Model E: gender, parity,
and parity spline
Logit estimates Number of obs = 472
LR chi2(3) = 14.18
Prob > chi2 = 0.0027
Log likelihood = -312.91444 Pseudo R2 = 0.0222
------------------------------------------------------------------------------
bf | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender | .0666432 .1903717 0.35 0.726 -.3064785 .439765
parityc | -.1718923 .0465719 -3.69 0.000 -.2631716 -.080613
parity_sp | .3281222 .1562619 2.10 0.036 .0218545 .6343899
(Intercept) | -1.045415 .2291123 -4.56 0.000 -1.494466 -.5963627
------------------------------------------------------------------------------
Sketch of Model E
breastfeeding
p
log odds of
log = β0 + β1 (Gender )
1− p
+ β 2 ( Parity − 8)
+ β3 ( Parity − 8) +
8 Parity
baby’s gender (1=F, 0=M) 30
Understanding the equation
Write separate equations by parity group
31
Problem with the parity spline
32
The new variable
(Gender)x(Parity – 8)+
baby’s gender (1=F, 0=M)
= 0 for boys
= 0 for parity < 8
= (Parity – 8) for girls with parity >=8
33
Model F:
spline + interaction with spline
Logit estimates Number of obs = 472
LR chi2(4) = 21.75
Prob > chi2 = 0.0002
Log likelihood = -309.12925 Pseudo R2 = 0.0340
------------------------------------------------------------------------------
bf | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender | .1806766 .1953877 0.92 0.355 -.2022763 .5636294
parityc | -.1737844 .0473172 -3.67 0.000 -.2665244 -.0810445
parity_sp | .734593 .2786475 2.64 0.008 .1884539 1.280732
parity_sp_~r | -.8665087 .3966433 -2.18 0.029 -1.643915 -.0891021
(Intercept) | -1.106983 .2343301 -4.72 0.000 -1.566261 -.647704
------------------------------------------------------------------------------
Sketch of Model F
p
breastfeeding
log = β0 + β1 (Gender ) log odds of
1− p
+ β2 ( Parity − 8) + β3 ( Parity − 8) +
+ β4Gender × ( Parity − 8) + baby’s gender (1=F, 0=M)
8 Parity 34
Understanding the equation
Write separate equations by parity and gender
log(odds) = -1.11 + 0.18(Gender) – 0.17(Parity-8) + 0.73(Parity-8)+
- 0.87(Gender)x(Parity-8)+
baby’s gender (1=F, 0=M)
35
Interpretation – Model F
38
Is the difference in the log odds ratio for parity by gender
statistically significant?
p
log = β0 + β1 (Gender )
1− p
+ β2 ( Parity − 8) + β3 ( Parity − 8) +
+ β4Gender × ( Parity − 8) + 39
Conclusion – Model F
For children whose mothers have fewer than 8 children,
the odds that the mother will breastfeed the child is
about the same for boys and girls and decreases by a
factor of exp(β2)=0.84 for each additional year of
mother’s age (95% CI: 0.77, 0.92).
This relationship is significantly different for both boys
and girls whose mothers have more than 8 children:
For boys whose mothers have more than 8 children,
the odds that the mother will breastfeed increases by
a factor of exp{β2+β3}=1.75 for each additional year
of mother’s age (95% CI: 1.05, 2.93).
For girls whose mothers have more than 8 children,
the odds that the mother will breastfeed decreases by
a factor of exp{β2+β3+β4}=0.74 for each additional
year of mother’s age (95% CI: 0.40, 1.37).
40
Comparing the models
Odds Ratio for Model
Variables A B C D E F
Reference* 0.69 0.85 0.58 0.45 0.35 0.33
Gender 1.04 1.06 1.09 1.06 1.07 1.20
Age-25 0.94 0.78
(Age-25)+ 1.26
Parity – 8 0.89 0.84 0.84
(Parity-8)+ 1.39 2.08
(Gender)x
0.42
(Parity-8)+
Deviance 640.0 623.5 613.5 630.0 625.8 618.3
41
*The table value for the reference group is the odds, not the odds ratio
Comparing the models
43
Summary of lecture 16
Logistic regression assumptions
L – the model fits the data
I – the observations are all independent
Logistic regression diagnostics
“Look” at the data: tables or logits of lowess curves
Graph observed probability vs. the predicted probability
Use the X2 Test of Goodness of Fit to assess the predicted
probabilities
Splines and interactions add flexibility to the model
When comparing nested models, a table of
the coefficients and their CI’s, or
the odds ratios and their CI’s
helps the reader quickly compare models
Two models not nested in one another cannot be directly
compared
One can identify a new parent model by comparing statistical
44
significance