0% found this document useful (0 votes)
18 views

Logistic Regression Diagnostics, Splines and Interactions: Sandy Eckel Seckel@jhsph - Edu 19 May 2007

This document discusses logistic regression diagnostics and methods for assessing model fit. It describes using graphs and tables to check the assumptions of logistic regression models. Specifically, it recommends using lowess curves to visualize relationships between a binary outcome and continuous predictors. Three methods are presented for assessing model fit: examining graphs and tables, comparing observed and predicted probabilities, and using a goodness-of-fit test. The document also discusses adding splines and interactions to logistic regression models to allow for flexible relationships between predictors and the outcome. An example using breastfeeding data is presented.

Uploaded by

Borja Martinez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Logistic Regression Diagnostics, Splines and Interactions: Sandy Eckel Seckel@jhsph - Edu 19 May 2007

This document discusses logistic regression diagnostics and methods for assessing model fit. It describes using graphs and tables to check the assumptions of logistic regression models. Specifically, it recommends using lowess curves to visualize relationships between a binary outcome and continuous predictors. Three methods are presented for assessing model fit: examining graphs and tables, comparing observed and predicted probabilities, and using a goodness-of-fit test. The document also discusses adding splines and interactions to logistic regression models to allow for flexible relationships between predictors and the outcome. An example using breastfeeding data is presented.

Uploaded by

Borja Martinez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Lecture 16:

Logistic regression diagnostics,


splines and interactions

Sandy Eckel
[email protected]

19 May 2007

1
Logistic Regression Diagnostics
Graphs to check assumptions

 Recall: Graphing was used to check the


assumptions of linear regression
 Graphing binary outcomes for logistic
regression is not as straightforward as
graphing a continuous outcome for linear
regression
 Several methods have been developed to
visualize the logistic regression model for use
in checking the assumptions
 Tables
 Graphs with lowess curves
2
Nepali breastfeeding study
Example: data
 Breastfeeding tends to be protective for numerous
infant health risks
 A study was conducted in Nepal to evaluate the odds
of breastfeeding using a number of possible factors

 Outcome: breastfeeding (1=yes, 0=no)


 Primary predictor: baby’s gender (1=F, 0=M)
 Secondary predictors:
 Child’s age (0 to 76 months)
 Mother’s age (17 to 52)
 Number of children (parity) (1 to 14)

3
How to look at the data? Binary Y and
Binary (or categorical) X

 Breastfeeding vs. baby’s gender


 both binary
 make a table!

 This method would work for any binary or


categorical predictor

4
How to look at the data? Binary Y and
Continuous X

 Breastfeeding vs. child’s age


 Breastfeeding is binary
 Child’s age is continuous
 Could make child’s age categorical or binary
by
 breaking it at the quartiles
 defining groups by years
e.g. <1 year, 1 year, 2-3 years, 4+ years
 then use tables
 Or, we could graph the relationship
5
How to look at the data? Binary Y and
Continuous X
A scatter plot

1
.8

Actual
.6
breast fed

breastfeeding
.4 .2
0

0 20 40 60 80
age of child (months)

This isn’t very informative…how can we fix this?


6
How to look at the data? Binary Y and
Continuous X
Allow a smoothed relationship
 The “lowess” command is a smoothed
graph
 It’s like a window has been pulled
across the graph
 at each moment, the probability of a 1
within the window is graphed
 as the window moves, the probability of a 1
is shown as a line
 changing the width of the window yields
different levels of smoothing
7
How to look at the data? Binary Y and
Continuous X
A scatter plot with Lowess curve

Lowess smoother
1
.8

Probability of
breast fed
.6

breastfeeding
.4 .2
0

0 20 40 60 80
age of child (months)
bandwidth = .9

Much more informative! Now we can talk about how the


probability of breastfeeding changes with child’s age
- We want this to look like a nice ‘logistic’ curve 8
Checking form of the model
 Lowess allows us to visualize how the probability of
our outcome varies by a certain predictor
 We really want to graph log[p/(1-p)], because that
function is assumed to be linear in logistic regression
 Get the lowess smooth of the probability and then you can
transform the smoothed probability to the log odds scale
 Plot the `smoothed’ log odds versus the continuous covariate
of interest
 This relation should look linear
 By looking at lowess plots within key subgroups, we
can detect whether the relationship varies across
covariates
 Looking at these plots helps us decide if interactions
or splines are needed in the model

9
Assumptions of logistic regression

Two assumptions:
 L – the model fits the data
 I – the observations are all independent

 Independence still cannot be assessed


graphically; must know how the data
were collected

10
How can we assess our model ?
L – the model fits the data
3 methods for assessing model fit
 “Look” at the data
 Binary or categorical predictors: tables
 Do you see a need for interaction?
 Continuous predictors: lowess curves
 Do you see a need for interaction or splines?
 Graph observed probability vs. the
predicted probability

 Use the X2 Test of Goodness of Fit to


assess the predicted probabilities
11
Assess model fit : Method 2
Graphing observed vs. predicted probabilities
 Run the model
 Save the predicted probability of breastfeeding for each
child
 Plot observed vs predicted probabilities
Lowess smoother
If the relationship is
close to a straight line
1

 the predicted and


.8

observed probabilities
Observed

are almost the same


.6
breast fed

 the model fits the data


.4

very well
.2

If not, try to add more


X’s, splines or
0

bandwidth = .8
.2 .4
Pr(bf)
.6 .8 1
interactions
Predicted
12
Assess model fit : Method 3
X2 Test of Goodness of Fit

 Run the model


X2 Test of Goodness of Fit
 Breaks data groups of equal size
 Compares observed and predicted
numbers of observations in each group
with a X2 test
(also called the Hosmer-Lemeshow X2 Test)

 H0: the model fits the observed data well


 We want p>0.05 so we don’t reject H0

13
Method 3:
X2 Test of Goodness of Fit

 p = 0.20 > α = 0.05

 Fail to reject H0; conclude that the model fits


the data reasonably well

 Conclusion matches the other methods


 Scatter plots showed same relationship as model
 the observed and predicted probabilities matched
 method 2: straight line
 the observed and predicted data matched
 method 3: p>0.05

14
Summary: logistic regression model diagnostics

 There are no easy graphs for looking at binary


outcome data
 use lowess
 split according to binary/categorical covariates to
see how relationship between outcome and
primary predictor varies
 Assessing model fit: 3 methods
 look at tables and graphs
 compare graph of observed vs. predicted p
 X2 Test of Goodness of Fit: want large p-value

15
How do we add
Flexibility in logistic regression?

Same methods as in linear regression!


 Splines
 are used to allow the “line” to bend

 Interaction
 is used to allow different effects (difference
in log odds ratio) for different groups

16
Example: Back to breastfeeding example

 Outcome: breastfeeding (1=yes, 0=no)

 Primary predictor: gender (1=F, 0=M)

 Secondary predictors:
 Child’s age (0 to 76 months)
 Mother’s age (17 to 52) – need to center
 Number of children (parity) (1 to 14) – need to
center

17
Model A: gender

 p   p 
log  = β0 + β1 (Gender ) ⇒ log  = -0.37 + 0.04(Gender )
1− p  1− p 

Logit estimates Number of obs = 472


LR chi2(1) = 0.04
Prob > chi2 = 0.8352
Log likelihood = -319.98468 Pseudo R2 = 0.0001

------------------------------------------------------------------------------
bf | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender | .0389756 .1873558 0.21 0.835 -.3282351 .4061863
(Intercept) | -.3692173 .1281411 -2.88 0.004 -.6203693 -.1180653
------------------------------------------------------------------------------

baby’s gender (1=F, 0=M)

18
Model B:
gender and mother's age
 p 
log  = β0 + β1 (Gender ) + β2 ( Agemom − 25)
1− p 
 p 
⇒ log  = -0.16 + 0.06(Gender ) + -0.06( Agemom − 25)
1− p 

Logit estimates Number of obs = 472


LR chi2(2) = 16.50
Prob > chi2 = 0.0003
Log likelihood = -311.75482 Pseudo R2 = 0.0258

------------------------------------------------------------------------------
bf | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender | .0620916 .1907094 0.33 0.745 -.311692 .4358751
age_momc | -.0615396 .0156442 -3.93 0.000 -.0922016 -.0308776
(Intercept) | -.1573215 .13957 -1.13 0.260 -.4308736 .1162307
------------------------------------------------------------------------------

baby’s gender (1=F, 0=M)


19
Possible modification – add a spline

 A plot of the log odds of the lowess smooth of


breastfeeding versus mother’s age reveals
 There may be a bend in the line at approximately
mother’s age = 25
 We’ll add a spline for mother’s age>25
Lowess smoother
Logit transformed smooth
Boys Girls
6
4
breast fed
2
0
-2

20 30 40 50 20 30 40 50
age of mother (years)
bandwidth = .9

20
Possible modification – add a spline

 For mother’s age > 25


 we center mother’s age at 25 also, for
convenience
 The spline is a new variable:

(agemom – 25)+
= 0 if age < 25
= (agemom – 25) if age >25

21
Model C:
gender and mother's age with spline
 p 
log  = β0 + β1 (Gender ) + β2 ( Agemom − 25) + β3 ( Agemom − 25) +
1− p 
 p 
⇒ log  = -0.55 + 0.08(Gender ) + -0.25( Agemom − 25) + 0.23( Agemom − 25) +
1− p 

Logit estimates Number of obs = 472


LR chi2(3) = 26.49
Prob > chi2 = 0.0000
Log likelihood = -306.76341 Pseudo R2 = 0.0414

------------------------------------------------------------------------------
bf | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender | .0821887 .1928521 0.43 0.670 -.2957946 .4601719
age_momc | -.2467804 .0627557 -3.93 0.000 -.3697794 -.1237814
age_mom_sp | .2306511 .074613 3.09 0.002 .0844122 .3768899
(Intercept) | -.5487527 .1888302 -2.91 0.004 -.9188531 -.1786522
------------------------------------------------------------------------------

baby’s gender (1=F, 0=M)


22
Understanding the equation
Write separate equations by age group

log(odds) = -0.55 + 0.08(Gender)– 0.25(Age-25)


+ 0.23(Age-25)+

 For those with mothers under 25


-0.55 + 0.08(Gender) – 0.25(Age-25)

 For those with mothers over 25


-0.55+0.08(Gender)–0.25(Age-25) + 0.23(Age-25)
= -0.55 + 0.08(Gender)+(-0.25+0.23)(Age-25)
= -0.55 + 0.08(Gender)+ -0.02 (Age-25)

23
Model C: Interpretation
 p 
log  = β0 + β1 (Gender ) + β 2 ( Agemom − 25) + β3 ( Agemom − 25) +
1− p 

 β0: The log odds of breastfeeding for boys with


25-year-old mothers is -0.55 baby’s gender (1=F, 0=M)

 β1: Adjusting for mother’s age, the log odds


ratio of breastfeeding for girls vs. boys is 0.08

 β2: Adjusting for gender, the log odds ratio of


breastfeeding corresponding to a one year
difference in mother’s age for mothers
under 25 years is -0.25
24
Model C: Interpretation
 p 
log  = β0 + β1 (Gender ) + β2 ( Agemom − 25) + β3 ( Agemom − 25) +
1− p 
 β2+β3: Adjusting for gender, the log odds ratio
of breastfeeding corresponding to a one year
difference in mother’s age for mothers over
25 years is -0.25 + 0.23
 β3: Adjusting for gender, the difference in the
log odds ratio of breastfeeding corresponding
to a one year difference in mother’s age for
mothers over 25 years compared with mothers
under 25 years is 0.23
Tough both to put in words and to understand,
can be easier to understand mathematically! 25
Model C: Is the difference in the log odds ratio for
mother’s age statistically significant?
 p 
log  = β0 + β1 (Gender ) + β2 ( Agemom − 25) + β3 ( Agemom − 25) +
1− p 

 H0: β3 = 0 in the population


 i.e., the change in slope is 0, and the line does
not bend in the population
 One variable added: use the Wald test
 Z=3.09, p=0.002, CI for β3 = (0.08, 0.38)
 Reject H0
 Conclude that Model C is better than Model B

26
Breastfeeding example conclusion
 For boys and girls with mothers under 25 years of
age, the odds that the mother will breastfeed the
child decreases by a factor of
exp(β2)=exp(-.24)=0.78
for each additional year of mother’s age
(95% CI: 0.69, 0.88)

 This relationship is significantly different for boys and


girls with mothers over 25 years of age:
 for these children, the odds that the mother will
breastfeed the child is approximately the same for each
year of mother’s age; the odds decreases by a factor of
only exp(β2+β3)=0.98 for each additional year of
mother’s age (95% CI: 0.95, 1.02)

27
Model D: gender and number of children (parity)
Logit estimates Number of obs = 472
LR chi2(2) = 9.99
Prob > chi2 = 0.0068
Log likelihood = -315.01027 Pseudo R2 = 0.0156

------------------------------------------------------------------------------
bf | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender | .0622939 .1894771 0.33 0.742 -.3090744 .4336622
parityc | -.1180777 .0384221 -3.07 0.002 -.1933837 -.0427718
(Intercept) | -.8009664 .1937284 -4.13 0.000 -1.180667 -.4212659
------------------------------------------------------------------------------
Sketch of Model D

breastfeeding
 p 
log odds of

log  = β0 + β1 (Gender )
1− p 
+ β 2 ( Parity − 8)

baby’s gender (1=F, 0=M)


8 Parity 28
Asessing the relationship in the data
 The relationship between logit(bf) and parity is very
different for boys and girls
 Mothers of more children tend to
 breastfeed boys more
 breastfeed girls less  The relationship is about
the same for boys and
Lowess smoother girls whose mothers
Logit transformed smooth
Boys Girls
have about 8 or fewer
kids
2

 Could add a spline


and an interaction
0

term for only parity


breast fed

> 8 so that the


-2

slopes only differ


then
-4

 First we’ll just add a


spline
-6

0 5 10 15 0 5 10 15
# of kids mother had born alive
bandwidth = .9 29
Model E: gender, parity,
and parity spline
Logit estimates Number of obs = 472
LR chi2(3) = 14.18
Prob > chi2 = 0.0027
Log likelihood = -312.91444 Pseudo R2 = 0.0222

------------------------------------------------------------------------------
bf | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender | .0666432 .1903717 0.35 0.726 -.3064785 .439765
parityc | -.1718923 .0465719 -3.69 0.000 -.2631716 -.080613
parity_sp | .3281222 .1562619 2.10 0.036 .0218545 .6343899
(Intercept) | -1.045415 .2291123 -4.56 0.000 -1.494466 -.5963627
------------------------------------------------------------------------------
Sketch of Model E

breastfeeding
 p 
log odds of
log  = β0 + β1 (Gender )
1− p 
+ β 2 ( Parity − 8)
+ β3 ( Parity − 8) +
8 Parity
baby’s gender (1=F, 0=M) 30
Understanding the equation
Write separate equations by parity group

log(odds) = -1.05 + 0.07(Gender) – 0.17(Parity-8) +


0.33(Parity-8)+

 For those with mothers with less than 8


children
-1.05 + 0.07(Gender) – 0.17(Parity-8)

 For those with mothers with at least 8 children


-1.05 + 0.07(Gender) – 0.17(Parity-8) + 0.33(Parity-8)
= -1.05 + 0.07(Gender) + (-0.17+0.33)(Parity-8)
= -1.05 + 0.07(Gender) + 0.16(Parity-8)

31
Problem with the parity spline

 Model E forces the “slope” to be the same for


boys and girls
 The lowess curve suggests slope should differ
for boys and girls whose mothers had more
than around 8 children
 Add an interaction term between the spline and
gender
 that allows the slope to differ by gender only for
those whose mothers have 8 or more children

32
The new variable

 Gender = 0 for boys


 (Parity – 8)+ = 0 for children of low
parity families

 (Gender)x(Parity – 8)+
baby’s gender (1=F, 0=M)
= 0 for boys
= 0 for parity < 8
= (Parity – 8) for girls with parity >=8

33
Model F:
spline + interaction with spline
Logit estimates Number of obs = 472
LR chi2(4) = 21.75
Prob > chi2 = 0.0002
Log likelihood = -309.12925 Pseudo R2 = 0.0340

------------------------------------------------------------------------------
bf | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender | .1806766 .1953877 0.92 0.355 -.2022763 .5636294
parityc | -.1737844 .0473172 -3.67 0.000 -.2665244 -.0810445
parity_sp | .734593 .2786475 2.64 0.008 .1884539 1.280732
parity_sp_~r | -.8665087 .3966433 -2.18 0.029 -1.643915 -.0891021
(Intercept) | -1.106983 .2343301 -4.72 0.000 -1.566261 -.647704
------------------------------------------------------------------------------

Sketch of Model F
 p 
breastfeeding
log  = β0 + β1 (Gender ) log odds of
1− p 
+ β2 ( Parity − 8) + β3 ( Parity − 8) +
+ β4Gender × ( Parity − 8) + baby’s gender (1=F, 0=M)
8 Parity 34
Understanding the equation
Write separate equations by parity and gender
log(odds) = -1.11 + 0.18(Gender) – 0.17(Parity-8) + 0.73(Parity-8)+
- 0.87(Gender)x(Parity-8)+
baby’s gender (1=F, 0=M)

 For those with mothers with less than 8 children


-1.11 + 0.18(Gender) – 0.17(Parity-8)

 For boys with mothers with at least 8 children


-1.11 + 0.18(Gender) – 0.17(Parity-8) + 0.73(Parity-8)
= -1.11 + (-0.17+0.73)(Parity-8)
 For girls with mothers with at least 8 children
-1.11 + 0.18(Gender) – 0.17(Parity-8) + 0.73(Parity-8)
- 0.87(Gender)x(Parity-8)
= (-1.11 + 0.18) +(-0.17 + 0.73 - 0.87)(Parity-8)

35
Interpretation – Model F

 exp(β0): The odds of breastfeeding for boys


of mothers with 8 children is exp(-1.11) =
0.33
 exp(β1): Adjusting for mother’s parity, the
odds ratio of breastfeeding for girls vs. boys is
1.20 for children of mothers with less than 8
children
 exp(β2): Adjusting for gender, the odds ratio
of breastfeeding corresponding to a one child
difference in parity for mothers with fewer
than 8 children is .84
36
Interpretation – Model F

 exp(β2+β3): Among boys, the odds


ratio of breastfeeding corresponding to
a one child difference in parity for
mothers with at least 8 children is 1.75

 exp(β2+β3+β4): Among girls, the


odds ratio of breastfeeding
corresponding to a one child difference
in parity for mothers with at least 8
children is 0.74
37
Interpretation – Model F
 Complicated to interpret the components on their
own – read on your own if you want!

 exp(β3): The odds ratio of breastfeeding corresponding to


a one child difference in parity is 2.08 times higher for
boys whose mothers have at least 8 children than
for boys whose mothers have fewer than 8 children
 exp(β3+β4): The odds ratio of breastfeeding corresponding
to a one child difference in parity is 0.74 times lower for
girls whose mothers have at least 8 children than for
girls whose mothers have fewer than 8 children
 exp(β4): The odds ratio of breastfeeding corresponding to
a one child difference in parity is 0.42 times lower for
boys whose mothers have at least 8 children than
for girls whose mothers have at least 8 children

38
Is the difference in the log odds ratio for parity by gender
statistically significant?

 H0: β4 = 0 in the population


 i.e. the change in slope for parity > 8 is the same for boys and
girls in the population
 One variable added: use the Wald test
 Z=-2.18, p=0.029, CI for exp(β3) = (0.19, 0.91)
 Reject H0
 Conclude that Model F is better than Model E

 p 
log  = β0 + β1 (Gender )
1− p 
+ β2 ( Parity − 8) + β3 ( Parity − 8) +
+ β4Gender × ( Parity − 8) + 39
Conclusion – Model F
 For children whose mothers have fewer than 8 children,
the odds that the mother will breastfeed the child is
about the same for boys and girls and decreases by a
factor of exp(β2)=0.84 for each additional year of
mother’s age (95% CI: 0.77, 0.92).
 This relationship is significantly different for both boys
and girls whose mothers have more than 8 children:
 For boys whose mothers have more than 8 children,
the odds that the mother will breastfeed increases by
a factor of exp{β2+β3}=1.75 for each additional year
of mother’s age (95% CI: 1.05, 2.93).
 For girls whose mothers have more than 8 children,
the odds that the mother will breastfeed decreases by
a factor of exp{β2+β3+β4}=0.74 for each additional
year of mother’s age (95% CI: 0.40, 1.37).
40
Comparing the models
Odds Ratio for Model
Variables A B C D E F
Reference* 0.69 0.85 0.58 0.45 0.35 0.33
Gender 1.04 1.06 1.09 1.06 1.07 1.20
Age-25 0.94 0.78
(Age-25)+ 1.26
Parity – 8 0.89 0.84 0.84
(Parity-8)+ 1.39 2.08
(Gender)x
0.42
(Parity-8)+
Deviance 640.0 623.5 613.5 630.0 625.8 618.3
41
*The table value for the reference group is the odds, not the odds ratio
Comparing the models

 Models C and F are both nested in


Model A
 Models C and F cannot be directly
compared to one another, but we can
see which has a smaller p-value when
compared to Model A
 C vs. A: X2 = 26.5 with 2 df
 F vs. A: X2 = 21.7 with 3 df
 Both p-values are very small <.0001, but the p-
value for model C is slightly smaller
42
What next?

 Model C improves prediction beyond gender


alone (Model A) more than Model F.
 Model C should be the next parent model, and
we should test the new variables in Model F to
see if they continue to improve prediction
within the context of Model C
 When a tentative final model is identified, the
assumptions of logistic regression should be
checked

43
Summary of lecture 16
 Logistic regression assumptions
 L – the model fits the data
 I – the observations are all independent
 Logistic regression diagnostics
 “Look” at the data: tables or logits of lowess curves
 Graph observed probability vs. the predicted probability
 Use the X2 Test of Goodness of Fit to assess the predicted
probabilities
 Splines and interactions add flexibility to the model
 When comparing nested models, a table of
 the coefficients and their CI’s, or
 the odds ratios and their CI’s
helps the reader quickly compare models
 Two models not nested in one another cannot be directly
compared
 One can identify a new parent model by comparing statistical
44
significance

You might also like