Practical Guide To Logistic Regression - Even
Practical Guide To Logistic Regression - Even
Logistic
Regression
Joseph M. Hilbe
Jet Propulsion Laboratory
California Institute of Technology, USA
and
This book is aimed at the working analyst or researcher who finds that Resources for how to learn how to model slightly more complicated models
they need some guidance when modeling binary response data. It is also of will be provided—where to go for the next step. Bayesian modeling is hav-
value for those who have not used logistic regression in the past, and who are ing a continually increasing role in research, and every analyst should at least
not familiar with how it is to be implemented. I assume, however, that the become acquainted with how to understand this class of models, and with how
reader has taken a basic course in statistics, including instruction on applying to program basic Bayesian logistic models when doing so is advisable.
linear regression to study data. It is sufficient if you have learned this on your R statistical software is used to display all but one statistical model dis-
own. There are a number of excellent books and free online tutorials related to cussed in the book—exact logistic regression. Otherwise R is used for all data
regression that can provide this background. management, models, postestimation fit analyses, tests, and graphics related
I think of this book as a basic guidebook, as well as a tutorial between you to our discussion of logistic regression in the book. SAS and Stata code for
and me. I have spent many years teaching logistic regression, using logistic- all examples is provided at the conclusion of each chapter. Complete Stata
based models in research, and writing books and articles about the subject. and SAS code and output, including graphics and tables, is provided on the
I have applied logistic regression in a wide variety of contexts—for medical book’s web site. R code is also provided on the book’s web site, as well as in
and health outcomes research, in ecology, fisheries, astronomy, transporta- the LOGIT package posted on CRAN.
tion, insurance, economics, recreation, sports, and in a number of other areas. R is used in the majority of newly published texts on statistics, as well as
Since 2003, I have also taught both the month-long Logistic Regression and for examples in most articles found in statistics journals published since 2005.
Advanced Logistic Regression courses for Statistics.com, a comprehensive R is open ware, meaning that it is possible for users to inspect the actual code
online statistical education program. Throughout this process I have learned used in the analysis and modeling process. It is also free, costing nothing to
what the stumbling blocks and problem areas are for most analysts when using download into one’s computer. A host of free resources is available to learn R,
logistic regression to model data. Since those taking my courses are located at and blogs exist that can be used to ask others how to perform various opera-
research sites and universities throughout the world, I have been able to gain tions. It is currently the most popular statistical software worldwide; hence, it
a rather synoptic view of the methodology and of its use in research in a wide makes sense to use it for examples in this relatively brief monograph on logis-
variety of applications. tic regression. But as indicated, SAS and Stata users have the complete code
In this volume, I share with you my experiences in using logistic regres- to replicate all of the R examples in the text itself. The code is in both printed
sion, and aim to provide you with the fundamental logic of the model and its format as well as electronic format for immediate download and use.
appropriate application. I have written it to be the book I wish I had read when A caveat: Keep in mind that when copying code from a PDF document, or
first learning about the model. It is much smaller and concise than my 656 even from a document using a different font from that which is compatible with
page Logistic Regression Models (Chapman & Hall/CRC, 2009), which is a R or Stata, you will likely find that a few characters need to be retyped in order
general reference to the full range of logistic-based models. Rather, this book to successfully execute. For example, when pasting program code from a PDF
focuses on how best to understand the key points of the basic logistic regres- or word document into the R editor, characters such as “quotation marks” and
sion model and how to use it properly to model a binary response variable. I “minus signs” may not convert properly. To remedy this, you need to retype the
do not discuss the esoteric details of estimation or provide detailed analysis of quotation or minus sign in the code you are using.
the literature regarding various modeling strategies in this volume, but rather It is also important to remember that this monograph is not about R, or
I focus on the most important features of the logistic model—how to construct any specific statistical software package. We will foremost be interested in
a logistic model, how to interpret coefficients and odds ratios, how to predict the logic of logistic modeling. The examples displayed are aimed to clarify
probabilities based on the model, and how to evaluate the model as to its fit. I the modeling process. The R language, although popular and powerful, is
also provide a final chapter on Bayesian logistic regression, providing an over- nevertheless tricky. It is easy to make mistakes, and R is rather unforgiving
view of how it differs from the traditional frequentist tradition. An important when you do. I therefore give some space to explaining the R code used in the
component of our examination of Bayesian modeling will be a step-by-step modeling and evaluative process when the code may not be clear. The goal is
guide through JAGS code for modeling real German health outcomes data. to provide you with code you can use directly, or adapt as needed, in order to
The reader should be able to attain a basic understanding of how Bayesian make your modeling tasks both easier and more productive.
logistic regression models can be developed and interpreted—and be able to I have chosen to provide Stata code at the end of each chapter since Stata
develop their own models using the explanation in the book as a guideline. is one of the most popular and to my mind powerful statistical packages on the
Author
xv
2 Practical Guide to Logistic Regression 1 • Statistical Models 3
distribution function or PDF. The analyst does not usually observe the entire parameters. The binomial, beta, and beta binomial distributions will be dis-
range of data defined by the underlying PDF, called the population data, but cussed later when discussing grouped logistic regression.
rather observes a random sample from the underlying data. If the sample of The catcher in this is that a probability distribution has various assump-
data is truly representative of the population data, the sample data will be tions. If these assumptions are violated, the estimates we make of the param-
described by the same PDF as the population data, and have the same values eters are biased, and may be incorrect. Statisticians have worked out a number
of its parameters, which are initially unknown. of adjustments for what may be called “violations of distributional assump-
Parameters define the specific mean or location (shape) and perhaps scale tions,” which are important for an analyst to use when modeling data exhibit-
of the PDF that best describes the population data, as well as the distribution of ing problems. I’ll mention these assumptions shortly, and we will address them
the random sample from the population. A statistical model is the relationship in more detail as we progress through the book.
between the parameters of the underlying PDF of the population data and the I fully realize that the above description of a statistical model—of a para-
estimates made by an analyst of those parameters. metric statistical model—is not the way we normally understand the modeling
Regression is one of the most common ways of estimating the true param- process, and it may be a bit confusing. But it is in general the way statisticians
eters in as unbiased manner as possible. That is, regression is typically used think of statistical modeling, and is the basis of the frequency-based tradition
to establish an accurate model of the population data. Measurement error can of statistical modeling. Keep these relationships in mind as we describe logis-
creep into the calculations at nearly every step, and the random sample we are tic regression.
testing may not fully resemble the underlying population of data, nor its true
parameters. The regression modeling process is a method used to understand
and control the uncertainty inherent in estimating the true parameters of the
distribution describing the population data. This is important since the predic-
tions we make from a model are assumed to come from this population.
1.2 BASICS OF LOGISTIC
Finally, there are typically only a limited range of PDFs which analysts REGRESSION MODELING
use to describe the population data, from which the data we are analyzing is
assumed to be derived. If the variable we are modeling, called the response Logistic regression is foremost used to model a binary (0,1) variable based on
term (y), is binary (0,1), then we will want to use a Bernoulli probability distri- one or more other variables, called predictors. The binary variable being mod-
bution to describe the data. The Bernoulli distribution, as we discuss in more eled is generally referred to as the response variable, or the dependent variable.
detail in the next section, consists of a series of 1s and 0s. If the variable we I shall use the term “response” for the variable being modeled since it has now
wish to model is continuous and appears normally distributed, then we assume become the preferred way of designating it. For a model to fit the data well, it
that it can be best modeled using a Gaussian (normal) distribution. This is a is assumed that
pretty straightforward relationship. Other probability distributions commonly
used in modeling are the lognormal, binomial, exponential, Poisson, negative The predictors are uncorrelated with one another.
binomial, gamma, inverse Gaussian, and beta PDFs. Mixtures of distributions That they are significantly related to the response.
are also constructed to describe data. The lognormal, negative binomial, and That the observations or data elements of a model are also uncorrelated.
beta binomial distributions are such mixture distributions—but they are nev-
ertheless completely valid PDFs and have the same basic assumptions as do As discussed in the previous section, the response is also assumed to fit
other PDFs. closely to an underlying probability distribution from which the response is
I should also mention that probability distributions do not all have the a theoretical sample. The goal of a model is to estimate the true parameter(s)
same parameters. The Bernoulli, exponential, and Poisson distributions are of the underlying PDF of the model based on the response as adjusted by its
single parameter distributions, and models directly based on them are single predictors. In the case of logistic regression, the response is binary (0,1) and
parameter models. That parameter is the mean or location parameter. The nor- follows a Bernoulli probability distribution. Since the Bernoulli distribution is
mal, lognormal, gamma, inverse Gaussian, beta, beta binomial, binomial, and a subset of the more general binomial distribution, logistic regression is recog-
negative binomial distributions are two parameter models. The first four of nized as a member of the binomial family of regression models. A comprehen-
these are continuous distributions with mean (shape) and scale (variability) sive analysis of these relationships is provided in Hilbe (2009).
6 Practical Guide to Logistic Regression 1 • Statistical Models 7
One of the nice features of presenting the log-likelihood function in expo- The linear predictor of the logistic model is
nential form is that we may abstract from it a link function as well as the
mean and variance functions of the underlying Bernoulli distribution. The link xi b = β0 + β1 xi1 + β2 xi 2 + + β p xip (1.7)
function, which I’ll discuss shortly, is whatever follows the y of the first term
of the right-hand side of Equation 1.5. Here it is log(p/(1 − p)). The mean of
However, the fitted or predicted value of the logistic model is based on the
the distribution can be obtained as the derivative of the negative of the second
link function, log(μ/(1 − μ)). In order to establish a linear relationship of the
term with respect to the link. The second derivative yields the variance. For the
predicted value, μ, and the linear predictor, we have the following relationship:
Bernoulli distribution, these values are
µi
Mean = µ = p ln = xi b = β0 + β1 xi1 + β2 xi 2 + + β p xip (1.8)
1 − µ i
Variance = V (µ ) = p(1 − p) = µ(1 − µ )
> mylogit <- irls_logit(died ~ hmo + white, data=medpar) The confidence intervals must be calculated separately. To obtain model-
> mylogit based standard errors, we use the confint.default function. Using the confint
$coef function produces what are called profile confidence intervals. We shall dis-
X(Intercept) Xhmo Xwhite cuss these later in Chapter 2, Section 2.3.
-0.92618620 -0.01224648 0.30338724
Call: glm(formula = y ~ x, family = binomial, data = xdta) > summary(logit2 <- glm(y~ x, family = binomial, data = xdta))
Coefficients: . . .
(Intercept) x
1.099 -1.504 Coefficients:
Estimate Std. Error z value Pr(>|z|)
Degrees of Freedom: 8 Total (i.e. Null); 7 Residual (Intercept) 1.099 1.155 0.951 0.341
Null Deviance: 12.37 x -1.504 1.472 -1.022 0.307
Residual Deviance: 11.23 AIC: 15.23 . .
Null deviance: 12.365 on 8 degrees of freedom
More complete model results can be obtained by assigning the model a Residual deviance: 11.229 on 7 degrees of freedom
name, and then summarizing it with the summary function. We will name the AIC: 15.229
model logit2.
There are a number of ancillary statistics which are associated with mod-
> logit2 <- glm(y~ x, family = binomial, data = xdta) eling data with logistic regression. I will show how to do this as we prog-
> summary(logit2) ress, and functions and scripts for all logistic statistics, fit tests, graphics, and
tables are provided on the books web site, as well as in the LOGIT package
Call: that accompanies this book. The LOGIT package will also have the data,
glm(formula = y ~ x, family = binomial, data = xdta) functions and scripts for the second edition of Logistic Regression Models
(Hilbe, 2016).
Deviance Residuals: For now we will focus on the meaning of the single binary predictor
Min 1Q Median 3Q Max
model. The coefficient of predictor x is −1.504077. A coefficient is a slope. It
-1.6651 -1.0108 0.7585 0.7585 1.3537
is the amount of the rate of change in y based on a one-unit change in x. When
Coefficients:
x is binary, it is the amount of change in y when x moves from 0 to 1 in value.
Estimate Std. Error z value Pr(>|z|) But what is changed?
(Intercept) 1.099 1.155 0.951 0.341 Recall that the linear predictor, xb, of a logistic model is defined as
x -1.504 1.472 -1.022 0.307 log(μ/(1 − μ)). This expression is called the log-odds or logit. It is the logistic
link function, and is the basis for interpreting logistic model coefficients.
(Dispersion parameter for binomial family taken to be 1) The interpretation of x is that when x changes from 0 to 1, the log-odds of
18 Practical Guide to Logistic Regression 2 • Logistic Models: Single Predictor 19
Let us now check the relationship of x to o, noting the values of o for the
2.2 PREDICTIONS, PROBABILITIES, two values of x.
of sources. One of the earliest adjustments made to standard errors was called > se <- sqrt(diag(vcov(logit2))) # coefficient SE
scaling. R’s glm function provides built in scaling of binomial and Poisson > delta <- or*se #
delta method,SE of OR
regression standard errors through the use of the quasibinomial and quasipois- > ortab <- data.frame(or, delta)
son options. Scaled standard errors are produced as the product of the model > round(ortab, 4)
or delta
standard errors and square root of the Pearson dispersion statistic. Coefficients
(Intercept) 3.0000 3.4641
are left unchanged. Scaling is discussed in detail in Chapter 3, Section 3.4.1. x 0.2222 0.3271
We may use the function on medpar data The traditional logistic regression model we are discussing here is based
on a frequency interpretation of statistics. As such the confidence intervals
> data(medpar) # assumes library(COUNT)or library(LOGIT) loaded must be interpreted in the same manner. If the coefficient of a logistic model
> smlogit <- glm(died ~ white + los + factor(type), predictor has a p-value under 0.05, the associated confidence interval will not
family = binomial, data = medpar)
include zero. The interpretation is
> summary(smlogit)
reference level, then level 2 is interpreted with reference to level 1. Level 3 is Analysts many times find that they must change the reference levels of
also interpreted with reference to level 1. Level 1 is the default reference level a categorical predictor. This may be done with the following code. We will
for both R’s glm function and Stata’s regression commands. SAS uses the high- change from the default reference level 1 to a reference level 3 using the relevel
est level as the default reference. Here it would be level 3. function.
It is advised to use either the lowest or highest level as the reference, in
particular whichever of the two has the most observations. But of even more > medpar$type <- factor(medpar$type)
importance, the reference level should be chosen which makes most sense for > medpar$type <- relevel(medpar$type, ref=3)
the data being modeled. > logit4 <- glm( died~factor(type), family=binomial,
You may let the software define your levels, or you may create them your- data=medpar)
> exp(coef(logit4))
self. If there is the likelihood that levels may have to be combined, then it may
(Intercept) factor(type)1 factor(type)2
be wise to create separate indicator variables for the levels. First though, let us 0.8823529 0.5357576 0.7320911
let the software create internal indicator variables, which are dropped at the
conclusion of the display to screen.
Interpretation changes to read
> summary(logit3 <- glm( died ~ factor(type), family = binomial,
data = medpar)) • Elective patients have about half the odds of dying in the hospital
than do emergency patients.
. . .
• Urgent patients have about a three quarters of the odds of dying in
Coefficients: the hospital than do emergency patients.
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.74924 0.06361 -11.779 < 2e-16 *** I mentioned that indicator or dummy variables can be created by hand,
factor(type)2 0.31222 0.14097 2.215 0.02677 * and levels merged if necessary. This occurs when, for example, the level 2
factor(type)3 0.62407 0.21419 2.914 0.00357 **
coefficient (or odds ratio) is not significant compared to reference level 1. We
—-
see this with the model where type = 3 is the reference level. From looking at
Null deviance: 1922.9 on 1494 degrees of freedom
Residual deviance: 1911.1 on 1492 degrees of freedom the models, it appears that levels 2 and 3 may not be statistically different from
AIC: 1917.1 one another, and may be merged. I caution you from concluding this though
since we may want to adjust the standard errors, resulting in changed p-values,
Note how the factor function excluded factor type1 (elective) from the for extra correlation in the data, or for some other reason we shall discuss in
output. It is the reference level though and is used to interpret both type2 Chapter 4. However, on the surface it appears that patients who were admitted
(urgent) and type3 (emergency). I shall exponentiate the coefficients of type2 as urgent are not significantly different from emergency patients with respect
and type3 in order to better interpret the model. Both will be interpreted as to death while hospitalized.
odds ratios, with the denominator of the ratio being the reference level. I mentioned before that combining levels is required if two levels do not
significantly differ from one another. In fact, when the emergency level of type
> exp(coef(logit3)) is the reference, level 2 (urgent) does not appear to be significant, indicating
(Intercept) factor(type)2 factor(type)3 that type levels 2 and 3 might be combined. With R this can be done as
0.4727273 1.3664596 1.8665158
> table(medpar$type)
The interpretation is
1 2 3
• Urgent admission patients have a near 37% greater odds of dying in 1134 265 96
the hospital than do elective admissions.
• Emergency admission patients have a near 87% greater odds of dying > medpar$type[medpar$type = =3] <- 2 # reclassify level 3 as level 2
in the hospital than do elective admissions. > table(medpar$type)
34 Practical Guide to Logistic Regression 2 • Logistic Models: Single Predictor 35
acronym for Length of Stay, referring to nights in the hospital. los ranges from
1 to 116. A cubic spline is used to smooth the shape of the distribution of los.
This is accomplished by using the S operator. 5
> summary(medpar$los)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 4.000 8.000 9.854 13.000 116.000
0
s(los,7.42)
> library(mgcv)
> diedgam <- gam(died ~ s(los), family = binomial, data = medpar)
> summary(diedgam)
. . .
–5
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.69195 0.05733 -12.07 <2e-16 ***
> table(badhealth$badh)
2.5.3 Centering
0 1
A continuous predictor whose lowest value is not close to 0 should likely be 1015 112
centered. For example, we use the badhealth data from the COUNT package.
> summary(badhealth$age)
> data(badhealth) Min. 1st Qu. Median Mean 3rd Qu. Max.
> head(badhealth) 20.00 28.00 35.00 37.23 46.00 60.00
38 Practical Guide to Logistic Regression 2 • Logistic Models: Single Predictor 39
> summary(logit7 <- glm(died ~ white, family = binomial, Null deviance: 1922.9 on 1494 degrees of freedom
data = medpar)) Residual deviance: 1904.6 on 1493 degrees of freedom
AIC: 1908.6
Coefficients:
Estimate Std. Error z value Pr(>|z|) > exp(coef(logit8))
(Intercept) -0.9273 0.1969 -4.710 2.48e-06 *** (Intercept) los
white 0.3025 0.2049 1.476 0.14 0.6964864 0.9699768
—-
Null deviance: 1922.9 on 1494 degrees of freedom > etac <- predict(logit8)
Residual deviance: 1920.6 on 1493 degrees of freedom > fitc <- logit8$fitted.value
AIC: 1924.6 > summary(fitc)
Min. 1st Qu. Median Mean 3rd Qu. Max.
> exp(coef(logit7)) 0.01988 0.31910 0.35310 0.34310 0.38140 0.40320
(Intercept) white
0.3956044 1.3532548 The predicted values of died given los range from 0.02 to 0.40.
If we wish to determine the probability of death while hospitalized for a
White patients have a 35% greater odds of death while hospitalized than patient who has stayed in the hospital for 20 days, multiply the coefficient on
do nonwhite patients. los by 20, add the intercept to obtain the linear predictor for los at 20 days.
Apply the inverse logit link to obtain the predicted probability.
LINEAR PREDICTOR
> etab <- predict(logit7)
> xb20 <- -0.361707 - 0.030483*20
> mu20 <- 1/(1 + exp(-xb20))
FITTED VALUE; PROBABILITY THAT DIED = =1 > mu20
> fitb <- logit7$fitted.value [1] 0.2746081
TABULATION OF PROBABILITIES
The probability is 0.275. A patient who stays in the hospital for 20 days
> table(fitb)
fitb has a 27% probability of dying while hospitalized—given a specific disease
0.283464566929547 0.348684210526331 from this data.
127 1368
1368 white patients have an approximate 0.349 probability of dying within 2.6.2 Prediction Confidence Intervals
the hospital. Nonwhite patients have some 0.283 probability of dying. Since
the predictor is binary, there are only two predicted values. We next calculate the standard error of the linear predictor. We use the predict
Let us model died on los, a continuous predictor. function with the type = “link” and se.fit = TRUE options to place the
predictions on the scale of the linear predictor, and to guarantee that the lpred
> summary(logit8 <- glm(died ~ los, family = binomial, object is in fact the standard error of the linear prediction.
data = medpar))
> lpred <- predict(
logit8, newdata = medpar, type = “link”,
Coefficients: se.fit = TRUE)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.361707 0.088436 -4.090 4.31e-05 *** Now we calculate the 95% confidence interval of the linear predictor. As
los -0.030483 0.007691 -3.964 7.38e-05 *** mentioned earlier, we assume that both sides of the distribution are used in
42 Practical Guide to Logistic Regression 2 • Logistic Models: Single Predictor 43
/* Section 2.3 */ *Refer to the code in section 1.4 to import and print medpar dataset;
*Build the logistic model- covb option provides var-cov matrix; *Generate the frequency table of type and output the dataset;
proc genmod data=xdta descending; proc freq data=medpar;
model y=x / dist=binomial link=logit covb; tables type / out=freq;
run; run;
*Use SAS interactive matrix language; *Build the logistic model with class;
proc iml; proc genmod data=medpar descending;
vcov={1.33333 -1.33333, class type (ref=’1’) / param = ref;
-1.33333 2.16667}; model died=type / dist=binomial link=logit;
46 Practical Guide to Logistic Regression 2 • Logistic Models: Single Predictor 47
*Build the logistic model with centered age; *Graph scatter plot;
proc genmod data=badhealth1 descending ; proc sgplot data=cl;
model badh=cage / dist=binomial link=logit; scatter x=los y=mu;
run; scatter x=los y=loci;
scatter x=los y=upci;
*Standardize age and output the sage dataset; run;
proc standard data=badhealth mean=0 std=1 out=sage;
var age;
run;
*Refer to proc freq in section 2.4 to generate the frequency table; 2.2
. glm y x, fam(bin) nolog nohead
*Build the logistic model and output model prediction; . di 1.098612 - 1.504077*1
proc genmod data=medpar descending; . di 1.098612 - 1.504077*0
model died=white / dist=binomial link=logit; . predict xb, xb
output out=etac pred=fitc; . predict mu
run; . gen o = mu/(1-mu)
. gen or = .6666667/3 if o < 1
. replace or = o if or = =.
*Refer to proc means in section 2.5 to summarize fitc; . gen coef = log(or)
. l y x xb mu o or coef
*Create a dataset to make calculations;
data prob; 2.3
xb20=-0.3617 - 0.0305*20; . glm y x, fam(bin) nolog nohead
mu20=1/(1+exp(-xb20)); . estat vce
run; . glm y x, fam(bin) nolog nohead scale(x2)
. glm y x, fam(bin) nolog nohead eform
*Print the variable mu20; . di normal(-abs(_b[x]/_se[x]))*2 // p-value for x
proc print data=prob; . di normal(-abs(_b[_cons]/_se[_cons]))*2 // p-value for intercept
var mu20; . use medpar, clear
run; . glm died white los i.type, fam(bin) nolog nohead
. glm died white los i.type, fam(bin) nolog nohead eform
*Build the logistic model and output confidence intervals;
proc genmod data=medpar descending; 2.4
model died=los / dist=binomial link=logit; . use medpar, clear
output out=cl pred=mu lower=loci upper=upci; . list in 1/6
run; . tab type
50 Practical Guide to Logistic Regression 3 • Logistic Models: Multiple Predictors 51
the other terms in the model are held constant. When the logistic regression the binary predictor male is 1 = male and 0 = female. Kids = 1 if the subject
term is exponentiated, interpretation is given in terms of an odds ratio, rather has children, and 0 if they have no children. Age is a categorical variable with
than log-odds. We can see this in Equation 3.2 below, which results by expo- levels as 5-year age groups. The range is from 17 to 57. I will interpret age,
nentiating each side of Equation 3.1. however, as a continuous predictor, with each ascending age as a 5-year period.
We model the data as before, but simply add more predictors in the model.
µi The categorical educlevel predictor is factored into its three levels, with the
= eβ0 + β1 xi1 + β2 xi 2 ++ β p xip (3.2)
1 − µi lowest level, AA, as the reference. It is not displayed in model output.
> summary(ed1 <- glm(religious ~ age + male + kids + factor(educlevel),
or + family = binomial, data = edrelig))
µi Coefficients:
= exp (β0 + β1 xi1 + β2 xi 2 + + β p xip ) (3.3) Estimate Std. Error z value Pr(>|z|)
1 − µi (Intercept) −1.43522 0.32996 −4.350 1.36e-05 ***
age 0.03983 0.01036 3.845 0.000121 ***
male 0.18997 0.18572 1.023 0.306381
An example will help clarify what is meant when interpreting a logistic kids 0.12393 0.21037 0.589 0.555790
regression model. Let’s use data from the social sciences regarding the rela- factor(educlevel)BA −0.47231 0.20822 −2.268 0.023313 *
tionship of whether a person identifies themselves as religious. Our main inter- factor(educlevel)MA/PhD −0.49543 0.22621 −2.190 0.028513 *
---
est will be in assessing how level of education affects religiosity. We’ll also Null deviance: 822.21 on 600 degrees of freedom
adjust by gender (male), age, and whether the person in the study has children Residual deviance: 792.84 on 595 degrees of freedom
AIC: 804.84
(kids). There are 601 subjects in the study, so there is no concern about sample
size. The data are in the edrelig data set.
The odds ratios are obtained by:
A study subject’s level of education is a categorical variable with three
fairly equal-sized levels: AA, BA, and MA/PhD. All subjects have achieved at > or <- exp(coef(ed1))
least an associate’s degree at a 2-year institution. A tabulation of the educlevel > round(or,4)
(Intercept) age male
predictor is shown below, together with the top six values of all variables in 0.2381 1.0406 1.2092
the data. kids factor(educlevel)BA factor(educlevel)MA/PhD
1.1319 0.6236 0.6093
> data(edrelig)
> head(edrelig) Or we can view the entire table of odds ratio estimates and associated
male age kids educlevel religious statistics using the code developed in the previous chapter.
1 1 37 0 MA/PhD 0
2 0 27 0 AA 1 > coef <- ed1$coef
> se <- sqrt(diag(vcov(ed1)))
3 1 27 0 MA/PhD 0 > zscore <- coef / se
4 0 32 1 AA 0 > or <- exp(coef)
5 0 27 1 BA 0 > delta <- or * se
6 1 57 1 MA/PhD 1 > pvalue <- 2*pnorm(abs(zscore),lower.tail=FALSE)
> loci <- coef - qnorm(.975) * se
> upci <- coef + qnorm(.975) * se
> table(edrelig$educlevel) > ortab <- data.frame(or, delta, zscore, pvalue, exp(loci), exp(upci))
> round(ortab, 4)
or delta zscore pvalue exp.loci. exp.upci.
AA BA MA/PhD (Intercept) 0.2381 0.0786 -4.3497 0.0000 0.1247 0.4545
205 204 192 age 1.0406 0.0108 3.8449 0.0001 1.0197 1.0620
male 1.2092 0.2246 1.0228 0.3064 0.8403 1.7402
kids 1.1319 0.2381 0.5891 0.5558 0.7495 1.7096
Male and kids are both binary predictors, having values of 0 and 1. 1 indi- factor(educlevel)BA 0.6236 0.1298 -2.2683 0.0233 0.4146 0.9378
cates (most always) that the name of the predictor is the case. For instance, factor(educlevel)MA/PhD 0.6093 0.1378 -2.1902 0.0285 0.3911 0.9493
54 Practical Guide to Logistic Regression 3 • Logistic Models: Multiple Predictors 55
observation in the model. This means that a y replaces every μ in the log- Logistic Model Pearson Chi2 GOF Statistic (based on the Bernoulli distribution):
likelihood function.
n
( yi − µ i )2
∑ µ (1 − µ )
n
where L is the model log-likelihood, k is the number of parameter estimates AICn – AIC divided by n
in the model, and n is the number of observations in the model. For logistic > aicn <- mymod$aic/(mymod$df.null + 1)
> aicn
regression, parameter estimates are the same as predictors, including the inter-
[1] 1.266322
cept. Using the medpar data set described earlier, we model died on
> data(medpar)
> summary(mymod <- glm(died ~ white + hmo + los + factor(type), 3.3.2 Finite Sample
+ family = binomial,
+ data = medpar)) Finite sample AIC was designed to compare logistic models. It is rarely used
in reports, but is important to know. It is defined as:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.720149 0.219073 -3.287 0.00101 ** −2{[ L − k − k (k + 1)]/ (n − k − 1)}
FAIC = (3.16)
white 0.303663 0.209120 1.452 0.14647 n
62 Practical Guide to Logistic Regression 3 • Logistic Models: Multiple Predictors 63
> coefse <- data.frame(coef, se) ---(Dispersion parameter for quasibinomial family
> coefse taken to be 1.020452)
coef se
(Intercept) -0.72014852 0.21907288 Null deviance: 1922.9 on 1494 degrees of freedom
white 0.30366254 0.20912002 Residual deviance: 1881.2 on 1489 degrees of freedom
hmo 0.02720413 0.15124225 AIC: NA
los -0.03719338 0.00779851
factor(type)2 0.41787319 0.14431763 The standard errors displayed in the quasibinomial model are identical to
factor(type)3 0.93381912 0.22941205 the scaled standard errors we created by hand. Remember that there is no true
quasibinomial GLM family. Quasibinomial is not a separate PDF. It is simply
an operation to provide scaled standard errors on a binomial model such as
Next we create Pearson dispersion statistics and multiply their square root
logistic regression.
by se above.
When an analyst models a logistic regression with scaled standard errors,
the resultant standard errors will be identical to model-based standard errors
> pr <- resid(mymod, type = “pearson”)
> pchi2 <- sum(residuals(mymod, type = “pearson”)^2)
if there are no distributional problems with the data. In other words, a logistic
> disp <- pchi2/mymod$df.residual model is not adversely affected if standard errors are scaled when they do not
> scse <- se*sqrt(disp) need it.
> newcoefse <- data.frame( coef, se, scse) A caveat when using R’s quasibinomial family: p-values are based on t
> newcoefse and not z as they should be. As a result a predictor p-value may be >0.05 and
coef se scse its confidence interval not include 0. Our toOR function used with quasibi-
(Intercept) -0.72014852 0.21907288 0.221301687 nomial models provides correct values. To see this occur, model the grouped
white 0.30366254 0.20912002 0.211247566 quasibinomial model: sick <- c(77,19,47,48,16,31); cases <- c(458,147,494,384,
hmo 0.02720413 0.15124225 0.152780959
127,464); feed <- c(1,2,3,1,2,3); gender <- c(1,1,1,0,0,0).
los -0.03719338 0.00779851 0.007877851
factor(type)2 0.41787319 0.14431763 0.145785892
factor(type)3 0.93381912 0.22941205 0.231746042 3.4.2 Robust or Sandwich Variance Estimators
We can now check to see if the quasibinomial “family” option produces Scaling was the foremost method of adjusting standard errors for many years—
scaled standard errors until analysts began to use what are called robust or sandwich standard errors.
Like scaling, using robust standard errors only affects the model when there
> summary(qmymod <- glm(died ~ white + hmo + los + factor(type), are problems with the model-based standard errors. If there is none, then the
+ family = quasibinomial, robust standard error reduces to the model-based errors. Many statisticians
+ data = medpar)) recommend that robust or sandwich standard errors be used as a default.
I shall use the same data to model a logistic regression with sandwich or
. . . robust standard errors. The sandwich package must be installed and loaded
Coefficients: before being able to create sandwich standard errors.
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.720149 0.221302 -3.254 0.00116 ** >library(sandwich)
white 0.303663 0.211248 1.437 0.15079 > rmymod <- glm(died ~ white + hmo + los + factor(type),
hmo 0.027204 0.152781 0.178 0.85870 family = binomial, data = medpar)
los -0.037193 0.007878 -4.721 2.56e-06 *** > rse <- sqrt(diag(vcovHC(rmymod, type = “HC0”)))
factor(type)2 0.417873 0.145786 2.866 0.00421 **
factor(type)3 0.933819 0.231746 4.029 5.87e-05 *** The robust standard errors are stored in rse. We’ll add those to the table of
standard errors we have been expanding.
66 Practical Guide to Logistic Regression 3 • Logistic Models: Multiple Predictors 67
manner based on the levels of another predictor. Suppose that the response > ior <- exp(0.77092 + (-0.04776*1:40))
> ior
term is death and we have predictors white and los. These are variables in [1] 2.0609355 1.9648187 1.8731846 1.7858241 1.7025379 1.6231359 1.5474370
the medpar data. If we believe that the probability of death based on length [8] 1.4752685 1.4064658 1.3408718 1.2783370 1.2187186 1.1618807 1.1076936
[15] 1.0560336 1.0067829 0.9598291 0.9150652 0.8723889 0.8317029 0.7929144
of stay in the hospital varies by racial classification, then we need to incor- [22] 0.7559349 0.7206800 0.6870694 0.6550262 0.6244775 0.5953535 0.5675877
porate an interaction term of white × los into the model. The main effects [29] 0.5411169 0.5158806 0.4918212 0.4688839 0.4470164 0.4261687 0.4062933
only model is: [36] 0.3873448 0.3692800 0.3520578 0.3356387 0.3199854
> summary(y0 <- glm(died~ white + los, family = binomial, Interactions for Binary × Binary. Binary × Categorical, Binary ×
data = medpar)) Continuous, Categorical × Categorical, Categorical × Continuous, and
Continuous × Continuous may be developed, as well as three-level interac-
Coefficients:
tions. See Hilbe (2009) for a thorough analysis of interactions. For now, keep
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.598683 0.213268 -2.807 0.005 ** in mind that when incorporating an interaction term into your model, be sure
white 0.252681 0.206552 1.223 0.221 to include the terms making up the interaction in the model, but don’t worry
los -0.029987 0.007704 -3.893 9.92e-05 *** about their interpretation or significance. Interpret the interaction based on
levels of particular values of the terms. When LOS is 14, we may interpret the
Note that los is significant, but white is not. Let’s create an interaction of odds ratio of the interaction term as:
white and los called wxl. We insert it into the model, making sure to include
the main effects terms as well. White patients who were in the hospital for 14 days have a some 10% greater
odds of death than do non-white patients who were in the hospital for 14
> wxl <- medpar$white * medpar$los days.
The interaction term is significant. It makes no difference if the main *Refer to the code in section 1.4 to import and print edrelig dataset;
effects terms are significant or not. Only the interaction term is interpreted *Refer to proc freq in section 2.4 to generate the frequency table;
*Build logistic model and obtain odds ratio & covariance matrix;
for this model. We calculate the odds ratios of the interaction of white and los proc genmod data = edrelig descending;
from 1 to 40 as: class educlevel (ref = ‘AA’) / param = ref;
model religious = age male kids educlevel/dist = binomial
link = logit covb;
OR interaction = exp(β white + β wxl * los[1 : 40]) (3.19) estimate “Intercept” Intercept 1 / exp;
estimate “Age” age 1 / exp;
estimate “Male” male 1 / exp;
That is, we add the slope of the binary predictor to the product of the slope estimate “Kid” kids 1 / exp;
of the interaction and the value(s) of the continuous predictor, exponentiating estimate “BA” educlevel 1 0 / exp;
estimate “MA/PhD” educlevel 0 1 / exp;
the whole. run;
# Odds ratios of death for a white patient for length of stay 1–40 days. *Refer to proc iml in section 2.3 and the full code is provided
# Note that odds of death decreases with length of stay. online;
70 Practical Guide to Logistic Regression
output;
4
end;
run;
3.2
. e(deviance) // deviance
. e(deviance_p) // Pearson Chi2
. e(dispers_p) // Pearson dispersion
. di e(ll) // log-likelihood 4.1 CHECKING LOGISTIC MODEL FIT
. gen loglike = e(ll)
. scalar loglik = e(ll)
. di loglik
. predict h, hat 4.1.1 Pearson Chi2 Goodness-of-Fit Test
. sum(h) // hat matrix diagonal
. predict stpr, pear stand
. sum stpr // stand. Pearson residual I earlier mentioned that the Pearson Chi2 statistic, when divided by the
. predict stdr, dev stand residual degrees of freedom, provides a check on the correlation in the data.
. sum stdr // stand deviance residual The idea is to observe if the result is much above the value of 1.0. That is,
3.3
a well-fitted model should have the values of the Pearson Chi2 statistic and
. use medpar, clear residual degrees of freedom closely the same. The closer in value, the better
. qui glm died white hmo los i.type, fam(bin) the fit.
. estat ic
. abic
Pearson Chi 2
~ 1.0
3.4
Residual dof
. glm died white hmo los i.type, fam(bin) scale(x2) nolog nohead
. glm died white hmo los i.type, fam(bin) vce(robust) nolog nohead
. glm died white hmo los i.type, fam(bin) vce(boot) nolog nohead This test, as we shall later discuss, is extremely useful for evaluating
extra dispersion in grouped logistic models, but for the observation-based
3.5
. glm died white los, fam(bin) nolog nohead models we are now discussing it is not. A large discrepancy from the value
. gen wxl <- white*los of 1, though, does indicate general extra dispersion or extra correlation in the
. glm died white los wxl, fam(bin) nolog nohead data, for which use of sandwich or scaled standard errors is an appropriate
. glm died white los wxl, fam(bin) nolog nohead eform
remedy.
A traditional Pearson Chi2 goodness-of-fit (GOF) test, however, is
commonly used to assess model fit. It does this by leaving the value of the
Pearson Chi2 statistic alone, considering it instead to be Chi2 distributed with
71
74 Practical Guide to Logistic Regression 4 • Testing and Fitting a Logistic Model 75
TABLE 4.1 Residuals for Bernoulli logistic regression TABLE 4.2 Residual code
Raw r y–μ mu <- mymod$fitted.value # predicted probability; fit
Residual analysis for logistic models is usually based on what are known as There are several ways to reduce the three variable subset of the medpar data
n-asymptotics. However, some statisticians suggest that residuals should be to m-asymptotic form. I will show a way that maintains the died response
based on m-asymptotically formatted data. Data in observation-based form; variable, which is renamed dead due to it not being a binary variable, and then
that is, one observation or case per line, are in n-asymptotic format. The show how to duplicate the above table.
datasets we have been using thus far for examples are in n-asymptotic form.
m-asymptotic data occurs when observations with the same values for all > data(medpar)
78 Practical Guide to Logistic Regression 4 • Testing and Fitting a Logistic Model 79
6 10
5
8
4
stdr∧2
6
3
ans^2
2 4
1
2
0
0.0 0.1 0.2 0.3 0.4 0.5 0.6
mu 0
0.0 0.1 0.2 0.3 0.4 0.5 0.6
mu
FIGURE 4.1 Squared standardized deviance versus mu.
hat
use of Anscombe residuals versus mu, or the predicted probability that the
0.010
response is equal to 1. Anscombe residuals adjust the residuals so that they are
as normally distributed as possible. This is important when using 2, or 4 when
0.005
the residual is squared, as a criterion for specifying an observation as an out-
lier. It is the 95% criterion so commonly used by statisticians for determining
statistical significance. Figure 4.2 is not much different from Figure 4.1 when –1 0 1 2 3 4 5
squared standardized deviance residuals are used in the graph. The Anscombe stpr
plot is preferred.
FIGURE 4.3 Leverage plot.
> plot(mu, ans^2)
> abline(h = 4, lty = “dotted”)
observations that fit this characterization. They can be identified by selecting
A leverage or influence plot (Figure 4.3) may be constructed as: hat values greater than 0.4 and squared residual values of |2|.
A wide variety of graphics may be constructed from the residuals given in
> plot(stpr, hat) Table 4.2. See Hilbe (2009), Bilger and Loughin (2015), Smithson and Merkle
> abline(v=0, col=”red”) (2014), and Collett (2003) for examples.
Large hat values indicate covariate patterns that differ from average covari-
ate patterns. Values on the horizontal extremes are high residuals. Values that 4.1.4 Conditional Effects Plot
are high on the hat scale, and low on the residual scale; that is, high in the
middle and close to the zero-line do not fit the model well. They are also dif- A nice feature of logistic regression is its ability to allow an analyst to plot the
ficult to detect as influential when using other graphics. There are some seven predicted probability of an outcome on a continuous predictor, factored across
82 Practical Guide to Logistic Regression 4 • Testing and Fitting a Logistic Model 83
determines the optimal probability value with which to separate predicted ver- for having a disease given that the patient in fact has the disease. Specificity
sus observed successes (1) or failures (0). refers to when a patient tests negative for a disease when they in fact do not
For an example we shall continue with the model used for residual analy- have the disease. Terms such as false positive refers to when a patient tests
sis earlier in the chapter. It is based on the medpar data, with died as the positive for a disease even though they do not have it. False negatives happen
response variable and white, hmo, los and levels of type as the predictors. We when a patient tests negative for a disease, even though they actually have it.
then obtain predicted probabilities that died = =1, which is the definition of These are all important statistics in classification analysis, but model sensi-
mu. The goal is then to determine how well the predicted probabilities actu- tivity and specificity are generally regarded as the most important results.
ally predict classification as died = =1, and how well they predict died = =0. However, false positive and false negative are used with the main statistics
Analysts are not only interested in correct prediction though, but also in such for creating the ROC curve. Each of these statistics can easily be calculated
issues as what percentage of times does the predictor incorrectly classify the from a confusion matrix. All three of these classification tools intimately
outcome. I advise the reader to remember though that logistic models that clas- relate with one another.
sify well are not always well-fitted models. If your interest is strictly to produce The key point is that determining the correct cut point provides the
the best classification scheme, do not be as much concerned about model fit. In grounds for correctly predicting the above statistics, given an estimated model.
keeping with this same logic, a well-fitted logistic model may not clearly dif- The cut point is usually close to the mean of the predicted values, but is not
ferentiate the two levels of the response. It’s valuable if a model accomplishes usually the same value as the mean. Another way of determining the proper
both fit and classification power, but it need not be the case. cut point is to choose a point at which the specificity and sensitivity are clos-
Now to our example model: est in values. As you will see though, formulae have been designed to find the
optimal cut point, which is usually at or near the site where the sensitivity and
> mymod <- glm(died ~ white + hmo + los + factor(type), specificity are the closest.
family=binomial, The Sensitivity-Specificity (S-S) plot and ROC plot and tests are com-
data=medpar)
ponents of the ROC_test function. The classification or confusion matrix is
> mu <- predict(mymod, type=”response”)
displayed using the confusion_stat function. Both of these functions are
> mean(medpar$died) part of the LOGIT package on CRAN. When LOGIT has been loaded into
[1] 0.3431438 memory the functions are automatically available to the analyst.
Analysts traditionally use the mean of the predicted value as the cut point. > library(LOGIT)
Values greater than 0.3431438 should predict that died = =1; values lower > data(medpar)
should predict died = =0. For confusion matrices, the mean of the response, > mymod <- glm(died ~ los + white + hmo + factor(type),
or mean of the prediction, will be a better cut point than the default 0.5 value family=binomial, data=medpar)
set with most software. If the response variable being modeled has substan-
We shall start with the S–S plot, which is typically used to establish the
tially more or less 1’s than 0’s, a 0.5 cut point will produce terrible results. I
cut point used in ROC and confusion matrix tests. The cut point used in ROC_
shall provide a better criterion for the cut point shortly, but the mean is a good
test is based on Youden’s J statistic (Youden, 1950). The optimal cut point is
default criterion.
defined as the threshold that maximizes the distance to the identity (diagnonal)
Analysts can use the percentage at which levels of died relate to mu being
line of the ROC curve. The optimality criterion is based on:
greater or less than 0.3431438 to calculate such statistics as specificity and
sensitivity. These are terms that originate in epidemiology, although tests like max(sensitivities + specificities)
the ROC statistic and curve were first derived in signal theory. Using our
example, we have patients who died (D) and those who did not (~D). The Other criteria have been suggested in the literature. Perhaps the most
probability of being predicted to die given that the patient has died is called noted alternative is:
model sensitivity. The probability of being predicted to stay alive, given the
min((1 - sensitivities)^2 + (1- specificities)^2)
fact that the patient remained alive is referred to as model specificity. In epi-
demiology, the term sensitivity refers to the probability of testing positive Both criteria give remarkably close cut points.
86 Practical Guide to Logistic Regression 4 • Testing and Fitting a Logistic Model 87
Note that a cutoff value of 0.3615 is used for the AUC statistic. Given False-negative rate for classified
that died indicates that a patient died while hospitalized, the AUC statistic negatives : 293/1087 = 0.2695482 = 26.95%
can be interpreted as follows: The estimated probability is 0.61 that patients
who die have a higher probability of death (higher mu) than patients who are An alternative way in which confusion matrices have been constructed is
alive. This value is very low. A ROC statistic of 0.5 indicates that the model based on the closeness in sensitivity and specificity values. That is, either the
has no predictive power at all. For our model there is some predictive power, analyst or an algorithm determines when the sensitivity and specificity values
but not a lot. are closest, and then constructs a matrix based on the implications of those
values. An example of this method can be made from the PresenceAbsence
package and function. The cut point, or threshold, is 0.351, which is not much
different from the cut point of 0.3638 we used in the ROC_test function. The
4.2.3 Confusion Matrix difference in matrix values and the associated sensitivity and specificity values
are rather marked though. I added the marginals to provide an easier under-
The traditional logistic regression classification table is given by the so-called
standing of various ancillary statistics which may be generated from the con-
confusion matrix of correctly and incorrectly predicted fitted values. The
fusion matrix.
matrix may be obtained following the use of the previous options of ROC_test
by typing > library(PresenceAbsence)
> mymod <- glm(died ~ white + hmo + los + factor(type),
> confusion_stat(out1$Predicted,out1$Observed) + family=binomial, data=medpar)
> mu <- predict(mymod, type=”response”)
A confusion matrix of values is immediately displayed on screen, together > cmxdf <- data.frame(id=1:nrow(medpar), died=as.
with values for correctly predicted (accuracy), sensitivity, and specificity. The vector(medpar$died),
cut point from the S–S plot is used as the confusion matrix cut point. + pred=as.vector(mu))
> cmx(cmxdf, threshold=0.351,which.model=1)
$matrix # a function in PresenceAbsence
obs 0 1 Sum
pred Observed Total
0 794 293 1087 predicted 1 0
1 188 220 408 1 292 378 670
Sum 982 513 1495 0 221 604 825
Total 513 982 1495
$statistics
Accuracy Sensitivity Specificity The correctly predicted value, or accuracy, is (292 + 604)/1495 or 59.93%.
0.6782609 0.4288499 0.8085540 Sensitivity is 292/(292 + 221) or 56.72% and specificity is 604/(378 + 604) or
61.51%. Note that the sensitivity (56.72%) and specificity (61.51%) are fairly
Other statistics that can be drawn from the confusion matrix and that can close in values—they are as close as we can obtain. If we use the same algo-
be of value in classification analysis are listed below. Recall from earlier dis- rithm with the cut point of 0.363812 calculated by the S–S plot using the crite-
cussion that D = patient died while in hospital (outcome = 1) and ~D = patient rion described at the beginning of this section, the values are
did not die in hospital (outcomes = 0).
> cmx(cmxdf, threshold=0.363812,which.model=1)
Positive predictive value : 220/408 =0.5392157 = 53.92%
observed
Negative predictive value : 794/1087 = 0.7304508 = 73.05% predicted 1 0 Total
False-positive rate for true ~D : 188/982 = 0.1914460 = 19.14% 1 252 233 485
False-positive rate for true D : 293/513 = 0.5711501 = 57.12% 0 261 749 1010
False-positive rate for classified positives : 188/408 = 0.4607843 = 46.08% Total 513 982 1495
90 Practical Guide to Logistic Regression 4 • Testing and Fitting a Logistic Model 91
If we divide up the response probability space into 12 divisions the results 4 124 15 34 38.05 0.27419 0.30689
appear as: 5 125 11 31 40.10 0.24800 0.32079
6 125 9 33 41.76 0.26400 0.33409
> HLChi12 <- HLTest(obj = mymod,g= 12) 7 124 7 29 43.16 0.23387 0.34806
> HLChi12 8 125 5 26 44.80 0.20800 0.35843
9 124 11 44 46.08 0.35484 0.37160
10 125 10 89 48.33 0.71200 0.38660
Hosmer and Lemeshow goodness-of-fit test with 12 bins
11 124 20 59 52.32 0.47581 0.42191
12 125 32 64 64.14 0.51200 0.51310
data: mymod
Total # Data: 1495 Total over cuts: 1495
X2 = 84.8001, df = 10, p-value = 5.718e-14
Chisq: 91.32444 d.f.: 10 p-value: 0.00000
> cbind(HL$observed, round(HL$expect, digits = 1)) The p-value again tells us that the model is not well fitted. The statistics
Y0 Y1 Y0hat Y1hat are similar, but not identical to the table shown earlier. The H–L test is nice
[0.0219,0.246] 87 38 99.6 25.4
summary test to use on a logistic model, but interpret it with care.
(0.246,0.278] 94 31 92.0 33.0
(0.278,0.297] 97 37 95.1 38.9
(0.297,0.313] 103 38 97.5 43.5
(0.313,0.329] 101 35 91.8 44.2
(0.329,0.343] 77 26 68.2 34.8 4.4 MODELS WITH UNBALANCED
(0.343,0.354] 113 38 98.1 52.9
(0.354,0.362] 75 15 57.5 32.5
DATA AND PERFECT PREDICTION
(0.362,0.38] 79 55 84.0 50.0
(0.38,0.391] 35 80 70.4 44.6 When the data set you wish to model has few observations, few predictors, and
(0.391,0.454] 62 58 69.0 51.0
are categorical in nature, it is possible that perfect prediction exists between
(0.454,0.618] 59 62 58.7 62.3
the predictors and response. That is, for a given covariate pattern only one
outcome occurs. Maximum likelihood estimation does not work well in such
The Chi2 test again indicates that the model is ill fitted.
circumstances. One or more of the coefficients become very large, and stan-
In order to show how different code can result in different results, I
dard errors may explode to huge sizes as well. Coefficient values may also be
used code for the H–L test in Hilbe (2009). Rather than groups defined
displayed with no value given. When this occurs it is nearly always the case
and displayed by range, they are calculated as ranges, but the mean of the
that perfect prediction exists in the data.
groups is displayed in output. The number of observations in each group is
Consider a real data set consisting of HIV drug data. The response is
also given.
given as the number of patients in a study who became infected with HIV.
This code will develop three H–L tables, with 8, 10, and 12 groups. The
There are two predictors, cd4 and cd8, each with three levels–0, 1, and 2. The
12 group table is displayed below.
data is weighted by the number of cases having the same pattern of covariates;
that is, with the values of cd4 and cd8 the same.
> medpar2<- na.omit(medpar) # drop obs with missing value(s)
> hlGOF.test(
medpar2$died, predict(mymod, medpar2,
The data, called hiv, is downloaded into R’s memory from its original
type=’response’), breaks=12) format as a Stata data set.
> library(Hmisc)
For # Cuts = 12 # Data = 1495
> data(hivlgold)
Cut # Total #Patterns # Resp. # Pred. Mean Resp. Mean Pred.
> hiv
1 125 61 38 25.39 0.30400 0.20311
infec cases cd4 cd8
2 124 24 31 32.72 0.25000 0.26384
1 0 3 0 0
3 125 14 35 36.16 0.28000 0.28929
2 0 8 1 1
94 Practical Guide to Logistic Regression 4 • Testing and Fitting a Logistic Model 95
prediction in the model. When that occurs penalized logistic regression should > exlr <- glm(
died ~ procedure + type, family=binomial,
data=azheart)
be used—as we discussed in the previous section.
> toOR(exlr)
For an example of exact logistic regression, I shall use Arizona hospital or delta zscore pvalue exp.loci. exp.upci.
data collected in 1991. The data consist of a random sample of heart procedures (Intercept) 0.0389 0.0482 -2.6170 0.0089 0.0034 0.4424
referred to as CABG and PTCA. CABG is an acronym meaning coronary artery procedureCABG 12.6548 15.7958 2.0334 0.0420 1.0959 146.1267
typeEmer/Urg 1.7186 1.9296 0.4823 0.6296 0.1903 15.5201
bypass grafting surgery and PTCA refers to percutaneous transluminal coronary
angioplasty. It is a nonsurgical method of placing a type of balloon into a coronary Note that there appears to be a statistically significant relationship between
artery in order to clear blockage caused by cholesterol. It is a substantially less the probability of death and type of procedure (p = 0.0420). Type of admission
severe procedure than CABG. We will model the probability of death within 48 h does not contribute to the model. Given the size of the data and adjusting for
of the procedure on 34 patients who sustained either a CABG or PTCA. The vari- the possibility of correlation in the data we next model the same data as a
able procedure is 1 for CABG and 0 for PTCA. It is adjusted in the model by the “quasibinomial” model. Earlier in the book I indicated that the quasibinomial
type of admission. Type = 1 is an emergency or urgent admit, and 0 is an elective option is nothing more than scaling (multiplying) the logistic model standard
admission. Other variables in the data are not used. Patients are all older than 65. errors by the square root of the Pearson dispersion statistic.
It is clear from the tabulation that more patients died having a CAGB than Exact logistic regression Number of obs = 34
with a PTCA. A table of died on type of admission is displayed as: Model score = 5.355253
Pr >= score = 0.0864
> table(azheart$died, azheart$type)
died Odds Ratio Suff. 2*Pr(Suff.) [95% Conf. Interval]
Elective Emer/Urg
Survive 17 11 procedure 10.33644 5 0.0679 .8880104 589.8112
Died 4 2 type 1.656699 2 1.0000 .1005901 28.38678
First we shall use a logistic regression to model died on procedure and type. The results show that procedure is not a significant predictor of died at
The model results are displayed in terms of odds ratios and associated statistics. the p = 0.05 criterion. This should not be surprising. Note that the odds ratio
98 Practical Guide to Logistic Regression 4 • Testing and Fitting a Logistic Model 99
I tend to structure grouped data in this manner. But as long as an analyst is > (8/5)/(6/4)
consistent, there is no difference in the methods. What is important to remem- [1] 1.066667
ber is that if there are only two binary variables in a table, y and x, and if y is
the response variable to be modeled, then it is placed as the left-most column which is the value calculated as x above. Recalling our discussion earlier in
with p2 levels. p is the number of binary variables, in this case 22 = 4. the text, the intercept odds is the denominator of the ratio we just calculated to
The data in grouped format are modeled as a frequency weighted regres- determine the odds ratio of x.
sion. Since y is binary, it will be modeled as a logistic regression, although it
> 6/4
also may be modeled as a probit, complimentary loglog, or loglog regression. [1] 1.5
The key is to enter the counts as a frequency weight.
> y <- c(1,1,0,0)
which confirms the calculation from R.
> x <- c(1,0,1,0) When tables are more complex the same logic used in creating the 2 × 2
> count <- c(8,6,5,4) table remains. For instance, consider a table of summary data that relates the
> mydata <- data.frame(y,x,count) pass–failure rate among males and females in an introductory to statistics
course at Noclue University. The goal is to determine if studying for the final
> mymodel <- glm(
y ~ x, weights=count, family=binomial,
data=mydata)
or going to a party or just sleeping instead had a bearing on passing. There are
> summary(mymodel) 18 males and 18 females, for a class of 36.
. . .
Coefficients: Gender
Estimate Std. Error z value Pr(>|z|) Female Male
(Intercept) 0.40547 0.64550 0.628 0.53 sleep party study sleep party study
x 0.06454 0.86120 0.075 0.94
fail 3 4 2 2 4 3
Null deviance: 30.789 on 3 degrees of freedom
Grade
Residual deviance: 30.783 on 2 degrees of freedom
pass 2 1 6 3 2 4
AIC: 34.783
The logistic coefficients are x = 0.06454 and intercept as 0.40547.
Exponentiation gives the following values: The data have a binary response, Grade, with levels of Fail and Pass,
Gender has two levels (Female and Male) and student Type has three levels
> exp(coef(mymodel))
(Intercept) x
(sleep, party, and study). I suggest that the response of interest, Pass, be giv-
1.500000 1.066667 ing the value of 1, with Fail assigned 0. For Gender, Female = 0 and Male = 1.
Type: Sleep = 1, Party = 2, and Study = 3. Multiply the levels for the total
To check the above calculations the odds ratio may be calculated directly number of levels or groups in the data. 2 * 2 * 3 = 12. The response vari-
from the original table data as well. Recall that the odds ratio of predictor x is able then will have six 0s and six 1s. When a table has predictors with more
the ratio of the odds of y = 1 divided by the odds of y = 0. The odds of y = 1 is than two levels, I recommend using the 0,1 format for setting up the data for
the ratio of x = 1 to x = 0 when y = 1, and the odds of y = 0 is the ratio of x = 1 analysis.
to x = 0 when y = 0. A binary variable will split its values between the next higher level.
Therefore, Gender will have alternating 0s and 1s for each half of Grade.
x
0 1 Since Type has three levels, 1–2–3 is assigned for each level of Gender.
Finally, assign the appropriate count value to each pattern of variables. The
0 4 5 first level represents Grade = Fail; Gender = Female; Type = Sleep. We move
y from the upper left of the top row across the columns of the row, then move
1 6 8
to the next row.
102 Practical Guide to Logistic Regression 4 • Testing and Fitting a Logistic Model 103
run; k2=-0.8714+(-0.0376)*los+0.4386*2;
r2=1/(1+exp(-k2));
*Square the standardized deviance residual; k3=-0.8714+(-0.0376)*los+0.4386*3;
data stats1; r3=1/(1+exp(-k3));
set stats; run;
stresdev2=stresdev**2;
run; *Graph the conditional effects plot;
proc sgplot data=effect;
scatter x=los y=r1;
*Plot the square of standardized deviance residuals and mu;
scatter x=los y=r2;
proc gplot data=stats1;
scatter x=los y=r3;
symbol v=circle color=black;
xaxis label=’LOS’;
plot stresdev2*pred / vref=4 cvref=red;
yaxis label=’Type of Admission’ grid values=(0 to 0.4 by 0.1);
run;
title ‘P[Death] within 48 hr admission’;
run;
*Plot the leverage and std Pearson residual;
proc gplot data=stats1;
/* Section 4.2 */
symbol v=circle color=black;
plot leverage*streschi / href=0 chref=red;
*Build the logistic model and output model prediction;
run;
proc genmod data=medpar descending;
class type (ref=’1’) / param=ref;
*Sort the dataset; model died=white hmo los type / dist=binomial link=logit;
proc sort data=medpar out=medpar1; output out=fit pred=mu;
by white hmo los type; run;
run;
*Refer to proc means in section 2.5 to calculate the mean;
*Calculate the sum of the dead;
proc means data=medpar1 sum; *Build the logistic model and output classification table & ROC curve;
by white hmo los type; proc logistic data=medpar descending plots(only)=ROC;
var died; class type (ref=’1’) / param=ref;
output out=summary sum=dead; model died=white hmo los type / outroc=ROCdata ctable pprob=(0 to
run; 1 by 0.0025);
ods output classification=ctable;
run;
*Create a new variable alive;
data summary1;
set summary; *Sensitivity and specificity plot;
alive=_freq_-dead; symbol1 interpol=join color=vibg height=0.1 width=2;
drop _type_ _freq_; symbol2 interpol=join color=depk height=0.1 width=2;
run; axis1 label=(“Probability”) order=(0 to 1 by 0.25);
axis2 label=(angle=90 “Sensitivity Specificity %”) order=(0 to 100 by 25);
proc gplot data=ctable;
*Refer to proc print in section 2.2 to print dataset summary1;
plot sensitivity*problevel specificity*problevel /
overlay haxis=axis1 vaxis=axis2 legend;
*Build the logistic model with numeric variables; run;
proc genmod data=medpar descending;
model died=los type/dist=binomial link=logit;
*Approximate cutoff point can be found when sensitivity and specificity
run;
are closest/equal in the classification table;
. mean died
5
. logit died white hmo los i.type, nolog
. lsens, genprob(cut) gensens(sen) genspec(spec)
Grouped
. lroc
. estat classification, cut(.351)
4.3
. estat gof, table group(10)
. estat gof, table group(12)
Logistic
Regression
4.4
. use hiv1gold
. list
. glm infec i.cd4 i.cd8 [fw=cases], fam(bin)
. firthlogit infec i.cd4 i.cd8 [fw=cases], nolog
4.5
. use azcabgptca34
. list in 1/6
. table died procedure
. table died type
. glm died procedure type, fam(bin) nolog
. glm died procedure type,fam(bin) scale(x2) nolog 5.1 THE BINOMIAL PROBABILITY
. exlogistic died procedure type, nolog DISTRIBUTION FUNCTION
4.6
. use pgmydata
. glm y x [fw=count], fam(bin) nolog
Grouped logistic regression is based on the binomial probability distribution.
. glm y x [fw=count], fam(bin) nolog eform Recall that standard logistic regression is based on the Bernoulli distribution,
. use phmydata2 which is a subset of the binomial. As such, the standard logistic model is a
. glm grade gender i.type [fw=count], fam9bin) nolog nohead
. glm grade gender i.type [fw=count], fam9bin) nolog nohead eform
subset of the grouped. The key concept involved is the binomial probability
distribution function (PDF), which is defined as:
n
f ( y; p, n) = p y (1 − p)n − y (5.1)
y
p n
f ( y; p, n) = exp y ln + n ln(1 − p) + ln (5.2)
1 − p y
107
110 Practical Guide to Logistic Regression 5 • Grouped Logistic Regression 111
> x3 <- c(1,1,1,1,1,1,1,0,0,0) terms of two columns of data—one for the number of 1s for a given covari-
> obser <- data.frame(y,x1,x2,x3) ate pattern, and the second for the number of 0s (not 1s). It is the only logistic
> xx1 < - glm(y ~ x1 + x2 + x3, family = binomial, data = obser) regression software I know of that allows this manner of formatting the bino-
> summary(xx1) mial response. However, one can create a variable representing the cbind(y,
. . .
noty) and run it as a single term response. The results will be identical.
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.2050 1.8348 0.657 0.511 > grp2 <- cbind(grp$y, grp$noty)
> summary(xx3 <- glm( grp2 ~ x1 + x2 + x3, family = binomial, data = grp))
x1 0.1714 1.4909 0.115 0.908
x2 -1.5972 1.6011 -0.998 0.318
x3 -0.5499 1.5817 -0.348 0.728 . . .
Coefficients:
Estimate Std. Error z value Pr(>|z|)
Null deviance: 13.46 on 9 degrees of freedom (Intercept) 1.2050 1.8348 0.657 0.511
Residual deviance: 12.05 on 6 degrees of freedom x1 0.1714 1.4909 0.115 0.908
AIC: 20.05 x2 -1.5972 1.6011 -0.998 0.318
x3 -0.5499 1.5817 -0.348 0.728
Grouped Data
In a manner more similar to that used in other statistical packages, the bino-
> y <- c(1,1,2,0,2,0)
> cases <- c(3,1,2,1,2,1) mial denominator, cases, may be employed directly into the response—but
> x1 <- c(1,1,0,0,1,0) only if it is also used as a weighting variable. The following code produces the
> x2 <- c(0,1,0,1,0,1) same output as above,
> x3 <- c(1,1,1,1,0,0)
> grp <- data.frame(y,cases,x1,x2,x3)
> grp$noty <- grp$cases – grp$y > summary(xx4 <- glm( y/cases ~ x1 + x2 + x3, family = binomial,
> xx2 <- glm( cbind(y, noty) ~ x1 + x2 + x3, family = binomial, data = grp) weights = cases, data = grp))
> summary(xx2) . . .
Coefficients:
Coefficients: Estimate Std. Error z value Pr(>|z|)
Estimate Std. Error z value Pr(>|z|) (Intercept) 1.2050 1.8348 0.657 0.511
(Intercept) 1.2050 1.8348 0.657 0.511 x1 0.1714 1.4909 0.115 0.908
x1 0.1714 1.4909 0.115 0.908 x2 -1.5972 1.6011 -0.998 0.318
x2 -1.5972 1.6011 -0.998 0.318 x3 -0.5499 1.5817 -0.348 0.728
x3 -0.5499 1.5817 -0.348 0.728
The advantage of using this method is that the analyst does not have
(Dispersion parameter for binomial family taken to be 1)
to create the noty variable. The downside is that some postestimation func-
Null deviance: 9.6411 on 5 degrees of freedom tions do not accept being based on a weighted model. Be aware that there are
Residual deviance: 8.2310 on 2 degrees of freedom alternatives and use the one that works best for your purposes. The cbind()
AIC: 17.853 response appears to be the most popular, and seems to be used more in pub-
lished research.
The coefficients, standard errors, z values, and p-values are identical. Stata and SAS use the grouping variable; for example, cases, as the variable
However, the ancillary deviance and AIC statistics differ due to the number n in the binomial formulae listed in the last section and as given in the example
of observations in each model. But the information in the two data sets is the directly above. The binomial response can be thought of as y = numerator and
same. This point is important to remember. cases = denominator. Of course these term names will differ depending on
Note that the response variable is cbind(y, noty) rather than y as in the the data being modeled. Check the end of this chapter for how Stata and SAS
standard model. R users tend to prefer having the response be formatted in handle the binomial denominator.
114 Practical Guide to Logistic Regression 5 • Grouped Logistic Regression 115
Analysts can actually create a model that specifically adds an extra Examples of how these indicators of apparent overdispersion affect logis-
parameter to the model that adjusts for the extra correlation or overdispersion tic models are given in Hilbe (2009).
in the data. For the Poisson model, the negative binomial model serves this
purpose. It is a two-parameter model. The beta binomial is a two-parameter Guideline
logistic model, with the extra heterogeneity parameter adjusting for extra cor- If a grouped logistic model has a dispersion statistic greater than 1, check
relation in the data. Other two- and three-parameter models have also been each of the 5 indicators of apparent overdispersion to determine if applying
developed to account for Poisson and binomial overdispersion, but they need them reduces the dispersion to approximately 1. If it does, the data are not
not concern us here (Hilbe, 2014). We shall discuss the beta binomial later in truly overdispersed. Adjust the model accordingly. If the dispersion statistic
this chapter. of a grouped logistic model is less than 1, the data is under-dispersed. This
How is binomial overdispersion identified? The easiest way is by using the type of extra-dispersion is more rare, and is usually dealt with by scaling or
Pearson dispersion statistic. Let us view the dispersion statistic on the grouped using robust SEs.
binomial model we created above from observation data.
> P__disp(bin)
5.4 MODELING AND INTERPRETATION
Pearson Chi2 = 6.630003 OF GROUPED LOGISTIC REGRESSION
Dispersion = 3.315001
Any value of the dispersion greater than 1 indicates extra variation in the Modeling and interpreting grouped logistic models is the same as for binary
data. That is, it indicates more variation than is allowed by the binomial PDF response, or observation-based models. The graphics that one develops will
which underlies the model. Recall that the dispersion statistic is the Pearson be a bit different from the ones developed that are based on a binary response
statistic divided by the residual degrees of freedom, which is defined as the model. Using the mylgg model we developed in Chapter 4, Section 4.1.3 when
number of observations in the model less coefficients (predictors, intercept, discussing residual analysis, we shall plot the same leverage versus standard-
extra parameters). The product of the square root of the dispersion by the ized Pearson residuals (Figure 5.1) and standardized deviance residuals versus
standard error of each predictor in grouped logistic model produces a quasi- mu (Figure 5.2) as done in Chapter 4. However, this time the standardized
binomial grouped logistic model. It adjusts the standard errors of the model. residuals in Figure 5.2 are not squared. For a binary response model, squaring
Sandwich and bootstrapped standard errors may be used as well to adjust for the standardized residuals provides for an easier interpretation. Note the dif-
overdispersed grouped logistic models. ference due to the grouped format of the data.
A caveat should be given regarding the identification of overdispersed
> fit <- glm( cbind(dead, alive) ~ white + hmo + los + factor(type),
data. I mentioned that for grouped logistic models that a dispersion statistic
family = binomial, data = mylgg)
greater than 1 indicates overdispersion, or unaccounted for variation in the > mu <- fit$fitted.value # predicted probability
data. However, there are times that models appear to be overdispersed, but are > hat <- hatvalues(fit) # hat matrix diagnoal
in fact not. A grouped logistic model dispersion statistic may be greater than 1, > dr <- resid(fit, type = “deviance”) # deviance residuals
but the model data can itself be adjusted to eliminate the perceived overdisper- > pr <- resid(fit, type = “pearson”) # Pearson residuals
> stdr <- dr/sqrt(1-hat) # standardized deviance
sion. Apparent overdispersion occurs in the following conditions:
> stpr <- pr/sqrt(1-hat) # standardized Pearson
Apparent Overdispersion
> plot(stpr, hat) # leverage plot
• The model is missing a needed predictor. > abline(v = 0, col = “red”)
• The model requires one or more interactions of predictors.
• A predictor needs to be transformed to a different scale; log(x). The interpretation of the hat statistics is the same as in Chapter 4. In
• The link is misspecified (the data should be modeled as probit or Figure 5.2, notice the more scattered nature of the standardized deviance
cloglog). residuals. This is due to the variety of covariate patterns. Covariate patterns
• There are existing outliers in the data. higher than the line at 2 are outliers, and do not fit the model.
118 Practical Guide to Logistic Regression 5 • Grouped Logistic Regression 119
The log-likelihood function for the binomial model can then be expressed, The mean and variance of the beta PDF may be given as:
with subscripts, as:
a ab
E ( y) = = µ V ( y) = (5.12)
n a+b (a + b)2 (a + b + 1)
∑{y ln(µ ) + (n − y )ln(1 − µ ) + lnΓ(n + 1) − lnΓ( y + 1)
L(µ i ; yi , ni ) = i i i i i i i
i =1 As mentioned before, the beta-binomial distribution is a mixture of the
− lnΓ(ni − yi + 1)} (5.8) binomial and beta distributions. The binomial parameter, μ, is distributed as
beta, which adjusts for extra-binomial correlation in the data. The mixture can
The beta distribution is used as the basis of modeling proportional data. be obtained by multiplying the two distributions.
That is, beta data is constrained between 0 and 1—and can be thought of in
f ( y; µ, a, b) = f ( y; µ, n) f ( y; µ, a, b) (5.13)
this context as the proportion obtained by dividing the binomial numerator by
the denominator. The beta PDF is given below in terms of two shape param-
eters, a and b, although there are a number of different parameterizations. The result is the beta-binomial probability distribution.
binomial logistic regression function is analogous to a Poisson, or perhaps a I also calculated robust or sandwich standard errors for the beta-binomial
negative binomial model. model. 2nd class and age resulted in nonsignificant p-values. This result is
the same as given with the above robust grouped logit model. gamlss does not
Beta Binomial work well with sandwich estimators; the calculations were done using Stata.
See the book’s web site for results.
> library(gamlss) The beta-binomial model is preferred to the single parameter logistic
> summary(mybb <- gamlss(cbind(survive,died) ~ age + sex + class03,
model. However, extra correlation still needs to be checked and adjusted. We
data = titanicgrp, family = BB))
should check for an interactive effect between age and sex, and between both
Estimate Std. Error t value Pr(>|t|)
age and sex and class 1. I shall leave that as an exercise for the reader. It
(Intercept) 1.498 0.6814 2.199 0.063855 appears, though, from looking at the model main effects only, that females
ageadults -2.202 0.8205 -2.684 0.031375 holding 1st and 2nd class tickets stood the best odds of survival on the Titanic.
sexman -2.177 0.6137 -3.547 0.009377 If they were female children, they stood even better odds. 3rd class ticket hold-
class032nd class 2.018 0.8222 2.455 0.043800 ers, and in particular 3rd class male passengers fared the worst. It should be
class031st class 2.760 0.8558 3.225 0.014547
noted that 1st class rooms were very expensive, with the best going for some
US$100,000 in 2015 equivalent purchasing power.
The beta binomial is an important model, and should be considered
Sigma link function: log
Sigma Coefficients: for all overdispersed logistic models. In addition, for binomial models with
Estimate Std. Error t value Pr(>|t|) probit and complementary loglog links, or with excess zero response values,
(Intercept) -1.801 0.7508 -2.399 0.03528 Stata’s betabin and zibbin commands (Hardin and Hilbe, 2013) have options
for these models. Perhaps these capabilities will be made available to R users
in the near future. The generalized binomial model is another function suitable
. . . for m
odeling overdispersed grouped logistic models. The model is available in
Stata (Hardin and Hilbe, 2007) and SAS (Morel and Neerchal, 2012).
Global Deviance: 73.80329
Global Deviance: 73.80329
SBC: 88.71273
Notice that the AIC statistic is reduced from 157.77 for the grouped logis-
tic model to 85.80 for the beta-binomial model. This is a substantial improve-
SAS CODE
ment in model fit. The heterogeneity or dispersion parameter, sigma, is 0.165.
/* Section 5.2 */
Sigma [gamlss’s sigma is log(sigma)]
> exp(-1.801) *Refer to data step in section 2.1 if manually input
[1] 0.1651337 obser dataset;
*Build the logistic model;
proc genmod data = obser descending;
Odds ratio for beta binomial are inflated compared to the grouped logit, model y = x1 x2 x3 / dist = binomial link = logit;
but the p-values are closely the same. run;
> exp(coef(mybb)) *Refer to data step in section 2.1 if manually input grp
(Intercept) ageadults sexman class032nd class dataset;
4.4738797 0.1105858 0.1133972 7.5253615
class031st class *Build the logistic model;
15.8044343 proc genmod data = grp descending;
126 Practical Guide to Logistic Regression
5.4
6
. use phmylgg
. cases = dead + alive
. glm dead white hmo los i.type, fam(bin cases)
. predict mu
. predict hat, hat
Bayesian
. predict dev, deviance
. gen stdev = dev/sqrt(1-hat)
. predict stpr, rstandard
Logistic
Regression
. scatter stpr hat
. gen stdev2 = stdev^2
. scatter stdev2 mu
5.5
. use titanicgrp
. list
. glm died age sex b3.class, fam(bin cases) nolog
. glm, eform
. glm died age sex b3.class, fam(bin cases) vce(robust) nolog 6.1 A BRIEF OVERVIEW OF
. betabin died age sex b3.class, n(cases) nolog
. gen died = cases-survive
BAYESIAN METHODOLOGY
Bayesian methodology would likely not be recognized by the person who is
regarded as the founder of the tradition. Thomas Bayes (1702–1761) was a
British Presbyterian country minister and amateur mathematician who had a
passing interest in what was called inverse probability. Bayes wrote a paper
on the subject, but it was never submitted for publication. He died without
anyone knowing of its existence. Thomas Price, a friend of Bayes, discovered
the paper when going through Bayes’s personal effects. Realizing its impor-
tance, he managed to have it published in the Royal Society’s Philosophical
Transactions in 1764. The method was only accepted as a curiosity and was
largely forgotten until Pierre-Simon Laplace, generally recognized as the
leading mathematician worldwide during this period, discovered it several
decades later and began to employ its central thesis to problems of probability.
However, how Bayes’s inverse probability was employed during this time is
quite different from how analysts currently apply it to regression modeling. For
those who are interested in the origins of Bayesian thinking, and its relation-
ship to the development of probability and statistics in general, I recommend
reading Weisberg (2014) or Mcgrayne (2011).
Inverse probability is simple in theory. Suppose that we know from epide-
miological records that the probability of a person having certain symptoms S
given that they have disease D is 0.8. This relationship may be symbolized as
Pr(S|D) = 0.8. However, most physicians want to know the probability of having
the disease if a patient displays these symptoms, or Pr(D|S). In order to find this
127
130 Practical Guide to Logistic Regression 6 • Bayesian Logistic Regression 131
> dim(R84)
[1] 3874 15
6.2 EXAMPLES: BAYESIAN The response variable, outwork, has 1420 1s and 2454 0s, for a mean of
0.5786.
LOGISTIC REGRESSION
> table(R84$outwork)
0 1
6.2.1 Bayesian Logistic Regression Using R 2454 1420
For an example we shall model the 1984 German health reform data, rwm1984. Other characteristics of the data to be modeled, including the centering of
Our variable of interest is a patient’s work status. If they are not working, out- both continuous predictors, are given as follows:
work = 1; if they are employed or are otherwise working, outwork = 0. The
predictors we use to understand outwork are: # SUMMARIES OF THE TWO CONTINUOUS VARIBLES
> summary(R84$docvis)
Min. 1st Qu. Median Mean 3rd Qu. Max.
docvis : The number of visits made to a physician during the year, from 0 0.000 0.000 1.000 3.163 4.000 121.000
to 121.
female : 1 = female; 0 = male. > summary(R84$age)
kids : 1 = has children; 0 = no children. Min. 1st Qu. Median Mean 3rd Qu. Max.
age : age, from 25 to 64. 25 35 44 44 54 64
The data are first loaded and the data are renamed R84. We shall view the # CENTER BOTH CONTINUOUS PREDICTORS
data, including other variables in the data set. > R84$cage <- R84$age - mean(R84$age)
> R84$cdoc <- R84$docvis - mean(R84$docvis)
> library(MCMCpack)
> library(LOGIT)
We shall first model the data based on a standard logistic regression, and
> data(rwm1984) then by a logistic regression with the standard errors scaled by the square root
> R84 <- rwm1984 of the Pearson dispersion. The scaled logistic model, as discussed in the previ-
ous chapter, is sometimes referred to as a “quasibinomial” model. We model
# DATA PROFILE both to determine if there is extra variability in the data that may require
> head(R84) adjustments. The tables of coefficients for each model are not displayed below,
docvis hospvis edlevel age outwork female married kids hhninc educ self
1 1 0 3 54 0 0 1 0 3.050 15.0 0
but are stored in myg and myq, respectively. I shall use the toOR function to
2 0 0 1 44 1 1 1 0 3.050 9.0 0 display the odds ratios and associated statistics of both models in close prox-
3 0 0 1 58 1 1 0 0 1.434 11.0 0 imity. The analyst should inspect the delta (SEs) values to determine if they
4 7 2 1 64 0 0 0 0 1.500 10.5 0
5 6 0 3 30 1 0 0 0 2.400 13.0 0 differ from each other by much. If they do, then there is variability in the data.
6 9 0 3 26 1 0 0 0 1.050 13.0 0 A scaled logistic model, or other adjusted models, should be used on the data,
edlevel1 edlevel2 edlevel3 edlevel4
1 0 0 1 0
including a Bayesian model. Which model we use depends on what we think is
2 1 0 0 0 the source of the extra correlation.
134 Practical Guide to Logistic Regression 6 • Bayesian Logistic Regression 135
Compare the output above for the noninformative prior with SAS output Trace of cdoc Density of cdoc
60
on the same data and model. The results are remarkably similar.
0.03 40
POSTERIOR SUMMARIES 20
0.00 0
PERCENTILES
STANDARD 2e+04 6e+04 1e+05 0.00 0.01 0.02 0.03 0.04 0.05
PARAMETER N MEAN DEVIATION 25% 50% 75% Iterations N = 100,000 Bandwidth = 0.0006621
Intercept 100,000 −2.0140 0.0815 −2.0686 −2.0134 −1.9586 Trace of female Density of female
Cdoc 100,000 0.0247 0.00632 0.0204 0.0246 0.0289
4
Female 100,000 2.2605 0.0832 2.2043 2.2602 2.3166 2.4
Kids 100,000 0.3596 0.0907 0.2981 0.3590 0.4207 2
2.0
Cage 100,000 0.0545 0.00418 0.0516 0.0545 0.0573 0
2e+04 6e+04 1e+05 2.0 2.2 2.4 2.6
Iterations N = 100,000 Bandwidth = 0.008791
POSTERIOR INTERVALS
Trace of kids Density of kids
PARAMETER ALPHA EQUAL-TAIL INTERVAL HPD INTERVAL 4
3
Intercept 0.050 −2.1755 −1.8557 −2.1710 −1.8520 0.4 2
Cdoc 0.050 0.0124 0.0373 0.0124 0.0373 1
Female 0.050 2.0989 2.4242 2.0971 2.4220 0.0 0
2e+04 6e+04 1e+05 0.0 0.2 0.4 0.6
Kids 0.050 0.1831 0.5382 0.1838 0.5386
Iterations N= 100,000 bandwidth = 0.009471
Cage 0.050 0.0463 0.0628 0.0464 0.0628
Trace of cage Density of cage
number of different models. Of course, our example will be to show its use in TABLE 6.1 JAGS code for Bayesian logistic model
creating a Bayesian logistic model. X <- model.matrix(~ cdoc + female + kids + cage,
First, make sure you have installed JAGS to your computer. It is freeware, data = R84)
as is R. JAGS is similar to WinBUGS and OpenBUGS, which can also be run K <- ncol(X)
logit.data <- list(Y = R84$outwork,
as standalone packages or within the R environment. JAGS is many times pre- N = nrow(R84),
ferred by those in the hard sciences like physics, astronomy, ecology, biology, X = X,
and so forth since it is command-line driven, and written in C ++ for speed. K = K,
LogN = log(nrow(R84)),
WinBUGS and OpenBUGS are written in Pascal, which tends to run slower b0 = rep(0, K),
than C ++ implementations, but can be run within the standalone WinBUGS B0 = diag(0.00001, K)
or OpenBUGS environments, which include menus, help, and so forth. The )
sink(“LOGIT.txt”)
BUGS programs are more user-friendly. Both OpenBUGS and JAGS are also
able to run on a variety of platforms, which is advantageous to many users. In cat(“
model{
fact, WinBUGS is no longer being developed or supported. The developers are # Priors
putting all of their attention to OpenBUGS. Lastly, and what I like about it, beta ~ dmnorm(b0[], B0[,])
when JAGS is run from within R, the program actually appears as if it is just
# Likelihood
another R package. I do not feel as if I am using an outside program. for (i in 1:N){
To start it is necessary to have JAGS in R’s path, and the R2jags package Y[i] ~ dbern(p[i])
needs to be installed and loaded. For the first JAGS example you also should logit(p[i]) <- max(-20, min(20, eta[i]))
eta[i] <- inprod(beta[], X[i,])
bring two functions contained in jhbayes.R into memory using the source LLi[i] <- Y[i] * log(p[i]) +
function. (1 - Y[i]) * log(1 - p[i])
}
LogL <- sum(LLi[1:N])
> library(R2jags)
AIC <- -2 * LogL + 2 * K
>
source(“c://Rfiles/jhbayes.R”) # or where you store R BIC <- -2 * LogL + LogN * K
files; book’s website }
“,fill = TRUE)
sink()
The code in Table 6.1 is specific to the model we have been working with
in the previous section. However, as you can see, it is easily adaptable for other # INITIAL VALUES – BETAS AND SIGMAS
logistic models. With a change in the log-likelihood, it can also be used with inits <- function () {
list(
other distributions and can be further amended to incorporate random effects, beta = rnorm(K, 0, 0.1)
mixed effects, and a host of other models. ) }
Let us walk through the code in Table 6.1. Doing so will make it much params <- c(“beta”, “LogL”, “AIC”, “BIC”)
easier for you to use it for other modeling situations. # JAGs
The top two lines J0 <- jags(data = logit.data,
inits = inits,
X <- model.matrix(~ cdoc + female + kids + cage, parameters = params,
model.file = “LOGIT.txt”,
data = R84)
n.thin = 10,
K <- ncol(X) n.chains = 3,
n.burnin = 40000,
create a matrix of predictors, X, from the model R84, and a variable, K, which n.iter = 50000)
contains the number of predictors contained in X. A column of 1s for the inter- # OUTPUT DISPLAYED
cept is also generated by model.matrix(). out <- J0$BUGSoutput
The next code segment is logit.data, although we may call it anything myB <- MyBUGSOutput(out, c(uNames(“beta”, K), “LogL”, “AIC”, “BIC”))
round(myB, 4)
we wish. logit.data is a list of the components of the JAGS model we are
142 Practical Guide to Logistic Regression 6 • Bayesian Logistic Regression 143
sampling values are discarded before values are kept in the posterior distri- statistically identical. This output also matches the SAS results displayed esti-
bution. The initial values can vary widely, and skew the results. If all of the mated using noninformative priors.
early values were kept, the mean of the posterior distribution could be severely
> summary(myg)
biased. Discarding a sizeable number of early values helps guarantee a better Coefficients:
posterior. Finally, the n.iter specifies how many values are kept for the poste- Estimate Std. Error z value Pr(>|z|)
rior distribution, after thinning and discarding of burn-in values. (Intercept) -2.010276 0.081087 -24.792 < 2e-16 ***
J0 <- jags(data = logit.data, cdoc 0.024432 0.006263 3.901 9.57e-05 ***
inits = inits, female 2.256804 0.082760 27.269 < 2e-16 ***
parameters = params, kids 0.357976 0.089962 3.979 6.92e-05 ***
model.file = “LOGIT.txt”, cage 0.054379 0.004159 13.075 < 2e-16 ***
n.thin = 10, ---
n.chains = 3, Null deviance: 5091.1 on 3873 degrees of freedom
n.burnin = 40000, Residual deviance: 3918.2 on 3869 degrees of freedom
n.iter = 50000) AIC: 3928.2
After running the jags function, which we have called J0, typing J0 on A comparison of the frequency-based standard logistic regression and our
the R command-line will provide raw model results. The final code in Table two Bayesian models without informative priors reveal nearly identical values.
6.1 provides nicer looking output. The source code in jhbayes.R is relevant Note that using two entirely different methods of estimation—maximum like-
at this point. jhbayes.r consists of two small functions from the Zuur support lihood and sampling—result in the same values. This tells us that these esti-
package, MCMCSupportHighstat.R, which comes with Zuur, Hilbe and Ieno mation procedures are valid ways of estimating the true underlying parameter
(2013) and is available for other books by Zuur as well. The posterior means, values of the distribution theoretically generating the data.
or betas, the log-likelihood function, and AIC and BIC statistics are displayed, > round(cbind(coef(myg), Bcoef, myB[1:K,1]), 4)
together with their standard errors and outer 0.025 “credible set.” We specified Bcoef
that only four decimal digits are displayed. BUGSoutput and MyBUGSOutput (Intercept) -2.0103 -2.0131 -2.0193
are parts of the R2jags package: cdoc 0.0244 0.0246 0.0245
female 2.2568 2.2592 2.2569
out <- J0$BUGSoutput
kids 0.3580 0.3575 0.3685
myB <- M
yBUGSOutput(out, c(uNames(“beta”, K),
cage 0.0544 0.0544 0.0545
“LogL”, “AIC”, “BIC”))
round(myB, 4) The example above did not employ an informative prior. For instance,
The Bayesian logistic model results are listed in the table below. we could have provided information that reflected our knowledge that docvis
has between 40% and 50% zero counts. We compounded the problem since
> round(myB, 4) docvis was centered, becoming cdoc. The centered values for when docvis = 0
mean se 2.5% 97.5%
are −3.162881. They are −2.162881 when docvis = 1. We can therefore set up a
beta[1] -2.0193 0.0824 -2.1760 -1.8609
beta[2] 0.0245 0.0063 0.0128 0.0370
prior that we expect 40%–50% zero counts when cdoc is less than −3.
beta[3] 2.2569 0.0843 2.0922 2.4216
beta[4] 0.3685 0.0904 0.1920 0.5415
beta[5] 0.0545 0.0042 0.0466 0.0626 6.2.3 Bayesian Logistic Regression
LogL
AIC
-1961.6258
3933.2517
1.5178 -1965.4037 -1959.5816
3.0357 3929.1632 3940.8074
with Informative Priors
BIC 3964.5619 3.0357 3960.4734 3972.1176
In a regression model the focus is on placing priors on parameters in order to
Compare the above statistics with the summary table of myg, which was develop adjusted posterior parameter values. For example, we could set a prior
the model as estimated using the glm function. Note that the AIC values are on the coefficient of cdoc such that we are 75% confident that the coefficient
146 Practical Guide to Logistic Regression 6 • Bayesian Logistic Regression 147
BIC 5.024e+03 5.024e+03 5.025e+03 5.027e+03 5.031e+03 also mention that Hilbe et al. (2016) will provide a clear analysis of Bayesian
LogL -2.511e+03 -2.509e+03 -2.508e+03 -2.508e+03 -2.508e+03 modeling as applied to astronomical data.
beta.0 -6.282e-01 -5.725e-01 -5.476e-01 -5.230e-01 -4.682e-01
beta.1 3.839e-02 4.629e-02 5.050e-02 5.481e-02 6.325e-02
beta.2 -1.112e+01 -9.996e-01 -7.193e-03 9.644e-01 1.070e+01
References
in errors. Even copying code from my own saved Word and PDF documents
to R’s editor caused problems. Many times I had to retype quotation marks,
minus signs, and several other symbols in order for R to run properly. I also
should advise you that when in the R editor, it may be wise to “run” long
stretches of code in segments. That is, rather than select the entire program
code, select and run segments of it. I have had students, and those who have
purchased books of mine that include R code, email me that they cannot run Bilder, C.R. and Loughin, T.M. 2015. Analysis of Categorical Data with R. Boca Raton,
the code. I advise them to run it in segments. Nearly always they email back FL: Chapman & Hall/CRC.
that they now have no problems. Of course, at times in the past there have Christensen, R., Johnson, W., Branscu, A. and Hanson, T.E. 2011. Bayesian Ideas and
indeed been errors in the code, but know that the code in this book has all been Data Analysis. Boca Raton, FL: Chapman & Hall/CRC.
successfully run multiple times. Make sure that the proper libraries and data Collett, D. 2003. Modeling Binary Data, 2nd Edn. Boca Raton, FL: Chapman & Hall/CRC.
have been installed and loaded before executing code. Cowles. M.K. 2013. Applied Bayesian Statistics. New York, NY: Springer.
There is a lot of information in the book. However, I did not discuss issues De Souza, R.S. Cameron, E., Killedar, M., Hilbe, J., Vilatia, R., Maio, U., Biffi, V., Riggs,
J.D. and Ciardi, B., for the COIN Collaboration. 2015. The overlooked potential
such as missing values, survey analysis, validation, endogeny, and latent class of generalized linear models in astronomy—I: Binomial regression and numeri-
models. These are left for my comprehensive book titled, Logistic Regression cal simulations, Astronomy & Computing, DOI: 10.1016/j.ascom.2015.04.002.
Models (2009, Chapman & Hall), which is over 650 pages in length. A forth- Dobson, A.J. and Barnett, A.G. 2008. An Introduction to Generalized Linear Models,
coming second edition will include both Stata and R code in the text with SAS 3rd Edn. Boca Raton, FL: Chapman & Hall/CRC.
code as it is with this book. Bayesian logistic regression will be more thor- Dohoo, I., Martin, W. and Stryhn, H. 2012. Methods in Epidemiological Research.
oughly examined, with Bayesian analysis of grouped, ordered, multinomial, Charlottetown, PEI, CA: VER.
Firth, D. 1993. Bias reduction of maximum likelihood estimates, Biometrika 80, 27–28.
hierarchical, and other related models addressed.
Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A. and Rubin, C.B. 2014.
I primarily wrote this book to go with a month-long web-based course Bayesian Data Analysis, 3rd Edn. Boca Raton, FL: Chapman & Hall/CRC.
I teach with Statistics.com. I have taught the course with them since 2003, Geweke, J. 1992. Evaluating the accuracy of sampling-based approaches to calculating
three classes a year, and continually get questions and feedback from research- posterior moments. In Bernardo, J.M., Berger, J.O., Dawid, A.P., Smith, A.F.M.
ers, analysts, and professors from around the world. I have also taught logistic (eds.), Bayesian Statistics, 4th Edn. Oxford, UK: Clarendon Press.
regression and given workshops on it for over a quarter a century. In this book, Hardin, J.W. and Hilbe, J.M. 2007. Generalized Linear Models and Extensions, 2nd
I have tried to address the most frequent concerns and problem areas that prac- edition, College Station, TX: Stata Press.
Hardin, J.W. and Hilbe, J.M. 2013. Generalized Linear Models and Extensions, 3rd
ticing analysts have informed me about. I feel confident that anyone reading
Edn., College Station, TX: Stata Press/CRC (4th edition due out in late 2015 or
carefully through this relatively brief monograph will come away from it with early 2016).
a solid knowledge of how to use logistic regression—both observation based Hardin, J. W. and Hilbe, J.M. 2014. Estimation and testing of binomial and beta-binomial
and grouped. For those who wish to learn more after going through this book, regression models with and without zero inflation, Stata Journal 14(2): 292–303.
I recommend my Logistic Regression Models (2009, 2016 in preparation). I Heinze, G. and Schemper, M. 2002. A solution to the problem of separation in logistic
also recommend Bilger and Loughin (2015), which uses R code for exam- regression. Statistics in Medicine 21, 2409–2419.
ples, Collett (2003), Dohoo et al. (2012), and for nicely written shorter books Hilbe, J.M. 2009. Logistic Regression Models. Boca Raton, FL: Chapman & Hall/CRC.
Hilbe, J.M. 2011. Negative Binomial Regression, 2nd Ed. Cambridge, UK: Cambridge
dealing with the logistic regression and GLM in general, Dobson and Barnett
University Press.
(2008), Hardin and Hilbe (2013), and Smithson and Merkle (2014). Hosmer Hilbe, J.M. 2014. Modeling Count Data. New York, NY: Cambridge University Press.
et al. (2013) is also a fine reference book on the subject, but there is no code Hilbe, J.M. and Robinson, A.P. 2013. Methods of Statistical Model Estimation. Boca
provided with the book. The other recommended books have code to support Raton, FL: Chapman & Hall/CRC.
examples, which I very much believe assists the learning process. Hilbe, J.M., de Souza, R.S. and Ishida, E. 2016. Bayesian Models for Astrophysical
I invite readers of this book to email me their comments and suggestions Data: Using R/JAGS and Python/Stan. Cambridge, UK: Cambridge University
about it: hilbe//works.bepress.com/joseph_hilbe/, has the data sets used in the Press.
Hosmer, D.W., Lemeshow, S. and Sturdivant, R.X. 2013. Applied Logistic Regression,
book in various formats, and all of the code used in the book in electronic
3rd Edn. Hokoken, NJ: Wiley.
format. Both SAS and Stata code and output is also provided.
151