100% found this document useful (1 vote)
279 views

Practical Guide To Logistic Regression - Even

This document provides a summary of the contents of a book on practical logistic regression. It discusses topics such as information criterion tests, adjusting standard errors, testing and fitting logistic models, grouped logistic regression, and Bayesian logistic regression. Code examples are provided in R, SAS, and Stata. The goal is to serve as an accessible guide for researchers and analysts who need to use logistic regression to model binary response data.

Uploaded by

Om Pawar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
279 views

Practical Guide To Logistic Regression - Even

This document provides a summary of the contents of a book on practical logistic regression. It discusses topics such as information criterion tests, adjusting standard errors, testing and fitting logistic models, grouped logistic regression, and Bayesian logistic regression. Code examples are provided in R, SAS, and Stata. The goal is to serve as an accessible guide for researchers and analysts who need to use logistic regression to model binary response data.

Uploaded by

Om Pawar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Practical Guide to

Logistic
Regression

Joseph M. Hilbe
Jet Propulsion Laboratory
California Institute of Technology, USA

and

Arizona State University, USA


vi Contents Contents vii

3.3 Information Criterion Tests 58 6 Bayesian Logistic Regression 127


3.3.1 Akaike Information Criterion 58 6.1 A Brief Overview of Bayesian Methodology 127
3.3.2 Finite Sample 59 6.2 Examples: Bayesian Logistic Regression 130
3.3.3 Bayesian Information Criterion 60 6.2.1 Bayesian Logistic Regression Using R 130
3.3.4 ​Other Information Criterion Tests 60 6.2.2 Bayesian Logistic Regression Using JAGS 137
3.4 The Model Fitting Process: Adjusting Standard Errors 61 6.2.3 Bayesian Logistic Regression with
3.4.1 Scaling Standard Errors 61 Informative Priors 143
3.4.2 Robust or Sandwich Variance Estimators 63 SAS Code 147
3.4.3 Bootstrapping 64 Stata Code 148
3.5 Risk Factors, Confounders, Effect Modifiers, Concluding Comments 149
and Interactions 65
SAS Code 67
References 151
Stata Code 70

4 Testing and Fitting a Logistic Model 71


4.1 Checking Logistic Model Fit 71
4.1.1 Pearson Chi2 Goodness-of-Fit Test 71
4.1.2 Likelihood Ratio Test 72
4.1.3 Residual Analysis 73
4.1.4 Conditional Effects Plot 79
4.2 Classification Statistics 81
4.2.1 S–S Plot 84
4.2.2 ROC Analysis 84
4.2.3 Confusion Matrix 86
4.3 Hosmer–Lemeshow Statistic 88
4.4 Models with Unbalanced Data and Perfect Prediction 91
4.5 Exact Logistic Regression 93
4.6 Modeling Table Data 96
SAS Code 101
Stata Code 105

5 Grouped Logistic Regression 107


5.1 The Binomial Probability Distribution Function 107
5.2 From Observation to Grouped Data 109
5.3 Identifying and Adjusting for Extra Dispersion 113
5.4 Modeling and Interpretation of Grouped Logistic
Regression 115
5.5 Beta-Binomial Regression 117
SAS Code 123
Stata Code 125
x Preface Preface xi

This book is aimed at the working analyst or researcher who finds that Resources for how to learn how to model slightly more complicated models
they need some guidance when modeling binary response data. It is also of will be provided—where to go for the next step. Bayesian modeling is hav-
value for those who have not used logistic regression in the past, and who are ing a continually increasing role in research, and every analyst should at least
not familiar with how it is to be implemented. I assume, however, that the become acquainted with how to understand this class of models, and with how
reader has taken a basic course in statistics, including instruction on applying to program basic Bayesian logistic models when doing so is advisable.
linear regression to study data. It is sufficient if you have learned this on your R statistical software is used to display all but one statistical model dis-
own. There are a number of excellent books and free online tutorials related to cussed in the book—exact logistic regression. Otherwise R is used for all data
regression that can provide this background. management, models, postestimation fit analyses, tests, and graphics related
I think of this book as a basic guidebook, as well as a tutorial between you to our discussion of logistic regression in the book. SAS and Stata code for
and me. I have spent many years teaching logistic regression, using logistic- all examples is provided at the conclusion of each chapter. Complete Stata
based models in research, and writing books and articles about the subject. and SAS code and output, including graphics and tables, is provided on the
I have applied logistic regression in a wide variety of contexts—for medical book’s web site. R code is also provided on the book’s web site, as well as in
and health outcomes research, in ecology, fisheries, astronomy, transporta- the LOGIT package posted on CRAN.
tion, insurance, economics, recreation, sports, and in a number of other areas. R is used in the majority of newly published texts on statistics, as well as
Since 2003, I have also taught both the month-long Logistic Regression and for examples in most articles found in statistics journals published since 2005.
Advanced Logistic Regression courses for Statistics.com, a comprehensive R is open ware, meaning that it is possible for users to inspect the actual code
online statistical education program. Throughout this process I have learned used in the analysis and modeling process. It is also free, costing nothing to
what the stumbling blocks and problem areas are for most analysts when using download into one’s computer. A host of free resources is available to learn R,
logistic regression to model data. Since those taking my courses are located at and blogs exist that can be used to ask others how to perform various opera-
research sites and universities throughout the world, I have been able to gain tions. It is currently the most popular statistical software worldwide; hence, it
a rather synoptic view of the methodology and of its use in research in a wide makes sense to use it for examples in this relatively brief monograph on logis-
variety of applications. tic regression. But as indicated, SAS and Stata users have the complete code
In this volume, I share with you my experiences in using logistic regres- to replicate all of the R examples in the text itself. The code is in both printed
sion, and aim to provide you with the fundamental logic of the model and its format as well as electronic format for immediate download and use.
appropriate application. I have written it to be the book I wish I had read when A caveat: Keep in mind that when copying code from a PDF document, or
first learning about the model. It is much smaller and concise than my 656 even from a document using a different font from that which is compatible with
page Logistic Regression Models (Chapman & Hall/CRC, 2009), which is a R or Stata, you will likely find that a few characters need to be retyped in order
general reference to the full range of logistic-based models. Rather, this book to successfully execute. For example, when pasting program code from a PDF
focuses on how best to understand the key points of the basic logistic regres- or word document into the R editor, characters such as “quotation marks” and
sion model and how to use it properly to model a binary response variable. I “minus signs” may not convert properly. To remedy this, you need to retype the
do not discuss the esoteric details of estimation or provide detailed analysis of quotation or minus sign in the code you are using.
the literature regarding various modeling strategies in this volume, but rather It is also important to remember that this monograph is not about R, or
I focus on the most important features of the logistic model—how to construct any specific statistical software package. We will foremost be interested in
a logistic model, how to interpret coefficients and odds ratios, how to predict the logic of logistic modeling. The examples displayed are aimed to clarify
probabilities based on the model, and how to evaluate the model as to its fit. I the modeling process. The R language, although popular and powerful, is
also provide a final chapter on Bayesian logistic regression, providing an over- nevertheless tricky. It is easy to make mistakes, and R is rather unforgiving
view of how it differs from the traditional frequentist tradition. An important when you do. I therefore give some space to explaining the R code used in the
component of our examination of Bayesian modeling will be a step-by-step modeling and evaluative process when the code may not be clear. The goal is
guide through JAGS code for modeling real German health outcomes data. to provide you with code you can use directly, or adapt as needed, in order to
The reader should be able to attain a basic understanding of how Bayesian make your modeling tasks both easier and more productive.
logistic regression models can be developed and interpreted—and be able to I have chosen to provide Stata code at the end of each chapter since Stata
develop their own models using the explanation in the book as a guideline. is one of the most popular and to my mind powerful statistical packages on the
Author

Joseph M. Hilbe (1944–) is a Solar System


Ambassador with NASA’s Jet Propulsion
Laboratory, California Institute of Technology,
an adjunct professor of statistics at Arizona
State University, and emeritus professor at
the University of Hawaii. He is currently
president of the International Astrostatistics
Association, is an elected Fellow of the
American Statistical Association, is an
Elected Member of the International
Statistical Institute and Full Member of the
American Astronomical Society. Professor
Hilbe is one of the leading statisticians in modeling discrete and longitudinal
data, and has authored a number of books in these areas including best sell-
ers, Logistic Regression Models (Chapman & Hall/CRC, 2009), two editions of
Negative Binomial Regression (Cambridge University Press, 2007, 2011), and
Modeling Count Data (Cambridge University Press, 2014).

Other statistics books by Joseph M. Hilbe:

Generalized Linear Models and Extensions (2001, 2007, 2013, 2016;


with J. Hardin)
Methods of Statistical Model Estimation (2013; with A. Robinson)
Generalized Estimating Equations (2003, 2013; with J. Hardin)
Quasi-Least Squares Regression (2014; with J. Shults)
R for Stata Users (2010; with R. Muenchen)
A Beginner’s Guide to GLM and GLMM with R: A Frequentist and
Bayesian Perspective for Ecologists (2013; with A. Zuur and E.
Ieno)
Astrostatistical Challenges for the New Astronomy (2013)
Practical Predictive Analytics and Decisioning Systems for Medicine
(2015; with L. Miner, P. Bolding, M. Goldstein, T. Hill, R. Nesbit,
N. Walton, and G. Miner)
Solutions Manual for Logistic Regression Models (2009)

xv
2  Practical Guide to Logistic Regression 1 • Statistical Models  3

distribution function or PDF. The analyst does not usually observe the entire parameters. The binomial, beta, and beta binomial distributions will be dis-
range of data defined by the underlying PDF, called the population data, but cussed later when discussing grouped logistic regression.
rather observes a random sample from the underlying data. If the sample of The catcher in this is that a probability distribution has various assump-
data is truly representative of the population data, the sample data will be tions. If these assumptions are violated, the estimates we make of the param-
described by the same PDF as the population data, and have the same values eters are biased, and may be incorrect. Statisticians have worked out a number
of its parameters, which are initially unknown. of adjustments for what may be called “violations of distributional assump-
Parameters define the specific mean or location (shape) and perhaps scale tions,” which are important for an analyst to use when modeling data exhibit-
of the PDF that best describes the population data, as well as the distribution of ing problems. I’ll mention these assumptions shortly, and we will address them
the random sample from the population. A statistical model is the relationship in more detail as we progress through the book.
between the parameters of the underlying PDF of the population data and the I fully realize that the above description of a statistical model—of a para-
estimates made by an analyst of those parameters. metric statistical model—is not the way we normally understand the modeling
Regression is one of the most common ways of estimating the true param- process, and it may be a bit confusing. But it is in general the way statisticians
eters in as unbiased manner as possible. That is, regression is typically used think of statistical modeling, and is the basis of the frequency-based tradition
to establish an accurate model of the population data. Measurement error can of statistical modeling. Keep these relationships in mind as we describe logis-
creep into the calculations at nearly every step, and the random sample we are tic regression.
testing may not fully resemble the underlying population of data, nor its true
parameters. The regression modeling process is a method used to understand
and control the uncertainty inherent in estimating the true parameters of the
distribution describing the population data. This is important since the predic-
tions we make from a model are assumed to come from this population.
1.2  BASICS OF LOGISTIC
Finally, there are typically only a limited range of PDFs which analysts REGRESSION MODELING
use to describe the population data, from which the data we are analyzing is
assumed to be derived. If the variable we are modeling, called the response Logistic regression is foremost used to model a binary (0,1) variable based on
term (y), is binary (0,1), then we will want to use a Bernoulli probability distri- one or more other variables, called predictors. The binary variable being mod-
bution to describe the data. The Bernoulli distribution, as we discuss in more eled is generally referred to as the response variable, or the dependent variable.
detail in the next section, consists of a series of 1s and 0s. If the variable we I shall use the term “response” for the variable being modeled since it has now
wish to model is continuous and appears normally distributed, then we assume become the preferred way of designating it. For a model to fit the data well, it
that it can be best modeled using a Gaussian (normal) distribution. This is a is assumed that
pretty straightforward relationship. Other probability distributions commonly
used in modeling are the lognormal, binomial, exponential, Poisson, negative The predictors are uncorrelated with one another.
binomial, gamma, inverse Gaussian, and beta PDFs. Mixtures of distributions That they are significantly related to the response.
are also constructed to describe data. The lognormal, negative binomial, and That the observations or data elements of a model are also uncorrelated.
beta binomial distributions are such mixture distributions—but they are nev-
ertheless completely valid PDFs and have the same basic assumptions as do As discussed in the previous section, the response is also assumed to fit
other PDFs. closely to an underlying probability distribution from which the response is
I should also mention that probability distributions do not all have the a theoretical sample. The goal of a model is to estimate the true parameter(s)
same parameters. The Bernoulli, exponential, and Poisson distributions are of the underlying PDF of the model based on the response as adjusted by its
single parameter distributions, and models directly based on them are single predictors. In the case of logistic regression, the response is binary (0,1) and
parameter models. That parameter is the mean or location parameter. The nor- follows a Bernoulli probability distribution. Since the Bernoulli distribution is
mal, lognormal, gamma, inverse Gaussian, beta, beta binomial, binomial, and a subset of the more general binomial distribution, logistic regression is recog-
negative binomial distributions are two parameter models. The first four of nized as a member of the binomial family of regression models. A comprehen-
these are continuous distributions with mean (shape) and scale (variability) sive analysis of these relationships is provided in Hilbe (2009).
6  Practical Guide to Logistic Regression 1 • Statistical Models  7

One of the nice features of presenting the log-likelihood function in expo- The linear predictor of the logistic model is
nential form is that we may abstract from it a link function as well as the
mean and variance functions of the underlying Bernoulli distribution. The link xi b = β0 + β1 xi1 + β2 xi 2 +  + β p xip (1.7)
function, which I’ll discuss shortly, is whatever follows the y of the first term
of the right-hand side of Equation 1.5. Here it is log(p/(1 − p)). The mean of
However, the fitted or predicted value of the logistic model is based on the
the distribution can be obtained as the derivative of the negative of the second
link function, log(μ/(1 − μ)). In order to establish a linear relationship of the
term with respect to the link. The second derivative yields the variance. For the
predicted value, μ, and the linear predictor, we have the following relationship:
Bernoulli distribution, these values are

 µi 
Mean = µ = p ln  = xi b = β0 + β1 xi1 + β2 xi 2 +  + β p xip (1.8)
 1 − µ i 

Variance = V (µ ) = p(1 − p) = µ(1 − µ )

where μ, like p, is the probability that the response value y is equal to 1.


In the case of Bernoulli-based logistic regression, the mean is symbolized It can also be thought of as the probability of the presence or occurrence
as μ (mu) and variance as μ(1 − μ). The above link and log-likelihood func- of some characteristic, while 1 − p can be thought of as the probability of
tions are many times expressed in terms of μ as well. It is important to note the absence of that characteristic. Notice that μ/(1 − μ), or p/(1 − p), is the
that strictly speaking the estimated p or μ should be symbolized as p̂ and µˆ , formula for odds. The odds of something occurring is the probability of its
respectively. p and μ are typically reserved for the true population values. success or presence divided by the probability of its failure or absence, 1 − p.
However, for ease of interpretation, I will use the symbol μ in place of µ̂ If μ = 0.7, (1 − μ) = 0.3. μ + (1 − μ) always equals 1. The log of the odds has
throughout the book. been called by statisticians the logit function, from which the term logistic
I should also mention that for grouped logistic regression, which we regression derives.
address in Chapter 5, μ and p are not the same, with μ defined as n ⋅ p. But I’ll In order to determine μ on the basis of the linear predictor, xb, we solve
delay making this distinction until we begin discussing grouped models. the logit function for μ, without displaying subscripts, as
Let us look at a logistic regression and how it differs from normal or
ordinary linear regression. Recall that a regression attempts to understand a exp(xb) 1
µ= = (1.9)
response variable on the basis of one of more predictors or explanatory vari- 1 + exp(xb) 1 + exp(− xb)
ables. This is usually symbolized as
The equations in (1.9) above are very important, and will be frequently
ŷi = β0 + β1 xi1 + β2 xi 2 +  + β p xip (1.6) used in our later discussion. Once a logistic model is solved, we may calculate
the linear predictor, xb, and then apply either equation to determine the pre-
dicted value, μ for each observation in the model.
where y-hat, or ŷ, is the sum of the terms in the regression. The sum of regres-
sion terms is also referred to as the linear predictor, or xb. Each βx is a term
indicating the value of a predictor, x, and its coefficient, β. In linear regression,
which is based in matrix form on the Gaussian or normal probability distri-
bution, ŷ is the predicted value of the regression model as well as the linear
predictor. j indicates the number of predictors in a model. There is a linear
1.4  METHODS OF ESTIMATION
relationship between the predicted or fitted values of the model and the terms
on the right-hand side of Equation 1.6—the linear predictor. yˆ = xb. This is Maximum likelihood estimation, MLE, is the standard method used by stat-
not the case for logistic regression. isticians for estimating the parameter estimates of a logistic model. Other
10  Practical Guide to Logistic Regression 1 • Statistical Models  11

> library(LOGIT) > glmlogit <- glm(died ~ hmo + white, family=binomial,


> data(medpar) data=medpar)
> head(medpar) > summary(glmlogit)
  los hmo white died age80 type provnum . . .
1 4 0 1 0 0 1 030001 Coefficients:
2 9 1 1 0 0 1 030001 Estimate Std. Error z value Pr(>|z|)
3 3 1 1 1 1 1 030001 (Intercept) −0.92619 0.19739 −4.692 2.7e-06 ***
4 9 0 1 0 0 1 030001 hmo −0.01225 0.14893 −0.082 0.934
5 1 0 1 1 1 1 030001 white 0.30339 0.20518 1.479 0.139
6 4 0 1 1 0 1 030001 ---
Null deviance: 1922.9 on 1494 degrees of freedom
Residual deviance: 1920.6 on 1492 degrees of freedom
We may run the model using the following code: AIC: 1926.6

> mylogit <- irls_logit(died ~ hmo + white, data=medpar) The confidence intervals must be calculated separately. To obtain model-
> mylogit based standard errors, we use the confint.default function. Using the confint
$coef function produces what are called profile confidence intervals. We shall dis-
X(Intercept) Xhmo Xwhite cuss these later in Chapter 2, Section 2.3.
-0.92618620 -0.01224648 0.30338724

$se > confint.default(glmlogit)


X(Intercept) Xhmo Xwhite 2.5 % 97.5 %
0.1973903 0.1489251 0.2051795 (Intercept) −1.31306417 −0.5393082
hmo −0.30413424 0.2796413
white −0.09875728 0.7055318
Just typing the model name we assigned, mylogit, displays the coefficients
and standard errors of the model. We can make a table of estimates, standard
Again, I have displayed a full logistic regression model output to show
errors, z-statistic, p-value, and confidence intervals by using the code:
where we are headed in our discussion of logistic regression. The output is
very similar to that of ordinary linear regression. Interpretation, however, is
> coef <- mylogit$coef different. How coefficients, standard errors, and so forth are to be interpreted
> se <- mylogit$se will concern us in the following chapters.
> zscore <- coef / se
> pvalue <- 2*pnorm(abs(zscore),lower.tail=FALSE)
> loci <- coef − 1.96 * se
> upci <- coef + 1.96 * se
> coeftab <- data.frame(coef, se, zscore, pvalue, loci, upci)
> round(coeftab, 4) SAS CODE
coef se zscore pvalue loci upci
X(Intercept)  −0.9262 0.1974 −4.6922 0.0000 −1.3131 −0.5393
Xhmo  −0.0122 0.1489 −0.0822 0.9345 −0.3041 0.2796
Xwhite     0.3034    0.2052     1.4786 0.1392 −0.0988 0.7055
/* Section 1.4 */
*Import medpar as a temporary dataset;
Running the same data using R’s glm function produces the following proc import datafile=“c:\data\medpar.dta” out=medpar
output. I have deleted some ancillary output. dbms=dta replace;
14  Practical Guide to Logistic Regression 2  •  Logistic Models: Single Predictor  15

> upci <- coef + 1.96 * se Null deviance: 12.365 on 8 degrees of freedom


> coeftab <- data.frame(coef, se, zscore, pvalue, loci, upci) Residual deviance: 11.229 on 7 degrees of freedom
> round(coeftab, 4) AIC: 15.229
coef se zscore pvalue loci upci
X(Intercept) 1.0986 1.1547 0.9514 0.3414 -1.1646 3.3618 Model-based confidence intervals may be displayed by
Xx -1.5041 1.4720 -1.0218 0.3069 -4.3891 1.3810
> confint.default(logit2)
The coefficient or slope of x is −1.5041 with a standard error of 1.472. 2.5 % 97.5 %
The intercept value is 1.0986. The intercept is the value of the model when the (Intercept) -1.164557 3.361782
value of x is zero. x -4.389065 1.380910
Using R’s glm function, the above data may be modeled using logistic
regression as A more efficient way of displaying a logistic regression using R is to
encapsulate the summary function around the regression. It will be the way I
> glm(y~ x, family = binomial, data = xdta) typically display example results using R.

Call: glm(formula = y ~ x, family = binomial, data = xdta) > summary(logit2 <- glm(y~ x, family = binomial, data = xdta))

Coefficients: . . .
(Intercept) x
1.099 -1.504 Coefficients:
Estimate Std. Error z value Pr(>|z|)
Degrees of Freedom: 8 Total (i.e. Null); 7 Residual (Intercept) 1.099 1.155 0.951 0.341
Null Deviance: 12.37 x -1.504 1.472 -1.022 0.307
Residual Deviance:  11.23 AIC: 15.23 . .
Null deviance: 12.365 on 8 degrees of freedom
More complete model results can be obtained by assigning the model a Residual deviance: 11.229 on 7 degrees of freedom
name, and then summarizing it with the summary function. We will name the AIC: 15.229
model logit2.
There are a number of ancillary statistics which are associated with mod-
> logit2 <- glm(y~ x, family = binomial, data = xdta) eling data with logistic regression. I will show how to do this as we prog-
> summary(logit2) ress, and functions and scripts for all logistic statistics, fit tests, graphics, and
tables are provided on the books web site, as well as in the LOGIT package
Call: that accompanies this book. The LOGIT package will also have the data,
glm(formula = y ~ x, family = binomial, data = xdta) functions and scripts for the second edition of Logistic Regression Models
(Hilbe, 2016).
Deviance Residuals: For now we will focus on the meaning of the single binary predictor
Min 1Q Median 3Q Max
model. The coefficient of predictor x is −1.504077. A coefficient is a slope. It
-1.6651 -1.0108 0.7585 0.7585 1.3537
is the amount of the rate of change in y based on a one-unit change in x. When
Coefficients:
x is binary, it is the amount of change in y when x moves from 0 to 1 in value.
Estimate Std. Error z value Pr(>|z|) But what is changed?
(Intercept) 1.099 1.155 0.951 0.341 Recall that the linear predictor, xb, of a logistic model is defined as
x -1.504 1.472 -1.022 0.307 log(μ/(1 − μ)). This expression is called the log-odds or logit. It is the logistic
link function, and is the basis for interpreting logistic model coefficients.
(Dispersion parameter for binomial family taken to be 1) The interpretation of x is that when x changes from 0 to 1, the log-odds of
18  Practical Guide to Logistic Regression 2  •  Logistic Models: Single Predictor  19

Let us now check the relationship of x to o, noting the values of o for the
2.2 ​PREDICTIONS, PROBABILITIES, two values of x.

AND ODDS RATIOS > check_o <-data.frame(x,o)


> round(check_o, 3)
x o
I mentioned before that unlike linear regression, the model linear predictors 1 0 3.000
and fitted values differ for logistic regression. If μ is understood as the pre- 2 1 0.667
dicted mean, or fitted value: 3 1 0.667
4 1 0.667
Linear regression μ = x′β 5 0 3.000
6 0 3.000
Logistic regression μ = exp(x′β)/(1 + exp(x′β)) 7 1 0.667
or μ = 1/(1 + exp(−x′β)) 8 0 3.000
9 1 0.667
For the logistic model, μ is defined as the probability that y = 1, where y is
the symbol for the model response term. Recall that the odds ratio of x is the ratio of x = 1/x = 0. The odds of the
intercept is the value of o when x = 0. In order to obtain the odds ratio of x
> logit2 <- glm( y ~ x, family = binomial, data = xdta) when x = 1, we divide 0.667/3. So that we do not have rounding problems with
> coef(logit2)
the calculations, o = 0.667 will be indicated as o < 1. We will create a variable
(Intercept) x called or that retains the odds-intercept value (x = 0) or 3.0 and selectively
1.098612 -1.504077 changes each value of o < 1 to 0.667/3. The corresponding model coefficient
may be determined by logging each value of or.
LINEAR PREDICTOR WHEN X = 1
> 1.098612 -1.504077*1 > or <- o
[1] -0.405465 > or[or< 1] <- (.6666667/3)
> coeff <- log(or)
LINEAR PREDICTOR WHEN X = 0
> 1.098612 -1.504077*0 Finally we shall create a table of statistics, including all of the relevant
[1] 1.098612 values we have just calculated.
We use R’s post-glm function for calculating the linear predictor. The > data1 <-data.frame(y,x,xb,mu,o,or,coeff)
code below generates linear predictor values for all observations in the model. > round(data1,4)
Remember that R has several ways that certain important statistics can be y x xb mu o or coeff
obtained. 1 1 0 1.0986 0.75 3.0000 3.0000 1.0986
2 1 1 -0.4055 0.40 0.6667 0.2222 -1.5041
> xb <- logit2$linear.predictors 3 0 1 -0.4055 0.40 0.6667 0.2222 -1.5041
4 0 1 -0.4055 0.40 0.6667 0.2222 -1.5041
The inverse logistic link function is used to calculate μ. 5 1 0 1.0986 0.75 3.0000 3.0000 1.0986
6 0 0 1.0986 0.75 3.0000 3.0000 1.0986
> mu <- 1/(1 + exp(-xb)) 7 0 1 -0.4055 0.40 0.6667 0.2222 -1.5041
8 1 0 1.0986 0.75 3.0000 3.0000 1.0986
From the predicted probability that y = 1, or μ, the odds for each level of 9 1 1 -0.4055 0.40 0.6667 0.2222 -1.5041
x may be calculated.
What we find is that from the model linear predictor and probabilities we
> o <- mu/(1-mu) calculated the model odds ratios and coefficients. Adding additional predictors
22  Practical Guide to Logistic Regression 2  •  Logistic Models: Single Predictor  23

of sources. One of the earliest adjustments made to standard errors was called > se <- sqrt(diag(vcov(logit2))) # coefficient SE
scaling. R’s glm function provides built in scaling of binomial and Poisson > delta <- or*se #
 delta method,SE of OR
regression standard errors through the use of the quasibinomial and quasipois- > ortab <- data.frame(or, delta)
son options. Scaled standard errors are produced as the product of the model > round(ortab, 4)
or delta
standard errors and square root of the Pearson dispersion statistic. Coefficients
(Intercept) 3.0000 3.4641
are left unchanged. Scaling is discussed in detail in Chapter 3, Section 3.4.1. x 0.2222 0.3271

> summary(logitsc <- glm( y ~ x, family = quasibinomial, data = xdta))

Coefficients: 2.3.2 ​z Statistics


Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.099 1.309   0.839 0.429 The z statistic is the ratio of a coefficient to its standard error.
x -1.504 1.669   -0.901 0.397
> zscore <- coef/se
(Dispersion parameter for quasibinomial family taken to be
1.285715)
The reason this statistic is called z is due to its assumption as being nor-
I will explain more about the Pearson statistic, the Pearson dispersion, mally distributed. For linear regression models, we use the t statistic instead.
scaling, and other ways of adjusting standard errors when we discuss model The z statistic for odds ratio models is identical to that of standard coefficient
fit. However, the scaled standard error for x in the above model logitsc is models. Large values of z typically indicate a predictor that significantly con-
calculated from model logit2 by tributes to the model; that is, to the understanding of the response.

> 1.471959 * sqrt(1.285715)


[1] 1.669045 2.3.3  p-Values
based on the formula I described. The dispersion statistic is displayed in the The p-value of a logistic model is usually misinterpreted. It is also typically
final line of the quasibinomial model output above. Regardless, many analysts given more credence than it should have. First, though, let us look at how it is
advise that standard errors be adjusted by default. If data are not excessively calculated.
correlated, scaled standard errors, for example, reduce to model standard errors.
The standard errors of odds ratios cannot be abstracted from a variance– > pvalue <- 2*pnorm(abs(zscore),lower.tail=FALSE)
covariance matrix. One calculates odds ratio standard errors using what stat-
isticians call the delta method. See Hilbe (2009, 2016) for details. When the The p-value is a two-tail test of the z statistic. It tests the null hypothesis
delta method is used for odds ratios, as well as for risk or rate ratios, the cal- that the associated coefficient value is 0. More exactly, p is the probability of
culation is simple. obtaining a coefficient value at least as extreme as the observed coefficient
given the assumption that β = 0. The smaller the p-value, the more likely β ≠ 0.
SEOR  =  exp(β)*SEcoef The standard “level of significance” for most studies is p = 0.05. Values of
less than 0.05 indicate that the null hypothesis of no relationship between the
Standard errors of odds ratios are calculated by multiplying the odds ratio predictor and response is false. That is, p-values less than 0.05 indicate that
by the coefficient standard error. Starting from the logit2 model, odds ratios the predictor significantly contributes to the model. Values greater than 0.05
and their corresponding standard errors maybe calculated by, indicate that the null hypothesis has not been rejected and that the predictor
does not contribute to the model.
> logit2 <- glm( y ~ x, family = binomial, data = xdta) A cutoff of 0.05 means that one out of every 20 times the coefficient on
> coef <- logit2$coefficients # coefficients average will not reject the null hypothesis; that is, that the coefficient is in
> or <- exp(logit2$coefficients) # odds ratios fact not significant when we thought it was. For many scientific disciplines,
26  Practical Guide to Logistic Regression 2  •  Logistic Models: Single Predictor  27

We may use the function on medpar data The traditional logistic regression model we are discussing here is based
on a frequency interpretation of statistics. As such the confidence intervals
> data(medpar) # assumes library(COUNT)or library(LOGIT) loaded must be interpreted in the same manner. If the coefficient of a logistic model
> smlogit <- glm(died ~ white + los + factor(type), predictor has a p-value under 0.05, the associated confidence interval will not
family = binomial, data = medpar)
include zero. The interpretation is
> summary(smlogit)

. . . Wald Confidence Intervals


If we repeat the modeling analysis a very large number of times, the true
Coefficients: coefficient would be within the range of the lower and upper levels of the
Estimate Std. Error z value Pr(>|z|) confidence interval 95 times out of 100.
(Intercept) -0.716364 0.218040 -3.285 0.00102 **
white 0.305238 0.208926 1.461 0.14402 I earlier mentioned that the use of confint() following R’s glm displays
los -0.037226 0.007797 -4.775 1.80e-06 *** profile confidence intervals. confint.default() produces standard confidence
factor(type)2 0.416257 0.144034 2.890 0.00385 ** intervals, based on the normal distribution. Profile confidence intervals are
factor(type)3 0.929994 0.228411 4.072 4.67e-05 ***
based on the Chi2 distribution. Profile confidence intervals are particularly
important to use when there are relatively few observations in the model, as
> toOR(smlogit)
   
or  delta  zscore  pvalue  exp.loci. exp.upci.
well as when the data are unbalanced. For example, if a logistic model has 30
(Intercept) 0.4885 0.1065 -3.2855 0.0010   0.3186 0.7490 observations, but the response variable consists of 26 ones and only 4 zeros,
white 1.3569 0.2835 1.4610 0.1440    0.9010 2.0436 the data are unbalanced. Ideally a logistic response variable should have rela-
los 0.9635 0.0075 -4.7747 0.0000    0.9488 0.9783 tively equal numbers of 1s to 0s. Likewise, if a binary predictor has nearly all
factor(type)2 1.5163 0.2184 2.8900 0.0039    1.1433 2.0109 1s or 0s, the model is unbalanced, and adjustments may need to be made to
factor(type)3 2.5345 0.5789 4.0716 0.0000   1.6198 3.9657 the model.
In any case, profile confidence intervals are derived as the inverse of the
Confidence intervals are very important when interpreting a logistic likelihood ratio test defined as
model, as well as any regression model. By looking at the low and high range
of a predictor’s confidence interval an analyst can determine if the predictor Likelihood ratio test = −2{Lreduced − Lfull }

contributes to the model.
Remember that a regression p-value is an assessment of whether we may This is a test we will use later when assessing the significance of adding,
“significantly” reject the null hypothesis that the coefficient (β) is equal to 0. or dropping, a predictor or group of predictors from a model. The log-likeli-
If the confidence interval of a predictor includes 0, then we cannot be signifi- hood of a model with all of the predictors is subtracted from the log-likelihood
cantly sure that the coefficient is not really 0 in value. For odds ratios, since the of a model with fewer predictors. The result is multiplied by “ − 2.” The sig-
confidence intervals are exponentiations of the coefficient confidence inter- nificance of the test is based on the Chi2 distribution, whose arguments are the
vals, having the range of the confidence interval include 1 is evidence that the likelihood ratio test statistic and degrees of freedom. The degrees of freedom
null hypothesis has not been rejected. The confidence intervals for logit2 odds consists of how many predictors there are between the full and reduced mod-
ratio model above both include 1—0.0124123 to 3.9788526 and 0.3120602 to els. If a single predictor is being evaluated, there is one degree of freedom.
28.84059. Note that the p-values for both x and the intercept are approximately The likelihood ratio test is preferred to the standard Wald assessment based on
0.3. 0.3 far exceeds the 0.05 criterion of significance. regression coefficient or odds ratio p-values. We shall discuss the test further
How is the confidence interval to be interpreted? If zero is not within the in Chapter 4, Section 4.2.
lower and upper limits of the confidence interval of a coefficient, we cannot For now you need only know that profile confidence intervals are the
conclude that we are 95% sure that the coefficient is “significant”; that is, that inversion of the likelihood ratio test. The statistic is not simple to produce
the associated p-value is truly under 0.05. Many analysts interpret confidence by hand, but easy to display using the confint function. It should be noted
intervals in such a manner, but they should not. that when the predictors are significant and the logistic model is well fit,
30  Practical Guide to Logistic Regression 2  •  Logistic Models: Single Predictor  31

reference level, then level 2 is interpreted with reference to level 1. Level 3 is Analysts many times find that they must change the reference levels of
also interpreted with reference to level 1. Level 1 is the default reference level a categorical predictor. This may be done with the following code. We will
for both R’s glm function and Stata’s regression commands. SAS uses the high- change from the default reference level 1 to a reference level 3 using the relevel
est level as the default reference. Here it would be level 3. function.
It is advised to use either the lowest or highest level as the reference, in
particular whichever of the two has the most observations. But of even more > medpar$type <- factor(medpar$type)
importance, the reference level should be chosen which makes most sense for > medpar$type <- relevel(medpar$type, ref=3)
the data being modeled. > logit4 <- glm( died~factor(type), family=binomial,
You may let the software define your levels, or you may create them your- data=medpar)
> exp(coef(logit4))
self. If there is the likelihood that levels may have to be combined, then it may
(Intercept) factor(type)1 factor(type)2
be wise to create separate indicator variables for the levels. First though, let us 0.8823529 0.5357576 0.7320911
let the software create internal indicator variables, which are dropped at the
conclusion of the display to screen.
Interpretation changes to read
> summary(logit3 <- glm( died ~ factor(type), family = binomial,
data = medpar)) • Elective patients have about half the odds of dying in the hospital
than do emergency patients.
. . .
• Urgent patients have about a three quarters of the odds of dying in
Coefficients: the hospital than do emergency patients.
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.74924 0.06361 -11.779 < 2e-16 *** I mentioned that indicator or dummy variables can be created by hand,
factor(type)2 0.31222 0.14097 2.215 0.02677 * and levels merged if necessary. This occurs when, for example, the level 2
factor(type)3 0.62407 0.21419 2.914 0.00357 **
coefficient (or odds ratio) is not significant compared to reference level 1. We
—-
see this with the model where type = 3 is the reference level. From looking at
Null deviance: 1922.9 on 1494 degrees of freedom
Residual deviance: 1911.1 on 1492 degrees of freedom the models, it appears that levels 2 and 3 may not be statistically different from
AIC: 1917.1 one another, and may be merged. I caution you from concluding this though
since we may want to adjust the standard errors, resulting in changed p-values,
Note how the factor function excluded factor type1 (elective) from the for extra correlation in the data, or for some other reason we shall discuss in
output. It is the reference level though and is used to interpret both type2 Chapter 4. However, on the surface it appears that patients who were admitted
(urgent) and type3 (emergency). I shall exponentiate the coefficients of type2 as urgent are not significantly different from emergency patients with respect
and type3 in order to better interpret the model. Both will be interpreted as to death while hospitalized.
odds ratios, with the denominator of the ratio being the reference level. I mentioned before that combining levels is required if two levels do not
significantly differ from one another. In fact, when the emergency level of type
> exp(coef(logit3)) is the reference, level 2 (urgent) does not appear to be significant, indicating
(Intercept) factor(type)2 factor(type)3 that type levels 2 and 3 might be combined. With R this can be done as
0.4727273 1.3664596 1.8665158
> table(medpar$type)
The interpretation is
1 2 3
• Urgent admission patients have a near 37% greater odds of dying in 1134 265 96
the hospital than do elective admissions.
• Emergency admission patients have a near 87% greater odds of dying > medpar$type[medpar$type = =3] <- 2 # reclassify level 3 as level 2
in the hospital than do elective admissions. > table(medpar$type)
34  Practical Guide to Logistic Regression 2  •  Logistic Models: Single Predictor  35

acronym for Length of Stay, referring to nights in the hospital. los ranges from
1 to 116. A cubic spline is used to smooth the shape of the distribution of los.
This is accomplished by using the S operator. 5

> summary(medpar$los)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 4.000 8.000 9.854 13.000 116.000
0

s(los,7.42)
> library(mgcv)
> diedgam <- gam(died ~ s(los), family = binomial, data = medpar)
> summary(diedgam)
. . .
–5
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.69195 0.05733 -12.07 <2e-16 ***

Approximate significance of smooth terms:


–10
edf Ref.df Chi.sq p-value
s(los) 7.424 8.292 116.8 <2e-16 *** 0 20 40 60 80 100 120

R-sq.(adj) = 0.0873 Deviance explained = 6.75% los


UBRE = 0.21064 Scale est. = 1 n = 1495
> plot(diedgam) FIGURE 2.1  GAM model of los.
Note that no other predictors are in this model. Adding others may well
alter the shape of the splines. The edf statistic indicates the “effective degrees numvisit badh age
of freedom.” It is a value that determines the shape of the curves. An edf of 1 1 30 0 58
2 20 0 54
indicates a straight line; 8 and higher is a highly curved shape. The graph has
3 16 0 44
an edf of 7.424, which is rather high. See Zuur (2012) for a complete analysis 4 20 0 57
of GAM using R. 5 15 0 33
If this was all the data I had to work with, based on the change of slope 6 15 0 28
points in Figure 2.1, I would be tempted to factor los into four intervals with
three slopes at 10, 52, and 90. Each of the four levels would be part of a cat- badh is a binary variable, and indicates that a patient has “bad health,” what-
egorical predictor with the lowest level as the reference. If the slopes differ ever that may mean. numvisit, or number of visits to the doctor during the year
considerably across levels, we should use it for modeling the effect of los rather 1984, and age, are continuous variables. Number of visits ranges from 0 to 40,
than model the continuous predictor. and the age range of patients is from 20 to 60.

> table(badhealth$badh)
2.5.3 ​Centering
0 1
A continuous predictor whose lowest value is not close to 0 should likely be 1015 112
centered. For example, we use the badhealth data from the COUNT package.
> summary(badhealth$age)
> data(badhealth) Min. 1st Qu. Median Mean 3rd Qu. Max.
> head(badhealth) 20.00 28.00 35.00 37.23 46.00 60.00
38  Practical Guide to Logistic Regression 2  •  Logistic Models: Single Predictor  39

> summary(logit7 <- glm(died ~ white, family = binomial, Null deviance: 1922.9 on 1494 degrees of freedom
data = medpar)) Residual deviance: 1904.6 on 1493 degrees of freedom
AIC: 1908.6
Coefficients:
Estimate Std. Error z value Pr(>|z|) > exp(coef(logit8))
(Intercept) -0.9273 0.1969 -4.710 2.48e-06 *** (Intercept) los
white 0.3025 0.2049 1.476 0.14 0.6964864 0.9699768
—-
Null deviance: 1922.9 on 1494 degrees of freedom > etac <- predict(logit8)
Residual deviance: 1920.6 on 1493 degrees of freedom > fitc <- logit8$fitted.value
AIC: 1924.6 > summary(fitc)
Min. 1st Qu. Median Mean 3rd Qu. Max.
> exp(coef(logit7)) 0.01988 0.31910 0.35310 0.34310 0.38140 0.40320
(Intercept) white
0.3956044 1.3532548 The predicted values of died given los range from 0.02 to 0.40.
If we wish to determine the probability of death while hospitalized for a
White patients have a 35% greater odds of death while hospitalized than patient who has stayed in the hospital for 20 days, multiply the coefficient on
do nonwhite patients. los by 20, add the intercept to obtain the linear predictor for los at 20 days.
Apply the inverse logit link to obtain the predicted probability.
LINEAR PREDICTOR
> etab <- predict(logit7)
> xb20 <- -0.361707 - 0.030483*20
> mu20 <- 1/(1 + exp(-xb20))
FITTED VALUE; PROBABILITY THAT DIED = =1 > mu20
> fitb <- logit7$fitted.value [1] 0.2746081

TABULATION OF PROBABILITIES
The probability is 0.275. A patient who stays in the hospital for 20 days
> table(fitb)
fitb has a 27% probability of dying while hospitalized—given a specific disease
0.283464566929547 0.348684210526331 from this data.
127 1368

1368 white patients have an approximate 0.349 probability of dying within 2.6.2 ​Prediction Confidence Intervals
the hospital. Nonwhite patients have some 0.283 probability of dying. Since
the predictor is binary, there are only two predicted values. We next calculate the standard error of the linear predictor. We use the predict
Let us model died on los, a continuous predictor. function with the type = “link” and se.fit = TRUE options to place the
predictions on the scale of the linear predictor, and to guarantee that the lpred
> summary(logit8 <- glm(died ~ los, family = binomial, object is in fact the standard error of the linear prediction.
data = medpar))
> lpred <- predict(
logit8, newdata = medpar, type = “link”,
Coefficients: se.fit = TRUE)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.361707 0.088436 -4.090 4.31e-05 *** Now we calculate the 95% confidence interval of the linear predictor. As
los -0.030483 0.007691 -3.964 7.38e-05 *** mentioned earlier, we assume that both sides of the distribution are used in
42  Practical Guide to Logistic Regression 2  •  Logistic Models: Single Predictor  43

*Statistics of deviance residual; se=sqrt(diag(vcov));


proc means data=residual min q1 median q3 max maxdec=4; print se;
var deviance; quit;
run;
*Logistic regression with OIM standard error;
*Generate a table of y by x; proc surveylogistic data=xdta;
proc freq data=xdta; model y(event=’1’)=x;
tables y*x / norow nocol nocum nopercent; run;
run;
*Refer to proc genmod in section 2.3 to obtain var-cov matrix;
*Expb option provides the odds ratio;
proc genmod data=xdta descending; *Calculations of odds ratio and model statistics;
model y=x / dist=binomial link=logit; proc iml;
estimate “Intercept” Intercept 1 / exp; vcov={1.33333 -1.33333,
estimate “x” x 1 / exp; -1.33333 2.16667};
run; coef={1.0986, -1.5041};
or=exp(coef);
/* Section 2.2 */ se=sqrt(diag(vcov));
ose=se*or;
*Refer to proc genmod in section 2.1 to build the logistic model; print or [format = 7.4] ose [format = 7.4];

*Create a dataset to make calculations; zscore=coef/se;


data data1; delta=ose;
set xdta; z=zscore[,+];
if x=1 then xb=1.0986-1.5041*1; pvalue=2*(1-probnorm((abs(z))));
else if x=0 then xb=1.0986-1.5041*0; print z pvalue;
mu=1/(1+exp(-xb));
o=mu/(1-mu); se1=se[,+];
or=o; loci=coef-quantile(‘normal’, 0.975)*se1;
if or < 1 then or=0.667/3; upci=coef+quantile(‘normal’, 0.975)*se1;
coeff=log(or); expl=exp(loci);
format mu 4.2 o or xb coeff 7.4; expu=exp(upci);
run; print or [format=7.4] delta [format=7.4] z [format=7.4]
pvalue [format=7.4] expl [format=7.4] expu [format=7.4];
*Print the dataset; quit;
proc print data=data1;
var x o; *Clparm=both provides both PL and Wald confidence intervals;
run; proc logistic data=xdta descending;
model y=x / clparm=both;
*Print the whole dataset; run;
proc print data=data1;
run; /* Section 2.4 */

/* Section 2.3 */ *Refer to the code in section 1.4 to import and print medpar dataset;

*Build the logistic model- covb option provides var-cov matrix; *Generate the frequency table of type and output the dataset;
proc genmod data=xdta descending; proc freq data=medpar;
model y=x / dist=binomial link=logit covb; tables type / out=freq;
run; run;

*Use SAS interactive matrix language; *Build the logistic model with class;
proc iml; proc genmod data=medpar descending;
vcov={1.33333 -1.33333, class type (ref=’1’) / param = ref;
-1.33333 2.16667}; model died=type / dist=binomial link=logit;
46  Practical Guide to Logistic Regression 2  •  Logistic Models: Single Predictor  47

*Build the logistic model; *Summary for confidence intervals;


proc genmod data=badhealth descending; proc means data=cl min q1 median mean q3 max maxdec=5;
model badh=age / dist=binomial link=logit; var loci mu upci;
run; run;

*Build the logistic model with centered age; *Graph scatter plot;
proc genmod data=badhealth1 descending ; proc sgplot data=cl;
model badh=cage / dist=binomial link=logit; scatter x=los y=mu;
run; scatter x=los y=loci;
scatter x=los y=upci;
*Standardize age and output the sage dataset; run;
proc standard data=badhealth mean=0 std=1 out=sage;
var age;
run;

*Build the logistic model with standardized age;


proc genmod data=sage descending ;

run;
model badh=age / dist=binomial link=logit; STATA CODE
/* Section 2.6 */ 2.1
. use xdta
*Build the logistic model and output model prediction; . list
proc genmod data=medpar descending; . glm y x, fam(bin) nolog
model died=white / dist=binomial link=logit; . table y x
output out=etab pred=fitb; . tab y x
run; . glm y x, fam(bin) eform nolog nohead

*Refer to proc freq in section 2.4 to generate the frequency table; 2.2
. glm y x, fam(bin) nolog nohead
*Build the logistic model and output model prediction; . di 1.098612 - 1.504077*1
proc genmod data=medpar descending; . di 1.098612 - 1.504077*0
model died=white / dist=binomial link=logit; . predict xb, xb
output out=etac pred=fitc; . predict mu
run; . gen o = mu/(1-mu)
. gen or = .6666667/3 if o < 1
. replace or = o if or = =.
*Refer to proc means in section 2.5 to summarize fitc; . gen coef = log(or)
. l y x xb mu o or coef
*Create a dataset to make calculations;
data prob; 2.3
xb20=-0.3617 - 0.0305*20; . glm y x, fam(bin) nolog nohead
mu20=1/(1+exp(-xb20)); . estat vce
run; . glm y x, fam(bin) nolog nohead scale(x2)
. glm y x, fam(bin) nolog nohead eform
*Print the variable mu20; . di normal(-abs(_b[x]/_se[x]))*2 // p-value for x
proc print data=prob; . di normal(-abs(_b[_cons]/_se[_cons]))*2 // p-value for intercept
var mu20; . use medpar, clear
run; . glm died white los i.type, fam(bin) nolog nohead
. glm died white los i.type, fam(bin) nolog nohead eform
*Build the logistic model and output confidence intervals;
proc genmod data=medpar descending; 2.4
model died=los / dist=binomial link=logit; . use medpar, clear
output out=cl pred=mu lower=loci upper=upci; . list in 1/6
run; . tab type
50  Practical Guide to Logistic Regression 3  •  Logistic Models: Multiple Predictors  51

the other terms in the model are held constant. When the logistic regression the binary predictor male is 1 = male and 0 = female. Kids = 1 if the subject
term is exponentiated, interpretation is given in terms of an odds ratio, rather has children, and 0 if they have no children. Age is a categorical variable with
than log-odds. We can see this in Equation 3.2 below, which results by expo- levels as 5-year age groups. The range is from 17 to 57. I will interpret age,
nentiating each side of Equation 3.1. however, as a continuous predictor, with each ascending age as a 5-year period.
We model the data as before, but simply add more predictors in the model.
µi The categorical educlevel predictor is factored into its three levels, with the
= eβ0 + β1 xi1 + β2 xi 2 ++ β p xip (3.2)
1 − µi lowest level, AA, as the reference. It is not displayed in model output.

> summary(ed1 <- glm(religious ~ age + male + kids + factor(educlevel),
or + family = binomial, data = edrelig))

µi Coefficients:
= exp (β0 + β1 xi1 + β2 xi 2 +  + β p xip ) (3.3) Estimate Std. Error z value Pr(>|z|)
1 − µi (Intercept) −1.43522 0.32996 −4.350 1.36e-05 ***
age 0.03983 0.01036 3.845 0.000121 ***
male 0.18997 0.18572 1.023 0.306381
An example will help clarify what is meant when interpreting a logistic kids 0.12393 0.21037 0.589 0.555790
regression model. Let’s use data from the social sciences regarding the rela- factor(educlevel)BA −0.47231 0.20822 −2.268 0.023313 *
tionship of whether a person identifies themselves as religious. Our main inter- factor(educlevel)MA/PhD −0.49543 0.22621 −2.190 0.028513 *
---
est will be in assessing how level of education affects religiosity. We’ll also Null deviance: 822.21 on 600 degrees of freedom
adjust by gender (male), age, and whether the person in the study has children Residual deviance: 792.84 on 595 degrees of freedom
AIC: 804.84
(kids). There are 601 subjects in the study, so there is no concern about sample
size. The data are in the edrelig data set.
The odds ratios are obtained by:
A study subject’s level of education is a categorical variable with three
fairly equal-sized levels: AA, BA, and MA/PhD. All subjects have achieved at > or <- exp(coef(ed1))
least an associate’s degree at a 2-year institution. A tabulation of the educlevel > round(or,4)
(Intercept) age male
predictor is shown below, together with the top six values of all variables in 0.2381 1.0406 1.2092
the data. kids factor(educlevel)BA factor(educlevel)MA/PhD
1.1319 0.6236 0.6093
> data(edrelig)
> head(edrelig) Or we can view the entire table of odds ratio estimates and associated
male age kids educlevel religious statistics using the code developed in the previous chapter.
1 1 37 0 MA/PhD 0
2 0 27 0 AA 1 > coef <- ed1$coef
> se <- sqrt(diag(vcov(ed1)))
3 1 27 0 MA/PhD 0 > zscore <- coef / se
4 0 32 1 AA 0 > or <- exp(coef)
5 0 27 1 BA 0 > delta <- or * se
6 1 57 1 MA/PhD 1 > pvalue <- 2*pnorm(abs(zscore),lower.tail=FALSE)
> loci <- coef - qnorm(.975) * se
> upci <- coef + qnorm(.975) * se
> table(edrelig$educlevel) > ortab <- data.frame(or, delta, zscore, pvalue, exp(loci), exp(upci))
> round(ortab, 4)
or delta zscore pvalue exp.loci. exp.upci.
AA BA MA/PhD (Intercept) 0.2381 0.0786 -4.3497 0.0000   0.1247 0.4545
205 204 192 age 1.0406 0.0108 3.8449 0.0001 1.0197 1.0620
male 1.2092 0.2246 1.0228 0.3064 0.8403 1.7402
kids 1.1319 0.2381 0.5891 0.5558 0.7495 1.7096
Male and kids are both binary predictors, having values of 0 and 1. 1 indi- factor(educlevel)BA 0.6236 0.1298 -2.2683 0.0233 0.4146 0.9378
cates (most always) that the name of the predictor is the case. For instance, factor(educlevel)MA/PhD 0.6093 0.1378 -2.1902 0.0285 0.3911 0.9493
54  Practical Guide to Logistic Regression 3  •  Logistic Models: Multiple Predictors  55

observation in the model. This means that a y replaces every μ in the log- Logistic Model Pearson Chi2 GOF Statistic (based on the Bernoulli distribution):
likelihood function.
n
( yi − µ i )2
∑ µ (1 − µ )
n

D=2 ∑ {L( yi ; yi ) − L(µ i ; yi )} (3.5)


i =1
i i

(3.8)
i =1
The Bernoulli deviance is expressed as: The degrees of freedom for the Pearson statistic are the same as for the
Logistic Model Deviance Statistic: deviance. For count models, the dispersion statistic is defined as the Pearson
Chi2 statistic divided by the residual dof. Values greater than 1 indicate pos-
n sible overdispersion. The same is the case with grouped logistic models—a
 1 − yi 
∑ y ln  µ  + (1 − y )ln  1 − µ 
 yi 
D=2 i i (3.6) topic we shall discuss in Chapter 5. The deviance dispersion can also be used
i i
i =1 for binomial models—again a subject to which we shall later return.
I mentioned earlier that raw residuals are defined as “y − μ.” All other
For GLM models, including logistic regression, the deviance statistic is
residuals are adjustments to this basic residual. The Pearson residual, for
the basis of model convergence. Since the GLM estimating algorithm is itera-
example, is defined as:
tive, convergence is achieved when the difference between two deviance val-
ues reaches a threshold criterion, usually set at 0.000001. Pearson Residual:
When convergence is achieved, the values of the coefficients, of mu, eta,
and other statistics are at their optimal values. y−µ
The other main use of the deviance statistic is as a goodness-of-fit test. (3.9)
µ(1 − µ )
The “residual” deviance is the value of D that can be calculated following
model convergence. Each observation will have a calculated value of Di as
y*ln(y/μ) + (1 − y)*ln[(1 − y)/(1 − μ)]. Sum the Ds across all observations and It is important to know that the sum of the squared Pearson residuals is
multiply by 2—that’s the deviance statistic. The value of the deviance statis- the Pearson Chi2 statistic:
tic for an intercept only model; that is, a model with no predictors, is called
the null deviance. The null degrees of freedom is the total observations in 2
n
 yi − µ i 
the model minus the intercept, or n − 1. The residual degrees of freedom is n
minus the number of predictors in the model, including the intercept. For the
Pearson Chi 2 statistic = ∑  
 µ i (1 − µ i ) 
(3.10)
i =1
example model, there are 601 observations and six predictors: age, male, kids,
educlevel(BA), educlevel(MA/PhD), and the intercept. The reference level is
In fact, the way programmers calculate the Pearson Chi2 statistic is by
not counted. The null deviance degrees of freedom (dof) is 600 and residual
summing the squared Pearson residuals.
dof is 595. A traditional fit statistic we shall discuss in the next chapter is based
on the Chi2 distribution of the deviance with a dof of the residual deviance. > pr <- resid(ed1, type = “pearson”) # calculates Pearson residuals
The Pearson Chi2 goodness-of-fit statistic is defined as the square of the > pchi2 <- sum(residuals(ed1, type = “pearson”)^2)
raw residual divided by the variance statistic. The raw residual is the value of > disp <- pchi2/ed1$df.residual
> c(pchi2, disp)
the response, y, minus the mean (μ). The Bernoulli variance function for the [1] 600.179386 1.008705
logistic regression model is μ(1 − μ). Therefore,
Pearson Chi2 GOF Statistic: Unfortunately neither the Pearson Chi2 statistic nor the Pearson disper-
sion is directly available from R. Strangely though, the Pearson dispersion is
n
( yi − µ i )2 used to generate what are called quasibinomial models; that is, logistic models
∑ V (µ i )
(3.7) with too much or too little correlation in the data. See Hilbe (2009) and Hilbe
i =1 and Robinson (2013) for a detailed discussion of this topic.
58  Practical Guide to Logistic Regression 3  •  Logistic Models: Multiple Predictors  59

hmo 0.027204 0.151242 0.180 0.85725


los -0.037193 0.007799 -4.769 1.85e-06 ***
3.3  INFORMATION CRITERION TESTS factor(type)2 0.417873 0.144318 2.896 0.00379 **
factor(type)3 0.933819 0.229412 4.070 4.69e-05 ***
Information criterion tests are single statistics by which analysts may com-
Null deviance: 1922.9 on 1494 degrees of freedom
pare models. Models with lower values of the same information criterion are Residual deviance: 1881.2 on 1489 degrees of freedom
considered better fitted models. A number of information tests have been pub- AIC: 1893.2
lished, but only a few are frequently used in research reports.
Using R and the values of the log-likelihood and the number of predictors,
we may calculate the AIC as:
3.3.1  Akaike Information Criterion
> -2*( -940.5755 -6)
The Akaike information criterion (AIC) test, named after Japanese statistician [1] 1893.151
Hirotsugu Akaike (1927–2009), is perhaps the most well-known and well used
information statistic in current research. What may seem surprising to many This is the same value that is displayed in the glm output. It should be
readers is that there are a plethora of journal articles detailing studies proving noted that of all the information criteria that have been formulated, this ver-
how poor the AIC test is in assessing which of two models is the better fit- sion of the AIC is the only one that does not adjust the log-likelihood by n, the
ted. Even Akaike himself later developed another criterion which he preferred number of observations in the model. All others adjust by some variation of
to the original. However, it is his original 1973 version that is used by most number of predictors and observations. If the AIC is used to compare models,
researchers and that is found in most journals to assess comparative model fit. where n is different (which normally should not be the case), then the test will
The traditional AIC statistic is found in two versions: be mistaken. Using the version of AIC where the statistic is divided by n is
then preferable—and similar to that of other criterion tests. The AIC statistic,
AIC = −2 L + 2 k or − 2( L − k ) (3.14) captured from the postestimation statistics following the execution of glm, is
displayed below, as is AICn. These statistics are also part of the modelfit func-
tion described below.
or
AIC – from postestimation statistics
−2 L + 2 k −2( L − k ) > mymod$aic
AIC = or (3.15)
n n [1] 1893.151

where L is the model log-likelihood, k is the number of parameter estimates AICn – AIC divided by n
in the model, and n is the number of observations in the model. For logistic > aicn <- mymod$aic/(mymod$df.null + 1)
> aicn
regression, parameter estimates are the same as predictors, including the inter-
[1] 1.266322
cept. Using the medpar data set described earlier, we model died on

> data(medpar)
> summary(mymod <- glm(died ~ white + hmo + los + factor(type), 3.3.2  Finite Sample
+ family = binomial,
+ data = medpar)) Finite sample AIC was designed to compare logistic models. It is rarely used
in reports, but is important to know. It is defined as:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.720149 0.219073 -3.287 0.00101 ** −2{[ L − k − k (k + 1)]/ (n − k − 1)}
FAIC = (3.16)
white 0.303663 0.209120 1.452 0.14647 n
62  Practical Guide to Logistic Regression 3  •  Logistic Models: Multiple Predictors  63

> coefse <- data.frame(coef, se) ---(Dispersion parameter for quasibinomial family
> coefse taken to be 1.020452)
coef se
(Intercept) -0.72014852 0.21907288 Null deviance: 1922.9 on 1494 degrees of freedom
white 0.30366254 0.20912002 Residual deviance: 1881.2 on 1489 degrees of freedom
hmo 0.02720413 0.15124225 AIC: NA
los -0.03719338 0.00779851
factor(type)2 0.41787319 0.14431763 The standard errors displayed in the quasibinomial model are identical to
factor(type)3 0.93381912 0.22941205 the scaled standard errors we created by hand. Remember that there is no true
quasibinomial GLM family. Quasibinomial is not a separate PDF. It is simply
an operation to provide scaled standard errors on a binomial model such as
Next we create Pearson dispersion statistics and multiply their square root
logistic regression.
by se above.
When an analyst models a logistic regression with scaled standard errors,
the resultant standard errors will be identical to model-based standard errors
> pr <- resid(mymod, type = “pearson”)
> pchi2 <- sum(residuals(mymod, type = “pearson”)^2)
if there are no distributional problems with the data. In other words, a logistic
> disp <- pchi2/mymod$df.residual model is not adversely affected if standard errors are scaled when they do not
> scse <- se*sqrt(disp) need it.
> newcoefse <- data.frame( coef, se, scse) A caveat when using R’s quasibinomial family: p-values are based on t
> newcoefse and not z as they should be. As a result a predictor p-value may be >0.05 and
coef se scse its confidence interval not include 0. Our toOR function used with quasibi-
(Intercept) -0.72014852 0.21907288 0.221301687 nomial models provides correct values. To see this occur, model the grouped
white 0.30366254 0.20912002 0.211247566 quasibinomial model: sick <- c(77,19,47,48,16,31); cases <- c(458,147,494,384,
hmo 0.02720413 0.15124225 0.152780959
127,464); feed <- c(1,2,3,1,2,3); gender <- c(1,1,1,0,0,0).
los -0.03719338 0.00779851 0.007877851
factor(type)2 0.41787319 0.14431763 0.145785892
factor(type)3 0.93381912 0.22941205 0.231746042 3.4.2  Robust or Sandwich Variance Estimators
We can now check to see if the quasibinomial “family” option produces Scaling was the foremost method of adjusting standard errors for many years—
scaled standard errors until analysts began to use what are called robust or sandwich standard errors.
Like scaling, using robust standard errors only affects the model when there
> summary(qmymod <- glm(died ~ white + hmo + los + factor(type), are problems with the model-based standard errors. If there is none, then the
+ family = quasibinomial, robust standard error reduces to the model-based errors. Many statisticians
+ data = medpar)) recommend that robust or sandwich standard errors be used as a default.
I shall use the same data to model a logistic regression with sandwich or
. . . robust standard errors. The sandwich package must be installed and loaded
Coefficients: before being able to create sandwich standard errors.
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.720149 0.221302 -3.254 0.00116 ** >library(sandwich)
white 0.303663 0.211248 1.437 0.15079 > rmymod <- glm(died ~ white + hmo + los + factor(type),
hmo 0.027204 0.152781 0.178 0.85870 family  = binomial, data = medpar)
los -0.037193 0.007878 -4.721 2.56e-06 *** > rse <- sqrt(diag(vcovHC(rmymod, type = “HC0”)))
factor(type)2 0.417873 0.145786 2.866 0.00421 **
factor(type)3 0.933819 0.231746 4.029 5.87e-05 *** The robust standard errors are stored in rse. We’ll add those to the table of
standard errors we have been expanding.
66  Practical Guide to Logistic Regression 3  •  Logistic Models: Multiple Predictors  67

manner based on the levels of another predictor. Suppose that the response > ior <- exp(0.77092 + (-0.04776*1:40))
> ior
term is death and we have predictors white and los. These are variables in [1] 2.0609355 1.9648187 1.8731846 1.7858241 1.7025379 1.6231359 1.5474370
the medpar data. If we believe that the probability of death based on length [8] 1.4752685 1.4064658 1.3408718 1.2783370 1.2187186 1.1618807 1.1076936
[15] 1.0560336 1.0067829 0.9598291 0.9150652 0.8723889 0.8317029 0.7929144
of stay in the hospital varies by racial classification, then we need to incor- [22] 0.7559349 0.7206800 0.6870694 0.6550262 0.6244775 0.5953535 0.5675877
porate an interaction term of white × los into the model. The main effects [29] 0.5411169 0.5158806 0.4918212 0.4688839 0.4470164 0.4261687 0.4062933
only model is: [36] 0.3873448 0.3692800 0.3520578 0.3356387 0.3199854

> summary(y0 <- glm(died~ white + los, family = binomial, Interactions for Binary  × Binary. Binary × Categorical, Binary ×
data  = medpar)) Continuous, Categorical  × Categorical, Categorical × Continuous, and
Continuous × Continuous may be developed, as well as three-level interac-
Coefficients:
tions. See Hilbe (2009) for a thorough analysis of interactions. For now, keep
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.598683 0.213268 -2.807 0.005 ** in mind that when incorporating an interaction term into your model, be sure
white 0.252681 0.206552 1.223 0.221 to include the terms making up the interaction in the model, but don’t worry
los -0.029987 0.007704 -3.893 9.92e-05 *** about their interpretation or significance. Interpret the interaction based on
levels of particular values of the terms. When LOS is 14, we may interpret the
Note that los is significant, but white is not. Let’s create an interaction of odds ratio of the interaction term as:
white and los called wxl. We insert it into the model, making sure to include
the main effects terms as well. White patients who were in the hospital for 14 days have a some 10% greater
odds of death than do non-white patients who were in the hospital for 14
> wxl <- medpar$white * medpar$los days.

> summary(y1 <- glm(died~ white + los + wxl,


family = binomial, data = medpar))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.04834 0.28024 -3.741 0.000183 *** SAS CODE
white 0.77092 0.29560 2.608 0.009107 **
los 0.01002 0.01619 0.619 0.535986
wxl -0.04776 0.01829 -2.611 0.009035 ** /* Section 3.1 */

The interaction term is significant. It makes no difference if the main *Refer to the code in section 1.4 to import and print edrelig dataset;
effects terms are significant or not. Only the interaction term is interpreted *Refer to proc freq in section 2.4 to generate the frequency table;
*Build logistic model and obtain odds ratio & covariance matrix;
for this model. We calculate the odds ratios of the interaction of white and los proc genmod data = edrelig descending;
from 1 to 40 as: class educlevel (ref = ‘AA’) / param = ref;
model religious = age male kids educlevel/dist = binomial
link = logit covb;
OR interaction = exp(β white + β wxl * los[1 : 40]) (3.19) estimate “Intercept” Intercept 1 / exp;
estimate “Age” age 1 / exp;
estimate “Male” male 1 / exp;
That is, we add the slope of the binary predictor to the product of the slope estimate “Kid” kids 1 / exp;
of the interaction and the value(s) of the continuous predictor, exponentiating estimate “BA” educlevel 1 0 / exp;
estimate “MA/PhD” educlevel 0 1 / exp;
the whole. run;

# Odds ratios of death for a white patient for length of stay 1–40 days. *Refer to proc iml in section 2.3 and the full code is provided
# Note that odds of death decreases with length of stay. online;
70  Practical Guide to Logistic Regression

output;

4
end;
run;

*Refer to proc print in section 2.2 to print dataset ior;


Testing and
STATA CODE
Fitting a
3.1
. use edrelig, clear
Logistic Model
. glm religious age male kids i.educlevel, fam(bin) nolog nohead eform
. glm religious age male kids i.educlevel, fam(bin) nolog eform

3.2
. e(deviance) // deviance
. e(deviance_p) // Pearson Chi2
. e(dispers_p) // Pearson dispersion
. di e(ll) // log-likelihood 4.1  CHECKING LOGISTIC MODEL FIT
. gen loglike = e(ll)
. scalar loglik = e(ll)
. di loglik
. predict h, hat 4.1.1 Pearson Chi2 Goodness-of-Fit Test
. sum(h) // hat matrix diagonal
. predict stpr, pear stand
. sum stpr // stand. Pearson residual I earlier mentioned that the Pearson Chi2 statistic, when divided by the
. predict stdr, dev stand residual degrees of freedom, provides a check on the correlation in the data.
. sum stdr // stand deviance residual The idea is to observe if the result is much above the value of 1.0. That is,
3.3
a well-fitted model should have the values of the Pearson Chi2 statistic and
. use medpar, clear residual degrees of freedom closely the same. The closer in value, the better
. qui glm died white hmo los i.type, fam(bin) the fit.
. estat ic
. abic
Pearson Chi 2
~ 1.0
3.4
Residual dof
. glm died white hmo los i.type, fam(bin) scale(x2) nolog nohead
. glm died white hmo los i.type, fam(bin) vce(robust) nolog nohead
. glm died white hmo los i.type, fam(bin) vce(boot) nolog nohead This test, as we shall later discuss, is extremely useful for evaluating
extra dispersion in grouped logistic models, but for the observation-based
3.5
. glm died white los, fam(bin) nolog nohead models we are now discussing it is not. A large discrepancy from the value
. gen wxl <- white*los of 1, though, does indicate general extra dispersion or extra correlation in the
. glm died white los wxl, fam(bin) nolog nohead data, for which use of sandwich or scaled standard errors is an appropriate
. glm died white los wxl, fam(bin) nolog nohead eform
remedy.
A traditional Pearson Chi2 goodness-of-fit (GOF) test, however, is
commonly used to assess model fit. It does this by leaving the value of the
Pearson Chi2 statistic alone, considering it instead to be Chi2 distributed with

71
74  Practical Guide to Logistic Regression 4  •  Testing and Fitting a Logistic Model  75

TABLE 4.1  Residuals for Bernoulli logistic regression TABLE 4.2  Residual code
Raw r y–μ mu <- mymod$fitted.value # predicted probability; fit

Pearson r p r <- medpar$died – mu # raw residual


(y − µ )/ µ(1 − µ )
dr <-resid(mymod, type=“deviance”) # deviance resid
Deviance rd
2∑{In(1/µ)} if y = 1 pr <- resid(mymod, type=“pearson”) # Pearson resid
hat <- hatvalues(mymod) # hat matrix diagonal
2∑{In(1/ (1 − µ )) } if y = 0 stdr <- dr/sqrt(1-hat) # standardized deviance

Stand. Pearson r sp r p stpr <- pr/sqrt(1-hat) # standardized Pearson


1− h deltadev <- dr^2 + hat*stpr^2 # Δ deviance
Stand. deviance r sd rd deltaChi2 <- stpr^2 # Δ Pearson
1− h deltaBeta <- (pr^2*hat/(1-hat)^2) # Δ beta
Likelihood r l
sgn( y − µ ) h( r p )2 + (1 − h)( r d )2 ncoef <- length(coef(mymod)) # number coefficients
# Cooke’s distance
Anscombe rA A( y ) − A( µ )
{ µ(1 − µ ) }1/ 6 cookD <- (pr^2 * hat) / ((1-hat)^2 * ncoef * summary(mymod)$dispersion)

where A(z) = Beta(2/3, 2/3)*{IncompleteBeta(z, 2/3, 2/3), and z = (y; μ)


Beta(2/3, 2/3) = 2.05339. When z = 1, the function reduces to the Beta (see
Hilbe, 2009). predictors are considered as a single observation, but with an extra variable in
Cooks’ distance hr p , C  = number of coefficients the data signifying the number of n-asymptotic observations having the same
r CD
C n (1 − h)2
n
covariate pattern. For example, let us consider the 1495 observation medpar
Delta Pearson ΔChi2 ( r sd )2
data for which we have kept only white, hmo, and type as predictors for died.
There are 1495 observations for the data in this format. In m-asymptotic for-
Delta deviance ΔDev ( r d )2  + h ( r sp )2
mat the data appears as:
Delta beta Δβ h( r p )2
(1 − h)2 white hmo type m
1 0 0 1 72
2 0 0 2 33
y <- medpar$died ; mu <- mymod$fitted.value 3 0 0 3 10
a <- .666667 ; b <- .666667 4 0 1 1 8
ibeta<- function(x,a,b){ pbeta(x,a,b)*beta(a,b) } 5 0 1 2 4
A <- ibeta(y,a,b) ; B <- ibeta(mu,a,b) 6 1 0 1 857
ans <- (A-B)/ (mu*(1-mu))^(1/6) 7 1 0 2 201
8 1 0 3 83
> summary(ans) 9 1 1 1 197
Min. 1st Qu. Median Mean 3rd Qu. Max. 10 1 1 2 27
-1.5450 -1.0090 -0.9020 -0.0872 1.5310 3.2270 11 1 1 3 3

Residual analysis for logistic models is usually based on what are known as There are several ways to reduce the three variable subset of the medpar data
n-asymptotics. However, some statisticians suggest that residuals should be to m-asymptotic form. I will show a way that maintains the died response
based on m-asymptotically formatted data. Data in observation-based form; variable, which is renamed dead due to it not being a binary variable, and then
that is, one observation or case per line, are in n-asymptotic format. The show how to duplicate the above table.
datasets we have been using thus far for examples are in n-asymptotic form.
m-asymptotic data occurs when observations with the same values for all > data(medpar)
78  Practical Guide to Logistic Regression 4  •  Testing and Fitting a Logistic Model  79

6 10

5
8
4
stdr∧2

6
3

ans^2
2 4
1
2
0
0.0 0.1 0.2 0.3 0.4 0.5 0.6
mu 0
0.0 0.1 0.2 0.3 0.4 0.5 0.6
mu
FIGURE 4.1  Squared standardized deviance versus mu.

FIGURE 4.2  Anscombe versus mu plot. Values >4 are outliers.


Analysts commonly use the plot of the square of the standardized deviance
residuals versus mu to check for outliers in a fitted logistic model. Values in
the plot greater than 4 are considered outliers. The values on the vertical axis
0.020
are in terms of standard deviations of the residual. The horizontal axis are pre-
dicted probabilities. All figures here are based on the medpar data.
0.015
Another good way of identifying outliers based on a residual graph is by

hat
use of Anscombe residuals versus mu, or the predicted probability that the
0.010
response is equal to 1. Anscombe residuals adjust the residuals so that they are
as normally distributed as possible. This is important when using 2, or 4 when
0.005
the residual is squared, as a criterion for specifying an observation as an out-
lier. It is the 95% criterion so commonly used by statisticians for determining
statistical significance. Figure 4.2 is not much different from Figure 4.1 when –1 0 1 2 3 4 5
squared standardized deviance residuals are used in the graph. The Anscombe stpr
plot is preferred.
FIGURE 4.3  Leverage plot.
> plot(mu, ans^2)
> abline(h = 4, lty = “dotted”)
observations that fit this characterization. They can be identified by selecting
A leverage or influence plot (Figure 4.3) may be constructed as: hat values greater than 0.4 and squared residual values of |2|.
A wide variety of graphics may be constructed from the residuals given in
> plot(stpr, hat) Table 4.2. See Hilbe (2009), Bilger and Loughin (2015), Smithson and Merkle
> abline(v=0, col=”red”) (2014), and Collett (2003) for examples.
Large hat values indicate covariate patterns that differ from average covari-
ate patterns. Values on the horizontal extremes are high residuals. Values that 4.1.4  Conditional Effects Plot
are high on the hat scale, and low on the residual scale; that is, high in the
middle and close to the zero-line do not fit the model well. They are also dif- A nice feature of logistic regression is its ability to allow an analyst to plot the
ficult to detect as influential when using other graphics. There are some seven predicted probability of an outcome on a continuous predictor, factored across
82  Practical Guide to Logistic Regression 4  •  Testing and Fitting a Logistic Model  83

determines the optimal probability value with which to separate predicted ver- for having a disease given that the patient in fact has the disease. Specificity
sus observed successes (1) or failures (0). refers to when a patient tests negative for a disease when they in fact do not
For an example we shall continue with the model used for residual analy- have the disease. Terms such as false positive refers to when a patient tests
sis earlier in the chapter. It is based on the medpar data, with died as the positive for a disease even though they do not have it. False negatives happen
response variable and white, hmo, los and levels of type as the predictors. We when a patient tests negative for a disease, even though they actually have it.
then obtain predicted probabilities that died = =1, which is the definition of These are all important statistics in classification analysis, but model sensi-
mu. The goal is then to determine how well the predicted probabilities actu- tivity and specificity are generally regarded as the most important results.
ally predict classification as died = =1, and how well they predict died = =0. However, false positive and false negative are used with the main statistics
Analysts are not only interested in correct prediction though, but also in such for creating the ROC curve. Each of these statistics can easily be calculated
issues as what percentage of times does the predictor incorrectly classify the from a confusion matrix. All three of these classification tools intimately
outcome. I advise the reader to remember though that logistic models that clas- relate with one another.
sify well are not always well-fitted models. If your interest is strictly to produce The key point is that determining the correct cut point provides the
the best classification scheme, do not be as much concerned about model fit. In grounds for correctly predicting the above statistics, given an estimated model.
keeping with this same logic, a well-fitted logistic model may not clearly dif- The cut point is usually close to the mean of the predicted values, but is not
ferentiate the two levels of the response. It’s valuable if a model accomplishes usually the same value as the mean. Another way of determining the proper
both fit and classification power, but it need not be the case. cut point is to choose a point at which the specificity and sensitivity are clos-
Now to our example model: est in values. As you will see though, formulae have been designed to find the
optimal cut point, which is usually at or near the site where the sensitivity and
> mymod <- glm(died ~ white + hmo + los + factor(type), specificity are the closest.
family=binomial, The Sensitivity-Specificity (S-S) plot and ROC plot and tests are com-
data=medpar)
ponents of the ROC_test function. The classification or confusion matrix is
> mu <- predict(mymod, type=”response”)
displayed using the confusion_stat function. Both of these functions are
> mean(medpar$died) part of the LOGIT package on CRAN. When LOGIT has been loaded into
[1] 0.3431438 memory the functions are automatically available to the analyst.

Analysts traditionally use the mean of the predicted value as the cut point. > library(LOGIT)
Values greater than 0.3431438 should predict that died = =1; values lower > data(medpar)
should predict died = =0. For confusion matrices, the mean of the response, > mymod <- glm(died ~ los + white + hmo + factor(type),
or mean of the prediction, will be a better cut point than the default 0.5 value family=binomial, data=medpar)
set with most software. If the response variable being modeled has substan-
We shall start with the S–S plot, which is typically used to establish the
tially more or less 1’s than 0’s, a 0.5 cut point will produce terrible results. I
cut point used in ROC and confusion matrix tests. The cut point used in ROC_
shall provide a better criterion for the cut point shortly, but the mean is a good
test is based on Youden’s J statistic (Youden, 1950). The optimal cut point is
default criterion.
defined as the threshold that maximizes the distance to the identity (diagnonal)
Analysts can use the percentage at which levels of died relate to mu being
line of the ROC curve. The optimality criterion is based on:
greater or less than 0.3431438 to calculate such statistics as specificity and
sensitivity. These are terms that originate in epidemiology, although tests like max(sensitivities + specificities)
the ROC statistic and curve were first derived in signal theory. Using our
example, we have patients who died (D) and those who did not (~D). The Other criteria have been suggested in the literature. Perhaps the most
probability of being predicted to die given that the patient has died is called noted alternative is:
model sensitivity. The probability of being predicted to stay alive, given the
min((1 - sensitivities)^2 + (1- specificities)^2)
fact that the patient remained alive is referred to as model specificity. In epi-
demiology, the term sensitivity refers to the probability of testing positive Both criteria give remarkably close cut points.
86  Practical Guide to Logistic Regression 4  •  Testing and Fitting a Logistic Model  87

Note that a cutoff value of 0.3615 is used for the AUC statistic. Given False-negative rate for classified
that died indicates that a patient died while hospitalized, the AUC statistic negatives : 293/1087 = 0.2695482 = 26.95%
can be interpreted as follows: The estimated probability is 0.61 that patients
who die have a higher probability of death (higher mu) than patients who are An alternative way in which confusion matrices have been constructed is
alive. This value is very low. A ROC statistic of 0.5 indicates that the model based on the closeness in sensitivity and specificity values. That is, either the
has no predictive power at all. For our model there is some predictive power, analyst or an algorithm determines when the sensitivity and specificity values
but not a lot. are closest, and then constructs a matrix based on the implications of those
values. An example of this method can be made from the PresenceAbsence
package and function. The cut point, or threshold, is 0.351, which is not much
different from the cut point of 0.3638 we used in the ROC_test function. The
4.2.3  Confusion Matrix difference in matrix values and the associated sensitivity and specificity values
are rather marked though. I added the marginals to provide an easier under-
The traditional logistic regression classification table is given by the so-called
standing of various ancillary statistics which may be generated from the con-
confusion matrix of correctly and incorrectly predicted fitted values. The
fusion matrix.
matrix may be obtained following the use of the previous options of ROC_test
by typing > library(PresenceAbsence)
> mymod <- glm(died ~ white + hmo + los + factor(type),
> confusion_stat(out1$Predicted,out1$Observed) + family=binomial, data=medpar)
> mu <- predict(mymod, type=”response”)
A confusion matrix of values is immediately displayed on screen, together > cmxdf <- data.frame(id=1:nrow(medpar), died=as.

with values for correctly predicted (accuracy), sensitivity, and specificity. The vector(medpar$died),
cut point from the S–S plot is used as the confusion matrix cut point. + pred=as.vector(mu))
> cmx(cmxdf, threshold=0.351,which.model=1)
$matrix # a function in PresenceAbsence
obs 0 1 Sum
pred Observed Total
0 794 293 1087 predicted 1 0
1 188 220 408 1 292 378 670
Sum 982 513 1495 0 221 604 825
Total 513 982 1495
$statistics
Accuracy Sensitivity Specificity The correctly predicted value, or accuracy, is (292 + 604)/1495 or 59.93%.
0.6782609 0.4288499 0.8085540 Sensitivity is 292/(292 + 221) or 56.72% and specificity is 604/(378 + 604) or
61.51%. Note that the sensitivity (56.72%) and specificity (61.51%) are fairly
Other statistics that can be drawn from the confusion matrix and that can close in values—they are as close as we can obtain. If we use the same algo-
be of value in classification analysis are listed below. Recall from earlier dis- rithm with the cut point of 0.363812 calculated by the S–S plot using the crite-
cussion that D = patient died while in hospital (outcome = 1) and ~D = patient rion described at the beginning of this section, the values are
did not die in hospital (outcomes = 0).
> cmx(cmxdf, threshold=0.363812,which.model=1)
Positive predictive value : 220/408 =0.5392157 = 53.92%
observed
Negative predictive value  : 794/1087 = 0.7304508 = 73.05% predicted 1 0 Total
False-positive rate for true ~D : 188/982 = 0.1914460 = 19.14% 1 252 233 485
False-positive rate for true D : 293/513 = 0.5711501 = 57.12% 0 261 749 1010
False-positive rate for classified positives   : 188/408 = 0.4607843 = 46.08% Total 513 982 1495
90  Practical Guide to Logistic Regression 4  •  Testing and Fitting a Logistic Model  91

If we divide up the response probability space into 12 divisions the results 4 124 15 34 38.05 0.27419 0.30689
appear as: 5 125 11 31 40.10 0.24800 0.32079
6 125 9 33 41.76 0.26400 0.33409
> HLChi12 <- HLTest(obj = mymod,g= 12) 7 124 7 29 43.16 0.23387 0.34806
> HLChi12 8 125 5 26 44.80 0.20800 0.35843
9 124 11 44 46.08 0.35484 0.37160
10 125 10 89 48.33 0.71200 0.38660
Hosmer and Lemeshow goodness-of-fit test with 12 bins
11 124 20 59 52.32 0.47581 0.42191
12 125 32 64 64.14 0.51200 0.51310
data: mymod
Total # Data: 1495 Total over cuts: 1495
X2 = 84.8001, df = 10, p-value = 5.718e-14
Chisq: 91.32444 d.f.: 10 p-value: 0.00000

> cbind(HL$observed, round(HL$expect, digits = 1)) The p-value again tells us that the model is not well fitted. The statistics
Y0 Y1 Y0hat Y1hat are similar, but not identical to the table shown earlier. The H–L test is nice
[0.0219,0.246] 87 38 99.6 25.4
summary test to use on a logistic model, but interpret it with care.
(0.246,0.278] 94 31 92.0 33.0
(0.278,0.297] 97 37 95.1 38.9
(0.297,0.313] 103 38 97.5 43.5
(0.313,0.329] 101 35 91.8 44.2
(0.329,0.343] 77 26 68.2 34.8 4.4  MODELS WITH UNBALANCED
(0.343,0.354] 113 38 98.1 52.9
(0.354,0.362] 75 15 57.5 32.5
DATA AND PERFECT PREDICTION
(0.362,0.38] 79 55 84.0 50.0
(0.38,0.391] 35 80 70.4 44.6 When the data set you wish to model has few observations, few predictors, and
(0.391,0.454] 62 58 69.0 51.0
are categorical in nature, it is possible that perfect prediction exists between
(0.454,0.618] 59 62 58.7 62.3
the predictors and response. That is, for a given covariate pattern only one
outcome occurs. Maximum likelihood estimation does not work well in such
The Chi2 test again indicates that the model is ill fitted.
circumstances. One or more of the coefficients become very large, and stan-
In order to show how different code can result in different results, I
dard errors may explode to huge sizes as well. Coefficient values may also be
used code for the H–L test in Hilbe (2009). Rather than groups defined
displayed with no value given. When this occurs it is nearly always the case
and displayed by range, they are calculated as ranges, but the mean of the
that perfect prediction exists in the data.
groups is displayed in output. The number of observations in each group is
Consider a real data set consisting of HIV drug data. The response is
also given.
given as the number of patients in a study who became infected with HIV.
This code will develop three H–L tables, with 8, 10, and 12 groups. The
There are two predictors, cd4 and cd8, each with three levels–0, 1, and 2. The
12 group table is displayed below.
data is weighted by the number of cases having the same pattern of covariates;
that is, with the values of cd4 and cd8 the same.
> medpar2<- na.omit(medpar) # drop obs with missing value(s)
> hlGOF.test(
medpar2$died, predict(mymod, medpar2,
The data, called hiv, is downloaded into R’s memory from its original
type=’response’), breaks=12) format as a Stata data set.

> library(Hmisc)
For # Cuts = 12 # Data = 1495
> data(hivlgold)
Cut # Total #Patterns # Resp. # Pred. Mean Resp. Mean Pred.
> hiv
1 125 61 38 25.39 0.30400 0.20311
infec cases cd4 cd8
2 124 24 31 32.72 0.25000 0.26384
1 0 3 0 0
3 125 14 35 36.16 0.28000 0.28929
2 0 8 1 1
94  Practical Guide to Logistic Regression 4  •  Testing and Fitting a Logistic Model  95

prediction in the model. When that occurs penalized logistic regression should > exlr <- glm(
died ~ procedure + type, family=binomial,
data=azheart)
be used—as we discussed in the previous section.
> toOR(exlr)
For an example of exact logistic regression, I shall use Arizona hospital or delta zscore pvalue exp.loci. exp.upci.
data collected in 1991. The data consist of a random sample of heart procedures (Intercept) 0.0389 0.0482 -2.6170 0.0089 0.0034 0.4424
referred to as CABG and PTCA. CABG is an acronym meaning coronary artery procedureCABG 12.6548 15.7958 2.0334 0.0420 1.0959 146.1267
typeEmer/Urg 1.7186 1.9296 0.4823 0.6296 0.1903 15.5201
bypass grafting surgery and PTCA refers to percutaneous transluminal coronary
angioplasty. It is a nonsurgical method of placing a type of balloon into a coronary Note that there appears to be a statistically significant relationship between
artery in order to clear blockage caused by cholesterol. It is a substantially less the probability of death and type of procedure (p = 0.0420). Type of admission
severe procedure than CABG. We will model the probability of death within 48 h does not contribute to the model. Given the size of the data and adjusting for
of the procedure on 34 patients who sustained either a CABG or PTCA. The vari- the possibility of correlation in the data we next model the same data as a
able procedure is 1 for CABG and 0 for PTCA. It is adjusted in the model by the “quasibinomial” model. Earlier in the book I indicated that the quasibinomial
type of admission. Type = 1 is an emergency or urgent admit, and 0 is an elective option is nothing more than scaling (multiplying) the logistic model standard
admission. Other variables in the data are not used. Patients are all older than 65. errors by the square root of the Pearson dispersion statistic.

> exlr1 <- glm(


died ~ procedure + type, family=quasibinomial,
> data(azcabgptca34)
data=azheart)
> head(azheart) > toOR(exlr1)
died procedure age gender los type or delta zscore pvalue exp.loci. exp.upci.
1 Died CABG 65 Male 10 Elective (Intercept) 0.0389 0.0478 -2.6420 0.0082 0.0035 0.4324
2 Survive CABG 69 Male 7 Emer/Urg procedureCABG 12.6548 15.6466 2.0528 0.0401 1.1216 142.7874
3 Survive PTCA 76 Female 7 Emer/Urg typeEmer/Urg 1.7186 1.9114 0.4869 0.6264 0.1943 15.2007
4 Survive CABG 65 Male 8 Elective
5 Survive PTCA 69 Male 1 Elective The p-value of procedure is further reduced by scaling. The same is the
6 Survive CABG 67 Male 7 Emer/Urg
case when the standard errors are adjusted by sandwich or robust variance
estimators (not shown). We might accept procedure as a significant predic-
A cross-tabulation of died on procedure is given as: tor of the probability of death—if it were not for the small sample size. If we
took another sample from the population of procedures would we have similar
> library(Hmisc)
results? It is wise to set aside a validation sample to test our primary model.
> table(azheart$died, azheart$procedure)
But suppose that we do not have access to additional data? We subject the data
PTCA CABG to modeling with an exact logistic regression. The Stata code and output are
Survive 19 9 given below.
Died 1 5
. exlogistic died procedure type, nolog

It is clear from the tabulation that more patients died having a CAGB than Exact logistic regression Number of obs = 34
with a PTCA. A table of died on type of admission is displayed as: Model score = 5.355253
Pr >= score = 0.0864
> table(azheart$died, azheart$type)
died Odds Ratio Suff. 2*Pr(Suff.) [95% Conf. Interval]
Elective Emer/Urg
Survive 17 11 procedure 10.33644 5 0.0679 .8880104 589.8112
Died 4 2 type 1.656699 2 1.0000 .1005901 28.38678

First we shall use a logistic regression to model died on procedure and type. The results show that procedure is not a significant predictor of died at
The model results are displayed in terms of odds ratios and associated statistics. the p = 0.05 criterion. This should not be surprising. Note that the odds ratio
98  Practical Guide to Logistic Regression 4  •  Testing and Fitting a Logistic Model  99

I tend to structure grouped data in this manner. But as long as an analyst is > (8/5)/(6/4)
consistent, there is no difference in the methods. What is important to remem- [1] 1.066667
ber is that if there are only two binary variables in a table, y and x, and if y is
the response variable to be modeled, then it is placed as the left-most column which is the value calculated as x above. Recalling our discussion earlier in
with p2 levels. p is the number of binary variables, in this case 22 = 4. the text, the intercept odds is the denominator of the ratio we just calculated to
The data in grouped format are modeled as a frequency weighted regres- determine the odds ratio of x.
sion. Since y is binary, it will be modeled as a logistic regression, although it
> 6/4
also may be modeled as a probit, complimentary loglog, or loglog regression. [1] 1.5
The key is to enter the counts as a frequency weight.
> y <- c(1,1,0,0)
which confirms the calculation from R.
> x <- c(1,0,1,0) When tables are more complex the same logic used in creating the 2 × 2
> count <- c(8,6,5,4) table remains. For instance, consider a table of summary data that relates the
> mydata <- data.frame(y,x,count) pass–failure rate among males and females in an introductory to statistics
course at Noclue University. The goal is to determine if studying for the final
> mymodel <- glm(
y ~ x, weights=count, family=binomial,
data=mydata)
or going to a party or just sleeping instead had a bearing on passing. There are
> summary(mymodel) 18 males and 18 females, for a class of 36.
. . .
Coefficients: Gender
Estimate Std. Error z value Pr(>|z|) Female Male
(Intercept) 0.40547 0.64550 0.628 0.53 sleep party study sleep party study
x 0.06454 0.86120 0.075 0.94
fail 3 4 2 2 4 3
Null deviance: 30.789 on 3 degrees of freedom
Grade
Residual deviance: 30.783 on 2 degrees of freedom
pass 2 1 6 3 2 4
AIC: 34.783
The logistic coefficients are x = 0.06454 and intercept as 0.40547.
Exponentiation gives the following values: The data have a binary response, Grade, with levels of Fail and Pass,
Gender has two levels (Female and Male) and student Type has three levels
> exp(coef(mymodel))
(Intercept) x
(sleep, party, and study). I suggest that the response of interest, Pass, be giv-
1.500000 1.066667 ing the value of 1, with Fail assigned 0. For Gender, Female = 0 and Male = 1.
Type: Sleep = 1, Party = 2, and Study = 3. Multiply the levels for the total
To check the above calculations the odds ratio may be calculated directly number of levels or groups in the data. 2 * 2 * 3 = 12. The response vari-
from the original table data as well. Recall that the odds ratio of predictor x is able then will have six 0s and six 1s. When a table has predictors with more
the ratio of the odds of y = 1 divided by the odds of y = 0. The odds of y = 1 is than two levels, I recommend using the 0,1 format for setting up the data for
the ratio of x = 1 to x = 0 when y = 1, and the odds of y = 0 is the ratio of x = 1 analysis.
to x = 0 when y = 0. A binary variable will split its values between the next higher level.
Therefore, Gender will have alternating 0s and 1s for each half of Grade.
x
0 1 Since Type has three levels, 1–2–3 is assigned for each level of Gender.
Finally, assign the appropriate count value to each pattern of variables. The
0 4 5 first level represents Grade = Fail; Gender = Female; Type = Sleep. We move
y from the upper left of the top row across the columns of the row, then move
1 6 8
to the next row.
102  Practical Guide to Logistic Regression 4  •  Testing and Fitting a Logistic Model  103

run; k2=-0.8714+(-0.0376)*los+0.4386*2;
r2=1/(1+exp(-k2));
*Square the standardized deviance residual; k3=-0.8714+(-0.0376)*los+0.4386*3;
data stats1; r3=1/(1+exp(-k3));
set stats; run;
stresdev2=stresdev**2;
run; *Graph the conditional effects plot;
proc sgplot data=effect;
scatter x=los y=r1;
*Plot the square of standardized deviance residuals and mu;
scatter x=los y=r2;
proc gplot data=stats1;
scatter x=los y=r3;
symbol v=circle color=black;
xaxis label=’LOS’;
plot stresdev2*pred / vref=4 cvref=red;
yaxis label=’Type of Admission’ grid values=(0 to 0.4 by 0.1);
run;
title ‘P[Death] within 48 hr admission’;
run;
*Plot the leverage and std Pearson residual;
proc gplot data=stats1;
/* Section 4.2 */
symbol v=circle color=black;
plot leverage*streschi / href=0 chref=red;
*Build the logistic model and output model prediction;
run;
proc genmod data=medpar descending;
class type (ref=’1’) / param=ref;
*Sort the dataset; model died=white hmo los type / dist=binomial link=logit;
proc sort data=medpar out=medpar1; output out=fit pred=mu;
by white hmo los type; run;
run;
*Refer to proc means in section 2.5 to calculate the mean;
*Calculate the sum of the dead;
proc means data=medpar1 sum; *Build the logistic model and output classification table & ROC curve;
by white hmo los type; proc logistic data=medpar descending plots(only)=ROC;
var died; class type (ref=’1’) / param=ref;
output out=summary sum=dead; model died=white hmo los type / outroc=ROCdata ctable pprob=(0 to
run; 1 by 0.0025);
ods output classification=ctable;
run;
*Create a new variable alive;
data summary1;
set summary; *Sensitivity and specificity plot;
alive=_freq_-dead; symbol1 interpol=join color=vibg height=0.1 width=2;
drop _type_ _freq_; symbol2 interpol=join color=depk height=0.1 width=2;
run; axis1 label=(“Probability”) order=(0 to 1 by 0.25);
axis2 label=(angle=90 “Sensitivity Specificity %”) order=(0 to 100 by 25);
proc gplot data=ctable;
*Refer to proc print in section 2.2 to print dataset summary1;
plot sensitivity*problevel specificity*problevel /
overlay haxis=axis1 vaxis=axis2 legend;
*Build the logistic model with numeric variables; run;
proc genmod data=medpar descending;
model died=los type/dist=binomial link=logit;
*Approximate cutoff point can be found when sensitivity and specificity
run;
are closest/equal in the classification table;

*Output the los;


proc freq data=medpar; /* Section 4.3 */
tables los/out=los;
run; *Lackfit option provides the Hosmer-Lemeshow GOF test;
proc logistic data=medpar descending;
*Prepare for the conditional effects plot; class type (ref=’1’) / param=ref;
data effect; model died=white hmo los type / lackfit;
set los; run;
k1=-0.8714+(-0.0376)*los+0.4386*1;
r1=1/(1+exp(-k1)); /* Section 4.4 */
106  Practical Guide to Logistic Regression

. mean died

5
. logit died white hmo los i.type, nolog
. lsens, genprob(cut) gensens(sen) genspec(spec)

Grouped
. lroc
. estat classification, cut(.351)

4.3
. estat gof, table group(10)
. estat gof, table group(12)
Logistic
Regression
4.4
. use hiv1gold
. list
. glm infec i.cd4 i.cd8 [fw=cases], fam(bin)
. firthlogit infec i.cd4 i.cd8 [fw=cases], nolog

4.5
. use azcabgptca34
. list in 1/6
. table died procedure
. table died type
. glm died procedure type, fam(bin) nolog
. glm died procedure type,fam(bin) scale(x2) nolog 5.1  THE BINOMIAL PROBABILITY
. exlogistic died procedure type, nolog DISTRIBUTION FUNCTION
4.6
. use pgmydata
. glm y x [fw=count], fam(bin) nolog
Grouped logistic regression is based on the binomial probability distribution.
. glm y x [fw=count], fam(bin) nolog eform Recall that standard logistic regression is based on the Bernoulli distribution,
. use phmydata2 which is a subset of the binomial. As such, the standard logistic model is a
. glm grade gender i.type [fw=count], fam9bin) nolog nohead
. glm grade gender i.type [fw=count], fam9bin) nolog nohead eform
subset of the grouped. The key concept involved is the binomial probability
distribution function (PDF), which is defined as:

 n
f ( y; p, n) =   p y (1 − p)n − y (5.1)
 y

The product sign is assumed to be in front of the right-hand side terms. In


exponential family form, the above expression becomes:

  p   n 
f ( y; p, n) = exp  y ln   + n ln(1 − p) + ln    (5.2)
  1 − p   y 

The symbol n represents the number of observations in a given covari-


ate pattern. We have discussed covariate patterns before when dealing with
residuals in the last chapter. For Bernoulli response logistic models, the model
is estimated on the basis of observations. Only when analyzing the fit of the

107
110  Practical Guide to Logistic Regression 5  •  Grouped Logistic Regression   111

> x3 <- c(1,1,1,1,1,1,1,0,0,0) terms of two columns of data—one for the number of 1s for a given covari-
> obser <- data.frame(y,x1,x2,x3) ate pattern, and the second for the number of 0s (not 1s). It is the only logistic
> xx1 < - glm(y ~ x1 + x2 + x3, family = binomial, data = obser) regression software I know of that allows this manner of formatting the bino-
> summary(xx1) mial response. However, one can create a variable representing the cbind(y,
. . .
noty) and run it as a single term response. The results will be identical.
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.2050 1.8348 0.657 0.511 > grp2 <- cbind(grp$y, grp$noty)
> summary(xx3 <- glm( grp2 ~ x1 + x2 + x3, family = binomial, data = grp))
x1 0.1714 1.4909 0.115 0.908
x2 -1.5972 1.6011 -0.998 0.318
x3 -0.5499 1.5817 -0.348 0.728 . . .
Coefficients:
Estimate Std. Error z value Pr(>|z|)
Null deviance: 13.46 on 9 degrees of freedom (Intercept) 1.2050 1.8348 0.657 0.511
Residual deviance: 12.05 on 6 degrees of freedom x1 0.1714 1.4909 0.115 0.908
AIC: 20.05 x2 -1.5972 1.6011 -0.998 0.318
x3 -0.5499 1.5817 -0.348 0.728
Grouped Data
In a manner more similar to that used in other statistical packages, the bino-
> y <- c(1,1,2,0,2,0)
> cases <- c(3,1,2,1,2,1) mial denominator, cases, may be employed directly into the response—but
> x1 <- c(1,1,0,0,1,0) only if it is also used as a weighting variable. The following code produces the
> x2 <- c(0,1,0,1,0,1) same output as above,
> x3 <- c(1,1,1,1,0,0)
> grp <- data.frame(y,cases,x1,x2,x3)
> grp$noty <- grp$cases – grp$y > summary(xx4 <- glm( y/cases ~ x1 + x2 + x3, family = binomial,
> xx2 <- glm( cbind(y, noty) ~ x1 + x2 + x3, family = binomial, data = grp) weights = cases, data = grp))
> summary(xx2) . . .
Coefficients:
Coefficients: Estimate Std. Error z value Pr(>|z|)
Estimate Std. Error z value Pr(>|z|) (Intercept) 1.2050 1.8348 0.657 0.511
(Intercept) 1.2050 1.8348 0.657 0.511 x1 0.1714 1.4909 0.115 0.908
x1 0.1714 1.4909 0.115 0.908 x2 -1.5972 1.6011 -0.998 0.318
x2 -1.5972 1.6011 -0.998 0.318 x3 -0.5499 1.5817 -0.348 0.728
x3 -0.5499 1.5817 -0.348 0.728
The advantage of using this method is that the analyst does not have
(Dispersion parameter for binomial family taken to be 1)
to create the noty variable. The downside is that some postestimation func-
Null deviance: 9.6411 on 5 degrees of freedom tions do not accept being based on a weighted model. Be aware that there are
Residual deviance: 8.2310 on 2 degrees of freedom alternatives and use the one that works best for your purposes. The cbind()
AIC: 17.853 response appears to be the most popular, and seems to be used more in pub-
lished research.
The coefficients, standard errors, z values, and p-values are identical. Stata and SAS use the grouping variable; for example, cases, as the variable
However, the ancillary deviance and AIC statistics differ due to the number n in the binomial formulae listed in the last section and as given in the example
of observations in each model. But the information in the two data sets is the directly above. The binomial response can be thought of as y = numerator and
same. This point is important to remember. cases = denominator. Of course these term names will differ depending on
Note that the response variable is cbind(y, noty) rather than y as in the the data being modeled. Check the end of this chapter for how Stata and SAS
standard model. R users tend to prefer having the response be formatted in handle the binomial denominator.
114  Practical Guide to Logistic Regression 5  •  Grouped Logistic Regression   115

Analysts can actually create a model that specifically adds an extra Examples of how these indicators of apparent overdispersion affect logis-
parameter to the model that adjusts for the extra correlation or overdispersion tic models are given in Hilbe (2009).
in the data. For the Poisson model, the negative binomial model serves this
purpose. It is a two-parameter model. The beta binomial is a two-parameter Guideline
logistic model, with the extra heterogeneity parameter adjusting for extra cor- If a grouped logistic model has a dispersion statistic greater than 1, check
relation in the data. Other two- and three-parameter models have also been each of the 5 indicators of apparent overdispersion to determine if applying
developed to account for Poisson and binomial overdispersion, but they need them reduces the dispersion to approximately 1. If it does, the data are not
not concern us here (Hilbe, 2014). We shall discuss the beta binomial later in truly overdispersed. Adjust the model accordingly. If the dispersion statistic
this chapter. of a grouped logistic model is less than 1, the data is under-dispersed. This
How is binomial overdispersion identified? The easiest way is by using the type of extra-dispersion is more rare, and is usually dealt with by scaling or
Pearson dispersion statistic. Let us view the dispersion statistic on the grouped using robust SEs.
binomial model we created above from observation data.
> P__disp(bin)
5.4  MODELING AND INTERPRETATION
Pearson Chi2 = 6.630003 OF GROUPED LOGISTIC REGRESSION
Dispersion  = 3.315001

Any value of the dispersion greater than 1 indicates extra variation in the Modeling and interpreting grouped logistic models is the same as for binary
data. That is, it indicates more variation than is allowed by the binomial PDF response, or observation-based models. The graphics that one develops will
which underlies the model. Recall that the dispersion statistic is the Pearson be a bit different from the ones developed that are based on a binary response
statistic divided by the residual degrees of freedom, which is defined as the model. Using the mylgg model we developed in Chapter 4, Section 4.1.3 when
number of observations in the model less coefficients (predictors, intercept, discussing residual analysis, we shall plot the same leverage versus standard-
extra parameters). The product of the square root of the dispersion by the ized Pearson residuals (Figure 5.1) and standardized deviance residuals versus
standard error of each predictor in grouped logistic model produces a quasi- mu (Figure 5.2) as done in Chapter 4. However, this time the standardized
binomial grouped logistic model. It adjusts the standard errors of the model. residuals in Figure 5.2 are not squared. For a binary response model, squaring
Sandwich and bootstrapped standard errors may be used as well to adjust for the standardized residuals provides for an easier interpretation. Note the dif-
overdispersed grouped logistic models. ference due to the grouped format of the data.
A caveat should be given regarding the identification of overdispersed
> fit <- glm( cbind(dead, alive) ~ white + hmo + los + factor(type),
data. I mentioned that for grouped logistic models that a dispersion statistic
family = binomial, data = mylgg)
greater than 1 indicates overdispersion, or unaccounted for variation in the > mu <- fit$fitted.value  # predicted probability
data. However, there are times that models appear to be overdispersed, but are > hat <- hatvalues(fit) # hat matrix diagnoal
in fact not. A grouped logistic model dispersion statistic may be greater than 1, > dr <- resid(fit, type = “deviance”)  # deviance residuals
but the model data can itself be adjusted to eliminate the perceived overdisper- > pr <- resid(fit, type = “pearson”)  # Pearson residuals
> stdr <- dr/sqrt(1-hat) # standardized deviance
sion. Apparent overdispersion occurs in the following conditions:
> stpr <- pr/sqrt(1-hat) # standardized Pearson
Apparent Overdispersion
> plot(stpr, hat)  # leverage plot
• The model is missing a needed predictor. > abline(v = 0, col = “red”)
• The model requires one or more interactions of predictors.
• A predictor needs to be transformed to a different scale; log(x). The interpretation of the hat statistics is the same as in Chapter 4. In
• The link is misspecified (the data should be modeled as probit or Figure 5.2, notice the more scattered nature of the standardized deviance
cloglog). residuals. This is due to the variety of covariate patterns. Covariate patterns
• There are existing outliers in the data. higher than the line at 2 are outliers, and do not fit the model.
118  Practical Guide to Logistic Regression 5  •  Grouped Logistic Regression   119

The log-likelihood function for the binomial model can then be expressed, The mean and variance of the beta PDF may be given as:
with subscripts, as:
a ab
E ( y) = = µ V ( y) = (5.12)
n a+b (a + b)2 (a + b + 1)
∑{y ln(µ ) + (n − y )ln(1 − µ ) + lnΓ(n + 1) − lnΓ( y + 1)

L(µ i ; yi , ni ) = i i i i i i i
i =1 As mentioned before, the beta-binomial distribution is a mixture of the
− lnΓ(ni − yi + 1)} (5.8) binomial and beta distributions. The binomial parameter, μ, is distributed as

beta, which adjusts for extra-binomial correlation in the data. The mixture can
The beta distribution is used as the basis of modeling proportional data. be obtained by multiplying the two distributions.
That is, beta data is constrained between 0 and 1—and can be thought of in
f ( y; µ, a, b) = f ( y; µ, n) f ( y; µ, a, b) (5.13)
this context as the proportion obtained by dividing the binomial numerator by
the denominator. The beta PDF is given below in terms of two shape param-
eters, a and b, although there are a number of different parameterizations. The result is the beta-binomial probability distribution.

Beta PDF Beta Binomial

Γ(a + b) a −1 Γ(a + b)Γ(n + 1)


f ( y; a, b) = y (1 − y)b −1 (5.9) f ( y; µ, a, b) = π y − a −1 (1 − µ )n − y + b −1 (5.14)
Γ(a )Γ(b) Γ(a )Γ(b)Γ( y + 1)Γ(n − y + 1)

An alternative parameterization may be given in terms of μ and σ, with
where a is the number of successes and b is the number of failures. The ini-
μ = a/(a + b).
tial term in the function is the normalization constant, comprised of gamma
functions.
 1  µ  1 − µ
The above function can also be parameterized in terms of μ. Since we Γ  Γ y +  Γn − y +
plan on having the binomial parameter, μ, itself distributed as beta, we can Γ(n + 1)  σ  σ  σ 
f ( y; µ, σ ) = (5.15)
parameterize the beta PDF as: Γ( y + 1)Γ(n − y + 1)  1   µ 1 − µ
Γn +  Γ  Γ
 σ   σ   σ 
Γ(a + b) a −1
f (µ ) = µ (1 − µ )b −1 (5.10)
Γ(a )Γ(b) with y = 0, 1, 2, … n, and 0 < μ < 1, and σ > 0.
Under this parameterization, the mean and variance of the beta binomial are:
Notice that the kernal of the beta distribution is similar to that of the
binomial kernal.  σ 
E (Y ) = nµ V (Y ) = nµ(1 − µ ) 1 + (n − 1)  (5.16)
 1+ σ 
µ (1 − µ )
y n− y
~µ a −1
(1 − µ ) b −1
(5.11)

This is the parameterization that is used in R’s gamlss function (Rigby and
Even the coefficients of the beta and binomial are similar in structure. In Stasinopoulos, 2005) and in the Stata betabin command (Hardin and Hilbe, 2014).
probability theory such a relationship is termed conjugate. The beta distribu- For an example, we shall use the 1912 Titanic shipping disaster passenger
tion is conjugate to the binomial. This is a very useful property when mixing data. In grouped format, the data are called titanicgrp. The predictors of the
distributions, since it generally allows for easier estimation. Conjugacy plays model are:
a particularly important role in Bayesian modeling where a prior conjugate
(beta) distribution of a model coefficient, which is considered to be a ran- Age: 1 = adult; 0 = child
dom variable, is mixed with the (binomial) likelihood to form a beta posterior Sex: 1 = male; 0 = female
distribution. Class: 1st class, 2nd class, 3rd class (we make 3rd class the reference)
122  Practical Guide to Logistic Regression 5  •  Grouped Logistic Regression   123

binomial logistic regression function is analogous to a Poisson, or perhaps a I also calculated robust or sandwich standard errors for the beta-binomial
negative binomial model. model. 2nd class and age resulted in nonsignificant p-values. This result is
the same as given with the above robust grouped logit model. gamlss does not
Beta Binomial work well with sandwich estimators; the calculations were done using Stata.
See the book’s web site for results.
> library(gamlss) The beta-binomial model is preferred to the single parameter logistic
> summary(mybb <- gamlss(cbind(survive,died) ~ age + sex + class03,
model. However, extra correlation still needs to be checked and adjusted. We
data = titanicgrp, family = BB))
should check for an interactive effect between age and sex, and between both
Estimate Std. Error t value Pr(>|t|)
age and sex and class 1. I shall leave that as an exercise for the reader. It
(Intercept) 1.498 0.6814 2.199 0.063855 appears, though, from looking at the model main effects only, that females
ageadults -2.202 0.8205 -2.684 0.031375 holding 1st and 2nd class tickets stood the best odds of survival on the Titanic.
sexman -2.177 0.6137 -3.547 0.009377 If they were female children, they stood even better odds. 3rd class ticket hold-
class032nd class 2.018 0.8222 2.455 0.043800 ers, and in particular 3rd class male passengers fared the worst. It should be
class031st class 2.760 0.8558 3.225 0.014547
noted that 1st class rooms were very expensive, with the best going for some

US$100,000 in 2015 equivalent purchasing power.
The beta binomial is an important model, and should be considered
Sigma link function: log
Sigma Coefficients: for all overdispersed logistic models. In addition, for binomial models with
Estimate Std. Error t value Pr(>|t|) ­probit and complementary loglog links, or with excess zero response values,
(Intercept) -1.801 0.7508 -2.399 0.03528 Stata’s betabin and zibbin commands (Hardin and Hilbe, 2013) have options
 for these models. Perhaps these capabilities will be made available to R users
in the near future. The generalized binomial model is another function suitable
. . . for m
­ odeling overdispersed grouped logistic models. The model is available in
Stata (Hardin and Hilbe, 2007) and SAS (Morel and Neerchal, 2012).
Global Deviance: 73.80329
Global Deviance: 73.80329
SBC: 88.71273

Notice that the AIC statistic is reduced from 157.77 for the grouped logis-
tic model to 85.80 for the beta-binomial model. This is a substantial improve-
SAS CODE
ment in model fit. The heterogeneity or dispersion parameter, sigma, is 0.165.
/* Section 5.2 */
Sigma [gamlss’s sigma is log(sigma)]
> exp(-1.801) *Refer to data step in section 2.1 if manually input
[1] 0.1651337 obser dataset;
*Build the logistic model;
proc genmod data = obser descending;
Odds ratio for beta binomial are inflated compared to the grouped logit, model y = x1 x2 x3 / dist = binomial link = logit;
but the p-values are closely the same. run;

> exp(coef(mybb)) *Refer to data step in section 2.1 if manually input grp
(Intercept) ageadults sexman class032nd class dataset;
4.4738797 0.1105858 0.1133972 7.5253615
class031st class *Build the logistic model;
15.8044343 proc genmod data = grp descending;
126  Practical Guide to Logistic Regression

5.4

6
. use phmylgg
. cases = dead + alive
. glm dead white hmo los i.type, fam(bin cases)
. predict mu
. predict hat, hat
Bayesian
. predict dev, deviance
. gen stdev = dev/sqrt(1-hat)
. predict stpr, rstandard
Logistic
Regression
. scatter stpr hat
. gen stdev2 = stdev^2
. scatter stdev2 mu

5.5
. use titanicgrp
. list
. glm died age sex b3.class, fam(bin cases) nolog
. glm, eform
. glm died age sex b3.class, fam(bin cases) vce(robust) nolog 6.1  A BRIEF OVERVIEW OF
. betabin died age sex b3.class, n(cases) nolog
. gen died = cases-survive
BAYESIAN METHODOLOGY
Bayesian methodology would likely not be recognized by the person who is
regarded as the founder of the tradition. Thomas Bayes (1702–1761) was a
British Presbyterian country minister and amateur mathematician who had a
passing interest in what was called inverse probability. Bayes wrote a paper
on the subject, but it was never submitted for publication. He died without
anyone knowing of its existence. Thomas Price, a friend of Bayes, discovered
the paper when going through Bayes’s personal effects. Realizing its impor-
tance, he managed to have it published in the Royal Society’s Philosophical
Transactions in 1764. The method was only accepted as a curiosity and was
largely forgotten until Pierre-Simon Laplace, generally recognized as the
leading mathematician worldwide during this period, discovered it several
decades later and began to employ its central thesis to problems of probability.
However, how Bayes’s inverse probability was employed during this time is
quite different from how analysts currently apply it to regression modeling. For
those who are interested in the origins of Bayesian thinking, and its relation-
ship to the development of probability and statistics in general, I recommend
reading Weisberg (2014) or Mcgrayne (2011).
Inverse probability is simple in theory. Suppose that we know from epide-
miological records that the probability of a person having certain symptoms S
given that they have disease D is 0.8. This relationship may be symbolized as
Pr(S|D) = 0.8. However, most physicians want to know the probability of having
the disease if a patient displays these symptoms, or Pr(D|S). In order to find this

127
130  Practical Guide to Logistic Regression 6  •  Bayesian Logistic Regression  131

Cambridge University in the United Kingdom. More will be mentioned about 3 1 0 0 0


4 1 0 0 0
the BUGS packages and JAGS at the start of Chapter 6, Section 6.2.2. Stata 14 5 0 0 1 0
was released on April 7, 2015, well after this text was written. Stata now has full 6 0 0 1 0
Bayesian capabilities. I was able to include Stata code at the end of this chapter
for Bayesian logistic models with noninformative and Cauchy priors. The data have 3874 observations and 15 variables.

> dim(R84)
[1] 3874 15

6.2  EXAMPLES: BAYESIAN The response variable, outwork, has 1420 1s and 2454 0s, for a mean of
0.5786.
LOGISTIC REGRESSION
> table(R84$outwork)
0 1
6.2.1  Bayesian Logistic Regression Using R 2454 1420

For an example we shall model the 1984 German health reform data, rwm1984. Other characteristics of the data to be modeled, including the centering of
Our variable of interest is a patient’s work status. If they are not working, out- both continuous predictors, are given as follows:
work = 1; if they are employed or are otherwise working, outwork = 0. The
predictors we use to understand outwork are: # SUMMARIES OF THE TWO CONTINUOUS VARIBLES
> summary(R84$docvis)
Min. 1st Qu. Median Mean 3rd Qu. Max.
docvis : The number of visits made to a physician during the year, from 0 0.000 0.000 1.000 3.163 4.000 121.000
to 121.
female : 1 = female; 0 = male. > summary(R84$age)
kids : 1 = has children; 0 = no children. Min. 1st Qu. Median Mean 3rd Qu. Max.
age : age, from 25 to 64. 25 35 44 44 54 64

The data are first loaded and the data are renamed R84. We shall view the # CENTER BOTH CONTINUOUS PREDICTORS
data, including other variables in the data set. > R84$cage <- R84$age - mean(R84$age)
> R84$cdoc <- R84$docvis - mean(R84$docvis)
> library(MCMCpack)
> library(LOGIT)
We shall first model the data based on a standard logistic regression, and
> data(rwm1984) then by a logistic regression with the standard errors scaled by the square root
> R84 <- rwm1984 of the Pearson dispersion. The scaled logistic model, as discussed in the previ-
ous chapter, is sometimes referred to as a “quasibinomial” model. We model
# DATA PROFILE both to determine if there is extra variability in the data that may require
> head(R84) adjustments. The tables of coefficients for each model are not displayed below,
docvis hospvis edlevel age outwork female married kids hhninc educ self
1 1 0 3 54 0 0 1 0 3.050 15.0 0
but are stored in myg and myq, respectively. I shall use the toOR function to
2 0 0 1 44 1 1 1 0 3.050 9.0 0 display the odds ratios and associated statistics of both models in close prox-
3 0 0 1 58 1 1 0 0 1.434 11.0 0 imity. The analyst should inspect the delta (SEs) values to determine if they
4 7 2 1 64 0 0 0 0 1.500 10.5 0
5 6 0 3 30 1 0 0 0 2.400 13.0 0 differ from each other by much. If they do, then there is variability in the data.
6 9 0 3 26 1 0 0 0 1.050 13.0 0 A scaled logistic model, or other adjusted models, should be used on the data,
edlevel1 edlevel2 edlevel3 edlevel4
1 0 0 1 0
including a Bayesian model. Which model we use depends on what we think is
2 1 0 0 0 the source of the extra correlation.
134  Practical Guide to Logistic Regression 6  •  Bayesian Logistic Regression  135

2. Quantiles for each variable: Trace of (Intercept) Density of (Intercept)


–1.7
4
2.5% 25% 50% 75% 97.5%
–2.0
(Intercept) -2.17202 -2.06720 -2.01287 -1.95843 -1.85674 2
cdoc 0.01267 0.02034 0.02454 0.02883 0.03704 –2.3
female 2.09550 2.20357 2.25912 2.31470 2.42444 0
kids 0.18391 0.29680 0.35791 0.41714 0.53193 2e+04 6e+04 1e+05 –2.3 –2.2 –2.1 –2.0 –1.9 –1.8 –1.7
cage 0.04630 0.05164 0.05442 0.05723 0.06255 Iterations N = 100,000 Bandwidth = 0.008535

Compare the output above for the noninformative prior with SAS output Trace of cdoc Density of cdoc
60
on the same data and model. The results are remarkably similar.
0.03 40
POSTERIOR SUMMARIES 20
0.00 0
PERCENTILES
STANDARD 2e+04 6e+04 1e+05 0.00 0.01 0.02 0.03 0.04 0.05
PARAMETER N MEAN DEVIATION 25% 50% 75% Iterations N = 100,000 Bandwidth = 0.0006621
Intercept 100,000 −2.0140 0.0815 −2.0686 −2.0134 −1.9586 Trace of female Density of female
Cdoc 100,000 0.0247 0.00632 0.0204 0.0246 0.0289
4
Female 100,000 2.2605 0.0832 2.2043 2.2602 2.3166 2.4
Kids 100,000 0.3596 0.0907 0.2981 0.3590 0.4207 2
2.0
Cage 100,000 0.0545 0.00418 0.0516 0.0545 0.0573 0
2e+04 6e+04 1e+05 2.0 2.2 2.4 2.6
Iterations N = 100,000 Bandwidth = 0.008791
POSTERIOR INTERVALS
Trace of kids Density of kids
PARAMETER ALPHA EQUAL-TAIL INTERVAL HPD INTERVAL 4
3
Intercept 0.050 −2.1755 −1.8557 −2.1710 −1.8520 0.4 2
Cdoc 0.050 0.0124 0.0373 0.0124 0.0373 1
Female 0.050 2.0989 2.4242 2.0971 2.4220 0.0 0
2e+04 6e+04 1e+05 0.0 0.2 0.4 0.6
Kids 0.050 0.1831 0.5382 0.1838 0.5386
Iterations N= 100,000 bandwidth = 0.009471
Cage 0.050 0.0463 0.0628 0.0464 0.0628
Trace of cage Density of cage

Although interpretations differ, the posterior mean values are analogous to 80


0.06
maximum likelihood coefficients, the standard deviations are like standard errors 40
and the 2.5% and 97.5% quantiles are somewhat similar to confidence intervals.
0.04 0
Here Bayesians refer to the external quantiles as “credible sets” or sometimes as 0.04 0.05 0.06 0.07
2e+04 6e+04 1e+05
either “credible intervals” or “posterior intervals.” N= 100,000 bandwidth = 0.0004408
Iterations
Remember that each parameter is considered to be randomly distributed,
and not fixed as is assumed when data are being modeled using standard fre-
quentist-based maximum likelihood methods. As such Bayesians attempt to FIGURE 6.1  R trace and density plots of model with noninformative
develop a distribution for each posterior parameter, the mean of each is the priors.
Bayesian logistic beta. The plots on the right-hand side of Figure 6.1 display
138  Practical Guide to Logistic Regression 6  •  Bayesian Logistic Regression  139

number of different models. Of course, our example will be to show its use in TABLE 6.1  JAGS code for Bayesian logistic model
creating a Bayesian logistic model. X <- model.matrix(~ cdoc + female + kids + cage,
First, make sure you have installed JAGS to your computer. It is freeware, data = R84)
as is R. JAGS is similar to WinBUGS and OpenBUGS, which can also be run K <- ncol(X)
logit.data <- list(Y = R84$outwork,
as standalone packages or within the R environment. JAGS is many times pre- N = nrow(R84),
ferred by those in the hard sciences like physics, astronomy, ecology, biology, X = X,
and so forth since it is command-line driven, and written in C ++ for speed. K = K,
LogN = log(nrow(R84)),
WinBUGS and OpenBUGS are written in Pascal, which tends to run slower b0 = rep(0, K),
than C ++ implementations, but can be run within the standalone WinBUGS B0 = diag(0.00001, K)
or OpenBUGS environments, which include menus, help, and so forth. The )
sink(“LOGIT.txt”)
BUGS programs are more user-friendly. Both OpenBUGS and JAGS are also
able to run on a variety of platforms, which is advantageous to many users. In cat(“
model{
fact, WinBUGS is no longer being developed or supported. The developers are # Priors
putting all of their attention to OpenBUGS. Lastly, and what I like about it, beta ~ dmnorm(b0[], B0[,])
when JAGS is run from within R, the program actually appears as if it is just
# Likelihood
another R package. I do not feel as if I am using an outside program. for (i in 1:N){
To start it is necessary to have JAGS in R’s path, and the R2jags package Y[i] ~ dbern(p[i])
needs to be installed and loaded. For the first JAGS example you also should logit(p[i]) <- max(-20, min(20, eta[i]))
eta[i] <- inprod(beta[], X[i,])
bring two functions contained in jhbayes.R into memory using the source LLi[i] <- Y[i] * log(p[i]) +
function. (1 - Y[i]) * log(1 - p[i])
}
LogL <- sum(LLi[1:N])
> library(R2jags)
AIC <- -2 * LogL + 2 * K
> 
source(“c://Rfiles/jhbayes.R”) # or where you store R BIC <- -2 * LogL + LogN * K
files; book’s website }
“,fill = TRUE)
sink()
The code in Table 6.1 is specific to the model we have been working with
in the previous section. However, as you can see, it is easily adaptable for other # INITIAL VALUES – BETAS AND SIGMAS
logistic models. With a change in the log-likelihood, it can also be used with inits <- function () {
list(
other distributions and can be further amended to incorporate random effects, beta = rnorm(K, 0, 0.1)
mixed effects, and a host of other models. ) }
Let us walk through the code in Table 6.1. Doing so will make it much params <- c(“beta”, “LogL”, “AIC”, “BIC”)
easier for you to use it for other modeling situations. # JAGs
The top two lines J0 <- jags(data = logit.data,
inits = inits,
X <- model.matrix(~ cdoc + female + kids + cage, parameters = params,
model.file = “LOGIT.txt”,
data = R84)
n.thin = 10,
K <- ncol(X) n.chains = 3,
n.burnin = 40000,
create a matrix of predictors, X, from the model R84, and a variable, K, which n.iter = 50000)
contains the number of predictors contained in X. A column of 1s for the inter- # OUTPUT DISPLAYED
cept is also generated by model.matrix(). out <- J0$BUGSoutput
The next code segment is logit.data, although we may call it anything myB <- MyBUGSOutput(out, c(uNames(“beta”, K), “LogL”, “AIC”, “BIC”))
round(myB, 4)
we wish. logit.data is a list of the components of the JAGS model we are
142  Practical Guide to Logistic Regression 6  •  Bayesian Logistic Regression  143

sampling values are discarded before values are kept in the posterior distri- statistically identical. This output also matches the SAS results displayed esti-
bution. The initial values can vary widely, and skew the results. If all of the mated using noninformative priors.
early values were kept, the mean of the posterior distribution could be severely
> summary(myg)
biased. Discarding a sizeable number of early values helps guarantee a better Coefficients:
posterior. Finally, the n.iter specifies how many values are kept for the poste- Estimate Std. Error z value Pr(>|z|)
rior distribution, after thinning and discarding of burn-in values. (Intercept) -2.010276 0.081087 -24.792 < 2e-16 ***
J0 <- jags(data = logit.data, cdoc 0.024432 0.006263 3.901 9.57e-05 ***
inits = inits, female 2.256804 0.082760 27.269 < 2e-16 ***
parameters = params, kids 0.357976 0.089962 3.979 6.92e-05 ***
model.file = “LOGIT.txt”, cage 0.054379 0.004159 13.075 < 2e-16 ***
n.thin = 10, ---
n.chains = 3, Null deviance: 5091.1 on 3873 degrees of freedom
n.burnin = 40000, Residual deviance: 3918.2 on 3869 degrees of freedom
n.iter = 50000) AIC: 3928.2

After running the jags function, which we have called J0, typing J0 on A comparison of the frequency-based standard logistic regression and our
the R command-line will provide raw model results. The final code in Table two Bayesian models without informative priors reveal nearly identical values.
6.1 provides nicer looking output. The source code in jhbayes.R is relevant Note that using two entirely different methods of estimation—maximum like-
at this point. jhbayes.r consists of two small functions from the Zuur support lihood and sampling—result in the same values. This tells us that these esti-
package, MCMCSupportHighstat.R, which comes with Zuur, Hilbe and Ieno mation procedures are valid ways of estimating the true underlying parameter
(2013) and is available for other books by Zuur as well. The posterior means, values of the distribution theoretically generating the data.
or betas, the log-likelihood function, and AIC and BIC statistics are displayed, > round(cbind(coef(myg), Bcoef, myB[1:K,1]), 4)
together with their standard errors and outer 0.025 “credible set.” We specified Bcoef
that only four decimal digits are displayed. BUGSoutput and MyBUGSOutput (Intercept) -2.0103 -2.0131 -2.0193
are parts of the R2jags package: cdoc 0.0244 0.0246 0.0245
female 2.2568 2.2592 2.2569
out <- J0$BUGSoutput
kids 0.3580 0.3575 0.3685
myB <- M
yBUGSOutput(out, c(uNames(“beta”, K),
cage 0.0544 0.0544 0.0545
“LogL”, “AIC”, “BIC”))
round(myB, 4) The example above did not employ an informative prior. For instance,
The Bayesian logistic model results are listed in the table below. we could have provided information that reflected our knowledge that docvis
has between 40% and 50% zero counts. We compounded the problem since
> round(myB, 4) docvis was centered, becoming cdoc. The centered values for when docvis = 0
mean se 2.5% 97.5%
are −3.162881. They are −2.162881 when docvis = 1. We can therefore set up a
beta[1] -2.0193 0.0824 -2.1760 -1.8609
beta[2] 0.0245 0.0063 0.0128 0.0370
prior that we expect 40%–50% zero counts when cdoc is less than −3.
beta[3] 2.2569 0.0843 2.0922 2.4216
beta[4] 0.3685 0.0904 0.1920 0.5415
beta[5] 0.0545 0.0042 0.0466 0.0626 6.2.3 Bayesian Logistic Regression
LogL
AIC
-1961.6258
3933.2517
1.5178 -1965.4037 -1959.5816
3.0357 3929.1632 3940.8074
with Informative Priors
BIC 3964.5619 3.0357 3960.4734 3972.1176
In a regression model the focus is on placing priors on parameters in order to
Compare the above statistics with the summary table of myg, which was develop adjusted posterior parameter values. For example, we could set a prior
the model as estimated using the glm function. Note that the AIC values are on the coefficient of cdoc such that we are 75% confident that the coefficient
146  Practical Guide to Logistic Regression 6  •  Bayesian Logistic Regression  147

BIC 5.024e+03 5.024e+03 5.025e+03 5.027e+03 5.031e+03 also mention that Hilbe et al. (2016) will provide a clear analysis of Bayesian
LogL -2.511e+03 -2.509e+03 -2.508e+03 -2.508e+03 -2.508e+03 modeling as applied to astronomical data.
beta.0 -6.282e-01 -5.725e-01 -5.476e-01 -5.230e-01 -4.682e-01
beta.1 3.839e-02 4.629e-02 5.050e-02 5.481e-02 6.325e-02
beta.2 -1.112e+01 -9.996e-01 -7.193e-03 9.644e-01 1.070e+01

With normal priors, the output is displayed as:


SAS CODE
#1. Priors
beta.0 ~ dnorm(0, 0.00001) /* Section 6.2 */
beta.1~dnorm(0, 0.00001) *Refer to the code in section 1.4 to import and print rwm1984 dataset;
beta.2~dnorm(0, 0.00001) *Refer to proc freq in section 2.4 to generate the frequency table;
*Summary for continuous variables;
proc means data=rwm1984 min q1 median mean q3 max maxdec=3;
1. Empirical mean and standard deviation for each variable, var docvis age;
plus standard error of the mean: output out=center mean=;
run;
Mean SD Naive SE Time-series SE
AIC 5.020e+03 1.983282 5.121e-03 7.678e-03
BIC 5.026e+03 1.983282 5.121e-03 7.678e-03 *Create the macro variables;
proc sql;
LogL -2.509e+03 0.991641 2.560e-03 3.839e-03 select age into: meanage from center;
beta.0 -5.471e-01 0.044242 1.142e-04 5.766e-04 select docvis into: meandoc from center;
beta.1 5.058e-02 0.006379 1.647e-05 2.113e-05 quit;
beta.2 -1.546e-01 7.041635 1.818e-02 1.412e-01
*Center the continuous variables;
2. Quantiles for each variable: data R84;
set rwm1984;
cage=age-&meanage;
2.5% 25% 50% 75% 97.5% cdoc=docvis-&meandoc;
AIC 5.018e+03 5.018e+03 5.019e+03 5.020e+03 5.025e+03 run;
BIC 5.024e+03 5.024e+03 5.025e+03 5.027e+03 5.031e+03
LogL -2.511e+03 -2.509e+03 -2.508e+03 -2.508e+03 -2.508e+03
*Build the logistic model and obtain odds ratio & statistics;
beta.0 -6.255e-01 -5.722e-01 -5.473e-01 -5.223e-01 -4.660e-01
proc genmod data=R84 descending;
beta.1 3.829e-02 4.625e-02 5.051e-02 5.484e-02 6.333e-02 model outwork=cdoc female kids cage / dist=binomial link=logit;
beta.2 -1.235e+01 -1.022e+00 -2.184e-02 9.511e-01 9.943e+00 estimate “Intercept” Intercept 1 / exp;
estimate “Cdoc” cdoc 1 / exp;
estimate “Female” female 1 / exp;
Notice that the values of the distributional means for each parameter— estimate “Kids” kids 1 / exp;
intercept, cdoc, and cage—differ, as do other associated statistics. The prior estimate “Cage” cage 1 / exp;
has indeed changed the model. What this means is that we can provide our run;
model with a substantial amount of additional information about the predic-
tors used in our logistic model. Generally speaking, it is advisable to have a *Build the quasibinomial logistic model;
proc glimmix data=R84;
prior that is distributionally compatible with the distribution of the parameter model outwork (event=’1’)=cdoc female kids cage / dist=binary
having the prior. The subject is central to Bayesian modeling, but it takes us link=logit solution covb;
beyond the level of this book. My recommendations for taking the next step in random _RESIDUAL_;
run;
Bayesian modeling include Zuur et al. (2013), Cowles (2013), and Lunn et al.
(2013). More advanced but thorough texts are Christensen et  al. (2011) and *Refer to proc iml in section 2.3 and the full code is provided
Gelman et al. (2014). There are many other excellent texts as well. I should online;
150  Practical Guide to Logistic Regression

References
in errors. Even copying code from my own saved Word and PDF documents
to R’s editor caused problems. Many times I had to retype quotation marks,
minus signs, and several other symbols in order for R to run properly. I also
should advise you that when in the R editor, it may be wise to “run” long
stretches of code in segments. That is, rather than select the entire program
code, select and run segments of it. I have had students, and those who have
purchased books of mine that include R code, email me that they cannot run Bilder, C.R. and Loughin, T.M. 2015. Analysis of Categorical Data with R. Boca Raton,
the code. I advise them to run it in segments. Nearly always they email back FL: Chapman & Hall/CRC.
that they now have no problems. Of course, at times in the past there have Christensen, R., Johnson, W., Branscu, A. and Hanson, T.E. 2011. Bayesian Ideas and
indeed been errors in the code, but know that the code in this book has all been Data Analysis. Boca Raton, FL: Chapman & Hall/CRC.
successfully run multiple times. Make sure that the proper libraries and data Collett, D. 2003. Modeling Binary Data, 2nd Edn. Boca Raton, FL: Chapman & Hall/CRC.
have been installed and loaded before executing code. Cowles. M.K. 2013. Applied Bayesian Statistics. New York, NY: Springer.
There is a lot of information in the book. However, I did not discuss issues De Souza, R.S. Cameron, E., Killedar, M., Hilbe, J., Vilatia, R., Maio, U., Biffi, V., Riggs,
J.D. and Ciardi, B., for the COIN Collaboration. 2015. The overlooked potential
such as missing values, survey analysis, validation, endogeny, and latent class of generalized linear models in astronomy—I: Binomial regression and numeri-
models. These are left for my comprehensive book titled, Logistic Regression cal simulations, Astronomy & Computing, DOI: 10.1016/j.ascom.2015.04.002.
Models (2009, Chapman & Hall), which is over 650 pages in length. A forth- Dobson, A.J. and Barnett, A.G. 2008. An Introduction to Generalized Linear Models,
coming second edition will include both Stata and R code in the text with SAS 3rd Edn. Boca Raton, FL: Chapman & Hall/CRC.
code as it is with this book. Bayesian logistic regression will be more thor- Dohoo, I., Martin, W. and Stryhn, H. 2012. Methods in Epidemiological Research.
oughly examined, with Bayesian analysis of grouped, ordered, multinomial, Charlottetown, PEI, CA: VER.
Firth, D. 1993. Bias reduction of maximum likelihood estimates, Biometrika 80, 27–28.
hierarchical, and other related models addressed.
Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A. and Rubin, C.B. 2014.
I primarily wrote this book to go with a month-long web-based course Bayesian Data Analysis, 3rd Edn. Boca Raton, FL: Chapman & Hall/CRC.
I teach with Statistics.com. I have taught the course with them since 2003, Geweke, J. 1992. Evaluating the accuracy of sampling-based approaches to calculating
three classes a year, and continually get questions and feedback from research- posterior moments. In Bernardo, J.M., Berger, J.O., Dawid, A.P., Smith, A.F.M.
ers, analysts, and professors from around the world. I have also taught logistic (eds.), Bayesian Statistics, 4th Edn. Oxford, UK: Clarendon Press.
regression and given workshops on it for over a quarter a century. In this book, Hardin, J.W. and Hilbe, J.M. 2007. Generalized Linear Models and Extensions, 2nd
I have tried to address the most frequent concerns and problem areas that prac- edition, College Station, TX: Stata Press.
Hardin, J.W. and Hilbe, J.M. 2013. Generalized Linear Models and Extensions, 3rd
ticing analysts have informed me about. I feel confident that anyone reading
Edn., College Station, TX: Stata Press/CRC (4th edition due out in late 2015 or
carefully through this relatively brief monograph will come away from it with early 2016).
a solid knowledge of how to use logistic regression—both observation based Hardin, J. W. and Hilbe, J.M. 2014. Estimation and testing of binomial and beta-binomial
and grouped. For those who wish to learn more after going through this book, regression models with and without zero inflation, Stata Journal 14(2): 292–303.
I recommend my Logistic Regression Models (2009, 2016 in preparation). I Heinze, G. and Schemper, M. 2002. A solution to the problem of separation in logistic
also recommend Bilger and Loughin (2015), which uses R code for exam- regression. Statistics in Medicine 21, 2409–2419.
ples, Collett (2003), Dohoo et al. (2012), and for nicely written shorter books Hilbe, J.M. 2009. Logistic Regression Models. Boca Raton, FL: Chapman & Hall/CRC.
Hilbe, J.M. 2011. Negative Binomial Regression, 2nd Ed. Cambridge, UK: Cambridge
dealing with the logistic regression and GLM in general, Dobson and Barnett
University Press.
(2008), Hardin and Hilbe (2013), and Smithson and Merkle (2014). Hosmer Hilbe, J.M. 2014. Modeling Count Data. New York, NY: Cambridge University Press.
et al. (2013) is also a fine reference book on the subject, but there is no code Hilbe, J.M. and Robinson, A.P. 2013. Methods of Statistical Model Estimation. Boca
provided with the book. The other recommended books have code to support Raton, FL: Chapman & Hall/CRC.
examples, which I very much believe assists the learning process. Hilbe, J.M., de Souza, R.S. and Ishida, E. 2016. Bayesian Models for Astrophysical
I invite readers of this book to email me their comments and suggestions Data: Using R/JAGS and Python/Stan. Cambridge, UK: Cambridge University
about it: hilbe//works.bepress.com/joseph_hilbe/, has the data sets used in the Press.
Hosmer, D.W., Lemeshow, S. and Sturdivant, R.X. 2013. Applied Logistic Regression,
book in various formats, and all of the code used in the book in electronic
3rd Edn. Hokoken, NJ: Wiley.
format. Both SAS and Stata code and output is also provided.

151

You might also like