0% found this document useful (0 votes)
69 views7 pages

Separation in Logistic Regression - Causes Consequences and Control

This document discusses separation in logistic regression models, which occurs when predictor variables perfectly separate the outcome groups. Complete separation happens when each subject's outcome can be perfectly predicted, while quasi-complete separation allows perfect prediction for some but not all subjects. Separation is more likely with rare outcomes or exposures, highly correlated predictors, or strong predictor effects. Software packages deal with separation differently, with some producing infinite coefficient estimates. Penalized likelihood methods provide alternatives to address separation and sparse data problems. The document illustrates these concepts using data on contraceptive use and urinary tract infections.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views7 pages

Separation in Logistic Regression - Causes Consequences and Control

This document discusses separation in logistic regression models, which occurs when predictor variables perfectly separate the outcome groups. Complete separation happens when each subject's outcome can be perfectly predicted, while quasi-complete separation allows perfect prediction for some but not all subjects. Separation is more likely with rare outcomes or exposures, highly correlated predictors, or strong predictor effects. Software packages deal with separation differently, with some producing infinite coefficient estimates. Penalized likelihood methods provide alternatives to address separation and sparse data problems. The document illustrates these concepts using data on contraceptive use and urinary tract infections.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

American Journal of Epidemiology Vol. 187, No.

4
© The Author(s) 2018. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of DOI: 10.1093/aje/kwx299
Public Health. All rights reserved. For permissions, please e-mail: [email protected]. Advance Access publication:
August 17, 2017

Practice of Epidemiology

Separation in Logistic Regression: Causes, Consequences, and Control

Downloaded from https://academic.oup.com/aje/article/187/4/864/4084405 by Universiti Teknologi MARA - Shah Alam user on 03 February 2021
Mohammad Ali Mansournia, Angelika Geroldinger*, Sander Greenland, and Georg Heinze
* Correspondence to Dr. Angelika Geroldinger, Section for Clinical Biometrics, Center for Medical Statistics, Informatics and
Intelligent Systems, Medical University of Vienna, Austria Spitalgasse 23, 1090 Vienna, Austria (e-mail: angelika.geroldinger@
gmx.at).

Initially submitted November 23, 2016; accepted for publication August 3, 2017.

Separation is encountered in regression models with a discrete outcome (such as logistic regression) where the co-
variates perfectly predict the outcome. It is most frequent under the same conditions that lead to small-sample and
sparse-data bias, such as presence of a rare outcome, rare exposures, highly correlated covariates, or covariates with
strong effects. In theory, separation will produce infinite estimates for some coefficients. In practice, however, separa-
tion may be unnoticed or mishandled because of software limits in recognizing and handling the problem and in notifying
the user. We discuss causes of separation in logistic regression and describe how common software packages deal
with it. We then describe methods that remove separation, focusing on the same penalized-likelihood techniques used
to address more general sparse-data problems. These methods improve accuracy, avoid software problems, and allow
interpretation as Bayesian analyses with weakly informative priors. We discuss likelihood penalties, including some that
can be implemented easily with any software package, and their relative advantages and disadvantages. We provide
an illustration of ideas and methods using data from a case-control study of contraceptive practices and urinary tract
infection.
logistic regression; maximum likelihood; penalized likelihood; separation; small samples; sparse data

Abbreviations: LASSO, least absolute shrinkage and selection operator; ML, maximum likelihood; UTI, urinary tract infection.

Logistic regression is a standard method for estimating infection (UTI). Throughout, we assume that the separation rep-
adjusted odds ratios. Logistic models are almost always fitted resents a sampling artefact of the data rather than a causal
with maximum likelihood (ML) software, which provides valid necessity (e.g., having entered a covariate that is a sufficient
statistical inferences if the model is approximately correct and cause of the outcome).
the sample is large enough (e.g., at least 4–5 subjects per param-
eter at each level of the outcome). Nonetheless, ML estimation
can break down with small or sparse data sets, an exposure or CAUSES OF SEPARATION
outcome that is uncommon in the data, or large underlying ef-
fects, especially with combinations of these problems (1–6). Foxman et al. (9) conducted a case-control study of con-
In these cases, ML estimators are not even approximately traceptives and UTI among college women. We used their
unbiased, and ML estimates of finite odds ratios may be infinite. data, distributed with the software package LogXact (Cytel,
These infinite estimates can be viewed as arising from separa- Cambridge, Massachusetts), consisting of 437 observations.
tion of the outcomes by the covariates (7, 8). Numerically, there Table 1 cross-tabulates 1 of the 9 evaluated covariates, use of
are two types of separation: With complete separation, the out- diaphragm or cervical cap, with the outcome variable, UTI.
come of each subject in the data set can be perfectly predicted, The combination of rare exposure and strong relationship to
while with quasicomplete separation this is possible only for a the outcome causes 1 of the 4 cell counts to be zero. This leads to
subset of the subjects. We have explored the problem and its an infinite estimate. However, this reflects 0 noninfected women
solutions, and we have illustrated them in the analysis of a case- among only 7 subjects using diaphragms, leading to extreme
control study relating contraceptive practices to urinary tract imprecision, even for proportions (e.g., the exact 95% confidence

864 Am J Epidemiol. 2018;187(4):864–870


Separation in Logistic Regression 865

Table 1. Diaphragm Use and Urinary Tract Infection in the Data Separation can also arise as a consequence of the joint effects
Reported by Foxman et al.a, 1997 of continuous covariates, even if no covariate achieves separa-
Urinary Tract Infection
tion alone. Consider the urinary incontinence data reported by
Diaphragm Use Potter (11), with 3 predictors x1, x2, x3 for treatment success.
Yes No
While models using each covariate or each pair of covariates do
Yes 7 0 not show irregularities, fitting a model with all 3 covariates re-
No 140 290 sults in infinite coefficient estimates for each covariate (data not

Downloaded from https://academic.oup.com/aje/article/187/4/864/4084405 by Universiti Teknologi MARA - Shah Alam user on 03 February 2021
a
shown). A 3-dimensional plot of these covariates, distinguishing
Foxman et al. (9).
the 2 outcome states, reveals that the data points with or without
treatment success can be separated by a plane in 3-dimensional
space, defined by −112.3x1 − 165.3x2 + 21.02x3 = 5.4 (Figure 1).
More precisely, the expression −112.3x1 − 165.3x2 + 21.02x3
is less than or equal to 5.4 for all subjects with treatment failure
interval for the frequency of UTI in this small subgroup is and is greater than or equal to 5.4 for all with treatment success.
59%–100%). Some subjects in both outcome categories have values equal to
Using continuous covariates can give rise to similar prob- exactly 5.4 (i.e., fall on the separation plane), resulting in quasi-
lems. Consider as an example the study of Salama et al. (10), in complete separation. By contrast, complete separation describes
which endothelin-1 serum expression in lung transplant recip- the case where a separating plane need not contain any data
ients is used as predictor of primary graft dysfunction. Logistic point (see Albert and Anderson (7) and Web Appendix 2).
regression estimates the odds ratio, relating a 1-unit increase in
log endothelin-1 expression to primary graft dysfunction, by
maximizing the probability of the observed outcomes given the HOW DO SOFTWARE PACKAGES DEAL WITH
model (i.e., by maximizing the likelihood). All 64 subjects with SEPARATION?
a log endothelin-1 expression of 5.05 or more had primary graft
dysfunction, while all the other 41 subjects, with log endothelin- In the presence of either type of separation, differences in
1 serum expression of 4.42 or less, had normal graft function. fitting algorithms may lead to important differences in the es-
Thus, the likelihood is maximized if the former subjects are as- timates. To illustrate differences in the results using default
signed predicted probabilities of 1 while the latter are assigned settings, we analyzed the UTI example with all 9 covariates in
predicted probabilities of 0. Web Figure 1 in Web Appendix 1 the model. The command LOGISTIC REGRESSION in SPSS
(available at https://academic.oup.com/aje) illustrates how the (version 22; SPSS Inc., Chicago, Illinois), the glm function in R
estimate of the log odds ratio β1 iteratively approaches this (version 3.2.2; R Foundation for Statistical Computing, Vienna,
extreme. With each iteration, the estimated curve becomes Austria), and the LOGISTIC procedure in SAS (version 9.4;
steeper until the estimated probabilities become totally sepa- SAS Institute, Inc., Cary, North Carolina) reported log odds
rated and “jump” from 0 for all log endothelin-1 values below ratio (standard error) estimates for the variable diaphragm use of
4.42 to 1 for all values above 4.42. The algorithm is on track 20.9 (14,002.9), 16.2 (807), and 15.1 (771.7), respectively (see
to maximize the likelihood, but it has to stop when regression Table 1). SAS reported an odds ratio of >999.999 with a Wald
coefficients become numerically too large for the software 95% confidence interval (estimate −/+ 1.96 standard errors) of
to handle. <0.001 to >999.999. SPSS stated that the estimation terminated

Figure 1. Illustration of data separation for the data from Potter (11), 2005. The axes correspond to the 3 covariates. Treatment success is marked
in black and failure in gray. Plots (A) and (B) differ only in the angle of view. The data are an example of quasicomplete separation (i.e., there is a
plane (with equation −112.3x1 − 165.3x2 + 21.02x3 = 5.4) that separates data points with different outcomes but with observations of both out-
comes lying exactly on the plane).

Am J Epidemiol. 2018;187(4):864–870
866 Mansournia et al.

at iteration number 20 because maximum iterations had been confidence intervals. Profile-likelihood confidence intervals
reached. are usually not a default output of software packages, but they
Why are these results so different? All software packages can often be obtained by additional commands. In case of sep-
use an iterative algorithm to find the coefficients that maxi- aration, these confidence intervals have a finite and an infinite
mize the log-likelihood (see Web Appendix 3 or Cole et al. limit, reflecting asymmetry of the log-likelihood (see also
(12)). They differ mainly in the way they determine whether Figure 2). Although profile-likelihood confidence intervals may
the maximum has been reached. Convergence may be declared be too narrow with sparse data (6), they are better than the

Downloaded from https://academic.oup.com/aje/article/187/4/864/4084405 by Universiti Teknologi MARA - Shah Alam user on 03 February 2021
if, at an additional iteration step, the log-likelihood or the coef- default Wald confidence intervals, which in case of separation
ficients (log odds ratios) change little. For numerical rea- do not even exist.
sons, with each consecutive iteration, the log-likelihood Large differences between Wald and profile-likelihood
will increase, although possibly by only a very small amount, confidence intervals (which are available in SAS, Stata (Stata-
and the coefficient changes will almost never attain a value of Corp LP, College Station, Texas), and R) are a good indicator
exactly 0. of sparsity in general and separation in particular. Table 2 il-
Therefore, programs need a definition of “little change.” Dif- lustrates those differences with the UTI data example. The
ferent criteria (e.g., an absolute value smaller than 10−5, 10−8, LOGIT command in Stata, version 14, drops the diaphragm
or 10−10) will lead to different estimates. If there is no separa- variable from the model and fits a model with the 8 other co-
tion, the ML estimates will be finite, and these differences in variates for those with no diaphragm use (n = 430); it does not
convergence criteria will usually be irrelevant. However, with estimate the diaphragm odds ratio but instead provides the
separation even large changes in the coefficients lead only to warning “diaphragm!=0 predicts success perfectly.” For co-
little changes in the log-likelihood (i.e., the log-likelihood is variates other than diaphragm use, all software packages give
almost flat over a vast range). Figure 2 illustrates this situation the same finite log odds ratios and standard errors. Generally,
for the UTI example; increasing the log odds ratio from 5 to 3 cases must be distinguished for such nonseparation-causing
10, for example, does not improve the model fit in terms of covariates: First, their estimates might not change much
log-likelihood. Thus, different default log-likelihood conver- with the software package and stopping criteria. Second, if
gence criteria, combined with the possibly large iteration-to- instead 1 covariate causes complete separation, all other log
iteration changes in the estimate (due to the flat log-likelihood) odds ratios would become inestimable (nonidentifiable), and
imply that the reported estimates may vary substantially between packages usually report arbitrary values. Complete separation
software packages. is indicated by a model likelihood of 1 (log-likelihood of 0).
The flatness of log-likelihood implies that the standard errors Third, even with only quasicomplete separation, there can be
will be very large and Wald confidence intervals extremely cases where other log odds ratios cannot be estimated. In Web
wide and uninformative, practically ranging from minus to Appendix 4 we illustrate this with an artificial example of qua-
plus infinity (13). Multicollinearity of predictors is another sicomplete separation: For the data specified in Web Table 1
problem that could lead to flat log-likelihood and wide Wald and illustrated in Web Figure 2, the separation plane is unique
confidence intervals. However, with multicollinearity alone a and the glm function in R and the LOGISTIC procedure in
unique maximum of the log-likelihood still exists. Therefore, SAS yield the same log odds ratios for the nonseparation-
the researcher should be alerted by very large coefficient esti- causing covariate. However, if we slightly modify the data in
mates accompanied by extremely wide Wald confidence Web Table 1 such that the separation plane is not unique any-
intervals. more, but there is still quasicomplete separation (Web Figure 3),
Unlike multicollinearity, separation does not as seriously affect then different software packages yield different results for the
likelihood-ratio based statistics (12) such as profile-likelihood nonseparation-causing covariate. All 3 cases outlined above are
caused by data sparsity, and thus the solutions based on penaliza-
tion that we explain below will improve not only the estimate for
the separation-causing covariate but also the estimates of other
odds ratios.
1e−122
Profile Likelihood

1e−130
SOLUTIONS TO SEPARATION
1e−138
Encountering separation in a data set, one should first clar-
1e−146
ify whether the problem can be removed by a sensible revi-
sion of the data (8). Typical examples of modeling strategies
1e−154 that often give rise to separation are the categorization of contin-
uous variables or the use of many categories for nominal vari-
−10 −5 0 5 10 15
ables. In the following, we assume that separation cannot be
β removed by revising the data.
Figure 2. Profile likelihood (on logarithmic scale) for the log odds ratio
Stata has a simple built-in solution to the problem: restriction.
β of diaphragm use in univariate analysis of the data from Foxman et al. For example, in the UTI example it automatically restricts analy-
(9), 1997. For each value of β, the profile likelihood is obtained by maxi- sis to subjects who were not exposed to diaphragm use, stating
mizing the likelihood as a function of the intercept given β. that the outcome of exposed subjects can already be predicted

Am J Epidemiol. 2018;187(4):864–870
Separation in Logistic Regression 867

Table 2. Estimates of the Effect of Diaphragm Use on Urinary Tract Infection, Adjusted for 8 Additional Covariates,
in the Data Reported by Foxman et al.a, 1997

Method Log Odds Ratio CI Odds Ratio

ML with SPSS 22 (Wald CIs) 20.9 (−27,424.3, 27,466.2) 1,235,862,779


ML with R 3.2.2 (Wald CIs) 16.2 (−1,565.5, 1,597.9) 11,157,742
ML with SAS 9.4 (Wald CIs)b 15.1 (−1,497.4, 1,527.7) 3,753,745

Downloaded from https://academic.oup.com/aje/article/187/4/864/4084405 by Universiti Teknologi MARA - Shah Alam user on 03 February 2021
ML with SAS 9.4 (PL CIs)c 15.1 (0.9,) 3,753,745
ML with Stata 14d
Exact logistic regression (exact CIs)e 2 (0.2, infinity) 7.3
Firth penalization (PL CIs)f 2.6 (0.3, 7.5) 13.2
Cauchy(0,2.5) priors (Wald CIs)g 2.8 (−0.2, 5.8) 15.8
log-F(1,1) priors (PL CIs)h 2.5 (0.3, 7.4) 12.3
i
Ridge 2.5 12
LASSO regressionj 3.3 28.2

Abbreviations: CI, confidence interval; LASSO, least absolute shrinkage and selection operator; ML, maximum
likelihood; PL, profile likelihood.
a
Foxman et al. (9).
b
Odds ratio reported as >999.999.
c
Fitted in SAS with the parameter CLPARM = PL in the model statement; no value is provided for the infinite upper
limit; odds ratio is reported as >999.999.
d
Drops covariate from model and fits a model for those who do not use diaphragm.
e
Fitted in SAS using the EXACT statement in PROC LOGISTIC; median unbiased point estimate with exact confi-
dence interval; needed a larger amount of memory than the 2 GB available with the default settings of SAS.
f
Fitted in SAS (using FIRTH in the MODEL statement of PROC LOGISTIC). The Wald confidence interval for the
odds ratio (0.5, 352.9) is far from the profile-likelihood confidence interval, it includes parity. SAS also provides a Wald
P value of 0.123.
g
Fitted in R using Cauchy (0,2.5) priors for coefficients and Cauchy (0,10) prior for intercept in the function
bayesglm using the package arm (25).
h
Fitted in R using ML estimation on augmented data; intercept not included in the prior.
i
Fitted in R using the package glmnet (34). The penalty parameter is chosen by cross-validation with the deviance
as decision criterion; confidence intervals are not supplied by R. The LASSO model retained all 9 covariates.

perfectly by the diaphragm variable, and removes the variable unbiased” point estimates even when the ML estimate is infinite.
and the exposed subjects from model estimation. If the model is Unfortunately, these estimates can behave unexpectedly with
correctly specified, this approach prevents confounding by the extremely sparse data (16, 17). Exact confidence intervals and
omitted covariate. Nonetheless, it is inefficient insofar as it at- P values are exact only in the sense they are derived from exact
tributes all outcomes of the left-out subjects entirely to the distributions. Because of discreteness, standard exact methods
separation-causing covariate, without allowing that other covar- tend to produce excessively large P values and excessively
iates might have contributed to those outcomes as well. It also wide, conservative confidence intervals (6). Further disadvan-
does not address the general problem of data sparsity, which tages of exact statistics seriously limit their use in practice (e.g.,
causes overestimation of coefficients and variances from ML they are computationally intensive, especially with large sample
even if no separation occurred (14). The solutions below address sizes or many covariates, and may break down completely,
sparse-data bias, and construct confidence intervals for the sepa- especially with continuous covariates). In the UTI example,
rating coefficient from the full data. These confidence intervals exact logistic regression with all 9 covariates gives a median
are important because the Stata approach does not answer the unbiased log odds ratio estimate of 2 for diaphragm use (Table 2).
original question about diaphragm use, given that we regard the The corresponding exact 95% confidence interval for the odds
observed separation as reflecting nothing more than bad luck in ratio ranges from 1.2 to infinity, favoring a positive association
sampling, because using diaphragms is neither necessary nor but with no precision.
sufficient for UTI, thus ruling out an infinite odds ratio. Likelihood modifications can provide better solutions. If
One might address separation using exact logistic regression the log-likelihood to be maximized is slightly modified by
(15). Exact logistic regression was developed to obtain tests of adding a suitable “penalty” term, infinite coefficient estimates
regression coefficients for which the type I error probabilities can be prevented. Several such penalties have been suggested
are guaranteed not to exceed the nominal levels. Inversion of and were motivated by different aims. Some penalties reduce
these exact tests yields confidence intervals with analogous bias or mean-squared error of estimates. Others are motivated
properties. Exact logistic regression can provide finite “median from a Bayesian perspective, incorporating information that

Am J Epidemiol. 2018;187(4):864–870
868 Mansournia et al.

infinite coefficients are wrong, as in the diaphragm-UTI exam- A crucial step is the choice of the multiplier for the sum
ple (18). Standard additive log-likelihood modifications paral- (known as the “tuning parameter”). Typical implementations
lel prior distributions for the coefficients that are unimodal and estimate it from the data (e.g., by applying cross-validation
symmetric about 0. (33)) as in the R package glmnet (34). With separation, how-
ever, such estimation often yields a multiplier equal to 0 and
Solution via Firth penalization thus reverts to ordinary ML, resulting in infinite estimates. In
contrast, if the multiplier can be set in advance, both methods

Downloaded from https://academic.oup.com/aje/article/187/4/864/4084405 by Universiti Teknologi MARA - Shah Alam user on 03 February 2021
Firth suggested a likelihood penalty that reduces the bias of yield finite coefficients.
ML estimators in generalized linear models (2, 19) and solves Because their primary aim is accurate prediction, which is
the separation problem in logistic regression (16). Simulation achieved by introducing bias of log odds ratios towards zero,
studies find that it reduces bias and mean squared error relative one cannot compute valid confidence intervals from ridge
to ML. Notably, Rainey and McCaskey (20) report greatly regression using the ordinary Wald method, while LASSO
reduced mean squared error for Firth’s method. Further numeri- will usually drop confounders and so will not smoothly esti-
cal comparisons of Firth’s and other methods under small event mate coefficients in a causal outcome model (thus invalidating
rates can be found in a recent report of Puhr et al. (21), who pro- the confidence intervals). Ridge regression with a known tuning
posed a modification of Firth penalization that further reduces parameter is, however, easily duplicated by data augmentation
mean squared error. Others, such as Rahman and Sultana (22), by recognizing the parameter as the precision (inverse variance)
have presented simulations that appear biased by omission of of a zero-centered normal prior. This allows one to specify the
separated samples (23, 24). parameter from prior information about coefficient sizes and then
For inference, Heinze and Schemper (16) and Heinze (6) compute profile-likelihood intervals from standard software using
proposed the use of profile penalized likelihood confidence data augmentation (29).
intervals and demonstrated their superior coverage and power Puhr et al. (21) compared these proposals to deal with the
compared with Wald-type estimates, even in pathologically separation problem in logistic regression with rare events.
sparse-data situations. They observed that the original and modified Firth penalization
and Cauchy priors gave very good point-estimate and confi-
Solution via Cauchy priors dence interval accuracy, closely followed by the log-F(1,1)
prior. To facilitate use of these methods, Web Appendix 5
Gelman et al. (25) proposed using independent Cauchy prior provides instructions for their implementation by means of
distributions with scale inflated by a factor of 2.5 for each coef- univariate analysis of the UTI study.
ficient. They investigated its performance on several data sets Table 2 shows point and interval estimates of adjusted
from different fields, using cross-validation. Their approach is odds ratios for the UTI example. There were clear differences
fully automated in the R package arm (25). between ML and penalized estimation, and between penaliza-
tion methods that use Wald confidence intervals (Cauchy
Solution via log-F(1,1) priors priors) and those using profile-likelihood confidence intervals
(Firth and log-F(1,1) priors) (14, 16, 29, 30). The latter confi-
Another way to penalize the likelihood or include prior infor- dence intervals can accommodate asymmetry of the penalized
mation is to use data augmentation priors (i.e., priors represented log-likelihood, which occurs as a result of separation.
by pseudo-observations added to the data). This representation The inflated Cauchy prior led to a larger odds ratio than the
makes the analysis simplify to ML estimation using the aug- Firth and log-F(1,1) methods. The undershrinkage of infinite
mented data. In particular, Greenland et al. (14, 26–29) and ML estimates for this Cauchy prior is expected given that its
Discacciati et al. (30) recommend log-F(m,m) or normal priors 95% prior interval extends into the trillions whereas the log-F(1,1)
for logistic regression, suggesting the log-F(1,1) prior when interval extends “only” to 648 (28). In situations in which such
there is very little known about a coefficient (28). They also huge effects are clearly absurd (e.g., dietary surveys), one
strongly advise using profile-likelihood to obtain confidence can use stronger penalizations: For example, using a log-F(2,2)
intervals and P values in sparse situations such as separation prior (95% prior odds ratio limits 1/39, 39), the diaphragm odds
(14, 29, 30). ratio would be 6.4 (profile-likelihood limits: 1.1, 124.5).

Ridge and LASSO regression


DISCUSSION
The methods discussed above are examples of what are
sometimes called regularization methods, which remove cer- We have explained why results from software packages
tain pathologies such as infinite estimates. In particular they may differ when separation occurs, and how researchers
are examples of shrinkage methods, which on average pull the should be alerted to sparsity and separation by huge log odds
original estimates toward zero. Other examples include ridge ratio estimates accompanied by huge standard errors. The
and least absolute shrinkage and selection operator (LASSO) usual cause of separation is extreme sparsity in the observed
regression, which penalize the log-likelihood by subtracting a data (i.e., presence of groups of subjects that are defined by
multiple of the sum of squared or absolute coefficients, respec- their covariate values in which only events or only nonevents
tively, excluding the intercept from the sum (31, 32). These are observed (8, 14, 28, 35)). Sparsity is common in small
correspond respectively to penalization using normal and using samples but also arises in large data sets when the number
double-exponential (Laplace) prior densities. of parameters is high relative to sample size or there is an

Am J Epidemiol. 2018;187(4):864–870
Separation in Logistic Regression 869

exposure with very low frequency or a very strong effect on proportional hazards, and multinomial, ordinal, and con-
the outcome (as in the UTI example). Thus, while larger sam- ditional logistic regressions—and can be addressed by penali-
ple sizes are less likely to encounter separation, bias due to zation (17, 26, 27, 29, 41, 42).
sparsity may remain (14).
We recommend first running ML analyses to detect pro-
blems, which should be reported so that readers can see the
effect of further adjustments such as penalization. Even if the

Downloaded from https://academic.oup.com/aje/article/187/4/864/4084405 by Universiti Teknologi MARA - Shah Alam user on 03 February 2021
separation-causing covariate is not of interest, the ML esti- ACKNOWLEDGMENTS
mates for the remaining covariates derived by omitting sub- Author affiliations: Department of Epidemiology and
jects (as in Stata) may still suffer sparse-data bias. Therefore, Biostatistics, School of Public Health, Tehran University
we advise also presenting analyses extended to all subjects of Medical Sciences, Tehran, Iran (Mohammad Ali
using penalized estimation (14). We have reviewed several Mansournia); Section for Clinical Biometrics, Center for
such methods, which can be derived both from Bayesian and Medical Statistics, Informatics and Intelligent Systems,
frequentist shrinkage considerations. Superiority of these meth- Medical University of Vienna, Austria (Angelika
ods to ML in sparse-data analyses, especially with separation, Geroldinger, Georg Heinze); Department of Epidemiology,
has been seen in several simulation studies (16, 21, 25). Fielding School of Public Health, University of California
When the accuracy of risk prediction is central, methods Los Angeles, Los Angeles, California (Sander Greenland);
that focus on prediction error (such as LASSO and traditional and Department of Statistics, University of California Los
ridge regression) are useful (33, 36). In contrast, when the Angeles, Los Angeles, California (Sander Greenland).
parameters are the target, it seems more appropriate to use meth- This work was supported by the Austrian Science Fund
ods designed to minimize error in their estimation such as coeffi- (FWF) (award I 2276).
cient penalization as discussed above. See Hastie et al. (37, p. 223) We thank Dr. Foxman et al., for allowing us to use the
and Shmueli (38) for a deeper discussion of “predictive” versus urinary tract infection data, and 2 anonymous reviewers for
“explanatory” modeling. comments that led to improvements on an earlier version of
A crucial difference among the above methods is their differ- the manuscript.
ent sensitivity to coding of covariates. This can be seen by Conflict of interest: none declared.
changing covariate scales (e.g., with height expressed in inches
instead of centimeters, the coefficient refers to a 1-inch increase
instead of a 1-cm increase, and so will be 2.54 larger (1 inch =
2.54 cm)). For ordinary ML and the Firth method, the inch coef-
ficient will indeed be 2.54 times the centimeter coefficient. This REFERENCES
is also true for the Cauchy prior in the R package arm because it
scales covariates to standard deviation units, and the cm stan- 1. Hirji KF, Tsiatis AA, Mehta CR. Median unbiased estimation
dard deviation is 2.54 times the inch standard deviation. Unfor- for binary data. Am Stat. 1989;43(1):7–11.
2. Firth D. Bias reduction of maximum likelihood estimates.
tunately, the standard deviation is estimated from the data set
Biometrika. 1993;80(1):27–38.
itself, so that different data sets will lead to different standard 3. Schaefer RL. Bias correction in maximum likelihood logistic
deviation units and hence lead to differences in results that are regression. Stat Med. 1983;2(1):71–78.
irrelevant to any biological effect (39, 40). For this reason, the 4. Bull SB, Greenwood CM, Hauck WW. Jackknife bias
log-F method does not use standard deviation units, so the prior reduction for polychotomous logistic regression. Stat Med.
information it contains depends on the coefficient units, and the 1997;16(5):545–560.
results from using different units will not have a simple relation; 5. Cordeiro GM, Cribari-Neto F. On bias reduction in exponential
nonetheless, unlike the results from the R package arm, the and non-exponential family regression models. Commun Stat
log-F results will not depend on the covariate sample stan- Simul Comput. 1998;27(2):485–500.
dard deviation. 6. Heinze G. A comparative investigation of methods for logistic
regression with separated or nearly separated data. Stat Med.
The UTI example shows how point estimates from different
2006;25(24):4216–4226.
penalties may look similar for practical purposes. Nonetheless, 7. Albert A, Anderson JA. On the existence of maximum
Wald and profile-likelihood confidence intervals can look rather likelihood estimates in logistic regression models. Biometrika.
different, which is unsurprising given that the Wald (but not the 1984;71(1):1–10.
profile-likelihood) confidence interval assumes symmetric esti- 8. Allison P. Convergence problems in logistic regression. In:
mate distributions, an assumption that is violated in sparse data. Altman M, Gill G, McDonald MP, eds. Numerical Issues in
Along with small samples, very large (true) odds ratios are a Statistical Computing for the Social Scientist. Hoboken, NJ:
major cause of separation. This possibility is better reflected in John Wiley & Sons; 2004:238–252.
profile penalized likelihood confidence intervals, which often 9. Foxman B, Marsh J, Gillespie B, et al. Condom use and first
extend out to very large values, and has also been seen in simu- time urinary tract infection. Epidemiology. 1997;8(6):637–641.
10. Salama M, Andrukhova O, Hoda MA, et al. Concomitant
lation studies and theory. We thus strongly recommend profile-
endothelin-1 overexpression in lung transplant donors and
likelihood limits be used with the methods discussed here; they recipients predicts primary graft dysfunction. Am J Transplant.
are now available in most major software. 2010;10(3):628–636.
We have focused on ordinary logistic regression. Nonethe- 11. Potter DM. A permutation test for inference in logistic
less, separation and related sparse-data problems can also regression with small- and moderate-sized data sets. Stat Med.
occur with other discrete-outcome models—including probit, 2005;24(5):693–708.

Am J Epidemiol. 2018;187(4):864–870
870 Mansournia et al.

12. Cole SR, Chu HT, Greenland S. Maximum likelihood, profile 28. Greenland S, Mansournia MA. Penalization, bias reduction,
likelihood, and penalized likelihood: a primer. Am J Epidemiol. and default priors in logistic and related categorical and
2014;179(2):252–260. survival regressions. Stat Med. 2015;34(23):3133–3143.
13. Vaeth M. On the use of Wald’s test in exponential-families. Int 29. Sullivan SG, Greenland S. Bayesian regression in SAS
Stat Rev. 1985;53(2):199–214. software. Int J Epidemiol. 2013;42(1):308–317.
14. Greenland S, Mansournia MA, Altman DG. Sparse data bias: a 30. Discacciati A, Orsini N, Greenland S. Approximate Bayesian
problem hiding in plain sight. BMJ. 2016;352:i1981. logistic regression via penalized likelihood by data

Downloaded from https://academic.oup.com/aje/article/187/4/864/4084405 by Universiti Teknologi MARA - Shah Alam user on 03 February 2021
15. Agresti A. Categorical Data Analysis. 3rd ed. Hoboken, NJ: augmentation. Stata J. 2015;15(3):712–736.
John Wiley & Sons; 2013. 31. Le Cessie S, Van Houwelingen J. Ridge estimators in logistic
16. Heinze G, Schemper M. A solution to the problem of regression. J R Stat Soc Ser C Appl Stat. 1992;41(1):191–201.
separation in logistic regression. Stat Med. 2002;21(16): 32. Tibshirani R. Regression shrinkage and selection via the lasso.
2409–2419. J R Stat Soc Series B Stat Methodol. 1996;58(1):267–288.
17. Heinze G, Puhr R. Bias-reduced and separation-proof 33. Steyerberg E. Clinical Prediction Models: A Practical
conditional logistic regression with small or sparse data sets. Approach to Development, Validation, and Updating. New
Stat Med. 2010;29(7–8):770–777. York, NY: Springer-Verlag; 2008.
18. Lesaffre E, Lawson AB. Bayesian Biostatistics. Chichester, 34. Friedman J, Hastie T, Tibshirani R. Regularization paths for
UK: John Wiley & Sons; 2012. generalized linear models via coordinate descent. J Stat Softw.
19. Kosmidis I, Firth D. Bias reduction in exponential family 2010;33(1):1–22.
nonlinear models. Biometrika. 2009;96(4):793–804. 35. Greenland S, Schwartzbaum JA, Finkle WD. Problems due to
20. Rainey C, McCaskey K. Estimating Logit Models with Small small samples and sparse data in conditional logistic regression
Samples. carlislerainey.com/papers/small.pdf. Accessed analysis. Am J Epidemiol. 2000;151(5):531–539.
July 7, 2017. 36. Harrell F. Regression Modeling Strategies: With Applications
21. Puhr R, Heinze G, Nold M, et al. Firth’s logistic regression to Linear Models, Logistic and Ordinal Regression, and
with rare events: accurate effect estimates and predictions? Stat Survival Analysis. 2nd ed. New York, NY: Springer-Verlag;
Med. 2017;36(14):2302–2317. 2015.
22. Rahman MS, Sultana M. Performance of Firth- and logF-type 37. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical
penalized methods in risk prediction for small or sparse binary Learning. 2nd ed. New York, NY: Springer-Verlag; 2009.
data. BMC Med Res Methodol. 2017;17(1):33. 38. Shmueli G. To explain or to predict? Stat Sci. 2010;25(3):
23. Heinze G. Comment on “Bias reduction in conditional logistic 289–310.
regression”. Stat Med. 2011;30(12):1466–1467. 39. Greenland S, Maclure M, Schlesselman JJ, et al. Standardized
24. Heinze G. Comment on “A comparative study of the bias regression-coefficients—a further critique and review of some
corrected estimates in logistic regression”. Stat Methods Med alternatives. Epidemiology 1991;2(5):387–392.
Res. 2012;21(6):660–661. 40. Greenland S, Schlesselman JJ, Criqui MH. The fallacy of
25. Gelman A, Jakulin A, Pittau MG, et al. A weakly informative employing standardized regression coefficients and
default prior distribution for logistic and other regression correlations as measures of effect. Am J Epidemiol. 1986;
models. Ann Appl Stat. 2008;2(4):1360–1383. 123(2):203–208.
26. Greenland S. Generalized conjugate priors for Bayesian analysis 41. Heinze G, Schemper L. A solution to the problem of monotone
of risk and survival regressions. Biometrics. 2003;59(1):92–99. likelihood in Cox regression. Biometrics. 2001;57(1):114–119.
27. Greenland S. Bayesian perspectives for epidemiological research. 42. Santos Silva JMC, Tenreyro S. Poisson: some convergence
II. Regression analysis. Int J Epidemiol. 2007;36(1):195–202. issues. Stata J. 2011;11(2):207–212.

Am J Epidemiol. 2018;187(4):864–870

You might also like