Logistic Regression
Logistic Regression
School of Physical Education and Sports - Federal University of Rio de Janeiro, Rio de Janeiro, Brazil
Abstract
Logistic regression is used to obtain odds ratio in the presence of more than one explanatory variable. The procedure is quite similar to multiple li-
near regression, with the exception that the response variable is binomial. The result is the impact of each variable on the odds ratio of the observed
event of interest. The main advantage is to avoid confounding effects by analyzing the association of all variables together. In this article, we explain
the logistic regression procedure using examples to make it as simple as possible. After definition of the technique, the basic interpretation of the
results is highlighted and then some special issues are discussed.
Key words: regression analysis; logistic regression; odds ratio; variable selection
Introduction
One of the previous topics in Lessons in biostatistics suppose we are now interested in the relationship
presented the calculation, usage and interpreta- between age and death in the same group of SA
tion of odds ratio statistic and greatly demonstrat- endocarditis patients. Table 2 presents the fiction-
ed the simplicity of odds ratio in clinical practice al new data. You ought to remember that those
(1). The example used then was from a fictional data are not real data and that the relationships
study where the effects of two drug treatments to described here are not meant to reflect any real as-
Staphylococcus Aureus (SA) endocarditis were com- sociations.
pared. Original data are reproduced on Table 1. Again, we can calculate an OR as (120 x 134 / 217 x
Following (1), the odds ratio (OR) of death of pa- 49) = 1.51, meaning that the chance of an younger
tients using standard treatment can be calculated individual (between 30 and 45 years-old) death is
as (152 X 103) / (248 x 47) = 3.71, meaning that pa- about 1.5 times the chance of the death of an old-
tients at standard treatment present a chance to er individual (between 46 and 60 years-old). Now,
die 3.71 times greater than patients under new instead, we have two variables related to the event
treatment. To a more detailed information about of interest (death) at individuals with SA endo-
basic OR interpretations, please see McHugh (1). carditis. But in the presence of more than one ex-
However, a more complex problem can arise when, planatory variable, separately testing each inde-
instead of the association between one explana- pendent variable against the response variable in-
tory variable and one response variable (e.g., type troduces bias into the research (2), Performing
of treatment and death), we are interested in the multiple tests on the same data inflates the alpha,
joint relationship between two or more explana- thus increasing Type I error rates while missing
tory variables and the response variable. Let us possible confounding effects. So, how do we know
Older Died 43 6 49
(46-60 yrs) Survived 100 34 134 2.44
Totals 143 40 183
Standard treatment New treatment Totals OR
confounding effects, as was demonstrated in the The result of our model can be seen below, at Ta-
example above when the effect of treatment on ble 4.
death probability was partially hidden by the ef- Now all we have to do is to interpret this output.
fect of age. Beginning with the intercept term, which corre-
A logistic regression will model the chance of an sponds to our ßß. Taking the exponential of ß^ we
outcome based on individual characteristics. Be- have the mean odds to death of individuals in the
cause chance is a ratio, what will be actually mod- reference category. So, exp(ßo) = exp(-2.121) = 0.12
eled is the logarithm of the chance given by: is the chance of death among those individuals
that are older and received new treatment. A small
log n difference in the interpretation of coefficients ap-
(2) pears when we go to the next coefficients. Individ-
uals that also received new treatment but are
where n indicates the probability of an event (e.g., younger have a mean chance of death exp(ßi) =
death in the previous example), and j8, are the re- exp(0.454) = 1.58 times the chance of reference in-
gression coefficients associated with the reference dividuals. Similarly, older individuals that received
group and the x¡ explanatory variables. At this standard treatment have a mean chance exp(ß2) =
point, an important concept must to be highlight- exp(1.333) = 3.79 times the chance of reference in-
ed. The reference group, represented by ßg, is con- dividuals to die. But what if individuals are young-
stituted by those individuals presenting the refer- er and received standard treatment? Then we have
ence level of each and every variable x, ^ . To illus- to calculate exp(ßi+ß2) = exp(1.787) = 5.97 times
trate, considering our previous example, these are the mean chance of reference individuals.
the individuals older aged that received standard
This is the basics of logistic regression interpreta-
treatment. Later, we will discuss how to set the ref-
tion. However, some issues appear during the
erence level.
analysis and solutions are not always readily avail-
able. In the next section we will discuss how to
Logistic regression step-by-step deal with them.
Let us apply a logistic regression to the example
described before to see how it works and how to Logistic regression pitfalls
interpret the results. Let us build a logistic regres-
sion model to include all explanatory variables Odds and probabilities
(age and treatment). This kind of model with all First It is imperative to understand that odds and
variables included is a called "full model" or a "sat- probabilities, although sometimes used as synon-
urated model" and is the best starting option if ymous, are not the same. Probability is the ratio
you have a good sample size and small number of between the number of events favorable to some
variables to include (issues about sample size, vari- outcome and the total number of events. On the
able inclusion and selection and others will be dis- other hand, odds are the ratio between probabili-
cussed in the next section. For now, we will keep it ties: the probability of an event favorable to an
as simple as possible). outcome and the probability of an event against
TABLE 4 . Results from multivariate logistic regression model containing all explanatory variables (full model).
14
SperandeiS. Logistic regression analysis
the same outcome. Probability is constrained be- Therefore, as demonstrated, a large OR only means
tween zero and one and odds are constrained be- that the chance of a particular group is much
tween zero and infinity. And odds ratio is the ratio greater than that of the reference group. But if the
between odds. The importance of this is that a chance of reference group is small, even a large OR
large odds ratio (OR) can represent a small proba- can still indicate a small probability.
bility and vice-versa. Let us go back to our exam-
ple to make this point clear. Continuous explanatory variables or variables
The reference group (older individuals receiving with more than two levels
new treatment) showed a chance of death approx- Now is time to think about what to do if explana-
imately equal to 0.12. Using: tory variables are not binomial, as before. When an
chance
explanatory variable is multinomial, then we must
probability - build n-1 binary variables (called dummy variable)
1 + chance (3) to it, where n indicates the number of levels of the
it can be shown that the mean probability of death variable. A dummy variable is just a variable that
of this group is 0.11. Knowing that the mean chance will assume value one if subject presents the spec-
of death in the group of younger individuals that ified category and zero otherwise. For instance, a
received new treatment is 1.58 greater than the variable named "satisfaction" that presents three
mean chance of the reference group, the chance levels ("Low", "Medium" and "High") needs to be
of death to this group can be estimated as 1.58 x represented by two dummy variables (x^ and Xj) in
0.12 = 0.19 or, using Equation 3 above, a probability the model. The individuals at reference level, let's
of death equal to 0.16. Similarly, the mean chance say "Low", will present zeros in both dummy vari-
of death of an older individual receiving standard ables (Equation 4a), while individuals with "Medi-
treatment is 3.79 times the reference group, which um" satisfaction will have a one in x^ and a zero in
means a chance of death equal to 3.79 x 0.12 = 0.45 Xj (Equation 4b). The opposite will occur with indi-
or a probability of death equals to 0.31. Finally, viduals with "High" satisfaction (Equation 4c). Usu-
younger individuals receiving standard treatment ally, statistical software does it automatically and
have a chance of death equal to 5.97 x 0.11 = 0.72 the reader does not have to worry about it.
or a probability of death equal to 0.42.
I 7T
log (a)
I 77 j
While interpretation of outputs from multinomial Stead of being binomial (older x younger) were
explanatory variables is straightforward and fol- continuous, would produce the following result
lows the ones of binomial explanatory variables, (Table 5).
the interpretation of continuous variables, on the The first thing to point out is that the "Age2" coef-
other hand, is a bit more complex. The exp(ß) of a ficient (Age here taken as a continuous variable) is
continuous variable represents the increment of now negative. It occurs because the older the indi-
the chance of an event related to each unit incre- vidual (in years) the smaller the chance of death. If
ment on the explanatory variable. For instance, we take exp(-0.294) = 0.75, it shows us that for
the variable "Age" in our previous example, if in- each year of life the chance to die of SA endocardi-
TABLE 5. Results from multivariate logistic regression model containing all explanatory variables (full model), using AGE as a
continuous variable.
tis decreases by 25%. The intercept, now, repre- nificant variables could be dropped due to low
sents individuals that received the new treatment statistical power, as mentioned above.
and with "zero years-old". Take extra care when in- As a rule, if we have a large sample size, let's say
terpreting logistic regression results using contin- that we have at least ten individuals per variable,
uous explanatory variables. we can try to include all your explanatory variables
in the full model. However, if we have a limited
Variables inclusion and selection sample size in relation to the number of candidate
variables, a pre-selection should be performed in-
A major problem when building a logistic model is
stead. One way to do that is to test all variables
to select which variables to include. Researchers
previously, using models with just one explanatory
usually collect as many variables as possible in
variable at a time (univariate models) and after-
their research instrument, then put all of them into
wards include in the multivariate model all varia-
the model and try to find something "significant".
bles that have shown a relaxed P-value (for in-
This approach increases the emergence of two sit-
stance, P < 0.25). There is no reason to worry about
uations. First, one or more variables are statistically
a rigorous p-value criterion at this stage, because
"significant", but the researcher has no theory to
this is just a pre-selection strategy and no infer-
link the "significant" variable to the event of inter-
ence will derive from this step. This relaxed P-value
est modeled. Remember that you are working
criterion will allow reducing the initial number of
with samples and spurious results can occur. The
variables in the model reducing the risk of missing
second situation is that a model with more varia-
important variables (4,5).
bles presents less statistical power. So, if there is an
association between one explanatory variable and There is some debate about the appropriate strat-
the occurrence of an event, researcher can miss egy to variable selection (6) and the last is just an-
this effect because saturated models (those that other one. It is easy and intuitive. More elaborated
contains all possible explanatory variables) are not methods are available, but whatever the method,
sensible enough to detect it. So the researcher it is very important that researchers get aware of
must to be very cautious with the selection of vari- the procedure applied and not just press some
ables to include into the model. buttons on software.
We can start a regression using either a full (satu-
rated) model, or a null (empty) model, which starts Reference group setup
only with the intercept term. In the first case, vari- There are some explanatory variables for which
ables need to be dropped one by one, preferably the reference level is almost automatically deter-
dropping the less significant one. This is the pre- mined. For instance, to our response variable
ferred strategy just because is easier to handle, named "Result" for which the outcomes are "died"
while the second requires all candidate variables and "survived", the reference level is almost always
to be tested each step in a way to select the better set to "survived", since the interest is focused on
choice to include. On the other hand, if too many variables associated with the outcome, death.
variables are included at once in a full model, sig-
16
Sperandei S. Logistic regression analysis
On the other hand, some variables have no clear ber that all results (and significant effects) present-
reference level, but present ordered levels and the ed are relative to the reference level. To make this
reference level will be, usually, one of the end- point clearer, let's see an example. In a nationwide
points or, less frequently, the central level. This is survey about the occurrence of diabetic ketoaci-
the case of variables assessed using Likert scales, a dosis, individual's geographic region was found to
psychometric scale commonly involved in research be significantly related to the probability of dia-
that employs questionnaires (for instance, the de- betic ketoacidosis at the onset of diabetes (7). In
gree of satisfaction about some product scaled as this work, north/northeast region was set as refer-
"satisfied", "nor satisfied, nor unsatisfied" or "un- ence and southeast region was the only one to be
satisfied"). However, some variables have no or- statistically different relative to the reference. The
dered levels and no clear reference level. This can results, showing just the region variable, are below
occur with geographic region. And then appears (Table 6).
the question: what region should I use as refer-
If we otherwise use Middle-East as the reference
ence?
level, the next result will emerge (again, only geo-
The answer is that there is no answer... However, graphic region is shown) (Table 7).
reference level selection can change the model es- Finally, if we use the southeast region as reference
timation in some cases. It is important to remem- level, we obtain following results (Table 8).
TABLE 6. Relationship between geographic region and ketoacidosis prevalence in Brazil (data from (7)). North/Notheast region used
as reference level.
TABLE 7. Relationship between geographic region and ketoacidosis prevalence in Brazil (data from (7)). Middle-West region used as
reference level.
TABLE 8 . Relationship between geographic region and ketoacidosis prevalence in Brazil (data from (7)). Southeast region used as
reference level.