0% found this document useful (0 votes)
63 views

Logistic Regression

Research paper on logistic regression

Uploaded by

ali
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views

Logistic Regression

Research paper on logistic regression

Uploaded by

ali
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Lessons in biostatistics

Understanding logistic regression analysis


Sandro Sperandei

School of Physical Education and Sports - Federal University of Rio de Janeiro, Rio de Janeiro, Brazil

Corresponding author: ssperandeiiSgmail.com

Abstract
Logistic regression is used to obtain odds ratio in the presence of more than one explanatory variable. The procedure is quite similar to multiple li-
near regression, with the exception that the response variable is binomial. The result is the impact of each variable on the odds ratio of the observed
event of interest. The main advantage is to avoid confounding effects by analyzing the association of all variables together. In this article, we explain
the logistic regression procedure using examples to make it as simple as possible. After definition of the technique, the basic interpretation of the
results is highlighted and then some special issues are discussed.
Key words: regression analysis; logistic regression; odds ratio; variable selection

Received: September 06,2013 Accepted: November 26,2013

Introduction
One of the previous topics in Lessons in biostatistics suppose we are now interested in the relationship
presented the calculation, usage and interpreta- between age and death in the same group of SA
tion of odds ratio statistic and greatly demonstrat- endocarditis patients. Table 2 presents the fiction-
ed the simplicity of odds ratio in clinical practice al new data. You ought to remember that those
(1). The example used then was from a fictional data are not real data and that the relationships
study where the effects of two drug treatments to described here are not meant to reflect any real as-
Staphylococcus Aureus (SA) endocarditis were com- sociations.
pared. Original data are reproduced on Table 1. Again, we can calculate an OR as (120 x 134 / 217 x
Following (1), the odds ratio (OR) of death of pa- 49) = 1.51, meaning that the chance of an younger
tients using standard treatment can be calculated individual (between 30 and 45 years-old) death is
as (152 X 103) / (248 x 47) = 3.71, meaning that pa- about 1.5 times the chance of the death of an old-
tients at standard treatment present a chance to er individual (between 46 and 60 years-old). Now,
die 3.71 times greater than patients under new instead, we have two variables related to the event
treatment. To a more detailed information about of interest (death) at individuals with SA endo-
basic OR interpretations, please see McHugh (1). carditis. But in the presence of more than one ex-
However, a more complex problem can arise when, planatory variable, separately testing each inde-
instead of the association between one explana- pendent variable against the response variable in-
tory variable and one response variable (e.g., type troduces bias into the research (2), Performing
of treatment and death), we are interested in the multiple tests on the same data inflates the alpha,
joint relationship between two or more explana- thus increasing Type I error rates while missing
tory variables and the response variable. Let us possible confounding effects. So, how do we know

Biochemia Medica 2014;24(l):12-8 http://dx.doi.org/10.11613/BM.2014.003


OCopyrightby Croatian Society of Medical Biochemistry and Laboratory Medicine. This is an Open Mcess artlde distributed under the terms of the Creative Commons Attribution Non-Commercial License
12 thttpj/creatlvecommons.org/iicenses/by-nc-nd/3.0/) which permits unrestricted non-commercial use. distribution, and reproduction in any medium, provided the original work is properiy cited
Sperandei S. Logistic regression analysis

TABLE 1 . Results from fictional endocarditis treatment study by


iVIcHugh (1).
43x34 109x69
183 + 337
= 3.74
Standard New 100 X 6^148 x11
Totals
treatment treatment 183 + 337
Died 152 17 169
(1)
Survived 248 103 351 It means that the weighted chance of death asso-
Totals 400 120 520 ciated with standard treatment is 3.74 times the
chance of death of individuals taking new treat-
ment. However, as the number of explanatory vari-
ables increases, the complexity of these calcula-
TABLE 2. Results from fictional endocarditis treatment study by
McHugh looking at age (1). tions can become nearly impossible to handle. Ad-
ditionally, Mantel-Haenszel OR, like the simple OR,
Younger Older admits only categorical explanatory variables. For
Totals
(30-45 yrs) (46-60 yrs)
instance, to use a continuous variable like age we
Died 120 49 169
need to set a breaking point to categorize (in our
Survived 217 134 351 case, arbitrarily set at 45 years-old) and could not
Totals 337 183 520 use the real age. Determining breaking points is
not always easy! But there is a better approach: us-
ing logistic regression instead.
whether the treatment effect on endocarditis re-
sult is being masked by the effect of age? Let us Definition
take a look at the treatment effect as stratified by
age (Table 3). Logistic regression works very similar to linear re-
gression, but with a binomial response variable.
As table 3 illustrates, the impact of treatment is
The greatest advantage when compared to Man-
higher on younger individuals, because OR in the
tel-Haenszel OR is the fact that you can use con-
younger patients subgroup is higher than in the
tinuous explanatory variables and it is easier to
older patients subgroup. Therefore, it would be in-
handle more than two explanatory variables si-
correct to simply look at the treatment results
multaneously. Although apparently trivial, this last
without considering the impact of age. The sim-
characteristic is essential when we are interested
plest way to solve this problem is to calculate some
in the impact of various explanatory variables on
form of "weighted" OR (i.e., Mantel-Haenszel OR
the response variable. If we look at multiple ex-
(3)), using Equation 1 below, where n¡ is the sample
planatory variables independently, we ignore the
size of age class /, and a, b, c and d are the table
covariance among variables and are subjected to
cells, as presented by McHugh (1).

TABLE 3 . Effect of treatment on endocarditis stratified by age.

Standard treatment New treatment Totals OR

Older Died 43 6 49
(46-60 yrs) Survived 100 34 134 2.44
Totals 143 40 183
Standard treatment New treatment Totals OR

Younger Died 109 11 120


(30-45 yrs) Survived 148 69 217 4.62
Totals 257 80 337

http://dx.doi.org/10.71613/BM.2014.003 Biochemia Medica 2014;24(1): 12-8


13
Sperandei S. Logistic regression analysis

confounding effects, as was demonstrated in the The result of our model can be seen below, at Ta-
example above when the effect of treatment on ble 4.
death probability was partially hidden by the ef- Now all we have to do is to interpret this output.
fect of age. Beginning with the intercept term, which corre-
A logistic regression will model the chance of an sponds to our ßß. Taking the exponential of ß^ we
outcome based on individual characteristics. Be- have the mean odds to death of individuals in the
cause chance is a ratio, what will be actually mod- reference category. So, exp(ßo) = exp(-2.121) = 0.12
eled is the logarithm of the chance given by: is the chance of death among those individuals
that are older and received new treatment. A small
log n difference in the interpretation of coefficients ap-
(2) pears when we go to the next coefficients. Individ-
uals that also received new treatment but are
where n indicates the probability of an event (e.g., younger have a mean chance of death exp(ßi) =
death in the previous example), and j8, are the re- exp(0.454) = 1.58 times the chance of reference in-
gression coefficients associated with the reference dividuals. Similarly, older individuals that received
group and the x¡ explanatory variables. At this standard treatment have a mean chance exp(ß2) =
point, an important concept must to be highlight- exp(1.333) = 3.79 times the chance of reference in-
ed. The reference group, represented by ßg, is con- dividuals to die. But what if individuals are young-
stituted by those individuals presenting the refer- er and received standard treatment? Then we have
ence level of each and every variable x, ^ . To illus- to calculate exp(ßi+ß2) = exp(1.787) = 5.97 times
trate, considering our previous example, these are the mean chance of reference individuals.
the individuals older aged that received standard
This is the basics of logistic regression interpreta-
treatment. Later, we will discuss how to set the ref-
tion. However, some issues appear during the
erence level.
analysis and solutions are not always readily avail-
able. In the next section we will discuss how to
Logistic regression step-by-step deal with them.
Let us apply a logistic regression to the example
described before to see how it works and how to Logistic regression pitfalls
interpret the results. Let us build a logistic regres-
sion model to include all explanatory variables Odds and probabilities
(age and treatment). This kind of model with all First It is imperative to understand that odds and
variables included is a called "full model" or a "sat- probabilities, although sometimes used as synon-
urated model" and is the best starting option if ymous, are not the same. Probability is the ratio
you have a good sample size and small number of between the number of events favorable to some
variables to include (issues about sample size, vari- outcome and the total number of events. On the
able inclusion and selection and others will be dis- other hand, odds are the ratio between probabili-
cussed in the next section. For now, we will keep it ties: the probability of an event favorable to an
as simple as possible). outcome and the probability of an event against

TABLE 4 . Results from multivariate logistic regression model containing all explanatory variables (full model).

Term ß estimate Standard error P value

Intercept (/Sp) -2.121 0.303 <0.001

Age: Younger (j3,) 0.454 0.207 0.028

Treatment: Standard (ß^ 1.333 0.283 <0.001

Biochemia Medica 2014;24(1):12-8 http://dx.doi.org/10.11613/BM.2014.003

14
SperandeiS. Logistic regression analysis

the same outcome. Probability is constrained be- Therefore, as demonstrated, a large OR only means
tween zero and one and odds are constrained be- that the chance of a particular group is much
tween zero and infinity. And odds ratio is the ratio greater than that of the reference group. But if the
between odds. The importance of this is that a chance of reference group is small, even a large OR
large odds ratio (OR) can represent a small proba- can still indicate a small probability.
bility and vice-versa. Let us go back to our exam-
ple to make this point clear. Continuous explanatory variables or variables
The reference group (older individuals receiving with more than two levels
new treatment) showed a chance of death approx- Now is time to think about what to do if explana-
imately equal to 0.12. Using: tory variables are not binomial, as before. When an
chance
explanatory variable is multinomial, then we must
probability - build n-1 binary variables (called dummy variable)
1 + chance (3) to it, where n indicates the number of levels of the
it can be shown that the mean probability of death variable. A dummy variable is just a variable that
of this group is 0.11. Knowing that the mean chance will assume value one if subject presents the spec-
of death in the group of younger individuals that ified category and zero otherwise. For instance, a
received new treatment is 1.58 greater than the variable named "satisfaction" that presents three
mean chance of the reference group, the chance levels ("Low", "Medium" and "High") needs to be
of death to this group can be estimated as 1.58 x represented by two dummy variables (x^ and Xj) in
0.12 = 0.19 or, using Equation 3 above, a probability the model. The individuals at reference level, let's
of death equal to 0.16. Similarly, the mean chance say "Low", will present zeros in both dummy vari-
of death of an older individual receiving standard ables (Equation 4a), while individuals with "Medi-
treatment is 3.79 times the reference group, which um" satisfaction will have a one in x^ and a zero in
means a chance of death equal to 3.79 x 0.12 = 0.45 Xj (Equation 4b). The opposite will occur with indi-
or a probability of death equals to 0.31. Finally, viduals with "High" satisfaction (Equation 4c). Usu-
younger individuals receiving standard treatment ally, statistical software does it automatically and
have a chance of death equal to 5.97 x 0.11 = 0.72 the reader does not have to worry about it.
or a probability of death equal to 0.42.

I 7T
log (a)
I 77 j

"Medium"^ = ß^^ + (b)

"High"- -ß2 (C) (4)

While interpretation of outputs from multinomial Stead of being binomial (older x younger) were
explanatory variables is straightforward and fol- continuous, would produce the following result
lows the ones of binomial explanatory variables, (Table 5).
the interpretation of continuous variables, on the The first thing to point out is that the "Age2" coef-
other hand, is a bit more complex. The exp(ß) of a ficient (Age here taken as a continuous variable) is
continuous variable represents the increment of now negative. It occurs because the older the indi-
the chance of an event related to each unit incre- vidual (in years) the smaller the chance of death. If
ment on the explanatory variable. For instance, we take exp(-0.294) = 0.75, it shows us that for
the variable "Age" in our previous example, if in- each year of life the chance to die of SA endocardi-

http://dx.doi.org/W. /1613/BM.2014.003 Biochemia Medica 2014;24(l):12-8


15
SperandeiS. Logistic regression anaiysis

TABLE 5. Results from multivariate logistic regression model containing all explanatory variables (full model), using AGE as a
continuous variable.

Term /3 estimate Standard error P value

Intercept {ßg) 9.039 1.513 <0.001

Age2(j3,) -0.294 0.041 <0.001

Treatment: Standard (/3^) 2.229 0.297 <0.001

tis decreases by 25%. The intercept, now, repre- nificant variables could be dropped due to low
sents individuals that received the new treatment statistical power, as mentioned above.
and with "zero years-old". Take extra care when in- As a rule, if we have a large sample size, let's say
terpreting logistic regression results using contin- that we have at least ten individuals per variable,
uous explanatory variables. we can try to include all your explanatory variables
in the full model. However, if we have a limited
Variables inclusion and selection sample size in relation to the number of candidate
variables, a pre-selection should be performed in-
A major problem when building a logistic model is
stead. One way to do that is to test all variables
to select which variables to include. Researchers
previously, using models with just one explanatory
usually collect as many variables as possible in
variable at a time (univariate models) and after-
their research instrument, then put all of them into
wards include in the multivariate model all varia-
the model and try to find something "significant".
bles that have shown a relaxed P-value (for in-
This approach increases the emergence of two sit-
stance, P < 0.25). There is no reason to worry about
uations. First, one or more variables are statistically
a rigorous p-value criterion at this stage, because
"significant", but the researcher has no theory to
this is just a pre-selection strategy and no infer-
link the "significant" variable to the event of inter-
ence will derive from this step. This relaxed P-value
est modeled. Remember that you are working
criterion will allow reducing the initial number of
with samples and spurious results can occur. The
variables in the model reducing the risk of missing
second situation is that a model with more varia-
important variables (4,5).
bles presents less statistical power. So, if there is an
association between one explanatory variable and There is some debate about the appropriate strat-
the occurrence of an event, researcher can miss egy to variable selection (6) and the last is just an-
this effect because saturated models (those that other one. It is easy and intuitive. More elaborated
contains all possible explanatory variables) are not methods are available, but whatever the method,
sensible enough to detect it. So the researcher it is very important that researchers get aware of
must to be very cautious with the selection of vari- the procedure applied and not just press some
ables to include into the model. buttons on software.
We can start a regression using either a full (satu-
rated) model, or a null (empty) model, which starts Reference group setup
only with the intercept term. In the first case, vari- There are some explanatory variables for which
ables need to be dropped one by one, preferably the reference level is almost automatically deter-
dropping the less significant one. This is the pre- mined. For instance, to our response variable
ferred strategy just because is easier to handle, named "Result" for which the outcomes are "died"
while the second requires all candidate variables and "survived", the reference level is almost always
to be tested each step in a way to select the better set to "survived", since the interest is focused on
choice to include. On the other hand, if too many variables associated with the outcome, death.
variables are included at once in a full model, sig-

Biochemia Medica 2014;24(l):12-8 http://dx.doi.org/W. 11613/BM.2014.003

16
Sperandei S. Logistic regression analysis

On the other hand, some variables have no clear ber that all results (and significant effects) present-
reference level, but present ordered levels and the ed are relative to the reference level. To make this
reference level will be, usually, one of the end- point clearer, let's see an example. In a nationwide
points or, less frequently, the central level. This is survey about the occurrence of diabetic ketoaci-
the case of variables assessed using Likert scales, a dosis, individual's geographic region was found to
psychometric scale commonly involved in research be significantly related to the probability of dia-
that employs questionnaires (for instance, the de- betic ketoacidosis at the onset of diabetes (7). In
gree of satisfaction about some product scaled as this work, north/northeast region was set as refer-
"satisfied", "nor satisfied, nor unsatisfied" or "un- ence and southeast region was the only one to be
satisfied"). However, some variables have no or- statistically different relative to the reference. The
dered levels and no clear reference level. This can results, showing just the region variable, are below
occur with geographic region. And then appears (Table 6).
the question: what region should I use as refer-
If we otherwise use Middle-East as the reference
ence?
level, the next result will emerge (again, only geo-
The answer is that there is no answer... However, graphic region is shown) (Table 7).
reference level selection can change the model es- Finally, if we use the southeast region as reference
timation in some cases. It is important to remem- level, we obtain following results (Table 8).

TABLE 6. Relationship between geographic region and ketoacidosis prevalence in Brazil (data from (7)). North/Notheast region used
as reference level.

Term ß estimate Standard error OR P value


Intercept (ß^) -1.92 0.19 - <0.001
Region: South (ß,) -0.09 O.n 0.92 0.405
Region: Middle-West (ß^) 0.18 0.16 1.19 0.267
Region: Southeast (ßj) 0.36 0.09 1.43 <0.001

TABLE 7. Relationship between geographic region and ketoacidosis prevalence in Brazil (data from (7)). Middle-West region used as
reference level.

Term ^estimate Standard error OR P value


Intercept (ßo) -1.75 0.22 - <0.001
Region: South (ß,) -0.26 0.16 0.77 0.104
Region: North/NE (ß^) -0.18 0.16 0.84 0.267
Region: Southeast (ß^) 0.18 0.15 1.20 0.237

TABLE 8 . Relationship between geographic region and ketoacidosis prevalence in Brazil (data from (7)). Southeast region used as
reference level.

Term ß estimate Standard error OR P value


Intercept (ß^,) -1.56 0.18 <0.001
Region: South (ß,) -0.45 0.09 0.64 <0.001
Region: North/NE (ß^) -0.36 0.09 0.70 <0.001
Region: Middle-West (ß^) -0.18 0.15 0.83 0.237

http://dx.doi.org/10.n613/BM.2014.003 Biochemia Medica 2014;24(1):12-8


T7
Sperandei S. Logistic regression analysis

So, it is importance to pay attention to the setup of Acknowledgements


the reference levels. If there is no apparent rule de- Author would like to thanks to Dr. Arianne Reis and
rived by the data itself or by the prior knowledge Dr. Marcelo Ribeiro-Alves for the valuable manu-
about the variable values, one recommendation script revision and suggestions.
that remains is to select a reference level with min-
imum sample size, to allow adequate statistical Potential conflict of interest
power. Another recommendation that will make None declared.
interpretation easier is to choose categories with
the same relationship to the event of interest. If References
you believe that older individuals have smaller 1. McHugh ML The odds ratio: calculation, usage, and in-
probability to die and people receiving new treat- terpretation. Biochem Med 2009;19:120-6. http://dx.doi.
org/10.11613/BM.2009.011.
ment are less probable to die, put these two cate-
2. SimundicAM. Bias in research. Biochem Med2013;23:12-5.
gories as reference. You can use the opposite and http://dx.doi.org/10.11613/BM.2013.003.
set younger individuals and standard treatment. 3. Mantel N, Haenszel W. Statistical aspects of the analysis of
But choosing older individuals and standard treat- data from retrospective studies of disease. J Nati Cancer
ment, although possible and not wrong, will diffi- Inst1959;22:719-48.
cult the interpretation of the results. 4. Bendel RB, Afin /\A. Comparison of Stopping Rules in
Forward "Stepwise" Regression. J Am Stat Assoc 1977;72:
46-53.
5. Costanza MC, Afifi AA. Comparison of Stopping Rules in
Conclusion Forward Stepwise Discriminant Analysis. J Am Stat Assoc
1979:74:777-85. http://dx.d0i.0rg/i 0.1080/01621459.1979.
Logistic regression is a powerful tool, especially in
10481030.
epidemiologic studies, allowing multiple explana- 6. Greenland S. Modeling and variable selection in epidemi-
tory variables being analyzed simultaneously, ologic analysis. Am J Public Health 1989;79:340-9. http://
meanwhile reducing the effect of confounding dx.doi.org/10.2105/AJPH.79.3.340.
factors. However, researchers must pay attention 7. Negrato CA, Cobas RA, Gomes MB. Temporal changes in
the diagnosis of type 1 diabetes by diabetic ketoacidosis in
to model building, avoiding just feeding software
Brazil: a nationwide survey. Diabet Med 2012;29:1142-7.
with raw data and going forward to results. Some http://dx.doi.org/10.1111/j. 1464-5491.2012.03590J(.
difficult decisions on model building will depend
entirely on the expertise of researcher on the field.

Biochemia Medica 2014;24(1):12-8 http://dx.doi.org/10.11613/BM.2014.003


Copyright of Biochemia Medica is the property of Biochemia Medica and its content may not
be copied or emailed to multiple sites or posted to a listserv without the copyright holder's
express written permission. However, users may print, download, or email articles for
individual use.

You might also like