Logistic Regression
Logistic Regression
Logistic Regression
Michael P. LaValley, PhD
surgery,3 an evaluation of the relationship between the TaqlB are the natural estimates from the model and attempts to
genotype and risk of cardiovascular disease in a meta-analy- transform these to relative risks can distort the results.10
sis,4 and an examination of the relationship between lipopro- A useful way to think of the odds ratio is that 100 times the
tein abnormalities and the incidence of diabetes.5 odds ratio minus 1, ie, 100⫻(odds ratio⫺1), gives the percent
change in the odds of the event corresponding to a 1-unit
The Logistic Regression Model increase in X. If this value is negative, then the odds of the
The logistic regression model has its basis in the odds of a event decrease with increasing values of X; if positive, the
2-level outcome of interest. For simplicity, I assume that we odds increase. This percentage change is the same for any
have designated one of the outcome levels the event of 1-unit increase in X because of the assumed linearity between
interest and in the following text will simply call it the event. X and the logarithm of the odds in the regression model
The odds of the event is the ratio of the probability of the above. For some continuous predictors, this assumption may
event happening divided by the probability of the event not not match the data,11 in which case careful checking of the
happening. Odds often are used for gambling, and “even model results is required. For example, if the logarithm of the
odds” (odds⫽1) correspond to the event happening half the odds against the predictor X has a U shape (both low and high
time. This would be the case for rolling an even number on a values have large odds of the outcome relative to the
single die. The odds for rolling a number ⬍5 would be 2 intermediate values) and the model assumes a linear (straight
because rolling a number ⬍5 is twice as likely as rolling a 5 line) pattern, then goodness-of-fit checking should show that
or 6. Symmetry in the odds is found by taking the reciprocal, the model and the data are not compatible. In such a case,
and the odds of rolling at least a 5 would be 0.5 (⫽1/2). splitting the predictor values into categories and using
The logistic regression model takes the natural logarithm dummy variables to code for the categories may improve the
of the odds as a regression function of the predictors. With 1 fit.1 Other methods such as splines also may be used to lessen
predictor, X, this takes the form ln[odds(Y⫽1)]⫽0⫹1X, the assumption of linearity.12
From the Department of Biostatistics, Boston University School of Public Health, Boston, Mass.
Correspondence to Dr Michael P. LaValley, Department of Biostatistics, Boston University School of Public Health, 715 Albany St, Crosstown Center
Room 322, Boston, MA 02118. E-mail [email protected]
(Circulation. 2008;117:2395-2399.)
© 2008 American Heart Association, Inc.
Circulation is available at http://circ.ahajournals.org DOI: 10.1161/CIRCULATIONAHA.106.682658
2395
2396 Circulation May 6, 2008
When adjusted values are needed, more predictors can be analyses were done with SAS version 9.1 (SAS Institute Inc,
added to the right side of the regression equation above, along Cary, NC).
with corresponding regression coefficients (). In this case, After those with prevalent angina are removed, 4287
the odds ratio value for X would be adjusted for the other subjects remain, and 578 subjects (13.5%) developed new
predictors in the model. The equation above, 100⫻(odds angina during the follow-up. At the 1956 examination, 56.8%
ratio⫺1), would then be interpreted as the percent change in of subjects were women, 49.5% were current smokers, and
the odds corresponding to a 1-unit increase in X while 2.9% had diabetes. The mean total cholesterol was 236.7
holding all other predictors fixed. The selection of appropri- mg/dL (limits, 107 to 696 mg/dL), mean age was 49.6 years
ate predictors to reduce confounding and to improve the (limits, 32 to 70 years), mean body mass index was 25.8
precision of estimates is done similarly for logistic regression kg/m2 (limits, 15.5 to 56.8 kg/m2), and mean heart rate was
and for linear regression; guidelines can be found in many 75.9 bpm (limits, 44 to 143 bpm).
statistical textbooks.1,2,12 Table 1 gives the unadjusted and adjusted odds ratios for a
Unlike linear regression, there is no formula for the difference of 1 SD (44.622 mg/dL) of cholesterol on the
Downloaded from http://ahajournals.org by on July 16, 2024
estimates of  for logistic regression. Finding the best occurrence of new angina during the follow-up. In the unad-
estimates requires repeatedly improving approximate esti- justed model, cholesterol is the only predictor; in the adjusted
mates until stability is reached. This is done easily on a model, sex, current smoking, presence of diabetes, age, body
computer, and there are many statistical software packages mass index, and heart rate also are included. In the unadjusted
that perform logistic regression, but it makes logistic regres- model, there is a 41.2% increase in the odds of angina with
sion less understandable and more of a “black box” approach each 1-SD increase in total cholesterol, and there is a 40.4%
for many researchers. increase in the adjusted model. Often, there is greater dis-
crepancy between adjusted and unadjusted estimates. So, in
these data, there is little confounding of the effect of choles-
Angina in the Framingham Heart Study terol as a result of the other predictors in the adjusted model.
To illustrate the use of logistic regression, I use data from the
From the adjusted model, the odds of angina are increased
Framingham Heart Study13 that are available for teaching 42% for men compared with women, and increased body
purposes from the National Heart, Lung, and Blood Institute mass index and decreased heart rate increase the odds of
(http://www.nhlbi.nih.gov/resources/deca/teaching.htm). angina. The effects of current smoking, the presence of
These data include subjects at the 1956 Framingham exami- diabetes, and age are not larger than could be due to chance
nation, considered to be the baseline, with 24 years of in these data (P⬎0.05).
follow-up. Here, I analyze the event of development of new In a data set with fewer cases of angina, the confidence
angina pectoris during the follow-up. Subjects with prevalent interval for the adjusted result could be wider owing to
angina at the 1956 examination are excluded from the data, increasing the variability of the estimates when more predic-
and only measures from the 1956 examination are used as tors are used than the data would support. A rule of thumb for
predictors. Not all subjects have complete 24-year follow-up stability of the estimates from logistic regression is to have at
because some died or left the study before 1980. Use of least 10 events (or nonevents, whichever is rarer in the data)
survival analysis methods to account for varying length of per predictor in the model—more precisely, per degree of
follow-up14 would be appropriate for a more definitive study freedom used in the model.15 Because there are about 83
of these data. cases of angina for each predictor in the adjusted model, the
The predictor of main interest in my analysis is the results are quite stable.
measure of serum total cholesterol (mg/dL), and I consider
adjusting for the sex of the subject, current smoking (yes or Goodness of Fit
no), presence of diabetes (yes or no), age (years), body mass One aspect of the results of logistic regression that is not
index (kg/m2), and ventricular heart rate (bpm). All of the described in the preceding section is how well the model
LaValley Logistic Regression 2397
Table 2. Hosmer and Lemeshow Test Results for Unadjusted and Adjusted Logistic Regression Models
Unadjusted Model Adjusted Model
Predicted Probability Observed Angina Expected Angina Observed Angina Expected Angina
Ranking Groups Cases, n Cases, n Cases, n Cases, n
1 (Lowest) 26 34.4 22 23.2
2 26 40.1 31 31.6
3 53 43.8 41 38.4
4 55 49.2 37 44.4
5 60 52.0 56 50.4
6 57 57.4 63 56.0
7 69 61.3 57 62.8
8 65 65.6 70 71.2
9 72 75.4 83 83.1
10 (Highest) 90 93.8 112 111.1
2 13.6 4.0
P 0.094 0.854
Hosmer and Lemeshow test results for the prediction of angina in the Framingham data. Columns 2 and 3 show the observed and
expected numbers of angina cases by group for the unadjusted model. Columns 4 and 5 show the observed and expected numbers
of angina cases by group for the adjusted model. 2 Test statistics (on 8 df) and the probability values are shown for each model.
agrees with the observed data. This is called the goodness of It is the same as the area under the receiver-operating
fit of the model. The odds ratio values given above describe characteristic curve,20 formed by taking the predicted values
the model as it is applied to the data. If the model and the data from the regression model as a diagnostic test for the event in
are not in good agreement, then these odds ratios are not very the data. The minimum value of c is 0.5; the maximum is 1.0.
meaningful.16 Several authors have pointed out that although In their textbook, Hosmer and Lemeshow1 consider c values
goodness of fit is crucial for the assessment of the validity of of 0.7 to 0.8 to show acceptable discrimination, values of 0.8
logistic regression results in medical research, it often is not to 0.9 to indicate excellent discrimination, and values of ⱖ0.9
Downloaded from http://ahajournals.org by on July 16, 2024
included in published articles.16 –18 to show outstanding discrimination (page 162). The c statistic
Goodness of fit is usually evaluated in 2 parts. The first value is 0.603 in the unadjusted model for angina and 0.643
step is to generate global measures of how well the model fits in the adjusted model, both below the threshold for acceptable
the whole set of observations; the second step is to evaluate discrimination.
individual observations to see whether any are problematic The Hosmer and Lemeshow test evaluates whether the
for the regression model.1 Some global measures of goodness logistic regression model is well calibrated so that probability
of fit include R2 measures for logistic regression; the c predictions from the model reflect the occurrence of events in
statistic, a measure of how well the model can be used to the data. Obtaining a significant result on the test would
discriminate subjects having the event from subjects not
indicate that the model is not well calibrated, so the fit is not
having the event; and a test of model calibration developed by
good. For this test, subjects are grouped by their percentile of
Hosmer and Lemeshow.19 The second part of evaluating
predicted probability of having the event according to the
goodness of fit is focused on looking for outliers and
model: group 1 has subjects with predicted probabilities in
influence points and may be useful for seeing whether
the 1st to 10th percentiles, group 2 has subjects with predicted
linearity in the model is reasonable.
probabilities in the 11th to 20th percentiles, and so on. If the
The R2 measures for logistic regression mimic the widely
used R2 measure from linear regression, which gives the observed and expected numbers of events are very different
fraction of the variability in the outcome that is explained by in any group, then the model is judged not to fit. Observed
the model. However, logistic regression R2 does not have and expected values for the groups in the unadjusted and
such intuitive explanation, and values tend to be close to 0 adjusted models for angina are shown in Table 2. The
even for models that fit well. Because there is an upper bound unadjusted model has a borderline-significant (P⫽0.094) test
for the basic logistic regression R2, a rescaled R2 is usually result, indicating possible problems with the model fit. In the
also presented showing the fraction of the upper bound that is adjusted model, the test finds less evidence of lack of fit
attained. In the logistic regressions predicting angina, the (P⫽0.854). Inspection of Table 2 shows that the adjusted
model containing only cholesterol as a predictor had an R2 of model has much better agreement between observed and
0.015 with a rescaled R2 of 0.0275. The model containing 7 expected numbers of angina events, especially for groups
predictors had an R2 of 0.0304 and a rescaled R2 of 0.0555. with low percentages of expected events, ie, in subjects with
The adjusted model has larger R2 values, but it is difficult to relatively low cholesterol.
judge whether the difference is large enough to be important. Problematic points are those that are either outliers, data
The c statistic measures how well the model can discrim- values for which the observed value and the model prediction
inate between observations at different levels of the outcome. are in poor agreement, or influence points, observations with
2398 Circulation May 6, 2008
an unexpectedly large impact on model results. Checking for angina rates and low predicted probability of angina in the
problematic observations is done by plotting residuals against regression model for these subjects creates large residuals,
predicted values, the model estimate of the probability that a and these are the points in the upper left region of the Figure.
subject will have the event.21 Outliers are observations with A substantial number of these subjects have residual values
large residuals, and in logistic regression, several residuals ⬎3 and might be considered outliers.
have been developed. Here, I use the relatively simple So, although we cannot reject that the adjusted model fits
Pearson residual, which is the difference between the ob- the data according to the Hosmer and Lemeshow test, the R2
served and expected outcomes for an observation divided by and c values are still rather low. In addition, the Figure makes
the square root of the variability of the expected outcome. it clear that there are some subjects with low cholesterol who
Logistic regression residual plots look different from those develop angina and are not well fit by the model. There are
from linear regression because the residuals fall on 2 curves, also some subjects with very high cholesterol who may have
1 for each outcome level. Pearson residuals ⬎3 and ⬍⫺3 excessive influence on the model estimates. As a sensitivity
Downloaded from http://ahajournals.org by on July 16, 2024
2. Kirkwood BR, Sterne JAC. Essential Medical Statistics. Oxford, UK: 14. Hosmer DW, Lemeshow S. Applied Survival Analysis. New York, NY:
Blackwell Science Ltd; 2003. John Wiley & Sons; 1999.
3. Blankstein R, Ward RP, Arnsdorf M, Jones B, Lou YB, Pine M. Female 15. Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. A simulation
gender is an independent predictor of operative mortality after coronary study of the number of events per variable in logistic regression analysis.
artery bypass graft surgery: contemporary analysis of 31 Midwestern J Clin Epidemiol. 1996;49:1373–1379.
hospitals. Circulation. 2005;112(suppl):I-323–I-327. 16. Hosmer DW, Taber S, Lemeshow S. The importance of assessing the fit
4. Boekholdt SM, Sacks FM, Jukema JW, Shepherd J, Freeman DJ, of logistic regression models: a case study. Am J Public Health. 1991;
McMahon AD, Cambien F, Nicaud V, de Grooth GJ, Talmud PJ, 81:1630 –1635.
Humphries SE, Miller GJ, Eiriksdottir G, Gudnason V, Kauma H, Kakko 17. Bagley SC, White H, Golomb BA. Logistic regression in the medical
S, Savolainen MJ, Arca M, Montali A, Liu S, Lanz HJ, Zwinderman AH, literature: standards for use and reporting, with particular attention to one
Kuivenhoven JA, Kastelein JJ. Cholesteryl ester transfer protein TaqIB medical domain. J Clin Epidemiol. 2001;54:979 –985.
variant, high-density lipoprotein cholesterol levels, cardiovascular risk, 18. Bender R, Grouven U. Logistic regression models used in medical
and efficacy of pravastatin treatment: individual patient meta-analysis of research are poorly presented. BMJ. 1996;313:628.
13,677 subjects. Circulation. 2005;111:278 –287. 19. Hosmer DW, Lemeshow S. A goodness-of-fit test for the multiple logistic
5. Festa A, Williams K, Hanley AJ, Otvos JD, Goff DC, Wagenknecht LE, regression model. Commun Stat. 1980;A10:1043–1069.
Haffner SM. Nuclear magnetic resonance lipoprotein abnormalities in 20. Pepe MS. The Statistical Evaluation of Medical Tests for Classification
prediabetic subjects in the Insulin Resistance Atherosclerosis Study. and Prediction. Oxford, UK: Oxford University Press; 2003.
Circulation. 2005;111:3465–3472. 21. Friendly M. Visualizing Categorical Data. Cary, NC: SAS Institute Inc;
6. Bland JM, Altman DG. Statistics notes: the odds ratio. BMJ. 2000; 2000.
320:1468. 22. Bender R, Grouven U. Ordinal logistic regression in medical research.
7. Breslow NE, Day NE. Statistical methods in cancer research, volume I: J R Coll Physicians Lond. 1997;31:546 –551.
the analysis of case-control studies. IARC Sci Publ. 1980:5–338. 23. Harrell FE Jr, Margolis PA, Gove S, Mason KE, Mulholland EK,
8. Holcomb WL Jr, Chaiworapongsa T, Luke DA, Burgdorf KD. An odd
Lehmann D, Muhe L, Gatchalian S, Eichenwald HF. Development of a
measure of risk: use and misuse of the odds ratio. Obstet Gynecol.
clinical prediction model for an ordinal outcome: the World Health
2001;98:685– 688.
Organization Multicentre Study of Clinical Signs and Etiological Agents
9. Davies HT, Crombie IK, Tavakoli M. When can odds ratios mislead?
of Pneumonia, Sepsis and Meningitis in Young Infants: WHO/ARI
BMJ. 1998;316:989 –991.
Young Infant Multicentre Study Group. Stat Med. 1998;17:909 –944.
10. McNutt LA, Wu C, Xue X, Hafner JP. Estimating the relative risk in
24. Scott SC, Goldberg MS, Mayo NE. Statistical assessment of ordinal
cohort studies and clinical trials of common outcomes. Am J Epidemiol.
outcomes in comparative studies. J Clin Epidemiol. 1997;50:45–55.
2003;157:940 –943.
25. Cannon MJ, Warner L, Taddei JA, Kleinbaum DG. What can go wrong
11. Lee J. An insight on the use of multiple logistic regression analysis to
when you assume that correlated data are independent: an illustration
estimate association between risk factor and disease occurrence. Int J
Epidemiol. 1986;15:22–29. from the evaluation of a childhood health intervention in Brazil. Stat Med.
12. Harrell FE. Regression Modeling Strategies: With Applications to Linear 2001;20:1461–1467.
Models, Logistic Regression, and Survival Analysis. New York, NY: 26. Lipsitz SR, Kim K, Zhao L. Analysis of repeated categorical data using
Springer-Verlag; 2001. generalized estimating equations. Stat Med. 1994;13:1149 –1163.
13. Fox CS, Pencina MJ, Meigs JB, Vasan RS, Levitzky YS, D’Agostino RB 27. Twisk JWR. Applied Longitudinal Data Analysis for Epidemiology. Cam-
Downloaded from http://ahajournals.org by on July 16, 2024
Sr. Trends in the incidence of type 2 diabetes mellitus from the 1970s to bridge, UK: Cambridge University Press; 2003.
the 1990s: the Framingham Heart Study. Circulation. 2006;113:2914 –
2918. KEY WORDS: angina 䡲 epidemiology 䡲 risk factors 䡲 statistics