0% found this document useful (0 votes)
6 views

Logistic Regression

Uploaded by

PhyoNyeinChan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Logistic Regression

Uploaded by

PhyoNyeinChan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Statistical Primer for Cardiovascular Research

Logistic Regression
Michael P. LaValley, PhD

ike contingency table analyses and ␹2 tests, logistic


L regression allows the analysis of dichotomous or binary
outcomes with 2 mutually exclusive levels.1 However, logis-
where ln stands for the natural logarithm, Y is the outcome
and Y⫽1 when the event happens (versus Y⫽0 when it does
not), ␤0 is the intercept term, and ␤1 represents the regression
tic regression permits the use of continuous or categorical coefficient, the change in the logarithm of the odds of the
predictors and provides the ability to adjust for multiple event with a 1-unit change in the predictor X. The difference
predictors. This makes logistic regression especially useful in the logarithms of 2 values is equal to the logarithm of the
for analysis of observational data when adjustment is needed ratio of the 2 values, so by taking the exponential of ␤1, we
to reduce the potential bias resulting from differences in the obtain the ratio of the odds (the odds ratio) corresponding to
groups being compared.2 a 1-unit change in X.
Use of standard linear regression for a 2-level outcome can Odds ratios often are used in the analysis of 2-by-2
produce very unsatisfactory results. Predicted values for some contingency tables6 and case-control studies.7 The odds ratio
covariate values are likely to be either above the upper level is sometimes confused with the relative risk, which is the
(usually 1) or below the lower level of the outcome (usually ratio of probabilities rather than odds. Only when the prob-
0). In addition, the validity of linear regression depends on ability of the event is very low can the odds ratio be
the variability of the outcome being the same for all values of considered a good approximation to the relative risk.2 The
the predictors. This assumption of constant variability does odds ratio is more extreme than the relative risk, which leads
not match the behavior of a 2-level outcome. So, linear to exaggeration of the effect of a predictor when it is
regression is not adequate for such data, and logistic regres- misinterpreted as a relative risk.8 In many settings, the
sion has been developed to fill this gap. relative risk is preferred over the odds ratio because it
Some recent examples of use of logistic regression in addresses the more readily understood probability of the
Circulation include the assessment of gender as a predictor of event rather than its odds.9 However, logistic regression
operative mortality after coronary artery bypass grafting results are typically presented by odds ratios because these
Downloaded from http://ahajournals.org by on July 16, 2024

surgery,3 an evaluation of the relationship between the TaqlB are the natural estimates from the model and attempts to
genotype and risk of cardiovascular disease in a meta-analy- transform these to relative risks can distort the results.10
sis,4 and an examination of the relationship between lipopro- A useful way to think of the odds ratio is that 100 times the
tein abnormalities and the incidence of diabetes.5 odds ratio minus 1, ie, 100⫻(odds ratio⫺1), gives the percent
change in the odds of the event corresponding to a 1-unit
The Logistic Regression Model increase in X. If this value is negative, then the odds of the
The logistic regression model has its basis in the odds of a event decrease with increasing values of X; if positive, the
2-level outcome of interest. For simplicity, I assume that we odds increase. This percentage change is the same for any
have designated one of the outcome levels the event of 1-unit increase in X because of the assumed linearity between
interest and in the following text will simply call it the event. X and the logarithm of the odds in the regression model
The odds of the event is the ratio of the probability of the above. For some continuous predictors, this assumption may
event happening divided by the probability of the event not not match the data,11 in which case careful checking of the
happening. Odds often are used for gambling, and “even model results is required. For example, if the logarithm of the
odds” (odds⫽1) correspond to the event happening half the odds against the predictor X has a U shape (both low and high
time. This would be the case for rolling an even number on a values have large odds of the outcome relative to the
single die. The odds for rolling a number ⬍5 would be 2 intermediate values) and the model assumes a linear (straight
because rolling a number ⬍5 is twice as likely as rolling a 5 line) pattern, then goodness-of-fit checking should show that
or 6. Symmetry in the odds is found by taking the reciprocal, the model and the data are not compatible. In such a case,
and the odds of rolling at least a 5 would be 0.5 (⫽1/2). splitting the predictor values into categories and using
The logistic regression model takes the natural logarithm dummy variables to code for the categories may improve the
of the odds as a regression function of the predictors. With 1 fit.1 Other methods such as splines also may be used to lessen
predictor, X, this takes the form ln[odds(Y⫽1)]⫽␤0⫹␤1X, the assumption of linearity.12

From the Department of Biostatistics, Boston University School of Public Health, Boston, Mass.
Correspondence to Dr Michael P. LaValley, Department of Biostatistics, Boston University School of Public Health, 715 Albany St, Crosstown Center
Room 322, Boston, MA 02118. E-mail [email protected]
(Circulation. 2008;117:2395-2399.)
© 2008 American Heart Association, Inc.
Circulation is available at http://circ.ahajournals.org DOI: 10.1161/CIRCULATIONAHA.106.682658

2395
2396 Circulation May 6, 2008

Table 1. Unadjusted and Adjusted Odds Ratios for Development of Angina


Unadjusted Adjusted

Predictor Odds Ratio 95% CI P Odds Ratio 95% CI P


Cholesterol (1 SD) 1.412 (1.297, 1.537) ⬍0.001 1.404 (1.284–1.535) ⬍0.001
Sex 1.415 (1.173–1.705) ⬍0.001
Current smoking 1.035 (0.854–1.255) 0.728
Diabetes 1.437 (0.891–2.320) 0.138
Age (10 y) 1.088 (0.973–1.216) 0.139
Body mass index (1 SD) 1.299 (1.190–1.419) ⬍0.001
Heart rate (1 SD) 0.867 (0.788–0.953) 0.0031
Odds ratios, 95% CIs, and probability values for predictors of angina in the Framingham data. Columns 2 through 4 present results from the
unadjusted model; columns 5 through 7 show results from the adjusted model. The respective SDs for cholesterol, body mass index, and heart rate
are 44.622 mg/dL, 4.077 kg/m2, and 12.033 bpm.

When adjusted values are needed, more predictors can be analyses were done with SAS version 9.1 (SAS Institute Inc,
added to the right side of the regression equation above, along Cary, NC).
with corresponding regression coefficients (␤). In this case, After those with prevalent angina are removed, 4287
the odds ratio value for X would be adjusted for the other subjects remain, and 578 subjects (13.5%) developed new
predictors in the model. The equation above, 100⫻(odds angina during the follow-up. At the 1956 examination, 56.8%
ratio⫺1), would then be interpreted as the percent change in of subjects were women, 49.5% were current smokers, and
the odds corresponding to a 1-unit increase in X while 2.9% had diabetes. The mean total cholesterol was 236.7
holding all other predictors fixed. The selection of appropri- mg/dL (limits, 107 to 696 mg/dL), mean age was 49.6 years
ate predictors to reduce confounding and to improve the (limits, 32 to 70 years), mean body mass index was 25.8
precision of estimates is done similarly for logistic regression kg/m2 (limits, 15.5 to 56.8 kg/m2), and mean heart rate was
and for linear regression; guidelines can be found in many 75.9 bpm (limits, 44 to 143 bpm).
statistical textbooks.1,2,12 Table 1 gives the unadjusted and adjusted odds ratios for a
Unlike linear regression, there is no formula for the difference of 1 SD (44.622 mg/dL) of cholesterol on the
Downloaded from http://ahajournals.org by on July 16, 2024

estimates of ␤ for logistic regression. Finding the best occurrence of new angina during the follow-up. In the unad-
estimates requires repeatedly improving approximate esti- justed model, cholesterol is the only predictor; in the adjusted
mates until stability is reached. This is done easily on a model, sex, current smoking, presence of diabetes, age, body
computer, and there are many statistical software packages mass index, and heart rate also are included. In the unadjusted
that perform logistic regression, but it makes logistic regres- model, there is a 41.2% increase in the odds of angina with
sion less understandable and more of a “black box” approach each 1-SD increase in total cholesterol, and there is a 40.4%
for many researchers. increase in the adjusted model. Often, there is greater dis-
crepancy between adjusted and unadjusted estimates. So, in
these data, there is little confounding of the effect of choles-
Angina in the Framingham Heart Study terol as a result of the other predictors in the adjusted model.
To illustrate the use of logistic regression, I use data from the
From the adjusted model, the odds of angina are increased
Framingham Heart Study13 that are available for teaching 42% for men compared with women, and increased body
purposes from the National Heart, Lung, and Blood Institute mass index and decreased heart rate increase the odds of
(http://www.nhlbi.nih.gov/resources/deca/teaching.htm). angina. The effects of current smoking, the presence of
These data include subjects at the 1956 Framingham exami- diabetes, and age are not larger than could be due to chance
nation, considered to be the baseline, with 24 years of in these data (P⬎0.05).
follow-up. Here, I analyze the event of development of new In a data set with fewer cases of angina, the confidence
angina pectoris during the follow-up. Subjects with prevalent interval for the adjusted result could be wider owing to
angina at the 1956 examination are excluded from the data, increasing the variability of the estimates when more predic-
and only measures from the 1956 examination are used as tors are used than the data would support. A rule of thumb for
predictors. Not all subjects have complete 24-year follow-up stability of the estimates from logistic regression is to have at
because some died or left the study before 1980. Use of least 10 events (or nonevents, whichever is rarer in the data)
survival analysis methods to account for varying length of per predictor in the model—more precisely, per degree of
follow-up14 would be appropriate for a more definitive study freedom used in the model.15 Because there are about 83
of these data. cases of angina for each predictor in the adjusted model, the
The predictor of main interest in my analysis is the results are quite stable.
measure of serum total cholesterol (mg/dL), and I consider
adjusting for the sex of the subject, current smoking (yes or Goodness of Fit
no), presence of diabetes (yes or no), age (years), body mass One aspect of the results of logistic regression that is not
index (kg/m2), and ventricular heart rate (bpm). All of the described in the preceding section is how well the model
LaValley Logistic Regression 2397

Table 2. Hosmer and Lemeshow Test Results for Unadjusted and Adjusted Logistic Regression Models
Unadjusted Model Adjusted Model

Predicted Probability Observed Angina Expected Angina Observed Angina Expected Angina
Ranking Groups Cases, n Cases, n Cases, n Cases, n
1 (Lowest) 26 34.4 22 23.2
2 26 40.1 31 31.6
3 53 43.8 41 38.4
4 55 49.2 37 44.4
5 60 52.0 56 50.4
6 57 57.4 63 56.0
7 69 61.3 57 62.8
8 65 65.6 70 71.2
9 72 75.4 83 83.1
10 (Highest) 90 93.8 112 111.1
␹2 13.6 4.0
P 0.094 0.854
Hosmer and Lemeshow test results for the prediction of angina in the Framingham data. Columns 2 and 3 show the observed and
expected numbers of angina cases by group for the unadjusted model. Columns 4 and 5 show the observed and expected numbers
of angina cases by group for the adjusted model. ␹2 Test statistics (on 8 df) and the probability values are shown for each model.

agrees with the observed data. This is called the goodness of It is the same as the area under the receiver-operating
fit of the model. The odds ratio values given above describe characteristic curve,20 formed by taking the predicted values
the model as it is applied to the data. If the model and the data from the regression model as a diagnostic test for the event in
are not in good agreement, then these odds ratios are not very the data. The minimum value of c is 0.5; the maximum is 1.0.
meaningful.16 Several authors have pointed out that although In their textbook, Hosmer and Lemeshow1 consider c values
goodness of fit is crucial for the assessment of the validity of of 0.7 to 0.8 to show acceptable discrimination, values of 0.8
logistic regression results in medical research, it often is not to 0.9 to indicate excellent discrimination, and values of ⱖ0.9
Downloaded from http://ahajournals.org by on July 16, 2024

included in published articles.16 –18 to show outstanding discrimination (page 162). The c statistic
Goodness of fit is usually evaluated in 2 parts. The first value is 0.603 in the unadjusted model for angina and 0.643
step is to generate global measures of how well the model fits in the adjusted model, both below the threshold for acceptable
the whole set of observations; the second step is to evaluate discrimination.
individual observations to see whether any are problematic The Hosmer and Lemeshow test evaluates whether the
for the regression model.1 Some global measures of goodness logistic regression model is well calibrated so that probability
of fit include R2 measures for logistic regression; the c predictions from the model reflect the occurrence of events in
statistic, a measure of how well the model can be used to the data. Obtaining a significant result on the test would
discriminate subjects having the event from subjects not
indicate that the model is not well calibrated, so the fit is not
having the event; and a test of model calibration developed by
good. For this test, subjects are grouped by their percentile of
Hosmer and Lemeshow.19 The second part of evaluating
predicted probability of having the event according to the
goodness of fit is focused on looking for outliers and
model: group 1 has subjects with predicted probabilities in
influence points and may be useful for seeing whether
the 1st to 10th percentiles, group 2 has subjects with predicted
linearity in the model is reasonable.
probabilities in the 11th to 20th percentiles, and so on. If the
The R2 measures for logistic regression mimic the widely
used R2 measure from linear regression, which gives the observed and expected numbers of events are very different
fraction of the variability in the outcome that is explained by in any group, then the model is judged not to fit. Observed
the model. However, logistic regression R2 does not have and expected values for the groups in the unadjusted and
such intuitive explanation, and values tend to be close to 0 adjusted models for angina are shown in Table 2. The
even for models that fit well. Because there is an upper bound unadjusted model has a borderline-significant (P⫽0.094) test
for the basic logistic regression R2, a rescaled R2 is usually result, indicating possible problems with the model fit. In the
also presented showing the fraction of the upper bound that is adjusted model, the test finds less evidence of lack of fit
attained. In the logistic regressions predicting angina, the (P⫽0.854). Inspection of Table 2 shows that the adjusted
model containing only cholesterol as a predictor had an R2 of model has much better agreement between observed and
0.015 with a rescaled R2 of 0.0275. The model containing 7 expected numbers of angina events, especially for groups
predictors had an R2 of 0.0304 and a rescaled R2 of 0.0555. with low percentages of expected events, ie, in subjects with
The adjusted model has larger R2 values, but it is difficult to relatively low cholesterol.
judge whether the difference is large enough to be important. Problematic points are those that are either outliers, data
The c statistic measures how well the model can discrim- values for which the observed value and the model prediction
inate between observations at different levels of the outcome. are in poor agreement, or influence points, observations with
2398 Circulation May 6, 2008

Figure. Residual plot from the adjusted model for


angina in the Framingham data. The horizontal axis
shows the predicted probability of angina; vertical
axis, the value of the Pearson residual. The size of
the plotted circle is proportional to the influence of
an observation.

an unexpectedly large impact on model results. Checking for angina rates and low predicted probability of angina in the
problematic observations is done by plotting residuals against regression model for these subjects creates large residuals,
predicted values, the model estimate of the probability that a and these are the points in the upper left region of the Figure.
subject will have the event.21 Outliers are observations with A substantial number of these subjects have residual values
large residuals, and in logistic regression, several residuals ⬎3 and might be considered outliers.
have been developed. Here, I use the relatively simple So, although we cannot reject that the adjusted model fits
Pearson residual, which is the difference between the ob- the data according to the Hosmer and Lemeshow test, the R2
served and expected outcomes for an observation divided by and c values are still rather low. In addition, the Figure makes
the square root of the variability of the expected outcome. it clear that there are some subjects with low cholesterol who
Logistic regression residual plots look different from those develop angina and are not well fit by the model. There are
from linear regression because the residuals fall on 2 curves, also some subjects with very high cholesterol who may have
1 for each outcome level. Pearson residuals ⬎3 and ⬍⫺3 excessive influence on the model estimates. As a sensitivity
Downloaded from http://ahajournals.org by on July 16, 2024

analysis, we might want to remove subjects with cholesterol


would be considered potential problems, although for large
of ⱖ600 mg/dL and see if the model results change substan-
data sets we should expect some values beyond those limits.
tially. We also might consider adding more predictors or
There also are several measures of influence for logistic
allowing a nonlinear effect of cholesterol to see if we can
regression. Here, I use the logistic regression version of
better predict angina for subjects with low cholesterol levels.
Cook’s distance, which provides a measure of how much the
model estimates change when each point is removed. Neither Extensions to the Logistic Regression Model
outliers nor influence points should be discarded automati- Here, I have considered only outcomes with 2 levels, but
cally, but having knowledge of their presence can be used for there are extensions to the logistic regression model that
targeted data checking and cleaning, or sensitivity analyses. allow analysis of outcomes with ⱖ3 ordered levels such as no
The Figure is a residual plot for the adjusted model. The pain, moderate pain, or severe pain. Such data often are
horizontal axis shows the predicted probability of angina for analyzed with proportional odds logistic regression,22 al-
each observation; the vertical axis shows the Pearson resid- though other models also are possible.23,24 Multinomial lo-
ual. The size of the plotted circle is proportional to the Cook’s gistic regression may be used if the outcome consists of ⱖ3
distance for the observation. The higher curve is of subjects unordered categories.1 The standard form of logistic regres-
who developed angina, and the lower curve is of subjects who sion presented here also presumes that observations are
did not. Because the number of subjects who developed independent. This would not be the case for longitudinal or
angina is smaller, their observations are generally more clustered data, and analyzing such data as independent could
influential, and their circles tend to be larger. From the give misleading conclusions.25 Methods such as generalized
Figure, we can identify several possible problems. First, there estimating equations26 or random-effects models27 can be
are 2 observations with predicted probabilities of angina used for such data. Finally, survival analysis methods14
between 0.75 and 0.80. These come from 2 subjects with provide an extension for studies in which subjects have been
unusually high cholesterol values (600 and 696 mg/dL). The followed up for events across extended and varying follow-up
subject with 696 mg/dL did not develop angina, making a times.
rather poor fit to the model and the most influential observa-
tion in these data, shown by having the largest circle. There Disclosures
None.
are also subjects who developed angina despite having a very
low predicted probability in the model. The low predicted
References
probabilities for these subjects were primarily due to low 1. Hosmer DW, Lemeshow S. Applied Logistic Regression. 2nd ed. New
cholesterol values. The mismatch between the observed York, NY: John Wiley & Sons, Inc; 2000.
LaValley Logistic Regression 2399

2. Kirkwood BR, Sterne JAC. Essential Medical Statistics. Oxford, UK: 14. Hosmer DW, Lemeshow S. Applied Survival Analysis. New York, NY:
Blackwell Science Ltd; 2003. John Wiley & Sons; 1999.
3. Blankstein R, Ward RP, Arnsdorf M, Jones B, Lou YB, Pine M. Female 15. Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. A simulation
gender is an independent predictor of operative mortality after coronary study of the number of events per variable in logistic regression analysis.
artery bypass graft surgery: contemporary analysis of 31 Midwestern J Clin Epidemiol. 1996;49:1373–1379.
hospitals. Circulation. 2005;112(suppl):I-323–I-327. 16. Hosmer DW, Taber S, Lemeshow S. The importance of assessing the fit
4. Boekholdt SM, Sacks FM, Jukema JW, Shepherd J, Freeman DJ, of logistic regression models: a case study. Am J Public Health. 1991;
McMahon AD, Cambien F, Nicaud V, de Grooth GJ, Talmud PJ, 81:1630 –1635.
Humphries SE, Miller GJ, Eiriksdottir G, Gudnason V, Kauma H, Kakko 17. Bagley SC, White H, Golomb BA. Logistic regression in the medical
S, Savolainen MJ, Arca M, Montali A, Liu S, Lanz HJ, Zwinderman AH, literature: standards for use and reporting, with particular attention to one
Kuivenhoven JA, Kastelein JJ. Cholesteryl ester transfer protein TaqIB medical domain. J Clin Epidemiol. 2001;54:979 –985.
variant, high-density lipoprotein cholesterol levels, cardiovascular risk, 18. Bender R, Grouven U. Logistic regression models used in medical
and efficacy of pravastatin treatment: individual patient meta-analysis of research are poorly presented. BMJ. 1996;313:628.
13,677 subjects. Circulation. 2005;111:278 –287. 19. Hosmer DW, Lemeshow S. A goodness-of-fit test for the multiple logistic
5. Festa A, Williams K, Hanley AJ, Otvos JD, Goff DC, Wagenknecht LE, regression model. Commun Stat. 1980;A10:1043–1069.
Haffner SM. Nuclear magnetic resonance lipoprotein abnormalities in 20. Pepe MS. The Statistical Evaluation of Medical Tests for Classification
prediabetic subjects in the Insulin Resistance Atherosclerosis Study. and Prediction. Oxford, UK: Oxford University Press; 2003.
Circulation. 2005;111:3465–3472. 21. Friendly M. Visualizing Categorical Data. Cary, NC: SAS Institute Inc;
6. Bland JM, Altman DG. Statistics notes: the odds ratio. BMJ. 2000; 2000.
320:1468. 22. Bender R, Grouven U. Ordinal logistic regression in medical research.
7. Breslow NE, Day NE. Statistical methods in cancer research, volume I: J R Coll Physicians Lond. 1997;31:546 –551.
the analysis of case-control studies. IARC Sci Publ. 1980:5–338. 23. Harrell FE Jr, Margolis PA, Gove S, Mason KE, Mulholland EK,
8. Holcomb WL Jr, Chaiworapongsa T, Luke DA, Burgdorf KD. An odd
Lehmann D, Muhe L, Gatchalian S, Eichenwald HF. Development of a
measure of risk: use and misuse of the odds ratio. Obstet Gynecol.
clinical prediction model for an ordinal outcome: the World Health
2001;98:685– 688.
Organization Multicentre Study of Clinical Signs and Etiological Agents
9. Davies HT, Crombie IK, Tavakoli M. When can odds ratios mislead?
of Pneumonia, Sepsis and Meningitis in Young Infants: WHO/ARI
BMJ. 1998;316:989 –991.
Young Infant Multicentre Study Group. Stat Med. 1998;17:909 –944.
10. McNutt LA, Wu C, Xue X, Hafner JP. Estimating the relative risk in
24. Scott SC, Goldberg MS, Mayo NE. Statistical assessment of ordinal
cohort studies and clinical trials of common outcomes. Am J Epidemiol.
outcomes in comparative studies. J Clin Epidemiol. 1997;50:45–55.
2003;157:940 –943.
25. Cannon MJ, Warner L, Taddei JA, Kleinbaum DG. What can go wrong
11. Lee J. An insight on the use of multiple logistic regression analysis to
when you assume that correlated data are independent: an illustration
estimate association between risk factor and disease occurrence. Int J
Epidemiol. 1986;15:22–29. from the evaluation of a childhood health intervention in Brazil. Stat Med.
12. Harrell FE. Regression Modeling Strategies: With Applications to Linear 2001;20:1461–1467.
Models, Logistic Regression, and Survival Analysis. New York, NY: 26. Lipsitz SR, Kim K, Zhao L. Analysis of repeated categorical data using
Springer-Verlag; 2001. generalized estimating equations. Stat Med. 1994;13:1149 –1163.
13. Fox CS, Pencina MJ, Meigs JB, Vasan RS, Levitzky YS, D’Agostino RB 27. Twisk JWR. Applied Longitudinal Data Analysis for Epidemiology. Cam-
Downloaded from http://ahajournals.org by on July 16, 2024

Sr. Trends in the incidence of type 2 diabetes mellitus from the 1970s to bridge, UK: Cambridge University Press; 2003.
the 1990s: the Framingham Heart Study. Circulation. 2006;113:2914 –
2918. KEY WORDS: angina 䡲 epidemiology 䡲 risk factors 䡲 statistics

You might also like