Austin 2004
Austin 2004
Abstract
Objectives: Automated variable selection methods are frequently used to determine the independent predictors of an outcome. The
objective of this study was to determine the reproducibility of logistic regression models developed using automated variable selection
methods.
Study Design and Setting: An initial set of 29 candidate variables were considered for predicting mortality after acute myocardial
infarction (AMI). We drew 1,000 bootstrap samples from a dataset consisting of 4,911 patients admitted to hospital with an AMI. Using
each bootstrap sample, logistic regression models predicting 30-day mortality were obtained using backward elimination, forward selection,
and stepwise selection. The agreement between the different model selection methods and the agreement across the 1,000 bootstrap samples
were compared.
Results: Using 1,000 bootstrap samples, backward elimination identified 940 unique models for predicting mortality. Similar results
were obtained for forward and stepwise selection. Three variables were identified as independent predictors of mortality among all bootstrap
samples. Over half the candidate prognostic variables were identified as independent predictors in less than half of the bootstrap samples.
Conclusion: Automated variable selection methods result in models that are unstable and not reproducible. The variables selected as
independent predictors are sensitive to random fluctuations in the data. 쑖 2004 Elsevier Inc. All rights reserved.
Keywords: Regression models; Multivariate analysis; Variable selection; Logistic regression; Acute myocardial infarction; Epidemiology
trials, whereas others included patients from the general AMI Stepwise selection is a variation of forward selection. At
population. Second, the studies differed in terms of the vari- each step of the variable selection process, after a variable
ables collected a priori as potential predictors of mortality. has been added to the model, variables are allowed to
Third, the studies differed in terms of the statistical methods be eliminated from the model. For instance, if the signifi-
used to model mortality. However, despite these differ- cance of a given predictor is above a specified threshold, it
ences between studies, different studies identified different is eliminated from the model. The iterative process is ended
variables as independent predictors of mortality after AMI. when a pre-specified stopping rule is satisfied.
Investigators developing models to predict mortality need Statisticians have several concerns about the use of auto-
to maintain a balance between including too many variables mated variable selection methods: (1) It results in values of
and model parsimony [2,3]. Omitting important prognostic R2 that are biased high [11,12], (2) it results in estimated
factors results in a systematic mis-estimation of the regres- standard errors that are biased low [13], (3) the results are
sion coefficients and biased prediction, and including too dependent upon the correlation between the predictor vari-
many predictors results in the loss of precision in the estima- ables [14], and (4) the ordinary test statistics upon which
tion of the regression coefficients and the predictions of new these methods are based were intended for testing pre-speci-
responses [2]. Researchers frequently use automated variable fied hypotheses [13]. These results have been demonstrated
selection methods, such as backward elimination or forward in the context of linear regression estimated using ordinary
variable selection techniques, to identify independent pre- least squares. The impact of automated model selection
dictors of mortality or for developing parsimonious regres- methods on logistic regression models needs to be more
sion models. Automated variable selection methods have fully examined.
been used in several studies examining the independent
predictors of mortality after AMI [4–7].
The purposes of the current study were (1) to determine 2. Methods
the degree to which random variability in a dataset can
2.1. Data sources
result in different variables being identified as independent
predictors of mortality after an AMI (this allows one to The Ontario Myocardial Infarction Database (OMID) is
assess the reproducibility or stability of models obtained a population-based database of patients admitted with a most
using automated model selection methods) and (2) to com- responsible diagnosis of AMI in the province of Ontario.
pare the agreement between different automated model selec- It consists of patients discharged between 1 April 1992 and
tion methods. 31 March 2002. The OMID database was constructed by
linking together Ontario’s major health care administrative
databases. Details on the construction of the OMID data-
1.1. Model selection methods base are provided elsewhere [15,16]. One of the limitations
Multiple automated variable selection methods have been of the OMID database is the paucity of detailed clinical
developed. The three most commonly used methods are data. To address this limitation, detailed clinical data were
backward elimination, forward selection, and stepwise selec- collected on a random sample of 6,015 patients discharged
tion. Miller [8,9] and Hocking [10] provide comprehensive from 57 Ontario hospitals between 1 April 1999 and 31
overviews of model selection methods. We briefly summa- March 2001 by retrospective chart review to supplement the
rize these methods. Backward elimination begins with a full OMID administrative data. Data on patient history, cardiac
model consisting of all candidate predictor variables. Vari- risk factors, comorbid conditions and vascular history, vital
signs, and laboratory tests were collected for this sample
ables are sequentially eliminated from the model until a pre-
of patients.
specified stopping rule is satisfied. At a given step of
the elimination process, the variable whose elimination
would result in the smallest decrease in a summary measure 2.2. Statistical methods
is eliminated. Possible summary measures are deviance or We chose a list of candidate variables that would be
R2. The most common stopping rule is that all variables examined for their association with mortality within 30 days
that remain in the model are significant at a pre-specified of AMI admission. These variables included demographic
significance level. characteristics (age and gender), presenting signs and symp-
Forward selection begins with the empty model. Variables toms (cardiogenic shock and acute pulmonary edema), clas-
are added sequentially to a model until a predefined stopping sical cardiac risk factors (diabetes, history of cerebrovascular
rule is satisfied. At a given step of the selection process, accident or transient ischemic attack [CVA/TIA], history of
the variable whose addition would result in the greatest hyperlipidemia, hypertension, family history of heart dis-
increase in the summary measure is added to the model. A ease, and smoking history), comorbid condition and vascular
typical stopping rule is that if any added variable would not history (angina, asthma, coagulopathy, cancer, chronic
be significant at a predefined significance level, then no liver disease, chronic congestive heart failure, dementia/
further variables are added to the model. Alzheimer’s, depression, hyperthyroid, peptic ulcer disease,
1140 P.C. Austin, J.V. Tu / Journal of Clinical Epidemiology 57 (2004) 1138–1146
[PTCA], previous coronary artery bypass graft [CABG] sur- A. Categorical variables
gery, and aortic stenosis), vital signs on admission (systolic 30-day
and diastolic blood pressure, heart rate, and respiratory rate), Variable Prevalence mortality ratea P value
laboratory test results—hematology (hemoglobin, white Female sex 35.3% 14.7% ⬍.0001
blood count, and international normalized ratio), and labora- Patient history — admission
tory test results—chemistry (sodium, potassium, glucose, Acute pulmonary edema 5.7% 25.9% ⬍.0001
Cardiogenic shock 2.2% 67.4% ⬍.0001
urea, creatinine, total cholesterol, HDL and LDL cholesterol, Cardiac risk factors
and triglycerides). Diabetes 24.9% 13.8% ⬍.0001
Each of the above variables was examined for its univari- Hypertension 44.3% 11.1% .6962
ate association with 30-day mortality. For categorical Smoking history 34.2% 7.1% ⬍.0001
variables, a Chi-squared test was used to determine the statis- CVA/TIA 9.6% 18.5% ⬍.0001
Hyperlipidemia 31.0% 7.0% ⬍.0001
tical significance of the association between the variable and Family history of heart disease 30.3% 4.2% ⬍.0001
mortality within 30 days of admission. Dichotomous risk Comorbid conditions and vascular history
factors were assumed to be absent unless their presence was Angina 31.3% 12.1% .0466
explicitly documented in the patient’s medical record. Risk Cancer 3.7% 17.3% .0017
factors or conditions whose prevalence was ⬍1% in the Dementia/Alzheimer’s 3.8% 35.7% ⬍.0001
Peptic ulcer disease 4.9% 11.8% .6100
population of patients were excluded from further analyses. Previous MI 21.9% 14.7% ⬍.0001
For continuous variables, a univariate logistic regression Asthma 4.8% 10.7% .8962
predicting 30-day mortality was fit to determine the statisti- Liver disease 0.6% 16.2% .2877
cal significance of the association between the variable and Depression 6.7% 16.8% ⬍.0001
mortality. Continuous variables that were missing for more Peripheral arterial disease 6.8% 16.9% ⬍.0001
Previous PTCA 3.6% 5.5% .0095
than 10% of the population were excluded from further Coagulopathy 0.3% 5.0% .7169
analyses. Exclusion of patients with missing data on continu- Congestive heart failure 5.6% 30.7% ⬍.0001
ous data resulted in a final sample size of 4,911 for the (chronic)
multivariate analyses. Hyperthyroid 1.7% 12.6% .5772
Variables that were significantly associated with 30-day Renal disease (dialysis 0.6% 18.2% .1662
dependent)
mortality with a significance level of P ⬍ .25 were selected Previous CABG 6.9% 12.4% .3263
for possible inclusion in multivariate logistic regression Stenosis (aortic) 1.5% 24.4% ⬍.0001
models to predict 30-day mortality. We chose P ⫽ .25 as the
threshold for including variables in the multivariate model B. Continuous variables
because this has been suggested elsewhere as an appropriate Percent Odds
threshold [17]. In sensitivity analyses, we also explored Variable missing ratiob P value
the use of P ⫽ .10 and P ⫽ .05 as alternate thresholds. Table Age 0.0% 1.080 ⬍.0001
1 contains each variable identified above, its prevalence Vital signs on admission
(for categorical variables), the proportion of patients with Systolic blood pressure 0.6% 0.977 ⬍.0001
Diastolic blood pressure 0.8% 0.967 ⬍.0001
missing data (continuous variables), and the statistical sig- Heart rate 0.9% 1.013 ⬍.0001
nificance of its univariate association with 30-day mortality. Respiratory rate 9.6% 1.079 ⬍.0001
We used Bootstrap methods to examine the stability of Laboratory tests (hematology)
models predicting 30-day mortality after AMI [18]. First, Hemoglobin 1.9% 0.971 ⬍.0001
all observations with any missing data on the continuous White blood count 1.9% 1.071 ⬍.0001
International normalized ratio 17.6% 1.464 .0006
predictors were eliminated. From this sample, we chose Laboratory tests (chemistry)
1,000 bootstrap samples. A bootstrap sample is a sample of Sodium 2.1% 0.951 ⬍.0001
the same size as the original dataset chosen with replacement. Potassium 2.3% 1.861 ⬍.0001
Thus, a given subject in the original cohort may occur multi- Glucose 4.2% 1.064 ⬍.0001
ple times, only once, or not at all in a specific bootstrap Urea 7.1% 1.125 ⬍.0001
Creatinine 2.9% 1.007 ⬍.0001
sample. Once we chose a bootstrap sample, we used three
Total cholesterol 43.0% 0.627 .1108
different variable reduction methods to arrive at a final HDL cholesterol 52.8% 1.000 1.0000
regression model for determining the variables that are sig- LDL cholesterol 55.1% 0.585 ⬍.0001
nificant independent predictors of 30-day mortality. First, Triglycerides 43.9% 0.834 .0236
we used backward elimination with a threshold of P ⫽ .05 a
Overall 30-day mortality in the cohort was 10.9%.
b
for eliminating a variable in the model. Second, we used Odds ratio is relative change in the odds of 30-day mortality with a
forward model selection with a threshold of P ⫽ .05 for one-unit increase in the predictor variable.
selection a variable for inclusion in the model. Third, we
used stepwise model selection with thresholds of P ⫽ .05
P.C. Austin, J.V. Tu / Journal of Clinical Epidemiology 57 (2004) 1138–1146 1141
for variable selection and for variable elimination. Thus, for resulted in 936 unique regression models in the 1,000 boot-
a given bootstrap sample, we obtained three final models: strap samples. Eight hundred eighty-six models were chosen
one obtained using backward elimination, one obtained using only once, 39 models were chosen twice, 8 models were
forward variable selection, and one obtained using stepwise chosen three times, and 3 models were chosen four times.
variable selection. For each model, we noted which variables No model was chosen more than four times in the 1,000
had been selected and compared the results across the three bootstrap samples using stepwise selection. Over the
variable selection methods. We repeated this process using 1,000 bootstrap samples, the variables selected by backward
the 1,000 bootstrap samples. We determined the proportion selection agreed with those selected by forward selection in
of regression models in which each of the candidate variables 83.4% of the bootstrap samples. The model determined by
was retained, and we determined the proportion of bootstrap backward selection agreed with that determined by stepwise
samples in which there was agreement between the different selection in 90.1% of the bootstrap samples. Finally, the
model selection methods. Finally, we determined the distri- model selected by forward selection agreed with that deter-
bution of the number of variables selected for the final model mined by stepwise selection in 91.5% of the bootstrap
using each of the three model selection methods. samples.
We repeated the above analysis using only variables that The number of times that each variable was selected using
were significantly associated with 30-day mortality in a uni- each of the three variable selection techniques is depicted
variate analysis at the P ⬍ .10 level and again at the P ⬍ in Fig. 1. Age, systolic blood pressure, and shock at presenta-
.05 level. Finally, we repeated the primary analysis to exam- tion were identified as independent predictors of mortality
ine the impact of using clinical judgment in association with in all 1,000 models using each of the three variable selection
automated variable selection methods. To do so, we made models. Glucose level was identified as an independent pre-
several clinical judgments. First, we identified three vari- dictor of mortality in at least 99.5% of the bootstrap samples
ables a priori that we felt to be the strongest predictors of using each of the three methods. White blood count was simi-
AMI mortality. These variables were age, the presence larly identified in at least 98.7% of the bootstrap samples
of cardiogenic shock at presentation, and systolic blood pres- using backward selection, forward selection, and stepwise
sure on admission. These three variables were forced into selection. Urea was identified as independent predictor of
each regression model using backward, forward, and step- mortality in 91.0%, 94.8%, and 91.8% of the bootstrap sam-
wise model selection. Second, two of the variables, creati- ples using backward selection, forward selection, and step-
nine and urea levels, are, to a certain degree, surrogates for wise selection, respectively. The remaining 23 variables were
one another and are usually correlated with each other. Thus, selected in ⬍90% of the bootstrap samples using each
the decision was made to consider only creatinine levels as a model selection method. Eighteen of the 29 variables were
candidate and to exclude urea from the regression models. identified as independent predictors of mortality in less than
Similarly, the decision was made to exclude diastolic blood half of the bootstrap samples using backward selection. Six
pressure due to the inclusion of systolic blood pressure. variables (cancer, sodium, diastolic blood pressure, diabetes,
Finally, three of the variables (history of previous myocardial smoking status, and history of previous MI) were selected
infarction [MI], previous PTCA, and previous CABG) are in less than 10% of the bootstrap models using each variable
markers for the presence or severity of previous coronary selection method.
artery disease. The decision was made to retain only one of Six variables were identified as independent predictors
these variables (history of previous MI) and to exclude the of mortality in fewer than 10% of the bootstrap samples using
other two as potential independent predictors of mortality. backward elimination (cancer, sodium, diastolic blood pres-
sure, diabetes, smoking status, and history of previous MI).
At least one of these six variables was identified as an
independent predictor in 37.3% of the bootstrap models
3. Results
using backward elimination. Twelve variables were identi-
Backward model selection resulted in 940 unique regres- fied as independent predictors of mortality in ⬍20% of the
sion models in the 1,000 bootstrap samples. Eight hundred bootstrap samples using backward elimination (peripheral
eighty-nine models were chosen only once, 45 models were arterial disease, stenosis, angina, CVA/TIA, hyperlipidemia,
chosen twice, 3 models were chosen three times, and 3 and hemoglobin, in addition to the above six variables).
models were chosen four times. No model was chosen more However, in 77.8% of the bootstrap samples, at least one
than four times in the 1,000 bootstrap samples using back- of these 12 variables was identified as an independent
ward selection. Forward model selection resulted in 932 predictor of mortality using backward selection. Comparable
unique regression models in the 1,000 bootstrap samples. results were obtained for forward and stepwise selection.
Eight hundred seventy-nine models were chosen only once, The number of variables selected in the 1,000 bootstrap
41 models were chosen twice, 9 models were chosen replications using each variable selection methods is de-
three times, and 3 models were chosen four times. No model picted in Fig. 2. Each model selection method resulted in a
was chosen more than four times in the 1,000 bootstrap final model with between 8 and 19 variables in the 1,000
samples using forward selection. Stepwise model selection bootstrap samples. Furthermore, the distribution of the
1142 P.C. Austin, J.V. Tu / Journal of Clinical Epidemiology 57 (2004) 1138–1146
1000
900
Number of times variable chosen
800
700
600
Backward selection
500 Forward selection
Stepwise selection
400
300
200
100
0
Sy Sh e
ol k
lu P
W e
BC
a
em te
D ati ia
re ne
R io C n
Pu sp s P HF
en se
An sis
on tor A
m ed te
H a
l a Po m y
rte ta ale
l d ium
yp V a
H ipid IA
og ia
C bin
D So er
st um
ia P
Pr Sm tes
us r
I
AM
io ke
st oc
r
Ag
ea e
io
ily em
H C gin
G cB
lm ira TC
D ic B
re t
m
D rt ra
Fa ry ra
to
c
T
co
C en
St ea
H Ur
ep ni
o
ss
lo
be
an
ia di
ev o
er A/
ria s
em e
is
a y
i
s
e
is
ol
F
u
l
ev
Pr
ra
he
rip
Pe
Variable
Fig. 1. Number of times each variable was selected by automated variable selection methods.
number of variables in the resultant models is approximately 78.2% of the resultant models had between 11 and 14 vari-
normally distributed. Using backward selection, the major- ables identified as independent predictors of mortality,
ity of the resultant models (77.1%) had between 11 and 14 respectively.
variables that were identified as independent predictors of Within each bootstrap sample, we determined the number
mortality. For forward and stepwise selection, 78.3% and of variables identified as independent predictors of mortality
25
20
Percent of bootstrap samples (%)
15
Backward selection
Forward selection
Stepwise selection
10
0
8 9 10 11 12 13 14 15 16 17 18 19
Number of variables selected
Fig. 2. Number of variables in final model.
P.C. Austin, J.V. Tu / Journal of Clinical Epidemiology 57 (2004) 1138–1146 1143
100
90
80
Percent of bootstrap samples (%)
70
60
Backwards vs forwards
50 Backwards vs stepwise
Forward vs stepwise
40
30
20
10
0
-2 -1 0 1 2 3
Difference in number of variables
Fig. 3. Differences in number of variables identified as independent predictors.
using each of the three variable selection methods. We then least 1% and that were significant at P ⫽ .25 were also
determined the difference in the number of variables identi- significant at the .10 and .05 levels (Table 1).
fied using each of the three selection methods. Results are When clinical judgment was used in association with
reported in Fig. 3. In 85.6% of the bootstrap samples, back- automated model selection methods, results similar to those
ward selection and forward selection identified the same described above were obtained. Creatinine, glucose, and
number of variables as being independent predictors of mor- white blood count were identified as independent predictors
tality. However, in 6.3% of the bootstrap samples, backward of mortality in 99.9%, 99.9%, and 98.5% of bootstrap sam-
selection identified more independent predictors of mortality ples, respectively, using backward elimination. Sixteen of
than did forward selection. In six bootstrap samples, back- 26 variables were identified as significant in fewer than
ward selection identified three additional independent pre- half of the bootstrap models using backward elimination. De-
dictors of mortality than did forward selection. In 91.9% spite using clinical judgment in variable selection, backward
of the bootstrap samples, backward selection and stepwise elimination identified 845 unique subsets of variables. No
selection identified the same number of variables as being model was identified more than seven times in the 1,000
independent predictors of mortality. However, in 7.0% of bootstrap samples.
the bootstrap samples, backward selection identified more As a final sensitivity analysis, we created a random vari-
independent predictors of mortality than did forward selec- able for inclusion in the automated variable selection pro-
tion. In seven bootstrap samples, backward selection identi- cess. This is an adaptation of an approach suggested by
fied three more independent predictors of mortality than Miller [9]. For each subject, we generated a random variable
did forward selection. In 91.8% of the bootstrap samples, from a standard normal distribution. This value was gener-
forward selection and stepwise selection identified the same ated to be independent of the outcome and all other variables.
number of variables as being independent predictors of mor- We then included this randomly generated noise variable in
tality. However, in 8.2% of the bootstrap samples, forward the backward, forward, and stepwise model selection pro-
selection identified more independent predictors of mortality cesses. This variable was included in 16.3% of the models
than did stepwise selection. In seven bootstrap samples, selected using backward elimination. It was selected in
forward selection identified two more independent pre- 15.9% and 15.8% of the models selected using forward and
dictors of mortality than did stepwise selection. stepwise selection, respectively. Nineteen variables were se-
As a sensitivity analysis, we included as candidate vari- lected more frequently than this randomly generated noise
ables for the variable selection methods only variables whose variable; however, 10 variables were selected less frequently
association with mortality in a univariate analysis was sig- than this noise variable (angina, hyperlipidemia, CVA/TIA,
nificant at the P ⫽ .10 or P ⫽ .05. This resulted in the same hemoglobin, cancer, diastolic blood pressure, diabetes,
variables being considered as candidate variables. This was sodium, smoker, and previous AMI). Thus, a randomly gen-
due to the fact that the risk factors whose prevalence was at erated variable was identified as an independent predictor
1144 P.C. Austin, J.V. Tu / Journal of Clinical Epidemiology 57 (2004) 1138–1146
of mortality more often than 34.5% of clinical characteristics statistical significance of the association is overstated.
considered in the current study. These results were demonstrated in the context of linear
regression using ordinary least squares. We have demon-
strated that similar concerns are valid for logistic regres-
sion models.
4. Discussion
Henderson and Velleman [19] caution against the blind
Using detailed clinical data on 4,911 AMI patients, use of automated variable selection methods, arguing for
we have highlighted several issues concerning the use of the use of an interactive model-building approach. They
automated model selection methods. First, the variables argue that the data analyst must bring subject-specific knowl-
identified as independent predictors of AMI mortality using edge to the model-building process. Furthermore, automated
automated variable selection methods are highly dependent model-building methods can mask problems with multicol-
on random fluctuations in the subjects included in the linearity, nonlinearity, and observations with high leverage.
dataset. Using 1,000 bootstrap samples, backward selection There are at least three reasons for constructing regression
identified 940 distinct subsets of variables as independent models. The first is primarily epidemiologic in focus. In this
predictors of mortality. Furthermore, no regression model setting, one is interested in assessing the association between
was chosen more than four times. Similar results were ob- an exposure variable and an outcome of interest. Such
tained for forward and stepwise model selection. Second, an assessment must take into account possible confound-
the majority of candidate variables were identified as inde- ers and effect modifiers [20]. In such a setting, a struc-
pendent predictors of mortality in only a minority of models. tured approach to modeling can be used because the
This is despite the fact that the candidate variables were all researcher has a primary hypothesis guiding the model-
plausible predictors of AMI mortality. Three variables were building process. Furthermore, the variables identified a
identified as independent predictors of mortality in all boot- priori as possible confounders or effect modifiers can be
strap models; an additional three variables were identified informed by the data analyst’s previous experience and
as independent predictors of morality in at least 90% of the knowledge of the research field. The second reason for
models chosen using backward selection, and a further constructing regression models is for predictive purposes.
one variable was identified as an independent predictor of In cardiovascular research, one frequently wants to predict
mortality in at least 80% of the models. Third, for a given the likelihood of an adverse outcome such as mortality. In
model selection method, there was substantial variability such instances, one is more concerned with predictive accu-
in the number of variables that were identified as independent racy than with the variables that are entered in the model.
predictors of mortality across the 1,000 bootstrap replica- Model development should incorporate methods that involve
tions. Fourth, backward selection and forward selection data splitting [21] or bootstrap assessments of predictive ac-
methods agreed with one another on the independent pre- curacy [18]. The third reason for constructing regression
dictors of mortality in 83.4% of the bootstrap samples. Thus, models is hypothesis generation or exploratory analysis.
our findings explain, in part, why different studies of AMI In this setting, one is interested in determining the indepen-
mortality consistently identify different predictors of dent predictors of an event. Automated model selection
mortality. methods are frequently used because the model-building
The use of automated variable selection methods is con- process is not guided by a clear hypothesis that one is inter-
troversial, with many statisticians having reservations about ested in testing. In clinical research, it is important to de-
their use. There are multiple reasons for this. Flack and termine which variables are associated with a worse
Chang [11], using simulations, demonstrated that a large prognosis. This allows appropriate risk stratification and can
proportion of selected variables are truly independent of allow clinicians to provide more appropriate medical care.
the outcome and that the resultant model’s R2 is upwardly In the first setting described above, the use of automated
biased. A similar study, also using simulation methods, deter- variable selection methods is inappropriate and unnecessary
mined that between 20% and 74% of variables selected because the researcher is guided by a clear hypothesis and
are noise variables that are unrelated to the outcome [14]. should use a structured approach to assessing the effects
Furthermore, the number of noise variables included in- of confounders and effect modifiers. In the second setting
creased as the number of candidate variables increased. Simi- described above, automated variable selection methods are
larly, using simulations, Murtaugh [2] reported that the not inappropriate if used in conjunction with cross-validation
probability of correctly identifying variables was inversely or data-splitting methods to determine predictive accuracy
proportional to the number of variables under consideration. in an independent validation dataset. The resultant model is
Furthermore, coefficient estimates obtained using automated likely to contain important independent predictors of mortal-
variable selection methods are biased away from zero, and ity and noise variables mistakenly identified as predictors
P values associated with testing the statistical significance of mortality. However, because the objective of the model
of each independent variable are biased low. Thus, the mag- is prediction and not hypothesis testing, this is not a major
nitude of the association between each selected variable and limitation. The use of automated variable selection methods
the outcome is larger than are warranted by the data, and the in this context would be problematic if variables that were
P.C. Austin, J.V. Tu / Journal of Clinical Epidemiology 57 (2004) 1138–1146 1145
mistakenly identified as independent predictors of the out- modeling should not be separated from subject-matter exper-
come were expensive or difficult to obtain. In a related tise. Regression modeling should be informed by clinical
article, the authors [22] examine the use of backward variable knowledge and not be treated as a “black-box.” Third, when
elimination in conjunction with bootstrap methods to automated variable selection methods are used in an explor-
develop predictive models. In the third setting described atory analysis to determine which variables are associated
above, the use of automated variable selection methods is with the outcome of interest, we recommend that bootstrap
likely to result in the greatest problems. The resultant model methods similar to those outlined in the current manuscript
will likely contain variables that are true predictors of the be used to determine the strength of the evidence that a given
outcome and variables that have mistakenly been identified variable truly is an independent predictor of the outcome.
as predictors of the outcome. By interpreting the model in
isolation, one cannot assess which variables fall into each
of the two categories. To draw more robust conclusions, we
Acknowledgments
suggest several possible strategies: (1) using the bootstrap
methods described in the current study allows one to assess The Institute for Clinical Evaluative Sciences is sup-
the strength of the evidence that a given predictor is an ported in part by a grant from the Ontario Ministry of Health
independent predictor of mortality, (2) comparing the final and Long Term Care. The opinions, results and conclusions
model with other models reported in the literature to assess are those of the authors and no endorsement by the Ministry
the consistency of the findings, and (3) independently vali- of Health and Long-Term Care or by the Institute for Clinical
date the model in an independent dataset. Evaluative Sciences is intended or should be inferred. Finan-
Altman and Andersen [23] used bootstrap sampling to cial support for this study was provided in part by a grant
assess the stability of a Cox regression model fit to 216 from the Canadian Institutes of Health Research (CIHR) to
patients enrolled in a clinical trial. Using 17 candidate vari- the Canadian Cardiovascular Outcomes Research Team
ables, they demonstrated that stepwise selection identified (CCORT). Dr. Austin is supported in part by a New Investi-
between 4 and 10 variables as independent predictors of gator Award from the CIHR. Dr. Tu is supported by a Canada
mortality. Furthermore, the frequency with which variables Research Chair in Health Services Research.
were included in the models ranges from a high of 100%
for one variable to a low of 6% for another variable. Eleven
variables were identified as significant predictors in less
than half of the bootstrap models.
Despite theoretical concerns about the validity of auto- References
mated model selection methods, such methods are frequently [1] Krumholz HM, Chen J, Wang Y, Radford MJ, Chen YT, Marciniak TA.
used in the clinical literature. Studies in the literature iden- Comparing AMI mortality among hospitals in patients 65 years and
tify different variables as independent predictors of mortality older: evaluating methods of risk adjustment. Circulation 1999;99:
2986–92.
after AMI. Part of this is due to differences in the inclusion/ [2] Murtaugh PA. Methods of variable selection in regression model-
exclusion criteria upon which the different cohorts were ing. Commun Stat Simulation Computation 1998;27:711–34.
based. We have identified that an additional reason for this [3] Wears RL, Lewis RJ. Statistical models and Occam’s razor. Acad Emerg
is that small degrees of random variation in one dataset can Med 1999;6:93–4.
[4] The Multicenter Postinfarction Research Group. Risk stratification
have a substantial influence on the variables that are identi-
and survival after myocardial infarction. N Engl J Med 1983;309:
fied as independent predictors of mortality and on the number 331–6.
of variables that are identified as independent predictors of [5] Suarez C, Herrera M, Vera A, Torrado E, Ferriz J, Arboleda JA.
mortality. Thus, it is likely that no one regression model Prediction on admission of in-hospital mortality in patients older
estimated on one dataset can conclusively identify the inde- than 70 years with acute myocardial infarction. Chest 1995;108:83–8.
pendent predictors of mortality. Such a model will likely [6] Henning R, Wedel H, Wilhelmsen L. Mortality risk estimation by
multivariate statistical analyses. Acta Med Scand 1975;586(Suppl):
include variables that truly are independently associated with 14–31.
mortality. However, such a model will also most likely in- [7] Dubois C, Pierard LA, Albert A, Smeets J, Demoulin J, Boland J,
clude spurious variables that are not true independent pre- Kulbertus HE. Short-term risk stratification at admission based on
dictors of mortality. Furthermore, our study demonstrated that simple clinical data in acute myocardial infarction. Am J Cardiol
by changing the method by which variables are selected, 1988;61:216–9.
[8] Miller AJ. Selection of subsets of regression variables. J R Stat Soc
two investigators, working with the same data, may identify [Ser A] 1984;147:389–425.
different regression models. [9] Miller A. Subset selection in regression. 2nd edition. Boca Raton
In conclusion, we make several recommendations. First, (FL): Chapman & Hall/CRC; 2002.
investigators need to be aware of the limitations of using [10] Hocking RR. The analysis and selection of variables in linear regres-
automated variable selection methods. When using these sion. Biometrics 1976;32:1–49.
[11] Flack VF, Chang PC. Frequency of selecting noise variables in subset
methods, there is a strong likelihood that variables that regression analysis: a simulation study. Am Stat 1987;14:84–6.
are truly independent of the outcome will be identified as [12] Copas JB, Long T. Estimating the residual variance in orthogonal
independent predictors of the outcome. Second, statistical regression with variable selection. Statistician 1991;40:51–9.
1146 P.C. Austin, J.V. Tu / Journal of Clinical Epidemiology 57 (2004) 1138–1146
[13] Harrell FE Jr. Regression modelling strategies: with applications to linear [17] Hosmer DW, Lemeshow S. Applied logistic regression. New York:
models, logistic regression, and survival analyses. New York: Springer- John Wiley & Sons; 1989.
Verlag; 2001. [18] Efron B, Tibshirani R. An introduction to the bootstrap. London: Chap-
[14] Derkson S, Keselman HJ. Backward, forward and stepwise automated man & Hall; 1993.
subset selection algorithms: frequency of obtaining authentic and noise [19] Henderson HV, Velleman RF. Building multiple regression models
variables. Br J Math Stat Psychol 1992;45:265–82. interactively. Biometrics 1981;37:391–411.
[15] Tu JV, Naylor CD, Austin P. Temporal changes in the outcomes of
[20] Rothman KJ, Greenland S. Modern epidemiology. 2nd edition. Phila-
acute myocardial infarction in Ontario, 1992–1996. Can Med Assoc
delphia: Lippincott Williams & Wilkins; 1998.
J 1999;161:1257–61.
[16] Tu JV, Austin P, Naylor CD, Iron K, Zhang H. Acute myocardial [21] Picard RR, Berk KN. Data splitting. Am Stat 1990;4:140–7.
infarction outcomes in Ontario. In: Naylor CD, Slaughter PM, editors. [22] Austin PC, Tu JV. Bootstrap methods for developing predictive
Cardiovascular health services in Ontario: an ICES atlas. Toronto, models. Am Stat 2004;58:131–137.
Canada: Institute for Clinical Evaluative Sciences; 1999. p. 83– [23] Altman DG, Andersen PK. Bootstrap investigation of the stability of
110. a Cox regression model. Stat Med 1989;8:771–83.