Frank E. Harrell, Jr.
Regression Modeling
Strategies
With Applications to
Linear Models,
Logistic Regression,
and Survival Analysis
With 141 Figures
Springer
Contents
Preface
vii
Typographical Conventions
1 Introduction
1.1
1.2
1.3
1.4
1.5
Hypothesis Testing, Estimation, and Prediction
Examples of Uses of Predictive Multivariable Modeling
Planning for Modeling
1.3.1
Emphasizing Continuous Variables
Choice of the Model
Further Reading
2 General Aspects of Fitting Regression Models
2.1 Notation for Multivariable Regression Models
2.2 Model Formulations
2.3 Interpreting Model Parameters
2.3.1 Nominal Predictors
2.3.2 Interactions
xxiii
1
1
3
4
6
6
8
11
11
12
13
14
14
xiv
Contents
2.3.3
2.4
Example: Inference for a Simple Model
Relaxing Linearity Assumption for Continuous Predictors
15
16
2.4.1
Simple Nonlinear Terms
16
2.4.2
Splines for Estimating Shape of Regression Function and Determining Predictor Transformations
18
2.4.3
Cubic Spline Functions
19
2.4.4
Restricted Cubic Splines
20
2.4.5
Choosing Number and Position of Knots
23
2.4.6
Nonparametric Regression
24
2.4.7
Advantages of Regression Splines over Other Methods . . . .
26
2.5
Recursive Partitioning: Tree-Based Models
26
2.6
Multiple Degree of Freedom Tests of Association
27
2.7
Assessment of Model Fit
29
2.7.1
Regression Assumptions
29
2.7.2
Modeling and Testing Complex Interactions
32
2.7.3
Fitting Ordinal Predictors
34
2.7.4
Distributional Assumptions
35
2.8
Further Reading
36
2.9
Problems
37
Missing Data
41
3.1
Types of Missing Data
41
3.2
Prelude to Modeling
42
3.3
Missing Values for Different Types of Response Variables
43
3.4
Problems with Simple Alternatives to Imputation
43
3.5
Strategies for Developing Imputation Algorithms
44
3.6
Single Conditional Mean Imputation
47
3.7
Multiple Imputation
47
3.8
Summary and Rough Guidelines
48
3.9
Further Reading
50
3.10
Problems
51
Multivariable Modeling Strategies
4.1
Prespecification of Predictor Complexity Without
Later Simplification
53
53
Contents
4.2
Checking Assumptions of Multiple Predictors Simultaneously
4.3
4.4
Variable Selection
Overfitting and Limits on Number of Predictors
56
60
4.5
Shrinkage
61
4.6
4.7
Collinearity
Data Reduction
64
66
4.8
4.9
4.10
4.11
...
5.1
5.2
5.3
5.4
5.5
56
4.7.1
4.7.2
4.7.3
4.7.4
Variable Clustering
Transformation and Scaling Variables Without Using Y . . .
Simultaneous Transformation and Imputation
Simple Scoring of Variable Clusters
66
67
69
70
4.7.5
4.7.6
Simplifying Cluster Scores
How Much Data Reduction Is Necessary?
72
73
Overly Influential Observations
74
Comparing Two Models
Summary: Possible Modeling Strategies
4.10.1 Developing Predictive Models
77
79
79
4.10.2
82
Developing Models for Effect Estimation
4.10.3 Developing Models for Hypothesis Testing
Further Reading
5 Resampling, Validating, Describing, and Simplifying the Model
xv
The Bootstrap
83
84
87
87
Model Validation
5.2.1
Introduction
5.2.2
Which Quantities Should Be Used in Validation?
90
90
91
5.2.3
5.2.4
5.2.5
91
93
94
Data-Splitting
Improvements on Data-Splitting: Resampling
Validation Using the Bootstrap
Describing the Fitted Model
Simplifying the Final Model by Approximating It
5.4.1
Difficulties Using Full Models
5.4.2
Approximating the Full Model
Further Reading
S-Plus Software
97
98
98
99
101
105
xvi
Contents
6.1
The S Modeling Language
106
6.2
User-Contributed Functions
107
6.3
The Design Library
108
6.4
Other Functions
119
6.5
Further Reading
120
Case Study in Least Squares Fitting and Interpretation
of a Linear Model
121
7.1
Descriptive Statistics
122
7.2
Spending Degrees of Freedom/Specifying Predictor Complexity . . 127
7.3
Fitting the Model Using Least Squares
128
7.4
Checking Distributional Assumptions
131
7.5
Checking Goodness of Fit
135
7.6
Overly Influential Observations
135
7.7
Test Statistics and Partial R
136
7.8
Interpreting the Model
137
7.9
Problems
142
Case Study in Imputation and Data Reduction
147
8.1
Data
147
8.2
How Many Parameters Can Be Estimated?
150
8.3
Variable Clustering
151
8.4
Single Imputation Using Constants or Recursive Partitioning
8.5
Transformation and Single Imputation Using transcan
157
8.6
Data Reduction Using Principal Components
160
8.7
Detailed Examination of Individual Transformations
168
8.8
Examination of Variable Clusters on Transformed Variables . . . .
169
8.9
Transformation Using Nonparametric Smoothers
170
8.10
Multiple Imputation
172
8.11
Further Reading
175
8.12
Problems
176
. . . 154
Overview of Maximum Likelihood Estimation
179
9.1
General NotionsSimple Cases
179
9.2
Hypothesis Tests
183
Contents
9.3
xvii
9.2.1
Likelihood Ratio Test
183
9.2.2
Wald Test
184
9.2.3
Score Test
184
9.2.4
Normal DistributionOne Sample
185
General Case
186
9.3.1
Global Test Statistics
187
9.3.2
Testing a Subset of the Parameters
187
9.3.3
Which Test Statistics to Use When
189
9.3.4
Example: BinomialComparing Two Proportions
190
9.4
Iterative ML Estimation
192
9.5
Robust Estimation of the Covariance Matrix
193
9.6
Wald, Score, and Likelihood-Based Confidence Intervals
194
9.7
Bootstrap Confidence Regions
195
9.8
Further Use of the Log Likelihood
202
9.9
9.8.1
Rating Two Models, Penalizing for Complexity
202
9.8.2
Testing Whether One Model Is Better than Another
9.8.3
Unitless Index of Predictive Ability
203
9.8.4
Unitless Index of Adequacy of a Subset of Predictors . . . .
205
....
Weighted Maximum Likelihood Estimation
203
206
9.10
Penalized Maximum Likelihood Estimation
207
9.11
Further Reading
210
9.12
Problems
212
10 Binary Logistic Regression
10.1
10.2
Model
215
215
10.1.1
Model Assumptions and Interpretation of Parameters . . . .
217
10.1.2
Odds Ratio, Risk Ratio, and Risk Difference
220
10.1.3
Detailed Example
221
10.1.4
Design Formulations
227
Estimation
228
10.2.1
Maximum Likelihood Estimates
228
10.2.2
Estimation of Odds Ratios and Probabilities
228
10.3
Test Statistics
229
10.4
Residuals
230
xviii
Contents
10.5
10.6
10.7
10.8
10.9
Assessment of Model Fit
Collinearity
Overly Influential Observations
Quantifying Predictive Ability
Validating the Fitted Model
10.10 Describing the Fitted Model
10.11
S-PLUS Functions
10.12 Further Reading
10.13 Problems
11 Logistic Model Case Study 1: Predicting Cause of Death
11.1 Preparation for Modeling
11.2
230
244
245
247
249
253
257
264
265
269
269
11.4
11.5
Regression on Principal Components, Cluster Scores, and Pretransformations
Fit and Diagnostics for a Full Model, and Interpreting Pretransformations
Describing Results Using a Reduced Model
Approximating the Full Model Using Recursive Partitioning . . . .
276
285
291
11.6
Validating the Reduced Model
294
11.3
12 Logistic Model Case Study 2: Survival of Titanic Passengers
271
299
12.1
Descriptive Statistics
12.2
12.3
12.4
12.5
12.6
12.7
12.8
Exploring Trends with Nonparametric Regression
305
Binary Logistic Model With Casewise Deletion of Missing Values . 305
Examining Missing Data Patterns
'
312
Single Conditional Mean Imputation
316
Multiple Imputation
320
Summarizing the Fitted Model
322
Problems
326
13 Ordinal Logistic Regression
13.1
13.2
13.3
Background
Ordinality Assumption
Proportional Odds Model
13.3.1 Model
13.3.2 Assumptions and Interpretation of Parameters
300
331
331
332
333
333
333
Contents
xix
13.3.3
Estimation
334
13.3.4
Residuals
334
13.3.5
Assessment of Model Fit
335
13.3.6
Quantifying Predictive Ability
335
13.3.7
Validating the Fitted Model
337
13.3.8
S-PLUS Functions
337
13.4
13.5
13.6
Continuation Ratio Model
13.4.1 Model
13.4.2 Assumptions and Interpretation of Parameters
13.4.3 Estimation
338
338
338
339
13.4.4
Residuals
339
13.4.5
Assessment of Model Fit
339
13.4.6
Extended CR Model
339
13.4.7
13.4.8
Role of Penalization in Extended CR Model
Validating the Fitted Model
340
341
13.4.9
S-PLUS Functions
341
Further Reading
Problems
342
342
14 Case Study in Ordinal Regression, Data Reduction, and Penalization
345
14.1
14.2
Response Variable
Variable Clustering
14.3
14.4
Developing Cluster Summary Scores
Assessing Ordinality of Y for each X, and Unadjusted Checking of
PO and CR Assumptions
A Tentative Full Proportional Odds Model
Residual Plots
Graphical Assessment of Fit of CR Model
Extended Continuation Ratio Model
Penalized Estimation
Using Approximations to Simplify the Model
Validating the Model
Summary
Further Reading
14.5
14.6
14.7
14.8
14.9
14.10
14.11
14.12
14.13
. 346
347
349
351
352
355
357
357
359
364
367
369
371
xx
Contents
14.14 Problems
15 Models Using Nonparametric Transformations of X and Y
15.1 Background
15.2 Generalized Additive Models
15.3 Nonparametric Estimation of ^-Transformation
15.4 Obtaining Estimates on the Original Scale
15.5 S-PLUS Functions
15.6 Case Study
371
375
375
376
376
377
378
379
16 Introduction to Survival Analysis
389
16.1 Background
389
16.2 Censoring, Delayed Entry, and Truncation
391
16.3 Notation, Survival, and Hazard Functions
392
16.4 Homogeneous Failure Time Distributions
398
16.5 Nonparametric Estimation of 5 and A
400
16.5.1 Kaplan-Meier Estimator
400
16.5.2 Altschuler-Nelson Estimator
403
16.6 Analysis of Multiple Endpoints
404
16.6.1 Competing Risks
404
16.6.2 Competing Dependent Risks
405
16.6.3 State Transitions and Multiple Types of Nonfatal Events . . 406
16.6.4 Joint Analysis of Time and Severity of an Event
407
16.6.5 Analysis of Multiple Events
407
16.7 S-PLUS Functions
408
16.8 Further Reading
410
16.9 Problems
411
17 Parametric Survival Models
17.1 Homogeneous Models (No Predictors)
17.1.1 Specific Models
17.1.2 Estimation
17.1.3 Assessment of Model Fit
17.2 Parametric Proportional Hazards Models
17.2.1 Model
413
413
413
414
416
417
417
Contents
17.3
17.4
xxi
17.2.2
Model Assumptions and Interpretation of Parameters . . . .
418
17.2.3
Hazard Ratio, Risk Ratio, and Risk Difference
419
17.2.4
Specific Models
421
17.2.5
Estimation
422
17.2.6
Assessment of Model Fit
Accelerated Failure Time Models
423
426
17.3.1
Model
426
17.3.2
Model Assumptions and Interpretation of Parameters . . . .
427
17.3.3
Specific Models
427
17.3.4
Estimation
428
17.3.5
Residuals
429
17.3.6
Assessment of Model Fit
430
17.3.7
Validating the Fitted Model
434
Buckley-James Regression Model
435
17.5
Design Formulations
435
17.6
Test Statistics
435
17.7
Quantifying Predictive Ability
436
17.8
S-PLUS Functions
436
17.9
Further Reading
17.10 Problems
441
441
18 Case Study in Parametric Survival Modeling and Model Approximation
443
18.1 Descriptive Statistics
443
18.2
Checking Adequacy of Log-Normal Accelerated Failure Time Model 448
18.3
18.4
Summarizing the Fitted Model
Internal Validation of the Fitted Model Using the Bootstrap . . . .
18.5
Approximating the Full Model
458
18.6
Problems
464
19 Cox Proportional Hazards Regression Model
19.1
Model
454
454
465
465
19.1.1
Preliminaries
465
19.1.2
Model Definition
466
19.1.3
Estimation of/?
466
xxii
Contents
19.1.4 Model Assumptions and Interpretation of Parameters . . . .
19.1.5 Example
19.1.6 Design Formulations
19.1.7 Extending the Model by Stratification
19.2 Estimation of Survival Probability and Secondary Parameters . . .
468
468
470
470
472
19.3
474
Test Statistics
19.4
19.5
Residuals
476
Assessment of Model Fit
476
19.5.1 Regression Assumptions
477
19.5.2 Proportional Hazards Assumption
483
19.6 What to Do When PH Fails
489
19.7 Collinearity
491
19.8 Overly Influential Observations
492
19.9 Quantifying Predictive Ability
492
19.10 Validating the Fitted Model
493
19.10.1 Validation of Model Calibration
493
19.10.2 Validation of Discrimination and Other Statistical Indexes . 494
19.11 Describing the Fitted Model
496
19.12
S-PLUS Functions
19.13. Further Reading
499
506
20 Case Study in Cox Regression
509
20.1 Choosing the Number of Parameters and Fitting the Model . . . . 509
20.2
20.3
20.4
20.5
20.6
20.7
Checking Proportional Hazards
Testing Interactions
Describing Predictor Effects
Validating the Model
Presenting the Model
Problems
513
516
517
517
519
522
Appendix
523
References
527
Index
559