50% found this document useful (2 votes)

742 views11 pages

Regression Modeling Strategies: Frank E. Harrell, JR

This document provides an overview and summary of the book "Regression Modeling Strategies" by Frank E. Harrell Jr. It discusses the use of regression modeling for linear models, logistic regression, and survival analysis. The book covers topics such as model formulation, interpreting model parameters, assessing model fit and assumptions, handling missing data, variable selection, and model validation. It also provides case studies demonstrating various modeling strategies.

Uploaded by

Cipriana Gîrbea

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

50% found this document useful (2 votes)

742 views11 pages

Regression Modeling Strategies: Frank E. Harrell, JR

Uploaded by

Cipriana Gîrbea

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Frank E. Harrell, Jr.

Regression Modeling
Strategies
With Applications to
Linear Models,
Logistic Regression,
and Survival Analysis

With 141 Figures

Springer

Contents

Preface

vii

Typographical Conventions
1 Introduction
1.1
1.2
1.3
1.4
1.5

Hypothesis Testing, Estimation, and Prediction

Examples of Uses of Predictive Multivariable Modeling
Planning for Modeling
1.3.1
Emphasizing Continuous Variables
Choice of the Model
Further Reading

2 General Aspects of Fitting Regression Models

2.1 Notation for Multivariable Regression Models
2.2 Model Formulations
2.3 Interpreting Model Parameters
2.3.1 Nominal Predictors
2.3.2 Interactions

xxiii
1
1
3
4
6
6
8
11
11
12
13
14
14

xiv

Contents
2.3.3
2.4

Example: Inference for a Simple Model

Relaxing Linearity Assumption for Continuous Predictors

15
16

2.4.1

Simple Nonlinear Terms

2.4.2

Splines for Estimating Shape of Regression Function and Determining Predictor Transformations

2.4.3

Cubic Spline Functions

2.4.4

Restricted Cubic Splines

2.4.5

Choosing Number and Position of Knots

2.4.6

Nonparametric Regression

2.4.7

Advantages of Regression Splines over Other Methods . . . .

2.5

Recursive Partitioning: Tree-Based Models

2.6

Multiple Degree of Freedom Tests of Association

2.7

Assessment of Model Fit

2.7.1

Regression Assumptions

2.7.2

Modeling and Testing Complex Interactions

2.7.3

Fitting Ordinal Predictors

2.7.4

Distributional Assumptions

2.8

Further Reading

2.9

Problems

Missing Data

3.1

Types of Missing Data

3.2

Prelude to Modeling

3.3

Missing Values for Different Types of Response Variables

3.4

Problems with Simple Alternatives to Imputation

3.5

Strategies for Developing Imputation Algorithms

3.6

Single Conditional Mean Imputation

3.7

Multiple Imputation

3.8

Summary and Rough Guidelines

3.9

Further Reading

3.10

Problems

Multivariable Modeling Strategies

4.1

Prespecification of Predictor Complexity Without

Later Simplification

53
53

Contents
4.2

Checking Assumptions of Multiple Predictors Simultaneously

4.3
4.4

Variable Selection
Overfitting and Limits on Number of Predictors

56
60

4.5

Shrinkage

4.6
4.7

Collinearity
Data Reduction

64
66

4.8
4.9
4.10

4.11

...

5.1
5.2

5.3
5.4

5.5

4.7.1
4.7.2
4.7.3
4.7.4

Variable Clustering
Transformation and Scaling Variables Without Using Y . . .
Simultaneous Transformation and Imputation
Simple Scoring of Variable Clusters

66
67
69
70

4.7.5
4.7.6

Simplifying Cluster Scores

How Much Data Reduction Is Necessary?

72
73

Overly Influential Observations

Comparing Two Models

Summary: Possible Modeling Strategies
4.10.1 Developing Predictive Models

77
79
79

4.10.2

Developing Models for Effect Estimation

4.10.3 Developing Models for Hypothesis Testing

Further Reading

5 Resampling, Validating, Describing, and Simplifying the Model

The Bootstrap

83
84
87
87

Model Validation
5.2.1
Introduction
5.2.2
Which Quantities Should Be Used in Validation?

90
90
91

5.2.3
5.2.4
5.2.5

91
93
94

Data-Splitting
Improvements on Data-Splitting: Resampling
Validation Using the Bootstrap

Describing the Fitted Model

Simplifying the Final Model by Approximating It
5.4.1
Difficulties Using Full Models
5.4.2
Approximating the Full Model
Further Reading

S-Plus Software

97
98
98
99
101
105

xvi

Contents
6.1

The S Modeling Language

106

6.2

User-Contributed Functions

107

6.3

The Design Library

108

6.4

Other Functions

119

6.5

Further Reading

120

Case Study in Least Squares Fitting and Interpretation

of a Linear Model

121

7.1

Descriptive Statistics

122

7.2

Spending Degrees of Freedom/Specifying Predictor Complexity . . 127

7.3

Fitting the Model Using Least Squares

128

7.4

Checking Distributional Assumptions

131

7.5

Checking Goodness of Fit

135

7.6

Overly Influential Observations

135

7.7

Test Statistics and Partial R

136

7.8

Interpreting the Model

137

7.9

Problems

142

Case Study in Imputation and Data Reduction

147

8.1

Data

147

8.2

How Many Parameters Can Be Estimated?

150

8.3

Variable Clustering

151

8.4

Single Imputation Using Constants or Recursive Partitioning

8.5

Transformation and Single Imputation Using transcan

157

8.6

Data Reduction Using Principal Components

160

8.7

Detailed Examination of Individual Transformations

168

8.8

Examination of Variable Clusters on Transformed Variables . . . .

169

8.9

Transformation Using Nonparametric Smoothers

170

8.10

Multiple Imputation

172

8.11

Further Reading

175

8.12

Problems

176

. . . 154

Overview of Maximum Likelihood Estimation

179

9.1

General NotionsSimple Cases

179

9.2

Hypothesis Tests

183

Contents

9.3

xvii

9.2.1

Likelihood Ratio Test

183

9.2.2

Wald Test

184

9.2.3

Score Test

184

9.2.4

Normal DistributionOne Sample

185

General Case

186

9.3.1

Global Test Statistics

187

9.3.2

Testing a Subset of the Parameters

187

9.3.3

Which Test Statistics to Use When

189

9.3.4

Example: BinomialComparing Two Proportions

190

9.4

Iterative ML Estimation

192

9.5

Robust Estimation of the Covariance Matrix

193

9.6

Wald, Score, and Likelihood-Based Confidence Intervals

194

9.7

Bootstrap Confidence Regions

195

9.8

Further Use of the Log Likelihood

202

9.9

9.8.1

Rating Two Models, Penalizing for Complexity

202

9.8.2

Testing Whether One Model Is Better than Another

9.8.3

Unitless Index of Predictive Ability

203

9.8.4

Unitless Index of Adequacy of a Subset of Predictors . . . .

205

....

Weighted Maximum Likelihood Estimation

203

206

9.10

Penalized Maximum Likelihood Estimation

207

9.11

Further Reading

210

9.12

Problems

212

10 Binary Logistic Regression

10.1

10.2

Model

215
215

10.1.1

Model Assumptions and Interpretation of Parameters . . . .

217

10.1.2

Odds Ratio, Risk Ratio, and Risk Difference

220

10.1.3

Detailed Example

221

10.1.4

Design Formulations

227

Estimation

228

10.2.1

Maximum Likelihood Estimates

228

10.2.2

Estimation of Odds Ratios and Probabilities

228

10.3

Test Statistics

229

10.4

Residuals

230

xviii

Contents

10.5
10.6
10.7
10.8
10.9

Assessment of Model Fit

Collinearity
Overly Influential Observations
Quantifying Predictive Ability
Validating the Fitted Model

10.10 Describing the Fitted Model

10.11

S-PLUS Functions

10.12 Further Reading

10.13 Problems
11 Logistic Model Case Study 1: Predicting Cause of Death
11.1 Preparation for Modeling
11.2

230
244
245
247
249
253
257

264
265
269
269

11.4
11.5

Regression on Principal Components, Cluster Scores, and Pretransformations

Fit and Diagnostics for a Full Model, and Interpreting Pretransformations
Describing Results Using a Reduced Model
Approximating the Full Model Using Recursive Partitioning . . . .

276
285
291

11.6

Validating the Reduced Model

294

11.3

12 Logistic Model Case Study 2: Survival of Titanic Passengers

271

299

12.1

Descriptive Statistics

12.2
12.3
12.4
12.5
12.6
12.7
12.8

Exploring Trends with Nonparametric Regression

305
Binary Logistic Model With Casewise Deletion of Missing Values . 305
Examining Missing Data Patterns
'
312
Single Conditional Mean Imputation
316
Multiple Imputation
320
Summarizing the Fitted Model
322
Problems
326

13 Ordinal Logistic Regression

13.1
13.2
13.3

Background
Ordinality Assumption
Proportional Odds Model
13.3.1 Model
13.3.2 Assumptions and Interpretation of Parameters

300

331
331
332
333
333
333

Contents

xix

13.3.3

Estimation

334

13.3.4

Residuals

334

13.3.5

Assessment of Model Fit

335

13.3.6

Quantifying Predictive Ability

335

13.3.7

Validating the Fitted Model

337

13.3.8

S-PLUS Functions

337

13.4

13.5
13.6

Continuation Ratio Model

13.4.1 Model
13.4.2 Assumptions and Interpretation of Parameters
13.4.3 Estimation

338
338
338
339

13.4.4

Residuals

339

13.4.5

Assessment of Model Fit

339

13.4.6

Extended CR Model

339

13.4.7
13.4.8

Role of Penalization in Extended CR Model

Validating the Fitted Model

340
341

13.4.9

S-PLUS Functions

341

14 Case Study in Ordinal Regression, Data Reduction, and Penalization

345
14.1
14.2

Response Variable
Variable Clustering

14.3
14.4

Developing Cluster Summary Scores

Assessing Ordinality of Y for each X, and Unadjusted Checking of
PO and CR Assumptions
A Tentative Full Proportional Odds Model
Residual Plots
Graphical Assessment of Fit of CR Model
Extended Continuation Ratio Model
Penalized Estimation
Using Approximations to Simplify the Model
Validating the Model
Summary
Further Reading

14.5
14.6
14.7
14.8
14.9
14.10
14.11
14.12
14.13

. 346
347
349
351
352
355
357
357
359
364
367
369
371

Contents
14.14 Problems

15 Models Using Nonparametric Transformations of X and Y

15.1 Background
15.2 Generalized Additive Models
15.3 Nonparametric Estimation of ^-Transformation
15.4 Obtaining Estimates on the Original Scale
15.5 S-PLUS Functions
15.6 Case Study

371
375
375
376
376
377
378
379

16 Introduction to Survival Analysis

389
16.1 Background
389
16.2 Censoring, Delayed Entry, and Truncation
391
16.3 Notation, Survival, and Hazard Functions
392
16.4 Homogeneous Failure Time Distributions
398
16.5 Nonparametric Estimation of 5 and A
400
16.5.1 Kaplan-Meier Estimator
400
16.5.2 Altschuler-Nelson Estimator
403
16.6 Analysis of Multiple Endpoints
404
16.6.1 Competing Risks
404
16.6.2 Competing Dependent Risks
405
16.6.3 State Transitions and Multiple Types of Nonfatal Events . . 406
16.6.4 Joint Analysis of Time and Severity of an Event
407
16.6.5 Analysis of Multiple Events
407
16.7 S-PLUS Functions
408
16.8 Further Reading
410
16.9 Problems
411
17 Parametric Survival Models
17.1 Homogeneous Models (No Predictors)
17.1.1 Specific Models
17.1.2 Estimation
17.1.3 Assessment of Model Fit
17.2 Parametric Proportional Hazards Models
17.2.1 Model

413
413
413
414
416
417
417

Contents

17.3

17.4

xxi

17.2.2

Model Assumptions and Interpretation of Parameters . . . .

418

17.2.3

Hazard Ratio, Risk Ratio, and Risk Difference

419

17.2.4

Specific Models

421

17.2.5

Estimation

422

17.2.6

Assessment of Model Fit

Accelerated Failure Time Models

423
426

17.3.1

Model

426

17.3.2

Model Assumptions and Interpretation of Parameters . . . .

427

17.3.3

Specific Models

427

17.3.4

Estimation

428

17.3.5

Residuals

429

17.3.6

Assessment of Model Fit

430

17.3.7

Validating the Fitted Model

434

Buckley-James Regression Model

435

17.5

Design Formulations

435

17.6

Test Statistics

435

17.7

Quantifying Predictive Ability

436

17.8

S-PLUS Functions

436

17.9

Further Reading

17.10 Problems

441
441

18 Case Study in Parametric Survival Modeling and Model Approximation

443
18.1 Descriptive Statistics
443
18.2

Checking Adequacy of Log-Normal Accelerated Failure Time Model 448

18.3
18.4

Summarizing the Fitted Model

Internal Validation of the Fitted Model Using the Bootstrap . . . .

18.5

Approximating the Full Model

458

18.6

Problems

464

19 Cox Proportional Hazards Regression Model

19.1

Model

454
454

465
465

19.1.1

Preliminaries

465

19.1.2

Model Definition

466

19.1.3

Estimation of/?

466

xxii

Contents
19.1.4 Model Assumptions and Interpretation of Parameters . . . .
19.1.5 Example
19.1.6 Design Formulations
19.1.7 Extending the Model by Stratification
19.2 Estimation of Survival Probability and Secondary Parameters . . .

468
468
470
470
472

19.3

474

Test Statistics

19.4
19.5

Residuals
476
Assessment of Model Fit
476
19.5.1 Regression Assumptions
477
19.5.2 Proportional Hazards Assumption
483
19.6 What to Do When PH Fails
489
19.7 Collinearity
491
19.8 Overly Influential Observations
492
19.9 Quantifying Predictive Ability
492
19.10 Validating the Fitted Model
493
19.10.1 Validation of Model Calibration
493
19.10.2 Validation of Discrimination and Other Statistical Indexes . 494
19.11 Describing the Fitted Model
496
19.12

S-PLUS Functions

19.13. Further Reading

499

506

20 Case Study in Cox Regression

509
20.1 Choosing the Number of Parameters and Fitting the Model . . . . 509
20.2
20.3
20.4
20.5
20.6
20.7

Checking Proportional Hazards

Testing Interactions
Describing Predictor Effects
Validating the Model
Presenting the Model
Problems

513
516
517
517
519
522

Appendix

523

References

527

Index

559

Regression Modeling Strategies
No ratings yet
Regression Modeling Strategies
506 pages
Logistic Regression
100% (2)
Logistic Regression
32 pages
Bayesian Methods Statistical Analysis
100% (7)
Bayesian Methods Statistical Analysis
697 pages
Course Regression Model Strategies PDF
No ratings yet
Course Regression Model Strategies PDF
307 pages
Logistic Regression
100% (1)
Logistic Regression
34 pages
Regression Analysis
100% (2)
Regression Analysis
9 pages
Rms PDF
No ratings yet
Rms PDF
506 pages
Course PDF
No ratings yet
Course PDF
403 pages
Bootstrap Powerpoint
100% (1)
Bootstrap Powerpoint
20 pages
Statistics Using Stata An Integrative Approach: Weinberg and Abramowitz 2016
No ratings yet
Statistics Using Stata An Integrative Approach: Weinberg and Abramowitz 2016
46 pages
Multiple Linear Regression
100% (1)
Multiple Linear Regression
14 pages
### Data Exploration: 'Yes' 'No' 'Agency' 'Direct' 'Employee Referral' 'Yes' 'No'
100% (1)
### Data Exploration: 'Yes' 'No' 'Agency' 'Direct' 'Employee Referral' 'Yes' 'No'
6 pages
R Manual To Agresti's Categorical Data Analysis
100% (1)
R Manual To Agresti's Categorical Data Analysis
280 pages
Practical Guide To Logistic Regression - Joseph M. Hilbe (2017)
100% (1)
Practical Guide To Logistic Regression - Joseph M. Hilbe (2017)
170 pages
Classification and Regression Trees
100% (1)
Classification and Regression Trees
60 pages
Module 1 Notes
100% (1)
Module 1 Notes
73 pages
5 An Hour Betfair Money Machine
75% (4)
5 An Hour Betfair Money Machine
3 pages
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
100% (1)
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
15 pages
Assignment #3 Hypothesis Testing
No ratings yet
Assignment #3 Hypothesis Testing
10 pages
Lecture 9 PDF
100% (1)
Lecture 9 PDF
28 pages
Logistic Regression
100% (1)
Logistic Regression
21 pages
ALD Port Power Failure
No ratings yet
ALD Port Power Failure
1 page
Bahan Univariate Linear Regression
No ratings yet
Bahan Univariate Linear Regression
64 pages
HG8245H Datasheet PDF
No ratings yet
HG8245H Datasheet PDF
2 pages
Regression
No ratings yet
Regression
46 pages
Polynomial Regression and Step Function
100% (1)
Polynomial Regression and Step Function
6 pages
Logistic Regression
100% (3)
Logistic Regression
30 pages
Univariate Time Series Modelling and Forecasting
100% (2)
Univariate Time Series Modelling and Forecasting
72 pages
GAMS Getting Started
No ratings yet
GAMS Getting Started
31 pages
Statistics in Details
100% (2)
Statistics in Details
283 pages
Cheat Sheet
No ratings yet
Cheat Sheet
4 pages
NO FILE Enquiries: Technical Courses
No ratings yet
NO FILE Enquiries: Technical Courses
35 pages
Linear Regression
No ratings yet
Linear Regression
71 pages
Assignment Updated 101
100% (1)
Assignment Updated 101
24 pages
Understanding Random Forest
100% (1)
Understanding Random Forest
12 pages
Linear Statistical Models The Less Than Full Rank Model: Yao-Ban Chan
100% (1)
Linear Statistical Models The Less Than Full Rank Model: Yao-Ban Chan
140 pages
Chapter 17 - Logistic Regression
No ratings yet
Chapter 17 - Logistic Regression
32 pages
Notes On Time Series Analysis
No ratings yet
Notes On Time Series Analysis
111 pages
Arch Model and Time-Varying Volatility
No ratings yet
Arch Model and Time-Varying Volatility
17 pages
Incose SD Sept2019 Presentation Charley Patton Mbse A Practical Approach v01
No ratings yet
Incose SD Sept2019 Presentation Charley Patton Mbse A Practical Approach v01
20 pages
Logistic Regression
0% (1)
Logistic Regression
49 pages
Analysis Analysis: Multivariat E Multivariat E
100% (1)
Analysis Analysis: Multivariat E Multivariat E
12 pages
Logistic Regression
100% (2)
Logistic Regression
47 pages
GAM: The Predictive Modeling Silver Bullet: Author: Kim Larsen
No ratings yet
GAM: The Predictive Modeling Silver Bullet: Author: Kim Larsen
27 pages
Logistic Regression
100% (1)
Logistic Regression
17 pages
Getting Started Tutorial LS-DYNA
No ratings yet
Getting Started Tutorial LS-DYNA
39 pages
Notes For Multivariate Statistics With R
No ratings yet
Notes For Multivariate Statistics With R
189 pages
Logistic Regression
100% (3)
Logistic Regression
41 pages
H-311 Linear Regression Analysis With R
100% (1)
H-311 Linear Regression Analysis With R
71 pages
Propensity Score
100% (1)
Propensity Score
7 pages
Handleiding Spss Multinomial Logit Regression
No ratings yet
Handleiding Spss Multinomial Logit Regression
35 pages
Cheat Sheet Final
100% (2)
Cheat Sheet Final
7 pages
STATS Introduction Statistical Analysis
No ratings yet
STATS Introduction Statistical Analysis
105 pages
Share
No ratings yet
Share
9 pages
Homoscedasticity, Heteroscedasticity and Multicollinearity
100% (1)
Homoscedasticity, Heteroscedasticity and Multicollinearity
10 pages
Multiple Regression Analysis - Inference
No ratings yet
Multiple Regression Analysis - Inference
34 pages
CS229 Lecture 3 PDF
100% (1)
CS229 Lecture 3 PDF
35 pages
AcronisBackup 12.5 Userguide en-US
No ratings yet
AcronisBackup 12.5 Userguide en-US
261 pages
Random Forest
No ratings yet
Random Forest
32 pages
NEW Bayesian - Approaches.in - Oncology.using.R.and - OpenBUGS
100% (1)
NEW Bayesian - Approaches.in - Oncology.using.R.and - OpenBUGS
260 pages
Osp-P300/P300A Osp-P200/P200A: Okuma Mtconnect Adapter Software
No ratings yet
Osp-P300/P300A Osp-P200/P200A: Okuma Mtconnect Adapter Software
21 pages
App Ques
No ratings yet
App Ques
23 pages
BN2102 1-6 Notes
No ratings yet
BN2102 1-6 Notes
38 pages
L5 Logistic Regression (2011)
100% (1)
L5 Logistic Regression (2011)
55 pages
ML MU Unit 2
100% (3)
ML MU Unit 2
84 pages
Correlation & Regression Analysis
100% (1)
Correlation & Regression Analysis
39 pages
Introduction To Cox Regression: Kristin Sainani Ph.D. Stanford University Department of Health Research and Policy
No ratings yet
Introduction To Cox Regression: Kristin Sainani Ph.D. Stanford University Department of Health Research and Policy
62 pages
Logistic+Regression - Done
100% (1)
Logistic+Regression - Done
41 pages
HGS-HSM-SL-21-001 - Improvement of Safety Function For DF Engine
No ratings yet
HGS-HSM-SL-21-001 - Improvement of Safety Function For DF Engine
6 pages
Computer Documentation Gokul
No ratings yet
Computer Documentation Gokul
25 pages
ML Cheatsheet Final
No ratings yet
ML Cheatsheet Final
32 pages
SQL Injection
No ratings yet
SQL Injection
3 pages
C# Array PDF
No ratings yet
C# Array PDF
13 pages
Unknown
No ratings yet
Unknown
420 pages
FDMS - Adobe Photoshop - Course Outline
No ratings yet
FDMS - Adobe Photoshop - Course Outline
6 pages
Iqdialogue Asm 4.0 User's Guide - Sug-Diaasm-007
No ratings yet
Iqdialogue Asm 4.0 User's Guide - Sug-Diaasm-007
166 pages
UMTS Introduction of Network Coverage Evaluation Based On MR0309
100% (1)
UMTS Introduction of Network Coverage Evaluation Based On MR0309
43 pages
B2MML V0600 OperationsPerformance
No ratings yet
B2MML V0600 OperationsPerformance
20 pages
Sem. 3 DBMS Theory Bca
No ratings yet
Sem. 3 DBMS Theory Bca
2 pages
New Features Guide: Digital Camera
No ratings yet
New Features Guide: Digital Camera
10 pages
Data Stok
No ratings yet
Data Stok
3 pages
Creating Variables To MATLAB
No ratings yet
Creating Variables To MATLAB
9 pages
Modano Excel Fundamentals
No ratings yet
Modano Excel Fundamentals
36 pages
Module Handbook Adv Web Engineering-V1 0
No ratings yet
Module Handbook Adv Web Engineering-V1 0
10 pages
Prolink ShareHub Device Server
No ratings yet
Prolink ShareHub Device Server
33 pages
08103
100% (1)
08103
15 pages
Lab 6
No ratings yet
Lab 6
3 pages
2010-08-18 Zernik, J: Data Mining of Online Judicial Records of The Networked US Federal Courts, International Journal On Social Media: Monitoring, Measurement, Mining, 1:69-83 (2010)
No ratings yet
2010-08-18 Zernik, J: Data Mining of Online Judicial Records of The Networked US Federal Courts, International Journal On Social Media: Monitoring, Measurement, Mining, 1:69-83 (2010)
13 pages
Fox and Weisberg Logistic Regression
100% (1)
Fox and Weisberg Logistic Regression
4 pages
Eula
No ratings yet
Eula
2 pages
Monthly Reviewed "In" SLA Monthly (ART/LTS) Monthly Total Volume
No ratings yet
Monthly Reviewed "In" SLA Monthly (ART/LTS) Monthly Total Volume
2 pages
6.DC Motor Interface
No ratings yet
6.DC Motor Interface
51 pages
WST Macros Add in Features
No ratings yet
WST Macros Add in Features
1 page
Multiple Regression
No ratings yet
Multiple Regression
20 pages